How to extract Chinese text from a scanned PDF document? Thread poster: Yiting "Amy" Hsiao
|
I received a scanned simplified Chinese document to translate into English. Does anyone know how to extract the simplified Chinese text from the PDF into word, so that I can translate and edit it (and keep the original format)? I've tried using Adobe's "save as word" function, but only 20% of the Chinese characters were kept, all the rest were coded. I also tried the Automator on Mac, but it didn't work. I also tried submitting it to e.g., OCR online converter, but it came out as i... See more I received a scanned simplified Chinese document to translate into English. Does anyone know how to extract the simplified Chinese text from the PDF into word, so that I can translate and edit it (and keep the original format)? I've tried using Adobe's "save as word" function, but only 20% of the Chinese characters were kept, all the rest were coded. I also tried the Automator on Mac, but it didn't work. I also tried submitting it to e.g., OCR online converter, but it came out as images on Word, which renders it uneditable. Could any experienced translators give me some help here? Thank you so much!!! ▲ Collapse | | | esperantisto Local time: 11:13 Member (2006) English to Russian + ... SITE LOCALIZER
Use an OCR program such as ABBYY FineReader or OmniPage. | | | xxLecraxx (X) Germany Local time: 09:13 French to German + ... can't keep the original format | Jan 12, 2014 |
Hello, I don't think you'll be able to keep the original format. You'll have to format it later. The only procedure I can think of is the following: 1. copy the Chinese text in the PDF 2. paste into Windows Editor to get a plain text file 3. copy text in Editor 4. paste into a Word file Then you can edit the text, e.g. removing the pararaphs etc., before you start translating. The original format can be restored at the end. | | | Tony M France Local time: 09:13 Member French to English + ... SITE LOCALIZER
I don't know anything about the special characteristics of Chinese script, but I somehow suspect it is likely to cause problems for any kind of OCR program — unless maybe one has been developed specially in China, say? I'd have thought the quickest, cheapest, and simplest solution would have been to simply get the source text re-typed, and then do any formatting manually after translation. Generally, OCR progammes do not reproduce formatting very well, inasmuch as the... See more I don't know anything about the special characteristics of Chinese script, but I somehow suspect it is likely to cause problems for any kind of OCR program — unless maybe one has been developed specially in China, say? I'd have thought the quickest, cheapest, and simplest solution would have been to simply get the source text re-typed, and then do any formatting manually after translation. Generally, OCR progammes do not reproduce formatting very well, inasmuch as their output is usually a 'fudge' to produce a facsimile of the formatting — which is a long way from being the same thing as actually reproducing the formatting! What they produce may look fine, until you start to translate — and then it can turn out to be a total nightmare, and in my personal experience, waste you a great deal more time than if you had just started with plain, unformatted text. Again, I don't know about Chinese, but in terms of Western 'Roman alphabet' languages, a good typist can reproduce the source text with only basic formatting more quickly and cheaply than I can do it myself. Of course, if it is not possible to preserve the original formatting, maybe in any case re-typing is unnecessary? Although I like having a source text I can translate by over-typing, in point of fact, it is rarely essential, and one can often translate direct from the PDF original — the time and money saved can then be put to better use attempting to recreate the original formatting. ▲ Collapse | |
|
|
Paulinho Fonseca Brazil Local time: 05:13 Member (2011) English to Portuguese + ... How to extract text from PDF...? | Jan 12, 2014 |
I had the same experience with a client last year. I asked the company if they could provide me with the PDF or word doc and the reply was negative. I had to review quotes as client wanted both clean and unclean. I did have to type the whole doc and then translate it. | | |
it is almost perfect in my languages. I assume that for chinese it is better yet. There is a trial version. Also plustools from Wordfast. | | | Phil Hand China Local time: 16:13 Chinese to English
Abbyy is OK, but I find the formatting it does quite irritating. I use a freebie called 汉王, which allows you to copy the OCR results (the characters) as text, and paste them into a Word file. I then translate the Word file in my CAT tool and reconstruct the format in the English document. Hanwang isn't super-accurate, but it's good enough. | | | Optical character recognition (OCR) | Jan 13, 2014 |
You need to find a good OCR software. Or create your own that focuses on Chinese characters. And well, if OCR won't work, you are left with writing down the original text. By the way, Marcel, the text on scanned documents can't be copied or pasted. As far as the computer is concerned, it is a picture. Only OCR softwares can sort it out. | |
|
|
xxLecraxx (X) Germany Local time: 09:13 French to German + ...
Ricardy Ricot wrote: By the way, Marcel, the text on scanned documents can't be copied or pasted. As far as the computer is concerned, it is a picture. Only OCR softwares can sort it out. You're right. I skipped the word 'scanned', sorry. (: But how could she even keep 20 % of the characters when she tried to save the document as word? If it was only a picture, it shouldn't be possible at all. Or does Adobe Acrobat Pro come with an in-built OCR? | | |
To be frank, Marcel, I do not know. Could be. Because, if a document is scanned, it becomes a picture on the computer. | | | Lincoln Hui Hong Kong Local time: 16:13 Member Chinese to English + ...
OCR has been an integral function of Acrobat since 2008. Heck, the driver suite that my printer comes with has OCR.
[Edited at 2014-01-13 16:45 GMT] | | | online pdf text extractor | Oct 28, 2015 |
you can try this free online pdf text extractor(http://www.online-code.net/pdf-to-word.html), it support extract Chinese text from pdf document, just need you upload your pdf doc, this tool can extract all pages content as text online. | | | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » How to extract Chinese text from a scanned PDF document? TM-Town | Manage your TMs and Terms ... and boost your translation business
Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.
More info » |
| CafeTran Espresso | You've never met a CAT tool this clever!
Translate faster & easier, using a sophisticated CAT tool built by a translator / developer.
Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools.
Download and start using CafeTran Espresso -- for free
Buy now! » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |