Pages in topic:   [1 2] >
help needed on converting PDF to Word format
Thread poster: Luke Mersh
Luke Mersh
Luke Mersh  Identity Verified
United Kingdom
Local time: 09:41
Spanish to English
Aug 18, 2015

Dear colleagues.
I have been sent some PDFs which are more like scanned documents saved as PDFs, but my problem is that when I convert them to Word format they are still like images, so I am unable to to a word count without re-typing the PDF into a word document.

Can anybody tell me if there is a way to convert these image type PDFs into a word document without re-typing the whole document, so I am able to get a word count.

many thanks


 
..... (X)
..... (X)
Local time: 18:41
OCR Aug 19, 2015

Hi Luke,

The technology you are looking for is called OCR (optical character recognition). It will convert scanned PDFs or images to text format. The quality of the extraction will depend both on the tool you use and the quality of the PDF (is the scan clear, is it straight, is it high enough resolution, etc.).

The most popular tool for this is probably made by the company Abbyy. There are also a few free options from other companies.

Kevin


 
Luke Mersh
Luke Mersh  Identity Verified
United Kingdom
Local time: 09:41
Spanish to English
TOPIC STARTER
OCR Aug 19, 2015

Thank you.

My only concern is that some of the PDFs are ECG graphs with printed results on them.
regards


 
Chris Pr
Chris Pr
United Kingdom
Local time: 09:41
German to English
Another two options to play with... Aug 19, 2015

Hello Luke,

Kevin was entirely correct in his previous comment about optical recognition being the only applicable solution to your original question.

Two options you might like to (web search) check out:

Nuance PDF
Nitro PDF

Whether your ECG's will display correctly is entirely debatable, but these have been the best two performers I'm aware of to date.

And whether you're prepared to upload potentially sensitive docs for con
... See more
Hello Luke,

Kevin was entirely correct in his previous comment about optical recognition being the only applicable solution to your original question.

Two options you might like to (web search) check out:

Nuance PDF
Nitro PDF

Whether your ECG's will display correctly is entirely debatable, but these have been the best two performers I'm aware of to date.

And whether you're prepared to upload potentially sensitive docs for conversion remains another matter entirely. That said, trial versions for download are also available for (more discrete) local conversion offline. NDA items should never, ever be trusted to cloud solutions (not to mention TM's, termbases or any other material you'd consider sensitive or private).

On that last note, applicable to all translators generally, never underestimate that cloud=share (you're no longer in control of the information uploaded), period, full stop.

By the same simple equation, convenience=trade-off.

Best of British,
Chris
Collapse


 
esperantisto
esperantisto  Identity Verified
Local time: 12:41
Member (2006)
English to Russian
+ ...
SITE LOCALIZER
How? Aug 19, 2015

Luke Mersh wrote:

when I convert them to Word format they are still like images


And how do you actually perform the conversion?


 
neilmac
neilmac
Spain
Local time: 10:41
Spanish to English
+ ...
Nitro etc Aug 19, 2015

Nitro or Omipage are the best conversion programs I've found - I find Nitro is easier/less complicated to use. However, if they are scanned PDFs the results might never be optimum.
If the texts are short, I might prefer to retype/recreate the texts in Word. Or explain to the client that the format is causing problems and you will need to charge extra, or extend the agreed deadline, unless they can provide you with the text in a more workab
... See more
Nitro or Omipage are the best conversion programs I've found - I find Nitro is easier/less complicated to use. However, if they are scanned PDFs the results might never be optimum.
If the texts are short, I might prefer to retype/recreate the texts in Word. Or explain to the client that the format is causing problems and you will need to charge extra, or extend the agreed deadline, unless they can provide you with the text in a more workable format...

http://www.nuance.com/for-individuals/by-product/omnipage/index.htm

[Edited at 2015-08-19 09:32 GMT]
Collapse


 
Paula Darwish
Paula Darwish  Identity Verified
United Kingdom
Local time: 09:41
Member (2013)
Turkish to English
+ ...
OCR software and different alphabets Aug 19, 2015

In my experience, some are better than others at recognising different alphabets so you need to try the software on your particular language. They can probably all do English OK but in my translation language (Turkish) Omnipage is best one I have found for recognising the characters of the Turkish alphabet.

 
Andrzej Mierzejewski
Andrzej Mierzejewski  Identity Verified
Poland
Local time: 10:41
Polish to English
+ ...
OCR Aug 19, 2015

For ES-EN language pair, any commercially available OCR software should be OK. AFAIK the prices for a single computer licence are approx. EUR 100 - 150. With that price and with approx. 7 page a minute capacity, the return on investment is very quick. The OCR quality is also very high but it still depends on the image quality.

I 've recently worked on a PDF file with English text scanned from 34 paper pages. The character count was 53,500. The Spellcheck found not more than 30 error
... See more
For ES-EN language pair, any commercially available OCR software should be OK. AFAIK the prices for a single computer licence are approx. EUR 100 - 150. With that price and with approx. 7 page a minute capacity, the return on investment is very quick. The OCR quality is also very high but it still depends on the image quality.

I 've recently worked on a PDF file with English text scanned from 34 paper pages. The character count was 53,500. The Spellcheck found not more than 30 errors.

Of course, for more complex page layouts (tables etc.), more manual work is still needed in order to give the output DOC file the 'as-original' look.

Let me repeat esperantisto's question:

Luke, what do you mean in: "my problem is that when I convert them to Word format they are still like images, so I am unable to to a word count without re-typing the PDF into a word document."?

That's not an image-to-text conversion, for sure.

Rest regards

AM
Collapse


 
Rachel Waddington
Rachel Waddington  Identity Verified
United Kingdom
Local time: 09:41
Dutch to English
+ ...
Do you really need to convert the files? Aug 19, 2015

In this situation (assuming you don't manage to get a good result with the OCR software) I would just quote a rate based upon the target language word count and type the translation into a Word file. Agencies have always been happy with this approach.

 
Chris Pr
Chris Pr
United Kingdom
Local time: 09:41
German to English
Very true... Aug 19, 2015

Very true for agency work.

But direct clients can be charged a premium for providing a fully translated "clone" of the original PDF, complete with all charts, diagrams, images etc. perfectly formatted as in the original document - albeit in the docx format that these conversion softwares tend to export.


 
Rafael Harriet
Rafael Harriet
Spain
Local time: 10:41
German to Spanish
+ ...
One more option Aug 20, 2015

I usually work with ABBYY FineReader and the quality is very good.

My two cents. Good luck!!


 
Luke Mersh
Luke Mersh  Identity Verified
United Kingdom
Local time: 09:41
Spanish to English
TOPIC STARTER
Abbyy finereader Aug 20, 2015

After reading your posts.
I had already done a webinar on OCR, so I have decided to use the trial of Abbyy Finereader, which seems to do a good job.


 
Andrzej Mierzejewski
Andrzej Mierzejewski  Identity Verified
Poland
Local time: 10:41
Polish to English
+ ...
just quote a rate based upon the target language...? Aug 20, 2015

Rachel Waddington wrote:

In this situation (assuming you don't manage to get a good result with the OCR software) I would just quote a rate based upon the target language word count and type the translation into a Word file. Agencies have always been happy with this approach.


So, do you think an agency would wait until the job is done, and only then the translator payment would be calculated, and the customer would know how much they should pay?

Well, in my country the agencies normally know the job size, whether in words or characters, when they ask translators for availability. A good agency is expected to have and use a reliable OCR software just in order to tell their customer the price.

It's very unusual for an agency not to know the job size as the work time and all invoices are dependent thereon. Can happen when all the staff is young and unexperienced - but just once, and not again.

Let's not allow ourselves to do the agencies' job! Remember that they take a significant share of what the customers pay.

Regards

AM

[Edited at 2015-08-20 09:12 GMT]


 
Platary (X)
Platary (X)
Local time: 10:41
German to French
+ ...
Strange Aug 20, 2015

Andrzej Mierzejewski a écrit :

A good agency is expected to have and use a reliable OCR software just in order to tell their customer the price.



So "a good agency" should be able to send the translator an editable text, or?

If not, it's not "a good agency". And it's the case.

[Modifié le 2015-08-20 09:27 GMT]


 
Rachel Waddington
Rachel Waddington  Identity Verified
United Kingdom
Local time: 09:41
Dutch to English
+ ...
Yes Aug 20, 2015

Andrzej Mierzejewski wrote:

Rachel Waddington wrote:

In this situation (assuming you don't manage to get a good result with the OCR software) I would just quote a rate based upon the target language word count and type the translation into a Word file. Agencies have always been happy with this approach.


So, do you think an agency would wait until the job is done, and only then the translator payment would be calculated, and the customer would know how much they should pay?

Well, in my country the agencies normally know the job size, whether in words or characters, when they ask translators for availability. A good agency is expected to have and use a reliable OCR software just in order to tell their customer the price.

It's very unusual for an agency not to know the job size as the work time and all invoices are dependent thereon. Can happen when all the staff is young and unexperienced - but just once, and not again.

Let's not allow ourselves to do the agencies' job! Remember that they take a significant share of what the customers pay.

Regards

AM

[Edited at 2015-08-20 09:12 GMT]


Yes, in cases where the agency cannot provide an editable text I would always propose invoicing based on the target text and this has never been a problem. It's becoming less common nowadays, but still happens occasionally. In any case I would regard it as the agency's job to do the OCRing, not mine. Direct clients are a different thing, obviously.


 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

help needed on converting PDF to Word format






Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »