Pages in topic:   < [1 2]
OCR-ing graphics embedded in Word?
Thread poster: pj-ffm
pj-ffm
pj-ffm
Local time: 19:16
German to English
TOPIC STARTER
Slight issue, convert text to table is greyed out... Jun 15, 2011

Hi István,

I really appreciate you looking into this!

I've played around with your method and have a slight issue:

When I "select all" I can no longer choose to "convert text to table"; it is greyed out.

I have varied the area selected, and have discovered that (at least) when there is a change in the page format from portrait to landscape within the doc it no longer allows me to convert to table. (Unfortunately the docs I am currently transla
... See more
Hi István,

I really appreciate you looking into this!

I've played around with your method and have a slight issue:

When I "select all" I can no longer choose to "convert text to table"; it is greyed out.

I have varied the area selected, and have discovered that (at least) when there is a change in the page format from portrait to landscape within the doc it no longer allows me to convert to table. (Unfortunately the docs I am currently translating have quite a few page format changes like this to allow for wide tables, graphics etc.)

Any suggestions, or is this method just not going to work with such docs?

I think the concept you suggested is very promising anyway and will see if I can sort out a workflow for "normal" files that don't change as described above. And subsequently (hopefully) enhance it to append the OCR'd text below the graphic in each case rather than replacing it. That would represent a perfect solution that would allow a document to be "pre-processed" with minimum effort and then translated from start to finish with the CAT as usual! What a utopian concept!

cheers,
Peter.

UPDATE: It seems that whenever the selected area traverses a "section break" then "text to table" is greyed out.

[Edited at 2011-06-15 09:43 GMT]

Also, in my document some paragraph numbering and "-" characters are taken into the graphics column in addition to the actual graphics for some reason. But this can be cleaned.

[Edited at 2011-06-15 09:45 GMT]
Collapse


 
István Hirsch
István Hirsch  Identity Verified
Local time: 19:16
English to Hungarian
Some unwanted characters in the graphics column Jun 15, 2011

Perhaps they went there because they had had a tabulator in front of them.
As a first step, did you try to temporarily replace each tabulator present with something else?


 
pj-ffm
pj-ffm
Local time: 19:16
German to English
TOPIC STARTER
I replaced tabs with §§§ Jun 15, 2011

Maybe §§§ was a bad choice..?

I can probably work out how to avoid the odd characters and auto-numbering being incorporated with further analysis. It does make me wonder whether after re-importing the column, the rest of the doc will still keep its formatting though...

Do you think the "section break" issue is resolvable, or is this just a "feature" of Word?

cheers,
Peter.


 
István Hirsch
István Hirsch  Identity Verified
Local time: 19:16
English to Hungarian
Ideas Jun 15, 2011

Perhaps temporary replacing section breaks (^b) can solve some of the problems (greying out).
If some characters keep going in the graphics column, „white characters” should be made visible by clicking on the paragraph mark in the toolbar. Then see what are the characters directly in front of the „misbehaving” characters that go into the graphics column. If this issue remains unsolved, I would take them, too, to OCR rather than delete them.


 
Eugenia Morris
Eugenia Morris
United Kingdom
Local time: 18:16
English to Russian
+ ...
thanks! Jun 15, 2011

István Hirsch wrote:

I think it is also possible to convert the whole Word file (as is) into a Pdf file (with a free pdf maker or Adobe), OCR this Pdf file, and translate the product of OCR in Word.


it sounds so obvious, I am just wondering why I have never used it. Thanks a lot!


 
pj-ffm
pj-ffm
Local time: 19:16
German to English
TOPIC STARTER
I guess I need a list of "codes" used by Word Jun 16, 2011

István Hirsch wrote:

Perhaps temporary replacing section breaks (^b) can solve some of the problems (greying out).
If some characters keep going in the graphics column, „white characters” should be made visible by clicking on the paragraph mark in the toolbar. Then see what are the characters directly in front of the „misbehaving” characters that go into the graphics column. If this issue remains unsolved, I would take them, too, to OCR rather than delete them.


I wasn't aware that it was possible to search/replace things like section breaks, but this would most likely solve the problem. (I'll have to play around with this)

When I look at the document in draft mode, I see two types of section breaks (I don't know the English for them: "Nächste Seite" (next page) and "Fortlaufend" (sequential?)) that I guess have different codes. (Are these to do with the chapter numbering?)

Is there a comprehensive list of codes used in Word and/or can I reveal them in the actual file?

I will investigate the spurious characters and occasional chapter number that gets put into the second column in more depth. I'm sure I can sort that out.

cheers,
Peter.


 
István Hirsch
István Hirsch  Identity Verified
Local time: 19:16
English to Hungarian
Hi Peter Jun 16, 2011

Your file seems much more complex than the sample file my suggestion was based on.
To go on (perhaps to find another solution which better matches your file) I should see at least part of the file ([email protected]). If it is sensitive, do this with my copy(!): replace all letters with x

Go to Edit/Replace, check "Wildcard" button and Find: [a-zA-Z] Replace with: x

Cheers
István


 
Oliver Walter
Oliver Walter  Identity Verified
United Kingdom
Local time: 18:16
German to English
+ ...
docx is zip Jun 16, 2011

Natalie wrote:
1) you cannot make a ZIP file by renaming anything
2) if the images are embedded the are part of the doc file
3) what would you expect from a ZIP? It is just an archive


Well, surprisingly, you can make a Zip file by renaming - I have just done it. Obviously it's a special situation (the file was in fact already in a Zip format), which in my case is this:
  1. I have Word2000 and I added the Microsoft Office converters (from the Microsoft website; an executable file of about 40 megabytes) a couple of weeks ago, so now it can read and write .docx and .docm files
  2. I took a document recently sent to me, containing a picture and some text, and sent as an RTF file (it was probably made using a recent version of Microsoft Works and my friend sends these files as .rtf for compatibility with a range of word processors)
  3. I opened this .rtf file with Word2000 and saved it as a file ZZZ.docx
  4. I used Windows Explorer to find ZZZ.docx and rename it to ZZZ.zip
  5. I double-clicked it in Windows Explorer: WinZip opened it and I saw that it contained the image (called image1.jpeg) and 14 other files, most of them called something.xml
  6. I opened image1.jpeg with an image editor and it was indeed the correct image
This confirms what I saw somewhere 2 or 3 years ago: .docx is a format in which the text and formatting information in a file is stored as .xml files, the images in a standard image format, and the whole is then compressed using a Zip-compatible method.

Oliver


 
pj-ffm
pj-ffm
Local time: 19:16
German to English
TOPIC STARTER
Hi István Jun 16, 2011

István Hirsch wrote:

Your file seems much more complex than the sample file my suggestion was based on.
To go on (perhaps to find another solution which better matches your file) I should see at least part of the file ([email protected]). If it is sensitive, do this with my copy(!): replace all letters with x

Go to Edit/Replace, check "Wildcard" button and Find: [a-zA-Z] Replace with: x

Cheers
István


I really appreciate the time you're putting into this.
I think if we can find a solution here it would useful to quite a few people.

It's a fairly large document (around 15MB) so it's maybe better if I convert a section of it (hoping the "nasty" stuff is included) for you to experiment with. Where should I send it?

cheers,
Pete.


 
István Hirsch
István Hirsch  Identity Verified
Local time: 19:16
English to Hungarian
Hi Peter Jun 16, 2011

Please send only a little problematic part to my email address above.

Cheers
István


 
Pages in topic:   < [1 2]


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

OCR-ing graphics embedded in Word?






Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »