This week a customer from an advertising agency I work for sent me a PDF file to translate containing a newspaper article she had scanned. What struck me right away was that the whole article was an image and that it was laid out in a number of columns with a headline and large picture at the top. It looked quite nice as pieces of journalism go, but how did she expect me to make a quotation and translate the text? To be honest, I don't think she realised what she was asking of me; these days it's so common for people in PR, sales and marketing to work with PDFs of glossy-looking articles that they don't realise how tricky they can be for translators to deal with.
One way of handling such image files is to load them into Adobe Acrobat®, a powerful but pricey application for creating and editing PDF documents. You import the file and then process it using the optical-character-recognition feature (OCR). This theoretically "captures" any text found in the image and makes it editable. After doing that and saving the results as a new PDF file, you can export it to an external word-processing application like Microsoft Word® and then check it to see if all the text has been captured and reproduced correctly. It's only once this last step has been taken that you can actually start translating.
If you do most of your translating in a CAT tool, then you may also want to go over the editable file again before doing that using a utility such as Dave Turner's CodeZapper as this can reduce the number of formatting "tags" or "codes" in it, which appear in the translation grid and stop you from translating segments quickly (as you have to insert them in your translation one by one).
Well, after creating an editable Word file from Adobe Acrobat XI and not being very impressed with the outcome, I remembered a blog post that Dominique Pivard wrote a while ago about handling scanned PDFs using a Web-based CAT tool called Wordfast Anywhere®. Dominique has made a large number of short but generally very instructive videos on CAT tools that you can watch on his blog or on YouTube for free, and this is one of them.
I watched the video twice (just to make sure I'd understood everything!), set up a free user account on the Wordfast Anywhere site and then uploaded the original scanned PDF file to it. You need to create a translation memory and set the source and target languages before it processes the PDF, but once you've done that, you're off! The Wordfast Anywhere server processed my PDF file using a powerful OCR algorithm and converted it into an editable file in just a few minutes. It lets you either translate the output in a Wordfast environment directly on the server or download the file and translate it by other means if you wish (e.g. in a desktop CAT tool).
The results of the conversion I got it to do were very good and the file didn't need much fine-tuning at all – it was better than Acrobat's output and didn't cost me a penny.
Many thanks to Dominique for making his video tutorial. If you'd like to watch it, then just click here. (The 4-minute video will start running as soon as the page has built up in your browser.)
Carl
image: Wordfast logo © Wordfast LLC
Related posts: Uses of Adobe Acrobat XI (part 1)
Comments