PDF Translation Complications
PDF files are so prevalent because they preserve content in its designated format, irrespective of the operating system or software displaying the file. This is important because different browsers, applications and operating systems display not just fonts but colors and layouts slightly differently. They function sort of like a digital photocopy. Unfortunately, the same encoding and rendering attributes that provide this constancy and control also mean that the tools that we use for translation projects can struggle (or fail) to parse text correctly, especially in design-heavy or graphically complex documents. This can result in either potential quality issues, added cost, or unpleasant surprises for the client.
The first thing to know is that there are really two kinds of PDFs – digitally created PDFs in which the text is encoded in its own layer, allowing it to be parsed and searched relatively easily, and image PDFs that are usually based on a photograph or a scanned image.
Both kinds present problems for translation projects, but in different ways:
Image PDFs are often not parseable at all, and must be manually re-created or analyzed through optical character recognition (OCR) tools, which can add cost or introduce errors.
Text PDFs can be parsed through Acrobat or OCR, but images, typography, and intricate designs must be manually re-created by our staff, which duplicates careful work done by the original design team but with less control.
Of the two, text PDFs that were composed digitally represent both the lion’s share of PDFs that we see, and the most easily avoided problem. That’s because they were almost always composed in-house using software that we know how to use, like Adobe InDesign and Illustrator or Microsoft Office. Lots of folks send PDFs because they are portable, easily packaged, or just out of habit. However, in this case, we really need the original source file. If we have to try to replicate original design specs based on the PDF, we usually have to charge a fee in Desktop Publishing Services, which goes at $65/hr. We are willing to bet your designer would prefer that we get the original too, rather than us going back and trying to match fonts by eye, manually lay out images with text, and get all the little details in place in reverse order.
There are a variety of issues with exports, including:
- Poor machine parsing of text – confusion regarding word breaks, leading to poor machine readability of segments
- Mismatched fonts
- Mismatched text spacing
- Color changes, both hue changes and within solid blocks of colors
Of these issues, the most time consuming is the machine readability of text, because it will interfere with word counts and identifying repetitions (to learn more about translation memories and repetitions, please see our Localization 101 page). These errors require our staff to go over every piece of text in the document before translation to ensure that word counts are correct and that segments are properly identified, which is a crucial component in establishing terminological consistency.
The visual issues in the export also require fixing. While our support staff have serious chops, making them fix things that could have remained unbroken from the start just doesn’t make a lot of sense.
For image PDFs, clients should expect to see a small fee on a quote for document recreation. It’s usually around $65-$130, but potentially more, depending on size and complexity. This entails one of our staff members either running OCR if they can and manually checking every word or just literally re-typing the entire source document so that we can import the strings into memoQ.
So, in short, we can actually translate that PDF – but not without some hassle and/or added cost, which is almost always avoidable. If you have that source file, save yourself some time and your designer some stress and send it over. Everyone will be happier.
Want to get going on a specific project? Hit our quote page to upload files, enter parameters, and start talking specifics.
LET’S DO THIS!
Don’t need a quote, just want to talk?