Sunday, February 9, 2014

Custom PDF Font Encoding: Why You Should Care and What You Can Do About It

As part of my production workflow, I regularly get PDFs of construction work orders. They are highly technical documents that contain all kinds of important information. Recently, I've come across some of these PDFs that were non-searchable. It took me awhile to figure out the problem, because the PDFs weren't scanned. They had live text. I could highlight and comment on the text using PDF commenting tools. If that wasn't live text, I wouldn't be able to use the text annotation tools.

And by non -searchable, I mean this: I see the word "work," in the body of the PDF, but when I search for that word using Acrobat's Find function, no matches are found.

So I tried searching for the word "WORK." But again, no matches are found. What the heck?! 

I tried running the PDF through a couple of different PDF conversion programs (PDF2ID and PDF2DTP) and got nowhere.

As converted by PDF2ID (Recosoft)

As converted by PDF2DTP (Markzware)

However, one of the conversion tools gave me a less-than-helpful error message.

I tried copying and pasting text from the PDF and into a text editor and my email program and I got the same gibberish.

For kicks, I tried saving the troublesome PDF as a Microsoft office document. Not only did Acrobat save it out correctly with editable text, it also converted even the text highlights!

Fast forward a few weeks and I got to thinking that perhaps the inability to search within the PDF has something to do with fonts. So I go to the troublesome PDF and I look at the Font tab within Document Properties. The encoding is listed as "Custom." Now, I'm neither a font developer nor PDF developer, but rest assured I have never seen "Custom" encoding before. I'm used to seeing things like "Ansi" or "Identity-H."

Custom Font Encoding

So I open a non-troublesome (fully searchable) PDF from the same client and check the font encoding there. It is "Built-in."

Built-in Font Encoding
Now, I don't know what "Built-in" means either, but I know that those PDFs are searchable. A quick scan of google leads me to other people that have the same problem. While it doesn't seem that there is an easy way to simply change font encoding, I have come up with a solution.

I remember reading a few years ago that adding tags to a PDF somehow fixes the document so that when you select a paragraph of text and then copy and paste it into InDesign, you won't get hard returns at the end of each line. I also know that tags are really important for making "Accessible" documents. Now normally, I never have to create Accessible documents, so I don't bother with learning all the details involved in their creation. But on a hunch, I decided to add tags to the troublesome PDF and see what happened.

After Acrobat has added tags to the document, a quick check of the Documents Fonts pane revealed something new. There was now something listed below the Custom font encoding.

Now I tried doing a Find. And sure enough, the Find function now worked as expected! Adding Tags to the document somehow fixed the weird font issue and also made it so that I could convert the PDF using my PDF to InDesign conversion tools.

If you're interested in taking a deeper dive into learning about PDF tags, check out this article at What are “PDF tags” and why should I care?