As part of my production workflow, I regularly get PDFs of construction work orders. They are highly technical documents that contain all kinds of important information. Recently, I've come across some of these PDFs that were non-searchable. It took me awhile to figure out the problem, because the PDFs weren't scanned. They had live text. I could highlight and comment on the text using PDF commenting tools. If that wasn't live text, I wouldn't be able to use the text annotation tools.
And by non -searchable, I mean this: I see the word "work," in the body of the PDF, but when I search for that word using Acrobat's Find function, no matches are found.
So I tried searching for the word "WORK." But again, no matches are found. What the heck?!
|As converted by PDF2ID (Recosoft)|
|As converted by PDF2DTP (Markzware)|
However, one of the conversion tools gave me a less-than-helpful error message.
I tried copying and pasting text from the PDF and into a text editor and my email program and I got the same gibberish.
For kicks, I tried saving the troublesome PDF as a Microsoft office document. Not only did Acrobat save it out correctly with editable text, it also converted even the text highlights!
Fast forward a few weeks and I got to thinking that perhaps the inability to search within the PDF has something to do with fonts. So I go to the troublesome PDF and I look at the Font tab within Document Properties. The encoding is listed as "Custom." Now, I'm neither a font developer nor PDF developer, but rest assured I have never seen "Custom" encoding before. I'm used to seeing things like "Ansi" or "Identity-H."
|Custom Font Encoding|
So I open a non-troublesome (fully searchable) PDF from the same client and check the font encoding there. It is "Built-in."
|Built-in Font Encoding|
I remember reading a few years ago that adding tags to a PDF somehow fixes the document so that when you select a paragraph of text and then copy and paste it into InDesign, you won't get hard returns at the end of each line. I also know that tags are really important for making "Accessible" documents. Now normally, I never have to create Accessible documents, so I don't bother with learning all the details involved in their creation. But on a hunch, I decided to add tags to the troublesome PDF and see what happened.
After Acrobat has added tags to the document, a quick check of the Documents Fonts pane revealed something new. There was now something listed below the Custom font encoding.
Now I tried doing a Find. And sure enough, the Find function now worked as expected! Adding Tags to the document somehow fixed the weird font issue and also made it so that I could convert the PDF using my PDF to InDesign conversion tools.
If you're interested in taking a deeper dive into learning about PDF tags, check out this article at AcrobatUsers.com: What are “PDF tags” and why should I care?