As part of my production workflow, I regularly get PDFs of construction work orders. They are highly technical documents that contain all kinds of important information. Recently, I've come across some of these PDFs that were non-searchable. It took me awhile to figure out the problem, because the PDFs weren't scanned. They had live text. I could highlight and comment on the text using PDF commenting tools. If that wasn't live text, I wouldn't be able to use the text annotation tools.
And by non -searchable, I mean this: I see the word "work," in the body of the PDF, but when I search for that word using Acrobat's Find function, no matches are found.
So I tried searching for the word "WORK." But again, no matches are found. What the heck?!
As converted by PDF2ID (Recosoft) |
As converted by PDF2DTP (Markzware) |
However, one of the conversion tools gave me a less-than-helpful error message.
I tried copying and pasting text from the PDF and into a text editor and my email program and I got the same gibberish.
For kicks, I tried saving the troublesome PDF as a Microsoft office document. Not only did Acrobat save it out correctly with editable text, it also converted even the text highlights!
Fast forward a few weeks and I got to thinking that perhaps the inability to search within the PDF has something to do with fonts. So I go to the troublesome PDF and I look at the Font tab within Document Properties. The encoding is listed as "Custom." Now, I'm neither a font developer nor PDF developer, but rest assured I have never seen "Custom" encoding before. I'm used to seeing things like "Ansi" or "Identity-H."
Custom Font Encoding |
So I open a non-troublesome (fully searchable) PDF from the same client and check the font encoding there. It is "Built-in."
Built-in Font Encoding |
I remember reading a few years ago that adding tags to a PDF somehow fixes the document so that when you select a paragraph of text and then copy and paste it into InDesign, you won't get hard returns at the end of each line. I also know that tags are really important for making "Accessible" documents. Now normally, I never have to create Accessible documents, so I don't bother with learning all the details involved in their creation. But on a hunch, I decided to add tags to the troublesome PDF and see what happened.
After Acrobat has added tags to the document, a quick check of the Documents Fonts pane revealed something new. There was now something listed below the Custom font encoding.
Now I tried doing a Find. And sure enough, the Find function now worked as expected! Adding Tags to the document somehow fixed the weird font issue and also made it so that I could convert the PDF using my PDF to InDesign conversion tools.
If you're interested in taking a deeper dive into learning about PDF tags, check out this article at AcrobatUsers.com: What are “PDF tags” and why should I care?
I think the best pdf to test SDK is CnetSDK's.
ReplyDeleteThis is brilliant. Very carefully laid out. It expands my understanding. Unfortunately, the "fix" of tagging my document didn't change the encoding for me...maybe because we're dealing with a whole new generation of tools. But I learned something from this in any case, and appreciate it.
ReplyDelete