PhilHibbs said:
languagegeek said:
It looks like there was some search/replacing done for "f+l" to "fl" (Unicode U+FB02) and "f+i" to "fi" (U+FB01). Either that or the version of Garamond you're using has bad character names (unlikely). I assume you're using Adobe inDesign or Quark, so turning on "ligatures" will give nicely shaped fl's but still remain searchable.
So it can create a PDF that uses the ligatures for display but has an "alternative" searchable version of the text as well? Actually, I think I've seen scanned pdfs that are searchable - what you see is the scanned bitmap, but there's a text representation hidden somewhere that the search uses.
In a nutshell. There are characters (underlying), and there are glyphs (surface). Each character has a Unicode index number. That index number is consistent from computer to computer - i.e. "à" on a Mac is the same as an "à" on your PC. It didn't use to be that way, but Unicode is now virtually universal.
So, "f" has a unicode number (0066), "l" is (006C) and "y" is (0079). When I do a search for "fly", I type those letters in the search box, and the computer looks through the PDF for 0066+006C+0079.
However, proper typography requires ligatures between certain combinations of letters, the most common being "fl" "ff" "fi". In the old days, the solution was to make up unique new characters for these ligatures. When Unicode came along, these precombined characters had to be accounted for for compatibility purposes, so "fl" was given the code (FB02) and so on. Typographers using old fonts or old software continued the tradition of replacing every "f+l" (0066+006C) with (FB02). Looks fine on the printed page, but in the digital age, there are problems because if I want to search for "fly", I won't find anything because the "f+l" has been replaced with "fl" (FB02).
What good-quality modern fonts and half-decent desktop publishing software do is this: every time there is the character sequence "f+l", use a surface glyph ligature "fl". The glyph does *not* alter at all the underlying characters, the change is purely cosmetic, superficial. The underlying sequence remains "f+l". Thus my search for "fly" will work perfectly. In this day and age, the legacy precombined ligatures like "fl" (FB02) should not be used.
Searchable scanned pdf's are a different story. Here someone did a OCR run of the scans and makes a background script to accompany the scanned text. These scripts aren't perfect, and really run into trouble when non-English words are present.