Hard Problems and Fuzzy Solutions

Google Book Search (‘search the full text of books and discover new ones’) now supports ‘text versions’ of some of the out-of-copyright books that are in the Google Book Search database. Google Blogoscoped has a report. This is interesting, Google are OCR’ing books which have been scanned and figuring out how to reconstitute a reasonable ASCII version of the underlying text. Its also interesting that it is not possible to get a consistently good result — mind you Blogoscoped picks a hard example, a Shakespeare text with ‘f’s’ for ‘s’es’. But computing the underlying text of a book if you don’t already have it is a really hard problem.

So what? Well it suggests that pirating books in Google Book Search is, and is likely to remain, a very tough proposition. Easy to make dumb copies, easy enough if you can invest the effort in re-keying, but to make accurate, usable, automated copies with the text in the file, from an image file. Don’t even try. Google, with all their software geniuses can’t do it, so there is little chance of a pirate in Macao being able to get a quality solution. Exact Editions has a very similar production and content management system to the Google Book Search service. So it looks as though its going to remain very difficult to produce useful pirate issues of Exact Editions magazines unless the pirate gets access to the publisher’s copies of the PDF files. PDF files contain a lot more useful information than the dumb JPEGs that Google Book Search and Exact Editions ship out to web browsers. Publishers who care about their digital rights should be very careful with the security of their PDFs.

Fuzzy solutions? That is easy, even a poor machine-readable text version is pretty good for automated searching witness the way Google, Yahoo or MSFT Live already search out of copyright books. The fuzzy solution has been working for a while.