Preservation of magazine content and its accessibility


Reading-Room of the Boston Public Library, 1871 Wikimedia Commons

Clifford Lynch has just published a thoughtful and carefully researched article on Stewardship in the “Age of Algorithms”. Lynch’s central claim is that the web, algorithm-based institutions and our contemporary social media are in practical terms impossible to archive and preserve via traditional approaches. The traditional approaches that he has in mind are ‘format migration’ and ‘emulation’, but he appears to suggest that these are alternative approaches when they are essentially complementary. In fact one cannot have format migration without emulation, and any software emulation requires appropriate file format inputs. Lynch gives us a handy explanation of the format migration strategy:

The traditional models of digital archiving are twofold: format migration and emulation. Both, of course, assume a substrate, which is now relatively well understood and implemented with a fairly high degree of confidence, assuming reasonably consistent and continuous funding, of bit-level preservation by migration from one storage technology to the next as necessary [19]. The first approach, format migration, is best suited to “document-like” objects: PDFs, Microsoft Word files, audio, video, XML, JPEG, TIFF, etc. Here the idea is that, as standards, or de facto standards, gradually evolve and the ecosystem to deal with those types of files shift, curators will migrate the file formats, but this strategy is not necessarily as simple as it seems. New file formats are often not isomorphic to older ones. Formats may be proprietary and/or undocumented, and even objects claiming to conform to well-known standards may not implement these standards correctly or may add proprietary extensions. Lynch:Stewardship in the “Age of Algorithms”

While this may be a small correction to Lynch’s overall argument, it points to an important consequence. In digital culture, preservation — even of traditional cultural objects — is not a closed process. Once we aim to preserve analog or primarily physical cultural objects (for example photographs or books or magazines) in digital repositories and databases we are implicitly committed to an ongoing task of enabling and facilitating new forms of access. Precisely because our digital culture and our innovative technological mix will be inventing new ways of interacting with and enjoying these traditional cultural objects. Nor is it a settled or obvious question how these improvements and developments should be pursued.

This need for an ongoing commitment to preservation struck at Exact Editions very early in our development. We firmly believe that magazines, books and newspapers are all becoming more digital, but we also took it as an article of faith that back issues and archives have important and valuable content that should be available to digital subscribers, so our solutions whether through web browsers or via apps on smartphones and tablets have always ensured that archives and back issues are accessible to the subscribers of current issues. Just as the software of a digital magazine has to welcome and display each new issue, so its database should reach back and awaken back issues that in a print culture are usually filed inertly and inaccessibly on shelves or forgotten heaps.

The digital transition is nevertheless very real. Although the magazines that users read on their branded apps, are in one way strictly equivalent to the printed versions that they might have read in the Boston Library public reading room or Viennese cafes in the 19th century (with those bamboo frames that we still sometimes see) yet the reading experience and the arrangement of the digital editions is very different. It is only with a digital app or a web browser that one could expect to see all the decades or years of a magazine’s issues tidily piled up on a virtual desk. This stacking of issues emulates in a virtual form the tidy arrangement which would be quite hard to achieve with printed issues: and from most points of view it is a much handier solution than the collection of previous issues in carton sorters.


Slightly Foxed — incomplete archive on a physical desktop

Digital archiving even of historic and contemporary print formats is not easy; but it is both obviously possible and culturally necessary. The 12 complete magazine archives which are showcased in celebration of International Digital Preservation Day #IDPD17 are all growing and their preservation needs in 5 or 10 years time may be unanticipated by their current formats and their existing software. By then we hope to have found out what it is that mixed reality, block chains and machine learning are surely going to teach us. And the archives may in one way look the same, but they may behave a little differently.


Controlling Your Own Destiny

It continues to amaze me that magazines do not take care to preserve an effective archive of their published issues. Most of the smaller magazines have learned that it is handy and useful to retain an archive of their back issues in PDF form. PDFs are an important insurance policy, even if the publisher plans to deliver the archive in some other way.

The biggest magazine companies often find it surprisingly difficult to lay their hands on a solid collection of PDFs. Sometimes all that they have managed to archive is the vectorised form of the PDF (which means that all the text information in the file is lost). We were running a test for a big magazine and the publisher was appalled to be told that the repro house would charge them £5,000 pa for supplying a PDF of each issue. In these straightened times publishers dont like taking on any additional costs. And why should they? Since spinning a PDF out of the workflow is simply a matter of flipping a switch at the appropriate moment, this is simply outrageous profiteering from the repro house.

Magazine publishers are a special case in this negligence of their archives. Newspaper publishers and book publishers have learned through bitter experience that having a solid archive of PDFs of their publications is necessary. And it is easy to do. The repro houses are only trying it on. If a major magazine tells its repro house that it will lose the business unless the PDFs are supplied on demand, then the repro house will jump to flip the switch….. The cost to them is zero.

The Trouble with Orphans

In the old dispensation books (and magazines and newspapers) used to be published and then gradually disappear. A few copies of any particular print run would be kept in archival conditions in important libraries, but by and large they gradually mouldered away. In fact they biodegraded into mulch. Something similar happened to the ‘copyrights’: to the intellectual property that the publications crystalised. After a few decades, and with the exception of a very few masterpieces or works of genius, the intellectual property that they represented was of negligible value or interest, and they would sooner or later fall into the public domain, probably before the physical book biodegraded. At that point the ‘IP’ did not matter, or rather it mattered only to the public domain.

In the last 10 years there has been a growing tumult about ‘orphan’ copyrights. Or ‘orphan works’. In the eyes of some of the key critics, eg James Grimmelmann, the real problem with the Google Book Search proposition and the Settlement that Google is reaching with Authors and Publishers is all about the orphans that are being swept up into the maw of Google’s 7 million, and counting scanned digital books.

But the funny thing about ‘orphan works’ is that the very category is defined by the technology which makes it possible to replace or preserve printed books by digital books which could last forever. The books and photographs are no more ‘orphan’ than they ever were, it is just that they look like they should be imortal rather than biodegraded, so who can speak up for them on that? Suddenly old and mostly forgotten copyrights seem to have some possible value, because the digital books could last for ever, and who knows but some of them (a few) will surely have considerable hidden value? To many of the critics it does not seem right that this value should accrue or crystalise to Google (and to the Authors Guild and some Publishers), rather than to anybody else.

If you buy the idea that the ‘orphan’ status of a book (or some other piece of intellectual property) is more a function of the new technology than of the old which was around when the object was born or conceived, there is an interesting corollary. As technology improves the orphans become more valuable. The ‘orphans’ may indeed become a lot more valuable when computation advances again, as it will. Especially if the orphans can be used to construct something else, something that we dont yet understand. Who is to say what value they have? There is a lot in the Google Books Settlement about ‘non-consumptive’ research (which roughly means ‘reading by computers and software’). Who knows how valuable that could become?

We may get a glimpse of this when Wolfram’s intriguing Alpha project is unveiled. From Rudy Rucker’s recent blog about what Alpha portends, it certainly sounds to me as though Wolfram and his team have been doing some pretty sensitive ‘non-consumptive’ reading of key reference books:

I asked him how he is handling the daunting task of finding out all the possible scientific models. “There’s only so many linear feet of reference books that exist in the world,” remarked Wolfram. “Nowadays when I go into a library I look at the reference shelves and try and estimate how many of them we’ve picked up. I think we’re close to ninety percent by now. Right now my office is mounded with books with bookmarks for things we still need to implement, and one by one the bookmarks and the books are going away.”

When the database representation of what a book is about gets to be that powerful and expressive, non-consumptive reading is arguably more useful and valuable than the old fashioned human kind of reading. Orphan copyrights, in a clever enough computer environment, have much more value than their publishers or authors could have imagined….

Who Wants to Own Books?

Do we really want to own books? Do institutions really want to own books? I am a bit agnostic about the first question, but I suspect that on the second question the answer is more clear cut. For quite a lot of reasons (but not because they might want to sell them) institutional libraries really do want to own the books they have. They want to be able to keep them for as long as they might need them (and a bit longer) and they want to be able to do things with them that they and their users have not yet thought about. They want ownership in the round.

These thoughts are prompted by the comment of a friend who is rather taken with the Spotify music service, which streams you as much music as you can listen to, but does it all on the fly. My friend is rather relieved no longer to be carrying his gigabytes of digital music and thinks that something similar may be on the cards for books.

Suppose that you could have access to pretty well everything ever published, but you did not actually own any of it? This new service (call it Yangtse Book Search or Facebook Library) would allow you to sample everything in its purview (search and browse access) but the full reading rights would be gently rationed. Leave on one side, for a moment how that deal would be negotiated with the authors and publishers: but look at the matter as a consumer. Would you be willing to pay $9.99 a month which gets you reading access to five books a month, and $19.99 a month which get you reading access to thirty books a month (US or UK market only — not world literature in original languages)? Once you have selected books for your choice in that month they remain open to you for an agreed period. A slew of specialist technical and professional books will be outside the catchment zone. They would cost extra. Household or family access via the cable channel? No problem $29.99 a month for the 30 book offering.

I don’t know about you, but I would be mostly content with access to that digital library, and with enough subscribers it would pay the necessary royalties to authors, agents and publishers (no doubt mostly usage based). But libraries would not be very happy with such a scheme. Libraries want to own books, and they want to own books which readers may never consult. Or hardly ever.

If persistent search but intermittent reading, is the future of our individual enjoyment of books, widely shared but not owned, it would seem to me that the Google Books Settlement has got things upside down. In the Settlement Google envisages delivering annual licenses to libraries for large collections of books (collections with shifting contents) but individuals will be able to buy ‘life time‘ access (quasi ownership) to individual titles. The assymmetry in the settlement between the market for individuals and for libraries is striking, and not really explained or justified, but it may be completely the wrong way round. Perhaps Google should go back to the negotiating table with the Books Rights Registry and strike a different deal? Or is that a complementary proposition that the BRR should offer to a Google competitor? The competing service would have the franchise to sell ‘term of copyright’ licenses to libraries for particular titles and short term access rights to individuals. That should introduce some competition which could help to mitigate the monopoly charge.

Google’s Newspaper Project

Barbara Quint (always worth reading), in Information Today, has some interesting comments as she reports on the massive Google Newspaper archiving project.

Google’s efforts in this space are undoubtedly impressive. But the readable quality of old newspapers is inevitably poor. Take a look at this 1944 issue of the St Petersburg Times. Google can break the newspaper up into articles and can find the headline “Russians Nearing Minsk” — this is not easy and totally cool — but the readability of a facsimile of a 1944 newspaper is going to be poor. I also find it a trifle intriguing that searching on the phrase “Russians Nearing Minsk” in the Google archive, or indeed on the complete Google web index finds no results at all. How can that be? Did I misread the headline?

There is a real question in my mind why Google is doing this. And I take it that their process is entirely (ok 99.99%) automated. There would be no justification for doing it if it cost them significant man hours. And I also take it that they are in principle willing to digitise every newspaper. Google does not usually bother to ‘negotiate’ about what content should be put into its system, anything that can be scanned and that comes from one of their partners goes in the maw. Google does not do things by halves: see the correspondingly outlandish projection from Chad Hurley, that Google’s YouTube appetite will lead to exponential growth of video on the web and in the cloud.

It is highly intriguing the way in which Google’s confidence that every newspaper edition is worth digitising contrasts with the widespread gloom in the mainstream newspaper business that they fundamentally have no basis for a profitable future, especially the local papers. It is very puzzling that very few newspapers have made proper efforts to sell digital editions to their current subscribers. Deeply puzzling given Google’s appreciation of the value of archival databases.