Digital Preservation the Planets Way

As a representative of the UK Society of Archivists, which is a member of the Digital Preservation Coalition,  I attended the first day of this 3-day event on a partial DPC scholarship. It gave an overview of digital preservation and of the Planets project. Planets is a 4-year project that is European Community funded, with 16 partner organisations, and a budget of 13.7 million euros, showing a high level of commitment from the EC. The programme is due to finish May 2010. The Planets approach wraps preservation planning services, action services, characterisation services and a testbed within an interoperability framework. It seeks to respond to the OAIS reference model and it became clear as the day went on that having a knowledge of OAIS terminology was useful in following the talks, which often referred to SIPs, AIPs and DIPs.

After a keynote by Sheila Anderson, Director of the Centre for E-Research at Kings College, which touched upon some of the important principles that underlie digital preservation and outlined some projects that the Centre is involved in, we got into a day that blended general information about digital preservation with quite detailed information about the Planets services and tools.

Ross King from the Austrian Institute of Technology gave a good overview, looking at the scale of the digital universe and the challenges and incentives to preserve. Between now and 2019, the volume of content that organisations need to hold will rise twenty-five fold from an average of 20TB to over 500TB. (from Are You Ready? Assessing Whether Organisations are Prepared for Digital Preservation – PDF). We would need about 1 trillion CD-Roms to hold all of the digital information produced in 2009. Importantly, we have now reached a point at which information creation is exceeding storage capacity, so the question of what to preserve is becoming increasingly important. I found this point interesting, as at the last talk that I attended on digital preservation we heard the old cry of ‘why not keep everything digital – storage is no problem’.

Digital preservation is about using standards, best practices and technologies to ensure access over time. With digital information, there are challenges around bit-stream preservation (bytes and hardware) and logical preservation (software and format). Expounding on the challenge of formats, King said that typically knowledge workers produce at least two thirds of their documents in proprietary formats. These formats have high preservation risks relating to limited long-term support and limited backwards-compatibility.

The preservation planning process is also vital, and Planets provides help and guidance on this. It is important to know what we want to preserve, profile the collections and identify the risks in order to mitigate them. Hans Hofman of the National Archives of the Netherlands gave an introduction to preservation planning. A preservation plan should define a series of preservation actions that need to be taken to address identified risks for a given set of digital objects or records. It is the translation of a preservation policy. He talked about the importance of looking at objects in context, and about how to prepare in order to create a preservation planning strategy: the need to understand the organisational context, the resources and skills that are available. Often small organisations simply do not have the resources, and so large organisations inevitably lead the way in this area. Hans went through the step-by-step process of defining requirements, evaluating alternatives, analysing results and ending up with recommendations with which to build a preservation plan, which then needs to be monitored over time.

Planets has developed a testbed to investigate how preservation services act on digital objects (now open to anyone, see Edith Michaeler of the Austrian National Library explained that this provides a controlled environment for experimentation with your own data, as well as with structured test data that you can use (Corpora). It enables you to identify suitable tools and make informed decisions. The testbed is run centrally, so everything is made available to users and experiments can benefit the whole community. It is very much integrated within the whole Planets framework. Edith took us through the 6 steps to run an experiment: defining the properties, designing the experiment, running it, the results, the analysis and the evaluation. The testbed enables experiments in migration, load test migration and viewing in an emulator as well as characterisation and validation. So, you might use the testbed to answer a question such as which format to migrate a file to, or to see if a tool behaves in the way that you expected.

The incentives for digital preservation are many and for businesses those around things like legislative compliance and clarification of rights may be important incentives. But business decisions are generally made based on the short-term and they are made based on a calculated return on investment. So, maybe we need to place digital preservation in the area of risk management rather than investment. The risk needs to be quantified, which is not an easy task. How much is produced? What are the objects worth? How long do they retain their value? What does it cost to preserve? If we can estimate the financial risk, we can justify the preventative investment in digital preservation. (see MK Bergman, Untapped Assets – PDF).

During the panel discussion, the idea of ‘selling’ digital preservation on the basis of risk was discussed. Earlier in the day William Kilbride, director of the Digital Preservation Coalition, talked about digital preservation as sustaining opportunities over time, and for many delegates, this was much more in-tune with their sentiments.  He outlined the work of the DPC, and emphasised the community-based and collaborative approach it takes to raising awareness of digital preservation. 

Clive Billenness went through how Planets works with the whole life-cycle of digital preservation:

1. Identify the risks
2. Assess the risks (how severe they are, whether they are immediate or long-term)
3. Plan and evaluate

4. Implementation plan
5. Update and review

The cycle will be repeated if there is a new risk trigger, which might be anything that signifies a change practice, whether it be a change in policy, a change in the business environment or a change in the technical environment. For the whole life-cycle, Planets has tools to help. The Plato Preservation Planning Tool, the Planets Characterisation Services, the Testbed and the Planets Core Registry, which is a file format registry based upon Pronom, including preservation action tools and file formats, taking a community-based approach to preservation.

Types of preservation action were explained by Sara van Bussel of the National Library of the Netherlands. She talked about logical preservation and accessing bit streams, and how interpretation may depend on obsolete operating systems, applications or formats. Sara summarised migration and emulation as preservation processes. Migration means changing the object over time to make it accessible in the current environment, whatever that may be. This risks introducing inconsistencies, functionality can be lost and quality assessment can be difficult. Migration can happen before something comes into the system or whilst it is in the system. It can also happen on access, so it is demand-led. Emulation means changing the environment over time, so no changes to the object are needed. But it is techncially challenging and the user has to have knowledge about the original environment. An emulator emulates a hardware configuration. You need your original operating system and software, so they must be preserved. Emulation can be useful for viewing a website in a web archive, for opening old files, from Word Perfect files to databases, and for executing programs, such as games or scientific applications. It is also possible to use migration through emulation, which can get round the problem of a migration tool becoming obsolete.

We were told about the Planets Gap Analysis (PDF), which looked at existing file formats, and found 137 different formats in 76 institutions. The most archived file formats in archives libraries and museums are tiff, jpg, pdf and xml, but archives hardly archive the mp3 format, while libraries and museums frequently do. Only 22% of the archived file formats were found in four or more institutions, only two file formats, tiff and jpg, were found in over half of all institutions. So, most preservation action tools are for common file formats, which means that more obscure file formats may have a problem. However, Sara gave three examples where the environment is quite different. For DAISY, which is a format for audio books for the blind, there is a consortium of content providers who address issues arising with new versions of the format For FITS, a format for astronomical data, digital preservation issues are often solved by the knowledgeable user-base. But with sheet music the community is quite fragmented and uncoordinated so it is difficult to get a consensus to work together.

The Gap Analysis found that out of the top ten file formats, 9 are covered by migration tools known or used by Planets partners. XML is not covered, but it is usually the output of a file format rather than the input, so maybe this is not surprising. Many tools are flexible, so they can address many types of format, but each organisation has specific demands that might not be fulfilled by available tools.

Manfred Thaller from the University at Cologne gave a detailed account of the Planets Characterisation Services. He drew attention to the most basic layer of digital information – the 1’s and 0’s that make up a bit-stream. He showed a very simple image and how this can be represented by 1’s and 0’s, with a ‘5, 6’ to indicate the rows and columns (…or is that columns and rows – the point being that information such as this is vital!). To act on a file you need to identify it, validate it, extract information, undertake comparison. If you do not know what kind of file you have – maybe you have a bit-stream but do not know what it represents – DROID can help to interpret the file and it also assigns a permanent identifier to the file. DROID uses the PRONOM-based Planets File Format Registry.

Thaller emphasised that validation is often complex, and in real life we have to take files that are not necessarily valid. There is no Planets born validation service, but it does provide tools like JHOVE. Extraction is easier to deal with – the examination of what is really in a file. Many services extract some characteristics from a file. The traditional approach is to build a tool for each file format. The Planets Extensible Characterisation Language (XCL) approach is to have one tool which extracts many kinds of files. It provides a file format description language as well as a general container format for file characterisation.

Hannes Kulovitz from the Vienna University of Technology talked about how Plato, an interactive software tool provided by Planets, can help in preservation planning and went through the process of defining requirements, evaluating alternatives, analysing results, recommendations and building the preservation plan. In the ensuing discussion it became clear that the planning process is a major part of the preservation process, especially as each format requires its own plan. The plan should be seen as requiring a major investment of time and effort, and it will then faciliate more effective automation of the processes involved.

Ross King made a return to the podium to talk about integrating the components of digital preservation. There is no archive component being developed as part of Planets, so the assumption is that institutions already have this. His talk concentrated on workflows through Planets, with suggested templates for submission, migration and access. He then went on to give a case study of the British Library (tiff images of newspapers). The content is complex and it required the template to be changed to accommodate the requirements. He built up the workflow through the various stages and referred to options for using various Planets’ tools along the way. I would have liked this case study to be enlarged and followed through more clearly, as giving an example helps to clarify the way that the numerous tools available as part of Planets may be used.

We ended with a glimpse of the future of Plants. Most outputs will be freely available, under an Apache 2 licence. But to get take-up there must be a sustainability plan to maintain and develop the software, ensure continued access to Planets services (currently based at the University of Glasgow), support partners who are committed to the use of Planets, grow the community of users and promote further research and development. With this in mind, a decision has been taken to form something along the lines of an Open Planets Foundation (OPF), a not-for-profit organisation, limited by guarantee under UK law but with global membership. There has already been a commitment to this and there is financial support but Billenness was naturally reserved about being explicit here because the OPF will be a new body. There will be different classes of membership and the terms of membership are currently being finalised. But most of the software will remain free to download.

Image shows Planets Interoperability Framework.