Democracy 2.0 in the US

May 28, 2010 / Jane Stevenson

Democracy 2.0: A Case Study in Open Government from across the pond.

I have just listened to a presentation by David Ferriero – 10th Archivist of the US at the National Archives and Records Administration (www.archives.gov). He was talking about democracy, about being open and participatory. He contrasted the very early days of American independence, where there was a high level of secrecy in Government, to the current climate, where those who make decisions are not isolated from the citizens, and citizens’ voices can be heard. He referred to this as ‘Democracy 2.0.’ Barack Obama set out his open government directive right from the off, promoting the principles of more transparecy, participation and collaboration. Ferriero talked about seeking to inform, educate and maybe even entertain citizens.

The backbone of open government must be good record keeping. Records document individual rights and entitlements, record actions of government and who is responsible and accountable. They give us the history of the national experience. Only 2-3 percent of records created in conducting the public’s business are considered to be of permanent value and therefore kept in the US archives (still, obviously, a mind-bogglingly huge amount of stuff).

Ferriero emphasised the need to ensure that Federal records of historical value are in good order. But there are still too many records are at risk of damange or loss. A recent review of record keeping in Federal Agencies showed that 4 out of 5 agencies are at high or moderate risk of improper destruction of records. Cost effective IT solutions are required to address this, and NARA is looking to lead in this area. An electronic records archive (ERA) is being build in partnership with the private sector to hold all the Federal Government’s electronic records, and Ferriero sees this as the priority and the most important challenge for the National Archives. He felt that new kinds of records create new challenges, that is, records created as result of social media, and an ERA needs to be able to take care of these types of records.

Change in processes and change in culture is required to meet the new online landscape. The whole commerce of information has changed permanently and we need to be good stewards of the new dynamic. There needs to be better engagement with employees and with the public. NARA are looking to improve their online capabilities to improve the delivery of records. They are developing their catalogue into a social catalogue that allows users to contribute and using Web 2.0 tools to allow greater communication between staff. They are also going beyond their own website to reach users where they are, using YouTube, Twitter, blogs, etc. They intend to develop comprehensive social media strategy (which will be well worth reading if it does emerge).

The US Government are publishing high value datasets on data.gov and Ferriero said that they are eager to see the response to this, in terms of the innovative use of data. They are searching for ways to step of digitisation – looking at what to prioritise and how to accomplish the most with least cost. They want to provide open government leadership to Federal Agencies, for example, mediating in disputes relating to FoI. There are around 2,000 different security classification guides in the government, which makes record processing very comlex. There is a big backlog of documents waiting to be declassified, some pertaining to World War Two, the Koeran War and the Vietnam War, so they will be of great interest to researchers.

Ferriero also talked about the challenge of making the distiction between business records and personal records. He felt that the personal has to be there, within the archive, to help future researchers recreate the full picture of events.

There is still a problem with Government Agencies all doing their own thing. The Chief Information officers of all agencies have a Council (the CIO Council). The records managers have the Records Management Council. But it is a case of never the twain shall meet at the moment. Even within Agencies the two often have nothing to do with eachother….there are now plans to address this!

This was a presentation that ticked many of the boxes of concern – the importance of addressing electronic records, new media, bringing people together to create efficiencies and engaging the citizens. But then, of course, it’s easy to do that in words….

Archives Hub contributors’ survey

May 17, 2010 / Jane Stevenson

We would be very grateful if contributors to the Archives Hub could fill in this short survey for us. It is invaluable in helping us to understand your needs and priorities, and to plan for future development and enhancements.

Contributors’ survey: http://www.archiveshub.ac.uk/blog/?page_id=2317

This survey is specifically for contributors of archival descriptions rather than researchers. However, we are always keen to hear any views that you have on the strengths and weaknesses of the Hub, so do please email us with any feedback you have.

Many thanks,
from the Archives Hub team.

Opening up UK archives data (II)

May 11, 2010 / Jane Stevenson / 1 Comment

This is the second post relating to the recent UKAD meeting, concentrating on the brainstorming that took place around digital and digitised archives.

The driving forces that were identified:

Crowd-sourcing – metadata generation
Attracts funding
Promotes access
Open up wealth of possibility
Remain relevant
Meet user expectations
Centres of excellence in digitisation – common approach
Collections already digitised are hidden – in silos – return on investment
Potential to capture richer information about users
Potential to draw people in
Increasing ‘digitisation on demand’ – needs to be harnessed effectively
Increasing amount of born-digital media need to be made accessible online – drive to discoverability of digital materials
Changing profession – becoming more confident in this area as a result of above
Web makes it much easier

The group felt that it all added up to a resouding “we have to do this!”.

The resistors included:

Systems don’t talk to each other
Insufficient metadata of legacy digitised material – retroconversion – cost*
Copyright/IPR – complex, lots of local specificity
Work needed to marry user generated content and standard metadata
Community resistance to UGC
Vast amounts of content – prioritisation is intellectually challenging
Bulk digitisation is happening commercially – restricted rights
Clashes with business models – or perception that it does (e.g. models based on commercial digitisation assume increasing return on investment; the opposite may occur if the most commercially enticing material digitised first)
Fears – grounded in truth – could affect funding: diminish user/visitor numbers on site, diminishes value of on-site expertise
Challenges in bringing catalogue data and digital object systems together
Query: not ultimately cost effective
Cost
Web makes it easier – but it’s hard to keep up…

The group looked at actions that are required:

1. Accrue evidence of user demand and current behaviour

Identify user communities (family, academic, student researchers)
Secondary research of existing analysis
Market research
Produce cost-benefit analysis – impact on site visits?

2. Systems talking to each other

People talking to each other about systems!
Develop definitive list of systems in use – a picture of UK situation > crosswalks/maps between (see Library world)
Needs to cover both catalogue and digital object management systems
Discmap?

3. Copyright/IPR

Produce decision tree to help archivists make decisions – risk assessment but beware risk aversion
Encourage sharing of experience/lessons learned
Gathering what has already been done

4. Impact of digitised resources

Gather existing articles/research
Share practice in assessing impact in differing contexts

5. Metadata and costs

Establish costs of differing levels of metadata generation
Identify how much data needs to be converted into digital metadata (how much is not online?)

6. Identify quick wins!

Working together to create user cases and examples, sharing experience, getting onvolved in Resource Discovery Task Force and linking projects to this

Of course, the gathering of such evidence can help us to see where we are and where we need to go, and also how to get there. But implementation is quite another thing. The UKAD Network is hoping to build upon this work to encourage collaborative initiatives and the sharing of expertise and experiences. We are considering events and training opportunities that might help. We do feel that it will be useful to create a stronger presence for UKAD, as a means to provide a focus for this work, and we are looking at low-cost options to do this.

Poll: new Hub website

May 4, 2010 / Jane Stevenson

Do you like the new Hub website? Let us know below!

Opening up UK archives data (i)

April 16, 2010 / Jane Stevenson

On 14th April the UK Archives Discovery Network (UKAD) met in Manchester to discuss challenges surrounding the opening up of archival data. We were looking to develop our understanding of the key issues driving or preventing these developments and to start pulling together an action plan. We also talked about digital and digitised archives, which I’ll blog about in a separate post.

We split into two groups to brainstorm driving and restraining factors. There was no chance of drying up – we all had plenty to say, and of course, the restraining influences grew rapidly, threatening to outstrip the drivers by quite some way. However, in the end we had a good balance, and we felt that the day had been very positive, although summing up the position is one thing, implementing actions is quite another. However, we hope to start putting some things into place that will help to take us along the road to promoting archival discovery.

We are looking to create a UKAD website, which will help us to promote UKAD to archivists and others, and we’ll let you know about that as soon as we can.

With thanks to Melinda Haunton from The National Archives, who, as the UKAD secretary, galliantly pulled together the large number of flip charts and made them into something coherent, here is a summary of the points.

Our driving forces included:

Perceived user demand
Time saving – easier to search, more effective customer service
Opportunities – for use of data and for benefiting from others’ use
Government policy drivers in this direction (data.gov.uk is evidence of Govt buy-in)
Rich data – think about opportunities to make the most of events, people, places, concepts within the finding aids
Serendipitous collaboration – working together is a big driver – a common way to hear about initiatives and experiences of others that could be of benefit to you
Potential to get new users – eg via GIS data connected to archives data – users who may not think of using archives
Standards exist to drive openness
Sustainability of resources – less tied to a single service if data is open
Enrichment and adding value – others can enrich our data
Archives making use of others’ open data – sector benefits from open data as well as contributing to it
Connecting archives – new narratives – data can coalesce around events, people, places, subjects
Exposure of holdings – especially for small repositories who have limited resources to promote themselves
Unlikely to be restrictions on opening up descriptive data (unlike digital/digitised archives)
Could glean evidence of impact – ways to gather usage statistics are increasingly effective – provide evidence of benefits
Opening up could reach out to excluded communities more effectively (different routes into archives)
Potential for wider impact – e.g. in demonstrating impact of academic research (RAE)

Our restraining forces included:

Lack of evidence of user demand – it may not be what we expect/assume
APIs – where they exist, are they used? (possibly not)
Users’ understanding what they’ll get – you won’t normally get direct access to archives through descriptions
Proprietary software providers – may not ‘play ball’
Archivists understanding of open data issues – need understanding to get buy-in
Access to developer expertise – archivists frequently find getting IT or developer support very difficult
Machine to machine – not visual, not easy to sell – need to understand the potential
Messy data – all the issues we are so aware of with different data sources; the balkanisation of data
Backlogs – if its not catalogued, we can’t open it up
Sustainability of resources
Data becoming out of date as it gets further from the original source – end up reusing out-of-date data
Contractual embargoes – e.g. involving commercial partners e.g. software providers
Dependencies – potentially data may be dependent on other things – e.g. attached to schema, source code, IPR
Evidence of impact – can be difficult to get this and prove the worth of open data
Branding – or lack of it on reused open data – may affect funding if funders can’t see direct benefits
Loss of control causes fear – once its open anything can happen
Lack of ‘archival developers’ – very few developers with some understanding of archives and archival issues

Our Actions included:

Working together – collaborative evidence gathering and sharing, not competing – use examples/evidence from others
Evidence – case studies, knowing what researchers are requesting, evidence for advantages of digitising
Understanding funders – shared understanding of funders can help with internal funding
Archives developer days – bringing developers together as has been done with Dev8D – collaborative approach to programming
Strategy for approaching software vendors to get buy-in – appeal to their commercial interests, a concerted approach from aggregators may be more effective
UK based evaluation of archival cataloguing systems – still know little about percentages using different systems and evaluation of systems
Conference/workshops to raise awareness of buy in including practical demonstrations – must be interactive and practical and encourage sharing of projects, experiences and ideas

George Bernard Shaw: Man and Cameraman

April 12, 2010 / Jane Stevenson

The playwright George Bernard Shaw (1856-1950) was an avid amateur photographer at a time when the art was developing. In his lifetime he was known as a photographer but his collection has been largely left untouched since his death in 1950. Man & Cameraman aims to give his images back to the public.

This feature is illustrated with twelve photographs from the George Bernard Shaw Photographs collection at the London School of Economics. There is an introduction by Karyn Stuckey, Man & Cameraman Archivist at LSE, some suggested links, and some suggested reading (with links to Copac and Zetoc records).

George Bernard Shaw: Man and Cameraman.

Reinventing the wheel: the new Hub website

April 7, 2010 / Jane Stevenson

On 1st April 2010 the Archives Hub website changed. It was not just about a new look and feel, but a whole new site. The Hub team spent several months planning the new architecture, navigation and content. Most of the content was rewritten and this gave us a great opportunity to think about a coherent approach where we could be consistent in our tone and terminology and really think about what each page should say. We wanted the site to be intuitive and for each page to be useful and attractive, and not give an overwhelming amount of information.

We decided to introduce plenty of images, to lift the site visually, and we wanted to keep plenty of whitespace, to make it easy on the eye. In addition, the website designers, True North, helped us to think about our identity and the importance of presenting the Archives Hub in a way that conveys confidence, self-belief, professionalism and warmth.

The Archives Hub has getting on for 200 contributors now, which is quite an achievement, and we are very appreciative of the effort that our contributors put into creating descriptions for the Hub. We want to continue to develop the site with a focus on archivists as well as on researchers, as we see both groups of users as vital to us, and in fact they often overlap. We hope that our ‘Archivists’ section is helpful and informative for contributors and other information professionals interested in what we do and in issues around online data and interoperability.

Our Features section takes over from the old ‘Collections of the Month’ idea, bringing the same message about the breadth and depth of Hub content and enabling us to showcase contributors and wonderful collections.

Our ‘Researchers’ section is going to be expanded, although we are keen to keep it focussed and easy to scan and digest. We are looking at ways that we can continue to support researchers in using the Hub to the greatest advantage. Of course, the main way is to provide an effective search interface and to continue to expand the content. And this brings us on to the search – as well as a whole new information site, we have upgraded our software. We are now using ‘Cheshire 3’, which enables us to provide functionality that we could not provide before. We will be talking more about that in subsequent blogs. The new software is running on all-new hardware, so in fact we really have fundamentally changed the whole Archives Hub, but we hope that we have retained what is good about the site and about our service.

Linked Data: one thing leads to another

March 3, 2010 / Jane Stevenson

I attended another very succesful Linked Data meetup in London on 24 February. This was the second meetup, and the buzz being created by Linked Data has clearly been generating a great deal of interest, as around 200 people signed up to attend.
All of the speakers were excellent, and found that very helpful balance

between being expert and informative whilst getting their points across clearly and in a way that non-technical people could understand.

Tom Heath (Talis) took us back to the basic principles of Linked Data. It is about taking statements and expressing them in a way that meshes more closely with the architecture of the Web. This is done by assigning identifiers to things. A statement such as Jane Stevenson works at Mimas can be broken down and each part can be given a URI identifer. I have a URI (http://www.archiveshub.ac.uk/janefoaf.rdf) and Mimas has a URI (http://www.mimas.ac.uk/). The predicate that describes the relationship ‘worksAt’ must also have a URI. Creating statements like this, we can start linking datasets together.

Tom talked about how this Linked Data way of thinking challenges the existing metaphors that drive the Web, and this was emphasised thoroughout the sessions. The document metaphor is everywhere – we use it all the time when talking about the Web; we speak about the desktop, about our files, and about pages as documents. It is a bit like thinking about the Web as a library, but is this a useful metaphor? Mabye we should be moving towards the idea of an exploratory space, where we can reach out, touch and interact with things. Linked Data is not about looking for specific documents. If I take the Archives Hub as an example, Linked Data is not so much concerned with the fact that there is a page (document) about the Agatha Christie archive; what it is concerned about is the things/concepts within that page. Agatha Christie is one of the concepts, but there are many others – other people, places, subjects. You could say the description is about many things that are linked together in the text (in a way humans can undertand), but it is presented as a page about Agatha Christie. This traditional way of thinking hides references within documents, they are not ‘first class citizens of the web’ in themselves. Of course, a researcher may be wanting information about Agatha Christie archives, and then this description will be very relevant. But they may be looking for information about other concepts within the page. If ‘Torquay’ and ‘novelist’ and ‘nursing’ and ‘Poirot’ and all the other concepts were brought to the fore as things within their own right, then the data could really be enriched. With Linked Data you can link out to other data about the same concepts and bring it all together.

Tom spoke very eloquently about how you can describe any aspect you like of any thing you like by giving identifiers to things – it means you can interact with them directly. If a researcher wants to know about the entity of Agatha Christie, the Linked Data web would allow them to gather information about that topic from many different sources; if concepts relating to her are linked in a structured way, then the researcher can undertake a voyage of discovery around their topic, utilising the power that machines have to link structured data, rather than doing all the linking up manually. So, it is not a case of gathering ‘documents’ about a subject, but of gathering information about a subject. However, if you have information on the source of the data that you gather (the provenance), then you can go to the source as well. Linked Data does not mean documents are unimportant, but it means that they are one of the things on the Web along with everything else.

Having a well-known data provider such as the BBC involved in Linked Data provides a great example of what can be done with the sort of information that we all use and understand. The BBC Wildlife Finder is about concepts and entities in the natural world. People may want to know about specific BBC programmes, but they are more likely to want to know about lions, or tigers, or habitats, or breeding, or, other specific topics covered in the programmes. The BBC are enabling people to explore the natural world through using Linked Data. What underlies this is the importance of having URIs for all concepts. If you have these, then you are free to combine them as you wish. All resources, therefore, have HTTP URIs. If want to talk about sounds that lions make, or just the programmes about lions, or just one aspect of a lion’s behaviour, then you need to make sure each of these concepts have identifiers.

Wildlife Finder has almost no data itself; it comes from elsewhere. They pull stuff onto the pages, whether it is data from the BBC or from elsewhere. DBPedia (Wikipedia output in RDF) is particularly important to the BBC as a source of information. The BBC actually go to Wikipedia and edit the text from there, something that benefits Wikipedia and other users of Wikipedia. There is no point replicating data that is already available. DBPedia provides a big controled vocabulary – you can use the URI from Wikipedia to clarify what you are talking about, and it provides a way to link stuff together.

Tom Scott from the BBC told us that the BBC have only just released all the raw data as RDF. If you go to the URL it content negotiates to give you what you want (though he pointed out that it is not quite perfect yet). Tom showed us the RDF data for an Eastern Gorilla, providing all of the data about concepts that go with Eastern Gorillas in a structured form, including links to programmes and other sources of information.

Having two heavyweights such as the BBC and the UK Government involved in Linked Data certainly helps give it momentum. The Government appears to have understood that the potential for providing data as open Linked Data is tremendous, in terms of commercial exploitation, social capital and improving public service delivery. A number of times during the sessions the importance of doing things in a ‘web-centric’ way was emphasised. John Sheridan from The National Archives talked about data.gov.uk and the importance of having ‘data you can click on’. Fundamentally, Linked Data standards enable the publication of data in a very distributed way. People can gather the data in ways that are useful to them. For example, with data about schools, what is most useful is likely to be a combination of data, but rather than trying to combine the data internally before publishing it, the Government want all the data providers to publish their data and then others can combine it to suit their own needs – you don’t then have to second guess what those needs are.

Jeni Tennison, from data.gov.uk, talked about the necessity of working out core design patterns to allow Linked Data to be published fast and relatively cheaply. I felt that there was a very healthy emphasis on this need to be practical, to show benefits and to help people wanting to publish RDF. You can’t expect people to just start working with RDF and SPARQL (the query language for RDF). You have to make sure it is easy to query and process, which means creating nice friendly APIs for them to use.

Jeni talked about laying tracks to start people off, hepling people to publish their data in a way that can be consumed easily. She referred to ‘patterns’ for URIs for public sector things, definitions, classes, datasets, and providing recommendations on how to make URIs persistent. The Government have initial URI sets for areas such as legislation, schools, geographies, etc. She also referred to the importance of versioning, with things having multiple sources and multiple versions over time it is important to be able to relate back to previous states. They are looking at using named graphs in order to collect together information that has a particular source, which provides a way of getting time-sliced data. Finally, ensuring that provenance is recorded (where something originated, processing, validation, etc.) helps with building trust.

There was some interesting discussion on responsibilities for minting URIs. Certain domains can be seen to have responsibilities for certain areas, for example, the government minting URIs for government departments and schools, the World Health Organisation for health related concepts. But should we trust DBPedia URIs? This is an area where we simply have to make our own judgements. The BBC reuse the DBPedia URI slugs (the string-part in a URL to identify, describe and access a resource) on their own URLs for wildlife, so their URLs have the ‘bbc.co.uk’ bit and the DBPedia bit for the resource. This helps to create some cohesion across the Web.

There was also discussion about the risks of costs and monopolies – can you rely on data sources long-term? Might they start to charge? Speakers were asked about the use of http URIs – applications should not need to pick them apart in order to work out what they mean. They are opaque identifiers, but they are used by people, so it is useful for them to be usable by people, i.e. readable and understandable. As long as the information is made available in the metadata we can all use it. But we have got to be careful to avoid using URIs that are not persistent – if a page title in Wikipedia changes the URI changes, and if the BBC are using the Wikipedia URI slug then is a problem. Tom Scott made the point that it it worth choosing persistence over usability.

The development of applications is probably one of the main barriers to uptake of Linked Data. It is very different, and more challenging, than building applications on top of a known database under your control. In Linked Data applications need to access multiple datasets.

The session ended by again stressing the importance of thinking differntly about data on the Web. To start with things that people care about, not with a document-centric way of thinking. This is what the BBC have done with the Wildlife Finder. People care about lions, about the savannah, about hunting, about life-span, not about specific documents. It is essential to identify the real world things within your website. It is the modelling that is one of the biggest challenges – thinking about what you are talking about and giving those things URIs. A modelled approach means you can start to let machines do the things they are best at, and leave people to do the things that they are best at.

Post by Jane Stevenson (jane.stevenson@manchester.ac.uk)

Image: Linked Data Meetup, February 2010, panel discussion.

It’s all about YOU: Manchester as an Open Data city

February 12, 2010 / Jane Stevenson

There are plans afoot to declare Manchester as an Open Data city. At the Manchester Social Media Cafe last week I attended a presentation by Julian Tait, a founder of the Social Media Cafe, who talked to us about why this would be a good thing.

The Open Data initiative emerged as a result of Future Everything 2009, a celebration of the digital future in art, music and ideas. But what is an Open Data city? It is based upon the principle that data is the life blood of a city; it allows cities to operate, function, develop and respond, to be dymanic and to evolve. Huge datasets are generally held in places that are inaccessible to many of the populace; they are largely hidden. If data is opened up then applications of the data can be hugely expanded and the possibilities would be limitless.

There are currently moves by central government to open up datasets, to enable us to develop a greater awareness and understanding of society and of our environment. We now have data.gov.uk and we can go there and download data (currently around 2000 datasets) and use the data as we want to. But for data to have meaning to people within a city there has to be something at a city level; at a scale that feels more relevant to people in an everyday context.

Open data may be (should be?) seen as a part of the democratic process. It brings transparency, and helps to hold government to account. There are examples of the move towards transparency – sites such as They Work For You , which allows us all to keep tabs on our MP, and MySociety. In the US, Columbia has an initiative known as Apps for Democracy, providing prizes for innovative apps as a way to engage the community in ‘digital democracy’.

They key here is that if data is thrown open it may be used for very surprising, unpredictable and valuable things: “The first edition of Apps for Democracy yielded 47 web, iPhone and Facebook apps in 30 days – a $2,300,000 value to the city at a cost of $50,000”.

Mapumental is a very new initiative where you can investigate areas of the UK, looking at house price indexes, public transport data, etc. If we have truly open data, we could really build on this idea. We might be able to work out the best places to live if we want a quiet area with certain local amenities, and need to be at work for a certain time but have certain restrains on travel. Defra has a noise map of England, but it is not open information – we can’t combine it with other information.

Julian felt that Open Data will only work if it benefits people in their everyday existence. This may be true on a city scale. On a national scale I think that people have to be more visionary. It may or may not have a discernable impact on everyday living, but it is very likely to facilitate research that will surely benefit us in the long term, be it medically, environmentally or economically.

The Open Data initiative is being sold on the idea of people becoming engaged, empowered and informed. But there are those that have their reservations. What will happen if we open up everything? Will complex issues be simplified? Is there a danger that transparent information will encourage people to draw simplistic inferences? come to the ‘wrong’ conclusions? Maybe we will lose the subtleties that can be found within datasets, maybe we will encourage mis-information? Maybe we will condemn areas of our cities to be ghettos? With so much information at our fingertips about where we should live, the ‘better areas’ might continue to benefit at the expense of other areas.

The key question is whether society is better off with the information or without the information. Certainly the UK Government is behind the initiative, and the recent ‘Smarter Government‘ (PDF) document made a commitment to the opening up of datasets. The Government believes it can save money by opening up data, which, of course, is going to be a strong incentive.

For archivists the whole move towards numerous channels of information, open data, mashing up, recombining, reusing, keeping data fluid and dynamic is something of a nightmare from a professional point of view. In addition, if we start to see the benefits of providing everyone with access to all data, enabling them to do new and exciting things with it, then might we change our perspective about appraisal and selection. Does this make it more imperative that we keep everything?

Image: B of the Bang, Manchester

Digital Preservation the Planets Way

February 9, 2010 / Jane Stevenson

As a representative of the UK Society of Archivists, which is a member of the Digital Preservation Coalition, I attended the first day of this 3-day event on a partial DPC scholarship. It gave an overview of digital preservation and of the Planets project. Planets is a 4-year project that is European Community funded, with 16 partner organisations, and a budget of 13.7 million euros, showing a high level of commitment from the EC. The programme is due to finish May 2010. The Planets approach wraps preservation planning services, action services, characterisation services and a testbed within an interoperability framework. It seeks to respond to the OAIS reference model and it became clear as the day went on that having a knowledge of OAIS terminology was useful in following the talks, which often referred to SIPs, AIPs and DIPs.

After a keynote by Sheila Anderson, Director of the Centre for E-Research at Kings College, which touched upon some of the important principles that underlie digital preservation and outlined some projects that the Centre is involved in, we got into a day that blended general information about digital preservation with quite detailed information about the Planets services and tools.

Ross King from the Austrian Institute of Technology gave a good overview, looking at the scale of the digital universe and the challenges and incentives to preserve. Between now and 2019, the volume of content that organisations need to hold will rise twenty-five fold from an average of 20TB to over 500TB. (from Are You Ready? Assessing Whether Organisations are Prepared for Digital Preservation – PDF). We would need about 1 trillion CD-Roms to hold all of the digital information produced in 2009. Importantly, we have now reached a point at which information creation is exceeding storage capacity, so the question of what to preserve is becoming increasingly important. I found this point interesting, as at the last talk that I attended on digital preservation we heard the old cry of ‘why not keep everything digital – storage is no problem’.

Digital preservation is about using standards, best practices and technologies to ensure access over time. With digital information, there are challenges around bit-stream preservation (bytes and hardware) and logical preservation (software and format). Expounding on the challenge of formats, King said that typically knowledge workers produce at least two thirds of their documents in proprietary formats. These formats have high preservation risks relating to limited long-term support and limited backwards-compatibility.

The preservation planning process is also vital, and Planets provides help and guidance on this. It is important to know what we want to preserve, profile the collections and identify the risks in order to mitigate them. Hans Hofman of the National Archives of the Netherlands gave an introduction to preservation planning. A preservation plan should define a series of preservation actions that need to be taken to address identified risks for a given set of digital objects or records. It is the translation of a preservation policy. He talked about the importance of looking at objects in context, and about how to prepare in order to create a preservation planning strategy: the need to understand the organisational context, the resources and skills that are available. Often small organisations simply do not have the resources, and so large organisations inevitably lead the way in this area. Hans went through the step-by-step process of defining requirements, evaluating alternatives, analysing results and ending up with recommendations with which to build a preservation plan, which then needs to be monitored over time.

Planets has developed a testbed to investigate how preservation services act on digital objects (now open to anyone, see https://testbed.planets-project.eu/testbed/). Edith Michaeler of the Austrian National Library explained that this provides a controlled environment for experimentation with your own data, as well as with structured test data that you can use (Corpora). It enables you to identify suitable tools and make informed decisions. The testbed is run centrally, so everything is made available to users and experiments can benefit the whole community. It is very much integrated within the whole Planets framework. Edith took us through the 6 steps to run an experiment: defining the properties, designing the experiment, running it, the results, the analysis and the evaluation. The testbed enables experiments in migration, load test migration and viewing in an emulator as well as characterisation and validation. So, you might use the testbed to answer a question such as which format to migrate a file to, or to see if a tool behaves in the way that you expected.

The incentives for digital preservation are many and for businesses those around things like legislative compliance and clarification of rights may be important incentives. But business decisions are generally made based on the short-term and they are made based on a calculated return on investment. So, maybe we need to place digital preservation in the area of risk management rather than investment. The risk needs to be quantified, which is not an easy task. How much is produced? What are the objects worth? How long do they retain their value? What does it cost to preserve? If we can estimate the financial risk, we can justify the preventative investment in digital preservation. (see MK Bergman, Untapped Assets – PDF).

During the panel discussion, the idea of ‘selling’ digital preservation on the basis of risk was discussed. Earlier in the day William Kilbride, director of the Digital Preservation Coalition, talked about digital preservation as sustaining opportunities over time, and for many delegates, this was much more in-tune with their sentiments. He outlined the work of the DPC, and emphasised the community-based and collaborative approach it takes to raising awareness of digital preservation.

Clive Billenness went through how Planets works with the whole life-cycle of digital preservation:

1. Identify the risks
2. Assess the risks (how severe they are, whether they are immediate or long-term)
3. Plan and evaluate
4. Implementation plan
5. Update and review

The cycle will be repeated if there is a new risk trigger, which might be anything that signifies a change practice, whether it be a change in policy, a change in the business environment or a change in the technical environment. For the whole life-cycle, Planets has tools to help. The Plato Preservation Planning Tool, the Planets Characterisation Services, the Testbed and the Planets Core Registry, which is a file format registry based upon Pronom, including preservation action tools and file formats, taking a community-based approach to preservation.

Types of preservation action were explained by Sara van Bussel of the National Library of the Netherlands. She talked about logical preservation and accessing bit streams, and how interpretation may depend on obsolete operating systems, applications or formats. Sara summarised migration and emulation as preservation processes. Migration means changing the object over time to make it accessible in the current environment, whatever that may be. This risks introducing inconsistencies, functionality can be lost and quality assessment can be difficult. Migration can happen before something comes into the system or whilst it is in the system. It can also happen on access, so it is demand-led. Emulation means changing the environment over time, so no changes to the object are needed. But it is techncially challenging and the user has to have knowledge about the original environment. An emulator emulates a hardware configuration. You need your original operating system and software, so they must be preserved. Emulation can be useful for viewing a website in a web archive, for opening old files, from Word Perfect files to databases, and for executing programs, such as games or scientific applications. It is also possible to use migration through emulation, which can get round the problem of a migration tool becoming obsolete.

We were told about the Planets Gap Analysis (PDF), which looked at existing file formats, and found 137 different formats in 76 institutions. The most archived file formats in archives libraries and museums are tiff, jpg, pdf and xml, but archives hardly archive the mp3 format, while libraries and museums frequently do. Only 22% of the archived file formats were found in four or more institutions, only two file formats, tiff and jpg, were found in over half of all institutions. So, most preservation action tools are for common file formats, which means that more obscure file formats may have a problem. However, Sara gave three examples where the environment is quite different. For DAISY, which is a format for audio books for the blind, there is a consortium of content providers who address issues arising with new versions of the format For FITS, a format for astronomical data, digital preservation issues are often solved by the knowledgeable user-base. But with sheet music the community is quite fragmented and uncoordinated so it is difficult to get a consensus to work together.

The Gap Analysis found that out of the top ten file formats, 9 are covered by migration tools known or used by Planets partners. XML is not covered, but it is usually the output of a file format rather than the input, so maybe this is not surprising. Many tools are flexible, so they can address many types of format, but each organisation has specific demands that might not be fulfilled by available tools.

Manfred Thaller from the University at Cologne gave a detailed account of the Planets Characterisation Services. He drew attention to the most basic layer of digital information – the 1’s and 0’s that make up a bit-stream. He showed a very simple image and how this can be represented by 1’s and 0’s, with a ‘5, 6’ to indicate the rows and columns (…or is that columns and rows – the point being that information such as this is vital!). To act on a file you need to identify it, validate it, extract information, undertake comparison. If you do not know what kind of file you have – maybe you have a bit-stream but do not know what it represents – DROID can help to interpret the file and it also assigns a permanent identifier to the file. DROID uses the PRONOM-based Planets File Format Registry.

Thaller emphasised that validation is often complex, and in real life we have to take files that are not necessarily valid. There is no Planets born validation service, but it does provide tools like JHOVE. Extraction is easier to deal with – the examination of what is really in a file. Many services extract some characteristics from a file. The traditional approach is to build a tool for each file format. The Planets Extensible Characterisation Language (XCL) approach is to have one tool which extracts many kinds of files. It provides a file format description language as well as a general container format for file characterisation.

Hannes Kulovitz from the Vienna University of Technology talked about how Plato, an interactive software tool provided by Planets, can help in preservation planning and went through the process of defining requirements, evaluating alternatives, analysing results, recommendations and building the preservation plan. In the ensuing discussion it became clear that the planning process is a major part of the preservation process, especially as each format requires its own plan. The plan should be seen as requiring a major investment of time and effort, and it will then faciliate more effective automation of the processes involved.

Ross King made a return to the podium to talk about integrating the components of digital preservation. There is no archive component being developed as part of Planets, so the assumption is that institutions already have this. His talk concentrated on workflows through Planets, with suggested templates for submission, migration and access. He then went on to give a case study of the British Library (tiff images of newspapers). The content is complex and it required the template to be changed to accommodate the requirements. He built up the workflow through the various stages and referred to options for using various Planets’ tools along the way. I would have liked this case study to be enlarged and followed through more clearly, as giving an example helps to clarify the way that the numerous tools available as part of Planets may be used.

We ended with a glimpse of the future of Plants. Most outputs will be freely available, under an Apache 2 licence. But to get take-up there must be a sustainability plan to maintain and develop the software, ensure continued access to Planets services (currently based at the University of Glasgow), support partners who are committed to the use of Planets, grow the community of users and promote further research and development. With this in mind, a decision has been taken to form something along the lines of an Open Planets Foundation (OPF), a not-for-profit organisation, limited by guarantee under UK law but with global membership. There has already been a commitment to this and there is financial support but Billenness was naturally reserved about being explicit here because the OPF will be a new body. There will be different classes of membership and the terms of membership are currently being finalised. But most of the software will remain free to download.

Image shows Planets Interoperability Framework.