Digital Curation: think use, not preservation

For the keynote presentation at the DCC/RIN Research Data Management Forum on ‘The Economics of Applying and Sustaining Digital Curation’, Chris Rusbridge gave us some reflections from the Blue Ribbon Task Force (BRTF): http://brtf.sdsc.edu/about.html on Sustainable Digital Preservation and Access. This was a 2 year project, finishing earlier this year, and the final report is available from: http://brtf.sdsc.edu/biblio/BRTF_Final_Report.pdfpicture of digital data

Chris kicked off by asking us to think about how we currently support access to digital information. Avenues include Government grants, advertisements (e.g. through Google), subscriptions (to journals), pay per service (e.g. Amazon Web service), and donations.

One of the key themes that he raised and returned to was around the alignment, or lack of alignment between those who pay, those who provide and those who benefit from digital data: they are not necessarily the same, and the more different they are the harder it may be to create a sustainable model . Who owns, who benefits, who selects, who preserves, who pays?  This has interesting parallels with archive repositories, where an institution may pay for the acquisition, appraisal, storage, cataloguing and access for these resources, but the beneficiaries are far broader than just members of the institution. Some institutions may require payment for access, but others will provide access free of charge. They may see this as a means to enhance their reputation and status as a learned society.

Around 15 years ago we started to think about digital preservation as a technical problem and then the OAIS reference model was produced. The technical capabilities that we now have are well up to the task, although Chris warned that the most elegant technical solution is no good if it is not sustainable; digital preservation has to be a sustainable economic activity. Today the focus is on the economic and organisational problems. It is not just about money; it requires building upon a value proposition, providing incentives to act and defining roles and responsibilities.

Digital preservation represents a derived demand.  No one ‘wants’ preservation per se; what they want is access to a resource.  It is not easy to sell a derived demand – often it needs to be sold on some other  basis. This idea of selling the importance of providing use (over time) rather than trying to sell the idea of preservation was emphasised throughout the Forum.

Digital preservation is also ‘path dependent’, meaning that the actions and decision you take change over time; they are different at different points of the life-cycle. Today’s actions can remove other options for all time.

Cultural issues, and mindset may be an issue here, and I was interested in the potential problem Chris proposed of  the ‘free-rider’ culture when it comes to making research datasets available. It may be that some (many?) researchers don’t want to pay for things, under value services and maybe underestimate costs. Researchers may also resent conformity and what they see as beauracracy. All in all, it may be difficult to make a case that researchers should in some way pay. This may be compounded by a sense that money invested in preservation is money taken out of research.  Chris suggested that the incentives for preservation are less apparent to the individual researcher, but are more clearly defined when the data is aggregated.

Typically, long-term preservation activities  have been funded by short-term resource allocation, although maybe this is gradually changing; a more thorny issue is that of recognising and valuing the benefits of digital preservation, to provide incentives that attract funding. More work needs to be done on articulating the benefits in order to cultivate a sense of the value.However, other speakers at the Forum wondered whether we should actually take the value as a given – maybe we shouldn’t keep asking the question about benefits, but simply acknowledge that it is the right thing to make research and other digital outputs available long-term?  We may be creating problems for ourselves if we emphasise the need to demonstrate value too much, and then struggle to quantify the value. However, this was just one argument, and overall I think that there was a belief that we do need to understand and articulate the benefits of providing long-term access.

There is often a lack of clear responsibility around digital preservation – maybe this is one of those areas where it’s always thought to be someone else’s responsibility? So, appropriate organisation and governance is essential for efficient ongoing preservation, especially when considering the tendency for data to be transferred – these ‘handoffs’ need to be secure.

The three imperatives that the BRTF report comes up with are: to articulate a compelling value proposition; to provide clear incentives to preserve in the public interest; to define role and responsibilities.

Commenting briefly on the post BRTF developments, Chris mentioned the EU digital agenda and the  LIBER pan-european survey on sustainability preparedness.

There are some mandates emerging:  the NERC and ESRC, for example.  Some publishers do require authors to make available data that substantiates an article, but at present this is not rigorous enough. We need to focus more on the data behind the research and how important it is.

Chris contrasted domain data repositories and institutional data repositories. Domain data repositories: leverage scale and expertise; are valuable for ‘high curation’ data; can carry out a ‘community proxy’ role such as tool development; aggregate demand; are potentially vulnerable to policy change (e.g. AHDS). A mixed funding models desirable for domain data repositories (e.g. ICPSR). Institutional data repositories: have a reputational business case (risk management, records management aspects, showcasing); should be aligned with institutional goals; can link to institutional research services (e.g. universal backup); can work well for ‘low curation’ cases (relatively small, static datasets); demand aggregation across a set of disciplines.

One issue that came up in the discussion was that we must remember that in fact digital preservation is relatively cheap, especially when compared to the preservation of hard-copy archives, held in acid-free boxes on rows and rows of shelving in secure, controlled search rooms.  So, if the cost is actually not prohibitive, and the technical know-how is there, then it seems imperative to address the organisational issues and to really hammer home the true value of preserving our digital data.

Opening the door to demonstrating value

The Archives Hub team value the links that we have with our contributors, who, after all, make the Hub what it is. We have a Contributors’ Forum in order to establish and develop links with contributors and get their feedback on Hub developments.

photo of open doorThis week we ran a Contributors’ Forum that concentrated on measuring impact, something that is becoming increasingly important in order to demonstrate value.  Unfortunately, we ended up with quite a small group, despite sending out some enticing emails – maybe a sign of the difficult times. But we still had a stimulating discussion, and for us it is always very valuable to get a perspective from the actual archives repositories.

We spent the first part of the morning with updates on the Hub and reports from the contributors: John Rylands at Manchester, Salford, Liverpool and Glasgow. Joy then gave a presentation on measuring impact, reflecting on some work that the Archives Hub, Copac and Zetoc services have carried out through online surveys and one-to-one interviews with researchers in order to create case studies.

In the afternoon we concentrated on measuring impact by asking the contributors to think about (i) what sort of information they currently collect about their researchers and (ii) what sort of information they would like to have. Overall, it seems that most archives have some form of registration, where researchers give some details about themselves. But the information recorded varies, not surprisingly. Sometimes information such as the items consulted is given, sometimes researchers are asked to specify their subject area, and at Glasgow they are asked how they found out about the University Archive. At Liverpool all  of the requisition slips are studiously kept, so that there is a record of who has looked at what, and at Glasgow there is a log of everything leaving the strong room, and I’m sure that for most archives this is the case.  At Salford, phone and email enquiries are all logged, as well as website statistics kept.

However, it seems that in general there is very little information on what happens next. How does the visit to the archive benefit the researcher? Do they use what they have found in publications? reports? articles? The archive repository may find this sort of information out if the researcher asks about copyright issues, but otherwise it is very hard to know. We agreed that informal networks can be valuable here. Archivists often get to know regular researchers, and in fact, this may be more likely to happen at smaller repositories where there is a lone archivist. But this can only account for a small part of the use of the collections. In fact, two of our contributors said that a reception desk had recently been installed so that researchers often don’t really interact directly with the archivist unless they have a particular query, so whilst this may be more efficient, it may distance us more from our users.

Also, it seems that the information that is gathered is not really utilised. It ‘may’ go into reports, and it ‘may’ be used for funding applications, but the suggestion is that this is done in a rather ad hoc manner. At Glasgow, it is important to show that the researchers and students from the University are being prioritised, so the information gathered can help to support this kind of situation.

From the discussion that we had around this topic, our first likely action arose: If there is an easy way for a researcher to grab an archival reference, it will encourage people to include the correct citation, which will help with tracking the use of archives.  This is something that we should be able to introduce for the Hub.

We talked about how easy it would be to simply ask researchers if they will speak about their research. Maybe they could be encouraged to put something about this on the registration form. We felt that if we are honest about what we need (which is often to demonstrate our value in order to secure continued funding), then researchers may be more willing than we might suppose. There is, undoubtedly, an huge feeling of goodwill towards archives, and, as one contributor said, we may be pushing at an open door here.

We talked about the sort of information we would like to gather, and came up with some possibilities:

We would like to know how researchers are coming to the repository – e.g. from the Archives Hub, from a Hub Spoke, from the NRA?
We would like to know if users find what they need from the archival descriptions themselves. Maybe more detailed descriptions sometimes provide the information that they need – they might even show that the archive is not relevant to the research, thus saving the researcher a wasted visit (a positive negative outcome!).
We would like to know more about how people behave when looking at an archive catalogue: Where do they navigate to? Do they explore the catalogue Do they search laterally?

From the discussion with these contributors, it seems that the Archives Hub is having to place more emphasis on issues around ‘market penetration’ than they are at present,  although it was felt that this is starting to change and that archives may well be faced with more pressure to understand their markets and how to effectively reach them.

Finally, we came up with another action, which was to try to compile 3 case studies over the next year. John Rylands agreed to work with us on the first one, so that we can test out how best to approach this. It may be that telling stories is the most fruitful way to get a sense of the impact that archives have. But we cannot ignore the fact that statistics are required, and we do have to continue to look for different ways to demonstrate our value.

Do we need index terms?

image of road signArchival descriptions need to include associated subjects, names and places as index terms. Is that self-evident? Well, certainly we need to do what we can to provide ways into an archive, and you might say the more ways to access it the better. But do archival descriptions need index terms? Do they add anything that keyword searches don’t have?

The Archives Hub encourage our contributors to add access points, which is EAD speak for index terms for subjects, names and places that reflect the content of the description, and therefore the archive. But if those terms are already included in the description, with the technology at our disposal, maybe we can dispense with them as access points and simply query the main body of the description? What are the arguments in favour of keeping index terms?

1. It’s about what is significant. One of the great challenges with archives is drawing out what is important within the archive; enabling researchers to know whether the archive is relevant to them. But this is always going to be a very imperfect exercise. I remember cataloguing an architect’s diaries (Robert Mylne, architect of Blackfriars Bridge) and ending up taking months because I couldn’t bear to leave out any people, or place names or buildings, or building techniques, etc. What if someone really wanted to know about stanchions? If I didn’t mention them, then a search would not bring back the Mylne diaries, and I would have failed to connect researcher to research material. The reality is that with the time and resources at our disposal, what we need to try do is reflect what is ‘most significant’ and include ‘key concepts’, accepting that this is a somewhat subjective judgement and hoping that this is enough to lead the researcher in the right direction. For the Hub we usually recommend adding somewhere between 3 and 10 index terms to a description. It means that the archivist can (arguably) draw out the most pertinent subjects and list the most significant people.

2. It allows for drawing out entities. So, in a sentence like “The collection comprises of material relating to the British National Antarctic Expedition, 1901-1904 (leader Robert Falcon Scott), the British Antarctic Expedition, 1907-1909, led by Shackleton, correspondence with his family, miscellaneous papers and biographical information”, you can separate out the entities. Corporate bodies such as British National Antarctic Expedition, 1901-1904, and personal names such as Robert Falcon Scott.  This is very useful for machine processing of content, as machines do not know that Robert Falcon Scott is a personal name (although we are increaingly developing sophisticated text mining techniques to address this).

It can be particularly useful where the entities are not obvious from the text, such as “[A]s well as material relating to his broadcast and published works, the archive also includes many scripts…”. Notice a lack of definite subject terms such as ‘playwright’, or ‘writer’.  A human user may infer this, but a general search on ‘playwright’ will not bring back any results becauase a machine has to know it too, in order to serve the human user.

3. You can then apply consistency to the entities, in terms of using a pre-defined controlled vocabulary.  Bu in a world where folksonomies are becoming increasingly popular, with increasing use of user tagging, does it make sense to insist on controlled vocabularies?

Take the example above, which is about Arthur Hopcraft. The index terms do include ‘playwrights’ and ‘writers’ so that the user can do a keyword search on these terms, or a specific subject search, and find the description. However, there is an obvious flaw here: the archivist has chosen these terms. Whilst they do both come from the Unesco thesaurus, she could easily have chosen different terms. The index terms do not include ‘scriptwriter’ for example. They do not include ‘television’ or ‘journalism’, both of which could have reasonably been used for this description. We end up with some descriptions that use ‘playwrights’ as a controlled vocabulary term, but others that don’t, and some that maybe use ‘scriptwriters’ when they are essentially about the same subject, or ‘authors’ which is the Unesco preferred term for scriptwriters.

But you cannot cover everything, so you have to make a choice about which subject terms to use. The question is: is it better to have some subject terms rather than none, even if they do not necessarily cover ‘all’ subjects, and so the researcher may carry out a subject search and not find the archive? One important point is that with our without subject terms, you have the same problem; it is just that a specific subject search does actually narrow what the researcher is searching on – the search may not include other fields, such as the scope & content or biographical history. Therefore whilst a subject search helps the researcher to find the most significant collections, it may exclude some collections that might be very pertinent for their research (collections that they may find through a keyword search).

4. Index terms allow for clarification of which entity you are talking about. This can be particularly helpful with identifying people and corporations. The scope and content may refer to Linsday Anderson, but the index entry will provide the dates and maybe an epithet to clarify that this is Lindsay Gordon Anderson, 1923-1994, film director. You could add this information to the scope and content, but it would tend to make it much more dense and arguably more difficult to read if you did this with all names. It would also imply that all names are of equal significance, and it would not be very helpful for machine processing unless you marked it up so that a machine could identify it as a personal name.

5. Index terms allow for connecting the same entity throughout the system. A very useful and powerful reason to have index terms. The main issue here is that contributors do not always enter the same thing, even with rules and sources to draw upon. Personal and corporate names are usually consistent, but inevitably the addition of the epithet, which is much more of an archival practice than a library practice, means that one person often has a number of different entries. If you took the epithet away, at least for the purposes of identifying the same entity, then things would work reasonably well. For subjects it’s more a case of just the amount of subjects that can be used to describe an archive. If you look for all the descriptions with the subject of ‘first world war’, then you won’t find all the descritions that are significantly about this subject because some of them are indexed with ‘world war one’, and other may use ‘war’ and ‘conflict’.

The way around this for the Hub is our ‘Subject Finder’. This is different from a straightforward subject search. It actually looks for similar terms and brings them together. So, a search for ‘first world war’ will bring back ‘world war one’. Similarly, a search for ‘railways’ will bring back the Library of Congress heading of ‘railroads’.

The Subject Finder helps, but does not comletely address this problem of the differing choice of terms. It cannot by-pass the fact that sometimes descriptions do not include any subject terms, so then they will not show up in a subject search. Recently I was looking for archives in the Hub on ‘exploration’, and was surprised to find that many of the Antarctic expeditions collections were not listed in the results. This was because some repositories did not use this subject term; a perfectly legitimate choice not to use it, but many other similar archives do use it.

I still feel that it is worth adding the significant entities as index terms, even with the problems of selecting what is ‘significant’ and with the inconsistencies that we have. Cataloguing as a whole is a subjective exercise, and it will never be perfect. For those who say that index terms are out-dated, I can only say that they are proving pretty useful for our current Linked Data project, and that is certainly pretty up to the minute in terms of Web technologies.

One final point in favour: the Archives Hub index terms exist within the descriptions as clickable links. This allows researchers to carry out ‘lateral’ searches, and it is a popular means to traverse descriptions, exploring from one subject to another, from one person to another.

Whether we should also consider enabling researchers to tag descriptions themselves is a whole other issue for another blog post…

This is not a complete case for and against by any means, but I think I’d better leave it there. I’d love to hear your views.