UKAD Forum

The National Archives
The National Archives (used under a CC licence from http://www.flickr.com/photos/that_james/2693236972/)

Weds 2nd March was the inaugural event of the UK Archives Discovery Network – better known as UKAD.  Held at the National Archives, the UKAD Forum was a chance for archive practitioners to get together, share ideas, and hear about interesting new projects.

The day was organised into 3 tracks: A key themes for information discovery; B standards and crowdsourcing; and C demonstrating sites and systems.  Plenary sessions came from John Sheridan of TNA, Richard Wallis of Talis, David Flanders of Jisc, and Teresa Doherty of the Women’s Library.

I would normally have been tweeting away, but unfortunately although I could connect to the wifi, I couldn’t get any further!  So here are my edited highlights of the day (also known as ‘tweets I wish I could have sent’).

Richard Sheridan kicked off the proceedings by talking about open data.  The government’s Coalition Agreement contains a commitment to open data, which obviously affects The National Archives, as repository for government data.  They are using light-weight existing Linked Data vocabularies, and then specialising them for their needs. I was particularly interested to hear about the particular challenges posed by legislation.gov.uk, explained by John as ‘A changes B when C says so’: new legislation may alter existing legislation, and these changes might come into force at a time specified by a third piece of legislation…

Richard Wallis carried on the open data theme, by talking about Linked Data and Linked Open Data. His big prediction? That the impact of Linked Data will be greater than the impact of the World Wide Web it builds on. A potentially controversial statement, delivered with a very nice slide deck.

Off to the tracks, and I headed for track B to hear Victoria Peters from Strathclyde talk about ICA-AtoM.  This is open source, web based archival  description software, aimed at archivists and institutions with limited financial and technical resources.  It looks rather nifty, and supports EAD and EAC import and export, as well as digital objects.  If you want to try it out, you can download a demo from the ICA-AtoM website, or have a look at Strathclyde’s installation.

Bill Stockting from the BL gave us an update on EAD and EAC-CPF.  I’m just starting to learn about EAC-CPF, so it was interesting to hear the plans for it.  One of Bill’s main points was that they’re trying to move beyond purely archival concerns, and are hoping that EAC-CPF can be used in other domains, such as MARC.  This is an interesting development, and I hope to hear more about it in the future!  Bill also mentioned SNAC, the Social Networks and Archival Context project, which is looking at using EAC-CPF with a number of tools (including VIAF) to ‘to “unlock” descriptions of people from finding aids and link them together in exciting new ways’.

David Flanders’ post-lunch plenary provided absolutely my favourite moment of the day: David said ‘Technology will fail if not supported by the users’… and then, with perfect timing, the projector turned off.  One of David’s key points was that ‘you are not your users’.  You can’t be both expert and user, and you will never know exactly how what users want from your systems, and how they will use them unless you actually ask them! Get users involved in your projects and bids, and you’re likely to be much more successful.

Alexandra Eveleigh spoke in track B about ‘crowds and communities: user participation in the archives’.  I especially liked her distinction between ‘crowds’ and ‘communities’ – crowds are likely to be larger, and quickly dip in and out, while communities are likely to be smaller overall, but dedicate more time and effort.  She also pointed out that getting users involved isn’t a new thing – there’s always been a place in archives for those pursuing ‘serious leisure’, and bringing their own specialist knowledge and experience.  A point Alexandra made that I found particularly interesting was that of being fair to your users – don’t ask them to participate and help you, if you’re not going to listen to their opinions!

I have to admit that I’d never really heard of Historypin before I saw them on the conference programme.  Don’t click on that link if you have anything you need to get done today!  Historypin takes old photographs, and ‘pins’ them to their exact geographic location using Google maps.  You can see them in streetview, overlaid on the modern background, and it is absolutely fascinating.  Photos can be contributed by anyone, and anyone can add stories or more information to photos on the site.  One of the developments on the way is the ability to ‘pin’ video and audio clips in the same way.

CEO Nick Stanhope was keen to point out that Historypin is a not-for-profit – they’re in partnership with Google, but not owned by them, and they don’t ask for any rights to any of the material posted on Historypin.  They’re keen to work with archives to add their photographic collections, and have a couple of things they hope to soon be able to offer archives in return (as well as increased exposure!):  they’ll be allowing any archive to have an instance of Historypin embedded on the archive’s site for free.  They’re also developing a smartphone app, and will be offering any archive their own branded version of the app – for free!  These developments sound really exciting, and I hope we hear more from them soon.

Teresa Doherty’s closing plenary was on the re-launch of the Genesis project.  As Teresa said ‘many of you will be sitting there thinking ‘this isn’t plenary material! what’s going on?”, but Teresa definitely made it a plenary worth attending.  Genesis is a project which allows users to cross-search women’s studies resources from museums, libraries and archives in the UK, and Teresa made the persuasive point that while the project itself might not be revolutionary, how they’ve done it is.  Genesis has had no funding since 200 – everything they’ve done since then, including the relaunch, has been done with only the in-house resources they have available.  They’ve used SRU to search the Archives Hub, and managed to put together a valuable service with minimal resources.

As a librarian and a new professional, I found Teresa’s insights into the history of archival cataloguing particularly fascinating.  I knew that ISAD(G) was released in 1996, but I hadn’t had any real understanding of what that meant: that before 1996, there were no standards or guidelines for archival cataloguing. Each institution would catalogue in entirely their way – a revelation to me, and completely alien to my entirely standards-based professional background!  And I now have a new mantra, learned from one of Teresa’s old managers back in the early 90s:

‘We may not have a database now, but if we have structured data then one day we will have a database to put it in!’

I don’t think I’ve ever heard a better definition of the interoperability mindset.

After the day officially ended, it was off the the pub for a swift pint and wind-down. An excellent, instructive, and fun day.

Slides from the day are available on SlideShare – tag ukad.

A bit about Resource Discovery

The UK Archives Discovery Network (UKAD) recently advertised our up and coming Forum on the archives-nra listserv. This prompted one response to ask whether ‘resource discovery’ is what we now call cataloguing and getting the catalogues online. The respondent went on to ask why we feel it necessary to change the terminology of what we do, and labelled the term resource discovery as ‘gobledegook’. My first reaction to this was one of surprise, as I see it as a pretty plain talking way of describing the location and retrieval of information , but then I thought that it’s always worth considering how people react and what leads them to take a different perspective.

It made me think that even within a fairly small community, which archivists are, we can exist in very different worlds and have very different experiences and understanding. To me, ‘resource discovery’ is a given; it is not in any way an obscure term or a novel concept. But I now work in a very different environment from when I was an archivist looking after physical collections, and maybe that gives me a particular perspective. Being manager of the Archives Hub, I have found that a significant amount of time has to be dedicated to learning new things and absorbing new terminology. There seem to be learning curves all over the place, some little and some big. Learning curves around understanding how our Hub software (Cheshire) processes descriptions, Encoded Archival Description , deciding whether to move to the EAD schema, understanding namespaces, search engine optimisation, sitemaps, application programming interfaces, character encoding, stylesheets, log reports, ways to measure impact, machine-to-machine interfaces, scripts for automated data processing, linked data and the semantic web, etc. A great deal of this is about the use of technology, and figuring out how much you need to know about technology in order to use it to maximum effect. It is often a challenge, and our current Linked Data project, Locah, is very much a case in point (see the Locah blog). Of course, it is true that terminology can sometimes get in the way of understanding, and indeed, defining and having a common understanding of terms is often itself a challenge.

My expectation is that there will always be new standards, concepts and innovations to wrestle with, try to understand, integrate or exclude, accept or reject, on pretty much a daily basis. When I was the archivist at the RIBA (Royal Institute of British Architects), back in the 1990’s, my world centered much more around solid realities: around storerooms, temperature and humidity, acquisitions, appraisal, cataloguing, searchrooms and the never ending need for more space and more resources. I certainly had to learn new things, but I also had to spend far more time than I do now on routine or familiar tasks; very important, worthwhile tasks, but still largely familiar and centered around the institution that I worked for and the concepts terminology commonly used by archivists. If someone had asked me what resource discovery meant back then, I’m not sure how I would have responded. I think I would have said that it was to do with cataloguing, and I would have recognised the importance of consistency in cataloguing. I might have mentioned our Website, but only in as far as it provided access through to our database. The issues around cross-searching were still very new and ideas around usability and accessibility were yet to develop.

Now, I think about resource discovery a great deal, because I see it as part of my job to think of how to best represent the contributors who put time and effort into creating descriptions for the Hub. To use another increasingly pervasive term, I want to make the data that we have ‘work harder’. For me, catalogues that are available within repositories are just the beginning of the process. That’s fine if you have researchers who know that they are interested in your particular collections. But we need to think much more broadly about our potential global market: all the people out there who don’t know they are interested in archives – some, even, who don’t really know what archives are. To reach them, we have to think beyond individual repositories and we have to see things from the perspective of the researcher. How can we integrate our descriptions into the ‘global information environment’ in a much more effective way. A most basic step here, for example, is to think about search engine optimisation. Exposing archival descriptions through Google, and other search engines, has to be one very effective way to bring in new researchers. But it is not a straightforward exercise – books are written about SEO and experts charge for their services in helping optimise data for the Web. For the Archives Hub, we were lucky enough to be part of an exercise looking at SEO and how to improve it for our site. We are still (pretty much as I write) working on exposing our actual descriptions more effectively.

Linked Data provides another whole world of unfamiliar terminology to get your head round. Entities, triples, URI patterns, data models, concepts and real world things, sparql queries, vocabularies – the learning curve has indeed been steep. Working on outputting our data as RDF (a modelling framework for Linked Data) has made me think again about our approach to cataloguing and cataoguing standards. At the Hub, we’re always on about standards and interoperability, and it’s when you come to something like Linked Data, where there are exciting possibilities for all sorts of data connections, well beyond just the archive community, that you start to wish that archivists catalogued far more consistently. If only we had consistent ‘extent’ data, for example, we could look at developing a lovely map-based visualisation showing where there are archives based on specific subjects all around the country and have a sense of where there are more collections and where there are fewer collections. If only we had consistent entries for people’s names, we could do the same sort of thing here, but even with thesauri, we often have more than one name entry for the same person. I sometimes think that cataloguing is more of an art than a science, partly because it is nigh on impossible to know what the future will bring, and therefore knowing how to catalogue to make the most of as yet unknown technologies is tricky to say the least. But also, even within the environment we now have, archivists do not always fully appreciate the global and digital environment which requires new ways of thinking about description. Which brings me back to the idea of whether resource discovery is another term for cataloguing and getting catalogues online. No, it is not. It is about the user perspective, about how researchers locate resources and how we can improve that experience. It has increasingly become identified with the Web as a way to define the fundamental elements of the Web: objects that are available and can be accessed through the Internet, in fact, any concept that has an identity expressed as a URI. Yes, cataloguing is key to archives discovery, cataloguing to recognised standards is vital, and getting catalogued online in your own particular system is great…but there is so much more to the whole subject of enabling researchers to find, understand and use archives and integrating archives into the global world of resources available via the Web.

First class citizens of the Web

Linked Data enthusiasts like to talk about making concepts within data into first-class citizens. This should appeal to archivists. The idea that the concepts within our data are equal sounds very democratic, and is very appealing for rich data such as archival descriptions. But, where does that leave the notion of the all important top-level archival collection description? Archivists do tend to treat the collection description as superior; the series, sub-series, file, item, etc., are important, but subservient to the collection. You may argue that actually they are not less important, but they must be seen in the context of the collection. But I would still propose that (certainly within the UK) the collection-level description generally tends to be the focus and is considered to be the ‘right’ way into the collection, or at least, because of the way we catalogue, it beomes the main way into the collection.

Linked Data uses as its basis the data graph. This is different from the relational model and the tree structure model. In a graph, entities are all linked together in such a way that none has special status. All concepts are linked, the links are specified – that is to say, the relationships are clarified. In a tree structure, everything filters down, so it is inevitable that the top of the tree does seem like the most important part of the data. A data graph can be thought of as a tree structure where links go both ways, and nothing is top or bottom. You could still talk about the collection description being the ‘parent’ of the series description, but the series description is represented equally in RDF. But, maybe more fundamentally than this, Linked Data really moves away from the idea of the record as being at the heart of things and  replaces this with the idea of concepts being paramount. The record simply becomes one other piece of data, one other concept.

This type of modelling accords with the idea that users want to access the data from all sorts of starting points, and that they are usually interested in finding out about something real (a subject, a person) rather than an archive per se. When you model your data into RDF what you are trying to think about is exactly that – how will people want to access this data. In Australia, the record series is the preferred descriptive entry, and a huge amount has been written about the merits of this approach. It seems to me, with RDF, we don’t need to start with the collection or start with the series. We don’t need to start with anything.

Linked Data graph

This diagram, courtesy of Talis, shows part of a data graph for modelling information about spacecraft. You can see how the subjects (which are always represented by URLs) have values that may be literal (in rectangular boxes) or may point to other resources (URLs). Some of this data may come from other datasets (use of the same URL for a spacecraft enables you to link to a different resource and use the values within that resource).

The emphasis here is on the data – the concepts – not on the carrier of the data – the ‘record’.

In our LOCAH project we will need to look at the issue of hierarchy of multi-level descriptions. In truth, I am not yet familiar enough with Linked Data to really understand how this is going to work, and we have not yet really started to tackle this work. I think I’m still struggling to move away from thinking of the record as the basis of things, because, to coin a rather tiresome phrase, RDF modelling is a paradigm shift.  RDF is all about relationships between concepts and I will be interested to see where this leaves relationships between hierarchical parts of an archive description. But I am heartened by Rob Styles’ (of Talis) assertion that RDF allows anyone to say anything about anything.

Who is the creator?

I am currentphoto of quill pensly working on an exciting new Linked Data project, looking at exposing the Archives Hub metadata in a different way, that could provide great potential for new uses of the data. More on that in future posts. But it has got me thinking about the thorny issue of ‘Name of creator(s)’, as ISAD(G) says. The ‘creator’ of the archive. In RDF modelling (required for Linked Data output) we need to think about how data elements relate to eachother and be explicit about the data elements and the relationships between concepts.

Dublin Core has a widely used ‘createdBy’ element – it would be nice and easy to use that to define the relationship between the person and the archive. The ‘Sir Ernest Shakleton Collection’ createdBy Sir Ernest Shakleton. There is our statement. For RDF we’ll want to identify the names of things with URIs, but leaving that for now, what I’m interested in here is the predicate – the collection was created by Sir Ernest Shakleton, an Arctic explorer whose papers are represented on the Hub.

The only trouble with this is that the collection was not created by him. Well, it was and it wasn’t. The ‘collection’ as a group of things was created by him. That particular group of things would not exist otherwise. But people will usually take ‘created by’ to mean ‘authored by’. It is quite possible that none of the items in the collection were authored by Sir Ernest Shakleton. ISAD(G) refers to the ‘creation, accumulation and maintenance’ and uses ‘creator’ as shorthand for these three different activities. EAD uses ‘origination’ for the ‘individual or organisation responsible for the creation, accumulation or assembly of the described materials’. Maybe that definition is more accurate because it says ‘or assembly’. The idea of an originator appears to get nimbly around the fact that the person or organisation we attribute the archive to is not necessarily the author – they did not necessary create any of the records. But the OED defines the originator as the person who originates something, the creator.

It all seems to hang upon whether the creator can reasonably mean the creator of this archive collection – they are responsible for this collection of materials coming together. The trouble is, even if we go with that, it might work within an archival context – we all agree that this is what we mean – but it doesn’t work so well in a general context. If our Linked Data statement is that the Sir Ernest Shakleton collection ‘was created by’ Sir Ernest Shakleton then this is going to be seen, semantically, as the bog-standard meaning of creator, especially if we use a vocabulary that usually defines creator as author. Dublin Core has dc:creator. Dublin Core does not really have the concept of an archival originator, and I suspect that there are no other vocabularies that have addressed this need.

I would like to end this post with an insightful solution…but none such is coming to me at present. I suppose the most accurate one word description of the role of this person or organisation is ‘accumulator’ or ‘gatherer’. But something doesn’t sound quite right when you start talking about the accumulator. Sounds a bit like a Hollywood movie. Maybe gives it a certain air of mystery, but for representing data in RDF we need clarity and consistency in the use of terms.