Is Linked Data an Appropriate Technology for Implementing an Archive’s Catalogue?

Here at the Archives Hub we’ve not been so focussed on Linked Data (LD) in recent years, as we’ve mainly been working on developing and embedding our new system and workflows. However, we have continued to remain interested in what’s going on and are still looking at making Linked Data available in a sustainable way. We did do a substantial amount of work a number of years back on the LOCAH project from which we provided a subset of archival linked data at data.archiveshub.ac.uk.  Our next step this time round is likely to be embedding schema.org markup within the Hub descriptions. We’ve been closely involved in the W3C Schema Architypes Group activities, with Archives Hub URIs forming the basis of the group’s proposals to extend the “Schema.org schema for the improved representation of digital and physical archives and their contents”.

We are also aiming to reconnect more closely with the LODLAM community generally, and to this end I attended a TNA ‘Big Ideas’ session ‘Is Linked Data an appropriate technology for implementing an archive’s catalogue?’ given by Jean-Luc Cochard of the Swiss Federal Archives. I took a few notes which I thought it might be useful to share here.

Why we looked at Linked Data?

This was initially inspired by the Stanford LD 2011 workshop and the 2014 Open data.swiss initiative. In 2014 they built their first ‘aLOD’ prototype – http://alod.ch/

The Swiss have many archive silos from which they transformed the content of some systems to LD and then were able to merge. They created basic LD views, Jean-Luc noting that the LD data is less structured than data in the main archival systems, an example of which is e.g. http://data.ge.alod.ch/id/archivalresource/adl-j-125

They also developed a new interface http://alod.ch/search/ with which they were trying for an innovative approach to presenting the data such as providing a histogram with dates.  It’s currently just a prototype interface running off SPARQL with only 16,000 entries so far.

They are also now currently implementing a new archival information system (AIS) and are considering LD technolgy for the new system, but may go with a more conventional database approach. The new system has to work with the overall technical architecture.

Linked data maturity?

Jean-Luc noted that they expect that in three years born digital will greatly expand by factor of ten, though 90% of the archive is currently analogue. The system needs to cope with 50M – 1.5B triples. They have implemented Stardog triple stores 5.0.5 and 5.2. The larger configuration is a 1 TB RAM, 56 CPU and 8 TB disk machine.

As part of performance testing they have tried loading the system with up to 10 Billion triples and running various insert, delete and query functions. The larger config machine allowed 50M triple inserts in 5 min. 100M plus triples took 20min to insert. With the update function things were found to be quite stable.  They then combined querying with triple insertions at the same time, and this highlighted some issues with slow insertions with a smaller machine. They also tried full text indexing with the larger config machine. They got very variable results with some very slow response times with the insertions, finding the latter was a bug in the system.

Is Linked Data adequate for the task?

A key weakness of their current archival system is that you can only assign records to one provenance/person. Also, their current system can’t connect records to other databases, so they have the usual silo problem. Linked data can solve some of these problems. As part of the project they looked at various specs and standards:

BIBFRAME v2.0 2016
Europeana EDM released 2014.
EGAD activities – RiC-CM -> RiC-O based on OWL (Record in context)
A local initiative- Matterhorn RDF Model.  Matterhorn uses existing technologies, RDA, BPMN, DC, PREMIS. There is a first draft available.

They also looked at relevant EU R&D projects: ‘Prelia’, on preservation of LD and ‘Diachron’ – managing evolution and preservation of LD.

Jean-Luc noted that the versatility of LD is appealing for several reasons –

  • It can be used at both the data and metadata levels.
  • It brings together multiple data models.
  • It allows data model evolution.
  • They believe it is adequate to publish archive catalogue on the web.
  • It can be used in closed environment.

Jean-Luc  mentioned a dilemma they have between RDF based Triple stores and graph databases. Graph databases tend to be proprietary solutions, but have some advantages. Graph databases tend to use ACID transactions intended to guarantee validity even in the event of errors, power failures, etc., but they are not sure how ACID reliable triple stores are.

Their next step is expert discussion of a common approach, with a common RDF model. Further investigation is needed regarding triple store weaknesses.

Who is the creator?

I am currentphoto of quill pensly working on an exciting new Linked Data project, looking at exposing the Archives Hub metadata in a different way, that could provide great potential for new uses of the data. More on that in future posts. But it has got me thinking about the thorny issue of ‘Name of creator(s)’, as ISAD(G) says. The ‘creator’ of the archive. In RDF modelling (required for Linked Data output) we need to think about how data elements relate to eachother and be explicit about the data elements and the relationships between concepts.

Dublin Core has a widely used ‘createdBy’ element – it would be nice and easy to use that to define the relationship between the person and the archive. The ‘Sir Ernest Shakleton Collection’ createdBy Sir Ernest Shakleton. There is our statement. For RDF we’ll want to identify the names of things with URIs, but leaving that for now, what I’m interested in here is the predicate – the collection was created by Sir Ernest Shakleton, an Arctic explorer whose papers are represented on the Hub.

The only trouble with this is that the collection was not created by him. Well, it was and it wasn’t. The ‘collection’ as a group of things was created by him. That particular group of things would not exist otherwise. But people will usually take ‘created by’ to mean ‘authored by’. It is quite possible that none of the items in the collection were authored by Sir Ernest Shakleton. ISAD(G) refers to the ‘creation, accumulation and maintenance’ and uses ‘creator’ as shorthand for these three different activities. EAD uses ‘origination’ for the ‘individual or organisation responsible for the creation, accumulation or assembly of the described materials’. Maybe that definition is more accurate because it says ‘or assembly’. The idea of an originator appears to get nimbly around the fact that the person or organisation we attribute the archive to is not necessarily the author – they did not necessary create any of the records. But the OED defines the originator as the person who originates something, the creator.

It all seems to hang upon whether the creator can reasonably mean the creator of this archive collection – they are responsible for this collection of materials coming together. The trouble is, even if we go with that, it might work within an archival context – we all agree that this is what we mean – but it doesn’t work so well in a general context. If our Linked Data statement is that the Sir Ernest Shakleton collection ‘was created by’ Sir Ernest Shakleton then this is going to be seen, semantically, as the bog-standard meaning of creator, especially if we use a vocabulary that usually defines creator as author. Dublin Core has dc:creator. Dublin Core does not really have the concept of an archival originator, and I suspect that there are no other vocabularies that have addressed this need.

I would like to end this post with an insightful solution…but none such is coming to me at present. I suppose the most accurate one word description of the role of this person or organisation is ‘accumulator’ or ‘gatherer’. But something doesn’t sound quite right when you start talking about the accumulator. Sounds a bit like a Hollywood movie. Maybe gives it a certain air of mystery, but for representing data in RDF we need clarity and consistency in the use of terms.

Linked Data: one thing leads to another

I attended another very succesful Linked Data meetup in London on 24 February. This was the second meetup, and the buzz being created by Linked Data has clearly been generating a great deal of interest, as around 200 people signed up to attend.
All of the speakers were excellent, and found that very helpful balance

between being expert and informative whilst getting their points across clearly and in a way that non-technical people could understand.

Tom Heath (Talis) took us back to the basic principles of Linked Data. It is about taking statements and expressing them in a way that meshes more closely with the architecture of the Web. This is done by assigning identifiers to things. A statement such as Jane Stevenson works at Mimas can be broken down and each part can be given a URI identifer. I have a URI (http://www.archiveshub.ac.uk/janefoaf.rdf) and Mimas has a URI (http://www.mimas.ac.uk/). The predicate that describes the relationship ‘worksAt’ must also have a URI. Creating statements like this, we can start linking datasets together.

Tom talked about how this Linked Data way of thinking challenges the existing metaphors that drive the Web, and this was emphasised thoroughout the sessions. The document metaphor is everywhere – we use it all the time when talking about the Web; we speak about the desktop, about our files, and about pages as documents. It is a bit like thinking about the Web as a library, but is this a useful metaphor? Mabye we should be moving towards the idea of an exploratory space, where we can reach out, touch and interact with things. Linked Data is not about looking for specific documents. If I take the Archives Hub as an example, Linked Data is not so much concerned with the fact that there is a page (document) about the Agatha Christie archive; what it is concerned about is the things/concepts within that page. Agatha Christie is one of the concepts, but there are many others – other people, places, subjects. You could say the description is about many things that are linked together in the text (in a way humans can undertand), but it is presented as a page about Agatha Christie. This traditional way of thinking hides references within documents, they are not ‘first class citizens of the web’ in themselves. Of course, a researcher may be wanting information about Agatha Christie archives, and then this description will be very relevant. But they may be looking for information about other concepts within the page. If ‘Torquay’ and ‘novelist’ and ‘nursing’ and ‘Poirot’ and all the other concepts were brought to the fore as things within their own right, then the data could really be enriched. With Linked Data you can link out to other data about the same concepts and bring it all together.

Tom spoke very eloquently about how you can describe any aspect you like of any thing you like by giving identifiers to things – it means you can interact with them directly. If a researcher wants to know about the entity of Agatha Christie, the Linked Data web would allow them to gather information about that topic from many different sources; if concepts relating to her are linked in a structured way, then the researcher can undertake a voyage of discovery around their topic, utilising the power that machines have to link structured data, rather than doing all the linking up manually. So, it is not a case of gathering ‘documents’ about a subject, but of gathering information about a subject. However, if you have information on the source of the data that you gather (the provenance), then you can go to the source as well. Linked Data does not mean documents are unimportant, but it means that they are one of the things on the Web along with everything else.

Having a well-known data provider such as the BBC involved in Linked Data provides a great example of what can be done with the sort of information that we all use and understand. The BBC Wildlife Finder is about concepts and entities in the natural world. People may want to know about specific BBC programmes, but they are more likely to want to know about lions, or tigers, or habitats, or breeding, or, other specific topics covered in the programmes. The BBC are enabling people to explore the natural world through using Linked Data. What underlies this is the importance of having URIs for all concepts. If you have these, then you are free to combine them as you wish. All resources, therefore, have HTTP URIs. If want to talk about sounds that lions make, or just the programmes about lions, or just one aspect of a lion’s behaviour, then you need to make sure each of these concepts have identifiers.

Wildlife Finder has almost no data itself; it comes from elsewhere. They pull stuff onto the pages, whether it is data from the BBC or from elsewhere. DBPedia (Wikipedia output in RDF) is particularly important to the BBC as a source of information. The BBC actually go to Wikipedia and edit the text from there, something that benefits Wikipedia and other users of Wikipedia. There is no point replicating data that is already available. DBPedia provides a big controled vocabulary – you can use the URI from Wikipedia to clarify what you are talking about, and it provides a way to link stuff together.

Tom Scott from the BBC told us that the BBC have only just released all the raw data as RDF. If you go to the URL it content negotiates to give you what you want (though he pointed out that it is not quite perfect yet). Tom showed us the RDF data for an Eastern Gorilla, providing all of the data about concepts that go with Eastern Gorillas in a structured form, including links to programmes and other sources of information.

Having two heavyweights such as the BBC and the UK Government involved in Linked Data certainly helps give it momentum. The Government appears to have understood that the potential for providing data as open Linked Data is tremendous, in terms of commercial exploitation, social capital and improving public service delivery. A number of times during the sessions the importance of doing things in a ‘web-centric’ way was emphasised. John Sheridan from The National Archives talked about data.gov.uk and the importance of having ‘data you can click on’. Fundamentally, Linked Data standards enable the publication of data in a very distributed way. People can gather the data in ways that are useful to them. For example, with data about schools, what is most useful is likely to be a combination of data, but rather than trying to combine the data internally before publishing it, the Government want all the data providers to publish their data and then others can combine it to suit their own needs – you don’t then have to second guess what those needs are.

Jeni Tennison, from data.gov.uk, talked about the necessity of working out core design patterns to allow Linked Data to be published fast and relatively cheaply. I felt that there was a very healthy emphasis on this need to be practical, to show benefits and to help people wanting to publish RDF. You can’t expect people to just start working with RDF and SPARQL (the query language for RDF). You have to make sure it is easy to query and process, which means creating nice friendly APIs for them to use.

Jeni talked about laying tracks to start people off, hepling people to publish their data in a way that can be consumed easily. She referred to ‘patterns’ for URIs for public sector things, definitions, classes, datasets, and providing recommendations on how to make URIs persistent. The Government have initial URI sets for areas such as legislation, schools, geographies, etc. She also referred to the importance of versioning, with things having multiple sources and multiple versions over time it is important to be able to relate back to previous states. They are looking at using named graphs in order to collect together information that has a particular source, which provides a way of getting time-sliced data. Finally, ensuring that provenance is recorded (where something originated, processing, validation, etc.) helps with building trust.

There was some interesting discussion on responsibilities for minting URIs. Certain domains can be seen to have responsibilities for certain areas, for example, the government minting URIs for government departments and schools, the World Health Organisation for health related concepts. But should we trust DBPedia URIs? This is an area where we simply have to make our own judgements. The BBC reuse the DBPedia URI slugs (the string-part in a URL to identify, describe and access a resource) on their own URLs for wildlife, so their URLs have the ‘bbc.co.uk’ bit and the DBPedia bit for the resource. This helps to create some cohesion across the Web.

There was also discussion about the risks of costs and monopolies – can you rely on data sources long-term? Might they start to charge? Speakers were asked about the use of http URIs – applications should not need to pick them apart in order to work out what they mean. They are opaque identifiers, but they are used by people, so it is useful for them to be usable by people, i.e. readable and understandable. As long as the information is made available in the metadata we can all use it. But we have got to be careful to avoid using URIs that are not persistent – if a page title in Wikipedia changes the URI changes, and if the BBC are using the Wikipedia URI slug then is a problem. Tom Scott made the point that it it worth choosing persistence over usability.

The development of applications is probably one of the main barriers to uptake of Linked Data. It is very different, and more challenging, than building applications on top of a known database under your control. In Linked Data applications need to access multiple datasets.

The session ended by again stressing the importance of thinking differntly about data on the Web. To start with things that people care about, not with a document-centric way of thinking. This is what the BBC have done with the Wildlife Finder. People care about lions, about the savannah, about hunting, about life-span, not about specific documents. It is essential to identify the real world things within your website. It is the modelling that is one of the biggest challenges – thinking about what you are talking about and giving those things URIs. A modelled approach means you can start to let machines do the things they are best at, and leave people to do the things that they are best at.

Post by Jane Stevenson (jane.stevenson@manchester.ac.uk)

Image: Linked Data Meetup, February 2010, panel discussion.

A few thoughs on context and content

I have been reading with interest the post and comments on Mark Matienzo’s blog: http://thesecretmirror.com. He asks ‘Must contextual description be bound to records description?’

I tend to agree with his point of view that this is not a good thing. The Archives Hub uses EAD, and our contributors happily add very excellent biographical and administrative history information into their descriptions, via the tag, information that I am sure is very valuable for researchers. But should our descriptions leave out this sort of information and be just descriptions of the collection and no more? Wouldn’t it be so much more sensible to then link to contextual information that is stored separately?
Possibly, on the other side of the argument, if archivists created separate biographical/administrative history records, would they still want to contextualise them for specific collection descriptions anyway? It makes perfect sense to have the information separate to the collection description if it is going to be shared, but will archivists want to modify it to make it relevant to particular collections? Is it sensible to link to a comprehensive biographical record for someone when you are describing a very small collection that only refers to a year in their life?
Of course, we don’t have the issue with EAD at the moment, in so far as we can’t include an EAC-CPF record in an EAD record anyway, because it doesn’t allow stuff to be included from other XML schemas (no components from other namespaces can be used in EAD). But I can’t help thinking that an attractive model for something like the Archives Hub would be collection descriptions (including sub-fonds, series, items), that can link to whatever contextual information is appropriate, whether that information is stored by us or elsewhere. This brings me back to my current interest – Linked Data. If the Web is truly moving towards the Linked Data model, then maybe EAD should be revised in line with this? By breaking information down into logical components, it can be recombined in more imaginative ways – open and flexible data!

Linked Data: towards the Semantic Web

The Semantic Web has always interested me, although some years have elapsed since I first came across it. It feels like it took a back seat for a while, but now it is back and starting to go places, particularly with the advent of Linked Data, which is a central concept behind the Semantic Web.
The first Linked Data Meetup was recently held in London, with presentations, case studies, panels and a free bar in the evening, courtesy of Talis and the prize winners of Best-in-use-Track Paper award from the European Semantic Web conference, who generously donated their winnings behind the bar. The venue may have been hidden away in Hammersmith, but the room was packed and the general atmosphere was one of expectation and enthusiasm.
I am still in the process of trying to grasp the issues surrounding the Semantic Web, and whilst some of the presentations at this event were a little over my head, there was certainly a great deal to inform and interest, with a good mix of people, including programmers, information professionals and others, although I was probably the only archivist!
One of the most important messages that came across was the importance of http URIs, without which linked data cannot work. URIs may commonly be URLs but essentially they are also unique identifiers, and this is what is important about them. We heard about what the BBC are up to from Tom Scott. They are making great strides with linked data, creating identifiers for every programme, in order to make the programme into an entity. But there are identifiers for a great deal more than just programmes – natural history is a subject area they have been focussing on, and now they have identifiers for animals, for groups of animals, for species, for where they live, etc. By ensuring that all of these entities have URIs it is possible to think about linking them in imaginative ways. Furthermore, relationships between entities have URIs – this is where the idea of triples comes in, referring to the concept of a subject linked to an object through a relationship.
The three parts of each triple are called its subject, predicate, and object. A triple mirrors the basic structure of a simple sentence, such as: the Archives Hub is based at Mimas. The Hub is the subject ‘is based at’ is the predicate and Mimas is the object.
Whilst humans may read sentences such as this and understand the entities and the relationships, the Semantic Web vision is that machines can do the same – finding, sharing, analysing and combining information.
Issues such as sustainability were raised, and the great need to make Linked Data easier to create and use. We heard about DataIncubator.org, a project that is creating and publishing Linked Data. The Talis Connected Commons scheme offers free access to the Talis platform for public domain data, which means you have access to an online triple store. Talis will host the data, although the end goal is for original curator of data to take it back and publish it themselves. But this does seem to be a great way to help get the momentum going on Linked Data. Talis are one of the leading suppliers of library software, but clearly they have decided to put their weight behind the Semantic Web, and they are keen to engage the community in this by providing help and support with dataset conversion, that is to say, conversion of data into RDF.
There was some talk of the need to encourage community norms, for example, with linking and attribution, something that is particularly important when taking someone else’s data. People should be able to trace the path back to original dataset. Another issue that came up was the need to work together, particularly avoiding different people working on converting the same dataset. It is important to make all of the code available and to benefit from shared expertise. It was very obvious that the people taking part in this event and showing us their projects were keen to collaborate and take a very open approach.
Leigh Dodds from Talis explained that dataincubator.org has already converted some major datasets, such as the NASA space flight dataset, which includes every space flight launch since 1950, and OpenLibrary, which already publishes RDF but the modelling of the data was not great and so Talis have helped with this. The data that Leigh talked about is already in public domain, so the essential task is to model it for output as RDF. Leigh gave us two of his wish list data sets for possible conversion: the Prelinger Archives, a collection of over 2,000 historic films (the content is in the Internet Archive) and Lego, which adds a fun element and would mean a meeting of similar minds, as people into lego are generally as anal as those who are into the Semantic Web!
Whilst many of the participants at the Linked Data Meetup were enthusiastic programmers rather than business people or managers, there was still a sense of the importance of the business case and taking a more intelligent approach to promotion and marketing.
Archivists are always very interested in issues of privacy, rights, and the ownership of data, and these issues were also recognised and discussed, though not in any detail. There did seem to be a rather curious suggestion of changing copyright law to ‘protect facts’, and thus bring it more in line with what is happening in the online environment.
As well as examples of what is happening at the BBC, we heard about a various other projects, such as a project to enable people to find, store, share, track, publish and understand statistics – timetric. This is essentially about linking statistics and URIs and creating meaningful relationships between numbers. One of the interesting observations made here was that it is better to collect the data first and then decide how to sort and present it, rather than beforehand, because otherwise you may design something that does not fit in with what people want.
For me, the Government Data Panel was one of the highlights of the day. It gave me a good sense of what is happening at the moment with Linked Data and what the issues are. Tim Berners-Lee (inventor of the Web) and Nigel Shadbolt talked about the decision to prioritise UK government data within the Linked Data project – clearly it is of great value for a whole host of reasons, and a critical mass of data can be achieved if the government are on board, and also we should not forget that it is ‘our data’ so it should be opened up to us – public sector data touches all of us, businesses, institutions, individuals, groups, processes, etc.
The Linked Data project is not about changing the way government data is managed but about access, enabling the data to be used by all kinds of people for all kinds of things. It is not just about transparency, but about actually running things better – it may increase efficiencies if the data is opened up in this way. Tim Berners-Lee told us how government ministers tended to refer to ‘the database’ of information, as in the creation of one massive database, a misconception of what this Linked Data project is all about. Ministers have also raised worries about personal data, about whether this project will require more time and effort from them, and whether they will have to change their practices. But within government there are a few early adopters who ‘get it’, and it will be important to try to clone that understanding! There was brief mention, in passing, of the Ordnance Survey being charged to make money to run its operations, and therefore there is a problem with getting this data. Similarly, when parts of the public sector were privatised, the franchises took the data with them (e.g. train timetables).
Location data was recognised as being of great importance. A huge percentage of data has location in it, and it can as hub to join disparate datasets. We need an RDF datastore of counties, authorities, constituencies, etc, and we should think about the importance of using the same identifier for a particular location so that we can use the location data in this way.
There was recognition that we have tended to conflate Linked Data and open data, but they are different. It is important to stress that open data may not be data that is linked up, and Linked Data may not be open, it may have restricted access. But if we can start to join up datasets, we can bring whole new value to them, for example, combining medical and educational data in different ways, maybe in ways we have not yet thought about. We want to shift the presumption that the data should be held close unless a reason is give to give it up (an FoI request!). If the data can be made available through FoI, then why not provide as linked data?
One of the big challenges that was highlighted was with local government, where attitudes are not quite so promising as with central government. Unfortunately, as one panel member put it, we are not in a benevolent dictatorship so we cannot order people to open up the data! It is certainly a diffcult issue, and although it was pointed out that there are some examples of local authorities who are really keen to open up their data, many are not, and Crown copyright does not apply to local authorities.
Tim encouraged us all to make RDF files, create tools, enable mash-ups, and so on, so that people can take data and do things with it. So, do go and visit http://data.gov.uk once it is up and running and show that you support the initiative.
Whilst other initiatives in e-governement and standards do appear to have come and gone, it ma be that we wouldn’t have got to where we are now without them, so often these things are all part of the evolutionary process. The approach to the Linked Data Project is bottom-up, which is important for its sustainability. Whislt support of the Prime Minister is important, in a way it is the support of the lower levels in govt that is more important.
The Semantic Web could bring enormous benefits if it is realised. The closing presentation by Tom Heath, from Talis, gave a sense of this, as well as a realistic assessment of what lies ahead. The work that is going on demonstrated what might be achievable, but it also demonstrated that we are in the very early stages of this journey. There are huge challenges around the quality of the data and disambiguation. I find it exciting because it takes us along the road of computers as intelligent agents, opening up data and enabling it to be used in new and imaginative ways.
If any archivists out there are thinking of doing anything with Linked Data we would be very interested to hear from you!