The Archives Hub Names Project
The Archives Hub team and Knowledge Integration, our system suppliers, are embarking upon a short four month project to start to lay the groundwork, define the challenges and test the approaches to presenting end users with a name-based means to search, and connect to a broad range of resources related to people and organisations. I will be blogging about the project as we go along.
Our key aims in the long-term are:
- To provide the end user with a way to search for people and organisations and find a range of material relevant to their research
- To enable connections to be made between resources within and external to Jisc, using names as the main focus
- To bring archive collections together in an intellectual sense and provide different contexts to collections by creating networks across our data
This first project will not create an end-user interface, but will concentrate on processing, matching names and linking resources. We want to explore how this can be administered in order to be sustainable over time. In the end, the most challenging part of working with the names we have is identification, disambiguation and matching. The aim is to explore the space and start to formulate a longer-term plan for the full implementation of names as entities within the Archives Hub.
Creation of name records from EAD description records
NB: This blog often refers to personal names for convenience, but names include personal, family and corporate entities.
EAD descriptions include personal, family and corporate names. These ‘entities’ may be listed as archival creators and also associated with the collection as index terms. Archival creators may optionally be given biographical or administrative histories. The relationship of the collection with names in the index is not made explicit in the description (in a structural way), though it may often be gleaned from the descriptive information within the EAD record.
Creating name records for all names
We are proposing to begin by creating name records for all of these entries, no matter how thin the information for each entry may be.
Here is a random selection of names that are included in Archives Hub records:
Thatcher, J. Wells, Barrister at Law
Stanley, Catherine, 1792-1862
Roe, Alfred Charles
Rowlatt, Mary, b 1908
Milligan, Spike, 1918-2002
Fawcett, Margaret, d. 1987
Rolfe, Alan, 1908-2002 actor
Mayers, Frederick J (fl 1896-1937 : designer : Kidderminster, England)
Only a percentage of names have life dates. Some have born or death dates, some floruit dates.
Of course, the life dates, occupations and outputs of many people are not known, or may be very difficult to find. Also, life dates will change when a birth date is joined by a death date. Epithets may also change over time (and they are not controlled vocabulary anyway).
In addition, we have inverted and non-inverted names on the Archive Hub, names with punctuation in different places, names with and without brackets, etc. These issues create identification challenges.
Even taking names as creators and names as index terms within one single description, the match is often not exact:
Millicent Garrett Fawcett (creator name)
Fawcett, Dame Millicent. (1847-1929) nee Garrett, Feminist and Suffragist (index term)
Lingard, Joan (creator name)
Lingard, Joan Amelia, 1932- (index term)
The archival descriptions on the Archives Hub vary a great deal in terms of the structure, and different repositories have different approaches to cataloguing. Some do not add name of creator, some do not add index terms, some add them intermittently, and often the same name is added differently for different collections within the same repository. In many cases the cataloguer does not add life dates, even when they are known, or they are added to the name as creator but not in the index list, or vice versa. This sounds like a criticism, but the reality is that there are many reasons why catalogues have ended up as they are.
There has not been a strong tradition amongst archivists of adding names as unique identifiable entities, but of course, it has only been in the last few decades that we have had the potential, which is becoming increasingly sophisticated, of linking data through entity relationships, and creating so much more than stand-alone catalogue records. Many archivists still think primarily in terms of human readable descriptions. Some people feel that with the advent of Google and sophisticated text analysis, there is no need to add names in this structured way, and there is no need for index terms at all. But in reality search engines generally recommend structured data, and they are using it in sophisticated ways. Schema.org is for structured data on the web, an initiative started by Google, Microsoft, Yahoo, and Yandex. Explicit markup helps search engines understand content and it potentially helps with search engine optimisation (ensuring your content surfaces on search engines). Also, if we want to move down the Linked Data road, even if we are not thinking in terms of creating strict RDF Linked Data, we need to identify entities and provide unique identifiers for them (URLs on the web). Going back to Tim Berners-Lee’s seminal Linked Data article from 2006:
“The Semantic Web isn’t just about putting data on the web. It is about making links, so that a person or machine can explore the web of data. With linked data, when you have some of it, you can find other, related, data.”
So, including names explicitly provides huge potential (as well as subjects, places and other entities) and it has become more important, not less important. Indeed, I would go so far as to say that structured data is more important than standards compliant data, especially as, in my experience, standards are often not strictly adhered to, and also, they need constant updating in order to be relevant and useful.
The idea with our project is that we start with name records for every entity – a pot of data we can work with. We may create Encoded Archival Context (Corporate Bodies, Persons and Families), otherwise known as EAC-CPF…but that is not important at this stage. EAC is important for data ingest and output, and we intend to use it for that purpose, so it will come into the picture at some point.
The power of the anonymous
There are benefits in creating name records for people who are essentially anonymous or not easily identifiable. Firstly, these records have unknown potential; they may become key to making a particular connection at some point, bearing in mind that the Archives Hub continually takes new records in. Secondly, we can use these records to help with identification, and the matching work that we undertake may help to put more flesh on the bones of a basic name record. If we have ‘Grote, Arthur’ and then we come across ‘Grote, Arthur, 1840-1912’, we can potentially use this information and create a match. Of course, the whole business of inference is a tricky thing – you need more than a matching surname and forename to create a ‘same as’ relationship (I won’t get into that now). But the point is that a seemingly ‘orphan’ name may turn out to have utility. It may, indeed, provide the key to unlocking our understanding of particular events – the relationships and connections between people and other entities are what enable us to understand more about our history.
Components of a name record
So, all names will have name records, some with just a name, some with life dates of different sorts, some with biographical or administrative histories. The exception to this may be names that are not identifiable as people or organisations. It is potentially possible to discover the type of entity from the context, but that is a whole separate piece of work. Hundreds of names on the Archives Hub are simply labelled as ‘creator’ or ‘name’. This is down to historical circumstance – partly the Archives Hub made errors in the past (our old cataloguing tool which entered creators as simply EAD ‘origination’), partly other systems we ingest data from. At the moment, for example, we are taking in descriptions from Axiell’s AdLib system, but the system does not mark up creator names as people or organisations (unless the cataloguer explicitly adds this), so we cannot get that information. This is probably a reflection of a time when semantically structured data was simply less important. If a human reads ‘Elizabeth Gaskell’ in a catalogue entry they are likely to understand what that string means; if undertaking large-scale automated processing, it is just a string of characters, unless it includes semantic information.
From the name records that we create, we intend to develop and run algorithms to match names. In many cases, we should be able to draw several names together, with a ‘same-as’ relationship. Some may be more doubtful, others more certain. I will talk about that as we get into the work.
At the moment, we have some ideas about how we will work with these individual records in terms of the workflow and the end user experience, but we have not made any final decisions, and we think that what is most important at this stage is the creation and experimentation with algorithms to see what we can get.
Master name records
We intend to create master records for people and organisations. The principle is to see these master records not as something within the archives domain, but as stand-alone records about a person or organisation that enable a range of resources to be drawn together.
So, we might have several name records for one person:
Example of master record, with various related information included:
Webb, Martha Beatrice, 1858-1943, social reformer and historian
Examples of additional name records that should link to the master record:
Webb, Beatrice, 1858-1943 (good match)
Webb, Martha Beatrice, 1858-1943, economist and reformer (good match)
Webb, Martha Beatrice, nee Potter, 1858-1943 (good match)
Webb, M.B. b. 1858 (possible match)
Potter, Martha Beatrice, b 1858
…might well not be a match, in which case it would stand separately, and the archive connected to it would not benefit from the links being made.
We have discussed the pros and cons of creating master records for all names. It makes sense to bring together all of the Beatrice Webb names into one master record – there is plenty that can be said about that individual; but does it make sense to have a master record for single orphaned names with no life dates and nothing (as yet) more to say about that individual? That is a question we have yet to answer.
The principle is to have name records that enables us to create links to the Archives Hub entries and also to other Jisc services and resources beyond that – resources outside of the archives domain. Many of these resources may also help us with our own identification and matching processes. It is important to benefit from the work that has already been done in this area.
We are looking at various name resources and assessing where our priorities will be. This is a fairly short project, and we won’t have time to look at more than a handful of options. But we are currently thinking in terms of VIAF, ORCID and Wikidata. More on that to follow.
Personally, I’ve been thinking about working with names for several years. We have been asked about it quite a bit. But the challenge is so big and nebulous in many ways. It has not been feasible to embark upon this kind of work in the past, as our system has not supported the kind of systematic processing that is required. We are also able to benefit from the expertise K-Int can bring to data processing. It is one thing doing this as a stand-alone project; it is quite another to think about a live service, long term sustainability, version control and revisions, ingest from different systems, etc. And also, to break it down into logical phases of work. It is exciting, but it is going to involve a great deal of hard work and hard thinking.