Names (8): A 4 year old in red wellington boots

Firstly, an apology to those who commented. I was on a temporary machine for a while and didn’t get the notifications to approve the comments. I really appreciate feedback! And we need to think about this whole topic as an archive community.

Secondly, I wanted to pick up on some comments:

“If cataloguing archivists have access to a central pot of name authorities we are more likely to spot and re-use existing authority entries. So if one archivist identified Elizabeth Roberts 1790-1865 (artist) with a little potted biography which placed her in Penge, then a later archivist finding material from Lizzie Roberts in Penge in 1850s is much more likely to put 2 and 2 together manually”

In fact, one of the potential developments from the work we are doing is an interface specifically for cataloguers. The whole issue of ‘match’, ‘probable’ and ‘possible’ is tricky to present to end users, but relatively easy to present to cataloguers to help with creating names that will successfully be connected. So, we are bearing that in mind as a future development.

“When I looked at the list of names used in this article I thought ‘someone just doesn’t know what to include to properly describe a name”

Yes…I think that sometimes, when I am thinking about how to reconcile the massive variations and how to work with the lack of structure. But then I remember what it was like (when I was a proper archivist) to catalogue within time constraints. And I also remember that I am someone who spends half my life thinking about data! In addition, the point is that with archives it is perfectly valid to enter a name such as ‘Julia (fl 1976)’ because that is what you get from the item you are cataloguing, and nothing more. Maybe you could undertake research to find out who that it, but that would extend the time it takes to catalogue by days, if not weeks and months. For a researcher, this might jog something in the mind and lead to a connection being made. Something is better than nothing. For me, the entries that are rather more frustrating are names such as ‘various’ or ‘Author: various’, or ‘James MacAllister and various’ because these just aren’t names. However, many of these entries were probably created in a time when semantically structured data was not so important.

“The other way of dealing with this is to leave the final decision up to the end-user.”

Yes, this is a fair point. In our current thinking, the idea is that we have levels of confidence that we present to the user, and that allows them to make the decision. But we still need to think carefully about how to do this in a way that most clearly conveys meaning. The most difficult thing is to convey that even though you have linked several collection descriptions to one name, other name strings may also be a match. But at the end of the day, there is always the issue that decisions you make around the navigation and options provided to end users means they are likely to exclude some relevant results. A subject search will exclude any archives not indexed with that subject. Do you therefore dispense with a subject search? (More in this in future posts, as machine learning may present us with new tools to create subject entries).

Since my last post we actually hit the point of ‘blimey, this is just too difficult’. We really weren’t sure we were going to make this work, given the tremendous variations and, in particular, the lack of structure.

However, we have hacked our way through the undergrowth to create a path that I think will fulfil many of our aims. There is so much I could say, if I got into the detail of this, but I will spare you too much discussion around EAD and JSON structure!

A good part of the last few weeks from my point of view has been clarifying the thinking around what is required when processing names. I came up with the idea of the ‘4 pillars of names’.

  1. Matching

This refers to comparing and grouping names.  

Matching does not require us to know if it is a person or an organisation or to know anything about meaning at all. It is simply a process to group names.  So, ‘D J MacDonald’ could be a company or a person.  The question is, does that match ‘David John MacDonald’ or ‘D J MacDonald, manufacturers, Carlisle’?

Matching is therefore also about levels of confidence. It is about saying ‘D J MacDonald b.1932’ is the same as ‘D MacDonald b.1932’….or not. 

Matching may also mean matching a creator name and an index term within a record. For more on this, see below.  

2. Meaning

Name meaning is about whether it is a personal, corporate or family name.  Many creator names are just ‘creator’. There is no tagging to distinguish the type. Index terms have to have a type, but matching them up to creator name is not always easy. See more on that below.

3. Search behaviour

What happens when the user clicks on the name? Previous posts have presented our ideas for this. Whilst we are not yet ready to develop an end user interface, the options that are available to us for display are necessarily constrained by how we process the data. So we do need to think about this now.

4. Display

How we display a name record, or a name page. Again, not something we are focussing on now, other than to think about the sorts of features that we want to include.

* * *

Our discussions have been characterised by ‘one step forwards two steps backwards’, which can feel a little dispiriting. But we believe we have now sorted out the approach we need to take. I have spent a lot of time working collaboratively with Rob Tice from Knowledge Integration, unpicking the (many and varied) challenges in the data and as a result we’ve agreed an approach that we believe will produce the data that we want.

So, this again consists of 4 parts – a 4-step process that covers matching and meaning.

  1. Matching within a collection description

We need to try to match the creator name to the index term, if we have both. This is the first step in the workflow. To do this, the processing needs to identify names within one collection (each name needs to be attached to a collection via a reference).

Taking the description of the Caledonian Railway Company as an example (https://archiveshub.jisc.ac.uk/data/gb248-ugd008/7andugd8/38). The name appears as:

Creator: Caledonian Railway (railway company: 1845-1923: Scotland)
Bioghist: Caledonian Railway
Index term: Caledonian Railway, 1845-1923

We want to create one entry for these names that we take forwards into the de-duplication process. In this case, the names are all marked up as corporate names. But in many cases the creator is not marked up in this way. We need a process to match these entities to say that they are the same. This is about applying matching at the level of one collection, rather than across collections. When you apply it to one collection, you can decide to make more assumptions. For example,

Creator: Dorothy Johnson
Index term: Johnson, Dorothy, 1909-1966, Researcher into theatre history

This creator is not marked up as a personal name. If we worked with these entries in our general de-duplication, so that they were not associated with one particular collection, we could not say they are the same person. Indeed, we could not identify ‘Dorothy Johnson’ as a person, only as a creator. The relationship of these two entries would get lost. But within one collection description, we can make the assumption that they represent the same thing.

If we make this the first step we can remove many of the creator-as-string names from the processing – they will already be matched to a structured index term.

2. Structuring data

This is a process of following rules to structure data. Many names are not structured. PIDs (persistent identifiers) can by-pass this need for consistency, but at present the archive community barely uses recognised identifiers. I have posted previously on name authorities and structure. So, anyway, to introduce a bit of EAD, you might have:

<persname>Florence Nightingale, 1820-1910</persname>

or

<persname><emph altrender=”surname”>Nightingale</emph><emph altrender=”forename”>Florence</emph><emph altrender=”dates”>1820-1910</emph><emph altrender=”epithet”>Reformer of Hospital Nursing</emph></persname>

If we can process the first entry to give the kind of structure you see in the second entry that enables us to carry out de-duplication, and we have a much better chance of matching it to other entries. This is decidedly non-trivial, and we won’t be able to do this for all names.

3. De-Duplication

This is the process outlined in the blog post on de-duplication at scale . Once the other processes are in place, we are in a position to run the de-duplication process, and start to try out different levels of confidence with matching.

A working example: George Bernard Shaw

collection match:

George Bernard Shaw (gb97-photographs)
matches:
Shaw, George Bernard, 1856-1950, author and playwright (gb97-photographs)

structure rules:

apply rule: if it includes YYYY-YYYY and the preceding words include a comma then the first entry is a surname and the second entry is a forename
apply rule: YYYY-YYYY is a date
apply rule: words after YYYY-YYYY are additional information

Creates:
Surname: Shaw
Forename: George Bernard
Dates: 1856-1950
Additional information: author and playwright

de-duplication:

The structured entry matches a name from another description:

Surname: Shaw
Forename: G.B.
Dates: 1856-1950
Additional information: playwright.

*****

So, we are now in the process of implementing this workflow. The current phase of this project will not allow us to complete this work, but it will lay the foundations. Of course, we’ll find other challenges and issues. We still don’t know how successful we will be. There will definitely be names we can’t match and we can’t identify as personal or corporate. But then it is down to how we present the information to the end user.

I called this post ‘A 4 year old in red wellington boots’ because in her comment on the previous blog post Teresa used that as a metaphor for how we can think about data. We need to explore, to play with data, to search and discover, to not mind getting dirty. It is easy to get stressed about not getting everything right; but we need to jump into the puddles and just see what happens!

(instagram: shelightsthesky_photography)

Exploring British Design at the Europeana AGM 2015

I’m just back from another enjoyable and useful Europeana Network Association event where I gave a four minute ‘Ignite Talk’ on our recently completed ‘Exploring British Design’ project that Pete and Jane worked on. As it was such a short talk, I wanted make sure I got the timing right, so actually wrote the talk out. I think it gives quite a good summary of the project, as well as mentioning our connection with Europeana, so I thought it would be worth posting it here along with a link to the slides:

“Hello, my name is Adrian Stevenson and I’m a Senior Technical Coordinator working for Jisc in the UK.

[Introduction slide]

Today I want to briefly outline a one year project we’ve recently completed called ‘Exploring British Design’ which was funded by the Arts and Humanities Research Council.

The technical work and front-end interface for Exploring British Design was developed by the Archives Hub based in the UK. The Hub aggregates archival descriptions from about 280 institutions in the UK, from the very large such as the British Library to the very small such as the Shakespeare’s Globe Theatre, making these archives available to be searched through our website, APIs and findable on Google. For some institutions, the Archives Hub provides their only web presence, so it’s an important service for the archives sector in the UK.

For ‘Exploring British Design’ we collaborated with one of our enthusiastic contributors, the Brighton Design Archive, based at the University of Brighton. We used the ‘Britain Can Make It’ exhibition from 1946 as a focal point because the Archive has rich collections relating to this exhibition.

So what’s the connection with Europeana? The Archives Hub is in the process of contributing data to the Archives Portal Europe. The plan is that the portal data will be available through Europeana at some point in the future.

[Home page slide]

So lets have a look. This is the home page of the website. You can see that we take people, i.e. the designers and architects, their organisations, and the events they were involved with, such as the exhibition as the starting points, i.e. not the archive records as such.

What’s unique about this project is that we’re going beyond the record as being about about one person, one organisation and having one focus. The reality is that archives are about the connections between all sorts of people, places, and events, such as exhibitions, and much of this information is effectively ‘locked in’ the archival records. This is what we’re trying to draw out.

The idea is that anything can be a primary focus:  people, organisations, places, events or archive collections. Some of you may recognise this as an idea relating to linked data, and indeed this is loosely the approach we took for the under the hood implementation. We also looked at an archival name authority standard called EAC-CPF to help with this.

[Designer slide]

You see here how we’ve tried to emphasise the relationship types, such as ‘friend of’, ‘collaborates with, ‘colleague of’ and so on. Researchers are most interested in people, events, etc. not in archives per se.

[Exhibition slide]

This is a view of the exhibition page, focussing in on it as an event in its own right with a location, related people, etc. This sort of information hasn’t historically been captured all that usefully in archival descriptions.

[Visualisation slide]

We included visualisations, but these actually fall far short of the complexity of the relationships. It’s quite hard to get these to work effectively, but they give a sense of the relationships between architect Jane Drew and Le Corbusier, or even Croydon High School for Girls.

So hopefully you can get a sense of how we’ve tried to present researchers with more flexible routes through the connections we created, helping to surface relationships between people, organisations and events that were effectively hidden in the more traditional document-based way of presenting information.”

There was an excellent reception in the evening at the Rijksmuseum where we were lucky enough to get a private view of the ‘Gallery of Honour’. It was a great opportunity to get a picture by Rembrandt’s ‘Night Watch’ so we made the most. Thanks again to Europeana!

In front of the 'Night Watch
Adrian Stevenson and others in front of Rembrandt’s ‘Night Watch’ at the Rijksmuseum, Amsterdam.

“I’m Spartacus!’ (or giving a name authority)


This is the second blog post about the recent UKAD survey on indexing and name authorities (as stated previously a report on the survey will be made available shortly).


It seems to me that there is some confusion over what authority records actually are. When we came up with our survey it was clear that defining these terms is not always that straightforward and we often make assumptions that are not necessarily shared . We created a glossary for the survey, and defined a name authority record as:

“An entry for a person or corporate body that includes additional elements about the entity, providing contextual information as well as a name index entry.”

However, it is clear that some respondents were thinking of name index entries rather than more complete authority records. According to our survey, which received 93 responses, 34 maintain authority records that follow recognised rules or sources (although comments indicate that the number of these records may be very limited), 14 follow local practice and 29 do not maintain authority records. Bear in mind that responses were not per institution, so the figures can only tell us so much. But what they do indicate is: (i) there is some confusion about what authority records are (ii) some repositories maintain authority records that follow their own in-house practice rather than recognised standards (iii) it is important for archivists that the software cataloguing systems they use support the creation of authority records.

Many repositories use the original records to create authority records, which is one reason why archivists are in the best position to provide this kind of detailed and useful information to researchers. The original records can give a real insight into individuals, particularly lesser-known individuals. Many archivists base their name authority records on ISAAR(CPF), which gives a level of consistency, but many do not, maybe reflecting the fact that ISAAR is a recent standard (first edition 1996), and cataloguing is not a recent phenomenon.

If the authorised form of the actual name is following recognised rules, this provides for effective resource discovery. But in reality we know that there are often many versions of an individual out there. Here are the entries on the Archives Hub for David Lloyd George:

  • George David Lloyd
  • George David Lloyd 1863-1945 1st Earl Lloyd George Of Dwyfor Statesman
  • George David Lloyd 1863-1945 1st Earl Lloyd George Of Dwyfor Statesman And Prime Minister
  • George David Lloyd 1863-1945 Emph Altrender Epithet Prime Minister
  • George David Lloyd 1863-1945 First Earl Lloyd-george Of Dwyfor Prime Minister
  • Lloyd George David
  • Lloyd George David 1863-1945
  • Lloyd George David 1863-1945 1st Earl Lloyd George Of Dwyfor Statesman
  • Lloyd George David 1863-1945 1st Earl Lloyd-george Of Dwyfor Statesman
This illustrates quite nicely the problems of including an epithet, and even more clearly the problems of NCA Rules insisting on using the last element of a surname, even if it is a compound or hyphenated surname. I will never understand that one…sigh.


I love one of the responses to the question of which sources are used for authority records: ‘books, the internet, people’. In a way this reflects the diversity of sources used, which include encylopaedias, directories, books, journals and registers as well as donor knowledge. This shows how important the expertise of archivists is in using various sources to bring together valuable information about individuals, families and corporate bodies. Authority records maximise the benefits of the information archivists gather together for their work, bringing it to researches and giving them new ways into collections.

Archivists have to work with the software that they have, and sometimes this imposes certain limitations. One respondent mentioned the need to avoid using the ampersand, for example. Many repositories use CALM, and this is compliant with ISAAR(CPF), which should provide a great boost to archivists wanting to create authority records.

I do think that archivists should really be starting to think more carefully about the benefits of name authority records, and we need to have a more co-ordinated and collective approach to this. As one respondent put it, ‘We don’t create these at present, and I wonder whether we ever ought to? Surely this is most sensible as a global resource that we can contribute to and share.’ For my part, I would be very keen for the Archives Hub to facilitate this, and I hope that this is something we can look to in the future.

Image: Flickr Creative Commons Steeljam photostream


What’s in a Name?

I have just been taking a look through the results of a recent survey by the UK Archives Discovery Network (UKAD) Working Group. The Working Group are getting together this week and will be looking at making the results public.
The main thing that struck me was the variety of responses. If we thought that this survey might clarify the situation, I’m beginning to wonder if all that it clarifies is that the situation is not clear!
I’m just going to concentrate on name indexing here, and leave place and subject for another blog post.
It seems that only a small proportion of archivists (as reflected in this survey) do not think that indexing is important. Of the 80 responses, 49 indexed to recognised rules and 23 indexed in line with local practice; 13 did not index and 23 went for ‘other’, which tended to mean they were in the process of creating an index, moving to an index following recognised standards or had legacy data with some indexes.
The survey revealed many reasons to create a name index as a means to access archives:
  • for enhanced resource discovery
  • many users want to search by name (respondents indicated it is a very popular search option)
  • it brings together collections that reference the same people
  • it is a way researchers look for connections
  • it aids interdisciplinary research
  • to identify people involved in particular works and their roles
  • it helps researchers to narrow down larger numbers of hits to just relevant collections
  • it promotes interoperability
  • it addresses problems with variants of the name, name changes, or different people with the same name (aids reliability)
  • it is at the heart of family history research
  • it is useful for answering enquiries
  • it is useful for selecting material, e.g. for exhibitions
When asked why name indexing is not carried out, there were a number of reasons:
  • free text retrieval makes name indexing redundant
  • lack of funding
  • lack of training
  • lack of staff resource
  • the current system does not support indexing
  • it has never been done
  • uncertainty about how to index effectively
  • uncertainty about benefits
Out of 100 responses, 46 felt that name indexing is very important, 33 felt it is reasonably important and 11 felt it is a low priority. The main reason given for name indexing being a low priority was the pressing need to deal with cataloguing backlogs and actually get some kind of description out there. It also seems that archivists do not always feel that they have the evidence to suggest that indexing is of benefit to researchers (or enough benefit to warrant the time involved).
The level at which collections are indexed was often given as ‘whatever is appropriate’ and clearly varied widely. I had expected it to be much higher for collection-level descriptions, but this was not the case.
We asked which sources are used for names, and again the answers were varied. Many people clearly do use the original records, with the National Register of Archives and Dictionary of National Biography coming in close behind. There was mention of Wikipedia, and even Google. In terms of rules, a majority do use the NCA Rules, and more use in-house rules than use AACR2. Several respondents said they use ISAAR(CPF), which is curious, as this standard is for name authority records and states that the main name entry should follow recognised rules (e.g. NCA Rules). I wonder if people were thinking of name authority records rather than basic index entries.
More on the survey to follow. And the UKAD Network will be publishing the results via the listserv, archives-discovery-network@jiscmail.ac.uk Make sure you sign up to this if you are interested in these kind of activities: https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=ARCHIVES-DISCOVERY-NETWORK

Where next for the National Archives Network…?

Joy and I went to a meeting last week at The National Archives to discuss the issues surrounding the National Archives Network, and the possible future directions that the archive community might take. We came away with our heads full of ideas and issues to take forward – so a job well done I think.

The National Archives Network as a concept really began after the 1998 seminal report by the National Council on Archives, ‘Archives On-line: The Establishment of a United Kingdom Archival Network‘ (PDF file). The vision was to create a single portal to enable people to search across UK archives. However, it is not really surprising that this never materialised given the resources and technical support necessary to make such a huge concept work. The landscape has changed since the report came out, and this solution seems to be less relevant nowadays. However, the concept of a network and the importance of collaboration and sharing data have continued to be very much on the agenda.

The meeting was initiated by Nick Kingsley and Amy Warner from TNA National Advisory Services. It included representatives from The Archives Hub, AIM25, SCAN, ANW, Genesis and Janus, as well as a number of other interested archivists from various organisations. The morning was dedicated to brief talks about the various strands of the network, and it quickly emerged that we had many things in common in terms of how we were working and the sorts of development ideas that we had, and therefore there would clearly be an advantage in sharing knowledge and experience and working together to enhance our services for the benefit of our users.

In the afternoon we formed into 3 groups to talk about name authority files, searching and sharing data and also hidden archives. A number of broad points came out of these break out groups and also the discussion that followed:

We need to ensure that our catalogues are searchable by Google (no surprises there) – it looks like some of us have tackled this more successfully than others, and obviously there are issues about databases that are not accessible to Google. It is important for contributors that services like the Hub and AIM25 are available via Google, and this provides an additional motivation for contributing to such union catalogues.

We really need to come together to think more carefully about name authority files – how these are created, who is responsible for them, how we can even start to think about reaching a situation where there is actually just one name authority file for each person!

It is important to progress on the basis of exposing our data so that it can be easily shared. This means working together on various options, including import/export options and Web Services that allow machine-to-machine access to the data. There are also issues here about the format of some of the catalogues. Some work has already taken place on exporting EAD data from DS CALM and AdLib, two major archive management systems. The Archives Hub and AIM25 have also been working together with the aim of enabling contributors to add the same description to both services.

We talked about other areas where sharing our experiences and understanding would be of great benefit, including Website design and how to present collection and multi-level finding aids online. We also recognised the importance of gathering together more information about our users – what they want, what they expect, what would be of benefit to them. In the end, this is one of the keys to producing a useful and rewarding service.

The meeting was very positive, and there are plans to take some of these issues forward through working groups as well as meeting again as a whole group, maybe sharing some of the specific projects that we have been involved with and collaborating on future initiatives.