Names (6): Deduplication at scale

Having written several blogs setting out ideas and thoughts about challenges with names, this post sets out some of our plans going forwards in order to create name records for a national aggregator; something that can work at scale and in a sustainable way. The technical work is largely being undertaken by Knowledge Integration, our system suppliers, though working closely with the Archives Hub team.

Consider one repository – one Hub contributor. They have multiple archives described on the Archives Hub, and maybe hundreds or thousands of agents (people and organisations) included in those descriptions. All of this information will be put into a ‘management index‘. This will be done for all contributors. So, the management index will include all the content, from all levels, including all the names. A huge bucket of data to start us off.

A names authority source such as VIAF or any other names data that we would like to work with will not be treated any differently to Archives Hub data at this stage. In essence matching names is matching names, whatever the data source. So, matching Archives Hub names internally is the same as matching Archives Hub names to VIAF, or to Library Hub, for example. However, this ‘names authority’ data will not go into our big bucket of Archives Hub data, because, unless we create a match with a name on the Hub, the authority data is not relevant to us. Putting the whole of VIAF into our bucket of data would create something truly huge. It is only if we think that this external data source has a name that matches a person or organisation on the Hub that it becomes important. So data from external sources are stored in separate reference indexes (buckets) for the purposes of matching.

Tokenisation

Knowledge Integration are employing a method known as tokenization, which allows us to group the data from the indexes into levels (It is quite technical and I’m not qualified to go into it in detail, so I only refer briefly to the basic principles here. Wikipedia has quite a good description of tokenization). With this process, we can establish levels that we believe will suit our purposes in terms of confidence. Level 1 might be for what we think is a guaranteed match, such as where an identifier matches. So, for example, Wikidata might have the VIAF identifier included, so that the VIAF and Wikidata name can be matched. In some cases, the Archives Hub data includes VIAF IDs, so then the Hub data can be matched to VIAF. We also hope to work with and create matches to Library Hub data, as they also have VIAF ID’s.

Image showing versions of a name all with the same ID.
If all versions of a name have the same ID then they can be matched.

Level 2 might be a more configurable threshold based around the name. We might say that a match on name and date of birth, for example, is very likely an indication of a ‘same as’ relationship. We might say that ‘James T Kirk’ is the same person as ‘James Kirk’ if we have the same date of birth. This is where trial and error is inevitable, in order to test out degrees of confidence. Level 3 might bring in supporting information, such as biographical history or information about occupation or associated places. It is not useful by itself, but in conjunction with the name, it can add a degree of certainty.

Screenshot of part of a biographical history
Biographical information may be used to help match names

We are also thinking about a Level 4 for approaches that are Archives Hub specific. For example, if the same name is provided by the same repository, could we say it is more likely to be the same person?

This tokenisation process is all about creating a configurable process for deduplication. Tokens are created only for the purposes of matching. Once we have our levels decided, we can create a deduplication index and run the matching algorithm to see what we get.

Approaches to indexing

For deduplication indexing, the first thing to do is to convert to lower case and remove all of the non-alpha characters. (NB: For non-latin scripts, there are challenges that we may not be able to tackle in this phase of the project).

The tokens within the record will be indexed in multiple ways within the deduplication index to facilitate matching. This includes indexing all words in order that they appear, and also individual word matches.

Then, particularly when considering using text such as biographies to help identify matches, we can use bigrams and trigrams. These essentially divide text into two and three words chunks. A search can then identify how many groups of two and three words have matched. Generally, this is a useful method of ascertaining whether documents are about the same thing. It may help us with identifying name matches based upon supporting information. This is very much an exploratory approach, and we don’t know if it will help substantially with this project, but certainly it will be worth trying out this approach, and also considering using it for future data analysis projects.

Character trigrams break down individual words into groups of three characters and may be useful for the actual names. This should be useful for a more fuzzy matching approach, and it help to deal with typos. It can also help with things like plurals, which is relevant for working with the supporting information.

We are also going to explore hypocorisms. This means trying out matches for names such as Jim, Jimmy and James or Ned, Ed, Ted and Edward. A hypocorism is often defined as a pet name or term of endearment, but for us it is more about forename variations. Obviously Jim Jones is not necessarily the same person as James Jones, but there is a possibility of it, so it is useful to make that kind of match on name synonyms. It is often defined as a pet name or term of endearment.

Hypocorisms refers to pet names or terms of endearment

From this indexing approach we can try things out and see what works. There is little doubt that it will require an iterative and flexible approach. We can’t afford to set up a whole process that proves ineffective so that we have to start again. We need an approach that is basically sound and allows for infinite adjustments. This is particularly vital because this is about creating a framework that will be successful on an on-going basis, for a national-scale service. That is an entirely different challenge to creating a successful outcome for a finite project where you are not expecting to implement the process on an on-going basis. Apart from anything else, a project with a defined timescale and outcome gives you more leeway to have a bit of human intervention and tweak things manually to get a good result.

Group records

Using the tokenisers and matching methods we can try processing the data for matches. When records are matched with a degree of certainty, a group record is created in the deduplication index. It is allocated a group id and contains the ids of all of the linked records. This is used as the basis for the ‘master record’ creation.

Primary or master records

I have previously blogged some thoughts about the ‘master record’ idea. Our current proposal is that every Archives Hub name is a primary record, unless it is matched. So, if we start out with six variations of Martha Beatrice Webb, 1858-1943, then at that point they are all primary records and they would all display. If we match four of them, to a confidence threshold that we are happy with, then we have three primary records. One of the primary records covers four archives. We may be able to still link the other two instances of this name to the aggregated record, but we can assign a lower confidence threshold to this.

Diagram showing instances of the name Beatrice Webb and how they might match.
Deduplication for ‘Beatrice Webb’

In the above example (which is made up, but reflects some of the variations for this particular name) four of the instances of the name have been matched, and so that creates a new primary record, with child records. Two of the instances have not been matched. We might link them in some way, hence the dotted line, or they might end up as entirely separate primary records. The instance of Beatrix Potter, nee Webb, has not been matched (these two individuals are often confused, especially as they have the same death date). If we set levels of confidence wrongly, this name could easily be matched to ‘Beatrice Webb’.

The reasoning behind this approach is that we aggregate where we can, but we have a model that works comfortably with the impossibility of matching all names. Ideally we provide end users with one name record for one person – a record that links to archive collections and other related resources. But we have to balance this against levels of confidence, and we have to be careful about creating false matches. Where we do create a match, the records that were previously primary records become ‘child records’ and they no longer display in the end user interface. This means we reduce the likelihood of the end user searching for ‘william churchill’ and getting 25 results. We aim for one result, linking to all relevant archives, but we may end up with two or three results for names that have many variations, which is still a vast improvement.

If we have several primary records for the same person (due to name variations) then it may be that new data we receive will help us create a match. This cannot be a static process; it has to be an effective ongoing workflow.

Names (5): The Problem of anonymity

cartoon of person asking 'who am I?'It is easy to focus on names that represent fairly well known people.  But one of the challenges for archives is to work with little known people – names that represent someone who is referenced in a catalogue – maybe they are indexed because they are a correspondent for example – they appear in one of a series of letters – but there is no more information about them other than their name. They may be referenced in other sources, but we have little to go on in order to discover that, and often they won’t be represented – it may be that this is the only written source that includes them.

In a names service, we can add a name – let’s say ‘Louisa Jane Justamond’ – a name from https://archiveshub.jisc.ac.uk/data/gb12-ms.add.8556 (‘The Garland continued’, a collection of poems addressed to her).  We only have that one instance of that name. It is not in VIAF, it is not in Wikidata. There is an instance listed in ‘A genealogical and heraldic dictionary of the landed gentry of Great Britain’ (a precursor to Burke’s peerage). But unless we decide to use that an external source, write a name matching algorithm and decide, on levels of confidence, that it is indeed a match, that is not going to help us.  We are left with a name attached to one archive collection and nothing else.

We can create a name record for Justamond, but if we display it on the Archives Hub it will simply show her name and a link back to the related description.  It will be extremely minimal.

However, what we don’t know is whether new collections will be added to the Archives Hub, or new information added to Wikidata or another source that we use, such that this person becomes more identifiable.   We simply don’t know what the value of a name might be.  In the future, having a record of this person could prove to be immensely useful in making a connection.

Archives have what you might call a long tail of names. It is something that characterises our holdings. It is something that sets us apart from libraries and museums, at least to a degree.  Most names represented in library holdings (or names they represent in their catalogues and other finding aids) represent identifiable people.

Graph showing the long tail of names
The long tail of names

In archives, we have collections that represent ordinary people, not published, not celebrated, not notorious, with no documented place in history. We also have collections that include people where it is hard to know whether an individual is more widely known, because the archive collection does not entirely identify them.

Either way, it leaves us with a question about how to deal with a name that has nothing else attached to it other than ‘this name is in this letter’.

Building an index of all names means that we have a store of data that can be used for further exploration. It could sit behind the scenes, but it can be used to try out tools, data manipulation and matching.  In other words, the data is a separate thing from what you decide to display.

Having a name (maybe not knowing exactly who the name represents) and knowing that the name is in three different archives has value.  We can say ‘in the absence of any other information, we assume these names represent the same person’, or we can simply present the information and not make any conclusions (although that begs the question of how you present it without encouraging assumptions).  It is then up to researchers to explore further.  We might find new data sources that help to clarify names. We might get new descriptions that help to do this.

Many archival descriptions include subjects and, to a lesser extent, places. If you have Stephen Merryweather in one, with an index term of botany, and S. Merryweather in another, with the same index term, then you could say it is more likely to be a match. There is a question of how you might then present that information. The use of algorithms raises the issue of how to convey levels of confidence. It feels as if we need to have a more sophisticated – and recognised – means of presenting levels of confidence.

This whole issue of confidence levels is more of a focus for archives, because of the anonymity I’ve talked about.

Diagram showing Relationships of data involved in creating name records
Relationships of data involved in creating name records

The ‘Name’ records shown above are the names within archival descriptions (EAD records on the Hub).  These names can be pulled out from ‘origination’ (creator) and from ‘persname’ (usually in the controlaccess index section, but potentially elsewhere in the description).  These names may represent ‘unknown’ people, the EAD may not even indicate whether they are personal or corporate or family names. They may not include dates, they may just be ‘Mary Fleming’ or ‘Mary Fleming fl 1717’.  They may also be ‘unknown’, ‘[unknown]’, or even ‘unknown unknown’ (keeping the surname, forename structure!).  They may be ‘Name of author (various)’ or ‘Various health authority bodies’ or ‘Possibly Miss M. Lindsay’. All these are examples from our data.  They illustrate the conflict between human readable data – where ‘unknown’ is useful – and machine processable data – where semantics are important, and a name is ideally just a name.

If we create ‘Name’ entries for all of these then we have a store of data to work with, something I’ve mentioned before in my Names Project blogs.  We can then find out how many ‘Mary Fleming’ entries there are, or  how many ‘M Fleming’ entries. How we then choose to display that information to end users is a separate question.  But with the advances in machine learning, it is becoming an increasingly pertinent question.

We have an opportunity with archival metadata, with the way that archives represent ‘ordinary life’. But it is a challenge Catalogues are still not really set up to identify entities (in a way that works for machine processing). We create what we refer to as ‘name authorities’ but we do not usually consider the importance of matching names outside of individual organisations. The Archives Hub has an opportunity to work on behalf of UK archives to try to draw out people and, in a sense, identify them, or at least, enable them to be more contextualised. But it will require a good deal of experimentation and expertise in working with disparate data.  However, if we create a pool of names and provide an API, that would enable others to work with the data, and try different approaches.  This is a big challenge, and it needs a concerted and collaborative approach.

 

picture of anonymous crowd

 

 

Names (4): Ethics and identity

As archivists, we deal with ethical issues a good deal.  But the ability to link disparate and diverse data sources opens up new challenges in this area, and I wanted to explore this a bit.

If you do a general search for ethics and data, top of the list comes health. An interesting example of data join-up is the move to link health data to census data, which could potentially highlight where health needs are not being met:

“Health services are required to demonstrate that they are meeting the needs of ethnic minority populations. This is difficult, because routine data on health rarely include reliable data on ethnicity. But data on ethnicity are included in census returns, and if health and census data for the same individuals can be linked, the problem might be solved.” (Ethnicity and the ethics of data linkage)

However, individuals who stated their ethnicity in census returns were not told that this might subsequently be linked with their health data. Should explicit informed consent be given? Given the potential benefits, is this a reasonable ask? It is certainly getting into hazardous terrain to ignore the principle of informed consent. In their book ‘Rethinking Informed Consent in Bioethics‘, Manson and O’Neill argue that informed consent cannot be fully specific or fully explicit. They argue for a distinctive approach where rights can be waived or set aside in controlled and specific ways.

This leads to a wider question, is fully explicit and specific informed consent actually achievable within the joined-up online world? A world where data travels across connections, is blended, re-mixed, re-purposed. A world where APIs allow data to be accessed and utilised for all sorts of purposes, and ‘open data’ has become a rallying cry.  Is there a need to engage the public more fully in order to gain public confidence in what open data really means, and in order to debate what ‘informed consent’ is, and where it is really required?

I am working on a project to create name records, and I am looking at bringing data sources together. Of course, this is hardly new. Wikipedia is the most well-known hub for biographical data. Anyone can add anything to a Wikipedia page (within some limits, and with some policing and editing by Wikipedia, but in essence it is an open database).  Wikidata, which underlies Wikipedia, is about bringing sources together in an automated way.  Projects within cultural heritage are also working on linked data approaches to create rich sources of information on people. SNAC has taken archival data from many different archive repositories and brought it together. A page for one person, such as Martin Luther-King provides a whole host of associations and links. These sources are not all individually checked and verified, because this kind of work has to be done algorithmically. However, there is a great deal of provenance information, so that all sources used are clear.

image of page from the face of white australia website
The Face of White Australia

There are some amazing projects working to reveal hidden histories. Tim Sherratt has done some brilliant work with Australian records. Projects such as Invisible Australians, which aims to reveal hidden lives, using biographical information found in the records. He has helped to create some wonderful sites that reveal histories that have been marginalised.  Tim talks about ‘hacking heritage’ and says: ‘By manipulating the contexts of cultural heritage collections we can start to see their limits and biases. By hacking heritage we can move beyond search interfaces and image galleries to develop an understanding of what’s missing.’ (Hacking heritage, blog post)  He emphasises that access to indigenous cultural collections should be subject to community consultation and control.  But what does community consultation and control really mean?

I have always been keen to work with the names in archival descriptions – archival creators and all the other people who are associated with a collection. They are listed in the catalogue (leastways the names that we can work with are listed – many names obviously aren’t included, but that’s another story), so they are already publicly declared. It is not a case of whether the name should be made public at all, or, at least, that decision has been made already by the cataloguer.   But our plan is to take the names and bring them to the fore – to give them their own existence within our service.  We are taking them out of the context of a single archive collection and putting them into a broader one. In so doing, we want to give the archive collections themselves more social context, we want to give more effective access to distributed historical records, and we also want to enable researchers to travel through connections to create their own narratives.

This may help to reveal things about our history and highlight the roles that people have played. It may bring people to the fore people who have been marginalised.  Of course, it does not address the problem of biases and subjective approaches to accessions and cataloguing. But a joined-up approach may help us to see those biases and gaps; to understand more about the silent spaces.

Creating persistent identifiers and linking data reveals knowledge. It is temping to see that in simple terms as a good thing.  But what about privacy and ethics?  Even if someone is no longer living, there are still privacy issues, and many people represented in archives are alive.

Do individuals want to be persistently identified? What about if they change their identity? Do they want a pseudonym associated with their real name? They might have very good reasons for keeping their identity private. Persistent identification encourages openness and transparency, which can have real benefits, but it is not always benign.  It is like any information – it can be used for good and bad purposes, and who is to say what is good and what is not? Obviously we have GDPR and the Data Protection Act, and these have a good deal to say about obligations, the value of historical research and the right to be forgotten. This is something we’ll need to take into account. But linked data principles are not so much about working with personal data as working with data that may not seem personal, but that can help to reveal things when linked with other sources of data.

GDPR supports the principle of transparency and the importance of people’s awareness and control over what happens to their personal data. Even if we are not creating and storing personal data, it seems important to engage with data protection and what this means. The challenge of how to think about data when it is part of an ever shifting and growing  global data environment seems to me to be a huge one.

Certainly the horse has bolted to some degree with regards to joining up data. The Web lowered barriers considerably, and now we increasingly have structured data, so it is somewhat like one gigantic database. Finding things out about individuals is entirely feasible with or without something like a Names service created by the Archives Hub. We are not creating any new content, but creating this interface means we are consciously bringing data together, and obviously we want to be responsible, and respect people’s right to privacy. Clearly it is entirely impractical to try to get permission from all those living people who might be included. So, in the end, we are taking a degree of risk with privacy.  Of course, we will un-publish on request, and engage with any feedback and concerns. But at present we are taking the view that the advantages and benefits outweigh the risks.

 

Image of exhibition photograph of black rights march

“Imagine being a sibling in a family that continually removes you from photos; tries its best to erase you…As you go through [the scrapbook] you see events where you know you were there, but you are still missing.”  Lae’l Hughes-Watkins (University of Maryland) gave an impassioned and inspiring talk at DCDC 2019 about her experiences.  She argued that archivists need to interrogate the reality that has been presented, and accept that our ideas of neutrality are misplaced. She wants a history that actively represents her – her history and culture, and experiences as a black woman in the USA. She related moving stories of people with amazing stories (and amazing archives) who distrust cultural institutions because they don’t feel included or represented.

This may seem a long way away from our small project to create name records, but in reality our project could be seen as one very small part of a move towards what Lae’l is talking about.  Bringing descriptions together from across the UK together maybe helps us to play a small role in this – aiming to move towards documenting the full breadth of human experience. The archives that we cover may retain the biases and gaps for some time to come (probably for ever, given that documentary evidence tends to represent the powerful and the elite much more strongly), but by aggregating and creating connections with other sources, we help to paint a bigger picture.  By creating name records we help to contextualise people, making it much easier to bring other lives and events into the picture. It is a move towards recognising the limitation of what is actually in the archive, and reaching out to take advantage of what is on the Web.  In doing this through explicitly identifying people we do leave ourselves more open to the dangers of not respecting privacy or anonymity. When we plug fully into the Web, we become a part of its infinite possibilities, which is always going to be a revealing, exciting, uncontrollable and risky business. By allowing others to use this data in different ways, we open it up to diverse perspectives and uses.

 

 

 

Names (3): One name record to bind them

It has been great to get comments and feedback around names, and I wanted to expand upon something that a few people have commented on….the ideal of one ‘authority record’ for one person or organisation.

model showing relationships of catalogues and name records
Model showing potential relationships between catalogues and name records

 

The above diagram is a proposal for the relationships we might have – note that is it a working model, and may well change over time. You can see the catalogues (the descriptions of archives) include people, some with biographical histories, and these people are either creators of archive collections or referenced in them.  Each of these people then gets a name record (bottom left box), so we might have e.g. three name records for  the same name (and the same name may potentially the same person…or may not). We will work with the store of records that we have with the aim of creating matches, and ending up with a generic or main name record (green box, top left).

The ‘main record’ or ‘master record’ or whatever we might call it, for each individual person or organisation, is not an ‘archival record’. It is not intended simply to be a reflection of what is in our own data. It is intended to be a page dedicated to that person or organisation.  Our current feeling is that this should not be seen as domain specific; in fact, we want to get away from the idea that data is domain specific.  It is about an entity (a person or organisation), and what we know of that entity.

Keeping in mind the green box, and looking at the person page for Robin Day from Exploring British Design, a previous AHRC project we ran with Brighton Design Archive, you get a sense of the type of thing we mean.

Page for Robin Day, from the Exploring British Design website
Exploring British Design: Robin Day

This page presents as a general information page about a designer. It is not branded as a page about archives. It takes information in from different sources. Is it an ‘authority’ record?  I’m really not sure; I wouldn’t call it that. The point is really that it enables researchers to put Robin Day into the context of other people, organisations places and events, or at least it demonstrates how that can be done. It creates a network, and it intends to show the value of including archives in a network, rather than standing apart, in their ‘own world’.

Screenshot of an entity relationship diagram for Robin Day
Visualised relationships

 

The network can easily be visualised. There are tools out there to do this. The challenge is to create the data to feed into these visualisers. Again, this visualisation is not about archival name authority records, it is not domain specific.

 

 

In the Robin Day page, we have a section for related archives and museum resources.

screenshot showing archives related to Robin Day
Related archive and museum resources

 

This lists archives Robin Day is the ‘creator of’ or archives he is ‘associated with’.  It links to the Archives Hub, but also to other sources. One of the options for end users is to go and find out more about the archival sources, but it is not prioritised above other options.

 

 

 

 

 

So, this is essentially the idea – a page for a person, a page for an organisation. An information resources that focuses on creating a network of connections.  We think this is a good approach, but creating something along these lines that is automated, sustainable and effective within an ongoing national service is much harder.

Why not just use this one record, link to the archive catalogues, and dispense with the individual name records that we have created? There are three reasons to consider providing access to the individual name records:  biographical history,  uncertainty around matching and ingesting name authority records.

I have already written about biographical and administrative history in a separate post.

In this phase of the Names Project the individual records for Beatrice Webb (as a name example), will be created either from the creator name or index terms that we have in the Archives Hub catalogues.

The main problem is the wide variation in name entries.

Webb, Beatrice
Webb, Beatrice, née Potter
Webb (Martha) Beatrice, 1858-1943
Webb, Martha Beatrice, 1858-1943
Webb;[Martha] Beatrice [nee Potter] 1858-1943

These are all entries in the Archives Hub.  We can match them all up, but can we say they are all the same?   Names without dates should not be matched with certainty, but quite often they will be the same person. (Beatrix Potter also often ends up being linked with Beatrice Webb, née Potter).

The decision we need to make is whether to provide links to these individual name records that we will have, or only use them as a source of data.  It seems valuable to enable end users to see these names as a group, but it is another thing to risk integrating information from them all into one name record.  There is no perfect answer to this, but it does seem important to clearly indicate the level of uncertainty.  So many names that we have don’t have life dates, or have variations in structure.  What we are looking to achieve is a clear provenance, giving end users the best understanding of what they are seeing.

What about name records that have been created by our contributors?  The name records we create ourselves from catalogue descriptions will generally be no more than the name, dates, and biographical history.  But, going forwards, we will want to work with much more detailed name records.

For Exploring British Design we created rich name records with an entity-relationship structure (essentially using the EAC-CPF structure and working in RDF),  to demonstrate the power of connecting entities.  For this purpose, we partially hand-crafted the name records, as well as carrying out some very complex processing to create various connections.

screenshot of part of the timeline for Robin Day
Part of the timeline for Robin Day

The example above shows events from the Robin Day timeline, with linked connections to related organisations.  If we ingest EAC-CPF records we might get timelines like this.

Name records may also include relationships. The Borthwick Institute has good examples of name records with plenty of rich relationship information. e.g. Charles Lindley Wood, Viscount Halifax.

screenshot of part of the Viscount Wood record showing relationships to other people
An excerpt from a Borthwick entry for Charles Lindley Wood

If we took this record into the Archives Hub it might seem to make sense for it to become the main person record for Wood.  But that would involve a process of making choices, preferencing one name record over another.  Possible, but tricky to do in an automated way. Another record office might also have a splendid example of a name entry for this person, with some different data. Furthermore, this record has links to the Borthwick catalogue. We would potentially have to remove these links.

It would be very challenging to create one record from several source EAC-CPF records for the same person –  to blend timelines, or sort out relationships listed in different records, bearing in mind that it needs to be done in an automated way, keeping version control and dealing with revisions and new data coming in that might add to the name record.  How could we compare and blend two lists of relationships? Or two chronologies? We’d probably end up having to keep them all, and then potentially have similar but different relationships and chronologies, giving a slightly confused user experience.

If we do ingest records like the one above, we will have to figure out how these  more detailed records will relate to what we have already created.  If, as planned, we have one generic name record for a person, it makes the job easier, as we won’t be looking to make any one EAC-CPF record into the main name record, we will simply link to it from the main record. Bear in mind, our main record is intended to be a domain-neutral entry – linking to other sources beyond archives.  EAC-CPF records might do this to some extent, but they are unlikely to link to the Jisc Library Hub, and probably won’t link to Wikidata, or other external sources.   They are far more likely to provide internal links to the archive catalogue they relate to.

Arguably, it might be easier to forget about creating name records ourselves (from the catalogue entries) and just work with name records that have been created by our contributors (which are likely to be well-structured and include life dates). But if we do that, the pot of names will grow slowly, as only a small proportion of repositories create name records. We can’t realistically give the end user a few thousand name records covering maybe 1-2% of our names – they might search for ‘Winston Churchill’ as a name, and find that we don’t have him!  It would not remove the problem of name matching, and it would make the whole idea of reaching out beyond the archive domain, by linking into other resources using our names as the hook, rather ineffectual.

Therefore, we propose to keep the separate name records in our system We propose to create a ‘generic record’, which is what would be prominent in the Archives Hub display. We would then have the potential to link the records together, to blend them,  to try some text mining and analysis techniques. It gives us options.  It would not be sensible to make those decisions now. It is better to lay the groundwork that enables us to be flexible.   This approach allows us to link to an individual name record where we don’t feel able to confirm a ‘same as’ relationship. It presents the option to the end user – here is a name – we think this is the same person, so we’ve provided a link.

The end user experience needs to make sense and not mislead or provide false information. Links to brief name records could seem confusing, but, as I have said, trying to bring together in one record all the information from several name records, with  their biographies, relationships, aliases, events, related resources, is likely to be a nightmare.  In the end, it will take a good deal more testing and working with researchers to work out what is best.

 

Archives Hub Names Project (2): Biographical History

It is a somewhat vexed question how to treat biographical and administrative history (in this post I’ll focus on biographical history).  This is an ISAD(G) field and an EAD field. ISAD defines it as providing “an administrative history of, or biographical details on, the creator (or creators) of the unit of description to place the material in context and make it better understood”.  It advises for personal names to include “full names and titles, dates of birth and death, place of birth, successive places of domicile, activities, occupation or offices, original and any other names, significant accomplishments, and place of death”.

On the Archives Hub we have a whole range of biographical histories – from very short to very comprehensive.  I have had conversations with archivists who believe that ‘putting the collection in context’ means giving information that is particularly relevant for that archive rather than giving a general history. Conversely, many biographical history entries do give a very full biography, even if the collection only relates to one aspect of a person’s life and work. They may also include information that is not readily available elsewhere, as it may have been discovered as part of the cataloguing process.

The question is, if we create a generic name record for a person, how do we treat this biographical information? There are a number of alternatives.

(1) Add all biographical history entries to the record

If you look at a SNAC example:  https://snaccooperative.org/view/54801840 you can see that this is the approach. It has merits – all of the biographical information is brought together. But it can mean a great deal of repetition, and the ordering of the entries can seem rather illogical, with short entries first and then longer comprehensive entries at the end.

Whilst most biographical history entries are pretty good, it also means a few not very helpful entries may be included, and may be top of the order. In addition, putting all the entries in together doesn’t always seem to make much sense. In the example below there are just three short entries for a major figure in women’s liberation. They are automatically brought in from the catalogue entry for individual collections. Sometimes the biographical entries in individual catalogues suffer from system migration and various data processing issues that mean you end up with field contents that are not ideal.

Millicent Garrett Fawcett biographical histories in SNAC

The question is whether this approach provides a useful and effective end user experience.

Where there is one entry for a creator, with one biographical history, there is no issue other than whether the entry makes sense as an overall biographical entry for that person or organisation. But we have to consider the common situation where there will be a dozen or more entries. Even if we start with one entry, others may be added over time.  Generally, there will be repetition and information gaps, but in many cases this approach will provide a good deal of relevant information.

(2) Keep the biographical history entries with the individual name records

At the moment our plan is to create individual name records for each person, as well as a generic master record.  We haven’t yet worked out the way this might be presented to the end user.  But we could keep the biographical histories with the individual entries we have for names. The generic record would link to these entries, and to the information they contain.  This makes sense, as it keeps the biographical histories separate, and within the entries they were written to accompany. Repetition is not an issue as it is clear why that might happen.  But the end user has to go to each entry in turn to read this information.

(3) Keep biographical history entries with individual name records, but enable the information to be viewed in the generic master record

We have been thinking about giving the end user the option to ‘click to see all biographical histories created for this person’. That would help with expectations. Simply presenting a page with a dozen similar biographical histories is likely to confuse people, but  enabling them to make a decision to view entries gives us more opportunity for explanation – the link could include a brief explanatory note.

(4) Select one biographical history to be in the generic record

We have discussed this idea, but it is really a non-starter. How do you select one entry? What would the criteria be if it is automated? The longest?

(5) Link to a generic biography if available

This is the idea of drawing in the wikipedia entry for that person or organisation, or potentially using another source.  There is a certain risk to pulling in data from an external source as the ‘definitive’ biographical information, but it the source would always be cited, and it does start to move towards the principle of bringing different sources of information together. If we want to create a more generic resource, we are going to have to take risks with using external sources.

 

I would be interested in any comments on this.

Names Project (1): Creation of name records

The Archives Hub Names Project

The Archives Hub team and Knowledge Integration, our system suppliers, are embarking upon a short four month project to start to lay the groundwork, define the challenges and test the approaches to presenting end users with a name-based means to search, and connect to a broad range of resources related to people and organisations.  I will be blogging about the project as we go along.

Our key aims in the long-term are:

  • To provide the end user with a way to search for people and organisations and find a range of material relevant to their research
  • To enable connections to be made between resources within and external to Jisc, using names as the main focus
  • To bring archive collections together in an intellectual sense and provide different contexts to collections by creating networks across our data

This first project will not create an end-user interface, but will concentrate on processing,  matching names and linking resources. We want to explore how this can be administered in order to be sustainable over time.  In the end, the most challenging part of working with the names we have is identification, disambiguation and matching.  The aim is to explore the space and start to formulate a longer-term plan for the full implementation of names as entities within the Archives Hub.

Creation of name records from EAD description records

NB: This blog often refers to personal names for convenience, but names include personal, family and corporate entities.

EAD includes namesEAD descriptions include personal, family and corporate names.  These ‘entities’ may be listed as archival creators and also associated with the collection as index terms. Archival creators may optionally be given biographical or administrative histories.  The relationship of the collection with names in the index is not made explicit in the description (in a structural way), though it may often be gleaned from the descriptive information within the EAD record.

Creating name records for all names

We are proposing to begin by creating name records for all of these entries, no matter how thin the information for each entry may be.

Here is a random selection of names that are included in Archives Hub records:

Grote, Arthur
Gaskell, Arthur
Wilson, John
Thatcher, J. Wells, Barrister at Law
Barron, Margaret
Stanley, Catherine, 1792-1862
Roe, Alfred Charles
Rowlatt, Mary, b 1908
Milligan, Spike, 1918-2002
Fawcett, Margaret, d. 1987
Rolfe, Alan, 1908-2002 actor
Mayers, Frederick J (fl 1896-1937 : designer : Kidderminster, England)
Joan

Only a percentage of names have life dates. Some have born or death dates, some floruit dates.

Of course, the life dates, occupations and outputs of many people are not known, or may be very difficult to find.  Also, life dates will change when a birth date is joined by a death date. Epithets may also change over time (and they are not controlled vocabulary anyway).

In addition, we have inverted and non-inverted names on the Archive Hub, names with punctuation in different places, names with and without brackets, etc.  These issues create identification challenges.

Even taking names as creators and names as index terms within one single description, the match is often not exact:

Millicent Garrett Fawcett (creator name)
Fawcett, Dame Millicent. (1847-1929) nee Garrett, Feminist and Suffragist (index term)

Lingard, Joan (creator name)
Lingard, Joan Amelia, 1932- (index term)

The archival descriptions on the Archives Hub vary a great deal in terms of the structure, and different repositories have different approaches to cataloguing.  Some do not add name of creator, some do not add index terms, some add them intermittently, and often the same name is added differently for different collections within the same repository.  In many cases the cataloguer does not add life dates, even when they are known, or they are added to the name as creator but not in the index list, or vice versa. This sounds like a criticism, but the reality is that there are many reasons why catalogues have ended up as they are.

There has not been a strong tradition amongst archivists of adding names as unique identifiable entities, but of course, it has only been in the last few decades that we have had the potential, which is becoming increasingly sophisticated, of linking data through entity relationships, and creating so much more than stand-alone catalogue records. Many archivists still think primarily in terms of human readable descriptions.  Some people feel that with the advent of Google and sophisticated text analysis, there is no need to add names in this structured way, and there is no need for index terms at all.  But in reality search engines generally recommend structured data, and they are using it in sophisticated ways.  Schema.org is for structured data on the web, an initiative started by Google, Microsoft, Yahoo, and Yandex. Explicit markup helps search engines understand content and it potentially helps with search engine optimisation (ensuring your content surfaces on search engines).  Also, if we want to move down the Linked Data road, even if we are not thinking in terms of creating strict RDF Linked Data, we need to identify entities and provide unique identifiers for them (URLs on the web). Going back to Tim Berners-Lee’s seminal Linked Data article from 2006:

“The Semantic Web isn’t just about putting data on the web. It is about making links, so that a person or machine can explore the web of data.  With linked data, when you have some of it, you can find other, related, data.”

So, including names explicitly provides huge potential (as well as subjects, places and other entities) and it has become more important, not less important. Indeed, I would go so far as to say that structured data is more important than standards compliant data, especially as, in my experience, standards are often not strictly adhered to, and also, they need constant updating in order to be relevant and useful.

The idea with our project is that we start with name records for every entity – a pot of data we can work with. We may create Encoded Archival Context (Corporate Bodies, Persons and Families), otherwise known as EAC-CPF…but that is not important at this stage.  EAC is important for data ingest and output, and we intend to use it for that purpose, so it will come into the picture at some point.

The power of the anonymous

There are benefits in creating name records for people who are essentially anonymous or not easily identifiable.  Firstly, these records have unknown potential; they may become key to making a particular connection at some point, bearing in mind that the Archives Hub continually takes new records in. Secondly, we can use these records to help with identification, and the matching work that we undertake may help to put more flesh on the bones of a basic name record.  If we have ‘Grote, Arthur’ and then we come across ‘Grote, Arthur, 1840-1912’, we can potentially use this information and create a match. Of course, the whole business of inference is a tricky thing – you need more than a matching surname and forename to create a ‘same as’ relationship (I won’t get into that now). But the point is that a seemingly ‘orphan’ name may turn out to have utility. It may, indeed, provide the key to unlocking our understanding of particular events – the relationships and connections between people and other entities are what enable us to understand more about our history.

Components of a name record

So, all names will have name records, some with just a name, some with life dates of different sorts, some with biographical or administrative histories. The exception to this may be names that are not identifiable as people or organisations.  It is potentially possible to discover the type of entity from the context, but that is a whole separate piece of work.  Hundreds of names on the Archives Hub are simply labelled as ‘creator’ or ‘name’. This is down to historical circumstance – partly the Archives Hub made errors in the past (our old cataloguing tool which entered creators as simply EAD ‘origination’), partly other systems we ingest data from.  At the moment, for example, we are taking in descriptions from Axiell’s AdLib system, but the system does not mark up creator names as people or organisations (unless the cataloguer explicitly adds this), so we cannot get that information. This is probably a reflection of a time when semantically structured data was simply less important. If a human reads ‘Elizabeth Gaskell’ in a catalogue entry they are likely to understand what that string means; if undertaking large-scale automated processing, it is just a string of characters, unless it includes semantic information.

From the name records that we create, we intend to develop and run algorithms to match names. In many cases, we should be able to draw several names together, with a ‘same-as’ relationship. Some may be more doubtful, others more certain. I will talk about that as we get into the work.

At the moment, we have some ideas about how we will work with these individual records in terms of the workflow and the end user experience, but we have not made any final decisions, and we think that what is most important at this stage is the creation and experimentation with algorithms to see what we can get.

Master name records

We intend to create master records for people and organisations. The principle is to see these master records not as something within the archives domain, but as stand-alone records about a person or organisation that enable a range of resources to be drawn together.

So, we might have several name records for one person:

Example of master record, with various related information included:
Webb, Martha Beatrice, 1858-1943, social reformer and historian

Examples of additional name records that should link to the master record:
Webb, Beatrice, 1858-1943 (good match)
Webb, Martha Beatrice, 1858-1943, economist and reformer (good match)
Webb, Martha Beatrice, nee Potter, 1858-1943 (good match)
Webb, M.B. b. 1858 (possible match)
but…
Potter, Martha Beatrice, b 1858
…might well not be a match, in which case it would stand separately, and the archive connected to it would not benefit from the links being made.

We have discussed the pros and cons of creating master records for all names.  It makes sense to bring together all of the Beatrice Webb names into one master record – there is plenty that can be said about that individual; but does it make sense to have a master record for single orphaned names with no life dates and nothing (as yet) more to say about that individual?  That is a question we have yet to answer.

diagram showing link between archive, name records and master records
The archive is described though an EAD description held on our system (the CIIM). We take all the names from this to create a huge store of individual names. From this, we aim to create and update ‘definitive’ name records.

The principle is to have name records that enables us to create links to the Archives Hub entries and also to other Jisc services and resources beyond that – resources outside of the archives domain.  Many of these resources may also help us with our own identification and matching processes. It is important to benefit from the work that has already been done in this area.

We are looking at various name resources and assessing where our priorities will be.  This is a fairly short project, and we won’t have time to look at more than a handful of options. But we are currently thinking in terms of VIAF, ORCID and Wikidata. More on that to follow.

Personally, I’ve been thinking about working with names for several years. We have been asked about it quite a bit. But the challenge is so big and nebulous in many ways. It has not been feasible to embark upon this kind of work in the past, as our system has not supported the kind of systematic processing that is required. We are also able to benefit from the expertise K-Int can bring to data processing. It is one thing doing this as a stand-alone project; it is quite another to think about a live service, long term sustainability, version control and revisions, ingest from different systems, etc.  And also, to break it down into logical phases of work.  It is exciting, but it is going to involve a great deal of hard work and hard thinking.

 

 

 

 

 

 

Archives Hub Survey Results: What do people want from an archives aggregation service?

The 2018 Archives Hub online survey was answered by 83 respondents. The majority were in the UK, but a significant number were in other parts of Europe, the USA or further afield, including Australia, New Zealand and Africa. Nearly 50% were from higher or further education, and most were using it for undergraduate, postgraduate and academic research. Other users were spread across different sectors or retired, and using it for various reasons, including teaching, family history and leisure or archives administration.

We do find that a substantial number of people are kind enough to answer the survey, although they have not used the service yet. On this survey 60% were not regular users, so that is quite a large number, and maybe indicates how many first-time users we get on the service. Of those users, half expected to use it regularly, so it is likely they are students or other people with a sustained research interest. The other 40% use the Hub at varying levels of regularity. Overall, the findings indicate that we cannot assume any pattern of use, and this is corroborated by previous surveys.

Ease of use was generally good, with 43% finding it easy or very easy, but a few people felt it was difficult to use. This is likely to be the verdict of inexperienced users, and it may be that they are not familiar with archives, but it behoves us to keep thinking about users who need more support and help. We aim to make the Hub suitable for all levels of users, but it is true to say that we have a focus on academic use, so we would not want to simplify it to the point where functionality is lost.

I found one comment particularly elucidating: “You do need to understand how physical archives work to negotiate the resource, but in terms of teaching this actually makes it really useful as a way to teach students to use a physical archive.”  I think this is very true: archives are catalogued in a certain way, that may not be immediately obvious to someone new to them. The hierarchy gives important context but can make navigation more complicated. The fact that some large collections have a short summary description and other smaller archives have a detailed item-level description adds to the confusion.

One negative comment that we got maybe illustrates the problem with relevance ranking: “It is terribly unhelpful! It gives irrelevant stuff upfront, and searches for one’s terms separately, not together.” You always feel bad about someone having such a bad experience, but it is impossible to know if you could easily help the individual by just suggesting a slightly different search approach, or whether they are really looking for archival material at all. This particular user was a retired person undertaking family history, and they couldn’t access a specific letter they wanted to find. Relevance ranking is always tricky – it is not always obvious why you get the results that you do, but on the whole we’ve had positive comments about relevance ranking, and it is not easy to see how it could be markedly improved.  The Hub automatically uses AND for phrase searches, which is fairly standard practice. If you search for ‘gold silver’ you will probably get the terms close to each other but not as a phrase, but if you search for ‘cotton mills’ you will get the phrase ranked higher than e.g. ‘mill-made cotton’ or ‘cotton spinning mill’.  One of the problems is that the phrase may not be in the title, although the title is ranked higher than other fields overall. So, you may see in your hit list ‘Publication proposals’ or ‘Synopses’ and only see ‘cotton mills’ if you go into the description. On the face of it, you may think that the result is not relevant.

screenshot of survey showing what people value
What do you most value about the Archives Hub?

All of our surveys have clearly indicated that a comprehensive service providing detailed descriptions of materials is what people want most of all. It seems to be more important than providing digital content, which may indicate an acknowledgement from many researchers that most archives are not, and will not be, digitised. We also have some evidence from focus groups and talking to our contributors that many researchers really value working with physical materials, and do not necessarily see digital surrogates as a substitute for this. Having said that, providing links to digital materials still ranks very highly in our surveys. In the 2018 survey we asked whether researchers prefer to search physical and digital archives separately or together, in order to try to get more of a sense of how important digital content is. Respondents put a higher value on searching both together, although overall the results were not compelling one way or the other. But it does seem clear that a service providing access to purely digital content is not what researchers want. One respondent cited Europeana as being helpful because it provided the digital content, but it is unclear whether they would therefore prefer a service like Europeana that does not provide access to anything unless it is digital.

Searching by name, subject and place are clearly seen as important functions. Many of our contributors do index their descriptions, but overall indexing is inconsistent, and some repositories don’t do it at all. This means that a name or subject search inevitably filters out some important and relevant material. But in the end, this will happen with all searches. Results depend upon the search strategy used, and with archives, which are so idiosyncratic, there is no way to ensure that a researcher finds everything relating to their subject.  We are currently working on introducing name records (using EAC-CPF). But this is an incredibly difficult area of work. The most challenging aspect of providing name records is disambiguation. In the archives world, we have not traditionally had a consistent way of referring to individuals. In many of the descriptions that we have, life dates are not provided, even when available, and the archive community has a standard (NCA Rules) that it not always helpful for an online environment or for automated processing. It actually encourages cataloguers to split up a compound or hyphenated surname in a way that can make it impossible to then match the name. For example, what you would ideally want is an entry such as ‘Sackville-West, Victoria Mary (1892-1962) Writer‘, but according to the NCA Rules, you should enter something like ‘West Victoria Mary Sackville- 1892-1962 poet, novelist and biographer‘. The epithet is always likely to vary, which doesn’t help matters, but entering the name itself in this non-standard way is particularly frustrating in terms of name matching.  On the Hub we are encouraging the use of VIAF identifiers, which, if used widely, would massively facilitate name matching. But at the moment use is so small that this is really only a drop in the ocean. In addition, we have to think about whether we enable contributors to create new name records, whether we create them out of archive descriptions, and how we then match the names to names already on the Hub, whether we ingest names from other sources and try to deal with the inevitable variations and inconsistencies.  Archivists often refer to their own store of names as ‘authorities’ but in truth there is often nothing authoritative about them; they are done following in-house conventions. These challenges will not prevent us from going forwards with this work, but they are major hurdles, and one thing is clear: we will not end up with a perfect situation. Researchers will look for a name such as ‘Arthur Wellesley’ or ‘Duke of Wellington’ and will probably get several results. Our aim is to reduce the number of results as much as we can, but reducing all variations to a single result is not going to happen for many individuals, and probably for some organisations. Try searching SNAC (http://snaccooperative.org/), a name-based resource, for Wellington, Arthur Wellesley, to get an idea of the variations that you can get in the user interface, even after a substantial amount of work to try to disambiguate and bring names together.

The 2018 survey asked about the importance of providing information on how to access a collection, and 75% saw this as very important. This clearly indicates that we cannot assume that people are familiar with the archival landscape. Some time ago we introduced a link on all top-level entries ‘how to access these materials’. We have just changed that to ‘advice on accessing these materials’, as we felt that the former suggested that the materials are readily accessible (i.e. digital), and we have also introduced the link on all description pages, down to item-level. In the last year, the link has been clicked on 11,592 times, and the average time spent on the resulting information page is 1 minute, so this is clearly very important help for users. People are also indicating that general advice on how to discover and use archives is a high priority (59% saw this as of high value). So, we are keen to do more to help people navigate and understand the Archives Hub and the use of archives. We are just in the process of re-organising our ‘Researching‘ section of the website, to help make it easier to use and more focussed.

There were a number of suggestions for improvements to the Hub. One that stood out was the need to enable researchers to find archives from one repository. At the moment, our repository filter only provides the top 20 repositories, but we plan to extend this. It is partly a case of working out how best to do it, when the list of results could be over 300. We are considering a ‘more’ link to enable users to scroll down the list. Many other comments about improvements related back to being more comprehensive.

One respondent noted that ‘there was no option for inexperienced users’. It is clear that a number of users do find it hard to understand. However, to a degree this has to reflect the way archives are presented and catalogued, and it is unclear whether some users of the Hub are aware of what sort of materials are being presented to them and what their expectations are. We do have a Guide to Using Archives specifically for beginners, and this has been used 5,795 times in the last year, with consistently high use since it was introduced. It may be that we should give this higher visibility within the description pages.

Screenshot of Hub page on using archives
Guide to Using Archives

What we will do immediately as a result of the survey is to link this into our page on accessing materials, which is linked from all descriptions, so that people can find it more easily. We did used to have a ‘what am I looking at?’ kind of link on each page, and we could re-introduce this, maybe putting the link on our ‘Archive Collection’ and ‘Archive Unit’ icons.

 

 

 

It is particularly important to us that the survey indicated people that use the Hub do go on to visit a repository. We would not expect all use to translate into a visit, but the 2018 survey indicated 25% have visited a repository and 48% are likely to in the future. A couple of respondents said that they used it as a teaching tool or a tool to help others, who have then gone on to visit archives. People referred to a whole range of repositories they have or will visit, from local authority through to university and specialist archives.

screenshot of survey results
I have found material using the Archives Hub that I would not otherwise have discovered

59% had found materials using the Hub that they felt they would not have found otherwise. This makes the importance of aggregation very clear, and probably reflects our good ranking on Google and other search engines, which brings people into the Archive Hub who otherwise may not have found it, and may not have found the archives otherwise.

 

 

How the Exploring British Design project informed the development of the Archives Hub

Back in 2014 the Archives Hub joined forces with The University of Brighton Design Archives for an exciting new project, funded by the Arts and Humanities Research Council, ‘Exploring British Design’ (EBD).

The project explored Britain’s design history by connecting design-related content in different archives, with the aim of giving researchers the freedom to explore around and within archives.

You can read a number of blog posts on the project, and there is also a video introducing the EBD website on You Tube, but in this post I wanted to set out how we have learned from the project and how it has informed the development of the new Archives Hub.

Unfortunately, we may not be able to maintain the website longer term, and so it seemed timely to reflect on how the principles used in this project are being taken forward.

Modelling the Data

A key component of EBD was our move away from the traditional approach of putting the archive collection at the centre of the user experience. Instead, we wanted to reflect the richness of the content – the people, organisations, places, subjects, events that a collection represents.

We had many discussions and filled many pieces of paper with ideas about how this might work.

rough ideas for data connectivity
Coming up with ideas for how EBD should work

We then took these ideas and translated them into our basic model.

model of data for EBD
Relationships between entities in the EBD data

Archives are represented on our model as one aspect of the whole. They are a resource to be referenced, as are bibliographic resources and objects. They relate to the whole – to agents, time periods, places and events. This essentially puts them into a whole range of contexts, which can expand as the data grows.

Screenshot of EBD homepage
Homepage of Exploring British Design: People are foremost.

The Exploring British Design website was one way to reflect the inter-connected model that we created.

We have taken the principles of this approach with the new Archives Hub architecture and website, which was launched back in December 2016. Whilst the archive collection description stays very much in the forefront of the users’ experience, we have introduced additional tabs to represent themed collections and repositories. All three of these sources of information are, in a data and processing sense, treated equally. The user searches the Hub and the search runs across these three data sources. The model allows us to be flexible with how we present the data, so we could also try different interfaces in future, maybe foregrounding images, or events.

screenshot of Archives Hub search results
Search for ‘design industry’ gives results across Archive Collections, Themed Collections and Repositories

Names

The EBD project had a particular focus on people. We opted to combine machine methods of data extraction – data taken partly from our already existent archive descriptions as well as from other external sources – with manual methods, to create rich records about designers. This manual approach is not sustainable for a large-scale service like the Archives Hub, but it shows what is possible in terms of creating more context and connectivity.

screenshot of a person page from the EBD website
EBD website showing a person page

We wanted to indicate that well-structured data allows a great deal more flexibility in presentation. In this case the ‘Archive and Museum Resources’ are one link in the list of resources about or related to the individual. We could have come up with other ways to present the information, given how it was structured.

We are intending to introduce names pages to the Archives Hub, which will then more clearly echo the EBD approach. They will largely have been created through automated processes, as we needed to create them at scale. They will generally be quite brief, without the ideal structure or depth, but the principle remains that we can then link from a person page to a host of related resources. The Hub website will have a new tab for ‘Names’ and end users will be able to run searches that take in collections, themes, repositories, people and organisations.

The EBD project allowed us to explore standards used for the creation of names data. It was our first experience of using Encoded Archival Context (Corporate Bodies, Persons and Families) (EAC-CPF), so we could start to see what we could do with it, as well as discover some of the shortcomings of the standard, as our data went beyond what is supported. For example, we wanted to link images to people and events but this was not covered by the standard. It was useful to have this preliminary exploration of it, and what it can – and can’t – do, as we look to adopt it for names within the Archives Hub.

Structured Data

One of the things the project did reinforce for me was the importance of indexing. On the Archives Hub we have always recommended indexing, but we have had mixed reactions from archivists, some feeling that it is less useful than detailed narrative, some saying that it is not needed ‘now we have Google’, some simply saying they don’t have time.

Indexing has many advantages, some of which I’ve touched on in various blog posts – and one at the top of the list, is that it brings the advantages of structured data. A name in a narrative can, in theory, be pulled out and utilised as a point of connectivity, but a name as an index term tends to be a great deal easier to work with: it is identified as a name, it usually has structured surname, forename content, it usually includes life dates and may include titles and epithets to help unambiguously identify an individual.

EBD was all about structured data, and we gave ourselves the luxury of adding to the data by hand, creating rich structured records about designers. This was partly to demonstrate what could be done in an interface, but we were well aware that it would be problematic to create records of that level of detail at scale. However, as we start to grapple with expanding name records in the Archives Hub, we have EBD as a reference point. It has helped us to think more about approaches and priorities when creating name records. If we were to create an EAC Editor (similar to our EAD Editor) we would think carefully about how to facilitate creating relationships. For example, the type of relationship – should there be a controlled list of relationship types? e.g. ‘worked with, collaborated with, had professional connection with, influenced by,  spouse of’ – these are some of the relationships we used in EBD, after much discussion about how best to approach this. Or would it be more practical to stick to ‘associated with’ (i.e. not defined), which is easier, but far less useful to a researcher. Could we have both? How would one combine them in an interface?  Another example – the potential to create timelines. If we wanted to provide end users with timelines, we would need to focus on time-bound events. There are many issues to consider here, not least of which is how comprehensive the timeline would be.

The vexed question of how to combine data from name descriptions created by several institutions is not something we really dealt with in EBD, but that will be one of the biggest challenges for us in aiming to implement name data on the Archives Hub.

The level of granularity that you decide upon has massive implications for complexity, resources and benefits. The more granular the data, the more potential for researchers to be able to drill down into lives, events, locations, etc. So including life dates allows for a search for designers from 1946; including places of education allows for exploring possible connections through education, but adding dates of education allows for a more specific focus still.

Explaining our approach

One thing that struck me about this project was that it was harder than I had anticipated to convey to people what we were trying to achieve and what we could achieve. I tended to find that showing the website raised a number of expectations that I knew would be difficult to fulfill, and if I’m being honest, I sometimes felt rather frustrated at the lack of recognition of what we had achieved – it’s really not easy to combine, process and present different data sources!  It is ironic that the more we press forwards with new functionality, and try to push the boundaries of what we do, the more it seems that people ask for developments that are beyond that!  You can try to modify expectations by getting deep down and technical with the challenges involved in aggregating and enhancing data created over time, by different people, in different environments (we worked with CSV data, EAC-CPF data, RDF and geodata for example), with different perspectives and priorities.  But detailed explanations of technical challenges are not going to work for most audiences. End users see and make an assessment of the website; they shouldn’t really need to be aware of what is going on behind the scenes.

Originally, in our project specification, we asked the question: “How can we encourage researchers, archive and museum professionals, and the public, to apprehend an integrated and extended rather than collection-specific sense of Britain’s design history?”  Whilst we did not go as far to answer this question as we had hoped, the work that we did made me feel that it might be harder than I had envisaged. People are very used to the traditional catalogues and other finding aids that are out there, and it creates a certain (possibly unconscious) mindset. I know this too well, because, as an archivist, I have had to adjust my own thinking to see data in a different way and appreciate that traditional approaches to cataloguing and discoverability are not always suited to the digital online age.

Data Model

The hierarchical approach to data is very embedded among archivists, and this is what people are used to being presented with.  Unless archivists catalogue in a different way, providing more structured information about entities (names, places, etc) then actually presenting things in a more connected way is hard.

image of hierarchical folders
A folder structure is often used to represent archival hierarchy

A more inter-connected model, which eschews linear hierarchy in favour of fluid entity relationships, and allows for a more flexible approach with the front-end interface to the data relies upon the quality, structure and consistency of the data. If we don’t have place names at all we can’t provide a search by place. If we don’t have place names that are unambiguously identified (i.e. not just ‘Cambridge’) then we can provide a search by place, but a researcher will be presented with all places called Cambridge, anywhere in the world (including the US, Australia and Jamaica).

A diagram showing archives and other entities connected
An example of connected entities

The new Archives Hub was designed on the basis of a model that allows for entities to be introduced and new connections made.

Archives Hub Entity Relationship diagram
Entities within the Archives Hub system

So, the tabs that the end user sees in the interface can be modified and extended over time. Searches can be run across all entities; it is not solely about retrieving descriptions of archives. This approach allows for researchers to find e.g. repositories that are significantly about ‘design’ or repositories that are located in London. It allows us to introduce Themed Collections as a separate type of description, so a student doing a project on ‘plastics’ would discover the Museum of Design in Plastics as a resource alongside archive collections at repositories including Brighton Design Archives, the V&A and the Paul Mellon Centre.

screenshot of Archives Hub search results
Search for ‘plastics and design’ shows archives and themed resources

Website Maintenance

One of the things I’ve learnt from this project is that you need to factor in the ongoing costs and effort of maintaining a project website. The EBD website is quite sophisticated, which means there are substantial technical dependencies, and we ended up running into issues with security, upgrades and compatibility of software, issues that are par for the course for a website but nonetheless need dealing with promptly. Maybe we should have factored this in more than we did, as we know the systems administration required for the Archives Hub is no small thing, but when you are in the throws of a project your focus is on the objectives and final output more than the ongoing issues. We cannot maintain a site long-term that is not being regularly used. EBD does not get the level of use that would justify the resources we would have to put into it on an ongoing basis.

Conclusion

When we were creating the model for the Archives Hub, we thought as much about flexibility and future potential as anything else. This is one thing that we have learnt from running the Hub for 25 years and from projects like Exploring British Design. You need to plan for potential developments in order to start to work with cataloguers, to get the data into the shape that you need it to be. We wanted to be able to introduce additional entities, so that we could have names, places, languages, images, or any other entities as ‘first class citizens‘ of the Hub. We wanted to be able to enhance the end user’s ability to take different paths, and locate relevant archives through different avenues of exploration.

We need to temper our ambitions for the Hub with the realities of cataloguing, aggregation and resources available, and we need as much information as we can get about what researchers really want; but this is why it is so important to encompass potential as well as current functionality. We may not be able to introduce everything we have envisioned or that users ask for right now; but it is important to understand the vital link between approaches to cataloguing, adherence to data standards, and front end functionality. We created visualisations for EBD and we would love to do this for the Hub, but it was not an easy thing to do, and so we would need to consider what the data allows, the software options available, whether the technical requirements are sustainable over time, and the effectiveness of the end result for the researcher.

Visualisation showing connections to Elizabeth Denby
Visualisation for Elizabeth Denby

When we demonstrated the visualisations in EBD, they had the wow factor that was arguably lacking in the main text-based site, but for serious researchers the wow factor is a great deal less important that the breadth and depth of the content, and that requires a model that is fundamentally rigorous, sustainable over time and realistic in terms of the data that you have to work with.

 

From Ivory Tower to People Power

Here is a presentation I gave at ELAG 2015 to introduce our innovation project, Exploring British Design. The presentation is entitled ‘From Ivory Tower to People Power‘ (You Tube link) and emphasises the collaborative nature of the project and the focus on people as a topic, rather than on archival description, which is not always the best starting place for researchers. The presentation covers:

  • Aims of the project
  • Workshops with postgraduate students about how they research and analysis of their research paths
  • Workshops with postgraduates about websites: what students do and don’t like in terms of discovery
  • Traditional archival cataloguing ‘lock in’ of entities such as people, places and events.
  • Connectivity beyond single A to B connections; ‘anything can be a focus’ and can link to a myriad of other things
  • Use of EAC-CPF (XML standard for archival authority files)
  • Creating the data, handcrafting data, limitations of our approach, too many ideas not enough time!
  • Demonstration of the Website