Names (4): Ethics and identity

As archivists, we deal with ethical issues a good deal.  But the ability to link disparate and diverse data sources opens up new challenges in this area, and I wanted to explore this a bit.

If you do a general search for ethics and data, top of the list comes health. An interesting example of data join-up is the move to link health data to census data, which could potentially highlight where health needs are not being met:

“Health services are required to demonstrate that they are meeting the needs of ethnic minority populations. This is difficult, because routine data on health rarely include reliable data on ethnicity. But data on ethnicity are included in census returns, and if health and census data for the same individuals can be linked, the problem might be solved.” (Ethnicity and the ethics of data linkage)

However, individuals who stated their ethnicity in census returns were not told that this might subsequently be linked with their health data. Should explicit informed consent be given? Given the potential benefits, is this a reasonable ask? It is certainly getting into hazardous terrain to ignore the principle of informed consent. In their book ‘Rethinking Informed Consent in Bioethics‘, Manson and O’Neill argue that informed consent cannot be fully specific or fully explicit. They argue for a distinctive approach where rights can be waived or set aside in controlled and specific ways.

This leads to a wider question, is fully explicit and specific informed consent actually achievable within the joined-up online world? A world where data travels across connections, is blended, re-mixed, re-purposed. A world where APIs allow data to be accessed and utilised for all sorts of purposes, and ‘open data’ has become a rallying cry.  Is there a need to engage the public more fully in order to gain public confidence in what open data really means, and in order to debate what ‘informed consent’ is, and where it is really required?

I am working on a project to create name records, and I am looking at bringing data sources together. Of course, this is hardly new. Wikipedia is the most well-known hub for biographical data. Anyone can add anything to a Wikipedia page (within some limits, and with some policing and editing by Wikipedia, but in essence it is an open database).  Wikidata, which underlies Wikipedia, is about bringing sources together in an automated way.  Projects within cultural heritage are also working on linked data approaches to create rich sources of information on people. SNAC has taken archival data from many different archive repositories and brought it together. A page for one person, such as Martin Luther-King provides a whole host of associations and links. These sources are not all individually checked and verified, because this kind of work has to be done algorithmically. However, there is a great deal of provenance information, so that all sources used are clear.

image of page from the face of white australia website
The Face of White Australia

There are some amazing projects working to reveal hidden histories. Tim Sherratt has done some brilliant work with Australian records. Projects such as Invisible Australians, which aims to reveal hidden lives, using biographical information found in the records. He has helped to create some wonderful sites that reveal histories that have been marginalised.  Tim talks about ‘hacking heritage’ and says: ‘By manipulating the contexts of cultural heritage collections we can start to see their limits and biases. By hacking heritage we can move beyond search interfaces and image galleries to develop an understanding of what’s missing.’ (Hacking heritage, blog post)  He emphasises that access to indigenous cultural collections should be subject to community consultation and control.  But what does community consultation and control really mean?

I have always been keen to work with the names in archival descriptions – archival creators and all the other people who are associated with a collection. They are listed in the catalogue (leastways the names that we can work with are listed – many names obviously aren’t included, but that’s another story), so they are already publicly declared. It is not a case of whether the name should be made public at all, or, at least, that decision has been made already by the cataloguer.   But our plan is to take the names and bring them to the fore – to give them their own existence within our service.  We are taking them out of the context of a single archive collection and putting them into a broader one. In so doing, we want to give the archive collections themselves more social context, we want to give more effective access to distributed historical records, and we also want to enable researchers to travel through connections to create their own narratives.

This may help to reveal things about our history and highlight the roles that people have played. It may bring people to the fore people who have been marginalised.  Of course, it does not address the problem of biases and subjective approaches to accessions and cataloguing. But a joined-up approach may help us to see those biases and gaps; to understand more about the silent spaces.

Creating persistent identifiers and linking data reveals knowledge. It is temping to see that in simple terms as a good thing.  But what about privacy and ethics?  Even if someone is no longer living, there are still privacy issues, and many people represented in archives are alive.

Do individuals want to be persistently identified? What about if they change their identity? Do they want a pseudonym associated with their real name? They might have very good reasons for keeping their identity private. Persistent identification encourages openness and transparency, which can have real benefits, but it is not always benign.  It is like any information – it can be used for good and bad purposes, and who is to say what is good and what is not? Obviously we have GDPR and the Data Protection Act, and these have a good deal to say about obligations, the value of historical research and the right to be forgotten. This is something we’ll need to take into account. But linked data principles are not so much about working with personal data as working with data that may not seem personal, but that can help to reveal things when linked with other sources of data.

GDPR supports the principle of transparency and the importance of people’s awareness and control over what happens to their personal data. Even if we are not creating and storing personal data, it seems important to engage with data protection and what this means. The challenge of how to think about data when it is part of an ever shifting and growing  global data environment seems to me to be a huge one.

Certainly the horse has bolted to some degree with regards to joining up data. The Web lowered barriers considerably, and now we increasingly have structured data, so it is somewhat like one gigantic database. Finding things out about individuals is entirely feasible with or without something like a Names service created by the Archives Hub. We are not creating any new content, but creating this interface means we are consciously bringing data together, and obviously we want to be responsible, and respect people’s right to privacy. Clearly it is entirely impractical to try to get permission from all those living people who might be included. So, in the end, we are taking a degree of risk with privacy.  Of course, we will un-publish on request, and engage with any feedback and concerns. But at present we are taking the view that the advantages and benefits outweigh the risks.

 

Image of exhibition photograph of black rights march

“Imagine being a sibling in a family that continually removes you from photos; tries its best to erase you…As you go through [the scrapbook] you see events where you know you were there, but you are still missing.”  Lae’l Hughes-Watkins (University of Maryland) gave an impassioned and inspiring talk at DCDC 2019 about her experiences.  She argued that archivists need to interrogate the reality that has been presented, and accept that our ideas of neutrality are misplaced. She wants a history that actively represents her – her history and culture, and experiences as a black woman in the USA. She related moving stories of people with amazing stories (and amazing archives) who distrust cultural institutions because they don’t feel included or represented.

This may seem a long way away from our small project to create name records, but in reality our project could be seen as one very small part of a move towards what Lae’l is talking about.  Bringing descriptions together from across the UK together maybe helps us to play a small role in this – aiming to move towards documenting the full breadth of human experience. The archives that we cover may retain the biases and gaps for some time to come (probably for ever, given that documentary evidence tends to represent the powerful and the elite much more strongly), but by aggregating and creating connections with other sources, we help to paint a bigger picture.  By creating name records we help to contextualise people, making it much easier to bring other lives and events into the picture. It is a move towards recognising the limitation of what is actually in the archive, and reaching out to take advantage of what is on the Web.  In doing this through explicitly identifying people we do leave ourselves more open to the dangers of not respecting privacy or anonymity. When we plug fully into the Web, we become a part of its infinite possibilities, which is always going to be a revealing, exciting, uncontrollable and risky business. By allowing others to use this data in different ways, we open it up to diverse perspectives and uses.

 

 

 

Names (3): One name record to bind them

It has been great to get comments and feedback around names, and I wanted to expand upon something that a few people have commented on….the ideal of one ‘authority record’ for one person or organisation.

model showing relationships of catalogues and name records
Model showing potential relationships between catalogues and name records

 

The above diagram is a proposal for the relationships we might have – note that is it a working model, and may well change over time. You can see the catalogues (the descriptions of archives) include people, some with biographical histories, and these people are either creators of archive collections or referenced in them.  Each of these people then gets a name record (bottom left box), so we might have e.g. three name records for  the same name (and the same name may potentially the same person…or may not). We will work with the store of records that we have with the aim of creating matches, and ending up with a generic or main name record (green box, top left).

The ‘main record’ or ‘master record’ or whatever we might call it, for each individual person or organisation, is not an ‘archival record’. It is not intended simply to be a reflection of what is in our own data. It is intended to be a page dedicated to that person or organisation.  Our current feeling is that this should not be seen as domain specific; in fact, we want to get away from the idea that data is domain specific.  It is about an entity (a person or organisation), and what we know of that entity.

Keeping in mind the green box, and looking at the person page for Robin Day from Exploring British Design, a previous AHRC project we ran with Brighton Design Archive, you get a sense of the type of thing we mean.

Page for Robin Day, from the Exploring British Design website
Exploring British Design: Robin Day

This page presents as a general information page about a designer. It is not branded as a page about archives. It takes information in from different sources. Is it an ‘authority’ record?  I’m really not sure; I wouldn’t call it that. The point is really that it enables researchers to put Robin Day into the context of other people, organisations places and events, or at least it demonstrates how that can be done. It creates a network, and it intends to show the value of including archives in a network, rather than standing apart, in their ‘own world’.

Screenshot of an entity relationship diagram for Robin Day
Visualised relationships

 

The network can easily be visualised. There are tools out there to do this. The challenge is to create the data to feed into these visualisers. Again, this visualisation is not about archival name authority records, it is not domain specific.

 

 

In the Robin Day page, we have a section for related archives and museum resources.

screenshot showing archives related to Robin Day
Related archive and museum resources

 

This lists archives Robin Day is the ‘creator of’ or archives he is ‘associated with’.  It links to the Archives Hub, but also to other sources. One of the options for end users is to go and find out more about the archival sources, but it is not prioritised above other options.

 

 

 

 

 

So, this is essentially the idea – a page for a person, a page for an organisation. An information resources that focuses on creating a network of connections.  We think this is a good approach, but creating something along these lines that is automated, sustainable and effective within an ongoing national service is much harder.

Why not just use this one record, link to the archive catalogues, and dispense with the individual name records that we have created? There are three reasons to consider providing access to the individual name records:  biographical history,  uncertainty around matching and ingesting name authority records.

I have already written about biographical and administrative history in a separate post.

In this phase of the Names Project the individual records for Beatrice Webb (as a name example), will be created either from the creator name or index terms that we have in the Archives Hub catalogues.

The main problem is the wide variation in name entries.

Webb, Beatrice
Webb, Beatrice, née Potter
Webb (Martha) Beatrice, 1858-1943
Webb, Martha Beatrice, 1858-1943
Webb;[Martha] Beatrice [nee Potter] 1858-1943

These are all entries in the Archives Hub.  We can match them all up, but can we say they are all the same?   Names without dates should not be matched with certainty, but quite often they will be the same person. (Beatrix Potter also often ends up being linked with Beatrice Webb, née Potter).

The decision we need to make is whether to provide links to these individual name records that we will have, or only use them as a source of data.  It seems valuable to enable end users to see these names as a group, but it is another thing to risk integrating information from them all into one name record.  There is no perfect answer to this, but it does seem important to clearly indicate the level of uncertainty.  So many names that we have don’t have life dates, or have variations in structure.  What we are looking to achieve is a clear provenance, giving end users the best understanding of what they are seeing.

What about name records that have been created by our contributors?  The name records we create ourselves from catalogue descriptions will generally be no more than the name, dates, and biographical history.  But, going forwards, we will want to work with much more detailed name records.

For Exploring British Design we created rich name records with an entity-relationship structure (essentially using the EAC-CPF structure and working in RDF),  to demonstrate the power of connecting entities.  For this purpose, we partially hand-crafted the name records, as well as carrying out some very complex processing to create various connections.

screenshot of part of the timeline for Robin Day
Part of the timeline for Robin Day

The example above shows events from the Robin Day timeline, with linked connections to related organisations.  If we ingest EAC-CPF records we might get timelines like this.

Name records may also include relationships. The Borthwick Institute has good examples of name records with plenty of rich relationship information. e.g. Charles Lindley Wood, Viscount Halifax.

screenshot of part of the Viscount Wood record showing relationships to other people
An excerpt from a Borthwick entry for Charles Lindley Wood

If we took this record into the Archives Hub it might seem to make sense for it to become the main person record for Wood.  But that would involve a process of making choices, preferencing one name record over another.  Possible, but tricky to do in an automated way. Another record office might also have a splendid example of a name entry for this person, with some different data. Furthermore, this record has links to the Borthwick catalogue. We would potentially have to remove these links.

It would be very challenging to create one record from several source EAC-CPF records for the same person –  to blend timelines, or sort out relationships listed in different records, bearing in mind that it needs to be done in an automated way, keeping version control and dealing with revisions and new data coming in that might add to the name record.  How could we compare and blend two lists of relationships? Or two chronologies? We’d probably end up having to keep them all, and then potentially have similar but different relationships and chronologies, giving a slightly confused user experience.

If we do ingest records like the one above, we will have to figure out how these  more detailed records will relate to what we have already created.  If, as planned, we have one generic name record for a person, it makes the job easier, as we won’t be looking to make any one EAC-CPF record into the main name record, we will simply link to it from the main record. Bear in mind, our main record is intended to be a domain-neutral entry – linking to other sources beyond archives.  EAC-CPF records might do this to some extent, but they are unlikely to link to the Jisc Library Hub, and probably won’t link to Wikidata, or other external sources.   They are far more likely to provide internal links to the archive catalogue they relate to.

Arguably, it might be easier to forget about creating name records ourselves (from the catalogue entries) and just work with name records that have been created by our contributors (which are likely to be well-structured and include life dates). But if we do that, the pot of names will grow slowly, as only a small proportion of repositories create name records. We can’t realistically give the end user a few thousand name records covering maybe 1-2% of our names – they might search for ‘Winston Churchill’ as a name, and find that we don’t have him!  It would not remove the problem of name matching, and it would make the whole idea of reaching out beyond the archive domain, by linking into other resources using our names as the hook, rather ineffectual.

Therefore, we propose to keep the separate name records in our system We propose to create a ‘generic record’, which is what would be prominent in the Archives Hub display. We would then have the potential to link the records together, to blend them,  to try some text mining and analysis techniques. It gives us options.  It would not be sensible to make those decisions now. It is better to lay the groundwork that enables us to be flexible.   This approach allows us to link to an individual name record where we don’t feel able to confirm a ‘same as’ relationship. It presents the option to the end user – here is a name – we think this is the same person, so we’ve provided a link.

The end user experience needs to make sense and not mislead or provide false information. Links to brief name records could seem confusing, but, as I have said, trying to bring together in one record all the information from several name records, with  their biographies, relationships, aliases, events, related resources, is likely to be a nightmare.  In the end, it will take a good deal more testing and working with researchers to work out what is best.

 

Archives Hub Names Project (2): Biographical History

It is a somewhat vexed question how to treat biographical and administrative history (in this post I’ll focus on biographical history).  This is an ISAD(G) field and an EAD field. ISAD defines it as providing “an administrative history of, or biographical details on, the creator (or creators) of the unit of description to place the material in context and make it better understood”.  It advises for personal names to include “full names and titles, dates of birth and death, place of birth, successive places of domicile, activities, occupation or offices, original and any other names, significant accomplishments, and place of death”.

On the Archives Hub we have a whole range of biographical histories – from very short to very comprehensive.  I have had conversations with archivists who believe that ‘putting the collection in context’ means giving information that is particularly relevant for that archive rather than giving a general history. Conversely, many biographical history entries do give a very full biography, even if the collection only relates to one aspect of a person’s life and work. They may also include information that is not readily available elsewhere, as it may have been discovered as part of the cataloguing process.

The question is, if we create a generic name record for a person, how do we treat this biographical information? There are a number of alternatives.

(1) Add all biographical history entries to the record

If you look at a SNAC example:  https://snaccooperative.org/view/54801840 you can see that this is the approach. It has merits – all of the biographical information is brought together. But it can mean a great deal of repetition, and the ordering of the entries can seem rather illogical, with short entries first and then longer comprehensive entries at the end.

Whilst most biographical history entries are pretty good, it also means a few not very helpful entries may be included, and may be top of the order. In addition, putting all the entries in together doesn’t always seem to make much sense. In the example below there are just three short entries for a major figure in women’s liberation. They are automatically brought in from the catalogue entry for individual collections. Sometimes the biographical entries in individual catalogues suffer from system migration and various data processing issues that mean you end up with field contents that are not ideal.

Millicent Garrett Fawcett biographical histories in SNAC

The question is whether this approach provides a useful and effective end user experience.

Where there is one entry for a creator, with one biographical history, there is no issue other than whether the entry makes sense as an overall biographical entry for that person or organisation. But we have to consider the common situation where there will be a dozen or more entries. Even if we start with one entry, others may be added over time.  Generally, there will be repetition and information gaps, but in many cases this approach will provide a good deal of relevant information.

(2) Keep the biographical history entries with the individual name records

At the moment our plan is to create individual name records for each person, as well as a generic master record.  We haven’t yet worked out the way this might be presented to the end user.  But we could keep the biographical histories with the individual entries we have for names. The generic record would link to these entries, and to the information they contain.  This makes sense, as it keeps the biographical histories separate, and within the entries they were written to accompany. Repetition is not an issue as it is clear why that might happen.  But the end user has to go to each entry in turn to read this information.

(3) Keep biographical history entries with individual name records, but enable the information to be viewed in the generic master record

We have been thinking about giving the end user the option to ‘click to see all biographical histories created for this person’. That would help with expectations. Simply presenting a page with a dozen similar biographical histories is likely to confuse people, but  enabling them to make a decision to view entries gives us more opportunity for explanation – the link could include a brief explanatory note.

(4) Select one biographical history to be in the generic record

We have discussed this idea, but it is really a non-starter. How do you select one entry? What would the criteria be if it is automated? The longest?

(5) Link to a generic biography if available

This is the idea of drawing in the wikipedia entry for that person or organisation, or potentially using another source.  There is a certain risk to pulling in data from an external source as the ‘definitive’ biographical information, but it the source would always be cited, and it does start to move towards the principle of bringing different sources of information together. If we want to create a more generic resource, we are going to have to take risks with using external sources.

 

I would be interested in any comments on this.

Names Project (1): Creation of name records

The Archives Hub Names Project

The Archives Hub team and Knowledge Integration, our system suppliers, are embarking upon a short four month project to start to lay the groundwork, define the challenges and test the approaches to presenting end users with a name-based means to search, and connect to a broad range of resources related to people and organisations.  I will be blogging about the project as we go along.

Our key aims in the long-term are:

  • To provide the end user with a way to search for people and organisations and find a range of material relevant to their research
  • To enable connections to be made between resources within and external to Jisc, using names as the main focus
  • To bring archive collections together in an intellectual sense and provide different contexts to collections by creating networks across our data

This first project will not create an end-user interface, but will concentrate on processing,  matching names and linking resources. We want to explore how this can be administered in order to be sustainable over time.  In the end, the most challenging part of working with the names we have is identification, disambiguation and matching.  The aim is to explore the space and start to formulate a longer-term plan for the full implementation of names as entities within the Archives Hub.

Creation of name records from EAD description records

NB: This blog often refers to personal names for convenience, but names include personal, family and corporate entities.

EAD includes namesEAD descriptions include personal, family and corporate names.  These ‘entities’ may be listed as archival creators and also associated with the collection as index terms. Archival creators may optionally be given biographical or administrative histories.  The relationship of the collection with names in the index is not made explicit in the description (in a structural way), though it may often be gleaned from the descriptive information within the EAD record.

Creating name records for all names

We are proposing to begin by creating name records for all of these entries, no matter how thin the information for each entry may be.

Here is a random selection of names that are included in Archives Hub records:

Grote, Arthur
Gaskell, Arthur
Wilson, John
Thatcher, J. Wells, Barrister at Law
Barron, Margaret
Stanley, Catherine, 1792-1862
Roe, Alfred Charles
Rowlatt, Mary, b 1908
Milligan, Spike, 1918-2002
Fawcett, Margaret, d. 1987
Rolfe, Alan, 1908-2002 actor
Mayers, Frederick J (fl 1896-1937 : designer : Kidderminster, England)
Joan

Only a percentage of names have life dates. Some have born or death dates, some floruit dates.

Of course, the life dates, occupations and outputs of many people are not known, or may be very difficult to find.  Also, life dates will change when a birth date is joined by a death date. Epithets may also change over time (and they are not controlled vocabulary anyway).

In addition, we have inverted and non-inverted names on the Archive Hub, names with punctuation in different places, names with and without brackets, etc.  These issues create identification challenges.

Even taking names as creators and names as index terms within one single description, the match is often not exact:

Millicent Garrett Fawcett (creator name)
Fawcett, Dame Millicent. (1847-1929) nee Garrett, Feminist and Suffragist (index term)

Lingard, Joan (creator name)
Lingard, Joan Amelia, 1932- (index term)

The archival descriptions on the Archives Hub vary a great deal in terms of the structure, and different repositories have different approaches to cataloguing.  Some do not add name of creator, some do not add index terms, some add them intermittently, and often the same name is added differently for different collections within the same repository.  In many cases the cataloguer does not add life dates, even when they are known, or they are added to the name as creator but not in the index list, or vice versa. This sounds like a criticism, but the reality is that there are many reasons why catalogues have ended up as they are.

There has not been a strong tradition amongst archivists of adding names as unique identifiable entities, but of course, it has only been in the last few decades that we have had the potential, which is becoming increasingly sophisticated, of linking data through entity relationships, and creating so much more than stand-alone catalogue records. Many archivists still think primarily in terms of human readable descriptions.  Some people feel that with the advent of Google and sophisticated text analysis, there is no need to add names in this structured way, and there is no need for index terms at all.  But in reality search engines generally recommend structured data, and they are using it in sophisticated ways.  Schema.org is for structured data on the web, an initiative started by Google, Microsoft, Yahoo, and Yandex. Explicit markup helps search engines understand content and it potentially helps with search engine optimisation (ensuring your content surfaces on search engines).  Also, if we want to move down the Linked Data road, even if we are not thinking in terms of creating strict RDF Linked Data, we need to identify entities and provide unique identifiers for them (URLs on the web). Going back to Tim Berners-Lee’s seminal Linked Data article from 2006:

“The Semantic Web isn’t just about putting data on the web. It is about making links, so that a person or machine can explore the web of data.  With linked data, when you have some of it, you can find other, related, data.”

So, including names explicitly provides huge potential (as well as subjects, places and other entities) and it has become more important, not less important. Indeed, I would go so far as to say that structured data is more important than standards compliant data, especially as, in my experience, standards are often not strictly adhered to, and also, they need constant updating in order to be relevant and useful.

The idea with our project is that we start with name records for every entity – a pot of data we can work with. We may create Encoded Archival Context (Corporate Bodies, Persons and Families), otherwise known as EAC-CPF…but that is not important at this stage.  EAC is important for data ingest and output, and we intend to use it for that purpose, so it will come into the picture at some point.

The power of the anonymous

There are benefits in creating name records for people who are essentially anonymous or not easily identifiable.  Firstly, these records have unknown potential; they may become key to making a particular connection at some point, bearing in mind that the Archives Hub continually takes new records in. Secondly, we can use these records to help with identification, and the matching work that we undertake may help to put more flesh on the bones of a basic name record.  If we have ‘Grote, Arthur’ and then we come across ‘Grote, Arthur, 1840-1912’, we can potentially use this information and create a match. Of course, the whole business of inference is a tricky thing – you need more than a matching surname and forename to create a ‘same as’ relationship (I won’t get into that now). But the point is that a seemingly ‘orphan’ name may turn out to have utility. It may, indeed, provide the key to unlocking our understanding of particular events – the relationships and connections between people and other entities are what enable us to understand more about our history.

Components of a name record

So, all names will have name records, some with just a name, some with life dates of different sorts, some with biographical or administrative histories. The exception to this may be names that are not identifiable as people or organisations.  It is potentially possible to discover the type of entity from the context, but that is a whole separate piece of work.  Hundreds of names on the Archives Hub are simply labelled as ‘creator’ or ‘name’. This is down to historical circumstance – partly the Archives Hub made errors in the past (our old cataloguing tool which entered creators as simply EAD ‘origination’), partly other systems we ingest data from.  At the moment, for example, we are taking in descriptions from Axiell’s AdLib system, but the system does not mark up creator names as people or organisations (unless the cataloguer explicitly adds this), so we cannot get that information. This is probably a reflection of a time when semantically structured data was simply less important. If a human reads ‘Elizabeth Gaskell’ in a catalogue entry they are likely to understand what that string means; if undertaking large-scale automated processing, it is just a string of characters, unless it includes semantic information.

From the name records that we create, we intend to develop and run algorithms to match names. In many cases, we should be able to draw several names together, with a ‘same-as’ relationship. Some may be more doubtful, others more certain. I will talk about that as we get into the work.

At the moment, we have some ideas about how we will work with these individual records in terms of the workflow and the end user experience, but we have not made any final decisions, and we think that what is most important at this stage is the creation and experimentation with algorithms to see what we can get.

Master name records

We intend to create master records for people and organisations. The principle is to see these master records not as something within the archives domain, but as stand-alone records about a person or organisation that enable a range of resources to be drawn together.

So, we might have several name records for one person:

Example of master record, with various related information included:
Webb, Martha Beatrice, 1858-1943, social reformer and historian

Examples of additional name records that should link to the master record:
Webb, Beatrice, 1858-1943 (good match)
Webb, Martha Beatrice, 1858-1943, economist and reformer (good match)
Webb, Martha Beatrice, nee Potter, 1858-1943 (good match)
Webb, M.B. b. 1858 (possible match)
but…
Potter, Martha Beatrice, b 1858
…might well not be a match, in which case it would stand separately, and the archive connected to it would not benefit from the links being made.

We have discussed the pros and cons of creating master records for all names.  It makes sense to bring together all of the Beatrice Webb names into one master record – there is plenty that can be said about that individual; but does it make sense to have a master record for single orphaned names with no life dates and nothing (as yet) more to say about that individual?  That is a question we have yet to answer.

diagram showing link between archive, name records and master records
The archive is described though an EAD description held on our system (the CIIM). We take all the names from this to create a huge store of individual names. From this, we aim to create and update ‘definitive’ name records.

The principle is to have name records that enables us to create links to the Archives Hub entries and also to other Jisc services and resources beyond that – resources outside of the archives domain.  Many of these resources may also help us with our own identification and matching processes. It is important to benefit from the work that has already been done in this area.

We are looking at various name resources and assessing where our priorities will be.  This is a fairly short project, and we won’t have time to look at more than a handful of options. But we are currently thinking in terms of VIAF, ORCID and Wikidata. More on that to follow.

Personally, I’ve been thinking about working with names for several years. We have been asked about it quite a bit. But the challenge is so big and nebulous in many ways. It has not been feasible to embark upon this kind of work in the past, as our system has not supported the kind of systematic processing that is required. We are also able to benefit from the expertise K-Int can bring to data processing. It is one thing doing this as a stand-alone project; it is quite another to think about a live service, long term sustainability, version control and revisions, ingest from different systems, etc.  And also, to break it down into logical phases of work.  It is exciting, but it is going to involve a great deal of hard work and hard thinking.

 

 

 

 

 

 

Online Resources: Explore archives in different ways

Archives Hub feature for May 2020

The Archives Hub includes descriptions called Online Resources.  These sit alongside Archive Collection descriptions and Repository descriptions.

screenshot hit list showing online resources on health
Online Resources on health

Online Resources are collections of resources, typically digitised content. They are often created as part of a project, and usually based on a specific theme. But the definition is purposely very loose. They are essentially any web sites that offer any kind of introduction, interpretation, or way into archives, other than the more traditional archival descriptions for individual collections.

All Online Resources point to a website, but that doesn’t mean that they only represent digital materials. The website may provide narrative and context for physical collections.  A good example of this is War Child. The site is a story about the Evacuee Archive – how it came into being, the man who created it, what he has experienced.

screenshot of War Child site homepage
War Child

It aims to explore and document the life of this archive. The archive is largely paper-based, and includes some recordings and artefacts.  War Child provides a wonderful, creative experience, thinking about how people engage with archives and how individuals are shaped by archives.

Many Online Resources do represent digital collections, and frequently they showcase collaborations. Windows on Genius is a project by the University of Cambridge and University of Sussex that spans two digital collections, giving access to the works of Sir Isaac Newton.  Other Online Resources are materials within one institution, brought together by topic, such as Selected Sources on Healthcare, at the University of Warwick. This is a selection of primary sources relating to British healthcare before the foundation of the National Health Service.

screenshot of map showing endangered languages
Map showing endangered languages

Some Resources are ‘artificial collections’ that have been brought together to aid researchers, such as Endangered Languages – a digital repository created by SOAS, specialising in preserving and publishing endangered language documentation materials from around the world.

 

 

Quite often Online Resources provide help with interpretation and using sources for teaching, such as the Pre-Raphaelite resource, which provides teaching materials and allows for personal collections to be created.

screenshot showing link to teaching resources on the Pre-Raphaelite website
Pre-Raphaelite Illustrations learning resource

Something like this is a wonderful introduction to a subject for a new researcher.

Some of the resources are simply digital collections. Potentially they could also be described simply as Archive Collections.

screenshot of BT digital archive website
The BT Digital Archive

For example the BT Digital Archives Online Resource is an archive collection, and indeed, we do have this collection listed as The BT Digital Archives collection. However, the Online Resource takes the user to the full catalogue, and it provides further context and showcases highlights from the collection.

Our rationale for having Online Resources is more about servicing the end user than the strict definition of what an archive collection is and whether it can be described as an online resource. We want to make sure people find the materials, and we also want to promote any added value that they can get through narrative, context and interpretation that the holding institution provides.

We aim to increase the descriptions of Online Resources – we create them ourselves when we find good resources, and our current contributors can also create them quickly and easily. If an Online Resource is offered by a non-contributor, we can create it for them, or provide a specific type of access to our cataloguing tool, to allow them to create the entry.  It provides another discovery channel, so for the short amount of time it takes to write a short entry, it may be found by a researcher who would otherwise never have known about it.

These digital collections and physical archives and websites for learning, teaching, and research include a wealth of materials from many institutions across the UK. From fashion to photography, dance to Darwin, soldiers to Shakespeare, these websites represent a whole range of archival resources, often with strong visual themes that can be used for research, learning and teaching.  Explore the Online Resources, and do get in touch if you have any suggestions for additions to our catalogue!

 

 

Results of the 2018 student digital experience insights survey

Jisc aims to understand more about the student experience and student needs as part of its mission within UK higher and further education. The recent digital experience survey offers some useful findings about how students feel when it comes to digital skills and the digital experience.

37,720 students across 83 higher and further education institutions (HE and FE) are included in the data, equivalent to approximately 16% of colleges and 30% of universities in the UK.

Key findings are:

  • Students – regardless of setting – are positive about the quality of their institution’s digital provision, as well as digital teaching and learning on their course.
  • Over a third of all students want digital technologies to be used more on their course, although this does imply that the majority do not share this view.
  • Only 50% of FE and 69% of HE students think digital skills are important for their chosen career, and few agreed that their course prepares them for the digital workplace. This implies that there are many students who do not think digital skills are essential.
  • Many students bring their own devices to their institution but can’t use these to access subject-specialist software or online learning content. This indicates a lack of flexibility and interoperability.
  • One in five students use assistive or adaptive technologies, with 8% of HE and 6% of FE students considering these vital to their learning needs
  • About eight in ten students used a smartphone to support their learning, which is no surprise, and shows the importance of ensuring that sites are mobile-friendly
  • Around 10% of FE students rated Google search as their number one app or tool, compared with just over 1% of HE students. HE students on the other hand were twice as likely to cite Google Scholar as they were to cite Google on its own as a search tool. HE students also used a wider range of tools for online research, including online journals and journal catalogues.
  • A third of all students turned first to their fellow students when looking for support with digital devices or skills. A third of FE students turned first to their lecturers in comparison with only 8% of HE students. A third of HE students turned to online information in comparison with only 14% of FE students.

It appears that students feel there should be greater opportunities to work more flexibly, both in terms of device use and learning spaces, but overall the responses are generally positive in terms of the digital experience and there are high levels of overall satisfaction with institutional provision (FE: 74%, HE: 88%) and the quality of teaching and learning on students’ courses (FE: 72%, HE: 74%).

Read the full report:
http://repository.jisc.ac.uk/6967/1/Digital_experience_insights_survey_2018.pdf

 

Archives Hub Survey Results: What do people want from an archives aggregation service?

The 2018 Archives Hub online survey was answered by 83 respondents. The majority were in the UK, but a significant number were in other parts of Europe, the USA or further afield, including Australia, New Zealand and Africa. Nearly 50% were from higher or further education, and most were using it for undergraduate, postgraduate and academic research. Other users were spread across different sectors or retired, and using it for various reasons, including teaching, family history and leisure or archives administration.

We do find that a substantial number of people are kind enough to answer the survey, although they have not used the service yet. On this survey 60% were not regular users, so that is quite a large number, and maybe indicates how many first-time users we get on the service. Of those users, half expected to use it regularly, so it is likely they are students or other people with a sustained research interest. The other 40% use the Hub at varying levels of regularity. Overall, the findings indicate that we cannot assume any pattern of use, and this is corroborated by previous surveys.

Ease of use was generally good, with 43% finding it easy or very easy, but a few people felt it was difficult to use. This is likely to be the verdict of inexperienced users, and it may be that they are not familiar with archives, but it behoves us to keep thinking about users who need more support and help. We aim to make the Hub suitable for all levels of users, but it is true to say that we have a focus on academic use, so we would not want to simplify it to the point where functionality is lost.

I found one comment particularly elucidating: “You do need to understand how physical archives work to negotiate the resource, but in terms of teaching this actually makes it really useful as a way to teach students to use a physical archive.”  I think this is very true: archives are catalogued in a certain way, that may not be immediately obvious to someone new to them. The hierarchy gives important context but can make navigation more complicated. The fact that some large collections have a short summary description and other smaller archives have a detailed item-level description adds to the confusion.

One negative comment that we got maybe illustrates the problem with relevance ranking: “It is terribly unhelpful! It gives irrelevant stuff upfront, and searches for one’s terms separately, not together.” You always feel bad about someone having such a bad experience, but it is impossible to know if you could easily help the individual by just suggesting a slightly different search approach, or whether they are really looking for archival material at all. This particular user was a retired person undertaking family history, and they couldn’t access a specific letter they wanted to find. Relevance ranking is always tricky – it is not always obvious why you get the results that you do, but on the whole we’ve had positive comments about relevance ranking, and it is not easy to see how it could be markedly improved.  The Hub automatically uses AND for phrase searches, which is fairly standard practice. If you search for ‘gold silver’ you will probably get the terms close to each other but not as a phrase, but if you search for ‘cotton mills’ you will get the phrase ranked higher than e.g. ‘mill-made cotton’ or ‘cotton spinning mill’.  One of the problems is that the phrase may not be in the title, although the title is ranked higher than other fields overall. So, you may see in your hit list ‘Publication proposals’ or ‘Synopses’ and only see ‘cotton mills’ if you go into the description. On the face of it, you may think that the result is not relevant.

screenshot of survey showing what people value
What do you most value about the Archives Hub?

All of our surveys have clearly indicated that a comprehensive service providing detailed descriptions of materials is what people want most of all. It seems to be more important than providing digital content, which may indicate an acknowledgement from many researchers that most archives are not, and will not be, digitised. We also have some evidence from focus groups and talking to our contributors that many researchers really value working with physical materials, and do not necessarily see digital surrogates as a substitute for this. Having said that, providing links to digital materials still ranks very highly in our surveys. In the 2018 survey we asked whether researchers prefer to search physical and digital archives separately or together, in order to try to get more of a sense of how important digital content is. Respondents put a higher value on searching both together, although overall the results were not compelling one way or the other. But it does seem clear that a service providing access to purely digital content is not what researchers want. One respondent cited Europeana as being helpful because it provided the digital content, but it is unclear whether they would therefore prefer a service like Europeana that does not provide access to anything unless it is digital.

Searching by name, subject and place are clearly seen as important functions. Many of our contributors do index their descriptions, but overall indexing is inconsistent, and some repositories don’t do it at all. This means that a name or subject search inevitably filters out some important and relevant material. But in the end, this will happen with all searches. Results depend upon the search strategy used, and with archives, which are so idiosyncratic, there is no way to ensure that a researcher finds everything relating to their subject.  We are currently working on introducing name records (using EAC-CPF). But this is an incredibly difficult area of work. The most challenging aspect of providing name records is disambiguation. In the archives world, we have not traditionally had a consistent way of referring to individuals. In many of the descriptions that we have, life dates are not provided, even when available, and the archive community has a standard (NCA Rules) that it not always helpful for an online environment or for automated processing. It actually encourages cataloguers to split up a compound or hyphenated surname in a way that can make it impossible to then match the name. For example, what you would ideally want is an entry such as ‘Sackville-West, Victoria Mary (1892-1962) Writer‘, but according to the NCA Rules, you should enter something like ‘West Victoria Mary Sackville- 1892-1962 poet, novelist and biographer‘. The epithet is always likely to vary, which doesn’t help matters, but entering the name itself in this non-standard way is particularly frustrating in terms of name matching.  On the Hub we are encouraging the use of VIAF identifiers, which, if used widely, would massively facilitate name matching. But at the moment use is so small that this is really only a drop in the ocean. In addition, we have to think about whether we enable contributors to create new name records, whether we create them out of archive descriptions, and how we then match the names to names already on the Hub, whether we ingest names from other sources and try to deal with the inevitable variations and inconsistencies.  Archivists often refer to their own store of names as ‘authorities’ but in truth there is often nothing authoritative about them; they are done following in-house conventions. These challenges will not prevent us from going forwards with this work, but they are major hurdles, and one thing is clear: we will not end up with a perfect situation. Researchers will look for a name such as ‘Arthur Wellesley’ or ‘Duke of Wellington’ and will probably get several results. Our aim is to reduce the number of results as much as we can, but reducing all variations to a single result is not going to happen for many individuals, and probably for some organisations. Try searching SNAC (http://snaccooperative.org/), a name-based resource, for Wellington, Arthur Wellesley, to get an idea of the variations that you can get in the user interface, even after a substantial amount of work to try to disambiguate and bring names together.

The 2018 survey asked about the importance of providing information on how to access a collection, and 75% saw this as very important. This clearly indicates that we cannot assume that people are familiar with the archival landscape. Some time ago we introduced a link on all top-level entries ‘how to access these materials’. We have just changed that to ‘advice on accessing these materials’, as we felt that the former suggested that the materials are readily accessible (i.e. digital), and we have also introduced the link on all description pages, down to item-level. In the last year, the link has been clicked on 11,592 times, and the average time spent on the resulting information page is 1 minute, so this is clearly very important help for users. People are also indicating that general advice on how to discover and use archives is a high priority (59% saw this as of high value). So, we are keen to do more to help people navigate and understand the Archives Hub and the use of archives. We are just in the process of re-organising our ‘Researching‘ section of the website, to help make it easier to use and more focussed.

There were a number of suggestions for improvements to the Hub. One that stood out was the need to enable researchers to find archives from one repository. At the moment, our repository filter only provides the top 20 repositories, but we plan to extend this. It is partly a case of working out how best to do it, when the list of results could be over 300. We are considering a ‘more’ link to enable users to scroll down the list. Many other comments about improvements related back to being more comprehensive.

One respondent noted that ‘there was no option for inexperienced users’. It is clear that a number of users do find it hard to understand. However, to a degree this has to reflect the way archives are presented and catalogued, and it is unclear whether some users of the Hub are aware of what sort of materials are being presented to them and what their expectations are. We do have a Guide to Using Archives specifically for beginners, and this has been used 5,795 times in the last year, with consistently high use since it was introduced. It may be that we should give this higher visibility within the description pages.

Screenshot of Hub page on using archives
Guide to Using Archives

What we will do immediately as a result of the survey is to link this into our page on accessing materials, which is linked from all descriptions, so that people can find it more easily. We did used to have a ‘what am I looking at?’ kind of link on each page, and we could re-introduce this, maybe putting the link on our ‘Archive Collection’ and ‘Archive Unit’ icons.

 

 

 

It is particularly important to us that the survey indicated people that use the Hub do go on to visit a repository. We would not expect all use to translate into a visit, but the 2018 survey indicated 25% have visited a repository and 48% are likely to in the future. A couple of respondents said that they used it as a teaching tool or a tool to help others, who have then gone on to visit archives. People referred to a whole range of repositories they have or will visit, from local authority through to university and specialist archives.

screenshot of survey results
I have found material using the Archives Hub that I would not otherwise have discovered

59% had found materials using the Hub that they felt they would not have found otherwise. This makes the importance of aggregation very clear, and probably reflects our good ranking on Google and other search engines, which brings people into the Archive Hub who otherwise may not have found it, and may not have found the archives otherwise.

 

 

How the Exploring British Design project informed the development of the Archives Hub

Back in 2014 the Archives Hub joined forces with The University of Brighton Design Archives for an exciting new project, funded by the Arts and Humanities Research Council, ‘Exploring British Design’ (EBD).

The project explored Britain’s design history by connecting design-related content in different archives, with the aim of giving researchers the freedom to explore around and within archives.

You can read a number of blog posts on the project, and there is also a video introducing the EBD website on You Tube, but in this post I wanted to set out how we have learned from the project and how it has informed the development of the new Archives Hub.

Unfortunately, we may not be able to maintain the website longer term, and so it seemed timely to reflect on how the principles used in this project are being taken forward.

Modelling the Data

A key component of EBD was our move away from the traditional approach of putting the archive collection at the centre of the user experience. Instead, we wanted to reflect the richness of the content – the people, organisations, places, subjects, events that a collection represents.

We had many discussions and filled many pieces of paper with ideas about how this might work.

rough ideas for data connectivity
Coming up with ideas for how EBD should work

We then took these ideas and translated them into our basic model.

model of data for EBD
Relationships between entities in the EBD data

Archives are represented on our model as one aspect of the whole. They are a resource to be referenced, as are bibliographic resources and objects. They relate to the whole – to agents, time periods, places and events. This essentially puts them into a whole range of contexts, which can expand as the data grows.

Screenshot of EBD homepage
Homepage of Exploring British Design: People are foremost.

The Exploring British Design website was one way to reflect the inter-connected model that we created.

We have taken the principles of this approach with the new Archives Hub architecture and website, which was launched back in December 2016. Whilst the archive collection description stays very much in the forefront of the users’ experience, we have introduced additional tabs to represent themed collections and repositories. All three of these sources of information are, in a data and processing sense, treated equally. The user searches the Hub and the search runs across these three data sources. The model allows us to be flexible with how we present the data, so we could also try different interfaces in future, maybe foregrounding images, or events.

screenshot of Archives Hub search results
Search for ‘design industry’ gives results across Archive Collections, Themed Collections and Repositories

Names

The EBD project had a particular focus on people. We opted to combine machine methods of data extraction – data taken partly from our already existent archive descriptions as well as from other external sources – with manual methods, to create rich records about designers. This manual approach is not sustainable for a large-scale service like the Archives Hub, but it shows what is possible in terms of creating more context and connectivity.

screenshot of a person page from the EBD website
EBD website showing a person page

We wanted to indicate that well-structured data allows a great deal more flexibility in presentation. In this case the ‘Archive and Museum Resources’ are one link in the list of resources about or related to the individual. We could have come up with other ways to present the information, given how it was structured.

We are intending to introduce names pages to the Archives Hub, which will then more clearly echo the EBD approach. They will largely have been created through automated processes, as we needed to create them at scale. They will generally be quite brief, without the ideal structure or depth, but the principle remains that we can then link from a person page to a host of related resources. The Hub website will have a new tab for ‘Names’ and end users will be able to run searches that take in collections, themes, repositories, people and organisations.

The EBD project allowed us to explore standards used for the creation of names data. It was our first experience of using Encoded Archival Context (Corporate Bodies, Persons and Families) (EAC-CPF), so we could start to see what we could do with it, as well as discover some of the shortcomings of the standard, as our data went beyond what is supported. For example, we wanted to link images to people and events but this was not covered by the standard. It was useful to have this preliminary exploration of it, and what it can – and can’t – do, as we look to adopt it for names within the Archives Hub.

Structured Data

One of the things the project did reinforce for me was the importance of indexing. On the Archives Hub we have always recommended indexing, but we have had mixed reactions from archivists, some feeling that it is less useful than detailed narrative, some saying that it is not needed ‘now we have Google’, some simply saying they don’t have time.

Indexing has many advantages, some of which I’ve touched on in various blog posts – and one at the top of the list, is that it brings the advantages of structured data. A name in a narrative can, in theory, be pulled out and utilised as a point of connectivity, but a name as an index term tends to be a great deal easier to work with: it is identified as a name, it usually has structured surname, forename content, it usually includes life dates and may include titles and epithets to help unambiguously identify an individual.

EBD was all about structured data, and we gave ourselves the luxury of adding to the data by hand, creating rich structured records about designers. This was partly to demonstrate what could be done in an interface, but we were well aware that it would be problematic to create records of that level of detail at scale. However, as we start to grapple with expanding name records in the Archives Hub, we have EBD as a reference point. It has helped us to think more about approaches and priorities when creating name records. If we were to create an EAC Editor (similar to our EAD Editor) we would think carefully about how to facilitate creating relationships. For example, the type of relationship – should there be a controlled list of relationship types? e.g. ‘worked with, collaborated with, had professional connection with, influenced by,  spouse of’ – these are some of the relationships we used in EBD, after much discussion about how best to approach this. Or would it be more practical to stick to ‘associated with’ (i.e. not defined), which is easier, but far less useful to a researcher. Could we have both? How would one combine them in an interface?  Another example – the potential to create timelines. If we wanted to provide end users with timelines, we would need to focus on time-bound events. There are many issues to consider here, not least of which is how comprehensive the timeline would be.

The vexed question of how to combine data from name descriptions created by several institutions is not something we really dealt with in EBD, but that will be one of the biggest challenges for us in aiming to implement name data on the Archives Hub.

The level of granularity that you decide upon has massive implications for complexity, resources and benefits. The more granular the data, the more potential for researchers to be able to drill down into lives, events, locations, etc. So including life dates allows for a search for designers from 1946; including places of education allows for exploring possible connections through education, but adding dates of education allows for a more specific focus still.

Explaining our approach

One thing that struck me about this project was that it was harder than I had anticipated to convey to people what we were trying to achieve and what we could achieve. I tended to find that showing the website raised a number of expectations that I knew would be difficult to fulfill, and if I’m being honest, I sometimes felt rather frustrated at the lack of recognition of what we had achieved – it’s really not easy to combine, process and present different data sources!  It is ironic that the more we press forwards with new functionality, and try to push the boundaries of what we do, the more it seems that people ask for developments that are beyond that!  You can try to modify expectations by getting deep down and technical with the challenges involved in aggregating and enhancing data created over time, by different people, in different environments (we worked with CSV data, EAC-CPF data, RDF and geodata for example), with different perspectives and priorities.  But detailed explanations of technical challenges are not going to work for most audiences. End users see and make an assessment of the website; they shouldn’t really need to be aware of what is going on behind the scenes.

Originally, in our project specification, we asked the question: “How can we encourage researchers, archive and museum professionals, and the public, to apprehend an integrated and extended rather than collection-specific sense of Britain’s design history?”  Whilst we did not go as far to answer this question as we had hoped, the work that we did made me feel that it might be harder than I had envisaged. People are very used to the traditional catalogues and other finding aids that are out there, and it creates a certain (possibly unconscious) mindset. I know this too well, because, as an archivist, I have had to adjust my own thinking to see data in a different way and appreciate that traditional approaches to cataloguing and discoverability are not always suited to the digital online age.

Data Model

The hierarchical approach to data is very embedded among archivists, and this is what people are used to being presented with.  Unless archivists catalogue in a different way, providing more structured information about entities (names, places, etc) then actually presenting things in a more connected way is hard.

image of hierarchical folders
A folder structure is often used to represent archival hierarchy

A more inter-connected model, which eschews linear hierarchy in favour of fluid entity relationships, and allows for a more flexible approach with the front-end interface to the data relies upon the quality, structure and consistency of the data. If we don’t have place names at all we can’t provide a search by place. If we don’t have place names that are unambiguously identified (i.e. not just ‘Cambridge’) then we can provide a search by place, but a researcher will be presented with all places called Cambridge, anywhere in the world (including the US, Australia and Jamaica).

A diagram showing archives and other entities connected
An example of connected entities

The new Archives Hub was designed on the basis of a model that allows for entities to be introduced and new connections made.

Archives Hub Entity Relationship diagram
Entities within the Archives Hub system

So, the tabs that the end user sees in the interface can be modified and extended over time. Searches can be run across all entities; it is not solely about retrieving descriptions of archives. This approach allows for researchers to find e.g. repositories that are significantly about ‘design’ or repositories that are located in London. It allows us to introduce Themed Collections as a separate type of description, so a student doing a project on ‘plastics’ would discover the Museum of Design in Plastics as a resource alongside archive collections at repositories including Brighton Design Archives, the V&A and the Paul Mellon Centre.

screenshot of Archives Hub search results
Search for ‘plastics and design’ shows archives and themed resources

Website Maintenance

One of the things I’ve learnt from this project is that you need to factor in the ongoing costs and effort of maintaining a project website. The EBD website is quite sophisticated, which means there are substantial technical dependencies, and we ended up running into issues with security, upgrades and compatibility of software, issues that are par for the course for a website but nonetheless need dealing with promptly. Maybe we should have factored this in more than we did, as we know the systems administration required for the Archives Hub is no small thing, but when you are in the throws of a project your focus is on the objectives and final output more than the ongoing issues. We cannot maintain a site long-term that is not being regularly used. EBD does not get the level of use that would justify the resources we would have to put into it on an ongoing basis.

Conclusion

When we were creating the model for the Archives Hub, we thought as much about flexibility and future potential as anything else. This is one thing that we have learnt from running the Hub for 25 years and from projects like Exploring British Design. You need to plan for potential developments in order to start to work with cataloguers, to get the data into the shape that you need it to be. We wanted to be able to introduce additional entities, so that we could have names, places, languages, images, or any other entities as ‘first class citizens‘ of the Hub. We wanted to be able to enhance the end user’s ability to take different paths, and locate relevant archives through different avenues of exploration.

We need to temper our ambitions for the Hub with the realities of cataloguing, aggregation and resources available, and we need as much information as we can get about what researchers really want; but this is why it is so important to encompass potential as well as current functionality. We may not be able to introduce everything we have envisioned or that users ask for right now; but it is important to understand the vital link between approaches to cataloguing, adherence to data standards, and front end functionality. We created visualisations for EBD and we would love to do this for the Hub, but it was not an easy thing to do, and so we would need to consider what the data allows, the software options available, whether the technical requirements are sustainable over time, and the effectiveness of the end result for the researcher.

Visualisation showing connections to Elizabeth Denby
Visualisation for Elizabeth Denby

When we demonstrated the visualisations in EBD, they had the wow factor that was arguably lacking in the main text-based site, but for serious researchers the wow factor is a great deal less important that the breadth and depth of the content, and that requires a model that is fundamentally rigorous, sustainable over time and realistic in terms of the data that you have to work with.

 

Archives Hub Data and Workflow

Introduction

As those of you who contribute to or use the Hub will know, we went live with our new system in Dec 2016.  At the heart of our new system is our new workflow.  One of the key requirements that we set out with when we migrated to a new system was a more robust and sustainable workflow; the system was chosen on the basis that it could accommodate what we needed.

This post is about the EAD (Encoded Archival Data) descriptions, and how they progress through our processing workflow. It is the data that is at the heart of the Archives Hub world. We also work with EAG (Encoded Archival Guide) for repository descriptions, and EAC-CPF (Encoded Archival Context, Corporate bodies, Persons and Families) for name entities. Our system actually works with JSON internally, but EAD remains our means of taking in data and providing data out via our API.

On the Archives Hub now we have two main means of data ingest, via our own EAD Editor, which can be thought of as ‘internal’, and via exports from archive systems, which can be thought of as ‘external’.

Data Ingest via the EAD Editor

1. The nature of the EAD

The Editor creates EAD according to the Archives Hub requirements. These have been carefully worked out over time, and we have a page detailing them at http://archiveshub.jisc.ac.uk/eadforthehub

screenshot of eadforthehub page
Part of a Hub webpage about EAD requirements

When we started work on the new system, we were aware that having a clear and well-documented set of requirements was key. I would recommend having this before starting to implement a new system! But, as is often the case with software development, we didn’t have the luxury of doing that – we had to work it out as we went along, which was sometimes problematic, because you really need to know exactly what your data requirements are in order to set your system up. For example, simply knowing which fields are mandatory and which are not (ostensibly simple, but in reality this took us a good deal of thought, analysis and discussion).

Screenshot of the EAD Editor
EAD Editor

2. The scope of the EAD

EAD has plenty of tags and attributes! And they can be used in many ways. We can’t accommodate all of this in our Editor. Not only would it take time and effort, but it would result in a complicated interface, that would not be easy to use.

screenshot of EAD Tag Library
EAD Tag Library

So, when we created the new Editor, we included the tags and attributes for data that contributors have commonly provided to the Hub, with a few more additions that we discussed and felt were worthwhile for various reasons. We are currently looking again at what we could potentially add to the Editor, and prioritising developments. For example, the <materialspec> EAD tag is not accommodated at the moment. But if we find that our contributors use it, then there is a good argument for including it, as details specific to types of materials, such as map scales, can be useful to the end user.

We don’t believe that the Archives Hub necessarily needs to reflect the entire local catalogue of a contributor. It is perfectly reasonable to have a level of detail locally that is not brought across into an aggregator. Having said that, we do have contributors who use the Archives Hub as their sole online catalogue, so we do want to meet their needs for descriptive data. Field headings are an example of content we don’t utilise. These are  contained within <head> tags in EAD. The Editor doesn’t provide for adding these. (A contributor who creates data elsewhere may include <head> tags, but they just won’t be used on the Hub, see Uploading to the Editor).

We will continue to review the scope in terms of what the Editor displays and allows contributors to enter and revise; it will always be a work in progress.

3. Uploading to the Editor

In terms of data, the ability to upload to the Editor creates challenges for us. We wanted to preserve this functionality, as we had it on the old Editor, but as EAD is so permissive, the descriptions can vary enormously, and we simply can’t cope with every possible permutation. We undertake the main data analysis and processing within our main system, and trying to effectively replicate this in the Editor in order to upload descriptions would be duplicating effort and create significant overheads. One of our approaches to this issue is that we will preserve the data that is uploaded, but it may not display in the Editor. If you think of the model as ‘data in’ > ‘data editing’ > ‘data out’, then the idea is that the ‘data in’ and ‘data out’ provides all the EAD, but the ‘data editing’ may not necessary allow for editing of all the data. A good example of this situation occurs with the <head> tag, which is used for section headings. We don’t use these on the Hub, but we can ensure they remain in the EAD and they are there in the output from the Editor, so they are retained, but not displayed in the Editor. They can then be accessed by other means, such as through an XML Editor, and displayed in other interfaces.

We have disabled upload of exports from the Calm system to the Editor at present, as we found that the data variations, which often caused the EAD to be invalid, were too much for our Editor to cope with. It has to analyse the data that comes in and decide which fields to populate with which data. Some are straightforward – ‘title’ goes into <unittitle> for example, but some are not…for example, Calm has references and alternative references, and we don’t have this in our system, so they cause problems for the Editor.

4. Output from the Editor

When a description is submitted to the Archives Hub from the Editor, it is uploaded to our system (CIIM, pronounced ‘sim’), which is provided by Knowledge Integration, and modified for our own data processing requirements.

Screenshot of the CIIM
CIIM Browse screen

The CIIM framework allows us to implement data checking and customised transformations, which can be specific to individual repositories. For the data from the Editor, we know that we only need a fairly basic default processing, because we are in control of the EAD that is created. However, we will have to consider working with EAD that is uploaded to the Editor, but has not been created in the Editor – this may lead to a requirement for additional data checking and transformations. But the vast majority of the time descriptions are created in the Editor, so we know they are good, valid, Hub EAD, and they should go through our processing with no problems.

Data Ingest from External Data Providers

1. The nature of the EAD

EAD from systems such as Calm, Archivist’s Toolkit and AtoM is going to vary far more than EAD produced from the Editor. Some of the archival management systems have EAD exports. To have an export is one thing; it is not the same as producing EAD that the Hub can ingest. There are a number of factors here. The way people catalogue varies enormously, so, aside from the system itself, the content can be unpredictable – we have to deal with how people enter references; how they enter dates; whether they provide normalised dates for searching; whether entries in fields such as language are properly divided up, or whether one entry box is used for ‘English, French, Latin’, or ‘English and a small amount of Latin’; whether references are always unique; whether levels are used to group information, rather than to represent a group of materials; what people choose to put into ‘origination’ and if they use both ‘origination’ and ‘creator’; whether fields are customised, etc. etc.

The system itself will influence on the EAD output. A system will have a template, or transformation process, that maps the internal content to EAD. We have only worked in any detail with the Calm template so far. Axiell, the provider of Calm, made some changes for us, for example, only six languages were exporting when we first started testing the export, so they expanded this list, and then we made additional changes, such as allowing for multiple creators, subjects and dates to export, and ensuring languages in Welsh would export. This does mean that any potential Calm exporter needs to use this new template, but Axiell are going to add it to their next upgrade of Calm.

We are currently working to modify the AdLib template, before we start testing out the EAD export. Our experience with Calm has shown us that we have to test the export with a wide variety of descriptions, and modify it accordingly, and we eventually get to a reasonably stable point, where the majority of descriptions export OK.

We’ve also done some work with AtoM, and we are hoping to be able to harvest descriptions directly from the system.

2. The scope of the EAD

As stated above, finding aids can be wide ranging, and EAD was designed to reflect this, but as a result it is not always easy to work with. We have worked with some individual Calm users to extend the scope of what we take in from them, where they have used fields that were not being exported. For instance, information about condition and reproduction was not exporting in one case, due to the particular fields used in Calm, which were not mapping to EAD in the template. We’ve also had instances of index terms not exporting, and sometimes this had been due to the particular way an institution has set up their system. It is perfectly possible for an institution to modify the template themselves so that it suits their own particular catalogues, but this is something we are cautious about, as having large numbers of customised exports is going to be harder to manage, and may lead to more unpredictable EAD.

3. Uploading to the Editor

In the old Hub world, we expected exports to be uploaded to the Editor. A number of our contributors preferred to do this, particularly for adding index terms. However, this lead to problems for us because we ended up with such varied EAD, which mitigated against our aim of interoperable content. If you catalogue in a system, export from that system, upload to another system, edit in that system, then submit to an aggregator (and you do this sometimes, but other times you don’t), you are likely to run into problems with version control. Over the past few years we have done a considerable amount of work to clarify ‘master’ copies of descriptions. We have had situations where contributors have ended up with different versions to ours, and not necessarily been aware of it. Sometimes the level of detail would be greater in the Hub version, sometimes in the local version. It led to a deal of work sorting this out, and on some occasions data simply had to be lost in the interests of ending up with one master version, which is not a happy situation.

We are therefore cautious about uploading to the Editor, and we are recommending to contributors that they either provide their data directly (through exports) or they use the Editor. We are not ruling out a hybrid approach if there is a good reason for it, but we need to be clear about when we are doing this, what the workflow is, and where the master copy resides.

4. Output from Exported Descriptions

When we pass the exports through our processing, we carry out automated transformations based on analysis of the data. The EAD that we end up with – the processed version – is appropriate for the Hub. It is suitable for our interface, for aggregated searching, and for providing to others through our APIs. The original version is kept, so that we have a complete audit trail, and we can provide it back to the contributor. The processed EAD is provided to the Archives Portal Europe. If we did not carry out the processing, APE could not ingest many of the descriptions, or else they would ingest, but not display to the optimum standard.

Future Developments

Our automated workflow is working well. We have taken complete, or near complete,  exports from Calm users such as the Universities of Nottingham, Hull and (shortly) Warwick, and a number of Welsh local authority archives. This is a very effective way to ensure that we have up-to-date and comprehensive data.

We have well over one hundred active users of the EAD Editor and we also have a number of potential contributors who have signed up to it, keen to be part of the Archives Hub.

We intend to keep working on exports, and also hope to return to some work we started a few years ago on taking in Excel data. This is likely to require contributors to use our own Excel template, as it is impractical to work with locally produced templates. The problem is that working with one repository’s spreadsheet, translating it into EAD, could take weeks of work, and it would not replicate to other repositories, who will have different spreadsheets. Whilst Excel is reasonably simple, and most offices have it, it is also worth bearing in mind that creating data in Excel has considerable shortcomings. It is not designed for hierarchical archival data, which has requirements in terms of both structure and narrative, and is constantly being revised. TNA’s Discovery are also working with Excel, so we may be able to collaborate with them in progressing this area of work.

Our new architecture is working well, and it is gratifying to see that what we envisaged when we started working with Knowledge Integration and started setting out our vision for our workflow is now a reality.  Nothing stands still in archives, in standards, in technology or in user requirements, so we cannot stand still either, but we have a set-up that enables us to be flexible, and modify our processing to meet any new challenges.