Genre and form within archival descriptions

September 27, 2021 / Jane Stevenson

We spend a great deal of time discussing each field in an archival description as part of the process of data aggregation and normalisation. But some fields raise more questions than others. I think overall we’ve probably spent the most time on the unique reference for each unit of description, which is so important when identifying and sorting descriptions and moving them around. Creator has also thrown up a number of challenges. Recently we’ve been thinking about ‘Genre/Form’. So, I thought I would post about it, as it reflects many of the types of issues that we think about as an aggregator.

On the Archives Hub, less than 1% of descriptions have genres or forms included. They can be in the core descriptive area and within the ‘control’ area as index terms – most are in the descriptive area. Quite a few of them are in our Online Resource descriptions of web resources that feature/display/explain archives, in particular they are in descriptions created for digitisation projects, where adding this information was part of the cataloguing process. In conclusion, it is clearly not common practice to add this information in archival descriptions.

When very few descriptions have a type of descriptive data – in this case genre/form – then the only thing you can really do is display it. If you provide a search or filter so that end users can find genre/form content, such as ‘photographs’ or ‘maps’ or ‘typescripts’ then you are encouraging them to narrow down their search to a tiny percentage of the descriptions – only those ones that have these terms included. Most users will assume that a search for ‘photographs’ will find all of the descriptions that include photos, when in reality it would find just a few percent. So, it is not a useful search; it is really a very misleading search. For this reason, in the imminent upgrade to the Archives Hub we are removing the links that we currently have on the genre/form entities, so that they do not create new searches.

Even displaying this data could be seen as misleading, because then the user might think that a description that doesn’t list ‘photographs’, for example, doesn’t have them, because other descriptions do list photographs. It is hard to convey to users that descriptions vary enormously. Even writing this now, I start to wonder whether it is worth us displaying the genre/form content at all when it may mislead in this way. Yet, it certainly can be useful for a researcher to know the types of content within a large collection.

Within the descriptions that do use this field, many are as you might expect, e.g. ‘photographs, leaflets, posters, letters, ephemera, books’. Others are more descriptive, e.g. ‘silver instruments in hard leather box’ or ‘Correspondence and other documents, architectural drawings, engineering contract drawings, and naval architecture publication’ or ‘Small ring-bound notepad’. Descriptive entries can convey more to a researcher, but they provide real challenges if you want to use the terms as links to allow users to search for other similar items. Also, a ‘small notepad’ might be ‘manuscript’ or ‘typescript’. If an end user searches for ‘typescript’ they would not find the small notepad. This is the problem of a lack of controlled vocabulary, and the problem of what ‘genre’ and ‘form’ really mean. The difficulty of separating them is clearly why they have ended up being bundled together.

We have not made an analysis of the use of controlled vocabulary, but it is clear that in general terms are not controlled. In our own EAD Editor, we provide links to the Getty Thesaurus of Graphic Materials and the Art and Architecture Thesaurus, but I am not sure how appropriate these are to describe all materials within an archive. Obviously an archive can include pretty much anything. If we just stuck to controlled vocabularies, we would probably omit some items. The Ivan Bunin collection from the University of Leeds is a great example of a description that lists a whole range of items – really useful to have, but difficult to see how this would work in a structured, controlled vocabulary world. In general, it seems to be common practice simply to list genre and form using local terms, which will differ between institutions, between cataloguers, and over time.

One of the issues I’ve mused upon is whether people are more likely to add a form such as ‘photographs’ and omit a form such as ‘typescript’, even if there are only a very few photographs, and a great deal of typescript material. Do the terms included really reflect the make-up of the collection? I suspect that cataloguers might think that end users are more interested in finding photographs or maps as genre types than finding typescript documents, and that may well be true. Also, it would be very difficult to list all the material types within a large collection, so only the main types, or clearly defined types are likely to be included.

As an aggregator, we have to understand and appreciate that each contributor has their own approach to cataloguing, and will use fields differently, or use them regularly, sometimes, or not at all. But also, I’m sure many of our contributors would say that across their descriptions there isn’t the level of consistency they would like, for various historical reasons. This is just multiplied when everything is aggregated. Aggregation allows for the power of global editing and enhancement, UK-wide interrogation and cross-searching, and serendipitous discovery. It is enormously powerful. It also creates a headache with how to harmonise everything in order to effectively do this.

The particular issue with genre/form came up because we are developing an Excel (spreadsheet) template for people to use if they prefer to catalogue in this way. We want to make sure the template is user friendly. We have included a column named ‘Genres/Forms’ and in the end we have simply made it a descriptive field without trying to structure or control the content. We will not try to add the content to our indexes, because of this complication of turning the text into structured data, and because we are not sure that it is really all that useful for the reasons outlined above.

Somewhat related to this, the new EAD standard, EAD3, has rather unhelpfully removed the sub-categories of ‘physical description‘, which are ‘extent’, ‘genreform’, ‘dimensions’ and ‘physfacet’ so that they all have to be bundled into just one field. Either that or you have to add a structured physical description which requires you to add a value from a list: carrier, material type, space occupied or other physdesc structured type (which asks you to then add the ‘other’ type). I can just imagine going back to all our contributors and asking them to add a type to all their physical description information! If we move to EAD3, we would remove the demarcation that tells us the information is about the genre/form or about the extent. This is potentially a deal breaker for us adopting EAD3, as taking away structure that is already there seems like madness. You could argue that simply having one free text field for physical description gets us off the hook with our attempts to work with the data (e.g. potentially using extent to provide a search to help convey the size of collections to users) – if it was completely unstructured then any attempt to analyse and present it differently would be impossible. However, just the process of putting these sub-fields together into one field would actually be extremely difficult due to the fact that different institutions have different patterns of data input. ISAD(G), the archival standard for description, doesn’t refer to form or genre at all, but recommends adding extent and medium, such as ’42 photographs’ or ‘330 files’, or else adding the overall storage space, such as 20 cubic metres. It doesn’t really go in for promoting structured data.

For those interested, here is a breakdown of the genre/form entries that have been used at least 10 times, just to give an idea of some common terms (though most entries include several types, so they will not appear in this list):

List of genre and form types used in the Archives Hub

(Corrrespondence may be down to a rather extensive cut and paste error).

I’m not going to get into the thorny issue of what ‘genre’ is and what ‘form’ is. They were put together in EAD, whilst ISAD(G) doesn’t use these terms at all, but refers to ‘medium’. The distinction seems very blurred, and there are many archivists who will have more idea of what the definitions are than I do. I think it is very much open to interpretation for individual cataloguers – so we have entries like ‘small boxes’, ‘New Orleans-style jazz’ and ‘Museum administration’ and ‘social history’ as well as ‘personal papers’, ‘manuscripts’, ‘typescripts’ and ‘sound’.

In the end genre/form is a field that seems potentially very useful – the idea that researchers can search for maps, or prints, drawings or postcards, CDs or tape, is appealing, but in reality, we have never really prioritised this information in our catalogues. In our machine learning project, just kicking off, we may explore the possibility of interrogating descriptions to potentially add genre/form. It would be interesting to see how well this works. But I wouldn’t bet my house on it…or even my outhouse – the narrative style of most catalogues is likely to hinder any effective identification of material types.

We would love to hear from you if you utilise this field. Do you think it is useful? Do you try to add a comprehensive list of genres/forms? Do you think that researchers really want to search by material type?

Name Authorities in Archives

April 1, 2020 / Jane Stevenson / 2 Comments

There have been some threads on archives-nra recently about adding names to archival catalogues, so I thought it might be good to blog about it, reflecting the Archives Hub’s experience and knowledge on this topic after 20 years of working with aggregated data, and three Linked Data projects. We are also about to embark upon a ‘Names Project’ with the intention, in the first phase, of laying the groundwork for creating something that is interoperable and sustainable. The idea is to develop the Archives Hub so that we can include name records, with the ability to ingest and process them automatically and at scale, which is a big challenge.

What rules or guides should you follow for creating a name?

To start with, one of the interesting things about this topic is that the development of persistent unique identifiers (PIDs) should actually make consistency with name form and pattern less of an issue. (I say that advisedly, as someone who has always promoted consistency, following rules, and using care with constructing names). Of course, this only works if PIDs are assigned to names. To take an example – here is a list of names for one person, the Victorian social reformer Beatrice Webb:

Webb Martha Beatrice 1858-1943 Social Reformer
Webb Beatrice 1858-1943 social reformer
Webb, Martha Beatrice. ( 1858-1943) nee Potter Social Reformer
Webb Martha Beatrice 1858-1943 nee Potter, social reformer and historian
WEBB, Beatrice, 1858-1943
Webb, Martha Beatrice
Webb, M.B., 1858-1943
Webb, Martha, b. 1858, wife of Sydney Webb
Potter, Martha Beatrice, 1858-1943
Martha Beatrice Potter, 1858-1943

If all instances of this name were accompanied by recognised and agreed identifiers, then job done, we know they all represent the same person, whatever the form of the name.

It is important to state that ‘knowing who this is’ applies to both humans and machines. Humans will probably gather that these all represent the same person; the question is whether they can all be matched programmatically.

Still, we’ve a long way to go before universal PID harmony, and we’ve also got the problem of which identifiers to use. So, we’re back to rules for the construction of a name, which, of course, have many advantages besides disambiguation.

The archives community in the UK is likely to turn to ISAAR(CPF) and the NCA Rules. Sometimes the question is asked about which one to use, but in truth they are complimentary and so a choice is not needed.

ISAAR(CPF) is about a full name authority, as it is generally understood within the archive community – essentially a biographical record, documenting the nature, context and activities of the entity, preferably providing relationships to other people and organisations. The idea is to provide context about the records’ creation and use, which helps the user to understand and interpret the archive collection, something that archivists see as an essential activity.

The term ‘name authority’ can simply apply to the name itself as well as to a full record about an entity. This is typically the case in the Library world, but even in NCA Rules (which I’ll come on to), the term is defined as “the recognised, authorised or prescribed form of a name”. This difference in definition can sometimes cause confusion.

ISAAR(CPF) states that it “is intended to be used in conjunction with existing national standards or as the basis for the development of national standards”, and “rules and conventions for standardizing access points may be developed nationally”. ISAAR(CPF) is about a whole lot more than the name; it is about describing the entity and the relationships it has with with other entities. For the authorised form of the name, you are prompted to use national conventions or other guidance. The standard also allows for other forms of the name, but essentially the way the name is constructed and the dividers used are not prescribed. ISAAR(CPF) does also allow for an authority record identifier – which comes back to the PIDs mentioned above, but it does not prescribe the identifier used.

So, that leaves NCA Rules, which are about the construction of names. I’m just going to focus on personal names for this blog post.

As far as I’m concerned, there is loads of good and useful stuff in these Rules. Going through the rules really brings home just how complicated names can be. Everything from medieval surnames to greek names to names with no identifiable surname, pre-titles and epithets is addressed. I’ve particularly found the rules on royal names and papal names useful myself. If you want to know how to deal with William of Malmesbury or the Duchess of Marlborough it’s great.

The NCA Rules were created in 1997, which is an age away in terms of the modern digital and online age, and yet so much of what they say is still useful, because in the end we haven’t changed our names. However, in the digital age we continue to change how the names that we construct are stored, transferred, displayed and used. I think that this means that parts of the NCA Rules are no longer so helpful.

Hyphenated and compound surnames

This is a particular problem as far as I’m concerned. If you want to enter the name of William Henry Fox Talbot, the Rules propose using Talbot | William Henry | Fox. You can cross-reference to ‘Fox Talbot’. However, in modern databases and formats like XML, you are in danger of ending up with:

Surname: Talbot
Forenames: William Henry Fox

In terms of archival catalogues, this may not be so bad. If there is a search by name, it usually searches across the whole name, so ‘Fox Talbot’ as a search is likely to bring back the record. The display of the name may be Talbot, William Henry Fox, but a researcher is likely to understand who that is from the context of the description. Humans are generally good at interpretation through context.

However, for those of us pushing forward with the principles of joined-up data and moving towards the ideal of Linked Data (even if we don’t fully get there), this structure is a problem. In the Archives Hub we could end up with:

<persname>
<surname>Talbot</surname>
<forename>William Henry Fox</forename>
<dates>1800-1877</dates>
</persname>

Clearly, this is not correct, and it becomes harder to connect it with other instances of the name. As stated above, if we all used agreed PIDs it would not be such a problem, e.g.

<persname authfilenumber=”https://viaf.org/viaf/54325833″ source=”viaf”>
<surname >Talbot</surname>
<forename>William Henry Fox</forename>
<dates>1800-1877</dates>
</persname>

But applying these PIDs (even if we do manage to agree what they are) to all our catalogues retrospectively….well, that’s a bit of a job. And it would require the kind of analysis of names that much prefers semantically well-structured names, so kind of a catch 22.

That leaves the recommended route of stating that the ‘entry element’ for the surname is, well, the surname. Hyphenated surnames are the same. NCA Rules plumps for “Lewis” as the entry element for Cecil Day-Lewis. I would argue for it being “Day-Lewis”.

I think there is a similar issue with prefixes such as ‘Du’ and ‘Van’. Putting Daphne Du Maurier under ‘Maurier’ is not right…

Being part of the wider world

One reason ‘Maurier, Daphne Du’ is not right is clear when you look at http://viaf.org/viaf/24600806. This is the Virtual International Authority File entry for Daphne Du Maurier. Only the Lebanese National Library has gone for ‘Maurier, Daphne Du’. Of course, the name has still been matched with the others, so no harm done in a sense, at least on VIAF. But it doesn’t really help matters to be out of sync with everyone else where names are concerned.

VIAF is the Virtual International Authority File, and it is a good place to start when thinking about persistent unique identifiers and the benefits of data join-up. It is not perfect, but what it does is to push things towards interlinked data and to enable the kind of connectivity that Linked Data is after. Other authority files are available (which is part of the problem), but VIAF is widely used, it sources from many countries, and it is fairly comprehensive.

Going back to Beatrice Webb. She is also in VIAF: http://viaf.org/viaf/86607236. Just as with Daphne Du Maurier, you can see how all the variations in the name have been brought together. But there is more value to VIAF than this. It also brings in other data. As well as related names, related works and publishers it includes links to Wikipedia in a whole range of languages. It also records the ISNI (International Standard Name Identifier) and WorldCat Identifier.

All of these links provide the potential for any data at those destinations to be brought together. If you go to the English page on wikipedia for Beatrice Webb, that includes content sourced from wikidata: https://www.wikidata.org/wiki/Q242666 – another instance of sharing information across different services.

Going back to ISAAR(CPF), the standard states that repositories “can more easily share or link contextual information about this source if it has been maintained in a standardized manner. Such standardization is of particular international benefit when the sharing or linking of contextual information is likely to cross national boundaries”. But as the name entry itself is down to national standards, the question is whether the NCA Rules do encourage standardisation. I would say that they do, on the whole, but with caveats, including those mentioned above. You may have others. I know that some archivists are not happy with the treatment of women’s names, such as the advice: “A woman who marries and adopts her husband’s surname is to be entered under that name.” I tend to think that it is important to include the maiden name, and I think we should consider both linking up data (links from information created before the person was married, for example) and also how end users will search – rules are no good if they act against people actually finding the information that they want.

Epithet and structure

Epithets are used by archivists, but most domains do not use them. The library world does not add epithets. We like them for adding context, and they are often very useful. However, they do add to the level of variation considerably. For our Names Project we will probably exclude the epithet from name matching (if we can – they are not always easy to isolate). With the time and tenacity, you could utilise them to help with matching, but one of the challenges with algorithmic solutions is that you have to draw the line according to your resources. Epithets are really useful, but we can never hope to standardise them. What we really need to do is to identify them as part of the structure. In EAC-CPF (the XML standard for name entities that is based upon ISAAR(CPF), information typically included in epithets is separated out semantically:

<nameEntry>
<part localType=“surname”>Emberton</part>
<part localType=“forename”>Joseph</part>
</nameEntry>

And then also included in this entry:

<occupation>
<term>Architect</term>
</occupation>

This is perfect. You can display the information together, or separately; you can search them together, or separately. Records in Context (RiC) has sought to provide a conceptual model that brings together ICA archival standards. It is not a set of cataloguing guidelines, but it does use the language of structured data, and to some degree Linked Data – names are ‘entities’ and they have properties and relations with other entities. It encourages the idea of separating out things like occupation (often part of our epithet) from the agent (person) so that you can, for example, link one occupation to many people. This is more feasible if we create the entries in a structured way so that you can separate these two pieces of information (a person pursues an occupation / an occupation is pursued by a person), but often we don’t do this, and our cataloguing systems don’t help us to do this.

Dividers

In many ways these are part of a name for the purposes of processing, but dividers are not really covered in most standards. Standards tend to cop out by using pipes: Charles I | 1600-1649 | King of Great Britain and Ireland. This is an agnostic stance – the dividers are up to you. NCA Rules: “In the Rules there are no mandatory conventions for punctuation and abbreviation. These will continue to conform to the house style of each repository.” For house style read “everyone do what they prefer”.

This has meant that we have a great and interesting variety of dividers. The divider diversity can be overcome programatically, but it is still another complication in terms of consistency. Have a look at the names at the bottom of this record: https://archiveshub.jisc.ac.uk/data/gb982-sww. The problems are *not* because Aberystwyth have entered them incorrectly – the data is fine; they are because this record was created in a long past time (2010) of the Archives Hub’s first online data creation tool, which was fairly basic. We then attempted some global normalisation work on these older descriptions…and there’s the rub. If you write something to say ‘turn X into Y’ that usually works fine. But the more complicated it gets, the harder it is to satisfy all the data, so to speak. It is more like ‘turn P,Q,R,S,T,U,V,W,X into Y’ but if you come across A, B or C turn them into Z’. With the above example the excess of dividers is because there is punctuation within the XML record itself, but we also apply punctuation during display, as most records don’t include it. We are now working on more effective ways to standardise (which is a slow process because we don’t have many staff, whilst we also have loads of things clamouring for attention). We could have a recognised and agreed use of dividers, e.g:

Churchill, Sir Winston Leonard Spencer, 1874-1965, Knight, statesman and historian

Churchill, Sir Winston Leonard Spencer (1874-1965), Knight, statesman and historian

Dates in brackets seems to be the most common approach in archives, although maybe less so in other domains. Both these names can be easily matched – the dividers are not a problem. But it can get harder with various combinations of pre-titles, titles, epithets, born, died, floruit, question marks, nee, square brackets, commas, stops, semi-colons and colons. However, the ideal is that the parts of the name are created separately, and then displayed as preferred, which I come back to below.

So, what should we do?

My best advice to anyone creating names is to follow ISAAR(CPF) and to use NCA Rules, but also think about a name in a broader context and be aware of international standards and identifiers – it is great if you can include recognised identifiers if you can – VIAF or ISNI or ORCID. We want to share data with the outside world (outside of the archival domain) so we don’t want to be too focused on archival standards and ignore web standards and common ways of doing things. We have to work within the systems we have, so sometimes you cannot structure a name as you would like, but aim for consistency, semantic structure as far as you are able, and practices that are not out of step with everyone else’s practices. This means that we can more easily join our data up and create a space that researchers can navigate in infinite ways and for infinite purposes.

Cataloguing Matters: being part of the information landscape

November 9, 2015 / Jane Stevenson

Consider the following questions, which use the topic of design history, but could be for any topic area:

Where did this person work?
Who did they know?
What can I find out about furniture design in London in the early 20th century?
Who designed this early 20th century chair?
Did the designer feature in this exhibition?
Can I find photographs of this section of the exhibition?
Did these designers both feature in this design exhibition?
Who influenced this designer?

These are surely typical questions for researchers. They are the sort of questions we thought about when we were working on Exploring British Design. But these questions do not start with the archive. An archive collection may well hold many answers for these questions, but there remains the problem of connecting these questions to the archive: ‘I’m interested in this designer/in this chair that they designed/in this exhibition they designed for’. We need these questions to lead to archival sources when appropriate.

Which comes first, the research question or the archive? We tend to assume the archive, but for many researchers the question comes first and the archive is, at that point, not known to them. We become a little fish in a very big pond when we enter the Web, and so we need to find ways for researchers who may not be aware of us to hook our collections.

Wellcome Library, London: Fishing

But how do we achieve this? We often have to work within considerable constraints and we cannot simply transform our cataloguing practices. There are many technical solutions that we can potentially make use of, but I think the first challenge lies with the data itself.

Individual archive repositories may think that there is no way they can move into a Linked Data world of RDF and triples and persistent URIs. But I think there are steps that we could take to help our descriptive data better fit this kind of landscape and have the potential to be more ‘linked in’ to other sources and to researchers’ paths of discovery.

Our descriptive practices tend to treat collections as stand-alone entities, rather than integrated parts of a whole information landscape. We do not think enough about cataloguing in a way that facilitates an integrated approach where we can connect to other information resource and allow researchers to come to archives through searching for people, organisations, subjects, events and places, and to come to them as part of a whole network of resources. I think we need to try to think about providing potential ‘connectors’ within our descriptions that will allow them to be hooked into the landscape more effectively.

Here are some thoughts about how we can help to achieve this:

Be consistent when cataloguing

This may sound straightforward, but having looked at thousands of descriptions from well over two hundred archive repositories, I can announce that it doesn’t always happen. For instance, simply entering the name of the repository in the same way for every catalogue entry, and not adding the repository code as part of the repository name, for example, ensures that the repository is always correctly identified. This means that all of the collection descriptions can be clearly identified as being from the same institution.

When entering names, think about how you structure them. Many archive systems provide a means to store and link to names, and yet it is amazing how often one name varies. If nothing else, think about adding life dates to a name, which helps with unique identification.

Here on the Archives Hub we have to hold our hands up for a potentially unhelpful practice up till now of encouraging the creator name to be entered in one way under ‘name of creator’ and another way as an index term. This reflects a time when we were less focussed on machine processing. Of course, it is much better to enter the name consistently, so that the connection can clearly be made.

Try to use rules and standards

Whilst I have increasingly become somewhat frustrated by our standards, I still think there is a role for standards to encourage consistency, clarify meaning and help draw things together. If you enter subjects using UKAT or LCSH, then try to ensure that all your terms really do come from these thesauri. We get examples where the thesaurus is named, but the subject is not actually from the thesaurus.

When entering things like language codes (ISO standard 639), take a few minutes just to find out about them. They need to be lower case, to be consistent. It’s worth having an understanding of what you are doing and why.

Do the same for dates. Think about what a normalised date is and why it is important. It is really worth having a sense of why these things matter. ISO8601: “The purpose of this standard is to provide an unambiguous and well-defined method of representing dates and times”. That sounds perfect for archives, and well worth adopting.

Think about the questions people ask

One of the advantages of indexing is that it gives you a chance to consider the terms you can best use to ‘advertise’ your collection. It is really good to if the scope and content text is clearly reflected in the index terms. If your archive is about a designer and you have described things they have designed, then you might realise that you haven’t used ‘furniture designer’ in your text, but this is a good index term to use. Or maybe the archive is really useful for those looking at the history of design education, but you haven’t yet actually used the term ‘design education’ in your text. You can add it as an index term.

Try to think beyond the UK

This may again sound obvious, but unfortunately NCA Rules don’t exactly encourage this, even if they don’t prevent it. When indexing by place name, add the county and certainly add the country. Indexing really helps us think about this. Consider a typical biographical history entry:

Charles Edward Sayle was born in Cambridge on 6 December 1864. He entered New College, Oxford, in 1883, and St John’s College, Cambridge, in 1890.

It wouldn’t look quite right putting:

Charles Edward Sayle was born in Cambridge, England, on 6 December 1864. He entered New College, Oxford, England, in 1883, and St John’s College, Cambridge, England, in 1890 [etc]

And whilst it would help with identification, it is still unstructured. Much better to ensure you have:

Place name: Cambridge, England

in your structured index terms. True, there are plenty of ways to mark this up, some better than others, but at least having an index term entered in this way is a great help with uniquely identifying the Cambridge that you mean, as opposed to the one in the US or New Zealand, Australia or Jamaica. It means that a researcher who is researching a topic around ‘Cambridge, England’ is more likely to find your description.

If you put your place names into some kind of specific field, try to avoid something like ‘Cardiff, Merthyr Tydfil, Cambridge’ (an example from the Hub) or ‘Cambridge etc’ (another example). So if your cataloguing software provides just one box for place, repeat that box for each place, rather than putting a load of them into one entry. If it allows for something like place name and country, use that to your advantage.

Put index terms into the most appropriate categories

On the Hub we’ve had plenty of family names as personal names and personal names as corporate names. But most commonly we get genre as subject. If your archive contains photographs, press cuttings, preliminary sketches or parchment (animal material) and you want to make this known, you need to index by material type, not by subject ….unless the archive is about photographs or about parchment.

Think Reference!

Your references should give the ability to uniquely identifying every part of a hierarchical archival description. Think carefully about adding them and make sure you have a unique reference at every level of description. Some archival software will automatically generate references, which does help because then they will be consistent and unique. But if you add them manually, it is very easy to make a mistake. On the Hub we found several hundred duplicated ‘unique references’ when we went through a major exercise to clean up references. And I do mean several hundred, not one or two.

Optimise titles for search engines

All of this structured data really needs the boost of descriptions that work well on the Web, simply in terms of being discoverable. SEO, or search engine optimisation, is a big topic, and your system may not allow you to make many changes, but in terms of the actual data, do think about appropriate titles, which are a really good hook. Include the most significant words, make sure that are not too long (so they are easy to scan for a researcher looking through various sources) and make titles at collection level self-explanatory. At lower levels this is not such a problem, as it is possible to append the collection title, to help with interpretation.

Look at how your data exports

This may not be possible or practical for everyone, but if you can do it I think it is a really good indicator of how interoperable your data might be to see how it looks when it is exported, because that is when you are removing it from the comfort of your own familiar system, and potentially unleashing it into the world.

The Event Horizon

I am particularly interested at the moment in events. History is made up of events. Who’d have thought it, when our standards barely mention them!

EAD (Encoded Archival Description), our XML standard for descriptions, doesn’t allow for events to be indexed at all. CIDOC CRM is “a formal structure for describing the implicit and explicit concepts and relationships used in cultural heritage documentation” and it is event-based. It may be a complicated beast, not much used within our domain, and many of you won’t be aware of it, but there is certainly something to be said for putting events at the forefront of how we think.

The new EAC-CPF standard allows for chronologies in the biographical section of the description, but this is more about narrative (a list of events as part of one name authority description). It is based partly on ISAAR(CPF) which states “Record in narrative form or as a chronology the main life events, activities, achievements and/or roles of the entity being described.” Narrative form is fine for a researcher who is perusing the description, but structured form that is not necessarily so closely tied to narrative is what is required for more integration of data.

People, organisations, places and subjects are all linked by events – that is where the connections and the stories are. I think it is a major shortcoming of our standards that events are not given more prominence.

I mention this here really just to say that whilst we can improve our descriptions, and think about consistency and indexing, I do think that there are things around cataloguing that may require a more fundamental change.

Research Questions

Coming back to the initial questions I asked. Our cataloguing can help to ensure that archives are networked in to the broader information landscape. This is not about whether the archive can in principle answer these questions. It is about whether a researcher asking these questions (maybe initially through Google, maybe through other generic channels) can discover the archive as one source amongst many:

Where did this person work?
Uniquely identify the place(s).

Who did they know?
Make sure names are consistent and try to make them unambiguous.

What can I find out about furniture design in London in the early 20th century?
Use appropriate index terms and think about how to add dates consistently.

Who designed this early 20th century chair?
Adding index terms helps to draw out significant names and concepts

Did the designer feature in this exhibition?
Entering the exhibition name and designer name as structured data will help.

Can I find photographs of this section of the exhibition?
Make sure you have the exhibition name clearly stated and think about using index terms for formats – these are a great means for researchers to find types of material. But remember, this is ‘photographs’ as a genre or form, not as a subject!

Did these designers both feature in this design exhibition?
Not so easy, as we don’t index by event type

Who influenced this designer?
Not so easy, as we don’t tend to provide structured relationship information to connect people other than in the broadest possible way (these two people were ‘associated’). But that is another story…

To my mind, cataloguing is a skill, and it is really worth thinking about what you are cataloguing and how you catalogue it carefully. It is more important to think about this now than it was 30 years ago, because 30 years ago we were working with with narrative descriptions and index cards. Now we want our data to be interconnected.

The Quest for Single Search

September 26, 2011 / Jane Stevenson

This post is based on a report published by OCLC Research, Single Search: The Quest for the Holy Grail (Leah Prescott and Ricky Erway, 2011).

It is less than ideal when users can benefit from a single search option for resources across the internet, but within an institution they are presented with a range of search systems for different services and resources. A single search obviously allows researchers to search across the organisation’s resources; it may also give a sense of the rich resources of an organisation and may provide a motivation to build upon them.

The OCLC report is based upon discussions with nine organisations that have implemented single search. There are certainly substantial challenges, not least the resources required and the need for effective collaboration across an institution. But it is clear that single search, if it is provided effectively, will help researchers and will help to harmonize collections management.

Single search needs to simplify rather than complicate the user experience, and sometimes the challenges this poses are not addressed and a single search ends up being a frustrating or confusing experience. We know that some users find navigating archival hierarchical descriptions confusing; adding library and museum items to this increases the challenge. Different collections may be catalogued very differently and to different levels of granularity, so presenting a coherent list of results is not easy. Added to this, many institutions now have digital collections, but only a part of their resources are digitised and so there is a need to indicate clearly what is digital (what can be accessed digitally) and what requires a visit to the institution.

The OCLC report refers to single search having the ability to ‘fundamentally change how an institution identifies itself’. Maybe if the single search represents a large part of the resources of an institution this is true; it is not likely to be the case in a university, where the collections are only a small part of the university’s business. Single search may enable curators, archivists and librarians themselves to get a more coherent view of the collections. This could be a useful advantage, as we know that often curators in charge of one collection or subject area do not necessarily have a good understanding of the whole. It may encourage a more efficient and streamlined approach to collections management.

Amongst the nine institutions that formed part of the OCLC discussions, some did have a mandate to create single search, but even with this kind of directive, there is a need for senior managers to provide the resources required and ensure that it is made a priority. In addition, the isssue of individual motivation is significant. I think this is a fascinating area that is sometimes overlooked: The extent to which the staff involved are motivated to work together and to achieve a vision must have a substantial impact on the outcome. What sort of role to ‘champions’ play? How important are they? Does it come down to individuals with intellectual curiosity and the willingness to learning new skills and change working habits? Is it important for the institution to foster this kind of attitude in order to ensure that innovations like single search are likely to work? One of the institutions in the OCLC report referred to the staff that had been selected to work on a single search as being selected for their ‘interest, skills and capacity to work on the program’. I have certainly come across colleagues who are frustrated by a lack of co-operation from other staff, which can significantly hamper any kind of innovative changes to metadata creation and cross-searching.

I think that attitudes are key to success in a project like this, where working practices may have to change and habits may need to be broken. It reminds me of that great YouTube video of the lone dancer who is joined by just one person – one is a crazy lone dancer, and others tend to try to ignore him/her; but once just one person joins in you have a group, and once you have two, then you’re more likely to get three, then four, and then the group builds up to the extent where those who are reluctant to join in anything a bit new or different, where they might embarrass themselves, end up joining in because not joining in becomes the exception rather than the rule. It’s a slightly different scenario but the point is similar.

The size of the institution is likely to have an impact. A small institution is often more agile, and getting buy-in may be easier, although there may be less resource to draw on. Maybe for a large organisation, trying to implement something that cuts through the departments and teams in a very horizontal way, like single search, is harder if the organisational structures remain the same. The priorities of the different departments involved may end up pulling against the project. It becomes all the more important to define the goals, get buy-in at the right levels, have clear and effective communication channels, and also find an effective way to keep the momentum and motivation going.

The OCLC report makes one observation which resonates very much with me: ‘It is important for the success of the project to have representation…from IT units, as weak motivation within the IT area of an organization has the power to paralyze such a project.’ The important thing here seems to be to ensure that the right people are included at the right stages in the project. IT should be brought in right at the outset and a real effort should be made to develop not only a common understanding but also a feeling of good will and strong motivation.

As the OCLC report states: ‘The reality of achieving an integrated access vision could mean overturning years or decades of institutional thinking, which has segmented collections management practice among the three different sectors of LAMs.’ Professionals within libraries, archives and museums have their own perspectives and values, and are often very caught up in their own long-standing practices. There may be good reason for this – often curators and archivists have had to fight over time to ensure their collections are properly looked after and catalogued. But a single search may call for a more compromised approach, and certainly it is likely to call for different thinking and finding new ways to represent the collections.

The ‘Technological Considerations’ section of the report is well worth reading, giving a short summary of some of the options. This is an area where the Archives Hub is very well aware of the pros and cons of different approaches. For an institution wanting to implement single search, there are a number of approaches: systems where you adopt batch export; systems where an API is used to pull the data in dynamically; a single system that replaces all the separate systems or multiple systems harvesting to a central repository; a federated search where each separate system is queried and results are brought back and presented to the user; a central index that is searched rather than the individual systems. All of these have pros and cons around things like flexibility, speed, currency and professional practices.

Of course, a further very important consideration will be digital assets, and the need to take a systematic approach here. Institutions may have Digital Asset Management Systems, but do these operate effectively with other collections sytems? Do digital assets exist in the different collections management systems? Are there shared metadata standards for digital assets?

Metadata Considerations present a whole new raft of challenges. I think that all to often those outside of the domains – maybe the managers who want to see single search and a more integrated approach – do not appreciate the substantial differences in approach between libraries, archives and museums. It is thought that because they all have something to do with that nebulous concept of ‘cultural heritage’ that they should all play together relatively easily. But each domain has built up its own world-view over many decades; the development of standards and best practice involves a great deal of hard work. It could be argued that finding ways to present catalogues or finding aids to users in a way that is as simple and straightforward as possible is not compatible with single search. It may be that single search, while seeking to provide an integrated approach, actually creates a more complex interface as a result of trying to integrate collection-based and hierarchical archival descriptions, item-based museum artifact descriptions and largely open access and usually non-unique library collections.

One of the biggest problems is that metadata is expensive to create. Automated metadata provides one solution but it is a very partial solution, especially for unique archival and musuem collections. Another challenge is that usually metadata has been created over long periods of time using a variety of systems, sometimes migrating from one system to another (often with patchy results). Metadata is messy, and yet standards lie at the heart of effective integration. But even standards are usually at different stages of evolution, and standards adopted by each of the domains do not necessarily harmonise very well.

One of the issues we have noticed on the Hub is the tendency for collections that are catalogued in great detail can overwhelm more summary descriptions. It can give the effect that those catalogued in more detail are more important. If you search on the Archives Hub relatively frequenly, you are likely to come across ‘University of Liverpool Staff Papers’ because they have been very thoroughly catalogued. There may be really good stuff in there, but should this one collection seem to be so much more important than so many others? Yet detailed cataloguing is surely a good thing?

There are also issues around vocabularies, and the tendency to implement multiple vocabularies within the same community. The Hub allows for any recognised vocabulary to be used for index temrsm but that does mean personal names, for example, entered using NCA Rules or AACR. You will inevitably end up with several different entries for the same name. The OCLC report refers to the need to harmonise metadata, trying to standardise terms, but I think that for us the way forward is generally to try to use the ever increasing sophistication of data processing tools to get round this problem, becuase we will never get 200 institutions to put things into the system in exactly the same way. Having said that, we are finding that as most of our contributors use our EAD Editor now, the descriptions are much more consistent and easier to integrate.

The OCLC report ends with some advice on the user interface, and one comment that I wholeheartedly agree with is the advice to hire or consult professional designers if you possibly can. Web presence is so important and Websites are often quite poorly designed. An ideal is to carry out user testing, but having done this ourselves, we know just how much time and effort it can take, and for many archives this is really quite a barrier. Even just testing with a small handful of users is very worthwhile. It’s amazing how much you find you have taken for granted that researchers will question. It’s good to see the importance of rights management emphasised and the need to clearly define access to content. This is becoming increasingly relevant, as data is republished, shared and recombined.

Single search is an important goal, not least because ‘the challenges inherent in this information divide ultimately expect researchers to compartmentalize their interests in a similar manner, rather than encouraging more multi-disciplinary approaches that focus on the research inqury (rather than the nature and custody of the resources).’ Our appoach has tended to suit our own professional outlooks; it should be geared towards what researchers want and need.

Category: metadata

Genre and form within archival descriptions

Name Authorities in Archives

Cataloguing Matters: being part of the information landscape

The Quest for Single Search

All quotes taken from Prescott, Leah and Ricky Erway. 2011. Single Search: The Quest for the Holy Grail. Dublin, Ohio: OCLC Research.

Image from www.digital-delight.ch