Genre and form within archival descriptions

September 27, 2021 / Jane Stevenson

We spend a great deal of time discussing each field in an archival description as part of the process of data aggregation and normalisation. But some fields raise more questions than others. I think overall we’ve probably spent the most time on the unique reference for each unit of description, which is so important when identifying and sorting descriptions and moving them around. Creator has also thrown up a number of challenges. Recently we’ve been thinking about ‘Genre/Form’. So, I thought I would post about it, as it reflects many of the types of issues that we think about as an aggregator.

On the Archives Hub, less than 1% of descriptions have genres or forms included. They can be in the core descriptive area and within the ‘control’ area as index terms – most are in the descriptive area. Quite a few of them are in our Online Resource descriptions of web resources that feature/display/explain archives, in particular they are in descriptions created for digitisation projects, where adding this information was part of the cataloguing process. In conclusion, it is clearly not common practice to add this information in archival descriptions.

When very few descriptions have a type of descriptive data – in this case genre/form – then the only thing you can really do is display it. If you provide a search or filter so that end users can find genre/form content, such as ‘photographs’ or ‘maps’ or ‘typescripts’ then you are encouraging them to narrow down their search to a tiny percentage of the descriptions – only those ones that have these terms included. Most users will assume that a search for ‘photographs’ will find all of the descriptions that include photos, when in reality it would find just a few percent. So, it is not a useful search; it is really a very misleading search. For this reason, in the imminent upgrade to the Archives Hub we are removing the links that we currently have on the genre/form entities, so that they do not create new searches.

Even displaying this data could be seen as misleading, because then the user might think that a description that doesn’t list ‘photographs’, for example, doesn’t have them, because other descriptions do list photographs. It is hard to convey to users that descriptions vary enormously. Even writing this now, I start to wonder whether it is worth us displaying the genre/form content at all when it may mislead in this way. Yet, it certainly can be useful for a researcher to know the types of content within a large collection.

Within the descriptions that do use this field, many are as you might expect, e.g. ‘photographs, leaflets, posters, letters, ephemera, books’. Others are more descriptive, e.g. ‘silver instruments in hard leather box’ or ‘Correspondence and other documents, architectural drawings, engineering contract drawings, and naval architecture publication’ or ‘Small ring-bound notepad’. Descriptive entries can convey more to a researcher, but they provide real challenges if you want to use the terms as links to allow users to search for other similar items. Also, a ‘small notepad’ might be ‘manuscript’ or ‘typescript’. If an end user searches for ‘typescript’ they would not find the small notepad. This is the problem of a lack of controlled vocabulary, and the problem of what ‘genre’ and ‘form’ really mean. The difficulty of separating them is clearly why they have ended up being bundled together.

We have not made an analysis of the use of controlled vocabulary, but it is clear that in general terms are not controlled. In our own EAD Editor, we provide links to the Getty Thesaurus of Graphic Materials and the Art and Architecture Thesaurus, but I am not sure how appropriate these are to describe all materials within an archive. Obviously an archive can include pretty much anything. If we just stuck to controlled vocabularies, we would probably omit some items. The Ivan Bunin collection from the University of Leeds is a great example of a description that lists a whole range of items – really useful to have, but difficult to see how this would work in a structured, controlled vocabulary world. In general, it seems to be common practice simply to list genre and form using local terms, which will differ between institutions, between cataloguers, and over time.

One of the issues I’ve mused upon is whether people are more likely to add a form such as ‘photographs’ and omit a form such as ‘typescript’, even if there are only a very few photographs, and a great deal of typescript material. Do the terms included really reflect the make-up of the collection? I suspect that cataloguers might think that end users are more interested in finding photographs or maps as genre types than finding typescript documents, and that may well be true. Also, it would be very difficult to list all the material types within a large collection, so only the main types, or clearly defined types are likely to be included.

As an aggregator, we have to understand and appreciate that each contributor has their own approach to cataloguing, and will use fields differently, or use them regularly, sometimes, or not at all. But also, I’m sure many of our contributors would say that across their descriptions there isn’t the level of consistency they would like, for various historical reasons. This is just multiplied when everything is aggregated. Aggregation allows for the power of global editing and enhancement, UK-wide interrogation and cross-searching, and serendipitous discovery. It is enormously powerful. It also creates a headache with how to harmonise everything in order to effectively do this.

The particular issue with genre/form came up because we are developing an Excel (spreadsheet) template for people to use if they prefer to catalogue in this way. We want to make sure the template is user friendly. We have included a column named ‘Genres/Forms’ and in the end we have simply made it a descriptive field without trying to structure or control the content. We will not try to add the content to our indexes, because of this complication of turning the text into structured data, and because we are not sure that it is really all that useful for the reasons outlined above.

Somewhat related to this, the new EAD standard, EAD3, has rather unhelpfully removed the sub-categories of ‘physical description‘, which are ‘extent’, ‘genreform’, ‘dimensions’ and ‘physfacet’ so that they all have to be bundled into just one field. Either that or you have to add a structured physical description which requires you to add a value from a list: carrier, material type, space occupied or other physdesc structured type (which asks you to then add the ‘other’ type). I can just imagine going back to all our contributors and asking them to add a type to all their physical description information! If we move to EAD3, we would remove the demarcation that tells us the information is about the genre/form or about the extent. This is potentially a deal breaker for us adopting EAD3, as taking away structure that is already there seems like madness. You could argue that simply having one free text field for physical description gets us off the hook with our attempts to work with the data (e.g. potentially using extent to provide a search to help convey the size of collections to users) – if it was completely unstructured then any attempt to analyse and present it differently would be impossible. However, just the process of putting these sub-fields together into one field would actually be extremely difficult due to the fact that different institutions have different patterns of data input. ISAD(G), the archival standard for description, doesn’t refer to form or genre at all, but recommends adding extent and medium, such as ’42 photographs’ or ‘330 files’, or else adding the overall storage space, such as 20 cubic metres. It doesn’t really go in for promoting structured data.

For those interested, here is a breakdown of the genre/form entries that have been used at least 10 times, just to give an idea of some common terms (though most entries include several types, so they will not appear in this list):

List of genre and form types used in the Archives Hub

(Corrrespondence may be down to a rather extensive cut and paste error).

I’m not going to get into the thorny issue of what ‘genre’ is and what ‘form’ is. They were put together in EAD, whilst ISAD(G) doesn’t use these terms at all, but refers to ‘medium’. The distinction seems very blurred, and there are many archivists who will have more idea of what the definitions are than I do. I think it is very much open to interpretation for individual cataloguers – so we have entries like ‘small boxes’, ‘New Orleans-style jazz’ and ‘Museum administration’ and ‘social history’ as well as ‘personal papers’, ‘manuscripts’, ‘typescripts’ and ‘sound’.

In the end genre/form is a field that seems potentially very useful – the idea that researchers can search for maps, or prints, drawings or postcards, CDs or tape, is appealing, but in reality, we have never really prioritised this information in our catalogues. In our machine learning project, just kicking off, we may explore the possibility of interrogating descriptions to potentially add genre/form. It would be interesting to see how well this works. But I wouldn’t bet my house on it…or even my outhouse – the narrative style of most catalogues is likely to hinder any effective identification of material types.

We would love to hear from you if you utilise this field. Do you think it is useful? Do you try to add a comprehensive list of genres/forms? Do you think that researchers really want to search by material type?

Archives Hub Search Analysis

February 29, 2016 / Jane Stevenson

Search logs can give us an insight into how people really search. Our current system provides ‘search logs’ that show the numbers based on the different search criteria and faceting that the Hub offers, including combined searches. We can use these to help us understand how our users search and to give us pointers to improve our interface.

The Archives Hub has a ‘default search’ on the homepage and on the main search page, so that the user can simply type a search into the box provided. This is described as a keyword search, as the user is entering their own significant search terms and the results returned include any archival description where the term(s) are used.

The researcher can also choose to narrow down their search by type. The figure below shows the main types the Archives Hub currently has. Within these types we also have boolean type options (all, exact, phrase), but we have not analysed these at this point other than for the main keyword search.

Archives Hub search box showing the types of searches available

There are caveats to this analysis.

1. Result will include spiders and spam

With our search logs, excluding bots is not straightforward, something which I refer to in a previous post: Archives Logs and Google Analytics. We are shortly to migrate to an entirely new system, so for this analysis we decided to accept that the results may be slightly skewed by these types of searches. And, of course, these crawlers often perform a genuine service, exposing archive descriptions through different search engines and other systems.

2. There are a small number of unaccounted for searches

Unidentified searches only account for 0.5% of the total, and we could investigate the origins of these searches, but we felt the time it would take was not worth it at this point in time.

3. Figures will include searches from the browse list.

These figures include searches actioned by clicking on a browse list, e.g. a list of subjects or a list of creators.

4. Creator, Subject and Repository include faceted searching

The Archives Hub currently has faceted searching for these entities, so when a user clicks to filter down by a specific subject, that counts as a subject search.

Results for One Month (October 2015)

For October 2015 the total searches are 19,415. The keyword search dominates, with a smaller use of the ‘any’ and ‘phrase’ options within the keyword search. This is no surprise, but this ‘default search’ still forms only 36% of the whole, which does not necessarily support the idea that researchers always want a ‘google type’ search box.

We did not analyse these additional filters (‘any/phrase/exact’) for all of the searches, but looking at them for ‘keyword’ gives a general sense that they are useful, but not highly used.

A clear second is search by subject, with 17% of the total. The subject search was most commonly combined with other searches, such as a keyword and further subject search. Interestingly, subject is the only search where a combined subject + other search(es) is higher than a single subject search. If we look at the results over a year, the combined subject search is by far the highest number for the whole year, in fact it is over 50% of the total searches. This strongly suggests that bots are commonly responsible for combined subject searches.

These searches are often very long and complex, as can be seen from the search logs:

[2015-09-17 07:36:38] INFO: 94.212.216.52:: [+0.000 s] search:: [+0.044 s] Searching CQL query: (dc.subject exact “books of hours” and/cql.relevant/cql.proxinfo (dc.subject exact “protestantism” and/cql.relevant/cql.proxinfo (dc.subject exact “bible o.t. psalms” and/cql.relevant/cql.proxinfo (dc.subject exact “authors, classical” and/cql.relevant/cql.proxinfo (dc.subject exact “bible o.t. psalms” and/cql.relevant/cql.proxinfo (dc.subject exact “law” and/cql.relevant/cql.proxinfo (dc.subject exact “poetry” and/cql.relevant/cql.proxinfo (dc.subject exact “bible o.t. psalms” and/cql.relevant/cql.proxinfo (dc.subject exact “sermons” and/cql.relevant/cql.proxinfo bath.personalname exact “rawlinson richard 1690-1755 antiquary and nonjuror”))))))))):: [+0.050 s] 1 Hits:: Total time: 0.217 secs

It is most likely that the bots are not nefarious; they may be search engine bots, or they may be indexing for the purposes of information services of some kind, such as bibliographic services, but they do make attempts to assess the value of the various searches on the Hub very difficult.

Of the remaining search categories available from the main search page, it is no surprise that ‘title’ is used a fair bit, at 6.5%, and then after that creator, name, and organisation and personal name. These are all fairly even. For October 2015 they are around 3% of the total each, and it seems to be similar for other months.

The repository filter is popular. Researchers can select a single repository to find all of their descriptions (157), select a single repository and also search terms (916), and also search for all the descriptions from a single repository from our map of contributors (125). This is a total of 1,198, which is 6.1% of the total. If we also add the faceted filter by repository, after a search has been carried out, the total is 2,019, and the percentage is 10.4%. Looking at the whole year, the various options to select repository become an even bigger percentage of the total, in particular the faceted filter by repository. This suggests that improvements to the ability to select repositories, for example, by allowing researchers to select more than one repository, or maybe type of repository, would be useful.

Google Map on the Hub showing the link to search by contributor

We have a search within multi-level descriptions, introduced a few years ago, and that clearly does get a reasonable amount of use, with 1,404 uses in this particular month, or 7.2% of the total. This is particularly striking as this is only available within multi-level descriptions. It is no surprise that this is valuable for lengthy descriptions that may span many pages.

The searches that get minimal use are identifier, genre, family name and epithet. This is hardly surprising, and illustrates nicely some of the issues around how to measure the value of something like this.

Identifier enables users to search by the archival reference. This may not seem all that useful, but it tends to be popular with archivists, who use the Hub as an administrative tool. However, the current Archives Hub reference search is poor, and the results are often confusing. It seems likely that our contributors would use this search more if the results were more appropriate. We believe it can fulfill this administrative function well if we adjust the search to give better quality results; it is never likely to be a highly popular search option for researchers as it requires knowledge of the reference numbers of particular descriptions.

Epithet is tucked away in the browse list, so a ‘search’ will only happen if someone browses by epithet and then clicks on a search result. Would it be more highly used if we had a ‘search by occupation or activity’? There seems little doubt of this. It is certainly worth considering making this a more prominent search option, or at least getting more user feedback about whether they would use a search like this. However, its efficacy may be compromised by the extremely permissive nature of epithet for archival descriptions – the information is not at all rigorous or consistent.

Family name is not provided as a main search option, and is only available by browsing for a family name and clicking on a result, as with epithet. The main ‘name’ search option enables users to search by family name. We did find the family name search was much higher for the whole year, maybe an indication of use by family historians and of the importance of family estate records.

Genre is in the main list of search options, but we have very few descriptions that provide the form or medium of the archive. However, users are not likely to know this, and so the low use may also be down to our use of ‘Media type’, which may not be clear, and a lack of clarity about what sort of media types people can search for. There is also, of course, the option that people don’t want to search on this facet. However, looking at the annual search figures, we have 1,204 searches by media type, which is much more significant, and maybe could be built up if we had something like radio buttons for ‘photographs’, ‘manuscripts’, ‘audio’ that were more inviting to users. But, with a lack of categorisation by genre within the descriptions that we have, a search on genre will mean that users filter out a substantial amount of relevant material. A collection of photographs may not be catalogued by genre at all, and so the user would only get ‘photographs’ through a keyword search.

Place name is an interesting area. We have always believed that users would find an effective ‘search by place’ useful. Our place search is in the main search options, but most archivists do not index their descriptions by place and because of this it does not seem appropriate to promote a place name search. We would be very keen to find ways to analyse our descriptions and consider whether place names could be added as index terms, but unless this happens, place name is rather like media type – if we promote it as a means to find descriptions on the Archives Hub, then a hit list would exclude all of those descriptions that do not include place names.

This is one of the most difficult areas for a service like the Archives Hub. We want to provide search options that meet our users’ needs, but we are aware of the varied nature of the data. If a researcher is interested in ‘Bath’ then they can search for it as a keyword, but they will get all references to bath, which is not at all the same as archives that are significantly about Bath in Gloucestershire. But if they search for place name: bath, then they exclude any descriptions that are significantly about Bath, but not indexed by place. In addition, words like this, that have different meanings, can confuse the user in terms of the relevance of the results because ‘bath’ is less likely to appear in the title. It may simply be that somewhere in the description, there is a reference to a Dr Bath, for example.

This is one reason why we feel that encouraging the use of faceted search will be better for our users. A more simple initial search is likely to give plenty of results, and then the user can go from there to filter by various criteria.

It is worth mentioning ‘date’ search. We did have this at one point, but it did not give good results. This is partly due to many units of description not including normalised dates. But the feedback that we have received suggests that a date search would be popular, which is not surprising for an archives service. We are planning to provide a filter by date, as well as the ordering by date that we currently have.

Finally, I was particularly interested to see how popular our ‘search collection level only’ is. This enables users to only see ‘top level’ results, rather than all of the series and items as well. As it is a constant challenge to present hierarchical descriptions effectively, this would seem to be one means to simplify things. However, for October 2015 we had 17 uses of this function, and for the whole year only 148. This is almost negligible. It is curious that so few users chose to use this. Is it an indication that they don’t find it useful, or that they didn’t know what it means? We plan to have this as a faceted option in the future, and it will be interesting to see if that makes it more popular or not.

We are considering whether we should run this exercise using some sort of filtering to check for search engines, dubious IP addresses, spammers, etc., and therefore get a more accurate result in terms of human users. We would be very interested to hear from anyone who has undertaken this kind of exercise.

Cataloguing Matters: being part of the information landscape

November 9, 2015 / Jane Stevenson

Consider the following questions, which use the topic of design history, but could be for any topic area:

Where did this person work?
Who did they know?
What can I find out about furniture design in London in the early 20th century?
Who designed this early 20th century chair?
Did the designer feature in this exhibition?
Can I find photographs of this section of the exhibition?
Did these designers both feature in this design exhibition?
Who influenced this designer?

These are surely typical questions for researchers. They are the sort of questions we thought about when we were working on Exploring British Design. But these questions do not start with the archive. An archive collection may well hold many answers for these questions, but there remains the problem of connecting these questions to the archive: ‘I’m interested in this designer/in this chair that they designed/in this exhibition they designed for’. We need these questions to lead to archival sources when appropriate.

Which comes first, the research question or the archive? We tend to assume the archive, but for many researchers the question comes first and the archive is, at that point, not known to them. We become a little fish in a very big pond when we enter the Web, and so we need to find ways for researchers who may not be aware of us to hook our collections.

Wellcome Library, London: Fishing

But how do we achieve this? We often have to work within considerable constraints and we cannot simply transform our cataloguing practices. There are many technical solutions that we can potentially make use of, but I think the first challenge lies with the data itself.

Individual archive repositories may think that there is no way they can move into a Linked Data world of RDF and triples and persistent URIs. But I think there are steps that we could take to help our descriptive data better fit this kind of landscape and have the potential to be more ‘linked in’ to other sources and to researchers’ paths of discovery.

Our descriptive practices tend to treat collections as stand-alone entities, rather than integrated parts of a whole information landscape. We do not think enough about cataloguing in a way that facilitates an integrated approach where we can connect to other information resource and allow researchers to come to archives through searching for people, organisations, subjects, events and places, and to come to them as part of a whole network of resources. I think we need to try to think about providing potential ‘connectors’ within our descriptions that will allow them to be hooked into the landscape more effectively.

Here are some thoughts about how we can help to achieve this:

Be consistent when cataloguing

This may sound straightforward, but having looked at thousands of descriptions from well over two hundred archive repositories, I can announce that it doesn’t always happen. For instance, simply entering the name of the repository in the same way for every catalogue entry, and not adding the repository code as part of the repository name, for example, ensures that the repository is always correctly identified. This means that all of the collection descriptions can be clearly identified as being from the same institution.

When entering names, think about how you structure them. Many archive systems provide a means to store and link to names, and yet it is amazing how often one name varies. If nothing else, think about adding life dates to a name, which helps with unique identification.

Here on the Archives Hub we have to hold our hands up for a potentially unhelpful practice up till now of encouraging the creator name to be entered in one way under ‘name of creator’ and another way as an index term. This reflects a time when we were less focussed on machine processing. Of course, it is much better to enter the name consistently, so that the connection can clearly be made.

Try to use rules and standards

Whilst I have increasingly become somewhat frustrated by our standards, I still think there is a role for standards to encourage consistency, clarify meaning and help draw things together. If you enter subjects using UKAT or LCSH, then try to ensure that all your terms really do come from these thesauri. We get examples where the thesaurus is named, but the subject is not actually from the thesaurus.

When entering things like language codes (ISO standard 639), take a few minutes just to find out about them. They need to be lower case, to be consistent. It’s worth having an understanding of what you are doing and why.

Do the same for dates. Think about what a normalised date is and why it is important. It is really worth having a sense of why these things matter. ISO8601: “The purpose of this standard is to provide an unambiguous and well-defined method of representing dates and times”. That sounds perfect for archives, and well worth adopting.

Think about the questions people ask

One of the advantages of indexing is that it gives you a chance to consider the terms you can best use to ‘advertise’ your collection. It is really good to if the scope and content text is clearly reflected in the index terms. If your archive is about a designer and you have described things they have designed, then you might realise that you haven’t used ‘furniture designer’ in your text, but this is a good index term to use. Or maybe the archive is really useful for those looking at the history of design education, but you haven’t yet actually used the term ‘design education’ in your text. You can add it as an index term.

Try to think beyond the UK

This may again sound obvious, but unfortunately NCA Rules don’t exactly encourage this, even if they don’t prevent it. When indexing by place name, add the county and certainly add the country. Indexing really helps us think about this. Consider a typical biographical history entry:

Charles Edward Sayle was born in Cambridge on 6 December 1864. He entered New College, Oxford, in 1883, and St John’s College, Cambridge, in 1890.

It wouldn’t look quite right putting:

Charles Edward Sayle was born in Cambridge, England, on 6 December 1864. He entered New College, Oxford, England, in 1883, and St John’s College, Cambridge, England, in 1890 [etc]

And whilst it would help with identification, it is still unstructured. Much better to ensure you have:

Place name: Cambridge, England

in your structured index terms. True, there are plenty of ways to mark this up, some better than others, but at least having an index term entered in this way is a great help with uniquely identifying the Cambridge that you mean, as opposed to the one in the US or New Zealand, Australia or Jamaica. It means that a researcher who is researching a topic around ‘Cambridge, England’ is more likely to find your description.

If you put your place names into some kind of specific field, try to avoid something like ‘Cardiff, Merthyr Tydfil, Cambridge’ (an example from the Hub) or ‘Cambridge etc’ (another example). So if your cataloguing software provides just one box for place, repeat that box for each place, rather than putting a load of them into one entry. If it allows for something like place name and country, use that to your advantage.

Put index terms into the most appropriate categories

On the Hub we’ve had plenty of family names as personal names and personal names as corporate names. But most commonly we get genre as subject. If your archive contains photographs, press cuttings, preliminary sketches or parchment (animal material) and you want to make this known, you need to index by material type, not by subject ….unless the archive is about photographs or about parchment.

Think Reference!

Your references should give the ability to uniquely identifying every part of a hierarchical archival description. Think carefully about adding them and make sure you have a unique reference at every level of description. Some archival software will automatically generate references, which does help because then they will be consistent and unique. But if you add them manually, it is very easy to make a mistake. On the Hub we found several hundred duplicated ‘unique references’ when we went through a major exercise to clean up references. And I do mean several hundred, not one or two.

Optimise titles for search engines

All of this structured data really needs the boost of descriptions that work well on the Web, simply in terms of being discoverable. SEO, or search engine optimisation, is a big topic, and your system may not allow you to make many changes, but in terms of the actual data, do think about appropriate titles, which are a really good hook. Include the most significant words, make sure that are not too long (so they are easy to scan for a researcher looking through various sources) and make titles at collection level self-explanatory. At lower levels this is not such a problem, as it is possible to append the collection title, to help with interpretation.

Look at how your data exports

This may not be possible or practical for everyone, but if you can do it I think it is a really good indicator of how interoperable your data might be to see how it looks when it is exported, because that is when you are removing it from the comfort of your own familiar system, and potentially unleashing it into the world.

The Event Horizon

I am particularly interested at the moment in events. History is made up of events. Who’d have thought it, when our standards barely mention them!

EAD (Encoded Archival Description), our XML standard for descriptions, doesn’t allow for events to be indexed at all. CIDOC CRM is “a formal structure for describing the implicit and explicit concepts and relationships used in cultural heritage documentation” and it is event-based. It may be a complicated beast, not much used within our domain, and many of you won’t be aware of it, but there is certainly something to be said for putting events at the forefront of how we think.

The new EAC-CPF standard allows for chronologies in the biographical section of the description, but this is more about narrative (a list of events as part of one name authority description). It is based partly on ISAAR(CPF) which states “Record in narrative form or as a chronology the main life events, activities, achievements and/or roles of the entity being described.” Narrative form is fine for a researcher who is perusing the description, but structured form that is not necessarily so closely tied to narrative is what is required for more integration of data.

People, organisations, places and subjects are all linked by events – that is where the connections and the stories are. I think it is a major shortcoming of our standards that events are not given more prominence.

I mention this here really just to say that whilst we can improve our descriptions, and think about consistency and indexing, I do think that there are things around cataloguing that may require a more fundamental change.

Research Questions

Coming back to the initial questions I asked. Our cataloguing can help to ensure that archives are networked in to the broader information landscape. This is not about whether the archive can in principle answer these questions. It is about whether a researcher asking these questions (maybe initially through Google, maybe through other generic channels) can discover the archive as one source amongst many:

Where did this person work?
Uniquely identify the place(s).

Who did they know?
Make sure names are consistent and try to make them unambiguous.

What can I find out about furniture design in London in the early 20th century?
Use appropriate index terms and think about how to add dates consistently.

Who designed this early 20th century chair?
Adding index terms helps to draw out significant names and concepts

Did the designer feature in this exhibition?
Entering the exhibition name and designer name as structured data will help.

Can I find photographs of this section of the exhibition?
Make sure you have the exhibition name clearly stated and think about using index terms for formats – these are a great means for researchers to find types of material. But remember, this is ‘photographs’ as a genre or form, not as a subject!

Did these designers both feature in this design exhibition?
Not so easy, as we don’t index by event type

Who influenced this designer?
Not so easy, as we don’t tend to provide structured relationship information to connect people other than in the broadest possible way (these two people were ‘associated’). But that is another story…

To my mind, cataloguing is a skill, and it is really worth thinking about what you are cataloguing and how you catalogue it carefully. It is more important to think about this now than it was 30 years ago, because 30 years ago we were working with with narrative descriptions and index cards. Now we want our data to be interconnected.

EAD and Next Generation Discovery

January 3, 2014 / Jane Stevenson / 2 Comments

This post is in response to a recent article in Code4Lib, ‘Thresholds for Discovery: EAD Tag Analysis in ArchiveGrid, and Implications for Discovery Systems‘ by M. Bron, M. Proffitt and B. Washburn. All quotes are from that article, which looked at the instances of tags within ArchiveGrid, the US based archival aggregation run by OCLC. This post compares some of their findings to the UK based Archives Hub.

Date

In the ArchivesGrid analysis, the <unitdate> field use is around 72% within the high-level (usually collection level) description. The Archives Hub does significantly better here, with an almost universal inclusion of dates at this level of description. Therefore, a date search is not likely to exclude any potentially relevant descriptions. This is important, as researchers are likely to want to restrict their searches by date. Our new system also allows sorting retrieved results by date. The only issue we have is where the dates are non-standard and cause the ordering to break down in some way. But we do have both displayed dates and normalised dates, to enable better machine processing of the data.

Collection Title

“for sorting and browsing…utility depends on the content of the element.”

Titles are always provided, but they are very varied. Setting aside lower-level descriptions, which are particularly problematic, titles may be more or less informative. We may introduce sorting by title, but the utility of this will be limited. It is unlikely that titles will ever be controlled to the extent that they have a level of consistency, but it would be fascinating to analyse titles within the context of the ways people search on the Web, and see if we can gauge the value of different approaches to creating titles. In other words, what is the best type of title in terms of attracting researchers’ attention, search engine optimisation, display within search engine results, etc?

Lower-level descriptions tend to have titles such as ‘Accounts’, ‘Diary’ or something more difficult to understand out of context such as ‘Pigs and boars’ or ‘The Moon Dragon’. It is clearly vital to maintain the relationship of these lower-level descriptions to their parent level entries, otherwise they often become largely meaningless. But this should be perfectly possible when working on the Web.

It is important to ensure that a researcher finding a lower-level description through a general search engine gets a meaningful result.

Archives Hub search result from a Google search — A search result within Google

The above result is from a search for ‘garrick theatre archives joanna lumley’ – the sort of search a researcher might carry out. Whilst the link is directly to a lower -level entry for a play at the Garrick Theatre, the heading is for the archive collection. This entry is still not ideal, as the lower-level heading should be present as well. But it gives a reasonable sense of what the researcher will get if they click on this link. It includes the <unitid> from the parent entry and the URL for the lower-level, with the first part of the <scopecontent> for the entry. It also includes the Archives Hub tag line, which could be considered superfluous to a search for Garrick Theatre archives! However, it does help to embed the idea of a service in the mind of the researcher – something they can use for their research.

Extent

“It would be useful to be able to sort by size of collection, however, this would require some level of confidence that the <extent> tag is both widely used and that the content of the tag would lends itself to sorting.”

This was an idea we had when working on our Linked Data output. We wanted to think about visualizations that would help researchers get a sense of the collections that are out there, where they are, how relevant they are, and so on. In theory the ‘extent’ could help with a weighting system, where we could think about a map-based visualization showing concentrations of archives about a person or subject. We could also potentially order results by size – from the largest archive to the smallest archive that matches a researchers’ search term. However, archivists do not have any kind of controlled vocabulary for ‘extent’. So, within the Archives Hub this field can contain anything from numbers of boxes and folders to length in linear metres, dimensions in cubic metres and items in terms of numbers of photographs, pamphlets and other formats. ISAD(G) doesn’t really help with this; the examples they give simply serve to show how varied the description of extent can be.

Genre

“Other examples of desired functionality include providing a means in the interface to limit a search to include only items that are in a certain genre (for example, photographs)”.

This is something that could potentially be useful to researchers, but archivists don’t tend to provide the necessary data. We would need descriptions to include the genre, using controlled vocabulary. If we had this we could potentially enable researchers to select types of materials they are interested in, or simply include a flag to show, e.g. where a collection includes photographs.

The problem with introducing a genre search is that you run the risk of excluding key descriptions, because the search will only include results where the description includes that data in the appropriate location. If the word ‘photograph’ is in the general description only then a specific genre search won’t find it. This means a large collection of photographs may be excluded from a search for photographs.

Subject

In the Bron/Proffitt/Washburn article <controlaccess> is present around 72% of the time. I was surprised that they did not choose to analyse tags within <controlaccess> as I think these ‘access points’ can play a very important role in archival descrpition. They use the presence of <controlaccess> as an indication of the presence of subjects, and make the point that “given differences in library and archival practices, we would expect control of form and genre terms to be relatively high, and control of names and subjects to be relatively low.”

On the Archives Hub, use of subjects is relatively high (as well as personal and corporate names) and use of form and genre is very low. However, it is true to say that we have strongly encouraged adding subject terms, and archivists don’t generally see this as integral to cataloguing (although some certainly do!), so we like to think that we are partly responsible for such a high use of subject terms.

Subject terms are needed because they (1) help to pull out significant subjects, often from collections that are very diverse, (2) enable identification of words such as ‘church’ and ‘carpenter’ (ie. they are subjects, not surnames), (3) allow researchers to continue searching across the Archives Hub by subject (subjects are all linked to the browse list) and therefore pull collections together by theme (4) enable advanced searching (which is substantially used on the Hub).

Names (personal and corporate)

In Bron/Proffitt/Washburn the <origination> tag is present 87% of the time. The analysis did not include the use of <persname> and <corpname> within <origination> to identify the type of originator. In the Archives Hub the originator is a required field, and is present 99%+ of the time. However, we made what I think is a mistake in not providing for the addition of personal or corporate name identification within <origination> via our EAD Editor (for creating descriptions) or by simply recommending it as best practice. This means that most of our originators cannot be distinguished as people or corporate bodies. In addition, we have a number where several names are within one <origination> tag and where terms such as ‘and others’, ‘unknown’ or ‘various’ are used. This type of practice is disadvantageous to machine processing. We are looking to rectify it now, but addressing something like this in retrospect is never easy to do. The ideal is that all names within origination are separately entered and identified as people or organisations.

We do also have names within <controlaccess>, and this brings the same advantages as for <subjects>, ensuring the names are properly structured, can be used for searching and for bringing together archives relating to any one individual or organisation.

Repository

“Use of this element falls into the promising complete category (99.46%: see Table 7). However, a variety of practice is in play, with the name of the repository being embellished with <subarea> and <address> tags nested within <repository>.”

On the Archives Hub repository is mandatory, but as yet we do not have a checking system whereby a description is rejected if it does not contain this field. We are working towards something like this, using scripts to check for key information to help ensure validity and consistency at least to a minimum standard. On one occasion we did take in a substantial number of descriptions from a repository that omitted the name of repository, which is not very useful for an aggregation service! However, one thing about <repository> is that it is easy to add because it is always the same entry. Or at least it should be….we did recently discovery that a number of repositories had entered their name in various ways over the years and this is something we needed to correct.

Scope and content, biographical history and abstract

It is notable that in the US <abstract> is widely used, whereas we don’t use it at all. It is intended as a very brief summary, whereas <scopecontent> can be of any length.

“For search, its worth noting that the semantics of these elements are different, and may result in unexpected and false “relevance””

One of the advantages of including <controlaccess> terms is to mitigate against this kind of false relevance, as a search for ‘mason’ as a person and ‘mason’ as a subject is possible through restricted field searching.

The Bron/Proffitt /Washburn analysis shows <bioghist> used 70% of the time. This is lower than the Archives Hub, where it is rare for this field not to be included. Archivists seem to have a natural inclination to provide a reasonably detailed biographical history, especially for a large collection focussed on one individual or organisation.

Digital Archival Objects

It is a shame that the analysis did not include instances of <dao>, but it is likely to be fairly low (in line with previous analysis by Wisser and Dean, which puts it lower than 10%). The Archives Hub currently includes around 1,200 instances of images or links to digital content. But what would be interesting is to see how this is growing over time and whether the trajectory indicates that in 5 years or so we will be able to provide researchers with routes into much of the Archives Hub content. However, it is worth bearing in mind that many archives are not digitised and are not likely to be digitised, so it is important for us not to raise expectations that links to digital content will become a matter of course.

The Future of Discovery

“In order to make EAD-encoded finding aids more well suited for use in discovery systems, the population of key elements will need to be moved closer to high or (ideally) complete.”

This is undoubtedly true, but I wonder whether the priority over and above completeness is consistency and controlled vocabulary where appropriate. There is an argument in favour of a shorter description, that may exclude certain information about a collection, but is well structured and easier to machine process. (Of course, completeness and consistency is the ideal!).

The article highlights geo-location as something that is emerging within discovery services. The Archives Hub is planning on promoting this as an option once we move to the revised EAD schema (which will allow for this to be included), but it is a question of whether archivists choose to include geographical co-ordinates in their catalogues. We may need to find ways to make this as easy as possible and to show the potential benefits of doing so.

In terms of the future, we need a different perspective on what EAD can and should be:

“In the early days of EAD the focus was largely on moving finding aids from typescript to SGML and XML. Even with much attention given over to the development of institutional and consortial best practice guidelines and requirements, much work was done by brute force and often with little attention given to (or funds allocated for) making the data fit to the purpose of discovery.”

However, I would argue that one of the problems is that archivists sometimes still think in terms of typescript finding aids; of a printed finding aid that is available within the search room, and then made available online….as if they are essentially the same thing and we can use the same approach with both. I think more needs to be done to promote, explain and discuss ‘next generation finding aids’. By working with Linked Data, I have gained a very different perspective on what is possible, challenging the traditional approach to hierarchical finding aids.

Maybe we need some ‘next generation discovery’ workshops and discussions – but in order to really broaden our horizons we will need to take heed of what is going on outside of our own domain. We can no longer consider archival practice in isolation from discovery in the most general sense because the complexity and scale of online discovery requires us to learn from others with expertise and understanding of digital technologies.

Supporting Historians: responding to changing research practices

February 25, 2013 / Jane Stevenson

This post picks out some highlights from a report from Ithaka S+R, “Supporting the Changing Research Practices of Historians” by Roger C Schonfeld and Jennifer Rutner (December 2012). It concentrates on findings that are of particular relevance for archivists and for discovery. The report is recommended reading. It is a US study, but clearly there are strong similarities with other countries.

The report finds that underlying research methods are still broadly as they were but practices have changed considerably: “Based on interviews with dozens of historians, librarians, archivists, and other support services providers, this project has found that the underlying research methods of many historians remain fairly recognizable even with the introduction of new tools and technologies, but the day to day research practices of all historians have changed fundamentally.”

It goes on to summarise the improvements that archives might make to meet changing needs, none of which are unexpected: “For archives, we recommend ongoing improvements to access through improved finding aids, digitization, and discovery tool integration, as well as expanded opportunities for archivists to help historians interpret collections, to build connections among users, and to instruct PhD students in the use of archives.”

It is very encouraging to see the positive comments about researchers’ interactions with archivists: “Having a meeting with the archivist and librarian is really fantastic, because they help you understand what is in the archive, and what you might be able to use.” It is clear from the study that archivists have a vital role to play as key collaborators and colleagues of historians, and their value is clear: “Archivists are often able  to hone and direct an inquiry, bringing to light items and collections that the researcher may have been unaware of.”

The study does highlight the changing nature of interactions with archival material, as a result of the use of digital cameras in particular, which enables the analytical work to take place elsewhere. It is generally felt to be a convenient and time-saving option, enabling long-term interaction with resources outside of the reading room. This development is actually described as “the single most significant shift in research practices among historians.” It raises questions about whether the role of the archivist changes when the analytical work is displaced from the archive, as archivists may have less opportunity for intellectual engagement with researchers. The study does highlight a possible issue with digital copies, namely the separation of metadata from content, where the researcher has hundreds of images and needs to organise them constructively, and it also found that scholars are struggling to work with digitised non-textual content effectively.

The ability to find time for research trips was a primary challenge for many researchers. “Interviewees repeatedly emphasized that the amount of time they are able to spend in the archives shapes the nature of the interaction with the sources significantly.” Because most struggle to find time for research trips, digitised sources are hugely beneficial.

The study found that digitised finding aids help researchers to “travel more strategically”. It suggests that high-quality finding aids may become more important as researchers move more towards photographic visits to archives, rather than serendipitous visits. This connection is something I have not thought about before, and I would be very interested to hear what archivists think about this idea.

Of major relevance for a service like the Archives Hub is the conclusion about finding aids:

“The use of online finding aids greatly facilitates, and sometimes displaces, these visits. If a “good” finding aid is readily available online, this might make a scouting visit unnecessary, depending on the importance of the archive to the research project. In some cases, researchers were able to rule out a visit to an archive based on the online finding aids, and re-purpose funds and effort to tracking down other sources for the project.”

This study is a clear endorsement for our belief (which, I should say, is also backed up by our own researcher surveys) that finding aids play a role not only in identifying and prioritising sources, but also in providing enough information in themselves to make a visit unnecessary. As well as this, they may have a kind of positive negative effect: the researcher knows that materials can be ruled out. The study strongly emphasised the need for “searchable databases” and “centralized searching” and participants talked about the problem with locating each collection independently, especially across the diverse types of archive repository: “The process of identifying archives – in some cases small, local archives or international archives – can present an amazing challenge to researchers.” Clearly comprehensive cross-searching search tools are a huge boon to researchers.

In terms of discovery, Google is clearly a major tool and there was a feeling that it was the most comprehensive discovery tool, as well as being convenient and easy to use. It is often used at the start of a searching process.: “Generally, historians discover finding aids through Google searches and archive websites.” There is a clear demand for more descriptions online: “The general consensus among interviewees was that more online finding aids would greatly benefit their research, and that archives should continue to make efforts to make these accessible online. Continued and expanded efforts to develop finding aids more efficiently and to make them available digitally would seem to support the needs of historians for improved access.”

In terms of PhD students (and maybe others who are inexperienced researchers), the study found issues with the use of archives and other sources:

“Interviews with PhD candidates indicated that there is often little support for them in learning about new research methods or practices, either in their department or elsewhere at their institution, of which they are aware. While the subject matter treated by historians continues to diversify dramatically, new methodologies develop, and research practices change rapidly, it is clearly critically important that students have a grounding in the methods and practices of the field.” The Archives Hub has recently produced a brief Guide to Using Archives for the Inexperienced, and discussions on the archives email list showed just how much this is an important topic for archivists and how there was a general consensus that PhD students need more training on research methodologies.

Summing up, the report makes six recommendations specifically for Archives:

1. More online finding aids
2. More digitisation
3. Discovery tools that promote cross-searching, crossing institutional boundaries and encompassing small and local record offices
4. Adequate resources for ensuring the expertise of the archivist continues to be available, enabling archivists to be active interpreters of the collections
5. Adapting to and facilitating the use of digital cameras and scanners in reading rooms
6. Training PhD students in the use of archives

There is a great deal more of interest and relevance in the report around searching, Google Scholar, the use of the academic library, organising and managing research, citation management and digital research methods. It is very well worth reading.

An evaluation of the use of archives and the Archives Hub

November 15, 2012 / Jane Stevenson

This blog is based upon a report written by colleagues at Mimas* presenting the results of the evaluation of our innovative Linked Data interface, ‘Linking Lives‘. The evaluation consisted of a survey and a focus group, with 10 participants including PhD students and MA students studying history, politics and social sciences. We asked participants a number of questions about the Archives Hub service, in order to provide context for their thoughts on the Linking Lives interface.

This blog post concentrates on their responses relating to the use of archives, methods of searching and interpretation of results. You can read more about their responses to the Linking Lives interface on our Linking Lives blog.

Use of Archives and Primary Source Materials

We felt that it was important to establish how important archives are to the participants in our survey and focus group. We found that “without exception, all of the respondents expressed a need for primary resources” (Evaluation report). One respondent said:

“I would not consider myself to be doing proper history if I wasn’t either reinterpreting primary sources others had written about, or looking at primary sources nobody has written about. It is generally expected for history to be based on primary sources, I think.” (Survey response)

One of the most important factors to the respondents was originality in research. Other responses included acknowledgement of how archives give structure to research, bringing out different angles and perspectives and also highlighting areas that have been neglected. Archives give substance to research and they enable researchers to distinguish their own work:

“Primary sources are very valuable for my research because they allow me to put together my own interpretation, rather than relying on published findings elsewhere.” (Survey response)

Understanding of Archives

It is often the case that people have different perceptions of what archives are, and with the Linking Lives evaluation work this was confirmed. Commonly there is a difference between social scientists and historians; the former concentrating on datasets (e.g. data from the Office of National Statistics) and the latter on materials created during a person’s life or the activities of an organisation and deemed worthy of permanently preserving. The evaluation report states:

“The participants that had a similar understanding of what an archive was to the Archive Hub’s definition had a more positive experience than those who didn’t share that definition.”

This is a valuable observation for the work of the Hub in a general sense, as well as the Linking Lives interface, because it demonstrates how initial perceptions and expectations can influence attitudes towards the service. In addition, the evaluation work highlighted another common fallacy: that an archive is essentially a library. Some of the participants in the survey expected the Archives Hub to provide them with information about published sources, such as research papers.

These findings highlight one of the issues when trying to evaluate the likely value of an innovative service: researchers do not think in the same language or with the same perspectives as information professionals. I wonder if we have a tendency to present services and interfaces modelled from our own standpoint rather than from the standpoint of the researcher.

Search Techniques and Habits

“Searches were often not particularly expansive, and participants searched for specific details which were unique to their line of enquiry” (Evaluation report). Examples include titles of women’s magazines, personal names or places. If the search returned nothing, participants might then broaden it out.

Participants said they would repeatedly return to archives or websites they were familiar with, often linked to quite niche research topics. This highlights how a positive experience with a service when it is first used may have a powerful effect over the longer term.

The survey found that online research was a priority:

“Due to conflicting pressures on time and economic resources, online searching was prevalent amongst the sample. Often research starts online and the majority is done online. Visits to see archives in person, although still seen as necessary, are carefully evaluated.” (Evaluation report)

The main resources participants used were Google and Google Scholar (the most ubiquitous search engines used) as well as The National Archives, Google Books and ESDS. Specialist archives were referred to relating to specific search areas (e.g. The People’s History Museum, the Wellcome Library, the Mass Observation Archive).

Thoughts and Comments About the Archives Hub

All participants found the Hub easy to navigate and most found locating resources intuitive. As part of the survey we asked the participants to find certain resources, and almost all of them provided the right answers with seemingly no difficulty.

“It is clear. The descent of folders and references at the top are good for referencing/orientating oneself. The descriptions are good – they obviously can’t contain everything that could be useful to everyone and still be a summary. It is similar to other archive searches so it is clear.” (Survey response, PhD history student)

The social scientists that took part in the evaluation were less positive about the Archives Hub than the historians. Clearly many social science students are looking for datasets, and these are generally not represented on the Hub. There was a feeling that contemporary sources are not well represented, and these are often more important to researchers in fields like politics and sociology. But overall comments were very positive:

“…if anyone ever asked about how to search archives online I’d definitely point them to the Archives Hub”.

“Useful. It will save me making specific searches at universities.”

Archives Hub Content

It was interesting to see the sorts of searches participants made. A search for ‘spatial ideas’ by one participant did not yield useful results. This would not surprise many archivists – collections are generally not catalogued to draw out such concepts (neither Unesco nor UKAT have a subject heading for this; LCSH has ‘spatial analysis’). However, there may well be collections that cover a subject like this, if the researcher is prepared to dig deep enough and think about different approaches to searching. Another participant commented that “you can’t just look for the big themes”. This is the type of search that might benefit from us drawing together archive collections around themes, but this is always a very flawed approach. This is one reason that we have Features, which showcase archives around subjects but do not try to provide a ‘comprehensive’ view onto a subject.

This kind of feedback from researchers helps us to think about how to more effectively present the Archives Hub. Expectations are such an important part of researchers’ experiences. It is not possible to completely mitigate against expectations that do not match reality, but we could, for example, have a page on ‘The Archives Hub for Social Scientists’ that would at least provide those who looked at it with a better sense of what the Hub may or may not provide for them (whether anyone would read it is another matter!).

This survey, along with previous surveys we have carried out, emphasises the importance of a comprehensive service and a clear scope (“it wasn’t clear to me what subjects or organisations are covered”). However, with the nature of archives, it is very difficult to give this kind of information with any accuracy, as the collections represented are diverse and sometimes unexpected. in the end you cannot entirely draw a clear line around the scope of the Archives Hub, just like you cannot draw a clear line around the subjects represented in any one archive. The Hub also changes continuously, with new descriptions added every week. Cataloguing is not a perfect art; it can draw out key people, places, subjects and events, but it cannot hope to reflect everything about a collection, and the knowledge a researcher brings with them may help to draw out information from a collection that was not explicitly provided in the description. If a researcher is prepared to spend a bit of time searching, there is always the chance that they may stumble across sources that are new to them and potentially important:

“…another student who was mainly focused on the use of the Kremlin Archives did point out that [the Archives Hub] brought up the Walls and Glasier papers, which were new to [them]”.

Even if you provide a list of subjects, what does that really mean? Archives will not cover a subject comprehensively; they were not written with that in mind; they were created for other purposes – that is their strength in many ways – it is what makes them a rich and exciting resource, but it does not make it easy to accurately describe them for researchers. Just one series of correspondence may refer to thousands of subjects, some in passing, some more substantially, but archivists generally don’t have time to go through an entire series and draw out every concept.

If the Archives Hub included a description for every archive held at an HE institution across the UK, or for every specialist repository, what would that signify? It would be comprehensive in one sense, but in a sense that may not mean much to researchers. It would be interesting to ask researchers what they see as ‘comprehensive resources’ as it is hard to see how these could really exist, particularly when talking about unpublished sources.

Relevance of Search Results

The difficulties some participants had with the relevance of results comes back to the problem of how to catalogue resources that often cover a myriad of subjects, maybe superficially, maybe in detail; maybe from a very biased perspective. If a researcher looks for ‘social housing manchester’ then the results they get will be accurate in a sense – the machine will do its job and find collections with these terms, and there will be weighting of different fields (eg. the title will be highly weighted), but they still may not get the results they expect, because collections may not explicitly be about social housing in Manchester. The researcher needs to do a bit more work to think about what might be in the collection and whether it might be relevant. However, cataloguers are at fault to some extent. We do get descriptions sent to the Hub where the subjects listed seem inadequate or they do not seem to reflect the scope and content that has been provided. Sometimes a subject is listed but there is no sense of why it is included in the rest of the description. Sometimes a person is included in the index terms but they are not described in the content. This does not help researchers to make sense of what they see.

I do think that there are lessons here for archivists, or those who catalogue archives. I don’t think that enough thought is gives to the needs of the researcher. The inconsistent use of subject terms, for example, and the need for a description of the archive to draw out key concepts a little more clearly. Some archivists don’t see the need to add index terms, and think in terms of technologies like Google being able to search by keyword, therefore that is enough. But it isn’t enough. Researchers need more than this. They need to know what the collection is substantially about, they need to search across other collections about similar subjects. Controlled vocabulary enables this kind of exploratory searching. There is a big difference between searching for ‘nuclear disarmament’ as a keyword, which means it might exist anywhere within the description, and searching for it as a subject – a significant topic within an archive.

*Linking Lives Evaluation: Final Report (October 2012) by Lisa Charnock, Frank Manista, Janine Rigby and Joy Palmer

More Product, Less Processing?

January 5, 2012 / Jane Stevenson / 4 Comments

I’ve been reading a fascinating article by Mark A. Greene and Dennis Meissner, ‘More Product Less Process: Revamping Traditional Archival Processing‘ (PDF). I wanted to offer a summary of the article.

The essence of this article is that archivists spend too long processing collections (appraising, cataloguing and carrying out minor preservation). This approach is not working; the cataloguing backlog continues to increase. We are too conservative, cautious and set in our ways, and we need to think about a new approach to cataloguing that is more pragmatic and user-focussed. The article was written by archivists in the USA, but would seem to apply to archives here in the UK, where we know that the backlog is a continuing problem.

I think the article makes the argument well and with a good deal of conviction. The bottom line is that we must rethink our approach unless we are to continue to accrue backlogs and deny researchers access to hugely valuable primary source material.

However, there are arguments in support of detailed cataloguing. For digital archives it is extremely useful to provide metadata at the item level, enabling such useful resources as http://archiveshub.ac.uk/data/gb1837des-dca?page=3#id634580. With this detailed list, researches can see digital resources described and then access them directly. It could be argued that if a collection is to be digitised, providing this sort of level of metadata is appropriate, and in general it is the more valuable and highly used collections that are digitised. But for born-digital collections, this level of detail would be totally unsustainable.

Also, I wonder if the work that volunteers do should be taken into account – they may be able to help us catalogue in more detail, whilst trained archivists continue to create the main collection or series-level descriptions. I remember a whole band of NADFAS volunteers cataloguing photographs where I used to work. Furthermore, I was speaking to an archivist recently who said that they had taken the time to weed out duplicates (something this report criticises)…and then sold them on eBay for a tidy profit, that helped them fund their very under-resourced archive (they had the rights to do this!). So, maybe there are factors to take into consideration that support a detailed approach, but I think a bold approach to examining this whole area in UK archives would be very welcome.

Some of the points made in the report:

Archivists spend too much time cataloguing, not necessarily doing what is necessary. We think in terms of an ideal that we have to reach, although we haven’t actually articulated what this ideal is, and really examined it.
We are too attached to old-fashioned ways of doing things, which worked when we had smaller collections to deal with, but are not appropriate for large 20th century collections.
We give a higher priority to serving the needs of our collections rather than the needs of our users.
We need a new set of guidelines that focus on what we absolutely need to do.
We need to discuss, debate and examine our approach to cataloguing, and not be defensive about our roles.
We tend to arrange collections down to item level. In particular, we carry out preservation activities to this level. We accept the premise that basic preservation steps necessitate an item-level approach.
We often remove all metal fastenings and put materials into acid-free folders. So, even if we do not describe collections down to item level (maybe we just describe at collection or series level), we go down to this level of detail in our preservation activities. Yet, with good climate control, metal fasteners should not rust, and as yet we do not have strong evidence of a detrimental effect of standard manila folders if the materials is stored in a controlled environment.
We often weed out duplicates throughout a collection, which requires processing down to item level. Is this really worth doing?
The various sources of advice about the level of detail we process archives to are inconsistent. Some sources advocate description to series level, but preservation activities to item level. NARA advocates preservation in accordance with intrinsic value and anticipated use, so, for example, new folders should only be used if current ones are damaged, and metal fasteners should be removed only if ‘appropriate’ – meaning where they are causing obvious damage.
We seem to believe that we need to aspire to ‘a substantial, multi-layered, descriptive finding aid,’ a reflection of ‘slow, careful scholarly research’. But in reality, maybe we should adopt a more flexible approach, taking each collection in turn on its merits. Some may justify detailed cataloguing, but many do not.
We should take the position that users come to do research, and that we do not have to do this for them in advance.
We should ‘get beyond our absurd over-cautiousness’ about providing access to unprocessed collections, and make them available unless there are good legal or preservation reasons to restrict access or the collection is of extremely high value.
We have very inadequate processing metrics. Attempts to quantify processing expectations have resulted in wildly differing figures. Figures given in various studies include 3, 6.9, 8, 12.7 and 10.6 hours per cubic foot. Other studies have come up with between 3 and 5.5 days per foot.
One major study by an archive centre revealed 15.1 hours were spent on each cubic foot, far more than the value that was placed upon what was accomplished. The study gave ‘an improved sense of the real and total costs involved’.
The Greene/Meissner study looked at various projects funded by NHPRC grants (National Historical Publications & Records Committee), and found an average productivity figure of 9 hours per foot, but with highs of around 67 hours per foot. It also conducted an email survey and found expectations of processing times averaged at 14.8 hours, although there was a high of 250 hours!
Grant funding often encourages an item-level focus, rather than helping us to really tackle our substantial backlogs. There should be more of a requirement to justify meticulous processing – it should only be for exceptional collections.
The study recommends aiming for a processing rate of 4 hours per cubic foot for most large 20th century collections, using a series-level approach for description and preservation.
Studies show a lack of standardisation, not only in our definitions but also around the levels of arrangement, preservation and access that are useful and necessary. We do not have proper administrative controls over this work. We tend to argue for each of us having a unique situation, that does not allow for comparison, and we do not have a common sense of acceptibile policies and procedures.
Whilst we continue to process to item level, a substantial number do not make catalogues available through OPACs or Websites, arguably prioritising processing over user needs.

The report concludes that maybe we should recognise that ‘the use of archival records…is the ultimate purpose of identification and administration.’ (SAA, Planning for the Archival Profession, 1986). Maybe we should agree that a collection is catalogued if it ‘can be used productively for research.’ And maybe we should be willing to take a different approach for each collection, making choices and setting priorities, rather than being too caught up in a ‘love of craftmanship’ that could be seen as fastidiousness that does not truly serve the user.

The question seems to be how much would be lost by putting speed of processing before careful examination of all documents in a collection. Maybe this does require defining good cataloguing? Maybe we believe that our professional standing is tied up with undertaking detailed cataloguing…more so than the ever increasing growth of backlogs, where the papers are entirely unaccessible to researchers?

Greene and Meissner state that there should be a ‘golden minimum’ for processing, where we adequately address user needs and only go beyond this where there are demonstrable business reasons. They also believe that arrangement, description and preservation should all occur at the same level of detail, again, unless there are good reasons to deviate from this.

What do you think…?

Out and about or Hub contributor training

August 17, 2011 / Lisa

Every year we provide our contributors and potential contributors with free training on how to use our EAD editor software.

The days are great fun and we really enjoy the chance to meet archivists from around the UK and find out what they are working on.

The EAD editor has been developed so that archivists can create online descriptions of their collections without having to know EAD. It’s intuitive and user friendly and allows contributors to easily add collection level and multi-level descriptions to the Hub. Users can also enhance their descriptions by adding digital archival objects – images, documents and sound files.

Our training days are a mixture of presentation, demonstration and practical hands on. We (The training team consists of Jane, Beth and myself) tend to start by talking a little about Hub news and developments to set the scene for the day and then we move onto why the Hub uses EAD and why using standards is important for interoperability and means that more ‘stuff’ can be done with the data. We go from here on to a hands-on session that demonstrates how to create a basic record. We cover also cover adding lower level components and images and we show contributors how to add index terms to their descriptions. (Something that we heartily endorse! We LOVE standards and indexing!).

We always like to tailor our training to the users, and encourage users to bring along their own descriptions for the hands-on sessions. Some users manage to submit their first descriptions to the Hub by the end of the training session!

This year we have done training in Manchester and London, for the Lifeshare project team in Sheffield and for the Oxford colleges. We are also hoping (if we get enough take up) to run courses in Glasgow and Cardiff this year. (6^th Sept at Glasgow Caledonian, Cardiff date TBC. Email archiveshub@mimas.ac.uk to book a place)

So far this year three new contributors have joined the Hub as a result of training: Middle East Centre Archive, St Antony’s College, Oxford; Salford City Archive and the Taylor Institute, Oxford. We’ve also enabled four of our existing contributors to start updating their collections on the Hub: National Fairground Archive, the Co-operative Archive, St John’s College, Oxford and the V&A.

We have been given some great feedback this year and 100% of our attendees agreed/strongly agreed that they were satisfied with the content and teaching style of the course.

Some our feedback:

A very good introductory session to working with the EAD editor for the Archives Hub. I have not used the Archives Hub for a long time so an excellent refresher course.

This was a fantastic workshop – excellently designed resources, Lisa and Jane were really helpful (and patient!). The hands-on aspect was really useful: I now feel quite confident about creating EAD records for the Hub, and even more confident that the Hub team are on hand with online help

The hands on experience and being able to ask questions of the course leaders as things happened was really useful. Being able to work on something relevant to me was also a bonus.

Excellent presentation and delivery. I came along with a theoretical but not a practical knowledge of the Archives Hub and its workings, and the training session was pitched perfectly and was completely relevant to my job. Many thanks.

The Hub team train archivists how to use the EAD editor, archive students about EAD and Social media and research students in how to use the Hub to search for primary source materials. You can find our list of training that we provide on our training pages: http://archiveshub.ac.uk/trainingmodules/ . We’re always happy to hear from people who are interested in training – do let us know!

Opening up UK archives data (II)

May 11, 2010 / Jane Stevenson / 1 Comment

This is the second post relating to the recent UKAD meeting, concentrating on the brainstorming that took place around digital and digitised archives.

The driving forces that were identified:

Crowd-sourcing – metadata generation
Attracts funding
Promotes access
Open up wealth of possibility
Remain relevant
Meet user expectations
Centres of excellence in digitisation – common approach
Collections already digitised are hidden – in silos – return on investment
Potential to capture richer information about users
Potential to draw people in
Increasing ‘digitisation on demand’ – needs to be harnessed effectively
Increasing amount of born-digital media need to be made accessible online – drive to discoverability of digital materials
Changing profession – becoming more confident in this area as a result of above
Web makes it much easier

The group felt that it all added up to a resouding “we have to do this!”.

The resistors included:

Systems don’t talk to each other
Insufficient metadata of legacy digitised material – retroconversion – cost*
Copyright/IPR – complex, lots of local specificity
Work needed to marry user generated content and standard metadata
Community resistance to UGC
Vast amounts of content – prioritisation is intellectually challenging
Bulk digitisation is happening commercially – restricted rights
Clashes with business models – or perception that it does (e.g. models based on commercial digitisation assume increasing return on investment; the opposite may occur if the most commercially enticing material digitised first)
Fears – grounded in truth – could affect funding: diminish user/visitor numbers on site, diminishes value of on-site expertise
Challenges in bringing catalogue data and digital object systems together
Query: not ultimately cost effective
Cost
Web makes it easier – but it’s hard to keep up…

The group looked at actions that are required:

1. Accrue evidence of user demand and current behaviour

Identify user communities (family, academic, student researchers)
Secondary research of existing analysis
Market research
Produce cost-benefit analysis – impact on site visits?

2. Systems talking to each other

People talking to each other about systems!
Develop definitive list of systems in use – a picture of UK situation > crosswalks/maps between (see Library world)
Needs to cover both catalogue and digital object management systems
Discmap?

3. Copyright/IPR

Produce decision tree to help archivists make decisions – risk assessment but beware risk aversion
Encourage sharing of experience/lessons learned
Gathering what has already been done

4. Impact of digitised resources

Gather existing articles/research
Share practice in assessing impact in differing contexts

5. Metadata and costs

Establish costs of differing levels of metadata generation
Identify how much data needs to be converted into digital metadata (how much is not online?)

6. Identify quick wins!

Working together to create user cases and examples, sharing experience, getting onvolved in Resource Discovery Task Force and linking projects to this

Of course, the gathering of such evidence can help us to see where we are and where we need to go, and also how to get there. But implementation is quite another thing. The UKAD Network is hoping to build upon this work to encourage collaborative initiatives and the sharing of expertise and experiences. We are considering events and training opportunities that might help. We do feel that it will be useful to create a stronger presence for UKAD, as a means to provide a focus for this work, and we are looking at low-cost options to do this.

Opening up UK archives data (i)

April 16, 2010 / Jane Stevenson

On 14th April the UK Archives Discovery Network (UKAD) met in Manchester to discuss challenges surrounding the opening up of archival data. We were looking to develop our understanding of the key issues driving or preventing these developments and to start pulling together an action plan. We also talked about digital and digitised archives, which I’ll blog about in a separate post.

We split into two groups to brainstorm driving and restraining factors. There was no chance of drying up – we all had plenty to say, and of course, the restraining influences grew rapidly, threatening to outstrip the drivers by quite some way. However, in the end we had a good balance, and we felt that the day had been very positive, although summing up the position is one thing, implementing actions is quite another. However, we hope to start putting some things into place that will help to take us along the road to promoting archival discovery.

We are looking to create a UKAD website, which will help us to promote UKAD to archivists and others, and we’ll let you know about that as soon as we can.

With thanks to Melinda Haunton from The National Archives, who, as the UKAD secretary, galliantly pulled together the large number of flip charts and made them into something coherent, here is a summary of the points.

Our driving forces included:

Perceived user demand
Time saving – easier to search, more effective customer service
Opportunities – for use of data and for benefiting from others’ use
Government policy drivers in this direction (data.gov.uk is evidence of Govt buy-in)
Rich data – think about opportunities to make the most of events, people, places, concepts within the finding aids
Serendipitous collaboration – working together is a big driver – a common way to hear about initiatives and experiences of others that could be of benefit to you
Potential to get new users – eg via GIS data connected to archives data – users who may not think of using archives
Standards exist to drive openness
Sustainability of resources – less tied to a single service if data is open
Enrichment and adding value – others can enrich our data
Archives making use of others’ open data – sector benefits from open data as well as contributing to it
Connecting archives – new narratives – data can coalesce around events, people, places, subjects
Exposure of holdings – especially for small repositories who have limited resources to promote themselves
Unlikely to be restrictions on opening up descriptive data (unlike digital/digitised archives)
Could glean evidence of impact – ways to gather usage statistics are increasingly effective – provide evidence of benefits
Opening up could reach out to excluded communities more effectively (different routes into archives)
Potential for wider impact – e.g. in demonstrating impact of academic research (RAE)

Our restraining forces included:

Lack of evidence of user demand – it may not be what we expect/assume
APIs – where they exist, are they used? (possibly not)
Users’ understanding what they’ll get – you won’t normally get direct access to archives through descriptions
Proprietary software providers – may not ‘play ball’
Archivists understanding of open data issues – need understanding to get buy-in
Access to developer expertise – archivists frequently find getting IT or developer support very difficult
Machine to machine – not visual, not easy to sell – need to understand the potential
Messy data – all the issues we are so aware of with different data sources; the balkanisation of data
Backlogs – if its not catalogued, we can’t open it up
Sustainability of resources
Data becoming out of date as it gets further from the original source – end up reusing out-of-date data
Contractual embargoes – e.g. involving commercial partners e.g. software providers
Dependencies – potentially data may be dependent on other things – e.g. attached to schema, source code, IPR
Evidence of impact – can be difficult to get this and prove the worth of open data
Branding – or lack of it on reused open data – may affect funding if funders can’t see direct benefits
Loss of control causes fear – once its open anything can happen
Lack of ‘archival developers’ – very few developers with some understanding of archives and archival issues

Our Actions included:

Working together – collaborative evidence gathering and sharing, not competing – use examples/evidence from others
Evidence – case studies, knowing what researchers are requesting, evidence for advantages of digitising
Understanding funders – shared understanding of funders can help with internal funding
Archives developer days – bringing developers together as has been done with Dev8D – collaborative approach to programming
Strategy for approaching software vendors to get buy-in – appeal to their commercial interests, a concerted approach from aggregators may be more effective
UK based evaluation of archival cataloguing systems – still know little about percentages using different systems and evaluation of systems
Conference/workshops to raise awareness of buy in including practical demonstrations – must be interactive and practical and encourage sharing of projects, experiences and ideas