EAD and Next Generation Discovery

This post is in response to a recent article in Code4Lib, ‘Thresholds for Discovery: EAD Tag Analysis in ArchiveGrid, and Implications for Discovery Systems‘ by M. Bron, M. Proffitt and B. Washburn. All quotes are from that article, which looked at the instances of tags within ArchiveGrid, the US based archival aggregation run by OCLC. This post compares some of their findings to the UK based Archives Hub.

Date

In the ArchivesGrid analysis, the <unitdate> field use is around 72% within the high-level (usually collection level) description. The Archives Hub does significantly better here, with an almost universal inclusion of dates at this level of description. Therefore, a date search is not likely to exclude any potentially relevant descriptions. This is important, as researchers are likely to want to restrict their searches by date. Our new system also allows sorting retrieved results by date. The only issue we have is where the dates are non-standard and cause the ordering to break down in some way. But we do have both displayed dates and normalised dates, to enable better machine processing of the data.

Collection Title

“for sorting and browsing…utility depends on the content of the element.”

Titles are always provided, but they are very varied. Setting aside lower-level descriptions, which are particularly problematic, titles may be more or less informative. We may introduce sorting by title, but the utility of this will be limited. It is unlikely that titles will ever be controlled to the extent that they have a level of consistency, but it would be fascinating to analyse titles within the context of the ways people search on the Web, and see if we can gauge the value of different approaches to creating titles. In other words, what is the best type of title in terms of attracting researchers’ attention, search engine optimisation, display within search engine results, etc?

Lower-level descriptions tend to have titles such as ‘Accounts’, ‘Diary’ or something more difficult to understand out of context such as ‘Pigs and boars’ or ‘The Moon Dragon’. It is clearly vital to maintain the relationship of these lower-level descriptions to their parent level entries, otherwise they often become largely meaningless. But this should be perfectly possible when working on the Web.

It is important to ensure that a researcher finding a lower-level description through a general search engine gets a meaningful result.

Archives Hub search result from a Google search
A search result within Google

 

 

 

The above result is from a search for ‘garrick theatre archives joanna lumley’ – the sort of search a researcher might carry out. Whilst the link is directly to a lower -level entry for a play at the Garrick Theatre, the heading is for the archive collection. This entry is still not ideal, as the lower-level heading should be present as well. But it gives a reasonable sense of what the researcher will get if they click on this link. It includes the <unitid> from the parent entry and the URL for the lower-level, with the first part of the <scopecontent> for the entry.  It also includes the Archives Hub tag line, which could be considered superfluous to a search for Garrick Theatre archives! However, it does help to embed the idea of a service in the mind of the researcher – something they can use for their research.

Extent

“It would be useful to be able to sort by size of collection, however, this would require some level of confidence that the <extent> tag is both widely used and that the content of the tag would lends itself to sorting.”

This was an idea we had when working on our Linked Data output. We wanted to think about visualizations that would help researchers get a sense of the collections that are out there, where they are, how relevant they are, and so on. In theory the ‘extent’ could help with a weighting system, where we could think about a map-based visualization showing concentrations of archives about a person or subject. We could also potentially order results by size – from the largest archive to the smallest archive that matches a researchers’ search term. However, archivists do not have any kind of controlled vocabulary for ‘extent’. So, within the Archives Hub this field can contain anything from numbers of boxes and folders to length in linear metres, dimensions in cubic metres and items in terms of numbers of photographs, pamphlets and other formats. ISAD(G) doesn’t really help with this; the examples they give simply serve to show how varied the description of extent can be.

Genre

“Other examples of desired functionality include providing a means in the interface to limit a search to include only items that are in a certain genre (for example, photographs)”.

This is something that could potentially be useful to researchers, but archivists don’t tend to provide the necessary data. We would need descriptions to include the genre, using controlled vocabulary. If we had this we could potentially enable researchers to select types of materials they are interested in, or simply include a flag to show, e.g. where a collection includes photographs.

The problem with introducing a genre search is that you run the risk of excluding key descriptions, because the search will only include results where the description includes that data in the appropriate location. If the word ‘photograph’ is in the general description only then a specific genre search won’t find it. This means a large collection of photographs may be excluded from a search for photographs.

Subject

In the Bron/Proffitt/Washburn article <controlaccess> is present around 72% of the time. I was surprised that they did not choose to analyse tags within <controlaccess> as I think these ‘access points’ can play a very important role in archival descrpition.  They use the presence of <controlaccess> as an indication of the presence of subjects, and make the point that “given differences in library and archival practices, we would expect control of form and genre terms to be relatively high, and control of names and subjects to be relatively low.”

On the Archives Hub, use of subjects is relatively high (as well as personal and corporate names) and use of form and genre is very low. However, it is true to say that we have strongly encouraged adding subject terms, and archivists don’t generally see this as integral to cataloguing (although some certainly do!), so we like to think that we are partly responsible for such a high use of subject terms.

Subject terms are needed because they (1) help to pull out significant subjects, often from collections that are very diverse, (2) enable identification of words such as ‘church’ and ‘carpenter’ (ie. they are subjects, not surnames), (3) allow researchers to continue searching across the Archives Hub by subject (subjects are all linked to the browse list) and therefore pull collections together by theme (4) enable advanced searching (which is substantially used on the Hub).

Names (personal and corporate)

In Bron/Proffitt/Washburn the <origination> tag is present 87% of the time. The analysis did not include the use of <persname> and <corpname> within <origination> to identify the type of originator. In the Archives Hub the originator is a required field, and is present 99%+ of the time. However, we made what I think is a mistake in not providing for the addition of personal or corporate name identification within <origination> via our EAD Editor (for creating descriptions) or by simply recommending it as best practice. This means that most of our originators cannot be distinguished as people or corporate bodies. In addition, we have a number where several names are within one <origination> tag and where terms such as ‘and others’, ‘unknown’ or ‘various’ are used. This type of practice is disadvantageous to machine processing. We are looking to rectify it now, but addressing something like this in retrospect is never easy to do. The ideal is that all names within origination are separately entered and identified as people or organisations.

We do also have names within <controlaccess>, and this brings the same advantages as for <subjects>, ensuring the names are properly structured, can be used for searching and for bringing together archives relating to any one individual or organisation.

Repository

“Use of this element falls into the promising complete category (99.46%: see Table 7). However, a variety of practice is in play, with the name of the repository being embellished with <subarea> and <address> tags nested within <repository>.”

On the Archives Hub repository is mandatory, but as yet we do not have a checking system whereby a description is rejected if it does not contain this field. We are working towards something like this, using scripts to check for key information to help ensure validity and consistency at least to a minimum standard. On one occasion we did take in a substantial number of descriptions from a repository that omitted the name of repository, which is not very useful for an aggregation service! However, one thing about <repository> is that it is easy to add because it is always the same entry. Or at least it should be….we did recently discovery that a number of repositories had entered their name in various ways over the years and this is something we needed to correct.

Scope and content, biographical history and abstract

It is notable that in the US <abstract> is widely used, whereas we don’t use it at all. It is intended as a very brief summary, whereas <scopecontent> can be of any length.

“For search, its worth noting that the semantics of these elements are different, and may result in unexpected and false “relevance””

One of the advantages of including <controlaccess> terms is to mitigate against this kind of false relevance, as a search for ‘mason’ as a person and ‘mason’ as a subject is possible through restricted field searching.

The Bron/Proffitt /Washburn analysis shows <bioghist> used 70% of the time. This is lower than the Archives Hub, where it is rare for this field not to be included. Archivists seem to have a natural inclination to provide a reasonably detailed biographical history, especially for a large collection focussed on one individual or organisation.

Digital Archival Objects

It is a shame that the analysis did not include instances of <dao>, but it is likely to be fairly low (in line with previous analysis by Wisser and Dean, which puts it lower than 10%). The Archives Hub currently includes around 1,200 instances of images or links to digital content. But what would be interesting is to see how this is growing over time and whether the trajectory indicates that in 5 years or so we will be able to provide researchers with routes into much of the Archives Hub content. However, it is worth bearing in mind that many archives are not digitised and are not likely to be digitised, so it is important for us not to raise expectations that links to digital content will become a matter of course.

The Future of Discovery

“In order to make EAD-encoded finding aids more well suited for use in discovery systems, the population of key elements will need to be moved closer to high or (ideally) complete.”

This is undoubtedly true, but I wonder whether the priority over and above completeness is consistency and controlled vocabulary where appropriate. There is an argument in favour of a shorter description, that may exclude certain information about a collection, but is well structured and easier to machine process. (Of course, completeness and consistency is the ideal!).

The article highlights geo-location as something that is emerging within discovery services. The Archives Hub is planning on promoting this as an option once we move to the revised EAD schema (which will allow for this to be included), but it is a question of whether archivists choose to include geographical co-ordinates in their catalogues. We may need to find ways to make this as easy as possible and to show the potential benefits of doing so.

In terms of the future, we need a different perspective on what EAD can and should be:

“In the early days of EAD the focus was largely on moving finding aids from typescript to SGML and XML. Even with much attention given over to the development of institutional and consortial best practice guidelines and requirements, much work was done by brute force and often with little attention given to (or funds allocated for) making the data fit to the purpose of discovery.”

However, I would argue that one of the problems is that archivists sometimes still think in terms of typescript finding aids; of a printed finding aid that is available within the search room, and then made available online….as if they are essentially the same thing and we can use the same approach with both. I think more needs to be done to promote, explain and discuss ‘next generation finding aids’. By working with Linked Data, I have gained a very different perspective on what is possible, challenging the traditional approach to hierarchical finding aids.

Maybe we need some ‘next generation discovery’ workshops and discussions – but in order to really broaden our horizons we will need to take heed of what is going on outside of our own domain. We can no longer consider archival practice in isolation from discovery in the most general sense because the complexity and scale of online discovery requires us to learn from others with expertise and understanding of digital technologies.

 

 

 

 

 

 

 

An evaluation of the use of archives and the Archives Hub

This blog is based upon a report written by colleagues at Mimas* presenting the results of the evaluation of our innovative Linked Data interface, ‘Linking Lives‘. The evaluation consisted of a survey and a focus group, with 10 participants including PhD students and MA students studying history, politics and social sciences. We asked participants a number of questions about the Archives Hub service, in order to provide context for their thoughts on the Linking Lives interface.

This blog post concentrates on their responses relating to the use of archives, methods of searching and interpretation of results. You can read more about their responses to the Linking Lives interface on our Linking Lives blog.

Use of Archives and Primary Source Materials

We felt that it was important to establish how important archives are to the participants in our survey and focus group. We found that “without exception, all of the respondents expressed a need for primary resources” (Evaluation report). One respondent said:

“I would not consider myself to be doing proper history if I wasn’t either reinterpreting primary sources others had written about, or looking at primary sources nobody has written about. It is generally expected for history to be based on primary sources, I think.” (Survey response)

One of the most important factors to the respondents was originality in research. Other responses included acknowledgement of how archives give structure to research, bringing out different angles and perspectives and also highlighting areas that have been neglected. Archives give substance to research and they enable researchers to distinguish their own work:

“Primary sources are very valuable for my research because they allow me to put together my own interpretation, rather than relying on published findings elsewhere.” (Survey response)

Understanding of Archives

It is often the case that people have different perceptions of what archives are, and with the Linking Lives evaluation work this was confirmed. Commonly there is a difference between social scientists and historians; the former concentrating on datasets (e.g. data from the Office of National Statistics) and the latter on materials created during a person’s life or the activities of an organisation and deemed worthy of permanently preserving. The evaluation report states:

“The participants that had a similar understanding of what an archive was to the Archive Hub’s definition had a more positive experience than those who didn’t share that definition.”

This is a valuable observation for the work of the Hub in a general sense, as well as the Linking Lives interface, because it demonstrates how initial perceptions and expectations can influence attitudes towards the service. In addition, the evaluation work highlighted another common fallacy: that an archive is essentially a library. Some of the participants in the survey expected the Archives Hub to provide them with information about published sources, such as research papers.

These findings highlight one of the issues when trying to evaluate the likely value of an innovative service: researchers do not think in the same language or with the same perspectives as information professionals. I wonder if we have a tendency to present services and interfaces modelled from our own standpoint rather than from the standpoint of the researcher.

Search Techniques and Habits

“Searches were often not particularly expansive, and participants searched for specific details which were unique to their line of enquiry” (Evaluation report). Examples include titles of women’s magazines, personal names or places. If the search returned nothing, participants might then broaden it out.

Participants said they would repeatedly return to archives or websites they were familiar with, often linked to quite niche research topics. This highlights how a positive experience with a service when it is first used may have a powerful effect over the longer term.

The survey found that online research was a priority:

“Due to conflicting pressures on time and economic resources, online searching was prevalent amongst the sample. Often research starts online and the majority is done online. Visits to see archives in person, although still seen as necessary, are carefully evaluated.”  (Evaluation report)

The main resources participants used were Google and Google Scholar (the most ubiquitous search engines used) as well as The National Archives, Google Books and ESDS. Specialist archives were referred to relating to specific search areas (e.g. The People’s History Museum, the Wellcome Library, the Mass Observation Archive).

Thoughts and Comments About the Archives Hub

All participants found the Hub easy to navigate and most found locating resources intuitive. As part of the survey we asked the participants to find certain resources, and almost all of them provided the right answers with seemingly no difficulty.

“It is clear. The descent of folders and references at the top are good for referencing/orientating oneself. The descriptions are good – they obviously can’t contain everything that could be useful to everyone and still be a summary. It is similar to other archive searches so it is clear.” (Survey response, PhD history student)

The social scientists that took part in the evaluation were less positive about the Archives Hub than the historians. Clearly many social science students are looking for datasets, and these are generally not represented on the Hub. There was a feeling that contemporary sources are not well represented, and these are often more important to researchers in fields like politics and sociology. But overall comments were very positive:

“…if anyone ever asked about how to search archives online I’d definitely point them to the Archives Hub”.

“Useful. It will save me making specific searches at universities.”

Archives Hub Content

It was interesting to see the sorts of searches participants made. A search for ‘spatial ideas’ by one participant did not yield useful results. This would not surprise many archivists – collections are generally not catalogued to draw out such concepts (neither Unesco nor UKAT have a subject heading for this; LCSH has ‘spatial analysis’). However, there may well be collections that cover a subject like this, if the researcher is prepared to dig deep enough and think about different approaches to searching. Another participant commented that “you can’t just look for the big themes”. This is the type of search that might benefit from us drawing together archive collections around themes, but this is always a very flawed approach. This is one reason that we have Features, which showcase archives around subjects but do not try to provide a ‘comprehensive’ view onto a subject.

This kind of feedback from researchers helps us to think about how to more effectively present the Archives Hub. Expectations are such an important part of researchers’ experiences. It is not possible to completely mitigate against expectations that do not match reality, but we could, for example, have a page on ‘The Archives Hub for Social Scientists’ that would at least provide those who looked at it with a better sense of what the Hub may or may not provide for them (whether anyone would read it is another matter!).

This survey, along with previous surveys we have carried out, emphasises the importance of a comprehensive service and a clear scope (“it wasn’t clear to me what subjects or organisations are covered”). However, with the nature of archives, it is very difficult to give this kind of information with any accuracy, as the collections represented are diverse and sometimes unexpected. in the end you cannot entirely draw a clear line around the scope of the Archives Hub, just like you cannot draw a clear line around the subjects represented in any one archive. The Hub also changes continuously, with new descriptions added every week. Cataloguing is not a perfect art; it can draw out key people, places, subjects and events, but it cannot hope to reflect everything about a collection, and the knowledge a researcher brings with them may help to draw out information from a collection that was not explicitly provided in the description. If a researcher is prepared to spend a bit of time searching, there is always the chance that they may stumble across sources that are new to them and potentially important:

“…another student who was mainly focused on the use of the Kremlin Archives did point out that [the Archives Hub] brought up the Walls and Glasier papers, which were new to [them]”.

Even if you provide a list of subjects, what does that really mean? Archives will not cover a subject comprehensively; they were not written with that in mind; they were created for other purposes – that is their strength in many ways – it is what makes them a rich and exciting resource, but it does not make it easy to accurately describe them for researchers. Just one series of correspondence may refer to thousands of subjects, some in passing, some more substantially, but archivists generally don’t have time to go through an entire series and draw out every concept.

If the Archives Hub included a description for every archive held at an HE institution across the UK, or for every specialist repository, what would that signify? It would be comprehensive in one sense, but in a sense that may not mean much to researchers. It would be interesting to ask researchers what they see as ‘comprehensive resources’ as it is hard to see how these could really exist, particularly when talking about unpublished sources.

Relevance of Search Results

The difficulties some participants had with the relevance of results comes back to the problem of how to catalogue resources that often cover a myriad of subjects, maybe superficially, maybe in detail; maybe from a very biased perspective. If a researcher looks for ‘social housing manchester’ then the results they get will be accurate in a sense – the machine will do its job and find collections with these terms, and there will be weighting of different fields (eg. the title will be highly weighted), but they still may not get the results they expect, because collections may not explicitly be about social housing in Manchester. The researcher needs to do a bit more work to think about what might be in the collection and whether it might be relevant. However, cataloguers are at fault to some extent. We do get descriptions sent to the Hub where the subjects listed seem inadequate or they do not seem to reflect the scope and content that has been provided. Sometimes a subject is listed but there is no sense of why it is included in the rest of the description. Sometimes a person is included in the index terms but they are not described in the content. This does not help researchers to make sense of what they see.

I do think that there are lessons here for archivists, or those who catalogue archives. I don’t think that enough thought is gives to the needs of the researcher. The inconsistent use of subject terms, for example, and the need for a description of the archive to draw out key concepts a little more clearly. Some archivists don’t see the need to add index terms, and think in terms of technologies like Google being able to search by keyword, therefore that is enough. But it isn’t enough. Researchers need more than this. They need to know what the collection is substantially about, they need to search across other collections about similar subjects. Controlled vocabulary enables this kind of exploratory searching. There is a big difference between searching for ‘nuclear disarmament’ as a keyword, which means it might exist anywhere within the description, and searching for it as a subject – a significant topic within an archive.

 

*Linking Lives Evaluation: Final Report (October 2012) by Lisa Charnock, Frank Manista, Janine Rigby and Joy Palmer

Excel template

Update May 2015: Please Note we need to make some changes to the Excel template and we are not currently working with Excel data. We hope to be able to offer this service in the future.

As part of Project Headway we wanted to create an Excel template which archives could use to catalogue and create EAD. We know that some archives – especially smaller and under-resourced archives – are using spreadsheets or word processing software to catalogue, and often lack the time or resources to switch to using an archival management system. While users can catalogue directly on to the EAD Editor, this isn’t a perfect solution –  it won’t work in some older browsers, or offline.

While we would have liked to offer a script that allowed users to convert their own Excel catalogues to EAD, it soon became apparent that this wasn’t an option. We would have needed to produce a script for each institution, and relied on the institution using Excel in a very consistent, systematic way – and a way that was ISAD(G) compliant, and could easily be mapped to EAD. So we decided to start off with a simple template, which we can adapt to individual user needs if required.

I’d never worked with XML in Excel before, and a lot of the process was simply trial-and-error, googling error messages, and sending forlorn messages to my programmer husband asking ‘what on earth is denormalised data and how do I stop it?’. I found the office.microsoft.com and msdn.microsoft.com sites useful for figuring out the basics of getting XML in and out of Excel – though I often turned to support elsewhere, too (eg Microsoft support will only tell you that denormalised data is not supported – not what it is or how to fix it).

To get started with using XML in Excel, you need to have the XML add-in installed (it says 2003, but will work with other versions) and then make sure you can see the ‘developer’ tab – if you can’t, it’s under options -> customize ribbon.

While it’s hard (in retrospect) to remember all of the stages I went through in the trial-and-error,  I know I started by trying to create an XSD (XML schema file) from in-Excel data entry. It failed. I tried importing the EAD.xsd – which just failed, silently (no error messages- no messages at all).

I was also concerned that the official EAD.xsd was too complicated for my (and our users’) needs – for instance, this project didn’t require lists of enumeration values. I needed something a bit simpler – and I’d already figured out that Excel couldn’t handle multi-level descriptions – so I needed to start with something collection-level only, too.

I created a basic EAD collection-level description in the Archives Hub EAD Editor, saved it as XML, removed the DTD declaration (not allowed in Excel), and imported it (using developer -> xml -> import).  Clicking on ‘source’ in the developer XML tab then shows you the XML fields.

XML map in Excel

You can then export this map as an XSD, creating your XML schema.  Of course, it wasn’t that easy. This is where denormalised data cropped up – and stopped me from exporting. I have to admit, I’m still not entirely sure what exactly denormalised data is – and given definitions such as:

A denormalised data model is not the same as a data model that has not been normalised, and denormalisation should only take place after a satisfactory level of normalisation has taken place and that any required constraints and/or rules have been created to deal with the inherent anomalies in the design. For example, all the relations are in third normal form and any relations with join and multi-valued dependencies are handled appropriately.

(from the usually introductory-friendly Wikipedia)

I’m not sure I’ll ever find out (if you have a really good explanation, please do comment!). But what I did find out was what it meant for me in the context of this XML mapping: no repeated fields. EAD allows for repeated fields – for instance, multiple subjects would be encoded as:

<controlaccess> <subject>subject</subject><subject>subject 2</subject></controlaccess>

Try to import that into Excel, and you get, well, a mess. The whole description appears twice – once with subject, and once with subject 2. And if you try to export the schema, you get the error message that the map is not exportable because it contains denormalized data.

For this reason, Excel won’t support hierarchy. In EAD, the same fields are repeated at component level as at collection-level, just inside a different wrapper. If you thought it got messy when you add a single repeated field, just imaging having anything up to several thousand…

So, strip everything down to a single instance (which means separating collection and component level into different spreadsheets), and you have an XSD which will export (follow instructions in step 4 of that link – if you get a VBA error, debug instructions are in step 2). Hurrah! But how to make it useable?

Well, you have to put it back into Excel, and map the XML fields to Excel cells. This was tedious, but achievably tedious rather than crawling-through-help-forums tedious. Open up a new Excel document, click on ‘source’, and choose your shiny new XSD. This will give you a list of all the fields, in the right-hand pane. Mapping them to cells is simply a case of drag-and-drop – once you’ve mapped a field to a cell, that cell will be outlined in blue (as long as the source pane is showing). There’s an option to have Excel auto-label your fields with the content of the XML tag, but I decided that wouldn’t give the user-friendly interface I wanted, so I labelled them myself. Then colour-coded them. The result?

Screenshot of collection-level template

I had to tweak the exported XSD a little to allow for a field in which users can enter the reference codes of any components. This was my first experiences of hand-coding any of an XML schema, and it took a few tries to get right! But I managed to add and map the <dsc> and <c> elements:

<xsd:element minOccurs=”0″ nillable=”true” name=”dsc” form=”unqualified”>
<xsd:complexType>
<xsd:sequence minOccurs=”0″>
<xsd:element minOccurs=”0″ nillable=”true” type=”xsd:string” name=”c” form=”unqualified”/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>

(If I wanted to play with the XSD a bit more, I guess I could make mandatory fields really mandatory, by fiddling with the minOccurs and/or nillable attributes, but I haven’t worked up the courage yet…)

This allows users to enter the reference codes of parent/child descriptions. Each component needs its own spreadsheet, and its own XML export. These are then run through a script by our programmer, which will use these parent/child references to create a single, hierarchical description. Theoretically, anyway – we haven’t been able to do much testing on it yet, and we’re not sure how well it will cope with components that are more than a level or two deep.

Remember denormalised data, and how you can’t have repeated fields? Obviously we can’t tell contributors that they can only have a single subject for each description! So in repeatable fields, multiple entries are pipe | delimited, so we can split them, eg:

<controlaccess><subject>subject 1|subject2|subject3</subject></controlaccess>

to

<controlaccess><subject>subject1</subject><subject>subject2</subject><subject>subject3</subject></controlaccess>

If users enter their subject sources in the same order, they’ll be matched up as attributes to the correct subject. The script also removes any empty fields (valid XML, but they break the EAD Editor), and adds the special Archives Hub mark-up for access points (used to distinguish between eg surname and forename in a personal name, and handy for linked data).

And there we are: a description, created in Excel, that’s valid EAD. We’re still in the process of testing the template, and making sure that it’s robust and meets users’ needs. If you’d like to be involved with testing, please get in touch.

 

The modern archivist: working with people and technology

I’ve recently read Kate Theimer’s very excellent post on Honest Tips for Wannabe Archivists Out There.

This is something that I’ve thought about quite a bit, as I work as the manager of an online service for Archives and I do training and teaching for archivists and archive students around creating online descriptions. I would like to direct this blog post to archive students or those considering becoming archivists. I think this applies equally to records managers, although sometimes they have a more defined role in terms of audience, so the perspective may be somewhat different.

It’s fine if you have ‘a love of history’, if you ‘feel a thrill when handling old documents’. That’s a good start. I’ve heard this kind of thing frequently as a motivation for becoming an archivist. But this is not enough. It is more important to have the desire to make those archives available to others; to provide a service for researchers. To become an archivist is to become a service provider, not an historian. It may not sound as romantic, but as far as I am concerned it is what we are, and we should be proud of the service we provide, which is extremely valuable to society. Understanding how researchers might use the archives is, of course, very important, so that you can help to support them in their work. Love of the materials, and love of the subject (especially in a specialist repository) should certainly help you with this core role. Indeed, you will build an understanding of your collections, and become more expert in them over time, which is one of the wonderful things about being an archivist.

Your core role is to make archives available to the community – for many of us, the community is potentially anyone, for some of us it may be more restricted in scope. So, you have an interest in the materials, you need to make them available. To do this you need to understand the vital importance of cataloguing. It is this that gives people a way in to the archives. Cataloguing is a real skill, not something to be dismissed as simply creating a list of what you have. It is something to really work on and think about. I have seen enough inconsistent catalogues over the last ten years to tell you that being rigorous, systematic and standards-based in cataloguing is incredibly important, and technology is our friend in this aim. Furthermore, the whole notion of ‘cataloguing’ is changing, a change led by the opportunities of the modern digital age and the perspectives and requirements of those who use technology in their every day life and work. We need to be aware of this, willing (even excited!) to embrace what this means for our profession and ready to adapt.

image of control roomThis brings me to the subject I am particularly interested in: the use of technology. Cataloguing *is* using technology, and dissemination *is* using technology. That is, it should be and it needs to be if you want to make an impact; if you want to effectively disseminate your descriptions and increase your audience. It is simply no good to see this profession as in any way apart from technology. I would say that technology is more central to being an archivist than to many professions, because we *deal in information*. It may be that you can find a position where you can keep technology at arm’s length, but these types of positions will become few and far between.  How can you be someone who works professionally with information, and not be prepared to embrace the information environment? The Web, email, social networks, databases: these are what we need to use to do our jobs. We generally have limited resources, and technology can both help us make the most of the resources we have and, conversely, we may need to make informed choices about the technology we use and what sort of impact it will have. Should you use Flickr to disseminate content? What are the pros and cons? Is ‘augmented reality’ a reality for us? Should you be looking at Linked Data? What is is and why might it be important? What about Big Data? It may sound like the latest buzz phrase but it’s big business, and can potentially save time and money. Is your system fit for purpose? Does it create effective online catalogues? How interoperable is it? How adaptable?

Before I give the impression that you need to become some sort of technical whizz-kid, I should make clear that I am not talking about being an out-and-out techie – a software developer or programmer. I am talking about an understanding of technology and how to use it effectively. I am also talking about the ability to talk to technical colleagues in order to achieve this. Furthermore, I am talking about a willingness to embrace what technology offers and not be scared to try things out. It’s not always easy. Technology is fast-moving and sometimes bewildering. But it has to be seen as our ally, as something that can help us to bring archives to the public and to promote a greater understanding of what we do. We use it to catalogue, and I have written previously about how our choice of system has a great impact on our catalogues, and how important it is to be aware of this.

Our role in using technology is really *all about people*. I often think of myself as the middleman, between the technology (the developers) and the audience. My role is to understand technology well enough to work with it, and work with experts, to harness it in order to constantly evolve and use it to best advantage, but also to constantly communicate with archivists and with researchers. To have an understanding of requirements and make sure that we are relevant to end-users. Its a role, therefore, that is about working with people. For most archivists, this role will be within a record office or repository, but either way, working with people is the other side of the coin to working with technology. They are both central to the world of archives.

If you wonder how you can possibly think about everything that technology has to offer: well, you can’t. But that’s why it is even more vital now than it has ever been to think of yourself as being in a collaborative profession. You need to take advantage of the experience and knowledge of colleagues, both within the archives profession and further afield. It’s no good sitting in a bubble at your repository. We need to talk to each other and benefit from sharing our understanding. We need to be outgoing. If you are an introvert, if you are a little shy and quiet, that’s not a problem; but you may have to make a little more effort to engage and to reach out and be an active part of your profession.

They say ‘never work with children and animals’ in show business because both are unpredictable; but in our profession we should be aware that working with people and technology is our bread and butter. Understanding how to catalogue archives to make them available online, to use social networks to communicate our messages, to think about systems that will best meet the needs of archives management, to assess new technologies and tools that may help us in our work. These are vital to the role of a modern professional archivist.

HubbuB: March 2012

New collections on the Hub

A special mention for the University of Worcester Research Collections – they have now been added to the Hub as collection level descriptions, thanks largely to their HLF ‘Skills for the Future’ trainee, Sarah.

We are delighted to have the Royal College of Psychiatrists as a new contributor, adding to a number of distinguished Royal Colleges already on the Hub.

Feature for March

This month we step into the world of augmented reality with a feature about the SCARLET project:

The feature tells us that “The SCARLET ‘app’ now enables students to study early editions of Dante’s Divine Comedy, for example, while simultaneous viewing catalogue data, digital images, webpages and online learning resources on their tablet devices and phones.” It all sounds very exciting, and something that archives can really play a very active part in.

EAD Editor

We’ve been busy testing the new instance of the EAD Editor, which will be released soon. We’ll be able to tell you more about that shortly.

We now have a page giving you information about the ‘right click’ menu that helps you with things like paragraphs, lists and links:

SRU and OAI-PMH

APIs are becoming increasingly important with the open data agenda. We have provided APIs for some years now. Recently we have updated the information on these to help developers who would like to use them to access Hub descriptions: http://archiveshub.ac.uk/sru/ and http://archiveshub.ac.uk/oaipmh/

The SRU interface is used to provide data to Genesis, the portal for Women’s Studies: http://www.londonmet.ac.uk/genesis/. It means that the data is only held in one place, but a different interface provides access to select descriptions – in this case, descriptions relating to women.

APIs may not mean a great deal to you, as they are primarily something developers use to create new interfaces, mash-ups and cross-data explorations, but do pass this on if you know of developers interested in working with our data. We want to ensure that archives are at the heart of innovations in opening up and exploring data connections.

Page about identifiers

Some of you may have read my recent blog post about issues with identifiers for archives and for archive descriptions. We now have a page on the Hub to help explain what a persistent unique identifier is and how you create it:

http://archiveshub.ac.uk/identifiers/

As ever, please ask us if you have any questions about this.

Former Reference

The Archives Hub now displays former reference with the label of ‘alternative ref’. This is because for some contributors the former reference is, in fact, the main reference, so we felt this was the best compromise. For example: http://archiveshub.ac.uk/data/gb1069-12 (see lower level entries).

The new EAD Editor will allow for descriptions with a former reference to be uploaded, edited and removed, but it will not provide the facility to create them from scratch.

Case Studies Wanted!

Finally, we have a case studies section – http://archiveshub.ac.uk/casestudies/. We’d love to hear from any researchers willing to provide us with a case study. It is a really useful way for us to convey the importance of the Hub to our funders.

More Product, Less Processing?

I’ve been reading a fascinating article by Mark A. Greene and Dennis Meissner, ‘More Product Less Process: Revamping Traditional Archival Processing‘ (PDF). I wanted to offer a summary of the article.

image of scalesThe essence of this article is that archivists spend too long processing collections (appraising, cataloguing and carrying out minor preservation). This approach is not working; the cataloguing backlog continues to increase. We are too conservative, cautious and set in our ways, and we need to think about a new approach to cataloguing that is more pragmatic and user-focussed. The article was written by archivists in the USA, but would seem to apply to archives here in the UK, where we know that the backlog is a continuing problem.

I think the article makes the argument well and with a good deal of conviction. The bottom line is that we must rethink our approach unless we are to continue to accrue backlogs and deny researchers access to hugely valuable primary source material.

However, there are arguments in support of detailed cataloguing. For digital archives it is extremely useful to provide metadata at the item level,  enabling such useful resources as http://archiveshub.ac.uk/data/gb1837des-dca?page=3#id634580. With this detailed list, researches can see digital resources described and then access them directly. It could be argued that if a collection is to be digitised, providing this sort of level of metadata is appropriate, and in general it is the more valuable and highly used collections that are digitised. But for born-digital collections, this level of detail would be totally unsustainable.

Also, I wonder if the work that volunteers do should be taken into account – they may be able to help us catalogue in more detail, whilst trained archivists continue to create the main collection or series-level descriptions. I remember a whole band of NADFAS volunteers cataloguing photographs where I used to work. Furthermore, I was speaking to an archivist recently who said that they had taken the time to weed out duplicates (something this report criticises)…and then sold them on eBay for a tidy profit, that helped them fund their very under-resourced archive (they had the rights to do this!). So, maybe there are factors to take into consideration that support a detailed approach, but I think a bold approach to examining this whole area in UK archives would be very welcome.

Some of the points made in the report:

  • Archivists spend too much time cataloguing, not necessarily doing what is necessary. We think in terms of an ideal that we have to reach, although we haven’t actually articulated what this ideal is, and really examined it.
  • We are too attached to old-fashioned ways of doing things, which worked when we had smaller collections to deal with, but are not appropriate for large 20th century collections.
  • We give a higher priority to serving the needs of our collections rather than the needs of our users.
  • We need a new set of guidelines that focus on what we absolutely need to do.
  • We need to discuss, debate and examine our approach to cataloguing, and not be defensive about our roles.
  • We tend to arrange collections down to item level. In particular, we carry out preservation activities to this level. We accept the premise that basic preservation steps necessitate an item-level approach.
  • We often remove all metal fastenings and put materials into acid-free folders. So, even if we do not describe collections down to item level (maybe we just describe at collection or series level), we go down to this level of detail in our preservation activities.  Yet, with good climate control, metal fasteners should not rust, and as yet we do not have strong evidence of a detrimental effect of standard manila folders if the materials is stored in a controlled environment.
  • We often weed out duplicates throughout a collection, which requires processing down to item level. Is this really worth doing?
  • The various sources of advice about the level of detail we process archives to are inconsistent. Some sources advocate description to series level, but preservation activities to item level. NARA advocates preservation in accordance with intrinsic value and anticipated use, so, for example, new folders should only be used if current ones are damaged, and metal fasteners should be removed only if ‘appropriate’ – meaning where they are causing obvious damage.
  • We seem to believe that we need to aspire to ‘a substantial, multi-layered, descriptive finding aid,’ a reflection of ‘slow, careful scholarly research’.  But in reality, maybe we should adopt a more flexible approach, taking each collection in turn on its merits. Some may justify detailed cataloguing, but many do not.
  • We should take the position that users come to do research, and that we do not have to do this for them in advance.
  • We should ‘get beyond our absurd over-cautiousness’ about providing access to unprocessed collections, and make them available unless there are good legal or preservation reasons to restrict access or the collection is of extremely high value.
  • We have very inadequate processing metrics. Attempts to quantify processing expectations have resulted in wildly differing figures. Figures given in various studies include 3, 6.9, 8, 12.7 and 10.6 hours per cubic foot. Other studies have come up with between 3 and 5.5 days per foot.
  • One major study  by an archive centre revealed 15.1 hours were spent on each cubic foot, far more than the value that was placed upon  what was accomplished. The study gave ‘an improved sense of the real and total costs involved’.
  • The Greene/Meissner study looked at various projects funded by NHPRC grants (National Historical Publications & Records Committee), and found an average productivity figure of 9 hours per foot, but with highs of around 67 hours per foot.  It also conducted an email survey and found expectations of processing times averaged at 14.8 hours, although there was a high of 250 hours!
  • Grant funding often encourages an item-level focus, rather than helping us to really tackle our substantial backlogs. There should be more of a requirement to justify meticulous processing – it should only be for exceptional collections.
  • The study recommends aiming for a processing rate of 4 hours per cubic foot for most large 20th century collections, using a series-level approach for description and preservation.
  • Studies show a lack of standardisation, not only in our definitions but also around the levels of arrangement, preservation and access that are useful and necessary.  We do not have proper administrative controls over this work. We tend to argue for each of us having a unique situation, that does not allow for comparison, and we do not have a common sense of acceptibile policies and procedures.
  • Whilst we continue to process to item level, a substantial number do not make catalogues available through OPACs or Websites, arguably prioritising processing over user needs.

The report concludes that maybe we should recognise that ‘the use of archival records…is the ultimate purpose of identification and administration.’ (SAA, Planning for the Archival Profession, 1986).  Maybe we should agree that a collection is catalogued if it ‘can be used productively for research.’ And maybe we should be willing to take a different approach for each collection, making choices and setting priorities, rather than being too caught up in a ‘love of craftmanship’ that could be seen as fastidiousness that does not truly serve the user.

The question seems to be how much would be lost by putting speed of processing before careful examination of all documents in a collection.  Maybe this does require defining good cataloguing? Maybe we believe that our professional standing is tied up with undertaking detailed cataloguing…more so than the ever increasing growth of backlogs, where the papers are entirely unaccessible to researchers?

Greene and Meissner state that there should be a ‘golden minimum’ for processing, where we adequately address user needs and only go beyond this where there are demonstrable business reasons. They also believe that arrangement, description and preservation should all occur at the same level of detail, again, unless there are good reasons to deviate from this.

What do you think…?

Out and about or Hub contributor training

Every year we provide our contributors and potential contributors with free training on how to use our EAD editor software.

The days are great fun and we really enjoy the chance to meet archivists from around the UK and find out what they are working on.

The EAD editor has been developed so that archivists can create online descriptions of their collections without having to know EAD.  It’s intuitive and user friendly and allows contributors to easily add collection level and multi-level descriptions to the Hub.  Users can also enhance their descriptions by adding digital archival objects  – images, documents and sound files.

Contributor training day

Our training days are a mixture of presentation, demonstration and practical hands on. We (The training team consists of Jane, Beth and myself) tend to start by talking a little about Hub news and developments to set the scene for the day and then we move onto why the Hub uses EAD and why using standards is important for interoperability and means that more ‘stuff’ can be done with the data. We go from here on to a hands-on session that demonstrates how to create a basic record. We cover also cover adding lower level components and images and we show contributors how to add index terms to their descriptions. (Something that we heartily endorse! We LOVE standards and indexing!).

We always like to tailor our training to the users, and encourage users to bring along their own descriptions for the hands-on sessions. Some users manage to submit their first descriptions to the Hub by the end of the training session!

This year we have done training in Manchester and London, for the Lifeshare project team in Sheffield and for the Oxford colleges. We are also hoping (if we get enough take up) to run courses in Glasgow and Cardiff this year. (6th Sept at Glasgow Caledonian, Cardiff date TBC. Email archiveshub@mimas.ac.uk to book a place)

So far this year three new contributors have joined the Hub as a result of training:  Middle East Centre Archive, St Antony’s College, Oxford; Salford City Archive and the Taylor Institute, Oxford. We’ve also enabled four of our existing contributors to start updating their collections on the Hub: National Fairground Archive, the Co-operative Archive, St John’s College, Oxford and the V&A.

We have been given some great feedback this year and 100% of our attendees agreed/strongly agreed that they were satisfied with the content and teaching style of the course.

Some our feedback:

A very good introductory session to working with the EAD editor for the Archives Hub. I have not used the Archives Hub for a long time so an excellent refresher course.

This was a fantastic workshop – excellently designed resources, Lisa and Jane were really helpful (and patient!). The hands-on aspect was really useful: I now feel quite confident about creating EAD records for the Hub, and even more confident that the Hub team are on hand with online help

The hands on experience and being able to ask questions of the course leaders as things happened was really useful. Being able to work on something relevant to me was also a bonus.

Excellent presentation and delivery. I came along with a theoretical but not a practical knowledge of the Archives Hub and its workings, and the training session was pitched perfectly and was completely relevant to my job. Many thanks.

The Hub team train archivists how to use the EAD editor, archive students about EAD and Social media and research students in how to use the Hub to search for primary source materials. You can find our list of training that we provide on our training pages: http://archiveshub.ac.uk/trainingmodules/ .  We’re always happy to hear from people who are interested in training – do let us know!

HubbuB: August 2011

We are out and About in August. Jane and Joy will be going to the Society of American Archivists’ Conference this year, speaking as part of a panel session. We will be talking about Discovery, the Archives Hub and Linked Data. We’re also very excited to be visiting the OCLC offices in Dublin Ohio.  Lisa and Bethan will be at the Archives and Records Association conference in Edinburgh, so go and say hello if you are there. Lisa is also speaking at the conference.

Our Monthly Feature is all levitating women and mustacheod men, as we take a trip into Magic and Illusion at the Fairground Archive: http://archiveshub.ac.uk/features/magic/. Some great images, and a lovely photograph of Cyril Critchlow, a wizard in his 80’s, performing as ‘Wizardo, Harry Potter’s grandfather’!

We’ve recently created a page of Top Tips for Cataloguing: http://archiveshub.ac.uk/cataloguingtips/. These are some of the key areas that we believe are important for good online catalogues. We do still find that archivists don’t always think about the global online environment, so it’s worth setting out some of the most important points to bear in mind. It’s partly about thinking of the audience, browsing the Web, using Google, scanning pages for relevant content, and it’s partly about descriptions – ensuring that the title is as clear and self-explanatory as possible, thinking about how best to describe the archive in a way that is user-friendly.

We’ve been talking about ways to help get descriptions onto the Hub when they are created in Microsoft Word or Excel. We’re just exploring possibilities at the moment, but we are interested in anyone who uses, or knows anyone who uses, Microsoft Word to catalogue. Maybe smaller offices, or maybe you ask volunteers to do some of this?

We know people do use Microsoft Excel as well. We are thinking about ‘Tips for using Excel’. Would this be useful? We don’t necessarily want to give the impression that Excel is the most appropriate choice for cataloguing – its a spreadsheet software, not really for complex hierarchical archives. But we do realise that for some people, the choice of what to use is limited, and we want to do our best to accommodate the realities that people are faced with.

We’ve had some interest in the idea of researchers being able to request digital copies of archives through the Hub. That is, a researcher comes across an archive they would like to see, and they would like digital copies, so they indicate this in some way. Not yet fully thought out, but again, we’d need to know if there is a need for this. How many officers are starting to digitise on demand?

Finally, we’re covering music, dance, plants, medicine and the Middle East with our latest contributors. Check out who is recently on board on our contributors’ page:
http://archiveshub.ac.uk/contributors/

A bit about Resource Discovery

The UK Archives Discovery Network (UKAD) recently advertised our up and coming Forum on the archives-nra listserv. This prompted one response to ask whether ‘resource discovery’ is what we now call cataloguing and getting the catalogues online. The respondent went on to ask why we feel it necessary to change the terminology of what we do, and labelled the term resource discovery as ‘gobledegook’. My first reaction to this was one of surprise, as I see it as a pretty plain talking way of describing the location and retrieval of information , but then I thought that it’s always worth considering how people react and what leads them to take a different perspective.

It made me think that even within a fairly small community, which archivists are, we can exist in very different worlds and have very different experiences and understanding. To me, ‘resource discovery’ is a given; it is not in any way an obscure term or a novel concept. But I now work in a very different environment from when I was an archivist looking after physical collections, and maybe that gives me a particular perspective. Being manager of the Archives Hub, I have found that a significant amount of time has to be dedicated to learning new things and absorbing new terminology. There seem to be learning curves all over the place, some little and some big. Learning curves around understanding how our Hub software (Cheshire) processes descriptions, Encoded Archival Description , deciding whether to move to the EAD schema, understanding namespaces, search engine optimisation, sitemaps, application programming interfaces, character encoding, stylesheets, log reports, ways to measure impact, machine-to-machine interfaces, scripts for automated data processing, linked data and the semantic web, etc. A great deal of this is about the use of technology, and figuring out how much you need to know about technology in order to use it to maximum effect. It is often a challenge, and our current Linked Data project, Locah, is very much a case in point (see the Locah blog). Of course, it is true that terminology can sometimes get in the way of understanding, and indeed, defining and having a common understanding of terms is often itself a challenge.

My expectation is that there will always be new standards, concepts and innovations to wrestle with, try to understand, integrate or exclude, accept or reject, on pretty much a daily basis. When I was the archivist at the RIBA (Royal Institute of British Architects), back in the 1990’s, my world centered much more around solid realities: around storerooms, temperature and humidity, acquisitions, appraisal, cataloguing, searchrooms and the never ending need for more space and more resources. I certainly had to learn new things, but I also had to spend far more time than I do now on routine or familiar tasks; very important, worthwhile tasks, but still largely familiar and centered around the institution that I worked for and the concepts terminology commonly used by archivists. If someone had asked me what resource discovery meant back then, I’m not sure how I would have responded. I think I would have said that it was to do with cataloguing, and I would have recognised the importance of consistency in cataloguing. I might have mentioned our Website, but only in as far as it provided access through to our database. The issues around cross-searching were still very new and ideas around usability and accessibility were yet to develop.

Now, I think about resource discovery a great deal, because I see it as part of my job to think of how to best represent the contributors who put time and effort into creating descriptions for the Hub. To use another increasingly pervasive term, I want to make the data that we have ‘work harder’. For me, catalogues that are available within repositories are just the beginning of the process. That’s fine if you have researchers who know that they are interested in your particular collections. But we need to think much more broadly about our potential global market: all the people out there who don’t know they are interested in archives – some, even, who don’t really know what archives are. To reach them, we have to think beyond individual repositories and we have to see things from the perspective of the researcher. How can we integrate our descriptions into the ‘global information environment’ in a much more effective way. A most basic step here, for example, is to think about search engine optimisation. Exposing archival descriptions through Google, and other search engines, has to be one very effective way to bring in new researchers. But it is not a straightforward exercise – books are written about SEO and experts charge for their services in helping optimise data for the Web. For the Archives Hub, we were lucky enough to be part of an exercise looking at SEO and how to improve it for our site. We are still (pretty much as I write) working on exposing our actual descriptions more effectively.

Linked Data provides another whole world of unfamiliar terminology to get your head round. Entities, triples, URI patterns, data models, concepts and real world things, sparql queries, vocabularies – the learning curve has indeed been steep. Working on outputting our data as RDF (a modelling framework for Linked Data) has made me think again about our approach to cataloguing and cataoguing standards. At the Hub, we’re always on about standards and interoperability, and it’s when you come to something like Linked Data, where there are exciting possibilities for all sorts of data connections, well beyond just the archive community, that you start to wish that archivists catalogued far more consistently. If only we had consistent ‘extent’ data, for example, we could look at developing a lovely map-based visualisation showing where there are archives based on specific subjects all around the country and have a sense of where there are more collections and where there are fewer collections. If only we had consistent entries for people’s names, we could do the same sort of thing here, but even with thesauri, we often have more than one name entry for the same person. I sometimes think that cataloguing is more of an art than a science, partly because it is nigh on impossible to know what the future will bring, and therefore knowing how to catalogue to make the most of as yet unknown technologies is tricky to say the least. But also, even within the environment we now have, archivists do not always fully appreciate the global and digital environment which requires new ways of thinking about description. Which brings me back to the idea of whether resource discovery is another term for cataloguing and getting catalogues online. No, it is not. It is about the user perspective, about how researchers locate resources and how we can improve that experience. It has increasingly become identified with the Web as a way to define the fundamental elements of the Web: objects that are available and can be accessed through the Internet, in fact, any concept that has an identity expressed as a URI. Yes, cataloguing is key to archives discovery, cataloguing to recognised standards is vital, and getting catalogued online in your own particular system is great…but there is so much more to the whole subject of enabling researchers to find, understand and use archives and integrating archives into the global world of resources available via the Web.