Archives Hub Data and Workflow

Introduction

As those of you who contribute to or use the Hub will know, we went live with our new system in Dec 2016.  At the heart of our new system is our new workflow.  One of the key requirements that we set out with when we migrated to a new system was a more robust and sustainable workflow; the system was chosen on the basis that it could accommodate what we needed.

This post is about the EAD (Encoded Archival Data) descriptions, and how they progress through our processing workflow. It is the data that is at the heart of the Archives Hub world. We also work with EAG (Encoded Archival Guide) for repository descriptions, and EAC-CPF (Encoded Archival Context, Corporate bodies, Persons and Families) for name entities. Our system actually works with JSON internally, but EAD remains our means of taking in data and providing data out via our API.

On the Archives Hub now we have two main means of data ingest, via our own EAD Editor, which can be thought of as ‘internal’, and via exports from archive systems, which can be thought of as ‘external’.

Data Ingest via the EAD Editor

1. The nature of the EAD

The Editor creates EAD according to the Archives Hub requirements. These have been carefully worked out over time, and we have a page detailing them at http://archiveshub.jisc.ac.uk/eadforthehub

screenshot of eadforthehub page
Part of a Hub webpage about EAD requirements

When we started work on the new system, we were aware that having a clear and well-documented set of requirements was key. I would recommend having this before starting to implement a new system! But, as is often the case with software development, we didn’t have the luxury of doing that – we had to work it out as we went along, which was sometimes problematic, because you really need to know exactly what your data requirements are in order to set your system up. For example, simply knowing which fields are mandatory and which are not (ostensibly simple, but in reality this took us a good deal of thought, analysis and discussion).

Screenshot of the EAD Editor
EAD Editor

2. The scope of the EAD

EAD has plenty of tags and attributes! And they can be used in many ways. We can’t accommodate all of this in our Editor. Not only would it take time and effort, but it would result in a complicated interface, that would not be easy to use.

screenshot of EAD Tag Library
EAD Tag Library

So, when we created the new Editor, we included the tags and attributes for data that contributors have commonly provided to the Hub, with a few more additions that we discussed and felt were worthwhile for various reasons. We are currently looking again at what we could potentially add to the Editor, and prioritising developments. For example, the <materialspec> EAD tag is not accommodated at the moment. But if we find that our contributors use it, then there is a good argument for including it, as details specific to types of materials, such as map scales, can be useful to the end user.

We don’t believe that the Archives Hub necessarily needs to reflect the entire local catalogue of a contributor. It is perfectly reasonable to have a level of detail locally that is not brought across into an aggregator. Having said that, we do have contributors who use the Archives Hub as their sole online catalogue, so we do want to meet their needs for descriptive data. Field headings are an example of content we don’t utilise. These are  contained within <head> tags in EAD. The Editor doesn’t provide for adding these. (A contributor who creates data elsewhere may include <head> tags, but they just won’t be used on the Hub, see Uploading to the Editor).

We will continue to review the scope in terms of what the Editor displays and allows contributors to enter and revise; it will always be a work in progress.

3. Uploading to the Editor

In terms of data, the ability to upload to the Editor creates challenges for us. We wanted to preserve this functionality, as we had it on the old Editor, but as EAD is so permissive, the descriptions can vary enormously, and we simply can’t cope with every possible permutation. We undertake the main data analysis and processing within our main system, and trying to effectively replicate this in the Editor in order to upload descriptions would be duplicating effort and create significant overheads. One of our approaches to this issue is that we will preserve the data that is uploaded, but it may not display in the Editor. If you think of the model as ‘data in’ > ‘data editing’ > ‘data out’, then the idea is that the ‘data in’ and ‘data out’ provides all the EAD, but the ‘data editing’ may not necessary allow for editing of all the data. A good example of this situation occurs with the <head> tag, which is used for section headings. We don’t use these on the Hub, but we can ensure they remain in the EAD and they are there in the output from the Editor, so they are retained, but not displayed in the Editor. They can then be accessed by other means, such as through an XML Editor, and displayed in other interfaces.

We have disabled upload of exports from the Calm system to the Editor at present, as we found that the data variations, which often caused the EAD to be invalid, were too much for our Editor to cope with. It has to analyse the data that comes in and decide which fields to populate with which data. Some are straightforward – ‘title’ goes into <unittitle> for example, but some are not…for example, Calm has references and alternative references, and we don’t have this in our system, so they cause problems for the Editor.

4. Output from the Editor

When a description is submitted to the Archives Hub from the Editor, it is uploaded to our system (CIIM, pronounced ‘sim’), which is provided by Knowledge Integration, and modified for our own data processing requirements.

Screenshot of the CIIM
CIIM Browse screen

The CIIM framework allows us to implement data checking and customised transformations, which can be specific to individual repositories. For the data from the Editor, we know that we only need a fairly basic default processing, because we are in control of the EAD that is created. However, we will have to consider working with EAD that is uploaded to the Editor, but has not been created in the Editor – this may lead to a requirement for additional data checking and transformations. But the vast majority of the time descriptions are created in the Editor, so we know they are good, valid, Hub EAD, and they should go through our processing with no problems.

Data Ingest from External Data Providers

1. The nature of the EAD

EAD from systems such as Calm, Archivist’s Toolkit and AtoM is going to vary far more than EAD produced from the Editor. Some of the archival management systems have EAD exports. To have an export is one thing; it is not the same as producing EAD that the Hub can ingest. There are a number of factors here. The way people catalogue varies enormously, so, aside from the system itself, the content can be unpredictable – we have to deal with how people enter references; how they enter dates; whether they provide normalised dates for searching; whether entries in fields such as language are properly divided up, or whether one entry box is used for ‘English, French, Latin’, or ‘English and a small amount of Latin’; whether references are always unique; whether levels are used to group information, rather than to represent a group of materials; what people choose to put into ‘origination’ and if they use both ‘origination’ and ‘creator’; whether fields are customised, etc. etc.

The system itself will influence on the EAD output. A system will have a template, or transformation process, that maps the internal content to EAD. We have only worked in any detail with the Calm template so far. Axiell, the provider of Calm, made some changes for us, for example, only six languages were exporting when we first started testing the export, so they expanded this list, and then we made additional changes, such as allowing for multiple creators, subjects and dates to export, and ensuring languages in Welsh would export. This does mean that any potential Calm exporter needs to use this new template, but Axiell are going to add it to their next upgrade of Calm.

We are currently working to modify the AdLib template, before we start testing out the EAD export. Our experience with Calm has shown us that we have to test the export with a wide variety of descriptions, and modify it accordingly, and we eventually get to a reasonably stable point, where the majority of descriptions export OK.

We’ve also done some work with AtoM, and we are hoping to be able to harvest descriptions directly from the system.

2. The scope of the EAD

As stated above, finding aids can be wide ranging, and EAD was designed to reflect this, but as a result it is not always easy to work with. We have worked with some individual Calm users to extend the scope of what we take in from them, where they have used fields that were not being exported. For instance, information about condition and reproduction was not exporting in one case, due to the particular fields used in Calm, which were not mapping to EAD in the template. We’ve also had instances of index terms not exporting, and sometimes this had been due to the particular way an institution has set up their system. It is perfectly possible for an institution to modify the template themselves so that it suits their own particular catalogues, but this is something we are cautious about, as having large numbers of customised exports is going to be harder to manage, and may lead to more unpredictable EAD.

3. Uploading to the Editor

In the old Hub world, we expected exports to be uploaded to the Editor. A number of our contributors preferred to do this, particularly for adding index terms. However, this lead to problems for us because we ended up with such varied EAD, which mitigated against our aim of interoperable content. If you catalogue in a system, export from that system, upload to another system, edit in that system, then submit to an aggregator (and you do this sometimes, but other times you don’t), you are likely to run into problems with version control. Over the past few years we have done a considerable amount of work to clarify ‘master’ copies of descriptions. We have had situations where contributors have ended up with different versions to ours, and not necessarily been aware of it. Sometimes the level of detail would be greater in the Hub version, sometimes in the local version. It led to a deal of work sorting this out, and on some occasions data simply had to be lost in the interests of ending up with one master version, which is not a happy situation.

We are therefore cautious about uploading to the Editor, and we are recommending to contributors that they either provide their data directly (through exports) or they use the Editor. We are not ruling out a hybrid approach if there is a good reason for it, but we need to be clear about when we are doing this, what the workflow is, and where the master copy resides.

4. Output from Exported Descriptions

When we pass the exports through our processing, we carry out automated transformations based on analysis of the data. The EAD that we end up with – the processed version – is appropriate for the Hub. It is suitable for our interface, for aggregated searching, and for providing to others through our APIs. The original version is kept, so that we have a complete audit trail, and we can provide it back to the contributor. The processed EAD is provided to the Archives Portal Europe. If we did not carry out the processing, APE could not ingest many of the descriptions, or else they would ingest, but not display to the optimum standard.

Future Developments

Our automated workflow is working well. We have taken complete, or near complete,  exports from Calm users such as the Universities of Nottingham, Hull and (shortly) Warwick, and a number of Welsh local authority archives. This is a very effective way to ensure that we have up-to-date and comprehensive data.

We have well over one hundred active users of the EAD Editor and we also have a number of potential contributors who have signed up to it, keen to be part of the Archives Hub.

We intend to keep working on exports, and also hope to return to some work we started a few years ago on taking in Excel data. This is likely to require contributors to use our own Excel template, as it is impractical to work with locally produced templates. The problem is that working with one repository’s spreadsheet, translating it into EAD, could take weeks of work, and it would not replicate to other repositories, who will have different spreadsheets. Whilst Excel is reasonably simple, and most offices have it, it is also worth bearing in mind that creating data in Excel has considerable shortcomings. It is not designed for hierarchical archival data, which has requirements in terms of both structure and narrative, and is constantly being revised. TNA’s Discovery are also working with Excel, so we may be able to collaborate with them in progressing this area of work.

Our new architecture is working well, and it is gratifying to see that what we envisaged when we started working with Knowledge Integration and started setting out our vision for our workflow is now a reality.  Nothing stands still in archives, in standards, in technology or in user requirements, so we cannot stand still either, but we have a set-up that enables us to be flexible, and modify our processing to meet any new challenges.

Archives Hub Search Analysis

Search logs can give us an insight into how people really search. Our current system provides ‘search logs’ that show the numbers based on the different search criteria and faceting that the Hub offers, including combined searches. We can use these to help us understand how our users search and to give us pointers to improve our interface.

The Archives Hub has a ‘default search’ on the homepage and on the main search page, so that the user can simply type a search into the box provided. This is described as a keyword search, as the user is entering their own significant search terms and the results returned include any archival description where the term(s) are used.

The researcher can also choose to narrow down their search by type. The figure below shows the main types the Archives Hub currently has. Within these types we also have boolean type options (all, exact, phrase), but we have not analysed these at this point other than for the main keyword search.

Archives Hub search box

Archives Hub search box showing the types of searches available

There are caveats to this analysis.

1. Result will include spiders and spam

With our search logs, excluding bots is not straightforward, something which I refer to in a previous post: Archives Logs and Google Analytics. We are shortly to migrate to an entirely new system, so for this analysis we decided to accept that the results may be slightly skewed by these types of searches. And, of course, these crawlers often perform a genuine service, exposing archive descriptions through different search engines and other systems.

2. There are a small number of unaccounted for searches

Unidentified searches only account for 0.5% of the total, and we could investigate the origins of these searches, but we felt the time it would take was not worth it at this point in time.

3. Figures will include searches from the browse list.

These figures include searches actioned by clicking on a browse list, e.g. a list of subjects or a list of creators.

4. Creator, Subject and Repository include faceted searching

The Archives Hub currently has faceted searching for these entities, so when a user clicks to filter down by a specific subject, that counts as a subject search.

Results for One Month (October 2015)

Monthly figures for searches

For October 2015 the total searches are 19,415. The keyword search dominates, with a smaller use of the ‘any’ and ‘phrase’ options within the keyword search. This is no surprise, but this ‘default search’ still forms only 36% of the whole, which does not necessarily support the idea that researchers always want a ‘google type’ search box.

We did not analyse these additional filters (‘any/phrase/exact’) for all of the searches, but looking at them for ‘keyword’ gives a general sense that they are useful, but not highly used.

A clear second is search by subject, with 17% of the total. The subject search was most commonly combined with other searches, such as a keyword and further subject search. Interestingly, subject is the only search where a combined subject + other search(es) is higher than a single subject search. If we look at the results over a year, the combined subject search is by far the highest number for the whole year, in fact it is over 50% of the total searches. This strongly suggests that bots are commonly responsible for combined subject searches.

These searches are often very long and complex, as can be seen from the search logs:

[2015-09-17 07:36:38] INFO: 94.212.216.52:: [+0.000 s] search:: [+0.044 s] Searching CQL query: (dc.subject exact “books of hours” and/cql.relevant/cql.proxinfo (dc.subject exact “protestantism” and/cql.relevant/cql.proxinfo (dc.subject exact “bible o.t. psalms” and/cql.relevant/cql.proxinfo (dc.subject exact “authors, classical” and/cql.relevant/cql.proxinfo (dc.subject exact “bible o.t. psalms” and/cql.relevant/cql.proxinfo (dc.subject exact “law” and/cql.relevant/cql.proxinfo (dc.subject exact “poetry” and/cql.relevant/cql.proxinfo (dc.subject exact “bible o.t. psalms” and/cql.relevant/cql.proxinfo (dc.subject exact “sermons” and/cql.relevant/cql.proxinfo bath.personalname exact “rawlinson richard 1690-1755 antiquary and nonjuror”))))))))):: [+0.050 s] 1 Hits:: Total time: 0.217 secs

It is most likely that the bots are not nefarious; they may be search engine bots, or they may be indexing for the purposes of  information services of some kind, such as bibliographic services, but they do make attempts to assess the value of the various searches on the Hub very difficult.

Of the remaining search categories available from the main search page, it is no surprise that ‘title’ is used a fair bit, at 6.5%, and then after that creator, name, and organisation and personal name. These are all fairly even. For October 2015 they are around 3% of the total each, and it seems to be similar for other months.

The repository filter is popular. Researchers can select a single repository to find all of their descriptions (157), select a single repository and also search terms (916), and also search for all the descriptions from a single repository from our map of contributors (125). This is a total of 1,198, which is 6.1% of the total. If we also add the faceted filter by repository, after a search has been carried out, the total is 2,019, and the percentage is 10.4%. Looking at the whole year, the various options to select repository become an even bigger percentage of the total, in particular the faceted filter by repository.   This suggests that improvements to the ability to select repositories, for example, by allowing researchers to select more than one repository, or maybe type of repository, would be useful.

Screen shot of Hub map

Google Map on the Hub showing the link to search by contributor

We have a search within multi-level descriptions, introduced a few years ago, and that clearly does get a reasonable amount of use, with 1,404 uses in this particular month, or 7.2% of the total. This is particularly striking as this is only available within multi-level descriptions. It is no surprise that this is valuable for lengthy descriptions that may span many pages.

The searches that get minimal use are identifier, genre, family name and epithet. This is hardly surprising, and illustrates nicely some of the issues around how to measure the value of something like this.

Identifier enables users to search by the archival reference. This may not seem all that useful, but it tends to be popular with archivists, who use the Hub as an administrative tool. However, the current Archives Hub reference search is poor, and the results are often confusing. It seems likely that our contributors would use this search more if the results were more appropriate. We believe it can fulfill this administrative function well if we adjust the search to give better quality results; it is never likely to be a highly popular search option for researchers as it requires knowledge of the reference numbers of particular descriptions.

Epithet is tucked away in the browse list, so a ‘search’ will only happen if someone browses by epithet and then clicks on a search result. Would it be more highly used if we had a ‘search by occupation or activity’? There seems little doubt of this. It is certainly worth considering making this a more prominent search option, or at least getting more user feedback about whether they would use a search like this. However, its efficacy may be compromised by the extremely permissive nature of epithet for archival descriptions – the information is not at all rigorous or consistent.

Family name is not provided as a main search option, and is only available by browsing for a family name and clicking on a result, as with epithet. The main ‘name’ search option enables users to search by family name. We did find the family name search was much higher for the whole year, maybe an indication of use by family historians and of the importance of family estate records.

Genre is in the main list of search options, but we have very few descriptions that provide the form or medium of the archive. However, users are not likely to know this, and so the low use may also be down to our use of ‘Media type’, which may not be clear, and a lack of clarity about what sort of media types people can search for. There is also, of course, the option that people don’t want to search on this facet. However, looking at the annual search figures, we have 1,204 searches by media type, which is much more significant, and maybe could be built up if  we had something like radio buttons for ‘photographs’, ‘manuscripts’, ‘audio’ that were more inviting to users. But, with a lack of categorisation by genre within the descriptions that we have, a search on genre will mean that users filter out a substantial amount of relevant material. A collection of photographs may not be catalogued by genre at all, and so the user would only get ‘photographs’ through a keyword search.

Place name is an interesting area. We have always believed that users would find an effective ‘search by place’ useful. Our place search is in the main search options, but most archivists do not index their descriptions by place and because of this it does not seem appropriate to promote a place name search. We would be very keen to find ways to analyse our descriptions and consider whether place names could be added as index terms, but unless this happens, place name is rather like media type – if we promote it as a means to find descriptions on the Archives Hub, then a hit list would exclude all of those descriptions that do not include place names.

This is one of the most difficult areas for a service like the Archives Hub. We want to provide search options that meet our users’ needs, but we are aware of the varied nature of the data. If a researcher is interested in ‘Bath’ then they can search for it as a keyword, but they will get all references to bath, which is not at all the same as archives that are significantly about Bath in Gloucestershire. But if they search for place name: bath, then they exclude any descriptions that are significantly about Bath, but not indexed by place. In addition, words like this, that have different meanings, can confuse the user in terms of the relevance of the results because ‘bath’ is less likely to appear in the title. It may simply be that somewhere in the description, there is a reference to a Dr Bath, for example.

This is one reason why we feel that encouraging the use of faceted search will be better for our users. A more simple initial search is likely to give plenty of results, and then the user can go from there to filter by various criteria.

It is worth mentioning ‘date’ search. We did have this at one point, but it did not give good results. This is partly due to many units of description not including normalised dates. But the feedback that we have received suggests that a date search would be popular, which is not surprising for an archives service.  We are planning to provide a filter by date, as well as the ordering by date that we currently have.

Finally, I was particularly interested to see how popular our ‘search collection level only’ is. screen shot of Hub search boxThis enables users to only see ‘top level’ results, rather than all of the series and items as well. As it is a constant challenge to present hierarchical descriptions effectively, this would seem to be one means to simplify things. However, for October 2015 we had 17 uses of this function, and for the whole year only 148. This is almost negligible. It is curious that so few users chose to use this. Is it an indication that they don’t find it useful, or that they didn’t know what it means? We plan to have this as a faceted option in the future, and it will be interesting to see if that makes it more popular or not.

We are considering whether we should run this exercise using some sort of filtering to check for search engines, dubious IP addresses, spammers, etc., and therefore get a more accurate result in terms of human users.  We would be very interested to hear from anyone who has undertaken this kind of exercise.

 

Connecting through defining people and relationships

If, as a researcher, you search for ‘Jane Drew’, the celebrated architect and town planner, on the Archives Hub, amongst other things, you might discover a single item, “Letter from Jane B Drew to John and Myfanwy Piper”, a letter in the “Papers of John and Myfanwy Piper”.

You can see that its a letter in a collection at the Tate Gallery Archive. The description of the collection is an example of a good quality traditional archival catalogue, giving a fairly detailed listing of the content this particular collection.  But as a researcher you are really just interested in just this one letter.  You may ask yourself a number of questions, possibly starting with (1) Is this the Jane Drew I’m interested in? and then (2) What is the relationship between Jane Drew and John and Myfanwy Piper? You may well be able to find answers by accessing the letter itself, but at this stage you may just want to place this connection in the broader context of Jane Drew’s life and work. As a researcher, understanding how these people are connected may shed light on your research interests.

In this blog I want to think about this question of relationships. The fact is that archivists rarely provide structured information about relationships; if there is information, it is usually in the biographical history, which might outline key events and people in someone’s life, referring to their parents, work colleagues, friends, etc. The nature of the relationship is sometimes explicitly given, but often it is not. Our standards don’t really say much about relationships between the entities (people, organisations, places, etc) that we describe in our catalogues.

Going back to the Papers of John and Myfanwy Piper as an example, the biographical history includes the following:

[John] Piper began writing reviews from the late 1920s making a name for himself as a critic writing for periodicals like ‘The Listener’ and the ‘Architectural Review’. From 1935-1937 he assisted Myfanwy Evans, with the production of a quarterly review of contemporary European abstract painting called ‘Axis’. In 1937 Piper was commissioned by his friend John Betjeman to write the ‘Shell Guide to Oxfordshire’. Piper went on to write and provide photographs for a number of the guides as well as edit the series. In the same year John Piper married the writer Myfanwy Evans.

This is a typical of a biographical history – useful historical information about the individual or organisation. Within this there is information we can potentially use to create explicit relationship information:

John Piper ‘worked with’ Myfanwy Evans
John Piper ‘was friends with’ John Betjeman
John Piper ‘worked for’ John Betjeman
John Piper ‘was married to’ Myfanwy Evans

There are a number of issues to consider here:

How can we unambiguously identify the people?
How do we choose the vocabulary we use to define the relationships?
Do we try to include dates?
Is it reasonable for us to interpret relationships as ‘friendships’ or ‘collaborations’ if this is not actually explicit?

We are looking at some of these issues through our AHRC project, Exploring British Design. They are all issues that archivists need to explore in a debate around relationship information, but the first issue to consider is simply whether we should be thinking more about including this kind of relationship information in our archival finding aids. Is it something that would be of real value to end users?  This issue is coming more to the fore as we start to think about implementing ISAAR (CPF) and working with EAC-CPF , and also as Linked Open Data gains traction.

In a (well worth reading) recent article in the Journal of Contemporary Archival Studies, on the potential impact of EAC-CPF, K.M Wisser reports the findings of a survey about relationship information. The survey received 208 responses from archivists/archives in the US. Wisser wrote “The survey results indicate that the archival community has only just begun to consider relationships in the context of archival description and the role that explicit description of those relationships may play.”

As one respondent wrote:

“relationships are among the most important facets in a collection and deserve a high priority in description. One cannot understand the historical value of an event, person, or organization without knowing [the] relationship among and between them.”

One thing that really strikes me in Wisser’s findings is that archivists see relationships that are documented outside of the collection as almost as significant as those that are documented within the collection. Going back to our original topic of Jane Drew: who else did Jane Drew work with? Should we provide that information to our users, whether or not it is documented within the collection? Is our role to give as full an account as we can of Drew’s life and career? Is it to limit ourselves to what is within the collection?

Wisser’s survey asked respondents about the importance of relationship types. It is curious to me that archivists rated ‘collaborated with’ as a more important relationship than ‘studied with’; they rated a friendship as far more important when it was documented in the collection; and they rated ‘influenced by’ as generally not so important. I’m surprised that the respondents had such definite ideas about the relative importance of different types of relationships, especially when the majority appeared to agree with the importance of ‘objective cataloguing’.

In our Exploring British Design project, the work we did with researchers definitely confirmed to me the fairly self-evident observation that any relationship can be of major significance in research, even if it appears of minor significance within the archive, or indeed, within the literature in general. A brief collaboration may have been a crucial influence, a short friendship may have had hitherto unrealised impact, and anyway, the importance of the relationship depends upon the research you are doing. Researchers are not really aware of how challenging it is for us as information professionals to establish these kinds of relationships in ways that they can then access. But it is clear that this is the sort of connectivity they are after.

One of the challenges with documenting relationship types is that they can be hard to define. As Wisser notes:

“The concept of influence, however, proved the most problematic. Comments such as ‘influence is a squishy sort of relationship’ and ‘I think it would often be very difficult to prove that Entity A was influenced by Entity B’ indicate a notion of intangibility.”

The conclusion could be that we should leave well alone relationships that are hard to define. On the other hand, if we are in a position, as we research a collection, to highlight potential connections, that action could be of major value to a researcher, who may otherwise never know about a link that ends up being crucial to their particular research. The relationships that are easy to define are likely to have been defined already.

One thing that strikes me about the whole notion of introducing interpretation and opinion into cataloguing (a possible argument against defining relationships) is that the horse has pretty much bolted. I’ve looked at enough ‘objective’ descriptions to be aware that the names archivists choose to add as index terms are a choice; they inevitably have to be an opinion about the names significant enough to add as index terms. And subjects are a similar case – some collections are indexed thoroughly, some not at all.

Aside from indexing, each person would create a different scope and content entry, including and excluding different information, and whether you call that subjective or not, it is certainly always selective. You could also argue that the level of detailed hierarchical cataloguing, might indicate the relative importance of the collection. On the Archives Hub there are some collections catalogued in huge detail, and it is inevitable that researchers will assume these collections are particularly important.

All of these choices have implications for discoverability.

In Wisser’s survey, a significant proportion of respondents felt that the importance of a relationship should be based upon the use of the collection.  But this, again, raises the question: When thinking about relationships, is the cataloguer reflecting the scope of the collection, or are they trying to give as full a picture as they can of the person or organisation? Are we within the world of the collection; or is the collection within the world?

The reason that I believe that we should think beyond the bounds of the collection content is that I think it promises much richer rewards for our users and encourages archives to be a major player within a broader landscape of information resources. I base my thinking on the premise that the researcher is primarily interested in their research topic, which is not likely to be an archive collection per se, but rather an event, a person, an organisation, a subject, and the way things are connected. I think archivists are still tending to think in terms of a document that describes a collection, rather than how to link the collection into the cultural heritage landscape, and even more broadly beyond that. I wonder if archivists don’t always think beyond the catalogues they currently create because the researchers they have contact with (who visit the archive) are already fairly confident they want to use that repository, or a particular archive within that repository. In other words, the researcher is already in their space. When I worked in a specialist archive, I thought about researchers discovering our archive as a whole (having an online presence) and then I thought about them using our collections (individual collections each with their own description); I didn’t think about how our collections could be seen as part of a whole information landscape.

The loudest – and most convincing – argument I hear against this kind of approach is that it takes time, and archivists are short on time. But I wonder if that means we have to think fundamentally differently. Going back to Jane Drew, and think about the value of relationships for research into her life and work…

If one archive collection description highlights just a few relationships, this could take us a long way (although relationship types are a whole different thing…). If the individuals and organisations are unambiguously identified, this can help with the process of creating links out to other data sources, so that information can be linked together; then we have the chance to benefit from finding out about relationships that have been defined elsewhere. In other words, the connections one person has throughout their life can only be fully realised through the pooling of information resources, very much a joint effort. If the data is structured it can potentially be brought together.

Traditional archival cataloguing focuses on the collection, and what is documented within the collection. It tends to think in terms of a self-contained document. Pursuing relationships breaks the bounds of any one information source. That seems like a good thing, but it raises questions around approaches to cataloguing. One obvious way to tackle this is to start to think more about archival authority records. These should enable us to move beyond a collection-centric description of the collection and towards a more entity based approach, because you describe an agent (entity) independently of any one archival collection. Another option is to think in a Linked Data way, where you are concentrating on entities and relationships.

There are so many questions raised by the whole area of entities and relationships. A few of my current conclusions are:

We should primarily be led by what benefits research. Researchers are far less likely to think in terms of individual archive collections, and far more likely to think in terms of research areas (topics). The Web gives us the opportunity to think in a broader context.

Maybe it is worth considering taking some of the time used to provide a really detailed biographical history as an unstructured narrative, or the time to provide a really detailed multi-level description, and taking more time to provide (or provide the potential for) connections between our descriptions and the larger information environment. This could allow researchers to bring together much more comprehensive information, even if what we provide about individual collections is less detailed. Just adding something like a VIAF identifier to a name would be a great big leap forwards (http://viaf.org/viaf/51792789).

There is great value in being a small fish in a big pond, because most researchers are fishing for data in the big pond. As Wisser’s article says, “relationships are…seen to free collections from the isolation of individual repositories.” If we aim to be part of the big pond, we can continue to tend our smaller ponds as well!

To go back to the Piper Collection and Jane Drew….I used this as a random example, thinking of a researcher interested in one particular designer. But of course, the Tate Gallery Archive can’t be expected to define all the relationships within the description. It’s great that they have provided enough detail to find this one individual item – without that, we would not know about the connection with Jane Drew. I’m arguing for unambiguously identifying entities (people, organisations) because if we can potentially link this instance of ‘Jane Drew’ to other instances in other information sources, then it is very possible that we can find out more about this relationship; And if the relationship can’t be established through other sources, then maybe this archive provides unique evidence of a connection that could significantly benefit research.

EAD and Next Generation Discovery

This post is in response to a recent article in Code4Lib, ‘Thresholds for Discovery: EAD Tag Analysis in ArchiveGrid, and Implications for Discovery Systems‘ by M. Bron, M. Proffitt and B. Washburn. All quotes are from that article, which looked at the instances of tags within ArchiveGrid, the US based archival aggregation run by OCLC. This post compares some of their findings to the UK based Archives Hub.

Date

In the ArchivesGrid analysis, the <unitdate> field use is around 72% within the high-level (usually collection level) description. The Archives Hub does significantly better here, with an almost universal inclusion of dates at this level of description. Therefore, a date search is not likely to exclude any potentially relevant descriptions. This is important, as researchers are likely to want to restrict their searches by date. Our new system also allows sorting retrieved results by date. The only issue we have is where the dates are non-standard and cause the ordering to break down in some way. But we do have both displayed dates and normalised dates, to enable better machine processing of the data.

Collection Title

“for sorting and browsing…utility depends on the content of the element.”

Titles are always provided, but they are very varied. Setting aside lower-level descriptions, which are particularly problematic, titles may be more or less informative. We may introduce sorting by title, but the utility of this will be limited. It is unlikely that titles will ever be controlled to the extent that they have a level of consistency, but it would be fascinating to analyse titles within the context of the ways people search on the Web, and see if we can gauge the value of different approaches to creating titles. In other words, what is the best type of title in terms of attracting researchers’ attention, search engine optimisation, display within search engine results, etc?

Lower-level descriptions tend to have titles such as ‘Accounts’, ‘Diary’ or something more difficult to understand out of context such as ‘Pigs and boars’ or ‘The Moon Dragon’. It is clearly vital to maintain the relationship of these lower-level descriptions to their parent level entries, otherwise they often become largely meaningless. But this should be perfectly possible when working on the Web.

It is important to ensure that a researcher finding a lower-level description through a general search engine gets a meaningful result.

Archives Hub search result from a Google search
A search result within Google

 

 

 

The above result is from a search for ‘garrick theatre archives joanna lumley’ – the sort of search a researcher might carry out. Whilst the link is directly to a lower -level entry for a play at the Garrick Theatre, the heading is for the archive collection. This entry is still not ideal, as the lower-level heading should be present as well. But it gives a reasonable sense of what the researcher will get if they click on this link. It includes the <unitid> from the parent entry and the URL for the lower-level, with the first part of the <scopecontent> for the entry.  It also includes the Archives Hub tag line, which could be considered superfluous to a search for Garrick Theatre archives! However, it does help to embed the idea of a service in the mind of the researcher – something they can use for their research.

Extent

“It would be useful to be able to sort by size of collection, however, this would require some level of confidence that the <extent> tag is both widely used and that the content of the tag would lends itself to sorting.”

This was an idea we had when working on our Linked Data output. We wanted to think about visualizations that would help researchers get a sense of the collections that are out there, where they are, how relevant they are, and so on. In theory the ‘extent’ could help with a weighting system, where we could think about a map-based visualization showing concentrations of archives about a person or subject. We could also potentially order results by size – from the largest archive to the smallest archive that matches a researchers’ search term. However, archivists do not have any kind of controlled vocabulary for ‘extent’. So, within the Archives Hub this field can contain anything from numbers of boxes and folders to length in linear metres, dimensions in cubic metres and items in terms of numbers of photographs, pamphlets and other formats. ISAD(G) doesn’t really help with this; the examples they give simply serve to show how varied the description of extent can be.

Genre

“Other examples of desired functionality include providing a means in the interface to limit a search to include only items that are in a certain genre (for example, photographs)”.

This is something that could potentially be useful to researchers, but archivists don’t tend to provide the necessary data. We would need descriptions to include the genre, using controlled vocabulary. If we had this we could potentially enable researchers to select types of materials they are interested in, or simply include a flag to show, e.g. where a collection includes photographs.

The problem with introducing a genre search is that you run the risk of excluding key descriptions, because the search will only include results where the description includes that data in the appropriate location. If the word ‘photograph’ is in the general description only then a specific genre search won’t find it. This means a large collection of photographs may be excluded from a search for photographs.

Subject

In the Bron/Proffitt/Washburn article <controlaccess> is present around 72% of the time. I was surprised that they did not choose to analyse tags within <controlaccess> as I think these ‘access points’ can play a very important role in archival descrpition.  They use the presence of <controlaccess> as an indication of the presence of subjects, and make the point that “given differences in library and archival practices, we would expect control of form and genre terms to be relatively high, and control of names and subjects to be relatively low.”

On the Archives Hub, use of subjects is relatively high (as well as personal and corporate names) and use of form and genre is very low. However, it is true to say that we have strongly encouraged adding subject terms, and archivists don’t generally see this as integral to cataloguing (although some certainly do!), so we like to think that we are partly responsible for such a high use of subject terms.

Subject terms are needed because they (1) help to pull out significant subjects, often from collections that are very diverse, (2) enable identification of words such as ‘church’ and ‘carpenter’ (ie. they are subjects, not surnames), (3) allow researchers to continue searching across the Archives Hub by subject (subjects are all linked to the browse list) and therefore pull collections together by theme (4) enable advanced searching (which is substantially used on the Hub).

Names (personal and corporate)

In Bron/Proffitt/Washburn the <origination> tag is present 87% of the time. The analysis did not include the use of <persname> and <corpname> within <origination> to identify the type of originator. In the Archives Hub the originator is a required field, and is present 99%+ of the time. However, we made what I think is a mistake in not providing for the addition of personal or corporate name identification within <origination> via our EAD Editor (for creating descriptions) or by simply recommending it as best practice. This means that most of our originators cannot be distinguished as people or corporate bodies. In addition, we have a number where several names are within one <origination> tag and where terms such as ‘and others’, ‘unknown’ or ‘various’ are used. This type of practice is disadvantageous to machine processing. We are looking to rectify it now, but addressing something like this in retrospect is never easy to do. The ideal is that all names within origination are separately entered and identified as people or organisations.

We do also have names within <controlaccess>, and this brings the same advantages as for <subjects>, ensuring the names are properly structured, can be used for searching and for bringing together archives relating to any one individual or organisation.

Repository

“Use of this element falls into the promising complete category (99.46%: see Table 7). However, a variety of practice is in play, with the name of the repository being embellished with <subarea> and <address> tags nested within <repository>.”

On the Archives Hub repository is mandatory, but as yet we do not have a checking system whereby a description is rejected if it does not contain this field. We are working towards something like this, using scripts to check for key information to help ensure validity and consistency at least to a minimum standard. On one occasion we did take in a substantial number of descriptions from a repository that omitted the name of repository, which is not very useful for an aggregation service! However, one thing about <repository> is that it is easy to add because it is always the same entry. Or at least it should be….we did recently discovery that a number of repositories had entered their name in various ways over the years and this is something we needed to correct.

Scope and content, biographical history and abstract

It is notable that in the US <abstract> is widely used, whereas we don’t use it at all. It is intended as a very brief summary, whereas <scopecontent> can be of any length.

“For search, its worth noting that the semantics of these elements are different, and may result in unexpected and false “relevance””

One of the advantages of including <controlaccess> terms is to mitigate against this kind of false relevance, as a search for ‘mason’ as a person and ‘mason’ as a subject is possible through restricted field searching.

The Bron/Proffitt /Washburn analysis shows <bioghist> used 70% of the time. This is lower than the Archives Hub, where it is rare for this field not to be included. Archivists seem to have a natural inclination to provide a reasonably detailed biographical history, especially for a large collection focussed on one individual or organisation.

Digital Archival Objects

It is a shame that the analysis did not include instances of <dao>, but it is likely to be fairly low (in line with previous analysis by Wisser and Dean, which puts it lower than 10%). The Archives Hub currently includes around 1,200 instances of images or links to digital content. But what would be interesting is to see how this is growing over time and whether the trajectory indicates that in 5 years or so we will be able to provide researchers with routes into much of the Archives Hub content. However, it is worth bearing in mind that many archives are not digitised and are not likely to be digitised, so it is important for us not to raise expectations that links to digital content will become a matter of course.

The Future of Discovery

“In order to make EAD-encoded finding aids more well suited for use in discovery systems, the population of key elements will need to be moved closer to high or (ideally) complete.”

This is undoubtedly true, but I wonder whether the priority over and above completeness is consistency and controlled vocabulary where appropriate. There is an argument in favour of a shorter description, that may exclude certain information about a collection, but is well structured and easier to machine process. (Of course, completeness and consistency is the ideal!).

The article highlights geo-location as something that is emerging within discovery services. The Archives Hub is planning on promoting this as an option once we move to the revised EAD schema (which will allow for this to be included), but it is a question of whether archivists choose to include geographical co-ordinates in their catalogues. We may need to find ways to make this as easy as possible and to show the potential benefits of doing so.

In terms of the future, we need a different perspective on what EAD can and should be:

“In the early days of EAD the focus was largely on moving finding aids from typescript to SGML and XML. Even with much attention given over to the development of institutional and consortial best practice guidelines and requirements, much work was done by brute force and often with little attention given to (or funds allocated for) making the data fit to the purpose of discovery.”

However, I would argue that one of the problems is that archivists sometimes still think in terms of typescript finding aids; of a printed finding aid that is available within the search room, and then made available online….as if they are essentially the same thing and we can use the same approach with both. I think more needs to be done to promote, explain and discuss ‘next generation finding aids’. By working with Linked Data, I have gained a very different perspective on what is possible, challenging the traditional approach to hierarchical finding aids.

Maybe we need some ‘next generation discovery’ workshops and discussions – but in order to really broaden our horizons we will need to take heed of what is going on outside of our own domain. We can no longer consider archival practice in isolation from discovery in the most general sense because the complexity and scale of online discovery requires us to learn from others with expertise and understanding of digital technologies.

 

 

 

 

 

 

 

An evaluation of the use of archives and the Archives Hub

This blog is based upon a report written by colleagues at Mimas* presenting the results of the evaluation of our innovative Linked Data interface, ‘Linking Lives‘. The evaluation consisted of a survey and a focus group, with 10 participants including PhD students and MA students studying history, politics and social sciences. We asked participants a number of questions about the Archives Hub service, in order to provide context for their thoughts on the Linking Lives interface.

This blog post concentrates on their responses relating to the use of archives, methods of searching and interpretation of results. You can read more about their responses to the Linking Lives interface on our Linking Lives blog.

Use of Archives and Primary Source Materials

We felt that it was important to establish how important archives are to the participants in our survey and focus group. We found that “without exception, all of the respondents expressed a need for primary resources” (Evaluation report). One respondent said:

“I would not consider myself to be doing proper history if I wasn’t either reinterpreting primary sources others had written about, or looking at primary sources nobody has written about. It is generally expected for history to be based on primary sources, I think.” (Survey response)

One of the most important factors to the respondents was originality in research. Other responses included acknowledgement of how archives give structure to research, bringing out different angles and perspectives and also highlighting areas that have been neglected. Archives give substance to research and they enable researchers to distinguish their own work:

“Primary sources are very valuable for my research because they allow me to put together my own interpretation, rather than relying on published findings elsewhere.” (Survey response)

Understanding of Archives

It is often the case that people have different perceptions of what archives are, and with the Linking Lives evaluation work this was confirmed. Commonly there is a difference between social scientists and historians; the former concentrating on datasets (e.g. data from the Office of National Statistics) and the latter on materials created during a person’s life or the activities of an organisation and deemed worthy of permanently preserving. The evaluation report states:

“The participants that had a similar understanding of what an archive was to the Archive Hub’s definition had a more positive experience than those who didn’t share that definition.”

This is a valuable observation for the work of the Hub in a general sense, as well as the Linking Lives interface, because it demonstrates how initial perceptions and expectations can influence attitudes towards the service. In addition, the evaluation work highlighted another common fallacy: that an archive is essentially a library. Some of the participants in the survey expected the Archives Hub to provide them with information about published sources, such as research papers.

These findings highlight one of the issues when trying to evaluate the likely value of an innovative service: researchers do not think in the same language or with the same perspectives as information professionals. I wonder if we have a tendency to present services and interfaces modelled from our own standpoint rather than from the standpoint of the researcher.

Search Techniques and Habits

“Searches were often not particularly expansive, and participants searched for specific details which were unique to their line of enquiry” (Evaluation report). Examples include titles of women’s magazines, personal names or places. If the search returned nothing, participants might then broaden it out.

Participants said they would repeatedly return to archives or websites they were familiar with, often linked to quite niche research topics. This highlights how a positive experience with a service when it is first used may have a powerful effect over the longer term.

The survey found that online research was a priority:

“Due to conflicting pressures on time and economic resources, online searching was prevalent amongst the sample. Often research starts online and the majority is done online. Visits to see archives in person, although still seen as necessary, are carefully evaluated.”  (Evaluation report)

The main resources participants used were Google and Google Scholar (the most ubiquitous search engines used) as well as The National Archives, Google Books and ESDS. Specialist archives were referred to relating to specific search areas (e.g. The People’s History Museum, the Wellcome Library, the Mass Observation Archive).

Thoughts and Comments About the Archives Hub

All participants found the Hub easy to navigate and most found locating resources intuitive. As part of the survey we asked the participants to find certain resources, and almost all of them provided the right answers with seemingly no difficulty.

“It is clear. The descent of folders and references at the top are good for referencing/orientating oneself. The descriptions are good – they obviously can’t contain everything that could be useful to everyone and still be a summary. It is similar to other archive searches so it is clear.” (Survey response, PhD history student)

The social scientists that took part in the evaluation were less positive about the Archives Hub than the historians. Clearly many social science students are looking for datasets, and these are generally not represented on the Hub. There was a feeling that contemporary sources are not well represented, and these are often more important to researchers in fields like politics and sociology. But overall comments were very positive:

“…if anyone ever asked about how to search archives online I’d definitely point them to the Archives Hub”.

“Useful. It will save me making specific searches at universities.”

Archives Hub Content

It was interesting to see the sorts of searches participants made. A search for ‘spatial ideas’ by one participant did not yield useful results. This would not surprise many archivists – collections are generally not catalogued to draw out such concepts (neither Unesco nor UKAT have a subject heading for this; LCSH has ‘spatial analysis’). However, there may well be collections that cover a subject like this, if the researcher is prepared to dig deep enough and think about different approaches to searching. Another participant commented that “you can’t just look for the big themes”. This is the type of search that might benefit from us drawing together archive collections around themes, but this is always a very flawed approach. This is one reason that we have Features, which showcase archives around subjects but do not try to provide a ‘comprehensive’ view onto a subject.

This kind of feedback from researchers helps us to think about how to more effectively present the Archives Hub. Expectations are such an important part of researchers’ experiences. It is not possible to completely mitigate against expectations that do not match reality, but we could, for example, have a page on ‘The Archives Hub for Social Scientists’ that would at least provide those who looked at it with a better sense of what the Hub may or may not provide for them (whether anyone would read it is another matter!).

This survey, along with previous surveys we have carried out, emphasises the importance of a comprehensive service and a clear scope (“it wasn’t clear to me what subjects or organisations are covered”). However, with the nature of archives, it is very difficult to give this kind of information with any accuracy, as the collections represented are diverse and sometimes unexpected. in the end you cannot entirely draw a clear line around the scope of the Archives Hub, just like you cannot draw a clear line around the subjects represented in any one archive. The Hub also changes continuously, with new descriptions added every week. Cataloguing is not a perfect art; it can draw out key people, places, subjects and events, but it cannot hope to reflect everything about a collection, and the knowledge a researcher brings with them may help to draw out information from a collection that was not explicitly provided in the description. If a researcher is prepared to spend a bit of time searching, there is always the chance that they may stumble across sources that are new to them and potentially important:

“…another student who was mainly focused on the use of the Kremlin Archives did point out that [the Archives Hub] brought up the Walls and Glasier papers, which were new to [them]”.

Even if you provide a list of subjects, what does that really mean? Archives will not cover a subject comprehensively; they were not written with that in mind; they were created for other purposes – that is their strength in many ways – it is what makes them a rich and exciting resource, but it does not make it easy to accurately describe them for researchers. Just one series of correspondence may refer to thousands of subjects, some in passing, some more substantially, but archivists generally don’t have time to go through an entire series and draw out every concept.

If the Archives Hub included a description for every archive held at an HE institution across the UK, or for every specialist repository, what would that signify? It would be comprehensive in one sense, but in a sense that may not mean much to researchers. It would be interesting to ask researchers what they see as ‘comprehensive resources’ as it is hard to see how these could really exist, particularly when talking about unpublished sources.

Relevance of Search Results

The difficulties some participants had with the relevance of results comes back to the problem of how to catalogue resources that often cover a myriad of subjects, maybe superficially, maybe in detail; maybe from a very biased perspective. If a researcher looks for ‘social housing manchester’ then the results they get will be accurate in a sense – the machine will do its job and find collections with these terms, and there will be weighting of different fields (eg. the title will be highly weighted), but they still may not get the results they expect, because collections may not explicitly be about social housing in Manchester. The researcher needs to do a bit more work to think about what might be in the collection and whether it might be relevant. However, cataloguers are at fault to some extent. We do get descriptions sent to the Hub where the subjects listed seem inadequate or they do not seem to reflect the scope and content that has been provided. Sometimes a subject is listed but there is no sense of why it is included in the rest of the description. Sometimes a person is included in the index terms but they are not described in the content. This does not help researchers to make sense of what they see.

I do think that there are lessons here for archivists, or those who catalogue archives. I don’t think that enough thought is gives to the needs of the researcher. The inconsistent use of subject terms, for example, and the need for a description of the archive to draw out key concepts a little more clearly. Some archivists don’t see the need to add index terms, and think in terms of technologies like Google being able to search by keyword, therefore that is enough. But it isn’t enough. Researchers need more than this. They need to know what the collection is substantially about, they need to search across other collections about similar subjects. Controlled vocabulary enables this kind of exploratory searching. There is a big difference between searching for ‘nuclear disarmament’ as a keyword, which means it might exist anywhere within the description, and searching for it as a subject – a significant topic within an archive.

 

*Linking Lives Evaluation: Final Report (October 2012) by Lisa Charnock, Frank Manista, Janine Rigby and Joy Palmer

Excel template

Update May 2015: Please Note we need to make some changes to the Excel template and we are not currently working with Excel data. We hope to be able to offer this service in the future.

As part of Project Headway we wanted to create an Excel template which archives could use to catalogue and create EAD. We know that some archives – especially smaller and under-resourced archives – are using spreadsheets or word processing software to catalogue, and often lack the time or resources to switch to using an archival management system. While users can catalogue directly on to the EAD Editor, this isn’t a perfect solution –  it won’t work in some older browsers, or offline.

While we would have liked to offer a script that allowed users to convert their own Excel catalogues to EAD, it soon became apparent that this wasn’t an option. We would have needed to produce a script for each institution, and relied on the institution using Excel in a very consistent, systematic way – and a way that was ISAD(G) compliant, and could easily be mapped to EAD. So we decided to start off with a simple template, which we can adapt to individual user needs if required.

I’d never worked with XML in Excel before, and a lot of the process was simply trial-and-error, googling error messages, and sending forlorn messages to my programmer husband asking ‘what on earth is denormalised data and how do I stop it?’. I found the office.microsoft.com and msdn.microsoft.com sites useful for figuring out the basics of getting XML in and out of Excel – though I often turned to support elsewhere, too (eg Microsoft support will only tell you that denormalised data is not supported – not what it is or how to fix it).

To get started with using XML in Excel, you need to have the XML add-in installed (it says 2003, but will work with other versions) and then make sure you can see the ‘developer’ tab – if you can’t, it’s under options -> customize ribbon.

While it’s hard (in retrospect) to remember all of the stages I went through in the trial-and-error,  I know I started by trying to create an XSD (XML schema file) from in-Excel data entry. It failed. I tried importing the EAD.xsd – which just failed, silently (no error messages- no messages at all).

I was also concerned that the official EAD.xsd was too complicated for my (and our users’) needs – for instance, this project didn’t require lists of enumeration values. I needed something a bit simpler – and I’d already figured out that Excel couldn’t handle multi-level descriptions – so I needed to start with something collection-level only, too.

I created a basic EAD collection-level description in the Archives Hub EAD Editor, saved it as XML, removed the DTD declaration (not allowed in Excel), and imported it (using developer -> xml -> import).  Clicking on ‘source’ in the developer XML tab then shows you the XML fields.

XML map in Excel

You can then export this map as an XSD, creating your XML schema.  Of course, it wasn’t that easy. This is where denormalised data cropped up – and stopped me from exporting. I have to admit, I’m still not entirely sure what exactly denormalised data is – and given definitions such as:

A denormalised data model is not the same as a data model that has not been normalised, and denormalisation should only take place after a satisfactory level of normalisation has taken place and that any required constraints and/or rules have been created to deal with the inherent anomalies in the design. For example, all the relations are in third normal form and any relations with join and multi-valued dependencies are handled appropriately.

(from the usually introductory-friendly Wikipedia)

I’m not sure I’ll ever find out (if you have a really good explanation, please do comment!). But what I did find out was what it meant for me in the context of this XML mapping: no repeated fields. EAD allows for repeated fields – for instance, multiple subjects would be encoded as:

<controlaccess> <subject>subject</subject><subject>subject 2</subject></controlaccess>

Try to import that into Excel, and you get, well, a mess. The whole description appears twice – once with subject, and once with subject 2. And if you try to export the schema, you get the error message that the map is not exportable because it contains denormalized data.

For this reason, Excel won’t support hierarchy. In EAD, the same fields are repeated at component level as at collection-level, just inside a different wrapper. If you thought it got messy when you add a single repeated field, just imaging having anything up to several thousand…

So, strip everything down to a single instance (which means separating collection and component level into different spreadsheets), and you have an XSD which will export (follow instructions in step 4 of that link – if you get a VBA error, debug instructions are in step 2). Hurrah! But how to make it useable?

Well, you have to put it back into Excel, and map the XML fields to Excel cells. This was tedious, but achievably tedious rather than crawling-through-help-forums tedious. Open up a new Excel document, click on ‘source’, and choose your shiny new XSD. This will give you a list of all the fields, in the right-hand pane. Mapping them to cells is simply a case of drag-and-drop – once you’ve mapped a field to a cell, that cell will be outlined in blue (as long as the source pane is showing). There’s an option to have Excel auto-label your fields with the content of the XML tag, but I decided that wouldn’t give the user-friendly interface I wanted, so I labelled them myself. Then colour-coded them. The result?

Screenshot of collection-level template

I had to tweak the exported XSD a little to allow for a field in which users can enter the reference codes of any components. This was my first experiences of hand-coding any of an XML schema, and it took a few tries to get right! But I managed to add and map the <dsc> and <c> elements:

<xsd:element minOccurs=”0″ nillable=”true” name=”dsc” form=”unqualified”>
<xsd:complexType>
<xsd:sequence minOccurs=”0″>
<xsd:element minOccurs=”0″ nillable=”true” type=”xsd:string” name=”c” form=”unqualified”/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>

(If I wanted to play with the XSD a bit more, I guess I could make mandatory fields really mandatory, by fiddling with the minOccurs and/or nillable attributes, but I haven’t worked up the courage yet…)

This allows users to enter the reference codes of parent/child descriptions. Each component needs its own spreadsheet, and its own XML export. These are then run through a script by our programmer, which will use these parent/child references to create a single, hierarchical description. Theoretically, anyway – we haven’t been able to do much testing on it yet, and we’re not sure how well it will cope with components that are more than a level or two deep.

Remember denormalised data, and how you can’t have repeated fields? Obviously we can’t tell contributors that they can only have a single subject for each description! So in repeatable fields, multiple entries are pipe | delimited, so we can split them, eg:

<controlaccess><subject>subject 1|subject2|subject3</subject></controlaccess>

to

<controlaccess><subject>subject1</subject><subject>subject2</subject><subject>subject3</subject></controlaccess>

If users enter their subject sources in the same order, they’ll be matched up as attributes to the correct subject. The script also removes any empty fields (valid XML, but they break the EAD Editor), and adds the special Archives Hub mark-up for access points (used to distinguish between eg surname and forename in a personal name, and handy for linked data).

And there we are: a description, created in Excel, that’s valid EAD. We’re still in the process of testing the template, and making sure that it’s robust and meets users’ needs. If you’d like to be involved with testing, please get in touch.

 

The modern archivist: working with people and technology

I’ve recently read Kate Theimer’s very excellent post on Honest Tips for Wannabe Archivists Out There.

This is something that I’ve thought about quite a bit, as I work as the manager of an online service for Archives and I do training and teaching for archivists and archive students around creating online descriptions. I would like to direct this blog post to archive students or those considering becoming archivists. I think this applies equally to records managers, although sometimes they have a more defined role in terms of audience, so the perspective may be somewhat different.

It’s fine if you have ‘a love of history’, if you ‘feel a thrill when handling old documents’. That’s a good start. I’ve heard this kind of thing frequently as a motivation for becoming an archivist. But this is not enough. It is more important to have the desire to make those archives available to others; to provide a service for researchers. To become an archivist is to become a service provider, not an historian. It may not sound as romantic, but as far as I am concerned it is what we are, and we should be proud of the service we provide, which is extremely valuable to society. Understanding how researchers might use the archives is, of course, very important, so that you can help to support them in their work. Love of the materials, and love of the subject (especially in a specialist repository) should certainly help you with this core role. Indeed, you will build an understanding of your collections, and become more expert in them over time, which is one of the wonderful things about being an archivist.

Your core role is to make archives available to the community – for many of us, the community is potentially anyone, for some of us it may be more restricted in scope. So, you have an interest in the materials, you need to make them available. To do this you need to understand the vital importance of cataloguing. It is this that gives people a way in to the archives. Cataloguing is a real skill, not something to be dismissed as simply creating a list of what you have. It is something to really work on and think about. I have seen enough inconsistent catalogues over the last ten years to tell you that being rigorous, systematic and standards-based in cataloguing is incredibly important, and technology is our friend in this aim. Furthermore, the whole notion of ‘cataloguing’ is changing, a change led by the opportunities of the modern digital age and the perspectives and requirements of those who use technology in their every day life and work. We need to be aware of this, willing (even excited!) to embrace what this means for our profession and ready to adapt.

image of control roomThis brings me to the subject I am particularly interested in: the use of technology. Cataloguing *is* using technology, and dissemination *is* using technology. That is, it should be and it needs to be if you want to make an impact; if you want to effectively disseminate your descriptions and increase your audience. It is simply no good to see this profession as in any way apart from technology. I would say that technology is more central to being an archivist than to many professions, because we *deal in information*. It may be that you can find a position where you can keep technology at arm’s length, but these types of positions will become few and far between.  How can you be someone who works professionally with information, and not be prepared to embrace the information environment? The Web, email, social networks, databases: these are what we need to use to do our jobs. We generally have limited resources, and technology can both help us make the most of the resources we have and, conversely, we may need to make informed choices about the technology we use and what sort of impact it will have. Should you use Flickr to disseminate content? What are the pros and cons? Is ‘augmented reality’ a reality for us? Should you be looking at Linked Data? What is is and why might it be important? What about Big Data? It may sound like the latest buzz phrase but it’s big business, and can potentially save time and money. Is your system fit for purpose? Does it create effective online catalogues? How interoperable is it? How adaptable?

Before I give the impression that you need to become some sort of technical whizz-kid, I should make clear that I am not talking about being an out-and-out techie – a software developer or programmer. I am talking about an understanding of technology and how to use it effectively. I am also talking about the ability to talk to technical colleagues in order to achieve this. Furthermore, I am talking about a willingness to embrace what technology offers and not be scared to try things out. It’s not always easy. Technology is fast-moving and sometimes bewildering. But it has to be seen as our ally, as something that can help us to bring archives to the public and to promote a greater understanding of what we do. We use it to catalogue, and I have written previously about how our choice of system has a great impact on our catalogues, and how important it is to be aware of this.

Our role in using technology is really *all about people*. I often think of myself as the middleman, between the technology (the developers) and the audience. My role is to understand technology well enough to work with it, and work with experts, to harness it in order to constantly evolve and use it to best advantage, but also to constantly communicate with archivists and with researchers. To have an understanding of requirements and make sure that we are relevant to end-users. Its a role, therefore, that is about working with people. For most archivists, this role will be within a record office or repository, but either way, working with people is the other side of the coin to working with technology. They are both central to the world of archives.

If you wonder how you can possibly think about everything that technology has to offer: well, you can’t. But that’s why it is even more vital now than it has ever been to think of yourself as being in a collaborative profession. You need to take advantage of the experience and knowledge of colleagues, both within the archives profession and further afield. It’s no good sitting in a bubble at your repository. We need to talk to each other and benefit from sharing our understanding. We need to be outgoing. If you are an introvert, if you are a little shy and quiet, that’s not a problem; but you may have to make a little more effort to engage and to reach out and be an active part of your profession.

They say ‘never work with children and animals’ in show business because both are unpredictable; but in our profession we should be aware that working with people and technology is our bread and butter. Understanding how to catalogue archives to make them available online, to use social networks to communicate our messages, to think about systems that will best meet the needs of archives management, to assess new technologies and tools that may help us in our work. These are vital to the role of a modern professional archivist.

HubbuB: March 2012

New collections on the Hub

A special mention for the University of Worcester Research Collections – they have now been added to the Hub as collection level descriptions, thanks largely to their HLF ‘Skills for the Future’ trainee, Sarah.

We are delighted to have the Royal College of Psychiatrists as a new contributor, adding to a number of distinguished Royal Colleges already on the Hub.

Feature for March

This month we step into the world of augmented reality with a feature about the SCARLET project:

The feature tells us that “The SCARLET ‘app’ now enables students to study early editions of Dante’s Divine Comedy, for example, while simultaneous viewing catalogue data, digital images, webpages and online learning resources on their tablet devices and phones.” It all sounds very exciting, and something that archives can really play a very active part in.

EAD Editor

We’ve been busy testing the new instance of the EAD Editor, which will be released soon. We’ll be able to tell you more about that shortly.

We now have a page giving you information about the ‘right click’ menu that helps you with things like paragraphs, lists and links:

SRU and OAI-PMH

APIs are becoming increasingly important with the open data agenda. We have provided APIs for some years now. Recently we have updated the information on these to help developers who would like to use them to access Hub descriptions: http://archiveshub.ac.uk/sru/ and http://archiveshub.ac.uk/oaipmh/

The SRU interface is used to provide data to Genesis, the portal for Women’s Studies: http://www.londonmet.ac.uk/genesis/. It means that the data is only held in one place, but a different interface provides access to select descriptions – in this case, descriptions relating to women.

APIs may not mean a great deal to you, as they are primarily something developers use to create new interfaces, mash-ups and cross-data explorations, but do pass this on if you know of developers interested in working with our data. We want to ensure that archives are at the heart of innovations in opening up and exploring data connections.

Page about identifiers

Some of you may have read my recent blog post about issues with identifiers for archives and for archive descriptions. We now have a page on the Hub to help explain what a persistent unique identifier is and how you create it:

http://archiveshub.ac.uk/identifiers/

As ever, please ask us if you have any questions about this.

Former Reference

The Archives Hub now displays former reference with the label of ‘alternative ref’. This is because for some contributors the former reference is, in fact, the main reference, so we felt this was the best compromise. For example: http://archiveshub.ac.uk/data/gb1069-12 (see lower level entries).

The new EAD Editor will allow for descriptions with a former reference to be uploaded, edited and removed, but it will not provide the facility to create them from scratch.

Case Studies Wanted!

Finally, we have a case studies section – http://archiveshub.ac.uk/casestudies/. We’d love to hear from any researchers willing to provide us with a case study. It is a really useful way for us to convey the importance of the Hub to our funders.

More Product, Less Processing?

I’ve been reading a fascinating article by Mark A. Greene and Dennis Meissner, ‘More Product Less Process: Revamping Traditional Archival Processing‘ (PDF). I wanted to offer a summary of the article.

image of scalesThe essence of this article is that archivists spend too long processing collections (appraising, cataloguing and carrying out minor preservation). This approach is not working; the cataloguing backlog continues to increase. We are too conservative, cautious and set in our ways, and we need to think about a new approach to cataloguing that is more pragmatic and user-focussed. The article was written by archivists in the USA, but would seem to apply to archives here in the UK, where we know that the backlog is a continuing problem.

I think the article makes the argument well and with a good deal of conviction. The bottom line is that we must rethink our approach unless we are to continue to accrue backlogs and deny researchers access to hugely valuable primary source material.

However, there are arguments in support of detailed cataloguing. For digital archives it is extremely useful to provide metadata at the item level,  enabling such useful resources as http://archiveshub.ac.uk/data/gb1837des-dca?page=3#id634580. With this detailed list, researches can see digital resources described and then access them directly. It could be argued that if a collection is to be digitised, providing this sort of level of metadata is appropriate, and in general it is the more valuable and highly used collections that are digitised. But for born-digital collections, this level of detail would be totally unsustainable.

Also, I wonder if the work that volunteers do should be taken into account – they may be able to help us catalogue in more detail, whilst trained archivists continue to create the main collection or series-level descriptions. I remember a whole band of NADFAS volunteers cataloguing photographs where I used to work. Furthermore, I was speaking to an archivist recently who said that they had taken the time to weed out duplicates (something this report criticises)…and then sold them on eBay for a tidy profit, that helped them fund their very under-resourced archive (they had the rights to do this!). So, maybe there are factors to take into consideration that support a detailed approach, but I think a bold approach to examining this whole area in UK archives would be very welcome.

Some of the points made in the report:

  • Archivists spend too much time cataloguing, not necessarily doing what is necessary. We think in terms of an ideal that we have to reach, although we haven’t actually articulated what this ideal is, and really examined it.
  • We are too attached to old-fashioned ways of doing things, which worked when we had smaller collections to deal with, but are not appropriate for large 20th century collections.
  • We give a higher priority to serving the needs of our collections rather than the needs of our users.
  • We need a new set of guidelines that focus on what we absolutely need to do.
  • We need to discuss, debate and examine our approach to cataloguing, and not be defensive about our roles.
  • We tend to arrange collections down to item level. In particular, we carry out preservation activities to this level. We accept the premise that basic preservation steps necessitate an item-level approach.
  • We often remove all metal fastenings and put materials into acid-free folders. So, even if we do not describe collections down to item level (maybe we just describe at collection or series level), we go down to this level of detail in our preservation activities.  Yet, with good climate control, metal fasteners should not rust, and as yet we do not have strong evidence of a detrimental effect of standard manila folders if the materials is stored in a controlled environment.
  • We often weed out duplicates throughout a collection, which requires processing down to item level. Is this really worth doing?
  • The various sources of advice about the level of detail we process archives to are inconsistent. Some sources advocate description to series level, but preservation activities to item level. NARA advocates preservation in accordance with intrinsic value and anticipated use, so, for example, new folders should only be used if current ones are damaged, and metal fasteners should be removed only if ‘appropriate’ – meaning where they are causing obvious damage.
  • We seem to believe that we need to aspire to ‘a substantial, multi-layered, descriptive finding aid,’ a reflection of ‘slow, careful scholarly research’.  But in reality, maybe we should adopt a more flexible approach, taking each collection in turn on its merits. Some may justify detailed cataloguing, but many do not.
  • We should take the position that users come to do research, and that we do not have to do this for them in advance.
  • We should ‘get beyond our absurd over-cautiousness’ about providing access to unprocessed collections, and make them available unless there are good legal or preservation reasons to restrict access or the collection is of extremely high value.
  • We have very inadequate processing metrics. Attempts to quantify processing expectations have resulted in wildly differing figures. Figures given in various studies include 3, 6.9, 8, 12.7 and 10.6 hours per cubic foot. Other studies have come up with between 3 and 5.5 days per foot.
  • One major study  by an archive centre revealed 15.1 hours were spent on each cubic foot, far more than the value that was placed upon  what was accomplished. The study gave ‘an improved sense of the real and total costs involved’.
  • The Greene/Meissner study looked at various projects funded by NHPRC grants (National Historical Publications & Records Committee), and found an average productivity figure of 9 hours per foot, but with highs of around 67 hours per foot.  It also conducted an email survey and found expectations of processing times averaged at 14.8 hours, although there was a high of 250 hours!
  • Grant funding often encourages an item-level focus, rather than helping us to really tackle our substantial backlogs. There should be more of a requirement to justify meticulous processing – it should only be for exceptional collections.
  • The study recommends aiming for a processing rate of 4 hours per cubic foot for most large 20th century collections, using a series-level approach for description and preservation.
  • Studies show a lack of standardisation, not only in our definitions but also around the levels of arrangement, preservation and access that are useful and necessary.  We do not have proper administrative controls over this work. We tend to argue for each of us having a unique situation, that does not allow for comparison, and we do not have a common sense of acceptibile policies and procedures.
  • Whilst we continue to process to item level, a substantial number do not make catalogues available through OPACs or Websites, arguably prioritising processing over user needs.

The report concludes that maybe we should recognise that ‘the use of archival records…is the ultimate purpose of identification and administration.’ (SAA, Planning for the Archival Profession, 1986).  Maybe we should agree that a collection is catalogued if it ‘can be used productively for research.’ And maybe we should be willing to take a different approach for each collection, making choices and setting priorities, rather than being too caught up in a ‘love of craftmanship’ that could be seen as fastidiousness that does not truly serve the user.

The question seems to be how much would be lost by putting speed of processing before careful examination of all documents in a collection.  Maybe this does require defining good cataloguing? Maybe we believe that our professional standing is tied up with undertaking detailed cataloguing…more so than the ever increasing growth of backlogs, where the papers are entirely unaccessible to researchers?

Greene and Meissner state that there should be a ‘golden minimum’ for processing, where we adequately address user needs and only go beyond this where there are demonstrable business reasons. They also believe that arrangement, description and preservation should all occur at the same level of detail, again, unless there are good reasons to deviate from this.

What do you think…?