Archives Hub Data and Workflow


As those of you who contribute to or use the Hub will know, we went live with our new system in Dec 2016.  At the heart of our new system is our new workflow.  One of the key requirements that we set out with when we migrated to a new system was a more robust and sustainable workflow; the system was chosen on the basis that it could accommodate what we needed.

This post is about the EAD (Encoded Archival Data) descriptions, and how they progress through our processing workflow. It is the data that is at the heart of the Archives Hub world. We also work with EAG (Encoded Archival Guide) for repository descriptions, and EAC-CPF (Encoded Archival Context, Corporate bodies, Persons and Families) for name entities. Our system actually works with JSON internally, but EAD remains our means of taking in data and providing data out via our API.

On the Archives Hub now we have two main means of data ingest, via our own EAD Editor, which can be thought of as ‘internal’, and via exports from archive systems, which can be thought of as ‘external’.

Data Ingest via the EAD Editor

1. The nature of the EAD

The Editor creates EAD according to the Archives Hub requirements. These have been carefully worked out over time, and we have a page detailing them at

screenshot of eadforthehub page
Part of a Hub webpage about EAD requirements

When we started work on the new system, we were aware that having a clear and well-documented set of requirements was key. I would recommend having this before starting to implement a new system! But, as is often the case with software development, we didn’t have the luxury of doing that – we had to work it out as we went along, which was sometimes problematic, because you really need to know exactly what your data requirements are in order to set your system up. For example, simply knowing which fields are mandatory and which are not (ostensibly simple, but in reality this took us a good deal of thought, analysis and discussion).

Screenshot of the EAD Editor
EAD Editor

2. The scope of the EAD

EAD has plenty of tags and attributes! And they can be used in many ways. We can’t accommodate all of this in our Editor. Not only would it take time and effort, but it would result in a complicated interface, that would not be easy to use.

screenshot of EAD Tag Library
EAD Tag Library

So, when we created the new Editor, we included the tags and attributes for data that contributors have commonly provided to the Hub, with a few more additions that we discussed and felt were worthwhile for various reasons. We are currently looking again at what we could potentially add to the Editor, and prioritising developments. For example, the <materialspec> EAD tag is not accommodated at the moment. But if we find that our contributors use it, then there is a good argument for including it, as details specific to types of materials, such as map scales, can be useful to the end user.

We don’t believe that the Archives Hub necessarily needs to reflect the entire local catalogue of a contributor. It is perfectly reasonable to have a level of detail locally that is not brought across into an aggregator. Having said that, we do have contributors who use the Archives Hub as their sole online catalogue, so we do want to meet their needs for descriptive data. Field headings are an example of content we don’t utilise. These are  contained within <head> tags in EAD. The Editor doesn’t provide for adding these. (A contributor who creates data elsewhere may include <head> tags, but they just won’t be used on the Hub, see Uploading to the Editor).

We will continue to review the scope in terms of what the Editor displays and allows contributors to enter and revise; it will always be a work in progress.

3. Uploading to the Editor

In terms of data, the ability to upload to the Editor creates challenges for us. We wanted to preserve this functionality, as we had it on the old Editor, but as EAD is so permissive, the descriptions can vary enormously, and we simply can’t cope with every possible permutation. We undertake the main data analysis and processing within our main system, and trying to effectively replicate this in the Editor in order to upload descriptions would be duplicating effort and create significant overheads. One of our approaches to this issue is that we will preserve the data that is uploaded, but it may not display in the Editor. If you think of the model as ‘data in’ > ‘data editing’ > ‘data out’, then the idea is that the ‘data in’ and ‘data out’ provides all the EAD, but the ‘data editing’ may not necessary allow for editing of all the data. A good example of this situation occurs with the <head> tag, which is used for section headings. We don’t use these on the Hub, but we can ensure they remain in the EAD and they are there in the output from the Editor, so they are retained, but not displayed in the Editor. They can then be accessed by other means, such as through an XML Editor, and displayed in other interfaces.

We have disabled upload of exports from the Calm system to the Editor at present, as we found that the data variations, which often caused the EAD to be invalid, were too much for our Editor to cope with. It has to analyse the data that comes in and decide which fields to populate with which data. Some are straightforward – ‘title’ goes into <unittitle> for example, but some are not…for example, Calm has references and alternative references, and we don’t have this in our system, so they cause problems for the Editor.

4. Output from the Editor

When a description is submitted to the Archives Hub from the Editor, it is uploaded to our system (CIIM, pronounced ‘sim’), which is provided by Knowledge Integration, and modified for our own data processing requirements.

Screenshot of the CIIM
CIIM Browse screen

The CIIM framework allows us to implement data checking and customised transformations, which can be specific to individual repositories. For the data from the Editor, we know that we only need a fairly basic default processing, because we are in control of the EAD that is created. However, we will have to consider working with EAD that is uploaded to the Editor, but has not been created in the Editor – this may lead to a requirement for additional data checking and transformations. But the vast majority of the time descriptions are created in the Editor, so we know they are good, valid, Hub EAD, and they should go through our processing with no problems.

Data Ingest from External Data Providers

1. The nature of the EAD

EAD from systems such as Calm, Archivist’s Toolkit and AtoM is going to vary far more than EAD produced from the Editor. Some of the archival management systems have EAD exports. To have an export is one thing; it is not the same as producing EAD that the Hub can ingest. There are a number of factors here. The way people catalogue varies enormously, so, aside from the system itself, the content can be unpredictable – we have to deal with how people enter references; how they enter dates; whether they provide normalised dates for searching; whether entries in fields such as language are properly divided up, or whether one entry box is used for ‘English, French, Latin’, or ‘English and a small amount of Latin’; whether references are always unique; whether levels are used to group information, rather than to represent a group of materials; what people choose to put into ‘origination’ and if they use both ‘origination’ and ‘creator’; whether fields are customised, etc. etc.

The system itself will influence on the EAD output. A system will have a template, or transformation process, that maps the internal content to EAD. We have only worked in any detail with the Calm template so far. Axiell, the provider of Calm, made some changes for us, for example, only six languages were exporting when we first started testing the export, so they expanded this list, and then we made additional changes, such as allowing for multiple creators, subjects and dates to export, and ensuring languages in Welsh would export. This does mean that any potential Calm exporter needs to use this new template, but Axiell are going to add it to their next upgrade of Calm.

We are currently working to modify the AdLib template, before we start testing out the EAD export. Our experience with Calm has shown us that we have to test the export with a wide variety of descriptions, and modify it accordingly, and we eventually get to a reasonably stable point, where the majority of descriptions export OK.

We’ve also done some work with AtoM, and we are hoping to be able to harvest descriptions directly from the system.

2. The scope of the EAD

As stated above, finding aids can be wide ranging, and EAD was designed to reflect this, but as a result it is not always easy to work with. We have worked with some individual Calm users to extend the scope of what we take in from them, where they have used fields that were not being exported. For instance, information about condition and reproduction was not exporting in one case, due to the particular fields used in Calm, which were not mapping to EAD in the template. We’ve also had instances of index terms not exporting, and sometimes this had been due to the particular way an institution has set up their system. It is perfectly possible for an institution to modify the template themselves so that it suits their own particular catalogues, but this is something we are cautious about, as having large numbers of customised exports is going to be harder to manage, and may lead to more unpredictable EAD.

3. Uploading to the Editor

In the old Hub world, we expected exports to be uploaded to the Editor. A number of our contributors preferred to do this, particularly for adding index terms. However, this lead to problems for us because we ended up with such varied EAD, which mitigated against our aim of interoperable content. If you catalogue in a system, export from that system, upload to another system, edit in that system, then submit to an aggregator (and you do this sometimes, but other times you don’t), you are likely to run into problems with version control. Over the past few years we have done a considerable amount of work to clarify ‘master’ copies of descriptions. We have had situations where contributors have ended up with different versions to ours, and not necessarily been aware of it. Sometimes the level of detail would be greater in the Hub version, sometimes in the local version. It led to a deal of work sorting this out, and on some occasions data simply had to be lost in the interests of ending up with one master version, which is not a happy situation.

We are therefore cautious about uploading to the Editor, and we are recommending to contributors that they either provide their data directly (through exports) or they use the Editor. We are not ruling out a hybrid approach if there is a good reason for it, but we need to be clear about when we are doing this, what the workflow is, and where the master copy resides.

4. Output from Exported Descriptions

When we pass the exports through our processing, we carry out automated transformations based on analysis of the data. The EAD that we end up with – the processed version – is appropriate for the Hub. It is suitable for our interface, for aggregated searching, and for providing to others through our APIs. The original version is kept, so that we have a complete audit trail, and we can provide it back to the contributor. The processed EAD is provided to the Archives Portal Europe. If we did not carry out the processing, APE could not ingest many of the descriptions, or else they would ingest, but not display to the optimum standard.

Future Developments

Our automated workflow is working well. We have taken complete, or near complete,  exports from Calm users such as the Universities of Nottingham, Hull and (shortly) Warwick, and a number of Welsh local authority archives. This is a very effective way to ensure that we have up-to-date and comprehensive data.

We have well over one hundred active users of the EAD Editor and we also have a number of potential contributors who have signed up to it, keen to be part of the Archives Hub.

We intend to keep working on exports, and also hope to return to some work we started a few years ago on taking in Excel data. This is likely to require contributors to use our own Excel template, as it is impractical to work with locally produced templates. The problem is that working with one repository’s spreadsheet, translating it into EAD, could take weeks of work, and it would not replicate to other repositories, who will have different spreadsheets. Whilst Excel is reasonably simple, and most offices have it, it is also worth bearing in mind that creating data in Excel has considerable shortcomings. It is not designed for hierarchical archival data, which has requirements in terms of both structure and narrative, and is constantly being revised. TNA’s Discovery are also working with Excel, so we may be able to collaborate with them in progressing this area of work.

Our new architecture is working well, and it is gratifying to see that what we envisaged when we started working with Knowledge Integration and started setting out our vision for our workflow is now a reality.  Nothing stands still in archives, in standards, in technology or in user requirements, so we cannot stand still either, but we have a set-up that enables us to be flexible, and modify our processing to meet any new challenges.

The Building Blocks of the New Archives Hub

This is the first post outlining what the Archives Hub team have been up to over the past 18 months in creating a new system. We have worked with Knowledge Integration (K-Int) to create a new back end, using their CIIM software and Elastic Search, and we’ve worked with Gooii and Sero to create  a new interface. We are also building a new EAD Editor for cataloguing. Underlying all this we have a new data workflow and we will be implementing this through a new administrative interface. This post summarises some of the building blocks – our overall approach, objectives and processes.

What did we want to achieve?

The Archives Hub started off as a pilot project and has been running continuously as a service aggregating UK archival descriptions since 1999 (officially launched in 2001). That’s a long time to build up experience, to try things out, to have successes and failures, and to learn from mistakes.

The new Hub aimed to learn lessons from the past and to build positively upon our experiences.

Our key goals were:

  • sustainability
  • extensibility
  • reusability

Within these there is an awful I could unpack. But to keep it brief…

It was essential to come up with a system that could be maintained with the resources we had. In fact, we aimed to create a system that could be maintained to a basic level (essentially the data processing) with less effort than before. This included enabling contributors to administer their own data through access to a new interface, rather than having to go through the Hub team. Our more automated approach to basic processing would give us more resource to concentrate on added value, and this is essential in order to keep the service going, because a service has to develop  to remain relevant and meet changing needs.

The system had to be ‘future proof’ to the extent that we could make it so. One way to achieve this is to have a system that can be altered and extended over time; to make sure it is reasonably modular so that elements can be changed and replaced.

Key for us was that we wanted to end up with a store of data that could potentially be used in other interfaces and services. This is a substantial leap from thinking in terms of just servicing your own interface. But it is essential in the global digital age, and when thinking about value and impact, to think beyond your own environment and think in terms of  opportunities for increasing the profile and use of archives and of connecting data. There can be a tension between this kind of objective of openness and the need to clearly demonstrate the impact of the service, as you are pushing data beyond the bounds of your own scope and control, but it is essential for archives to be ‘out there’ in the digital environment, and we cannot shy away from the challenges that this raises.

In pursuing these goals, we needed to bring our contributors along with us. Our aims were going to have implications for them, so it was important to explain what we were doing and why.

Data Model for Sustainability

It is essential to create the right foundation. At the heart of what we do is the data (essentially meaning the archive descriptions, although future posts will introduce other types of data, namely repository descriptions and ‘name authorities’). Data comes in, is processed, is stored and accessed, and it flows out to other systems. It is the data that provides the value, and we know from experience that the data itself provides the biggest challenges.

The Archives Hub system that we originally created, working with the University of Liverpool and Cheshire software, allowed us to develop a successful aggregator, and we are proud of the many things we achieved. Aggregation was new, and, indeed, data standards were relatively new, and the aim was essentially to bring in data and provide access to it via our Archives Hub website. The system was not designed with a focus on a consistent workflow and sustainability was something of an unknown quantity, although the use of Encoded Archival Description (EAD) for our archive collection descriptions gave us a good basis in structured data. But in recent years the Hub started to become out of step with the digital environment.

For the new Hub we wanted to think about a more flexible model. We wanted the potential to add new ‘entities’. These may be described as any real world thing, so they might include archive descriptions, people, organisations, places, subjects, languages, repositories and events. If you create a model that allows for representing different entities, you can start to think about different perspectives, different ways to access the data and to connect the data up. It gives the potential for many different contexts and narratives.

We didn’t have the time and resource to bring in all the entities that we might have wanted to include; but a model that is based upon entities and relationships leaves the door open to further development. We needed a system that was compatible with this way of thinking. In fact, we went live without the ‘People and Organisations’ entity that we have been working on, but we can implement it when we are ready because the system allows for this.

Archives Hub Entity Relationship diagram
Entities within the Archives Hub system

The company that we employed to build the system had to be able to meet the needs of this type of model. That made it likely that we would need a supplier who already had this type of system. We found that with Knowledge Integration, who understood our modelling and what we were trying to achieve, and who had undertaken similar work aggregating descriptions of museum content.

Data Standards

The Hub works with Encoded Archival Description, so descriptions have to be valid EAD, and they have to conform to ISAD(G) (which EAD does). Originally the Hub employed a data editor, so that all descriptions were manually checked. This has the advantage of supporting contributors in a very 1-2-1 way, and working on the content of descriptions as well as the standardisation (e.g. thinking about what it means to have a useful title as well as thinking about the markup and format) and it was probably essential when we set out. But this approach had two significant shortcomings – content was changed without liaising with the contributor, which creates version control issues, and manual checking inevitably led to a lack of consistency and non-repeatable processes. It was resource intensive and not rigorous enough.

In order to move away from this and towards machine based processing we embarked upon a long process, over several months, of discussing ‘Hub data requirements’. It sometimes led to brain-frying discussions, and required us to make difficult decisions about what we would make mandatory. We talked in depth about pretty much every element of a description; we talked about levels of importance – mandatory, recommended, desirable; we asked contributors their opinions; we looked at our data from so many different angles. It was one of the more difficult elements of the work.  Two brief examples of this (I could list many more!):

Name of Creator

Name of creator is an ISAD(G) mandatory field. It is important for an understanding of the context of an archive. We started off by thinking it should be mandatory and most contributors agreed. But when we looked at our current data, hundreds of descriptions did not include a name of creator. We thought about whether we could make it mandatory for a ‘fonds’ (as opposed to an artificial collection), but there can be instances where the evidence points to a collection with a shared provenance, but the creator is not known. We looked at all the instances of ‘unknown’ ‘several’, ‘various’, etc within the name of creator field. They did not fulfill the requirement either – the name of a creator is not ‘unknown’. We couldn’t go back to contributors and ask them to provide a creator name for so many descriptions. We knew that it was a bad idea to make it mandatory, but then not enforce it (we had already got into problems with an inconsistent approach to our data guidelines). We had to have a clear position. For me personally it was hard to let go of creator as mandatory! It didn’t feel right. It meant that we couldn’t enforce it with new data coming in. But it was the practical decision because if you say ‘this is mandatory except for the descriptions that don’t have it’ then the whole idea of a consistent and rigorous approach starts to be problematic.

Access Conditions

This is not an ISAD(G) mandatory field – a good example of where the standard lags behind the reality. For an online service, providing information about access is essential. We know that researchers value this information. If they are considering travelling to a repository, they need to be aware that the materials they want are available. So, we made this mandatory, but that meant we had to deal with something like 500 collections that did not include this information. However, one of the advantages of this type of information is that it is feasible to provide standard ‘boiler plate’ text, and this is what we offered to our contributors. It may mean some slightly unsatisfactory ‘catch all’ conditions of access, but overall we improved and updated the access information in many descriptions, and we will ask for it as mandatory with future data ingest.

 Normalizing the Data

Our rather ambitious goal was to improve the consistency of the data, by which I mean reducing variation, where appropriate, with things like date formats, name of repository, names of rules or source used for index terms, and also ensuring good practice with globally unique references.

To simplify somewhat, our old approach led us to deal with the variations in the data that we received in a somewhat ad hoc way, creating solutions to fix specific problems; solutions that were often implemented at the interface rather than within the back-end system. Over time this led to a somewhat messy level of complexity and a lack of coherence.

When you aggregate data from many sources, one of the most fundamental activities is to enable it to be brought together coherently for search and display so oftentimes you are carrying out some kind of processing to standardise in some way. This can be characterised as simple processing and complex processing:

1) If X then Y

2) If X then Y or Z depending on whether A is present, and whether B and C match or do not match and whether the contributor is E or F.

The first example is straightforward; the second can get very complicated.

If you make these decisions as you go along, then after so many years you can end up with a level of complexity that becomes rather like a mass of lengths of string that have been tangled up in the middle – you just about manage to ensure that the threads in and out are still showing (the data in at one end; the data presented through interface the researcher uses at the other) but the middle is impossible to untangle and becomes increasingly difficult to manage.

This is eventually going to create problems for three main reasons. Firstly, it becomes harder to introduce more clauses to fix various data issues without unforeseen impacts, secondly it is almost impossible to carry out repeatable processes, and thirdly (and really as a result of the other two), it becomes very difficult to provide the data as one reasonably coherent, interoperable set of data for the wider world.

We needed to go beyond the idea of the Archives Hub interface being the objective; we needed to open up the data, to ensure that contributors could get the maximum impact from providing the data to the Archives Hub. We needed to think of the Hub not as the end destination but as a means to enable many more (as yet maybe unknown) destinations. By doing this, we would also set things up for if and when we wanted to make significant changes to our own interface.

This is a game changer. It sounds like the right thing to do, but the problem is that it meant tackling the descriptions we already had on the Hub to introduce more consistency. Thousands of descriptions with hundreds of thousands of units created over time, in different systems, with different mindsets, different ‘standards’, different migration paths. This is a massive challenge, and it wasn’t possible for us to be too idealistic; we had to think about a practical approach to transforming descriptions and creating descriptions that makes them more re-usable and interoperable. Not perfect, but better.

Migrating the Data

Once we had our Hub requirements in place, we could start to think about the data we currently have, and how to make sure it met our requirements. We knew that we were going to implement ‘pipelines’ for incoming data (see below) within the new system, but that was not exactly the same process as migrating data from old world to new, as migration is a one-off process. We worked slowly and carefully through a spreadsheet, over the best part of a year, with a line for each contributor. We used XSLT transforms (essentially scripts to transform data). For each contributor we assessed the data and had to work out what sort of processing was needed. This was immensely time-consuming and sometimes involved complex logic and careful checking, as it is very easy with global edits to change one thing and find knock-on effects elsewhere that you don’t want.

The migration process was largely done through use of these scripts, but we had a substantial amount of manual editing to do, where automation simply couldn’t deal with the issues. For example:

  • dates such as 1800/190, 1900-20-04, 8173/1878
  • non-unique references, often the result of human error
  • corporate names with surnames included
  • personal names that were really family names
  • missing titles, dates or languages

 When working through manual edits, our aim was to liaise with the contributor, but in the end there was so much to do that we made decisions that we thought were sensible and reasonable. Being an archivist and having significant experience of cataloguing made me feel qualified to do this. With some contributors, we also knew that they were planning a re-submission of all their descriptions, so we just needed to get the current descriptions migrated temporarily, and a non-ideal edit might therefore be fine just for a short period of time. Even with this approach we ended have a very small number of descriptions that we could not migrate for the going live date because we needed more time to figure out how to get them up to the required standard.

 Creating Pipelines

Our approach to data normalization for incoming descriptions was to create ‘pipelines’. More about this in another blog post, but essentially, we knew that we had to implement repeatable transformation processes. We had data from many different contributors, with many variations. We needed a set of pipelines so that we could work with data from each individual contributor appropriately.. The pipelines include things like:

  • fix problems with web links (where the link has not been included, or the link text has not been included)
  • remove empty tags
  • add ISO language code
  • take archon codes out of names of repositories

Of course, for many contributors these processes will be the same – there would be a default approach, but we sometimes will need to vary the pipelines as appropriate for individual contributors. For example:

  • add access information where it is not present
  • use the ‘alternative reference’ (created in Calm) as the main reference

We will be implementing these pipelines in our new world, through the administration interface that K-Int have built. We’re just starting on that particular journey!


We were ambitious, and whilst I think we’ve managed to fulfill many of the goals that we had, we did have to modify our data standards to ‘lower the bar’ as we went along. It is far better to set data standards at the outset as changing them part way through usually has ramifications, but it is difficult to do this when you have not yet worked through all the data. In hindsight, maybe we should have interrogated the data we have much more to begin with, to really see the full extent of the variations and missing data…but maybe that would have put us off ever starting the project!

The data is key. If you are aggregating from many different sources, and you are dealing with multi-level descriptions that may be revised every month, every year, or over many years, then the data is the biggest challenge, not the technical set-up. It was essential to think about the data and the workflow first and foremost.

It was important to think about what the contributors can do – what is realistic for them. The Archives Hub contributors clearly see the benefits of contributing and are prepared to put what resources they can into it, but their resources are limited. You can’t set the bar too high, but you can nudge it up in certain ways if you give good reasons for doing so.

It is really useful to have a model that conveys the fundamentals of your data organisation. We didn’t apply the model to environment; we created the environment from the model. A model that can be extended over time helps to make sure the service remains relevant and meets new requirements.


Archives Wales Catalogues Online: Working with the Archives Hub

Stacy Capner reflects on her first six months as Project Officer for the Archives Wales Catalogues Online project, a collaboration between the Archives and Records Council Wales and the Archives Hub to increase the discoverability of Welsh archives.

For a few years now there has been a strategic goal to get Wales’ archive collections more prominently ‘out there’ using the Archives Wales website. Collection level descriptions have been made available previously through the ‘Archives Network Wales’ project, but the aim now is to create a single portal to search and access multi-level descriptions from across services. The Archives Hub has an established, standards based way of doing this, so instead of re-inventing the wheel, Archives and Records Council Wales (ARCW) saw an opportunity to work with them to achieve these aims.

The work to take data from Welsh Archives into the Archives Hub started some time ago, but it became clear that getting exports from different systems and working with different cataloguing practices required more dedicated 1-2-1 liaison. I am the project officer on a defined project which began in April to provide dedicated support to archive services across Wales and to establish requirements for uploading their catalogue data to the Archives Hub (and subsequently to Archives Wales).

This project is supported by the Welsh Government through its Museums Archives and Libraries Division, with a grant to Swansea University, a member of ARCW and a long-standing contributor to the Hub. I’m on secondment from the University to the project, which means I’ve found myself back in my northern neck of the woods working alongside the Archives Hub team. This project has come at a time when the Archives Hub have been putting a lot of thought into their processes for uploading data straight from systems, which means that the requirements for Welsh services have started to define an approach which could be applied to archive services across Scotland, England and Northern Ireland.

Here are my reflections on the project so far:

  1. Wales has fantastic collections, holding internationally significant material. They deserve to be promoted, accessible and searchable to as wide an audience as possible. Some examples-

National Library of Wales, The Survey of the Manors of Crickhowell & Tretower (inscribed in the UNESCO Memory of the World Register, 2016)

Swansea University, South Wales Coalfield Collection

West Glamorgan, Neath Abbey Ironworks collection (inscribed in the UNESCO Memory of the World register, 2014)

Bangor University, Penrhyn Estate papers (including material relating to the sugar plantations in Jamaica)

Photograph of Ammanford colliers and workmen standing in front of anthracite truck, c 1900.
Photograph of Ammanford colliers and workmen standing in front of anthracite truck, c 1900. From the South Wales Coalfield Collection. Source: Richard Burton Archives, Swansea University (Ref: SWCC/PHO/COL/11)
  1. Don’t be scared of EAD ! I was. My knowledge of EAD (Encoded Archival Description) hadn’t been refreshed in 10 years, since Jane Stevenson got us to create brownie recipes using EAD tags on the archives course. So, whilst I started the task with confidence in cataloguing and cataloguing systems, my first month or so was spent learning about the Archives Hub EAD requirements. For contributors, one of the benefits of the Archives Hub is that they’ve created guidance, tools and processes so that archivists don’t have to become experts at creating or understanding EAD (though it is useful and interesting, if you get the chance!).
  1. The Archives Hub team are great! Their contributor numbers are growing (over 300 now) and their new website and editor are only going to make it easier for archive services to contribute and for researchers to search. What has struck me is that the team are all hot on data, standards and consistency, but it’s combined with a willingness to find solutions/processes which won’t put too much extra pressure on archive services wishing to contribute. It’s a balance that seems to work well and will be crucial for this project.
  1. The information gathering stage was interesting. And tiring. I visited every ARCW member archive service in Wales to introduce them to the project, find out what cataloguing systems they were using, and to review existing electronic catalogues. Most services in Wales are using Calm, though other systems currently being used include internally created databases, AtoM, Archivists Toolkit and Modes. It was really helpful to see how fields were being used, how services had adapted systems to suit them, and how all of this fitted in to Archives Hub requirements for interoperability.

    Photo of icecream
    Perks of working visits to beautiful parts of Wales.
  1. The support stage is set to be more interesting. And probably more tiring! The next 6 months will be spent providing practical support to services to help enable their catalogues to meet Archives Hub requirements. I’ll be able to address most of the smaller, service specific, tasks on site visits. The Hub team and I have identified a number of trickier ‘issues’ which we’ll hash out with further meetings and feedback from services. I can foresee further blog posts on these so briefly they are:
  • Multilingualism- most services catalogue Welsh items/collections in Welsh, English items/collections in English and multi-language item/collections bilingually. However, the method of doing this across services (and within services) isn’t consistent. We’re going to look at what can be done to ensure that descriptions in multiple languages are both human and machine readable.
  • Ref no/Alt ref- due to legacy issues with non-hierarchical catalogues, or just services personal preference, there are variations in the use of these fields. Some services use the ref no as the reference, others use the alt ref no as the reference. This isn’t a problem (as long as it’s consistent). Some services use ref no as the reference but not at series level, others use the alt ref no as the reference but not at series level. This will prove a little trickier for the Archives Hub to handle but hopefully workarounds for individual services will be found.
  • Extent fields missing- this is a mandatory field at collection level for the Archives Hub. It’s important to give researchers an idea of the size of the collection/series (it’s also an ISAD(G) required field). However, many services have hundreds of collection level descriptions which are missing extent. It’s not something I’ll practically be able to address on my support visits so the possibility of further work/funding will be looked into.
  • Indexing- this is understandably very important to the Archives Hub (they explain why here). For several archive services in Wales it seems to have been a step too far in the cataloguing process, mainly due to a lack of resource/time/training. Most have used imported terms from an old database or nothing at all. Although this will not prevent services from contributing catalogues to the Archives Hub, it does open up opportunities to think about partnership projects which might address this in the future (including looking at Welsh language index terms).

The project has made me think about how I’ve catalogued in the past. It’s made me much more aware that catalogues shouldn’t just be an inward-facing, local or an intellectual control based task; we should be constantly aware of making our descriptions more discoverable to researchers. And it’s shown me the importance of standards and consistency in achieving this (I feel like I’ve referenced consistency a lot in this one blog post; consistency is important!).  I hope that the project is also prompting Welsh archive services to reflect on the accessibility of their own cataloguing- something which might not have been looked at in many years.

There’s a lot of work to be done, both in this foundation work and further funding/projects which might come of the back of it. But hopefully in the next few years you’ll be discovering much more of Wales’ archive collections online.

Stacy Capner
Project Officer
Archives Wales Catalogues Online


Archives Hub EAD Editor –

Archives Hub contributors – list and map


Micro sites: local interfaces for Archives Hub contributors


Back in 2008 the Archives Hub embarked upon a project to become distributed; the aim was to give control of their data to the individual contributors. Every contributor could host their own data by installing and running a ‘mini Hub’. This would give them an administrative interface to manage their descriptions and a web interface for searching.

Five years later we had 6 distributed ‘spokes’ for 6 contributors. This was actually reduced from 8, which was the highest number of institutions that took up the invitation to hold their own data out of around 180 contributors at the time.

The primary reason for the lack of success was identified as a lack of technical knowledge and the skills required for setting up and maintaining the software. In addition to this,  many institutions are not willing to install unknown software or maintain an unfamiliar operating system. Of course, many Hub contributors already had a management system, and so they may not have wanted to run a second system; but a significant number did not (and still don’t) have their own system. Part of the reason may institutions want an out-of-the-box solution is that they do not have consistent or effective IT support, so they need something that is intuitive to use.

The spokes institutions ended up requiring a great deal of support from the central Hub team; and at the same time they found that running their spoke took a good deal of their own time. In the end, setting up a server with an operating system and bespoke software (Cheshire in this case) is not a trivial thing, even with step-by-step instructions, because there are many variables and external factors that impact on the process. We realised that running the spokes effectively would probably require a full-time member of the Hub team in support, which was not really feasible, but even then it was doubtful whether the spokes institutions could find the IT support they required on an ongoing basis, as they needed a secure server and they needed to upgrade the software periodically.

Another big issue with the distributed model was that the central Hub team could no longer work on the Hub data in its entirety, because the spoke institutions had the master copy of their own data. We are increasingly keen to work cross-platform, using the data in different applications. This requires the data to be consistent, and therefore we wanted to have a central store of data so that we could work on standardising the descriptions.

The Hub team spend a substantial amount of time processing the data, in order to be able to work with it more effectively. For example, a very substantial (and continuing) amount of work has been done to create persistent URIs for all levels of  description (i.e. series, item, etc.). This requires rigorous consistency and no duplications of references. When we started to work on this we found that we had 100’s of duplicate references due to both human error and issues with our workflow (which in some cases meant we had loaded a revised description along with the original description). Also, because we use archival references in our URIs, we were somewhat nonplussed to discover that there was an issue with duplicates arising from references such as GB 234 5AB and GB 2345 AB. We therefore had to change our URI pattern, which led to substantial additional work (we used a hyphen to create gb234-5ab and gb2345-ab).

We also carry out more minor data corrections, such as correcting character encoding (usually an issue with characters such as accented letters) and creating normalised dates (machine processable dates).

In addition to these types of corrections, we run validation checks and correct anything that is not valid according to the EAD schema, and we are planning, longer term, to set up a workflow such that we can implement some enhancement routines, such as adding a ‘personal name’ or ‘corporate name’ identifying tag to our creator names.

These data corrections/enhancements have been applied to data held centrally. We have tried to work with the distributed data, but it is very hard to maintain version control, as the data is constantly being revised, and we have ended up with some instances where identifying the ‘master’ copy of the data has become problematic.

We are currently working towards a more automated system of data corrections/enhancement, and this makes it important that we hold all of the data centrally, so that we ensure that the workflow is clear and we do not end up with duplicate slightly different versions of descriptions. (NB: there are ways to work more effectively with distributed data, but we do not have the resources to set up this kind of environment at present – it may be something for the longer term).

We concluded that the distributed model was not sustainable, but we still wanted to provide a front-end for contributors. We therefore came up with the idea of the ‘micro sites’.

What are Hub Micro Sites?

The micro sites are a template based local interface for individual Hub contributors. They use a feed of the contributor’s data from the central Archives Hub, so the data is only held in one place but accessible through both interfaces: the Hub and the micro site. The end-user performs a search on a micro site, the search request goes to the central Hub, and the results are returned and displayed in the micro site interface.

screenshot of brighton micro site
Brighton Design Archives micro site homepage

The principles underlying the micro sites are that they need to be:

•    Sustainable
•    Low cost
•    Efficient
•    Realistically resourced

A Template Approach?

As part of our aim of ensuring a sustainable and low-cost solution we knew we had to adopt a one-size-fits-all model. The aim is to be able to set up a new micro site with minimal effort, as the basic look and feel stays the same. Only the branding, top and bottom banners, basic text and colours change. This gives enough flexibility for a micro site to reflect an institution’s identity, through its logo and colours, but it means that we avoid customisation, which can be very time-consuming to maintain.

The micro sites use an open approach, so it would be possible for institutions to customise themselves, by manipulating the stylesheets. However, this is not something that the Archives Hub can support, and therefore the institution would need to have the expertise necessary to maintain this themselves.

The Consultation Process

We started by talking to the Spokes institutions and getting their feedback about the strengths and weaknesses of the spokes and what might replace them. We then sent out a survey to Hub contributors to ascertain whether there would be a demand for the micro sites.

Institutions preferred the micro sites to be hosted by the Archives Hub. This reflects the lack of technical support within UK archives. This solution is also likely to be more efficient for us, as providing support at a distance is often more complicated than maintaining services in-house.

The responders generally did not have images displayed on the Hub, but intended to in the future, so this needed to be taken into account. We also asked about experiences with understanding and using APIs. The response showed that people had no experience of APIs and did not really understand what they were, but were keen to find out more.

We asked for requirements and preferences, which we have taken into account as much as possible, but we explained that we would have to take a uniform approach, so it was likely that there would need to be compromises.

After a period of development, we met with the early adopters of the micro sites (see below) to update them on our progress and get additional requirements from them. We considered these requirements in terms of how practical they would be to implement in the time scale that we were working towards, and we then prioritised the requirements that we would aim to implement before going live.

The additional requirements included:

  • Search in multi-level description: the ability to search within a description to find just the components that include the search term
  • Reference search: useful for contributors for administrative purposes
  • Citation: title and reference, to encourage researchers to cite the archive correctly
  • Highlight: highlighting of the search term(s)
  • Links to ‘search again’ and to ‘go back’ to the collection result
  • The addition of Google Analytics code in the pages, to enable impact analysis

The Development Process

We wanted the micro sites to be a ‘stand alone’ implementation, not tied to the Archives Hub. We could have utilised the Hub, effectively creating duplicate instances of the interface, but this would have created dependencies.  We felt that it was important for the micro sites to be sustainable independent of our current Hub platform.

In fact, the Micro sites have been developed using Java, whereas the Hub uses Python, a completely different programming language. This happened mainly because we had a Java programmer on the team. It may seem a little odd to do this, as opposed to simply filtering the Hub data with Python, but we think that it has had unforeseen benefits. Namely, that the programmers who have worked on the micro sites have been able to come at the task afresh, and work on new ways to solve the many challenges that we faced. As a result of this we have implemented some solutions with the micro sites that are not implemented on the Hub.  Equally, there were certainly functions within the Hub that we could not replicate with the micro sites – mainly those that were specifically set up for the aggregated nature of the Hub (e.g browsing across the Hub content).

It was a steep learning curve for a developer, as the development required a good understanding of hierarchical archival descriptions, and also an appreciation of the challenges that come from a diverse data set. As with pretty much all Hub projects, it is the diverse nature of the data set that is the main hurdle. Developers need patterns; they need something to work with, something consistent. There isn’t too much of that with aggregated archives catalogues!

The developer utilised what he could from the Hub, but it is the nature of programming that reverse engineering of someone else’s code can be a great deal harder than re-coding, so in many cases the coding was done from scratch. For example, the table of contents is a particularly tricky thing to recreate, but the code used for the current Hub proved to be too complex to work with, as it has been built up over a decade and is designed to work within the Hub environment. The table of contents requires the hierarchy to be set out, collapsible folder structures, links to specific parts of the description with further navigation from there to allow the researcher to navigate up and down, so it is a complex thing to create and it took some time to achieve.

The feed of data has to provide the necessary information for the creation of the hierarchy, and our feed comes through SRU (Search/Retrieve via URL), which is a standard search protocol for Internet search queries using Contextual Query Language (CQL).  This was already available through the Hub API, and the micro sites application makes uses of SRU in order to perform most of the standard searches that are available on the Hub.  Essentially, each of the micro sites are provided by a single web application that acts as a layer on the Archives Hub.  To access the individual micro sites, the contributor provides a shortened version of the institution’s name as a sub-string to the micro sites web address.  This then filters the data accordingly for that institution, and sets up the site with the appropriate branding.  The latter is achieved through CSS stylesheets, individually tailored for the given institution by a stand-alone Java application and a standard CSS template.

Page Display

One of the changes that the developer suggested for the micro sites concerns the intellectual division of the descriptions. On the current Hub, a description may carry over many pages, but each page does not represent anything specific about the hierarchy, it is just a case of the description continuing from one page to the next. With the micro sites we have introduced the idea that each ‘child’ description of the top level is represented on one page. This can more easily be shown through a screenshot:

screenshot of table of contents from Salford Archives
Table of Contents of the Walter Greenwood Collection showing the tree structure














In the screenshot above, the series ‘Theatre Programmes, Playbills, etc’ is a first-level child description (a series description) of the archive collection ‘The Walter Greenwood Collection’.  Within this series there are a number of sub-series, the first of which is ‘Love on the Dole’, the last of which is ‘A Taste of Honey’. The researcher will therefore get a page that contains everything within this one series – all sub-series and items – if there are any described in the series.

screenshot of a page from Salford Archives
Page for ‘Theatre Programmes, Playbills, etc’ within the Walter Greenwood Collection

The sense of hierarchy and belonging is further re-enforced by repeating the main collection title at the top of every right hand pane.  The only potential downside to this approach is that it leads to variable length ‘child’ description pages, but we felt it was a reasonable trade-off because it enables the researcher to get a sense of the structure of the collection. Usually it means that they can see everything within one series on one page, as this is the most typical first child level of an archival description.  In EAD representation, this is everything contained within the <c01> tag or top level <c> tag.

Next Steps

We are currently testing the micro sites with early adopters: Glasgow University Archive Services, Salford University Archives, Brighton Design Archives and the University of Manchester John Rylands Library.

We aim to go live during September 2014 (although it has been hard to fix a live date, as with a new and innovative service such as the micro sites unforeseen problems tend to emerge with alarming regularity). We will see what sort of feedback we get, and it is likely that we will find a few things need addressing as a result of putting the micro sites out into the big wide world. We intend to arrange a meeting for the early adopters to come together again and feed back to us, so that we can consider whether we need a ‘phase 2’ to iron out any problems and make any enhancements. We may at that stage invite other interested institutions, to explain the process and look at setting up further sites. But certainly our aim is to roll out the micro sites to other Archives Hub institutions.

A European Journey: The Archives Portal Europe

In January 2013 the Archives Hub became the UK ‘Country Manager’ for the Archives Portal Europe.

The Archives Portal Europe (APE) is a European aggregator for archives. The website provides more information about the APE vision:

Borders between European countries have changed often during the course of history. States have merged and separated and it is these changing patterns that form the basis for a common ground as well as for differences in their development. It is this tension between their shared history and diversity that makes their respective histories even more interesting. By collating archival material that has been created during these historical and political evolutions, the Archives Portal Europe aims to provide the opportunity to compare national and regional developments and to understand their uniqueness while simultaneously placing them within the larger European context.

The portal will help visitors not only to dig deeper into their own fields of interest, but also to discover new sources by giving an overview of the jigsaw puzzle of archival holdings across Europe in all their diversity.

For many countries, the Country Manager role is taken on by the national archives. However, for the UK the Archives Hub was in a good position to work with APE. The Archives Hub is an aggregation of archival descriptions held across the UK. We work with and store content in Encoded Archival Description (EAD), which provides us with a head start in terms of contributing content.

Jane Stevenson, the Archives Hub Manager, attended an APE workshop in Pisa in January 2013, to learn more about the tools that the project provides to help Country Managers and contributors to provide their data. Since then, Jane has also attended a conference in Dublin, Building Infrastructures for Archives in a Digital World, where she talked about A Licence to Thrill: the benefits of open data. APE has provided a great opportunity to work with European colleagues; it not just about creating a pan-European portal, it is also about sharing and learning together. At present, APE has a project called APEx, which is an initiative for  “expanding, enriching, enhancing and sustaining” the portal.

How Content is Provided to APE

The way that APE normally works is through a Country Manager providing support to institutions wishing to contribute descriptions. However, for the UK, the Archives Hub takes on the role of providing the content directly, as it comes via the Hub and into APE. This is not to say that institutions cannot undertake to do this work themselves. The British Library, for example, will be working with their own data and submitting it to APE. But for many archives, the task of creating EAD and checking for validity would be beyond their resources. In addition, this model of working shows the benefits of using interoperable standards; the Archives Hub already processes and validates EAD, so we have a good understanding of what is required for the Archives Portal Europe.

All that Archives Hub institutions need to do to become part of APE is to create their own directory entry. These entries are created using Encoded Archival Guide (EAG), but the archivist does not need to be familiar with EAG, as they are simply presented with a form to fill in. The directory entry can be quite brief, or very detailed, including information on opening hours, accessibility, reprographic services, search room places, internet access and the history of the archive.

Fig 1: EAG entry for the University of East London

















Once the entry is created, we can upload the data. If the data is valid, this takes very little time to do, and immediately the archive is part of a national aggregation and a European aggregation.

APE Data Preparation Tool

The Data Preparation Tool allows us to upload EAD content and validate it. You can see on the screen shot below a list of EAD files from the Mills Archive that have been uploaded to the Tool, and the Tool will allow us to ‘convert and validate’ them. There are various options for checking against different flavours of EAD and there is also the option to upload EAC-CPF (which is not something the Hub is working with as yet) and EAG.

Fig 2: Data Preparation Tool










If all goes according to plan, the validation results in a whole batch of valid files, and you are ready to upload the data. Sometimes there will be an invalid file and you need to take a look at the validation message and figure out what you need to do (the error message in this screenshot relates to ‘example 2’ below).

Fig 3: Data Preparation Tool with invalid files

 APE Dashboard

The Dashboard is an interface provided to an APE Country Manger to enable them to administer their landscape. The first job is to create the archival landscape. For the UK we decided to group the archives into type:

Fig 4: Archival Landscape

The landscape can be modified as we go, but it is good to keep the basic categories, so its worth thinking about this from the outset. We found that many other European countries divide their archives differently, reflecting their own landscape, particularly in terms of how local government is organised. We did have a discussion about the advantages of all using the same categories, but it seemed better for the end-user to be presented with categories suitable for the way UK archives are organised.

Within the Dashboard, the Country Manager creates logins for all of the archive repositories contributing to APE. The repositories can potentially use these logins to upload EAD to the dashboard, validate and correct if necessary and then publish. But at present, the Archives Hub is taking on this role for almost all repositories. One advantage of doing this is that we can identify issues that surface across the data, and work out how best to address these issues for all repositories, rather than each one having to take time to investigate their own data.

Working with the Data

When the Archives Hub started to work with APE, we began by undertaking a comparison of Hub EAD and APE EAD. Jane created a document setting out the similarities and differences between the two flavours of EAD. Whilst the Hub and APE both use EAD, this does not mean that the two will be totally compatible. EAD is quite permissive and so for services like aggregators choices have to be made about which fields to use and how to style the content using XSLT stylesheets. To try to cover all possible permutations of EAD use would be a huge task!

There have been two main scenarios when dealing with data issues for APE:

(1) the data is not valid EAD or it is in some way incorrect

(2) the data is valid EAD but the APE stylesheet cannot yet deal with it

We found that there were a combination of these types of scenarios. For the first, the onus is on the Archives Hub to deal with the data issues at source. This enables us to improve the data at the same time as ensuring that it can be ingested into APE. For the second, we explain the issue to the APE developer, so that the stylesheet can be modified.

Here are just a few examples of some of the issues we worked through.

Example 1: Digital Archival Objects

APE was omitting the <daodesc> content:

<dao href=”” show=”embed”><daodesc><p>’Gary Popstar’ by Julian Opie</p></daodesc></dao>

 Content of <daodesc><p> should be transferred to <dao@xlink:title>. It would then be displayed as mouse-over text to the icons used in the APE for highlighting digital content. Would that solution be ok?

In this instance the problem was due to the Hub using the DTD and APE using the schema, and a small transformation done by APE when they ingested the data sufficed to provide a solution.

Example 2: EAD Level Attribute

Archivists are all familiar with the levels within archival descriptions. Unfortunately, ISAD(G), the standard for archival description, is not very helpful with enforcing controlled vocabulary here, simply suggesting terms like Fonds, Sub-fonds, Series, Sub-series. EAD has a more definite list of values:

  • collection
  • fonds
  • class
  • recordgrp
  • series
  • subfonds
  • subgrp
  • subseries
  • file
  • item
  • otherlevel

Inevitably this means that the Archives Hub has ended up with variations in these values. In addition, some descriptions use an attribute value called ‘otherlevel’ for values that are not, in fact, other levels, but are recognised levels.

We had to deal with quite a few variations: Subfonds, SubFonds, sub-fonds, Sub-fonds, sub fonds, for example. I needed to discuss these values with the APE developer and we decided that the Hub data should be modified to only use the EAD specified values.

For example:

<c level=”otherlevel” otherlevel=”sub-fonds”>

needed to be changed to:

<c level=”subfonds”>

At the same time the APE stylesheet also needed to be modified to deal with all recognised level values. Where the level was not a recognised EAD value, e.g. ‘piece’, then ‘otherlevel’ is valid, and the APE stylesheet was modified to recognise this.

Example 3: Data within <title> tag

We discovered that for certain fields, such as biographical history, any content within a <title> tag was being omitted from the APE display. This simply required a minor adjustment to the stylesheet.

Where are we Now?

The APE developers are constantly working to improve the stylesheets to work with EAD from across Europe. Most of the issues that we have had have now been dealt with. We will continue to check the UK data as we upload it, and go through the process described above, correcting data issues at source and reporting validation problems to the APE team.

The UK Archival Landscape in Europe

By being part of the Archives Portal Europe, UK archives benefit from more exposure, and researchers benefit from being able to connect archives in new and different ways.  UK archives are now being featured on the APE homepage.

Wiener Library: Antisemitic board game, 1936

APE and the APEx project provides a great community environment. It provides news from archives across Europe:; it has a section for people to contribute articles:; it runs events and advertises events across Europe: Most importantly for the Archives Hub, it provides effective tools along with knowledgeable staff, so that there is a supportive environment to facilitate our role as Country Manager.

APEx Project meeting in Berlin, Nov 2013




(copyright: Bundersarchiv)

EAD and Next Generation Discovery

This post is in response to a recent article in Code4Lib, ‘Thresholds for Discovery: EAD Tag Analysis in ArchiveGrid, and Implications for Discovery Systems‘ by M. Bron, M. Proffitt and B. Washburn. All quotes are from that article, which looked at the instances of tags within ArchiveGrid, the US based archival aggregation run by OCLC. This post compares some of their findings to the UK based Archives Hub.


In the ArchivesGrid analysis, the <unitdate> field use is around 72% within the high-level (usually collection level) description. The Archives Hub does significantly better here, with an almost universal inclusion of dates at this level of description. Therefore, a date search is not likely to exclude any potentially relevant descriptions. This is important, as researchers are likely to want to restrict their searches by date. Our new system also allows sorting retrieved results by date. The only issue we have is where the dates are non-standard and cause the ordering to break down in some way. But we do have both displayed dates and normalised dates, to enable better machine processing of the data.

Collection Title

“for sorting and browsing…utility depends on the content of the element.”

Titles are always provided, but they are very varied. Setting aside lower-level descriptions, which are particularly problematic, titles may be more or less informative. We may introduce sorting by title, but the utility of this will be limited. It is unlikely that titles will ever be controlled to the extent that they have a level of consistency, but it would be fascinating to analyse titles within the context of the ways people search on the Web, and see if we can gauge the value of different approaches to creating titles. In other words, what is the best type of title in terms of attracting researchers’ attention, search engine optimisation, display within search engine results, etc?

Lower-level descriptions tend to have titles such as ‘Accounts’, ‘Diary’ or something more difficult to understand out of context such as ‘Pigs and boars’ or ‘The Moon Dragon’. It is clearly vital to maintain the relationship of these lower-level descriptions to their parent level entries, otherwise they often become largely meaningless. But this should be perfectly possible when working on the Web.

It is important to ensure that a researcher finding a lower-level description through a general search engine gets a meaningful result.

Archives Hub search result from a Google search
A search result within Google




The above result is from a search for ‘garrick theatre archives joanna lumley’ – the sort of search a researcher might carry out. Whilst the link is directly to a lower -level entry for a play at the Garrick Theatre, the heading is for the archive collection. This entry is still not ideal, as the lower-level heading should be present as well. But it gives a reasonable sense of what the researcher will get if they click on this link. It includes the <unitid> from the parent entry and the URL for the lower-level, with the first part of the <scopecontent> for the entry.  It also includes the Archives Hub tag line, which could be considered superfluous to a search for Garrick Theatre archives! However, it does help to embed the idea of a service in the mind of the researcher – something they can use for their research.


“It would be useful to be able to sort by size of collection, however, this would require some level of confidence that the <extent> tag is both widely used and that the content of the tag would lends itself to sorting.”

This was an idea we had when working on our Linked Data output. We wanted to think about visualizations that would help researchers get a sense of the collections that are out there, where they are, how relevant they are, and so on. In theory the ‘extent’ could help with a weighting system, where we could think about a map-based visualization showing concentrations of archives about a person or subject. We could also potentially order results by size – from the largest archive to the smallest archive that matches a researchers’ search term. However, archivists do not have any kind of controlled vocabulary for ‘extent’. So, within the Archives Hub this field can contain anything from numbers of boxes and folders to length in linear metres, dimensions in cubic metres and items in terms of numbers of photographs, pamphlets and other formats. ISAD(G) doesn’t really help with this; the examples they give simply serve to show how varied the description of extent can be.


“Other examples of desired functionality include providing a means in the interface to limit a search to include only items that are in a certain genre (for example, photographs)”.

This is something that could potentially be useful to researchers, but archivists don’t tend to provide the necessary data. We would need descriptions to include the genre, using controlled vocabulary. If we had this we could potentially enable researchers to select types of materials they are interested in, or simply include a flag to show, e.g. where a collection includes photographs.

The problem with introducing a genre search is that you run the risk of excluding key descriptions, because the search will only include results where the description includes that data in the appropriate location. If the word ‘photograph’ is in the general description only then a specific genre search won’t find it. This means a large collection of photographs may be excluded from a search for photographs.


In the Bron/Proffitt/Washburn article <controlaccess> is present around 72% of the time. I was surprised that they did not choose to analyse tags within <controlaccess> as I think these ‘access points’ can play a very important role in archival descrpition.  They use the presence of <controlaccess> as an indication of the presence of subjects, and make the point that “given differences in library and archival practices, we would expect control of form and genre terms to be relatively high, and control of names and subjects to be relatively low.”

On the Archives Hub, use of subjects is relatively high (as well as personal and corporate names) and use of form and genre is very low. However, it is true to say that we have strongly encouraged adding subject terms, and archivists don’t generally see this as integral to cataloguing (although some certainly do!), so we like to think that we are partly responsible for such a high use of subject terms.

Subject terms are needed because they (1) help to pull out significant subjects, often from collections that are very diverse, (2) enable identification of words such as ‘church’ and ‘carpenter’ (ie. they are subjects, not surnames), (3) allow researchers to continue searching across the Archives Hub by subject (subjects are all linked to the browse list) and therefore pull collections together by theme (4) enable advanced searching (which is substantially used on the Hub).

Names (personal and corporate)

In Bron/Proffitt/Washburn the <origination> tag is present 87% of the time. The analysis did not include the use of <persname> and <corpname> within <origination> to identify the type of originator. In the Archives Hub the originator is a required field, and is present 99%+ of the time. However, we made what I think is a mistake in not providing for the addition of personal or corporate name identification within <origination> via our EAD Editor (for creating descriptions) or by simply recommending it as best practice. This means that most of our originators cannot be distinguished as people or corporate bodies. In addition, we have a number where several names are within one <origination> tag and where terms such as ‘and others’, ‘unknown’ or ‘various’ are used. This type of practice is disadvantageous to machine processing. We are looking to rectify it now, but addressing something like this in retrospect is never easy to do. The ideal is that all names within origination are separately entered and identified as people or organisations.

We do also have names within <controlaccess>, and this brings the same advantages as for <subjects>, ensuring the names are properly structured, can be used for searching and for bringing together archives relating to any one individual or organisation.


“Use of this element falls into the promising complete category (99.46%: see Table 7). However, a variety of practice is in play, with the name of the repository being embellished with <subarea> and <address> tags nested within <repository>.”

On the Archives Hub repository is mandatory, but as yet we do not have a checking system whereby a description is rejected if it does not contain this field. We are working towards something like this, using scripts to check for key information to help ensure validity and consistency at least to a minimum standard. On one occasion we did take in a substantial number of descriptions from a repository that omitted the name of repository, which is not very useful for an aggregation service! However, one thing about <repository> is that it is easy to add because it is always the same entry. Or at least it should be….we did recently discovery that a number of repositories had entered their name in various ways over the years and this is something we needed to correct.

Scope and content, biographical history and abstract

It is notable that in the US <abstract> is widely used, whereas we don’t use it at all. It is intended as a very brief summary, whereas <scopecontent> can be of any length.

“For search, its worth noting that the semantics of these elements are different, and may result in unexpected and false “relevance””

One of the advantages of including <controlaccess> terms is to mitigate against this kind of false relevance, as a search for ‘mason’ as a person and ‘mason’ as a subject is possible through restricted field searching.

The Bron/Proffitt /Washburn analysis shows <bioghist> used 70% of the time. This is lower than the Archives Hub, where it is rare for this field not to be included. Archivists seem to have a natural inclination to provide a reasonably detailed biographical history, especially for a large collection focussed on one individual or organisation.

Digital Archival Objects

It is a shame that the analysis did not include instances of <dao>, but it is likely to be fairly low (in line with previous analysis by Wisser and Dean, which puts it lower than 10%). The Archives Hub currently includes around 1,200 instances of images or links to digital content. But what would be interesting is to see how this is growing over time and whether the trajectory indicates that in 5 years or so we will be able to provide researchers with routes into much of the Archives Hub content. However, it is worth bearing in mind that many archives are not digitised and are not likely to be digitised, so it is important for us not to raise expectations that links to digital content will become a matter of course.

The Future of Discovery

“In order to make EAD-encoded finding aids more well suited for use in discovery systems, the population of key elements will need to be moved closer to high or (ideally) complete.”

This is undoubtedly true, but I wonder whether the priority over and above completeness is consistency and controlled vocabulary where appropriate. There is an argument in favour of a shorter description, that may exclude certain information about a collection, but is well structured and easier to machine process. (Of course, completeness and consistency is the ideal!).

The article highlights geo-location as something that is emerging within discovery services. The Archives Hub is planning on promoting this as an option once we move to the revised EAD schema (which will allow for this to be included), but it is a question of whether archivists choose to include geographical co-ordinates in their catalogues. We may need to find ways to make this as easy as possible and to show the potential benefits of doing so.

In terms of the future, we need a different perspective on what EAD can and should be:

“In the early days of EAD the focus was largely on moving finding aids from typescript to SGML and XML. Even with much attention given over to the development of institutional and consortial best practice guidelines and requirements, much work was done by brute force and often with little attention given to (or funds allocated for) making the data fit to the purpose of discovery.”

However, I would argue that one of the problems is that archivists sometimes still think in terms of typescript finding aids; of a printed finding aid that is available within the search room, and then made available online….as if they are essentially the same thing and we can use the same approach with both. I think more needs to be done to promote, explain and discuss ‘next generation finding aids’. By working with Linked Data, I have gained a very different perspective on what is possible, challenging the traditional approach to hierarchical finding aids.

Maybe we need some ‘next generation discovery’ workshops and discussions – but in order to really broaden our horizons we will need to take heed of what is going on outside of our own domain. We can no longer consider archival practice in isolation from discovery in the most general sense because the complexity and scale of online discovery requires us to learn from others with expertise and understanding of digital technologies.








HubbuB: October 2011

Europeana and APENet

Europeana LogoI have just come back from the Europeana Tech conference, a 2 day event on various aspects of Europeana’s work and on related topics to do with data. The big theme was ‘open, open, open’, as well, of course, as the benefits of a European portal for cultural heritage.  I was interested to hear about Europeana’s Linked Data output, but my understanding is that at present, we cannot effectively link to their data, because they don’t provide URIs  for concepts. In other words, identifiers for names such as, so that we can say, for example, that our ‘George Bernard Shaw’ is the same as ‘George Bernard Shaw’ represented on Europeana.

I am starting to think about the Hub being part of APENet and Europeana. APENet is the archival aggregator for Europe. I have been in touch with them about the possibility of contributing our data, and if the Hub was to contribute, we could probably start from next year. Europeana only provide metadata for digital content, so we could only supply descriptions where the user can link to the digital content, but this may well be worth doing, as a means to promote the collections of any Hub contributors who do link to digital materials.

If you are a contributor, or potential contributor, we would like to know what you think…. we have a quick question for you at It simply asks if you think its a good idea to be part of these European initiatives. We’d love to get your views, and you only have to leave your name and a comment if you want to.

Flickr: an easy way to provide images online

You will be aware that contributors can now add images to descriptions and links to digital content of all kinds. The idea is that the digital content then forms an integral whole with the metadata, and it is also interoperable with other systems.

I’ve just seen an announcement by the University of Northampton, who have recently added materials to Flickr . I know that many contributors struggle to get server space to put their digital content online, so this is one possible option, and of course it does reach a huge number of people this way. There may be risks associated with the persistence of the URIs for the images, but then that is the case wherever you put them.

On the Hub we now have a number of images and links to content, for example:,,,

Ideally, contributors would supply digital content at item level, so the metadata is directly about the image/digital content, but it is fine to provide it at any level that is appropriate.  The EAD Editor makes adding links easy ( If you aren’t sure what to do, please do email us.

Preferred Citation

We never had the field for the preferred citation in our old template for the creation of EAD, and it has not been in the EAD Editor up till now. We were prompted to think about this after seeing the results of a survey on the use of EAD fields presented at the Society of American Archivists conference. Around 80% of archive institutions do use it. We think it’s important to advise people how to cite the archive, so we are planning to provide this in the Editor and may be able to carry out global edits to add this to contributors’ data.

List of Contributors

Our list of contributors within the main search page has now been revised, and we hope it looks substantially more sensible, and that it is better for researchers. This process really reminded us how hard it is to come up with one order for institutions that works for everyone!  We are currently working on a regional search, something that will act as an alternative way to limit searching. We hope to introduce this next year.

And finally…A very engaging Linked Data interface

This interface demonstration by Tim Sherratt shows how something driven by Linked Data can really be very effective. It also uses some of the Archives Hub vocabulary from our own Linked Data work, which is a nice indication of how people have taken notice of what we have been doing. There is a great blog post about it by Pete Johnston, Storytelling, archives and Linked Data. I agree with Pete that this sort of work is so exciting, and really shows the potential of the Linked Data Web for enabling individual and collective storytelling…something we, as archivists, really must be a part of.

Archives Wales

map of wales with archivesI recently attended the ‘Online Development in Wales’ day organised by ARCW (Archives and Record Council Wales) to talk about the Porth Archifau (Archives Hub). I found out a good deal about what is happening in Wales at the moment and heard about plans and wishes for future developments.

In her introduction, Charlotte Hodgson from ARCW talked about the need for online catalogues with images rather than the other way around. Maybe there is too much emphasis on digitisation of images which become separated from their context. She referred to the good work of Archives Network Wales (ANW), but acknowledged that Wales is in danger of falling behind with online catalogues. There is a need to maximise opportunities, minimise duplication and effectively deploy resources.

Kim Collis from ARCW gave some background on ANW (now Archives Wales), which is a searchable database for collection-level descriptions that uses a MySQL database and a Typo3 front-end. It has stayed relatively static since it was first developed; the emphasis of individual offices maybe moved to their own web presence (many were using CALM and there was something of a race to get their catalogues online).  The front-end of the ANW site has not necessarily always been very user-friendly and has not provided the depth of information that it might do. However, it was developed in a standards-based way, and this stands it in good stead for future development. ‘Archives Wales’ was a bolt-on to the database, giving more information and including additional information about repositories, making a more complete and visually appealling site.

There has been some geo-tagging within ANW recently. This was seen as a good way to link in with People’s Collection Wales, enabling users to find out more information about, for example, a family that has owned an estate.  Kim talked about a number of possible developments, such as a project to provide links to  searchable tithe apportionments transcripts. The idea is to allow volunteers to transcribe the images.

Kim talked about the need to improve branding and identity. The site must be kept up to date to give it credibility. But there is, in a sense, competition with repository websites because many repositories want to prioritise these. I think it is worth impressing upon archivists the importance of cross-searching capability that aggregators provide, as well as the value of searching within a repository. We should not presuppose that researchers primarily want to know what is at just one individual office; they usually want to find ‘stuff’ on their topic of interest and then go down to the more detailed level of individual sources of information.

Sam Velumyl from The National Archives talked about the Discovery initiative at TNA, which provides a new information architecture that will accommodate the different systems that TNA has.   The idea is that it can accommodate the integration of other systems easily, making it a more sustainable and flexible solution. They are going to be carrying out an exercise in gathering feedback on Discovery, and you’re likely to hear about that very soon.  Sam said that the feedback will help TNA to decide upon their priorities. It may be that A2A will become active again, but at present this has not been decided.  There were concerns in the room that it is very difficult to get TNA to provide data back out of A2A.

People’s Collection Wales, which was presented to us by three speakers, is very much geared towards user-friendly and fun engagement in the history and culture of Wales. It works on the basis of everything being an item, and it gathers items together in collections by topic, not in the way that archivists would normally understand collections, but simply by areas that will be of interest to users. It is quite an eclectic experience, designed to draw in a broad section of the community and promote learning and understanding of Welsh history.  Re-purposing is a strong principle behind PCW. It integrates social media to encourage the idea of sharing the photograph or interview or whatever on Facebook or Twitter. It also has a scrapbook function so that people can gather together their own collections. It does link to the item within context, so you can link back to the website of the depositor.

PCW are going to be using an API to upload collection records  from Archives Wales. I got a little confused about this, as they also spoke about manual upload. I think the automated upload will only be for certain records.  They are also doing some interesting work with GIS, to enable users to do things like look at maps over time to see how a place has developed, and looking at making museum objects viewable in a 3-D way.

My plea to PCW is to make their titles clickable links where it seems as if they should be clickable. I found the site fun, with some great stuff, but it can take a while to understand what you are looking at. I went to browse the collections and many of them are untitled, and it’s not really clear what they are representing. I tried the map interface and looked for ‘castle’ near ‘barmouth’ and I was taken to a page of images of people talking about the Eisteddfod. The second time it worked better, but some of the images were not actually images and one of them remained in place when I did another search and I couldn’t delete it from the display, and I had a few more experiences of searches hanging and the display freezing. But then other searches worked well and I started getting links from places to objects. So, it was a mixed bag for me, and it seemed quite beta in terms of functionality, and also it was very slow, and I do think that’s a problem.  It feels very experimental, with loads of good ideas, but I wonder if it would be better to concentrate on developing fewer ideas but making them more effective.

The afternoon was more focussed on solutions for getting archives online. CyMAL recently commissioned research to analyse requirements for extending online access to archive catalogues in Wales, building on ARCW, and Sarah Horton gave us a summary of some of the findings.  Some of the stats were quite interesting: 11 local authority services use CALM, 1 uses the Archivists’ Toolkit and 1 uses Word. In higher education: 3 CALM, 1 Word, 1 no formal catalogue. The National Library of Wales uses the virutal library system and AC-NMW uses AdLib.  The survey found that the application of authority files and data standards was variable.

For online Access: 3 via CALMView but there are barriers to this for many offices, one being IT and their concerns about security. 4 services provide access via their own systems, 2 via PDF documents.  About 8,000 collections are listed on Archives Wales and 2,000 on the Hub.

9 services have backlogs of between 10-30%, 6 of over 30% and more if poor quality catalogues are taken into account. Many catalogues remain in manual form only.

We had a very interesting talk on the Black Country History website. Linda Ellis talked about how important it was for the project to be sustainable right from the outset.  The project was about working together to reduce costs and create a sustainable online resource. The original website used the Axiell DSCovery software, but it was not fit for purpose.  The redevelopment was by Orangeleaf System using their CollectionsBase system and WordPress, which means it is very easy to create different front-ends. There are a number of microsites, such as one for geology, filtered by keyword, a great idea for a way to target different audiences with minimal additional effort. Partners can upload data when they like via an XML export from CALM.  CollectionsBase will also take Excel, Access and manual data entry.   There is an API, so the data goes on to Culture Grid and Europeana.

Altogether a very stimulating day, with a good vibe and plenty of discussion.