Big Data, Small Data and Meaning

Victorian joke. From the Victorian Meme Machine, a BL Labs project (http://www.digitalvictorianist.com/)
Victorian joke. From the Victorian Meme Machine, a BL Labs project (http://www.digitalvictorianist.com/)

The BL Labs is an initiative funded by the Mellon Foundation that invites researchers and developers to work with the BL and their digital data to address research questions.  The Symposium 2014 showcased some of the work funded by the Labs, presenting innovative and exploratory projects that have been funded through this initiative. This year’s competition winners are  the Victorian Meme Machine, creating a database of Victorian jokes,  and a Text to Image Linking Tool (TILT) for linking areas on a page image and a clear transcription of the content.

Tim Hitchcock, Professor of Digital History from the University of Sussex, opened with a great keynote talk. He started out by stressing the role of libraries, archives and museums in preserving memory and their central place in a complex ecology of knowledge discovery, dissemination and reflection. He felt it was essential to remember this when we get too caught up in pursuing shiny new ideas.  It is important to continually rethink what it is to be an information professional; whilst also respecting the basic principles that a library (archive, museum) was created to serve.

Tim Hitchcock’s talk was Big Data, Small Data and Meaning. He said that conundrums of size mean there is a danger of a concentration on Big Data and a corresponding neglect of Small Data. But can we view and explore a world encompassing both the minuscule and the massive? Hitchcock introduced the concept of the macroscope, a term coined in a science fiction novel  by Piers Anthony back in 1970. He used this term in his talk to consider the idea of a macro view of data. How has the principle of the macroscope influenced the digital humanities? Hitchcock referred to Katy Borner’s work with Plug-and-Play Macroscopesa: “Macroscopes let us observe what is at once too great or too slow or too complex for the human eye and mind to notice and comprehend.” (See http://vimeo.com/33413091 for an introductory video).

Hitchcock felt that ideally macroscopes should be to observe patterns across large data and at the same time show the detail within small data.  The way that he talked about Big Data within the context of both the big and the small helped me to make more sense of Big Data methods. I think that within the archive community there has been something of a collective head scratching around Big Data;  what its significance is, and how it relates to what we do. In a way it helps to think of it alongside the analysis that Small Data allows researchers to undertake.

Graph from Paper Machines
Paper Machines visualisation (http://papermachines.org/)

Hitchcock gave some further examples of Big Data projects. Paper Machines is a plugin for Zotero that enables topic modelling analysis. It allows the user to curate a large collection of works and explore its characteristics with some great results; but the analysis does not really address detail.

The History Manifesto, by Jo Guldi and David Armitage talks about how Big Data might be used to redefine the role of Digital Humanities. But Hitchcock criticised it for dismissing micro-history as essentially irrelevant.

Scott Weingart is also a fan of the macroscope. He is a convincing advocate for network analysis, which he talks about in his blog, The modern role of DH in a data-driven world:

“distant reading occludes as much as it reveals, resulting in significant ethical breaches in our digital world. Network analysis and the humanities offers us a way out, a way to bridge personal stories with the big picture, and to bring a much-needed ethical eye to the modern world.”

Hitchcock posited that the large scale is often seen as a route to impact in policy formation, and this is an attractive inducement to think large. In working on a big data scale, Humanities can speak to power more convincingly; it can lead to a more powerful voice and more impact.

We were introduced to Ben Schmidt’s work, Prochronisms. This uses TV anachronisms to learn about changes in language scales of analysis around the analysis of text used, and Schmidt has done some work around particular TV programmes and films, looking at the overall use of language and the specifics of word use. One example of his work is the analysis of 12 Years a Slave:

visual representation of language in 12 Years a Slave
12 Years a Slave: Word Analysis (http://www.prochronism.com/)

‘the language Ridley introduces himself is full of dramatically modern words like “outcomes,” “cooperative,” and “internationally:” but that where he sticks to Northup’s own words, the film is giving us a good depiction of how things actually sounded. This is visible in the way that the orange ball is centered much higher than the blue one: higher translates to “more common than then now.”‘

Schmidt gives very entertaining examples of anachronisms, for example, the use of ‘parenting a child’ in the TV drama series Downton Abbey, which only shows up in literature 5 times during the 1920’s and in a rather different context to our modern use; his close reading of context also throws up surprises, such as his analysis of the use of the word ‘stuff’ in Downton Abbey (as in ‘family stuff’ or ‘general stuff’), which does not appear to be anachronistic and yet viewers feel that it is a modern term.  (A word of warning, the site is fascinating and it’s hard to stop reading it once you start!)

Professor Hitchcock gave this work as an example of using a macroscope effectively to combine the large and the small. Schmidt reveals narrative arcs; maybe showing us something that hasn’t been revealed before…and at the same time creates anxiety amongst script writers with his stark analysis!

Viewing data on a series of scales simultaneously seems a positive development, even with the pitfalls. But are humanists privileging social science types of analysis over more traditional humanist ones? Working with Big Data can be hugely productive and fun, and it can encourage collaboration, but are humanist scholars losing touch with what they traditionally do best? Language and art, cultural construction and human experience are complex things. Scholars therefore need to encompass close reading and Small Data in their work in order to get a nuanced reading.  Our urge towards the all-inclusive is largely irresistible, but in this fascination we may lose the detail. The global image needs to be balanced with a view from the other end of the macroscope.

It is important to represent and mobilise the powerless rather than always thinking about the relationship to the powerful; to analyse the construct of power rather than being held in the grip of power and technology. Histories of small things are often what gives voice to those who are marginalised. Humanists should encompass the peculiar and eccentric; they should not ignore the power of the particular.

Graph showing evidence for the Higgs boson particle
Graph showing evidence for the Higgs particle (http://www.atlas.ch/news/2012/latest-results-from-higgs-search.html)

Of course, Big Data can have huge and fundamental results. The discovery of the Higgs particle was the result of massive data crunching and finding a small ‘bump’ in the data that gave evidence to support its existence. The other smaller data variations needed to be ignored in this scenario. It was a case of millions of rolls of the dice to discover the elusive particle. But if this approach is applied across the board, the assumption is that the signal, or the evidence, will come through, despite the extraneous blips and bumps. It doesn’t matter if you are using dirty data because small hiccups are just ignored.  But humanists need to read data with an eye to peculiarities and they should consider the value of digital tools that allow them to think small.

Hitchcock believes that to perform humanities effectively we need to contextualise.  And the importance of context is never lost to an archivist, as this is a cornerstone of our work. Big Data analysis can lose this context; Small Data is all about understanding context to derive meaning.

Using the example of voice onset timing, which refers to the tiny breathy gap before speaking, Hitchcock showed that a couple of milliseconds of empty space can demand close reading, because it actually changes depending on who you are talking to, and it reveals some really interesting findings. A Big Data approach would simply miss this fascinating detail.

Big data has its advantages, but it can mean that you don’t look really closely at the data set itself. There is a danger you present your results in a compelling graph or visualisation, but it is hard to see whether it is a flawed reality. You may understand the whole thing, and you can draw valuable conclusions, but you don’t take note of what the single line can tell you.

Exploring British Design: Research Paths

Introduction

As part of our Exploring British Design project we are organising workshops for researchers, aiming to understand more about their approaches to research, and their understanding and use of archives. Our intention is to create an interface that reflects user requirements and, potentially, explores ideas that we gather from our workshops.

Of course, we can only hope to engage with a very small selection of researchers in this way, but our first workshop at Brighton Design Archive showed us just how valuable this kind of face-to-face communication can be.

We gathered together a small group of 7 postgraduate design students. We divided them into 4 groups of 2 researchers and a lone researcher, and we asked them to undertake 2 exercises. This post is about the first exercise and follow up discussion.  For this exercise, we presented each group with an event, person or building:

The Festival of Britain, 1951
Black Eyes and Lemonade Exhibition, Whitechapel Art Gallery, 1951
Natasha Kroll (1912-2004)
Simposons of Piccadilly, London

We gave each group a large piece of paper, and simply asked them to discuss and chart their research paths around the subject they had been given. Each group was joined by a facilitator, who was not there to lead in any way, but just to clarify where necessary, listen to the students and make notes.

Case Study

Researchers charting their research paths for the Festival of Britain
Researchers charting their research paths for the Festival of Britain

I worked with two design students, Richard and Caroline, both postgraduate students researching aspects of design at The University of Brighton. They were looking at the subject of the Festival of Britain (FoB). It fascinated me that even when they were talking about how to represent their research paths, one instinctively went to list their methods, the other to draw theirs, in a more graphic kind of mind map. It was an immediate indication of how people think differently. They ended up using the listing method (see left).

 

diagram showing stages of research
Potential research paths for the Festival of Britain

The above represents the research paths of Richard and Caroline. It became clear early on that they would take somewhat different paths, although they went on to agree about many of the principles of research. Caroline immediately said that she would go to the University library first of all and then probably the central library in Brighton. It is her habit to start with the library, mainly because she likes to think locally before casting the net wider, she prefers the physicality of the resources to the virtual environment of the Web. She likes the opportunity to browse, and to consider the critical theory that is written around the subject as a starting point. Caroline prefers to go to a library or archive and take pictures of resources, so that she can then work through them at her leisure.  She talked about the importance of being able to take pictures, in order to be able to study sources at her leisure, and how high charges for the use of digital cameras can inhibit research.

Richard started with an online search. He thought about the sort of websites that he would gravitate towards – sites that were directly about the topic, such as an exhibition website. He referred to Wikipedia early on, but saw it as a potential starting place to find links to useful websites, through the external links that it includes, rather than using the content of Wikipedia articles.

Richard took a very visual approach. He focused in on the FoB logo (we used this as a representation of the Festival) and thought about researching that. He also talked about whether the FoB might have been an exhibition that showcased design, and liked the idea of an object-based approach, researching things such as furniture or domestic objects that might have been part of the exhibition. It was clear that his approach was based upon his own interests and background as a film maker. He focused on what interested and excited him; the more visual aspects including the concrete things that could be seen, rather than thinking in a text-based way.

Caroline had previous experience of working in an archive, and her approach reflected this, as well as a more text-based way of thinking. She talked about a preference for being in control of her research, so using familiar routes was preferable. She would email the Design Archives at Brighton, but that was not top of the list because it was more of an unknown quantity than the library that she was used to. Maybe because she has worked in an archive, she referred to using film archives for her research;  whereas Richard, although a film maker, did not think of this so readily. Past experience was clearly important here.

Both researchers saw the library as a place for serendipitous research. They agreed that this browsing approach was more effective in a library than online. They were clearly attracted to the idea of searching the library shelves, and discovering sources that they had not known about. I asked why they felt that this was more effective than an online exploration of resources. It seemed to be partly to do with the dependency of the physical environment and also because they felt that the choice of search term online has a substantial effect on what is, and isn’t, found.

Both researchers were also very focused on issues of trust; both very much of opinion that they would assess their sources in terms of provenance and authorship.

In addition, they liked the idea of being able to search by user-generated tags and to have the ability to add tags to content.

General Discussion

In the general discussion some of the point made in the case study were reinforced. In summary:

Participants found the exercise easy to do. It was not hard to think about how they would research the topics they were given. They found it interesting to reflect on their research paths and to share this with others.

For one other participant the library was the first port of call, but the majority started online.

Some took a more historical approach, others a much more narrative and story-based approach.  There were different emphases, which seemed to be borne out of personality, experiences and preferences. For example, some thought more about the ordering of the evidence, others thought more about what was visually stimulating.
It was therefore clear that different researchers took different approaches based on what they were drawn to, which usually reflected their interests and strengths.

There was a strong feeling about trust being vital when assessing sources. Knowing the provenance of an article or piece of writing was essential.

The participants agreed that putting time and effort into gathering evidence is part of the enjoyment of research. One mentioned the idea that ‘a bit of pain’ makes the end result all the more rewarding!  They were taken aback at the idea that that discovery services feel pressured to constantly simplify in order to ensure that we meet researchers’ needs. They understood that research is a skill and a process that takes time and effort (although, of course, this may not be how the majority of undergraduates or more inexperienced researchers feel).  Certainly they agreed that information must not be withheld, it must be accessible. We (service providers) need to provide signposts, to allow researchers to take their own paths. There was discussion about ‘sleuthing’ as part of the research process, and trying unorthodox routes, as chance discoveries may be made. But there was consensus that researchers do not need or wish to be nannnied!

All researchers did use Google at some point….usually using it to start their search. Funnily enough, some participants had quite long discussions about what they would do, before they realised they would actually have gone to Google first of all. It is so common now, that most people don’t think about it. It seemed to operate very much as a as a starting point, from where the researchers would go to sites, assess their worth and ensure that the information was trustworthy.

[There will be follow up posts to this, providing more information about our researcher workshops, summarising the second activity, which was more focused on archive sources, and continuing to document our Exploring British Design project.]

 

 

Exploring British Design: New Routes through Content

At the moment, the Archives Hub takes a largely traditional approach to the navigation and display of archive collections. The approach is predicated on hundreds of years of archival theory, expanded upon in numerous books, articles, conferences and standards. It is built upon “respect des fonds” and original order. Archival provenance tells us that it is essential to provide the context of a single item within the whole archive collection; this is required in order to  understand and interpret said item.

ISAD(G) reinforces the ‘top down’ approach. The hierarchy of an archive collection is usually visualised as a tree structure, often using folders. The connections show a top-down or bottom-up approach, linking each parent to its child(ren).

image of hierarchical folders
A folder structure is often used to represent archival hierarchy

This principle of archival hierarchy makes very good sense. The importance of this sort of context is clear: one individual letter, one photograph, one drawing, can only reveal so much on its own. But being able to see that it forms part of a series, and part of a larger collection, gives it a fuller story.

However, I wonder if our strong focus on this type of context has meant that archivists have sometimes forgotten that there are other types of context, other routes through content. With the digital environment that we now have, and the tools at our disposal, we can broaden out our ambitions with regards to how to display and navigate through archives, and how we think of them alongside other sources of information. This is not an ‘either or’ scenario; we can maintain the archival context whilst enabling other ways to explore, via other interfaces and applications. This is the beauty of machine processable data – the data remains unchanged, but there can be numerous interfaces to the data, for different audiences and different purposes.

Providing different routes into archives, showing different contexts, and enabling researchers to create their own narratives, can potentially be achieved through a focus on the ‘real things’ within an archive description; the people, organisations and places, and also the events surrounding them.

image of entities and links
Very simplified model of entities within archive descriptions and links between them

This is a very simplified image, intended to convey the idea of extracting people, organisations and places from the data within archive descriptions (at all levels of description). Ideally, these entities and connections can be brought together within events, which can be built upon the principle of relationships between entities (i.e. a person was at a place at a particular time).

Exploring British Design is a project seeking to probe this kind of approach. By treating these entities as an important part of the ‘networks of things’, and by finding connections between the entities, we give researchers new routes through the content and the potential to tell new stories and make new discoveries. The idea is to explore ways to help us become more fully a part of the Web, to ensure that archives are not resources in isolation, but a part of the story.

A diagram showing archives and other entities connected
An example of connected entities

 

For this project, we are focussing on a small selection of data, around British design, extracting entities from the Archives Hub data, and considering how the content within the descriptions can be opened up to help us put it into new contexts.

We are creating biographical records that can be used to include structured data around relationships, places and events.  We aim to extract people from the archive descriptions in which they are ‘embedded’ so that we can treat them as entities – they can connect not only to archive collections they created or are associated with, but they can also connect to other people, to organisations, to events, to places and subjects. For example, Joseph Emberton designed Simpsons in Piccadilly, London, in 1936. There, we have the person, the building, the location and the time.

With this paradigm, the archive becomes one of the ‘nodes’ of the network,  with the other entities equally to the fore, and the ability to connect them together shows how we can start to make connections between different archive collections. The idea is that a researcher could come into an archive from any type of starting point. The above diagram (created just as an example) includes ‘1970’s TV comedy’ through to the use of portland stone, and it links the Brighton Design Archive, the V&A Theatre and Performance Archive and the University of the Arts London Archive. The long term aim is that our endeavours to open up our data will ensure that it can be connected to other data sources (that have also been made open); sources outside of our own sphere (the Archives Hub data). The traditional interface has its merits; certainly we need to continue to provide archival context and navigation through collections; but we can be more imaginative in how we think about displaying content. We don’t need to just have one interface onto our data. We need to ensure that archives are part of the bigger story, that they can be seen in all sorts of contexts, and they are not relegated to being a bit part, isolated from everything else.

 

WikiLinks

WikiLinks – Guest Blog by Andy Young

Between March and June 2014 I conducted a piece of social media-oriented research on behalf of the Archives Hub, the primary purpose of which was to measure the impact of adding links from specific Wikipedia articles featuring Hub content on the traffic that comes into the Hub website. As well as providing the Hub administrators – and, indeed, the profession as a whole – with a gauge as to whether the amount of time invested in creating links is worthwhile when compared to the benefits of impact, this research benefitted me personally in that it allowed me the opportunity to potentially earn credits on the Archives & Records Association’s Registration Scheme, under the ‘Contributions to the profession’ category.

The first phase of the study involved me identifying twenty archival collections listed in the Hub, with no existing links to related Wikipedia pages, which I could treat as measurable research subjects. This was done simply by entering specific Hub collection level descriptions into the Wikipedia search engine. (If a link to the Hub had already been created, I eliminated that particular collection from the study.) In order to achieve a fair and balanced piece of research, I selected collections of a relatively similar size and status, and avoided those relating to any significant public events running concurrent to, or immediately prior to, the commencement of the research, i.e. local elections in England, the World Cup. My feeling was that such collections could have been subject to closer scrutiny from researchers while the study was underway, which, in turn, would have resulted in an unexpected increase in Hub-searching activity. This, in essence, would have undermined the credibility of the study. I also made sure that the Wikipedia pages I utilised didn’t already include links to the collection-holding repositories, as this could potentially sway researchers away from clicking the newly-created links to the Hub descriptions, thereby affecting the accuracy of research.

The twenty collections selected, along with their corresponding Wikipedia links, are shown in the table below.

table showing list of Hub collections with wikipedia links
List of Collections used in the study with the Wikipedia URLs

Once the Hub collections and related Wikipedia pages had been identified, I then added new links to the individual pages using Wikipedia’s built-in editing tool. In the interests of consistency, I embedded each new link in the ‘External Links’ section on each of the pages I modified. I then used Google Analytics, in conjunction with an Excel spreadsheet, to collate and record Hub traffic data for each individual collection for the twelve-week period prior to the start of the study, specifically from the 22nd December, 2013 to the 15th March, 2014. This was done in order to enable me to generate a measurement of the overall impact of the newly-created links on incoming Hub traffic. The cumulative results for each collection, for the twelve-week period prior to the commencement of the study, are shown below.

table showing page views for collections prior to wikipedia links
Page views for collections in a 12 week period prior to the creation of the Wikipedia links

Over the course of the next twelve weeks, from the 17th March, 2014 to the 7th June, 2014, I used Google Analytics once again to monitor incoming Hub traffic, with a reading being taken at the end of every fourth week in order to identify any significant traffic fluctuations or changes. The four-week hit statistics for each of the twenty collections are shown in the table below.

table showing hits for hub collections when the links were on wikipedia
Hits for Hub collections during the Wikipedia study

At the end of the twelve-week research period it was evident from the accumulated data that fourteen of the twenty collections had each experienced an increase in traffic compared to the previous twelve-week period. Indeed, of the fourteen, two collections, namely the Ramsay MacDonald Papers and the London South Bank University Archives, had each received well in excess of 100 additional hits compared to the pre-link period. Of the remaining six collections, only the Sadler’s Wells Theatre Archive had decreased in hits significantly, down 109 from the previous period. Although it isn’t possible to say definitively why this decrease occurred, it may have been due to the fact that at some point during the research, a new link had been added to the Sadler’s Wells Theatre Archive Wikipedia page giving researchers the option to examine ‘Archival material relating to Sadler’s Wells Theatre listed at the UK National Archives.’ Taking this modification into account, it seems fair to suggest that any researchers interested in the Sadler’s Wells Theatre material may have been drawn to this link description rather than the newly-added link to the Hub description essentially because it makes mention of the country’s principal archival repository, TNA.

The cumulative number of hits for each of the twenty collections during the research period are presented in the table below. This table also shows the positive and negative numerical differences in hits for each of the collections compared to the twelve-week period prior to the start of the research.

table showing cumulative hits for collections with positive and negative changes shown
Cumulative hits for collections with positive and negative differences shown

Conclusion

This piece of research has demonstrated that the simple task of linking online archival descriptions to a popular social media reference tool such as Wikipedia can yield extremely positive results. It has shown, moreover, that there are clear benefits, both for the archival repository/aggregator and the individual researcher, when catalogue data is linked and shared. Not only that, it has proven that a successful outcome can be achieved in a relatively short space of time, and, truth be told, with only a small amount of physical effort. The process of checking whether links from specific Hub collections already existed in Wikipedia and then adding them to the website if they didn’t, took little more than three hours to complete, and, for the most part, basically involved me copying data from one website and pasting it onto another. Ultimately, the sheer simplicity of this exercise, coupled with the knowledge that interest in the vast majority of the Hub collections increased as a result of the Wikipedia editing, confirms, to my mind at least, that archive services the world over – especially those blessed with a healthy number of volunteers – would benefit from embarking on linked data projects of this nature. After all, it’s like Benjamin Franklin said, “An investment in knowledge always pays the best interest.”

Micro sites: local interfaces for Archives Hub contributors

Background

Back in 2008 the Archives Hub embarked upon a project to become distributed; the aim was to give control of their data to the individual contributors. Every contributor could host their own data by installing and running a ‘mini Hub’. This would give them an administrative interface to manage their descriptions and a web interface for searching.

Five years later we had 6 distributed ‘spokes’ for 6 contributors. This was actually reduced from 8, which was the highest number of institutions that took up the invitation to hold their own data out of around 180 contributors at the time.

The primary reason for the lack of success was identified as a lack of technical knowledge and the skills required for setting up and maintaining the software. In addition to this,  many institutions are not willing to install unknown software or maintain an unfamiliar operating system. Of course, many Hub contributors already had a management system, and so they may not have wanted to run a second system; but a significant number did not (and still don’t) have their own system. Part of the reason may institutions want an out-of-the-box solution is that they do not have consistent or effective IT support, so they need something that is intuitive to use.

The spokes institutions ended up requiring a great deal of support from the central Hub team; and at the same time they found that running their spoke took a good deal of their own time. In the end, setting up a server with an operating system and bespoke software (Cheshire in this case) is not a trivial thing, even with step-by-step instructions, because there are many variables and external factors that impact on the process. We realised that running the spokes effectively would probably require a full-time member of the Hub team in support, which was not really feasible, but even then it was doubtful whether the spokes institutions could find the IT support they required on an ongoing basis, as they needed a secure server and they needed to upgrade the software periodically.

Another big issue with the distributed model was that the central Hub team could no longer work on the Hub data in its entirety, because the spoke institutions had the master copy of their own data. We are increasingly keen to work cross-platform, using the data in different applications. This requires the data to be consistent, and therefore we wanted to have a central store of data so that we could work on standardising the descriptions.

The Hub team spend a substantial amount of time processing the data, in order to be able to work with it more effectively. For example, a very substantial (and continuing) amount of work has been done to create persistent URIs for all levels of  description (i.e. series, item, etc.). This requires rigorous consistency and no duplications of references. When we started to work on this we found that we had 100’s of duplicate references due to both human error and issues with our workflow (which in some cases meant we had loaded a revised description along with the original description). Also, because we use archival references in our URIs, we were somewhat nonplussed to discover that there was an issue with duplicates arising from references such as GB 234 5AB and GB 2345 AB. We therefore had to change our URI pattern, which led to substantial additional work (we used a hyphen to create gb234-5ab and gb2345-ab).

We also carry out more minor data corrections, such as correcting character encoding (usually an issue with characters such as accented letters) and creating normalised dates (machine processable dates).

In addition to these types of corrections, we run validation checks and correct anything that is not valid according to the EAD schema, and we are planning, longer term, to set up a workflow such that we can implement some enhancement routines, such as adding a ‘personal name’ or ‘corporate name’ identifying tag to our creator names.

These data corrections/enhancements have been applied to data held centrally. We have tried to work with the distributed data, but it is very hard to maintain version control, as the data is constantly being revised, and we have ended up with some instances where identifying the ‘master’ copy of the data has become problematic.

We are currently working towards a more automated system of data corrections/enhancement, and this makes it important that we hold all of the data centrally, so that we ensure that the workflow is clear and we do not end up with duplicate slightly different versions of descriptions. (NB: there are ways to work more effectively with distributed data, but we do not have the resources to set up this kind of environment at present – it may be something for the longer term).

We concluded that the distributed model was not sustainable, but we still wanted to provide a front-end for contributors. We therefore came up with the idea of the ‘micro sites’.

What are Hub Micro Sites?

The micro sites are a template based local interface for individual Hub contributors. They use a feed of the contributor’s data from the central Archives Hub, so the data is only held in one place but accessible through both interfaces: the Hub and the micro site. The end-user performs a search on a micro site, the search request goes to the central Hub, and the results are returned and displayed in the micro site interface.

screenshot of brighton micro site
Brighton Design Archives micro site homepage

The principles underlying the micro sites are that they need to be:

•    Sustainable
•    Low cost
•    Efficient
•    Realistically resourced

A Template Approach?

As part of our aim of ensuring a sustainable and low-cost solution we knew we had to adopt a one-size-fits-all model. The aim is to be able to set up a new micro site with minimal effort, as the basic look and feel stays the same. Only the branding, top and bottom banners, basic text and colours change. This gives enough flexibility for a micro site to reflect an institution’s identity, through its logo and colours, but it means that we avoid customisation, which can be very time-consuming to maintain.

The micro sites use an open approach, so it would be possible for institutions to customise themselves, by manipulating the stylesheets. However, this is not something that the Archives Hub can support, and therefore the institution would need to have the expertise necessary to maintain this themselves.

The Consultation Process

We started by talking to the Spokes institutions and getting their feedback about the strengths and weaknesses of the spokes and what might replace them. We then sent out a survey to Hub contributors to ascertain whether there would be a demand for the micro sites.

Institutions preferred the micro sites to be hosted by the Archives Hub. This reflects the lack of technical support within UK archives. This solution is also likely to be more efficient for us, as providing support at a distance is often more complicated than maintaining services in-house.

The responders generally did not have images displayed on the Hub, but intended to in the future, so this needed to be taken into account. We also asked about experiences with understanding and using APIs. The response showed that people had no experience of APIs and did not really understand what they were, but were keen to find out more.

We asked for requirements and preferences, which we have taken into account as much as possible, but we explained that we would have to take a uniform approach, so it was likely that there would need to be compromises.

After a period of development, we met with the early adopters of the micro sites (see below) to update them on our progress and get additional requirements from them. We considered these requirements in terms of how practical they would be to implement in the time scale that we were working towards, and we then prioritised the requirements that we would aim to implement before going live.

The additional requirements included:

  • Search in multi-level description: the ability to search within a description to find just the components that include the search term
  • Reference search: useful for contributors for administrative purposes
  • Citation: title and reference, to encourage researchers to cite the archive correctly
  • Highlight: highlighting of the search term(s)
  • Links to ‘search again’ and to ‘go back’ to the collection result
  • The addition of Google Analytics code in the pages, to enable impact analysis

The Development Process

We wanted the micro sites to be a ‘stand alone’ implementation, not tied to the Archives Hub. We could have utilised the Hub, effectively creating duplicate instances of the interface, but this would have created dependencies.  We felt that it was important for the micro sites to be sustainable independent of our current Hub platform.

In fact, the Micro sites have been developed using Java, whereas the Hub uses Python, a completely different programming language. This happened mainly because we had a Java programmer on the team. It may seem a little odd to do this, as opposed to simply filtering the Hub data with Python, but we think that it has had unforeseen benefits. Namely, that the programmers who have worked on the micro sites have been able to come at the task afresh, and work on new ways to solve the many challenges that we faced. As a result of this we have implemented some solutions with the micro sites that are not implemented on the Hub.  Equally, there were certainly functions within the Hub that we could not replicate with the micro sites – mainly those that were specifically set up for the aggregated nature of the Hub (e.g browsing across the Hub content).

It was a steep learning curve for a developer, as the development required a good understanding of hierarchical archival descriptions, and also an appreciation of the challenges that come from a diverse data set. As with pretty much all Hub projects, it is the diverse nature of the data set that is the main hurdle. Developers need patterns; they need something to work with, something consistent. There isn’t too much of that with aggregated archives catalogues!

The developer utilised what he could from the Hub, but it is the nature of programming that reverse engineering of someone else’s code can be a great deal harder than re-coding, so in many cases the coding was done from scratch. For example, the table of contents is a particularly tricky thing to recreate, but the code used for the current Hub proved to be too complex to work with, as it has been built up over a decade and is designed to work within the Hub environment. The table of contents requires the hierarchy to be set out, collapsible folder structures, links to specific parts of the description with further navigation from there to allow the researcher to navigate up and down, so it is a complex thing to create and it took some time to achieve.

The feed of data has to provide the necessary information for the creation of the hierarchy, and our feed comes through SRU (Search/Retrieve via URL), which is a standard search protocol for Internet search queries using Contextual Query Language (CQL).  This was already available through the Hub API, and the micro sites application makes uses of SRU in order to perform most of the standard searches that are available on the Hub.  Essentially, each of the micro sites are provided by a single web application that acts as a layer on the Archives Hub.  To access the individual micro sites, the contributor provides a shortened version of the institution’s name as a sub-string to the micro sites web address.  This then filters the data accordingly for that institution, and sets up the site with the appropriate branding.  The latter is achieved through CSS stylesheets, individually tailored for the given institution by a stand-alone Java application and a standard CSS template.

Page Display

One of the changes that the developer suggested for the micro sites concerns the intellectual division of the descriptions. On the current Hub, a description may carry over many pages, but each page does not represent anything specific about the hierarchy, it is just a case of the description continuing from one page to the next. With the micro sites we have introduced the idea that each ‘child’ description of the top level is represented on one page. This can more easily be shown through a screenshot:

screenshot of table of contents from Salford Archives
Table of Contents of the Walter Greenwood Collection showing the tree structure

 

 

 

 

 

 

 

 

 

 

 

 

 

In the screenshot above, the series ‘Theatre Programmes, Playbills, etc’ is a first-level child description (a series description) of the archive collection ‘The Walter Greenwood Collection’.  Within this series there are a number of sub-series, the first of which is ‘Love on the Dole’, the last of which is ‘A Taste of Honey’. The researcher will therefore get a page that contains everything within this one series – all sub-series and items – if there are any described in the series.

screenshot of a page from Salford Archives
Page for ‘Theatre Programmes, Playbills, etc’ within the Walter Greenwood Collection

The sense of hierarchy and belonging is further re-enforced by repeating the main collection title at the top of every right hand pane.  The only potential downside to this approach is that it leads to variable length ‘child’ description pages, but we felt it was a reasonable trade-off because it enables the researcher to get a sense of the structure of the collection. Usually it means that they can see everything within one series on one page, as this is the most typical first child level of an archival description.  In EAD representation, this is everything contained within the <c01> tag or top level <c> tag.

Next Steps

We are currently testing the micro sites with early adopters: Glasgow University Archive Services, Salford University Archives, Brighton Design Archives and the University of Manchester John Rylands Library.

We aim to go live during September 2014 (although it has been hard to fix a live date, as with a new and innovative service such as the micro sites unforeseen problems tend to emerge with alarming regularity). We will see what sort of feedback we get, and it is likely that we will find a few things need addressing as a result of putting the micro sites out into the big wide world. We intend to arrange a meeting for the early adopters to come together again and feed back to us, so that we can consider whether we need a ‘phase 2’ to iron out any problems and make any enhancements. We may at that stage invite other interested institutions, to explain the process and look at setting up further sites. But certainly our aim is to roll out the micro sites to other Archives Hub institutions.

Exploring British Design: An Introduction

We are very pleased to announce that the Archives Hub has joined forces with The University of Brighton Design Archives for an exciting new project, funded by the Arts and Humanities Research Council, ‘Exploring British Design’. The project is funded as one of ten new ‘Amplification Awards’ from the AHRC.

We will be working with Catherine Moriarty, Curatorial Director of the University of Brighton Design Archives and Professor of Art and Design History in the Faculty of Arts.  Catherine, myself and others on the project aim to provide you with updates and insights through the Archives Hub blog over the next 12 months.

* * *

The project will explore Britain’s design history by connecting design-related content in different archives. A collaboration between researchers, information professionals, technologists, curators and historians, the aim is to give researchers the freedom to explore the depth of detail held in British design archives.

We will be working with researchers to understand more about their use of archives and methods of archival research within design history. We aim to answer a number of research questions:

1. How can we link digital content and subject expertise in order to make archival content more discoverable for researchers? How can we increase the discoverability of design archives in and beyond the HE sector?

2. How can connected archival data better recover ‘lost moments of design action’? (Dilnot 2013: 337)

3. How might a website co-designed by researchers, rather than a top-down collection-defined approach to archive content, enhance engagement with and understanding of British design? How can we encourage researchers, archive and museum professionals, and the public, to apprehend an integrated and extended rather than collection-specific sense of Britain’s design history?

4. How can the principles of archive arrangement/description be made meaningful and useful to researchers? Are these principles sometimes a hindrance to public understanding, or can they be utilised to better effect to aid interpretation?

We want to use this opportunity to explore ways of presenting archival data beyond the traditional collection level description. We will be working with three main sources of data:

1) We will be utilising and enhancing the data within the Archives Hub, starting with the descriptions of the collections held at Brighton Design Archives, but also utilising other descriptions of archives held all across the UK, covering manufacturing history, art schools, personal perspectives and professional contexts, so that we make the most of the diversity of the archives described on the Hub.

2) We will be creating archival authority records, using the EAC-CPF XML format for ISAAR(CPF) records

3) We will be working with the Design Museum and looking to integrate their object-based data into our data set

We will also be working to integrate other sources of data into our authority records.

We aim to provide a front-end that demonstrates what is possible with rich and connected data sources. Our intention is to be led by researchers in this endeavour. It will give us the opportunity to explore researcher needs and requirements, and to understand more about the importance of familiarity with interfaces compared to the possibilities for ‘disruptive’ approaches that propose more radical solutions to interrogating the data.

We are grateful to the AHRC for giving us the opportunity to explore these important questions and take digital research to another level.

 

 

A European Journey: The Archives Portal Europe

In January 2013 the Archives Hub became the UK ‘Country Manager’ for the Archives Portal Europe.

The Archives Portal Europe (APE) is a European aggregator for archives. The website provides more information about the APE vision:

Borders between European countries have changed often during the course of history. States have merged and separated and it is these changing patterns that form the basis for a common ground as well as for differences in their development. It is this tension between their shared history and diversity that makes their respective histories even more interesting. By collating archival material that has been created during these historical and political evolutions, the Archives Portal Europe aims to provide the opportunity to compare national and regional developments and to understand their uniqueness while simultaneously placing them within the larger European context.

The portal will help visitors not only to dig deeper into their own fields of interest, but also to discover new sources by giving an overview of the jigsaw puzzle of archival holdings across Europe in all their diversity.

For many countries, the Country Manager role is taken on by the national archives. However, for the UK the Archives Hub was in a good position to work with APE. The Archives Hub is an aggregation of archival descriptions held across the UK. We work with and store content in Encoded Archival Description (EAD), which provides us with a head start in terms of contributing content.

Jane Stevenson, the Archives Hub Manager, attended an APE workshop in Pisa in January 2013, to learn more about the tools that the project provides to help Country Managers and contributors to provide their data. Since then, Jane has also attended a conference in Dublin, Building Infrastructures for Archives in a Digital World, where she talked about A Licence to Thrill: the benefits of open data. APE has provided a great opportunity to work with European colleagues; it not just about creating a pan-European portal, it is also about sharing and learning together. At present, APE has a project called APEx, which is an initiative for  “expanding, enriching, enhancing and sustaining” the portal.

How Content is Provided to APE

The way that APE normally works is through a Country Manager providing support to institutions wishing to contribute descriptions. However, for the UK, the Archives Hub takes on the role of providing the content directly, as it comes via the Hub and into APE. This is not to say that institutions cannot undertake to do this work themselves. The British Library, for example, will be working with their own data and submitting it to APE. But for many archives, the task of creating EAD and checking for validity would be beyond their resources. In addition, this model of working shows the benefits of using interoperable standards; the Archives Hub already processes and validates EAD, so we have a good understanding of what is required for the Archives Portal Europe.

All that Archives Hub institutions need to do to become part of APE is to create their own directory entry. These entries are created using Encoded Archival Guide (EAG), but the archivist does not need to be familiar with EAG, as they are simply presented with a form to fill in. The directory entry can be quite brief, or very detailed, including information on opening hours, accessibility, reprographic services, search room places, internet access and the history of the archive.

blog-ape-eag
Fig 1: EAG entry for the University of East London

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Once the entry is created, we can upload the data. If the data is valid, this takes very little time to do, and immediately the archive is part of a national aggregation and a European aggregation.

APE Data Preparation Tool

The Data Preparation Tool allows us to upload EAD content and validate it. You can see on the screen shot below a list of EAD files from the Mills Archive that have been uploaded to the Tool, and the Tool will allow us to ‘convert and validate’ them. There are various options for checking against different flavours of EAD and there is also the option to upload EAC-CPF (which is not something the Hub is working with as yet) and EAG.

blog-ape-dpt1
Fig 2: Data Preparation Tool

 

 

 

 

 

 

 

 

 

If all goes according to plan, the validation results in a whole batch of valid files, and you are ready to upload the data. Sometimes there will be an invalid file and you need to take a look at the validation message and figure out what you need to do (the error message in this screenshot relates to ‘example 2’ below).

blog-ape-dpt2
Fig 3: Data Preparation Tool with invalid files

 APE Dashboard

The Dashboard is an interface provided to an APE Country Manger to enable them to administer their landscape. The first job is to create the archival landscape. For the UK we decided to group the archives into type:

blog-ape-landscape
Fig 4: Archival Landscape

The landscape can be modified as we go, but it is good to keep the basic categories, so its worth thinking about this from the outset. We found that many other European countries divide their archives differently, reflecting their own landscape, particularly in terms of how local government is organised. We did have a discussion about the advantages of all using the same categories, but it seemed better for the end-user to be presented with categories suitable for the way UK archives are organised.

Within the Dashboard, the Country Manager creates logins for all of the archive repositories contributing to APE. The repositories can potentially use these logins to upload EAD to the dashboard, validate and correct if necessary and then publish. But at present, the Archives Hub is taking on this role for almost all repositories. One advantage of doing this is that we can identify issues that surface across the data, and work out how best to address these issues for all repositories, rather than each one having to take time to investigate their own data.

Working with the Data

When the Archives Hub started to work with APE, we began by undertaking a comparison of Hub EAD and APE EAD. Jane created a document setting out the similarities and differences between the two flavours of EAD. Whilst the Hub and APE both use EAD, this does not mean that the two will be totally compatible. EAD is quite permissive and so for services like aggregators choices have to be made about which fields to use and how to style the content using XSLT stylesheets. To try to cover all possible permutations of EAD use would be a huge task!

There have been two main scenarios when dealing with data issues for APE:

(1) the data is not valid EAD or it is in some way incorrect

(2) the data is valid EAD but the APE stylesheet cannot yet deal with it

We found that there were a combination of these types of scenarios. For the first, the onus is on the Archives Hub to deal with the data issues at source. This enables us to improve the data at the same time as ensuring that it can be ingested into APE. For the second, we explain the issue to the APE developer, so that the stylesheet can be modified.

Here are just a few examples of some of the issues we worked through.

Example 1: Digital Archival Objects

APE was omitting the <daodesc> content:

<dao href=”http://www.tate.org.uk/art/images/work/P/P78/P78315_8.jpg” show=”embed”><daodesc><p>’Gary Popstar’ by Julian Opie</p></daodesc></dao>

 Content of <daodesc><p> should be transferred to <dao@xlink:title>. It would then be displayed as mouse-over text to the icons used in the APE for highlighting digital content. Would that solution be ok?

In this instance the problem was due to the Hub using the DTD and APE using the schema, and a small transformation done by APE when they ingested the data sufficed to provide a solution.

Example 2: EAD Level Attribute

Archivists are all familiar with the levels within archival descriptions. Unfortunately, ISAD(G), the standard for archival description, is not very helpful with enforcing controlled vocabulary here, simply suggesting terms like Fonds, Sub-fonds, Series, Sub-series. EAD has a more definite list of values:

  • collection
  • fonds
  • class
  • recordgrp
  • series
  • subfonds
  • subgrp
  • subseries
  • file
  • item
  • otherlevel

Inevitably this means that the Archives Hub has ended up with variations in these values. In addition, some descriptions use an attribute value called ‘otherlevel’ for values that are not, in fact, other levels, but are recognised levels.

We had to deal with quite a few variations: Subfonds, SubFonds, sub-fonds, Sub-fonds, sub fonds, for example. I needed to discuss these values with the APE developer and we decided that the Hub data should be modified to only use the EAD specified values.

For example:

<c level=”otherlevel” otherlevel=”sub-fonds”>

needed to be changed to:

<c level=”subfonds”>

At the same time the APE stylesheet also needed to be modified to deal with all recognised level values. Where the level was not a recognised EAD value, e.g. ‘piece’, then ‘otherlevel’ is valid, and the APE stylesheet was modified to recognise this.

Example 3: Data within <title> tag

We discovered that for certain fields, such as biographical history, any content within a <title> tag was being omitted from the APE display. This simply required a minor adjustment to the stylesheet.

Where are we Now?

The APE developers are constantly working to improve the stylesheets to work with EAD from across Europe. Most of the issues that we have had have now been dealt with. We will continue to check the UK data as we upload it, and go through the process described above, correcting data issues at source and reporting validation problems to the APE team.

The UK Archival Landscape in Europe

By being part of the Archives Portal Europe, UK archives benefit from more exposure, and researchers benefit from being able to connect archives in new and different ways.  UK archives are now being featured on the APE homepage.

blog-ape-feature1
Wiener Library: Antisemitic board game, 1936

APE and the APEx project provides a great community environment. It provides news from archives across Europe: http://www.apex-project.eu/index.php/news; it has a section for people to contribute articles: http://www.apex-project.eu/index.php/articles; it runs events and advertises events across Europe: http://www.apex-project.eu/index.php/events/cat.listevents/ Most importantly for the Archives Hub, it provides effective tools along with knowledgeable staff, so that there is a supportive environment to facilitate our role as Country Manager.

blog-ape-berlingroup
APEx Project meeting in Berlin, Nov 2013

 

 

 

See: http://www.flickr.com/photos/apex_project/10723988866/in/set-72157637409343664/
(copyright: Bundersarchiv)

EAD and Next Generation Discovery

This post is in response to a recent article in Code4Lib, ‘Thresholds for Discovery: EAD Tag Analysis in ArchiveGrid, and Implications for Discovery Systems‘ by M. Bron, M. Proffitt and B. Washburn. All quotes are from that article, which looked at the instances of tags within ArchiveGrid, the US based archival aggregation run by OCLC. This post compares some of their findings to the UK based Archives Hub.

Date

In the ArchivesGrid analysis, the <unitdate> field use is around 72% within the high-level (usually collection level) description. The Archives Hub does significantly better here, with an almost universal inclusion of dates at this level of description. Therefore, a date search is not likely to exclude any potentially relevant descriptions. This is important, as researchers are likely to want to restrict their searches by date. Our new system also allows sorting retrieved results by date. The only issue we have is where the dates are non-standard and cause the ordering to break down in some way. But we do have both displayed dates and normalised dates, to enable better machine processing of the data.

Collection Title

“for sorting and browsing…utility depends on the content of the element.”

Titles are always provided, but they are very varied. Setting aside lower-level descriptions, which are particularly problematic, titles may be more or less informative. We may introduce sorting by title, but the utility of this will be limited. It is unlikely that titles will ever be controlled to the extent that they have a level of consistency, but it would be fascinating to analyse titles within the context of the ways people search on the Web, and see if we can gauge the value of different approaches to creating titles. In other words, what is the best type of title in terms of attracting researchers’ attention, search engine optimisation, display within search engine results, etc?

Lower-level descriptions tend to have titles such as ‘Accounts’, ‘Diary’ or something more difficult to understand out of context such as ‘Pigs and boars’ or ‘The Moon Dragon’. It is clearly vital to maintain the relationship of these lower-level descriptions to their parent level entries, otherwise they often become largely meaningless. But this should be perfectly possible when working on the Web.

It is important to ensure that a researcher finding a lower-level description through a general search engine gets a meaningful result.

Archives Hub search result from a Google search
A search result within Google

 

 

 

The above result is from a search for ‘garrick theatre archives joanna lumley’ – the sort of search a researcher might carry out. Whilst the link is directly to a lower -level entry for a play at the Garrick Theatre, the heading is for the archive collection. This entry is still not ideal, as the lower-level heading should be present as well. But it gives a reasonable sense of what the researcher will get if they click on this link. It includes the <unitid> from the parent entry and the URL for the lower-level, with the first part of the <scopecontent> for the entry.  It also includes the Archives Hub tag line, which could be considered superfluous to a search for Garrick Theatre archives! However, it does help to embed the idea of a service in the mind of the researcher – something they can use for their research.

Extent

“It would be useful to be able to sort by size of collection, however, this would require some level of confidence that the <extent> tag is both widely used and that the content of the tag would lends itself to sorting.”

This was an idea we had when working on our Linked Data output. We wanted to think about visualizations that would help researchers get a sense of the collections that are out there, where they are, how relevant they are, and so on. In theory the ‘extent’ could help with a weighting system, where we could think about a map-based visualization showing concentrations of archives about a person or subject. We could also potentially order results by size – from the largest archive to the smallest archive that matches a researchers’ search term. However, archivists do not have any kind of controlled vocabulary for ‘extent’. So, within the Archives Hub this field can contain anything from numbers of boxes and folders to length in linear metres, dimensions in cubic metres and items in terms of numbers of photographs, pamphlets and other formats. ISAD(G) doesn’t really help with this; the examples they give simply serve to show how varied the description of extent can be.

Genre

“Other examples of desired functionality include providing a means in the interface to limit a search to include only items that are in a certain genre (for example, photographs)”.

This is something that could potentially be useful to researchers, but archivists don’t tend to provide the necessary data. We would need descriptions to include the genre, using controlled vocabulary. If we had this we could potentially enable researchers to select types of materials they are interested in, or simply include a flag to show, e.g. where a collection includes photographs.

The problem with introducing a genre search is that you run the risk of excluding key descriptions, because the search will only include results where the description includes that data in the appropriate location. If the word ‘photograph’ is in the general description only then a specific genre search won’t find it. This means a large collection of photographs may be excluded from a search for photographs.

Subject

In the Bron/Proffitt/Washburn article <controlaccess> is present around 72% of the time. I was surprised that they did not choose to analyse tags within <controlaccess> as I think these ‘access points’ can play a very important role in archival descrpition.  They use the presence of <controlaccess> as an indication of the presence of subjects, and make the point that “given differences in library and archival practices, we would expect control of form and genre terms to be relatively high, and control of names and subjects to be relatively low.”

On the Archives Hub, use of subjects is relatively high (as well as personal and corporate names) and use of form and genre is very low. However, it is true to say that we have strongly encouraged adding subject terms, and archivists don’t generally see this as integral to cataloguing (although some certainly do!), so we like to think that we are partly responsible for such a high use of subject terms.

Subject terms are needed because they (1) help to pull out significant subjects, often from collections that are very diverse, (2) enable identification of words such as ‘church’ and ‘carpenter’ (ie. they are subjects, not surnames), (3) allow researchers to continue searching across the Archives Hub by subject (subjects are all linked to the browse list) and therefore pull collections together by theme (4) enable advanced searching (which is substantially used on the Hub).

Names (personal and corporate)

In Bron/Proffitt/Washburn the <origination> tag is present 87% of the time. The analysis did not include the use of <persname> and <corpname> within <origination> to identify the type of originator. In the Archives Hub the originator is a required field, and is present 99%+ of the time. However, we made what I think is a mistake in not providing for the addition of personal or corporate name identification within <origination> via our EAD Editor (for creating descriptions) or by simply recommending it as best practice. This means that most of our originators cannot be distinguished as people or corporate bodies. In addition, we have a number where several names are within one <origination> tag and where terms such as ‘and others’, ‘unknown’ or ‘various’ are used. This type of practice is disadvantageous to machine processing. We are looking to rectify it now, but addressing something like this in retrospect is never easy to do. The ideal is that all names within origination are separately entered and identified as people or organisations.

We do also have names within <controlaccess>, and this brings the same advantages as for <subjects>, ensuring the names are properly structured, can be used for searching and for bringing together archives relating to any one individual or organisation.

Repository

“Use of this element falls into the promising complete category (99.46%: see Table 7). However, a variety of practice is in play, with the name of the repository being embellished with <subarea> and <address> tags nested within <repository>.”

On the Archives Hub repository is mandatory, but as yet we do not have a checking system whereby a description is rejected if it does not contain this field. We are working towards something like this, using scripts to check for key information to help ensure validity and consistency at least to a minimum standard. On one occasion we did take in a substantial number of descriptions from a repository that omitted the name of repository, which is not very useful for an aggregation service! However, one thing about <repository> is that it is easy to add because it is always the same entry. Or at least it should be….we did recently discovery that a number of repositories had entered their name in various ways over the years and this is something we needed to correct.

Scope and content, biographical history and abstract

It is notable that in the US <abstract> is widely used, whereas we don’t use it at all. It is intended as a very brief summary, whereas <scopecontent> can be of any length.

“For search, its worth noting that the semantics of these elements are different, and may result in unexpected and false “relevance””

One of the advantages of including <controlaccess> terms is to mitigate against this kind of false relevance, as a search for ‘mason’ as a person and ‘mason’ as a subject is possible through restricted field searching.

The Bron/Proffitt /Washburn analysis shows <bioghist> used 70% of the time. This is lower than the Archives Hub, where it is rare for this field not to be included. Archivists seem to have a natural inclination to provide a reasonably detailed biographical history, especially for a large collection focussed on one individual or organisation.

Digital Archival Objects

It is a shame that the analysis did not include instances of <dao>, but it is likely to be fairly low (in line with previous analysis by Wisser and Dean, which puts it lower than 10%). The Archives Hub currently includes around 1,200 instances of images or links to digital content. But what would be interesting is to see how this is growing over time and whether the trajectory indicates that in 5 years or so we will be able to provide researchers with routes into much of the Archives Hub content. However, it is worth bearing in mind that many archives are not digitised and are not likely to be digitised, so it is important for us not to raise expectations that links to digital content will become a matter of course.

The Future of Discovery

“In order to make EAD-encoded finding aids more well suited for use in discovery systems, the population of key elements will need to be moved closer to high or (ideally) complete.”

This is undoubtedly true, but I wonder whether the priority over and above completeness is consistency and controlled vocabulary where appropriate. There is an argument in favour of a shorter description, that may exclude certain information about a collection, but is well structured and easier to machine process. (Of course, completeness and consistency is the ideal!).

The article highlights geo-location as something that is emerging within discovery services. The Archives Hub is planning on promoting this as an option once we move to the revised EAD schema (which will allow for this to be included), but it is a question of whether archivists choose to include geographical co-ordinates in their catalogues. We may need to find ways to make this as easy as possible and to show the potential benefits of doing so.

In terms of the future, we need a different perspective on what EAD can and should be:

“In the early days of EAD the focus was largely on moving finding aids from typescript to SGML and XML. Even with much attention given over to the development of institutional and consortial best practice guidelines and requirements, much work was done by brute force and often with little attention given to (or funds allocated for) making the data fit to the purpose of discovery.”

However, I would argue that one of the problems is that archivists sometimes still think in terms of typescript finding aids; of a printed finding aid that is available within the search room, and then made available online….as if they are essentially the same thing and we can use the same approach with both. I think more needs to be done to promote, explain and discuss ‘next generation finding aids’. By working with Linked Data, I have gained a very different perspective on what is possible, challenging the traditional approach to hierarchical finding aids.

Maybe we need some ‘next generation discovery’ workshops and discussions – but in order to really broaden our horizons we will need to take heed of what is going on outside of our own domain. We can no longer consider archival practice in isolation from discovery in the most general sense because the complexity and scale of online discovery requires us to learn from others with expertise and understanding of digital technologies.

 

 

 

 

 

 

 

Digital Humanities: Patterns, Pictures and Paradigms

The recent Digital Humanities @ University of Manchester conference presented research and pondered issues surrounding digital humanities. I attended the morning of the conference, interested to understand more about the discipline and how archivists might interact with digital humanists, and consider ways of opening up their materials that might facilitate this new kind of approach.

Visualisation within digital humanities  was presented in a keynote by Dr Massimo Riva, from Brown University. He talked about the importance of methodologies based on computation, whether the sources are analogue or digital, and how these techniques are becoming increasingly essential for humanities.  He asked whether a picture is worth one million words,  and presented some thought-provoking quotes relating to visualisation, such as a quote by John Berger: “The relation between what we see and what we know is never settled.” (John Berger, Ways of Seeing, 1972).

Riva talked about how visual projection is increasingly tied up with who we are and what we do. But is digital humanities translational or transformative? Are these tools useful for the pursuit of traditional scholarly goals, or do they herald a new paradigm?  Does digital humanities imply that scholars are making things as they research, not just generating texts?  Riva asked how we can combine close reading of individual artifacts and ‘distant reading’ of patterns across millions of artifacts. He posited that visualisation helps with issues of scale; making sense of huge amounts of data. It also helps cross boundaries of language and communication.

Riva talked about the fascinating Cave Writing at Brown University, a new kind of cognitive experience. It is a four-wall, immersive virtual reality device, a room of words. This led into his thoughts about data as a type of artifact and the nature of the archive.

“On the cusp of the twenty–first century…we speak of an ex–static archive, of an archive not assembled behind stone walls but suspended in a liquid element behind a luminous screen; the archive becomes a virtual repository of knowledge without visible limits, an archive in which the material now becomes immaterial.” This change “has altered in still unimaginable ways our relationship to the archive”. (Voss & Werner, 1999)

The Garibaldi panorama is a  276 feet long, a panorama that tells the story of Garibaldi, the Italian general and politician. blog-dighum-garibaldiIt is fragile and cannot be directly consulted by scholars. So, the whole panorama was photographed in 91 digital images in 2007. The digital experience is clearly different to the physical experience. But the resulting digital panorama can be interacted with it many various ways and it is widely available via the website along with various tools to help researchers interpret the panorama. It is interesting to think about how much this is in itself a curated experience, and how much it is an experience that the user curates themselves. Maybe it is both. If it is curated, then it is not really the archivists who are curators, but those who have created the experience  those with the ability to create such technical digital environments. It is also possible for students to create their own resources, and then for those resources to become part of the experience, such as an interactive timeline based on the panorama. So, students can enhance the metadata as a form of digital scholarship.

Riva showed an example of a collaborative environment where students can take parts of the panorama that interests them and explore it, finding links and connections and studying parts of the panorama along with relevant texts. It is fascinating as an archivist to see examples like this where the original archive remains the basis of the scholarly endeavour. The artifact is at a distance to the actual experience, but the researcher can analyse it to a very detailed level. It raises the whole debate around the importance of studying the original archive. As tools and environments become more and more sophisticated, it is possible to argue that the added value of a digital experience is very substantial, and for many researchers, preferable to handling the original.

Riva talked about the learning curve with the software. Scholars struggled to understand the full potential of it and what they could do and needed to invest time in this. But an important positive was that students could feedback to the programmers, in order to help them improve the environment.

We had short presentations on a diverse range of projects, all of which showed how digital humanities is helping to reveal history to us in many ways. Dr Guyda Armstrong made the point that library catalogues are more than they might seem – they are a part of cultural history. This is reflected in a bid for funding for a Digging into Data project, metaSCOPE, looking at bibliographical metadata as datamassive cultural history.  The questions the project hopes to answer are many: how are different cultures expressed in the data? How do library collections data reflect the epistemic values, national and disciplinary cultures and artifacts of production and dissemination expressed in their creation?  This project could help with mapping the history of publishing in space and time, as well as showing the history of one book over time.

We saw many examples of how visual work and digital humanities approaches can bring history to life and help with new understanding of many areas of research. I was interested to hear how the mapping of the Caribbean during the 18th century opened up the coastline to the slave traders, but the interior, which was not mapped in any detail, remained in many ways a free area, where the slave traders did not have control. The mapping had a direct influence on many people’s lives in very fundamental ways.

Another point that really stood out to me was the danger of numbers averaging out the human experience – a challenge with digital humanities approach, as, at the same time, numbers can give great insights into history. Maybe this is a very good reason why those who create tools and those who use them benefit from a shared understanding.

“All archaeological excavation is destruction”, so what actually lives on is the record you create, says Dr Stuart Campbell. Traditional monographs synthesize all the data. They represent what is created through the process of excavation. It is a very conventional approach. But things are changing and digital archiving creates new ways of working in the virtual world of archaeological data. Dr Campbell made the point that interpretation is often privileged over the data itself in traditional methods, but new approaches open up the data, allowing more narratives to be created. The process of data creation becomes apparent, and the approach scales up to allow querying that breaks out beyond the boundaries of archaeological sites. For example, he talked about looking at pattens on ancient pottery and plotting where the pottery comes from. New sophisticated tools allow different dimensions to be brought into the research.  Links can now be created that bring various social dimensions to archeological discoveries, but the understanding of what these connections really represent is less well understood or theorised.

Seemingly a contrast to many of the projects, a project to recreate the Gaskell house in blog-dighum-gaskellManchester is more about the physical experience. People will be able to take books down from the shelves, sit down and read them. But actually there is a digital approach here too, as the intention is to add value to the experience by enabling visitors to leaf through digital copies of Gaskell’s works and find out more about the process of writing and publishing by showing different versions of the same stories, handwritten, with annotations, and published. It is enhancing the physical experience with a tactile experience through digital means.

To end the morning we had a cautionary tale about the vulnerability of Websites. A very impressive site, allowing users to browse in detail through an Arabic manuscript, is to be taken down, presumably because of changes in personnel or priorities at the hosting institution.The sustainability of the digital approach is in itself a huge topic, whether it be the data or the dissemination approaches.

 

 

The Archives Hub, Swedish business, Welsh steel and British banks

blog-swedishvisit-medium
The Hub’s Jane Stevenson and Bethan Ruddock, with Stacy Capner, Nicholas Webb, and the delegation from the Swedish Business Archives Association

On 24th October, the Archives Hub was delighted to host a meeting of our colleagues from Sweden here in Manchester. The visitors were archivists with a particular interest in business and industry, and so we were very happy that Nicholas Webb from Barclays’ Group Archive and Stacy Capner, Business Archives Development Officer for Wales, both agreed to come along and speak.

Jane Stevenson opened with a presentation on the UK archival landscape. A topic that sounded easy in theory, but in practice is somewhat broad in scope! However, we tried to give our colleagues an overview of the professional bodies, standards training and career opportunities and concerns and challenges that make up the UK archives scene.

Per-Ola Karlsson, Head of Archives at the Swedish Center for Business History gave a talk on his work with the Centre for Business Archives. It was a shame that more colleagues from the business sector couldn’t join us because it was fascinating to hear about this approach to managing business archives. Per-Ola informed us that the Centre is the world’s largest private archive. The basic model is to hold business archives centrally; the centre will take in any business archive, and includes some of the leading businesses in Sweden, such as Ericsson, H&M and Unilever.

Per-Ola gave us some context to the formation of the Centre. Originally the assumption was that companies should take responsibility for their own archives, but this changed during the 1960’s, when companies were ceasing to exist and the archives were under threat. It was interesting to hear that the Government waded in on the debate, pressing for a solution (but reluctant to stump up any funds!).  Eventually regional business archives were established, and now the National Centre operates as a centre of expertise in business archives. Sweden has the most private business archives of any of the Nordic countries, and the contrast between the Swedish approach and Norwegian approach is marked, with Norway selecting companies’ archives, and Sweden encouraging all companies to deposit.

The pricing for the use of the Centre is by shelf metres. The depositor retains ownership and control, which in itself is a risk when the staff at the Centre invest so much time and effort in curating the collections. But they see their role as advocates and persuaders – they need to convince businesses that it makes good business sense to have an archive.  It means that requests for access can be vetted by the company, but many archives are fully open for researchers. Per-Ola talked about his role – in many ways serving the companies first, because the essence of the work is to attract archives; this is what will make the centre successful as a research centre.

It seemed to be a really positive thing to have this kind of model in so far as it promotes the importance of business archives and ensures there is a centre for advocating the vital importance of these archives for future research. The UK does great work through the Business Archives Council, but we wonder what business archivists would think of this kind of model for the UK? A central store for business archives, and a central pool of expertise. It means that in Sweden, archivists working within a business are much less common.

Stacy took us through the landscape of Wales, as told through its archives of industry. Coal, steel, iron, lager production, nuclear power – they are all quite localised, and tied in with local history in Wales. In the 1960’s, with the decline of heavy industry, many archives ended up in local record offices, but collection was not systematic.  There are no private business archives in Wales that are professionally managed.

Stacy pointed out that business archives are often more likely to be left uncatalogued – they are hard to deal with and understand, and more ‘attractive’ archives may take priority. Yet projects such as ‘Wales: Powering the World‘ show how business archives can be successfully used. One of the project’s outputs was a project by two Swansea University students encouraging others to use the archives (and especially business archives) to find research material.

We moved on to look at the archive at Barclays. Nick Webb gave us a thought-provoking talk that highlighted the role of an archive in a company that is struggling to regain its reputation. He gave very persuasive arguments around the vital role of an archive in providing transparency and, if not an objective view of history, at least a view that can be supported by documentary evidence. For instance, the archive shows Barclays’ true relationship to the slave trade, which is not as has often been portrayed. Whatever else the bank might be accused of, they had quite a strong Quaker history and campaigned against slavery. His lovely turn of phrase about archives being ‘a force against corporate amnesia’ really summed this up well. It was interesting to note how much the archive is used by employees – it really seemed that it has an important role to play and that this is properly recognised within the bank, especially since the team often put a monetary value on what they do! Nick has a great anecdote about a student who came into the archive to plough through archives about  Barclays’ work in Libya. He declared that the archive was the best source on pre-Gaddaffi Libyan history that he had come across. A great example of the surprises that are hidden within collections.

We ended with Bethan Ruddock and Jane Stevenson talking a bit about ‘the online archivist’ and expanding on some of the challenges archivists face in the digital age.

Altogether we had a great day. It was a great opportunity to hear about how another country approaches the challenges of business archives, and for us it was also a means to get a better understanding of the landscape of business archives within the UK.