Here is a presentation I gave at ELAG 2015 to introduce our innovation project, Exploring British Design. The presentation is entitled ‘From Ivory Tower to People Power‘ (You Tube link) and emphasises the collaborative nature of the project and the focus on people as a topic, rather than on archival description, which is not always the best starting place for researchers. The presentation covers:
Aims of the project
Workshops with postgraduate students about how they research and analysis of their research paths
Workshops with postgraduates about websites: what students do and don’t like in terms of discovery
Traditional archival cataloguing ‘lock in’ of entities such as people, places and events.
Connectivity beyond single A to B connections; ‘anything can be a focus’ and can link to a myriad of other things
Use of EAC-CPF (XML standard for archival authority files)
Creating the data, handcrafting data, limitations of our approach, too many ideas not enough time!
If, as a researcher, you search for ‘Jane Drew’, the celebrated architect and town planner, on the Archives Hub, amongst other things, you might discover a single item, “Letter from Jane B Drew to John and Myfanwy Piper”, a letter in the “Papers of John and Myfanwy Piper”.
You can see that its a letter in a collection at the Tate Gallery Archive. The description of the collection is an example of a good quality traditional archival catalogue, giving a fairly detailed listing of the content this particular collection. But as a researcher you are really just interested in just this one letter. You may ask yourself a number of questions, possibly starting with (1) Is this the Jane Drew I’m interested in? and then (2) What is the relationship between Jane Drew and John and Myfanwy Piper? You may well be able to find answers by accessing the letter itself, but at this stage you may just want to place this connection in the broader context of Jane Drew’s life and work. As a researcher, understanding how these people are connected may shed light on your research interests.
In this blog I want to think about this question of relationships. The fact is that archivists rarely provide structured information about relationships; if there is information, it is usually in the biographical history, which might outline key events and people in someone’s life, referring to their parents, work colleagues, friends, etc. The nature of the relationship is sometimes explicitly given, but often it is not. Our standards don’t really say much about relationships between the entities (people, organisations, places, etc) that we describe in our catalogues.
Going back to the Papers of John and Myfanwy Piper as an example, the biographical history includes the following:
[John] Piper began writing reviews from the late 1920s making a name for himself as a critic writing for periodicals like ‘The Listener’ and the ‘Architectural Review’. From 1935-1937 he assisted Myfanwy Evans, with the production of a quarterly review of contemporary European abstract painting called ‘Axis’. In 1937 Piper was commissioned by his friend John Betjeman to write the ‘Shell Guide to Oxfordshire’. Piper went on to write and provide photographs for a number of the guides as well as edit the series. In the same year John Piper married the writer Myfanwy Evans.
This is a typical of a biographical history – useful historical information about the individual or organisation. Within this there is information we can potentially use to create explicit relationship information:
John Piper ‘worked with’ Myfanwy Evans
John Piper ‘was friends with’ John Betjeman
John Piper ‘worked for’ John Betjeman
John Piper ‘was married to’ Myfanwy Evans
There are a number of issues to consider here:
How can we unambiguously identify the people?
How do we choose the vocabulary we use to define the relationships?
Do we try to include dates?
Is it reasonable for us to interpret relationships as ‘friendships’ or ‘collaborations’ if this is not actually explicit?
We are looking at some of these issues through our AHRC project, Exploring British Design. They are all issues that archivists need to explore in a debate around relationship information, but the first issue to consider is simply whether we should be thinking more about including this kind of relationship information in our archival finding aids. Is it something that would be of real value to end users? This issue is coming more to the fore as we start to think about implementing ISAAR (CPF) and working with EAC-CPF , and also as Linked Open Data gains traction.
In a (well worth reading) recent article in the Journal of Contemporary Archival Studies, on the potential impact of EAC-CPF, K.M Wisser reports the findings of a survey about relationship information. The survey received 208 responses from archivists/archives in the US. Wisser wrote “The survey results indicate that the archival community has only just begun to consider relationships in the context of archival description and the role that explicit description of those relationships may play.”
As one respondent wrote:
“relationships are among the most important facets in a collection and deserve a high priority in description. One cannot understand the historical value of an event, person, or organization without knowing [the] relationship among and between them.”
One thing that really strikes me in Wisser’s findings is that archivists see relationships that are documented outside of the collection as almost as significant as those that are documented within the collection. Going back to our original topic of Jane Drew: who else did Jane Drew work with? Should we provide that information to our users, whether or not it is documented within the collection? Is our role to give as full an account as we can of Drew’s life and career? Is it to limit ourselves to what is within the collection?
Wisser’s survey asked respondents about the importance of relationship types. It is curious to me that archivists rated ‘collaborated with’ as a more important relationship than ‘studied with’; they rated a friendship as far more important when it was documented in the collection; and they rated ‘influenced by’ as generally not so important. I’m surprised that the respondents had such definite ideas about the relative importance of different types of relationships, especially when the majority appeared to agree with the importance of ‘objective cataloguing’.
In our Exploring British Design project, the work we did with researchers definitely confirmed to me the fairly self-evident observation that any relationship can be of major significance in research, even if it appears of minor significance within the archive, or indeed, within the literature in general. A brief collaboration may have been a crucial influence, a short friendship may have had hitherto unrealised impact, and anyway, the importance of the relationship depends upon the research you are doing. Researchers are not really aware of how challenging it is for us as information professionals to establish these kinds of relationships in ways that they can then access. But it is clear that this is the sort of connectivity they are after.
One of the challenges with documenting relationship types is that they can be hard to define. As Wisser notes:
“The concept of influence, however, proved the most problematic. Comments such as ‘influence is a squishy sort of relationship’ and ‘I think it would often be very difficult to prove that Entity A was influenced by Entity B’ indicate a notion of intangibility.”
The conclusion could be that we should leave well alone relationships that are hard to define. On the other hand, if we are in a position, as we research a collection, to highlight potential connections, that action could be of major value to a researcher, who may otherwise never know about a link that ends up being crucial to their particular research. The relationships that are easy to define are likely to have been defined already.
One thing that strikes me about the whole notion of introducing interpretation and opinion into cataloguing (a possible argument against defining relationships) is that the horse has pretty much bolted. I’ve looked at enough ‘objective’ descriptions to be aware that the names archivists choose to add as index terms are a choice; they inevitably have to be an opinion about the names significant enough to add as index terms. And subjects are a similar case – some collections are indexed thoroughly, some not at all.
Aside from indexing, each person would create a different scope and content entry, including and excluding different information, and whether you call that subjective or not, it is certainly always selective. You could also argue that the level of detailed hierarchical cataloguing, might indicate the relative importance of the collection. On the Archives Hub there are some collections catalogued in huge detail, and it is inevitable that researchers will assume these collections are particularly important.
All of these choices have implications for discoverability.
In Wisser’s survey, a significant proportion of respondents felt that the importance of a relationship should be based upon the use of the collection. But this, again, raises the question: When thinking about relationships, is the cataloguer reflecting the scope of the collection, or are they trying to give as full a picture as they can of the person or organisation? Are we within the world of the collection; or is the collection within the world?
The reason that I believe that we should think beyond the bounds of the collection content is that I think it promises much richer rewards for our users and encourages archives to be a major player within a broader landscape of information resources. I base my thinking on the premise that the researcher is primarily interested in their research topic, which is not likely to be an archive collection per se, but rather an event, a person, an organisation, a subject, and the way things are connected. I think archivists are still tending to think in terms of a document that describes a collection, rather than how to link the collection into the cultural heritage landscape, and even more broadly beyond that. I wonder if archivists don’t always think beyond the catalogues they currently create because the researchers they have contact with (who visit the archive) are already fairly confident they want to use that repository, or a particular archive within that repository. In other words, the researcher is already in their space. When I worked in a specialist archive, I thought about researchers discovering our archive as a whole (having an online presence) and then I thought about them using our collections (individual collections each with their own description); I didn’t think about how our collections could be seen as part of a whole information landscape.
The loudest – and most convincing – argument I hear against this kind of approach is that it takes time, and archivists are short on time. But I wonder if that means we have to think fundamentally differently. Going back to Jane Drew, and think about the value of relationships for research into her life and work…
If one archive collection description highlights just a few relationships, this could take us a long way (although relationship types are a whole different thing…). If the individuals and organisations are unambiguously identified, this can help with the process of creating links out to other data sources, so that information can be linked together; then we have the chance to benefit from finding out about relationships that have been defined elsewhere. In other words, the connections one person has throughout their life can only be fully realised through the pooling of information resources, very much a joint effort. If the data is structured it can potentially be brought together.
Traditional archival cataloguing focuses on the collection, and what is documented within the collection. It tends to think in terms of a self-contained document. Pursuing relationships breaks the bounds of any one information source. That seems like a good thing, but it raises questions around approaches to cataloguing. One obvious way to tackle this is to start to think more about archival authority records. These should enable us to move beyond a collection-centric description of the collection and towards a more entity based approach, because you describe an agent (entity) independently of any one archival collection. Another option is to think in a Linked Data way, where you are concentrating on entities and relationships.
There are so many questions raised by the whole area of entities and relationships. A few of my current conclusions are:
We should primarily be led by what benefits research. Researchers are far less likely to think in terms of individual archive collections, and far more likely to think in terms of research areas (topics). The Web gives us the opportunity to think in a broader context.
Maybe it is worth considering taking some of the time used to provide a really detailed biographical history as an unstructured narrative, or the time to provide a really detailed multi-level description, and taking more time to provide (or provide the potential for) connections between our descriptions and the larger information environment. This could allow researchers to bring together much more comprehensive information, even if what we provide about individual collections is less detailed. Just adding something like a VIAF identifier to a name would be a great big leap forwards (http://viaf.org/viaf/51792789).
There is great value in being a small fish in a big pond, because most researchers are fishing for data in the big pond. As Wisser’s article says, “relationships are…seen to free collections from the isolation of individual repositories.” If we aim to be part of the big pond, we can continue to tend our smaller ponds as well!
To go back to the Piper Collection and Jane Drew….I used this as a random example, thinking of a researcher interested in one particular designer. But of course, the Tate Gallery Archive can’t be expected to define all the relationships within the description. It’s great that they have provided enough detail to find this one individual item – without that, we would not know about the connection with Jane Drew. I’m arguing for unambiguously identifying entities (people, organisations) because if we can potentially link this instance of ‘Jane Drew’ to other instances in other information sources, then it is very possible that we can find out more about this relationship; And if the relationship can’t be established through other sources, then maybe this archive provides unique evidence of a connection that could significantly benefit research.
How might a website co-designed by researchers, rather than a top-down collection-defined approach to archive content, enhance engagement with and understanding of British design?
The workshops that we have run were one of the key ways that we hoped to understand more about how postgraduates and others research their topics, what they liked and didn’t like about websites, and in a general sense how they think and understand resources, and how we can tune into that thinking.
In the blogs posts that we have created so far, we set out one of our central ideas:
Providing different routes into archives, showing different contexts, and enabling researchers to create their own narratives, can potentially be achieved through a focus on the ‘real things’ within an archive description; the people, organisations and places, and also the events surrounding them.
The feedback from the workshops gave us plenty to work with, and here I wanted to draw out some of the key messages that we are using to help us design an interface.
Researchers often think visually
Several of the participants in our workshops were visual thinkers. Maybe we had a slightly biased group, in that they work within or study design, but it seems reasonable to conclude that a visual approach can be attractive and engaging. We want to find a way to represent information more visually, whilst providing a rich and detailed resource. Our belief is that the visual should not dominate or hide the textual, as does often happen with cultural heritage resources, but that they should work better together.
Researchers often think in terms of creating a story or narrative
When we asked our participants to focus on an individual object, several of them thought in terms of its ‘story’. It seemed to me that most of the discussions that we had assumed a narrative type approach. It is hardy surprising, as when we talk about people, places and events we connect them together. It is a natural thing to do.
Different types of contexts provide value
When we asked workshop participants to think about how they would go about researching the object they were given, they tended to think of ways to contextualise it. They were interested in where it came from, in its physicality and its story. For example, we gave out photographs of an exhibition and they wanted to know where the photographs were taken, more about the exhibition and the designers involved in it, what else was going on at that time? Our idea with Exploring British Design is that we can create records that allow these kinds of contexts to flourish. The participants did not concentrate on traditional archival context, as they did not tend to recognise this in the same way as archivists – it is one perspective amongst many.
We cannot provide a substitute for the value of handling the original object, and it was clear that researchers found this to be immensely valuable, but we can help to provide context that helps to scope reality.
Uncovering the obscure is a good thing
Not surprisingly, our workshop participants were keen that their research efforts should result in finding little-known information that they could utilise. They talked about the excitement of uncovering information and the benefits for their work.
Habits are part of the approach to research
The balance between being innovative and anchoring an interface in what people are familiar with seems to be important.
Trust is very important
The importance of trust was stressed at all of our workshops, and the need to know the context of information. We need to build something that researchers believe is a quality resource, with information they can rely on.
Serendipity is good…although it can lead you astray
It was clear that our participants wanted to explore, and liked the idea of coming across the unexpected. Several of them felt that the library bookshelves provide a good opportunity to browse and discover new sources (they talked about this more than the serendipity of the web). But there was also a note of caution about time wasted pursuing different avenues of information. It seems good to build in serendipity, whilst providing an interface that gives clear landmarks and signposts.
Search and Relevance
Our workshop participants were clear that choice of search terms has a big influence on what you find, and this can be a disadvantage. You may be presented with a search box, and you don’t really know what to search for to get what you want, especially if you don’t know what you want! Also, the relevance ranking can be a puzzle. Library databases often seem to give results that don’t make that much sense.
One thing that stood out to me was the willingness to use Google, which is a simple search box, with no indication of how to search, that brings back huge amounts of results; but the criticisms of library databases, where choice of search term is crucial and where ‘too many results’ are seen as a problem. It seemed that the key here was effective relevance ranking, but our workshop participants did agree that relevance ranking can deceive: the first page of results may look good, but you don’t really know what you are missing. Google is good at providing a first page of useful looking results….and maybe that’s enough to stop most people wondering about what they might be missing!
Exploring British Design
As our project has progressed, I think it is fair to say that we have benefitted hugely from the input of the students and academics that we have talked to, not only for this project but also more generally. But it was not possible for us to manage to implement a directly co-designed website. The logistics of the project didn’t allow for this, as we wanted to gather input to inform the project, and then we had the complications of pulling together the data, designing the back end and the API. We would probably have needed at least another 6 months on the project to go back to the workshop participants and ask them about the website design as we went along.
But I think we have achieved a good deal in terms of engagement. Our Exploring British Design project has been about other ways through content, moving away from a search box and a list of search results, and thinking about immersing researchers in a ‘landscape’, where they can orientate themselves but also explore freely. So, we are thinking about engagement in terms of a more visually attractive and immersive experience, giving researchers the opportunity to follow connections in a way that gives them a sense of movement through the design landscape, hints at the unknown, and shows the relevancy of the entities that are featured in the website. We hope to show how this can potentially expand understanding because it allow for a wider context and more varied narratives.
In the next project post we hope to present our interface for this pilot project!
Last week I attended a very full and lively Europeana Tech conference. Here are some of the main initiatives and ideas I have taken away with me:
Think in terms of improvement, not perfection
Do the best you can with what you have; incorrect data may not be as bad as we think and maybe users expectations are changing, and they are increasingly willing to work with incomplete or imperfect data. Some of the speakers talked about successful crowd-sourcing – people are often happy to correct your metadata for you and a well thought-out crowd-sourcing project can give great results.
The British Library currently have an initiative to encourage tagging of their images on Flickr Commons and they also have a crowd-sourcing geo-referencer project.
The Cooper Hewitt Museum site takes a different and more informal approach to what we might usually expect from a cultural heritage site. The homepage goes for an honest approach:
“This is a kind of living document, meaning that development is ongoing — object research is being added, bugs are being fixed, and erroneous terms are being revised. In spite of the eccentricities of raw data, you can begin exploring the collection and discovering unexpected connections among objects and designers.”
The ‘here is some stuff’ and ‘show me more stuff’ type of approach was noticeable throughout the conference, with different speakers talking about their own websites. Seb Chan from the Cooper Hewitt Museum talked about the importance of putting information out there, even if you have very little, it is better than nothing (e.g. https://collection.cooperhewitt.org/objects/18446665).
The speaker from Google, Chris Welty, is best known for his work on ontologies in the Semantic Web and IBM’s Watson. He spoke about cognitive computing, and his message was ‘maybe it’s OK to be wrong’. Something may well still useful, even if it is not perfectly precise. We are increasingly understanding that the Web is in a state of continuous improvement, and so we should focus on improvement, not perfection. What we want is for mistakes to decrease, and for new functionality not to break old functionality. Chris talked about the importance of having a metric – something that is believable – that you can use to measure improvement. He also spoke about what is ‘true’ and the need for a ‘ground truth’ in an environment where problems often don’t have a right or wrong answer. What is the truth about an image? If you show an image to a human and ask them to talk about it they could talk for a long time. What are the right things to say about it? What should a machine see? To know this, or to know it better, Chris said, Google needs data – more and more and more data. He made it clear that the data is key and it will help us on the road to continuous improvement. He used the example of searching for pictures of flowers using Google to find ‘paintings with flowers’. If you did this search 5 years ago you probably wouldn’t get just paintings with flowers. The search has improved, and it will continue to improve. A search for ‘paintings with tulips’ now is likely to show you just tulips. However, he gave the example of ‘paintings with flowers by french artists’ – a search where you start to see errors as the results are not all by french artists. A current problem Google are dealing with is mixed language queries, such as ‘paintings des fleurs’, which opens a whole can of worms. But Chris’ message was that metadata matters: it is the metadata that makes this kind of searching possible.
The Success of Failure
Related to the point about improvement, the message is that being ‘wrong’ or ‘failing’ should be seen in a much more positive light. Chris Welty told us that two thirds of his work doesn’t make it into a live environment, and he has no problem with that. Of course, it’s hard not to think that Google can afford to fail rather more than many of us! But I did have an interesting conversation with colleagues, via Twitter, around the importance of senior management and funders understanding that we can learn a great deal from what is perceived as failure, and we shouldn’t feel compelled to hide it away.
Think in terms of Entities
We had a small group conversation where this came up, and a colleague said to me ‘but surely that’s obvious’. But as archivists we have always been very centered on documents rather than things – on the archive collection, and the archive collection description. The trend that I was seeing reflected at Europeana Tech continued to be towards connections, narratives, pathways, utilising new tools for working with data, for improving data quality and linking data, for adding geo-coordinates and describing new entities, for making images more interoperable and contextualising information. The principle underlying this was that we should start from the real world – the real world entities – and go from there. Various data models were explored, such as the Europeana Data Model and CIDOC CRM, and speakers explained how entities can connect, and enable a richer landscape. Data models are a tricky one because they can help to focus on key entities and relationships, but they can be very complex and rather off-putting. The EDM seems to split the crowd somewhat, and there was some criticism that it is not event-based like CIDOC CRM, but the CRM is often criticised for being very complex and difficult to understand. Anyway, setting that aside, the overall the message was that relationships are key, however we decide to model them.
Cataloguing will never capture everyone’s research interests
An obvious point, but I thought it was quite well conveyed in the conference. Do we catalogue with the assumption that people know what they need? What about researchers interested in how ‘sad’ is expressed throughout history, or fashions for facial hair, or a million other topics that simply don’t fit in with the sorts of keywords and subject terms we normally use. We’ll never be able to meet these needs, but putting out as much data as we can, and making it open, allows others to explore, tag and annotate and create infinite groups of resources. It can be amazing and moving, what people create: Every3Minutes.
There’s so much out there to explore….
There are so many great looking tools and initiatives worth looking at, so many places to go and experiment with open data, so many APIs enabling so much potential. I ended up with a very long list of interesting looking sites to check out. But I couldn’t help feeling that so few of us have the time or resource to actually take advantage of this busy world of technology. We heard about Europeana Labs, which has around 100 ‘hardcore’ users and 2,200 registered keys (required for API use). It is described as “a playground for remixing and using your cultural and scientific heritage. A place for inspiration, innovation and sharing.” I wondered if we would ever have the time to go and have a play. But then maybe we should shift focus away from not being able to do these things ourselves, and simply allow others to use the data, and to adopt the tools and techniques that are available – people can create all sorts of things. One example amongst many we heard about at the conference is a cultural collage: zenlan.com/collage. It comes back to what is now quite an old adage, ‘the best innovation may not be done by you’. APIs enable others to innovate, and what interests people can be a real surprise. Bill Thompson from the BBC referred to a huge interest in old listings from Radio Times, which are now available online.
The International Image Interoperability Framework
I list the IIIF this because it jumped out at me as a framework that seems to be very popular – several speakers referred to it, and it very positive terms. I hadn’t heard of it before, but it seemed to be seen as a practical means to ensure that images are interoperable, and can be moved around different systems.
One of my favourite thoughts from the conference, from the ever-inspirational Tim Sherratt, was that big ideas should enable little ideas. The little ideas are often what really makes the world go round. You don’t have to always think big. In fact, many sites have suffered from the tendency to try to do everything. Just because you can add tons of features to your applications, it doesn’t mean you should
The Importance of Orientation
How would you present your collections if you didn’t have a search box? This is the question I asked myself after listening to George Oates, from Good Form and Spectacle. She is a User Interface expert, and has worked on Flickr and for the Internet Archive amongst other things. I thought her argument about the need to help orientate users was interesting, as so often we are told that the ‘Google search box’ is the key thing, and what users expect. She talked about some of her experiments with front end interfaces that allow users to look at things differently, such as the V&A Spelunker. She spoke in terms of landmarks and paths that users could follow. I wonder if this is easier said than done with archives without over-curating what you have or excluding material that is less well catalogued, or does not have a nice image to work with. But I certainly think it is an idea worth exploring.
We recently ran a second workshop as part of our Exploring British Design project. The workshops aim to understand more about approaches to research, and researchers’ understanding and use of archives.
The second workshop was run largely on the same basis as the first workshop, using the same exercises.
Looking at what our researchers said and documented about their research paths over the two workshops, some points came out quite strongly:
Google is by far the most common starting point but its shortcomings are clear and issue of trust come up frequently.
There is often a strong visual emphasis to research, including searching for images and the use of Pinterest; there seems to be a split between those who gravitate towards a more text-based approach and those who think visually (many of our participants were graphic designers though!).
It is common to utilise the references listed in Wikipedia articles.
The library as a source is seen as part of a diverse landscape – it is one place to go to, albeit an important one. It is not the first port of call for the majority.
Aggregators are not specifically referred to very often. But they may be seen as a place to go if other searches don’t yield useful results.
Talking to people is very important, be it lecturers, experts, colleagues or friends
Online research is more immediate, and usually takes less effort, but there are issues of trust and it may not yield specific enough results, or uncover the more obscure sources.
There is a tendency to start from the general and work towards the more specific. With the research paths of most of the researchers, the library/archive was somewhere in the middle of this process.
Personal habits and past experience play a very large part, but there is a real interest in finding new routes through research, so habit is not a sticking point, but simply the dominant influence unless it is challenged.
For the second workshop, the first exercise asked participants to document their likely research paths around a topic.
We had four pairs of researchers looking at different topics, and we left them to discuss their research paths for about 45 minutes. The discussions following the exercise picked up on a number of areas:
Online vs Offline
We kicked off by asking the researchers about online versus ‘offline’ research paths. One participant commented that she saw online as a route through to traditional research – maybe to locate a library or archive – ‘online is telling me where to look’ but in itself it is too general and not specific enough; whereas the person she was paired with tended to do more research online. He saw online as giving the benefit of immediacy – at any time of day or night he could access content. The issue of trust came up in the discussion around this issue, and one participant summed up nicely: “If you do online research there is less effort but there is less trust; if you research offline there is more effort but there is more trust.”
Following on from the discussion about how people go about using online services, there was a comment that things found online are often the more obvious, the more used and cited resources. Visiting a library or archive may give more opportunity to uncover little known sources that help with original research. This seemed to be endorsed by most participants, one commenting that Pinterest tends to reflect what is trendy and popular. However, there was also a view that something like Pinterest can lead researchers to new sources, as they are benefiting from the efforts, and sometimes the quite obsessive enthusiasms, of a wide range of people.
There was agreement that online research can lead to ‘information dumping’, where you build up a formidable collection of resources, but are unlikely to get round to sorting them all out and using them.
The issue of effort came up later in the discussion when referring to a particular university library (probably typical of many university libraries), and the amount of effort involved in using its databases. There was a comment about how you need to ‘work yourself up to an afternoon in the library’ and there seemed to be a general agreement that the ‘search across all resources’ often produced quite meaningless results. When compared to Google, the issue seems to be that relevance ranking is not effective, so the top results often don’t match your requirements. There was also some discussion around the way that library resource discovery services often involve too many steps, and there is effort in understanding how the catalogue works. One participant, whose research centres on the Web and the online user experience, felt that printed sources were of little use to him, as they were out of date very quickly.
Curating your sources
One researcher talked about using Pinterest to organise findings visually. This was followed up by another researcher talking about how with online research you can organise and collect things yourself. It facilitates ‘curating’ your own collection of resources. It can also be easier to remember resources if they are visual. Comparing Pinterest to the Library – with the former you click to add the image to your board; with the Library you pay a visit, you find the book, you take it to the scanner, you pay to take a scan…although it is increasingly possible to take pictures of books using your own device. But the general feeling was that the Web was far quicker and more immediate.
Attitudes towards research
One participant felt that there might be a split between those more like him who see research as ‘a means to an end’ and those who enjoy the process itself. So maybe some are looking for the shortest route to the end goal, and others see research as more exploratory activity and expect it to take time and effort. This may partly be a result of the nature and scope of the research. Short time scales preclude in-depth research.
Talking about serendipitous approaches, someone commented that browsing the library shelves can be constructive, as you can find books around your subject that you weren’t aware existed. This is replicated to some extent in something like Amazon, which suggests books you might be interested in. There was also some feeling that exploring too many avenues can take the researcher off topic and take up a great deal of time.
Trust and Citation
The issue of trust is important. A first-hand experience, whether of a place you are researching, or using physical archive sources, is the most trustworthy, because you are seeing with your own eyes, experiencing first hand or looking at primary sources first hand; a library provides the next level of trust, as a book is an interpretation, and you may feel it requires corroboration; the online world is the least trustworthy. You will have the least trust if you are looking at a website where you don’t know about who or what is behind it. There was agreement that trust can come through crowd sourced information, but also some discussion around how to cite this (for example, using the Harvard system to reference web pages and crowd sourced resources). This led on to a short discussion around the credibility of what is cited within research. Maybe attitudes to Wikipedia are slowly changing, but at present there is generally still a feeling that a researcher cannot cite it as a source. There are traditions within disciplines around how to cite and what are the ‘right’ things to cite.
[Further posts on Exploring British Design will follow, with reflections on our workshops and updates on the project generally]
The BL Labs is an initiative funded by the Mellon Foundation that invites researchers and developers to work with the BL and their digital data to address research questions. The Symposium 2014 showcased some of the work funded by the Labs, presenting innovative and exploratory projects that have been funded through this initiative. This year’s competition winners are the Victorian Meme Machine, creating a database of Victorian jokes, and a Text to Image Linking Tool (TILT) for linking areas on a page image and a clear transcription of the content.
Tim Hitchcock, Professor of Digital History from the University of Sussex, opened with a great keynote talk. He started out by stressing the role of libraries, archives and museums in preserving memory and their central place in a complex ecology of knowledge discovery, dissemination and reflection. He felt it was essential to remember this when we get too caught up in pursuing shiny new ideas. It is important to continually rethink what it is to be an information professional; whilst also respecting the basic principles that a library (archive, museum) was created to serve.
Tim Hitchcock’s talk was Big Data, Small Data and Meaning. He said that conundrums of size mean there is a danger of a concentration on Big Data and a corresponding neglect of Small Data. But can we view and explore a world encompassing both the minuscule and the massive? Hitchcock introduced the concept of the macroscope, a term coined in a science fiction novel by Piers Anthony back in 1970. He used this term in his talk to consider the idea of a macro view of data. How has the principle of the macroscope influenced the digital humanities? Hitchcock referred to Katy Borner’s work with Plug-and-Play Macroscopesa: “Macroscopes let us observe what is at once too great or too slow or too complex for the human eye and mind to notice and comprehend.” (See http://vimeo.com/33413091 for an introductory video).
Hitchcock felt that ideally macroscopes should be to observe patterns across large data and at the same time show the detail within small data. The way that he talked about Big Data within the context of both the big and the small helped me to make more sense of Big Data methods. I think that within the archive community there has been something of a collective head scratching around Big Data; what its significance is, and how it relates to what we do. In a way it helps to think of it alongside the analysis that Small Data allows researchers to undertake.
Hitchcock gave some further examples of Big Data projects. Paper Machines is a plugin for Zotero that enables topic modelling analysis. It allows the user to curate a large collection of works and explore its characteristics with some great results; but the analysis does not really address detail.
The History Manifesto, by Jo Guldi and David Armitage talks about how Big Data might be used to redefine the role of Digital Humanities. But Hitchcock criticised it for dismissing micro-history as essentially irrelevant.
“distant reading occludes as much as it reveals, resulting in significant ethical breaches in our digital world. Network analysis and the humanities offers us a way out, a way to bridge personal stories with the big picture, and to bring a much-needed ethical eye to the modern world.”
Hitchcock posited that the large scale is often seen as a route to impact in policy formation, and this is an attractive inducement to think large. In working on a big data scale, Humanities can speak to power more convincingly; it can lead to a more powerful voice and more impact.
We were introduced to Ben Schmidt’s work, Prochronisms. This uses TV anachronisms to learn about changes in language scales of analysis around the analysis of text used, and Schmidt has done some work around particular TV programmes and films, looking at the overall use of language and the specifics of word use. One example of his work is the analysis of 12 Years a Slave:
‘the language Ridley introduces himself is full of dramatically modern words like “outcomes,” “cooperative,” and “internationally:” but that where he sticks to Northup’s own words, the film is giving us a good depiction of how things actually sounded. This is visible in the way that the orange ball is centered much higher than the blue one: higher translates to “more common than then now.”‘
Schmidt gives very entertaining examples of anachronisms, for example, the use of ‘parenting a child’ in the TV drama series Downton Abbey, which only shows up in literature 5 times during the 1920’s and in a rather different context to our modern use; his close reading of context also throws up surprises, such as his analysis of the use of the word ‘stuff’ in Downton Abbey (as in ‘family stuff’ or ‘general stuff’), which does not appear to be anachronistic and yet viewers feel that it is a modern term. (A word of warning, the site is fascinating and it’s hard to stop reading it once you start!)
Professor Hitchcock gave this work as an example of using a macroscope effectively to combine the large and the small. Schmidt reveals narrative arcs; maybe showing us something that hasn’t been revealed before…and at the same time creates anxiety amongst script writers with his stark analysis!
Viewing data on a series of scales simultaneously seems a positive development, even with the pitfalls. But are humanists privileging social science types of analysis over more traditional humanist ones? Working with Big Data can be hugely productive and fun, and it can encourage collaboration, but are humanist scholars losing touch with what they traditionally do best? Language and art, cultural construction and human experience are complex things. Scholars therefore need to encompass close reading and Small Data in their work in order to get a nuanced reading. Our urge towards the all-inclusive is largely irresistible, but in this fascination we may lose the detail. The global image needs to be balanced with a view from the other end of the macroscope.
It is important to represent and mobilise the powerless rather than always thinking about the relationship to the powerful; to analyse the construct of power rather than being held in the grip of power and technology. Histories of small things are often what gives voice to those who are marginalised. Humanists should encompass the peculiar and eccentric; they should not ignore the power of the particular.
Of course, Big Data can have huge and fundamental results. The discovery of the Higgs particle was the result of massive data crunching and finding a small ‘bump’ in the data that gave evidence to support its existence. The other smaller data variations needed to be ignored in this scenario. It was a case of millions of rolls of the dice to discover the elusive particle. But if this approach is applied across the board, the assumption is that the signal, or the evidence, will come through, despite the extraneous blips and bumps. It doesn’t matter if you are using dirty data because small hiccups are just ignored. But humanists need to read data with an eye to peculiarities and they should consider the value of digital tools that allow them to think small.
Hitchcock believes that to perform humanities effectively we need to contextualise. And the importance of context is never lost to an archivist, as this is a cornerstone of our work. Big Data analysis can lose this context; Small Data is all about understanding context to derive meaning.
Using the example of voice onset timing, which refers to the tiny breathy gap before speaking, Hitchcock showed that a couple of milliseconds of empty space can demand close reading, because it actually changes depending on who you are talking to, and it reveals some really interesting findings. A Big Data approach would simply miss this fascinating detail.
Big data has its advantages, but it can mean that you don’t look really closely at the data set itself. There is a danger you present your results in a compelling graph or visualisation, but it is hard to see whether it is a flawed reality. You may understand the whole thing, and you can draw valuable conclusions, but you don’t take note of what the single line can tell you.
As part of our Exploring British Design project we are organising workshops for researchers, aiming to understand more about their approaches to research, and their understanding and use of archives. Our intention is to create an interface that reflects user requirements and, potentially, explores ideas that we gather from our workshops.
Of course, we can only hope to engage with a very small selection of researchers in this way, but our first workshop at Brighton Design Archive showed us just how valuable this kind of face-to-face communication can be.
We gathered together a small group of 7 postgraduate design students. We divided them into 4 groups of 2 researchers and a lone researcher, and we asked them to undertake 2 exercises. This post is about the first exercise and follow up discussion. For this exercise, we presented each group with an event, person or building:
The Festival of Britain, 1951
Black Eyes and Lemonade Exhibition, Whitechapel Art Gallery, 1951
Natasha Kroll (1912-2004)
Simposons of Piccadilly, London
We gave each group a large piece of paper, and simply asked them to discuss and chart their research paths around the subject they had been given. Each group was joined by a facilitator, who was not there to lead in any way, but just to clarify where necessary, listen to the students and make notes.
I worked with two design students, Richard and Caroline, both postgraduate students researching aspects of design at The University of Brighton. They were looking at the subject of the Festival of Britain (FoB). It fascinated me that even when they were talking about how to represent their research paths, one instinctively went to list their methods, the other to draw theirs, in a more graphic kind of mind map. It was an immediate indication of how people think differently. They ended up using the listing method (see left).
The above represents the research paths of Richard and Caroline. It became clear early on that they would take somewhat different paths, although they went on to agree about many of the principles of research. Caroline immediately said that she would go to the University library first of all and then probably the central library in Brighton. It is her habit to start with the library, mainly because she likes to think locally before casting the net wider, she prefers the physicality of the resources to the virtual environment of the Web. She likes the opportunity to browse, and to consider the critical theory that is written around the subject as a starting point. Caroline prefers to go to a library or archive and take pictures of resources, so that she can then work through them at her leisure. She talked about the importance of being able to take pictures, in order to be able to study sources at her leisure, and how high charges for the use of digital cameras can inhibit research.
Richard started with an online search. He thought about the sort of websites that he would gravitate towards – sites that were directly about the topic, such as an exhibition website. He referred to Wikipedia early on, but saw it as a potential starting place to find links to useful websites, through the external links that it includes, rather than using the content of Wikipedia articles.
Richard took a very visual approach. He focused in on the FoB logo (we used this as a representation of the Festival) and thought about researching that. He also talked about whether the FoB might have been an exhibition that showcased design, and liked the idea of an object-based approach, researching things such as furniture or domestic objects that might have been part of the exhibition. It was clear that his approach was based upon his own interests and background as a film maker. He focused on what interested and excited him; the more visual aspects including the concrete things that could be seen, rather than thinking in a text-based way.
Caroline had previous experience of working in an archive, and her approach reflected this, as well as a more text-based way of thinking. She talked about a preference for being in control of her research, so using familiar routes was preferable. She would email the Design Archives at Brighton, but that was not top of the list because it was more of an unknown quantity than the library that she was used to. Maybe because she has worked in an archive, she referred to using film archives for her research; whereas Richard, although a film maker, did not think of this so readily. Past experience was clearly important here.
Both researchers saw the library as a place for serendipitous research. They agreed that this browsing approach was more effective in a library than online. They were clearly attracted to the idea of searching the library shelves, and discovering sources that they had not known about. I asked why they felt that this was more effective than an online exploration of resources. It seemed to be partly to do with the dependency of the physical environment and also because they felt that the choice of search term online has a substantial effect on what is, and isn’t, found.
Both researchers were also very focused on issues of trust; both very much of opinion that they would assess their sources in terms of provenance and authorship.
In addition, they liked the idea of being able to search by user-generated tags and to have the ability to add tags to content.
In the general discussion some of the point made in the case study were reinforced. In summary:
Participants found the exercise easy to do. It was not hard to think about how they would research the topics they were given. They found it interesting to reflect on their research paths and to share this with others.
For one other participant the library was the first port of call, but the majority started online.
Some took a more historical approach, others a much more narrative and story-based approach. There were different emphases, which seemed to be borne out of personality, experiences and preferences. For example, some thought more about the ordering of the evidence, others thought more about what was visually stimulating.
It was therefore clear that different researchers took different approaches based on what they were drawn to, which usually reflected their interests and strengths.
There was a strong feeling about trust being vital when assessing sources. Knowing the provenance of an article or piece of writing was essential.
The participants agreed that putting time and effort into gathering evidence is part of the enjoyment of research. One mentioned the idea that ‘a bit of pain’ makes the end result all the more rewarding! They were taken aback at the idea that that discovery services feel pressured to constantly simplify in order to ensure that we meet researchers’ needs. They understood that research is a skill and a process that takes time and effort (although, of course, this may not be how the majority of undergraduates or more inexperienced researchers feel). Certainly they agreed that information must not be withheld, it must be accessible. We (service providers) need to provide signposts, to allow researchers to take their own paths. There was discussion about ‘sleuthing’ as part of the research process, and trying unorthodox routes, as chance discoveries may be made. But there was consensus that researchers do not need or wish to be nannnied!
All researchers did use Google at some point….usually using it to start their search. Funnily enough, some participants had quite long discussions about what they would do, before they realised they would actually have gone to Google first of all. It is so common now, that most people don’t think about it. It seemed to operate very much as a as a starting point, from where the researchers would go to sites, assess their worth and ensure that the information was trustworthy.
[There will be follow up posts to this, providing more information about our researcher workshops, summarising the second activity, which was more focused on archive sources, and continuing to document our Exploring British Design project.]
At the moment, the Archives Hub takes a largely traditional approach to the navigation and display of archive collections. The approach is predicated on hundreds of years of archival theory, expanded upon in numerous books, articles, conferences and standards. It is built upon “respect des fonds” and original order. Archival provenance tells us that it is essential to provide the context of a single item within the whole archive collection; this is required in order to understand and interpret said item.
ISAD(G) reinforces the ‘top down’ approach. The hierarchy of an archive collection is usually visualised as a tree structure, often using folders. The connections show a top-down or bottom-up approach, linking each parent to its child(ren).
This principle of archival hierarchy makes very good sense. The importance of this sort of context is clear: one individual letter, one photograph, one drawing, can only reveal so much on its own. But being able to see that it forms part of a series, and part of a larger collection, gives it a fuller story.
However, I wonder if our strong focus on this type of context has meant that archivists have sometimes forgotten that there are other types of context, other routes through content. With the digital environment that we now have, and the tools at our disposal, we can broaden out our ambitions with regards to how to display and navigate through archives, and how we think of them alongside other sources of information. This is not an ‘either or’ scenario; we can maintain the archival context whilst enabling other ways to explore, via other interfaces and applications. This is the beauty of machine processable data – the data remains unchanged, but there can be numerous interfaces to the data, for different audiences and different purposes.
Providing different routes into archives, showing different contexts, and enabling researchers to create their own narratives, can potentially be achieved through a focus on the ‘real things’ within an archive description; the people, organisations and places, and also the events surrounding them.
This is a very simplified image, intended to convey the idea of extracting people, organisations and places from the data within archive descriptions (at all levels of description). Ideally, these entities and connections can be brought together within events, which can be built upon the principle of relationships between entities (i.e. a person was at a place at a particular time).
Exploring British Design is a project seeking to probe this kind of approach. By treating these entities as an important part of the ‘networks of things’, and by finding connections between the entities, we give researchers new routes through the content and the potential to tell new stories and make new discoveries. The idea is to explore ways to help us become more fully a part of the Web, to ensure that archives are not resources in isolation, but a part of the story.
For this project, we are focussing on a small selection of data, around British design, extracting entities from the Archives Hub data, and considering how the content within the descriptions can be opened up to help us put it into new contexts.
We are creating biographical records that can be used to include structured data around relationships, places and events. We aim to extract people from the archive descriptions in which they are ‘embedded’ so that we can treat them as entities – they can connect not only to archive collections they created or are associated with, but they can also connect to other people, to organisations, to events, to places and subjects. For example, Joseph Emberton designed Simpsons in Piccadilly, London, in 1936. There, we have the person, the building, the location and the time.
With this paradigm, the archive becomes one of the ‘nodes’ of the network, with the other entities equally to the fore, and the ability to connect them together shows how we can start to make connections between different archive collections. The idea is that a researcher could come into an archive from any type of starting point. The above diagram (created just as an example) includes ‘1970’s TV comedy’ through to the use of portland stone, and it links the Brighton Design Archive, the V&A Theatre and Performance Archive and the University of the Arts London Archive. The long term aim is that our endeavours to open up our data will ensure that it can be connected to other data sources (that have also been made open); sources outside of our own sphere (the Archives Hub data). The traditional interface has its merits; certainly we need to continue to provide archival context and navigation through collections; but we can be more imaginative in how we think about displaying content. We don’t need to just have one interface onto our data. We need to ensure that archives are part of the bigger story, that they can be seen in all sorts of contexts, and they are not relegated to being a bit part, isolated from everything else.
Between March and June 2014 I conducted a piece of social media-oriented research on behalf of the Archives Hub, the primary purpose of which was to measure the impact of adding links from specific Wikipedia articles featuring Hub content on the traffic that comes into the Hub website. As well as providing the Hub administrators – and, indeed, the profession as a whole – with a gauge as to whether the amount of time invested in creating links is worthwhile when compared to the benefits of impact, this research benefitted me personally in that it allowed me the opportunity to potentially earn credits on the Archives & Records Association’s Registration Scheme, under the ‘Contributions to the profession’ category.
The first phase of the study involved me identifying twenty archival collections listed in the Hub, with no existing links to related Wikipedia pages, which I could treat as measurable research subjects. This was done simply by entering specific Hub collection level descriptions into the Wikipedia search engine. (If a link to the Hub had already been created, I eliminated that particular collection from the study.) In order to achieve a fair and balanced piece of research, I selected collections of a relatively similar size and status, and avoided those relating to any significant public events running concurrent to, or immediately prior to, the commencement of the research, i.e. local elections in England, the World Cup. My feeling was that such collections could have been subject to closer scrutiny from researchers while the study was underway, which, in turn, would have resulted in an unexpected increase in Hub-searching activity. This, in essence, would have undermined the credibility of the study. I also made sure that the Wikipedia pages I utilised didn’t already include links to the collection-holding repositories, as this could potentially sway researchers away from clicking the newly-created links to the Hub descriptions, thereby affecting the accuracy of research.
The twenty collections selected, along with their corresponding Wikipedia links, are shown in the table below.
Once the Hub collections and related Wikipedia pages had been identified, I then added new links to the individual pages using Wikipedia’s built-in editing tool. In the interests of consistency, I embedded each new link in the ‘External Links’ section on each of the pages I modified. I then used Google Analytics, in conjunction with an Excel spreadsheet, to collate and record Hub traffic data for each individual collection for the twelve-week period prior to the start of the study, specifically from the 22nd December, 2013 to the 15th March, 2014. This was done in order to enable me to generate a measurement of the overall impact of the newly-created links on incoming Hub traffic. The cumulative results for each collection, for the twelve-week period prior to the commencement of the study, are shown below.
Over the course of the next twelve weeks, from the 17th March, 2014 to the 7th June, 2014, I used Google Analytics once again to monitor incoming Hub traffic, with a reading being taken at the end of every fourth week in order to identify any significant traffic fluctuations or changes. The four-week hit statistics for each of the twenty collections are shown in the table below.
At the end of the twelve-week research period it was evident from the accumulated data that fourteen of the twenty collections had each experienced an increase in traffic compared to the previous twelve-week period. Indeed, of the fourteen, two collections, namely the Ramsay MacDonald Papers and the London South Bank University Archives, had each received well in excess of 100 additional hits compared to the pre-link period. Of the remaining six collections, only the Sadler’s Wells Theatre Archive had decreased in hits significantly, down 109 from the previous period. Although it isn’t possible to say definitively why this decrease occurred, it may have been due to the fact that at some point during the research, a new link had been added to the Sadler’s Wells Theatre Archive Wikipedia page giving researchers the option to examine ‘Archival material relating to Sadler’s Wells Theatre listed at the UK National Archives.’ Taking this modification into account, it seems fair to suggest that any researchers interested in the Sadler’s Wells Theatre material may have been drawn to this link description rather than the newly-added link to the Hub description essentially because it makes mention of the country’s principal archival repository, TNA.
The cumulative number of hits for each of the twenty collections during the research period are presented in the table below. This table also shows the positive and negative numerical differences in hits for each of the collections compared to the twelve-week period prior to the start of the research.
This piece of research has demonstrated that the simple task of linking online archival descriptions to a popular social media reference tool such as Wikipedia can yield extremely positive results. It has shown, moreover, that there are clear benefits, both for the archival repository/aggregator and the individual researcher, when catalogue data is linked and shared. Not only that, it has proven that a successful outcome can be achieved in a relatively short space of time, and, truth be told, with only a small amount of physical effort. The process of checking whether links from specific Hub collections already existed in Wikipedia and then adding them to the website if they didn’t, took little more than three hours to complete, and, for the most part, basically involved me copying data from one website and pasting it onto another. Ultimately, the sheer simplicity of this exercise, coupled with the knowledge that interest in the vast majority of the Hub collections increased as a result of the Wikipedia editing, confirms, to my mind at least, that archive services the world over – especially those blessed with a healthy number of volunteers – would benefit from embarking on linked data projects of this nature. After all, it’s like Benjamin Franklin said, “An investment in knowledge always pays the best interest.”
Back in 2008 the Archives Hub embarked upon a project to become distributed; the aim was to give control of their data to the individual contributors. Every contributor could host their own data by installing and running a ‘mini Hub’. This would give them an administrative interface to manage their descriptions and a web interface for searching.
Five years later we had 6 distributed ‘spokes’ for 6 contributors. This was actually reduced from 8, which was the highest number of institutions that took up the invitation to hold their own data out of around 180 contributors at the time.
The primary reason for the lack of success was identified as a lack of technical knowledge and the skills required for setting up and maintaining the software. In addition to this, many institutions are not willing to install unknown software or maintain an unfamiliar operating system. Of course, many Hub contributors already had a management system, and so they may not have wanted to run a second system; but a significant number did not (and still don’t) have their own system. Part of the reason may institutions want an out-of-the-box solution is that they do not have consistent or effective IT support, so they need something that is intuitive to use.
The spokes institutions ended up requiring a great deal of support from the central Hub team; and at the same time they found that running their spoke took a good deal of their own time. In the end, setting up a server with an operating system and bespoke software (Cheshire in this case) is not a trivial thing, even with step-by-step instructions, because there are many variables and external factors that impact on the process. We realised that running the spokes effectively would probably require a full-time member of the Hub team in support, which was not really feasible, but even then it was doubtful whether the spokes institutions could find the IT support they required on an ongoing basis, as they needed a secure server and they needed to upgrade the software periodically.
Another big issue with the distributed model was that the central Hub team could no longer work on the Hub data in its entirety, because the spoke institutions had the master copy of their own data. We are increasingly keen to work cross-platform, using the data in different applications. This requires the data to be consistent, and therefore we wanted to have a central store of data so that we could work on standardising the descriptions.
The Hub team spend a substantial amount of time processing the data, in order to be able to work with it more effectively. For example, a very substantial (and continuing) amount of work has been done to create persistent URIs for all levels of description (i.e. series, item, etc.). This requires rigorous consistency and no duplications of references. When we started to work on this we found that we had 100’s of duplicate references due to both human error and issues with our workflow (which in some cases meant we had loaded a revised description along with the original description). Also, because we use archival references in our URIs, we were somewhat nonplussed to discover that there was an issue with duplicates arising from references such as GB 234 5AB and GB 2345 AB. We therefore had to change our URI pattern, which led to substantial additional work (we used a hyphen to create gb234-5ab and gb2345-ab).
We also carry out more minor data corrections, such as correcting character encoding (usually an issue with characters such as accented letters) and creating normalised dates (machine processable dates).
In addition to these types of corrections, we run validation checks and correct anything that is not valid according to the EAD schema, and we are planning, longer term, to set up a workflow such that we can implement some enhancement routines, such as adding a ‘personal name’ or ‘corporate name’ identifying tag to our creator names.
These data corrections/enhancements have been applied to data held centrally. We have tried to work with the distributed data, but it is very hard to maintain version control, as the data is constantly being revised, and we have ended up with some instances where identifying the ‘master’ copy of the data has become problematic.
We are currently working towards a more automated system of data corrections/enhancement, and this makes it important that we hold all of the data centrally, so that we ensure that the workflow is clear and we do not end up with duplicate slightly different versions of descriptions. (NB: there are ways to work more effectively with distributed data, but we do not have the resources to set up this kind of environment at present – it may be something for the longer term).
We concluded that the distributed model was not sustainable, but we still wanted to provide a front-end for contributors. We therefore came up with the idea of the ‘micro sites’.
What are Hub Micro Sites?
The micro sites are a template based local interface for individual Hub contributors. They use a feed of the contributor’s data from the central Archives Hub, so the data is only held in one place but accessible through both interfaces: the Hub and the micro site. The end-user performs a search on a micro site, the search request goes to the central Hub, and the results are returned and displayed in the micro site interface.
The principles underlying the micro sites are that they need to be:
As part of our aim of ensuring a sustainable and low-cost solution we knew we had to adopt a one-size-fits-all model. The aim is to be able to set up a new micro site with minimal effort, as the basic look and feel stays the same. Only the branding, top and bottom banners, basic text and colours change. This gives enough flexibility for a micro site to reflect an institution’s identity, through its logo and colours, but it means that we avoid customisation, which can be very time-consuming to maintain.
The micro sites use an open approach, so it would be possible for institutions to customise themselves, by manipulating the stylesheets. However, this is not something that the Archives Hub can support, and therefore the institution would need to have the expertise necessary to maintain this themselves.
The Consultation Process
We started by talking to the Spokes institutions and getting their feedback about the strengths and weaknesses of the spokes and what might replace them. We then sent out a survey to Hub contributors to ascertain whether there would be a demand for the micro sites.
Institutions preferred the micro sites to be hosted by the Archives Hub. This reflects the lack of technical support within UK archives. This solution is also likely to be more efficient for us, as providing support at a distance is often more complicated than maintaining services in-house.
The responders generally did not have images displayed on the Hub, but intended to in the future, so this needed to be taken into account. We also asked about experiences with understanding and using APIs. The response showed that people had no experience of APIs and did not really understand what they were, but were keen to find out more.
We asked for requirements and preferences, which we have taken into account as much as possible, but we explained that we would have to take a uniform approach, so it was likely that there would need to be compromises.
After a period of development, we met with the early adopters of the micro sites (see below) to update them on our progress and get additional requirements from them. We considered these requirements in terms of how practical they would be to implement in the time scale that we were working towards, and we then prioritised the requirements that we would aim to implement before going live.
The additional requirements included:
Search in multi-level description: the ability to search within a description to find just the components that include the search term
Reference search: useful for contributors for administrative purposes
Citation: title and reference, to encourage researchers to cite the archive correctly
Highlight: highlighting of the search term(s)
Links to ‘search again’ and to ‘go back’ to the collection result
The addition of Google Analytics code in the pages, to enable impact analysis
The Development Process
We wanted the micro sites to be a ‘stand alone’ implementation, not tied to the Archives Hub. We could have utilised the Hub, effectively creating duplicate instances of the interface, but this would have created dependencies. We felt that it was important for the micro sites to be sustainable independent of our current Hub platform.
In fact, the Micro sites have been developed using Java, whereas the Hub uses Python, a completely different programming language. This happened mainly because we had a Java programmer on the team. It may seem a little odd to do this, as opposed to simply filtering the Hub data with Python, but we think that it has had unforeseen benefits. Namely, that the programmers who have worked on the micro sites have been able to come at the task afresh, and work on new ways to solve the many challenges that we faced. As a result of this we have implemented some solutions with the micro sites that are not implemented on the Hub. Equally, there were certainly functions within the Hub that we could not replicate with the micro sites – mainly those that were specifically set up for the aggregated nature of the Hub (e.g browsing across the Hub content).
It was a steep learning curve for a developer, as the development required a good understanding of hierarchical archival descriptions, and also an appreciation of the challenges that come from a diverse data set. As with pretty much all Hub projects, it is the diverse nature of the data set that is the main hurdle. Developers need patterns; they need something to work with, something consistent. There isn’t too much of that with aggregated archives catalogues!
The developer utilised what he could from the Hub, but it is the nature of programming that reverse engineering of someone else’s code can be a great deal harder than re-coding, so in many cases the coding was done from scratch. For example, the table of contents is a particularly tricky thing to recreate, but the code used for the current Hub proved to be too complex to work with, as it has been built up over a decade and is designed to work within the Hub environment. The table of contents requires the hierarchy to be set out, collapsible folder structures, links to specific parts of the description with further navigation from there to allow the researcher to navigate up and down, so it is a complex thing to create and it took some time to achieve.
The feed of data has to provide the necessary information for the creation of the hierarchy, and our feed comes through SRU (Search/Retrieve via URL), which is a standard search protocol for Internet search queries using Contextual Query Language (CQL). This was already available through the Hub API, and the micro sites application makes uses of SRU in order to perform most of the standard searches that are available on the Hub. Essentially, each of the micro sites are provided by a single web application that acts as a layer on the Archives Hub. To access the individual micro sites, the contributor provides a shortened version of the institution’s name as a sub-string to the micro sites web address. This then filters the data accordingly for that institution, and sets up the site with the appropriate branding. The latter is achieved through CSS stylesheets, individually tailored for the given institution by a stand-alone Java application and a standard CSS template.
One of the changes that the developer suggested for the micro sites concerns the intellectual division of the descriptions. On the current Hub, a description may carry over many pages, but each page does not represent anything specific about the hierarchy, it is just a case of the description continuing from one page to the next. With the micro sites we have introduced the idea that each ‘child’ description of the top level is represented on one page. This can more easily be shown through a screenshot:
In the screenshot above, the series ‘Theatre Programmes, Playbills, etc’ is a first-level child description (a series description) of the archive collection ‘The Walter Greenwood Collection’. Within this series there are a number of sub-series, the first of which is ‘Love on the Dole’, the last of which is ‘A Taste of Honey’. The researcher will therefore get a page that contains everything within this one series – all sub-series and items – if there are any described in the series.
The sense of hierarchy and belonging is further re-enforced by repeating the main collection title at the top of every right hand pane. The only potential downside to this approach is that it leads to variable length ‘child’ description pages, but we felt it was a reasonable trade-off because it enables the researcher to get a sense of the structure of the collection. Usually it means that they can see everything within one series on one page, as this is the most typical first child level of an archival description. In EAD representation, this is everything contained within the <c01> tag or top level <c> tag.
We are currently testing the micro sites with early adopters: Glasgow University Archive Services, Salford University Archives, Brighton Design Archives and the University of Manchester John Rylands Library.
We aim to go live during September 2014 (although it has been hard to fix a live date, as with a new and innovative service such as the micro sites unforeseen problems tend to emerge with alarming regularity). We will see what sort of feedback we get, and it is likely that we will find a few things need addressing as a result of putting the micro sites out into the big wide world. We intend to arrange a meeting for the early adopters to come together again and feed back to us, so that we can consider whether we need a ‘phase 2’ to iron out any problems and make any enhancements. We may at that stage invite other interested institutions, to explain the process and look at setting up further sites. But certainly our aim is to roll out the micro sites to other Archives Hub institutions.