With a little help from the Interface

January 4, 2013 / Jane Stevenson

It is tempting to forge ahead with ambitious plans for Web interfaces that grab the attention, that look impressive and do new and whizzy things. But I largely agree with Lloyd Rutledge that we want “less emphasis on grand new interfaces” (Lloyd Rutledge, The Semantic Web – ISWC 2010, Selected Papers). I think it is important to experiment with exciting, innovative interfaces, but the priority needs to be creating interfaces that are effective for users, and that usually means a level of familiarity and supporting the idea that “users of the Web feel it acts they way they always knew it should (even though they actually couldn’t imagine it beforehand).” Maybe the key is to make new things feel familiar, so that we aren’t asking users to learn a whole new literacy, but a new literacy will gradually emerge and evolve.

For the Archives Hub, we face similar challenges to many websites that promote and provide access to archives, although our challenges are compounded by being an aggregator and not being in control of the content of the descriptions. We are seeking to gradually modify and improve our interfaces, in the hope that we help to make the users’ discovery experiences more effective, and encourage people to engage with archives.

One of our aims is to introduce options for users that allow them to navigate around in a fairly flexible manner, meeting different levels of experience and need, but without cluttering the screen or making the navigation look complicated and off-putting. Interviews with researchers have indicated how people have a tendency to ‘click and see’, learning as they go, but expecting useful results fairly quickly, so we want to work with this principle, to use hyperlinks effectively, on the understanding that the terminology used and the general layout of the page will have an effect on user expectations.

A Separation of Parts

One of the issues when presenting an archival description is how to separate out the ‘further actions’ or ‘find out more’ from the basic content. The challenge here is compounded by the fact that researchers often believe the description is the actual content, and not just metadata, or alternatively they assume that they can always access a digital resource.

We have tried to simplify the display by introducing a Utility Bar. It is intended to bring together the further options available to the end user. The idea is to make the presentation neater, show the additional options more clearly, and also keep the main description clear and self-contained.

The user can click to find out how to access the materials, to find out where the repository is located in the UK or contact the repository by email. We are planning to make the email contact link more direct, opening an email and populating it with the email address of the repository in order to cut down on the number of stages the user has to go through (currently we link to the Archon directory of Archive services). We can also modify other aspects of the Utility Bar over time, adding functionality as required, so it is a way to make the display more extensible.

We have included links to social networking sites, although in truth we have no real evidence that these are required or used. This really was a case of ‘suck it and see’ and it will be interesting to investigate whether this functionality really is of value. We certainly have a lively following on Twitter, and indications are that our Twitter presence is valued, so we do believe that social networking sites play an important part in what we do.

We have also included the ability to view different formats. This will not be of value to most researchers, but it is intended to be part of our mission to open up the data and give a sense of transparency – anyone can see the encoding behind the description and see that it is freely available. Some of our contributors may find it useful, as well as developers interested in the XML behind the scenes.

The Biggest Challenge: how to present an archive description

Until recently we presented users with an initial hit list of results, which enabled them to see the title of a description and choose between a ‘summary’ presentation and a ‘full’ presentation. However, feedback indicates that users don’t know what we mean by this. Firstly, they haven’t yet seen the description, so there is nothing on which to base the choice of link to click, and secondly, what is the definition of ‘summary’ and ‘full’ anyway? Our intention was to give the user the choice of a fairly brief, one page summary description, with the key descriptive data about the archive collection, or the full, complete description, which may run to many pages. A further consideration was that we could only provide highlighting of terms on a single page, so if we only had the full description, highlighting would not be possible.

There are a number of issues here. (a) Descriptions may be exactly the same for summary and full because sometimes they are short, only including key fields, and they do not provide multi-level content; the full description will only provide more information if the cataloguer has filled in additional fields, or created a multi-level display. (b) ‘Summary’ usually means a cut-down version of something, taking key elements, but we do not do this; we simply select what we believe to be the key fields. For example, Scope and Content may actually be very long and detailed, but it would always be part of the ‘summary’ description. (c) Fields that are excluded from the summary view may be particularly important in some cases – for example, the collection may be closed for a period of time, and this would really be key information for a researcher.

With the new Utility Bar we changed ‘summary’ and ‘full’ to become ‘brief’ and ‘detailed’. We felt that this more accurately reflects what these options represent. At present we have continued with the same principle of displaying selected fields in the ‘brief’ description, but we feel that this approach should be revised. After much discussion, we have (almost) decided that we will change our approach here. The brief description will become simply the collection-level description in its entirety; the detailed description will be the multi-level description. This gives the advantage of a certain level of consistency, but there are still potential pitfalls. Two of the key issues are (a) that ‘brief’ may actually be quite long (a collection description can still be very long) and (b) that many descriptions are not multi-level, so there would be no difference between the two descriptions. Therefore, we will look at creating a scenario where the user only gets the ‘Detailed Description’ link when the description is multi-level. If we can do this we will may change the terminology; but in the end there is no real user-friendly way to succinctly describe a collection-level as opposed to a multi-level description, simply because many people are not aware of what archival hierarchy really means.

As well as introducing the Utility Bar we changed the hit list of results to link the title of the description to the brief view. We simply show the title and the date(s) of the archive, as we feel that these are the key pieces of information that the researcher needs in order to select relevant collections to view.

Centralised Innovation

For some of the more complex changes we want to make, we need to first of all centralise the Archives Hub, so that the descriptions are all held by us. For some time we thought that this seemed like a retrograde step: to move from a federated system to a centralised system. But a federated system adds a whole layer of complexity because not only do you not have control over the data you are presenting; you do not have control over some of the data at all, to view it, and examine any issues with it, and also to potentially improve the consistency (of the markup in particular). In addition, there is a dependency between the centralised system and the local systems that form the federated model. Centralising the data will actually allow us to make it more openly available as well, and to continue to innovate more easily.

Multiple Gateways: Multiple Interfaces

We will continue to work to improve the Archives Hub interface and navigation, but we are well aware that increasingly people use alternative interfaces, or search techniques. As Lorcan Dempsey states: “options have multiplied and the breadth of interest of the local gateway is diminished: it provides access only to a part of what I am potentially interested in.” We need to be thinking more broadly: “The challenge is not now only to improve local systems, it is to make library resources discoverable in other venues and systems, in the places where their users are having their discovery experiences.” (Lorcan Dempsey’s Webblog). This is partly why we believe that we need to concentrate on presenting the descriptions themselves more effectively – users increasingly come directly to descriptions from search engines like Google, rather than coming to the Archives Hub homepage and entering a search from there. We need to think about any page within our site as a landing page, and how best to help users from there, to discovery more about what we have to offer them.

Season’s greeting and Christmas closure

December 20, 2012 / Bethan Ruddock

"Sunshine Annual 1938. The brightest of the year." — “Sunshine Annual 1938. The brightest of the year.”
The Sunshine Annual was a children’s annual produced by the Co-op movement.
Image copyright © National Co-operative Archive.

The Archives Hub team wish everyone a very Merry Christmas, and a Happy New Year!

The Archives Hub office will close on 21st December and will reopen on the 2nd January.

The Archives Hub service will be available over Christmas and New Year, but there will be no helpdesk support. Any queries sent over this period will be dealt with when we return.

The Hub out and about – presenting, training, and pubbing

December 4, 2012 / Bethan Ruddock

The Hub team like to get out and about to present, teach, and chat about archives and information. It can get a bit lonely being a purely online service, with our users and contributors at the other end of an email or phone call, so we try to ensure that we take advantage of chances to meet them face-to-face.

The last week of November was a busy week for this! On the Wednesday Jane and I (Bethan) gave a presentation to the MA Library & Information students at MMU.

Interoperability, networking and standards from Bethan Ruddock

We’ve given similar presentations to Archive students and early-career professionals in the past, but this is the first time we’ve given one to Library students. I’m pleased to say it worked well – the students were engaged and knowledgeable about archives, and how issues in libraries and archives cross-over.

It’s always very encouraging and stimulating to meet an enthusiastic group (I’d also met them the week before to talk about professional organisations), and both Jane and I really enjoyed giving the session. We had some nice feedback from the students, too, with one person saying:

The workshop was informative as well as entertaining. Complex issues were broken down so they were easier to understand. In a short amount of time a lot of areas were covered and due to the lively presentation style we all remained engaged and interested throughout.

And another said that they wished they had more next week!

I think it’s very important for us to be involved in talking to students, trainees, and early-career professionals. It’s good for them to hear from people who are actually working with the data that they’ll be creating. If nothing else, if we educate them about the need for good, interoperable data now, we’ll get better data from them later on! It’s also great to be able to tell them about the different sorts of jobs and opportunities there are for them, and hopefully give them some ideas about ‘alternative’ careers.

The next day saw me, Jane and Lisa heading down to London, for the inaugural ‘Hub in the Pub‘ on the Thursday evening, before a training session on the Friday. We joined forces with a large contingent of museum folk who were ‘Drinking about Museums’, and had a very enjoyable and useful couple of hours chatting about general information, data, and cultural heritage issues. We hope to have more ‘Hub in the Pub’ events in future, so watch our mailing list and twitter feed for details.

We made sure that the evening didn’t get too merry, so we were on top form for our contributors training day the next day. These training days are designed to help current and potential contributors use our EAD Editor, and are also a great chance to get to know our contributors and chat to them about any issues they might have. We have a few places left on our next training day in Glagsow in January – do sign up if you’d like to come along, or contact us if you’d like to know more.

If you can’t get along to a training session, we have online audio tutorials and a workbook designed to give you a step-by-step guide to using the Editor – and we’re always happy to answer any questions.

An evaluation of the use of archives and the Archives Hub

November 15, 2012 / Jane Stevenson

This blog is based upon a report written by colleagues at Mimas* presenting the results of the evaluation of our innovative Linked Data interface, ‘Linking Lives‘. The evaluation consisted of a survey and a focus group, with 10 participants including PhD students and MA students studying history, politics and social sciences. We asked participants a number of questions about the Archives Hub service, in order to provide context for their thoughts on the Linking Lives interface.

This blog post concentrates on their responses relating to the use of archives, methods of searching and interpretation of results. You can read more about their responses to the Linking Lives interface on our Linking Lives blog.

Use of Archives and Primary Source Materials

We felt that it was important to establish how important archives are to the participants in our survey and focus group. We found that “without exception, all of the respondents expressed a need for primary resources” (Evaluation report). One respondent said:

“I would not consider myself to be doing proper history if I wasn’t either reinterpreting primary sources others had written about, or looking at primary sources nobody has written about. It is generally expected for history to be based on primary sources, I think.” (Survey response)

One of the most important factors to the respondents was originality in research. Other responses included acknowledgement of how archives give structure to research, bringing out different angles and perspectives and also highlighting areas that have been neglected. Archives give substance to research and they enable researchers to distinguish their own work:

“Primary sources are very valuable for my research because they allow me to put together my own interpretation, rather than relying on published findings elsewhere.” (Survey response)

Understanding of Archives

It is often the case that people have different perceptions of what archives are, and with the Linking Lives evaluation work this was confirmed. Commonly there is a difference between social scientists and historians; the former concentrating on datasets (e.g. data from the Office of National Statistics) and the latter on materials created during a person’s life or the activities of an organisation and deemed worthy of permanently preserving. The evaluation report states:

“The participants that had a similar understanding of what an archive was to the Archive Hub’s definition had a more positive experience than those who didn’t share that definition.”

This is a valuable observation for the work of the Hub in a general sense, as well as the Linking Lives interface, because it demonstrates how initial perceptions and expectations can influence attitudes towards the service. In addition, the evaluation work highlighted another common fallacy: that an archive is essentially a library. Some of the participants in the survey expected the Archives Hub to provide them with information about published sources, such as research papers.

These findings highlight one of the issues when trying to evaluate the likely value of an innovative service: researchers do not think in the same language or with the same perspectives as information professionals. I wonder if we have a tendency to present services and interfaces modelled from our own standpoint rather than from the standpoint of the researcher.

Search Techniques and Habits

“Searches were often not particularly expansive, and participants searched for specific details which were unique to their line of enquiry” (Evaluation report). Examples include titles of women’s magazines, personal names or places. If the search returned nothing, participants might then broaden it out.

Participants said they would repeatedly return to archives or websites they were familiar with, often linked to quite niche research topics. This highlights how a positive experience with a service when it is first used may have a powerful effect over the longer term.

The survey found that online research was a priority:

“Due to conflicting pressures on time and economic resources, online searching was prevalent amongst the sample. Often research starts online and the majority is done online. Visits to see archives in person, although still seen as necessary, are carefully evaluated.” (Evaluation report)

The main resources participants used were Google and Google Scholar (the most ubiquitous search engines used) as well as The National Archives, Google Books and ESDS. Specialist archives were referred to relating to specific search areas (e.g. The People’s History Museum, the Wellcome Library, the Mass Observation Archive).

Thoughts and Comments About the Archives Hub

All participants found the Hub easy to navigate and most found locating resources intuitive. As part of the survey we asked the participants to find certain resources, and almost all of them provided the right answers with seemingly no difficulty.

“It is clear. The descent of folders and references at the top are good for referencing/orientating oneself. The descriptions are good – they obviously can’t contain everything that could be useful to everyone and still be a summary. It is similar to other archive searches so it is clear.” (Survey response, PhD history student)

The social scientists that took part in the evaluation were less positive about the Archives Hub than the historians. Clearly many social science students are looking for datasets, and these are generally not represented on the Hub. There was a feeling that contemporary sources are not well represented, and these are often more important to researchers in fields like politics and sociology. But overall comments were very positive:

“…if anyone ever asked about how to search archives online I’d definitely point them to the Archives Hub”.

“Useful. It will save me making specific searches at universities.”

Archives Hub Content

It was interesting to see the sorts of searches participants made. A search for ‘spatial ideas’ by one participant did not yield useful results. This would not surprise many archivists – collections are generally not catalogued to draw out such concepts (neither Unesco nor UKAT have a subject heading for this; LCSH has ‘spatial analysis’). However, there may well be collections that cover a subject like this, if the researcher is prepared to dig deep enough and think about different approaches to searching. Another participant commented that “you can’t just look for the big themes”. This is the type of search that might benefit from us drawing together archive collections around themes, but this is always a very flawed approach. This is one reason that we have Features, which showcase archives around subjects but do not try to provide a ‘comprehensive’ view onto a subject.

This kind of feedback from researchers helps us to think about how to more effectively present the Archives Hub. Expectations are such an important part of researchers’ experiences. It is not possible to completely mitigate against expectations that do not match reality, but we could, for example, have a page on ‘The Archives Hub for Social Scientists’ that would at least provide those who looked at it with a better sense of what the Hub may or may not provide for them (whether anyone would read it is another matter!).

This survey, along with previous surveys we have carried out, emphasises the importance of a comprehensive service and a clear scope (“it wasn’t clear to me what subjects or organisations are covered”). However, with the nature of archives, it is very difficult to give this kind of information with any accuracy, as the collections represented are diverse and sometimes unexpected. in the end you cannot entirely draw a clear line around the scope of the Archives Hub, just like you cannot draw a clear line around the subjects represented in any one archive. The Hub also changes continuously, with new descriptions added every week. Cataloguing is not a perfect art; it can draw out key people, places, subjects and events, but it cannot hope to reflect everything about a collection, and the knowledge a researcher brings with them may help to draw out information from a collection that was not explicitly provided in the description. If a researcher is prepared to spend a bit of time searching, there is always the chance that they may stumble across sources that are new to them and potentially important:

“…another student who was mainly focused on the use of the Kremlin Archives did point out that [the Archives Hub] brought up the Walls and Glasier papers, which were new to [them]”.

Even if you provide a list of subjects, what does that really mean? Archives will not cover a subject comprehensively; they were not written with that in mind; they were created for other purposes – that is their strength in many ways – it is what makes them a rich and exciting resource, but it does not make it easy to accurately describe them for researchers. Just one series of correspondence may refer to thousands of subjects, some in passing, some more substantially, but archivists generally don’t have time to go through an entire series and draw out every concept.

If the Archives Hub included a description for every archive held at an HE institution across the UK, or for every specialist repository, what would that signify? It would be comprehensive in one sense, but in a sense that may not mean much to researchers. It would be interesting to ask researchers what they see as ‘comprehensive resources’ as it is hard to see how these could really exist, particularly when talking about unpublished sources.

Relevance of Search Results

The difficulties some participants had with the relevance of results comes back to the problem of how to catalogue resources that often cover a myriad of subjects, maybe superficially, maybe in detail; maybe from a very biased perspective. If a researcher looks for ‘social housing manchester’ then the results they get will be accurate in a sense – the machine will do its job and find collections with these terms, and there will be weighting of different fields (eg. the title will be highly weighted), but they still may not get the results they expect, because collections may not explicitly be about social housing in Manchester. The researcher needs to do a bit more work to think about what might be in the collection and whether it might be relevant. However, cataloguers are at fault to some extent. We do get descriptions sent to the Hub where the subjects listed seem inadequate or they do not seem to reflect the scope and content that has been provided. Sometimes a subject is listed but there is no sense of why it is included in the rest of the description. Sometimes a person is included in the index terms but they are not described in the content. This does not help researchers to make sense of what they see.

I do think that there are lessons here for archivists, or those who catalogue archives. I don’t think that enough thought is gives to the needs of the researcher. The inconsistent use of subject terms, for example, and the need for a description of the archive to draw out key concepts a little more clearly. Some archivists don’t see the need to add index terms, and think in terms of technologies like Google being able to search by keyword, therefore that is enough. But it isn’t enough. Researchers need more than this. They need to know what the collection is substantially about, they need to search across other collections about similar subjects. Controlled vocabulary enables this kind of exploratory searching. There is a big difference between searching for ‘nuclear disarmament’ as a keyword, which means it might exist anywhere within the description, and searching for it as a subject – a significant topic within an archive.

*Linking Lives Evaluation: Final Report (October 2012) by Lisa Charnock, Frank Manista, Janine Rigby and Joy Palmer

Finding and accessing archives for voluntary action history

October 15, 2012 / Jane Stevenson / 1 Comment

Guest Blog by Georgina Brewis

It would not be an exaggeration to say that the history of voluntary, civic and cultural organisations has never been more popular as an academic subject in Britain. Leading historians like Brian Harrison have called attention to the importance of voluntarism as a theme in post-war British history while there has been a wave of PhD theses dealing with topics such as the voluntary hospitals, the role of disability charities in politics, the professionalization of the voluntary sector and the formation of humanitarian networks across empire. In 2011 no less than three edited collections presenting the latest research on voluntary action history were published and several further volumes appeared in 2012 or are in press. Such new research has been strengthened and sustained by the Voluntary Action History Society and particularly its active New Researchers group. Importantly, not all these studies are by historians, pointing to the importance of archival resources for students of political science, sociology, health studies and other disciplines. There is growing recognition that we cannot write British social history or social policy without looking at the considerable contributions of charities, voluntary groups, philanthropists, campaigners and volunteers.

So how do academic researchers track down the archives of the voluntary and community organisations they want to use? Any would-be researcher of charity needs to understand that those bodies with catalogued and accessible institutional archives – whether kept in-house or deposited elsewhere – represent only a very small minority of voluntary organisations. Unsurprisingly these tend to be the larger, better funded and longer-established groups such as the British Red Cross or the Children’s Society. The voluntary sector in Britain is often likened to a pyramid: a very small number of organisations at the top with paid staff, regular income and office space resting on a much larger base of groups run entirely by volunteers, subsisting on small grants and donations. Voluntary sector archives may reflect this pattern, but there is no guarantee that even the largest charity will have made provision for preservation and conservation of its records (aside from the limited financial data required by the Charity Commission) let alone for cataloguing or access.

Researchers and students are advised to start with the National Register of Archives. Another useful database is DANGO, which identifies the locations of the papers of several thousand non-governmental organisations, and was put together by a team at Birmingham University, although the end of project funding means its entries and website are no longer being updated. Searching the Archives Hub will find records of voluntary groups where these are deposited at an institution contained on its database; Hull History Centre, SOAS, Birmingham University Library or the Women’s Library have all built up specialisms in this area. Perhaps there would be a way of encouraging charities with in-house collections to make the catalogues available via Archives Hub?

Archives Hub has helped me search for materials relating to small or short-lived student-run charities that may be contained within a students’ union archive or an individual’s private papers. Although an organisation’s institutional archive may be lost or never have existed, its history can be reconstructed through accessing annual reports, correspondence and other papers held in many different repositories – as I have managed to do for the group International Student Service. It would be helpful for future researchers if it was possible to log this information somewhere.

It remains the case that many researchers will have to seek access to records by contacting an organisation or group founder directly, with variable results. This is likely to be increasingly the case given the increase in numbers of pressure groups, charities and other voluntary bodies since the 1960s. In my experience there is a range of practice from organisations which ignore or refuse requests for access with varying degrees of politeness to those that welcome you with open arms and let you sit unsupervised with the charity’s papers, free to copy, remove, deface or pour coffee all over the institutional record. Once you’ve had success accessing the records of one organisation, it may be easier to open communications with others in a related sector. Learning how to negotiate what we might call ‘informal archives’ will be a key challenge for future researchers of voluntary action. There is a need for better advice for academics, particularly students and new researchers, on the multiple ethical considerations and practical concerns that come with using informal archives. How do you track down such records? How do you reference sources? What do you do if you’re concerned about the physical state of records or what might happen to them when the group’s founder dies? How to reconcile your obligations as a historian with the fact that a particular organisation has trusted you to look at their materials?

It is also worth remembering that records relating to charitable activities can turn up in unexpected places, for example in the archives of private companies. The records of a charitable Trust or Foundation may well contain better sources about a particular charity than the organisation itself has preserved, although again there may be problems of access. There are good signs that this is changing not least through the positive examples of two funders involved with the new Campaign for Voluntary Sector Archives: the Barrow Cadbury Trust and the Diana, Princess of Wales Memorial Fund.

This new Campaign for Voluntary Sector Archives, which was launched at the House of Lords in October 2012, seeks to raise awareness of the importance of voluntary sector archives as strategic assets for governance, corporate identity, accountability and research. It maintains that caring for archives and records is actually an important aspect of the sector’s wider public benefit responsibility. Most significantly, the Campaign brings together academic researchers, custodians, creators of records and others in the voluntary sector to share expertise and resources. Together, we should be able to begin to address some of the issues and questions I’ve outlined above. Yet there is a long way to go before all voluntary organisations are convinced not only of the value of records to the current mission, but also of the value of making these accessible to researchers from a variety of disciplines. For more information contact info@voluntarysectorarchives.org.uk

The New Scholarly Record

October 4, 2012 / Jane Stevenson

I was lucky enough to attend the 2012 EmTACL conference in Trondheim, and this blog is based around the excellent keynote presentation by Herbert van de Sompel, which really made me think about temporal issues with the Web and how this can limit our understanding of the scholarly record.

Herbert believes that the current infrastructure for scholarly communication is not up to the job. We now have many non-traditional assets, which do not always have fixity and often have a wide range of dependencies; assets such as datasets, blogs, software, videos, slides which may form part of a scholarly resource. Everything is much more dynamic than it used to be. ‘Research objects’ often include assets that are interdependent with each other, so they need to be available all together for the object to be complete. But this is complicated by the fact that many of them are ‘in motion’ and updated over time.

This idea of dynamic resources that are in flux, constantly being updated, is very relevant for archivists, partly because we need to understand how archives are not static and fixed in time, and partly because we need to be aware of the challenges of archiving ever more complex and interconnected resources. It is useful to understand the research environment and the way technology influences outputs and influences what is possible for future research.

There are examples of innovative services that are responding to the opportunities of dynamic resources. One that Herbert mentioned was PLOS, which publishes open scholarly articles. It puts publications into Wikipedia as well as keeping the ‘static’ copy, so that the articles have a kind of second life where they continue to evolve as well as being kept as they were at the time of submission. For example, ‘Circular Permutation in Proteins‘.

The idea of executable papers is starting to become established – papers that are not just to read but to interact with. These contain access to the primary data with capabilities to re-execute algorithms and even capabilities to allow researchers to upload and use their own data. It produces a complex interdependency and produces a challenge for archiving because if something is not fixed in time, what does that mean for retaining access to it over time?

This all raises the issue of what the scholarly record actually is. Where does it start? Where does it end? We are no longer talking about a bunch of static files but a dynamic interconnected resource. In fact, there is an increasing sense that the article itself is not necessarily the key output, but rather it is the advertising for the actual scholarship.

Herbert concluded from this that it becomes very important to be able to view different points in time in the evolution of scholarly record, and this should be done in a way that works with the Web. The Web is the platform, the infrastructure for the scholarly record. Scholarly communication then becomes native to the Web. At the heart of this is the need to use HTTP URIs.

However, where are we at the moment? The current archival infrastructure for scholarly outputs deals with things with fixity and boundaries. It cannot deal with things in flux and with inter-dependencies. The Web exists in ‘now’ time; it does not have a built in notion of time. It assumes that you want the current version of something – you cannot use a URI to get to a prior version.

Diagram to show publication on the Web — Slide from Herbert van de Sompel’s presentation showing the publication context on the Web

We don’t really object to this limitation, something evidenced by the fact that we generally accept links that take us to 404 pages, as if it is just an inevitable inconvenience. Maybe many people just don’t think that there is any real interest in or requirement for ‘obsolete’ resources, and what is current is what is important on the Web.

Of course, there is the Internet Archive and other similar initiatives in Web archiving, but they are not integrated into the Web. You have to go somewhere completely different in order to search for older copies of resources.

If the research paper remains the same, but resources that are an integral part of it change over time, then we need to change archiving to reflect this. We need to think about how to reference assets over time and how to recreate older versions. Otherwise, we access the current version, but we are not getting the context that was there at the time of creation; we are getting something different.

Can we recreate a version of a scholarly record? Can we go back to certain point it time so we can see linked assets from a paper as they were at the time of publication? At the moment we are likely to get many 404s when we try to access links associated with a publication. Herbert showed one survey on the decay of URLs in Medline, which is about 10% per year, especially with links to thinks like related databases.

One solution to this is to be able to follow a URI in time – to be able to click on URI and say ‘I want to see this as was 2 years ago’. Herbert went on to talk about something he has created called Memento. Memento aims to better integrate the current and past Web. It allows you to select a day or time in the browser and effectively take the URI back in time. Currently, the team are looking at enabling people to browse past pages of Wikipedia. Memento has a fairly good success rate with going back to retrieve old versions, although it will not work for all resources. I tried it with the Archives Hub and found it easy to take the website back to how it looked right in the very early days.

Screen shot of the Archives Hub hompeage — Using Memento to take the Archives Hub back in time.

One issue is that the archived copies are not always created near the time of publication. But for those that are, they are created simply as part of the normal activity of the Web, by services like the Internet Archive or the British Library, so there is no extra work involved.

Herbert outlined some of the issues with using DOIs (digital object identifiers), which provide identifiers for resources that use a resolver to ensure that the identifier can remain the same over time. This is useful if, for example, a publisher is bought out – the identifier is still the same as the resolver redirects to the right location However, a DOI resolver exists in the perpetual now. It is not possible to travel back in time using HTTP URIs. This is maybe one illustration of the way some of the processes that we have implemented over the Web do not really fulfil our current needs, as things change and resources become more complex and dynamic.

With Memento, the same HTTP URI can function as the reference to temporally evolving resources. The importance of this type of functionality is becoming more recognised. There is a new experimental URI scheme, DURI , or Dated URI. The ideas is that a URI, such as http://www.ntnu.no, can be dated: 1997-06-17:http://www.ntnu.no (this is an example and is not actionable now). Herbert did raise another possibly of developing Websites that can deal with the TEL (telephone) protocol. The idea would be that the browser asks you whether the Website can use the TEL protocol, and if it can, you get this option offered to you. You can then use this and reference a resource and use Memento to go back in time.

Herbert concluded that the idea of ‘archiving’ should not be just a one-off event, but needs to happen continually. In fact, it could happen whenever there is an interaction. Also, when new materials are taken into a repository, you could scan for links and put them into an archive, so the links don’t die. If you archive the links at the time of publication or when materials submitted to a repository, then you protect against losing the context of the resource.

Herbert introduced us to SiteStory, which offers transactional archiving of a a web server. Usually a web archive sends out a robot, gathers and dumps the data. With SiteStory the web server takes an active part. Every time a user requests a page it is also pushed back into the archive, so you get a fine grained history of the resource. Something like this could be done by publishers/service providers, with the idea that they hold onto the hits, the impact, the audience. It certainly does seem to be a growing area of interest.

Herbert’s slides are available on Slideshare.

Archives and the Researchers of Tomorrow

August 16, 2012 / Jane Stevenson

“In 2009, the British Library and JISC commissioned the three-year Researchers of Tomorrow study, focusing on the information-seeking and research behaviour of doctoral students in ‘Generation Y’, born between 1982 and 1994 and not ‘digital natives’. Over 17,000 doctoral students from more than 70 higher education institutions participated in the three annual surveys, which were complemented by a longitudinal student cohort study.” (Taken from http://www.jisc.ac.uk/publications/reports/2012/researchers-of-tomorrow#exec_sum).

This post picks up on some aspects of the study, particularly those that are relevant to archives and archivists. I am assuming that archivists come into the category of libraries as being ‘library professionals’, at least to an extent, though our profession is not explicitly mentioned. I would recommend reading the report in full as it offers some useful insights into the research behaviour of an important group of researchers.

What is heartening about this study is that the findings confirm that generation Ydoctoral students are

blog-researchersOfTomorrow-Asian_student5 — Image from: http://www.freedigitalphotos.net

“sophisticated information-seekers and users of complex information sources“. The study does conclude that information seeking behaviour is becoming less reliant on the support of libraries and library staff, which may have implications for the role of libraries and archive professionals, but “library staff assistance with finding/retrieving difficult-to-access resources” was seen as one of the most valuable research support resources available to students, although it was a relatively small proportion of students that used this service. There was a preference for this kind of on-demand 1-2-1 support rather than formal training sessions. One of the students said ” the librarians are quite possibly the most enthusiastic and helpful people ever, and I certainly recommend finding a librarian who knows their stuff, because I have had tremendous amounts of help with my research so far, just simply by asking my librarian the right question.”

The survey concentrated on the most recent information-seeking activity of students, and found that most were not seeking primary sources, but rather secondary sources (largely journal articles and books).

“This apparent and striking dependence on published research resources implies that, as the basis for their own analytical and original research, relatively few doctoral students in social sciences and arts and humanities are using ‘primary’ materials such as newspapers, archival material and social data.”

This finding was true across all subject disciplines and all ages. The study found that about 80% of arts and humanities students were looking for any bibliographic references on their topic or specific publications, while only 7% were looking for non-published archival material. It seems that this reliance on published information is formed early on in their research, as students lay the groundwork for their PhD. Most students said they used academic libraries more in their first year of study – whether visiting or using the online services, so maybe this is the time to engage with students and encourage them to use more diverse sources in the longer-term.

A point that piqued my interest was that the arts and humanities students visiting other collections in order to use archival sources would do so even if “many of the resources they required had been digitised“, but this point was not explained further, which was frustrating. Is it because they are likely to use a mixture of digital and non-digital sources? Or maybe if they locate digital content it stimulates their interest to find out more about using primary sources?

Around 30% used Google or Google Scholar as their main information gathering tool, although arts and humanities students sourced their information from a wider spread of online and offline sources, including library catalogues.

One thing that concerned me, and that I was not aware of, was the assertion that “current citation-based assessment and authenticity criteria in doctoral and academic research discourage the citing of non-published or original material“. I would be interested to know why this is the case, as surely it should be actively encouraged rather than discouraged? How does this fit with the need to demonstrate originality in research?

Students rely heavily on help from their supervisors early on in their research, and supervisors have a great influence on their information and resource use. I wonder if they actively encourage the use of primary sources as much as we would like? I can’t help thinking that a supervisor enthusiastically extolling the importance and potential impact of using archives would be the best way to encourage use.

There continues to be a feeling amongst students, according to this study, that “using social media and online forums in research lacks legitimacy” and that these tools are more appropriate within a social context. The use of twitter, blogs and social bookmarking was low (2009 survey: 13% of arts and humanities students had used and valued Twitter) and use was more commonly passive than active. There was a feeling that new tools and applications would not transform the way that the students work, but should complement and enhance established research practices and behaviour. However, it should be noted that use of ‘Web 2.0’ tools increased over the 3 years of the study, so it may be that a similar study carried out in 5 years time would show significantly different behaviour.

Students want training in research geared towards their own subject area, preferably face-to-face. Online tutorials and packages were not well used. The implication is that training should be at a very local level and done in a fairly informal way. Generic research skills are less appealing. Research skills training days are valued, but if they are poor and badly taught, the student feels their time is wasted and may be put off trying again. Students were quite critical of the quality and utility of the training offered by their university mainly because (i) it was not pitched at the right level (ii) it was too generic or (iii) it was not available on demand. Library-led training sessions got a more positive response, but students were far less likely to take up training opportunities after the first year of their PhD. Training in the use of primary sources was not specifically covered in the report, though it must be supposed this would be (should be!) included in library-led training.

The study indicated that students dislike reading (as opposed to scanning) on screen. This suggests that it is important to provide the right information online, information that is easy to scan through, but worth providing a PDF for printout, especially for detailed descriptions.

One quote stood out for me, as it seems to sum up the potential danger of modern ways of working in terms of approaches to more in-depth analysis:

“The problem with the internet is that it’s so easy to drift between websites and to absorb information in short easy bites that at times you forget to turn off the computer, rest your eyes from screen glare and do some proper in-depth reading. The fragments and thoughts on the internet are compelling (addictive, even), and incredibly useful for breadth, but browsing (as its name suggests) isn’t really so good for depth, and at this level depth is what’s required.” (Arts and humanities)

We do often hear about the internet, or computers, tending to reduce levels of concentration. I think that this point is subtly different though – it’s more about the type of concentration required for in-depth work, something that could be seen as essential for effective primary source research.

Conclusions

We probably all agree that we can always do more to to promote the importance of archives to all potential users, including doctoral students. Certainly, we need to make it easier for them to discovery sources through the usual routes that they use, so for one thing ensuring we have a profile via Google and Google Scholar. Too many archives still resist this requirement, as if it is somehow demeaning or too populist, or maybe because they are too caught up in developing their own websites rather than thinking about search engine optimisation, or maybe it is just because archivists are not sure how to achieve good search engine rankings?

Are we actively promoting a low-barrier, welcome and open approach? I recall many archive institutions that routinely state that their archives are ‘open to bone fide researchers only’. Language like that seems to me to be somewhat off putting. Who are the ‘non-bone fide’ researchers that need to be kept out? This sort of language does not seem conducive to the principle of making archives available to all.

The applications we develop need to be relatively easy to absorb into existing research work practices, which will only change slowly over time. We should not get too caught up in social networks and Web 2.0 as if these are ‘where it’s at’ for this generation of students. Maybe the approaches to research are generally more traditional than we think.

The report itself concludes that the lack of use of primary sources is worrying and requires further investigation:

“There is a strong case for more in-depth research among doctoral students to determine whether the data signals a real shift away from doctoral research based on primary sources compared to, say, a decade ago. If this proves to be the case there may be significant implications for doctoral research quality related to what Park described as “widely articulated tensions between product (producing a thesis of adequate quality) and process (developing the researcher), and between timely completion and high quality research“.

Excel template

August 13, 2012 / Bethan Ruddock / 2 Comments

Update May 2015: Please Note we need to make some changes to the Excel template and we are not currently working with Excel data. We hope to be able to offer this service in the future.

As part of Project Headway we wanted to create an Excel template which archives could use to catalogue and create EAD. We know that some archives – especially smaller and under-resourced archives – are using spreadsheets or word processing software to catalogue, and often lack the time or resources to switch to using an archival management system. While users can catalogue directly on to the EAD Editor, this isn’t a perfect solution – it won’t work in some older browsers, or offline.

While we would have liked to offer a script that allowed users to convert their own Excel catalogues to EAD, it soon became apparent that this wasn’t an option. We would have needed to produce a script for each institution, and relied on the institution using Excel in a very consistent, systematic way – and a way that was ISAD(G) compliant, and could easily be mapped to EAD. So we decided to start off with a simple template, which we can adapt to individual user needs if required.

I’d never worked with XML in Excel before, and a lot of the process was simply trial-and-error, googling error messages, and sending forlorn messages to my programmer husband asking ‘what on earth is denormalised data and how do I stop it?’. I found the office.microsoft.com and msdn.microsoft.com sites useful for figuring out the basics of getting XML in and out of Excel – though I often turned to support elsewhere, too (eg Microsoft support will only tell you that denormalised data is not supported – not what it is or how to fix it).

To get started with using XML in Excel, you need to have the XML add-in installed (it says 2003, but will work with other versions) and then make sure you can see the ‘developer’ tab – if you can’t, it’s under options -> customize ribbon.

While it’s hard (in retrospect) to remember all of the stages I went through in the trial-and-error, I know I started by trying to create an XSD (XML schema file) from in-Excel data entry. It failed. I tried importing the EAD.xsd – which just failed, silently (no error messages- no messages at all).

I was also concerned that the official EAD.xsd was too complicated for my (and our users’) needs – for instance, this project didn’t require lists of enumeration values. I needed something a bit simpler – and I’d already figured out that Excel couldn’t handle multi-level descriptions – so I needed to start with something collection-level only, too.

I created a basic EAD collection-level description in the Archives Hub EAD Editor, saved it as XML, removed the DTD declaration (not allowed in Excel), and imported it (using developer -> xml -> import). Clicking on ‘source’ in the developer XML tab then shows you the XML fields.

You can then export this map as an XSD, creating your XML schema. Of course, it wasn’t that easy. This is where denormalised data cropped up – and stopped me from exporting. I have to admit, I’m still not entirely sure what exactly denormalised data is – and given definitions such as:

A denormalised data model is not the same as a data model that has not been normalised, and denormalisation should only take place after a satisfactory level of normalisation has taken place and that any required constraints and/or rules have been created to deal with the inherent anomalies in the design. For example, all the relations are in third normal form and any relations with join and multi-valued dependencies are handled appropriately.

(from the usually introductory-friendly Wikipedia)

I’m not sure I’ll ever find out (if you have a really good explanation, please do comment!). But what I did find out was what it meant for me in the context of this XML mapping: no repeated fields. EAD allows for repeated fields – for instance, multiple subjects would be encoded as:

<controlaccess> <subject>subject</subject><subject>subject 2</subject></controlaccess>

Try to import that into Excel, and you get, well, a mess. The whole description appears twice – once with subject, and once with subject 2. And if you try to export the schema, you get the error message that the map is not exportable because it contains denormalized data.

For this reason, Excel won’t support hierarchy. In EAD, the same fields are repeated at component level as at collection-level, just inside a different wrapper. If you thought it got messy when you add a single repeated field, just imaging having anything up to several thousand…

So, strip everything down to a single instance (which means separating collection and component level into different spreadsheets), and you have an XSD which will export (follow instructions in step 4 of that link – if you get a VBA error, debug instructions are in step 2). Hurrah! But how to make it useable?

Well, you have to put it back into Excel, and map the XML fields to Excel cells. This was tedious, but achievably tedious rather than crawling-through-help-forums tedious. Open up a new Excel document, click on ‘source’, and choose your shiny new XSD. This will give you a list of all the fields, in the right-hand pane. Mapping them to cells is simply a case of drag-and-drop – once you’ve mapped a field to a cell, that cell will be outlined in blue (as long as the source pane is showing). There’s an option to have Excel auto-label your fields with the content of the XML tag, but I decided that wouldn’t give the user-friendly interface I wanted, so I labelled them myself. Then colour-coded them. The result?

I had to tweak the exported XSD a little to allow for a field in which users can enter the reference codes of any components. This was my first experiences of hand-coding any of an XML schema, and it took a few tries to get right! But I managed to add and map the <dsc> and <c> elements:

<xsd:element minOccurs=”0″ nillable=”true” name=”dsc” form=”unqualified”>
<xsd:complexType>
<xsd:sequence minOccurs=”0″>
<xsd:element minOccurs=”0″ nillable=”true” type=”xsd:string” name=”c” form=”unqualified”/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>

(If I wanted to play with the XSD a bit more, I guess I could make mandatory fields really mandatory, by fiddling with the minOccurs and/or nillable attributes, but I haven’t worked up the courage yet…)

This allows users to enter the reference codes of parent/child descriptions. Each component needs its own spreadsheet, and its own XML export. These are then run through a script by our programmer, which will use these parent/child references to create a single, hierarchical description. Theoretically, anyway – we haven’t been able to do much testing on it yet, and we’re not sure how well it will cope with components that are more than a level or two deep.

Remember denormalised data, and how you can’t have repeated fields? Obviously we can’t tell contributors that they can only have a single subject for each description! So in repeatable fields, multiple entries are pipe | delimited, so we can split them, eg:

<controlaccess><subject>subject 1|subject2|subject3</subject></controlaccess>

<controlaccess><subject>subject1</subject><subject>subject2</subject><subject>subject3</subject></controlaccess>

If users enter their subject sources in the same order, they’ll be matched up as attributes to the correct subject. The script also removes any empty fields (valid XML, but they break the EAD Editor), and adds the special Archives Hub mark-up for access points (used to distinguish between eg surname and forename in a personal name, and handy for linked data).

And there we are: a description, created in Excel, that’s valid EAD. We’re still in the process of testing the template, and making sure that it’s robust and meets users’ needs. If you’d like to be involved with testing, please get in touch.

The modern archivist: working with people and technology

July 7, 2012 / Jane Stevenson / 5 Comments

I’ve recently read Kate Theimer’s very excellent post on Honest Tips for Wannabe Archivists Out There.

This is something that I’ve thought about quite a bit, as I work as the manager of an online service for Archives and I do training and teaching for archivists and archive students around creating online descriptions. I would like to direct this blog post to archive students or those considering becoming archivists. I think this applies equally to records managers, although sometimes they have a more defined role in terms of audience, so the perspective may be somewhat different.

It’s fine if you have ‘a love of history’, if you ‘feel a thrill when handling old documents’. That’s a good start. I’ve heard this kind of thing frequently as a motivation for becoming an archivist. But this is not enough. It is more important to have the desire to make those archives available to others; to provide a service for researchers. To become an archivist is to become a service provider, not an historian. It may not sound as romantic, but as far as I am concerned it is what we are, and we should be proud of the service we provide, which is extremely valuable to society. Understanding how researchers might use the archives is, of course, very important, so that you can help to support them in their work. Love of the materials, and love of the subject (especially in a specialist repository) should certainly help you with this core role. Indeed, you will build an understanding of your collections, and become more expert in them over time, which is one of the wonderful things about being an archivist.

Your core role is to make archives available to the community – for many of us, the community is potentially anyone, for some of us it may be more restricted in scope. So, you have an interest in the materials, you need to make them available. To do this you need to understand the vital importance of cataloguing. It is this that gives people a way in to the archives. Cataloguing is a real skill, not something to be dismissed as simply creating a list of what you have. It is something to really work on and think about. I have seen enough inconsistent catalogues over the last ten years to tell you that being rigorous, systematic and standards-based in cataloguing is incredibly important, and technology is our friend in this aim. Furthermore, the whole notion of ‘cataloguing’ is changing, a change led by the opportunities of the modern digital age and the perspectives and requirements of those who use technology in their every day life and work. We need to be aware of this, willing (even excited!) to embrace what this means for our profession and ready to adapt.

This brings me to the subject I am particularly interested in: the use of technology. Cataloguing *is* using technology, and dissemination *is* using technology. That is, it should be and it needs to be if you want to make an impact; if you want to effectively disseminate your descriptions and increase your audience. It is simply no good to see this profession as in any way apart from technology. I would say that technology is more central to being an archivist than to many professions, because we *deal in information*. It may be that you can find a position where you can keep technology at arm’s length, but these types of positions will become few and far between. How can you be someone who works professionally with information, and not be prepared to embrace the information environment? The Web, email, social networks, databases: these are what we need to use to do our jobs. We generally have limited resources, and technology can both help us make the most of the resources we have and, conversely, we may need to make informed choices about the technology we use and what sort of impact it will have. Should you use Flickr to disseminate content? What are the pros and cons? Is ‘augmented reality’ a reality for us? Should you be looking at Linked Data? What is is and why might it be important? What about Big Data? It may sound like the latest buzz phrase but it’s big business, and can potentially save time and money. Is your system fit for purpose? Does it create effective online catalogues? How interoperable is it? How adaptable?

Before I give the impression that you need to become some sort of technical whizz-kid, I should make clear that I am not talking about being an out-and-out techie – a software developer or programmer. I am talking about an understanding of technology and how to use it effectively. I am also talking about the ability to talk to technical colleagues in order to achieve this. Furthermore, I am talking about a willingness to embrace what technology offers and not be scared to try things out. It’s not always easy. Technology is fast-moving and sometimes bewildering. But it has to be seen as our ally, as something that can help us to bring archives to the public and to promote a greater understanding of what we do. We use it to catalogue, and I have written previously about how our choice of system has a great impact on our catalogues, and how important it is to be aware of this.

Our role in using technology is really *all about people*. I often think of myself as the middleman, between the technology (the developers) and the audience. My role is to understand technology well enough to work with it, and work with experts, to harness it in order to constantly evolve and use it to best advantage, but also to constantly communicate with archivists and with researchers. To have an understanding of requirements and make sure that we are relevant to end-users. Its a role, therefore, that is about working with people. For most archivists, this role will be within a record office or repository, but either way, working with people is the other side of the coin to working with technology. They are both central to the world of archives.

If you wonder how you can possibly think about everything that technology has to offer: well, you can’t. But that’s why it is even more vital now than it has ever been to think of yourself as being in a collaborative profession. You need to take advantage of the experience and knowledge of colleagues, both within the archives profession and further afield. It’s no good sitting in a bubble at your repository. We need to talk to each other and benefit from sharing our understanding. We need to be outgoing. If you are an introvert, if you are a little shy and quiet, that’s not a problem; but you may have to make a little more effort to engage and to reach out and be an active part of your profession.

They say ‘never work with children and animals’ in show business because both are unpredictable; but in our profession we should be aware that working with people and technology is our bread and butter. Understanding how to catalogue archives to make them available online, to use social networks to communicate our messages, to think about systems that will best meet the needs of archives management, to assess new technologies and tools that may help us in our work. These are vital to the role of a modern professional archivist.

Big Data: what’s it all about?

May 21, 2012 / Jane Stevenson

This blog is about ‘Big Data’. I think it’s worth understanding what’s happening within this space, and my aim is to give a (reasonably) short introduction to the concept and possible relevance for archives.

I attended the the Eduserv Symposium 2012 on Big Data , and this post is partly inspired by what I heard there, in particular the opening talk by Rob Anderson, Chief Technology Officer EMEA. Thanks also to Adrian Stevenson and Lukas Koster for their input during our discussions of this topic (over a beer!).

What is Big Data? In the book Planning for Big Data (Edd Dumbill, O’Reilly) it is described as:

“data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data you must choose an alternative way to process it.”

Big data is often associated with the massive and growing scale of data, and at the Eduserv Symposium many speakers emphasised just how big this increase in data is (which got somewhat repetitive). Many of them spoke about projects that involve huge, huge amounts of data, in particular medical and scientific data. For me tera- and peta- and whatever else -bytes don’t actually mean much. Suffice to say that this scale of data is way way beyond the sort of scale of data that I normally think about in terms of archives and archive descriptions.

We currently have more data than we can analyse, and 90% of the digital universe is unstructured, a fact that drives the move towards the big data approach. You may think big data is new; you may not have come across it before, but it has certainly arrived. Social media, electronic payments, retail analytics, video analysis of customers, medical imaging, utilities data, etc, etc., all are in the Big Data space, and the big players are there too – Google, Amazon, Walmart, Tesco, Facebook, etc., etc., – they all stand to gain a great deal from increasingly effective data analysis.

Very large scale unstructured data needs a different approach to structured data. With structured data there is a reasonable degree of routine. Data searches of a relational database, for example, are based around the principle of searching specific fields, the scope is already set out. But unstructured data is different, and requires a certain amount of experimentation with analysis.

The speakers throughout the Eduserv symposium emphasised many of the benefits that can come with the analysis of unstructured data. For example, Rob Anderson argued that we can raise the quality of patient care through analysis of various data sources – we can take into account treatments, social and economic factors, international perspective

, individual patient history. Another example he gave was the financial collapse: could we have averted this at least to some extent through being able to identify ‘at risk’ customers more effectively? It certainly seems convincing that analysing and understanding data more thoroughly could help us to improve public services and achieve big savings. In other words, data science principles could really bring public benefits and value for money.

Venn diagram for big data — The characteristics of big data

But the definition of big data as unstructured data is only part of the story. It is often characterised by three things, volume (scale), velocity (speed) and variety (heterogenous data).

It is reasonably easy to grasp the idea of volume – processing huge quantities of data. For this scale of data the traditional approaches, based on structured data, are often inadequate.

Velocity may relate to how quickly the data comes into the data centre and the speed of response. Real-time analysis is becoming increasingly effective, used by players like Google. It enables them to instantly identify and track trends – they are able to extract value from the data they are gathering. Similarly Facebook and Twitter are storing every message and monitoring every market trend. Amazon react instantly to purchase information – this can immediately affect price and they can adjust the supply chain. Tesco, for example, know when more of something is selling intensively and can divert supplies to meet demand. Big data approaches are undoubtedly providing great efficiencies for companies, although it raises the whole question of what these companies know about us and whether we are aware of how much data we are giving away.

The variety of data refers to diversity, maybe data from a variety of sources. It may be un-curated, have no schema, be inconsistent and changing. With these problems it can be hard to extract value. The question is, what is the potential value that can be extracted and how can that value be used to good effect? It may be that big data leads to the opti

mised organisation, but it takes time and skill to build what is required and it is important to have a clear idea of what you are wanting to achieve – what your value proposition is. Right now decisions are often made based on pretty poor information, and so they are often pretty poor decisions. Good data is often hard to get, so decisions may be based on little or no data at all. Many companies fail to detect shifts in consumer demand, but at the same time the Internet has made customers more segmented, so the picture is more complex. Companies need to adjust to this and respond to differing requirements. They need to take a more scientific approach because sophisticated analytics makes for better decisions and in the end better products.

At the Eduserv Symposium there were a number of speakers who provided some inspirational examples of what is possible with big data solutions. Dr Guy Coates from the Wellcome Trust Sanger Institute talked about the human genome. The ability to compare, correlate with other records and work towards finding genetic causes for diseases opens up exciting new opportunities. It is possible to work towards more personalised medicine, avoiding the time spent trying to work out which drugs work for individuals. Dr Coates talked about the rise of more agile systems, able to cope with this way of working, more modular design and an evolving incremental approach rather than the typical 3-year cycle of complete replacement of hardware.

Professor Anthony Brookes from the University of Leicester introduced the concept of ‘knowledge engineering’, thinking about this in the context of health. He stated that in many cases it may be that the rate of data generation is increasing, but it is often the same sort of data, so it may be that scale is not such an issue in all cases. This is an important point. It is easy to equate big data with scale, but it is not all about scale. The rise of Big Data is just as concerned with things like new tools that are making analysis of data more effective.

Prof Brookes described knowledge engineering as a discipline that involves integrating knowledge into computer systems in order to solve complex problems (see the i4health website for more information). He effectively conveyed how we have so much medical knowledge, but the knowledge is simply not used properly. Research and healthcare are separate – the data does not flow properly between them. We need to bring together bio-informatics and academics with medical informatics and companies, but at the moment there is a very definite gap, and this is a really really big problem. We need to build a bridge between the two, and for this you need an engineer – a knowledge engineer – someone with expertise to work through the issues involved in bridging the gap and getting the data flow right. The knowledge engineer needs to understand the knowledge potential, understand standards, understand who owns the data, understand the ethics, think about what is required to share data, such as having researcher IDs, open data discovery, remote pooled analysis of data, categories of risk for data. This type of role is essential in order to effect an integration of data with knowledge.

As well as hearing about the knowledge engineer we heard about the rise of the ‘data scientist‘, or, rather more facetiously, the “business analyst who lives in Califorina” (Adam Cooper’s blog). This concept of a data scientist was revisited throughout the Eduserv symposium. At one point it was referred to as “someone who likes telling stories around data”, an idea that immediately piqued my interest. It did seem to be a broad concept, encompassing both data analysis and data curation, although in reality these roles are really quite distinct, and there was acknowledgement that they need to be more clearly defined.

Big data has helped to create an atmosphere where expectations people have of the public sector are no longer met; expectations created by what is often provided in the private sector, for example, a more rapid response to enquiry. We expect the commercial sector to know about us and understand what we want and maybe we think public services should do the same? But organisational change is a big challenge in the public sector. Data analysis can actually be seen as opposed to the ‘human agenda’ as it moves away from the principle of human relationships. But data can drive public service innovation, and help to allocate resources efficiently, in a way that responds to need.

Big Data raises the question of the benefits of open, transparent and shared information. This message seems to come to the fore again and again, not just with Big Data, but with the whole open data agenda and Linked Data area. For example, advance warning for earthquakes requires real-time analytics – it is hard to extract this information from the diverse systems that are out there. But a Twitter-based Earthquake Detector provides substantial benefits. Simply by following #quake tweets it is possible to get a surprisingly accurate picture very quickly; apparently more #quake tweets has been shown to be an accurate indication of a bigger quake. Twitter users reacted extremely quickly to the quake and tsunami in Japan. In the US, the Government launched a cash for old cars initiative (Cash for Clunkers), to encourage people to move to greener vehicles. The Government were not sure whether the initiative was proving to be successful, but Google knew that it was, because they had the information about what people were searching for and they could see people searching for this initiative to find information about what the Government were offering. Google can instantly find out where in the world people are searching for information – for example, information about flu trends, because they can analyse which search terms are relevant, something called ‘nowcasting‘.

In the commercial sector big data is having a big impact, but it is less certain what its impact is within higher education and sectors such as cultural heritage. The question many people seem to be asking is ‘how will this be relevant to us?’

Big data may have implications for archivists in terms of what to keep and what to delete. We often log everything because we don’t know what questions we will want the answer to. But if we decide we can’t keep everything then what do we delete? We know that people only tend to criticise if you get it wrong. In the US the new National Science Foundation data retention requirements now mean you have to keep all data for 3 years after the research award conclusion, and you must produce a data management plan. But with more and more sophisticated means of processing data, should we be looking to keep data that we might otherwise dispose of? We might have considered it expensive to manage and hard to extract value from, but this is now changing. Should we always keep everything when we can? Many companies are simply storing more and more data ‘just in case’; partly because they don’t want to risk being accused of throwing it away if it turns out to be important. Our ideas of what we can extract from data are changing, and this may have implications for its value.

Does the archive community need to engage with big data? At the Eduserv Symposium one of the speakers referred to NARA making available documents for analysis. The analysis of data is something that should interest us:

“A common use of big data processing is to take unstructured data and extract ordered meaning, for consumption either by humans or as a structured input to an application.” (Planning for Big Data, Edd Dumbill).

For example, you might be looking to determine exactly what a name refers to: is this city London, England or London, Texas? Tasks such as crowd-sourcing, cleaning data, answering imprecise questions, these are relevant in the big data space and therefore potentially relevant within archives.

As already stated, it is about more than scale, and it can be very relevant to the end-user experience: “Big data is often about fast results, rather than simply crunching a large amount of information.” (Dumbill) For example, the ability to suggest on-the-fly which books someone might enjoy requires a system to provide an answer in the time it takes a page to load. It is possible to think about ways that this type of data processing could enhance the user experience within the archives space; helping the user to find what might be of relevance to their research. Again, expectations may start to demand that we provide this type of experience, as many other information sites already provide it.

The Eduserv Symposium concluded that there is a skills gap in both the public and commercial sector when it comes to big data. We need a new generation of “big data scientists” to meet a need. We also need to combine algorithms, machines and people holistically to meet the big data problem. There may be an issue around mindset, particularly in terms of worries about the use of data – something that arises in the whole open data agenda. In addition, one of the big problems we have at the moment is that we are often building on infrastructures that are not really suited to this type of unstructured data. It takes time, and knowledge of what is required, to move towards a new infrastructure, and the advance of big data may be held back by the organisational change required and skills needed to do this work.