Archives and the Researchers of Tomorrow

“In 2009, the British Library and JISC commissioned the three-year Researchers of Tomorrow study, focusing on the information-seeking and research behaviour of doctoral students in ‘Generation Y’, born between 1982 and 1994 and not ‘digital natives’. Over 17,000 doctoral students from more than 70 higher education institutions participated in the three annual surveys, which were complemented by a longitudinal student cohort study.” (Taken from http://www.jisc.ac.uk/publications/reports/2012/researchers-of-tomorrow#exec_sum).

This post picks up on some aspects of the study, particularly those that are  relevant to archives and archivists. I am assuming that archivists come into the category of libraries as being ‘library professionals’, at least to an extent, though our profession is not explicitly mentioned. I would recommend reading the report in full as it offers some useful insights into the research behaviour of an important group of researchers.

What is heartening about this study is that the findings confirm that generation Ydoctoral students are

Image from: http://www.freedigitalphotos.net

sophisticated information-seekers and users of complex information sources“.  The study does conclude that information seeking behaviour is becoming less reliant on the support of libraries and library staff, which may have implications for the role of libraries and archive professionals, but “library staff assistance with finding/retrieving difficult-to-access resources” was seen as one of the most valuable research support resources available to students, although it was a relatively small proportion of students that used this service. There was a preference for this kind of on-demand 1-2-1 support rather than formal training sessions. One of the students said ” the librarians are quite possibly the most enthusiastic and helpful people ever, and I certainly recommend finding a librarian who knows their stuff, because I have had tremendous amounts of help with my research so far, just simply by asking my librarian the right question.

The survey concentrated on the most recent information-seeking activity of students, and found that most were not seeking primary sources, but rather secondary sources (largely journal articles and books).

This apparent and striking dependence on published research resources implies that, as the basis for their own analytical and original research, relatively few doctoral students in social sciences and arts and humanities are using ‘primary’ materials such as newspapers, archival material and social data.

This finding was true across all subject disciplines and all ages. The study found that about 80% of arts and humanities students were looking for any bibliographic references on their topic or specific publications, while only 7% were looking for non-published archival material. It seems that this reliance on published information is formed early on in their research, as students lay the groundwork for their PhD. Most students said they used academic libraries more in their first year of study – whether visiting or using the online services, so maybe this is the time to engage with students and encourage them to use more diverse sources in the longer-term.

A point that piqued my interest was that the arts and humanities students visiting other collections in order to use archival sources would do so even if  “many of the resources they required had been digitised“, but this point was not explained further, which was frustrating. Is it because they are likely to use a mixture of digital and non-digital sources? Or maybe if they locate digital content it stimulates their interest to find out more about using primary sources?

Around 30% used Google or Google Scholar as their main information gathering tool,  although arts and humanities students sourced their information from a wider spread of online and offline sources, including library catalogues.

One thing that concerned me, and that I was not aware of, was the assertion that “current citation-based assessment and authenticity criteria in doctoral and academic research discourage the citing of non-published or original material“. I would be interested to know why this is the case, as surely it should be actively encouraged rather than discouraged? How does this fit with the need to demonstrate originality in research?

Students rely heavily on help from their supervisors early on in their research, and supervisors have a great influence on their information and resource use. I wonder if they actively encourage the use of primary sources as much as we would like?  I can’t help thinking that a supervisor enthusiastically extolling the importance and potential impact of using archives would be the best way to encourage use.

There continues to be a feeling amongst students, according to this study, that “using social media and online forums in research lacks legitimacy” and that these tools are more appropriate within a social context. The use of twitter, blogs and social bookmarking was low (2009 survey: 13% of arts and humanities students had used and valued Twitter) and use was more commonly passive than active. There was a feeling that new tools and applications would not transform the way that the students work, but should complement and enhance established research practices and behaviour. However, it should be noted that use of ‘Web 2.0’ tools increased over the 3 years of the study, so it may be that a similar study carried out in 5 years time would show significantly different behaviour.

Students want training in research geared towards their own subject area, preferably face-to-face. Online tutorials and packages were not well used. The implication is that training should be at a very local level and done in a fairly informal way. Generic research skills are less appealing. Research skills training days are valued, but if they are poor and badly taught, the student feels their time is wasted and may be put off trying again. Students were quite critical of the quality and utility of the training offered by their university mainly because (i) it was not pitched at the right level (ii) it was too generic or (iii) it was not available on demand. Library-led training sessions got a more positive response, but students were far less likely to take up training opportunities after the first year of their PhD.  Training in the use of primary sources was not specifically covered in the report, though it must be supposed this would be (should be!) included in library-led training.

The study indicated that students dislike reading (as opposed to scanning) on screen. This suggests that it is important to provide the right information online, information that is easy to scan through, but worth providing a PDF for printout, especially for detailed descriptions.

One quote stood out for me, as it seems to sum up the potential danger of modern ways of working in terms of approaches to more in-depth analysis:

“The problem with the internet is that it’s so easy to drift between websites and to absorb information in short easy bites that at times you forget to turn off the computer, rest your eyes from screen glare and do some proper in-depth reading. The fragments and thoughts on the internet are compelling (addictive, even), and incredibly useful for breadth, but browsing (as its name suggests) isn’t really so good for depth, and at this level depth is what’s required.” (Arts and humanities)

We do often hear about the internet, or computers, tending to reduce levels of concentration. I think that this point is subtly different though – it’s more about the type of concentration required for in-depth work, something that could be seen as essential for effective primary source research.

Conclusions

We probably all agree that we can always do more to to promote the importance of archives to all potential users, including doctoral students. Certainly, we need to make it easier for them to discovery sources through the usual routes that they use, so for one thing ensuring we have a profile via Google and Google Scholar. Too many archives still resist this requirement, as if it is somehow demeaning or too populist, or maybe because they are too caught up in developing their own websites rather than thinking about search engine optimisation, or maybe it is just because archivists are not sure how to achieve good search engine rankings?

Are we actively promoting a low-barrier, welcome and open approach? I recall many archive institutions that routinely state that their archives are ‘open to bone fide researchers only’. Language like that seems to me to be somewhat off putting. Who are the ‘non-bone fide’ researchers that need to be kept out? This sort of language does not seem conducive to the principle of making archives available to all.

The applications we develop need to be relatively easy to absorb into existing research work practices, which will only change slowly over time. We should not get too caught up in social networks and Web 2.0 as if these are ‘where it’s at’ for this generation of students. Maybe the approaches to research are generally more traditional than we think.

The report itself concludes that the lack of use of primary sources is worrying and requires further investigation:

There is a strong case for more in-depth research among doctoral students to determine whether the data signals a real shift away from doctoral research based on primary sources compared to, say, a decade ago. If this proves to be the case there may be significant implications for doctoral research quality related to what Park described as “widely articulated tensions between product (producing a thesis of adequate quality) and process (developing the researcher), and between timely completion and high quality research“.

 

 

 

 

 

The modern archivist: working with people and technology

I’ve recently read Kate Theimer’s very excellent post on Honest Tips for Wannabe Archivists Out There.

This is something that I’ve thought about quite a bit, as I work as the manager of an online service for Archives and I do training and teaching for archivists and archive students around creating online descriptions. I would like to direct this blog post to archive students or those considering becoming archivists. I think this applies equally to records managers, although sometimes they have a more defined role in terms of audience, so the perspective may be somewhat different.

It’s fine if you have ‘a love of history’, if you ‘feel a thrill when handling old documents’. That’s a good start. I’ve heard this kind of thing frequently as a motivation for becoming an archivist. But this is not enough. It is more important to have the desire to make those archives available to others; to provide a service for researchers. To become an archivist is to become a service provider, not an historian. It may not sound as romantic, but as far as I am concerned it is what we are, and we should be proud of the service we provide, which is extremely valuable to society. Understanding how researchers might use the archives is, of course, very important, so that you can help to support them in their work. Love of the materials, and love of the subject (especially in a specialist repository) should certainly help you with this core role. Indeed, you will build an understanding of your collections, and become more expert in them over time, which is one of the wonderful things about being an archivist.

Your core role is to make archives available to the community – for many of us, the community is potentially anyone, for some of us it may be more restricted in scope. So, you have an interest in the materials, you need to make them available. To do this you need to understand the vital importance of cataloguing. It is this that gives people a way in to the archives. Cataloguing is a real skill, not something to be dismissed as simply creating a list of what you have. It is something to really work on and think about. I have seen enough inconsistent catalogues over the last ten years to tell you that being rigorous, systematic and standards-based in cataloguing is incredibly important, and technology is our friend in this aim. Furthermore, the whole notion of ‘cataloguing’ is changing, a change led by the opportunities of the modern digital age and the perspectives and requirements of those who use technology in their every day life and work. We need to be aware of this, willing (even excited!) to embrace what this means for our profession and ready to adapt.

image of control roomThis brings me to the subject I am particularly interested in: the use of technology. Cataloguing *is* using technology, and dissemination *is* using technology. That is, it should be and it needs to be if you want to make an impact; if you want to effectively disseminate your descriptions and increase your audience. It is simply no good to see this profession as in any way apart from technology. I would say that technology is more central to being an archivist than to many professions, because we *deal in information*. It may be that you can find a position where you can keep technology at arm’s length, but these types of positions will become few and far between.  How can you be someone who works professionally with information, and not be prepared to embrace the information environment? The Web, email, social networks, databases: these are what we need to use to do our jobs. We generally have limited resources, and technology can both help us make the most of the resources we have and, conversely, we may need to make informed choices about the technology we use and what sort of impact it will have. Should you use Flickr to disseminate content? What are the pros and cons? Is ‘augmented reality’ a reality for us? Should you be looking at Linked Data? What is is and why might it be important? What about Big Data? It may sound like the latest buzz phrase but it’s big business, and can potentially save time and money. Is your system fit for purpose? Does it create effective online catalogues? How interoperable is it? How adaptable?

Before I give the impression that you need to become some sort of technical whizz-kid, I should make clear that I am not talking about being an out-and-out techie – a software developer or programmer. I am talking about an understanding of technology and how to use it effectively. I am also talking about the ability to talk to technical colleagues in order to achieve this. Furthermore, I am talking about a willingness to embrace what technology offers and not be scared to try things out. It’s not always easy. Technology is fast-moving and sometimes bewildering. But it has to be seen as our ally, as something that can help us to bring archives to the public and to promote a greater understanding of what we do. We use it to catalogue, and I have written previously about how our choice of system has a great impact on our catalogues, and how important it is to be aware of this.

Our role in using technology is really *all about people*. I often think of myself as the middleman, between the technology (the developers) and the audience. My role is to understand technology well enough to work with it, and work with experts, to harness it in order to constantly evolve and use it to best advantage, but also to constantly communicate with archivists and with researchers. To have an understanding of requirements and make sure that we are relevant to end-users. Its a role, therefore, that is about working with people. For most archivists, this role will be within a record office or repository, but either way, working with people is the other side of the coin to working with technology. They are both central to the world of archives.

If you wonder how you can possibly think about everything that technology has to offer: well, you can’t. But that’s why it is even more vital now than it has ever been to think of yourself as being in a collaborative profession. You need to take advantage of the experience and knowledge of colleagues, both within the archives profession and further afield. It’s no good sitting in a bubble at your repository. We need to talk to each other and benefit from sharing our understanding. We need to be outgoing. If you are an introvert, if you are a little shy and quiet, that’s not a problem; but you may have to make a little more effort to engage and to reach out and be an active part of your profession.

They say ‘never work with children and animals’ in show business because both are unpredictable; but in our profession we should be aware that working with people and technology is our bread and butter. Understanding how to catalogue archives to make them available online, to use social networks to communicate our messages, to think about systems that will best meet the needs of archives management, to assess new technologies and tools that may help us in our work. These are vital to the role of a modern professional archivist.

Big Data: what’s it all about?

This blog is about ‘Big Data’. I think it’s worth understanding what’s happening within this space, and my aim is to give a (reasonably) short introduction to the concept and  possible relevance for archives.

I attended the the Eduserv Symposium 2012 on Big Data , and this post is partly inspired by what I heard there, in particular the opening talk by Rob Anderson, Chief Technology Officer EMEA. Thanks also to Adrian Stevenson and Lukas Koster for their input during our discussions of this topic (over a beer!).

ServersWhat is Big Data? In the book Planning for Big Data (Edd Dumbill, O’Reilly) it is described as:

data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data you must choose an alternative way to process it.”

Big data is often associated with the massive and growing scale of data, and at the Eduserv Symposium many speakers emphasised just how big this increase in data is (which got somewhat repetitive). Many of them spoke about projects that involve huge, huge amounts of data, in particular medical and scientific data. For me tera- and peta- and whatever else -bytes don’t actually mean much. Suffice to say that this scale of data is way way beyond the sort of scale of data that I normally think about in terms of archives and archive descriptions.

We currently have more data than we can analyse, and  90% of the digital universe is unstructured, a fact that drives the move towards the big data approach. You may think big data is new; you may not have come across it before, but it has certainly arrived. Social media, electronic payments, retail analytics, video analysis of customers, medical imaging, utilities data, etc, etc., all are in the Big Data space, and the big players are there too – Google, Amazon, Walmart, Tesco, Facebook, etc., etc., – they all stand to gain a great deal from increasingly effective data analysis.

Very large scale unstructured data needs a different approach to structured data. With structured data there is a reasonable degree of routine. Data searches of a relational database, for example, are based around the principle of searching specific fields, the scope is already set out. But unstructured data is different, and requires a certain amount of experimentation with analysis.

The speakers throughout the Eduserv symposium emphasised many of the benefits that can come with the analysis of unstructured data. For example, Rob Anderson argued that we can raise the quality of patient care through analysis of various data sources – we can take into account treatments, social and economic factors, international perspective

, individual patient history. Another example he gave was the financial collapse: could we have averted this at least to some extent through being able to identify ‘at risk’ customers more effectively? It certainly seems convincing that analysing and understanding data more thoroughly could help us to improve public services and achieve big savings. In other words, data science principles could really bring public benefits and value for money.

Venn diagram for big data
The characteristics of big data

But the definition of big data as unstructured data is only part of the story. It is often characterised by three things, volume (scale), velocity (speed) and variety (heterogenous data).

It is reasonably easy to grasp the idea of volume – processing huge quantities of data. For this scale of data the traditional approaches, based on structured data, are often inadequate.

Velocity may relate to how quickly the data comes into the data centre and the speed of response. Real-time analysis is becoming increasingly effective, used by players like Google. It enables them to instantly identify and track trends – they are able to extract value from the data they are gathering. Similarly Facebook and Twitter are storing every message and monitoring every market trend. Amazon react instantly to purchase information – this can immediately affect price and they can adjust the supply chain. Tesco, for example, know when more of something is selling intensively and can divert supplies to meet demand. Big data approaches are undoubtedly providing great efficiencies for companies, although it raises the whole question of what these companies know about us and whether we are aware of how much data we are giving away.

The variety of data refers to diversity, maybe data from a variety of sources. It may be un-curated, have no schema, be inconsistent and changing. With these problems it can be hard to extract value. The question is, what is the potential value that can be extracted and how can that value be used to good effect? It may be that big data leads to the opti

mised organisation, but it takes time and skill to build what is required and it is important to have a clear idea of what you are wanting to achieve – what your value proposition is. Right now decisions are often made based on pretty poor information, and so they are often pretty poor decisions. Good data is often hard to get, so decisions may be based on little or no data at all. Many companies fail to detect shifts in consumer demand, but at the same time the Internet has made customers more segmented, so the picture is more complex. Companies need to adjust to this and respond to differing requirements. They need to take a more scientific approach because sophisticated analytics makes for better decisions and in the end better products.

Eduserv Symposium
Andy Powell introduces the Eduserv Symposium: Big Data, big deal?

At the Eduserv Symposium there were a number of speakers who provided some inspirational examples of what is possible with big data solutions. Dr Guy Coates from the Wellcome Trust Sanger Institute talked about the human genome. The ability to compare, correlate with other records and work towards finding genetic causes for diseases opens up exciting new opportunities. It is possible to work towards more personalised medicine, avoiding the time spent trying to work out which drugs work for individuals. Dr Coates talked about the rise of more agile systems, able to cope with this way of working, more modular design and an evolving incremental approach rather than the typical  3-year cycle of complete replacement of hardware.

Professor Anthony Brookes from the University of Leicester introduced the concept of ‘knowledge engineering’, thinking about this in the context of health. He stated that in many cases it may be that the rate of data generation is increasing, but it is often the same sort of data, so it may be that scale is not such an issue in all cases. This is an important point. It is easy to equate big data with scale, but it is not all about scale. The rise of Big Data is just as concerned with things like new tools that are making analysis of data more effective.

Prof Brookes described knowledge engineering as a discipline that involves integrating knowledge into computer systems in order to solve complex problems (see the i4health website for more information). He effectively conveyed how we have so much medical knowledge, but the knowledge is simply not used properly. Research and healthcare are separate – the data does not flow properly between them. We need to bring together bio-informatics and academics with medical informatics and companies, but at the moment there is a very definite gap, and this is a really really big problem. We need to build a bridge between the two, and for this you need an engineer – a knowledge engineer – someone with expertise to work through the issues involved in bridging the gap and getting the data flow right. The knowledge engineer needs to understand the knowledge potential, understand standards, understand who owns the data, understand the ethics, think about what is required to share data, such as having researcher IDs, open data discovery, remote pooled analysis of data, categories of risk for data. This type of role is essential in order to effect an integration of data with knowledge.

As well as hearing about the knowledge engineer we heard about the rise of the ‘data scientist‘, or, rather more facetiously, the “business analyst who lives in Califorina” (Adam Cooper’s blog). This concept of a data scientist was revisited throughout the Eduserv symposium. At one point it was referred to as “someone who likes telling stories around data”, an idea that immediately piqued my interest. It did seem to be a broad concept, encompassing both data analysis and data curation, although in reality these roles are really quite distinct, and there was acknowledgement that they need to be more clearly defined.

Big data has helped to create an atmosphere where expectations people have of the public sector are no longer met; expectations created by what is often provided in the private sector, for example, a more rapid response to enquiry. We expect the commercial sector to know about us and understand what we want and maybe we think public services should do the same?  But organisational change is a big challenge in the public sector. Data analysis can actually be seen as opposed to the ‘human agenda’ as it moves away from the principle of  human relationships. But data can drive public service innovation, and help to allocate resources efficiently, in a way that responds to need.

Big Data raises the question of the benefits of open, transparent and shared information. This message seems to come to the fore again and again, not just with Big Data, but with the whole open data agenda and Linked Data area. For example, advance warning for earthquakes requires real-time analytics – it is hard to extract this information from the diverse systems that are out there. But a Twitter-based Earthquake Detector provides substantial benefits. Simply by following #quake tweets it is possible to get a surprisingly accurate picture very quickly; apparently more #quake tweets has been shown to be an accurate indication of a bigger quake. Twitter users reacted extremely quickly to the quake and tsunami in Japan. In the US, the Government launched a cash for old cars initiative (Cash for Clunkers), to encourage people to move to greener vehicles. The Government were not sure whether the initiative was proving to be successful, but Google knew that it was, because they had the information about what people were searching for and they could see people searching for this initiative to find information about what the Government were offering. Google can instantly find out where in the world people are searching for information – for example, information about flu trends, because they can analyse which search terms are relevant, something called ‘nowcasting‘.

In the commercial sector big data is having a big impact, but it is less certain what its impact is within higher education and sectors such as cultural heritage. The question many people seem to be asking is  ‘how will this be relevant to us?’

Big data may have implications for archivists in terms of what to keep and what to delete. We often log everything because we don’t know what questions we will want the answer to. But if we decide we can’t keep everything then what do we delete? We know that people only tend to criticise if you get it wrong. In the US the new National Science Foundation data retention requirements now mean you have to keep all data for 3 years after the research award conclusion, and you must produce a data management plan. But with more and more sophisticated means of processing data, should we be looking to keep data that we might otherwise dispose of? We might have considered it expensive to manage and hard to extract value from, but this is now changing. Should we always keep everything when we can? Many companies are simply storing more and more data ‘just in case’; partly because they don’t want to risk being accused of throwing it away if it turns out to be important. Our ideas of what we can extract from data are changing, and this may have implications for its value.

Does the archive community need to engage with big data? At the Eduserv Symposium one of the speakers referred to NARA making available documents for analysis. The analysis of data is something that should interest us:

“A common use of big data processing is to take unstructured data and extract ordered meaning, for consumption either by humans or as a structured input to an application.” (Planning for Big Data, Edd Dumbill).

For example, you might be looking to determine exactly what a name refers to: is this city London, England or London, Texas? Tasks such as crowd-sourcing, cleaning data, answering imprecise questions, these are relevant in the big data space and therefore potentially relevant within archives.

As already stated, it is about more than scale, and it can be very relevant to the end-user experience: “Big data is often about fast results, rather than simply crunching a large amount of information.” (Dumbill) For example, the ability to suggest on-the-fly which books someone might enjoy requires a system to provide an answer in the time it takes a page to load. It is possible to think about ways that this type of data processing could enhance the user experience within the archives space; helping the user to find what might be of relevance to their research. Again, expectations may start to demand that we provide this type of experience, as many other information sites already provide it.

The Eduserv Symposium concluded that there is a skills gap in both the public and commercial sector when it comes to big data. We need a new generation of “big data scientists” to meet a need. We also need to combine algorithms, machines and people holistically to meet the big data problem. There may be an issue around mindset, particularly in terms of worries about the use of data – something that arises in the whole open data agenda. In addition, one of the big problems we have at the moment is that we are often building on infrastructures that are not really suited to this type of unstructured data. It takes time, and knowledge of what is required, to move towards a new infrastructure, and the advance of big data may be held back by the organisational change required and skills needed to do this work.

Open Up!

Speaking at the recent World Wide Web (WWW2012) conference in Lyon, Neelie Kroes, Vice President of the European Commission, emphasised the benefits of an open Web:  “With a truly open, universal platform, we can deliver choice and competition; innovation and opportunity; freedom and democratic accountability”.

Image: '774:Neuron Connection Pattern' http://www.flickr.com/photos/60057912@N00/4743616313

But what does ‘open’ really mean? Do we really have a clear idea about what it is, how we achieve it, and what the benefits might be? And do we have a sense of what is happening in this landscape and the implications for our own domain?

There is a tendency to think that the social web equates to some degree with the open web, but this is not the case. Social sites like Facebook often create walled gardens, and you could argue that whilst they are connecting people, they are working against the principle of free and linked information. Just because the aim is to enable people to be social and connected, that doesn’t mean that it is done in a way that ensures the data is open.

Another confusion may arise around open and its relationship to privacy. Again, things can be open, but privacy can be respected. It’s about clear and transparent choices – whether you want your data to be open or whether you want to include some restrictions. We are not always aware of what we are giving away:

“the Net enables individuals in many cases to compromise privacy more thoroughly than the government and commercial institutions traditionally targeted for scrutiny and regulation. The standard approaches that have been developed to analyze and limit institutional actors do not not work well for this new breed of problem, which goes far beyond the compromise of sensitive information.” (Jonathan Zittrain, The Future of the Internet).

There is an irony that we often readily give away personal data – so we can be (unknowingly?) very open with our own information – but often we give it away to companies that are not, in their turn, open with the data that is provided. As Tim Berners-Lee has stated, “”One of the issues of social networking silos is that they have the data and I don’t”.  It is possible to paint quite a worrying trend towards control being in the hands of the few:

“A small elite of gifted (largely male) super-wealthy masters of the universe are creating systems through their own brilliance which very few others in government, regulation or the general public can understand.” (http://www.guardian.co.uk/commentisfree/2012/apr/16/threat-open-web-opaque-elite).

Google has often been seen as a good guy in this context, but recently a number of actions have indicated a move away from open web principles. We have started to be aware of the way Google collects and uses our data, and there have been some dubious practices around ways that data has been gathered (eg. personal data collected from UK Wifi networks). Google appear to be very eager to get us all using Google+, and we are starting to see ‘social’ search results that appear to be pushing people towards using Google+. I think we should be concerned by the announcement from Google that they are now starting to combine your Google search history with other data they collect through use of their products in order, they say, to deliver improved search results. Do we want Google to do this? Are we happy with our search history being used to push products. Is a ‘personalised’ service worth having if it means companies are collecting and holding all of the data we provide when we browse the web for their own commercial purposes?

It does seem that many of these companies will defend ‘open’ if it is in their interests, but find ways to bypass it when it is not. But maybe if we continue to apply pressure in support of an open approach, by implementing it ourselves and favouring tools and services that implement it,  we can tip the balance increasingly towards a genuine open approach that is balanced by safeguards for guaranteeing privacy.

It is worth thinking about some of these issues when we think about our own archival data and whether we are going to take an open approach. It is not just our own individual circumstances, within one repository, that we should think about, but the whole context of what we want the Web to be, because our choices help determine the direction that the Internet goes in.

We do not know how our archival data will be used and we never have. Even when the data was on paper, we couldn’t control use. Fundamentally, we need to see this as a good thing, not a threat, but we need to be aware of the risks. The paradigm has shifted because the data is much much easier to mess with now that it is digital, and, of course, we have digital archives, and we are concerned to protect the integrity of these sources. Providing an API interface seems like a much more explicit invitation to play with the data than simply putting it on the Web. Furthermore, the mindset of the digital generation is entirely different. Cutting and pasting, mashing and inter-linking are all taken for granted and sometimes the idea of the provenance of data sources seems to go out of the window.

This issue of ‘unpredictable use’ is interesting. When the Apple II was launched, Steve Jobs did not know how it would be used – as Zittrain states, it ‘invited people to tinker with it’, and whilst Apple might have had ideas about use, there ideas were not a constraint to reality. Contrast this with the iPhone: it comes out of the box programmed. Quoting Zittrain again: ‘Whereas the world would innovate for the Apple II, only Apple would innovate for the iPhone.’  Zittrain sees this trend towards fixed appliances as pointing to a rather bleak future of ‘sterile appliances tethered to a network of control.’

Maybe the challenge is partly around our understanding of the data we are ‘giving away’, why we are giving it away, and if it’s worth giving away because of what we get in return. Think about the data gathered as we browse the Web – cookies have always had a major role in building up a profile of our Web behaviour – where we go and what we do. Cookies are generally interested in behaviour rather than identity. But still, from 26 May this year, every website operating in the UK will be required to inform its users that they are being tracked with cookies, and to ask users for their consent. This means justifying why cookies are required – it may be for very laudable reasons – logging users’ preferences to help improve their experience, and generally getting a better understanding of how people use the site in order to improve it. It may be in order to target advertising – and you may feel that its hardly a problem if this is adversely affected, but maybe it will be smaller, potentially very useful sites that will suffer if their advertising revenue is affected. Is it a good think that sites will have to justify use of cookies? Is it in our interests? Is this really where the main fight is around protecting your identity?  At the same time, the Government is proposing to bring in powers to monitor the websites people are looking at, something which has very profound implications for our privacy.

On the side of the good guys, there are plenty of people out there espousing the importance of an altruistic approach. In The Digital Public Domain: Foundations for an Open Culture the authors argue that the Public Domain — that is, the informational works owned by all of us, be that literature, music, the output of scientific research, educational material or public sector information — is fundamental to a healthy society. Many other publications echo this belief and you can see plenty of evidence of the movement towards open publication. Technology can increasingly help us realise the advantages of an open approach. For example, text mining is something that can potentially bring substantial economic and cultural benefits. It is often associated with biomedical sciences, but it could have great potential across the humanities. A recent JISC report on the Value and Benefits of Text Mining concluded that the current licensing arrangements are a key barrier to unlocking the potential value of text mining to society and enabling researchers to gain maximum benefit from the data. Text mining  analyses text using computational methods, and provides the end user with what is most relevant to them. But copyright law works against this principle:

http://www.jisc.ac.uk/inform/inform33/TextMining.html

“Current copyright law, however, imposes serious restrictions on text mining, because it involves a range of computerised analytical processes that are not all readily permitted within UK intellectual property law. In order to be ‘mined’, text must be accessed, copied, analysed, annotated and related to existing information and understanding. Even if the user has access rights to the material, making annotated copies can currently be illegal without the permission of the copyright holder.” (JISC report: Value and Benefits of Text Mining)

It is important to think beyond types of data and formats, and engage with the debate about the sort of advances for researchers that we can gain by opening up all types of data; whilst the data may differ, many of the arguments about the benefits of open data apply across the board, in terms of encouraging innovation and the advancement of knowledge.

The battle between open and transparent and the walled garden scenario is played out in many ways. One area is the growth of native apps for mobile devices. It is easy to simply see mobile apps as providing new functionality, and opportunities for innovation, with so many people creating apps for every activity you can think of. But what are the implications of a scenario where you have to enter a ‘silo’ environment in order to do anything, from finding your way to a location to converting measurements from one unit to another, to reading the local news reports? Tim Berners-Lee sees this as one of the main threats to the principle of the Web as a platform with the potential to connect data.

Should we be promoting the open Web? Surely as information professionals we should. This does not mean that there is an imperative to make everything open – we are all aware of the Data Protection Act and there are a whole range of very valid reasons to withhold information. But it is increasingly apparent that we need to be explicit in applying permissions – whether it be explicitly fully open licences, attribution licences or closure periods. Many of the archives described on the Archives Hub have ‘open for consultation’ under Access Conditions. But what does this really mean? What can people really do with the content? Is the content actually ‘open for reuse in any way the end-user deems useful to them’? And specifically, what can they do with digital content, available on the World Wide Web?

We need to be aware of the open data agenda and the threats and opportunities presented to us. We have the same kind of tensions within our own world in terms of what we need to protect, both for preservation reasons and for reasons of IPR and commercial advantage, but we want to make archives as open and available as possible and we need to understand what this means in the context of the digital, online world. The question is, whether we want to encourage innovation, what Zittrain calls a ‘generative’ approach. This means enabling programmers to take hold of what you can provide and working with it, as well as encouraging reuse by end users. But it also means taking a risk with security, because making things open inevitably makes them more open to abuse.

People’s experiences with the Internet are shaped by the devices they use to access it. And, similarly, their experiences of archives are shaped by the ways that we choose to enable access. In the same way that a revolution started when computers were made available that could potentially run software that was not available at the time they were built, so we should think about enabling innovation with our data that we may not envisage at present, not trying to determine user behaviour but relishing the idea of the unknown. This raises the question of whether we facilitate this by simply putting the data out as it is, even if it is somewhat unstructured, making it as open as possible, or whether we should try to clean up data, and make it more rigorous, more consistent and more structured, with the purpose of facilitating flexible use of the data, enabling people to use and present it in other ways. But maybe that is another story…

In a way, the early development of the Internet has been an unexpected journey. Innovation has been encouraged because a group of computer scientists did not want to control or to ensure maximum commercial success. It was an altruistic approach; and it may not have turned out that way. Maybe the PC was an unlikely success story in business, where a stable, supported, predictable solution might have been favoured, where companies did not need skills in-house.  Both Microsoft and Apple made their computers’ operating systems open for third-party software development, so even if there was a sense of monopoly, there was still a basic ethos of anyone being able to write for the PC.

Are we still moving down that road where amateur innovation and ‘messing about’ is seen as a positive thing? Or are the problems of security, spam and viruses, along with the power of the big guys, taking us towards something that is far more fixed and controlled?

Linking Cultural Heritage Data

Last week I attended a meeting at the British Museum to talk with some museum folk about ways forward with Linked Data. It was a follow up to a meeting I organised on Archives and Linked Data,  and it was held under the auspices of the CIDOC Documentation Standards Working Group. The group consisted of me, Richard Light (Museum consultant), Rory McIllroy (ULCC), Jeremy Ottevanger (Imperial War Museum), Jonathan Whitson Cloud (British Museum), Julia Stribblehill (British Museum), and briefly able to join us was Pete Johnston (worked on the Locah project and now on the Linking Lives Linked Data project).

It proved to be a very pleasant day, with lots of really useful discussion. We spent some time simply talking about the differences – and similarities – in our perspectives and in our data. One of our aims was to start to create more links and dialogue between our sectors, and in this regard I think that the day was undoubtedly successful.

To start with our conversation ranged around various issues that our domains deal with. For example, we talked a bit about definitions, and how important they are within a museum context. For example, if you think about a collection of coins, defining types is key, and agreeing what those types are and what they should be called could be a very significant job in itself.  We were thinking about this in the context of providing authoritative identifiers for these types, so that different data sources can use the same terms.  Effectively identifying entities such as names and places are vital for museums, libraries and archives, of course, and then within the archive community we could also provide authoritative identifiers for things like levels of description. Workign together to provide authoritative and persistent URIs for these kinds of things could be really useful for our communities.

We talked about the value of promoting ‘storytelling’ and the limitations that may inhibit a more event-based approach. DBPedia (Wikipedia as Linked Data) may be at the centre of the Linked Data Cloud, but it may not be so useful in this context because it cannot chart data over time. For example, it can give you the population of Berlin, but it cannot give you the changing population over time. We agreed that it is important to have an emphasis on this kind of timeline approach.

We spent a little while looking at the British Museum’s departmental database, which includes some archives, but treats them more as objects (although the series they form a part of is provided, this contextual information is not at the fore – there is not a series description as such). The proposal is to find a way to join this system up with the central archive, maybe through the use of Linked Data.

We touched upon the whole issue of what a ‘collection’ is within the museum context, which is often more about single objects, and reflected on the challenge of how to define a collection, because even something like a cup and saucer could be seen as a collection…or it is one object?…or is a full tea set a collection?

For archivists, quite detailed biographical information is often part of the description of a collection. We do this in order to place the collection within a context. These biographical histories often add significant value to our descriptions, and sometimes the information in them may be taken from the archive collection, so new information may be revealed. Museums don’t tend to provide this kind of detail, and are more likely to reference other sources for the researcher to use to find out about individuals or organisations. In fact, referencing external sources is something archives are doing more frequently, and Linked Data will encourage this kind of approach, and may save us time in duplicating effort creating new biographical entries for the same person. (There is also the move towards creating separate name authorities, but this also brings with it big challenges around sharing data and using the same authorities).

We moved on to talk about Linked Data more specifically, and thought a bit about whether the emphasis should be on discovery or the quality and utility of what you get when you are presented with the results. We generally felt that discovery was key because Linked Data is primarily about linking things together in order to make new discoveries and take new directions in research.

book showing design patterns
Wellcome Library, London

One of the main aims of the day was to discuss the idea of the use of design patterns to help the cultural heritage community create and use Linked Data. It would facilitate the process of querying different graphs and getting reasonably predictable information back, if we could do things in common where possible. Richard has written up some thoughts about design patterns from a museum perspective and there is a very useful Linked Data Patterns book by Leigh Dodds and Richard Davis. We felt this could form a template for the sort of thing that we want to do. We were well aware that this work would really benefit from a cross-domain approach involving museums, archives, libraries and galleries, and this is what we hope to achieve.

We spoke briefly about the value of something like OpenCalais and wondered whether a cultural heritage version of this kind of extraction tool would be useful. If it was more tailored for our own sectors, it may be more useful in creating authorities, so that we can refer to things in a common way, as a persistent URL would be provided for the people, subjects, concepts, that we need to describe. We considered the scenario that people may go back to writing free text and then intelligent tools will extract concepts for them.

We concluded that it would be worth setting up a Wiki to encourage the community to get involved in exploring the idea of Linked Data Patterns. We thought it would be a good idea to ask people to tell us what they want to know – we need the real life questions and then we can think about how our data can join up to answer those questions. Just a short set of typical real-life questions would enable us to look at ways to link up data that fit a need, because a key question is whether existing practices are a good fit for what researchers really want to know.

HubbuB: March 2012

New collections on the Hub

A special mention for the University of Worcester Research Collections – they have now been added to the Hub as collection level descriptions, thanks largely to their HLF ‘Skills for the Future’ trainee, Sarah.

We are delighted to have the Royal College of Psychiatrists as a new contributor, adding to a number of distinguished Royal Colleges already on the Hub.

Feature for March

This month we step into the world of augmented reality with a feature about the SCARLET project:

The feature tells us that “The SCARLET ‘app’ now enables students to study early editions of Dante’s Divine Comedy, for example, while simultaneous viewing catalogue data, digital images, webpages and online learning resources on their tablet devices and phones.” It all sounds very exciting, and something that archives can really play a very active part in.

EAD Editor

We’ve been busy testing the new instance of the EAD Editor, which will be released soon. We’ll be able to tell you more about that shortly.

We now have a page giving you information about the ‘right click’ menu that helps you with things like paragraphs, lists and links:

SRU and OAI-PMH

APIs are becoming increasingly important with the open data agenda. We have provided APIs for some years now. Recently we have updated the information on these to help developers who would like to use them to access Hub descriptions: http://archiveshub.ac.uk/sru/ and http://archiveshub.ac.uk/oaipmh/

The SRU interface is used to provide data to Genesis, the portal for Women’s Studies: http://www.londonmet.ac.uk/genesis/. It means that the data is only held in one place, but a different interface provides access to select descriptions – in this case, descriptions relating to women.

APIs may not mean a great deal to you, as they are primarily something developers use to create new interfaces, mash-ups and cross-data explorations, but do pass this on if you know of developers interested in working with our data. We want to ensure that archives are at the heart of innovations in opening up and exploring data connections.

Page about identifiers

Some of you may have read my recent blog post about issues with identifiers for archives and for archive descriptions. We now have a page on the Hub to help explain what a persistent unique identifier is and how you create it:

http://archiveshub.ac.uk/identifiers/

As ever, please ask us if you have any questions about this.

Former Reference

The Archives Hub now displays former reference with the label of ‘alternative ref’. This is because for some contributors the former reference is, in fact, the main reference, so we felt this was the best compromise. For example: http://archiveshub.ac.uk/data/gb1069-12 (see lower level entries).

The new EAD Editor will allow for descriptions with a former reference to be uploaded, edited and removed, but it will not provide the facility to create them from scratch.

Case Studies Wanted!

Finally, we have a case studies section – http://archiveshub.ac.uk/casestudies/. We’d love to hear from any researchers willing to provide us with a case study. It is a really useful way for us to convey the importance of the Hub to our funders.

Creating name authorities

We are currently evaluating ICA-AtoM, and it is throwing up some really useful ideas and some quite difficult issues because it supports the creation of name authorities, and it also automatically creates authority records for any creator name and any names within the access points when you upload an archive description.

As far as I am concerned, the idea of name authorities is to create a biographical entry for a person or organisation, something that gives the researcher useful information about the entity, different names they use, significant events in their life, people they know, etc (all set out within the ISAAR(CPF) standard (PDF)). You can then link to the archive collections that relate to them, thus giving the archives more context. You can distinguish between archives that they ‘created’ (were responsible for writing or accumulating), and archives that they are a subject of (where they are referenced – maybe as an access point in the index terms).

The ideal vision is for one name authority record for an entity, and that authority record intellectually brings together the archives relating to the entity, which is a great advantage for researchers. However, this is not practically achievable. Just the creation and effective use of name authorities to bring archives together at any level is quite challenging. For the Hub we have a number of issues….

Often we have more than one archive collection for the same entity (let’s say person, although it could be a corporate body or family). I’ll take Martha Beatrice Webb as an example, because we have 9 collections on the Archives Hub where she is a creator. She’s also topical, as the LSE have just launched a new Digital Library with the Beatrice Webb diaries!

beatrice webb collections

There is nothing wrong with any of these entries. In fact, I pick Beatrice Webb as an example because we do have some really good, detailed biographical history entries for her – exactly what we want in a collection description, so that researcher can get as much useful information as possible. We have always recommended to our contributors that they do provide a biographical history (we know from researchers that this is often useful to them; although conversely it is true to say that some researchers see it as unnecessary).

We have nine collections for Beatrice Webb, and three repositories holding these collections.

If we were to try to create one authority record for these nine collections:

1. Which biographical history would we use? Would we use them all, which means a lot of information and quite a bit of duplication? Would we pick the longest one? What criteria could we use?

2. Two of these collections are attributed to Beatrice and Sydney Webb. Ideally, we would want to link the collections to two authority records – one for Beatrice and one for Sydney, but the biographical history is for both of them, so we would probably end up with an authority record for ‘Beatrice and Sydney Webb’ as an entity.

3. The name of creator is entered differently in the descriptions – ‘Beatrice Webb’ and ‘Webb, Martha Beatrice, 1858-1943, wife of 1st Baron Passfield, social reformer and historian’. This is very typical for the Hub. Sometimes the name of creator is entered simply as forename and surname, but sometimes the same elements as are used for the index term entry are used – sometimes in the same (inverted) order and sometimes not. This makes it harder to automatically create one authority record, because you have to find ways to match the names and confirm they are the same person.

4. Maybe it would be easier to try to create one authority record per repository, so for Beatrice Webb we would have three records, but this would go against the idea of using name authorities to help intellectually bring archives together and still leaves us with some of the same challenges.

5. How would we deal with new descriptions? Could we get contributors to link to the authority records that have been created? Would we ask them to stop creating biographical histories within descriptions?

6. The name as an access point is often even more variable than the name as creator, although it does tend to have more structure. How can we tell that ‘Martha Beatrice Webb’ is the same person as ‘Beatrice Webb’ is the same person as ‘Beatrice Potter Webb’? For this we’ll have to carry out analysis of the data and pattern matching. For some names this isn’t so difficult, such as for Clough Williams-Ellis and Clough Williams Ellis.  But what about ‘Edward Coles’ (author of commonplace book c. 1730-40) and ‘Edward Coles fl. 1741’? With archives, there is a good deal of ‘fl’ and ‘c’ due to the relative obscurity of many people within archive collections. If we don’t identify matches, then we will have an authority record for every variation in a name.

7. We will have issues where a biographical history is cross-referenced, as in one example here. It makes perfect sense within the current context, and indeed, it follows the principle of using one biographical history for the same person, but it requires a distinct solution when introducing authority records.

I would be really interested to know what archivists, especially Hub contributors, think about linking to authority files.  I am very excited about the potential, and I believe there is so much great information tied up in biographical and administrative history fields that would help to make really useful authority records, but the challenges are pretty substantial. Some questions to consider:

(1) Would you be happy to link to a ‘definitive’ authority record? What would ‘authority’ and ‘definitive’ mean for you?

(2) Would you like to be able to edit the ‘definitive’ record – maybe you have further information to add to it? Would this type of collective authorship work?

(3) Would you rather have one authority record for each repository?

(4) Would you rather have room within an archive description to create a biographical entry that reflected that particular archive?

(5) Would you like to see co-ordination of the creation and use of name authorities? Maybe it’s something The National Archives could lead, following their work in maintaining names through the National Register of Archvies?

I’m sure you can think of other questions to ask, and maybe you have some questions to ask of us about our review? Suffice to say that we have no plans to change our software – this is currently just an assessment, but we are seriously thinking of how we can start to incorporate name authorities into the Archives Hub.

Finally, its worth mentioning that if you are interested in ways to bring biographical histories together, you might like to follow our ‘Linking Lives’ linked data project through the Linking Lives blog, as for this project we are looking at providing an interface that gives then end user something a little like a name authority record, only it will include more external links to other content.

HuBBub: January 2012

Some good developments

We’ve started off 2012 with two new developers for Mimas, both of whom will be doing some work for the Hub. Neeta Patel will be working on the UKAD website and some of the Hub interface developments and challenging global edits to help us improve the consistency and utility of the data. We also have Lee Baylis, who is working on our Linked Data project, Linking Lives. He will be helping to design the interface, and is currently beavering away on some exciting ideas for how researchers could customise their display for our biographical interface.

Punctuation for Index Terms

Something that may seem small, but is mighty complicated to execute: Currently we have a mixture of index terms with punctuation and no punctuation. This is because some descriptions came to us with, some without, and some through our EAD Editor – which adds punctuation (so these descriptions are all fine).

Just go to browse and search for ‘andrews’, for example, to see what I mean.  You can see:

Andrewes, Lancelot, 1555-1626, Bishop of Winchester
and
Andrews Barbara B 1829 Nee Campbell

The second is a little confusing without punctuation. But it is not easy to find a way to include punctuation for so many different names, with titles, dates, epithets, kings, queens, floruits, circas, etc. So, we are going to attempt to write scripts that will do this for us, and we’ll see how we go!

Alternative and Former Reference

We’ve taken a while, but finally we are displaying ‘former reference’ with an appropriate field heading. It has been complicated partly because  descriptions with these references often come from the CALM software, and some contributors want the former reference to be the current reference, because they don’t use the CALM automatically generated reference, whilst most want it to be the former reference, and for some it is more of an alternative reference. Finding it impossible to attend to all these needs, we are displaying any reference that is labelled as ‘former reference’ in the markup with the name of ‘Alt. Ref. Number’. This is a compromise, and at least ensures that all references are displayed.

Assessment of ICA AtoM

The Archives Hub is undertaking a review of current software, and as part of this we are looking at ICA-AtoM (Access to Memory). We will be undertaking a fairly detailed assessment of the software, from installation through to upload, search, display and other aspects such as scalability, Google exposure and APIs. We feel that AtoM offers a number of advantages, as free open source software, conforming to archival standards and with the the ability to incorporate name authority records and controlled vocabularies. We are also attracted by the lively international community that has built up around AtoM, and the ethos of sharing and working together to improve the functionality.

It will be interesting to see how it compares to our current software, Cheshire 3, which offers many advantages and sophisticated functionality, build up over 10 years to meet the needs of the Hub and archival researchers. Cheshire has served us very well, and provides stiff competition for any rivals, but it is important to assess where we are and what is best for us going forwards. Looking at other systems offers us the opportunity to think about functionality and assess exactly what we need going forwards.

Why Contribute?

We are constantly updating our pages, and adding new ones. Recently we’ve revamped the ‘Why Contribute?’ page as well as creating a new page, Becoming a Contributor. If you know of any archivists interested in the Hub, maybe you could point them to these pages as a means to provide some compelling reasons to be part of the Hub!

New Contributors

Our two latest contributors illustrate admirably the great diversity of Hub repositories. We have the Freshwater Biological Association with a collection about lakes and rivers in Cumbria and Scotland (if you ever wanted to know about bacteria counts, for example…), and also the National Meteorological Archive looking for a fair outlook by promoting their collections on the Hub.

Open Data

Some of you may have seen the announcement of the Open Data Strategy by the European Commission. This is very much in line with the increasing move towards open data: “The best way to get value from data is to give it away”.  The Archives Hub fully supports this ethos, and we will release all our data as open data unless any contributor wishes to opt out.

The Hub team wishes you all the best for 2012!

Online Survey Results (2011)

We would like to share some of the results of our annual online survey, which we run each year, over a 3-4 week period. We aim for about 100 responses (though obviously more would be very welcome!), and for this survey we got 92 responses. We create a pop-up invitation to fill out the survey – something we do not like to do, but we do feel that it attracts more responses than a simple link.

Context

We have a number of questions that are replicated in surveys run for Zetoc and Copac, two bibliographic JISC-funded Mimas services, and this provides a means to help us (and our funders) look at all three services together and compare patterns of use and types of user.

This year we added four questions specifically designed to help us with understanding users of the Hub and to help us plan our priorities.

We aim to keep the number of questions down to about 12 at the most, and ensure that the survey will take no longer than 10 minutes to complete. But we also want to provide the opportunity for people to spend longer and give more feedback if they wish, so we combine tick lists and radio boxes with free text comments boxes.

We take the opportunity to ask whether participants would be willing to provide more feedback for us, and if they are potentially willing, they provide their email address. This gives us the opportunity to ask them to provide more feedback, maybe by being part of a focus group.

Results of the Survey

Profile

  • The vast majority of respondents (80%) are based in the UK for their study and/or work.
  • Most respondents are in the higher education sector (60%). A substantial number are in the Government sector and also the heritage/museum sector.
  • 20% of those using the Hub are students – maybe less than we would hope, but a significant number.
  • 10% are academics – again, less than we would hope, but it may be that academics are less willing to fill in a survey.
  • 50% are archivists or other information professionals. This is a high number, but it is important to note that it includes use of the Hub on behalf of researchers, to answer their enquiries, so it could be said to represent indirect use by researchers.
  • The majority of respondents use the service once or twice a month, although usage patterns were spread over all options, from daily to less than once a month, and it is difficult to draw conclusions from this, as just one visit to the Hub website may prove invaluable for research.

graph showing value of the HubUse and Recommendation

  • A significant percentage – 26% – find the Hub ‘neither easy nor difficult’ to use, and 3% of the respondents found it difficult to use, indicating that we still need to work on improving usability (although note that a number of comments were positive about ease of use) .
  • 73% agree their work would take longer without the Hub, which is a very positive result and shows how important it is to be able to cross-search archives in this way.
  • A huge majority – 93% – would recommend the Hub to others, which is very important for us. We aim to achieve 90% positive in this response, as we believe that recommendations are a very important means for the Hub to become more widely known.

Subject Areas

We spent a significant amount of time creating a list of subjects that would give us a good indication of disciplines in which people might use the Hub. The results were:

    • History 47
    • Library & Archive Studies 33
    • English Literature 17
    • Creative & Performing Arts 16
    • Education & Research Methods 10
    • Predominantly Interdisciplinary 9
    • Geography & Environment 5
    • Political Studies & International Affairs 5
    • Modern Languages and Linguistics 4
    • Physical Sciences 4
    • Special Collections 4
    • Architecture & Planning 3
    • Biological & Natural Sciences 3
    • Communication & Media Studies 3
    • Medicine 3
    • Theology & Philosophy 3
    • Archaeology 2
    • Engineering 2
    • Psychology & Sociology 2
    • Agriculture 1
    • Law 1
    • Mathematics 1
    • Business & Management Studies 0
  • History is, not surprisingly, the most common discipline, but literature, the arts, education and also interdisciplinary work all feature highly.
  • There is a reasonable amount of use from the subjects that might be deemed to have less call for archives, showing that we should continue to promote the Hub in these areas and that archives are used in disciplines where they do not have a high profile. It would be very valuable to explore this further.

graph showing use of archival websites

  • The Hub is often used along with other archival websites, particularly The National Archives and individual record office websites, but a significant number do not use the websites listed, so we cannot assume prior knowledge of archives.
  • It would be interesting to know more about patterns of use. Do researchers try different websites, and in what order to they visit them? Do they have a sense of what the different sites offer?
  • There is still low use of the European aggregators, Europeana and APENet, although at present UK archives are not well represented on these services and arguably they do not have a high profile amongst researchers (the Hub is not yet represented on these aggregators).

Subsequent activities

  • It is interesting to note that 32% visit a record office as a result of using the Hub, but 68% do not. It would be useful to explore this further, to understand whether the use of the Hub is in itself enough for some researchers. We do know that for some people, the description holds valuable information in and of itself, but we don’t know whether the need to visit a record office, maybe some distance away, prevents use of the archives when they might be of value to the researcher.

What is of most value?

  • We asked about what is important to researchers, looking at key areas for us. The results show that comprehensive coverage still tops the polls, but detailed descriptions also continue to be very important to researchers, somewhat in opposition tograph showing what is most valuable to researchers the idea of the ‘quick and dirty’ approach. More sophisticated questioning might draw out how useful basic descriptions are compared with no description and what sort of level of detail is acceptable.
  • Links to digital content and information on related material are important, but not as important as adding more descriptions and providing a level of detail that enables researchers to effectively assess archives.
  • Searching across other cultural heritage resources at the same time is maybe surprisingly less of a priority than content and links. It is often assumed that researchers want as much diverse information as possible in a ‘one-stop shop’ approach, but maybe the issues with things like the usability of the search,  navigation, number of results and relevance ranking of results illustrate one of the main issues – creating a site that holds descriptions and links to very varied content and still ensuring it is very easily understandable and researchers know what they are getting.
  • The regional search was not a high priority but a significant medium priority, and it might be argued that not all researchers would be interested in this, but some would find it particularly useful, and many archivists would certainly find it helpful in their work
  • We provided a free text box for participants to say what they most valued. The ability to search across descriptions, which is the most basic value proposition of the Hub, came out top, and breadth of coverage was also popular, and could be said to be part of the same selling point.
  • It was interesting to see that some respondents cited the EAD Editor as the main strength for them, showing how important it is to provide ways for archivists to create descriptions (it may be thought that other means are at their disposal, but often this is not the case).
  • Six people referred to the importance of the Hub for providing an online presence, indicating that for some record offices, the Hub is still the only way that collections are surfaced on the Web.

What would most improve the Hub?

  • We had a diversity of responses to the question about what would most improve the Hub, maybe indicating that there are no very obvious weaknesses, which is a good thing. But this does make it difficult for us to take anything constructive from the answers, because we cannot tell whether there is a real need for a change to be made. However, there were a few answers that focused on the interface design, and some of these issues should be addressed by our new ‘utility bar’ which is a means to more clearly separate the description from the other functions that users can then perform, and should be implemented in the next six months.

Conclusions

The survey did not throw up anything unexpected, so it has not materially affected our plans for development of the Hub. But it is essentially an endorsement of what we are doing, which is very positive for us. It emphasised the importance of comprehensive coverage, which is something we are prioritising, and the value of detailed descriptions, which we facilitate through the EAD Editor and our training opportunities and online documentation. Please contact us if you would like to know more.

More Product, Less Processing?

I’ve been reading a fascinating article by Mark A. Greene and Dennis Meissner, ‘More Product Less Process: Revamping Traditional Archival Processing‘ (PDF). I wanted to offer a summary of the article.

image of scalesThe essence of this article is that archivists spend too long processing collections (appraising, cataloguing and carrying out minor preservation). This approach is not working; the cataloguing backlog continues to increase. We are too conservative, cautious and set in our ways, and we need to think about a new approach to cataloguing that is more pragmatic and user-focussed. The article was written by archivists in the USA, but would seem to apply to archives here in the UK, where we know that the backlog is a continuing problem.

I think the article makes the argument well and with a good deal of conviction. The bottom line is that we must rethink our approach unless we are to continue to accrue backlogs and deny researchers access to hugely valuable primary source material.

However, there are arguments in support of detailed cataloguing. For digital archives it is extremely useful to provide metadata at the item level,  enabling such useful resources as http://archiveshub.ac.uk/data/gb1837des-dca?page=3#id634580. With this detailed list, researches can see digital resources described and then access them directly. It could be argued that if a collection is to be digitised, providing this sort of level of metadata is appropriate, and in general it is the more valuable and highly used collections that are digitised. But for born-digital collections, this level of detail would be totally unsustainable.

Also, I wonder if the work that volunteers do should be taken into account – they may be able to help us catalogue in more detail, whilst trained archivists continue to create the main collection or series-level descriptions. I remember a whole band of NADFAS volunteers cataloguing photographs where I used to work. Furthermore, I was speaking to an archivist recently who said that they had taken the time to weed out duplicates (something this report criticises)…and then sold them on eBay for a tidy profit, that helped them fund their very under-resourced archive (they had the rights to do this!). So, maybe there are factors to take into consideration that support a detailed approach, but I think a bold approach to examining this whole area in UK archives would be very welcome.

Some of the points made in the report:

  • Archivists spend too much time cataloguing, not necessarily doing what is necessary. We think in terms of an ideal that we have to reach, although we haven’t actually articulated what this ideal is, and really examined it.
  • We are too attached to old-fashioned ways of doing things, which worked when we had smaller collections to deal with, but are not appropriate for large 20th century collections.
  • We give a higher priority to serving the needs of our collections rather than the needs of our users.
  • We need a new set of guidelines that focus on what we absolutely need to do.
  • We need to discuss, debate and examine our approach to cataloguing, and not be defensive about our roles.
  • We tend to arrange collections down to item level. In particular, we carry out preservation activities to this level. We accept the premise that basic preservation steps necessitate an item-level approach.
  • We often remove all metal fastenings and put materials into acid-free folders. So, even if we do not describe collections down to item level (maybe we just describe at collection or series level), we go down to this level of detail in our preservation activities.  Yet, with good climate control, metal fasteners should not rust, and as yet we do not have strong evidence of a detrimental effect of standard manila folders if the materials is stored in a controlled environment.
  • We often weed out duplicates throughout a collection, which requires processing down to item level. Is this really worth doing?
  • The various sources of advice about the level of detail we process archives to are inconsistent. Some sources advocate description to series level, but preservation activities to item level. NARA advocates preservation in accordance with intrinsic value and anticipated use, so, for example, new folders should only be used if current ones are damaged, and metal fasteners should be removed only if ‘appropriate’ – meaning where they are causing obvious damage.
  • We seem to believe that we need to aspire to ‘a substantial, multi-layered, descriptive finding aid,’ a reflection of ‘slow, careful scholarly research’.  But in reality, maybe we should adopt a more flexible approach, taking each collection in turn on its merits. Some may justify detailed cataloguing, but many do not.
  • We should take the position that users come to do research, and that we do not have to do this for them in advance.
  • We should ‘get beyond our absurd over-cautiousness’ about providing access to unprocessed collections, and make them available unless there are good legal or preservation reasons to restrict access or the collection is of extremely high value.
  • We have very inadequate processing metrics. Attempts to quantify processing expectations have resulted in wildly differing figures. Figures given in various studies include 3, 6.9, 8, 12.7 and 10.6 hours per cubic foot. Other studies have come up with between 3 and 5.5 days per foot.
  • One major study  by an archive centre revealed 15.1 hours were spent on each cubic foot, far more than the value that was placed upon  what was accomplished. The study gave ‘an improved sense of the real and total costs involved’.
  • The Greene/Meissner study looked at various projects funded by NHPRC grants (National Historical Publications & Records Committee), and found an average productivity figure of 9 hours per foot, but with highs of around 67 hours per foot.  It also conducted an email survey and found expectations of processing times averaged at 14.8 hours, although there was a high of 250 hours!
  • Grant funding often encourages an item-level focus, rather than helping us to really tackle our substantial backlogs. There should be more of a requirement to justify meticulous processing – it should only be for exceptional collections.
  • The study recommends aiming for a processing rate of 4 hours per cubic foot for most large 20th century collections, using a series-level approach for description and preservation.
  • Studies show a lack of standardisation, not only in our definitions but also around the levels of arrangement, preservation and access that are useful and necessary.  We do not have proper administrative controls over this work. We tend to argue for each of us having a unique situation, that does not allow for comparison, and we do not have a common sense of acceptibile policies and procedures.
  • Whilst we continue to process to item level, a substantial number do not make catalogues available through OPACs or Websites, arguably prioritising processing over user needs.

The report concludes that maybe we should recognise that ‘the use of archival records…is the ultimate purpose of identification and administration.’ (SAA, Planning for the Archival Profession, 1986).  Maybe we should agree that a collection is catalogued if it ‘can be used productively for research.’ And maybe we should be willing to take a different approach for each collection, making choices and setting priorities, rather than being too caught up in a ‘love of craftmanship’ that could be seen as fastidiousness that does not truly serve the user.

The question seems to be how much would be lost by putting speed of processing before careful examination of all documents in a collection.  Maybe this does require defining good cataloguing? Maybe we believe that our professional standing is tied up with undertaking detailed cataloguing…more so than the ever increasing growth of backlogs, where the papers are entirely unaccessible to researchers?

Greene and Meissner state that there should be a ‘golden minimum’ for processing, where we adequately address user needs and only go beyond this where there are demonstrable business reasons. They also believe that arrangement, description and preservation should all occur at the same level of detail, again, unless there are good reasons to deviate from this.

What do you think…?