Archives Portal Europe Country Managers’ Meeting, 30 Nov 2016

This is a report of a meeting of the Archives Portal Europe Country Managers’ in Slovakia, 30 November 2016, with some comments and views from the UK and Archives Hub perspective.

APE-CMmeeting-30Nov2016
APE Country Managers meeting, Bratislava, 30 Nov 2016

Context

The APE Foundation (APEF), which was created following the completion of the APEx project (an EC funded project to maintain and develop the portal running from 2012 to 2015), is now taking APE forward. It has a Governing Board and working groups for standards, technical issues and PR/comms. The APEF has a coordinator and three technical/systems staff as well as an outreach officer. Institutions are invited to become associate members, to help support the portal and its aims.

Things are going well for APEF, with a profit recorded for 2016, and growing associate membership. APEF continues to be busy with development of APE, and is endeavouring to encourage cooperation and collaboration as a means to seize opportunities to keep developing and to take advantage of EU funding opportunities.

Current Development

The APEF has the support of Ministry of Culture in the Netherlands and has a close working relationship with the Netherlands national aggregation project, the ‘DTR’, which is key to the current APE development phase. The idea is to use the framework of APE for the DTR, benefitting both parties. Cooperation with DTR involves three main areas:

•    building an API to open up the functionality of APE to third parties (and to enable the DTR to harvest the APE data from The Netherlands)
•    improving the uploading and processing of EAC-CPF
•    enabling the uploading and processing of ‘additional finding aids’

The API has been developed so that specific requests can be sent to fetch selected data. It is possible to do this for EAD (descriptions) and EAC-CPF (names).  The API provides raw data as well as processed results.  There have been issues around things like relevance of ordering of results which is a substantial area of work that is being addressed.

The API raises implications in terms of the data, as the Content Provider Agreement that APE institutions sign gives control of the data to the contributors. So, the API had to be implemented in a way that enables each contributor to give explicit permission for the data to be available as CC0 (fully open data). This means that if a third party uses the API to grab data, they only get data from a country that has given this permission. APEF has introduced an API key, which is a little controversial, as it could be argued that it is a barrier to complete openness, but it does enable the Foundation to monitor use, which is useful for impact, for checking correct use, and blocking those who misuse the API. This information is not made open, but it is stored for impact and security purposes.

There was some discussion at the meeting around open data and use of CC0. In countries such as Switzerland it is not permitted to open up data through a CC0 licence, and in fact, it may be true to say that CC0 is not the appropriate licence for archival descriptions (the question of whether any copyright can exist in them is not clear) and a public domain licence is more appropriate. When working across European countries there are variations in approaches to open data. The situation is complicated because the application of CC0 for APE data is not explicit, so any licence that a country has attached to their data will effectively be exported with the data and you may get a kind of licence clash. But the feeling is that for practical purposes if the data is available through an API, developers will expect it to be fully open and use it with that in mind.

There has been work to look at ways to take EAC-CPF from a whole set of institutions more easily, which would be useful for the UK, where we have many EAC-CPF descriptions created by SNAC.  Work on any kind of work to bring more than one name description for the same person together has not started, and is not scheduled for the current period of development, but the emphasis is likely to be on better connectivity between variations of a name rather than having one description per name.

Additional finding aids offer the opportunity to add different types of information to APE. You may, for example, have a register of artists or ships logs, you may have started out with a set of cards with names A-Z, relating to your archive in some way.  You could describe these in one EAD description, and link this to the main description. In the current implementation of EAD2002 in APE this would have to go into a table in Scope & Content and in-line tagging is not allowed to identify parts of the data. This leads to limitations with how to search by name. But then EAD3 gives the option to add more information on events and names. You can divide a name up into parts, which allows for better searching.  Therefore APE is developing a new means to fetch and process EAD3 for the additional finding aids alongside EAD2002 for ‘standard’ finding aids. In conjunction with this, the interface needs to be changed to present the new names within the search.

The work on additional finding aids may not be so relevant for the Archives Hub as a contributor to APE, as the Hub cannot look at taking on ‘other finding aids’, with all the potential variations that implies. However, institutions could potentially log into APE themselves and upload these different types of descriptions.

APE and Europeana

There was quite a bit to talk about concerning APE and Europeana. The APEF is a full partner of the Europeana Digital Services Infrastructure 2 (DSI2) project (currently running 2016/2017). The project involves work on the structure for Europeana, maintaining and running data and aggregation services, improving data quality, and optimising relations with data partners. The work APE is involved with includes improving the current workflow for harvest/ingest of data, and also evaluating what has already been ingested into Europeana.

Europeana seems to have ongoing problems dealing with multi-level EAD descriptions, compounded by the limitation that they only represent  digital materials. The approach is not a good fit for archives. Europeana have also introduced both a new publishing framework and different rights statements.

The new publishing framework is a 4 tier approach where you can think of Europeana as a more basic tool for promoting your archives, or something that is a platform for reuse. It refers to the digital materials in terms of whether they are a certain number of pixels, e.g. 800 pixels wide for thumbnails (adding thumbnails means using Europeana as a ‘showcase’) and 1,200 pixels wide ( high quality and reusable, using Europeana as a distribution and reuse platform). The idea of trying to get ‘quality’ images seems good, but in practice I wonder if it simply raises the barrier too much.

The new Rights statements require institutions to be very clear about the rights they want to apply to digital content.  The likely conclusion of all this from the point of view of the Archives Hub is that we cannot grapple with adding to Europeana on behalf of all of our contributors, and therefore individual contributors will have to take this on board themselves. It will be possible for contributors to log into the APE dashboard (when it has been changed to reflect the Europeana new rights) and engage with this, selecting the finding aids, the preferred rights statements, and ensuring that thumbnail and reusable images meet the requirements.  One the descriptions are in APE they can then be supplied to Europeana. The resulting display in Europeana should be checked, to ensure that it is appropriate.

We discussed this approach, and concluded that maybe APE contributors could see Europeana as something that they might use to showcase their content, so, think of it on our terms, as archives, and how it might help us. There is no obligation to contribute, so it is a case of making the decision whether it is worth representing the best visual archives through Europeana or whether this approach takes more effort than the value that we get out of it.  After 10 years of working with Europeana, and not really getting proper representation of archives, the idea of finding a successful way of contributing archives is appealing, but it seems to me that the amount of effort required is going to be significant, and I’m not sure if the impact is enough to warrant it.

Europeana are working on a new way of automated and real time ingest from aggregators and content providers, but this may take another year or more to become fully operational.

Outreach and CM Reports

Towards the end of the day we had a presentation from the new PR/communicaitons officer. Having someone to encourage, co-ordinate and develop ideas for dissemination should provide invaluable for APE. The Facebook page is full of APE activities and related news and events. You can tweet and use the hashtag #archivesportaleurope if you would like to make APE aware of anything.

We ended the day with reports from country managers, which, as always threw up many issues, challenges, solutions, questions and answers. Plenty to set up APEF for another busy year!

Save

Save

Exploring British Design: New Routes through Content

At the moment, the Archives Hub takes a largely traditional approach to the navigation and display of archive collections. The approach is predicated on hundreds of years of archival theory, expanded upon in numerous books, articles, conferences and standards. It is built upon “respect des fonds” and original order. Archival provenance tells us that it is essential to provide the context of a single item within the whole archive collection; this is required in order to  understand and interpret said item.

ISAD(G) reinforces the ‘top down’ approach. The hierarchy of an archive collection is usually visualised as a tree structure, often using folders. The connections show a top-down or bottom-up approach, linking each parent to its child(ren).

image of hierarchical folders
A folder structure is often used to represent archival hierarchy

This principle of archival hierarchy makes very good sense. The importance of this sort of context is clear: one individual letter, one photograph, one drawing, can only reveal so much on its own. But being able to see that it forms part of a series, and part of a larger collection, gives it a fuller story.

However, I wonder if our strong focus on this type of context has meant that archivists have sometimes forgotten that there are other types of context, other routes through content. With the digital environment that we now have, and the tools at our disposal, we can broaden out our ambitions with regards to how to display and navigate through archives, and how we think of them alongside other sources of information. This is not an ‘either or’ scenario; we can maintain the archival context whilst enabling other ways to explore, via other interfaces and applications. This is the beauty of machine processable data – the data remains unchanged, but there can be numerous interfaces to the data, for different audiences and different purposes.

Providing different routes into archives, showing different contexts, and enabling researchers to create their own narratives, can potentially be achieved through a focus on the ‘real things’ within an archive description; the people, organisations and places, and also the events surrounding them.

image of entities and links
Very simplified model of entities within archive descriptions and links between them

This is a very simplified image, intended to convey the idea of extracting people, organisations and places from the data within archive descriptions (at all levels of description). Ideally, these entities and connections can be brought together within events, which can be built upon the principle of relationships between entities (i.e. a person was at a place at a particular time).

Exploring British Design is a project seeking to probe this kind of approach. By treating these entities as an important part of the ‘networks of things’, and by finding connections between the entities, we give researchers new routes through the content and the potential to tell new stories and make new discoveries. The idea is to explore ways to help us become more fully a part of the Web, to ensure that archives are not resources in isolation, but a part of the story.

A diagram showing archives and other entities connected
An example of connected entities

 

For this project, we are focussing on a small selection of data, around British design, extracting entities from the Archives Hub data, and considering how the content within the descriptions can be opened up to help us put it into new contexts.

We are creating biographical records that can be used to include structured data around relationships, places and events.  We aim to extract people from the archive descriptions in which they are ‘embedded’ so that we can treat them as entities – they can connect not only to archive collections they created or are associated with, but they can also connect to other people, to organisations, to events, to places and subjects. For example, Joseph Emberton designed Simpsons in Piccadilly, London, in 1936. There, we have the person, the building, the location and the time.

With this paradigm, the archive becomes one of the ‘nodes’ of the network,  with the other entities equally to the fore, and the ability to connect them together shows how we can start to make connections between different archive collections. The idea is that a researcher could come into an archive from any type of starting point. The above diagram (created just as an example) includes ‘1970’s TV comedy’ through to the use of portland stone, and it links the Brighton Design Archive, the V&A Theatre and Performance Archive and the University of the Arts London Archive. The long term aim is that our endeavours to open up our data will ensure that it can be connected to other data sources (that have also been made open); sources outside of our own sphere (the Archives Hub data). The traditional interface has its merits; certainly we need to continue to provide archival context and navigation through collections; but we can be more imaginative in how we think about displaying content. We don’t need to just have one interface onto our data. We need to ensure that archives are part of the bigger story, that they can be seen in all sorts of contexts, and they are not relegated to being a bit part, isolated from everything else.

 

In With the New: open, flexible, user-centered

The 2013 Eduserv Symposium, was held in the impressive (and very much ‘keep in with the old’) surroundings of One Great George Street in Westminster, the home of the Institute of Civil Engineers.

‘In with the New’ covered new skills sets, new modes of engagement and new ways of working.  With such a wide topic area, the conference took quite a broad-brush approach. Andy Powell of Eduserv introduced the day and talked about dealing with change, change that may be imposed upon us from the outside, as well as being driven internally.

image from Digital Govt ServiceDavid Cotterill from the Government Digital Service gave the opening keynote, which is what I want to focus on here. He said his talk was about ‘my exciting life as a civil servant’….the audience weren’t convinced about this at the outset, but maybe for those interested in open data, there was some shift of opinion by the end!

He talked about the old consensus, which was built around long-term contracts for IT in government; contracts that were consistently awarded to a limited number of suppliers and not to smaller and more innovative suppliers.  IT was not defined as a core function, so out-sourcing was considered appropriate. But in the 21st century things have changed. There is recognition that IT covers very diverse areas. For Government (and for many other organisations), it covers digital public services, mission IT systems (i.e. more niche or specialised systems for government departments), desktop, infrastructure, connectivity, etc. (the more general IT), and, within government, there are also ‘shared services’ (such as for financial systems). David talked about the need to structure mission IT systems and digital public services so that they can run on different desktops or infrastructures and not be tied down (as often used to be the case).

David went on to argue that the Government really has taken up the open agenda, and showed some quotes: “The latest step is the publications of this report on open standards. And once again the government has got it right.” (Wall Street Journal).  He argued that in order to have flexibility to progress, to upgrade, to move forwards, you need open and standards based systems. You also need to look at specific needs in specific areas and not think of IT as some kind of monolithic thing.

It was surprising to hear him say that “this is a great time to be a supplier”, but he said that many of the current deals within government come to an end over the next few years, so there is opportunity for new suppliers and creating a more diverse set-up.

What is 21st century governmentgov.uk screenshot about? David said it’s about things like www.gov.uk/, built using a platform approach (rather than a CMS) which allows the Government Digital Service (GDS) to build products onto it that meet user needs; products that enable the government to engage with citizens. David gave a sense of how this approach is working across UK government, with multi-disciplinary teams including developers, designers, product and service managers, policy, communications, etc.

His core message was to start with the user need. Of course, this is something that we can all agree with, although whether it always happens in reality is debatable, even if it is the intention. We need to shape things in terms of user requirements  right from the start, and not bring it in once all the policy, requirements and  development work is done. We should think about capturing requirements and developing alpha and then beta versions before going live. This may mean that what is initially developed is chucked out after the alpha stage, because it doesn’t meet needs, and then there is a need to start again. I think one of the problems with this approach is that funders do not necessarily facilitate it. How easy would it be to get funding for a project where the iterative process may go on for quite some time, and there is a risk of starting again several times in order to get it right? A further difficulty with this from a funding point of view is that it is much harder to specify what you are going to end up with, because you necessarily need to keep an open mind; you’ll end up (hopefully) with what users want, but it might be different to what was envisaged and you’ll only know after the testing and refining process.

It makes we think about archival software systems, for example.  Surely you should put the user needs at the heart of the development of your system? Ideally you would start out by gathering user requirements for a system, maybe looking at other research done in this area. You’d end up with a specification, listing priorities for your system. Most archives can’t then build it themselves, so they would go out and look at what meets these needs. But would it be possible to test a system out with users, to see if it really does fulfill their needs, and if it doesn’t go back and try something else? The problem here is that if you are buying a system, its hard to apply an iterative approach. However, it may be possible to move to a more user-centered approach. You should have clear evidence that the system does meet key user needs, and, in the absence of an ability to chop and change, you should ensure that the system does not tie you down and that it provides the flexibility to build and modify, so that changing priorities can be met.

It’s good to see Government leading the way. David showed previews of some services that are being developed, working towards a more transparent approach to things like transactional services and he highlighted a government manual about building services that people want to use.  There is now a ‘Standards Hub‘, to promote open standards and also to encourage wider participation in solving data challenges. It is amazing to see Government code onimage of keyboard 'save' key GitHub. Somehow that really brought home to me home how different things are now to 10-15 years ago. David, as well as other speakers at the conference, believes that open standards encourage a more efficient approach, so it becomes a cost-saving venture as well as encouraging public engagement and transparency.

Open Up!

Speaking at the recent World Wide Web (WWW2012) conference in Lyon, Neelie Kroes, Vice President of the European Commission, emphasised the benefits of an open Web:  “With a truly open, universal platform, we can deliver choice and competition; innovation and opportunity; freedom and democratic accountability”.

Image: '774:Neuron Connection Pattern' http://www.flickr.com/photos/60057912@N00/4743616313

But what does ‘open’ really mean? Do we really have a clear idea about what it is, how we achieve it, and what the benefits might be? And do we have a sense of what is happening in this landscape and the implications for our own domain?

There is a tendency to think that the social web equates to some degree with the open web, but this is not the case. Social sites like Facebook often create walled gardens, and you could argue that whilst they are connecting people, they are working against the principle of free and linked information. Just because the aim is to enable people to be social and connected, that doesn’t mean that it is done in a way that ensures the data is open.

Another confusion may arise around open and its relationship to privacy. Again, things can be open, but privacy can be respected. It’s about clear and transparent choices – whether you want your data to be open or whether you want to include some restrictions. We are not always aware of what we are giving away:

“the Net enables individuals in many cases to compromise privacy more thoroughly than the government and commercial institutions traditionally targeted for scrutiny and regulation. The standard approaches that have been developed to analyze and limit institutional actors do not not work well for this new breed of problem, which goes far beyond the compromise of sensitive information.” (Jonathan Zittrain, The Future of the Internet).

There is an irony that we often readily give away personal data – so we can be (unknowingly?) very open with our own information – but often we give it away to companies that are not, in their turn, open with the data that is provided. As Tim Berners-Lee has stated, “”One of the issues of social networking silos is that they have the data and I don’t”.  It is possible to paint quite a worrying trend towards control being in the hands of the few:

“A small elite of gifted (largely male) super-wealthy masters of the universe are creating systems through their own brilliance which very few others in government, regulation or the general public can understand.” (http://www.guardian.co.uk/commentisfree/2012/apr/16/threat-open-web-opaque-elite).

Google has often been seen as a good guy in this context, but recently a number of actions have indicated a move away from open web principles. We have started to be aware of the way Google collects and uses our data, and there have been some dubious practices around ways that data has been gathered (eg. personal data collected from UK Wifi networks). Google appear to be very eager to get us all using Google+, and we are starting to see ‘social’ search results that appear to be pushing people towards using Google+. I think we should be concerned by the announcement from Google that they are now starting to combine your Google search history with other data they collect through use of their products in order, they say, to deliver improved search results. Do we want Google to do this? Are we happy with our search history being used to push products. Is a ‘personalised’ service worth having if it means companies are collecting and holding all of the data we provide when we browse the web for their own commercial purposes?

It does seem that many of these companies will defend ‘open’ if it is in their interests, but find ways to bypass it when it is not. But maybe if we continue to apply pressure in support of an open approach, by implementing it ourselves and favouring tools and services that implement it,  we can tip the balance increasingly towards a genuine open approach that is balanced by safeguards for guaranteeing privacy.

It is worth thinking about some of these issues when we think about our own archival data and whether we are going to take an open approach. It is not just our own individual circumstances, within one repository, that we should think about, but the whole context of what we want the Web to be, because our choices help determine the direction that the Internet goes in.

We do not know how our archival data will be used and we never have. Even when the data was on paper, we couldn’t control use. Fundamentally, we need to see this as a good thing, not a threat, but we need to be aware of the risks. The paradigm has shifted because the data is much much easier to mess with now that it is digital, and, of course, we have digital archives, and we are concerned to protect the integrity of these sources. Providing an API interface seems like a much more explicit invitation to play with the data than simply putting it on the Web. Furthermore, the mindset of the digital generation is entirely different. Cutting and pasting, mashing and inter-linking are all taken for granted and sometimes the idea of the provenance of data sources seems to go out of the window.

This issue of ‘unpredictable use’ is interesting. When the Apple II was launched, Steve Jobs did not know how it would be used – as Zittrain states, it ‘invited people to tinker with it’, and whilst Apple might have had ideas about use, there ideas were not a constraint to reality. Contrast this with the iPhone: it comes out of the box programmed. Quoting Zittrain again: ‘Whereas the world would innovate for the Apple II, only Apple would innovate for the iPhone.’  Zittrain sees this trend towards fixed appliances as pointing to a rather bleak future of ‘sterile appliances tethered to a network of control.’

Maybe the challenge is partly around our understanding of the data we are ‘giving away’, why we are giving it away, and if it’s worth giving away because of what we get in return. Think about the data gathered as we browse the Web – cookies have always had a major role in building up a profile of our Web behaviour – where we go and what we do. Cookies are generally interested in behaviour rather than identity. But still, from 26 May this year, every website operating in the UK will be required to inform its users that they are being tracked with cookies, and to ask users for their consent. This means justifying why cookies are required – it may be for very laudable reasons – logging users’ preferences to help improve their experience, and generally getting a better understanding of how people use the site in order to improve it. It may be in order to target advertising – and you may feel that its hardly a problem if this is adversely affected, but maybe it will be smaller, potentially very useful sites that will suffer if their advertising revenue is affected. Is it a good think that sites will have to justify use of cookies? Is it in our interests? Is this really where the main fight is around protecting your identity?  At the same time, the Government is proposing to bring in powers to monitor the websites people are looking at, something which has very profound implications for our privacy.

On the side of the good guys, there are plenty of people out there espousing the importance of an altruistic approach. In The Digital Public Domain: Foundations for an Open Culture the authors argue that the Public Domain — that is, the informational works owned by all of us, be that literature, music, the output of scientific research, educational material or public sector information — is fundamental to a healthy society. Many other publications echo this belief and you can see plenty of evidence of the movement towards open publication. Technology can increasingly help us realise the advantages of an open approach. For example, text mining is something that can potentially bring substantial economic and cultural benefits. It is often associated with biomedical sciences, but it could have great potential across the humanities. A recent JISC report on the Value and Benefits of Text Mining concluded that the current licensing arrangements are a key barrier to unlocking the potential value of text mining to society and enabling researchers to gain maximum benefit from the data. Text mining  analyses text using computational methods, and provides the end user with what is most relevant to them. But copyright law works against this principle:

http://www.jisc.ac.uk/inform/inform33/TextMining.html

“Current copyright law, however, imposes serious restrictions on text mining, because it involves a range of computerised analytical processes that are not all readily permitted within UK intellectual property law. In order to be ‘mined’, text must be accessed, copied, analysed, annotated and related to existing information and understanding. Even if the user has access rights to the material, making annotated copies can currently be illegal without the permission of the copyright holder.” (JISC report: Value and Benefits of Text Mining)

It is important to think beyond types of data and formats, and engage with the debate about the sort of advances for researchers that we can gain by opening up all types of data; whilst the data may differ, many of the arguments about the benefits of open data apply across the board, in terms of encouraging innovation and the advancement of knowledge.

The battle between open and transparent and the walled garden scenario is played out in many ways. One area is the growth of native apps for mobile devices. It is easy to simply see mobile apps as providing new functionality, and opportunities for innovation, with so many people creating apps for every activity you can think of. But what are the implications of a scenario where you have to enter a ‘silo’ environment in order to do anything, from finding your way to a location to converting measurements from one unit to another, to reading the local news reports? Tim Berners-Lee sees this as one of the main threats to the principle of the Web as a platform with the potential to connect data.

Should we be promoting the open Web? Surely as information professionals we should. This does not mean that there is an imperative to make everything open – we are all aware of the Data Protection Act and there are a whole range of very valid reasons to withhold information. But it is increasingly apparent that we need to be explicit in applying permissions – whether it be explicitly fully open licences, attribution licences or closure periods. Many of the archives described on the Archives Hub have ‘open for consultation’ under Access Conditions. But what does this really mean? What can people really do with the content? Is the content actually ‘open for reuse in any way the end-user deems useful to them’? And specifically, what can they do with digital content, available on the World Wide Web?

We need to be aware of the open data agenda and the threats and opportunities presented to us. We have the same kind of tensions within our own world in terms of what we need to protect, both for preservation reasons and for reasons of IPR and commercial advantage, but we want to make archives as open and available as possible and we need to understand what this means in the context of the digital, online world. The question is, whether we want to encourage innovation, what Zittrain calls a ‘generative’ approach. This means enabling programmers to take hold of what you can provide and working with it, as well as encouraging reuse by end users. But it also means taking a risk with security, because making things open inevitably makes them more open to abuse.

People’s experiences with the Internet are shaped by the devices they use to access it. And, similarly, their experiences of archives are shaped by the ways that we choose to enable access. In the same way that a revolution started when computers were made available that could potentially run software that was not available at the time they were built, so we should think about enabling innovation with our data that we may not envisage at present, not trying to determine user behaviour but relishing the idea of the unknown. This raises the question of whether we facilitate this by simply putting the data out as it is, even if it is somewhat unstructured, making it as open as possible, or whether we should try to clean up data, and make it more rigorous, more consistent and more structured, with the purpose of facilitating flexible use of the data, enabling people to use and present it in other ways. But maybe that is another story…

In a way, the early development of the Internet has been an unexpected journey. Innovation has been encouraged because a group of computer scientists did not want to control or to ensure maximum commercial success. It was an altruistic approach; and it may not have turned out that way. Maybe the PC was an unlikely success story in business, where a stable, supported, predictable solution might have been favoured, where companies did not need skills in-house.  Both Microsoft and Apple made their computers’ operating systems open for third-party software development, so even if there was a sense of monopoly, there was still a basic ethos of anyone being able to write for the PC.

Are we still moving down that road where amateur innovation and ‘messing about’ is seen as a positive thing? Or are the problems of security, spam and viruses, along with the power of the big guys, taking us towards something that is far more fixed and controlled?

HuBBub: January 2012

Some good developments

We’ve started off 2012 with two new developers for Mimas, both of whom will be doing some work for the Hub. Neeta Patel will be working on the UKAD website and some of the Hub interface developments and challenging global edits to help us improve the consistency and utility of the data. We also have Lee Baylis, who is working on our Linked Data project, Linking Lives. He will be helping to design the interface, and is currently beavering away on some exciting ideas for how researchers could customise their display for our biographical interface.

Punctuation for Index Terms

Something that may seem small, but is mighty complicated to execute: Currently we have a mixture of index terms with punctuation and no punctuation. This is because some descriptions came to us with, some without, and some through our EAD Editor – which adds punctuation (so these descriptions are all fine).

Just go to browse and search for ‘andrews’, for example, to see what I mean.  You can see:

Andrewes, Lancelot, 1555-1626, Bishop of Winchester
and
Andrews Barbara B 1829 Nee Campbell

The second is a little confusing without punctuation. But it is not easy to find a way to include punctuation for so many different names, with titles, dates, epithets, kings, queens, floruits, circas, etc. So, we are going to attempt to write scripts that will do this for us, and we’ll see how we go!

Alternative and Former Reference

We’ve taken a while, but finally we are displaying ‘former reference’ with an appropriate field heading. It has been complicated partly because  descriptions with these references often come from the CALM software, and some contributors want the former reference to be the current reference, because they don’t use the CALM automatically generated reference, whilst most want it to be the former reference, and for some it is more of an alternative reference. Finding it impossible to attend to all these needs, we are displaying any reference that is labelled as ‘former reference’ in the markup with the name of ‘Alt. Ref. Number’. This is a compromise, and at least ensures that all references are displayed.

Assessment of ICA AtoM

The Archives Hub is undertaking a review of current software, and as part of this we are looking at ICA-AtoM (Access to Memory). We will be undertaking a fairly detailed assessment of the software, from installation through to upload, search, display and other aspects such as scalability, Google exposure and APIs. We feel that AtoM offers a number of advantages, as free open source software, conforming to archival standards and with the the ability to incorporate name authority records and controlled vocabularies. We are also attracted by the lively international community that has built up around AtoM, and the ethos of sharing and working together to improve the functionality.

It will be interesting to see how it compares to our current software, Cheshire 3, which offers many advantages and sophisticated functionality, build up over 10 years to meet the needs of the Hub and archival researchers. Cheshire has served us very well, and provides stiff competition for any rivals, but it is important to assess where we are and what is best for us going forwards. Looking at other systems offers us the opportunity to think about functionality and assess exactly what we need going forwards.

Why Contribute?

We are constantly updating our pages, and adding new ones. Recently we’ve revamped the ‘Why Contribute?’ page as well as creating a new page, Becoming a Contributor. If you know of any archivists interested in the Hub, maybe you could point them to these pages as a means to provide some compelling reasons to be part of the Hub!

New Contributors

Our two latest contributors illustrate admirably the great diversity of Hub repositories. We have the Freshwater Biological Association with a collection about lakes and rivers in Cumbria and Scotland (if you ever wanted to know about bacteria counts, for example…), and also the National Meteorological Archive looking for a fair outlook by promoting their collections on the Hub.

Open Data

Some of you may have seen the announcement of the Open Data Strategy by the European Commission. This is very much in line with the increasing move towards open data: “The best way to get value from data is to give it away”.  The Archives Hub fully supports this ethos, and we will release all our data as open data unless any contributor wishes to opt out.

The Hub team wishes you all the best for 2012!

HubbuB: November 2011

image showing celebratory 200 I don’t think we made much of a fuss about reaching 200 contributors, but we’re really pleased to say that we’re now into the 200’s and new contributors are coming on board regularly, which makes the Hub even more useful to even more researchers.

We’re currently trying out a bit of a whizzy thing with the contributors’ map – go to http://archiveshub.ac.uk/contributorsmap/ and try a few clicks and you’ll see what I mean. We particularly like the jump from Aberdeen to Exeter, and are looking for archives from further afield in order to execute even bigger jumps!

Speaking of contributors, we’ve made a few changes to our contributor pages. We now have a link to browse each contributor’s descriptions, and also a link to simply show the list of collections. This link was largely introduced to help us with our quest to bring the Hub out loud and strong through Google. We’re doing pretty well on that front….we’ve found that page views have gone up radically over the last few months, and that can only be good for archives.  I think the list of descriptions can really look quite impressive – I tried Aberdeen and found collections from ‘favourite tunes’ to ‘a valuation of the Shire of Aberdeen’.

We’ve been busy on our new Linking Lives project, using Linked Data to create a Web front-end, and making the data available via an open licence. We’re really pleased that the vast majority of contributors have not asked us to exclude their descriptions, and many have emailed specifically to endorse what we are doing.  This is brilliant news, and I think it shows that most archivists are actually forward-thinking and understand that technology can really benefit our domain (flattery will get you everywhere!).  We want to ensure that archives are out there in the Web of Data, and part of the innovative work that is happening now. You may have seen a few blog posts to get going on Linking Lives: http://archiveshub.ac.uk/linkinglives/. Pete’s are rather more technical than mine, and brilliantly set out some of the difficult issues. I’m trying to think about what archivists are interested in and how we think about archival context. I hope our posts on licensing convey how much we are thinking about the best way to present and attribute the content.

Lastly for this month’s HubbuB, I’ve knocked up a fairly short Feature on the latest stuff that’s happening. I’m thinking of this as an annual feature – sometimes we are so busy we kind of forget to actually make a bit of noise about what we’ve achieved. You’ll see that we’re working on some record display improvements. I really hope I can show you these soon.

HubbuB: October 2011

Europeana and APENet

Europeana LogoI have just come back from the Europeana Tech conference, a 2 day event on various aspects of Europeana’s work and on related topics to do with data. The big theme was ‘open, open, open’, as well, of course, as the benefits of a European portal for cultural heritage.  I was interested to hear about Europeana’s Linked Data output, but my understanding is that at present, we cannot effectively link to their data, because they don’t provide URIs  for concepts. In other words, identifiers for names such as http://data.archiveshub.ac.uk/doc/agent/gb97/georgebernardshaw, so that we can say, for example, that our ‘George Bernard Shaw’ is the same as ‘George Bernard Shaw’ represented on Europeana.

I am starting to think about the Hub being part of APENet and Europeana. APENet is the archival aggregator for Europe. I have been in touch with them about the possibility of contributing our data, and if the Hub was to contribute, we could probably start from next year. Europeana only provide metadata for digital content, so we could only supply descriptions where the user can link to the digital content, but this may well be worth doing, as a means to promote the collections of any Hub contributors who do link to digital materials.

If you are a contributor, or potential contributor, we would like to know what you think…. we have a quick question for you at http://polldaddy.com/poll/5565396/. It simply asks if you think its a good idea to be part of these European initiatives. We’d love to get your views, and you only have to leave your name and a comment if you want to.

Flickr: an easy way to provide images online

You will be aware that contributors can now add images to descriptions and links to digital content of all kinds. The idea is that the digital content then forms an integral whole with the metadata, and it is also interoperable with other systems.

I’ve just seen an announcement by the University of Northampton, who have recently added materials to Flickr . I know that many contributors struggle to get server space to put their digital content online, so this is one possible option, and of course it does reach a huge number of people this way. There may be risks associated with the persistence of the URIs for the images, but then that is the case wherever you put them.

On the Hub we now have a number of images and links to content, for example: http://archiveshub.ac.uk/data/gb1089ukc-joh, http://archiveshub.ac.uk/data/gb1089ukc-bigwood, http://archiveshub.ac.uk/data/gb1089ukc-wea, http://archiveshub.ac.uk/data/gb141boda?page=7#boda.03.03.02.

Ideally, contributors would supply digital content at item level, so the metadata is directly about the image/digital content, but it is fine to provide it at any level that is appropriate.  The EAD Editor makes adding links easy (http://archiveshub.ac.uk/dao/). If you aren’t sure what to do, please do email us.

Preferred Citation

We never had the field for the preferred citation in our old template for the creation of EAD, and it has not been in the EAD Editor up till now. We were prompted to think about this after seeing the results of a survey on the use of EAD fields presented at the Society of American Archivists conference. Around 80% of archive institutions do use it. We think it’s important to advise people how to cite the archive, so we are planning to provide this in the Editor and may be able to carry out global edits to add this to contributors’ data.

List of Contributors

Our list of contributors within the main search page has now been revised, and we hope it looks substantially more sensible, and that it is better for researchers. This process really reminded us how hard it is to come up with one order for institutions that works for everyone!  We are currently working on a regional search, something that will act as an alternative way to limit searching. We hope to introduce this next year.

And finally…A very engaging Linked Data interface

This interface demonstration by Tim Sherratt shows how something driven by Linked Data can really be very effective. It also uses some of the Archives Hub vocabulary from our own Linked Data work, which is a nice indication of how people have taken notice of what we have been doing. There is a great blog post about it by Pete Johnston, Storytelling, archives and Linked Data. I agree with Pete that this sort of work is so exciting, and really shows the potential of the Linked Data Web for enabling individual and collective storytelling…something we, as archivists, really must be a part of.

Locah Linking Lives: an introduction

We are very pleased to announce that the Archives Hub will be working on a new Linked Data project over the next 11 months, following on from our first phase of Locah, and called Linking Lives. We’d like to introduce this with a brief overview of the benefits of the Linked Data approach, and an outline of what the project is looking to achieve. For more in-depth discussion of Linked Data, please see our Locah project blog, or take a look at the Linked Data website guides and tutorials.

Linked Open Data Cloud
Linked Data Cloud

The benefits of Linked Data

The W3C currently has a draft of a report, ‘Library Linked Data‘, which covers archives and museums. In this they state that:

‘Linked data is shareable, extensible, and easily re-usable… These characteristics are inherent in the linked data standards and are supported by the use of web-friendly identifiers for data and concepts.’

Shareable

One of the exciting things about Linked Data is that it is about sharing data (certainly where you have Linked Open Data). I have found that this emphasis on sharing and data integration has actually had a positive effect aside from the practical reality of sharing; it engenders a mindset of collaboration and sharing, something that is of great value not just in the pursuit of the Linked Data vision, but also, more broadly, for any kind of collaborative effort and for encouraging a supportive environment. Our previous Linked Data project, Locah, has been great for forging contacts and for putting archival data within this exciting space where people are talking about the future of the Web and the sorts of things that we might be able to do if we work together.

For the Archives Hub, our aim is to share the descriptions of archive collections as a means to raise the profile of archives, and show just how relevant archives are across numerous disciplines and for numerous purposes. In many ways, sharing the data gives us an opportunity to get away from the idea that archives are only of interest to a narrow group of people (i.e. family historians and academics purely within the History Faculty).

Extensible

The principle of allowing for future growth and development seems to me to be vital. The idea is to ensure that we can take a flexible approach, whereby enhancements can be made over time. This is vital for an exploratory area like Linked Data, where an iterative approach to development is the best way to go, and where we are looking at presenting data in new ways, looking to respond to user needs, and working with what technology offers.

Reusable

‘Reuse’ has become a real buzz word, and is seen as synonymous with efficiency and flexibility.  In this context it is about using data in different contexts, for different purposes. In a Linked Data environment what this can mean is providing the means for people to combine data from different sources to create something new, something that answers a certain need. To many archivists this will be fine, but some may question the implications in terms of whether the provenance of the data is lost and what this might mean. What about if information from an archive description is combined with information from Wikipedia? Does this have any implications for the idea of archive repositories being trusted and does it mean that pieces of the information will be out of their original context and therefore in some way open to misuse or misinterpretation?

Reuse may throw up issues, but it provides a great deal more benefits than risks. Whatever the caveats, it is an inevitable consequence of open data combined with technology, so archives either join in or exclude themselves from this type of free-flow of data. Nevertheless, it is certainly worth thinking about the issues involved in providing data within different contexts, and our project will consider issues around provenance.

Linking Lives

The basic idea of the Linking Lives project is to develop a new Web interface that presents useful resources relating to individual people, and potentially organisations as well.

It is about more than just looking at a name-based approach for archives. It is also about utilising external datasets in order to bring archives together with other data sources. Researchers are often interested in individual people or organisations, and will want to know what sort of resources are out there. They may not just be interested in archives. Indeed, they may not really have thought about using archives, but they may be very interested in biographical information, known connections, events during a person’s lifetime, etc. The idea is to show that various sources exist, including archives, and thus to work towards broadening the user-base of archives.

Our interface will bring different data sources together and we will link to all the archival collections relating to an individual. We have many ideas about what we can do, but with limited time and resources, we will have to prioritise, test out various options and see what works and what doesn’t and what each option requires to implement. We’ll be updating you via the blog, and we are very interested in any thoughts that you have about the work, so please do leave comments, or contact us directly.

In some ways, our approach may be an alternative to using EAC-CPF (Encoded Archvial Content for Corporate Bodies, Persons and Families, an XML standard for marking up names associated with archive collections). But maybe in essence it will compliment EAC-CPF, because eventually we could use EAC authority records and create Linked Data from them. We have followed the SNAC project with interest, and we recently met up with some of the SNAC project members at the Society of American Archivists’ Conference. We hope to take advantage of some of the exciting work that they are doing to match and normalise name records.

The W3C draft report on Library Linked Data states that ‘through rich linkages with complementary data from trusted sources, libraries [and archives] can increase the value of their own data beyond the sum of its sources taken individually’. This is one of the main principles that we would like to explore. By providing an interface that is designed for researchers, we will be able to test out the benefits of the Linked Data approach in a much more realistic way.

Maybe we are at a bit of a crossroads with Linked Data. A large number of data sets have been put out as XML RDF, and some great work has been done by people like the BBC (e.g. the Wildlife Finder), the University of Southampton and the various JISC-funded projects. We have data.gov.uk, making Government data sets much more open. But there is still a need to make a convincing argument that Linked Data really will provide concrete benefits to the end user. Talking about Sparql endpoints, JSON, Turtle, Triples that connect entities and the benefits of persistent URIs won’t convince people who are not really interested in process and principles, but just want to see the benefits for themselves.

Has there been too much emphasis on the idea that if we output Linked Data then other people can (will) build tools? The much quoted adage is ‘The best thing that will be done with your data will be done by someone else’, but is there a risk in relying on this idea? In order to get buy-in to Linked Data, we need visible, concrete examples of benefit. Yes, people can build tools, they can combine data for their own purposes, and that’s great if and when it happens; but for a community like the archives community the problem may be that this won’t happen very rapidly, or on the sort of scale that we need to prove the worth of the investment in Linked Data. Maybe we are going to have to come up with some exemplars that show the sorts of benefits that end users can get. We hope that Linking Lives will be one step along this road; not so much an exemplar as a practical addition to the Archives Hub service that researchers will immediately be able to benefit from.

Of course, there has already been very good work within the library, archive and museum communities, and readers of this blog may be interested in the Use Cases that are provided at http://www.w3.org/2005/Incubator/lld/wiki/UseCaseReport

Just to finish with a quote from the W3C draft report (I’ve taken the liberty of adding ‘archives’ as I think they can readily be substituted):

“Linked Data reaches a diverse community far broader than the library/archive community; moving to library/archival Linked Data requires libraries/archives to understand and interact with the entire information community. Much of this information community has been engendered by the capabilities provided by new technologies. The library/archive community has not fully engaged with these new information communities, yet the success of Linked Data will require libraries/archives to interact with them as fully as they interact with other libraries/archives today. This will be a huge cultural change that must be addressed.”

photo of paper chain dolls
Flickr: Icono SVDs photostream, http://www.flickr.com/photos/28860201@N05/with/3674610629/

A Web of Possibilities

“Will you browse around my website”, said the spider to the fly,image of spider from Wellcome images
‘Tis the most attractive website that you ever did spy”

All of us want to provide attractive websites for our users. Of course, we’d like to think its not really the spider/fly kind of relationship! But we want to entice and draw people in and often we will see our own website as our key web presence; a place for people to come to to find out about who we are, what we have and what we do and to look at our wares, so to speak.

The recently released ‘Discovery’ vision is to provide UK researchers with “easy, flexible and ongoing access to content and services through a collaborative, aggregated and integrated resource discovery and delivery framework which is comprehensive, open and sustainable.”  Does this have any implications for the institutional or small-scale website, usually designed to provide access to the archives (or descriptions of archives) held at one particular location?

Over the years that I’ve been working in archives, announcements about new websites for searching the archives of a specific institution, or the outputs of a specific project have been commonplace.  A website is one of the obvious outputs from time-bound projects, where the aim is often to catalogue, digitise or exhibit certain groups of archives held in particular repositories. These websites are often great sources of in-depth information about archives. Institutional websites are particularly useful when a researcher really wants to gain a detailed understanding of what a particular repository holds.

However, such sites can present a view that is based more around the provider of the information rather than the receiver. It could be argued that a researcher is less likely to want to use the archives because they are held at a particular location, apart from for reasons of convenience, and more likely to want archives around their subject area, and it is likely that the archives which are relevant to them will be held in a whole range of archives, museums and libraries (and elsewhere). By only looking at the archives held at a particular location, even if that location is a specialist repository that represents the researcher’s key subject area, the researcher may not think about what they might be missing.

Project-based websites may group together archives in ways that  benefit researchers more obviously, because they are often aggregating around a specific subject area. For example, making available the descriptions and links to digital archives around a research topic. Value may be added through rich metadata, community engagement and functionality aimed at a particular audience. Sometimes the downside here is the sustainability angle: projects necessarily have a limited life-span, and archives do not. They are ever-changing and growing and descriptions need to be updated all the time.

So, what is the answer? Is this too much of a silo-type approach, creating a large number of websites, each dedicated to a small selection of archives?

Broader aggregation seems like one obvious answer. It allows for descriptions of archives (or other resources) to be brought together so that researchers have the benefit of searching across collections, bringing together archives by subject, place, person or event, regardless of where they are held (although there is going to be some kind of limit here, even if it is at the national level).

You might say that the Archives Hub is likely to be in favour of aggregation! But it’s definitely not all pros and no cons. Aggregations may offer a powerful search functionality for intellectually bringing together archives based on a researcher’s interests, but in some ways there is a greater risk around what is omitted. When searching a website that represents one repository, a researcher is more likely to understand that other archives may exist that are relevant to them. Aggregations tend to promote themselves as comprehensive – if not explicitly then implicitly – which this creates expectation that cannot ever fully be met. They can also raise issues around measuring impact and around licensing. There is also the risk of a proliferation of aggregation services, further confusing the resource discovery landscape.

Is the ideal of broad inter-disciplinary cross-searching going to be impeded if we compete to create different aggregations? Yes, maybe it will be to some extent, but I think that it is an inevitability, and it is valid for different gateways to service different audiences’ needs. It is important to acknowledge that researchers in different disciplines and at different levels have their own needs, their own specific requirements, and we cannot fulfill all of these needs by only presenting data in one  way.

One thing I think is critical here is for all archive repositories to think about the benefits of employing recognised and widely-used standards, so that they can effectively interoperate and so that the data remains relevant and sustainable over time. This is the key to ensuring that data is agile, and can meet different needs by being used in different systems and contexts.

I do wonder if maybe there is a point at which aggregations become unwieldy, politically complicated and technically challenging. That point seems to be when they start to search across countries. I am still unsure about whether Europeana can overcome this kind of problem, although I can see why many people are so keen on making it work. But at present, it is extremely patchy, and , for example, getting no results for texts held in Britain relating to Shakespeare is not really a good result. But then, maybe the point is that Europeana is there for those that want to use it, and it is doing ground-breaking work in its focus on European culture; the Archives Hub exists for those interested in UK Archives and a more cross-disciplinary approach; Genesis exists for those interested in womens studies; for those interested in the Co-operative movement, there is the National Co-operative Archive site; for those researching film, the British Film Institute website and archive is of enormous value.

So, is the important principle here that diversity is good because people are diverse and have diverse needs? Probably so. But at the same time, we need to remember that to get this landscape, we need to encourage data sharing and  avoid duplication of effort. Once you have created descriptions of your archive collections you should be able to put them onto your own website, contribute them to a project website, and provide them to an aggregator.

Ideally, we would be looking at one single store of descriptions, because as soon as you contribute to different systems, if they also store the data, you have version control issues. The ability to remotely search different data sources would seem to be the right solution here. However, there are substantial challenges. The Archives Hub has been designed to work in a distributed way, so that institutions can host their own data. The distributed searching does present challenges, but it certainly works pretty well. The problem is that running a server, operating system and software can actually be a challenge for institutions that do not have the requisite IT skills dedicated to the archives department.  Institutions that hold their own data have it in a great variety of formats. So, what we really need is the ability for the Archives Hub to seamlessly search CALM, AdLib, MODES, ICA AtoM, Access, Excel, Word, etc. and bring back meaningful results. Hmmm….

The business case for opening up data seems clear. Project like Open Bibliographic Data have helped progress the thinking in this arena and raised issues and solutions around barriers such as licensing.   But it seems clear that we need to understand more about the benefits of aggregation, and the different approaches to aggregation, and we need to get more buy-in for this kind of approach.  Does aggregation allow users to do things that they could not do otherwise? Does it save them time? Does it promote innovation? Does it skew the landscape? Does it create problems for institutions because of the problems with branding and measuring impact?  Furthermore, how can we actually measure these kinds of potential benefits and issues?

Websites that offer access to archives (or descriptions of archives) based on where they are located and based on they body that administers them have an important role to play. But it seems to me that it is vital that these archives are also represented on a more national, and even international stage. We need to bring our collections to where the users are. We need to ensure that Google and other search engines find our descriptions. We need to put archives at the heart of research, alongside other resources.

I remember once talking about the Archives Hub to an archivist who ran a specialist repository. She said that she didn’t think it was worth contributing to the Hub because they already had their own catalogue. That is, researchers could find what they wanted via the institute’s own catalogue on their own system, available in their reading room. She didn’t seem to be aware that this could only happen if they knew that the archive was there, and that this view rested on the idea that researchers would be happy to repeat that kind of search on a number of other systems. Archives are often about a whole wealth of different subjects – we all know how often there are unexpected and exciting finds. A specialist repository for any one discipline will have archives that reach way beyond that discipline into all sorts of fascinating areas.

It seems undeniable that data is going to become more open and that we should promote flexible access through a number of discovery routes, but this throws up challenges around version control, measuring impact, brand and identity. We always have to be cognisant of funding, and widely disseminated data does not always help us with a funding case because we lose control of the statistics around use and any kind of correlation between visits to our website and bums on seats. Maybe one of the challenges is therefore around persuading top-level managers and funders to look at this whole area with a new perspective?

Whose Data Is It?: a Linked Data perspective

A comment on the blog post announcing the release of the Hub Linked Data maybe sums up what many archivists will think: “the main thing that struck me is that the data is very much for someone else (like a developer) rather than for an archivist. It is both ‘our data’ and not our data at the same time.”

Interfaces to the data

Archives Hub search interface

In many ways, Linked Data provides the same advantages as other machine based ways into the data. It gives you the ability to access data in a more unfiltered way. If you think about a standard Web interface search, what it does is to provide controlled ways into the data, and we present the data in a certain way. A user comes to a site, sees a keyword search box and enters a term, such as ‘antarctic exploration’. They have certain expectations of what they will get – some kind of list of results that are relevant to antarctica and famous explorers and expeditions – and yet they may not think much about the process – will all records that have any/either/both of these terms be returned, for example? Will the results be comprehensive? Might there be more effective ways to search for what they want? As those who run systems, we have to decide what a search is going to give the user. Will we look for these terms as adjacent terms and single terms? Will we return results from any field? How will we rank the results? We recently revised the relevance ranking on the Hub because although it was ‘pragmatically’ correct, it did not reflect what users expect to see. If a user enters ‘sir john franklin’ (with or without quotation marks) they would expect the Sir John Franklin Papers to come up first. This was not happening with the previous relevance ranking. The point here is that we (the service providers) decide – we have control over what the search returns and how it is displayed, and we do our best to provide something that will work for users.

Similarly, we decide how to display the results. We provide as a basis collection descriptions, maybe with lower-level entries, but the user cannot display information in different ways. The collection remains the indivisible unit.

With a Web interface we are providing (we hope) a user-friendly way to search for descriptions of archives – one that does not require prior knowledge. We know that users like a straightforward keyword search, as well as options for more advanced searching. We hide all of the mechanics of running the search and don’t really inform the user exactly what their search is doing in any kind of technical sense. When a user searches for a subject in the advanced subject search, they will expect to get all descriptions relating to that subject, but that is not necessarily what they will get. The reason is that the subject search looks for terms within the subject field. The creator of the description must put the subject in as an index term. In addition, the creator of the description may have entered a different term for the subject – say ‘drugs’ instead of ‘medicines’. The Archives Hub has a ‘subject finder’ that returns results for similar terms, so it would find both of these entries. However, maybe the case of the subject finder makes a good point about searching: it provides a really useful way to find results but it is quite hard to convey what it does quickly and obviously. It has never been widely used, even though evidence shows that users often want to search by subject, and by entering the subject as a keyword, they are more likely to get less relevant results.

These are all examples of how we, as service providers, look to find ways to make the data searchable in ways that we think users want and try to convey the search options effectively. But it does give a sense that they are coming into our world, searching ‘our data’, because we control how they can search and what they see.

Linked Data is a different way of formatting data that is based upon a model of the entities in the data and relationships between them. To read more about the basics of Linked Data take a look at some of the earlier posts on the Locah blog (http://blogs.ukoln.ac.uk/locah/2010/08/).

Providing machine interfaces gives a number of benefits. However, I want to refer to two types of ‘user’ here. The ‘intermediate user’ and the ‘end user’. The intermediate user is the one that gets the data and creates the new ways of searching and accessing the data. Typically, this may be a developer working with the archivist. But as tools are developed to faciliate this kind of work, it should become easier to work with the data in this way. The end user is the person who actually wants to use the data.

1) Data is made available to be selected and used in different ways

We want to provide the ability for the data to be queried in different ways and for users to get results that are not necessarily based upon the collection description. For example, the intermediate user could select only data that relates to a particular theme, because they are representing end users who are interested in combining that data with other sources on the same theme. The combined data can be displayed to end users in ways that work for a particular community or particular scenario.

The display within a service like the Hub is for the most part unchanging, providing consistency, and it generally does the job. We, of course, make changes and enhancements to improve the service based on user needs from time to time, but we’re still essentially catering for one generic user as best we can, However, we want to provide the potential to allow users to display data in their own way for their own purposes. Linked Data encourages this. There are other ways to make this possible of course, and we have an SRU interface that is being used by the Genesis portal for Women’s Studies. The important point is that we provide the potential for these kinds of innovations.

2) External links begin the process of interconnecting data

Machine interfaces provide flexible ways into the data, but I think that one of the main selling points of Linked Data is, well, linking data. To do this with the Hub data, we have put some links in to external datasets. I will be blogging about the process of linking to VIAF names (Virtual International Name Authority File), but suffice to say that if we can make the statement within our data that ‘Sir Ernest Shackleton’ on the Hub is the same as ‘Sir Ernest Shackleton’ on VIAF then we can benefit from anything that VIAF links to DBPedia for example (Wikipedia output as Linked Data). A user (or intermediate user) can potentially bring together information on Sir Ernest Shackleton from a wide range of sources. This provides a means to make data interconnected and bring people through to archives via a myriad of starting points.

3) Shared vocabularies provide common semantics

If we identify the title of a collection by using Dublin Core, then it shows that we mean the same thing by ‘title’ as others who use the Dublin Core title element. If we identify ‘English’ by using a commonly recognised URI (identifier) for English, from a common vocabulary (lexvo), then it shows that we mean the same thing as all the other datasets that use this vocabulary. The use of common vocabularies provides impetus towards more interoperability – again, connecting data more effectively. This brings the data out of the archival domain (where we share standards and terminology amongst our own community) and into a more global space.  It provides the potential for intermediate users to understand more about what our data is saying in order to provide services for end users. For example, they can create a cross-search of other data that includes titles, dates, extent, creator, etc. and have reasonable confidence that the cross-search will work because they are identifying the same type of content.

For the Hub there are certain entities where we have had to create our own vocabulary, because those in existence do not define what we need, but then there is the potential for other datasets to use the same terms that we use.

4) URIs are provided for all entities

For Linked Data one of the key rules is that entities are identified with HTTP URIs. This means that names, places, subjects, repositories, etc. within the Hub data are now brought to the fore through having their own identifier – all the individuals, for example, within the index terms, have their own URI. This allows the potential to link from the person identified on the Hub to the same person identified in other datasets.

Who is the user?

So far so good. But I think that whilst in theory Linked Data does bring significant benefits, maybe there is a need to explain the limitations of where we are currently at.Hub Sparql endpoint

Our Linked Data cannot currently be accessed via a human user friendly Web-based search interface; it can however be accessed via a Sparql endpoint. Sparql is the language for querying RDF, the format used for Linked Data. It shares many similarities to SQL, a language typically used for querying conventional relational databases that are the basis of many online services. (Our Sparql endpoint is at http://data.archiveshub.ac.uk/sparql ). What this means is that if you can write Sparql queries then you’re up and running. Most end users can’t, so they will not be able to pull out the data in this way. Even once you’ve got the data, then what? Most people wouldn’t know what to do with RDF output. In the main, therefore, fully utilising the data requires technical ability – it requires intermediate users to work with the data and create tools and services for end users.

For the Hub

we have provided Linked Data views, but it is important not to misunderstand the role of these views – they are not any kind of definite presentation, they are simply a means to show what the data consists of, and the user can then access that data as RDF/XML, JSON or Turtle (i.e. in a number of formats). It’s a human friendly view on the Linked Data if you access a Hub entity web address via a web browser. If however, you are a machine wanting machine readable RDF visiting the very same URI, you would get the RDF view straight off. This is not to say that it wouldn’t be possible to provide all sorts of search interfaces onto the data – but this is not really the point of it for us at the moment – the point is to allow other people to have the potential to do what they want to do.

The realisation of the user benefit has always been the biggest question mark for me over Linked Data – not so much the potential benefits, as the way people perceive the benefits and the confidence that they can be realised. We cannot all go off and create cool visualisations (e.g. http://www.simile-widgets.org/timeline/). However, it is important to put this into perspective. The Hub data at Mimas sits in directories as EAD XML. Most users wouldn’t find that very useful. We provide an interface that enables users with no technical knowledge to access the data, but we control this and it only provides access to our dataset and to a collection-based view. In order to step beyond this and allow users to access the data in different ways, we necessarily need to output it in a way that provides this potential, but there is likely to be a lag before tools and services come along that take advantage of this. In other words, what we are essentially doing is unlocking more potential, but we are not necessarily working with that potential ourselves – we are simply putting it out there for others.

Having said that, I do think that it is really important for us to now look to demonstrate the benefits of Linked Data for our service more clearly by providing some ways into the Linked Data that take advantage of the flexible nature of the data and the external links – something that ‘ordinary’ users can benefit from. We are looking to work on some visualisations that do demonstrate some of the potential. There does seem to be an increasing consensus within cultural heritage that primary resources are too severed from the process of research – we have a universe of unrelated bits that hint at what is possible but do not allow it to be realised. Linked Data is attempting to resolve this, so it’s worth putting some time and effort into exploring what it can do.

We want our data to be available so that anyone can use it as they want. It may be true that the best thing done with the data will be thought of by someone else. (see Paul Walk’s blog post for a view on this).

However, this is problematic when trying to measure impact, and if we want to understand the benefits of Linked Data we could do with a way to measure them. Certainly, we can continue to work to realise benefits by actively working with the Linked Data community and encouraging a more constructive and effective relationship between developers and managers. It seems to me that things like Linked Data require us to encourage developers to innovate and experiment with the data, enabling users to realise its benefits by taking full advantage of the global interconnectivity that is the vision of the Linked Data Web. This is the aim of UKOLN’s Dev CSI project – something I think we should be encouraging within our domain.

So, coming back to the starting point of this blog: The data maybe starts off as ‘our data’ but really we do indeed want it to be everyone’s data. A pick ‘n pix environment to suit every information need.

Flickr: davidlocke's photostream