A European Journey: The Archives Portal Europe

March 17, 2014 / Jane Stevenson

In January 2013 the Archives Hub became the UK ‘Country Manager’ for the Archives Portal Europe.

The Archives Portal Europe (APE) is a European aggregator for archives. The website provides more information about the APE vision:

Borders between European countries have changed often during the course of history. States have merged and separated and it is these changing patterns that form the basis for a common ground as well as for differences in their development. It is this tension between their shared history and diversity that makes their respective histories even more interesting. By collating archival material that has been created during these historical and political evolutions, the Archives Portal Europe aims to provide the opportunity to compare national and regional developments and to understand their uniqueness while simultaneously placing them within the larger European context.

The portal will help visitors not only to dig deeper into their own fields of interest, but also to discover new sources by giving an overview of the jigsaw puzzle of archival holdings across Europe in all their diversity.

For many countries, the Country Manager role is taken on by the national archives. However, for the UK the Archives Hub was in a good position to work with APE. The Archives Hub is an aggregation of archival descriptions held across the UK. We work with and store content in Encoded Archival Description (EAD), which provides us with a head start in terms of contributing content.

Jane Stevenson, the Archives Hub Manager, attended an APE workshop in Pisa in January 2013, to learn more about the tools that the project provides to help Country Managers and contributors to provide their data. Since then, Jane has also attended a conference in Dublin, Building Infrastructures for Archives in a Digital World, where she talked about A Licence to Thrill: the benefits of open data. APE has provided a great opportunity to work with European colleagues; it not just about creating a pan-European portal, it is also about sharing and learning together. At present, APE has a project called APEx, which is an initiative for “expanding, enriching, enhancing and sustaining” the portal.

How Content is Provided to APE

The way that APE normally works is through a Country Manager providing support to institutions wishing to contribute descriptions. However, for the UK, the Archives Hub takes on the role of providing the content directly, as it comes via the Hub and into APE. This is not to say that institutions cannot undertake to do this work themselves. The British Library, for example, will be working with their own data and submitting it to APE. But for many archives, the task of creating EAD and checking for validity would be beyond their resources. In addition, this model of working shows the benefits of using interoperable standards; the Archives Hub already processes and validates EAD, so we have a good understanding of what is required for the Archives Portal Europe.

All that Archives Hub institutions need to do to become part of APE is to create their own directory entry. These entries are created using Encoded Archival Guide (EAG), but the archivist does not need to be familiar with EAG, as they are simply presented with a form to fill in. The directory entry can be quite brief, or very detailed, including information on opening hours, accessibility, reprographic services, search room places, internet access and the history of the archive.

blog-ape-eag — Fig 1: EAG entry for the University of East London

Once the entry is created, we can upload the data. If the data is valid, this takes very little time to do, and immediately the archive is part of a national aggregation and a European aggregation.

APE Data Preparation Tool

The Data Preparation Tool allows us to upload EAD content and validate it. You can see on the screen shot below a list of EAD files from the Mills Archive that have been uploaded to the Tool, and the Tool will allow us to ‘convert and validate’ them. There are various options for checking against different flavours of EAD and there is also the option to upload EAC-CPF (which is not something the Hub is working with as yet) and EAG.

blog-ape-dpt1 — Fig 2: Data Preparation Tool

If all goes according to plan, the validation results in a whole batch of valid files, and you are ready to upload the data. Sometimes there will be an invalid file and you need to take a look at the validation message and figure out what you need to do (the error message in this screenshot relates to ‘example 2’ below).

blog-ape-dpt2 — Fig 3: Data Preparation Tool with invalid files

APE Dashboard

The Dashboard is an interface provided to an APE Country Manger to enable them to administer their landscape. The first job is to create the archival landscape. For the UK we decided to group the archives into type:

blog-ape-landscape — Fig 4: Archival Landscape

The landscape can be modified as we go, but it is good to keep the basic categories, so its worth thinking about this from the outset. We found that many other European countries divide their archives differently, reflecting their own landscape, particularly in terms of how local government is organised. We did have a discussion about the advantages of all using the same categories, but it seemed better for the end-user to be presented with categories suitable for the way UK archives are organised.

Within the Dashboard, the Country Manager creates logins for all of the archive repositories contributing to APE. The repositories can potentially use these logins to upload EAD to the dashboard, validate and correct if necessary and then publish. But at present, the Archives Hub is taking on this role for almost all repositories. One advantage of doing this is that we can identify issues that surface across the data, and work out how best to address these issues for all repositories, rather than each one having to take time to investigate their own data.

Working with the Data

When the Archives Hub started to work with APE, we began by undertaking a comparison of Hub EAD and APE EAD. Jane created a document setting out the similarities and differences between the two flavours of EAD. Whilst the Hub and APE both use EAD, this does not mean that the two will be totally compatible. EAD is quite permissive and so for services like aggregators choices have to be made about which fields to use and how to style the content using XSLT stylesheets. To try to cover all possible permutations of EAD use would be a huge task!

There have been two main scenarios when dealing with data issues for APE:

(1) the data is not valid EAD or it is in some way incorrect

(2) the data is valid EAD but the APE stylesheet cannot yet deal with it

We found that there were a combination of these types of scenarios. For the first, the onus is on the Archives Hub to deal with the data issues at source. This enables us to improve the data at the same time as ensuring that it can be ingested into APE. For the second, we explain the issue to the APE developer, so that the stylesheet can be modified.

Here are just a few examples of some of the issues we worked through.

Example 1: Digital Archival Objects

APE was omitting the <daodesc> content:

<dao href=”http://www.tate.org.uk/art/images/work/P/P78/P78315_8.jpg” show=”embed”><daodesc><p>’Gary Popstar’ by Julian Opie</p></daodesc></dao>

Content of <daodesc><p> should be transferred to <dao@xlink:title>. It would then be displayed as mouse-over text to the icons used in the APE for highlighting digital content. Would that solution be ok?

In this instance the problem was due to the Hub using the DTD and APE using the schema, and a small transformation done by APE when they ingested the data sufficed to provide a solution.

Example 2: EAD Level Attribute

Archivists are all familiar with the levels within archival descriptions. Unfortunately, ISAD(G), the standard for archival description, is not very helpful with enforcing controlled vocabulary here, simply suggesting terms like Fonds, Sub-fonds, Series, Sub-series. EAD has a more definite list of values:

collection
fonds
class
recordgrp
series
subfonds
subgrp
subseries
file
item
otherlevel

Inevitably this means that the Archives Hub has ended up with variations in these values. In addition, some descriptions use an attribute value called ‘otherlevel’ for values that are not, in fact, other levels, but are recognised levels.

We had to deal with quite a few variations: Subfonds, SubFonds, sub-fonds, Sub-fonds, sub fonds, for example. I needed to discuss these values with the APE developer and we decided that the Hub data should be modified to only use the EAD specified values.

For example:

needed to be changed to:

At the same time the APE stylesheet also needed to be modified to deal with all recognised level values. Where the level was not a recognised EAD value, e.g. ‘piece’, then ‘otherlevel’ is valid, and the APE stylesheet was modified to recognise this.

Example 3: Data within <title> tag

We discovered that for certain fields, such as biographical history, any content within a <title> tag was being omitted from the APE display. This simply required a minor adjustment to the stylesheet.

Where are we Now?

The APE developers are constantly working to improve the stylesheets to work with EAD from across Europe. Most of the issues that we have had have now been dealt with. We will continue to check the UK data as we upload it, and go through the process described above, correcting data issues at source and reporting validation problems to the APE team.

The UK Archival Landscape in Europe

By being part of the Archives Portal Europe, UK archives benefit from more exposure, and researchers benefit from being able to connect archives in new and different ways. UK archives are now being featured on the APE homepage.

blog-ape-feature1 — Wiener Library: Antisemitic board game, 1936

APE and the APEx project provides a great community environment. It provides news from archives across Europe: http://www.apex-project.eu/index.php/news; it has a section for people to contribute articles: http://www.apex-project.eu/index.php/articles; it runs events and advertises events across Europe: http://www.apex-project.eu/index.php/events/cat.listevents/ Most importantly for the Archives Hub, it provides effective tools along with knowledgeable staff, so that there is a supportive environment to facilitate our role as Country Manager.

blog-ape-berlingroup — APEx Project meeting in Berlin, Nov 2013

See: http://www.flickr.com/photos/apex_project/10723988866/in/set-72157637409343664/
(copyright: Bundersarchiv)

Open Up!

April 24, 2012 / Jane Stevenson

Speaking at the recent World Wide Web (WWW2012) conference in Lyon, Neelie Kroes, Vice President of the European Commission, emphasised the benefits of an open Web: “With a truly open, universal platform, we can deliver choice and competition; innovation and opportunity; freedom and democratic accountability”.

blog-neron-network — Image: '774:Neuron Connection Pattern' http://www.flickr.com/photos/60057912@N00/4743616313

But what does ‘open’ really mean? Do we really have a clear idea about what it is, how we achieve it, and what the benefits might be? And do we have a sense of what is happening in this landscape and the implications for our own domain?

There is a tendency to think that the social web equates to some degree with the open web, but this is not the case. Social sites like Facebook often create walled gardens, and you could argue that whilst they are connecting people, they are working against the principle of free and linked information. Just because the aim is to enable people to be social and connected, that doesn’t mean that it is done in a way that ensures the data is open.

Another confusion may arise around open and its relationship to privacy. Again, things can be open, but privacy can be respected. It’s about clear and transparent choices – whether you want your data to be open or whether you want to include some restrictions. We are not always aware of what we are giving away:

“the Net enables individuals in many cases to compromise privacy more thoroughly than the government and commercial institutions traditionally targeted for scrutiny and regulation. The standard approaches that have been developed to analyze and limit institutional actors do not not work well for this new breed of problem, which goes far beyond the compromise of sensitive information.” (Jonathan Zittrain, The Future of the Internet).

There is an irony that we often readily give away personal data – so we can be (unknowingly?) very open with our own information – but often we give it away to companies that are not, in their turn, open with the data that is provided. As Tim Berners-Lee has stated, “”One of the issues of social networking silos is that they have the data and I don’t”. It is possible to paint quite a worrying trend towards control being in the hands of the few:

“A small elite of gifted (largely male) super-wealthy masters of the universe are creating systems through their own brilliance which very few others in government, regulation or the general public can understand.” (http://www.guardian.co.uk/commentisfree/2012/apr/16/threat-open-web-opaque-elite).

Google has often been seen as a good guy in this context, but recently a number of actions have indicated a move away from open web principles. We have started to be aware of the way Google collects and uses our data, and there have been some dubious practices around ways that data has been gathered (eg. personal data collected from UK Wifi networks). Google appear to be very eager to get us all using Google+, and we are starting to see ‘social’ search results that appear to be pushing people towards using Google+. I think we should be concerned by the announcement from Google that they are now starting to combine your Google search history with other data they collect through use of their products in order, they say, to deliver improved search results. Do we want Google to do this? Are we happy with our search history being used to push products. Is a ‘personalised’ service worth having if it means companies are collecting and holding all of the data we provide when we browse the web for their own commercial purposes?

It does seem that many of these companies will defend ‘open’ if it is in their interests, but find ways to bypass it when it is not. But maybe if we continue to apply pressure in support of an open approach, by implementing it ourselves and favouring tools and services that implement it, we can tip the balance increasingly towards a genuine open approach that is balanced by safeguards for guaranteeing privacy.

It is worth thinking about some of these issues when we think about our own archival data and whether we are going to take an open approach. It is not just our own individual circumstances, within one repository, that we should think about, but the whole context of what we want the Web to be, because our choices help determine the direction that the Internet goes in.

We do not know how our archival data will be used and we never have. Even when the data was on paper, we couldn’t control use. Fundamentally, we need to see this as a good thing, not a threat, but we need to be aware of the risks. The paradigm has shifted because the data is much much easier to mess with now that it is digital, and, of course, we have digital archives, and we are concerned to protect the integrity of these sources. Providing an API interface seems like a much more explicit invitation to play with the data than simply putting it on the Web. Furthermore, the mindset of the digital generation is entirely different. Cutting and pasting, mashing and inter-linking are all taken for granted and sometimes the idea of the provenance of data sources seems to go out of the window.

This issue of ‘unpredictable use’ is interesting. When the Apple II was launched, Steve Jobs did not know how it would be used – as Zittrain states, it ‘invited people to tinker with it’, and whilst Apple might have had ideas about use, there ideas were not a constraint to reality. Contrast this with the iPhone: it comes out of the box programmed. Quoting Zittrain again: ‘Whereas the world would innovate for the Apple II, only Apple would innovate for the iPhone.’ Zittrain sees this trend towards fixed appliances as pointing to a rather bleak future of ‘sterile appliances tethered to a network of control.’

Maybe the challenge is partly around our understanding of the data we are ‘giving away’, why we are giving it away, and if it’s worth giving away because of what we get in return. Think about the data gathered as we browse the Web – cookies have always had a major role in building up a profile of our Web behaviour – where we go and what we do. Cookies are generally interested in behaviour rather than identity. But still, from 26 May this year, every website operating in the UK will be required to inform its users that they are being tracked with cookies, and to ask users for their consent. This means justifying why cookies are required – it may be for very laudable reasons – logging users’ preferences to help improve their experience, and generally getting a better understanding of how people use the site in order to improve it. It may be in order to target advertising – and you may feel that its hardly a problem if this is adversely affected, but maybe it will be smaller, potentially very useful sites that will suffer if their advertising revenue is affected. Is it a good think that sites will have to justify use of cookies? Is it in our interests? Is this really where the main fight is around protecting your identity? At the same time, the Government is proposing to bring in powers to monitor the websites people are looking at, something which has very profound implications for our privacy.

On the side of the good guys, there are plenty of people out there espousing the importance of an altruistic approach. In The Digital Public Domain: Foundations for an Open Culture the authors argue that the Public Domain — that is, the informational works owned by all of us, be that literature, music, the output of scientific research, educational material or public sector information — is fundamental to a healthy society. Many other publications echo this belief and you can see plenty of evidence of the movement towards open publication. Technology can increasingly help us realise the advantages of an open approach. For example, text mining is something that can potentially bring substantial economic and cultural benefits. It is often associated with biomedical sciences, but it could have great potential across the humanities. A recent JISC report on the Value and Benefits of Text Mining concluded that the current licensing arrangements are a key barrier to unlocking the potential value of text mining to society and enabling researchers to gain maximum benefit from the data. Text mining analyses text using computational methods, and provides the end user with what is most relevant to them. But copyright law works against this principle:

blog-jiscreport-textmining — http://www.jisc.ac.uk/inform/inform33/TextMining.html

“Current copyright law, however, imposes serious restrictions on text mining, because it involves a range of computerised analytical processes that are not all readily permitted within UK intellectual property law. In order to be ‘mined’, text must be accessed, copied, analysed, annotated and related to existing information and understanding. Even if the user has access rights to the material, making annotated copies can currently be illegal without the permission of the copyright holder.” (JISC report: Value and Benefits of Text Mining)

It is important to think beyond types of data and formats, and engage with the debate about the sort of advances for researchers that we can gain by opening up all types of data; whilst the data may differ, many of the arguments about the benefits of open data apply across the board, in terms of encouraging innovation and the advancement of knowledge.

The battle between open and transparent and the walled garden scenario is played out in many ways. One area is the growth of native apps for mobile devices. It is easy to simply see mobile apps as providing new functionality, and opportunities for innovation, with so many people creating apps for every activity you can think of. But what are the implications of a scenario where you have to enter a ‘silo’ environment in order to do anything, from finding your way to a location to converting measurements from one unit to another, to reading the local news reports? Tim Berners-Lee sees this as one of the main threats to the principle of the Web as a platform with the potential to connect data.

Should we be promoting the open Web? Surely as information professionals we should. This does not mean that there is an imperative to make everything open – we are all aware of the Data Protection Act and there are a whole range of very valid reasons to withhold information. But it is increasingly apparent that we need to be explicit in applying permissions – whether it be explicitly fully open licences, attribution licences or closure periods. Many of the archives described on the Archives Hub have ‘open for consultation’ under Access Conditions. But what does this really mean? What can people really do with the content? Is the content actually ‘open for reuse in any way the end-user deems useful to them’? And specifically, what can they do with digital content, available on the World Wide Web?

We need to be aware of the open data agenda and the threats and opportunities presented to us. We have the same kind of tensions within our own world in terms of what we need to protect, both for preservation reasons and for reasons of IPR and commercial advantage, but we want to make archives as open and available as possible and we need to understand what this means in the context of the digital, online world. The question is, whether we want to encourage innovation, what Zittrain calls a ‘generative’ approach. This means enabling programmers to take hold of what you can provide and working with it, as well as encouraging reuse by end users. But it also means taking a risk with security, because making things open inevitably makes them more open to abuse.

People’s experiences with the Internet are shaped by the devices they use to access it. And, similarly, their experiences of archives are shaped by the ways that we choose to enable access. In the same way that a revolution started when computers were made available that could potentially run software that was not available at the time they were built, so we should think about enabling innovation with our data that we may not envisage at present, not trying to determine user behaviour but relishing the idea of the unknown. This raises the question of whether we facilitate this by simply putting the data out as it is, even if it is somewhat unstructured, making it as open as possible, or whether we should try to clean up data, and make it more rigorous, more consistent and more structured, with the purpose of facilitating flexible use of the data, enabling people to use and present it in other ways. But maybe that is another story…

In a way, the early development of the Internet has been an unexpected journey. Innovation has been encouraged because a group of computer scientists did not want to control or to ensure maximum commercial success. It was an altruistic approach; and it may not have turned out that way. Maybe the PC was an unlikely success story in business, where a stable, supported, predictable solution might have been favoured, where companies did not need skills in-house. Both Microsoft and Apple made their computers’ operating systems open for third-party software development, so even if there was a sense of monopoly, there was still a basic ethos of anyone being able to write for the PC.

Are we still moving down that road where amateur innovation and ‘messing about’ is seen as a positive thing? Or are the problems of security, spam and viruses, along with the power of the big guys, taking us towards something that is far more fixed and controlled?

Opening up UK archives data (II)

May 11, 2010 / Jane Stevenson / 1 Comment

This is the second post relating to the recent UKAD meeting, concentrating on the brainstorming that took place around digital and digitised archives.

The driving forces that were identified:

Crowd-sourcing – metadata generation
Attracts funding
Promotes access
Open up wealth of possibility
Remain relevant
Meet user expectations
Centres of excellence in digitisation – common approach
Collections already digitised are hidden – in silos – return on investment
Potential to capture richer information about users
Potential to draw people in
Increasing ‘digitisation on demand’ – needs to be harnessed effectively
Increasing amount of born-digital media need to be made accessible online – drive to discoverability of digital materials
Changing profession – becoming more confident in this area as a result of above
Web makes it much easier

The group felt that it all added up to a resouding “we have to do this!”.

The resistors included:

Systems don’t talk to each other
Insufficient metadata of legacy digitised material – retroconversion – cost*
Copyright/IPR – complex, lots of local specificity
Work needed to marry user generated content and standard metadata
Community resistance to UGC
Vast amounts of content – prioritisation is intellectually challenging
Bulk digitisation is happening commercially – restricted rights
Clashes with business models – or perception that it does (e.g. models based on commercial digitisation assume increasing return on investment; the opposite may occur if the most commercially enticing material digitised first)
Fears – grounded in truth – could affect funding: diminish user/visitor numbers on site, diminishes value of on-site expertise
Challenges in bringing catalogue data and digital object systems together
Query: not ultimately cost effective
Cost
Web makes it easier – but it’s hard to keep up…

The group looked at actions that are required:

1. Accrue evidence of user demand and current behaviour

Identify user communities (family, academic, student researchers)
Secondary research of existing analysis
Market research
Produce cost-benefit analysis – impact on site visits?

2. Systems talking to each other

People talking to each other about systems!
Develop definitive list of systems in use – a picture of UK situation > crosswalks/maps between (see Library world)
Needs to cover both catalogue and digital object management systems
Discmap?

3. Copyright/IPR

Produce decision tree to help archivists make decisions – risk assessment but beware risk aversion
Encourage sharing of experience/lessons learned
Gathering what has already been done

4. Impact of digitised resources

Gather existing articles/research
Share practice in assessing impact in differing contexts

5. Metadata and costs

Establish costs of differing levels of metadata generation
Identify how much data needs to be converted into digital metadata (how much is not online?)

6. Identify quick wins!

Working together to create user cases and examples, sharing experience, getting onvolved in Resource Discovery Task Force and linking projects to this

Of course, the gathering of such evidence can help us to see where we are and where we need to go, and also how to get there. But implementation is quite another thing. The UKAD Network is hoping to build upon this work to encourage collaborative initiatives and the sharing of expertise and experiences. We are considering events and training opportunities that might help. We do feel that it will be useful to create a stronger presence for UKAD, as a means to provide a focus for this work, and we are looking at low-cost options to do this.

Opening up UK archives data (i)

April 16, 2010 / Jane Stevenson

On 14th April the UK Archives Discovery Network (UKAD) met in Manchester to discuss challenges surrounding the opening up of archival data. We were looking to develop our understanding of the key issues driving or preventing these developments and to start pulling together an action plan. We also talked about digital and digitised archives, which I’ll blog about in a separate post.

We split into two groups to brainstorm driving and restraining factors. There was no chance of drying up – we all had plenty to say, and of course, the restraining influences grew rapidly, threatening to outstrip the drivers by quite some way. However, in the end we had a good balance, and we felt that the day had been very positive, although summing up the position is one thing, implementing actions is quite another. However, we hope to start putting some things into place that will help to take us along the road to promoting archival discovery.

We are looking to create a UKAD website, which will help us to promote UKAD to archivists and others, and we’ll let you know about that as soon as we can.

With thanks to Melinda Haunton from The National Archives, who, as the UKAD secretary, galliantly pulled together the large number of flip charts and made them into something coherent, here is a summary of the points.

Our driving forces included:

Perceived user demand
Time saving – easier to search, more effective customer service
Opportunities – for use of data and for benefiting from others’ use
Government policy drivers in this direction (data.gov.uk is evidence of Govt buy-in)
Rich data – think about opportunities to make the most of events, people, places, concepts within the finding aids
Serendipitous collaboration – working together is a big driver – a common way to hear about initiatives and experiences of others that could be of benefit to you
Potential to get new users – eg via GIS data connected to archives data – users who may not think of using archives
Standards exist to drive openness
Sustainability of resources – less tied to a single service if data is open
Enrichment and adding value – others can enrich our data
Archives making use of others’ open data – sector benefits from open data as well as contributing to it
Connecting archives – new narratives – data can coalesce around events, people, places, subjects
Exposure of holdings – especially for small repositories who have limited resources to promote themselves
Unlikely to be restrictions on opening up descriptive data (unlike digital/digitised archives)
Could glean evidence of impact – ways to gather usage statistics are increasingly effective – provide evidence of benefits
Opening up could reach out to excluded communities more effectively (different routes into archives)
Potential for wider impact – e.g. in demonstrating impact of academic research (RAE)

Our restraining forces included:

Lack of evidence of user demand – it may not be what we expect/assume
APIs – where they exist, are they used? (possibly not)
Users’ understanding what they’ll get – you won’t normally get direct access to archives through descriptions
Proprietary software providers – may not ‘play ball’
Archivists understanding of open data issues – need understanding to get buy-in
Access to developer expertise – archivists frequently find getting IT or developer support very difficult
Machine to machine – not visual, not easy to sell – need to understand the potential
Messy data – all the issues we are so aware of with different data sources; the balkanisation of data
Backlogs – if its not catalogued, we can’t open it up
Sustainability of resources
Data becoming out of date as it gets further from the original source – end up reusing out-of-date data
Contractual embargoes – e.g. involving commercial partners e.g. software providers
Dependencies – potentially data may be dependent on other things – e.g. attached to schema, source code, IPR
Evidence of impact – can be difficult to get this and prove the worth of open data
Branding – or lack of it on reused open data – may affect funding if funders can’t see direct benefits
Loss of control causes fear – once its open anything can happen
Lack of ‘archival developers’ – very few developers with some understanding of archives and archival issues

Our Actions included:

Working together – collaborative evidence gathering and sharing, not competing – use examples/evidence from others
Evidence – case studies, knowing what researchers are requesting, evidence for advantages of digitising
Understanding funders – shared understanding of funders can help with internal funding
Archives developer days – bringing developers together as has been done with Dev8D – collaborative approach to programming
Strategy for approaching software vendors to get buy-in – appeal to their commercial interests, a concerted approach from aggregators may be more effective
UK based evaluation of archival cataloguing systems – still know little about percentages using different systems and evaluation of systems
Conference/workshops to raise awareness of buy in including practical demonstrations – must be interactive and practical and encourage sharing of projects, experiences and ideas