Archives Portal Europe Country Managers’ Meeting, 30 Nov 2016

November 30, 2016 / Jane Stevenson / 1 Comment

This is a report of a meeting of the Archives Portal Europe Country Managers’ in Slovakia, 30 November 2016, with some comments and views from the UK and Archives Hub perspective.

APE-CMmeeting-30Nov2016 — APE Country Managers meeting, Bratislava, 30 Nov 2016

Context

The APE Foundation (APEF), which was created following the completion of the APEx project (an EC funded project to maintain and develop the portal running from 2012 to 2015), is now taking APE forward. It has a Governing Board and working groups for standards, technical issues and PR/comms. The APEF has a coordinator and three technical/systems staff as well as an outreach officer. Institutions are invited to become associate members, to help support the portal and its aims.

Things are going well for APEF, with a profit recorded for 2016, and growing associate membership. APEF continues to be busy with development of APE, and is endeavouring to encourage cooperation and collaboration as a means to seize opportunities to keep developing and to take advantage of EU funding opportunities.

Current Development

The APEF has the support of Ministry of Culture in the Netherlands and has a close working relationship with the Netherlands national aggregation project, the ‘DTR’, which is key to the current APE development phase. The idea is to use the framework of APE for the DTR, benefitting both parties. Cooperation with DTR involves three main areas:

•   building an API to open up the functionality of APE to third parties (and to enable the DTR to harvest the APE data from The Netherlands)
•   improving the uploading and processing of EAC-CPF
•   enabling the uploading and processing of ‘additional finding aids’

The API has been developed so that specific requests can be sent to fetch selected data. It is possible to do this for EAD (descriptions) and EAC-CPF (names). The API provides raw data as well as processed results. There have been issues around things like relevance of ordering of results which is a substantial area of work that is being addressed.

The API raises implications in terms of the data, as the Content Provider Agreement that APE institutions sign gives control of the data to the contributors. So, the API had to be implemented in a way that enables each contributor to give explicit permission for the data to be available as CC0 (fully open data). This means that if a third party uses the API to grab data, they only get data from a country that has given this permission. APEF has introduced an API key, which is a little controversial, as it could be argued that it is a barrier to complete openness, but it does enable the Foundation to monitor use, which is useful for impact, for checking correct use, and blocking those who misuse the API. This information is not made open, but it is stored for impact and security purposes.

There was some discussion at the meeting around open data and use of CC0. In countries such as Switzerland it is not permitted to open up data through a CC0 licence, and in fact, it may be true to say that CC0 is not the appropriate licence for archival descriptions (the question of whether any copyright can exist in them is not clear) and a public domain licence is more appropriate. When working across European countries there are variations in approaches to open data. The situation is complicated because the application of CC0 for APE data is not explicit, so any licence that a country has attached to their data will effectively be exported with the data and you may get a kind of licence clash. But the feeling is that for practical purposes if the data is available through an API, developers will expect it to be fully open and use it with that in mind.

There has been work to look at ways to take EAC-CPF from a whole set of institutions more easily, which would be useful for the UK, where we have many EAC-CPF descriptions created by SNAC. Work on any kind of work to bring more than one name description for the same person together has not started, and is not scheduled for the current period of development, but the emphasis is likely to be on better connectivity between variations of a name rather than having one description per name.

Additional finding aids offer the opportunity to add different types of information to APE. You may, for example, have a register of artists or ships logs, you may have started out with a set of cards with names A-Z, relating to your archive in some way. You could describe these in one EAD description, and link this to the main description. In the current implementation of EAD2002 in APE this would have to go into a table in Scope & Content and in-line tagging is not allowed to identify parts of the data. This leads to limitations with how to search by name. But then EAD3 gives the option to add more information on events and names. You can divide a name up into parts, which allows for better searching. Therefore APE is developing a new means to fetch and process EAD3 for the additional finding aids alongside EAD2002 for ‘standard’ finding aids. In conjunction with this, the interface needs to be changed to present the new names within the search.

The work on additional finding aids may not be so relevant for the Archives Hub as a contributor to APE, as the Hub cannot look at taking on ‘other finding aids’, with all the potential variations that implies. However, institutions could potentially log into APE themselves and upload these different types of descriptions.

APE and Europeana

There was quite a bit to talk about concerning APE and Europeana. The APEF is a full partner of the Europeana Digital Services Infrastructure 2 (DSI2) project (currently running 2016/2017). The project involves work on the structure for Europeana, maintaining and running data and aggregation services, improving data quality, and optimising relations with data partners. The work APE is involved with includes improving the current workflow for harvest/ingest of data, and also evaluating what has already been ingested into Europeana.

Europeana seems to have ongoing problems dealing with multi-level EAD descriptions, compounded by the limitation that they only represent digital materials. The approach is not a good fit for archives. Europeana have also introduced both a new publishing framework and different rights statements.

The new publishing framework is a 4 tier approach where you can think of Europeana as a more basic tool for promoting your archives, or something that is a platform for reuse. It refers to the digital materials in terms of whether they are a certain number of pixels, e.g. 800 pixels wide for thumbnails (adding thumbnails means using Europeana as a ‘showcase’) and 1,200 pixels wide ( high quality and reusable, using Europeana as a distribution and reuse platform). The idea of trying to get ‘quality’ images seems good, but in practice I wonder if it simply raises the barrier too much.

The new Rights statements require institutions to be very clear about the rights they want to apply to digital content. The likely conclusion of all this from the point of view of the Archives Hub is that we cannot grapple with adding to Europeana on behalf of all of our contributors, and therefore individual contributors will have to take this on board themselves. It will be possible for contributors to log into the APE dashboard (when it has been changed to reflect the Europeana new rights) and engage with this, selecting the finding aids, the preferred rights statements, and ensuring that thumbnail and reusable images meet the requirements. One the descriptions are in APE they can then be supplied to Europeana. The resulting display in Europeana should be checked, to ensure that it is appropriate.

We discussed this approach, and concluded that maybe APE contributors could see Europeana as something that they might use to showcase their content, so, think of it on our terms, as archives, and how it might help us. There is no obligation to contribute, so it is a case of making the decision whether it is worth representing the best visual archives through Europeana or whether this approach takes more effort than the value that we get out of it. After 10 years of working with Europeana, and not really getting proper representation of archives, the idea of finding a successful way of contributing archives is appealing, but it seems to me that the amount of effort required is going to be significant, and I’m not sure if the impact is enough to warrant it.

Europeana are working on a new way of automated and real time ingest from aggregators and content providers, but this may take another year or more to become fully operational.

Outreach and CM Reports

Towards the end of the day we had a presentation from the new PR/communicaitons officer. Having someone to encourage, co-ordinate and develop ideas for dissemination should provide invaluable for APE. The Facebook page is full of APE activities and related news and events. You can tweet and use the hashtag #archivesportaleurope if you would like to make APE aware of anything.

We ended the day with reports from country managers, which, as always threw up many issues, challenges, solutions, questions and answers. Plenty to set up APEF for another busy year!

Save

Archives Portal Europe builds firm foundations

July 7, 2016 / Jane Stevenson / 1 Comment

On 8th June 2016 I attended the first Country Manager’s meeting of the newly formed Foundation of the Archives Portal Europe (APEF) at the National Archives of the Netherlands (Nationaal Archief).

The Foundation has been formed on the basis of partnerships between European countries. The current Foundation partners are: Belgium, Denmark, Luxembourg, The Netherlands, Spain, Sweden, Switzerland, Estonia, France, Germany, Hungary, Italy, Latvia, Norway and Slovenia. All of these countries are members of the ‘Assembly of Associates’. Negotiations are proceeding with Bulgaria, Greece, Liechtenstein, Lithuania, Malta, Poland, Slovakia and the UK. Some countries are not yet in a position to become members, mainly due to financial and administrative issues, but the prospects currently look very positive, with a great willingness to take the Portal forwards and continue the valuable networking that has been built up over the past decade. Contributing to the Portal does not incur financial contribution; the Assembly of Associates is separate from this, and the idea is that countries (National Archives or bodies with an educational/research remit) sign up to the principles of APE and the APE Foundation – to collaborate and share experiences and ideas, and to make European archives as accessible as possible.

The Governing Board of the Foundation is working with potential partners to reach agreements on a combination of financial and in-kind contributions. It’s also working on long term strategy documents. It has established working groups for Standards and PR & Communications and it has set up cooperation with the Dutch DTR project (Digitale Taken Rijksarchieven / Digital Processes in State Archives) and with Europeana. The cooperation with the DTR project has been a major boost, as both projects are working towards similar goals, and therefore work effort can be shared, particularly development work.

Current tasks for the APEF:

Building an API to open up the functionality of the Archives Portal Europe to third parties and to implement the possibility for the content providers to switch this option on or off in the Archives Portal Europe’s back-end.
Improving the uploading and processing of EAC-CPF records in the Archives Portal Europe and improving the way in which records creators’ information can be searched and found via the Archives Portal Europe’s front-end and via the API.
Enabling the uploading/processing of “additional finding aids (indexes)” in the Archives Portal Europe and making this additional information available via the Archives Portal Europe’s front-end and the API.

The above in addition to the continuing work of getting more data into the Portal, supporting the country managers in working with repositories, and promoting the portal to researchers interested in using European-wide search and discovery tool.

APEF will be a full partner in the Europeana DSI2 project, connecting the online collections of Europe’s cultural heritage institutions, which will start after the summer and will run for 16 months. Within this project APEF will focus on helping Europeana to develop the aggregation structure and provide quality data from the archives community to Europeana. A focus on quality will help to get archival data into Europeana in a way that works for all parties. There seems to be a focus from Europeana on the ‘treasures’ from the archives, and on images that ‘sell’ the archives more effectively. Whatever the rights and wrongs of this, it seems important to continue to work to expose archives through as many channels as we can, and for us in the UK, the advantages of contributing to the Archives Hub and thence seamlessly to APE and to Europeana, albeit selectively, are clear.

A substantial part of the meeting was dedicated to updates from countries, which gave us all a chance to find out what others are doing, from the building of a national archives portal in Slovakia to progress with OAI-PMH harvesting from various systems, such as ScopeArchiv, used in Switzerland and other countries. Many countries are also concerned with translations of various documents, such as the Content Provider Agreement, which is not something the UK has had to consider (although a Welsh translation would be a possibility).

We had a session looking at some of the more operational and functional tasks that need to be thought about in any complex system such as the APE system. We then had a general Q&A session. It was acknowledged that creating EAD from scratch is a barrier to contributing for many repositories. For the UK this is not really an issue, because we contribute Archives Hub descriptions. But of course it is an issue for the Hub: to find ways to help our contributors provide descriptions, especially if they are using a proprietary system. Our EAD Editor accounts for a large percentage of our data, and that creates the EAD without the requirement of understanding more than a few formatting tags.

The Archives Hub aims to set up harvesting of our contributors’ descriptions over the next year, thus ensuring that any descriptions contributed to us will automatically be uploaded to the Archives Portal Europe. (We currently have to upload on a per-contributor basis, which is not very efficient with over 300 contributors). We will soon be turning our attention to the selective digital content that can be provided by APE to Europeana. That will require an agreement from each institution in terms of the Europeana open data licence. As the Hub operates on the principles of open data, to encourage maximum exposure of our descriptions and promote UK archives, that should not be a problem.

With thanks to Wim van Dongen, APEF country manager coordinator / technical coordinator, who provided the minutes of the Country Managers’ meeting, which are partially reproduced here.

Archives Hub Search Analysis

February 29, 2016 / Jane Stevenson

Search logs can give us an insight into how people really search. Our current system provides ‘search logs’ that show the numbers based on the different search criteria and faceting that the Hub offers, including combined searches. We can use these to help us understand how our users search and to give us pointers to improve our interface.

The Archives Hub has a ‘default search’ on the homepage and on the main search page, so that the user can simply type a search into the box provided. This is described as a keyword search, as the user is entering their own significant search terms and the results returned include any archival description where the term(s) are used.

The researcher can also choose to narrow down their search by type. The figure below shows the main types the Archives Hub currently has. Within these types we also have boolean type options (all, exact, phrase), but we have not analysed these at this point other than for the main keyword search.

Archives Hub search box showing the types of searches available

There are caveats to this analysis.

1. Result will include spiders and spam

With our search logs, excluding bots is not straightforward, something which I refer to in a previous post: Archives Logs and Google Analytics. We are shortly to migrate to an entirely new system, so for this analysis we decided to accept that the results may be slightly skewed by these types of searches. And, of course, these crawlers often perform a genuine service, exposing archive descriptions through different search engines and other systems.

2. There are a small number of unaccounted for searches

Unidentified searches only account for 0.5% of the total, and we could investigate the origins of these searches, but we felt the time it would take was not worth it at this point in time.

3. Figures will include searches from the browse list.

These figures include searches actioned by clicking on a browse list, e.g. a list of subjects or a list of creators.

4. Creator, Subject and Repository include faceted searching

The Archives Hub currently has faceted searching for these entities, so when a user clicks to filter down by a specific subject, that counts as a subject search.

Results for One Month (October 2015)

For October 2015 the total searches are 19,415. The keyword search dominates, with a smaller use of the ‘any’ and ‘phrase’ options within the keyword search. This is no surprise, but this ‘default search’ still forms only 36% of the whole, which does not necessarily support the idea that researchers always want a ‘google type’ search box.

We did not analyse these additional filters (‘any/phrase/exact’) for all of the searches, but looking at them for ‘keyword’ gives a general sense that they are useful, but not highly used.

A clear second is search by subject, with 17% of the total. The subject search was most commonly combined with other searches, such as a keyword and further subject search. Interestingly, subject is the only search where a combined subject + other search(es) is higher than a single subject search. If we look at the results over a year, the combined subject search is by far the highest number for the whole year, in fact it is over 50% of the total searches. This strongly suggests that bots are commonly responsible for combined subject searches.

These searches are often very long and complex, as can be seen from the search logs:

[2015-09-17 07:36:38] INFO: 94.212.216.52:: [+0.000 s] search:: [+0.044 s] Searching CQL query: (dc.subject exact “books of hours” and/cql.relevant/cql.proxinfo (dc.subject exact “protestantism” and/cql.relevant/cql.proxinfo (dc.subject exact “bible o.t. psalms” and/cql.relevant/cql.proxinfo (dc.subject exact “authors, classical” and/cql.relevant/cql.proxinfo (dc.subject exact “bible o.t. psalms” and/cql.relevant/cql.proxinfo (dc.subject exact “law” and/cql.relevant/cql.proxinfo (dc.subject exact “poetry” and/cql.relevant/cql.proxinfo (dc.subject exact “bible o.t. psalms” and/cql.relevant/cql.proxinfo (dc.subject exact “sermons” and/cql.relevant/cql.proxinfo bath.personalname exact “rawlinson richard 1690-1755 antiquary and nonjuror”))))))))):: [+0.050 s] 1 Hits:: Total time: 0.217 secs

It is most likely that the bots are not nefarious; they may be search engine bots, or they may be indexing for the purposes of information services of some kind, such as bibliographic services, but they do make attempts to assess the value of the various searches on the Hub very difficult.

Of the remaining search categories available from the main search page, it is no surprise that ‘title’ is used a fair bit, at 6.5%, and then after that creator, name, and organisation and personal name. These are all fairly even. For October 2015 they are around 3% of the total each, and it seems to be similar for other months.

The repository filter is popular. Researchers can select a single repository to find all of their descriptions (157), select a single repository and also search terms (916), and also search for all the descriptions from a single repository from our map of contributors (125). This is a total of 1,198, which is 6.1% of the total. If we also add the faceted filter by repository, after a search has been carried out, the total is 2,019, and the percentage is 10.4%. Looking at the whole year, the various options to select repository become an even bigger percentage of the total, in particular the faceted filter by repository. This suggests that improvements to the ability to select repositories, for example, by allowing researchers to select more than one repository, or maybe type of repository, would be useful.

Google Map on the Hub showing the link to search by contributor

We have a search within multi-level descriptions, introduced a few years ago, and that clearly does get a reasonable amount of use, with 1,404 uses in this particular month, or 7.2% of the total. This is particularly striking as this is only available within multi-level descriptions. It is no surprise that this is valuable for lengthy descriptions that may span many pages.

The searches that get minimal use are identifier, genre, family name and epithet. This is hardly surprising, and illustrates nicely some of the issues around how to measure the value of something like this.

Identifier enables users to search by the archival reference. This may not seem all that useful, but it tends to be popular with archivists, who use the Hub as an administrative tool. However, the current Archives Hub reference search is poor, and the results are often confusing. It seems likely that our contributors would use this search more if the results were more appropriate. We believe it can fulfill this administrative function well if we adjust the search to give better quality results; it is never likely to be a highly popular search option for researchers as it requires knowledge of the reference numbers of particular descriptions.

Epithet is tucked away in the browse list, so a ‘search’ will only happen if someone browses by epithet and then clicks on a search result. Would it be more highly used if we had a ‘search by occupation or activity’? There seems little doubt of this. It is certainly worth considering making this a more prominent search option, or at least getting more user feedback about whether they would use a search like this. However, its efficacy may be compromised by the extremely permissive nature of epithet for archival descriptions – the information is not at all rigorous or consistent.

Family name is not provided as a main search option, and is only available by browsing for a family name and clicking on a result, as with epithet. The main ‘name’ search option enables users to search by family name. We did find the family name search was much higher for the whole year, maybe an indication of use by family historians and of the importance of family estate records.

Genre is in the main list of search options, but we have very few descriptions that provide the form or medium of the archive. However, users are not likely to know this, and so the low use may also be down to our use of ‘Media type’, which may not be clear, and a lack of clarity about what sort of media types people can search for. There is also, of course, the option that people don’t want to search on this facet. However, looking at the annual search figures, we have 1,204 searches by media type, which is much more significant, and maybe could be built up if we had something like radio buttons for ‘photographs’, ‘manuscripts’, ‘audio’ that were more inviting to users. But, with a lack of categorisation by genre within the descriptions that we have, a search on genre will mean that users filter out a substantial amount of relevant material. A collection of photographs may not be catalogued by genre at all, and so the user would only get ‘photographs’ through a keyword search.

Place name is an interesting area. We have always believed that users would find an effective ‘search by place’ useful. Our place search is in the main search options, but most archivists do not index their descriptions by place and because of this it does not seem appropriate to promote a place name search. We would be very keen to find ways to analyse our descriptions and consider whether place names could be added as index terms, but unless this happens, place name is rather like media type – if we promote it as a means to find descriptions on the Archives Hub, then a hit list would exclude all of those descriptions that do not include place names.

This is one of the most difficult areas for a service like the Archives Hub. We want to provide search options that meet our users’ needs, but we are aware of the varied nature of the data. If a researcher is interested in ‘Bath’ then they can search for it as a keyword, but they will get all references to bath, which is not at all the same as archives that are significantly about Bath in Gloucestershire. But if they search for place name: bath, then they exclude any descriptions that are significantly about Bath, but not indexed by place. In addition, words like this, that have different meanings, can confuse the user in terms of the relevance of the results because ‘bath’ is less likely to appear in the title. It may simply be that somewhere in the description, there is a reference to a Dr Bath, for example.

This is one reason why we feel that encouraging the use of faceted search will be better for our users. A more simple initial search is likely to give plenty of results, and then the user can go from there to filter by various criteria.

It is worth mentioning ‘date’ search. We did have this at one point, but it did not give good results. This is partly due to many units of description not including normalised dates. But the feedback that we have received suggests that a date search would be popular, which is not surprising for an archives service. We are planning to provide a filter by date, as well as the ordering by date that we currently have.

Finally, I was particularly interested to see how popular our ‘search collection level only’ is. This enables users to only see ‘top level’ results, rather than all of the series and items as well. As it is a constant challenge to present hierarchical descriptions effectively, this would seem to be one means to simplify things. However, for October 2015 we had 17 uses of this function, and for the whole year only 148. This is almost negligible. It is curious that so few users chose to use this. Is it an indication that they don’t find it useful, or that they didn’t know what it means? We plan to have this as a faceted option in the future, and it will be interesting to see if that makes it more popular or not.

We are considering whether we should run this exercise using some sort of filtering to check for search engines, dubious IP addresses, spammers, etc., and therefore get a more accurate result in terms of human users. We would be very interested to hear from anyone who has undertaken this kind of exercise.

Archives, Logs and Google Analytics

January 21, 2016 / Jane Stevenson

It is vital to have a sense of the value of your service, and if you run a website, particularly a discovery website, you want to be sure that people are using it effectively. This is crucial for an online service like the Archives Hub, but it is important for all of us, as we invest time and effort in putting things online and we are aware of the potential the Web gives us for opening up our collections.

But measuring use of a website is no simple thing. You may hear people blithely talking about the number of ‘hits’ their website gets, but what does this really mean?

I wanted to share a few things that we’ve been doing to try to make better sense of our stats, and to understand more about the pitfalls of website use figures. There is still plenty we can do, and more familiarity with the tools at our disposal may yield other options to help us, but we do now have a better understanding of the dangers of taking stats at face value.

Logs

We are all likely to have usage logs of some kind if we have a website, even if it is just the basic apache web logs. These are part of what the apache web server offers. The format of these can be configured to suit, although I suspect many of us don’t look into this in too much detail. You may also have other logs – your system may generate these. Our current system provides a few different logs files, where we can find out a bit more about use.

Apache access logs typically contain: the IP address of the requesting machine, the date of the access, the http method (usually ‘get’ or ‘post’), the requested resource (the URL of the page, image, pdf etc.), the size of what is returned, the referring site, if available, and the user agent. The last of these will sometimes provide information on the browser used to make the request, although this will often not be the case.

Example apache access log entry:

54.72.xx.xx - - [14/Sep/2015:11:51:19 +0000] "GET /data/gb015-williamwagstaffe HTTP/1.1" 200 22244 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"

So, with this you can find out some information about the source of the request from the IP addresses and what is being requested (URL of resource).

Other logs such as our current system’s search logs may provide further information, often including more about the nature of the query and maybe the number of hits and the response time.

Google Analytics

Increasingly, we are turning to Google Analytics (GA) as a convenient method of collecting stats, and providing nice charts to show use of the service. Google Analytics requires you to add some specific code to the pages that you want tracked. GA provides for lots of customisation, but out of the box it does a pretty good job of providing information on pages accessed, number of accesses, routes, bounce rate, user agents (browsers), and so on.

Processing your stats

If you do choose to use your own logs and process your stats, then you have some decisions to make about how you are going to do this. One of the first things that I learnt when doing this is that ‘hits’ is a very misleading term. If you hear someone promoting their site on the basis of the number of hits, then beware. Hits actually refers to the number of files downloaded on your site. One page may include several photos, buttons and other graphics, and these all count as hits. So one page accessed may represent many hits. Therefore hits is largely meaningless as a measure of use. Page views is a more helpful term, as it means one page accessed counts as ‘one’.

So, if you are going to count page views, do you then simply use the numbers the logs give you?

Robots

One of the most difficult problems with using logs is that they count bots and crawlers. These may access your site hundreds or thousands of times in a month. They are performing a useful role, crawling and gathering information that usually has a genuine use, but they inflate your page views, sometimes enormously. So, if someone tells you they have 10,000 page views a month, does this count all of the bots that access the pages? Should it? It may be that human use of the site is more like 2,000 page views per month.

Identifying and excluding robot accesses accurately and consistently throughout every reporting period is a frustrating and resource intensive task. Some of us may be lucky enough to have the expertise and resources to exclude robots as part of an automated process (more on that with GA), but for many of us, it is a process that requires regular review. If you see an IP address that has accessed thousands of pages, then you may be suspicious. Investigation may prove that it is a robot or crawler, or just that it is under suspicion. We recently investigated one particular IP address that gave high numbers of accesses. We used the ‘Honey Pot‘ service to check it out. The service reported:

“This IP address has been seen by at least one Honey Pot. However, none of its visits have resulted in any bad events yet. It’s possible that this IP is just a harmless web spider or Internet user.”

The language used here shows that even a major initiative to identify dodgy IP addresses can find it hard to assess each one as they come and go with alarming speed. This project asks for community feedback in order to continually update the knowledge base.

We also checked out another individual IP address that showed thousands of accesses:

“The Project Honey Pot system has detected behavior from the IP address consistent with that of a rule breaker. Below we’ve reported some other data associated with this IP. This interrelated data helps map spammers’ networks and aids in law enforcement efforts.”

We found that this IP address is associated with a crawler called ‘megaindex.com/crawler’. We could choose to exclude this crawler in future. The trouble is that this is one of many. Very many. If you get one IP address that shows a huge number of accesses, then you might think it’s a bot, and worth investigating. But we’ve found bots that access our site 20 or 30 times a month. How do you identify these? The trouble is that bots change constantly with new ones appearing every day, and these may not be listed by services such as Honeypot. We had one example of a bot that accessed the Hub 49,459 times in one month, and zero times the next month.We looked at our stats for one month and found three bots that we had not yet identified – MegaIndex, XoviBot and DotBot. The figures for these bots added up to about 120,000 page views just for one month.

404: Page Not Found

The standard web server http response if a page does not exist is the infamous ‘404‘. Most websites will typically generate a “404 Not Found” web page. Should these requests be taken out of your processed stats? It can be argued that these are genuine requests in terms of service use, as they do show activity and user intent, even if they do not result in a content page.

500: Server Error

The standard http response if there’s been a system problem of some kind is the ‘500’ Sever Error . As with the ‘404’ page, this may be genuine human activity, even if it does not lead to the user finding a content page. Should these requests be removed before you present your stats?

Other formats

You may also have text pages (.txt), XML pages (.xml) and PDFs (.pdf). Should these be included or not? If they show high use, is that a sign of robots? It may be that people genuinely want to access them.

Google Analytics and Bots

As far as we can tell, GA appears to do a good job of not including bots by default, presumably because many bots do not run the GA tracking code that creates the GA page request log. We haven’t proved this, but our investigations do seem to bear this out. Therefore, you are likely to find that your logs show higher page accesses than your GA stats. And as a bot can really pummel your site, the differences can be huge. Interestingly, GA also now provides the option to enable bot filtering, but we haven’t found much evidence of GA logging our bot accesses.

But can GA be relied upon? We had a look in detail at some of the logs accesses and compared them with GA. We found one IP address that showed high use but appeared to be genuine, and the user agents looked like they represented real human use. The pattern of searching and pages accessed also looked convincing. From this IP address we found one example of an Archives Hub description page with two accesses in the log: gb015-williamwagstaffe. The accesses appeared to come from standard browsers (the Chrome browser). We looked at several other pages accessed from this IP address. There was no evidence to suggest these accesses are bots or not genuine, but they are not in the GA accesses.

Why might GA exclude some accesses? There could be several reasons:

GA uses javascript tracking code, and it may not know about the activity because the javascript doesn’t run when the page is requested even though it appears to be a legitimate browser according to the user agent log
The requester may be using ad-blocking, which can also block calls to GA
It may be a tracking call back failure to GA due to network issues
It may be that GA purposely excludes an IP address because it is believed to be a bot
It may not be a genuine browser, i.e. a bot, script or some other requesting agent that doesn’t run the GA tracking code

Dynamic single page applications

Modern systems increasingly use html5 and Ajax to load content dynamically. Whereas traditional systems load the analytics tracker on each page load, these ‘single page applications require a different approach in order to track activity. This requires using the new ‘Google Universal Analytics’ and doing a bit of technical work. It is not necessarily something we all have the resource and expertise to do. But it may mean that your page views appear to go down.

Conclusions

Web statistics are not straightforward. Google Analytics may be extremely useful, and is likely to be reasonably accurate, but it is worth understanding the pitfalls of relying on it completely. Our GA stats fell off a rather steep cliff a few years ago, and eventually we realised that the .xml and .txt pages had started being excluded. This was not something we had control over, and that is one of the downsides of using third party software – you don’t know exactly how they do what they do and you don’t have complete control.

A recent study of How Many Users Block Google Analytics by Jason Packer of Quantable suggests that GA may often be blocked at the same time as ads, using the one of the increasing number of ad blocking tools, and the effect could be significant. He ran a fairly small survey of about 2,400 users of a fairly niche site, but found that 8.4% blocked GA, which is a substantial percentage.

Remember that statistics for ‘hits’ or ‘page views’ don’t mean so much by themselves – you need to understand exactly what is being measured. Are bots included? Are 404s included?

Stats are increasingly being used to show value, but we do this at our peril. Whilst they are important, they are open to interpretation and there are many variables that mean comparing different sites through their access stats is going to be problematic.

Cataloguing Matters: being part of the information landscape

November 9, 2015 / Jane Stevenson

Consider the following questions, which use the topic of design history, but could be for any topic area:

Where did this person work?
Who did they know?
What can I find out about furniture design in London in the early 20th century?
Who designed this early 20th century chair?
Did the designer feature in this exhibition?
Can I find photographs of this section of the exhibition?
Did these designers both feature in this design exhibition?
Who influenced this designer?

These are surely typical questions for researchers. They are the sort of questions we thought about when we were working on Exploring British Design. But these questions do not start with the archive. An archive collection may well hold many answers for these questions, but there remains the problem of connecting these questions to the archive: ‘I’m interested in this designer/in this chair that they designed/in this exhibition they designed for’. We need these questions to lead to archival sources when appropriate.

Which comes first, the research question or the archive? We tend to assume the archive, but for many researchers the question comes first and the archive is, at that point, not known to them. We become a little fish in a very big pond when we enter the Web, and so we need to find ways for researchers who may not be aware of us to hook our collections.

Wellcome Library, London: Fishing

But how do we achieve this? We often have to work within considerable constraints and we cannot simply transform our cataloguing practices. There are many technical solutions that we can potentially make use of, but I think the first challenge lies with the data itself.

Individual archive repositories may think that there is no way they can move into a Linked Data world of RDF and triples and persistent URIs. But I think there are steps that we could take to help our descriptive data better fit this kind of landscape and have the potential to be more ‘linked in’ to other sources and to researchers’ paths of discovery.

Our descriptive practices tend to treat collections as stand-alone entities, rather than integrated parts of a whole information landscape. We do not think enough about cataloguing in a way that facilitates an integrated approach where we can connect to other information resource and allow researchers to come to archives through searching for people, organisations, subjects, events and places, and to come to them as part of a whole network of resources. I think we need to try to think about providing potential ‘connectors’ within our descriptions that will allow them to be hooked into the landscape more effectively.

Here are some thoughts about how we can help to achieve this:

Be consistent when cataloguing

This may sound straightforward, but having looked at thousands of descriptions from well over two hundred archive repositories, I can announce that it doesn’t always happen. For instance, simply entering the name of the repository in the same way for every catalogue entry, and not adding the repository code as part of the repository name, for example, ensures that the repository is always correctly identified. This means that all of the collection descriptions can be clearly identified as being from the same institution.

When entering names, think about how you structure them. Many archive systems provide a means to store and link to names, and yet it is amazing how often one name varies. If nothing else, think about adding life dates to a name, which helps with unique identification.

Here on the Archives Hub we have to hold our hands up for a potentially unhelpful practice up till now of encouraging the creator name to be entered in one way under ‘name of creator’ and another way as an index term. This reflects a time when we were less focussed on machine processing. Of course, it is much better to enter the name consistently, so that the connection can clearly be made.

Try to use rules and standards

Whilst I have increasingly become somewhat frustrated by our standards, I still think there is a role for standards to encourage consistency, clarify meaning and help draw things together. If you enter subjects using UKAT or LCSH, then try to ensure that all your terms really do come from these thesauri. We get examples where the thesaurus is named, but the subject is not actually from the thesaurus.

When entering things like language codes (ISO standard 639), take a few minutes just to find out about them. They need to be lower case, to be consistent. It’s worth having an understanding of what you are doing and why.

Do the same for dates. Think about what a normalised date is and why it is important. It is really worth having a sense of why these things matter. ISO8601: “The purpose of this standard is to provide an unambiguous and well-defined method of representing dates and times”. That sounds perfect for archives, and well worth adopting.

Think about the questions people ask

One of the advantages of indexing is that it gives you a chance to consider the terms you can best use to ‘advertise’ your collection. It is really good to if the scope and content text is clearly reflected in the index terms. If your archive is about a designer and you have described things they have designed, then you might realise that you haven’t used ‘furniture designer’ in your text, but this is a good index term to use. Or maybe the archive is really useful for those looking at the history of design education, but you haven’t yet actually used the term ‘design education’ in your text. You can add it as an index term.

Try to think beyond the UK

This may again sound obvious, but unfortunately NCA Rules don’t exactly encourage this, even if they don’t prevent it. When indexing by place name, add the county and certainly add the country. Indexing really helps us think about this. Consider a typical biographical history entry:

Charles Edward Sayle was born in Cambridge on 6 December 1864. He entered New College, Oxford, in 1883, and St John’s College, Cambridge, in 1890.

It wouldn’t look quite right putting:

Charles Edward Sayle was born in Cambridge, England, on 6 December 1864. He entered New College, Oxford, England, in 1883, and St John’s College, Cambridge, England, in 1890 [etc]

And whilst it would help with identification, it is still unstructured. Much better to ensure you have:

Place name: Cambridge, England

in your structured index terms. True, there are plenty of ways to mark this up, some better than others, but at least having an index term entered in this way is a great help with uniquely identifying the Cambridge that you mean, as opposed to the one in the US or New Zealand, Australia or Jamaica. It means that a researcher who is researching a topic around ‘Cambridge, England’ is more likely to find your description.

If you put your place names into some kind of specific field, try to avoid something like ‘Cardiff, Merthyr Tydfil, Cambridge’ (an example from the Hub) or ‘Cambridge etc’ (another example). So if your cataloguing software provides just one box for place, repeat that box for each place, rather than putting a load of them into one entry. If it allows for something like place name and country, use that to your advantage.

Put index terms into the most appropriate categories

On the Hub we’ve had plenty of family names as personal names and personal names as corporate names. But most commonly we get genre as subject. If your archive contains photographs, press cuttings, preliminary sketches or parchment (animal material) and you want to make this known, you need to index by material type, not by subject ….unless the archive is about photographs or about parchment.

Think Reference!

Your references should give the ability to uniquely identifying every part of a hierarchical archival description. Think carefully about adding them and make sure you have a unique reference at every level of description. Some archival software will automatically generate references, which does help because then they will be consistent and unique. But if you add them manually, it is very easy to make a mistake. On the Hub we found several hundred duplicated ‘unique references’ when we went through a major exercise to clean up references. And I do mean several hundred, not one or two.

Optimise titles for search engines

All of this structured data really needs the boost of descriptions that work well on the Web, simply in terms of being discoverable. SEO, or search engine optimisation, is a big topic, and your system may not allow you to make many changes, but in terms of the actual data, do think about appropriate titles, which are a really good hook. Include the most significant words, make sure that are not too long (so they are easy to scan for a researcher looking through various sources) and make titles at collection level self-explanatory. At lower levels this is not such a problem, as it is possible to append the collection title, to help with interpretation.

Look at how your data exports

This may not be possible or practical for everyone, but if you can do it I think it is a really good indicator of how interoperable your data might be to see how it looks when it is exported, because that is when you are removing it from the comfort of your own familiar system, and potentially unleashing it into the world.

The Event Horizon

I am particularly interested at the moment in events. History is made up of events. Who’d have thought it, when our standards barely mention them!

EAD (Encoded Archival Description), our XML standard for descriptions, doesn’t allow for events to be indexed at all. CIDOC CRM is “a formal structure for describing the implicit and explicit concepts and relationships used in cultural heritage documentation” and it is event-based. It may be a complicated beast, not much used within our domain, and many of you won’t be aware of it, but there is certainly something to be said for putting events at the forefront of how we think.

The new EAC-CPF standard allows for chronologies in the biographical section of the description, but this is more about narrative (a list of events as part of one name authority description). It is based partly on ISAAR(CPF) which states “Record in narrative form or as a chronology the main life events, activities, achievements and/or roles of the entity being described.” Narrative form is fine for a researcher who is perusing the description, but structured form that is not necessarily so closely tied to narrative is what is required for more integration of data.

People, organisations, places and subjects are all linked by events – that is where the connections and the stories are. I think it is a major shortcoming of our standards that events are not given more prominence.

I mention this here really just to say that whilst we can improve our descriptions, and think about consistency and indexing, I do think that there are things around cataloguing that may require a more fundamental change.

Research Questions

Coming back to the initial questions I asked. Our cataloguing can help to ensure that archives are networked in to the broader information landscape. This is not about whether the archive can in principle answer these questions. It is about whether a researcher asking these questions (maybe initially through Google, maybe through other generic channels) can discover the archive as one source amongst many:

Where did this person work?
Uniquely identify the place(s).

Who did they know?
Make sure names are consistent and try to make them unambiguous.

What can I find out about furniture design in London in the early 20th century?
Use appropriate index terms and think about how to add dates consistently.

Who designed this early 20th century chair?
Adding index terms helps to draw out significant names and concepts

Did the designer feature in this exhibition?
Entering the exhibition name and designer name as structured data will help.

Can I find photographs of this section of the exhibition?
Make sure you have the exhibition name clearly stated and think about using index terms for formats – these are a great means for researchers to find types of material. But remember, this is ‘photographs’ as a genre or form, not as a subject!

Did these designers both feature in this design exhibition?
Not so easy, as we don’t index by event type

Who influenced this designer?
Not so easy, as we don’t tend to provide structured relationship information to connect people other than in the broadest possible way (these two people were ‘associated’). But that is another story…

To my mind, cataloguing is a skill, and it is really worth thinking about what you are cataloguing and how you catalogue it carefully. It is more important to think about this now than it was 30 years ago, because 30 years ago we were working with with narrative descriptions and index cards. Now we want our data to be interconnected.

From Ivory Tower to People Power

October 5, 2015 / Jane Stevenson

Here is a presentation I gave at ELAG 2015 to introduce our innovation project, Exploring British Design. The presentation is entitled ‘From Ivory Tower to People Power‘ (You Tube link) and emphasises the collaborative nature of the project and the focus on people as a topic, rather than on archival description, which is not always the best starting place for researchers. The presentation covers:

Aims of the project
Workshops with postgraduate students about how they research and analysis of their research paths
Workshops with postgraduates about websites: what students do and don’t like in terms of discovery
Traditional archival cataloguing ‘lock in’ of entities such as people, places and events.
Connectivity beyond single A to B connections; ‘anything can be a focus’ and can link to a myriad of other things
Use of EAC-CPF (XML standard for archival authority files)
Creating the data, handcrafting data, limitations of our approach, too many ideas not enough time!
Demonstration of the Website

Connecting through defining people and relationships

June 11, 2015 / Jane Stevenson / 1 Comment

If, as a researcher, you search for ‘Jane Drew’, the celebrated architect and town planner, on the Archives Hub, amongst other things, you might discover a single item, “Letter from Jane B Drew to John and Myfanwy Piper”, a letter in the “Papers of John and Myfanwy Piper”.

You can see that its a letter in a collection at the Tate Gallery Archive. The description of the collection is an example of a good quality traditional archival catalogue, giving a fairly detailed listing of the content this particular collection. But as a researcher you are really just interested in just this one letter. You may ask yourself a number of questions, possibly starting with (1) Is this the Jane Drew I’m interested in? and then (2) What is the relationship between Jane Drew and John and Myfanwy Piper? You may well be able to find answers by accessing the letter itself, but at this stage you may just want to place this connection in the broader context of Jane Drew’s life and work. As a researcher, understanding how these people are connected may shed light on your research interests.

In this blog I want to think about this question of relationships. The fact is that archivists rarely provide structured information about relationships; if there is information, it is usually in the biographical history, which might outline key events and people in someone’s life, referring to their parents, work colleagues, friends, etc. The nature of the relationship is sometimes explicitly given, but often it is not. Our standards don’t really say much about relationships between the entities (people, organisations, places, etc) that we describe in our catalogues.

Going back to the Papers of John and Myfanwy Piper as an example, the biographical history includes the following:

[John] Piper began writing reviews from the late 1920s making a name for himself as a critic writing for periodicals like ‘The Listener’ and the ‘Architectural Review’. From 1935-1937 he assisted Myfanwy Evans, with the production of a quarterly review of contemporary European abstract painting called ‘Axis’. In 1937 Piper was commissioned by his friend John Betjeman to write the ‘Shell Guide to Oxfordshire’. Piper went on to write and provide photographs for a number of the guides as well as edit the series. In the same year John Piper married the writer Myfanwy Evans.

This is a typical of a biographical history – useful historical information about the individual or organisation. Within this there is information we can potentially use to create explicit relationship information:

John Piper ‘worked with’ Myfanwy Evans
John Piper ‘was friends with’ John Betjeman
John Piper ‘worked for’ John Betjeman
John Piper ‘was married to’ Myfanwy Evans

There are a number of issues to consider here:

How can we unambiguously identify the people?
How do we choose the vocabulary we use to define the relationships?
Do we try to include dates?
Is it reasonable for us to interpret relationships as ‘friendships’ or ‘collaborations’ if this is not actually explicit?

We are looking at some of these issues through our AHRC project, Exploring British Design. They are all issues that archivists need to explore in a debate around relationship information, but the first issue to consider is simply whether we should be thinking more about including this kind of relationship information in our archival finding aids. Is it something that would be of real value to end users? This issue is coming more to the fore as we start to think about implementing ISAAR (CPF) and working with EAC-CPF , and also as Linked Open Data gains traction.

In a (well worth reading) recent article in the Journal of Contemporary Archival Studies, on the potential impact of EAC-CPF, K.M Wisser reports the findings of a survey about relationship information. The survey received 208 responses from archivists/archives in the US. Wisser wrote “The survey results indicate that the archival community has only just begun to consider relationships in the context of archival description and the role that explicit description of those relationships may play.”

As one respondent wrote:

“relationships are among the most important facets in a collection and deserve a high priority in description. One cannot understand the historical value of an event, person, or organization without knowing [the] relationship among and between them.”

One thing that really strikes me in Wisser’s findings is that archivists see relationships that are documented outside of the collection as almost as significant as those that are documented within the collection. Going back to our original topic of Jane Drew: who else did Jane Drew work with? Should we provide that information to our users, whether or not it is documented within the collection? Is our role to give as full an account as we can of Drew’s life and career? Is it to limit ourselves to what is within the collection?

Wisser’s survey asked respondents about the importance of relationship types. It is curious to me that archivists rated ‘collaborated with’ as a more important relationship than ‘studied with’; they rated a friendship as far more important when it was documented in the collection; and they rated ‘influenced by’ as generally not so important. I’m surprised that the respondents had such definite ideas about the relative importance of different types of relationships, especially when the majority appeared to agree with the importance of ‘objective cataloguing’.

In our Exploring British Design project, the work we did with researchers definitely confirmed to me the fairly self-evident observation that any relationship can be of major significance in research, even if it appears of minor significance within the archive, or indeed, within the literature in general. A brief collaboration may have been a crucial influence, a short friendship may have had hitherto unrealised impact, and anyway, the importance of the relationship depends upon the research you are doing. Researchers are not really aware of how challenging it is for us as information professionals to establish these kinds of relationships in ways that they can then access. But it is clear that this is the sort of connectivity they are after.

One of the challenges with documenting relationship types is that they can be hard to define. As Wisser notes:

“The concept of influence, however, proved the most problematic. Comments such as ‘influence is a squishy sort of relationship’ and ‘I think it would often be very difficult to prove that Entity A was influenced by Entity B’ indicate a notion of intangibility.”

The conclusion could be that we should leave well alone relationships that are hard to define. On the other hand, if we are in a position, as we research a collection, to highlight potential connections, that action could be of major value to a researcher, who may otherwise never know about a link that ends up being crucial to their particular research. The relationships that are easy to define are likely to have been defined already.

One thing that strikes me about the whole notion of introducing interpretation and opinion into cataloguing (a possible argument against defining relationships) is that the horse has pretty much bolted. I’ve looked at enough ‘objective’ descriptions to be aware that the names archivists choose to add as index terms are a choice; they inevitably have to be an opinion about the names significant enough to add as index terms. And subjects are a similar case – some collections are indexed thoroughly, some not at all.

Aside from indexing, each person would create a different scope and content entry, including and excluding different information, and whether you call that subjective or not, it is certainly always selective. You could also argue that the level of detailed hierarchical cataloguing, might indicate the relative importance of the collection. On the Archives Hub there are some collections catalogued in huge detail, and it is inevitable that researchers will assume these collections are particularly important.

All of these choices have implications for discoverability.

In Wisser’s survey, a significant proportion of respondents felt that the importance of a relationship should be based upon the use of the collection. But this, again, raises the question: When thinking about relationships, is the cataloguer reflecting the scope of the collection, or are they trying to give as full a picture as they can of the person or organisation? Are we within the world of the collection; or is the collection within the world?

The reason that I believe that we should think beyond the bounds of the collection content is that I think it promises much richer rewards for our users and encourages archives to be a major player within a broader landscape of information resources. I base my thinking on the premise that the researcher is primarily interested in their research topic, which is not likely to be an archive collection per se, but rather an event, a person, an organisation, a subject, and the way things are connected. I think archivists are still tending to think in terms of a document that describes a collection, rather than how to link the collection into the cultural heritage landscape, and even more broadly beyond that. I wonder if archivists don’t always think beyond the catalogues they currently create because the researchers they have contact with (who visit the archive) are already fairly confident they want to use that repository, or a particular archive within that repository. In other words, the researcher is already in their space. When I worked in a specialist archive, I thought about researchers discovering our archive as a whole (having an online presence) and then I thought about them using our collections (individual collections each with their own description); I didn’t think about how our collections could be seen as part of a whole information landscape.

The loudest – and most convincing – argument I hear against this kind of approach is that it takes time, and archivists are short on time. But I wonder if that means we have to think fundamentally differently. Going back to Jane Drew, and think about the value of relationships for research into her life and work…

If one archive collection description highlights just a few relationships, this could take us a long way (although relationship types are a whole different thing…). If the individuals and organisations are unambiguously identified, this can help with the process of creating links out to other data sources, so that information can be linked together; then we have the chance to benefit from finding out about relationships that have been defined elsewhere. In other words, the connections one person has throughout their life can only be fully realised through the pooling of information resources, very much a joint effort. If the data is structured it can potentially be brought together.

Traditional archival cataloguing focuses on the collection, and what is documented within the collection. It tends to think in terms of a self-contained document. Pursuing relationships breaks the bounds of any one information source. That seems like a good thing, but it raises questions around approaches to cataloguing. One obvious way to tackle this is to start to think more about archival authority records. These should enable us to move beyond a collection-centric description of the collection and towards a more entity based approach, because you describe an agent (entity) independently of any one archival collection. Another option is to think in a Linked Data way, where you are concentrating on entities and relationships.

There are so many questions raised by the whole area of entities and relationships. A few of my current conclusions are:

We should primarily be led by what benefits research. Researchers are far less likely to think in terms of individual archive collections, and far more likely to think in terms of research areas (topics). The Web gives us the opportunity to think in a broader context.

Maybe it is worth considering taking some of the time used to provide a really detailed biographical history as an unstructured narrative, or the time to provide a really detailed multi-level description, and taking more time to provide (or provide the potential for) connections between our descriptions and the larger information environment. This could allow researchers to bring together much more comprehensive information, even if what we provide about individual collections is less detailed. Just adding something like a VIAF identifier to a name would be a great big leap forwards (http://viaf.org/viaf/51792789).

There is great value in being a small fish in a big pond, because most researchers are fishing for data in the big pond. As Wisser’s article says, “relationships are…seen to free collections from the isolation of individual repositories.” If we aim to be part of the big pond, we can continue to tend our smaller ponds as well!

To go back to the Piper Collection and Jane Drew….I used this as a random example, thinking of a researcher interested in one particular designer. But of course, the Tate Gallery Archive can’t be expected to define all the relationships within the description. It’s great that they have provided enough detail to find this one individual item – without that, we would not know about the connection with Jane Drew. I’m arguing for unambiguously identifying entities (people, organisations) because if we can potentially link this instance of ‘Jane Drew’ to other instances in other information sources, then it is very possible that we can find out more about this relationship; And if the relationship can’t be established through other sources, then maybe this archive provides unique evidence of a connection that could significantly benefit research.

Exploring British Design: Interface Design Principles

April 30, 2015 / Jane Stevenson

Britain Can Make It exhibition poster — Britain Can Make It, exhibition poster

For our AHRC project, ‘Exploring British Design‘ one of the questions we asked is:

How might a website co-designed by researchers, rather than a top-down collection-defined approach to archive content, enhance engagement with and understanding of British design?

The workshops that we have run were one of the key ways that we hoped to understand more about how postgraduates and others research their topics, what they liked and didn’t like about websites, and in a general sense how they think and understand resources, and how we can tune into that thinking.

In the blogs posts that we have created so far, we set out one of our central ideas:

Providing different routes into archives, showing different contexts, and enabling researchers to create their own narratives, can potentially be achieved through a focus on the ‘real things’ within an archive description; the people, organisations and places, and also the events surrounding them.

The feedback from the workshops gave us plenty to work with, and here I wanted to draw out some of the key messages that we are using to help us design an interface.

Researchers often think visually

Several of the participants in our workshops were visual thinkers. Maybe we had a slightly biased group, in that they work within or study design, but it seems reasonable to conclude that a visual approach can be attractive and engaging. We want to find a way to represent information more visually, whilst providing a rich and detailed resource. Our belief is that the visual should not dominate or hide the textual, as does often happen with cultural heritage resources, but that they should work better together.

Researchers often think in terms of creating a story or narrative

When we asked our participants to focus on an individual object, several of them thought in terms of its ‘story’. It seemed to me that most of the discussions that we had assumed a narrative type approach. It is hardy surprising, as when we talk about people, places and events we connect them together. It is a natural thing to do.

Different types of contexts provide value

When we asked workshop participants to think about how they would go about researching the object they were given, they tended to think of ways to contextualise it. They were interested in where it came from, in its physicality and its story. For example, we gave out photographs of an exhibition and they wanted to know where the photographs were taken, more about the exhibition and the designers involved in it, what else was going on at that time? Our idea with Exploring British Design is that we can create records that allow these kinds of contexts to flourish. The participants did not concentrate on traditional archival context, as they did not tend to recognise this in the same way as archivists – it is one perspective amongst many.

We cannot provide a substitute for the value of handling the original object, and it was clear that researchers found this to be immensely valuable, but we can help to provide context that helps to scope reality.

Uncovering the obscure is a good thing

Not surprisingly, our workshop participants were keen that their research efforts should result in finding little-known information that they could utilise. They talked about the excitement of uncovering information and the benefits for their work.

Habits are part of the approach to research

The balance between being innovative and anchoring an interface in what people are familiar with seems to be important.

Trust is very important

The importance of trust was stressed at all of our workshops, and the need to know the context of information. We need to build something that researchers believe is a quality resource, with information they can rely on.

Serendipity is good…although it can lead you astray

It was clear that our participants wanted to explore, and liked the idea of coming across the unexpected. Several of them felt that the library bookshelves provide a good opportunity to browse and discover new sources (they talked about this more than the serendipity of the web). But there was also a note of caution about time wasted pursuing different avenues of information. It seems good to build in serendipity, whilst providing an interface that gives clear landmarks and signposts.

Search and Relevance

Our workshop participants were clear that choice of search terms has a big influence on what you find, and this can be a disadvantage. You may be presented with a search box, and you don’t really know what to search for to get what you want, especially if you don’t know what you want! Also, the relevance ranking can be a puzzle. Library databases often seem to give results that don’t make that much sense.

One thing that stood out to me was the willingness to use Google, which is a simple search box, with no indication of how to search, that brings back huge amounts of results; but the criticisms of library databases, where choice of search term is crucial and where ‘too many results’ are seen as a problem. It seemed that the key here was effective relevance ranking, but our workshop participants did agree that relevance ranking can deceive: the first page of results may look good, but you don’t really know what you are missing. Google is good at providing a first page of useful looking results….and maybe that’s enough to stop most people wondering about what they might be missing!

Exploring British Design

As our project has progressed, I think it is fair to say that we have benefitted hugely from the input of the students and academics that we have talked to, not only for this project but also more generally. But it was not possible for us to manage to implement a directly co-designed website. The logistics of the project didn’t allow for this, as we wanted to gather input to inform the project, and then we had the complications of pulling together the data, designing the back end and the API. We would probably have needed at least another 6 months on the project to go back to the workshop participants and ask them about the website design as we went along.

But I think we have achieved a good deal in terms of engagement. Our Exploring British Design project has been about other ways through content, moving away from a search box and a list of search results, and thinking about immersing researchers in a ‘landscape’, where they can orientate themselves but also explore freely. So, we are thinking about engagement in terms of a more visually attractive and immersive experience, giving researchers the opportunity to follow connections in a way that gives them a sense of movement through the design landscape, hints at the unknown, and shows the relevancy of the entities that are featured in the website. We hope to show how this can potentially expand understanding because it allow for a wider context and more varied narratives.

In the next project post we hope to present our interface for this pilot project!

Europeana Tech 2015: focus on the journey

February 19, 2015 / Jane Stevenson

Last week I attended a very full and lively Europeana Tech conference. Here are some of the main initiatives and ideas I have taken away with me:

Think in terms of improvement, not perfection

Do the best you can with what you have; incorrect data may not be as bad as we think and maybe users expectations are changing, and they are increasingly willing to work with incomplete or imperfect data. Some of the speakers talked about successful crowd-sourcing – people are often happy to correct your metadata for you and a well thought-out crowd-sourcing project can give great results.

BL Georeferencer, showing an old map overlaying part of Manchester: http://www.bl.uk/maps/georeferencingmap.html

The British Library currently have an initiative to encourage tagging of their images on Flickr Commons and they also have a crowd-sourcing geo-referencer project.

The Cooper Hewitt Museum site takes a different and more informal approach to what we might usually expect from a cultural heritage site. The homepage goes for an honest approach:

“This is a kind of living document, meaning that development is ongoing — object research is being added, bugs are being fixed, and erroneous terms are being revised. In spite of the eccentricities of raw data, you can begin exploring the collection and discovering unexpected connections among objects and designers.”

The ‘here is some stuff’ and ‘show me more stuff’ type of approach was noticeable throughout the conference, with different speakers talking about their own websites. Seb Chan from the Cooper Hewitt Museum talked about the importance of putting information out there, even if you have very little, it is better than nothing (e.g. https://collection.cooperhewitt.org/objects/18446665).

The speaker from Google, Chris Welty, is best known for his work on ontologies in the Semantic Web and IBM’s Watson. He spoke about cognitive computing, and his message was ‘maybe it’s OK to be wrong’. Something may well still useful, even if it is not perfectly precise. We are increasingly understanding that the Web is in a state of continuous improvement, and so we should focus on improvement, not perfection. What we want is for mistakes to decrease, and for new functionality not to break old functionality. Chris talked about the importance of having a metric – something that is believable – that you can use to measure improvement. He also spoke about what is ‘true’ and the need for a ‘ground truth’ in an environment where problems often don’t have a right or wrong answer. What is the truth about an image? If you show an image to a human and ask them to talk about it they could talk for a long time. What are the right things to say about it? What should a machine see? To know this, or to know it better, Chris said, Google needs data – more and more and more data. He made it clear that the data is key and it will help us on the road to continuous improvement. He used the example of searching for pictures of flowers using Google to find ‘paintings with flowers’. If you did this search 5 years ago you probably wouldn’t get just paintings with flowers. The search has improved, and it will continue to improve. A search for ‘paintings with tulips’ now is likely to show you just tulips. However, he gave the example of ‘paintings with flowers by french artists’ – a search where you start to see errors as the results are not all by french artists. A current problem Google are dealing with is mixed language queries, such as ‘paintings des fleurs’, which opens a whole can of worms. But Chris’ message was that metadata matters: it is the metadata that makes this kind of searching possible.

The Success of Failure

Related to the point about improvement, the message is that being ‘wrong’ or ‘failing’ should be seen in a much more positive light. Chris Welty told us that two thirds of his work doesn’t make it into a live environment, and he has no problem with that. Of course, it’s hard not to think that Google can afford to fail rather more than many of us! But I did have an interesting conversation with colleagues, via Twitter, around the importance of senior management and funders understanding that we can learn a great deal from what is perceived as failure, and we shouldn’t feel compelled to hide it away.

Photo from Europeana Tech — Europeana Tech panel session, with four continents represented

Think in terms of Entities

We had a small group conversation where this came up, and a colleague said to me ‘but surely that’s obvious’. But as archivists we have always been very centered on documents rather than things – on the archive collection, and the archive collection description. The trend that I was seeing reflected at Europeana Tech continued to be towards connections, narratives, pathways, utilising new tools for working with data, for improving data quality and linking data, for adding geo-coordinates and describing new entities, for making images more interoperable and contextualising information. The principle underlying this was that we should start from the real world – the real world entities – and go from there. Various data models were explored, such as the Europeana Data Model and CIDOC CRM, and speakers explained how entities can connect, and enable a richer landscape. Data models are a tricky one because they can help to focus on key entities and relationships, but they can be very complex and rather off-putting. The EDM seems to split the crowd somewhat, and there was some criticism that it is not event-based like CIDOC CRM, but the CRM is often criticised for being very complex and difficult to understand. Anyway, setting that aside, the overall the message was that relationships are key, however we decide to model them.

Cataloguing will never capture everyone’s research interests

An obvious point, but I thought it was quite well conveyed in the conference. Do we catalogue with the assumption that people know what they need? What about researchers interested in how ‘sad’ is expressed throughout history, or fashions for facial hair, or a million other topics that simply don’t fit in with the sorts of keywords and subject terms we normally use. We’ll never be able to meet these needs, but putting out as much data as we can, and making it open, allows others to explore, tag and annotate and create infinite groups of resources. It can be amazing and moving, what people create: Every3Minutes.

There’s so much out there to explore….

There are so many great looking tools and initiatives worth looking at, so many places to go and experiment with open data, so many APIs enabling so much potential. I ended up with a very long list of interesting looking sites to check out. But I couldn’t help feeling that so few of us have the time or resource to actually take advantage of this busy world of technology. We heard about Europeana Labs, which has around 100 ‘hardcore’ users and 2,200 registered keys (required for API use). It is described as “a playground for remixing and using your cultural and scientific heritage. A place for inspiration, innovation and sharing.” I wondered if we would ever have the time to go and have a play. But then maybe we should shift focus away from not being able to do these things ourselves, and simply allow others to use the data, and to adopt the tools and techniques that are available – people can create all sorts of things. One example amongst many we heard about at the conference is a cultural collage: zenlan.com/collage. It comes back to what is now quite an old adage, ‘the best innovation may not be done by you’. APIs enable others to innovate, and what interests people can be a real surprise. Bill Thompson from the BBC referred to a huge interest in old listings from Radio Times, which are now available online.

The International Image Interoperability Framework

I list the IIIF this because it jumped out at me as a framework that seems to be very popular – several speakers referred to it, and it very positive terms. I hadn’t heard of it before, but it seemed to be seen as a practical means to ensure that images are interoperable, and can be moved around different systems.

Think Little

One of my favourite thoughts from the conference, from the ever-inspirational Tim Sherratt, was that big ideas should enable little ideas. The little ideas are often what really makes the world go round. You don’t have to always think big. In fact, many sites have suffered from the tendency to try to do everything. Just because you can add tons of features to your applications, it doesn’t mean you should

The Importance of Orientation

How would you present your collections if you didn’t have a search box? This is the question I asked myself after listening to George Oates, from Good Form and Spectacle. She is a User Interface expert, and has worked on Flickr and for the Internet Archive amongst other things. I thought her argument about the need to help orientate users was interesting, as so often we are told that the ‘Google search box’ is the key thing, and what users expect. She talked about some of her experiments with front end interfaces that allow users to look at things differently, such as the V&A Spelunker. She spoke in terms of landmarks and paths that users could follow. I wonder if this is easier said than done with archives without over-curating what you have or excluding material that is less well catalogued, or does not have a nice image to work with. But I certainly think it is an idea worth exploring.

View of V&A Speleunker — “The V&A Spelunker is a rough thing built by Good, Form & Spectacle to give a different view into the collection of the Victoria & Albert Museum”

Exploring British Design: Research Paths II

December 19, 2014 / Jane Stevenson

We recently ran a second workshop as part of our Exploring British Design project. The workshops aim to understand more about approaches to research, and researchers’ understanding and use of archives.

The second workshop was run largely on the same basis as the first workshop, using the same exercises.

Looking at what our researchers said and documented about their research paths over the two workshops, some points came out quite strongly:

Google is by far the most common starting point but its shortcomings are clear and issue of trust come up frequently.
There is often a strong visual emphasis to research, including searching for images and the use of Pinterest; there seems to be a split between those who gravitate towards a more text-based approach and those who think visually (many of our participants were graphic designers though!).
It is common to utilise the references listed in Wikipedia articles.
The library as a source is seen as part of a diverse landscape – it is one place to go to, albeit an important one. It is not the first port of call for the majority.
Aggregators are not specifically referred to very often. But they may be seen as a place to go if other searches don’t yield useful results.
Talking to people is very important, be it lecturers, experts, colleagues or friends
Online research is more immediate, and usually takes less effort, but there are issues of trust and it may not yield specific enough results, or uncover the more obscure sources.
There is a tendency to start from the general and work towards the more specific. With the research paths of most of the researchers, the library/archive was somewhere in the middle of this process.
Personal habits and past experience play a very large part, but there is a real interest in finding new routes through research, so habit is not a sticking point, but simply the dominant influence unless it is challenged.

For the second workshop, the first exercise asked participants to document their likely research paths around a topic.

flip chart showing research paths for a topic — Research paths of two researchers for the topic of Simpsons of Piccadilly

We had four pairs of researchers looking at different topics, and we left them to discuss their research paths for about 45 minutes. The discussions following the exercise picked up on a number of areas:

Online vs Offline

We kicked off by asking the researchers about online versus ‘offline’ research paths. One participant commented that she saw online as a route through to traditional research – maybe to locate a library or archive – ‘online is telling me where to look’ but in itself it is too general and not specific enough; whereas the person she was paired with tended to do more research online. He saw online as giving the benefit of immediacy – at any time of day or night he could access content. The issue of trust came up in the discussion around this issue, and one participant summed up nicely: “If you do online research there is less effort but there is less trust; if you research offline there is more effort but there is more trust.”

Following on from the discussion about how people go about using online services, there was a comment that things found online are often the more obvious, the more used and cited resources. Visiting a library or archive may give more opportunity to uncover little known sources that help with original research. This seemed to be endorsed by most participants, one commenting that Pinterest tends to reflect what is trendy and popular. However, there was also a view that something like Pinterest can lead researchers to new sources, as they are benefiting from the efforts, and sometimes the quite obsessive enthusiasms, of a wide range of people.

There was agreement that online research can lead to ‘information dumping’, where you build up a formidable collection of resources, but are unlikely to get round to sorting them all out and using them.

Library Resources

The issue of effort came up later in the discussion when referring to a particular university library (probably typical of many university libraries), and the amount of effort involved in using its databases. There was a comment about how you need to ‘work yourself up to an afternoon in the library’ and there seemed to be a general agreement that the ‘search across all resources’ often produced quite meaningless results. When compared to Google, the issue seems to be that relevance ranking is not effective, so the top results often don’t match your requirements. There was also some discussion around the way that library resource discovery services often involve too many steps, and there is effort in understanding how the catalogue works. One participant, whose research centres on the Web and the online user experience, felt that printed sources were of little use to him, as they were out of date very quickly.

Curating your sources

One researcher talked about using Pinterest to organise findings visually. This was followed up by another researcher talking about how with online research you can organise and collect things yourself. It facilitates ‘curating’ your own collection of resources. It can also be easier to remember resources if they are visual. Comparing Pinterest to the Library – with the former you click to add the image to your board; with the Library you pay a visit, you find the book, you take it to the scanner, you pay to take a scan…although it is increasingly possible to take pictures of books using your own device. But the general feeling was that the Web was far quicker and more immediate.

Attitudes towards research

One participant felt that there might be a split between those more like him who see research as ‘a means to an end’ and those who enjoy the process itself. So maybe some are looking for the shortest route to the end goal, and others see research as more exploratory activity and expect it to take time and effort. This may partly be a result of the nature and scope of the research. Short time scales preclude in-depth research.

Talking about serendipitous approaches, someone commented that browsing the library shelves can be constructive, as you can find books around your subject that you weren’t aware existed. This is replicated to some extent in something like Amazon, which suggests books you might be interested in. There was also some feeling that exploring too many avenues can take the researcher off topic and take up a great deal of time.

Trust and Citation

The issue of trust is important. A first-hand experience, whether of a place you are researching, or using physical archive sources, is the most trustworthy, because you are seeing with your own eyes, experiencing first hand or looking at primary sources first hand; a library provides the next level of trust, as a book is an interpretation, and you may feel it requires corroboration; the online world is the least trustworthy. You will have the least trust if you are looking at a website where you don’t know about who or what is behind it. There was agreement that trust can come through crowd sourced information, but also some discussion around how to cite this (for example, using the Harvard system to reference web pages and crowd sourced resources). This led on to a short discussion around the credibility of what is cited within research. Maybe attitudes to Wikipedia are slowly changing, but at present there is generally still a feeling that a researcher cannot cite it as a source. There are traditions within disciplines around how to cite and what are the ‘right’ things to cite.

[Further posts on Exploring British Design will follow, with reflections on our workshops and updates on the project generally]