Machine Learning: Training the Model

A recent OCLC paper by Thomas Padilla highlights the need for ‘Pilot collaborations between institutions with representative collections’ and working ‘to share source data and produce “gold standard” training data.

We think that the Archives Hub Labs project exemplifes Tom’s suggested approach by working with ten of our contributing institutions from across the UK, reflecting a variety of archives.

However, it is also surely true that cultural heritage will need to engage with the broader AI and ML communities to understand and benefit fully from the range of ML services such as translation, transcription, object identification and facial recognition:

‘Advances in all of these areas are being driven and guided by the government or commercial sectors, which are infinitely better funded  than cultural memory; for example, many nation-states and major corporations are intensively interested  in facial recognition. The key strategy for the cultural memory sector will be to exploit these advantages, adapting and tuning the technologies around the margins for its own needs.’ From a short blog post by Dr Clifford Lynch from the CNI which is well worth reading.

People often criticise Machine Learning for being biased. But bias and mis-representation is essentially due to embedded bias in the input training data. The algorithm learns with what it has. So one of the key tasks for us as an archives community is to think about training data. We need algorithms that are trained to work for us to give us useful outputs.

Gathering training data in order to create useful models is going to be a challenge. Machine Learning is not like anything else that we have done before – we don’t actually know what we’ll get – we just know that we need to give the algorithm data that educates it in the way that we want. A bit like a child in school, we can teach it the curriculum, but we don’t know if it will pass the exam.

It certainly seems a given that we will need to use well labelled archival material as training data, so that the model is tailored specifically to the material we have. We will need to work together to provide this scale of training data. We have many wonderfully catalogued collections, with detail down to item level; as well as many collections that are catalogued quite basically, maybe just at collection level. If we join together as a community and utilise the well-catalogued content to train algorithms, we may be able to achieve something really useful to help make all collections more discoverable.

If an algorithm is trained on a fairly narrow set of data, then it is questionable whether it will have broad applicability. For example, if we train an algorithm on letters written in the 18th century, but just authored by two or three people, then it is unlikely to learn enough to be of real use with transcription; but if we train it on the handwriting of fifty people or more, then it could be a really useful tool for recognising and transcribing 18th century letters To do this training, we will need to bring content together. We will need to share the Machine Learning journey. The benefits could be massive in terms of discoverability of archives; effective discovery for all those materials that we currently don’t have time to catalogue. The main danger is that the resulting identification, transcription, tagging or whatever, is not to the standard that we want. We can only experiment and see what happens if we trial ML with a set of data (which is what we are doing now with our Labs project). One benefit could actually be much more consistency across collections. As someone working on aggregating data from 350 organisations, I can testify that we are not consistent! – and this lack of consistency impairs discovery.

Archival content is likely to be distinct in terms of both quality and subject. Typescripts might be old and faded, manuscripts might be hard to read, photographs might be black and white and not as high resolution as modern prints. Photographs might be of historical artefacts that are not recognised by most algorithms. We have specific challenges with our material, and we need the algorithms to learn from our material, in order to then provide something useful as we input more content.

In terms of subject, the Lotus and Delta shoe shops are a good example of a specific topic. They are represented in the Joseph Emberton papers, at the University of Brighton Design Archives, with a series of photographs. Architecture is potentially an interesting area to focus on. ML could give us some outputs that provide information on architectural features. It could be that the design of Lotus and Delta shops can be connected to other shops with similar architectures and shop fronts. ML may pick out features that a cataloguer may not include. On the other hand, we may find that it is extremely hard to train an algorithm on old black and white and potentially low resolution photographs in order for it to learn what a shop is, and maybe what a shoe shop is.

In this collection a number of the photographs are of exteriors. Some are identified by location, and some are not yet identified.

photo of Emberton shoe shop, Harrogate
Harrogate
Photo of Edinburgh shoe shop exterior
Edinburgh
Photo of unidentified shoe shop
Unidentified shop

These photographs have been catalogued to item level, and so researchers will be able to find these when searching for ‘shops’ and particularly ‘shoe shops’ on the Hub, e.g. a search for ‘harrogate shoe shop‘ finds the exterior of a shop front in Harrogate. There may not be much more that could be provided for searching this collection, unless machine learning could label the type of shop front, the type of windows and signage for example. This seems very challenging with these old photographs, but presumably not impossible. With ML it is a matter of trying things out. You might think that if artificial intelligence can master self-driving cars it can master shop exteriors….but it is not a foregone conclusion.

If the model was trained with this set of photographs, then other shop fronts could potentially be identified in photographs that aren’t catalogued individually. We could potentially end up with collections from many different archives tagged with ‘shop front’ and potentially with ‘shoes’. Whether an unidentified shop front could be be identified is less certain, unless there are definite contextual features to work with.

interior of ladies department shoe shop
Interior of ladies’ dept.
photograph of shoe shop interior
Interior of men’s dept.

Shop interiors are likely to be even more of a challenge. But it will be exciting to try things like this out and see what we get.

Commercial providers offer black box solutions, and we can be sure they were not trained to work well with archives. They may be adapted to new situations, but it is unlikely they can ever work effectively for archival content. I explored this to an extent in my last blog post. However, it is worth considering that a model not trained on archival material may highlight objects or topics that we would not think of including in a catalogue entry.

The Archives Hub and Jisc could play a pivotal role in co-ordinating work to create better models for archival material. Aggregation allows for providing more training material, and thus creating more effective models.

To date, most ML projects in libraries have required bespoke data annotation to create sufficient training data. Reproducing this work for every ML project, however, risks wasting both time and labor, and there are ample opportunities for scholars to share and build upon each other’s work.’ (R. Cordell, LC Labs report)

We can have a role to play in ‘data gathering, sharing, annotation, ethics monitoring, and record-keeping processes‘ (Eun Seo Jo, Timnit Gebru, https://arxiv.org/abs/1912.10389). We will need to think about how to bring our contributors into the loop in order to check and feedback on the ML outputs. This is a non-trivial part of the process that we are considering at the moment. We need an interface that displays the results of our ML trials.

One of the interesting aspects of this is that collections that have been catalogued in detail will provide the training data for collections that are not. Will this prove to be a barrier, or will it bring us together as a community? In theory the resources that some archives have, which have enabled them to catalogue to item level, can benefit those with minimal resources. Would this be a free and open exchange, or would we start to see a commercial framework developing?

It is also important that we don’t ignore the catalogue entries from our 350 contributors. Catalogues could provide great fodder for ML – we could start to establish connections and commonalities and increase the utility of the catalogues considerably.

The issue of how to incorporate the results of ML into the end user discovery interface is yet another challenge. Is it fundamentally important that end users know what has been done through ML and what has been done by a human? I can’t help thinking that over time the lines will blur, as we become more comfortable with AI….or as AI simply becomes more integrated into our world. It is clear that many people don’t realise how much Artificial Intelligence sits behind so many systems and processes that we use on an everyday basis. But I think that for the time being, it would be useful to make that distinction within our end user interfaces, so that people know why something has been catalogued or described in a certain way and so that we can assess the effectiveness of the ML contribution.

In subsequent posts we aim to share some initial findings from doing work at scale. We will only be able to undertake some modest experiments, but we hope that we are contributing to the start of what will be a very big adventure for archives.

Images and Machine Learning Project

Under our new Labs umbrella, we have started a new project, ‘Images and Machine Learning’ it has three distinct and related strands.

screenshot with bullet points to describe the DAO store, IIIF and Machine Learning
The three themes of the project

We will be working on these themes with ten participants, who already contribute to the Archives Hub, and who have expressed an interest in one or more of these strands: Cardiff University, Bangor University, Brighton Design Archives at the University of Brighton, Queens University Belfast, the University of Hull, the Borthwick Institute for Archives at the University of York, the Geological Society, the Paul Mellon Centre, Lambeth Palace (Church of England) and Lloyds Bank.

This project is not about pre-selecting participants or content that meet any kind of criteria. The point is to work with a whole variety of descriptions and images, and not in any sense to ‘cherry pick’ descriptions or images in order to make our lives easier. We want a realistic sense of what is required to implement digital storage and IIIF display, and we want to see how machine learning tools work with a range of content. Some of the participants will be able to dedicate more time to the project, others will have very little time, some will have technical experience, others won’t. A successful implementation that runs beyond our project and into service will need to fit in with our contributors needs and limitations. It is problematic to run a project that asks for unrealistic amounts of time from people that will not be achievable long-term, as trying to turn a project into a service is not likely to work.

DAO Store

Over the years we have been asked a number of times about hosting content for our contributors. Whilst there are already options available for hosting, there are issues of cost, technical support, fit for purpose-ness, trust and security for archives that are not necessarily easily met.

Jisc can potentially provide a digital object store that is relatively inexpensive, integrated with the current Archives Hub tools and interfaces, and designed specifically to meet our own contributors’ requirements. In order to explore this proposal, we are going to invest some resource into modifying our current administrative interface, the CIIM, to enable the ingest of digital content.

We spent some time looking at the feasibility of integrating an archival digital object store with the current Jisc Preservation Service. However, for various reasons this did not prove to be a practical solution. One of the main issues is the particular nature of archives as hierarchical multi-level collections. Archival metadata has its own particular requirements. The CIIM is already set up to work with EAD descriptions and by using the CIIM we have full control over the metadata so that we can design it to meet the needs of archives. It also allows us to more easily think about enabling IIIF (see below).

The idea is that contributors use the CIIM to upload content and attach metadata. They can then organise and search their content, and publish it, in order to give it web address URIs that can be added to their archival descriptions – both in the Archives Hub and elsewhere.

It should be noted that this store is not designed to be a preservation solution. As said, Jisc already provides this service, and there are many other services available. This is a store for access and use, and for providing IIIF enabled content.

The metadata fields have not yet been finalised, but we have a working proposal and some thoughts about each field.

Titlemandatory? individual vs batch?
Datespreferably structured, options for approx. and not dated.
Licencepossibly a URI. option to add institution’s rights statement.
Resource typecontrolled list. values to be determined with participants. could upload a thesaurus. could try ML to identify type.
Keywordsfree text
Taggingenable digital objects to be grouped e.g by topic or e.g. ‘to do’ to indicate work is required
Statusunpublished/published. May refer to IIIF enabled.
URLunique URI of image (at individual level)
Proposed fields for the Digital Object Store

We need to think about the workflow and user interface. The images would be uploaded and not published by default, so that they would only be available to the DAO Store user at that point. On publication, they would be available at a designated URL. Would we then give the option to re-size? Would we set a maximum size? How would this fit in with IIIF and the preference for images of a higher resolution? We will certainly need to think about how to handle low resolution images.

International Image Interoperability Framework

IIIF is a framework that enables images to be viewed in any IIIF viewer. Typically, they can be sequenced, such as for a book, and they are zoomable to a very high resolution. At the heart of IIIF is the principle that organisations expose images over the web in a way that allows researchers to use images from anywhere, using any platform that speaks IIIF. This means a researcher can group images for their own research purposes, and very easily compare them. IIIF promotes the idea of fully open digital content, and works best with high resolution images.

There are a number of demos here: https://matienzo.org/iiif-archives-demo/

And here is a demo provided by Project Mirador: http://projectmirador.org/demo/

An example from the University of Cambridge: https://cudl.lib.cam.ac.uk/view/MS-RGO-00014-00051/358

And one from the University of Manchester: https://www.digitalcollections.manchester.ac.uk/collections/ruskin/1

There are very good reasons for the Archives Hub to get involved in IIIF, but there are challenges being an aggregator that individual institutions don’t face, or at least not to the same degree. We won’t know what digital content we will receive, so we have to think about how to work with images of varying resolutions. Our contributors will have different preferences for the interface and functionality. On the plus side, we are a large and established service, with technical expertise and good relationships with our contributors. We can potentially help smaller and less well-resourced institutions into this world. In addition, we are well positioned to establish a community of use, to share experiences and challenges.

One thing that we are very convinced by: IIIF is a really effective way to surface digital content and it is an enormous boon to researchers. So, it makes total sense for us to move into this area. With this in mind, Jisc has become a member of the IIIF Consortium, and we aim to take advantage of the knowledge and experience within the community – and to contribute to it.

Machine Learning

This is a huge area, and it can feel rather daunting. It is also very complicated, and we are under no illusions that it will be a long road, probably with plenty of blind alleys. It is very exciting, but not without big challenges.

It seems as if ML is getting a bad reputation lately, with the idea that algorithms make decisions that are often unfair or unjust, or that are clearly biased. But the main issue lies with the data. ML is about machines learning from data, and if the data is inadequate, biased, or suspect in some way, then the outcomes are not likely to be good. ML offers us a big opportunity to analyse our data. It can help us surface bias and problematic cataloguing.

We want to take the descriptions and images that our participants provide and see what we can do with ML tools. Obviously we won’t do anything that affects the data without consulting with our contributors. But it is best with ML to have a large amount of data, and so this is an area where an aggregator has an advantage.

This area is truly exploratory. We are not aiming for anything other than the broad idea of improved discoverability. We will see if ML can help identify entities, such as people, places and concepts. But we are also open to looking at the results of ML and thinking about how we might benefit from them. We may conclude that ML only has limited use for us – at least, as it stands now. But it is changing all the time, and becoming more sophisticated. It is something that will only grow and become more embedded within cultural heritage.

Over the next several months we will be blogging about the project, and we would be very pleased to receive feedback and thoughts. We will also be holding some webinar sessions. These will be advertised to contributors via our contributors list, and advertised on the JiscMail archives-nra list.