Images and Machine Learning Project

Under our new Labs umbrella, we have started a new project, ‘Images and Machine Learning’ it has three distinct and related strands.

screenshot with bullet points to describe the DAO store, IIIF and Machine Learning
The three themes of the project

We will be working on these themes with ten participants, who already contribute to the Archives Hub, and who have expressed an interest in one or more of these strands: Cardiff University, Bangor University, Brighton Design Archives at the University of Brighton, Queens University Belfast, the University of Hull, the Borthwick Institute for Archives at the University of York, the Geological Society, the Paul Mellon Centre, Lambeth Palace (Church of England) and Lloyds Bank.

This project is not about pre-selecting participants or content that meet any kind of criteria. The point is to work with a whole variety of descriptions and images, and not in any sense to ‘cherry pick’ descriptions or images in order to make our lives easier. We want a realistic sense of what is required to implement digital storage and IIIF display, and we want to see how machine learning tools work with a range of content. Some of the participants will be able to dedicate more time to the project, others will have very little time, some will have technical experience, others won’t. A successful implementation that runs beyond our project and into service will need to fit in with our contributors needs and limitations. It is problematic to run a project that asks for unrealistic amounts of time from people that will not be achievable long-term, as trying to turn a project into a service is not likely to work.

DAO Store

Over the years we have been asked a number of times about hosting content for our contributors. Whilst there are already options available for hosting, there are issues of cost, technical support, fit for purpose-ness, trust and security for archives that are not necessarily easily met.

Jisc can potentially provide a digital object store that is relatively inexpensive, integrated with the current Archives Hub tools and interfaces, and designed specifically to meet our own contributors’ requirements. In order to explore this proposal, we are going to invest some resource into modifying our current administrative interface, the CIIM, to enable the ingest of digital content.

We spent some time looking at the feasibility of integrating an archival digital object store with the current Jisc Preservation Service. However, for various reasons this did not prove to be a practical solution. One of the main issues is the particular nature of archives as hierarchical multi-level collections. Archival metadata has its own particular requirements. The CIIM is already set up to work with EAD descriptions and by using the CIIM we have full control over the metadata so that we can design it to meet the needs of archives. It also allows us to more easily think about enabling IIIF (see below).

The idea is that contributors use the CIIM to upload content and attach metadata. They can then organise and search their content, and publish it, in order to give it web address URIs that can be added to their archival descriptions – both in the Archives Hub and elsewhere.

It should be noted that this store is not designed to be a preservation solution. As said, Jisc already provides this service, and there are many other services available. This is a store for access and use, and for providing IIIF enabled content.

The metadata fields have not yet been finalised, but we have a working proposal and some thoughts about each field.

Titlemandatory? individual vs batch?
Datespreferably structured, options for approx. and not dated.
Licencepossibly a URI. option to add institution’s rights statement.
Resource typecontrolled list. values to be determined with participants. could upload a thesaurus. could try ML to identify type.
Keywordsfree text
Taggingenable digital objects to be grouped e.g by topic or e.g. ‘to do’ to indicate work is required
Statusunpublished/published. May refer to IIIF enabled.
URLunique URI of image (at individual level)
Proposed fields for the Digital Object Store

We need to think about the workflow and user interface. The images would be uploaded and not published by default, so that they would only be available to the DAO Store user at that point. On publication, they would be available at a designated URL. Would we then give the option to re-size? Would we set a maximum size? How would this fit in with IIIF and the preference for images of a higher resolution? We will certainly need to think about how to handle low resolution images.

International Image Interoperability Framework

IIIF is a framework that enables images to be viewed in any IIIF viewer. Typically, they can be sequenced, such as for a book, and they are zoomable to a very high resolution. At the heart of IIIF is the principle that organisations expose images over the web in a way that allows researchers to use images from anywhere, using any platform that speaks IIIF. This means a researcher can group images for their own research purposes, and very easily compare them. IIIF promotes the idea of fully open digital content, and works best with high resolution images.

There are a number of demos here: https://matienzo.org/iiif-archives-demo/

And here is a demo provided by Project Mirador: http://projectmirador.org/demo/

An example from the University of Cambridge: https://cudl.lib.cam.ac.uk/view/MS-RGO-00014-00051/358

And one from the University of Manchester: https://www.digitalcollections.manchester.ac.uk/collections/ruskin/1

There are very good reasons for the Archives Hub to get involved in IIIF, but there are challenges being an aggregator that individual institutions don’t face, or at least not to the same degree. We won’t know what digital content we will receive, so we have to think about how to work with images of varying resolutions. Our contributors will have different preferences for the interface and functionality. On the plus side, we are a large and established service, with technical expertise and good relationships with our contributors. We can potentially help smaller and less well-resourced institutions into this world. In addition, we are well positioned to establish a community of use, to share experiences and challenges.

One thing that we are very convinced by: IIIF is a really effective way to surface digital content and it is an enormous boon to researchers. So, it makes total sense for us to move into this area. With this in mind, Jisc has become a member of the IIIF Consortium, and we aim to take advantage of the knowledge and experience within the community – and to contribute to it.

Machine Learning

This is a huge area, and it can feel rather daunting. It is also very complicated, and we are under no illusions that it will be a long road, probably with plenty of blind alleys. It is very exciting, but not without big challenges.

It seems as if ML is getting a bad reputation lately, with the idea that algorithms make decisions that are often unfair or unjust, or that are clearly biased. But the main issue lies with the data. ML is about machines learning from data, and if the data is inadequate, biased, or suspect in some way, then the outcomes are not likely to be good. ML offers us a big opportunity to analyse our data. It can help us surface bias and problematic cataloguing.

We want to take the descriptions and images that our participants provide and see what we can do with ML tools. Obviously we won’t do anything that affects the data without consulting with our contributors. But it is best with ML to have a large amount of data, and so this is an area where an aggregator has an advantage.

This area is truly exploratory. We are not aiming for anything other than the broad idea of improved discoverability. We will see if ML can help identify entities, such as people, places and concepts. But we are also open to looking at the results of ML and thinking about how we might benefit from them. We may conclude that ML only has limited use for us – at least, as it stands now. But it is changing all the time, and becoming more sophisticated. It is something that will only grow and become more embedded within cultural heritage.

Over the next several months we will be blogging about the project, and we would be very pleased to receive feedback and thoughts. We will also be holding some webinar sessions. These will be advertised to contributors via our contributors list, and advertised on the JiscMail archives-nra list.

The Archives Hub and IIIF: supporting the true potential of images on the Web

IIIF is a model for presenting and annotating digital content on the Web, including images and audio/visual files. There is a very active global community that develops IIIF and promotes the principles of open, shareable content. One of the strengths of IIIF is the community, which is a diverse mix of people, including developers and information professionals.

IIIF map showing where there are known IIIF projects and implementations

Images are fundamental carriers of information. They provide a huge amount of value for researchers, helping us understand history and culture. We interact with huge amounts of images, and yet we do not always get as much value out of them as we might. Content may be digitised, but it is often within silos, where the end user has to go to a specific website to discover content and to view a specific image, it is not always easy or possible to discover, gather together, compare, analyse and manipulate images.

IIIF is a particularly useful solution for cultural heritage, where analysis of images is so important. A current ‘Towards a National Collection’ project has been looking at practical applications of IIIF.

The IIIF Solution

Exactly what IIIF enables depends upon a number of factors, but in general it enables:

Deep zoom: view and zoom in closely to see all the detail of an image

Sequencing: navigate through a book or sequence of archival materials

Comparisons: bring images together and put them side-by-side. This can enable researchers to bring together images from different collections, maybe material with the same provenance that has been separated over time.

Search within text: work with transcriptions and translations

Connections: connect to resources such as Wikidata

Use of different IIIF viewers: different viewers have their own features and facilities.

How It Works

The IIIF community tends to talk in terms of APIs. These can be thought of as agreed and structured ways to connect systems. If you have this kind of agreement then you can implement different systems, or parts of systems, to work with the same content, because you are sticking to an agreed structure. The basic principle is to store an image once (on a IIIF server) and be able to use it many times in many contexts.

IIIF is like a a layer above the data stores that host content. The images are accessed through that IIIF layer – or through the IIIF APIs. This enables different agents to create viewers and tools for the data held in all the stores.

Different repositories have their own data stores, but they can share content through the IIIF APIs.

There are a few different APIs that make up the IIIF standard.

Image API

This API delivers the content (or pixels). The image is delivered as a URL, and the URL is structured in an agreed way.

Presentation API

This delivers information on the presentation of the material, such as the sequence of a book, for example, or a bundle of letters, and metadata about the object.

This screenshot shows the Image API providing the zoomable image, and the presentation API providing basic information – the title and the sequence of the pages of this object.

Search API

Allows searching within the text of an object.

Authentication API

Allows materials to be restricted by audience. So, this is useful for sensitive images or images under copyright that may have restrictions.

IIIF viewer

As IIIF images are served in a standard way, any IIIF viewer can access them. Examples of IIIF viewers:

The Universal Viewer: https://universalviewer.io/
Mirador: https://mirador-dev.netlify.app/tests/integration/mirador/
Archival IIIF: https://archival-iiif.github.io/
Storiiies digital storytelling: https://storiiies.cogapp.com/#storiiies

There are a whole host of viewers available, with various functionality. Most will offer the basics of zooming and cropping. There does seem to be a question around why so many viewers are needed. It might be considered a better approach for the community to work on a limited group of viewers, but this may be a politically driven desire to own and brand a viewer. In the end, a IIIF viewer can display any IIIF content, and each viewer will have its own features and functionality.

To find out more about how researchers can benefit from IIIF, you may like to watch this presentation on YouTube (59m): Using IIIF for research 

Some Examples

In many projects, the aim is to digitise key materials, such as artworks of national importance and rare books and manuscripts, in order to provide a rich experience for end users. For instance, the Raphael Cartoons at the V&A are now available to explore different layers and detail, even enabling the infra-red view and surface view, to allow researchers to study the paintings in great depth. Images can easily be compared within your own workspace, by pulling in other IIIF images.

The V&A Raphael Cartoons can be viewed in ultra high resolution colour, exploring all of the layers

What is the Archives Hub planning to do with IIIF?

Hosting content: We are starting a 15 month project to explore options for hosting and delivering content. Integral to this project will be providing a IIIF Image API. As referenced above, this will mean that the digital content can be viewed in any IIIF viewer, because we will provide the necessary URLs to do so. One of the barriers for many archives is that images need to be on a IIIF server in order to utilise the Image API. It may be that Jisc can provide this service.

Creation of IIIF manifests: I’ll talk more about this in future blog posts, but the manifest is a part of the Presentation API. It contains a sequence (e.g. ordering of a book), as well as metadata such as a title, description, attribution, rights information, table of contents, and any other information about the objects that may be useful for presentation. We will be looking at how to create manifests efficiently and at scale, and the implications for representing hierarchical collections.

Providing an interface to manage content: This would be useful for any image store, so it does not relate specifically to IIIF. But it may have implications around the metadata provided and what we might put into a IIIF manifest.

Integrating a IIIF viewer into the Archives Hub: We will be providing a IIIF viewer so that the images that we host, and other IIIF images, can be viewed within the Archives Hub.

Assessing image quality: A key aim of this project is to assess the real-world situation of a typical archive repository in the UK, and how they can best engage with IIIF. Image resolution is one potential issue. Whilst any image can be served through the IIIF API, a lower resolution image will not give the end user the same sort of rich experience with zooming and analysing that a high resolution image provides. We will be considering the implications of the likely mix of different resolutions that many repositories will hold.

Looking at rights and IIIF: Rights are an important issue with archives, and we will be considering how to work with images at scale and ensure rights are respected.

Projects often have a finite goal of providing some kind of demonstrator showing what is possible, and they often pre-select material to work with. We are taking a different approach. We are working with a limited number of institutions, but we have not pre-selected ‘good’ material. We are simply going to try things out and see what works and what doesn’t, what the barriers are and how to overcome them. The process of ingest of the descriptive data and images will be part of the project. We are looking to consider both scalability and sustainability for the UK archive sector, including all different kinds of repositories with different resourcing and expertise, and with a whole variety of content and granularity of metadata.

Acknowledgement: This blog post cites the introductory video on IIIF which can be viewed within YouTube.