Given Hub Labs is looking at AI and we’re also writing blog posts, it seemed like a good opportunity to try out one of the AI blog post writing applications that keep appearing in my various streams. The one that seems to get the best recommendations and is also free to trial is Writesonic. I’ve been meaning to write a post on our initial steps to build our AWS cloud infrastructure for Labs so I thought I would use this as the basis for the experiment. You essentially provide a title and it will then provide some intros and outlines that you can accept before it goes on to produce some copy. The best copy I could produce given the options is pasted in below. You can see it’s very marketing orientated and it was no great surprise that it’s not proved to be of much use for my experiment at least. For some reason it wanted to assume we are using Google Cloud and didn’t offer any AWS copy. Interesting to try it though.
The generated AI copy including the image (italised by me):
Getting Started with Machine Learning in the Cloud
Image Source: Freepik
The practice of machine learning involves training computer algorithms to recognize patterns in data and make predictions based on those patterns. It’takes a lotofresources to implement machine learning in your business. There are many different datasets, tools, and software packages that you need to understandandchoose frombefore you canput themto workanalyzing yourdataand buildingyour algorithm. This blog post discusses how you can use cloud services for machine learning with an example using Google Cloud Machine Learning (Cloud ML) and its APIs for Natural Language Processing (NLP).
What is Machine Learning?
There are many definitions of machine learning, including the IEEE’s “A computer program is said to learn from experience E with respect to some class of tasks T and some performance measure P if its performance at tasks in T, as measured by P, improves with experience E.” Or, as silly as it sounds, you can think about machine learning as a computer program that takes a bunch of data and spits out a prediction. You’re the one that has to tell the program how to do it.
Cloud ML with its APIs for NLP
Google Cloud Machine Learning workflows let you build and manage machine learning models in Google Cloud. Cloud ML supports a variety of machine learning frameworks, including TensorFlow, Caffe 2, and Apache MXNet. Cloud ML also provides a set of Natural Language APIs to support natural language processing (NLP) tasks, including sentiment analysis and entity extraction. Cloud ML’s NLP APIs can be accessed programmatically through its REST APIs.
Create a dataset using BigQuery
When you’re collecting data for machine learning, it’s important that your data be as accurate as possible. Google’s BigQuery lets you create a dataset in the cloud, and then you can use Google’s Natural Language API to analyze that dataset and get a breakdown of the different parts of speech. BigQuery is Google’s fully managed, enterprise data warehouse for business analytics. With BigQuery, you can query large data sets using standard SQL and get results in seconds. It’s also fully integrated with Google’s other analytical and collaboration tools, such as Google Sheets, Google Data Studio, and Google Cloud Platform.
Create a model using TensorFlow
With your dataset collected and a model for the Natural Language API built, you can create the model with TensorFlow. In this example, you’ll be building a model for sentiment analysis, which means you’ll be using the API’s sentiment analysis functionality. If you’re new to TensorFlow, don’t worry. There are some libraries out there that make it super easy to get started. One of them is tf-python-client, which is a library that allows you to write python code that uses TensorFlow.
Run the model in a Cloud Dataflow job
At this point, you’ve built a model and have everything ready to go. All that’s left to do is plug everything into Cloud Dataflow. Cloud Dataflow is Google’s fully managed, distributed framework for executing streaming data pipelines. Cloud Dataflow lets you build scalable, easy-to-manage data processing pipelines that transform and enrich your data. With Dataflow, you can build data processing and machine learning pipelines in a single, unified programming model. This means that you can build a model that performs data analysis, trains the model, and then deploys it in one workflow instead of three different ones.
Run the model in an API server
Your model is ready to go. Now you’re just missing the code to plug into an API server that can handle your data and send it to BigQuery. The Google Cloud Natural Language API lets you understand the meaning of text. You can use the API to detect entities, sentiments, and other linguistic properties of text. The API supports 82 languages.
Now that you know what machine learning is and have gotten started with Cloud ML and its NLP APIs, you can put this knowledge to use. You can use the NLP APIs to perform sentiment analysis or entity extraction on a dataset. This can be helpful for creating articles, social media posts, or marketing materials that are more effective.
With Cloud ML and its NLP APIs, you can create machine learning models that are more accurate and make more informed predictions. It takes a lot more effort to create models without cloud services, and it’s often difficult to keep track of all of the resources needed. With cloud services, you can focus on the model and not worry about the infrastructure and resources needed to run it.
A recent OCLC paper by Thomas Padilla highlights the need for ‘Pilot collaborations between institutions with representative collections’ and working ‘to share source data and produce “gold standard” training data.‘
However, it is also surely true that cultural heritage will need to engage with the broader AI and ML communities to understand and benefit fully from the range of ML services such as translation, transcription, object identification and facial recognition:
‘Advances in all of these areas are being driven and guided by the government or commercial sectors, which are infinitely better funded than cultural memory; for example, many nation-states and major corporations are intensively interested in facial recognition. The key strategy for the cultural memory sector will be to exploit these advantages, adapting and tuning the technologies around the margins for its own needs.’ From a short blog post by Dr Clifford Lynch from the CNI which is well worth reading.
People often criticise Machine Learning for being biased. But bias and mis-representation is essentially due to embedded bias in the input training data. The algorithm learns with what it has. So one of the key tasks for us as an archives community is to think about training data. We need algorithms that are trained to work for us to give us useful outputs.
Gathering training data in order to create useful models is going to be a challenge. Machine Learning is not like anything else that we have done before – we don’t actually know what we’ll get – we just know that we need to give the algorithm data that educates it in the way that we want. A bit like a child in school, we can teach it the curriculum, but we don’t know if it will pass the exam.
It certainly seems a given that we will need to use well labelled archival material as training data, so that the model is tailored specifically to the material we have. We will need to work together to provide this scale of training data. We have many wonderfully catalogued collections, with detail down to item level; as well as many collections that are catalogued quite basically, maybe just at collection level. If we join together as a community and utilise the well-catalogued content to train algorithms, we may be able to achieve something really useful to help make all collections more discoverable.
If an algorithm is trained on a fairly narrow set of data, then it is questionable whether it will have broad applicability. For example, if we train an algorithm on letters written in the 18th century, but just authored by two or three people, then it is unlikely to learn enough to be of real use with transcription; but if we train it on the handwriting of fifty people or more, then it could be a really useful tool for recognising and transcribing 18th century letters To do this training, we will need to bring content together. We will need to share the Machine Learning journey. The benefits could be massive in terms of discoverability of archives; effective discovery for all those materials that we currently don’t have time to catalogue. The main danger is that the resulting identification, transcription, tagging or whatever, is not to the standard that we want. We can only experiment and see what happens if we trial ML with a set of data (which is what we are doing now with our Labs project). One benefit could actually be much more consistency across collections. As someone working on aggregating data from 350 organisations, I can testify that we are not consistent! – and this lack of consistency impairs discovery.
Archival content is likely to be distinct in terms of both quality and subject. Typescripts might be old and faded, manuscripts might be hard to read, photographs might be black and white and not as high resolution as modern prints. Photographs might be of historical artefacts that are not recognised by most algorithms. We have specific challenges with our material, and we need the algorithms to learn from our material, in order to then provide something useful as we input more content.
In terms of subject, the Lotus and Delta shoe shops are a good example of a specific topic. They are represented in the Joseph Emberton papers, at the University of Brighton Design Archives, with a series of photographs. Architecture is potentially an interesting area to focus on. ML could give us some outputs that provide information on architectural features. It could be that the design of Lotus and Delta shops can be connected to other shops with similar architectures and shop fronts. ML may pick out features that a cataloguer may not include. On the other hand, we may find that it is extremely hard to train an algorithm on old black and white and potentially low resolution photographs in order for it to learn what a shop is, and maybe what a shoe shop is.
In this collection a number of the photographs are of exteriors. Some are identified by location, and some are not yet identified.
These photographs have been catalogued to item level, and so researchers will be able to find these when searching for ‘shops’ and particularly ‘shoe shops’ on the Hub, e.g. a search for ‘harrogate shoe shop‘ finds the exterior of a shop front in Harrogate. There may not be much more that could be provided for searching this collection, unless machine learning could label the type of shop front, the type of windows and signage for example. This seems very challenging with these old photographs, but presumably not impossible. With ML it is a matter of trying things out. You might think that if artificial intelligence can master self-driving cars it can master shop exteriors….but it is not a foregone conclusion.
If the model was trained with this set of photographs, then other shop fronts could potentially be identified in photographs that aren’t catalogued individually. We could potentially end up with collections from many different archives tagged with ‘shop front’ and potentially with ‘shoes’. Whether an unidentified shop front could be be identified is less certain, unless there are definite contextual features to work with.
Shop interiors are likely to be even more of a challenge. But it will be exciting to try things like this out and see what we get.
Commercial providers offer black box solutions, and we can be sure they were not trained to work well with archives. They may be adapted to new situations, but it is unlikely they can ever work effectively for archival content. I explored this to an extent in my last blog post. However, it is worth considering that a model not trained on archival material may highlight objects or topics that we would not think of including in a catalogue entry.
The Archives Hub and Jisc could play a pivotal role in co-ordinating work to create better models for archival material. Aggregation allows for providing more training material, and thus creating more effective models.
‘To date, most ML projects in libraries have required bespoke data annotation to create sufficient training data. Reproducing this work for every ML project, however, risks wasting both time and labor, and there are ample opportunities for scholars to share and build upon each other’s work.’ (R. Cordell, LC Labs report)
We can have a role to play in ‘data gathering, sharing, annotation, ethics monitoring, and record-keeping processes‘ (Eun Seo Jo, Timnit Gebru, https://arxiv.org/abs/1912.10389). We will need to think about how to bring our contributors into the loop in order to check and feedback on the ML outputs. This is a non-trivial part of the process that we are considering at the moment. We need an interface that displays the results of our ML trials.
One of the interesting aspects of this is that collections that have been catalogued in detail will provide the training data for collections that are not. Will this prove to be a barrier, or will it bring us together as a community? In theory the resources that some archives have, which have enabled them to catalogue to item level, can benefit those with minimal resources. Would this be a free and open exchange, or would we start to see a commercial framework developing?
It is also important that we don’t ignore the catalogue entries from our 350 contributors. Catalogues could provide great fodder for ML – we could start to establish connections and commonalities and increase the utility of the catalogues considerably.
The issue of how to incorporate the results of ML into the end user discovery interface is yet another challenge. Is it fundamentally important that end users know what has been done through ML and what has been done by a human? I can’t help thinking that over time the lines will blur, as we become more comfortable with AI….or as AI simply becomes more integrated into our world. It is clear that many people don’t realise how much Artificial Intelligence sits behind so many systems and processes that we use on an everyday basis. But I think that for the time being, it would be useful to make that distinction within our end user interfaces, so that people know why something has been catalogued or described in a certain way and so that we can assess the effectiveness of the ML contribution.
In subsequent posts we aim to share some initial findings from doing work at scale. We will only be able to undertake some modest experiments, but we hope that we are contributing to the start of what will be a very big adventure for archives.
Machine Learning is a sub-set of Artificial Intelligence (AI). You might like to look at devopedia.org for a short introduction to Machine Learning (ML).
Machine Learning is a data-oriented technique that enables computers to learn from experience. Human experience comes from our interaction with the environment. For computers, experience is indirect. It’s based on data collected from the world, data about the world.
Definition of Machine Learning from devopedia.org
The idea of this and subsequent blog posts is to look at machine learning from a specifically archival point of view as well as update you on our Labs project, Images and Machine Learning. We hope that our blog posts help archivists and other information professionals within the archival or cultural heritage domain to better understand ML and how it might be used.
At the Archives Hub we are particularly focussed on looking at Machine Learning from the point of view of archival catalogues and digital content, to aid discoverability, and potentially to identify patterns and bias in cataloguing.
Machine Learning to aid discoverability can be carried out as supervised or unsupervised learning. Supervised learning may be the most reliable, producing the best results. It requires a set of data that contains both the inputs and the desired outputs. By ‘outputs’ we mean that the objective is provided by labelling some of the input data. This is often called training data. In a ‘traditional’ scenario, code is written to take input and create output; in machine learning, input and output is provided, and the part done by human code is instead done by machine algorithms to create a model. This model is then used to derive outputs from further inputs.
So, for example, taking the Vickers instruments collection from the Borthwick: https://dlib.york.ac.uk/yodl/app/collection/detail?id=york%3a796319&ref=browse. You may want to recognise optical instruments, for example, telescopes and microscopes. You could provide training data with a set of labelled images (output data) to create a model. You could then input additional images and see if the optical instruments are identified by the model.
Of course, the Borthwick may have catalogued these photographs already (in fact, they have been catalogued), so we know which are telescopes and which are micrometers or lenses or eye pieces. If you have a specialist collection, essentially focused on a subject, and the photographs are already labelled, then there may be less scope for improving discoverability for that collection by using machine learning. If the Borthwick had only catalogued a few boxes of photographs, they might consider using machine learning to label the remaining photographs. However, a big advantage is that the enhanced telescope recognising model can now be used on all the images from the Archives Hub to discover and label images containing telescopes from other collections. This is one of the great advantages of applying ML across the aggregated data of the Archives Hub. The results of machine learning are always going to be better with more training data, so ideally you would provide a large collection of labelled photographs in order to teach the algorithm. Archive collections may not always be at the kind of scale where this process is optimised. Providing good training data is potentially a very substantial task, and does require that the content is labelled. It is possible to use models that are already available without doing this training step, but the results are likely to be far less useful.
Another scenario that could lend itself to ML is a more varied collection, such as Borthwick’s University photograph collection. These have been catalogued, but there is potential to recognise various additional elements within the photographs.
The above photograph has been labelled as a construction site. ML could recognise that there are people in the photograph, and this information could be added, so a researcher could then look for construction site with people. Recognising people in a photograph is something that many ML tools are able to do, having already been trained on this. However, archive collections are often composed of historic documents and old photographs that may not be as clear as modern documents. In addition, the models will probably have been trained with more current content. This is likely to be an issue for archives generally. For models to be effective, they need to have been trained with content that is similar to the content we want to catalogue.
The benefits of adding labels to photographs via ML to potentially enhance the catalogue and help with discoverability is going to depend upon a number of factors: how well the image is already catalogued, whether training data can be provided to improve the algorithm, how well ML can then pick out features that might be of use.
The drawings of fossil fish at the Geological Society are another example of a very subject specific collection. We put a few of these through some out-of-the-box ML tools. These tools have been pre-trained on large diverse datasets, but we have not done any additional training ourselves yet, so you could see them as generalists in recognising entities rather than specialists with any particular material or topic.
In this case the drawing has been tagged with ‘fossil’, which could be useful if you wanted to identify fossil drawings from a varied collection of drawings. It has also tagged this with archaeology and art, both of which could potentially be useful, again depending upon the context. The label of soil is a bit more problematic, and yet it is the one that has been added with 99.5% certainty. However, a bit of training to tell the algorithm that ‘soil’ is not correct may remove this tag from subsequent drawings.
This example illustrates the above point that a subject specific collection may be tagged with labels that are already provided in the catalogue description. It also shows that machine learning is unlikely to ever be perfectly accurate (although there are many claims it outperforms humans in a number of areas). It is very likely to add labels that are not correct. Ideally we would train the model to make less mistakes – though it is unlikely that all mistakes will be eliminated – so that does mean some level of manual review.
Tagging an image using ML may draw out features that would not necessarily be added to the catalogue – maybe they are not relevant to the repository’s main theme, and in the end, it is too time-consuming for cataloguers themselves to describe each photo in great detail as part of the cataloguing process.
The above image is a simple one with not too much going on. It will be discoverable on the Queen’s website through a search for ‘china’ or ‘robert hart’ for example, but tagging could make it discoverable for those interested in plants or architectural features. Again, false positives could be a problem, so a key here is to think about levels of certainty and how to manage expectations.
As mentioned above, archival images are often difficult to interpret. They may be old and faded, and they may also represent features or items that an algorithm will not recognise.
In the above example from Brighton Design Archives, the photograph is from a set made of an exhibition of 1947, Things In Their Home Setting. The AWS image Rekognition service has no problem with the chair, but it has confidently identified the oven as a refrigerator. This could probably be corrected by providing more training data, or giving feedback to improve the understanding of the algorithm and its knowledge of 1940’s kitchen furniture. But by the time you have given enough training data for the model to recognise a cooker from a fridge from a washing machine, it might have been easier simply to do the cataloguing manually.
Another option for machine learning is optical character recognition. This has been around for a while, but it has improved substantially as a result of the machine learning approach. Again, one of the challenges for archives is that many items within the collections are handwritten, faded, and generally not easily readable. So, can ML prove to be better with these items than previous OCR approaches?
A tool like Transkribus can potentially offer great benefits to archives, and is seen as a community-driven effort to create, gather and share training data. We hope to try out some experiments with it in the course of our project.
The above plan is from Lambeth Palace Library’s 19th century ecclesiastical maps. It can already be found searching for ‘clerkenwell’ or ‘st james parish’. But ML could potentially provide more searchable information.
The words here are fairly clear, so the character recognition using the Microsoft Azure ML service is quite good. Obviously the formatting is an issue in terms of word order. ‘James’ is recognised as ‘Iames’ due to the style of writing. ‘Church’ is recognised despite the style looking like ‘Chvrch’ – this will be something the algorithm has learnt. This analysis could potentially be useful to add to the catalogue because an end user could then search for ‘pentonville chapel’ or ‘northampton square’ and find this plan.
As well as looking at digital archives, we will be trying out examples with catalogue text. A great deal of archival cataloguing is legacy data, and archivists do not always have the time to catalogue to item level or to add index terms, which can substantially aid discoverability. So, it is tempting to look at ML as a means to substantially improve our catalogues. For example, to add to our index terms, which provide structured access points for end users searching for people, organisations, places and subjects.
In a traditional approach to adding subject terms to a catalogue, you might write rules. We have done this in our Names Project – we have written a whole load of rules in order to identify name, life dates, and additional data within index terms. We could have written even more rules – for example, to try to identify forename and surname. But it would be very difficult because the data does not present the elements of names consistently. We could potentially train an ML model with a load of names, tagging the parts of the name as forename, surname, dates, titles, epithets. But could an algorithm then successfully work out the parts of any subsequent names that we feed into it? It seems unlikely because there is no real consistency in how cataloguers input names. The algorithm might learn, for example, that a word, then a comma, then another word is surname, forename (Roberts, Elizabeth). But two words followed by a comma and another word could be surname + forename or forename + surname, (Vaughan Williams, Ralph; Gerald Finzi, composer). In this scenario, the best option may be to aim to use source data (e.g. the Virtual International Authority File) to compare our data to, rather than try to train a machine to learn patterns, when there really isn’t a model to provide the input.
We may find that analysing text within a catalogue offers more promise.
Here is an example from an administrative history of the British Linen Group, a collection held by Lloyds Banking Group. The entity recognition is pretty good – people’s names, organisations, dates, places, occupations and other entities can be picked out fairly successfully from catalogues. Of course that is only the first step; it is how to then use that information that is the main issue. You would not necessarily want to apply the terms as index terms for example, as they may not be what the collection is substantially about. But from the above example you could easily imagine tagging all the place names with a ‘place’ tag, so that a place search could find them. So, a general search for Stranraer would obviously find this catalogue entry, but if you could identify it as a place name it could be included in the more specific place name search.
With machine learning it is very difficult and sometimes impossible to understand exactly what is happening and why. By definition, the machine learns and modifies its output. Whilst you can provide training data to give inputs and desired outputs, machine learning will always be just that….a machine learning as it goes along, and not simply working through a programme that a human has written. Supervised learning provides for the most control over the outputs. Unsupervised learning, and deep learning, are where you have much less control (we’ll come onto those in later posts).
It is only by understanding the algorithms and what they are doing that you can set up your environment for the best results. But that is where things can get very complicated. We are going to try to run some experiments where we do prepare the data, but learning how to do this is a non-trivial task. Hence one of the questions we are asking is ‘is Machine Learning worth the effort required in order to improve archival discoverability?’ We hope to get at least some way along the road to answering that question.
There are, of course, other pressing questions, not least the issue of bias, and concerns about energy use with machine learning as well as how to preserve the processes and outputs of ML and document the decision making. But there could be big wins in terms of saving time that can then be dedicated to other tasks. The increasing volumes of data that we have to process may make this a necessity. We hope to touch upon some of these areas, but this is a fairly small scale project and Machine Learning it is one huge topic.
Under our new Labs umbrella, we have started a new project, ‘Images and Machine Learning’ it has three distinct and related strands.
We will be working on these themes with ten participants, who already contribute to the Archives Hub, and who have expressed an interest in one or more of these strands: Cardiff University, Bangor University, Brighton Design Archives at the University of Brighton, Queens University Belfast, the University of Hull, the Borthwick Institute for Archives at the University of York, the Geological Society, the Paul Mellon Centre, Lambeth Palace (Church of England) and Lloyds Bank.
This project is not about pre-selecting participants or content that meet any kind of criteria. The point is to work with a whole variety of descriptions and images, and not in any sense to ‘cherry pick’ descriptions or images in order to make our lives easier. We want a realistic sense of what is required to implement digital storage and IIIF display, and we want to see how machine learning tools work with a range of content. Some of the participants will be able to dedicate more time to the project, others will have very little time, some will have technical experience, others won’t. A successful implementation that runs beyond our project and into service will need to fit in with our contributors needs and limitations. It is problematic to run a project that asks for unrealistic amounts of time from people that will not be achievable long-term, as trying to turn a project into a service is not likely to work.
Over the years we have been asked a number of times about hosting content for our contributors. Whilst there are already options available for hosting, there are issues of cost, technical support, fit for purpose-ness, trust and security for archives that are not necessarily easily met.
Jisc can potentially provide a digital object store that is relatively inexpensive, integrated with the current Archives Hub tools and interfaces, and designed specifically to meet our own contributors’ requirements. In order to explore this proposal, we are going to invest some resource into modifying our current administrative interface, the CIIM, to enable the ingest of digital content.
We spent some time looking at the feasibility of integrating an archival digital object store with the current Jisc Preservation Service. However, for various reasons this did not prove to be a practical solution. One of the main issues is the particular nature of archives as hierarchical multi-level collections. Archival metadata has its own particular requirements. The CIIM is already set up to work with EAD descriptions and by using the CIIM we have full control over the metadata so that we can design it to meet the needs of archives. It also allows us to more easily think about enabling IIIF (see below).
The idea is that contributors use the CIIM to upload content and attach metadata. They can then organise and search their content, and publish it, in order to give it web address URIs that can be added to their archival descriptions – both in the Archives Hub and elsewhere.
It should be noted that this store is not designed to be a preservation solution. As said, Jisc already provides this service, and there are many other services available. This is a store for access and use, and for providing IIIF enabled content.
The metadata fields have not yet been finalised, but we have a working proposal and some thoughts about each field.
mandatory? individual vs batch?
preferably structured, options for approx. and not dated.
possibly a URI. option to add institution’s rights statement.
controlled list. values to be determined with participants. could upload a thesaurus. could try ML to identify type.
enable digital objects to be grouped e.g by topic or e.g. ‘to do’ to indicate work is required
unpublished/published. May refer to IIIF enabled.
unique URI of image (at individual level)
Proposed fields for the Digital Object Store
We need to think about the workflow and user interface. The images would be uploaded and not published by default, so that they would only be available to the DAO Store user at that point. On publication, they would be available at a designated URL. Would we then give the option to re-size? Would we set a maximum size? How would this fit in with IIIF and the preference for images of a higher resolution? We will certainly need to think about how to handle low resolution images.
International Image Interoperability Framework
IIIF is a framework that enables images to be viewed in any IIIF viewer. Typically, they can be sequenced, such as for a book, and they are zoomable to a very high resolution. At the heart of IIIF is the principle that organisations expose images over the web in a way that allows researchers to use images from anywhere, using any platform that speaks IIIF. This means a researcher can group images for their own research purposes, and very easily compare them. IIIF promotes the idea of fully open digital content, and works best with high resolution images.
There are very good reasons for the Archives Hub to get involved in IIIF, but there are challenges being an aggregator that individual institutions don’t face, or at least not to the same degree. We won’t know what digital content we will receive, so we have to think about how to work with images of varying resolutions. Our contributors will have different preferences for the interface and functionality. On the plus side, we are a large and established service, with technical expertise and good relationships with our contributors. We can potentially help smaller and less well-resourced institutions into this world. In addition, we are well positioned to establish a community of use, to share experiences and challenges.
One thing that we are very convinced by: IIIF is a really effective way to surface digital content and it is an enormous boon to researchers. So, it makes total sense for us to move into this area. With this in mind, Jisc has become a member of the IIIF Consortium, and we aim to take advantage of the knowledge and experience within the community – and to contribute to it.
This is a huge area, and it can feel rather daunting. It is also very complicated, and we are under no illusions that it will be a long road, probably with plenty of blind alleys. It is very exciting, but not without big challenges.
It seems as if ML is getting a bad reputation lately, with the idea that algorithms make decisions that are often unfair or unjust, or that are clearly biased. But the main issue lies with the data. ML is about machines learning from data, and if the data is inadequate, biased, or suspect in some way, then the outcomes are not likely to be good. ML offers us a big opportunity to analyse our data. It can help us surface bias and problematic cataloguing.
We want to take the descriptions and images that our participants provide and see what we can do with ML tools. Obviously we won’t do anything that affects the data without consulting with our contributors. But it is best with ML to have a large amount of data, and so this is an area where an aggregator has an advantage.
This area is truly exploratory. We are not aiming for anything other than the broad idea of improved discoverability. We will see if ML can help identify entities, such as people, places and concepts. But we are also open to looking at the results of ML and thinking about how we might benefit from them. We may conclude that ML only has limited use for us – at least, as it stands now. But it is changing all the time, and becoming more sophisticated. It is something that will only grow and become more embedded within cultural heritage.
Over the next several months we will be blogging about the project, and we would be very pleased to receive feedback and thoughts. We will also be holding some webinar sessions. These will be advertised to contributors via our contributors list, and advertised on the JiscMail archives-nra list.
IIIF is a model for presenting and annotating digital content on the Web, including images and audio/visual files. There is a very active global community that develops IIIF and promotes the principles of open, shareable content. One of the strengths of IIIF is the community, which is a diverse mix of people, including developers and information professionals.
Images are fundamental carriers of information. They provide a huge amount of value for researchers, helping us understand history and culture. We interact with huge amounts of images, and yet we do not always get as much value out of them as we might. Content may be digitised, but it is often within silos, where the end user has to go to a specific website to discover content and to view a specific image, it is not always easy or possible to discover, gather together, compare, analyse and manipulate images.
IIIF is a particularly useful solution for cultural heritage, where analysis of images is so important. A current ‘Towards a National Collection’ project has been looking at practical applications of IIIF.
The IIIF Solution
Exactly what IIIF enables depends upon a number of factors, but in general it enables:
Deep zoom: view and zoom in closely to see all the detail of an image
Sequencing: navigate through a book or sequence of archival materials
Comparisons: bring images together and put them side-by-side. This can enable researchers to bring together images from different collections, maybe material with the same provenance that has been separated over time.
Search within text: work with transcriptions and translations
Connections: connect to resources such as Wikidata
Use of different IIIF viewers: different viewers have their own features and facilities.
How It Works
The IIIF community tends to talk in terms of APIs. These can be thought of as agreed and structured ways to connect systems. If you have this kind of agreement then you can implement different systems, or parts of systems, to work with the same content, because you are sticking to an agreed structure. The basic principle is to store an image once (on a IIIF server) and be able to use it many times in many contexts.
IIIF is like a a layer above the data stores that host content. The images are accessed through that IIIF layer – or through the IIIF APIs. This enables different agents to create viewers and tools for the data held in all the stores.
There are a few different APIs that make up the IIIF standard.
This API delivers the content (or pixels). The image is delivered as a URL, and the URL is structured in an agreed way.
This delivers information on the presentation of the material, such as the sequence of a book, for example, or a bundle of letters, and metadata about the object.
Allows searching within the text of an object.
Allows materials to be restricted by audience. So, this is useful for sensitive images or images under copyright that may have restrictions.
As IIIF images are served in a standard way, any IIIF viewer can access them. Examples of IIIF viewers:
There are a whole host of viewers available, with various functionality. Most will offer the basics of zooming and cropping. There does seem to be a question around why so many viewers are needed. It might be considered a better approach for the community to work on a limited group of viewers, but this may be a politically driven desire to own and brand a viewer. In the end, a IIIF viewer can display any IIIF content, and each viewer will have its own features and functionality.
To find out more about how researchers can benefit from IIIF, you may like to watch this presentation on YouTube (59m): Using IIIF for research
In many projects, the aim is to digitise key materials, such as artworks of national importance and rare books and manuscripts, in order to provide a rich experience for end users. For instance, the Raphael Cartoons at the V&A are now available to explore different layers and detail, even enabling the infra-red view and surface view, to allow researchers to study the paintings in great depth. Images can easily be compared within your own workspace, by pulling in other IIIF images.
What is the Archives Hub planning to do with IIIF?
Hosting content: We are starting a 15 month project to explore options for hosting and delivering content. Integral to this project will be providing a IIIF Image API. As referenced above, this will mean that the digital content can be viewed in any IIIF viewer, because we will provide the necessary URLs to do so. One of the barriers for many archives is that images need to be on a IIIF server in order to utilise the Image API. It may be that Jisc can provide this service.
Creation of IIIF manifests: I’ll talk more about this in future blog posts, but the manifest is a part of the Presentation API. It contains a sequence (e.g. ordering of a book), as well as metadata such as a title, description, attribution, rights information, table of contents, and any other information about the objects that may be useful for presentation. We will be looking at how to create manifests efficiently and at scale, and the implications for representing hierarchical collections.
Providing an interface to manage content: This would be useful for any image store, so it does not relate specifically to IIIF. But it may have implications around the metadata provided and what we might put into a IIIF manifest.
Integrating a IIIF viewer into the Archives Hub: We will be providing a IIIF viewer so that the images that we host, and other IIIF images, can be viewed within the Archives Hub.
Assessing image quality: A key aim of this project is to assess the real-world situation of a typical archive repository in the UK, and how they can best engage with IIIF. Image resolution is one potential issue. Whilst any image can be served through the IIIF API, a lower resolution image will not give the end user the same sort of rich experience with zooming and analysing that a high resolution image provides. We will be considering the implications of the likely mix of different resolutions that many repositories will hold.
Looking at rights and IIIF: Rights are an important issue with archives, and we will be considering how to work with images at scale and ensure rights are respected.
Projects often have a finite goal of providing some kind of demonstrator showing what is possible, and they often pre-select material to work with. We are taking a different approach. We are working with a limited number of institutions, but we have not pre-selected ‘good’ material. We are simply going to try things out and see what works and what doesn’t, what the barriers are and how to overcome them. The process of ingest of the descriptive data and images will be part of the project. We are looking to consider both scalability and sustainability for the UK archive sector, including all different kinds of repositories with different resourcing and expertise, and with a whole variety of content and granularity of metadata.
‘Visual AI and Printed Chapbook Illustrations at the National Library of Scotland’ – Dr Giles Bergel (University of Oxford / National Library of Scotland)
Giles’ team have been using machine learning (ML) on data from data.nls.uk. He outlined their three part approach. First they find illustrations in manuscripts using Google’s EfficientDet object detection convolutional neural network seeded by manually pre-annotated images. They found the object detector worked extremely well after relatively few learning passes. There were a few false positives such as image ink showing through, marginalia and dog ears that would confuse the model.
Next they matched and grouped the illustrations using their “state of art” image search engine. Giles believes this shows that AI simplifies the task of finding things that are related in images. The final step was to apply classification alogorithms with the VGG Image Classification Engine which uses Google as a source of labelled images. The lessons learned were:
AI requires well-curated data
Tools for annotating data are no less important than classifiers
Generic image models generalize well to printed books
‘Classical’ computer vision still works
AI software development benefits from end-to-end use-cases including data preparation, refinement, consulting with domain experts, public engagement etc.
‘Machine Learning and Cultural Heritage: What Is It Good Enough For?’ – John Stack (UK Science Museum)
John described how AI is being used as part of the Science Museum’s linked data work to collect data into a central knowledge graph. He noted that the Science Museum are doing a great deal of digitisation but currently they only have what John describes as ‘thin’ object data.
They are looking at using AI for name disambiguation as a first step before adding links to wikidata and using entity recognition to enhance their own catalogue. It stuck me that they, and we at the Hub, have been ‘doing AI’ for a while now with such technologies as entity recognition and OCR before the term AI was used. They are aiming to link through to wikidata such that they can pull in the data and add it to their knowledge graph. This allows them to enhance their local data and apply ML to perform such things as clustering to draw out new insights.
John identified the main benefits of ML currently as suggesting possibilities and identifying trends and gaps. It’s also useful for visualisation and identifying related content as well as enhancing catalogues with new terminology. However there were ‘but’s. ML content needs framing and context. He noted that false positives are not always apparent and usually require specialist knowledge. It’s important to approach things critically and understand what can’t be done. John mentioned that they don’t have any ML driven features in production as yet.
This was followed by a Q&A where several issues came up. We need to consider how AI may drive new ways/modalities of browsing that we haven’t imagined yet. A major issue is the work needed to feed AI enhancements into user interfaces. Most work so far has been on backend data. AI tools need to integrate into day-to-day workflows for their benefits to be realised. More sector specific case-studies, training materials, tools and models are needed that are appropriate to cultural heritage. See the Heritage Connector blog for more information.
‘AI and the Photoarchive‘ – John McQuaid (Frick Collection), Dr Vardan Papyan (University of Toronto), and X.Y. Han (Cornell University)
The Frick Collection have been using the PyTorchdeep neural network to identify labels for their photo archive collection. They then compared the ML results as a validation exercise with internally crowdsourced data from their staff and curators captured by the Zooinverse software for the same photos.
They found that 67% of the ML labels matched with the crowdsource validations which they considered a good result. They concluded that at present ML is most useful for ‘curatorial amplification’, but much human effort is still needed. This auto-generation of metadata was their main use case so far.
‘Keep True: Three Strategies to Guide AI Engagement‘ – Thomas Padilla (Center for Research Libraries)
Thomas believes GLAMs have an opportunity to distinguish themselves in the AI space. He covered a number of themes, the first being the ’Non-scalability imperative’. Scale is everywhere with AI. There’s a great deal of marketing language about scale, but we need to look at all the non-scalable processes that scale depends on. There’s a problematic dependency where scalability is made possible by non-scalable processes, resources and people. Heterogeneity and diversity can become a problem to be solved by ML. There’s little consideration that AI should be just and fair.
The second theme was ‘Neoliberal traps’ in AI. Who says ethical AI is ethical AI? GLAMs are trying to do the right thing with AI, but this is in the context of neoliberal moral regulation which is unfair and ineffective. He mentioned some of the good examples from the sector including from CILIP, Museums AI Network and his own ‘Responsible Operations‘ paper.
He credited Melissa Terras for asking the question “How are you going to advocate for this with legislation?”. The US doesn’t have any regulations at the moment to get the private sector to get better. I mentioned the UK AI Council who are looking at this in the UK context, and the recent CogX event where the need for AI regulation was discussed in many of the sessions.
The final theme was ‘Maintenance as Innovation’. Information maintenance is a Practice of Care. There is an asserted dichotomy between maintenance and innovation that’s false. Maintenance is sustained innovation and we must value the importance of maintenance to innovation. He appealed to the origin of the word ‘innovation’ which derives from the latin ‘innovare’ which means “to alter, renew, restore, return to a thing, introduce changes in the way something is done or made”. It’s not about creating from new. At the Hub we wholeheartedly endorse this view. We feel there’s far too much focus on the latest technology meme and we’ve had tensions within our own organisation along these lines. There may appear to be some irony here given the topic of this post, but we have been doing AI for a while as noted above. He referred us to https://themaintainers.org/ for more on this.
Roundtable discussion with the AEOLIAN Project Team
Dr Lise Jaillant, Dr Annalina Caputo, Glen Worthey (University of Illinois), Prof. Claire Warwick (Durham University), Prof. J. Stephen Downie (University of Illinois), Dr Paul Gooding (Glasgow University), and Ryan Dubnicek (University of Illinois).
Stephen Downie talked about the need for standardisation of ML extracted features so we can re-use these across GLAMs in a consistent way. The ‘Datasheets for Datasets’ paper was mentioned that proposes “a short document to accompany public datasets, commercial APIs, and pretrained models”. This reminded me of Yves Bernaert’s talk about the related need for standardisation of carbon consumption measures. Both are critical issues and possible areas for Jisc to be involved in providing leadership. Another point that Stephen made is that researchers are finding they can’t afford the bill for ML processing. Finding hardware and resources is a big problem. As noted by ML guru Andrew Ng, we have a considerable data issue with AI and ML work . It may be that we need to work more on the data rather than wasting time, electricity and money re-creating expensive ML models. A related piece of work, ‘Lessons from Archives‘ was also mentioned in this regard. There is a case for sharing model developments across the sector for efficiency and sustainability here.
I attended the ‘CogX Global Leadership Summit and Festival of AI’ last week, my first ‘in-person’ event in quite a while. The CogX Festival “gathers the brightest minds in business, government and technology to celebrate innovation, discuss global topics and share the latest trends shaping the defining decade ahead”. Although the event wasn’t orientated towards archives or cultural heritage specifically, we are doing work behind the scenes on AI and machine learning with the Archives Hub that we’ll say more about in due course. Most of what’s described below is relevant to all sectors as AI is a very generalised technology in its application.
My attention was drawn to the event by my niece Laura Stevenson who works at Faculty and was presenting on ‘How the NHS is using AI to predict demand for services‘. Laura has led on Faculty’s AI driven ‘Early Warning System’ that forecasts covid patient admissions and bed usage for the NHS. The system can use data from one trust to help forecast care for a trust in another area, and can help with best and worst scenario planning with 95% confidence. It also incorporates expert knowledge into the modelling to forecast upticks more accurately than doubling rates can. Laura noted that embedding such a system into operational workflows is a considerable extra challenge to developing the technology.
The system includes an explainability feature showing various inputs and the degree to which they affect forecasting. To help users trust the tool, the interface has a model performance tab so users can see information on how accurate the tool has been with previous forecasts. The tool is continuing to help NHS operational managers make planning decisions with confidence and is expected to have lasting impact on NHS decision making.
Lila works at Deep Mind who are looking to use AI to unlock whole new areas of science. Lila highlighted the role of the AI Council who are providing guidance to UK Government in regard to UK AI research. She talked about Alphafold that has been addressing the 50 year old challenge of protein folding. This is a critical issue as being able to predict protein folding unlocks many possibilities including disease control and using enzymes to break down industrial waste. DeepMind have already created an AI system that can help predict how a protein folding occurs and have a peer reviewed article coming out soon. They are trying to get closer to the great challenge of general intelligence.
Yves focussed on company and corporate responsibility, starting his session with some striking statistics:
100 companies produce 70% of global carbon emissions.
40% of water consumption is by companies.
40% of deforestation is by companies.
There is 80 times more industrial waste than consumer waste.
20% of the acidification of the ocean is produced by 20 companies only.
Yves therefore believes that companies have a great responsibility, and technology can help to reduce climate impact. 2% of global electricity comes from data centres currently and is growing exponentially, soon to be 8%. A single email produces on average 4g of carbon. Yves stressed that all companies have to accept that now is the time to come up with solutions and companies must urgently get on with solving this problem. IT energy consumption needs to be seen as something to be fixed. If we use IT more efficiently, emissions can be reduced by 20-30%. The solution starts with measurement which must be built into the IT design process.
We can also design software to be far more efficient. Yves gave the example of AI model accuracy. More accuracy requires more energy. If 96% accuracy is to be improved by just 2%, the cost will be 7 times more energy usage. To train a single neural network requires the equivalent of the full lifecycle energy consumption of five cars. These are massive considerations. Interpreted program code has much higher energy use than compiled code such as C++.
A positive note is that 80% of the global IT workload is expected to move to the cloud in the next 3 years. This will reduce carbon emissions by 84%. Savings can be made with cloud efficiency measures such as scaling systems down and outwards so as not to unneccessarily provision for occasional workload spikes. Cloud migration can save 60 million tons carbon per year which is the equivalent of 20 million full lifecycle car emissions. We have to make this happen!
On where are the big wins, Yves said this is also in the IT area. Companies need to embed sustainability into their goals and strategy. We should go straight for the biggest spend. Make measurements and make changes that will have the most effect. Allow departments and people to know their carbon footprint.
* Update 28th June 2021 * – It was remiss of me not to mention that I’m working on a number of initiatives relating to green sustainable computing at Jisc. We’re looking at assessing the carbon footprint of the Archives Hub using the Cloud Carbon Footprint tool to help us make optimisations. I’m also leading on efforts within my directorate, Digital Resources, to optimise our overall cloud infrastructure using some of the measures mentioned above in conjuction with the Jisc Cloud Solutions team and our General Infrastructure team. Our Cloud CTO Andy Powell says more on this in his ‘AI, cloud and the environment‘ blog post.
Ottoline believes that pushing the boundaries of how we support research needs to happen. Research is now more holistic. We draw in what we need to create value. The lone genius is a big problem for research culture and it has to go. Research is insecure and needs connectivity.
Ottoline believes AI will change everything about how research is done. It’s initially replacing mundane tasks but will some more complex tasks such as spotting correlations. Eventually AI will be used as a tool to help understanding in a fundamental way. In terms of the existential risk of AI, we need to embed research as collective endeavour and share effort to mitigate and distribute this risk. It requires culture change, joining up education and entrepreneurship.
We need to fund research in places that are not the usual places. Ottoline likes a football analogy where people are excited and engaged at all levels of the endeavour, whether in the local park or at the stadium. She suggests research at the moment is more like elitist Polo not football.
Ottoline mentioned that UKRI funding does allow for white spaces research. Anyone can apply. However, we need to create wider white spaces to allow research in areas not covered by the usual research categories. It will involve braided and micro careers, not just research careers. Funding is needed to support radical transitions. Ottonline agrees that the slow pace of publication and peer review is a big problem that undermines research. We need to broaden ways we evaluate research. Peer Review is helpful but mustn’t slow things down.
Rob suggests we are in an era with AI where there are no clear rules of the road yet. The task for AI is to make it safe to ‘drive’ with regulations. We can’t stop facial recognition any more than we can stop gravity. We need datasets for governance so we can check accuracy against these for validation. Transparency is also required so we can validate algorithms. A big AI concern is the tribalism on social media.
Matt Hancock believes we are at a key moment with healthcare and AI technology where it’s now of vital importance. Data saves lives! The next thing is how to take things forward in NHS. A clinical trials interoperability programme is starting that will agreed standards to get more out of data use, and the Government will be updating it’s Data strategy soon. He suggests we need to remove silos and commercial incentives (sic). On the use of GP data he suggests we all agree on the use of data, but the question is how it’s used. The NHS technical architecture needs to improve for better use and building data into the way the NHS works. GPs don’t own patient data, it is the citizen.
He said a data lake is being built across the NHS. Citizen interaction with health data is now greater than ever before and NHS data presents a great opportunity for research, and an enormous opportunity for the use of data to advance health care. He suggested we need to radically simplify the NHS information governance rules. On areas where not enough progress has been made, he mentioned the lack of separation of data layers is currently a problem. So many applications silo their data. There has also been a culture of Individual data with personal curation. The UK is going for a TRE first approach: ‘Trusted Research Environment service for England‘. Data is the preserve of the patient who will allow accredited researchers to use the data through the TRE. The clear preference of citizens is sharing data if they trust the sharing mechanism. Every person goes through a consent process for all data sharing. Acceptance requires motivating people with the lifesaving element of research. If there’s trust, the public will be on side. Researchers in this domain with have to abide by new rules to allow us to build on this data. He mentioned that Ben Goldacre will look at the line where open commons ends and NHS data ownership begins in the forthcoming Goldacre Review.