Employing Machine Learning and Artificial Intelligence in Cultural Institutions

As mentioned in my last post, we’re looking at the possibilities Artificial Intelligence and Machine Learning can offer the Archives Hub and the archives community in general. I also now have a wider role in Jisc as a ‘Technical Innovations Manager’, so my brief is to consider the wider technical and strategic possibilities of AI/ML for the Digital Resources directorate and Jisc as a whole. We continue to work behind the scenes, but we also keep a watch on cultural heritage and wider sector activities. As part of this I participated in the Aeolian Project’s ‘Online Workshop 1: Employing Machine Learning and Artificial Intelligence in Cultural Institutions’ yesterday.

‘Visual AI and Printed Chapbook Illustrations at the National Library of Scotland’ – Dr Giles Bergel (University of Oxford / National Library of Scotland)

Giles’ team have been using machine learning (ML) on data from data.nls.uk. He outlined their three part approach. First they find illustrations in manuscripts using Google’s EfficientDet object detection convolutional neural network seeded by manually pre-annotated images. They found the object detector worked extremely well after relatively few learning passes. There were a few false positives such as image ink showing through, marginalia and dog ears that would confuse the model.

Image showing false postive ml recognition
False positive ML recognition – ink showthrough

Next they matched and grouped the illustrations using their “state of art” image search engine. Giles believes this shows that AI simplifies the task of finding things that are related in images. The final step was to apply classification alogorithms with the VGG Image Classification Engine which uses Google as a source of labelled images. The lessons learned were:

  • AI requires well-curated data
  • Tools for annotating data are no less important than classifiers
  • Generic image models generalize well to printed books
  • ‘Classical’ computer vision still works
  • AI software development benefits from end-to-end use-cases including data preparation, refinement, consulting with domain experts, public engagement etc.

Machine Learning and Cultural Heritage: What Is It Good Enough For?’ – John Stack (UK Science Museum)

John described how AI is being used as part of the Science Museum’s linked data work to collect data into a central knowledge graph. He noted that the Science Museum are doing a great deal of digitisation but currently they only have what John describes as ‘thin’ object data.

They are looking at using AI for name disambiguation as a first step before adding links to wikidata and using entity recognition to enhance their own catalogue. It stuck me that they, and we at the Hub, have been ‘doing AI’ for a while now with such technologies as entity recognition and OCR before the term AI was used. They are aiming to link through to wikidata such that they can pull in the data and add it to their knowledge graph. This allows them to enhance their local data and apply ML to perform such things as clustering to draw out new insights.

John identified the main benefits of ML currently as suggesting possibilities and identifying trends and gaps. It’s also useful for visualisation and identifying related content as well as enhancing catalogues with new terminology. However there were ‘but’s. ML content needs framing and context. He noted that false positives are not always apparent and usually require specialist knowledge. It’s important to approach things critically and understand what can’t be done. John mentioned that they don’t have any ML driven features in production as yet.

Diagram showing the components of the Heritage Connector software

This was followed by a Q&A where several issues came up. We need to consider how AI may drive new ways/modalities of browsing that we haven’t imagined yet. A major issue is the work needed to feed AI enhancements into user interfaces. Most work so far has been on backend data. AI tools need to integrate into day-to-day workflows for their benefits to be realised. More sector specific case-studies, training materials, tools and models are needed that are appropriate to cultural heritage. See the Heritage Connector blog for more information.

AI and the Photoarchive‘ – John McQuaid (Frick Collection), Dr Vardan Papyan (University of Toronto), and X.Y. Han (Cornell University)

The Frick Collection have been using the PyTorch deep neural network to identify labels for their photo archive collection. They then compared the ML results as a validation exercise with internally crowdsourced data from their staff and curators captured by the Zooinverse software for the same photos.

Frick Collection workflow
Frick Collection ML workflow

They found that 67% of the ML labels matched with the crowdsource validations which they considered a good result. They concluded that at present ML is most useful for ‘curatorial amplification’, but much human effort is still needed. This auto-generation of metadata was their main use case so far.

Keep True: Three Strategies to Guide AI Engagement‘ – Thomas Padilla (Center for Research Libraries)

Thomas believes GLAMs have an opportunity to distinguish themselves in the AI space. He covered a number of themes, the first being the ’Non-scalability imperative’. Scale is everywhere with AI.  There’s a great deal of marketing language about scale, but we need to look at all the non-scalable processes that scale depends on. There’s a problematic dependency where scalability is made possible by non-scalable processes, resources and people. Heterogeneity and diversity can become a problem to be solved by ML. There’s little consideration that AI should be just and fair. 

The second theme was ‘Neoliberal traps’ in AI. Who says ethical AI is ethical AI? GLAMs are trying to do the right thing with AI, but this is in the context of neoliberal moral regulation which is unfair and ineffective. He mentioned some of the good examples from the sector including from CILIP, Museums AI Network and his own ‘Responsible Operations‘ paper.

He credited Melissa Terras for asking the question “How are you going to advocate for this with legislation?”. The US doesn’t have any regulations at the moment to get the private sector to get better. I mentioned the UK AI Council who are looking at this in the UK context, and the recent CogX event where the need for AI regulation was discussed in many of the sessions.

The final theme was ‘Maintenance as Innovation’. Information maintenance is a Practice of Care. There is an asserted dichotomy between maintenance and innovation that’s false. Maintenance is sustained innovation and we must value the importance of maintenance to innovation. He appealed to the origin of the word ‘innovation’ which derives from the latin ‘innovare’ which means “to alter, renew, restore, return to a thing, introduce changes in the way something is done or made”. It’s not about creating from new. At the Hub we wholeheartedly endorse this view. We feel there’s far too much focus on the latest technology meme and we’ve had tensions within our own organisation along these lines. There may appear to be some irony here given the topic of this post, but we have been doing AI for a while as noted above. He referred us to https://themaintainers.org/ for more on this.

Roundtable discussion with the AEOLIAN Project Team

Dr Lise Jaillant, Dr Annalina Caputo, Glen Worthey (University of Illinois), Prof. Claire Warwick (Durham University), Prof. J. Stephen Downie (University of Illinois), Dr Paul Gooding (Glasgow University), and Ryan Dubnicek (University of Illinois).

Stephen Downie talked about the need for standardisation of ML extracted features so we can re-use these across GLAMs in a consistent way. The ‘Datasheets for Datasets’ paper was mentioned that proposes “a short document to accompany public datasets, commercial APIs, and pretrained models”. This reminded me of Yves Bernaert’s talk about the related need for standardisation of carbon consumption measures. Both are critical issues and possible areas for Jisc to be involved in providing leadership. Another point that Stephen made is that researchers are finding they can’t afford the bill for ML processing. Finding hardware and resources is a big problem. As noted by ML guru Andrew Ng, we have a considerable data issue with AI and ML work . It may be that we need to work more on the data rather than wasting time, electricity and money re-creating expensive ML models. A related piece of work, ‘Lessons from Archives‘ was also mentioned in this regard. There is a case for sharing model developments across the sector for efficiency and sustainability here.

Artificial Intelligence – Getting the Next Ten Years Right

CogX poster with dates of the event

I attended the ‘CogX Global Leadership Summit and Festival of AI’ last week, my first ‘in-person’ event in quite a while. The CogX Festival “gathers the brightest minds in business, government and technology to celebrate innovation, discuss global topics and share the latest trends shaping the defining decade ahead”. Although the event wasn’t orientated towards archives or cultural heritage specifically, we are doing work behind the scenes on AI and machine learning with the Archives Hub that we’ll say more about in due course. Most of what’s described below is relevant to all sectors as AI is a very generalised technology in its application.

image of presenter

My attention was drawn to the event by my niece Laura Stevenson who works at Faculty and was presenting on ‘How the NHS is using AI to predict demand for services‘. Laura has led on Faculty’s AI driven ‘Early Warning System’ that forecasts covid patient admissions and bed usage for the NHS. The system can use data from one trust to help forecast care for a trust in another area, and can help with best and worst scenario planning with 95% confidence. It also incorporates expert knowledge into the modelling to forecast upticks more accurately than doubling rates can. Laura noted that embedding such a system into operational workflows is a considerable extra challenge to developing the technology.

Screenshot of Explainability Data
Example of AI explainability data from the Early Warning System (image ©Faculty.ai )

The system includes an explainability feature showing various inputs and the degree to which they affect forecasting. To help users trust the tool, the interface has a model performance tab so users can see information on how accurate the tool has been with previous forecasts. The tool is continuing to help NHS operational managers make planning decisions with confidence and is expected to have lasting impact on NHS decision making.

image of presenter

Responsible leadership: The risks and the rewards of advancing the state of the art in AI’ – Lila Ibrahim

Lila works at Deep Mind who are looking to use AI to unlock whole new areas of science. Lila highlighted the role of the AI Council who are providing guidance to UK Government in regard to UK AI research. She talked about Alphafold that has been addressing the 50 year old challenge of protein folding. This is a critical issue as being able to predict protein folding unlocks many possibilities including disease control and using enzymes to break down industrial waste. DeepMind have already created an AI system that can help predict how a protein folding occurs and have a peer reviewed article coming out soon. They are trying to get closer to the great challenge of general intelligence.

image of presenter

Sustainable Technologies, Green IT & Cloud‘ – Yves Bernaert, Senior Managing Director, Accenture

Yves focussed on company and corporate responsibility, starting his session with some striking statistics:

  • 100 companies produce 70% of global carbon emissions.
  • 40% of water consumption is by companies.
  • 40% of deforestation is by companies.
  • There is 80 times more industrial waste than consumer waste.
  • 20% of the acidification of the ocean is produced by 20 companies only.

Yves therefore believes that companies have a great responsibility, and technology can help to reduce climate impact. 2% of global electricity comes from data centres currently and is growing exponentially, soon to be 8%. A single email produces on average 4g of carbon. Yves stressed that all companies have to accept that now is the time to come up with solutions and companies must urgently get on with solving this problem. IT energy consumption needs to be seen as something to be fixed. If we use IT more efficiently, emissions can be reduced by 20-30%. The solution starts with measurement which must be built into the IT design process.

We can also design software to be far more efficient. Yves gave the example of AI model accuracy.  More accuracy requires more energy. If 96% accuracy is to be improved by just 2%, the cost will be 7 times more energy usage. To train a single neural network requires the equivalent of the full lifecycle energy consumption of five cars. These are massive considerations. Interpreted program code has much higher energy use than compiled code such as C++.

A positive note is that 80% of the global IT workload is expected to move to the cloud in the next 3 years. This will reduce carbon emissions by 84%. Savings can be made with cloud efficiency measures such as scaling systems down and outwards so as not to unneccessarily provision for occasional workload spikes. Cloud migration can save 60 million tons carbon per year which is the equivalent of 20 million full lifecycle car emissions. We have to make this happen!

On where are the big wins, Yves said this is also in the IT area. Companies need to embed sustainability into their goals and strategy. We should go straight for the biggest spend. Make measurements and make changes that will have the most effect. Allow departments and people to know their carbon footprint.

* Update 28th June 2021 * – It was remiss of me not to mention that I’m working on a number of initiatives relating to green sustainable computing at Jisc. We’re looking at assessing the carbon footprint of the Archives Hub using the Cloud Carbon Footprint tool to help us make optimisations. I’m also leading on efforts within my directorate, Digital Resources, to optimise our overall cloud infrastructure using some of the measures mentioned above in conjuction with the Jisc Cloud Solutions team and our General Infrastructure team. Our Cloud CTO Andy Powell says more on this in his ‘AI, cloud and the environment‘ blog post.

image of presenter

Future of Research’ – Prof. Dame Ottoline Leyser, CEO, UK Research and Innovation (UKRI)

Ottoline believes that pushing the boundaries of how we support research needs to happen. Research is now more holistic. We draw in what we need to create value. The lone genius is a big problem for research culture and it has to go. Research is insecure and needs connectivity.

Ottoline believes AI will change everything about how research is done. It’s initially replacing mundane tasks but will some more complex tasks such as spotting correlations. Eventually AI will be used as a tool to help understanding in a fundamental way. In terms of the existential risk of AI, we need to embed research as collective endeavour and share effort to mitigate and distribute this risk. It requires culture change, joining up education and entrepreneurship.

We need to fund research in places that are not the usual places. Ottoline likes a football analogy where people are excited and engaged at all levels of the endeavour, whether in the local park or at the stadium. She suggests research at the moment is more like elitist Polo not football.

Ottoline mentioned that UKRI funding does allow for white spaces research. Anyone can apply. However, we need to create wider white spaces to allow research in areas not covered by the usual research categories. It will involve braided and micro careers, not just research careers. Funding is needed to support radical transitions. Ottonline agrees that the slow pace of publication and peer review is a big problem that undermines research. We need to broaden ways we evaluate research. Peer Review is helpful but mustn’t slow things down.

image of presenter

Ethics and Bias in AI‘ – Rob Glaser, CEO & Founder of RealNetworks

Rob suggests we are in an era with AI where there are no clear rules of the road yet. The task for AI is to make it safe to ‘drive’ with regulations. We can’t stop facial recognition any more than we can stop gravity. We need datasets for governance so we can check accuracy against these for validation. Transparency is also required so we can validate algorithms.  A big AI concern is the tribalism on social media.

image of presenter

‘AI and Healthcare‘ – Rt. Hon. Matt Hancock

Matt Hancock believes we are at a key moment with healthcare and AI technology where it’s now of vital importance. Data saves lives! The next thing is how to take things forward in NHS. A clinical trials interoperability programme is starting that will agreed standards to get more out of data use, and the Government will be updating it’s Data strategy soon. He suggests we need to remove silos and commercial incentives (sic). On the use of GP data he suggests we all agree on the use of data, but the question is how it’s used. The NHS technical architecture needs to improve for better use and building data into the way the NHS works. GPs don’t own patient data, it is the citizen.

He said a data lake is being built across the NHS. Citizen interaction with health data is now greater than ever before and NHS data presents a great opportunity for research, and an enormous opportunity for the use of data to advance health care. He suggested we need to radically simplify the NHS information governance rules. On areas where not enough progress has been made, he mentioned the lack of separation of data layers is currently a problem. So many applications silo their data. There has also been a culture of Individual data with personal curation. The UK is going for a TRE first approach: ‘Trusted Research Environment service for England‘. Data is the preserve of the patient who will allow accredited researchers to use the data through the TRE. The clear preference of citizens is sharing data if they trust the sharing mechanism. Every person goes through a consent process for all data sharing. Acceptance requires motivating people with the lifesaving element of research. If there’s trust, the public will be on side. Researchers in this domain with have to abide by new rules to allow us to build on this data. He mentioned that Ben Goldacre will look at the line where open commons ends and NHS data ownership begins in the forthcoming Goldacre Review.

Is Linked Data an Appropriate Technology for Implementing an Archive’s Catalogue?

Here at the Archives Hub we’ve not been so focussed on Linked Data (LD) in recent years, as we’ve mainly been working on developing and embedding our new system and workflows. However, we have continued to remain interested in what’s going on and are still looking at making Linked Data available in a sustainable way. We did do a substantial amount of work a number of years back on the LOCAH project from which we provided a subset of archival linked data at data.archiveshub.ac.uk.  Our next step this time round is likely to be embedding schema.org markup within the Hub descriptions. We’ve been closely involved in the W3C Schema Architypes Group activities, with Archives Hub URIs forming the basis of the group’s proposals to extend the “Schema.org schema for the improved representation of digital and physical archives and their contents”.

We are also aiming to reconnect more closely with the LODLAM community generally, and to this end I attended a TNA ‘Big Ideas’ session ‘Is Linked Data an appropriate technology for implementing an archive’s catalogue?’ given by Jean-Luc Cochard of the Swiss Federal Archives. I took a few notes which I thought it might be useful to share here.

Why we looked at Linked Data?

This was initially inspired by the Stanford LD 2011 workshop and the 2014 Open data.swiss initiative. In 2014 they built their first ‘aLOD’ prototype – http://alod.ch/

The Swiss have many archive silos from which they transformed the content of some systems to LD and then were able to merge. They created basic LD views, Jean-Luc noting that the LD data is less structured than data in the main archival systems, an example of which is e.g. http://data.ge.alod.ch/id/archivalresource/adl-j-125

They also developed a new interface http://alod.ch/search/ with which they were trying for an innovative approach to presenting the data such as providing a histogram with dates.  It’s currently just a prototype interface running off SPARQL with only 16,000 entries so far.

They are also now currently implementing a new archival information system (AIS) and are considering LD technolgy for the new system, but may go with a more conventional database approach. The new system has to work with the overall technical architecture.

Linked data maturity?

Jean-Luc noted that they expect that in three years born digital will greatly expand by factor of ten, though 90% of the archive is currently analogue. The system needs to cope with 50M – 1.5B triples. They have implemented Stardog triple stores 5.0.5 and 5.2. The larger configuration is a 1 TB RAM, 56 CPU and 8 TB disk machine.

As part of performance testing they have tried loading the system with up to 10 Billion triples and running various insert, delete and query functions. The larger config machine allowed 50M triple inserts in 5 min. 100M plus triples took 20min to insert. With the update function things were found to be quite stable.  They then combined querying with triple insertions at the same time, and this highlighted some issues with slow insertions with a smaller machine. They also tried full text indexing with the larger config machine. They got very variable results with some very slow response times with the insertions, finding the latter was a bug in the system.

Is Linked Data adequate for the task?

A key weakness of their current archival system is that you can only assign records to one provenance/person. Also, their current system can’t connect records to other databases, so they have the usual silo problem. Linked data can solve some of these problems. As part of the project they looked at various specs and standards:

BIBFRAME v2.0 2016
Europeana EDM released 2014.
EGAD activities – RiC-CM -> RiC-O based on OWL (Record in context)
A local initiative- Matterhorn RDF Model.  Matterhorn uses existing technologies, RDA, BPMN, DC, PREMIS. There is a first draft available.

They also looked at relevant EU R&D projects: ‘Prelia’, on preservation of LD and ‘Diachron’ – managing evolution and preservation of LD.

Jean-Luc noted that the versatility of LD is appealing for several reasons –

  • It can be used at both the data and metadata levels.
  • It brings together multiple data models.
  • It allows data model evolution.
  • They believe it is adequate to publish archive catalogue on the web.
  • It can be used in closed environment.

Jean-Luc  mentioned a dilemma they have between RDF based Triple stores and graph databases. Graph databases tend to be proprietary solutions, but have some advantages. Graph databases tend to use ACID transactions intended to guarantee validity even in the event of errors, power failures, etc., but they are not sure how ACID reliable triple stores are.

Their next step is expert discussion of a common approach, with a common RDF model. Further investigation is needed regarding triple store weaknesses.

Exploring British Design at the Europeana AGM 2015

I’m just back from another enjoyable and useful Europeana Network Association event where I gave a four minute ‘Ignite Talk’ on our recently completed ‘Exploring British Design’ project that Pete and Jane worked on. As it was such a short talk, I wanted make sure I got the timing right, so actually wrote the talk out. I think it gives quite a good summary of the project, as well as mentioning our connection with Europeana, so I thought it would be worth posting it here along with a link to the slides:

“Hello, my name is Adrian Stevenson and I’m a Senior Technical Coordinator working for Jisc in the UK.

[Introduction slide]

Today I want to briefly outline a one year project we’ve recently completed called ‘Exploring British Design’ which was funded by the Arts and Humanities Research Council.

The technical work and front-end interface for Exploring British Design was developed by the Archives Hub based in the UK. The Hub aggregates archival descriptions from about 280 institutions in the UK, from the very large such as the British Library to the very small such as the Shakespeare’s Globe Theatre, making these archives available to be searched through our website, APIs and findable on Google. For some institutions, the Archives Hub provides their only web presence, so it’s an important service for the archives sector in the UK.

For ‘Exploring British Design’ we collaborated with one of our enthusiastic contributors, the Brighton Design Archive, based at the University of Brighton. We used the ‘Britain Can Make It’ exhibition from 1946 as a focal point because the Archive has rich collections relating to this exhibition.

So what’s the connection with Europeana? The Archives Hub is in the process of contributing data to the Archives Portal Europe. The plan is that the portal data will be available through Europeana at some point in the future.

[Home page slide]

So lets have a look. This is the home page of the website. You can see that we take people, i.e. the designers and architects, their organisations, and the events they were involved with, such as the exhibition as the starting points, i.e. not the archive records as such.

What’s unique about this project is that we’re going beyond the record as being about about one person, one organisation and having one focus. The reality is that archives are about the connections between all sorts of people, places, and events, such as exhibitions, and much of this information is effectively ‘locked in’ the archival records. This is what we’re trying to draw out.

The idea is that anything can be a primary focus:  people, organisations, places, events or archive collections. Some of you may recognise this as an idea relating to linked data, and indeed this is loosely the approach we took for the under the hood implementation. We also looked at an archival name authority standard called EAC-CPF to help with this.

[Designer slide]

You see here how we’ve tried to emphasise the relationship types, such as ‘friend of’, ‘collaborates with, ‘colleague of’ and so on. Researchers are most interested in people, events, etc. not in archives per se.

[Exhibition slide]

This is a view of the exhibition page, focussing in on it as an event in its own right with a location, related people, etc. This sort of information hasn’t historically been captured all that usefully in archival descriptions.

[Visualisation slide]

We included visualisations, but these actually fall far short of the complexity of the relationships. It’s quite hard to get these to work effectively, but they give a sense of the relationships between architect Jane Drew and Le Corbusier, or even Croydon High School for Girls.

So hopefully you can get a sense of how we’ve tried to present researchers with more flexible routes through the connections we created, helping to surface relationships between people, organisations and events that were effectively hidden in the more traditional document-based way of presenting information.”

There was an excellent reception in the evening at the Rijksmuseum where we were lucky enough to get a private view of the ‘Gallery of Honour’. It was a great opportunity to get a picture by Rembrandt’s ‘Night Watch’ so we made the most. Thanks again to Europeana!

In front of the 'Night Watch
Adrian Stevenson and others in front of Rembrandt’s ‘Night Watch’ at the Rijksmuseum, Amsterdam.