Artificial intelligence and Archives Seminar

Municipal Archive of Girona, Thursday, April 27, 2023

Yesterday I checked out some of the ‘Artificial Intelligence and Archives Seminar‘ hosted by the Municipal Archive of Girona “within the framework of the Faber-Llull Residency (Olot, Catalonia) and the project InterPARES Trust AI of the University of British Columbia (Vancouver, Canada), and with the collaboration of the Society of Catalan Archivists. There were some useful things discussed in this still quite new area of AI, so thought I’d share my notes.

Promo image for the seminar

InterPARES Trust AI Project (https://interparestrustai.org/), Muhammad Abdul-Mageed

The Trust AI goals are to:

  1. Identify specific AI technologies that can address critical records and archives challenges
  2. Determine the benefits and risks of using AI technologies on records and archives
  3. Ensure that archival concepts and principles inform the development of responsible AI
  4. Validate outcomes from Objective 3 through case studies and demonstrations

Muhammad focussed on trustworthiness as an issue for Archives. They are looking at using AI to assess and verify the authenticity of Archives through time. The essential research question: Can we develop artificial intelligence for carrying out competently and efficiently all records and archives functions while respecting the nature and ensuring the continuing trustworthiness of the record.

He noted that a fundamental difference between analog and digital records is the fact that analogue materials can be proven and verified on face value and rarely need extrinsic evidence. However for digital materials, extrinsic elements such as metadata are needed. They rely on ‘circumstantial’ evidence such as the integrity of the hosting system as well as the politics, procedures and technology surrounding the digital record.

Muhammad suggests that off-the-shelf tools are not well suited to archives, so within the Archives profession we will have to develop the systems ourselves. We are the only ones who know what to do because we are the professionals. Developers need to talk to archives professionals to find out what they want and design appropriate AI tools for them. The tools need to respect the trustworthiness of the records. The project is looking to influence the development of responsible tools.

The project looks to provide a wealth of tools and code. A very important aspect of the project is training the community. Muhammad suggested that the Archives profession will have to do a great deal of training to engage with AI tools and its possibilities.

Linking AI to Archives and Records, Peter Sullivan

The aim of the talk was to look at combining archival concepts and principles with AI. Peter used the lens of Diplomatic to consider AI solutions and how AI may interact with different components of the record including the context, act, persons, procedure, form and archival bond. Which parts of the archival record are impacted by AI and how does this inform the design of AI tools that respect diplomatic theory?

The most important component is the ‘archival bond’ which covers how aspects of records are related to each other. AI may be poor at looking at records in context of other records, and may not be able to respect the archival bond. Also, AI may not respect the context of the creation of the records and may not be aware of different levels of appraisal used.

AI may be helpful where there are different variations of names and fuzzy matching can be used to reconcile names. This aligns with the Archives Hub Names project. Dealing with records in aggregate may be somewhere AI is able to help, using topic modelling and clustering techniques. This is a use case we have identified ourselves and something we are looking at with the Archives Hub Labs Project. Finally he mentioned the interesting question of how we will archive the artefacts of AI developments themselves.

Model for an AI-Assisted Digitisation Project, Peter Sullivan

Peter talked about how AI is being used to help with the archiving of audio recordings, providing AI generated metadata enrichment. He noted this is very time-consuming to do by hand. Different types of recordings create very different challenges to AI to analyse . For UNESCO audio they are using four models, one for language translation and three for text extraction and text summarisation.

AI and Archives: Basic Requirements, Pilar Campos and Eloi Puertas

The project is aiming to provide a resource for archive professionals to assess AI solutions to help guide decision-making and create recommendations. They will provide a check list to assess AI tool performance. The rationale behind this is that there is a huge amount of interest and concern regarding AI, but a scarcity of implementation examples, along with a lack of knowledge of AI solutions for the professionals in the archives domain. There is also a degree of mistrust of the results of AI.

The expected results of the project are to provide AI knowledge in the archive domain and a list of potential risks for archivists. A SWOT analysis about AI from the Archives viewpoint will be provided, along with an assessment of the balance between our expectations of risk.

Automated Transcription: Palaeography and AI, Thiara Alves and Leonardo Fontes

The talk was essentially about using AI for automated transcription. The speakers talked about using Transkribus for transcription of text from images of documents. They found that most algorithms weren’t good at detecting old versions of Portuguese and Spanish words. The speakers felt that the context provided by the archivist was necessary for the transcribers transcriptions to be useful.

First Steps and Main Expectations from CRDI’s Experience of AI, David Inglésias

David talked about a project looking at being able to search images that haven’t been catalogued, so they don’t have metadata unless it is created by using AI. This ability is very useful for a photographic archive. They work with the Europeana Kaleidoscope project to attempt to provide archival context for images.

AI also allows for innovative new approaches to presenting photographs in addition to the standard historical ways of doing so. AI can be used for clustering photos that appear to be similar or related in someway. This could be something that the Archives Hub could look at also.

The full seminar is available on Youtube.

ChatGPT advice on Archival Research

I imagine many of you have tried asking ChatGPT questions like these, but we thought it was worth providing this blog, for those who haven’t tried it out. For anyone who has managed to avoid the headlines around this technology, ChatGPT is an artificial intelligence chatbot developed by OpenAI and released in November 2022.

First of all, we wanted to try a question that focussed on using one archive, so we asked:

Can you give me advice on how to write a dissertation on Stanley Kubrick using primary sources or archives?

ChatGPT answered:

Certainly, here are some tips for writing a dissertation on Stanley Kubrick using primary sources or archives:

  1. Choose a research question: The first step in writing a dissertation is to identify a research question that you want to explore. This could be anything from analyzing Kubrick’s use of color in his films to examining his working methods and creative process. Make sure that your research question is focused enough to be manageable within the scope of your dissertation.
  2. Identify relevant primary sources and archives: Once you have a research question, you need to identify primary sources and archives that are relevant to your topic. This could include Kubrick’s personal papers, scripts, storyboards, correspondence, and production notes. You can also look for archives and collections related to Kubrick’s collaborators and contemporaries, such as other film directors, writers, or actors.
  3. Conduct a literature review: Before diving into primary sources and archives, conduct a thorough literature review to see what has already been published on your topic. This will help you identify any gaps in the existing research and refine your research question further. It will also give you a sense of the debates and issues within the field that your dissertation will need to address.
  4. Plan your research visits: Before visiting archives, you need to plan ahead and make sure that you have enough time to access and review the materials you need. Some archives have strict access requirements or limited opening hours, so check this in advance. Make a list of the specific documents or materials you need to examine, and plan your research visits accordingly.
  5. Analyze and evaluate primary sources: Once you have access to primary sources, you will need to analyze and evaluate the materials carefully. Make notes of important details, such as dates, locations, and people involved. Look for patterns, themes, and connections in the materials, and identify any contradictions or inconsistencies. Keep in mind the broader research question you are trying to answer, and look for evidence that can support your argument.
  6. Write your dissertation: Once you have reviewed and analyzed your primary sources, you can start writing your dissertation. Make sure to frame your analysis within the broader context of existing scholarship, and use primary sources to support your arguments and interpretations. Use quotations and examples from primary sources to illustrate your points and make your argument more convincing.

Writing a dissertation on Stanley Kubrick using primary sources and archives can be a rewarding and challenging task. Make sure to plan ahead, conduct thorough research, and use primary sources effectively to support your arguments and interpretations.

*****

We then wanted to see how well ChatGPT would summarise the role and value of Archives Hub, so we asked:

How do I use Archives Hub to help with my research?

ChatGPT answered:

The Archives Hub can be a valuable resource for researchers who are looking to explore archival collections held in institutions across the UK. Here are some tips on how to use Archives Hub to help with your research:

  1. Search the Archives Hub database: Start by using the search bar on the Archives Hub homepage to search for keywords, names, or locations that are relevant to your research topic. This will bring up a list of archives and collections that match your search terms. You can then click on each archive or collection to view more detailed information, including a description of the materials held, contact information for the holding institution, and links to related collections.
  2. Refine your search: You can refine your search results using the menus on the left-hand side of the page. These menus allow you to narrow your search by institution, keyword, date range, or subject. Use these filters to focus your search on the most relevant materials.
  3. Contact the holding institution: Once you have identified an archive or collection that seems relevant to your research, contact the holding institution to confirm the availability of the materials and to arrange a research visit if necessary. The contact information for each institution is listed on the Archives Hub website.
  4. Plan your visit: Before visiting an archive, make sure to plan ahead and think about what materials you want to examine. Familiarize yourself with the archive’s policies on access and handling of materials, and bring any necessary equipment, such as a camera or laptop. Make sure to take detailed notes during your visit, as well as copies or photographs of any relevant documents or materials.
  5. Follow ethical guidelines: When conducting research in archives, it is important to follow ethical guidelines and practices. This means properly citing your sources, respecting the privacy and confidentiality of individuals mentioned in archival materials, and obtaining permission to use any copyrighted materials. This will help to maintain the trust and cooperation of archival repositories and their staff.

By using the Archives Hub to identify collections and institutions that are relevant to your research, you can gain access to a vast array of archival materials held across the UK. Be sure to plan ahead, follow ethical guidelines, and take detailed notes during your research visits to make the most of these valuable resources.

*****

Both answers are very convincing and it is hard to argue that we would have done better if we had written them ourselves, though we might want to add more information to what is there. The guide for using the Archives Hub seems to utilise text from different parts of our website, including our Guide for Beginners, access information and search help. But the text is not lifted directly – it is far more sophisticated than that. There are many examples of text that does not echo anything we have said on our website and I can’t find the text elsewhere. For example, ChatGPT refers to possible dissertation subjects for Kubrick: ‘analyzing Kubrick’s use of color in his films to examining his working methods and creative process’. There are sentences that widen the scope, for example: ‘Make sure to frame your analysis within the broader context of existing scholarship, and use primary sources to support your arguments and interpretations.’ There are sentiments that I don’t recall seeing stated in quite the same way that ChatGPT has done, for example: ‘This means properly citing your sources, respecting the privacy and confidentiality of individuals mentioned in archival materials, and obtaining permission to use any copyrighted materials. This will help to maintain the trust and cooperation of archival repositories and their staff.’

It is easy to see why ChatGPT is seen as a means to write effectively. Maybe there are questions around what is left out of the above answers, but I would certainly be happy to use them as a basis for our own guidelines.

The Christian Brethren Archive

Archives Hub feature for April 2023

About Us

The Christian Brethren Archive (CBA) is part of the John Rylands Research Institute and Library Special Collections at The University of Manchester, England.  The CBA is a world-class collection relating to the Brethren Movement and to congregations which have their roots in the Brethren tradition. This huge resource spans over 250 years and contains literature and records in many different languages in addition to English. The archive grew organically from a small collection of papers donated to the University in 1979 by the influential, evangelical scholar, F.F. Bruce (1910-90), Rylands Professor of Biblical Criticism and Exegesis at The University of Manchester. The CBA is a living archive, and today we receive communications, enquiries and gifts of material from all around the world. 

Woolpit Sunday School staff and pupils, Woolpit, Suffolk, England, c.1900.

The collection is managed by a full-time archivist whose work is overseen by an Advisory Group made up of historians and library professionals with Brethren interests and concerns. 

Many of the Archive costs are funded through the kindness of private and charitable donations, the remainder are met by The University of Manchester. 

Who Are The Christian Brethren?

The Brethren movement was formed by a group of independent Christian congregations who emerged out of Protestant Ireland in the 1820’s. Notable early members were John Nelson Darby (1800-82), Benjamin Wills Newton (1807-99) and George Müller (1805-98). Doctrinal differences caused a split in 1848 which led to the establishment of two distinct Brethren streams, the Exclusive Brethren who were led by John Nelson Darby and the Open Brethren who were led by George Müller. The Exclusive Brethren initially established themselves in Plymouth, Devon, England, giving rise to the group being known as the Plymouth Brethren. Both Brethren streams continue to flourish around the globe today. 

Lantern slide, ‘Drawing Room meetings in Dublin’ Ireland, late nineteenth century.

What’s In The Archive?

Over 7,000 manuscripts, 18,000 rare books, pamphlets, and tracts, and some 400 series of periodicals, dating from the early nineteenth century to the present. As well as photographs, films, and audio recordings.

There are personal papers relating to personalities among the Brethren, such as founding members John Nelson Darby and Benjamin Wills Newton, as well as records relating to assemblies across the United Kingdom, such as the Church of God in Belfast, Northern Ireland (1897-2018). Important organisations and events are also represented, for example Echoes International (formerly Echoes of Service, missionary support agency) est. 1872, the Devonshire Conferences of 1906 and 1907 (which discussed the terms of fellowship between gatherings of Open and Exclusive Brethren), the Christian Brethren Research Fellowship for 1962–81, and the Swanwick conferences of the 1970s, ’80s and ’90s.

We regularly collect born digital records, many of which are newsletters and periodicals. These are stored in our digital preservation system, Preservica and made accessible via University of Manchester Collections

The Green Field, published by the Brethren Church of the Egyptian Arab Republic, 1991.

All our catalogued Christian Brethren archives, some 60 collections are available via the Archives Hub, The University of Manchester’s Special Collections. All catalogued CBA holdings can be searched via The University’s Library Search

What is the Archive’s Research Potential?

The collection provides a rich resource for many disciplines including religion and theology, human geography, culture, history, politics and gender, as well as humanitarian studies and post colonialism. 

Recent research has focussed on topics such as women and the Brethren, for example the writer and evangelist Grace Grattan Guinness (1877-1967). Grace documented her travels as she accompanied her husband, Henry Grattan Guinness, a well known orator, on a five-year preaching tour of the world whilst on their honeymoon!

Grace Grattan Guinness’s honeymoon schedule, page one, 1903.

We regularly hold events which aim to show the amazing breadth of material within the CBA. For example, using items from the George Müller archive we participated in the Histories of Care (March 2023). A collections encounter and public roundtable which reflected on the social care and experience of children throughout history and sought to understand how these histories might inform the shape of future childcare. George Müller was the founder and director of the Ashley Down Orphan Homes in Bristol, England. In his lifetime, he cared for 10,000 orphaned children.

A. E. Booth, A Chart on the Course of Time from Eternity to Eternity, originally published in 1896 by the Loiseaux Brothers, Bible Truth Depot, New York, USA.

Signs of the Times. Maps and Charts of History and Prophecy was a public talk and collections encounter (March 2023) which looked at Brethren thinking about ‘End Times’ or the end of the world and discussed texts and images of apocalyptic imagery from the 8th century to the present day. Led by historians Professor Crawford Gribben, Queen’s University, Belfast, and Dr Andrew Crome, Manchester Metropolitan University, the event provided a unique opportunity to see first-hand some of the CBA’s mysterious maps and charts of prophecy and to get an overview of their history and purpose. A selection of the maps and charts is on display in the Rylands Gallery at The John Rylands Research Institute and Library until 11 November 2023.

Some of the maps of prophecy and history on display in the Rylands Gallery.

Jane Speller, Curator, Christian Brethren Archive
Contact uml.special-collections@manchester.ac.uk

Related

Browse all The University of Manchester’s Special Collections descriptions to date on Archives Hub

The Christian Brethren Archive is located at The University of Manchester Library

John Rylands Research Institute and Library Special Collections

Library Search – search all Christian Brethren Archive resources

University of Manchester Collections – view digitised Brethren collections

Rylands Blog – read about Brethren collections

Previous Archives Hub features on The University of Manchester collections

The Editorial Correspondence of C.P. Scott in the Guardian archive

A Spring in Your Step

James Phillips Kay-Shuttleworth – pioneering educational reformer

Bicentenary of the birth of Charles Dickens

Charles Wesley (1707-88)

Robert Donat

All images copyright The University of Manchester. Reproduced with the kind permission of the copyright holders.

The Wellington Archive: forty years on

Archives Hub feature for March 2023

In 1983 the government allocated the papers of Arthur Wellesley, first Duke of Wellington, the long serving politician and the premier soldier of his generation, to the University of Southampton under national heritage legislation. The collection arrived on 17 March of that year. This brought to Southampton the University’s first major manuscript collection, leading to the creation of an Archives Department and the development of a major strand of activity within the University Library.

Official opening of the Wellington Suite Archives accommodation: Mr Naylor, University Librarian, Professor Smith, Department of History (hidden), Mr. Woolgar, Archivist, and the Duke of Wellington looking at display of papers, 14 May 1983.
Official opening of the Wellington Suite Archives accommodation: Mr Naylor, University Librarian, Professor Smith, Department of History (hidden), Mr. Woolgar, Archivist, and the Duke of Wellington looking at display of papers, 14 May 1983. [MS1/Phot/39/ph3526]

Composed of around 100,000 items that cover the Duke’s career as a soldier, statesman and diplomat from 1790 to his death in 1852, the collection bears witness to great military, political and social events of the time. It is exceptional among the papers of nineteenth-century figures for its size and scope.

Plan of Seringapatam at the time of its capture by Wellington, early 19th century.
Plan of Seringapatam at the time of its capture by Wellington, early 19th century [MS61/WP15/6]

Wellington’s time in India, 1798-1805, when he made his fortune and his name as a military commander, is well represented by three series of letter books, the first two series arranged chronologically, with the third, covering 1802-5, arranged by correspondent. The sections for the Peninsular War (1808-14) and for the Waterloo campaign provide an unrivalled source for the history of British participation. The collection also includes Wellington’s correspondence and papers for the congresses at the end of the Napoleonic wars and the allied occupation of France, 1815-18, a period when Wellington was a major player on the European political scene.

Letter from Wellington to Lord Bathurst after the battle of Vitoria using the phrase ‘scum of the earth’, 1813.
Letter from Wellington to Lord Bathurst after the battle of Vitoria using the phrase ‘scum of the earth’, 1813 [MS61/WP1/373/6]

Wellington was involved in politics throughout his career, serving as an MP in the Irish parliament in the 1780s onwards and as Chief Secretary for Ireland, 1807-9. There is considerable material for his political career post 1818, including two times as prime minister, as well as for his role as Lord Lieutenant of Hampshire and as Commander in Chief of the Army. Amongst the extensive number of cabinet papers, drafted in the Duke’s own hand is the memorandum written by Wellington and Peel setting out the details of the Catholic emancipation act of 1829. Material from his tenure as Lord Lieutenant of Hampshire includes death threats from Captain Swing dating from the Swing riots around the southern counties of England.

Letter to Wellington, signed “Swing” threatening assassination, n.d. c. 8 November 1830.
Letter to Wellington, signed “Swing” threatening assassination, n.d. c. 8 November 1830 [MS61/WP1/1159/114]

As the archive is from the great age of government by correspondence, as well as coinciding with a wider revolution in communication, it contains material from a wide cross section of society. Everyone wrote to the Duke of Wellington, offering the national hero their views on a whole range of subjects, asking for patronage, promotion or assistance, wishing to dedicate their works to him, or asking him to be the godfather of their children or to be allowed to name them after him. In response to one letter, Wellington noted with his usual acerbic wit the inconvenience of calling all boys born on his birth by the name “Arthur”.

Cataloguing the Wellington Archive in the 1980s using BBC microcomputers.
Cataloguing the Wellington Archive in the 1980s using BBC microcomputers.

The arrival of the Wellington Archive in 1983 was significant in another way in that it marked the beginning of Southampton’s long involvement in automated archive catalogues. The Wellington Papers Database could claim to be one of, if the not the earliest, online archive catalogue in the UK. Investigations into a system to support this were already underway in December 1982, prior to the arrival of the papers. In July 1983 the University decided to develop a manuscript cataloguing system using STATUS software and it was in use for cataloguing material early the following year. The cataloguing was done “offline” by the archivists on BBC microcomputers equipped with rudimentary word-processing packages – but no memory – and all text was saved onto floppy discs. It was subsequently transferred to an ICL mainframe computer for incorporation into the database by batch programme. This being the days prior to the WWW, the initial database was made available by the Joint Academic Network (JANET) and the public switched telephone network. It was initially scheduled to be made available 156 hours a week, rising to 168. In 2023 the catalogue of the Wellington Archive can be accessed in the Epexio Archive Catalogue, a new system that we launched in November 2021.

The collection also came with a major conservation challenge – some ten percent of the collection was so badly damaged it was unfit to handle and in a parlous state. Considerable progress has been made in addressing this. Important material is now available for research, including for the Peninsular War, papers for 1822 (for the Congress of Verona) and for Wellington as Prime Minister in 1829.  The badly degraded and mould-damaged bundles from 1832, significant as the time of the First Reform Act, are available for the first time since 1940.

19th Wellington Lecture – Martin Carthy.
19th Wellington Lecture – Martin Carthy.

The last forty years also has seen a great deal of outreach and activity focused on the Wellington Archive.  As well as research and teaching sessions, drop-in sessions, events and exhibitions, the Archives Department has arranged six international Wellington congresses. In 2015 and 2017 Karen Robson and Professor Chris Woolgar presented a MOOC they had co-created relating to Wellington and Waterloo. And since 1989 there has been the annual Wellington Lecture with speakers or presenters ranging from Elizabeth Longford to Martin Carthy.

To mark Wellington 40 this year we shall be running a number of events. Follow our social media campaign in March, join us for exhibitions in June/July and October and perhaps take part in the open day event in July: https://www.eventbrite.co.uk/e/wellington-papers-40-behind-the-scenes-at-the-archives-tickets-528206288227?aff=ebdssbdestsearch

Karen Robson
Head of Archives and Special Collections
University of Southampton Library

Related

Wellington papers, mid-late seventeenth century, 1790-1852

Browse the University of Southampton Special Collections descriptions to date on the Archives Hub

Previous features by University of Southampton

All images copyright University of Southampton Special Collections. Reproduced with the kind permission of the copyright holders.

Machine Learning: training a model by creating a labelled dataset

In this post I will go through the steps we took to create a human labelled dataset (i.e. naming objects within images), applying the labels to bounding boxes (showing where the objects are in the image) in order to identify objects and train an ML model. Note that the other approach, and one we will talk about in another post, is to simply let a pre-trained tool do the work of labelling without any human intervention. But we thought that it would be worthwhile to try the human labelling out before seeing what the out-of-the-box results are.

I used the photographs in the Claude William Jamson archive, kindly provided by Hull University Archives. This is a collection with a variety of content that lends itself to this kind of experiment.

An image from the Jamson archive

I used Amazon SageMaker for this work. In SageMaker you can set up a labelling job using the Ground Truth service, by giving the location of the source material – in this case, the folder containing the Jamson photographs. Images have to be jpg, or png, so if you have tif images, for example, they have to be converted. You give the job a name and provide the location of the source material (in our case an S3 bucket, which is the Amazon Simple Storage Service).

Location and output information are added – I have specified that we are working with images.

I then decide on my approach. I trained the algorithm with a random sample of images from this collection. This is because I wanted this sample to be a subset of the full Jamson Archive dataset of images we are working with. We can then use the ML model created from the subset to make object detection predictions for the rest of the dataset.

Random sample is selected, and I can also specify the size of the sample, e.g. 25%.

Once I had these settings completed, I started to create the labels for the ‘Ground Truth’ job. You have to provide the list of labels first of all from which you will select individual labels for each image. You cannot create the labels as you go. This immediately seemed like a big constraint to me.

Interface for adding labels, and a description of the task

I went through the photographs and decided upon the labels – you can only add up to 50 labels. It is probably worth noting here that ‘label bias’ is a known issue within machine learning. This is where the set of labelled data is not fully representative of the entirety of potential labels, and so it can create bias. This might be something we come back to, in order to think about the implications.

Creating a list of labels that I can then apply to each individual image

I chose to add some fairly obvious labels, such as boat or church. But I also wanted to try adding labels for features that are often not described in the metadata for an image, but nonetheless might be of interest to researchers, so I added things like terraced house, telegraph pole, hat and tree, for example.

Once you have the labels, there are some other options. You can assign to a labelling team, and make the task time bound, which might be useful for thinking about the resources involved in doing a job like this. You can also ask for automated data labelling, which does add to the cost, so it is worth considering this when deciding on your settings. The automated labelling uses ML to learn from the human labelling. As the task will be assigned to a work team, you need to ensure that you have the people you want in the team already added to Ground Truth.

Confirming the team and the task timeout

Those assigned to the labelling job will receive an email confirming this and giving a link to access to the labelling job.

Workers assigned to the job can now start to work to create the bounding boxes and add the labels

You can now begin the job of identifying objects and applying labels.

The interface for adding labels to images

First up I have a photograph showing rowing boats. I didn’t add the label ‘rowing boat’ as I didn’t go through every single photograph to find all the objects that I might want to label, so not a good start! ‘Boat’ will have to do. As stated above, I had to work with the labels that I created, I can’t add more labels at this stage.

I added as many labels as I could to each photograph, which was a fairly time intensive exercise. For example, in the image below I added not only boat and person but also hat and chimney. I also added water, which could be optimistic, as it is not really an object that is bound within a box, and it is rather difficult to identify in many cases, but it’s worth a try.

Adding labels using bounding boxes

I can zoom in and out and play with exposure and contrast settings to help me identify objects.

Bounding boxes with labels

Here is another example where I experimented with some labels that seem quite ambitious – I tried shopfront and pavement, for example, though it is hard to classify a shop from another house front, and it is hard to pin-point a pavement.

The more I went through the images, drawing bounding boxes and adding labels, the more I could see the challenges and wondered how the out-of-the-box ML tools would fare identifying these things. My aim in doing the labelling work was partly to get my head into that space of identification, and what the characteristics are of various objects (especially objects in the historic images that are common in archive collections). But my aim was also to train the model to improve accuracy. For an object like a chimney, this labelling exercise looked like it might be fruitful. A chimney has certain characteristics and giving the algorithm lots of examples seems like it will improve the model and thus identify more chimneys. But giving the algorithm examples of shop fronts is harder to predict. If you try to identify the characteristics, it is often a bay window and you can see items displayed in it. It will usually have a sign above, though that is indistinct in many of these pictures. It seems very different training the model on clear, full view images of shops, as opposed to the reality of many photographs, where they are just part of the whole scene, and you get a partial view.

There were certainly some features I really wanted to label as I went along. Not being able to do this seemed to be a major shortcoming of the tool. For example, I thought flags might be good – something that has quite defined characteristics – and I might have added some more architectural features such as dome and statue, and even just building (I had house, terraced house, shop and pub). Having said that, I assume that identifying common features like buildings and people will work well out-of-the-box.

Running a labelling job is a very interesting form of classification. You have to decide how thorough you are going to be. It is more labour intensive than simply providing a description like ‘view of a street’ or ‘war memorial’. I found it elucidating as I felt that I was looking at images in a different way and thinking about how amazing the brain is to be able to pick out a rather blurred cart or a van or a bicycle with a trailer, or whatever it might be, and how we have all these classifications in our head. It took more time than it might have done because I was thinking about this project, and about writing blog posts! But, if you invest time in training a model well, then it may be able to add labels to unlabelled photographs, and thus save time down the line. So, investing time at this point could reap real rewards.

Part of a photograph with labels added

In the above example, I’ve outlined an object that i’ve identified as a telegraph pole. One question I had is whether I am are right in all of my identifications, and I’m sure there will be times when things will be wrongly identified. But this is certainly the type of feature that isn’t normally described within an image, and there must be enthusiasts for telegraph poles out there! (Well, maybe more likely historians looking at communications or the history of the telephone). It also helps to provide examples from different periods of history, so that the algorithm learns more about the object. I’ve added a label for a cart and a van in this photo. These are not all that clear within the image, but maybe by labelling less distinct features, I will help with future automated identification in archival images.

I’ve added hat as a label, but it strikes me that my boxes also highlight heads or faces in many cases, as the people in these photos are small, and it is hard to distinguish hat from head. I also suspect that the algorithm might be quite good with hats, though I don’t yet know for sure.

person and child labels

I used ‘person’ as a label, and also ‘child’, and I tended not to use ‘person’ for ‘child’, which is obviously incorrect, but I thought that it made more sense to train the algorithm to identify children, as person is probably going to work quite well. But again, I imagine that person identification is going to be quite successful without my extra work – though identifying a child is a rather more challenging task. In the end, it may be that there is no real point in doing any work identifying people as that work has probably been done with millions of images, so adding my hundred odd is hardly going to matter!

I had church as a label, and then used it for anything that looked like a church, so that included Beverly Minster, for example. I couldn’t guarantee that every building I labelled as a church is a church, and I didn’t have more nuanced labels. I didn’t have church interior as a label, so I did wonder whether labelling the interior with the same label as the exterior would not be ideal.

I was interested in whether pubs and inns can be identified. Like shops, they are easy for us to identify, but it is not easy to define them for a machine.

Green Dragon at Welton

A pub is usually a larger building (but not always) with a sign on the facade (but not always) and maybe a hanging sign. But that could be said for a shop as well. It is the details such as the shape of the sign that help a human eye distinguish it. Even a lantern hanging over the door, or several people hanging around outside! In many of the photos the pub is indistinct, and I wondered whether it is better to identify it as a pub, or whether that could be misleading.

I found that things like street lamps and telegraph poles seemed to work well, as they have clear characteristics. I wanted to try to identify more indistinct things like street and pavement, and I added these labels in order to see if they yield any useful results.

I chose to label 10% of the images. That was 109 in total, and it took a few hours. I think if I did it again I would aim to label about 50 for an experiment like this. But then the more labels you provide, the more likely you will get results.

The next step will be to compare the output using the Rekognition out of the box service with one trained using these labels. I’m very interested to see how the two compare! We are very aware that we are using a very small labelled dataset for training, but we are using the transfer learning approach that builds upon existing models, so we are hopeful we may see some improvement in label predications. We are also working on adding these labels to our front end interface and thinking about how they might enhance discoverability.

Thanks to Adrian Stevenson, one of the Hub Labs team, who took me through the technical processes outlined in this post.

Stand and Stare: The Forest of Dean Sculpture Trail Archive at the University of Gloucestershire

Archives Hub feature for February 2023

The University of Gloucestershire’s Special Collections and Archives joined Archives Hub in 2022. There are 14 collections at the university, charting everything from the history of the institution (founded as a teacher training college in 1847), to national collections including the Independent Television News (ITN) Image Archive and Local Heritage Initiative Archive. There is also an emphasis on local connections, with holdings on Gloucestershire poets, writers and artists. One such example is the Forest of Dean Sculpture Trail Archive.

Artists at launch of project 1986 (ref: ST-1-3-6). Copyright: Forest of Dean Sculpture Trust.

The collection charts the history of the Forest of Dean Sculpture Trail, located at Beechenhurst Lodge in the heart of the Forest of Dean. In 1983, following the establishment of a sculpture trail in Exeter Forest, Martin Orrom (Forestry and Environment Officer, Forestry Commission) wrote a brief for the establishment of a sculpture trail in the Forest of Dean. The Elephant Trust provided £2,500 towards the project and in Spring 1984 around 20 artists were invited to visit the site and submit proposals for sculptures. Martin worked alongside Jeremy Rees (Founding Director of The Arnolfini, Bristol) and Rupert Martin (Curator at The Arnolfini). Six artists were chosen and these founding commissions were collectively titled “Stand and Stare”:

Peter Appleton – Sound Sculptures
Kevin Atherton – Cathedral
Andrew Darke – Sliced Log Star (Inside Out Tree)
Magdalena Jetelova – Place
David Nash – Black Dome/ Fire and Water Boats
Keir Smith – The Iron Road

Installation of Cathedral 1986 (ref: ST-3-3-5).
Installation of Cathedral 1986 (ref: ST-3-3-5). Copyright: Forest of Dean Sculpture Trust.

The trail was opened on 19 June 1986 by Sir David Montgomery, Chair of the Forestry Commission. By 1988, a second batch of sculptures had been installed including:

Bruce Allan – Observatory
Zadok Ben David – As There Is No Hunting Tomorrow
Miles Davies – House
Ian Hamilton Finlay – Grove of Silence
Tim Lees – The Heart of the Stone
Cornelia Parker – Hanging Fire
Peter Randall-Page – Cone and Vessel
Sophie Ryder – Crossing Place/ Deer/Searcher

Maquette of the sculpture House circa 1988 (ref: ST-3-9-7).
Maquette of the sculpture House circa 1988 (ref: ST-3-9-7). Copyright: University of Gloucestershire.

Since 1986, over 30 sculptures both temporary and permanent have been sited on the Sculpture Trail. The Forest is a living place, and the sculptures have come and gone leaving a mark on visitors and locals alike. Magdalena Jetelová’s ‘Place’, locally known as ‘Giant’s Chair’, was a huge chair sculpted from oak beams looking out over the landscape. It was originally planned as a temporary sculpture to be charcoaled in-situ, but this was deemed too dangerous. ‘Place’ remained on the trail for nearly 30 years before being decommissioned in 2015. It was dismantled and the wood turned to charcoal, reflecting one of the past industries of the forest, with the charcoal creating new artwork.

Magdalena Jetlova sculpture Place circa 1986 (ref: ST-3-14).
Magdalena Jetlova sculpture Place circa 1986 (ref: ST-3-14). Copyright: Forest of Dean Sculpture Trust.

The Forest of Dean Sculpture Trust was established in 1988 as a registered charity overseeing the maintenance of the trail and commissioning new works. The trail is owned and managed by Forestry England. Since 2011, the Trust has deposited the archive of the trail with the University of Gloucestershire in Cheltenham. Material covers both the administrative and artistic processes involved. Formats encompass documents, books and publications, leaflets, drawings, videos, a maquette and other ephemera. There is even part of the original bark from ‘Place’ and some of the charcoaled sculpture. The collection has proved popular with arts students both at the university and wider afield. Both the trail and the archive continue to grow as the landscape evolves.

Charcoal from sculpture Place 2015 (ref: ST-3-14-26).
Charcoal from sculpture Place 2015 (ref: ST-3-14-26). Copyright: University of Gloucestershire.

The recent Forest to Forest project celebrated the trail’s 35th anniversary. To find out more about the site visit https://www.forestofdean-sculpture.org.uk/.

Louise Hughes
Special Collections and University Archivist
University of Gloucestershire

Related

Forest of Dean Sculpture Trail Archive, c.1976-2019

Browse the University of Gloucestershire Special Collections and Archives descriptions to date on Archives Hub


All images copyright Forest of Dean Sculpture Trust and University of Gloucestershire Special Collections and Archives. Reproduced with the kind permission of the copyright holders.

Surveyors through the ages: a glimpse into the Lloyds Register archive

Archives Hub feature for January 2023

In March 2022 the Heritage and Education Centre (HEC) for Lloyds Register (LR) began an inventory of the archive holdings as large parts of the collections currently remain uncatalogued. Part of this work has highlighted the individual experiences of some of the surveyor’s working for LR over a period of nearly 200 years in the form of surveyor letter books, notebooks, and journals. Alongside the related survey reports, plans and correspondence which make up a core part of both the archive and the historic work of LR we can provide an insight into these unique individuals and their roles.

Some of the earliest surviving accounts of surveyors we hold come from two surveyors based in Scotland, Walter Paton (Leith and Firth of Forth ports) and John Bar Cumming (Clyde Ports). Their letter books and journals span the years 1834-1850, they generally contain information and notes from their time surveying vessels which accompanies the information which was to be captured on the ship survey reports and subsequently included in the Register Books. They also reflect the changes in shipping within a steadily growing global industrial world as well as showing the lives of individuals operating within shipping. On several occasions Walter Paton explains his dissatisfaction with the unwillingness of local ship owners and builders to pay the survey fees. In one letter John Bar Cummings puts forward a suggestion for a master of a ship to accept a job taking immigrants to Australia.

Page 163-164 of John Bar Cummings Letter book, 1838-1840.

The later survey report for David Clark shows that the person in question took this job.

Annual Surveys Report for David Clark 21st April 1841.

Both these surveyors reference meeting and contacting each other throughout their correspondence, demonstrating the networks of officials operating across the United Kingdom, and internationally, with a centralised contact with the London Office.

Surveyors were initially selected from positions as shipwrights and sea captains, not only was this practical experience with the understanding of ship construction and maintenance relevant for the tasks at hand but it also prepared the surveyors for the dangers of life at sea, this can be seen by Walter Paton assisting with a shipwreck of the coast of Leith. Reflecting a theme that runs through the history of LR, safety at sea. The surveyors often worked long hours and in the early days had limited holidays. As a letter book for John Bar Cummings shows, the work-life balance of some of the surveyors from this time was fraught with difficulty. Usually working the Christmas period, Cummings luckily had one holiday on Saturday 1st of January 1848 due to closed offices!

Page 22-23 of Walter Paton Letter book, 1834-1838.

The early ship survey reports often included additional information on the ship owners, ship builders, the vessel themselves or comments on events and activities at the port of survey. These often go hand in hand with the letter books and journals. The survey report for Hecla states that the vessel was thoroughly overhauled and fitted out as a whaling ship, for the now famous Northern expedition under Captain William Edward Parry, but that originally, she had been built as a bomb vessel for the Royal Navy.

Report of Survey for Hecla 31 January 1835 (page 1).
Report of Survey for Hecla 31 January 1835 (page 2).

Some of the later accounts of surveyors we hold provide less of a personal insight, but they do reveal how technology and industry were changing and the specialised knowledge that was required to undertake their roles. N H Burgess surveyor notebooks from the late 19th and early 20th century include various detailed diagrams and calculations and lists the safe number of staff that should be on any given ship at one time. Again, reflecting the continued theme of safety at sea within the work of the surveyors. Likewise, the Surveyor notebooks for S Archer from 1942 onwards contain various tables, diagrams, and calculations. Together they show the engineering, scientific and mathematical capabilities required for surveying in the first half of the 20th century. More information on these surveyors can be found listed on the ‘List of Surveyors’ (1) available on our website and found in the Lloyd’s Register of Ships.

Page from N H Burgess surveyor notebook 1901.

The surveyor journal for Bill Blacklock for 1962-1964, around South Shields, Liverpool and Middlesborough, reflects a great change in the global approaches towards energy and fuel. He was one of the surveyors to have worked on the HMS Dreadnaught, the UK’s first nuclear powered submarine.

Page 111 of Bill Ballock surveyor journal, 1962-1964.

The notebooks of surveyor John Mansfield covering the 1980s contain multiple volumes from his tenure at the then Machinery Design and Plan Approval Department. In addition to working for plan approval in London he was also positioned in Hamburg. They are typically representative of surveyors work from the time and detail the activities on the ground for the surveying staff. The notebooks include detailed diagrams, calculations and notes on engineering and shipbuilding, including extracts from letters and discussions for rules and regulations. In the case of the letter below, comments extended to the inspection of onboard CO2 fire extinguisher cannisters.

Page from John Mansfield surveyor notebook, 1980.

From surveying artic exploration vessels and shipwrecks, to inspecting nuclear powered vessels, these records offer unique insights into the working lives of the Lloyd’s Register Surveyor on sea and on land.

As noted above not all our collections are currently catalogued and searchable online, as our inventory and archival site mapping work progresses, HEC aims to make as much of this material freely and publicly accessible. HEC is currently closed for refurbishments, for further information on our archives, and access the ship plan and survey report, register books and list of surveyors please visit our online catalogue.

Zach Schieferstein, Archive Officer
Heritage & Education Centre
Lloyd’s Register Foundation

Related

(1) Lloyd’s Register List of Surveyors 1942-1947: Lloyd’s Register Foundation, Heritage & Education Centre: Free Download, Borrow, and Streaming : Internet Archive

Browse the Lloyd’s Register Foundation, Heritage and Education Centre descriptions to date on Archives Hub

All images copyright LRF heritage and education centre. Reproduced with the kind permission of the copyright holders.

A polymath’s archive: the Edward-Heron Allen Collection at West Sussex Record Office

Archives Hub feature for December 2022

The polymath Edward Heron-Allen (1861-1943) was a man of considerable talent and many interests. A solicitor by profession, his interests touched on a plethora of subjects: science, languages, literature and music. His pioneering treatise Violin-Making As It Was And Is, published in 1884, remains one of the key works on the instrument (the newest edition was published as recently as 2017). Heron-Allen’s study of Persian enabled him to publish a literal translation of The Rubaiyat of Omar Khayyam in 1898, and he lectured widely on the subject.

After moving to Selsey, a small fishing village on the Manhood peninsula in West Sussex, in 1911 (where he had built a house in 1904), Heron-Allen turned his attention to foraminifera (tiny, single-celled marine organisms) and published a large number of scientific papers, often in conjunction with the oceanographer Arthur Earland; as a result of this work, Heron-Allen was President of the Royal Microscopical Society from 1916-1917 and elected Fellow of the Royal Society in 1919. Anxious to serve in World War I, Heron-Allen enlisted in Selsey’s Home Guard (where he lined his uniform with silk, for comfort), but his linguistic abilities eventually saw him work for the intelligence services, where he devised propaganda before serving on the front line in France, on special duty.

Copies of Heron-Allen’s science and science fiction books.
Copies of Heron-Allen’s The Rubaiyat of Omar Khayyam, 1898 (EHA 1/4/1/9), Violin-Making As It Was And Is, 1884 (EHA 1/4/1/3), Codex Chirmontiae, 1883 (EHA 1/4/1/1) and A Manual of Cheirosophy, late 19th century (EHA 1/4/1/7).

Despite his grounding in the solid, traditional subjects of music, science, law and classical literature, Heron-Allen had a keen interest in the unseen and unknown world, and the occult. Under the pseudonym Christopher Blayre, Heron-Allen penned a number of early science fiction works, some of which are now regarded as classics, including The Cheetah Girl and The Purple Sapphire. This latter work was based on an amethyst which came into Heron-Allen’s own possession in 1890; supposedly cursed, Heron-Allen believed it wreaked havoc on his own life, to the extent that he eventually packed it in seven boxes and left it with his bankers, who were under strict instructions not to open it until 30 years after his death (it now forms part of the Natural History Museum’s collections). Years before he wrote his science fiction, Heron-Allen studied palmistry, publishing A Manual of Cheirosophy in 1885 and The Science of the Hand in 1886; such was his skill that he foretold the death of his younger daughter, Armorel, in a car crash in 1930, many years before the tragic event, merely by observing her hands.

Whilst Heron-Allen deposited many of his papers at institutions such as the Royal College of Music, the collection at West Sussex Record Office contains some of his more personal items, as well as copies of his many published works. Gathered for the most part by his grandson, Ivor Jones, the Edward Heron-Allen Collection (reference EHA) includes letters, visitor’s books, photographs, family material, and diaries, and it provides us with a clear sense of Heron-Allen’s personality.

The spines of three volumes of Heron-Allen’s bound travel journals.
The spines of three volumes of Heron-Allen’s bound travel journals (refs EHA 1/2/1/9-11).

The bulk of the collection is formed by Heron-Allen’s holiday journals, a meticulous, expansive set of 32 volumes, dating from 1885-1937, all written in Heron-Allen’s neat hand and featuring photographs and postcards alongside all manner of ephemera, from menus, wine cards, bills and receipts, train tickets and timetables, deck plans, and even laundry lists.

Pages from Volume XI of Heron-Allen’s travel journals, featuring Paris
Pages from Volume XI of Heron-Allen’s travel journals, featuring Paris, where Heron-Allen and his wife, Edith, spent their honeymoon in November 1903. Reference EHA 1/2/1/11.

Locations range from British destinations such as Harrogate and Penzance, to European countries including Belgium, France and Italy, and further afield to Constantinople (now Istanbul), Egypt and Ceylon (now Sri Lanka). Whilst some journals record holidays, others record scientific trips, such as the survey of Clare Island (Ireland), alongside Arthur Earland, in 1911, and the British Association’s meeting at Dundee in 1912.

Pages from Volume X of Heron-Allen’s travel journals, featuring Egypt
Pages from Volume X of Heron-Allen’s travel journals, featuring Egypt, which Heron-Allen visited in 1903. Heron-Allen can be seen in the photograph on the right of the page, on board the SS Tewfik. Reference EHA 1/2/1/10.
Part of a leaflet for a watchmaker and jewellers in Cairo, pasted into Volume X of Heron-Allen’s travel journals.
Part of a leaflet for a watchmaker and jewellers in Cairo, pasted into Volume X of Heron-Allen’s travel journals. Heron-Allen visited Egypt in 1903. Reference EHA 1/2/1/10.

All provide evidence of Heron-Allen’s personality, as well as a slice of social and local history, and can provide wonderful detail for researchers interested in a particular locality visited by Heron-Allen.

Photographs pasted into Volume IX of Heron-Allen’s travel journals, probably taken in Turkey.
Photographs pasted into Volume IX of Heron-Allen’s travel journals, probably taken in Turkey, which Heron-Allen visited in 1902. Reference EHA 1/2/1/9.

Local history was another of Heron-Allen’s keen interests, and another subject in which he excelled. After moving to Selsey in 1911, Heron-Allen quickly published his epic work, Selsey Bill Historic and Prehistoric, providing an exhaustive geographical history of the area; a condensed, ‘popular’ version of this was delivered by lecture at Chichester High School for Girls in 1911, illustrated by lantern slides. Heron-Allen also produced numerous scholarly articles on the area, primarily for the Sussex Archaeological Collections.

Perhaps more accessible for many of us are his seven volumes of Selseyana, dating from 1901-1937, comprising press cuttings, reports of local associations, postcards, posters and handbills, and assorted other ephemera, all of which provide a compelling story of what it was like to live on the peninsula. As with his other work, Heron-Allen’s thirst for knowledge, eye for detail and compulsion to collect have provided us with a social as well as physical history of this small corner of West Sussex.

 A handbill advertising Heron-Allen’s ‘Attractive Scientific Lecture’ on ‘Nature and History at Selsey Bill’.
A handbill advertising Heron-Allen’s ‘Attractive Scientific Lecture’ on ‘Nature and History at Selsey Bill’, held at Chichester High School for Girls on February 4, 1911. Reference MP 118.

The Edward Heron-Allen Collection is freely available to view at West Sussex Record Office. For information on opening times, our location and access conditions, please click see our website

The Edward Heron-Allen Society was formed in 2000. It hosts regular symposia and publishes a series of Opuscula, which concern the symposia and biographical matters relating to Heron-Allen. More about the Heron-Allen Society.

You can find out more about the Purple Sapphire on the Natural History Society’s website.

Nichola Court, Archivist
West Sussex Record Office

Related

The Edward Heron-Allen Collection, 1850 – ongoing

Browse the West Sussex Record Office descriptions to date on the Archives Hub

Exploring IIIF for the ‘Images and Machine Learning’ project

There are many ways of utilising the International Image Interoperability framework (IIIF) in order to deliver high-quality, attributed digital objects online at scale. One of the exploratory areas focused on in Images and Machine Learning – a project which is part of Archives Hub Labs – is how to display the context of the archive hierarchy using IIIF alongside the digital media.

Two of the objectives for this project are:

  • to explore IIIF Manifest and IIIF Collection creation from archive descriptions.
  • to test IIIF viewers in the context of showing the structure of archival material whilst viewing the digitised collections.

We have been experimenting with two types of resource from the IIIF Presentation API. The IIIF Manifest added into the Mirador viewer on the collection page contains just the images, in order to easily access these through the viewer. This is in contrast to a IIIF Collection, which we have been experimenting with. The IIIF Collection includes not only the images from a collection but also metadata and item structure within the IIIF resource. It is defined as a set of manifests (or ‘child’ collections) that communicate hierarchy or gather related things (for example, a set of boxes that each have folders within them, and photographs within those folders). We have been testing whether this has the potential to represent the hierarchy of an archival structure within the IIIF structure.

Creating a User Interface

Since joining the Archives Hub team, one of the areas I’ve been involved in is building a User Interface for this project that allows us to test out the different ways in which we can display the IIIF Images, Manifests and Collections using the IIIF Image API and the IIIF Presentation API. Below I will share some screenshots from my progress and talk about my process when building this User Interface.

The homepage for the UI showing the list of contributors for this project.
The collections from all of our contributors that are being displayed within the UI using IIIF manifests and collections.

This web application is currently a prototype and further development will be happening in the future. The programming language I am using is Typescript. I began by creating a Next.js React application and I am also using Tailwind CSS for styling. My first task was to use the Mirador viewer to display IIIF Collections and Manifests, so I installed the mirador package into the codebase. I created dynamic pages for every contributor to display their collections.

This is the contributor page for the University of Brighton Design Archives.

I also created dynamic collection pages for each collection. Included on the left-hand side of a collection page is the archives hub record link and the metadata about the collection taken from the archival EAD data – these sections displaying the metadata can be extended or hidden. The right-hand side of a collection page features a Mirador viewer. A simple IIIF Manifest has been added for all of the images in each collection. This Manifest is used to help quickly navigate through and browse the images in the collection.

This is the collection page for the University of Brighton Design Archives ‘Britain Can Make It’ collection.

Mirador has the ability to display multiple windows within one workspace. This is really useful for comparison of images side-by-side. Therefore, I have also created a ‘Compare Collections’ page where two Manifests of collection images can be compared side-by-side. I have configured two windows to display within one Mirador viewer. Then, two collections can be chosen for comparison using the dropdown select boxes seen in the image below.

The ‘Compare Collections’ page.

Next steps

There are three key next steps for developing the User Interface –

  • We have experimented with the Mirador viewer, and now we will be looking at how the Universal Viewer handles IIIF Collections. 
  • From the workshop feedback and from our exploration with the display of images, we will be looking at how we can offer an alternative experience of these archival images – distinct from their cataloguing hierarchy – such as thematic digital exhibitions and linking to other IIIF Collections and Manifests that already exist.
  • As part of the Machine Learning aspect of this project, we will be utilising the additional option to add annotations within the IIIF resources, so that the ML outputs from each image can be added as annotations and displayed in a viewer.

Labs IIIF Workshop

We recently held a workshop with the Archives Hub Labs project participants in order to get feedback on viewing the archive hierarchy through these IIIF Collections, displayed in a Mirador viewer. In preparation for this workshop, Ben created a sample of IIIF Collections using the images kindly provided by the project participants and the archival data related to these images that is on the Archives Hub. These were then loaded into the Mirador viewer so our workshop participants could see how the collection hierarchy is displayed within the viewer. The outcomes of this workshop will be explored in the next Archives Hub Labs blog post.

Thank you to Cardiff University, Bangor University, Brighton Design Archives at the University of Brighton, the University of Hull, the Borthwick Institute for Archives at the University of York, Lambeth Palace (Church of England) and Lloyds Bank for providing their digital collections and for participating in Archives Hub Labs.

Running Machine Learning in AWS

For our Machine Learning experiments we are using Amazon Web Services (AWS). We thought it would be useful to explain what we have been doing.

AWS, like most Cloud providers, gives you access to a huge range of infrastructure, services and tools. Typically, instead of having your own servers physically on your premises, you instead utlitise the virtual servers provided in the Cloud. The Cloud is a cost effective solution, and in particular it allows for elasticity; dynamically allocating resources as required. It also provides a range of features, and that includes a set of Machine Learning services and tools.

The AWS console lists the services available. For Machine Learning there are a range of options.

One of the services available is Amazon Rekognition. This is what we have used when writing our previous blog posts.

Amazon Rekognition enables you to analyse images

One of the things Rekognition does is object detection. We have written about using Rekognition in a previous post.

Our initial experiments were done on the basis of uploading single images at a time and looking at the output. The next step is to work out how to submit a batch of images and get output from that. AWS doesn’t have an interface that allows you to upload a batch. We have batches of images stored in the Cloud (using the ‘S3’ service), and so we need to pass sets of images from S3 to the Rekognition service and store the resulting label predictions (outputs). We also need to figure out how to provide these predictions to our contributors in a user friendly display.

S3 provides a storage facility (‘buckets’), where we have uploaded images from our Labs participants

After substantial research into approaches that we could take, we decided to use the AWS Lambda and DynamoDB services along with Rekognition and S3. Lambda is a service that allows you to run code without having to set up the virtual machine infrastructure (it is often referred to as a serverless approach). We used some ‘blueprint’ Lambda code (written in Python) as the basis, and extended it for our purposes.

One of the blueprints is for using Rekognition to detect faces

Using something like AWS does not mean that you get this type of facility out of the box. AWS provides the infrastructure and the interfaces are reasonably user friendly, but it does not provide a full blown application for doing Machine Learning. We have to do some development work in order to use Rekognition, or other ML tools, for a set of images.

A slice of the code – the images are taken from the S3 bucket and Rekognition provides a response with levels of confidence.

Lambda is set up so the code will run every time an image is placed in the S3 bucket. It then passes the output (label prediction) to another AWS service, called DynamoDB, which is a ‘NoSQL’ database.

DynamoDB output

In the above image you can see an excerpt from the output from running the Lambda code. This is for image U DX336-1-6.jpg (see below) and it has predicted ‘tree’ with a confidence level of 94.51 percent. Ideally we wanted to add the ‘bounding box’ which provides the co-ordinates for where the object is within the image.

Image from the Royal Conservatoire of Scotland showing bounding boxes to identify person and chair

We spent quite a bit of time trying to figure out how to add bounding boxes, and eventually realised that they are only added for some objects – Amazon Rekognition Image and Amazon Rekognition Video can return the bounding box for common object labels such as cars, furniture, apparel or pets, but the information isn’t returned for less common object labels. Quite how things are classed as more or less common is not clear. At the moment we are working on passing the bounding box information (when there is any) to our database output.

Image from Hull University Archives
Label predictions for the above image

Clearly for this image, it would be useful to have ‘memorial’ and ‘cross’ as label predictions, but these terms are absent. However, sometimes ML can provide terms that might not be used by the cataloguer, such as ‘tree’ or ‘monument’.

So we now have the ability to submit a batch of images, but currently the output is in JSON (the above output table is only provided if you upload the image individually). We are hoping to read the data and place the labels into our IIIF development interface.

The next step is to create a model using a subset of the images that our participants have provided. A key thing to understand is that in order to train a model so that it makes better predictions you need to provide labelled images. Therefore, if you want to try using ML, it is likely that part of the ML journey will require you to undertake a substantial amount of labelling if you don’t already have labelled images. Providing labelled content is the way that the algorithm learns. If we provided the above image and a batch of others like it and included a label of ‘memorial’ then that would make it more likely that other non-labelled images we input would be identified correctly. We could also include the more specific label ‘war memorial’ – but it would seem like a tall order for ML to distinguish war memorials from other types. Having said that, the fascinating thing is that often machines learn to detect patterns in a way that surpasses what humans can achieve. We can only give it a go and see what we get.

Thanks to Adrian Stevenson, one of the Hub Labs team, who took me through the technical processes outlined in this post.