Machine Learning: training a model by creating a labelled dataset

In this post I will go through the steps we took to create a human labelled dataset (i.e. naming objects within images), applying the labels to bounding boxes (showing where the objects are in the image) in order to identify objects and train an ML model. Note that the other approach, and one we will talk about in another post, is to simply let a pre-trained tool do the work of labelling without any human intervention. But we thought that it would be worthwhile to try the human labelling out before seeing what the out-of-the-box results are.

I used the photographs in the Claude William Jamson archive, kindly provided by Hull University Archives. This is a collection with a variety of content that lends itself to this kind of experiment.

An image from the Jamson archive

I used Amazon SageMaker for this work. In SageMaker you can set up a labelling job using the Ground Truth service, by giving the location of the source material – in this case, the folder containing the Jamson photographs. Images have to be jpg, or png, so if you have tif images, for example, they have to be converted. You give the job a name and provide the location of the source material (in our case an S3 bucket, which is the Amazon Simple Storage Service).

Location and output information are added – I have specified that we are working with images.

I then decide on my approach. I trained the algorithm with a random sample of images from this collection. This is because I wanted this sample to be a subset of the full Jamson Archive dataset of images we are working with. We can then use the ML model created from the subset to make object detection predictions for the rest of the dataset.

Random sample is selected, and I can also specify the size of the sample, e.g. 25%.

Once I had these settings completed, I started to create the labels for the ‘Ground Truth’ job. You have to provide the list of labels first of all from which you will select individual labels for each image. You cannot create the labels as you go. This immediately seemed like a big constraint to me.

Interface for adding labels, and a description of the task

I went through the photographs and decided upon the labels – you can only add up to 50 labels. It is probably worth noting here that ‘label bias’ is a known issue within machine learning. This is where the set of labelled data is not fully representative of the entirety of potential labels, and so it can create bias. This might be something we come back to, in order to think about the implications.

Creating a list of labels that I can then apply to each individual image

I chose to add some fairly obvious labels, such as boat or church. But I also wanted to try adding labels for features that are often not described in the metadata for an image, but nonetheless might be of interest to researchers, so I added things like terraced house, telegraph pole, hat and tree, for example.

Once you have the labels, there are some other options. You can assign to a labelling team, and make the task time bound, which might be useful for thinking about the resources involved in doing a job like this. You can also ask for automated data labelling, which does add to the cost, so it is worth considering this when deciding on your settings. The automated labelling uses ML to learn from the human labelling. As the task will be assigned to a work team, you need to ensure that you have the people you want in the team already added to Ground Truth.

Confirming the team and the task timeout

Those assigned to the labelling job will receive an email confirming this and giving a link to access to the labelling job.

Workers assigned to the job can now start to work to create the bounding boxes and add the labels

You can now begin the job of identifying objects and applying labels.

The interface for adding labels to images

First up I have a photograph showing rowing boats. I didn’t add the label ‘rowing boat’ as I didn’t go through every single photograph to find all the objects that I might want to label, so not a good start! ‘Boat’ will have to do. As stated above, I had to work with the labels that I created, I can’t add more labels at this stage.

I added as many labels as I could to each photograph, which was a fairly time intensive exercise. For example, in the image below I added not only boat and person but also hat and chimney. I also added water, which could be optimistic, as it is not really an object that is bound within a box, and it is rather difficult to identify in many cases, but it’s worth a try.

Adding labels using bounding boxes

I can zoom in and out and play with exposure and contrast settings to help me identify objects.

Bounding boxes with labels

Here is another example where I experimented with some labels that seem quite ambitious – I tried shopfront and pavement, for example, though it is hard to classify a shop from another house front, and it is hard to pin-point a pavement.

The more I went through the images, drawing bounding boxes and adding labels, the more I could see the challenges and wondered how the out-of-the-box ML tools would fare identifying these things. My aim in doing the labelling work was partly to get my head into that space of identification, and what the characteristics are of various objects (especially objects in the historic images that are common in archive collections). But my aim was also to train the model to improve accuracy. For an object like a chimney, this labelling exercise looked like it might be fruitful. A chimney has certain characteristics and giving the algorithm lots of examples seems like it will improve the model and thus identify more chimneys. But giving the algorithm examples of shop fronts is harder to predict. If you try to identify the characteristics, it is often a bay window and you can see items displayed in it. It will usually have a sign above, though that is indistinct in many of these pictures. It seems very different training the model on clear, full view images of shops, as opposed to the reality of many photographs, where they are just part of the whole scene, and you get a partial view.

There were certainly some features I really wanted to label as I went along. Not being able to do this seemed to be a major shortcoming of the tool. For example, I thought flags might be good – something that has quite defined characteristics – and I might have added some more architectural features such as dome and statue, and even just building (I had house, terraced house, shop and pub). Having said that, I assume that identifying common features like buildings and people will work well out-of-the-box.

Running a labelling job is a very interesting form of classification. You have to decide how thorough you are going to be. It is more labour intensive than simply providing a description like ‘view of a street’ or ‘war memorial’. I found it elucidating as I felt that I was looking at images in a different way and thinking about how amazing the brain is to be able to pick out a rather blurred cart or a van or a bicycle with a trailer, or whatever it might be, and how we have all these classifications in our head. It took more time than it might have done because I was thinking about this project, and about writing blog posts! But, if you invest time in training a model well, then it may be able to add labels to unlabelled photographs, and thus save time down the line. So, investing time at this point could reap real rewards.

Part of a photograph with labels added

In the above example, I’ve outlined an object that i’ve identified as a telegraph pole. One question I had is whether I am are right in all of my identifications, and I’m sure there will be times when things will be wrongly identified. But this is certainly the type of feature that isn’t normally described within an image, and there must be enthusiasts for telegraph poles out there! (Well, maybe more likely historians looking at communications or the history of the telephone). It also helps to provide examples from different periods of history, so that the algorithm learns more about the object. I’ve added a label for a cart and a van in this photo. These are not all that clear within the image, but maybe by labelling less distinct features, I will help with future automated identification in archival images.

I’ve added hat as a label, but it strikes me that my boxes also highlight heads or faces in many cases, as the people in these photos are small, and it is hard to distinguish hat from head. I also suspect that the algorithm might be quite good with hats, though I don’t yet know for sure.

person and child labels

I used ‘person’ as a label, and also ‘child’, and I tended not to use ‘person’ for ‘child’, which is obviously incorrect, but I thought that it made more sense to train the algorithm to identify children, as person is probably going to work quite well. But again, I imagine that person identification is going to be quite successful without my extra work – though identifying a child is a rather more challenging task. In the end, it may be that there is no real point in doing any work identifying people as that work has probably been done with millions of images, so adding my hundred odd is hardly going to matter!

I had church as a label, and then used it for anything that looked like a church, so that included Beverly Minster, for example. I couldn’t guarantee that every building I labelled as a church is a church, and I didn’t have more nuanced labels. I didn’t have church interior as a label, so I did wonder whether labelling the interior with the same label as the exterior would not be ideal.

I was interested in whether pubs and inns can be identified. Like shops, they are easy for us to identify, but it is not easy to define them for a machine.

Green Dragon at Welton

A pub is usually a larger building (but not always) with a sign on the facade (but not always) and maybe a hanging sign. But that could be said for a shop as well. It is the details such as the shape of the sign that help a human eye distinguish it. Even a lantern hanging over the door, or several people hanging around outside! In many of the photos the pub is indistinct, and I wondered whether it is better to identify it as a pub, or whether that could be misleading.

I found that things like street lamps and telegraph poles seemed to work well, as they have clear characteristics. I wanted to try to identify more indistinct things like street and pavement, and I added these labels in order to see if they yield any useful results.

I chose to label 10% of the images. That was 109 in total, and it took a few hours. I think if I did it again I would aim to label about 50 for an experiment like this. But then the more labels you provide, the more likely you will get results.

The next step will be to compare the output using the Rekognition out of the box service with one trained using these labels. I’m very interested to see how the two compare! We are very aware that we are using a very small labelled dataset for training, but we are using the transfer learning approach that builds upon existing models, so we are hopeful we may see some improvement in label predications. We are also working on adding these labels to our front end interface and thinking about how they might enhance discoverability.

Thanks to Adrian Stevenson, one of the Hub Labs team, who took me through the technical processes outlined in this post.

Exploring IIIF for the ‘Images and Machine Learning’ project

There are many ways of utilising the International Image Interoperability framework (IIIF) in order to deliver high-quality, attributed digital objects online at scale. One of the exploratory areas focused on in Images and Machine Learning – a project which is part of Archives Hub Labs – is how to display the context of the archive hierarchy using IIIF alongside the digital media.

Two of the objectives for this project are:

  • to explore IIIF Manifest and IIIF Collection creation from archive descriptions.
  • to test IIIF viewers in the context of showing the structure of archival material whilst viewing the digitised collections.

We have been experimenting with two types of resource from the IIIF Presentation API. The IIIF Manifest added into the Mirador viewer on the collection page contains just the images, in order to easily access these through the viewer. This is in contrast to a IIIF Collection, which we have been experimenting with. The IIIF Collection includes not only the images from a collection but also metadata and item structure within the IIIF resource. It is defined as a set of manifests (or ‘child’ collections) that communicate hierarchy or gather related things (for example, a set of boxes that each have folders within them, and photographs within those folders). We have been testing whether this has the potential to represent the hierarchy of an archival structure within the IIIF structure.

Creating a User Interface

Since joining the Archives Hub team, one of the areas I’ve been involved in is building a User Interface for this project that allows us to test out the different ways in which we can display the IIIF Images, Manifests and Collections using the IIIF Image API and the IIIF Presentation API. Below I will share some screenshots from my progress and talk about my process when building this User Interface.

The homepage for the UI showing the list of contributors for this project.
The collections from all of our contributors that are being displayed within the UI using IIIF manifests and collections.

This web application is currently a prototype and further development will be happening in the future. The programming language I am using is Typescript. I began by creating a Next.js React application and I am also using Tailwind CSS for styling. My first task was to use the Mirador viewer to display IIIF Collections and Manifests, so I installed the mirador package into the codebase. I created dynamic pages for every contributor to display their collections.

This is the contributor page for the University of Brighton Design Archives.

I also created dynamic collection pages for each collection. Included on the left-hand side of a collection page is the archives hub record link and the metadata about the collection taken from the archival EAD data – these sections displaying the metadata can be extended or hidden. The right-hand side of a collection page features a Mirador viewer. A simple IIIF Manifest has been added for all of the images in each collection. This Manifest is used to help quickly navigate through and browse the images in the collection.

This is the collection page for the University of Brighton Design Archives ‘Britain Can Make It’ collection.

Mirador has the ability to display multiple windows within one workspace. This is really useful for comparison of images side-by-side. Therefore, I have also created a ‘Compare Collections’ page where two Manifests of collection images can be compared side-by-side. I have configured two windows to display within one Mirador viewer. Then, two collections can be chosen for comparison using the dropdown select boxes seen in the image below.

The ‘Compare Collections’ page.

Next steps

There are three key next steps for developing the User Interface –

  • We have experimented with the Mirador viewer, and now we will be looking at how the Universal Viewer handles IIIF Collections. 
  • From the workshop feedback and from our exploration with the display of images, we will be looking at how we can offer an alternative experience of these archival images – distinct from their cataloguing hierarchy – such as thematic digital exhibitions and linking to other IIIF Collections and Manifests that already exist.
  • As part of the Machine Learning aspect of this project, we will be utilising the additional option to add annotations within the IIIF resources, so that the ML outputs from each image can be added as annotations and displayed in a viewer.

Labs IIIF Workshop

We recently held a workshop with the Archives Hub Labs project participants in order to get feedback on viewing the archive hierarchy through these IIIF Collections, displayed in a Mirador viewer. In preparation for this workshop, Ben created a sample of IIIF Collections using the images kindly provided by the project participants and the archival data related to these images that is on the Archives Hub. These were then loaded into the Mirador viewer so our workshop participants could see how the collection hierarchy is displayed within the viewer. The outcomes of this workshop will be explored in the next Archives Hub Labs blog post.

Thank you to Cardiff University, Bangor University, Brighton Design Archives at the University of Brighton, the University of Hull, the Borthwick Institute for Archives at the University of York, Lambeth Palace (Church of England) and Lloyds Bank for providing their digital collections and for participating in Archives Hub Labs.

Running Machine Learning in AWS

For our Machine Learning experiments we are using Amazon Web Services (AWS). We thought it would be useful to explain what we have been doing.

AWS, like most Cloud providers, gives you access to a huge range of infrastructure, services and tools. Typically, instead of having your own servers physically on your premises, you instead utlitise the virtual servers provided in the Cloud. The Cloud is a cost effective solution, and in particular it allows for elasticity; dynamically allocating resources as required. It also provides a range of features, and that includes a set of Machine Learning services and tools.

The AWS console lists the services available. For Machine Learning there are a range of options.

One of the services available is Amazon Rekognition. This is what we have used when writing our previous blog posts.

Amazon Rekognition enables you to analyse images

One of the things Rekognition does is object detection. We have written about using Rekognition in a previous post.

Our initial experiments were done on the basis of uploading single images at a time and looking at the output. The next step is to work out how to submit a batch of images and get output from that. AWS doesn’t have an interface that allows you to upload a batch. We have batches of images stored in the Cloud (using the ‘S3’ service), and so we need to pass sets of images from S3 to the Rekognition service and store the resulting label predictions (outputs). We also need to figure out how to provide these predictions to our contributors in a user friendly display.

S3 provides a storage facility (‘buckets’), where we have uploaded images from our Labs participants

After substantial research into approaches that we could take, we decided to use the AWS Lambda and DynamoDB services along with Rekognition and S3. Lambda is a service that allows you to run code without having to set up the virtual machine infrastructure (it is often referred to as a serverless approach). We used some ‘blueprint’ Lambda code (written in Python) as the basis, and extended it for our purposes.

One of the blueprints is for using Rekognition to detect faces

Using something like AWS does not mean that you get this type of facility out of the box. AWS provides the infrastructure and the interfaces are reasonably user friendly, but it does not provide a full blown application for doing Machine Learning. We have to do some development work in order to use Rekognition, or other ML tools, for a set of images.

A slice of the code – the images are taken from the S3 bucket and Rekognition provides a response with levels of confidence.

Lambda is set up so the code will run every time an image is placed in the S3 bucket. It then passes the output (label prediction) to another AWS service, called DynamoDB, which is a ‘NoSQL’ database.

DynamoDB output

In the above image you can see an excerpt from the output from running the Lambda code. This is for image U DX336-1-6.jpg (see below) and it has predicted ‘tree’ with a confidence level of 94.51 percent. Ideally we wanted to add the ‘bounding box’ which provides the co-ordinates for where the object is within the image.

Image from the Royal Conservatoire of Scotland showing bounding boxes to identify person and chair

We spent quite a bit of time trying to figure out how to add bounding boxes, and eventually realised that they are only added for some objects – Amazon Rekognition Image and Amazon Rekognition Video can return the bounding box for common object labels such as cars, furniture, apparel or pets, but the information isn’t returned for less common object labels. Quite how things are classed as more or less common is not clear. At the moment we are working on passing the bounding box information (when there is any) to our database output.

Image from Hull University Archives
Label predictions for the above image

Clearly for this image, it would be useful to have ‘memorial’ and ‘cross’ as label predictions, but these terms are absent. However, sometimes ML can provide terms that might not be used by the cataloguer, such as ‘tree’ or ‘monument’.

So we now have the ability to submit a batch of images, but currently the output is in JSON (the above output table is only provided if you upload the image individually). We are hoping to read the data and place the labels into our IIIF development interface.

The next step is to create a model using a subset of the images that our participants have provided. A key thing to understand is that in order to train a model so that it makes better predictions you need to provide labelled images. Therefore, if you want to try using ML, it is likely that part of the ML journey will require you to undertake a substantial amount of labelling if you don’t already have labelled images. Providing labelled content is the way that the algorithm learns. If we provided the above image and a batch of others like it and included a label of ‘memorial’ then that would make it more likely that other non-labelled images we input would be identified correctly. We could also include the more specific label ‘war memorial’ – but it would seem like a tall order for ML to distinguish war memorials from other types. Having said that, the fascinating thing is that often machines learn to detect patterns in a way that surpasses what humans can achieve. We can only give it a go and see what we get.

Thanks to Adrian Stevenson, one of the Hub Labs team, who took me through the technical processes outlined in this post.

Digital Content on Archives Hub

As part of the Archives Hub Labs ‘Images and Machine Learning’ project we are currently exploring the challenges around implementing IIIF image services for archival collections, and also for Archives Hub more specifically as an aggregator of archival descriptions. This work is motivated by our desire to encourage the inclusion of more digital content on Archives Hub, and to improve our users’ experience of that content, in terms of both display and associated functionality.

Before we start to report on our progress with IIIF, we thought it would be useful to capture some of our current ideas and objectives with regards to the presentation of digital content on Archives Hub. This will help us to assess at later stages of the project how well IIIF supports those objectives, since it can be easy to get caught up in the excitement of experimenting with new technologies and lose sight of one’s starting point. It will also help our audience to understand how we’re aiming to develop the Hub, and how the Labs project supports those aims.

     Why more digital content?

  • We know it’s what our users want
  • Crucial part of modern research and engagement with collections, especially after the pandemic
  • Another route into archives for researchers
  • Contributes to making archives more accessible
  • Will enable us to create new experiences and entry points within Archives Hub
  • To support contributing archives which can’t host or display content themselves
The poet Edward Thomas, ‘Wearing hat, c.1904’

The Current Situation

At the moment our contributors can include digital content in their descriptions on Archives Hub. They add links to their descriptions prior to publication, and they can do this at any level, e.g. ‘item’ level for images of individually catalogued objects, or maybe ‘fonds’ or ‘collection’ level for a selection of sample images. If the links are to image files, these are displayed on the Hub as part of the description. If the links are to video or audio files, or documents, we just display a link.

There are a few disadvantages to this set up: it can be a labour-intensive process adding individual links to descriptions; links often go dead because content is moved, leading to disappointment for researchers; and it means contributing archives need to be able to host content themselves, which isn’t always possible.

Where images are included in descriptions, these are embedded in the page as part of the description itself. If there are multiple images they are arranged to best fit the size of the screen, which means their order isn’t preserved.

If a user clicks on an image it is opened in a pop out viewer, which has a zoom button, and arrows for browsing if there is more than one image.

The embedded image and the viewer are both quite small, so there is also a button to view the image in fullscreen.

The viewer and the fullscreen option both obscure all or part of the decription itself, and there is no descriptive information included around the image other than a caption, if one has been provided.

As you can see the current interface is functional, but not ideal. Listed below are some of the key things we would like to look at and improve going forwards. The list is not intended to be exhaustive, but even so it’s pretty long, and we’re aware that we might not be able to fix everything, and certainly not in one go.

Documenting our aims though is an important part of steering our innovations work, even if those aims end up evolving as part of the exploration process.

Display and Viewing Experience

❐ The viewer needs updating so that users can play audio and video files in situ on the Hub, just as they can view images at the moment. It would be great if they could also read documents (PDF, Word etc).

❐ Large or high-resolution image files should load more quickly into the viewer.

❐ The viewer should also include tools for interacting with content, e.g. for images: zoom, rotate, greyscale, adjust brightness/contrast etc; for audio-visual files: play, pause, rewind, modify speed etc.

❐ When opened, any content viewer should expand to a more usable size than the current one.

❐ Should the viewer also support the display of descriptive information around the content, so that if the archive description itself is obscured, the user still has context for what they’re looking at? Any viewer should definitely clearly display rights and licensing information alongside content.

‘Now for a jolly ride at Bridlington’

Search and Navigation

❐ The Archives Hub search interface should offer users the option to filter by the type of digital content included in their search results (e.g. image, video, PDF etc).

❐ The search interface should also highlight the presence of digital content in search results more prominently, and maybe even include a preview?

❐ When viewing the top level of a multi-level description, users should be able to identify easily which levels include digital content.

❐ Users should also be able to jump to the digital content within a multi-level description quickly – possibly being able to browse through the digital content separately from the description itself?

❐ Users should be able to begin with digital content as a route into the material on Archives Hub, rather than only being able to search the text descriptions as their starting point.

Contributor Experience

❐ Perhaps Archives Hub should offer some form of hosting service, to support archives, improve availability of digital content on the Hub, and allow for the development of workflows around managing content?

❐ Ideally, we would also develop a user-friendly method for linking content to descriptions, to make publishing and updating digital content easy and time-efficient.

❐ Any workflows or interfaces for managing digital content should be straightforward and accessible for non-technical staff.

❐ The service could give contributors access to innovative but sustainable tools, which drive engagement by highlighting their collections.

❐ If possible, any resources created should be re-usable within an archive’s own sites or resources – making the most of both the material and the time invested.

Future Possibilities

❐ We could look at offering options for contributors to curate content in creative and inventive ways which aren’t tied to cataloguing alone, and which offer alternative ways of experiencing archival material for users.

❐ It would be exciting for users to be able to ‘collect’, customise or interact with content in more direct ways. Some examples might include:

  • Creating their own collections of content
  • Creating annotations or notes
  • Publicly tagging or commenting on content

❐ Develop the experience for users with things like: automated tagging of images for better search; providing searchable OCR scanned text for text within images; using the tagging or classification of content to provide links to information and resources elsewhere.

Image credits

Edward Thomas: Papers of Edward Thomas (GB 1239 424/8/1/1/10), Cardiff University Archives / Prifysgol Caerdydd.

Servants of the Queen (from Salome, c1900): by Dorothy Carleton Smyth. Art, Design and Architecture collection (GB 1694 NMC/0098F), Glasgow School of Art Archives and Collections.

‘Now for a jolly ride at Bridlington’: Claude William Jamson Archive (GB 50 U DX336/8/1), Hull University Archives.

‘Things for children’: Design Council Archives (GB 1837 DES/DCA/30/1/13/25), University of Brighton Design Archives.

Images and Machine Learning Project

Under our new Labs umbrella, we have started a new project, ‘Images and Machine Learning’ it has three distinct and related strands.

screenshot with bullet points to describe the DAO store, IIIF and Machine Learning
The three themes of the project

We will be working on these themes with ten participants, who already contribute to the Archives Hub, and who have expressed an interest in one or more of these strands: Cardiff University, Bangor University, Brighton Design Archives at the University of Brighton, Queens University Belfast, the University of Hull, the Borthwick Institute for Archives at the University of York, the Geological Society, the Paul Mellon Centre, Lambeth Palace (Church of England) and Lloyds Bank.

This project is not about pre-selecting participants or content that meet any kind of criteria. The point is to work with a whole variety of descriptions and images, and not in any sense to ‘cherry pick’ descriptions or images in order to make our lives easier. We want a realistic sense of what is required to implement digital storage and IIIF display, and we want to see how machine learning tools work with a range of content. Some of the participants will be able to dedicate more time to the project, others will have very little time, some will have technical experience, others won’t. A successful implementation that runs beyond our project and into service will need to fit in with our contributors needs and limitations. It is problematic to run a project that asks for unrealistic amounts of time from people that will not be achievable long-term, as trying to turn a project into a service is not likely to work.

DAO Store

Over the years we have been asked a number of times about hosting content for our contributors. Whilst there are already options available for hosting, there are issues of cost, technical support, fit for purpose-ness, trust and security for archives that are not necessarily easily met.

Jisc can potentially provide a digital object store that is relatively inexpensive, integrated with the current Archives Hub tools and interfaces, and designed specifically to meet our own contributors’ requirements. In order to explore this proposal, we are going to invest some resource into modifying our current administrative interface, the CIIM, to enable the ingest of digital content.

We spent some time looking at the feasibility of integrating an archival digital object store with the current Jisc Preservation Service. However, for various reasons this did not prove to be a practical solution. One of the main issues is the particular nature of archives as hierarchical multi-level collections. Archival metadata has its own particular requirements. The CIIM is already set up to work with EAD descriptions and by using the CIIM we have full control over the metadata so that we can design it to meet the needs of archives. It also allows us to more easily think about enabling IIIF (see below).

The idea is that contributors use the CIIM to upload content and attach metadata. They can then organise and search their content, and publish it, in order to give it web address URIs that can be added to their archival descriptions – both in the Archives Hub and elsewhere.

It should be noted that this store is not designed to be a preservation solution. As said, Jisc already provides this service, and there are many other services available. This is a store for access and use, and for providing IIIF enabled content.

The metadata fields have not yet been finalised, but we have a working proposal and some thoughts about each field.

Titlemandatory? individual vs batch?
Datespreferably structured, options for approx. and not dated.
Licencepossibly a URI. option to add institution’s rights statement.
Resource typecontrolled list. values to be determined with participants. could upload a thesaurus. could try ML to identify type.
Keywordsfree text
Taggingenable digital objects to be grouped e.g by topic or e.g. ‘to do’ to indicate work is required
Statusunpublished/published. May refer to IIIF enabled.
URLunique URI of image (at individual level)
Proposed fields for the Digital Object Store

We need to think about the workflow and user interface. The images would be uploaded and not published by default, so that they would only be available to the DAO Store user at that point. On publication, they would be available at a designated URL. Would we then give the option to re-size? Would we set a maximum size? How would this fit in with IIIF and the preference for images of a higher resolution? We will certainly need to think about how to handle low resolution images.

International Image Interoperability Framework

IIIF is a framework that enables images to be viewed in any IIIF viewer. Typically, they can be sequenced, such as for a book, and they are zoomable to a very high resolution. At the heart of IIIF is the principle that organisations expose images over the web in a way that allows researchers to use images from anywhere, using any platform that speaks IIIF. This means a researcher can group images for their own research purposes, and very easily compare them. IIIF promotes the idea of fully open digital content, and works best with high resolution images.

There are a number of demos here: https://matienzo.org/iiif-archives-demo/

And here is a demo provided by Project Mirador: http://projectmirador.org/demo/

An example from the University of Cambridge: https://cudl.lib.cam.ac.uk/view/MS-RGO-00014-00051/358

And one from the University of Manchester: https://www.digitalcollections.manchester.ac.uk/collections/ruskin/1

There are very good reasons for the Archives Hub to get involved in IIIF, but there are challenges being an aggregator that individual institutions don’t face, or at least not to the same degree. We won’t know what digital content we will receive, so we have to think about how to work with images of varying resolutions. Our contributors will have different preferences for the interface and functionality. On the plus side, we are a large and established service, with technical expertise and good relationships with our contributors. We can potentially help smaller and less well-resourced institutions into this world. In addition, we are well positioned to establish a community of use, to share experiences and challenges.

One thing that we are very convinced by: IIIF is a really effective way to surface digital content and it is an enormous boon to researchers. So, it makes total sense for us to move into this area. With this in mind, Jisc has become a member of the IIIF Consortium, and we aim to take advantage of the knowledge and experience within the community – and to contribute to it.

Machine Learning

This is a huge area, and it can feel rather daunting. It is also very complicated, and we are under no illusions that it will be a long road, probably with plenty of blind alleys. It is very exciting, but not without big challenges.

It seems as if ML is getting a bad reputation lately, with the idea that algorithms make decisions that are often unfair or unjust, or that are clearly biased. But the main issue lies with the data. ML is about machines learning from data, and if the data is inadequate, biased, or suspect in some way, then the outcomes are not likely to be good. ML offers us a big opportunity to analyse our data. It can help us surface bias and problematic cataloguing.

We want to take the descriptions and images that our participants provide and see what we can do with ML tools. Obviously we won’t do anything that affects the data without consulting with our contributors. But it is best with ML to have a large amount of data, and so this is an area where an aggregator has an advantage.

This area is truly exploratory. We are not aiming for anything other than the broad idea of improved discoverability. We will see if ML can help identify entities, such as people, places and concepts. But we are also open to looking at the results of ML and thinking about how we might benefit from them. We may conclude that ML only has limited use for us – at least, as it stands now. But it is changing all the time, and becoming more sophisticated. It is something that will only grow and become more embedded within cultural heritage.

Over the next several months we will be blogging about the project, and we would be very pleased to receive feedback and thoughts. We will also be holding some webinar sessions. These will be advertised to contributors via our contributors list, and advertised on the JiscMail archives-nra list.

Thomas Baron Pitfield (1903-1999): a visual autobiography

Archives Hub feature for July 2015

Monstrous Monster drawing, 1979
TP1.12 Pen-and-wash drawing of the Monstrous Monster, the Duophonia, and landscapes, 1979

This month’s archive one true love is the Thomas Baron Pitfield Collection at the Royal Northern College of Music in Manchester. Pitfield was, to name a handful of epithets, a composer, teacher, poet, artist, engineer, furniture maker, calligrapher and engraver.

He studied and later taught at the Royal Manchester College of Music (RMCM). He is a well-loved composer. However, it is the rest of his creative life that I wish to draw attention to here in this feature. In particular, his sketchbooks.

A bit of context

Pitfield was born 5 April 1903 to a strict Church of England family in Bolton. His parents had him late in life and according to his memoirs he was an unwanted and unplanned for child.

Pitfield was not born into an environment of plentiful inspiration and artistic encouragement. His creative nature was exactly that: his nature. Nurture was not a feature. In his autobiographies he mentions that he was given no means to entertain himself as a child save for his own resourcefulness which he believed fostered innovation in his early years.

Painted minstrel, 1933
TP1.10 Painted minstrel, 1933

By age two he was notably good at drawing and in school his ability to learn music almost instantaneously by ear was remarked upon. Much, he assures us, to the unimpressed pillars of his parents who intended for him to be a joiner like his father. He strove on however, collecting scraps from his father’s workshop and working them into toys and other objects.

At age 14 he was pulled from school and enrolled in an apprenticeship in the millwrights’ department of a local engineering firm, which he despised. It took time away from his creative and musical endeavours which he sneakily developed when everyone else was asleep. He also abhorred the idea that the machines he was helping to maintain could one day severely harm or even kill someone, as the near misses he witnessed assured him could happen.

The artist

“The artist [it is said] should be able to find his inspiration in the objects and life about him. I could never wax poetic about the gasometers and industrial plant.” (Pitfield, A Song After Supper, 1990 p84). And so he haunted the Bolton moors at the weekends bringing sketchbooks with him. “The countryside is the backdrop of most of my creative thoughts.” (ibid 12)

Tree drawing, 1981
TP1.15 Pen-and-wash drawing of a contorted tree, Dunham Park, 1981

Here we witness the birth of his sketchbook obsession. By the end of his life he had filled over 6,000 pages of thoughts, ideas, paintings, music, teachings, prose, poetry and designs. The calls them “a visual autobiography… so that they have become an outline of my life’s activities.”(ibid, p95)

Calligraphy, 1960
TP1.16 Calligraphy swirls, 1960

In his books we see everything that influenced his life for over seven decades. From the many pen-and-wash sketches of churches, woodlands, creatures and characters, to the incredible astuteness of his calligraphy and furniture designs. This stream of creative consciousness follows him through his short time as a student at the RMCM after quitting engineering at 21; working as a teacher of woodwork for the unemployed from 23; his fruitful composition career; his fondly remembered time returning as a teacher to the RMCM and beyond.

Philosophy and themes

Pitfield was a complex mould breaker. He remarks that early on he “began to see that an almost rabid conformity in those about me was no assurance of their sanity.” (Pitfield, No Song, No Supper, 1986, p24) In his life, themes of self-efficiency and great personal motivation permeate, whether it be stepping away from the religious upbringing, becoming vegetarian at a young age, his pacifism or his love of John Ruskin and William Morris.

Dunbleton Church sketch, 1968
TP1.1 Dunbleton Church sketch with Aaron Copeland quote, 1968

Nevertheless Christian iconography is very apparent in his notebooks and sits alongside furniture designs and the wild nature scenes which uproot the carefully penned calligraphy and drafts for lino prints, prose and poetry. The finished artworks crop up elsewhere in the archive but it is in the sketchbooks, the first manifestation for many of his creative outputs, where we find an absolute wonderland of inspiration.

Thanks for reading. If you would like to know more about his wonderful creations then do get in touch: archives@rncm.ac.uk.

Heather Roberts
College Archivist
Royal Northern College of Music

 

Related:

The Thomas Pitfield collections on the Archives Hub: http://archiveshub.ac.uk/data/gb1179-tp.

Browse the collections of the Royal Northern College of Music on the Archives Hub.

Artworks copyright: The Pitfield Trust.

HubbuB: October 2011

Europeana and APENet

Europeana LogoI have just come back from the Europeana Tech conference, a 2 day event on various aspects of Europeana’s work and on related topics to do with data. The big theme was ‘open, open, open’, as well, of course, as the benefits of a European portal for cultural heritage.  I was interested to hear about Europeana’s Linked Data output, but my understanding is that at present, we cannot effectively link to their data, because they don’t provide URIs  for concepts. In other words, identifiers for names such as http://data.archiveshub.ac.uk/doc/agent/gb97/georgebernardshaw, so that we can say, for example, that our ‘George Bernard Shaw’ is the same as ‘George Bernard Shaw’ represented on Europeana.

I am starting to think about the Hub being part of APENet and Europeana. APENet is the archival aggregator for Europe. I have been in touch with them about the possibility of contributing our data, and if the Hub was to contribute, we could probably start from next year. Europeana only provide metadata for digital content, so we could only supply descriptions where the user can link to the digital content, but this may well be worth doing, as a means to promote the collections of any Hub contributors who do link to digital materials.

If you are a contributor, or potential contributor, we would like to know what you think…. we have a quick question for you at http://polldaddy.com/poll/5565396/. It simply asks if you think its a good idea to be part of these European initiatives. We’d love to get your views, and you only have to leave your name and a comment if you want to.

Flickr: an easy way to provide images online

You will be aware that contributors can now add images to descriptions and links to digital content of all kinds. The idea is that the digital content then forms an integral whole with the metadata, and it is also interoperable with other systems.

I’ve just seen an announcement by the University of Northampton, who have recently added materials to Flickr . I know that many contributors struggle to get server space to put their digital content online, so this is one possible option, and of course it does reach a huge number of people this way. There may be risks associated with the persistence of the URIs for the images, but then that is the case wherever you put them.

On the Hub we now have a number of images and links to content, for example: http://archiveshub.ac.uk/data/gb1089ukc-joh, http://archiveshub.ac.uk/data/gb1089ukc-bigwood, http://archiveshub.ac.uk/data/gb1089ukc-wea, http://archiveshub.ac.uk/data/gb141boda?page=7#boda.03.03.02.

Ideally, contributors would supply digital content at item level, so the metadata is directly about the image/digital content, but it is fine to provide it at any level that is appropriate.  The EAD Editor makes adding links easy (http://archiveshub.ac.uk/dao/). If you aren’t sure what to do, please do email us.

Preferred Citation

We never had the field for the preferred citation in our old template for the creation of EAD, and it has not been in the EAD Editor up till now. We were prompted to think about this after seeing the results of a survey on the use of EAD fields presented at the Society of American Archivists conference. Around 80% of archive institutions do use it. We think it’s important to advise people how to cite the archive, so we are planning to provide this in the Editor and may be able to carry out global edits to add this to contributors’ data.

List of Contributors

Our list of contributors within the main search page has now been revised, and we hope it looks substantially more sensible, and that it is better for researchers. This process really reminded us how hard it is to come up with one order for institutions that works for everyone!  We are currently working on a regional search, something that will act as an alternative way to limit searching. We hope to introduce this next year.

And finally…A very engaging Linked Data interface

This interface demonstration by Tim Sherratt shows how something driven by Linked Data can really be very effective. It also uses some of the Archives Hub vocabulary from our own Linked Data work, which is a nice indication of how people have taken notice of what we have been doing. There is a great blog post about it by Pete Johnston, Storytelling, archives and Linked Data. I agree with Pete that this sort of work is so exciting, and really shows the potential of the Linked Data Web for enabling individual and collective storytelling…something we, as archivists, really must be a part of.

HubbuB

Diary of the Archives Hub, June 2011

Design Council Archive poster
Desing Council Archive: Festival of Britain poster

This is the first of our monthly diary entries, where we share news, ideas and thoughts about the Archives Hub and the wider world. This diary is aimed primarily at archives that contribute to the Hub, or are thinking about contributing, but we hope that it provides useful information for others about the sorts of developments going on at the Hub and how we are working to promote archives to researchers.

Hub Contributors’ Forum

At the Hub we are always looking to maintain an active and constructive relationship with our contributors. Our Contributors’ Forum provides one way to do this. It is informal, friendly, and just meets once or twice a year to give us a chance to talk directly to archivists. We think that archivists also value the opportunity to meet other contributors and think about issues around data discovery.

We have a Contributors’ Forum on 7th July at the University of Manchester and if any contributors out there would like to come we’d love to see you. It is a chance to think about where the Hub is going and to have input into what you think we should be doing, where our priorities should lie and how to make the service effective for users. Just in case you all jump in at once, we do have a limit on numbers….but please do get in touch if you are interested.

The session will be from 10.30 to 1.00 at the University of Manchester with lunch provided. It will be with some members of the Hub Steering Committee, so a chance for all to mix and mingle and get to know each other. And for you to talk to Steering Committee members directly.

Please email Lisa if you would like to attend: lisa.jeskins@manchester.ac.uk.

Contributor Audio Tutorials

Our audio tutorial is aimed at contributors who need some help with creating descriptions for the Hub. It takes you through the use of our EAD Editor, step-by-step. It is also useful in a general sense for creating archival descriptions, as it follows the principles of ISAD(G). The tutorial can be found at http://archiveshub.ac.uk/tutorials/. It is just a simple audio tutorial, split into convenient short modules, covering basic collection-level descriptions through to multi-level and indexing. Any feedback greatly appreciated – if you want any changes or more units added, just let us know.

Archives Hub Feature: 100 Objects

We are very pleased with our monthly features, founded by Paddy, now ably run by Lisa. They are a chance to show the wealth of archive collections and provide all contributors the opportunity to showcase their holdings.  They do quite well on Google searches as well!

Our monthly feature for June comes from Bradford Special Collections, one of our stalwart contributors, highlighting their current online exhibition: 100 Objects.  Some lovely images, including my favourite, ‘Is this man an anarchist?’ (No!! he’s just trying to look after his family): http://archiveshub.ac.uk/features/100objects/Nationalunionofrailwaymenposter.html

Relevance Ranking

Relevance ranking is a tricky beast, as our developer, John, will attest. How to rank the results of a search in a way that users see as meaningful? Especially with archive descriptions, which range from a short description of a 100 box archive to a 10 page description of a 2 box archive!

John has recently worked on the algorithm used for relevance ranking so that results now look more as most users would expect. For example, if you searched for ‘Sir John Franklin’ before, the ‘Sir John Franklin archive’ would not come up near the top of the results. It now appears 1st in results rather than way down the list, as it was previously. Result.

Images

Since last year we have provided the ability to add images to Hub descriptions. The images have to be stored elsewhere, but we will embed them into descriptions at any level (e.g. you can have an image to represent a whole collection, or an image at each item level description).

We’ve recently got some great images from the Design Council Archive: http://archiveshub.ac.uk/data/gb1837des-dca – take a look at the Festival of Britain entries, which have ‘digital objects’ linked at item level, enabling researchers to get a great idea of what this splendid archive holds.

Any contributors wishing to add images, or simple links to digital content, can easily do so through using the EAD Editor: http://archiveshub.ac.uk/images/ You can also add links to documents and audio files. Let us know if you would like more information on this.

Linking to descriptions

Linking to Hub descriptions from elsewhere has become simpler, thanks to our use of ‘cool URIs’. See http://archiveshub.ac.uk/linkingtodescriptions/. You simply need to use the basic URI for the Hub, with the /data/ directory, e.g. http://archiveshub.ac.uk/data/gb029ms207.

Out and About

It would take up too much space to tell you about all of our wanderings, but recently Jane spent a very productive week in Prague at the European Libraries Automation Group (ELAG), a very friendly bunch of people, a good mix of librarians and developers, and a very useful conference centering on Linked Data.

Bethan is at the CILIP new professionals information day today, busy twittering about networking and sharing knowledge.

Lisa is organising our contributors’ workshops for this year (feels like our summer season of workshops) and has already run one in Manchester. More to follow in Glasgow, London and Cardiff. This is our first workshop in Wales, so please take advantage of this opportunity if you are in Wales or south west England. More information at http://archiveshub.ac.uk/contributortraining/

Joy is very busy with the exciting initiative, UKDiscovery. This is about promoting an open data agenda for archives, museums and libraries – something that we know you are all interested in. Take a look at the new website: http://discovery.ac.uk/.

With best wishes,
The Hub Team