Machine Learning with Archive Collections

Machine Learning is a sub-set of Artificial Intelligence (AI). You might like to look at devopedia.org for a short introduction to Machine Learning (ML).

Machine Learning is a data-oriented technique that enables computers to learn from experience. Human experience comes from our interaction with the environment. For computers, experience is indirect. It’s based on data collected from the world, data about the world.

Definition of Machine Learning from devopedia.org

The idea of this and subsequent blog posts is to look at machine learning from a specifically archival point of view as well as update you on our Labs project, Images and Machine Learning. We hope that our blog posts help archivists and other information professionals within the archival or cultural heritage domain to better understand ML and how it might be used.

AI can be used for many areas of learning and research. Chatbots have been trialled at some institutions, for example, ‘Ada’ at Bolton College has generally been well received. AI can be useful for aspects of website usability and accessibility, or helping students to choose the right university degree. The Jisc National Centre for AI site has more information on how AI can add value for education and learning.

At the Archives Hub we are particularly focussed on looking at Machine Learning from the point of view of archival catalogues and digital content, to aid discoverability, and potentially to identify patterns and bias in cataloguing.

Machine Learning to aid discoverability can be carried out as supervised or unsupervised learning. Supervised learning may be the most reliable, producing the best results. It requires a set of data that contains both the inputs and the desired outputs. By ‘outputs’ we mean that the objective is provided by labelling some of the input data. This is often called training data. In a ‘traditional’ scenario, code is written to take input and create output; in machine learning, input and output is provided, and the part done by human code is instead done by machine algorithms to create a model. This model is then used to derive outputs from further inputs.

The machine learning model, or program, is the outcome of learning from data (source: Advani 2020)

So, for example, taking the Vickers instruments collection from the Borthwick: https://dlib.york.ac.uk/yodl/app/collection/detail?id=york%3a796319&ref=browse. You may want to recognise optical instruments, for example, telescopes and microscopes. You could provide training data with a set of labelled images (output data) to create a model. You could then input additional images and see if the optical instruments are identified by the model.

Of course, the Borthwick may have catalogued these photographs already (in fact, they have been catalogued), so we know which are telescopes and which are micrometers or lenses or eye pieces. If you have a specialist collection, essentially focused on a subject, and the photographs are already labelled, then there may be less scope for improving discoverability for that collection by using machine learning. If the Borthwick had only catalogued a few boxes of photographs, they might consider using machine learning to label the remaining photographs. However, a big advantage is that the enhanced telescope recognising model can now be used on all the images from the Archives Hub to discover and label images containing telescopes from other collections. This is one of the great advantages of applying ML across the aggregated data of the Archives Hub. The results of machine learning are always going to be better with more training data, so ideally you would provide a large collection of labelled photographs in order to teach the algorithm. Archive collections may not always be at the kind of scale where this process is optimised. Providing good training data is potentially a very substantial task, and does require that the content is labelled. It is possible to use models that are already available without doing this training step, but the results are likely to be far less useful.

Another scenario that could lend itself to ML is a more varied collection, such as Borthwick’s University photograph collection. These have been catalogued, but there is potential to recognise various additional elements within the photographs.

construction site with people
Construction of the J.B. Morrell Library, University of York

The above photograph has been labelled as a construction site. ML could recognise that there are people in the photograph, and this information could be added, so a researcher could then look for construction site with people. Recognising people in a photograph is something that many ML tools are able to do, having already been trained on this. However, archive collections are often composed of historic documents and old photographs that may not be as clear as modern documents. In addition, the models will probably have been trained with more current content. This is likely to be an issue for archives generally. For models to be effective, they need to have been trained with content that is similar to the content we want to catalogue.

The Amazon Web Services (AWS) Rekognition facial recognition tool finds three faces…
…the Microsoft Azure facial recognition tool doesn’t do so well.

The benefits of adding labels to photographs via ML to potentially enhance the catalogue and help with discoverability is going to depend upon a number of factors: how well the image is already catalogued, whether training data can be provided to improve the algorithm, how well ML can then pick out features that might be of use.

The drawings of fossil fish at the Geological Society are another example of a very subject specific collection. We put a few of these through some out-of-the-box ML tools. These tools have been pre-trained on large diverse datasets, but we have not done any additional training ourselves yet, so you could see them as generalists in recognising entities rather than specialists with any particular material or topic.

drawing of a fossil tortoise
Fossil tortoise from Oeningen

In this case the drawing has been tagged with ‘fossil’, which could be useful if you wanted to identify fossil drawings from a varied collection of drawings. It has also tagged this with archaeology and art, both of which could potentially be useful, again depending upon the context. The label of soil is a bit more problematic, and yet it is the one that has been added with 99.5% certainty. However, a bit of training to tell the algorithm that ‘soil’ is not correct may remove this tag from subsequent drawings.

This example illustrates the above point that a subject specific collection may be tagged with labels that are already provided in the catalogue description. It also shows that machine learning is unlikely to ever be perfectly accurate (although there are many claims it outperforms humans in a number of areas). It is very likely to add labels that are not correct. Ideally we would train the model to make less mistakes – though it is unlikely that all mistakes will be eliminated – so that does mean some level of manual review.

Tagging an image using ML may draw out features that would not necessarily be added to the catalogue – maybe they are not relevant to the repository’s main theme, and in the end, it is too time-consuming for cataloguers themselves to describe each photo in great detail as part of the cataloguing process.

Queen’s University Belfast: Hart Collection – China Photographs

The above image is a simple one with not too much going on. It will be discoverable on the Queen’s website through a search for ‘china’ or ‘robert hart’ for example, but tagging could make it discoverable for those interested in plants or architectural features. Again, false positives could be a problem, so a key here is to think about levels of certainty and how to manage expectations.

As mentioned above, archival images are often difficult to interpret. They may be old and faded, and they may also represent features or items that an algorithm will not recognise.

Design Council Archive: Things in their home setting – detail of a living room

In the above example from Brighton Design Archives, the photograph is from a set made of an exhibition of 1947, Things In Their Home Setting. The AWS image Rekognition service has no problem with the chair, but it has confidently identified the oven as a refrigerator. This could probably be corrected by providing more training data, or giving feedback to improve the understanding of the algorithm and its knowledge of 1940’s kitchen furniture. But by the time you have given enough training data for the model to recognise a cooker from a fridge from a washing machine, it might have been easier simply to do the cataloguing manually.

Another option for machine learning is optical character recognition. This has been around for a while, but it has improved substantially as a result of the machine learning approach. Again, one of the challenges for archives is that many items within the collections are handwritten, faded, and generally not easily readable. So, can ML prove to be better with these items than previous OCR approaches?

A tool like Transkribus can potentially offer great benefits to archives, and is seen as a community-driven effort to create, gather and share training data. We hope to try out some experiments with it in the course of our project.

Clerkenwell St James Parish, General Plan

The above plan is from Lambeth Palace Library’s 19th century ecclesiastical maps. It can already be found searching for ‘clerkenwell’ or ‘st james parish’. But ML could potentially provide more searchable information.

OCR using Azure

The words here are fairly clear, so the character recognition using the Microsoft Azure ML service is quite good. Obviously the formatting is an issue in terms of word order. ‘James’ is recognised as ‘Iames’ due to the style of writing. ‘Church’ is recognised despite the style looking like ‘Chvrch’ – this will be something the algorithm has learnt. This analysis could potentially be useful to add to the catalogue because an end user could then search for ‘pentonville chapel’ or ‘northampton square’ and find this plan.

As well as looking at digital archives, we will be trying out examples with catalogue text. A great deal of archival cataloguing is legacy data, and archivists do not always have the time to catalogue to item level or to add index terms, which can substantially aid discoverability. So, it is tempting to look at ML as a means to substantially improve our catalogues. For example, to add to our index terms, which provide structured access points for end users searching for people, organisations, places and subjects.

In a traditional approach to adding subject terms to a catalogue, you might write rules. We have done this in our Names Project – we have written a whole load of rules in order to identify name, life dates, and additional data within index terms. We could have written even more rules – for example, to try to identify forename and surname. But it would be very difficult because the data does not present the elements of names consistently. We could potentially train an ML model with a load of names, tagging the parts of the name as forename, surname, dates, titles, epithets. But could an algorithm then successfully work out the parts of any subsequent names that we feed into it? It seems unlikely because there is no real consistency in how cataloguers input names. The algorithm might learn, for example, that a word, then a comma, then another word is surname, forename (Roberts, Elizabeth). But two words followed by a comma and another word could be surname + forename or forename + surname, (Vaughan Williams, Ralph; Gerald Finzi, composer). In this scenario, the best option may be to aim to use source data (e.g. the Virtual International Authority File) to compare our data to, rather than try to train a machine to learn patterns, when there really isn’t a model to provide the input.

We may find that analysing text within a catalogue offers more promise.

Part of the admin history for the British Linen Company archive at Lloyds

Here is an example from an administrative history of the British Linen Group, a collection held by Lloyds Banking Group. The entity recognition is pretty good – people’s names, organisations, dates, places, occupations and other entities can be picked out fairly successfully from catalogues. Of course that is only the first step; it is how to then use that information that is the main issue. You would not necessarily want to apply the terms as index terms for example, as they may not be what the collection is substantially about. But from the above example you could easily imagine tagging all the place names with a ‘place’ tag, so that a place search could find them. So, a general search for Stranraer would obviously find this catalogue entry, but if you could identify it as a place name it could be included in the more specific place name search.

With machine learning it is very difficult and sometimes impossible to understand exactly what is happening and why. By definition, the machine learns and modifies its output. Whilst you can provide training data to give inputs and desired outputs, machine learning will always be just that….a machine learning as it goes along, and not simply working through a programme that a human has written. Supervised learning provides for the most control over the outputs. Unsupervised learning, and deep learning, are where you have much less control (we’ll come onto those in later posts).

It is only by understanding the algorithms and what they are doing that you can set up your environment for the best results. But that is where things can get very complicated. We are going to try to run some experiments where we do prepare the data, but learning how to do this is a non-trivial task. Hence one of the questions we are asking is ‘is Machine Learning worth the effort required in order to improve archival discoverability?’ We hope to get at least some way along the road to answering that question.

There are, of course, other pressing questions, not least the issue of bias, and concerns about energy use with machine learning as well as how to preserve the processes and outputs of ML and document the decision making. But there could be big wins in terms of saving time that can then be dedicated to other tasks. The increasing volumes of data that we have to process may make this a necessity. We hope to touch upon some of these areas, but this is a fairly small scale project and Machine Learning it is one huge topic.

A Selection of Archives to mark International Women’s Day

To mark International Women’s Day on 8th March, here is a selection of archives featuring women who have excelled and been highly influential in many different fields.

Daphne Oram (1925-2003), composer and musician

The Daphne Oram Archive, held at Goldsmiths, University of London, comprises papers, personal research, correspondence and photographs documenting the life and work of a pioneering British composer and electronic musician.

Throughout her career she lectured on electronic music and studio techniques. In 1971 she wrote An Individual Note of Music, Sound and Electronics which investigated philosophical aspects of electronic music. Besides being a musical innovator her other significant achievements include being the first woman to direct an electronic music studio, the first woman to set up a personal studio and the first woman to design and construct an electronic musical instrument.

Delia Derbyshire (1937-2001), musician and composer

The University of Manchester holds the Papers of Delia Derbyshire, composer. After being rejected by Decca Records, who said that they did not employ women in the recording studio, in 1962 Derbyshire became a trainee studio manager at the BBC. She was soon seconded to work at the BBC’s Radiophonic Workshop, which had been set up to provide theme and incidental music and sound for BBC radio and television programmes. The following year, she produced her electronic ‘realisation’ of Ron Grainer’s theme tune for the hugely popular BBC series Doctor Who – which is still one of the most famous and instantly recognisable television themes. In the late 1990s there was renewed interest in her work and many younger musicians making electronic dance and ambient music (such as Aphex Twin and The Chemical Brothers) cited Derbyshire as an important influence.

The Anita White Foundation International Women and Sport Archive

Dr Anita White and Professor Celia Brackenridge were both associated with the University of Chichester, and they were both centrally involved in the leadership and development of the international women and sport movement since 1990. The International Women and Sport Archive is comprised primarily of papers brought together by them and other leaders in the movement, accumulated in the course of their research, study and work in the fields of the sociology of sport and sport science, and their involvement as activists and leaders in the global women and sport movement.

The International Women and Sport Movement is said to have been born out of a decade in which increasing globalisation brought together women from across the world in the practice of sport. It does not refer to any one organisation, body or country, but it is generally agreed that a landmark event and major catalyst in the movement was the first international conference on women and sport which took place on 5-8 May 1994.

Kaye Webb ( 1914-1996), editor and publisher

The Papers of Kaye Webb, covering her career as journalist, magazine editor, editor at Puffin and later literary agent, are held at the Seven Stories Archive. The collection provides a comprehensive record of Webb’s career, reflecting the wide variety of work undertaken by her, and documented through notes, correspondence, press cuttings, audio-visual material, memorabilia and ephemera. Webb was editor of Puffin Books between 1961 and 1979, and in 1967 founded the Puffin Club, which she ran until 1981. As a journalist she worked on publications including Picture Post, Lilliput and the News Chronicle.

Elizabeth Garrett Anderson (1836-1917), physician and suffragist

The Letters of Elizabeth Garrett Anderson are part of the Women’s Library Archives. An English physician and suffragist, she was was the first woman to qualify in Britain as a physician and surgeon. She was the co-founder of the first hospital staffed by women, the first dean of a British medical school, the first woman in Britain to be elected to a school board and, as mayor of Aldeburgh, the first female mayor in Britain. The letters cover Anderson’s struggle to secure an entry into the medical profession.

Barbara Castle (1910-2002), politician and campaigner

The Barbara Castle Cabinet Diaries at the University of Bradford cover 1965-1971 and 1974-1976. In the 1945 General Election Barbara Castle was elected M.P. for Blackburn, a seat that she retained for 34 years. Following the Labour victory in 1964, Prime Minister Harold Wilson put Castle in charge of the newly-created Ministry of Overseas Development. “I decided on 26 January that I ought to start keeping a regular record of what was happening”, she said. Castle maintained this political diary throughout her periods in office. In 1974 Castle was made Secretary of State for Social Services, and in this post she introduced payment of child benefit to mothers and worked on the State Earnings Related Pensions Scheme. In 1979 she became a Member of the European Parliament and in 1990 she entered the House of Lords as Baroness Castle of Blackburn.

Alison Settle (1891-1980), fashion journalist and editor

In a career spanning from the early 1920s to the early 1970s, Alison Settle worked as a fashion journalist, and Brighton Design Archive hold the Alison Settle Archive which includes professional papers dating from the mid-1930s. She was a tireless champion of the interests of women, as well as campaigning for good quality, affordable design through her relationships with designers and manufacturers. Settle sought to improve design standards in all areas of manufacture and production, and contributed to the work of both the Council for Art & Industry and the Council of Industrial Design. She remained one of the best known fashion journalists in the country.

Elise Edith Bowerman (1889-1973), lawyer and suffragette

Diaries, photographs and correspondence of Elsie Edith Bowerman are held at the Women’s Library. Bowerman followed her mother into the suffrage movement. They were both active members of the militant Women’s Social & Political Union. They were on the maiden voyage of the Titanic – both survived. She worked for Scottish Women’s Hospitals during the First World War, and she also worked for Emmeline and Christabel Pankhurst during their campaign for ‘industrial peace’ in support of the war effort. In 1924 or 1925 she went on to set up the Women’s Guild of Empire with Flora Drummond, with the aim of promoting co-operation between employers and workers. She was admitted to the Bar in the early twenties and practised until 1938, when she joined the Women’s Voluntary Services. In 1947 Bowerman went to the United States to help set up the United Nations Commission on the Status of Women.

Tessa Boffin (1960-1993), writer, photographer and performance artist

The Tessa Boffin Archive at the University for the Creative Arts includes lesbian, gay, bisexual, transexual and other photography projects, including portrayal of AIDS, cross dressing and safe sex, as well as notes on television and radio productions of the 1980s portrayal on feminism and AIDS. Boffin was one of the leading lesbian artists in Great Britain during the AIDS Crisis, but her risqué performances were controversial, and frequently drew criticism, including from inside the LGBTQ community.

Gladys Aylward (1902-1970), missionary

Gladys May Aylward was an evangelical Christian missionary to China. She travelled to China in 1932 and in 1936 she became a Chinese citizen. In 1940, against the background of civil war between Nationalist government troops and the Communists, Japanese invasion, and the threat of bandits, she led a group of orphans on a perilous journey to Sian. Her story was told in the book The Small Woman, by Alan Burgess published in 1957, and made into the film The Inn of the Sixth Happiness starring Ingrid Bergman, in 1958. The Papers of Gladys Aylward, held at SOAS, provide a vivid portrait of Aylward, including her life in China, and the impact of World War Two.

Exploring New Worlds in the Archives Hub

This blog post forms part of History Day 2020, a day of online interactive events for students, researchers and history enthusiasts to explore library, museum, archive and history collections across the UK and beyond.

Use the Archives Hub, a free resource, to find unique sources for your research, both physical and digital. Search across descriptions of archives, held at over 350 institutions across the UK.

History Day 2020 coincides with the Being Human festival, the UK’s national festival of the humanities. Their theme this year is ‘New Worlds’, so taking this as our inspiration, we’re highlighting a range of archive collections – across Travel, Exploration, Space Exploration and Science Fiction.

Travel

Austen Henry Layard’s passport (1) (LAY/1/4/8)
Austen Henry Layard’s passport (1) (LAY/1/4/8). Image copyright: University of Newcastle.

Unearthing Family Treasures: The Layard and Blenkinsopp Coulson Archives
In 1839 a young lawyer left behind his London office for a post in the Ceylon (now Sri Lanka) Civil Service, thus beginning a series of travels, adventures and discoveries which would result in him achieving world renown for uncovering and shining a light on the ancient civilizations of Mesopotamia, in particularly Assyrian culture. That young man was Austen Henry Layard. Read the feature, by University of Newcastle Special Collections.

Papers of Elizabeth Thomson, 1847-1918, teacher, missionary, traveller and suffragette, c1914
Throughout the 1890s and 1900s Thomson travelled the world with her sister, Agnes, working as teachers and missionaries. The countries they visited include India, Japan, the USA, Germany and Italy. In the summer of 1899 Thomson reports that she visited Faizabad in India to learn Urdu but could not stand the heat and left for Almora in 1902. In 1907 she sailed to Bombay to complete missionary work, before teaching English in Sangor for the winter. In 1909 she travelled back to the UK, via Vienna, Prague, Dresden and Berlin, to settle in Edinburgh. Material held by University of Glasgow Archive Services – see the full collection description.

Steel engraving, 1875. © Image is in the public domain.
Steel engraving, 1875. © Image is in the public domain.

Sentimental Journey: a focus on travel in the archives
The hundreds of collections relating to travel featured in the Archives Hub shed light on multiple aspects of travel, from royalty to the working classes, and encompassing touring, business, exploration and research, the work of missionaries and nomadic cultures. Read the feature.

An abstract of a voyage from England to the Mediteranian: the diary of an anonymous English naval victualler, 1694-1696
Contains the log of an anonymous English naval victualler on a voyage from Gravesend in England to Cadiz in the Mediterranean between 31 December 1694 and 29 October 1696. Material is in English Spanish Latin Hebrew. Written in a single neat late seventeenth-century English hand with the text on each page set within faint ruled lines. There are many tables, diagrams, and quite finely-drawn illustrations of places en route, especially in Spain, and interesting objects, such as keys and seals. Material held by University of Leeds Special Collections – see the full collection description.

Bodiwan Papers, 1634-1923
The papers of Michael D. Jones and his family, which include numerous letters to Michael D. Jones from the Welsh settlers in Patagonia or relating to them, prior to the sailing of the Mimosa and after. Amongst them is a letter from Charles de Gaulle, the eminent Breton and Celticist, expressing his interest in the scheme to found a Welsh colony in Patagonia. Also, amongst the correspondents are L. Patagonia Humphreys, Rev. D. Lloyd Jones, Rhuthun and Mihangel ap Iwan and Llwyd ap Iwan. The papers reflect the hardship suffered by the new settlers as well as the investment made by Michael D. Jones in the venture. There are bills and receipts relating to the Mimosa, share certificates, statistics regarding population for 1879. Also, a bank pass book of the Welsh Colonising and General Trading Company Ltd, 1870-1883, and a register of the Welsh applicants to Patagonia, 1875-1876. The collection is held by Archifdy Prifysgol Bangor / Bangor University Archives – see the full collection description.

The London to Istanbul European Highway
Part of The National Motor Museum Trust Motoring Archive‘s Bradley Collection, including striking illustrations by Margaret Bradley. Read the feature.

The handsome blue car, by Margaret Bradley. ‘With apologies…this being a rough sketch…made somewhere in the middle of no mild channel’. Sketch by Margaret Bradley, copyright the National Motor Museum Trust.
The handsome blue car, by Margaret Bradley. ‘With apologies…this being a rough sketch…made somewhere in the middle of no mild channel’. Sketch by Margaret Bradley, copyright the National Motor Museum Trust.

Exploration

Cambridge Svalbard Exploration Collection, 1933-1992
The collection documents many decades of scientific work undertaken by (mostly) Cambridge researchers from 1938 until the early 1990s. These were mostly led by Walter Brian Harland (1917-2003), who also became the collator of the materials collected in Spitsbergen. The documentary archive complements the physical collection of geological specimens collected during those expeditions. Svalbard is located in the north-western corner of the Barents Shelf 650km north of Norway, and is named after the Dutch Captain, Barents, who is credited with the modern discovery of the islands in 1596 and after whom the Barents Sea is named. Collection held by Sedgwick Museum of Earth Sciences, University of Cambridge – see the full collection description.

Online Resource: Old Maps Online – provided by Great Britain Historical GIS Project, Maps Online is a search portal that combines the historical map collections of several organisations around the world. Users can search across collections through a single interface and easily locate multiple maps of a geographical area. The interface is free and access is open to all users. A wide range of different types of map are available, including: land maps; sea charts; boundary and estate maps; military and political maps; and town plans. Historical maps of many countries are available – including South and Central America from the 16th to the 20th centuries; Britain and particularly London, up to 1860; North America in the 18th and 19th centuries; pre-1900 Dutch Maps; the North West of England; and Moscow. More details.

Challenger Expedition Photographs, 1870s-1885; 1981-1983
HMS Challenger set out to collect specimens from different depths of water across the globe. The voyage took place between 1872 and 1876. It is thought that this was the first expedition to routinely use photography to document the journey. There was a darkroom on board so photographs could be developed on the ship. Material held by National Museums Scotland – see the full collection description.

Shackleton’s Endurance Expedition Centenary
27th October 1915: Antarctic expedition ship Endurance was abandoned on the orders of Sir Ernest Shackleton and their expedition became fight for survival. Read the feature by the Scott Polar Research Institute, University of Cambridge.

Space Exploration

John Herschel’s photograph of his father’s 40-foot telescope.
Herschel’s 40-foot telescope, circular glass plate photograph. The telescope’s wooden scaffolding is seen here on 9 September 1839, at Observatory House in Slough, England. It was photographed by the astronomer John Herschel (1792-1871) before its demolition. The telescope was designed by John’s father, the German-born British astronomer William Herschel (1738-1822). The tube was 40 feet (12 metres) long. The first observations with this telescope were carried out 50 years earlier on 28 August 1789, when two new moons of Saturn (Enceladus and Mimas) were discovered. 50 years later, by 1839, John Herschel and W H Fox Talbot had invented the process we now know as photography. This is one of the earliest surviving glass plate photographs. Image copyright: Royal Astronomical Society Archives

Russian Space Exploration, 1903
Drawings, documents, photographs, ephemeral objects and memorabilia relating to early Russian space exploration. Objects include domestic items such as cigarette cases, ashtrays, cigarette ornamental dispensers, desk thermometers, ornamental lamps and tea glass holders. Included in the collection are photo albums and a press cutting album made by a school child as well as stamp collections. The collection boasts rare drawings by Konstantin Tsiolkovsky in which he envisaged the exit from a spacecraft into the vacuum of space as well as a drawing of a Reactive engine (Rocket engine); one of the first designs of its kind from c.1930. The collection is held by De Montfort University Archives and Special Collections – see the full collection description.

Jodrell Bank Observatory Archive, c.1924-1993
The Jodrell Bank Observatory is one of the world’s largest radio-telescope facilities. Originally known as the Jodrell Bank Experimental Station, it was renamed the Nuffield Radio Astronomy Laboratories in 1966, and changed to its current name in 1999. The first radar transmitter and receiver was installed by Bernard Lovell, then working as a physicist at the University of Manchester, at Jodrell Bank, Cheshire, in December 1945 (the University campus had proved unsuitable because of the high level of electrical interference). At this period Lovell was researching cosmic rays under the direction of Patrick Blackett, professor of physics at the University of Manchester. Lovell’s work involved studying radio echoes from large cosmic ray showers in the Earth’s atmosphere, using old military radars. As a result of this, Lovell went on to make important discoveries in meteoric astronomy. The collection is held by University of Manchester Library – see the full collection description.

The Herschel archive at the Royal Astronomical Society
The Royal Astronomical Society is the custodian of a significant collection of the astronomy-related papers of William, Caroline and John Herschel. Read the feature.

Caroline Herschel.
Caroline Lucretia Herschel (1750-1848), German- born British astronomer, in 1847, pointing at the orbit of a comet on a map of the solar system. The map shows all the planets out to Saturn. Uranus had been discovered in 1781 by William Herschel, but was at first thought to be a comet. Neptune was discovered in 1846. The map also shows the asteroids Ceres (discovered in 1801), Pallas (1802), Juno (1804) and Vesta (1807). Caroline was the sister of William Herschel, and worked with him in England. She discovered eight new comets between 1786 and 1797. After her brother’s death in 1822, Caroline returned to Hanover, where she died at the age of 98. This artwork shows Herschel in Hanover in 1847, the year before she died. Image copyright: Royal Astronomical Society Archives

Science Fiction

Papers of Douglas Noël Adams, 1952-2001 (Circa.)
Douglas Noël Adams was born in Cambridge in 1952. He was awarded an exhibition to read English at St John’s College, Cambridge, obtaining his BA in 1974. While at Cambridge, Adams occupied himself chiefly in writing, performing in, and producing comedy sketches and revues, establishing connections that were to be integral to his future work. His career took off with ‘The Hitchhiker’s Guide to the Galaxy’, a six-part comic science-fiction radio series commissioned by the BBC in 1977 and broadcast in 1978. Novelisation and a second series were followed by further books in what became billed as ‘the increasingly inaccurately named Hitchhiker’s Trilogy’. The ‘Hitchhiker’s Guide’ series has taken many forms, including audio recordings; stage adaptations; a television series; a computer game; publication of the original radio scripts; radio adaptations of the remaining novels, and a film. Adams’s other creative work included writing and script-editing for BBC Television’s ‘Doctor Who’. Material held by St John’s College Library Special Collections, University of Cambridge – see the full collection description.

Papers of Brian Aldiss, 1966-1995
Brian Aldiss was born in 1925 in Dereham, Norfolk. After war service in the Royal Corps of Signals he entered the bookselling trade, working at Sanders & Co. in Oxford. His first work as a writer was The Brightfount Diaries, a fictionalised diary of a bookseller first published as a column in The Bookseller during 1954 and 1955 and published as one volume by Faber & Faber in 1955. The following year he became a full-time writer, and in 1957 his first science fiction book, the short story collection Space, Time and Nathaniel was published. His first science fiction novel, Non-Stop was published in 1958. Since then Aldiss has been a prolific writer, best known for his science fiction novels, novellas and short stories, including the award-winning Helliconia trilogy. He has also been a historian and critic of the genre, and has edited many science fiction collections. In addition, his ‘mainstream’ writing has included the novels The Male Response, Forgotten Life and the semi-autobiographical Horatio Stubbs sequence. He was elected a Fellow of the Royal Society of Literature in 1989. In 1990 he published his autobiography, Bury my heart at W.H. Smith’s. the collection is held by the University of Reading Special Collections Services – see the full collection description.

Other ‘New Worlds’

Pan-African Congress 1945 and 1995 Archive
The Pan-African Congress was a series of meetings, held throughout the world. In 1945 Manchester hosted the 5th Pan-African Congress. The Pan-African Congress was successful in bringing attention to the decolonization in Africa and in the West Indies. The Congress gained the reputation as a peace maker and made significant advance for the Pan-African cause. One of the demands was to end colonial rule and end racial discrimination, against imperialism and it demanded human rights and equality of economic opportunity. The manifesto given by the Pan-African Congress included the political and economic demands of the Congress for a new world context of international cooperation. material is held by the Ahmed Iqbal Ullah Race Relations Resource Centre – see the full collection description.

Records of the British Union for the Abolition of Vivisection, 1865-1996
The British Union for the Abolition of Vivisection (BUAV) was founded in 1898 by Miss Frances Power Cobbe (1822-1904). Concern for the welfare of animals was not a new phenomena, the first wave of anti-vivisection feeling in England commenced around the middle of the nineteenth century. The Second World War appeared to foster greater ideas of cooperation within the animal welfare movement. The Conference of anti-vivisection Societies first met on 20 November 1942. Five societies were represented at the invitation of BUAV ‘for the purpose of discussing and making plans for a joint intensive campaign, after the war, to claim the total abolition of vivisection as a necessary step towards securing for animals their rightful place in the new world order, which it is generally believed will follow the peace’. The immediate post war period began to see a rise in public demonstrations as a medium to spread the anti-vivisection message, in particular these were held outside vivisection laboratories. The collection is held by Hull University Archives, Hull History Centre – see the full collection description.

The Percy Johnson-Marshall Collection, 1931-1993
Percy Edwin Alan Johnson-Marshall (1915-1993) was one of the most energetic of a generation of town-planners who began their careers in the 1930s and, after the Second World War, dedicated their lives to the creation of a new world of social equity through the radical transformation of the human environment. Material held by Edinburgh University Library Special Collections – see the full collection description.

Find out more

Archives Wales Catalogues Online: Working with the Archives Hub

Stacy Capner reflects on her first six months as Project Officer for the Archives Wales Catalogues Online project, a collaboration between the Archives and Records Council Wales and the Archives Hub to increase the discoverability of Welsh archives.

For a few years now there has been a strategic goal to get Wales’ archive collections more prominently ‘out there’ using the Archives Wales website. Collection level descriptions have been made available previously through the ‘Archives Network Wales’ project, but the aim now is to create a single portal to search and access multi-level descriptions from across services. The Archives Hub has an established, standards based way of doing this, so instead of re-inventing the wheel, Archives and Records Council Wales (ARCW) saw an opportunity to work with them to achieve these aims.

The work to take data from Welsh Archives into the Archives Hub started some time ago, but it became clear that getting exports from different systems and working with different cataloguing practices required more dedicated 1-2-1 liaison. I am the project officer on a defined project which began in April to provide dedicated support to archive services across Wales and to establish requirements for uploading their catalogue data to the Archives Hub (and subsequently to Archives Wales).

This project is supported by the Welsh Government through its Museums Archives and Libraries Division, with a grant to Swansea University, a member of ARCW and a long-standing contributor to the Hub. I’m on secondment from the University to the project, which means I’ve found myself back in my northern neck of the woods working alongside the Archives Hub team. This project has come at a time when the Archives Hub have been putting a lot of thought into their processes for uploading data straight from systems, which means that the requirements for Welsh services have started to define an approach which could be applied to archive services across Scotland, England and Northern Ireland.

Here are my reflections on the project so far:

  1. Wales has fantastic collections, holding internationally significant material. They deserve to be promoted, accessible and searchable to as wide an audience as possible. Some examples-

National Library of Wales, The Survey of the Manors of Crickhowell & Tretower (inscribed in the UNESCO Memory of the World Register, 2016) https://www.llgc.org.uk/blog/?p=11715

Swansea University, South Wales Coalfield Collection http://www.swansea.ac.uk/library/archive-and-research-collections/richard-burton-archives/ourcollections/southwalescoalfieldcollection/

West Glamorgan, Neath Abbey Ironworks collection (inscribed in the UNESCO Memory of the World register, 2014) http://www.southwales-eveningpost.co.uk/treasured-neath-port-talbot-history-recognised/story-26073633-detail/story.html

Bangor University, Penrhyn Estate papers (including material relating to the sugar plantations in Jamaica) https://www.bangor.ac.uk/archives/sugar_slate.php.en#project

Photograph of Ammanford colliers and workmen standing in front of anthracite truck, c 1900.
Photograph of Ammanford colliers and workmen standing in front of anthracite truck, c 1900. From the South Wales Coalfield Collection. Source: Richard Burton Archives, Swansea University (Ref: SWCC/PHO/COL/11)

  1. Don’t be scared of EAD ! I was. My knowledge of EAD (Encoded Archival Description) hadn’t been refreshed in 10 years, since Jane Stevenson got us to create brownie recipes using EAD tags on the archives course. So, whilst I started the task with confidence in cataloguing and cataloguing systems, my first month or so was spent learning about the Archives Hub EAD requirements. For contributors, one of the benefits of the Archives Hub is that they’ve created guidance, tools and processes so that archivists don’t have to become experts at creating or understanding EAD (though it is useful and interesting, if you get the chance!).
  1. The Archives Hub team are great! Their contributor numbers are growing (over 300 now) and their new website and editor are only going to make it easier for archive services to contribute and for researchers to search. What has struck me is that the team are all hot on data, standards and consistency, but it’s combined with a willingness to find solutions/processes which won’t put too much extra pressure on archive services wishing to contribute. It’s a balance that seems to work well and will be crucial for this project.
  1. The information gathering stage was interesting. And tiring. I visited every ARCW member archive service in Wales to introduce them to the project, find out what cataloguing systems they were using, and to review existing electronic catalogues. Most services in Wales are using Calm, though other systems currently being used include internally created databases, AtoM, Archivists Toolkit and Modes. It was really helpful to see how fields were being used, how services had adapted systems to suit them, and how all of this fitted in to Archives Hub requirements for interoperability.

    Photo of icecream
    Perks of working visits to beautiful parts of Wales.
  1. The support stage is set to be more interesting. And probably more tiring! The next 6 months will be spent providing practical support to services to help enable their catalogues to meet Archives Hub requirements. I’ll be able to address most of the smaller, service specific, tasks on site visits. The Hub team and I have identified a number of trickier ‘issues’ which we’ll hash out with further meetings and feedback from services. I can foresee further blog posts on these so briefly they are:
  • Multilingualism- most services catalogue Welsh items/collections in Welsh, English items/collections in English and multi-language item/collections bilingually. However, the method of doing this across services (and within services) isn’t consistent. We’re going to look at what can be done to ensure that descriptions in multiple languages are both human and machine readable.
  • Ref no/Alt ref- due to legacy issues with non-hierarchical catalogues, or just services personal preference, there are variations in the use of these fields. Some services use the ref no as the reference, others use the alt ref no as the reference. This isn’t a problem (as long as it’s consistent). Some services use ref no as the reference but not at series level, others use the alt ref no as the reference but not at series level. This will prove a little trickier for the Archives Hub to handle but hopefully workarounds for individual services will be found.
  • Extent fields missing- this is a mandatory field at collection level for the Archives Hub. It’s important to give researchers an idea of the size of the collection/series (it’s also an ISAD(G) required field). However, many services have hundreds of collection level descriptions which are missing extent. It’s not something I’ll practically be able to address on my support visits so the possibility of further work/funding will be looked into.
  • Indexing- this is understandably very important to the Archives Hub (they explain why here). For several archive services in Wales it seems to have been a step too far in the cataloguing process, mainly due to a lack of resource/time/training. Most have used imported terms from an old database or nothing at all. Although this will not prevent services from contributing catalogues to the Archives Hub, it does open up opportunities to think about partnership projects which might address this in the future (including looking at Welsh language index terms).

The project has made me think about how I’ve catalogued in the past. It’s made me much more aware that catalogues shouldn’t just be an inward-facing, local or an intellectual control based task; we should be constantly aware of making our descriptions more discoverable to researchers. And it’s shown me the importance of standards and consistency in achieving this (I feel like I’ve referenced consistency a lot in this one blog post; consistency is important!).  I hope that the project is also prompting Welsh archive services to reflect on the accessibility of their own cataloguing- something which might not have been looked at in many years.

There’s a lot of work to be done, both in this foundation work and further funding/projects which might come of the back of it. But hopefully in the next few years you’ll be discovering much more of Wales’ archive collections online.

Stacy Capner
Project Officer
Archives Wales Catalogues Online

Related:

Archives Hub EAD Editor – http://archiveshub.ac.uk/eadeditor/

Archives Hub contributors – list and map

 

Making your digital collections easier to discover – Jisc workshops in November

Jisc is offering two one-day workshops to help you increase the reach of your digital collections, optimise them for discovery and evaluate their impact.

‘Exploiting digital collections in learning, teaching and research’ will be held on Tuesday 15 November.

‘Making google work for your digital collections’ will be held on Tuesday 22 November.

If your organisation has digital collections, or plans to develop them, our workshops will help you maximize the reach of those collections online, demonstrate the impact of their usage, and help you build for future sustainability. They will equip you with the knowledge and skills to:

• Increase the visibility of your digital collections for use in learning, teaching and research
• Encourage collaboration between curators and users of digital collections
• Strategically promote your digital collections in appropriate contexts, for a range of audiences
• Optimise your collection for discovery via Google and other search tools
• Use web analytics to track and monitor access and usage of your digital collections
• Evaluate impact and realise the benefits of investment in your digital collection

Who should attend?

Anyone working in education and research, who manages, supports and/or promotes digital collections for teaching, learning and research. Those working in similar roles in libraries, archives and museums would also benefit.

Both workshops will be held at Jisc office, Brettenham House, London and will offer a mix of discussion, practical activities and post-workshop resources to support online resource discovery activities.

For more information and to book your place please visit http://www.jisc.ac.uk/advice/training/making-your-digital-collections-easier-to-discover.

Archives Portal Europe builds firm foundations

On 8th June 2016 I attended the first Country Manager’s meeting of the newly formed Foundation of the Archives Portal Europe (APEF) at the National Archives of the Netherlands (Nationaal Archief).

The Foundation has been formed on the basis of partnerships between European countries. The current Foundation partners are: Belgium, Denmark, Luxembourg, The Netherlands, Spain, Sweden, Switzerland, Estonia, France, Germany, Hungary, Italy, Latvia, Norway and Slovenia. All of these countries are members of the ‘Assembly of Associates’. Negotiations are proceeding with Bulgaria, Greece, Liechtenstein, Lithuania, Malta, Poland, Slovakia and the UK. Some countries are not yet in a position to become members, mainly due to financial and administrative issues, but the prospects currently look very positive, with a great willingness to take the Portal forwards and continue the valuable networking that has been built up over the past decade. Contributing to the Portal does not incur financial contribution; the Assembly of Associates is separate from this, and the idea is that countries (National Archives or bodies with an educational/research remit) sign up to the principles of APE and the APE Foundation – to collaborate and share experiences and ideas, and to make European archives as accessible as possible.

The Governing Board of the Foundation is working with potential partners to reach agreements on a combination of financial and in-kind contributions. It’s also working on long term strategy documents. It has established working groups for Standards and PR & Communications and it has set up cooperation with the Dutch DTR project (Digitale Taken Rijksarchieven / Digital Processes in State Archives) and with Europeana. The cooperation with the DTR project has been a major boost, as both projects are working towards similar goals, and therefore work effort can be shared, particularly development work.

Current tasks for the APEF:

  • Building an API to open up the functionality of the Archives Portal Europe to third parties and to implement the possibility for the content providers to switch this option on or off in the Archives Portal Europe’s back-end.
  • Improving the uploading and processing of EAC-CPF records in the Archives Portal Europe and improving the way in which records creators’ information can be searched and found via the Archives Portal Europe’s front-end and via the API.
  • Enabling the uploading/processing of “additional finding aids (indexes)” in the Archives Portal Europe and making this additional information available via the Archives Portal Europe’s front-end and the API.

The above in addition to the continuing work of getting more data into the Portal, supporting the country managers in working with repositories, and promoting the portal to researchers interested in using European-wide search and discovery tool.

APEF will be a full partner in the Europeana DSI2 project, connecting the online collections of Europe’s cultural heritage institutions, which will start after the summer and will run for 16 months. Within this project APEF will focus on helping Europeana to develop the aggregation structure and provide quality data from the archives community to Europeana. A focus on quality will help to get archival data into Europeana in a way that works for all parties. There seems to be a focus from Europeana on the ‘treasures’ from the archives, and on images that ‘sell’ the archives more effectively. Whatever the rights and wrongs of this, it seems important to continue to work to expose archives through as many channels as we can, and for us in the UK, the advantages of contributing to the Archives Hub and thence seamlessly to APE and to Europeana, albeit selectively, are clear.

A substantial part of the meeting was dedicated to updates from countries, which gave us all a chance to find out what others are doing, from the building of a national archives portal in Slovakia to progress with OAI-PMH harvesting from various systems, such as ScopeArchiv, used in Switzerland and other countries. Many countries are also concerned with translations of various documents, such as the Content Provider Agreement, which is not something the UK has had to consider (although a Welsh translation would be a possibility).

We had a session looking at some of the more operational and functional tasks that need to be thought about in any complex system such as the APE system. We then had a general Q&A session. It was acknowledged that creating EAD from scratch is a barrier to contributing for many repositories. For the UK this is not really an issue, because we contribute Archives Hub descriptions. But of course it is an issue for the Hub: to find ways to help our contributors provide descriptions, especially if they are using a proprietary system. Our EAD Editor accounts for a large percentage of our data, and that creates the EAD without the requirement of understanding more than a few formatting tags.

The Archives Hub aims to set up harvesting of our contributors’ descriptions over the next year, thus ensuring that any descriptions contributed to us will automatically be uploaded to the Archives Portal Europe. (We currently have to upload on a per-contributor basis, which is not very efficient with over 300 contributors). We will soon be turning our attention to the selective digital content that can be provided by APE to Europeana. That will require an agreement from each institution in terms of the Europeana open data licence. As the Hub operates on the principles of open data, to encourage maximum exposure of our descriptions and promote UK archives, that should not be a problem.

With thanks to Wim van Dongen, APEF country manager coordinator / technical coordinator, who provided the minutes of the Country Managers’ meeting, which are partially reproduced here.

Archives Hub Search Analysis

Search logs can give us an insight into how people really search. Our current system provides ‘search logs’ that show the numbers based on the different search criteria and faceting that the Hub offers, including combined searches. We can use these to help us understand how our users search and to give us pointers to improve our interface.

The Archives Hub has a ‘default search’ on the homepage and on the main search page, so that the user can simply type a search into the box provided. This is described as a keyword search, as the user is entering their own significant search terms and the results returned include any archival description where the term(s) are used.

The researcher can also choose to narrow down their search by type. The figure below shows the main types the Archives Hub currently has. Within these types we also have boolean type options (all, exact, phrase), but we have not analysed these at this point other than for the main keyword search.

Archives Hub search box

Archives Hub search box showing the types of searches available

There are caveats to this analysis.

1. Result will include spiders and spam

With our search logs, excluding bots is not straightforward, something which I refer to in a previous post: Archives Logs and Google Analytics. We are shortly to migrate to an entirely new system, so for this analysis we decided to accept that the results may be slightly skewed by these types of searches. And, of course, these crawlers often perform a genuine service, exposing archive descriptions through different search engines and other systems.

2. There are a small number of unaccounted for searches

Unidentified searches only account for 0.5% of the total, and we could investigate the origins of these searches, but we felt the time it would take was not worth it at this point in time.

3. Figures will include searches from the browse list.

These figures include searches actioned by clicking on a browse list, e.g. a list of subjects or a list of creators.

4. Creator, Subject and Repository include faceted searching

The Archives Hub currently has faceted searching for these entities, so when a user clicks to filter down by a specific subject, that counts as a subject search.

Results for One Month (October 2015)

Monthly figures for searches

For October 2015 the total searches are 19,415. The keyword search dominates, with a smaller use of the ‘any’ and ‘phrase’ options within the keyword search. This is no surprise, but this ‘default search’ still forms only 36% of the whole, which does not necessarily support the idea that researchers always want a ‘google type’ search box.

We did not analyse these additional filters (‘any/phrase/exact’) for all of the searches, but looking at them for ‘keyword’ gives a general sense that they are useful, but not highly used.

A clear second is search by subject, with 17% of the total. The subject search was most commonly combined with other searches, such as a keyword and further subject search. Interestingly, subject is the only search where a combined subject + other search(es) is higher than a single subject search. If we look at the results over a year, the combined subject search is by far the highest number for the whole year, in fact it is over 50% of the total searches. This strongly suggests that bots are commonly responsible for combined subject searches.

These searches are often very long and complex, as can be seen from the search logs:

[2015-09-17 07:36:38] INFO: 94.212.216.52:: [+0.000 s] search:: [+0.044 s] Searching CQL query: (dc.subject exact “books of hours” and/cql.relevant/cql.proxinfo (dc.subject exact “protestantism” and/cql.relevant/cql.proxinfo (dc.subject exact “bible o.t. psalms” and/cql.relevant/cql.proxinfo (dc.subject exact “authors, classical” and/cql.relevant/cql.proxinfo (dc.subject exact “bible o.t. psalms” and/cql.relevant/cql.proxinfo (dc.subject exact “law” and/cql.relevant/cql.proxinfo (dc.subject exact “poetry” and/cql.relevant/cql.proxinfo (dc.subject exact “bible o.t. psalms” and/cql.relevant/cql.proxinfo (dc.subject exact “sermons” and/cql.relevant/cql.proxinfo bath.personalname exact “rawlinson richard 1690-1755 antiquary and nonjuror”))))))))):: [+0.050 s] 1 Hits:: Total time: 0.217 secs

It is most likely that the bots are not nefarious; they may be search engine bots, or they may be indexing for the purposes of  information services of some kind, such as bibliographic services, but they do make attempts to assess the value of the various searches on the Hub very difficult.

Of the remaining search categories available from the main search page, it is no surprise that ‘title’ is used a fair bit, at 6.5%, and then after that creator, name, and organisation and personal name. These are all fairly even. For October 2015 they are around 3% of the total each, and it seems to be similar for other months.

The repository filter is popular. Researchers can select a single repository to find all of their descriptions (157), select a single repository and also search terms (916), and also search for all the descriptions from a single repository from our map of contributors (125). This is a total of 1,198, which is 6.1% of the total. If we also add the faceted filter by repository, after a search has been carried out, the total is 2,019, and the percentage is 10.4%. Looking at the whole year, the various options to select repository become an even bigger percentage of the total, in particular the faceted filter by repository.   This suggests that improvements to the ability to select repositories, for example, by allowing researchers to select more than one repository, or maybe type of repository, would be useful.

Screen shot of Hub map

Google Map on the Hub showing the link to search by contributor

We have a search within multi-level descriptions, introduced a few years ago, and that clearly does get a reasonable amount of use, with 1,404 uses in this particular month, or 7.2% of the total. This is particularly striking as this is only available within multi-level descriptions. It is no surprise that this is valuable for lengthy descriptions that may span many pages.

The searches that get minimal use are identifier, genre, family name and epithet. This is hardly surprising, and illustrates nicely some of the issues around how to measure the value of something like this.

Identifier enables users to search by the archival reference. This may not seem all that useful, but it tends to be popular with archivists, who use the Hub as an administrative tool. However, the current Archives Hub reference search is poor, and the results are often confusing. It seems likely that our contributors would use this search more if the results were more appropriate. We believe it can fulfill this administrative function well if we adjust the search to give better quality results; it is never likely to be a highly popular search option for researchers as it requires knowledge of the reference numbers of particular descriptions.

Epithet is tucked away in the browse list, so a ‘search’ will only happen if someone browses by epithet and then clicks on a search result. Would it be more highly used if we had a ‘search by occupation or activity’? There seems little doubt of this. It is certainly worth considering making this a more prominent search option, or at least getting more user feedback about whether they would use a search like this. However, its efficacy may be compromised by the extremely permissive nature of epithet for archival descriptions – the information is not at all rigorous or consistent.

Family name is not provided as a main search option, and is only available by browsing for a family name and clicking on a result, as with epithet. The main ‘name’ search option enables users to search by family name. We did find the family name search was much higher for the whole year, maybe an indication of use by family historians and of the importance of family estate records.

Genre is in the main list of search options, but we have very few descriptions that provide the form or medium of the archive. However, users are not likely to know this, and so the low use may also be down to our use of ‘Media type’, which may not be clear, and a lack of clarity about what sort of media types people can search for. There is also, of course, the option that people don’t want to search on this facet. However, looking at the annual search figures, we have 1,204 searches by media type, which is much more significant, and maybe could be built up if  we had something like radio buttons for ‘photographs’, ‘manuscripts’, ‘audio’ that were more inviting to users. But, with a lack of categorisation by genre within the descriptions that we have, a search on genre will mean that users filter out a substantial amount of relevant material. A collection of photographs may not be catalogued by genre at all, and so the user would only get ‘photographs’ through a keyword search.

Place name is an interesting area. We have always believed that users would find an effective ‘search by place’ useful. Our place search is in the main search options, but most archivists do not index their descriptions by place and because of this it does not seem appropriate to promote a place name search. We would be very keen to find ways to analyse our descriptions and consider whether place names could be added as index terms, but unless this happens, place name is rather like media type – if we promote it as a means to find descriptions on the Archives Hub, then a hit list would exclude all of those descriptions that do not include place names.

This is one of the most difficult areas for a service like the Archives Hub. We want to provide search options that meet our users’ needs, but we are aware of the varied nature of the data. If a researcher is interested in ‘Bath’ then they can search for it as a keyword, but they will get all references to bath, which is not at all the same as archives that are significantly about Bath in Gloucestershire. But if they search for place name: bath, then they exclude any descriptions that are significantly about Bath, but not indexed by place. In addition, words like this, that have different meanings, can confuse the user in terms of the relevance of the results because ‘bath’ is less likely to appear in the title. It may simply be that somewhere in the description, there is a reference to a Dr Bath, for example.

This is one reason why we feel that encouraging the use of faceted search will be better for our users. A more simple initial search is likely to give plenty of results, and then the user can go from there to filter by various criteria.

It is worth mentioning ‘date’ search. We did have this at one point, but it did not give good results. This is partly due to many units of description not including normalised dates. But the feedback that we have received suggests that a date search would be popular, which is not surprising for an archives service.  We are planning to provide a filter by date, as well as the ordering by date that we currently have.

Finally, I was particularly interested to see how popular our ‘search collection level only’ is. screen shot of Hub search boxThis enables users to only see ‘top level’ results, rather than all of the series and items as well. As it is a constant challenge to present hierarchical descriptions effectively, this would seem to be one means to simplify things. However, for October 2015 we had 17 uses of this function, and for the whole year only 148. This is almost negligible. It is curious that so few users chose to use this. Is it an indication that they don’t find it useful, or that they didn’t know what it means? We plan to have this as a faceted option in the future, and it will be interesting to see if that makes it more popular or not.

We are considering whether we should run this exercise using some sort of filtering to check for search engines, dubious IP addresses, spammers, etc., and therefore get a more accurate result in terms of human users.  We would be very interested to hear from anyone who has undertaken this kind of exercise.

 

Europeana Tech 2015: focus on the journey

Last week I attended a very full and lively Europeana Tech conference. Here are some of the main initiatives and ideas I have taken away with me:

Think in terms of improvement, not perfection

Do the best you can with what you have; incorrect data may not be as bad as we think and maybe users expectations are changing, and they are increasingly willing to work with incomplete or imperfect data. Some of the speakers talked about successful crowd-sourcing – people are often happy to correct your metadata for you and a well thought-out crowd-sourcing project can give great results.

BL Georeferencer, showing an old map overlaying part of Manchester: http://www.bl.uk/maps/georeferencingmap.html
BL Georeferencer, showing an old map overlaying part of Manchester: http://www.bl.uk/maps/georeferencingmap.html

The British Library currently have an initiative to encourage tagging of their images on Flickr Commons and they also have a crowd-sourcing geo-referencer project.

The Cooper Hewitt Museum site takes a different and more informal approach to what we might usually expect from a cultural heritage site. The homepage goes for an honest approach:

“This is a kind of living document, meaning that development is ongoing — object research is being added, bugs are being fixed, and erroneous terms are being revised. In spite of the eccentricities of raw data, you can begin exploring the collection and discovering unexpected connections among objects and designers.”

The ‘here is some stuff’ and ‘show me more stuff’ type of approach was noticeable throughout the conference, with different speakers talking about their own websites. Seb Chan from the Cooper Hewitt Museum talked about the importance of putting information out there, even if you have very little, it is better than nothing (e.g. https://collection.cooperhewitt.org/objects/18446665).

The speaker from Google, Chris Welty, is best known for his work on ontologies in the Semantic Web and IBM’s Watson. He spoke about cognitive computing, and his message was ‘maybe it’s OK to be wrong’. Something may well still useful, even if it is not perfectly precise. We are increasingly understanding that the Web is in a state of continuous improvement, and so we should focus on improvement, not perfection. What we want is for mistakes to decrease, and for new functionality not to break old functionality.  Chris talked about the importance of having a metric – something that is believable – that you can use to measure improvement. He also spoke about what is ‘true’ and the need for a ‘ground truth’ in an environment where problems often don’t have a right or wrong answer. What is the truth about an image? If you show an image to a human and ask them to talk about it they could talk for a long time. What are the right things to say about it? What should a machine see? To know this, or to know it better, Chris said, Google needs data – more and more and more data. He made it clear that the data is key and it will help us on the road to continuous improvement. He used the example of searching for pictures of flowers using Google to find ‘paintings with flowers’. If you did this search 5 years ago you probably wouldn’t get just paintings with flowers. The  search has improved, and it will continue to improve.  A search for ‘paintings with tulips’ now is likely to show you just tulips. However, he gave the example of  ‘paintings with flowers by french artists’ –  a search where you start to see errors as the results are not all by french artists. A current problem Google are dealing with is mixed language queries, such as  ‘paintings des fleurs’, which opens a whole can of worms. But Chris’ message was that metadata matters: it is the metadata that makes this kind of searching possible.

The Success of Failure

Related to the point about improvement, the message is that being ‘wrong’ or ‘failing’ should be seen in a much more positive light. Chris Welty told us that two thirds of his work doesn’t make it into a live environment, and he has no problem with that. Of course, it’s hard not to think that Google can afford to fail rather more than many of us! But I did have an interesting conversation with colleagues, via Twitter, around the importance of senior management and funders understanding that we can learn a great deal from what is perceived as failure, and we shouldn’t feel compelled to hide it away.

Photo from Europeana Tech
Europeana Tech panel session, with four continents represented

Think in terms of Entities

We had a small group conversation where this came up, and a colleague said to me ‘but surely that’s obvious’. But as archivists we have always been very centered on documents rather than things – on the archive collection, and the archive collection description. The  trend that I was seeing reflected at Europeana Tech continued to be towards connections, narratives, pathways, utilising new tools for working with data, for improving data quality and linking data, for adding geo-coordinates and describing new entities, for making images more interoperable and contextualising information. The principle underlying this was that we should start from the real world – the real world entities – and go from there. Various data models were explored, such as the Europeana Data Model and CIDOC CRM, and speakers explained how entities can connect, and enable a richer landscape. Data models are a tricky one because they can help to focus on key entities and relationships, but they can be very complex and rather off-putting. The EDM seems to split the crowd somewhat, and there was some criticism that it is not event-based like CIDOC CRM, but the CRM is often criticised for being very complex and difficult to understand. Anyway, setting that aside, the overall the message was that relationships are key, however we decide to model them.

Cataloguing will never capture everyone’s research interests

An obvious point, but I thought it was quite well conveyed in the conference. Do we catalogue with the assumption that people know what they need? What about researchers interested in how ‘sad’ is expressed throughout history, or fashions for facial hair, or a million other topics that simply don’t fit in with the sorts of keywords and subject terms we normally use. We’ll never be able to meet these needs, but putting out as much data as we can, and making it open, allows others to explore, tag and annotate and create infinite groups of resources. It can be amazing and moving, what people create: Every3Minutes.

There’s so much out there to explore….

There are so many great looking tools and initiatives worth looking at, so many places to go and experiment with open data, so many APIs enabling so much potential. I ended up with a very long list of interesting looking sites to check out. But I couldn’t help feeling that so few of us have the time or resource to actually take advantage of this busy world of technology. We heard about Europeana Labs, which has around 100 ‘hardcore’ users and 2,200 registered keys (required for API use). It is described as “a playground for remixing and using your cultural and scientific heritage. A place for inspiration, innovation and sharing.” I wondered if we would ever have the time to go and have a play. But then maybe we should shift focus away from not being able to do these things ourselves, and simply allow others to use the data, and to adopt the tools and techniques that are available – people can create all sorts of things. One example amongst many we heard about at the conference is a cultural collage: zenlan.com/collage. It comes back to what is now quite an old adage, ‘the best innovation may not be done by you’. APIs enable others to innovate, and what interests people can be a real surprise. Bill Thompson from the BBC referred to a huge interest in old listings from Radio Times, which are now available online.

The International Image Interoperability Framework

I list the IIIF this because it jumped out at me as a framework that seems to be very popular – several speakers referred to it, and it very positive terms. I hadn’t heard of it before, but it seemed to be seen as a practical means to ensure that images are interoperable, and can be moved around different systems.

Think Little

One of my favourite thoughts from the conference, from the ever-inspirational Tim Sherratt, was that big ideas should enable little ideas. The little ideas are often what really makes the world go round. You don’t have to always think big. In fact, many sites have suffered from the tendency to try to do everything. Just because you can add tons of features to your applications, it doesn’t mean you should

The Importance of Orientation

How would you present your collections if you didn’t have a search box? This is the question I asked myself after listening to George Oates, from Good Form and Spectacle. She is a User Interface expert, and has worked on Flickr and for the Internet Archive amongst other things. I thought her argument about the need to help orientate users was interesting, as so often we are told that the ‘Google search box’ is the key thing, and what users expect. She talked about some of her experiments with front end interfaces that allow users to look at things differently, such as the V&A Spelunker. She spoke in terms of landmarks and paths that users could follow. I wonder if this is easier said than done with archives without over-curating what you have or excluding material that is less well catalogued, or does not have a nice image to work with. But I certainly think it is an idea worth exploring.

View of V&A Speleunker
“The V&A Spelunker is a rough thing built by Good, Form & Spectacle to give a different view into the collection of the Victoria & Albert Museum”