Names (6): Deduplication at scale

September 7, 2020 / Jane Stevenson

Having written several blogs setting out ideas and thoughts about challenges with names, this post sets out some of our plans going forwards in order to create name records for a national aggregator; something that can work at scale and in a sustainable way. The technical work is largely being undertaken by Knowledge Integration, our system suppliers, though working closely with the Archives Hub team.

Consider one repository – one Hub contributor. They have multiple archives described on the Archives Hub, and maybe hundreds or thousands of agents (people and organisations) included in those descriptions. All of this information will be put into a ‘management index‘. This will be done for all contributors. So, the management index will include all the content, from all levels, including all the names. A huge bucket of data to start us off.

A names authority source such as VIAF or any other names data that we would like to work with will not be treated any differently to Archives Hub data at this stage. In essence matching names is matching names, whatever the data source. So, matching Archives Hub names internally is the same as matching Archives Hub names to VIAF, or to Library Hub, for example. However, this ‘names authority’ data will not go into our big bucket of Archives Hub data, because, unless we create a match with a name on the Hub, the authority data is not relevant to us. Putting the whole of VIAF into our bucket of data would create something truly huge. It is only if we think that this external data source has a name that matches a person or organisation on the Hub that it becomes important. So data from external sources are stored in separate reference indexes (buckets) for the purposes of matching.

Tokenisation

Knowledge Integration are employing a method known as tokenization, which allows us to group the data from the indexes into levels (It is quite technical and I’m not qualified to go into it in detail, so I only refer briefly to the basic principles here. Wikipedia has quite a good description of tokenization). With this process, we can establish levels that we believe will suit our purposes in terms of confidence. Level 1 might be for what we think is a guaranteed match, such as where an identifier matches. So, for example, Wikidata might have the VIAF identifier included, so that the VIAF and Wikidata name can be matched. In some cases, the Archives Hub data includes VIAF IDs, so then the Hub data can be matched to VIAF. We also hope to work with and create matches to Library Hub data, as they also have VIAF ID’s.

Image showing versions of a name all with the same ID. — If all versions of a name have the same ID then they can be matched.

Level 2 might be a more configurable threshold based around the name. We might say that a match on name and date of birth, for example, is very likely an indication of a ‘same as’ relationship. We might say that ‘James T Kirk’ is the same person as ‘James Kirk’ if we have the same date of birth. This is where trial and error is inevitable, in order to test out degrees of confidence. Level 3 might bring in supporting information, such as biographical history or information about occupation or associated places. It is not useful by itself, but in conjunction with the name, it can add a degree of certainty.

Screenshot of part of a biographical history — Biographical information may be used to help match names

We are also thinking about a Level 4 for approaches that are Archives Hub specific. For example, if the same name is provided by the same repository, could we say it is more likely to be the same person?

This tokenisation process is all about creating a configurable process for deduplication. Tokens are created only for the purposes of matching. Once we have our levels decided, we can create a deduplication index and run the matching algorithm to see what we get.

Approaches to indexing

For deduplication indexing, the first thing to do is to convert to lower case and remove all of the non-alpha characters. (NB: For non-latin scripts, there are challenges that we may not be able to tackle in this phase of the project).

The tokens within the record will be indexed in multiple ways within the deduplication index to facilitate matching. This includes indexing all words in order that they appear, and also individual word matches.

Then, particularly when considering using text such as biographies to help identify matches, we can use bigrams and trigrams. These essentially divide text into two and three words chunks. A search can then identify how many groups of two and three words have matched. Generally, this is a useful method of ascertaining whether documents are about the same thing. It may help us with identifying name matches based upon supporting information. This is very much an exploratory approach, and we don’t know if it will help substantially with this project, but certainly it will be worth trying out this approach, and also considering using it for future data analysis projects.

Character trigrams break down individual words into groups of three characters and may be useful for the actual names. This should be useful for a more fuzzy matching approach, and it help to deal with typos. It can also help with things like plurals, which is relevant for working with the supporting information.

We are also going to explore hypocorisms. This means trying out matches for names such as Jim, Jimmy and James or Ned, Ed, Ted and Edward. A hypocorism is often defined as a pet name or term of endearment, but for us it is more about forename variations. Obviously Jim Jones is not necessarily the same person as James Jones, but there is a possibility of it, so it is useful to make that kind of match on name synonyms. It is often defined as a pet name or term of endearment.

Hypocorisms refers to pet names or terms of endearment

From this indexing approach we can try things out and see what works. There is little doubt that it will require an iterative and flexible approach. We can’t afford to set up a whole process that proves ineffective so that we have to start again. We need an approach that is basically sound and allows for infinite adjustments. This is particularly vital because this is about creating a framework that will be successful on an on-going basis, for a national-scale service. That is an entirely different challenge to creating a successful outcome for a finite project where you are not expecting to implement the process on an on-going basis. Apart from anything else, a project with a defined timescale and outcome gives you more leeway to have a bit of human intervention and tweak things manually to get a good result.

Group records

Using the tokenisers and matching methods we can try processing the data for matches. When records are matched with a degree of certainty, a group record is created in the deduplication index. It is allocated a group id and contains the ids of all of the linked records. This is used as the basis for the ‘master record’ creation.

Primary or master records

I have previously blogged some thoughts about the ‘master record’ idea. Our current proposal is that every Archives Hub name is a primary record, unless it is matched. So, if we start out with six variations of Martha Beatrice Webb, 1858-1943, then at that point they are all primary records and they would all display. If we match four of them, to a confidence threshold that we are happy with, then we have three primary records. One of the primary records covers four archives. We may be able to still link the other two instances of this name to the aggregated record, but we can assign a lower confidence threshold to this.

Diagram showing instances of the name Beatrice Webb and how they might match. — Deduplication for ‘Beatrice Webb’

In the above example (which is made up, but reflects some of the variations for this particular name) four of the instances of the name have been matched, and so that creates a new primary record, with child records. Two of the instances have not been matched. We might link them in some way, hence the dotted line, or they might end up as entirely separate primary records. The instance of Beatrix Potter, nee Webb, has not been matched (these two individuals are often confused, especially as they have the same death date). If we set levels of confidence wrongly, this name could easily be matched to ‘Beatrice Webb’.

The reasoning behind this approach is that we aggregate where we can, but we have a model that works comfortably with the impossibility of matching all names. Ideally we provide end users with one name record for one person – a record that links to archive collections and other related resources. But we have to balance this against levels of confidence, and we have to be careful about creating false matches. Where we do create a match, the records that were previously primary records become ‘child records’ and they no longer display in the end user interface. This means we reduce the likelihood of the end user searching for ‘william churchill’ and getting 25 results. We aim for one result, linking to all relevant archives, but we may end up with two or three results for names that have many variations, which is still a vast improvement.

If we have several primary records for the same person (due to name variations) then it may be that new data we receive will help us create a match. This cannot be a static process; it has to be an effective ongoing workflow.

Comic strips and seaside holidays: unexpected stories from the Save the Children Archive

September 7, 2020 / Jane Ronson

Archives Hub feature for September 2020

The Save the Children (SCF) archive, held at the Cadbury Research Library, University of Birmingham, charts the development of the charity from its creation in 1919. The collection includes a wealth of material relating to the charity’s founder, Eglantyne Jebb, and these papers provide a fascinating insight into how SCF operated during the 1920s. They also highlight the personal stories of individuals associated with SCF.

Concertina comic strips

***Illustrated concertina comic strip (ref: SCF/EJ/9/2).***

One fascinating item is a wonderful illustrated concertina comic strip created by Corinne de Candole, documenting her first week working at the SCF office in April 1925. She dedicated the strip to ‘Miss Jebb who showed me how the New World is being built at the Office of the Save the Children Fund’. The strip depicts Corinne’s interview with a Mrs Beach, as well as the making of blue cloaks and flags and ‘planning for the new world’.

***Travelling to Geneva (ref: SCF/EJ/9/2).***

Another two comic strips reveal how Corinne travelled to Geneva for the summer school in 1925 and she also wrote two poems about this experience: ‘The Disobedient Lady who never got to the SCF Summer School’ and ‘The Obedient Lady who went to the SCF Summer School’. Through these documents we can sense the pride with which Corinne felt for working for SCF and her thoughts on how it was helping change the world.

Thank you letters

The overseas country papers in the Eglantyne Jebb series highlight the personal stories of those affected by the crisis in Europe after the First World War. The Horak family, from Hungary, wrote a letter of appreciation to SCF, offering thanks and remembering their benefactors.

***The Horak family letter with typed translation, 1922 (ref: SCF/EJ/1/17/1).***

‘From the bottom of our hearts sending our Christmas Greetings and very best wishes [and] we are always thinking gratefully of those who helped to get homes for us poor war invalids and widows with our families. May you be as happy as you have made us […] The little cottage means also a new life to us, making us forget our sufferings and losses. We beg the Almighty to pour his blessing over you and your family and give long life and happiness to those who provided us with a home. This will be our prayer on this holy Christmas eve.’

***The letter is accompanied by a photograph of the Horak family (ref: SCF/EJ/1/17/1).***

In a letter to Miss Vulliamy, who was leading SCF funded projects in Poland, Vera Staack describes how her mother, and herself, had to flee Russia due to the Bolsheviks: ‘But why are they frightened, why do I read such terror in their eyes? I shall explain you the reason. The red banner flashes, and on it the black words which make everybody tremble. “Death to the bourgeois.”…..The fathers or mothers are taken from their children, children are torn from their parents sides. And so everybody tries to hide quickly.’

‘The picture of the past rises involuntary before me. Christmas Eve! It was our last Christmas Eve in our native land-in far off Moscow. An enormous Christmas-tree made dazzlingly brilliant by quantities of electric lamps and brilliant ornaments and many, many presents…..And all this has been taken from me by the Bolsheviks. Dear Miss Vulliamy, and I shall have no more Christmas-trees or Christmas Eves, and mother is always very cross now, cries often, and wishes to speak to no one. She was quite different before.’

***Letter from Vera Staack, 1921 (ref: SCF/EJ/1/22/7).***

‘And now good-bye, my dear, dear English friend. I hug you very hard and remain your very respectful and unhappy little Domby friend

She ends ‘P.S. Why are men so wicked, dear Miss Vulliamy.’

A seaside holiday

Another example can be found in a report entitled ‘A seaside holiday’, written by M. Brown, where we learn of the impact that a trip to the beach had for a group of young children: ‘“Who pushes the sea?” Is water never still?” “Does sand bite?” […] even the Ukrainian student was among the unbelievers who doubted whether the sea was salt, and made a wild dash to stoop down and taste it to make quite sure that he was not being deceived.’

The children then share their stories of the horrors that they have been through: ‘that was a long time ago…my mama died in the truck on the way from Russia. She died of hunger my mama did not live long after my daddy was killed by the Bolshevists. I wouldn’t believe it at first when the doctor came round and bent down and listened to her heart and said that mam was dead.’

***‘A seaside holiday’ report, 1922 (ref: SCF/EJ/1/22/8).***

‘All the children have their own sad story, and all have lived through strange and dreadful times, and in all their young faces can be read the tragedy of the homeless and the outcast. It is to build up their energy for the life struggle before them that Miss Vulliamy inaugurated the Children’s Holiday Home at Danzig in 1922.’

These archives offer a glimpse into the traumatic events which children and families faced in the aftermath of the First World War, the attempts by SCF to help and the appreciation that this generated.

Matthew Goodwin
Save the Children Project Archivist
Cadbury Research Library, University of Birmingham

Browse all Cadbury Research Library, University of Birmingham descriptions available to date on the Archives Hub.

All images copyright Cadbury Research Library, University of Birmingham. Reproduced with the kind permission of the copyright holders.

Names (5): The Problem of anonymity

August 5, 2020 / Jane Stevenson

It is easy to focus on names that represent fairly well known people. But one of the challenges for archives is to work with little known people – names that represent someone who is referenced in a catalogue – maybe they are indexed because they are a correspondent for example – they appear in one of a series of letters – but there is no more information about them other than their name. They may be referenced in other sources, but we have little to go on in order to discover that, and often they won’t be represented – it may be that this is the only written source that includes them.

In a names service, we can add a name – let’s say ‘Louisa Jane Justamond’ – a name from https://archiveshub.jisc.ac.uk/data/gb12-ms.add.8556 (‘The Garland continued’, a collection of poems addressed to her). We only have that one instance of that name. It is not in VIAF, it is not in Wikidata. There is an instance listed in ‘A genealogical and heraldic dictionary of the landed gentry of Great Britain’ (a precursor to Burke’s peerage). But unless we decide to use that an external source, write a name matching algorithm and decide, on levels of confidence, that it is indeed a match, that is not going to help us. We are left with a name attached to one archive collection and nothing else.

We can create a name record for Justamond, but if we display it on the Archives Hub it will simply show her name and a link back to the related description. It will be extremely minimal.

However, what we don’t know is whether new collections will be added to the Archives Hub, or new information added to Wikidata or another source that we use, such that this person becomes more identifiable. We simply don’t know what the value of a name might be. In the future, having a record of this person could prove to be immensely useful in making a connection.

Archives have what you might call a long tail of names. It is something that characterises our holdings. It is something that sets us apart from libraries and museums, at least to a degree. Most names represented in library holdings (or names they represent in their catalogues and other finding aids) represent identifiable people.

Graph showing the long tail of names — The long tail of names

In archives, we have collections that represent ordinary people, not published, not celebrated, not notorious, with no documented place in history. We also have collections that include people where it is hard to know whether an individual is more widely known, because the archive collection does not entirely identify them.

Either way, it leaves us with a question about how to deal with a name that has nothing else attached to it other than ‘this name is in this letter’.

Building an index of all names means that we have a store of data that can be used for further exploration. It could sit behind the scenes, but it can be used to try out tools, data manipulation and matching. In other words, the data is a separate thing from what you decide to display.

Having a name (maybe not knowing exactly who the name represents) and knowing that the name is in three different archives has value. We can say ‘in the absence of any other information, we assume these names represent the same person’, or we can simply present the information and not make any conclusions (although that begs the question of how you present it without encouraging assumptions). It is then up to researchers to explore further. We might find new data sources that help to clarify names. We might get new descriptions that help to do this.

Many archival descriptions include subjects and, to a lesser extent, places. If you have Stephen Merryweather in one, with an index term of botany, and S. Merryweather in another, with the same index term, then you could say it is more likely to be a match. There is a question of how you might then present that information. The use of algorithms raises the issue of how to convey levels of confidence. It feels as if we need to have a more sophisticated – and recognised – means of presenting levels of confidence.

This whole issue of confidence levels is more of a focus for archives, because of the anonymity I’ve talked about.

Diagram showing Relationships of data involved in creating name records — Relationships of data involved in creating name records

The ‘Name’ records shown above are the names within archival descriptions (EAD records on the Hub). These names can be pulled out from ‘origination’ (creator) and from ‘persname’ (usually in the controlaccess index section, but potentially elsewhere in the description). These names may represent ‘unknown’ people, the EAD may not even indicate whether they are personal or corporate or family names. They may not include dates, they may just be ‘Mary Fleming’ or ‘Mary Fleming fl 1717’. They may also be ‘unknown’, ‘[unknown]’, or even ‘unknown unknown’ (keeping the surname, forename structure!). They may be ‘Name of author (various)’ or ‘Various health authority bodies’ or ‘Possibly Miss M. Lindsay’. All these are examples from our data. They illustrate the conflict between human readable data – where ‘unknown’ is useful – and machine processable data – where semantics are important, and a name is ideally just a name.

If we create ‘Name’ entries for all of these then we have a store of data to work with, something I’ve mentioned before in my Names Project blogs. We can then find out how many ‘Mary Fleming’ entries there are, or how many ‘M Fleming’ entries. How we then choose to display that information to end users is a separate question. But with the advances in machine learning, it is becoming an increasingly pertinent question.

We have an opportunity with archival metadata, with the way that archives represent ‘ordinary life’. But it is a challenge Catalogues are still not really set up to identify entities (in a way that works for machine processing). We create what we refer to as ‘name authorities’ but we do not usually consider the importance of matching names outside of individual organisations. The Archives Hub has an opportunity to work on behalf of UK archives to try to draw out people and, in a sense, identify them, or at least, enable them to be more contextualised. But it will require a good deal of experimentation and expertise in working with disparate data. However, if we create a pool of names and provide an API, that would enable others to work with the data, and try different approaches. This is a big challenge, and it needs a concerted and collaborative approach.

Fish are jumpin’ in the Archives

July 31, 2020 / Jane Ronson

Archives Hub feature for August 2020

“Summertime and the livin’ is easy...” ¹. Well, it’s a rather wet summer in the UK but all the better for exploring collections on the theme of fish!

Plotosus lineatus (Catfish). Copyright: Alain Feulvarch (https://commons.wikimedia.org/wiki/File:Catfish_Plotosus_lineatus.jpg). Creative Commons 2.0 license: https://creativecommons.org/licenses/by/2.0/deed.en — Plotosus lineatus (Catfish). Copyright: Alain Feulvarch (https://commons.wikimedia.org/wiki/File:Catfish_Plotosus_lineatus.jpg). Creative Commons 2.0 license.

We’ve trawled the Archives Hub (sorry, couldn’t resist!) to bring you a selection of the wonderful, and sometimes surprising, collections relating to fish, ranging across research, expeditions, fisheries, the fishing industry and river authorities – not forgetting a fish and chip shop, a theatre and several appropriately named individuals.

Research and Expeditions

Fishes Collected by Darwin, 1842. 300 pages of notes on the fish collected by Darwin on the Beagle, compiled by Leonard Jenyns (1800-1893), a clergyman and naturalist; Jenyns changed his name to Leonard Blomefield in 1871. Held by the Museum of Zoology Archives, University of Cambridge https://archiveshub.jisc.ac.uk/data/gb433-jenynsdarwin.

C Tate Regan collection, 1912-1913. Charles Tate Regan (born in 1878) was keeper of zoology at the British Museum. He worked on the scientific results of the Scottish National Antarctic Expedition, 1902-1904 (leader William Speirs Bruce) and the British Antarctic Expedition, 1910-1913 (leader Robert Falcon Scott). He died in 1948. Published work includes ‘Antarctic fishes of the Scottish National Antarctic Expedition’ in the Reports of the scientific results of the voyage of the steam yacht Scotia and ‘Fishes’ and ‘Larval and post larval fishes’ published in the zoology reports of the British Antarctic Expedition, 1910-1913. Held by the Scott Polar Research Institute Archives, University of Cambridge https://archiveshub.jisc.ac.uk/data/gb15-charlestateregan.

Cuthbertson drawing of an Atlantic lizardfish. Copyright the National Museums Scotland Library. — Cuthbertson drawing of an Atlantic lizardfish. Copyright the National Museums Scotland Library (adapted from the full image included in the William Speirs Bruce Archive feature, August 2017).

Winifred E. Frost collection, 1930s-1960s. Frost was an authority on the natural history of fish in the Lake District. Research includes work on euphausids with professor James Johnstone at Liverpool university and she worked for the fisheries branch at Dublin investigating trout in the River Lifey. She was appointed to the Freshwater Biological Association in 1938 and was awarded a D.S.c. by Liverpool University for her published papers. She wrote The Trout with Margaret E.Brown (Varley) published in 1967 that took 21 years to prepare. She was a member of the Council of the Salmon and trout association, and president of the Windermere and District angling association, also travelling to international scientific meetings and undertaking investigation of eels in Africa. Held by the Freshwater Biological Association Archives https://archiveshub.jisc.ac.uk/data/gb986-frow.

Notes towards a dictionary of fish names, by Paul Barbier (C20th). Barbier was Professor of French Language and Literature at the University of Leeds, 1903-1938. The collection comprises 8 boxes of notes prepared in the course of research for an unpublished dictionary of names of fishes. Held by University of Leeds Special Collections https://archiveshub.jisc.ac.uk/data/gb206-ms125.

Solenostomus paradoxus - Harlequin Ghost Pipefish. © Steve Childs (https://commons.wikimedia.org/wiki/File:Solenostomus_paradoxus_-_Harlequin_Ghost_Pipefish.jpg). Creative Commons 2.0 license https://creativecommons.org/licenses/by-sa/2.0/deed.en. — Solenostomus paradoxus – Harlequin Ghost Pipefish. © Steve Childs (https://commons.wikimedia.org/wiki/File:Solenostomus_paradoxus_-_Harlequin_Ghost_Pipefish.jpg). Creative Commons 2.0 license.

Rosemary Lowe-McConnell Collection, 1934-1947. Lowe-McConnell was a pioneer in tropical fish ecology. She was born in Liverpool, and graduated from the university. She worked at the Freshwater Biological Association studying the migration of silver eels. In 1993 Michael N. Bruton interviewed Lowe-Connell on the personal reasons behind her choice of work, and her personal influences, and experiences of being a woman in a male dominated world. Initially she wanted to be an explorer/naturalist, with the reply being ‘never mind dear, perhaps you can teach’. When applying for the colonial services in 1945, to be an entomologist, they would not employ a female one, but the tropical fisheries department was new, and not considered as important. Despite her being forced to resign in 1954 when the marriage bar was in place, she was more interested in pursuing her findings than concerned with job status, and she believed that the fact she had been offered the directorship at the Joint Fisheries Research organisation in central Africa (which she rejected) showed her that she was accepted despite being female. Held by Freshwater Biological Association Archives
https://archiveshub.jisc.ac.uk/data/gb986-lowr.

Journal of John Walsh’s Visit to France in 1772. John Walsh (1726-1795) was elected to the Royal Society in 1770, and became known for his work on the electric ray, Torpedo marmorata. In 1769 Edward Banfield proved that the electric eel emitted electric shocks, and Walsh set out to confirm that the ray had a similar power. In this he was encouraged by Benjamin Franklin, whose American colleagues were undertaking similar investigations. With his nephew Arthur Fowkes he spent the summer of 1772 at La Rochelle, where the ray was often captured. The fish could survive many hours out of water, and Walsh was able to conduct experiments ashore and successfully proved that the ray’s shocks were caused by electricity. His findings were published in the Royal Society’s Philosophical Transactions, vol. 63 (1773), pp. 461-77, and the Royal Society awarded him the Copley medal for his achievement. Held by University of Manchester Library https://archiveshub.jisc.ac.uk/data/gb133-engms724.

Fisheries and the Fishing industry

Records of Aberdeen Fish Curers and Merchants Association, 1888-1947. The association was established in May 1888, as Aberdeen Fish Trade Association, and was incorporated with its present title in 1944. It began in response to the introduction of sales by auction in the late nineteenth century, its first achievement being an agreement amongst fish sellers to provide discounts for cash sales to accredited buyers. Membership was open to wholesale fish merchants and fish curers carrying on a business in Aberdeen, and in 1980 stood at more than 200. Held by University of Aberdeen Special Collections https://archiveshub.jisc.ac.uk/data/gb231-ms3054.

Records of the Berwick Salmon Fisheries Co Ltd, salmon fishers, Berwick upon Tweed, England, 1562-1964 (predominant 1860-1964). The Old Shipping Co, shipping traders and salmon fishers, Berwick-upon-Tweed, Northumberland, England, was established at some point prior to 1766 by a group of local men, mainly coopers, who held shares in a small sailing fleet engaged in the London, coastal and foreign trade. As commodities included salmon, the company leased fishing rights on the river Tweed. The shipping vessels were sold off in 1869 as business had become unprofitable and the company’s name changed to Berwick Salmon Fisheries Co Ltd in 1872. Held by University of Glasgow Archive Services https://archiveshub.jisc.ac.uk/data/gb248-ugd245.

Volume containing two copies of a printed register relating to Netherlands herring fisheries, 1749: entitled Naamlyst der boekhouders, schepen, en stuurluiden van de haring-shepen, in’t Yaar 1749, van Enchisen en de Ryp, ter haring-shepen uitgevaren (Jan von Guissen, Enkhuisen, 1749), giving details of the ships, owners and captains of the fleets of Enkhuisen and De Rijp. Added in manuscript are details of the total catch for 1749, and the catch for individual ships on various voyages. Held by Senate House Library Archives, University of London
https://archiveshub.jisc.ac.uk/data/gb96-ms115.

Women Fish Sellers - from Hamilton, Robert (1866) British Fishes, Part II, Naturalist's Library, vol. 37, London: Chatto and Windus. Image in the public domain (photograph from the Freshwater and Marine Image Bank at the University of Washington). — Women Fish Sellers – from Hamilton, Robert (1866) British Fishes, Part II, Naturalist’s Library, vol. 37, London: Chatto and Windus. Image in the public domain (photograph from the Freshwater and Marine Image Bank at the University of Washington).

Grimsby Steam and Diesel Fishing Vessels’ Engineers’ and Firemen’s Union, 1897-1987. The Grimsby Steam Fishing Vessels’ Engineers’ and Firemen’s Union was founded in 1896. It changed its name to the Grimsby Steam and Diesel Fishing Vessels’ Engineers’ and Firemen’s Union in 1961. In 1976 it transferred engagements to the Transport and General Workers’ Union, becoming 10/3c Branch. Held by Modern Records Centre, University of Warwick https://archiveshub.jisc.ac.uk/data/gb152-gsf.

The business records of Shippam’s Ltd, 1853-1995. The Shippam’s business first started in 1786, when Charles Shippam established a grocery store in Westgate, Chichester. In 1886 they began food manufacturing and in 1894 launched a wide range of potted meat and fish pastes, for which Shippam’s was to become internationally famous. Held by West Sussex Record Office https://archiveshub.jisc.ac.uk/data/gb182-shippam’s.

Fish and Chips

Records of Pesci Bros Fish and Chip Shop, 1920-1994. The Pesci family, originally from Bardi in Italy, came to Barking from Wales in 1934, and went on to open a fish and chip shop at 15 Broadway. Only a few years later the shop was compulsorily purchased by Barking Borough Council so that the site could be used for the building of the new Town Hall. After a long search for a new premises, the family finally re-opened at 26 Ripple Road in 1939. The business flourished for nearly 60 years. Held by Barking and Dagenham Archive and Local Studies Centre https://archiveshub.jisc.ac.uk/data/gb350-bd76.

River authorities

Records of the Centre for Environment, Fisheries and Aquaculture Science, Benarth Road, Conwy, 1916-1994. In December 1999 the Conwy Laboratory closed after approximately ninety years of pioneering research and development into fish and shellfish aquaculture. The laboratory’s foundation came about following the building of mussel purification tanks by Conwy Corporation in 1913, in an attempt to improve the quality of Conwy mussels, which had been at the centre of several serious infections. The collection is of scientific importance in documenting experiments of international significance. Additionally, it reflects the traditional activities of the mussel fishermen themselves. Held by Gwasanaeth Archifau Conwy / Conwy Archive Service https://archiveshub.jisc.ac.uk/data/gb2008-cd3.

Environment Agency Collection, 1786-2010. The collection consists of reports, surveys, data records, maps, administrative records and other material relating to the work of the Environment Agency (and of its predecessor organisations the various River Boards, River Authorities, Water Authorities and the National Rivers Authority). A few documents date back to the 19th century and earlier, the majority spans the 1930s to the 1990s. Most of the collection relates to the Agency’s monitoring and management of the area’s river and lake catchments, with an emphasis on fisheries, biodiversity, constructions such as fish passes, weirs and fish traps, fish diseases, water quality and pollution. Included are papers relating to the Agency’s corporate, strategy and public affairs, as well as information on regional and national byelaws, net limitation orders and historic fishery rights. Held by Freshwater Biological Association Archives https://archiveshub.jisc.ac.uk/data/gb986-enva.

A Different Kettle of Fish

Records relating to Ada Fish, First World War munitions worker at Pembrey, 1918-1919. Held by West Glamorgan Archive Service https://archiveshub.jisc.ac.uk/data/gb216-d/dz969.

Fisher Theatre, Bungay, 1790-1886. The Fisher theatre at Bungay, Suffolk, opened in February 1828. Built by David Fisher I, the theatre was one of a dozen serving the circuit of Fisher’s company, The Norfolk and Suffolk Company of Comedians and seasons of performances were produced on a two-year cycle. The theatre was sold by the Fishers in 1844 and was used subsequently as a corn hall, furniture store, steam laundry, cinema, and textile warehouse. In 2000 the building was acquired by the Bungay Arts Trust. After extensive renovations the building was re-opened in 2006 as a community theatre and arts centre which is also licensed for wedding and civil ceremonies. Held by the University of East Anglia Archives https://archiveshub.jisc.ac.uk/data/gb1187-ftb.

Papers of Robert Salmon Hutton, 1897-1970. Hutton was born in 1876 in London. His family owned a silversmiths in Sheffield. Hutton pursued his research interests in electro-metallurgy with Professor Arthur Schuster at Manchester and Henri Moissan in Paris. From 1900-1908 he was a lecturer in electro-chemistry at the University of Manchester, where he carried out pioneering work on electric furnace technology, seeing its value for commercial metallurgy. In 1903 he perfected a method for the mass production of fused silica. Hutton had a great interest in research and development, and he was aware of failings in this area by British metallurgical industries. A great believer in the value of technical libraries, he was a founder of the Association of Scientific Libraries Information Bureau (ASLIB) in 1924. Held by University of Manchester Library https://archiveshub.jisc.ac.uk/data/gb133-hut.

Engraving of Anthias Anthias at that time called Anthias Sacer. The Author ran out of resources while issuing this book and therefore every engraving had its own sponsor. This one has been sponsored by Sigmund Zois Freiherr von Edelstein. Author: Bloch, Marcus Elieser, 1723-1799. Item/Page/Plate: Pl. 315, opp. p. 86. Image in the Public Domain(https://creativecommons.org/publicdomain/mark/1.0/deed.en; PD-US), courtesy of The New York Public Library, www.nypl.org. — Engraving of Anthias Anthias at that time called Anthias Sacer. The Author ran out of resources while issuing this book and therefore every engraving had its own sponsor. This one has been sponsored by Sigmund Zois Freiherr von Edelstein. Author: Bloch, Marcus Elieser, 1723-1799. Item/Page/Plate: Pl. 315, opp. p. 86. Image in the Public Domain (https://creativecommons.org/publicdomain/mark/1.0/deed.en; PD-US, courtesy of The New York Public Library).

Herring, Thomas (1693-1757). Papers of Thomas Herring, Archbishop of Canterbury 1747-57. 4 volumes, held by Lambeth Palace Library https://archiveshub.jisc.ac.uk/data/gb109-herring.

Papers of George Gordon Hake, 1891-1904. Hake was born in 1847. He spent thirteen years from 1891 working in South Africa, initially with the British South Africa Company and later with the Tanganyika Telegraph Service during 1889 and 1903 in the Mashonaland area. He died in 1903 and was buried at Port Herald. Hake was closely connected to the Rossetti family in their later years, acting as a ‘minder’ to Dante Gabriel Rossetti during one of their family holidays. Christina Rossetti was also godmother to his daughter Ursula. Held by School of Oriental and African Studies (SOAS) Archives, University of London https://archiveshub.jisc.ac.uk/data/gb102-ppms40.

Henry Guppy (1861-1948) was librarian of the John Rylands Library from 1900-1948. Held by University of Manchester Library https://archiveshub.jisc.ac.uk/data/gb133-tft/tft/1/459.

Declaration of Trust of Leasehold Property in Breams Buildings, Chancery Lane, London, 1888. Lease for the Breams Building, which was the main Birkbeck site from 1888-1952. The lease is in the form of a soft cover book, written over several velum pages, with wax seals on the last page. Held by Birkbeck Library Archives and Special Collections, University of London https://archiveshub.jisc.ac.uk/data/gb1832-bbk/bbk/6/1.

John Whiting Archive, 1917-1963. Whiting, a playwright and actor, was born in 1917 Salisbury, UK. He received his education at Taunton School and then later trained as an actor at Royal Academy of Dramatic Art. After his time in the army Whiting had some success as an actor and then went onto write numerous plays, short stories and plays for radio. Whiting also took up theatre criticism during the last few years of his life for ‘London Magazine’, some of his work can be found in the ‘The Art of Dramatist’ (1970). Held by V&A Theatre and Performance Collections https://archiveshub.jisc.ac.uk/data/gb71-thm/222.

Roe Manuscripts, 10th-17th century. Sir Thomas Roe was born in 1580 or 1581, and matriculated at Magdalen College, Oxford, in 1593, but took no degree. In 1605 he was knighted, and in 1614 began his official journeys to the East which made him famous. From that year to 1618 he was Ambassador to Jehngr, the Mogul emperor of Hindustan, and from 1621 to 1628 to the Turkish Court. In 1640 Roe was elected a burgess of the University in Parliament, and died in 1644. The manuscript collection comprises: 27 Greek, one Hebrew, one Arabic, and one Latin. Held by the Bodleian Library, University of Oxford https://archiveshub.jisc.ac.uk/data/gb161-mss.roe1-17,18a-b,19-29.

A "tornado" of schooling barracudas at Sanganeb Reef, Sudan. Copyright: Robin Hughes (https://commons.wikimedia.org/wiki/File:Barracuda_Tornado.jpg). Creative Commons 2.0 license: https://creativecommons.org/licenses/by-sa/2.0/deed.en. — A “tornado” of schooling barracudas at Sanganeb Reef, Sudan. Copyright: Robin Hughes (https://commons.wikimedia.org/wiki/File:Barracuda_Tornado.jpg). Creative Commons 2.0 license.

Rocket assisted take off by a Barracuda, 1945 – on HMS Trumpeter. 2 photos, held by Gwasanaeth Archifau Conwy / Conwy Archive Service https://archiveshub.jisc.ac.uk/data/gb2008-cp1727/cp1727/4/1/40.

Previous features relating to Fish

Silt, sluices and smelt fishing – The Eau Brink Cut and the Bedford Level Corporation Archive

Silt, sluices and smelt fishing – The Eau Brink Cut and the Bedford Level Corporation Archive

William Speirs Bruce Archive in the National Museums Scotland Library

William Speirs Bruce Archive in the National Museums Scotland Library

1. George Gershwin – Summertime lyrics: https://www.stlyrics.com/songs/g/georgegershwin8836/summertime299720.html

Names (4): Ethics and identity

July 9, 2020 / Jane Stevenson / 1 Comment

As archivists, we deal with ethical issues a good deal. But the ability to link disparate and diverse data sources opens up new challenges in this area, and I wanted to explore this a bit.

If you do a general search for ethics and data, top of the list comes health. An interesting example of data join-up is the move to link health data to census data, which could potentially highlight where health needs are not being met:

“Health services are required to demonstrate that they are meeting the needs of ethnic minority populations. This is difficult, because routine data on health rarely include reliable data on ethnicity. But data on ethnicity are included in census returns, and if health and census data for the same individuals can be linked, the problem might be solved.” (Ethnicity and the ethics of data linkage)

However, individuals who stated their ethnicity in census returns were not told that this might subsequently be linked with their health data. Should explicit informed consent be given? Given the potential benefits, is this a reasonable ask? It is certainly getting into hazardous terrain to ignore the principle of informed consent. In their book ‘Rethinking Informed Consent in Bioethics‘, Manson and O’Neill argue that informed consent cannot be fully specific or fully explicit. They argue for a distinctive approach where rights can be waived or set aside in controlled and specific ways.

This leads to a wider question, is fully explicit and specific informed consent actually achievable within the joined-up online world? A world where data travels across connections, is blended, re-mixed, re-purposed. A world where APIs allow data to be accessed and utilised for all sorts of purposes, and ‘open data’ has become a rallying cry. Is there a need to engage the public more fully in order to gain public confidence in what open data really means, and in order to debate what ‘informed consent’ is, and where it is really required?

I am working on a project to create name records, and I am looking at bringing data sources together. Of course, this is hardly new. Wikipedia is the most well-known hub for biographical data. Anyone can add anything to a Wikipedia page (within some limits, and with some policing and editing by Wikipedia, but in essence it is an open database). Wikidata, which underlies Wikipedia, is about bringing sources together in an automated way. Projects within cultural heritage are also working on linked data approaches to create rich sources of information on people. SNAC has taken archival data from many different archive repositories and brought it together. A page for one person, such as Martin Luther-King provides a whole host of associations and links. These sources are not all individually checked and verified, because this kind of work has to be done algorithmically. However, there is a great deal of provenance information, so that all sources used are clear.

image of page from the face of white australia website — The Face of White Australia

There are some amazing projects working to reveal hidden histories. Tim Sherratt has done some brilliant work with Australian records. Projects such as Invisible Australians, which aims to reveal hidden lives, using biographical information found in the records. He has helped to create some wonderful sites that reveal histories that have been marginalised. Tim talks about ‘hacking heritage’ and says: ‘By manipulating the contexts of cultural heritage collections we can start to see their limits and biases. By hacking heritage we can move beyond search interfaces and image galleries to develop an understanding of what’s missing.’ (Hacking heritage, blog post) He emphasises that access to indigenous cultural collections should be subject to community consultation and control. But what does community consultation and control really mean?

I have always been keen to work with the names in archival descriptions – archival creators and all the other people who are associated with a collection. They are listed in the catalogue (leastways the names that we can work with are listed – many names obviously aren’t included, but that’s another story), so they are already publicly declared. It is not a case of whether the name should be made public at all, or, at least, that decision has been made already by the cataloguer. But our plan is to take the names and bring them to the fore – to give them their own existence within our service. We are taking them out of the context of a single archive collection and putting them into a broader one. In so doing, we want to give the archive collections themselves more social context, we want to give more effective access to distributed historical records, and we also want to enable researchers to travel through connections to create their own narratives.

This may help to reveal things about our history and highlight the roles that people have played. It may bring people to the fore people who have been marginalised. Of course, it does not address the problem of biases and subjective approaches to accessions and cataloguing. But a joined-up approach may help us to see those biases and gaps; to understand more about the silent spaces.

Creating persistent identifiers and linking data reveals knowledge. It is temping to see that in simple terms as a good thing. But what about privacy and ethics? Even if someone is no longer living, there are still privacy issues, and many people represented in archives are alive.

Do individuals want to be persistently identified? What about if they change their identity? Do they want a pseudonym associated with their real name? They might have very good reasons for keeping their identity private. Persistent identification encourages openness and transparency, which can have real benefits, but it is not always benign. It is like any information – it can be used for good and bad purposes, and who is to say what is good and what is not? Obviously we have GDPR and the Data Protection Act, and these have a good deal to say about obligations, the value of historical research and the right to be forgotten. This is something we’ll need to take into account. But linked data principles are not so much about working with personal data as working with data that may not seem personal, but that can help to reveal things when linked with other sources of data.

GDPR supports the principle of transparency and the importance of people’s awareness and control over what happens to their personal data. Even if we are not creating and storing personal data, it seems important to engage with data protection and what this means. The challenge of how to think about data when it is part of an ever shifting and growing global data environment seems to me to be a huge one.

Certainly the horse has bolted to some degree with regards to joining up data. The Web lowered barriers considerably, and now we increasingly have structured data, so it is somewhat like one gigantic database. Finding things out about individuals is entirely feasible with or without something like a Names service created by the Archives Hub. We are not creating any new content, but creating this interface means we are consciously bringing data together, and obviously we want to be responsible, and respect people’s right to privacy. Clearly it is entirely impractical to try to get permission from all those living people who might be included. So, in the end, we are taking a degree of risk with privacy. Of course, we will un-publish on request, and engage with any feedback and concerns. But at present we are taking the view that the advantages and benefits outweigh the risks.

“Imagine being a sibling in a family that continually removes you from photos; tries its best to erase you…As you go through [the scrapbook] you see events where you know you were there, but you are still missing.” Lae’l Hughes-Watkins (University of Maryland) gave an impassioned and inspiring talk at DCDC 2019 about her experiences. She argued that archivists need to interrogate the reality that has been presented, and accept that our ideas of neutrality are misplaced. She wants a history that actively represents her – her history and culture, and experiences as a black woman in the USA. She related moving stories of people with amazing stories (and amazing archives) who distrust cultural institutions because they don’t feel included or represented.

This may seem a long way away from our small project to create name records, but in reality our project could be seen as one very small part of a move towards what Lae’l is talking about. Bringing descriptions together from across the UK together maybe helps us to play a small role in this – aiming to move towards documenting the full breadth of human experience. The archives that we cover may retain the biases and gaps for some time to come (probably for ever, given that documentary evidence tends to represent the powerful and the elite much more strongly), but by aggregating and creating connections with other sources, we help to paint a bigger picture. By creating name records we help to contextualise people, making it much easier to bring other lives and events into the picture. It is a move towards recognising the limitation of what is actually in the archive, and reaching out to take advantage of what is on the Web. In doing this through explicitly identifying people we do leave ourselves more open to the dangers of not respecting privacy or anonymity. When we plug fully into the Web, we become a part of its infinite possibilities, which is always going to be a revealing, exciting, uncontrollable and risky business. By allowing others to use this data in different ways, we open it up to diverse perspectives and uses.

Here’s a riddle: how can you work in an Archive Centre when you can’t work in an Archive Centre?

July 1, 2020 / Jane Ronson / 1 Comment

Archives Hub feature for July 2020

It’s a dilemma in this strange and worrying time. The collections are there, you know this. You know they are safe. For the time being, for you to remain safe, for all of us to remain safe, you can’t go near them. But this is your job, and much more than that – a passion. We know that archives are stories, solidified memories of individuals, groups, institutions. Many have been around a lot longer than us, and will be there after we’re gone. But at this point of their long, interesting history, we are their gatekeepers, their tenders. Donors from all walks of life have entrusted us with their stories, letting go of the physical, holding only to the ephemeral, and yet now…now we too are distanced from the physical. So, again, how do we work in an Archive Centre when we can’t work in an Archive Centre?

Blythe Duff is a Scottish actress born in East Kilbride on 25 November 1962. She has worked continuously since her debut as part of the Scottish Youth Festival in 1984. Though she has gone on to ply her trade mainly in theatre, she is perhaps best known for her role as Detective Sergeant Jackie Reid in the long-running Glasgow-based crime series Taggart. In 2011 she was awarded an honorary doctorate from Glasgow Caledonian University for services to the performing arts and in 2012 was made a cultural fellow of GCU.

It was in this guise that in 2018 she generously donated her decades-worth of accumulated Taggart artefacts to GCU Archive Centre. It is a rich, fascinating and rewarding resource for fans of the show both die-hard and casual, for aspiring scriptwriters, those with an interest in television production, and indeed for anyone with even a passing interest in Glasgow through the lens of British popular culture.

I’ve been thinking about this collection in these fast and slow days, weeks, and months of lockdown, as I adjust to this new, remote set-up. Once the working day is done, the laptop shut for the evening, I find myself, like so many, at a loose end. With so much temporarily closed, the question has become not so much what do I do, as what do I watch?

Blythe Duff and John Michie standing side by side between shelves of archive boxes and materials. Each is looking into camera and holding several scripts. — Blythe Duff and fellow Taggart star John Michie in GCU Archive Centre at the launch of her papers on 24th October 2018.

With this in mind the Blythe Duff Taggart papers are a fascinating insight into the televisual process of the late 20^th century. As a scriptwriting graduate, I am particularly enthralled by the variety of artefacts on offer. There are 138 individual scripts contained in the collection, spanning from Blythe’s debut on the show in 1990 all the way to 2010. Researchers will find a mixture of rehearsal scripts and shooting scripts, a fantastic insight into the malleable nature of the production process. Particularly poignant is the two versions of 1994’s two-parter ‘Legends’. Mark McManus, the titular Taggart, tragically died before production had finished. The two versions, one featuring Detective Chief Inspector Jim Taggart, and the other re-written without, offer a glimpse into what could have been, as well as the embryonic steps of the show of which Taggart was to become.

It is the little details in the collection that draw me back to it – the scribbled notes on the pages, the inside jokes of the cast. Though the collection is currently uncatalogued, researchers will find Blythe’s personalised chair cover, a monogrammed Taggart jacket, along with a photo of Blythe in character in full police uniform. There are books as well; 25 Years of Taggart and Taggart’s Glasgow. Other artefacts include Taggart wrap party flyers, postcards of different actors from the show – one signed by cast members. There’s even a Taggart Mystery Jigsaw Puzzle game!

Selection of photographs, artefacts, all from television show Taggart, artfully laid on black backdrop. — Selected Taggart treasures from the Blythe Duff papers.

Since becoming available to researchers, it is one of the collections at GCU Archive Centre that has proved most popular with a wide range of visitors. Almost as soon as it was publicised with a visit to the Archive Centre by Blythe and fellow cast member John Michie, we’ve had members of the public – some of whom had never been in an archive before – pop their head into the reading room and ask if they could read an episode. We’ve had a family of fanatics all the way from Australia, a couple from England where the husband surprised his super-fan wife for a special birthday, and many more besides.

It’s also a particularly relevant resource for the University’s learning and teaching as GCU has offered a Masters course in Television Fiction Writing since 2010, the first of its kind in the UK. One of the course leaders, Chris Dolan, was previously a writer for Taggart. Students of the course have examined the scripts, seeing how they’re structured, potentially being inspired in their own work.

Close up photo of cover page of script for episode of Taggart. ‘Blythe’ handwritten in top corner. — Cover page of one of Blythe’s scripts.

The frustration of not being able to go into the Archive Centre each day, not being able to see collections, or chat to team members with ease, is very real. Nonetheless, we have all adjusted to working from home. Team meetings still occur through the magic of MS Teams, projects are still ongoing, new challenges arise and are met. And in the thick of the unprecedented time we are in, if I think back to my initial question, I realise it is possible to work in an Archive Centre even if you can’t work there. For it is the collective knowledge we have, and our willingness to ensure collections are protected and as available to as many as possible that is the lifeblood of archival work. Archives are indeed stories, and at this juncture we’ve reached a twist worthy of Taggart himself. But the path we’re on, though long and difficult will lead us all back to where we want to be. It’s too tragic a time to call it a happy ending, but we’ve certainly had enough of cliff-hangers and will take a bittersweet conclusion.

David Ward
Archive Assistant
Glasgow Caledonian University Archive Centre – Sir Alex Ferguson Library

Related

Browse all Glasgow Caledonian University Archives and Special Collections descriptions available to date on the Archives Hub

All images copyright Glasgow Caledonian University. Reproduced with the kind permission of the copyright holders.

Names (3): One name record to bind them

June 29, 2020 / Jane Stevenson / 2 Comments

It has been great to get comments and feedback around names, and I wanted to expand upon something that a few people have commented on….the ideal of one ‘authority record’ for one person or organisation.

model showing relationships of catalogues and name records — Model showing potential relationships between catalogues and name records

The above diagram is a proposal for the relationships we might have – note that is it a working model, and may well change over time. You can see the catalogues (the descriptions of archives) include people, some with biographical histories, and these people are either creators of archive collections or referenced in them. Each of these people then gets a name record (bottom left box), so we might have e.g. three name records for the same name (and the same name may potentially the same person…or may not). We will work with the store of records that we have with the aim of creating matches, and ending up with a generic or main name record (green box, top left).

The ‘main record’ or ‘master record’ or whatever we might call it, for each individual person or organisation, is not an ‘archival record’. It is not intended simply to be a reflection of what is in our own data. It is intended to be a page dedicated to that person or organisation. Our current feeling is that this should not be seen as domain specific; in fact, we want to get away from the idea that data is domain specific. It is about an entity (a person or organisation), and what we know of that entity.

Keeping in mind the green box, and looking at the person page for Robin Day from Exploring British Design, a previous AHRC project we ran with Brighton Design Archive, you get a sense of the type of thing we mean.

Page for Robin Day, from the Exploring British Design website — Exploring British Design: Robin Day

This page presents as a general information page about a designer. It is not branded as a page about archives. It takes information in from different sources. Is it an ‘authority’ record? I’m really not sure; I wouldn’t call it that. The point is really that it enables researchers to put Robin Day into the context of other people, organisations places and events, or at least it demonstrates how that can be done. It creates a network, and it intends to show the value of including archives in a network, rather than standing apart, in their ‘own world’.

Screenshot of an entity relationship diagram for Robin Day — Visualised relationships

The network can easily be visualised. There are tools out there to do this. The challenge is to create the data to feed into these visualisers. Again, this visualisation is not about archival name authority records, it is not domain specific.

In the Robin Day page, we have a section for related archives and museum resources.

screenshot showing archives related to Robin Day — Related archive and museum resources

This lists archives Robin Day is the ‘creator of’ or archives he is ‘associated with’. It links to the Archives Hub, but also to other sources. One of the options for end users is to go and find out more about the archival sources, but it is not prioritised above other options.

So, this is essentially the idea – a page for a person, a page for an organisation. An information resources that focuses on creating a network of connections. We think this is a good approach, but creating something along these lines that is automated, sustainable and effective within an ongoing national service is much harder.

Why not just use this one record, link to the archive catalogues, and dispense with the individual name records that we have created? There are three reasons to consider providing access to the individual name records: biographical history, uncertainty around matching and ingesting name authority records.

I have already written about biographical and administrative history in a separate post.

In this phase of the Names Project the individual records for Beatrice Webb (as a name example), will be created either from the creator name or index terms that we have in the Archives Hub catalogues.

The main problem is the wide variation in name entries.

Webb, Beatrice
Webb, Beatrice, née Potter
Webb (Martha) Beatrice, 1858-1943
Webb, Martha Beatrice, 1858-1943
Webb;[Martha] Beatrice [nee Potter] 1858-1943

These are all entries in the Archives Hub. We can match them all up, but can we say they are all the same? Names without dates should not be matched with certainty, but quite often they will be the same person. (Beatrix Potter also often ends up being linked with Beatrice Webb, née Potter).

The decision we need to make is whether to provide links to these individual name records that we will have, or only use them as a source of data. It seems valuable to enable end users to see these names as a group, but it is another thing to risk integrating information from them all into one name record. There is no perfect answer to this, but it does seem important to clearly indicate the level of uncertainty. So many names that we have don’t have life dates, or have variations in structure. What we are looking to achieve is a clear provenance, giving end users the best understanding of what they are seeing.

What about name records that have been created by our contributors? The name records we create ourselves from catalogue descriptions will generally be no more than the name, dates, and biographical history. But, going forwards, we will want to work with much more detailed name records.

For Exploring British Design we created rich name records with an entity-relationship structure (essentially using the EAC-CPF structure and working in RDF), to demonstrate the power of connecting entities. For this purpose, we partially hand-crafted the name records, as well as carrying out some very complex processing to create various connections.

screenshot of part of the timeline for Robin Day — Part of the timeline for Robin Day

The example above shows events from the Robin Day timeline, with linked connections to related organisations. If we ingest EAC-CPF records we might get timelines like this.

Name records may also include relationships. The Borthwick Institute has good examples of name records with plenty of rich relationship information. e.g. Charles Lindley Wood, Viscount Halifax.

screenshot of part of the Viscount Wood record showing relationships to other people — An excerpt from a Borthwick entry for Charles Lindley Wood

If we took this record into the Archives Hub it might seem to make sense for it to become the main person record for Wood. But that would involve a process of making choices, preferencing one name record over another. Possible, but tricky to do in an automated way. Another record office might also have a splendid example of a name entry for this person, with some different data. Furthermore, this record has links to the Borthwick catalogue. We would potentially have to remove these links.

It would be very challenging to create one record from several source EAC-CPF records for the same person – to blend timelines, or sort out relationships listed in different records, bearing in mind that it needs to be done in an automated way, keeping version control and dealing with revisions and new data coming in that might add to the name record. How could we compare and blend two lists of relationships? Or two chronologies? We’d probably end up having to keep them all, and then potentially have similar but different relationships and chronologies, giving a slightly confused user experience.

If we do ingest records like the one above, we will have to figure out how these more detailed records will relate to what we have already created. If, as planned, we have one generic name record for a person, it makes the job easier, as we won’t be looking to make any one EAC-CPF record into the main name record, we will simply link to it from the main record. Bear in mind, our main record is intended to be a domain-neutral entry – linking to other sources beyond archives. EAC-CPF records might do this to some extent, but they are unlikely to link to the Jisc Library Hub, and probably won’t link to Wikidata, or other external sources. They are far more likely to provide internal links to the archive catalogue they relate to.

Arguably, it might be easier to forget about creating name records ourselves (from the catalogue entries) and just work with name records that have been created by our contributors (which are likely to be well-structured and include life dates). But if we do that, the pot of names will grow slowly, as only a small proportion of repositories create name records. We can’t realistically give the end user a few thousand name records covering maybe 1-2% of our names – they might search for ‘Winston Churchill’ as a name, and find that we don’t have him! It would not remove the problem of name matching, and it would make the whole idea of reaching out beyond the archive domain, by linking into other resources using our names as the hook, rather ineffectual.

Therefore, we propose to keep the separate name records in our system We propose to create a ‘generic record’, which is what would be prominent in the Archives Hub display. We would then have the potential to link the records together, to blend them, to try some text mining and analysis techniques. It gives us options. It would not be sensible to make those decisions now. It is better to lay the groundwork that enables us to be flexible. This approach allows us to link to an individual name record where we don’t feel able to confirm a ‘same as’ relationship. It presents the option to the end user – here is a name – we think this is the same person, so we’ve provided a link.

The end user experience needs to make sense and not mislead or provide false information. Links to brief name records could seem confusing, but, as I have said, trying to bring together in one record all the information from several name records, with their biographies, relationships, aliases, events, related resources, is likely to be a nightmare. In the end, it will take a good deal more testing and working with researchers to work out what is best.

Archives Hub Names Project (2): Biographical History

June 17, 2020 / Jane Stevenson / 5 Comments

It is a somewhat vexed question how to treat biographical and administrative history (in this post I’ll focus on biographical history). This is an ISAD(G) field and an EAD field. ISAD defines it as providing “an administrative history of, or biographical details on, the creator (or creators) of the unit of description to place the material in context and make it better understood”. It advises for personal names to include “full names and titles, dates of birth and death, place of birth, successive places of domicile, activities, occupation or offices, original and any other names, significant accomplishments, and place of death”.

On the Archives Hub we have a whole range of biographical histories – from very short to very comprehensive. I have had conversations with archivists who believe that ‘putting the collection in context’ means giving information that is particularly relevant for that archive rather than giving a general history. Conversely, many biographical history entries do give a very full biography, even if the collection only relates to one aspect of a person’s life and work. They may also include information that is not readily available elsewhere, as it may have been discovered as part of the cataloguing process.

The question is, if we create a generic name record for a person, how do we treat this biographical information? There are a number of alternatives.

(1) Add all biographical history entries to the record

If you look at a SNAC example: https://snaccooperative.org/view/54801840 you can see that this is the approach. It has merits – all of the biographical information is brought together. But it can mean a great deal of repetition, and the ordering of the entries can seem rather illogical, with short entries first and then longer comprehensive entries at the end.

Whilst most biographical history entries are pretty good, it also means a few not very helpful entries may be included, and may be top of the order. In addition, putting all the entries in together doesn’t always seem to make much sense. In the example below there are just three short entries for a major figure in women’s liberation. They are automatically brought in from the catalogue entry for individual collections. Sometimes the biographical entries in individual catalogues suffer from system migration and various data processing issues that mean you end up with field contents that are not ideal.

Millicent Garrett Fawcett biographical histories in SNAC

The question is whether this approach provides a useful and effective end user experience.

Where there is one entry for a creator, with one biographical history, there is no issue other than whether the entry makes sense as an overall biographical entry for that person or organisation. But we have to consider the common situation where there will be a dozen or more entries. Even if we start with one entry, others may be added over time. Generally, there will be repetition and information gaps, but in many cases this approach will provide a good deal of relevant information.

(2) Keep the biographical history entries with the individual name records

At the moment our plan is to create individual name records for each person, as well as a generic master record. We haven’t yet worked out the way this might be presented to the end user. But we could keep the biographical histories with the individual entries we have for names. The generic record would link to these entries, and to the information they contain. This makes sense, as it keeps the biographical histories separate, and within the entries they were written to accompany. Repetition is not an issue as it is clear why that might happen. But the end user has to go to each entry in turn to read this information.

(3) Keep biographical history entries with individual name records, but enable the information to be viewed in the generic master record

We have been thinking about giving the end user the option to ‘click to see all biographical histories created for this person’. That would help with expectations. Simply presenting a page with a dozen similar biographical histories is likely to confuse people, but enabling them to make a decision to view entries gives us more opportunity for explanation – the link could include a brief explanatory note.

(4) Select one biographical history to be in the generic record

We have discussed this idea, but it is really a non-starter. How do you select one entry? What would the criteria be if it is automated? The longest?

(5) Link to a generic biography if available

This is the idea of drawing in the wikipedia entry for that person or organisation, or potentially using another source. There is a certain risk to pulling in data from an external source as the ‘definitive’ biographical information, but it the source would always be cited, and it does start to move towards the principle of bringing different sources of information together. If we want to create a more generic resource, we are going to have to take risks with using external sources.

I would be interested in any comments on this.

Names Project (1): Creation of name records

June 9, 2020 / Jane Stevenson / 6 Comments

The Archives Hub Names Project

The Archives Hub team and Knowledge Integration, our system suppliers, are embarking upon a short four month project to start to lay the groundwork, define the challenges and test the approaches to presenting end users with a name-based means to search, and connect to a broad range of resources related to people and organisations. I will be blogging about the project as we go along.

Our key aims in the long-term are:

To provide the end user with a way to search for people and organisations and find a range of material relevant to their research
To enable connections to be made between resources within and external to Jisc, using names as the main focus
To bring archive collections together in an intellectual sense and provide different contexts to collections by creating networks across our data

This first project will not create an end-user interface, but will concentrate on processing, matching names and linking resources. We want to explore how this can be administered in order to be sustainable over time. In the end, the most challenging part of working with the names we have is identification, disambiguation and matching. The aim is to explore the space and start to formulate a longer-term plan for the full implementation of names as entities within the Archives Hub.

Creation of name records from EAD description records

NB: This blog often refers to personal names for convenience, but names include personal, family and corporate entities.

EAD descriptions include personal, family and corporate names. These ‘entities’ may be listed as archival creators and also associated with the collection as index terms. Archival creators may optionally be given biographical or administrative histories. The relationship of the collection with names in the index is not made explicit in the description (in a structural way), though it may often be gleaned from the descriptive information within the EAD record.

Creating name records for all names

We are proposing to begin by creating name records for all of these entries, no matter how thin the information for each entry may be.

Here is a random selection of names that are included in Archives Hub records:

Grote, Arthur
Gaskell, Arthur
Wilson, John
Thatcher, J. Wells, Barrister at Law
Barron, Margaret
Stanley, Catherine, 1792-1862
Roe, Alfred Charles
Rowlatt, Mary, b 1908
Milligan, Spike, 1918-2002
Fawcett, Margaret, d. 1987
Rolfe, Alan, 1908-2002 actor
Mayers, Frederick J (fl 1896-1937 : designer : Kidderminster, England)
Joan

Only a percentage of names have life dates. Some have born or death dates, some floruit dates.

Of course, the life dates, occupations and outputs of many people are not known, or may be very difficult to find. Also, life dates will change when a birth date is joined by a death date. Epithets may also change over time (and they are not controlled vocabulary anyway).

In addition, we have inverted and non-inverted names on the Archive Hub, names with punctuation in different places, names with and without brackets, etc. These issues create identification challenges.

Even taking names as creators and names as index terms within one single description, the match is often not exact:

Millicent Garrett Fawcett (creator name)
Fawcett, Dame Millicent. (1847-1929) nee Garrett, Feminist and Suffragist (index term)

Lingard, Joan (creator name)
Lingard, Joan Amelia, 1932- (index term)

The archival descriptions on the Archives Hub vary a great deal in terms of the structure, and different repositories have different approaches to cataloguing. Some do not add name of creator, some do not add index terms, some add them intermittently, and often the same name is added differently for different collections within the same repository. In many cases the cataloguer does not add life dates, even when they are known, or they are added to the name as creator but not in the index list, or vice versa. This sounds like a criticism, but the reality is that there are many reasons why catalogues have ended up as they are.

There has not been a strong tradition amongst archivists of adding names as unique identifiable entities, but of course, it has only been in the last few decades that we have had the potential, which is becoming increasingly sophisticated, of linking data through entity relationships, and creating so much more than stand-alone catalogue records. Many archivists still think primarily in terms of human readable descriptions. Some people feel that with the advent of Google and sophisticated text analysis, there is no need to add names in this structured way, and there is no need for index terms at all. But in reality search engines generally recommend structured data, and they are using it in sophisticated ways. Schema.org is for structured data on the web, an initiative started by Google, Microsoft, Yahoo, and Yandex. Explicit markup helps search engines understand content and it potentially helps with search engine optimisation (ensuring your content surfaces on search engines). Also, if we want to move down the Linked Data road, even if we are not thinking in terms of creating strict RDF Linked Data, we need to identify entities and provide unique identifiers for them (URLs on the web). Going back to Tim Berners-Lee’s seminal Linked Data article from 2006:

“The Semantic Web isn’t just about putting data on the web. It is about making links, so that a person or machine can explore the web of data. With linked data, when you have some of it, you can find other, related, data.”

So, including names explicitly provides huge potential (as well as subjects, places and other entities) and it has become more important, not less important. Indeed, I would go so far as to say that structured data is more important than standards compliant data, especially as, in my experience, standards are often not strictly adhered to, and also, they need constant updating in order to be relevant and useful.

The idea with our project is that we start with name records for every entity – a pot of data we can work with. We may create Encoded Archival Context (Corporate Bodies, Persons and Families), otherwise known as EAC-CPF…but that is not important at this stage. EAC is important for data ingest and output, and we intend to use it for that purpose, so it will come into the picture at some point.

The power of the anonymous

There are benefits in creating name records for people who are essentially anonymous or not easily identifiable. Firstly, these records have unknown potential; they may become key to making a particular connection at some point, bearing in mind that the Archives Hub continually takes new records in. Secondly, we can use these records to help with identification, and the matching work that we undertake may help to put more flesh on the bones of a basic name record. If we have ‘Grote, Arthur’ and then we come across ‘Grote, Arthur, 1840-1912’, we can potentially use this information and create a match. Of course, the whole business of inference is a tricky thing – you need more than a matching surname and forename to create a ‘same as’ relationship (I won’t get into that now). But the point is that a seemingly ‘orphan’ name may turn out to have utility. It may, indeed, provide the key to unlocking our understanding of particular events – the relationships and connections between people and other entities are what enable us to understand more about our history.

Components of a name record

So, all names will have name records, some with just a name, some with life dates of different sorts, some with biographical or administrative histories. The exception to this may be names that are not identifiable as people or organisations. It is potentially possible to discover the type of entity from the context, but that is a whole separate piece of work. Hundreds of names on the Archives Hub are simply labelled as ‘creator’ or ‘name’. This is down to historical circumstance – partly the Archives Hub made errors in the past (our old cataloguing tool which entered creators as simply EAD ‘origination’), partly other systems we ingest data from. At the moment, for example, we are taking in descriptions from Axiell’s AdLib system, but the system does not mark up creator names as people or organisations (unless the cataloguer explicitly adds this), so we cannot get that information. This is probably a reflection of a time when semantically structured data was simply less important. If a human reads ‘Elizabeth Gaskell’ in a catalogue entry they are likely to understand what that string means; if undertaking large-scale automated processing, it is just a string of characters, unless it includes semantic information.

From the name records that we create, we intend to develop and run algorithms to match names. In many cases, we should be able to draw several names together, with a ‘same-as’ relationship. Some may be more doubtful, others more certain. I will talk about that as we get into the work.

At the moment, we have some ideas about how we will work with these individual records in terms of the workflow and the end user experience, but we have not made any final decisions, and we think that what is most important at this stage is the creation and experimentation with algorithms to see what we can get.

Master name records

We intend to create master records for people and organisations. The principle is to see these master records not as something within the archives domain, but as stand-alone records about a person or organisation that enable a range of resources to be drawn together.

So, we might have several name records for one person:

Example of master record, with various related information included:
Webb, Martha Beatrice, 1858-1943, social reformer and historian

Examples of additional name records that should link to the master record:
Webb, Beatrice, 1858-1943 (good match)
Webb, Martha Beatrice, 1858-1943, economist and reformer (good match)
Webb, Martha Beatrice, nee Potter, 1858-1943 (good match)
Webb, M.B. b. 1858 (possible match)
but…
Potter, Martha Beatrice, b 1858
…might well not be a match, in which case it would stand separately, and the archive connected to it would not benefit from the links being made.

We have discussed the pros and cons of creating master records for all names. It makes sense to bring together all of the Beatrice Webb names into one master record – there is plenty that can be said about that individual; but does it make sense to have a master record for single orphaned names with no life dates and nothing (as yet) more to say about that individual? That is a question we have yet to answer.

diagram showing link between archive, name records and master records — The archive is described though an EAD description held on our system (the CIIM). We take all the names from this to create a huge store of individual names. From this, we aim to create and update ‘definitive’ name records.

The principle is to have name records that enables us to create links to the Archives Hub entries and also to other Jisc services and resources beyond that – resources outside of the archives domain. Many of these resources may also help us with our own identification and matching processes. It is important to benefit from the work that has already been done in this area.

We are looking at various name resources and assessing where our priorities will be. This is a fairly short project, and we won’t have time to look at more than a handful of options. But we are currently thinking in terms of VIAF, ORCID and Wikidata. More on that to follow.

Personally, I’ve been thinking about working with names for several years. We have been asked about it quite a bit. But the challenge is so big and nebulous in many ways. It has not been feasible to embark upon this kind of work in the past, as our system has not supported the kind of systematic processing that is required. We are also able to benefit from the expertise K-Int can bring to data processing. It is one thing doing this as a stand-alone project; it is quite another to think about a live service, long term sustainability, version control and revisions, ingest from different systems, etc. And also, to break it down into logical phases of work. It is exciting, but it is going to involve a great deal of hard work and hard thinking.

Interconnected archives: cataloguing the Rossetti family letters at Leeds University Special Collections

June 1, 2020 / Jane Ronson

Archives Hub feature for June 2020

Special Collections holds over 700 letters written by members of the Rossetti family. The collection includes letters from nearly all members of this storied family, with the bulk written by Dante Gabriel (https://en.wikipedia.org/wiki/Dante_Gabriel_Rossetti) and William Michael (https://en.wikipedia.org/wiki/William_Michael_Rossetti), and a significant tranche from Christina Rossetti (https://en.wikipedia.org/wiki/Christina_Rossetti). The letters are only a fraction of the full Rossetti family correspondence, which can be found in libraries and archives across the world.

The Rossetti Family by Lewis Carroll, albumen print, 7 October 1863 (Christina Georgina Rossetti, Dante Gabriel Rossetti, Frances Mary Lavinia Rossetti (née Polidori) and William Michael Rossetti). NPG P56. © National Portrait Gallery, London. Creative Commons 3.0 licence (https://creativecommons.org/licenses/by-nc-nd/3.0/).

Many of the letters have been in Special Collections since the 1930s but were not catalogued in any detail. Some were represented by very brief index records, which did not convey the scope or context of the full collection, others were entirely uncatalogued. Although much of the Dante Gabriel and Christina Rossetti correspondence had been published in their respective Collected Letters ((The Correspondence of Dante Gabriel Rossetti, ed. William E. Fredeman, 2015 and The Letters of Christina Rossetti, https://rotunda.upress.virginia.edu/crossetti/), but the letters themselves remained inaccessible for research.

A 2019 project funded by the Strachey Trust enabled us to repackage and create item-level records for each letter in the collection. Catalogue records included basic ISAD(G) metadata, a brief synopsis of the letter’s contents, links to authority files for both sender and addressee and a reference for the published version of the letter, where one exists. The finished catalogue now describes the full extent of the Rossetti Collection at Leeds, ensuring that material is identifiable, accessible for research and secure in our holdings.

Cataloguing gave us fascinating insight into the lives of the Rossettis. The largest group of letters in the collection were written by Dante Gabriel Rossetti and cover both the beginning and end of his career. Early letters reveal a humorous correspondent. One, written from a deluged Kent, describes him sketching ‘with my umbrella tied over my head to my buttonhole – a position which you will oblige me by remembering, I expressly desired should be selected for my statue. (N.B. Trousers turned up.)’

These are in direct contrast to later letters to Theodore Watts-Dunton (https://en.wikipedia.org/wiki/Theodore_Watts-Dunton) who acted as Rossetti’s advisor. The volume and regularity of Rossetti’s letters to Watts-Dunton, their paranoia and requests for advice show Rossetti’s great dependence on his close friends in later years.

The collection includes 30 letters written by Christina Rossetti. Project work uncovered a previously unknown letter, written to her sister-in-law, Lucy Maddox Brown Rossetti (https://en.wikipedia.org/wiki/Lucy_Madox_Brown). This brief letter gives Rossetti’s assessment of an unnamed poem: ‘The fact is I think it diabolical. Its degree of serene skill and finesse intensifies to me its horror…’

150 letters by William Michael Rossetti were also catalogued during this project, the majority of which are unpublished. His letters include a long series addressed to John Lucas Tupper (https://sculpture.gla.ac.uk/view/person.php?id=msib7_1220373335), a close associate and contributor to ‘The Germ’, the journal of the Pre-Raphaelite Brotherhood. The letters to Tupper, whose writing and career he promoted, highlight professional opportunities and networks of editors and journals available during this period. They give an interesting glimpse of the kind of life afforded to a literary Victorian gentleman employed by the Civil Service. During certain periods of his life, Rossetti travelled abroad, visiting the continent and even Australia. Having been robbed on one occasion in Italy, he discusses the advisability of carrying a pistol with Tupper, who travelled with him in 1869. Other letters cover wide-ranging topics, from discussions of Ruskin and Browning to the politics of the day, spiritualism, and lycanthropy.

Alongside revealing individual letters, the catalogue records now allow researchers to explore Rossetti family networks in some detail. A good example of this is correspondence relating to the artist Frederic Shields (https://en.wikipedia.org/wiki/Frederic_Shields), who was a regular subject of Dante Gabriel Rossetti’s letters to Watts-Dunton. Later letters from William Michael Rossetti to Shields describe the hours before his brother’s death with great tenderness, passing on a last message to Shields. Subsequent letters from Christina Rossetti are concerned with Shields’ work on a memorial for Dante Gabriel Rossetti. These intertwined relationships would not be easily discoverable from published letters alone but can be usefully explored through this catalogue.

Cataloguing also gave us the chance to research the provenance of groups of letters in the collection. This revealed connections between material previously considered separate: the Swinburne manuscript collection (https://explore.library.leeds.ac.uk/special-collections-explore/8607) and substantial correspondence relating to Swinburne and Watts-Dunton (including Rossetti correspondence) were all acquired from the same source, Watts-Dunton’s estate. These letters and manuscripts had historically been treated as distinct collections, and the connections between them were not clear from catalogue records.

Image taken from one of the Rossetti family letters.

Cataloguing work on this small collection has emphasised the many levels of interconnectedness in which archives exist. Letters can show relationships between individuals, collections of letters show their wider networks, and collections themselves speak to other material both within a repository and in many other locations across the world.

The Rossetti family letters collection is now available for research (https://explore.library.leeds.ac.uk/special-collections-explore/7436). This project would not have been possible without the support of the Strachey Trust, and Special Collections is grateful to it for its generosity in funding work on this significant collection.

Sarah Prescott
Literary Archivist
University of Leeds Special Collections

Browse all University of Leeds Special Collections descriptions on the Archives Hub

Explore more collections relating to the Rossetti family on the Archives Hub

Previous features on University of Leeds Special Collections:

“Gather them in” – the musical treasures of W.T. Freemantle

Sentimental Journey: a focus on travel in the archives

Recipes through the ages

World War One

All images copyright University of Leeds Special Collections and National Portrait Gallery, London. Reproduced with the kind permission of the copyright holders.