Reinventing the wheel: the new Hub website

promotional postcard On 1st April 2010 the Archives Hub website changed. It was not just about a new look and feel, but a whole new site. The Hub team spent several months planning the new architecture, navigation and content. Most of the content was rewritten and this gave us a great opportunity to think about a coherent approach where we could be consistent in our tone and terminology and really think about what each page should say. We wanted the site to be intuitive and for each page to be useful and attractive, and not give an overwhelming amount of information.

We decided to introduce plenty of images, to lift the site visually, and we wanted to keep plenty of whitespace, to make it easy on the eye. In addition, the website designers, True North, helped us to think about our identity and the importance of presenting the Archives Hub in a way that conveys confidence, self-belief, professionalism and warmth.

The Archives Hub has getting on for 200 contributors now, which is quite an achievement, and we are very appreciative of the effort that our contributors put into creating descriptions for the Hub. We want to continue to develop the site with a focus on archivists as well as on researchers, as we see both groups of users as vital to us, and in fact they often overlap. We hope that our ‘Archivists’ section is helpful and informative for contributors and other information professionals interested in what we do and in issues around online data and interoperability.

Our Features section takes over from the old ‘Collections of the Month’ idea, bringing the same message about the breadth and depth of Hub content and enabling us to showcase contributors and wonderful collections.

Our ‘Researchers’ section is going to be expanded, although we are keen to keep it focussed and easy to scan and digest. We are looking at ways that we can continue to support researchers in using the Hub to the greatest advantage. Of course, the main way is to provide an effective search interface and to continue to expand the content.  And this brings us on to the search – as well as a whole new information site, we have upgraded our software. We are now using ‘Cheshire 3’, which enables us to provide functionality that we could not provide before. We will be talking more about that in subsequent blogs. The new software is running on all-new hardware, so in fact we really have fundamentally changed the whole Archives Hub, but we hope that we have retained what is good about the site and about our service.

Linked Data: one thing leads to another

I attended another very succesful Linked Data meetup in London on 24 February. This was the second meetup, and the buzz being created by Linked Data has clearly been generating a great deal of interest, as around 200 people signed up to attend.
All of the speakers were excellent, and found that very helpful balance

between being expert and informative whilst getting their points across clearly and in a way that non-technical people could understand.

Tom Heath (Talis) took us back to the basic principles of Linked Data. It is about taking statements and expressing them in a way that meshes more closely with the architecture of the Web. This is done by assigning identifiers to things. A statement such as Jane Stevenson works at Mimas can be broken down and each part can be given a URI identifer. I have a URI (http://www.archiveshub.ac.uk/janefoaf.rdf) and Mimas has a URI (http://www.mimas.ac.uk/). The predicate that describes the relationship ‘worksAt’ must also have a URI. Creating statements like this, we can start linking datasets together.

Tom talked about how this Linked Data way of thinking challenges the existing metaphors that drive the Web, and this was emphasised thoroughout the sessions. The document metaphor is everywhere – we use it all the time when talking about the Web; we speak about the desktop, about our files, and about pages as documents. It is a bit like thinking about the Web as a library, but is this a useful metaphor? Mabye we should be moving towards the idea of an exploratory space, where we can reach out, touch and interact with things. Linked Data is not about looking for specific documents. If I take the Archives Hub as an example, Linked Data is not so much concerned with the fact that there is a page (document) about the Agatha Christie archive; what it is concerned about is the things/concepts within that page. Agatha Christie is one of the concepts, but there are many others – other people, places, subjects. You could say the description is about many things that are linked together in the text (in a way humans can undertand), but it is presented as a page about Agatha Christie. This traditional way of thinking hides references within documents, they are not ‘first class citizens of the web’ in themselves. Of course, a researcher may be wanting information about Agatha Christie archives, and then this description will be very relevant. But they may be looking for information about other concepts within the page. If ‘Torquay’ and ‘novelist’ and ‘nursing’ and ‘Poirot’ and all the other concepts were brought to the fore as things within their own right, then the data could really be enriched. With Linked Data you can link out to other data about the same concepts and bring it all together.

Tom spoke very eloquently about how you can describe any aspect you like of any thing you like by giving identifiers to things – it means you can interact with them directly. If a researcher wants to know about the entity of Agatha Christie, the Linked Data web would allow them to gather information about that topic from many different sources; if concepts relating to her are linked in a structured way, then the researcher can undertake a voyage of discovery around their topic, utilising the power that machines have to link structured data, rather than doing all the linking up manually. So, it is not a case of gathering ‘documents’ about a subject, but of gathering information about a subject. However, if you have information on the source of the data that you gather (the provenance), then you can go to the source as well. Linked Data does not mean documents are unimportant, but it means that they are one of the things on the Web along with everything else.

Having a well-known data provider such as the BBC involved in Linked Data provides a great example of what can be done with the sort of information that we all use and understand. The BBC Wildlife Finder is about concepts and entities in the natural world. People may want to know about specific BBC programmes, but they are more likely to want to know about lions, or tigers, or habitats, or breeding, or, other specific topics covered in the programmes. The BBC are enabling people to explore the natural world through using Linked Data. What underlies this is the importance of having URIs for all concepts. If you have these, then you are free to combine them as you wish. All resources, therefore, have HTTP URIs. If want to talk about sounds that lions make, or just the programmes about lions, or just one aspect of a lion’s behaviour, then you need to make sure each of these concepts have identifiers.

Wildlife Finder has almost no data itself; it comes from elsewhere. They pull stuff onto the pages, whether it is data from the BBC or from elsewhere. DBPedia (Wikipedia output in RDF) is particularly important to the BBC as a source of information. The BBC actually go to Wikipedia and edit the text from there, something that benefits Wikipedia and other users of Wikipedia. There is no point replicating data that is already available. DBPedia provides a big controled vocabulary – you can use the URI from Wikipedia to clarify what you are talking about, and it provides a way to link stuff together.

Tom Scott from the BBC told us that the BBC have only just released all the raw data as RDF. If you go to the URL it content negotiates to give you what you want (though he pointed out that it is not quite perfect yet). Tom showed us the RDF data for an Eastern Gorilla, providing all of the data about concepts that go with Eastern Gorillas in a structured form, including links to programmes and other sources of information.

Having two heavyweights such as the BBC and the UK Government involved in Linked Data certainly helps give it momentum. The Government appears to have understood that the potential for providing data as open Linked Data is tremendous, in terms of commercial exploitation, social capital and improving public service delivery. A number of times during the sessions the importance of doing things in a ‘web-centric’ way was emphasised. John Sheridan from The National Archives talked about data.gov.uk and the importance of having ‘data you can click on’. Fundamentally, Linked Data standards enable the publication of data in a very distributed way. People can gather the data in ways that are useful to them. For example, with data about schools, what is most useful is likely to be a combination of data, but rather than trying to combine the data internally before publishing it, the Government want all the data providers to publish their data and then others can combine it to suit their own needs – you don’t then have to second guess what those needs are.

Jeni Tennison, from data.gov.uk, talked about the necessity of working out core design patterns to allow Linked Data to be published fast and relatively cheaply. I felt that there was a very healthy emphasis on this need to be practical, to show benefits and to help people wanting to publish RDF. You can’t expect people to just start working with RDF and SPARQL (the query language for RDF). You have to make sure it is easy to query and process, which means creating nice friendly APIs for them to use.

Jeni talked about laying tracks to start people off, hepling people to publish their data in a way that can be consumed easily. She referred to ‘patterns’ for URIs for public sector things, definitions, classes, datasets, and providing recommendations on how to make URIs persistent. The Government have initial URI sets for areas such as legislation, schools, geographies, etc. She also referred to the importance of versioning, with things having multiple sources and multiple versions over time it is important to be able to relate back to previous states. They are looking at using named graphs in order to collect together information that has a particular source, which provides a way of getting time-sliced data. Finally, ensuring that provenance is recorded (where something originated, processing, validation, etc.) helps with building trust.

There was some interesting discussion on responsibilities for minting URIs. Certain domains can be seen to have responsibilities for certain areas, for example, the government minting URIs for government departments and schools, the World Health Organisation for health related concepts. But should we trust DBPedia URIs? This is an area where we simply have to make our own judgements. The BBC reuse the DBPedia URI slugs (the string-part in a URL to identify, describe and access a resource) on their own URLs for wildlife, so their URLs have the ‘bbc.co.uk’ bit and the DBPedia bit for the resource. This helps to create some cohesion across the Web.

There was also discussion about the risks of costs and monopolies – can you rely on data sources long-term? Might they start to charge? Speakers were asked about the use of http URIs – applications should not need to pick them apart in order to work out what they mean. They are opaque identifiers, but they are used by people, so it is useful for them to be usable by people, i.e. readable and understandable. As long as the information is made available in the metadata we can all use it. But we have got to be careful to avoid using URIs that are not persistent – if a page title in Wikipedia changes the URI changes, and if the BBC are using the Wikipedia URI slug then is a problem. Tom Scott made the point that it it worth choosing persistence over usability.

The development of applications is probably one of the main barriers to uptake of Linked Data. It is very different, and more challenging, than building applications on top of a known database under your control. In Linked Data applications need to access multiple datasets.

The session ended by again stressing the importance of thinking differntly about data on the Web. To start with things that people care about, not with a document-centric way of thinking. This is what the BBC have done with the Wildlife Finder. People care about lions, about the savannah, about hunting, about life-span, not about specific documents. It is essential to identify the real world things within your website. It is the modelling that is one of the biggest challenges – thinking about what you are talking about and giving those things URIs. A modelled approach means you can start to let machines do the things they are best at, and leave people to do the things that they are best at.

Post by Jane Stevenson (jane.stevenson@manchester.ac.uk)

Image: Linked Data Meetup, February 2010, panel discussion.

It’s all about YOU: Manchester as an Open Data city

There are plans afoot to declare Manchester as an Open Data city. At the Manchester Social Media Cafe last week I attended a presentation by Julian Tait, a founder of the Social Media Cafe, who talked to us about why this would be a good thing.

The Open Data initiative emerged as a result of Future Everything 2009, a celebration of the digital future in art, music and ideas. But what is an Open Data city? It is based upon the principle that data is the life blood of a city; it allows cities to operate, function, develop and respond, to be dymanic and to evolve. Huge datasets are generally held in places that are inaccessible to many of the populace; they are largely hidden. If data is opened up then applications of the data can be hugely expanded and the possibilities would be limitless.
There are currently moves by central government to open up datasets, to enable us to develop a greater awareness and understanding of society and of our environment. We now have data.gov.uk and we can go there and download data (currently around 2000 datasets) and use the data as we want to. But for data to have meaning to people within a city there has to be something at a city level; at a scale that feels more relevant to people in an everyday context.
Open data may be (should be?) seen as a part of the democratic process. It brings transparency, and helps to hold government to account. There are examples of the move towards transparency – sites such as They Work For You , which allows us all to keep tabs on our MP, and MySociety. In the US, Columbia has an initiative known as Apps for Democracy, providing prizes for innovative apps as a way to engage the community in ‘digital democracy’.
They key here is that if data is thrown open it may be used for very surprising, unpredictable and valuable things: “The first edition of Apps for Democracy yielded 47 web, iPhone and Facebook apps in 30 days – a $2,300,000 value to the city at a cost of $50,000”.
Mapumental is a very new initiative where you can investigate areas of the UK, looking at house price indexes, public transport data, etc. If we have truly open data, we could really build on this idea. We might be able to work out the best places to live if we want a quiet area with certain local amenities, and need to be at work for a certain time but have certain restrains on travel. Defra has a noise map of England, but it is not open information – we can’t combine it with other information.
Julian felt that Open Data will only work if it benefits people in their everyday existence. This may be true on a city scale. On a national scale I think that people have to be more visionary. It may or may not have a discernable impact on everyday living, but it is very likely to facilitate research that will surely benefit us in the long term, be it medically, environmentally or economically.
The Open Data initiative is being sold on the idea of people becoming engaged, empowered and informed. But there are those that have their reservations. What will happen if we open up everything? Will complex issues be simplified? Is there a danger that transparent information will encourage people to draw simplistic inferences? come to the ‘wrong’ conclusions? Maybe we will lose the subtleties that can be found within datasets, maybe we will encourage mis-information? Maybe we will condemn areas of our cities to be ghettos? With so much information at our fingertips about where we should live, the ‘better areas’ might continue to benefit at the expense of other areas.
The key question is whether society is better off with the information or without the information. Certainly the UK Government is behind the initiative, and the recent ‘Smarter Government‘ (PDF) document made a commitment to the opening up of datasets. The Government believes it can save money by opening up data, which, of course, is going to be a strong incentive.
For archivists the whole move towards numerous channels of information, open data, mashing up, recombining, reusing, keeping data fluid and dynamic is something of a nightmare from a professional point of view. In addition, if we start to see the benefits of providing everyone with access to all data, enabling them to do new and exciting things with it, then might we change our perspective about appraisal and selection. Does this make it more imperative that we keep everything?
Image: B of the Bang, Manchester

Digital Preservation the Planets Way


As a representative of the UK Society of Archivists, which is a member of the Digital Preservation Coalition,  I attended the first day of this 3-day event on a partial DPC scholarship. It gave an overview of digital preservation and of the Planets project. Planets is a 4-year project that is European Community funded, with 16 partner organisations, and a budget of 13.7 million euros, showing a high level of commitment from the EC. The programme is due to finish May 2010. The Planets approach wraps preservation planning services, action services, characterisation services and a testbed within an interoperability framework. It seeks to respond to the OAIS reference model and it became clear as the day went on that having a knowledge of OAIS terminology was useful in following the talks, which often referred to SIPs, AIPs and DIPs.

After a keynote by Sheila Anderson, Director of the Centre for E-Research at Kings College, which touched upon some of the important principles that underlie digital preservation and outlined some projects that the Centre is involved in, we got into a day that blended general information about digital preservation with quite detailed information about the Planets services and tools.

Ross King from the Austrian Institute of Technology gave a good overview, looking at the scale of the digital universe and the challenges and incentives to preserve. Between now and 2019, the volume of content that organisations need to hold will rise twenty-five fold from an average of 20TB to over 500TB. (from Are You Ready? Assessing Whether Organisations are Prepared for Digital Preservation – PDF). We would need about 1 trillion CD-Roms to hold all of the digital information produced in 2009. Importantly, we have now reached a point at which information creation is exceeding storage capacity, so the question of what to preserve is becoming increasingly important. I found this point interesting, as at the last talk that I attended on digital preservation we heard the old cry of ‘why not keep everything digital – storage is no problem’.

Digital preservation is about using standards, best practices and technologies to ensure access over time. With digital information, there are challenges around bit-stream preservation (bytes and hardware) and logical preservation (software and format). Expounding on the challenge of formats, King said that typically knowledge workers produce at least two thirds of their documents in proprietary formats. These formats have high preservation risks relating to limited long-term support and limited backwards-compatibility.

The preservation planning process is also vital, and Planets provides help and guidance on this. It is important to know what we want to preserve, profile the collections and identify the risks in order to mitigate them. Hans Hofman of the National Archives of the Netherlands gave an introduction to preservation planning. A preservation plan should define a series of preservation actions that need to be taken to address identified risks for a given set of digital objects or records. It is the translation of a preservation policy. He talked about the importance of looking at objects in context, and about how to prepare in order to create a preservation planning strategy: the need to understand the organisational context, the resources and skills that are available. Often small organisations simply do not have the resources, and so large organisations inevitably lead the way in this area. Hans went through the step-by-step process of defining requirements, evaluating alternatives, analysing results and ending up with recommendations with which to build a preservation plan, which then needs to be monitored over time.

Planets has developed a testbed to investigate how preservation services act on digital objects (now open to anyone, see https://testbed.planets-project.eu/testbed/). Edith Michaeler of the Austrian National Library explained that this provides a controlled environment for experimentation with your own data, as well as with structured test data that you can use (Corpora). It enables you to identify suitable tools and make informed decisions. The testbed is run centrally, so everything is made available to users and experiments can benefit the whole community. It is very much integrated within the whole Planets framework. Edith took us through the 6 steps to run an experiment: defining the properties, designing the experiment, running it, the results, the analysis and the evaluation. The testbed enables experiments in migration, load test migration and viewing in an emulator as well as characterisation and validation. So, you might use the testbed to answer a question such as which format to migrate a file to, or to see if a tool behaves in the way that you expected.

The incentives for digital preservation are many and for businesses those around things like legislative compliance and clarification of rights may be important incentives. But business decisions are generally made based on the short-term and they are made based on a calculated return on investment. So, maybe we need to place digital preservation in the area of risk management rather than investment. The risk needs to be quantified, which is not an easy task. How much is produced? What are the objects worth? How long do they retain their value? What does it cost to preserve? If we can estimate the financial risk, we can justify the preventative investment in digital preservation. (see MK Bergman, Untapped Assets – PDF).

During the panel discussion, the idea of ‘selling’ digital preservation on the basis of risk was discussed. Earlier in the day William Kilbride, director of the Digital Preservation Coalition, talked about digital preservation as sustaining opportunities over time, and for many delegates, this was much more in-tune with their sentiments.  He outlined the work of the DPC, and emphasised the community-based and collaborative approach it takes to raising awareness of digital preservation. 

Clive Billenness went through how Planets works with the whole life-cycle of digital preservation:

1. Identify the risks
2. Assess the risks (how severe they are, whether they are immediate or long-term)
3. Plan and evaluate

4. Implementation plan
5. Update and review

The cycle will be repeated if there is a new risk trigger, which might be anything that signifies a change practice, whether it be a change in policy, a change in the business environment or a change in the technical environment. For the whole life-cycle, Planets has tools to help. The Plato Preservation Planning Tool, the Planets Characterisation Services, the Testbed and the Planets Core Registry, which is a file format registry based upon Pronom, including preservation action tools and file formats, taking a community-based approach to preservation.

Types of preservation action were explained by Sara van Bussel of the National Library of the Netherlands. She talked about logical preservation and accessing bit streams, and how interpretation may depend on obsolete operating systems, applications or formats. Sara summarised migration and emulation as preservation processes. Migration means changing the object over time to make it accessible in the current environment, whatever that may be. This risks introducing inconsistencies, functionality can be lost and quality assessment can be difficult. Migration can happen before something comes into the system or whilst it is in the system. It can also happen on access, so it is demand-led. Emulation means changing the environment over time, so no changes to the object are needed. But it is techncially challenging and the user has to have knowledge about the original environment. An emulator emulates a hardware configuration. You need your original operating system and software, so they must be preserved. Emulation can be useful for viewing a website in a web archive, for opening old files, from Word Perfect files to databases, and for executing programs, such as games or scientific applications. It is also possible to use migration through emulation, which can get round the problem of a migration tool becoming obsolete.

We were told about the Planets Gap Analysis (PDF), which looked at existing file formats, and found 137 different formats in 76 institutions. The most archived file formats in archives libraries and museums are tiff, jpg, pdf and xml, but archives hardly archive the mp3 format, while libraries and museums frequently do. Only 22% of the archived file formats were found in four or more institutions, only two file formats, tiff and jpg, were found in over half of all institutions. So, most preservation action tools are for common file formats, which means that more obscure file formats may have a problem. However, Sara gave three examples where the environment is quite different. For DAISY, which is a format for audio books for the blind, there is a consortium of content providers who address issues arising with new versions of the format For FITS, a format for astronomical data, digital preservation issues are often solved by the knowledgeable user-base. But with sheet music the community is quite fragmented and uncoordinated so it is difficult to get a consensus to work together.

The Gap Analysis found that out of the top ten file formats, 9 are covered by migration tools known or used by Planets partners. XML is not covered, but it is usually the output of a file format rather than the input, so maybe this is not surprising. Many tools are flexible, so they can address many types of format, but each organisation has specific demands that might not be fulfilled by available tools.

Manfred Thaller from the University at Cologne gave a detailed account of the Planets Characterisation Services. He drew attention to the most basic layer of digital information – the 1’s and 0’s that make up a bit-stream. He showed a very simple image and how this can be represented by 1’s and 0’s, with a ‘5, 6’ to indicate the rows and columns (…or is that columns and rows – the point being that information such as this is vital!). To act on a file you need to identify it, validate it, extract information, undertake comparison. If you do not know what kind of file you have – maybe you have a bit-stream but do not know what it represents – DROID can help to interpret the file and it also assigns a permanent identifier to the file. DROID uses the PRONOM-based Planets File Format Registry.

Thaller emphasised that validation is often complex, and in real life we have to take files that are not necessarily valid. There is no Planets born validation service, but it does provide tools like JHOVE. Extraction is easier to deal with – the examination of what is really in a file. Many services extract some characteristics from a file. The traditional approach is to build a tool for each file format. The Planets Extensible Characterisation Language (XCL) approach is to have one tool which extracts many kinds of files. It provides a file format description language as well as a general container format for file characterisation.

Hannes Kulovitz from the Vienna University of Technology talked about how Plato, an interactive software tool provided by Planets, can help in preservation planning and went through the process of defining requirements, evaluating alternatives, analysing results, recommendations and building the preservation plan. In the ensuing discussion it became clear that the planning process is a major part of the preservation process, especially as each format requires its own plan. The plan should be seen as requiring a major investment of time and effort, and it will then faciliate more effective automation of the processes involved.


Ross King made a return to the podium to talk about integrating the components of digital preservation. There is no archive component being developed as part of Planets, so the assumption is that institutions already have this. His talk concentrated on workflows through Planets, with suggested templates for submission, migration and access. He then went on to give a case study of the British Library (tiff images of newspapers). The content is complex and it required the template to be changed to accommodate the requirements. He built up the workflow through the various stages and referred to options for using various Planets’ tools along the way. I would have liked this case study to be enlarged and followed through more clearly, as giving an example helps to clarify the way that the numerous tools available as part of Planets may be used.

We ended with a glimpse of the future of Plants. Most outputs will be freely available, under an Apache 2 licence. But to get take-up there must be a sustainability plan to maintain and develop the software, ensure continued access to Planets services (currently based at the University of Glasgow), support partners who are committed to the use of Planets, grow the community of users and promote further research and development. With this in mind, a decision has been taken to form something along the lines of an Open Planets Foundation (OPF), a not-for-profit organisation, limited by guarantee under UK law but with global membership. There has already been a commitment to this and there is financial support but Billenness was naturally reserved about being explicit here because the OPF will be a new body. There will be different classes of membership and the terms of membership are currently being finalised. But most of the software will remain free to download.

Image shows Planets Interoperability Framework.

Charles Wesley (1707-88)

Charles Wesley This month we highlight the new catalogue for the personal papers of Anglican minister, Methodist preacher and religious poet Charles Wesley (1707-88).

There is an introduction by Dr. Gareth Lloyd, Methodist Archivist at the Methodist Archives and Research Centre, The University of Manchester, The John Rylands University Library.

A model to bring museums, libraries and archives together

I am attending a workshop on the Conceptual Reference Model created by the International Council of Museums Committee on Documentation (CIDOC) this week.
The CIDOC Conceptual Reference Model (CRM) was created as a means of enabling information interchange and integration in the museum community and beyond. It “provides definitions and a formal structure for describing the implicit and explicit concepts and relationships used in cultural heritage documentation”.
It became an ISO standard in 2006 and a Special Interest Group continues to work to develop it and keep it in line with progress in conceptualisation for information integration.
The vision is to facilitate the harmonization of information across the cultural heritage sector, encompassing museums, libraries and archives, helping to create a global resource. The CRM is effectively an ontology describing concepts and relationships relevant to this kind of information. It is not in any sense a content standard, rather it takes what is available and looks at the underlying logic, analysing the structure in order to progress semantic interoperability.
I come to this as someone with a keen interest in interoperability, and I think that the Archives community should engage more actively in cross-sectoral initiatives that benefit resource discovery. I am interested to find out more about the practical application and adoption of the CRM. My concern is that in the attempt to cover all eventualities, it seems like quite a complex model. It seeks to ‘provide the level of detail and precision expected and required by museum professionals and researchers’. It covers detailed descriptions, contexts and relationships, which can often be very complex. The SIG is looking to harmonise the CRM with archival standards, which should take the cultural heritage sector a step further towards working together to share our resources.
I will be interested to learn more about the Model and I would like to consider how the CRM relates to what is going on in the wider environment, and particularly with reference to Linked Data and, more basically, the increasing recognition of web architecture as the core means to disseminate information. Initiatives to bring data together, to interconnect, should move us closer to integrated information systems, but we want to make sure that we have complimentary approaches.
You can read more about the Conceptual Reference Model on the CIDOC CRM website.

English Language — subjectless constructions

This is (probably) a final blog post referring to the recent survey by the UK Archives Discovery Network (UKAD) Working Group on Indexing and Name Authorities. Here we look in particular at subject indexing.

We received 82 responses to the question asking whether descriptions are indexed by subject. Most (42) do so, and follow recognised rules (UKAT, Unesco, LCSH, etc.). A significant proportion (29) index using in-house rules and some do not index by subject (18). Comments on this question indicated that in-house rules often supplement recognised standards, sometimes providing specialised terms where standards are too general (although I wonder whether these respondents have looked at Library of Congress headings, which are sometimes really quite satisfyingly specific, from the behaviour of the great blue heron to the history of music criticism in 20th century Bavaria).

Reasons given for subject indexing include:
  • it is good practice
  • it is essential for resource discovery
  • users find it easier than full-text searching
  • it gives people an indication of the subject strengths of collections
  • it imposes consistency
  • it is essential for browsing (for users who prefer to navigate in this way)
  • it brings together references to specific events
  • it brings out subjects not made explicit in keyword searching
  • it enables people to find out about things and about concepts
  • it may provide a means to find out about a collection where it is not yet fully described
  • it maximises the utility of the catalogues
  • it helps users identify the most relevant sources
  • it can indicate useful material that may not otherwise be found
  • it enables themes to be drawn out that may be missed by free-text searching
  • it can aid teachers
  • it helps with answering enquiries
  • it facilitate access across the library and archive
  • it meets the needs of academic researchers
The lack of staff resources was a significant reason given where subject searching was not undertaken. Several respondents did not consider it to be necessary. Reasons given for this were:
  • the scope of the archive is tightly defined so subject indexing is less important
  • the benefits are not clear
  • the lack of a thesaurus that is specific enough to meet needs
  • a management decision that it is ‘faddy’
  • the collections are too extensive
  • the cataloguing backlog is the priority
Name indexing is considered more important than subject indexing only by a small margin, and some respondents did emphasise that they index by name but not by subject. Comments here included the observation that subject indexing is more problematic because it is more subjective, that subjects may more easily be pulled out via automated means and that it depends upon the particular archive (collection). As with name and place indexing, subject indexing happens at all levels of description, and not predominantly at collection-level. Comments suggest that subjects are only added at lower-levels if appropriate (and not appropriate to collection-level).
For subjects, the survey asked how many terms are on average applied to each record. According to the options we gave, the vast majority use between one and six. However, some respondents commented that it varies widely, and one said that they might use a few thousand for a directory, which seems a little generous (possibly there is a misunderstanding here?)
Sources used for subjects included the usual thesauri, with UKAT coming out strongest, followed by Unesco and Library of Congress. A few respondents also referred to the Getty Art and Architecture Thesaurus. However, as with other indexes, in-house lists and a combination approach also proved common. It was pointed out in one comment that in-house lists should not be seen as lesser sources; one respondent has sold their thesaurus to other local archives. There were two comments about UKAT not being maintained, and hopes that the UKAD Network might take this on. And, indeed, when asked about the choice of sources used for subject indexing, UKAT again came up as a good thesaurus in need of maintenance.
Reasons given for the diverse choice of sources used included:
  • being led by what is within the software used for cataloguing
  • the need to work cross-domain
  • the need to be interoperable
  • the need to apply very specific subject terms
  • the need to follow what the library does
  • the importance of an international perspective
  • the lack of forethought on how users might use indexes
  • the lack of a specialist thesaurus in the subject area the repository represents (e.g. religious orders)
  • following the recommendations of the Archives Hub and A2A
Image courtesy of Flickr Creative Commons licence, Luca Pedrotti’s photostream

* the title of this blog post is a Library of Congress approved subject heading

Liberty, Parity and Justice at the Hull History Centre

British Union for the Abolition of Vivisection marchersThis month we are marking our 100th feature by highlighting descriptions for the records of pressure groups held by Hull University Archives at Hull History Centre.

The records of pressure groups and campaigns represent some of the most significant and substantial archives held by Hull University Archives, They number around 40 collections from small, short-lived, groups to major continuing organisations. The most significant and substantial archive is that of Liberty, which recording the continuing development of civil rights in Britain over the past 75 years.

photo: British Union for the Abolition of Vivisection marchers, copyright © Hull History Centre.

English language — subjectless constructions*

This is (probably) a final blog post referring to the recent survey by the UK Archives Discovery Network (UKAD) Working Group. Here we look in particular at subject indexing.

We received 82 responses to the question asking whether descriptions are indexed by subject. Most (42) do so, and follow recognised rules (UKAT, Unesco, LCSH, etc.). A significant proportion (29) index using in-house rules and some do not index by subject (18). Comments on this question indicated that in-house rules often supplement recognised standards, sometimes providing specialised terms where standards are too general (although I wonder whether these respondents have looked at Library of Congress headings, which are sometimes really quite satisfyingly specific, from the behaviour of the great blue heron to the history of music criticism in 20th century Bavaria).

Reasons given for subject indexing include:
  • it is good practice
  • it is essential for resource discovery
  • users find it easier than full-text searching
  • it gives people an indication of the subject strengths of collections
  • it imposes consistency
  • it is essential for browsing (for users who prefer to navigate in this way)
  • it brings together references to specific events
  • it brings out subjects not made explicit in keyword searching
  • it enables people to find out about things and about concepts
  • it may provide a means to find out about a collection where it is not yet fully described
  • it maximises the utility of the catalogues
  • it helps users identify the most relevant sources
  • it can indicate useful material that may not otherwise be found
  • it enables themes to be drawn out that may be missed by free-text searching
  • it can aid teachers
  • it helps with answering enquiries
  • it facilitate access across the library and archive
  • it meets the needs of academic researchers
The lack of staff resources was a significant reason given where subject searching was not undertaken. Several respondents did not consider it to be necessary. Reasons given for this were:
  • the scope of the archive is tightly defined so subject indexing is less important
  • the benefits are not clear
  • the lack of a thesaurus that is specific enough to meet needs
  • a management decision that it is ‘faddy’
  • the collections are too extensive
  • the cataloguing backlog is the priority
Name indexing is considered more important than subject indexing only by a small margin, and some respondents did emphasise that they index by name but not by subject. Comments here included the observation that subject indexing is more problematic because it is more subjective, that subjects may more easily be pulled out via automated means and that it depends upon the particular archive (collection). As with name and place indexing, subject indexing happens at all levels of description, and not predominantly at collection-level. Comments suggest that subjects are only added at lower-levels if appropriate (and not appropriate to collection-level).
For subjects, the survey asked how many terms are on average applied to each record. According to the options we gave, the vast majority use between one and six. However, some respondents commented that it varies widely, and one said that they might use a few thousand for a directory, which seems a little generous (possibly there is a misunderstanding here?)
Sources used for subjects included the usual thesauri, with UKAT coming out strongest, followed by Unesco and Library of Congress. A few respondents also referred to the Getty Art and Architecture Thesaurus. However, as with other indexes, in-house lists and a combination approach also proved common. It was pointed out in one comment that in-house lists should not be seen as lesser sources; one respondent has sold their thesaurus to other local archives. There were two comments about UKAT not being maintained, and hopes that the UKAD Network might take this on. And, indeed, when asked about the choice of sources used for subject indexing, UKAT again came up as a good thesaurus in need of maintenance.
Reasons given for the diverse choice of sources used included:
  • being led by what is within the software used for cataloguing
  • the need to work cross-domain
  • the need to be interoperable
  • the need to apply very specific subject terms
  • the need to follow what the library does
  • the importance of an international perspective
  • the lack of forethought on how users might use indexes
  • the lack of a specialist thesaurus in the subject area the repository represents (e.g. religious orders)
  • following the recommendations of the Archives Hub and A2A
* the title of this blog post is a Library of Congress approved subject heading

Place names: we would be lost without them

According to the recent Indexing and Authority Records Survey (which I have been blogging about recently), archivists have a number of reasons why they think it is important to undertake place indexing:

  • to facilitate access
  • it is essential to resource discovery
  • users frequently request information about places
  • it is very important for local historians
  • it is good practice
  • to tackle inconsistencies in spelling and place name changes
  • to distinguish between places that have the same name
  • as a source of statistics (e.g. how many collections relate to individual countries)
  • it is an important part of the University’s diversity plan – many students are from other countries – shows that the collections are international
  • the records are arranged by place
  • it is a way to bring together disparate material in diverse collections
  • it helps identify and track boundary changes over time
  • it is used by national network sites (e.g. the Archives Hub)
The main reason not to index by place was given as a lack of staff resources, but some did also feel that it is not necessary. Other reasons were:
  • the search engine can pull out the place name
  • would need to index at item level for place entries to be useful and this is not practical to do
  • cataloguing and name indexing are the priorities
  • collections cover a small geographical area
  • collections are more thematic and name indexing works better than place indexing
  • not appropriate for the material (e.g. cartoons)
  • it has never been done
  • names are standardised to facilitate keyword searching
For those that do index by place, just as with names, the spread between collection-level, series-level, file and item-level indexing was pretty even, and the percentage of collections indexed by place varied enormously. The sources used for place names were varied, although most do seem to use the recognised gazetteers and guides. Others referred to the Library of Congress, local people and the documents themselves.
Many do use the NCA Rules, but there were some comments about the drawbacks of these -they do not recognise the three Yorkshire Ridings, they were created by a previous generation of archivists and are outdated.
We did ask whether any repositories use a co-ordinates based system, and only 3 responses were in the affirmative, though a couple stated that they were going to look into this.
Finally, when asked about reasons for the choice of rules or sources for place names, there were some varied responses:
  • being part of a set-up with other contributors
  • familiarity
  • ease
  • internationally accepted [standard], widely known and used
  • indexing was done before standards were introduced
  • it appears that no real thought has been given to this
  • standards were not precise enough when the decision was made
Place name indexing: is it necessary? One respondent said: ‘To put it bluntly we would be lost without it.’
Image: Flickr Creative Commons JMC Photos