Linking Data in Sydney

 

By Geoff Browell, Head of Archives Services

I was fortunate to attend the biennial Linked Open Data,
Libraries, Archives, Museums summit in early July in Sydney, Australia. I
played a very small role in setting it up, as a member of the organising
committee. The conference is an opportunity for archivists, librarians, museum
curators and information professionals and IT experts to meet and discuss the
latest developments in Linked Data among higher education, heritage and
‘memory’ institutions, worldwide. Delegates have the chance to hear about
successful (and unsuccessful) projects and take part in targeted discussions on
the future of the technology, and encourage new collaborations. The event
features the ‘Challenge’ – an open competition for the best application of
Linked Data in a cultural setting.  The
summit adopts the ‘un-conference’ format without pre-prepared papers, at which
relevant issues can be aired and debated and sub-groups convened to address
specific topics.

View this graph of attendees: https://graphcommons.com/graphs/0f874303-97c2-4e53-abc6-83a13a1a2030

What is Linked Data?

Linked Data is a way of structuring online and other data to
improve its accuracy, visibility and connectedness. The technology has been
available for more than a decade and has mainly been used by commercial
entities such as publishing and media organisations including the BBC and
Reuters.  For archives, libraries and
museums, Linked Data holds the prospect of providing a richer experience for
users, better connectivity between pools of data, new ways of cataloguing
collections, and improved access for researchers and the public.

It could, for example, provide the means to unlock research
data or mix it with other types of data such as maps, or to search digitised
content including books and image files and collection metadata. New, more
robust, services are currently being developed by international initiatives
such as Europeana which should make its adoption by libraries and archives much
easier. There remain many challenges, however, and this conference provided the
opportunity to explore these.

The conference comprised a mix of quick fire discussions,
parallel breakout sessions, 2-minute introductions to interesting projects, and
the Challenge entries.

[photo: Work in progress at the LODLAM summit]

Quick fire points
from delegates

  • Need for improved visualisation of data (current
    visualisations are not scalable or require too much IT input for archivists and
    librarians to realistically use)
  • Need to build Linked Data creation and editing
    into vendor systems (the Step change model which we pursued at King’s Archives
    in a Jisc-funded project)
  • Exploring where text mining and Natural Language
    Processing overlap with LOD
  • World War One Linked Data: what next? (less of a
    theme this time around as the anniversary has already started)
  • LOD in archives: a particular challenge?
    (archives are lagging libraries and galleries in their implementation of Linked
    Data)
  • What is the next Getty vocabularies: a popular vocabulary
    that can encourage use of LOD?
  • Fedora 8 and LOD in similar open source or
    proprietary content management systems (how can Linked Data be used with these
    popular platforms?)
  • Linked Data is an off-putting term implying a
    data-centric set of skills (perhaps Linked Open Knowledge as an alternative?)
  • Building a directory of cultural heritage
    organisation LOD: how do we find available data sets? (such as Linked Open
    Vocabularies)
  • Implementing the European Data Model: next steps
    (stressing the importance of Europeana in the Linked Data landscape)
  • Can we connect different entities across
    different vocabularies to create new knowledge? (a lot of vocabularies have
    been created, but how do they communicate?)

 

Day One sessions

OASIS Deep Image
Indexing (
http://www.synaptica.com/oasis/).

This talk showcased a new product called OASIS from
Synaptica, aimed at art galleries, which facilitates the identification,
annotation and linking of parts of images. These elements can be linked
semantically and described using externally-managed vocabularies such as the
Getty suite of vocabularies or classifications like Iconclass. This helps
curators do their job. End users enjoy an enriched appreciation of paintings
and other art. It is the latest example of annotation services that overlay useful
information and utilise agreed international standards like the Open Annotation
Data Model and the IIIF standard for image zoom.

We were shown two examples: Botticelli’s The Birth of Venus
and Holbein’s The Ambassadors for impressive zooming of well-known paintings
and detailed descriptions of features. Future development will allow for
crowdsourcing to identify key elements and utilising image recognition software
to find these elements on the Web (‘find all examples of images of dogs in 16th
century public works of art embedded in the art but not indexed in available
metadata’).

This product mirrors the implementation of IIIF by an
international consortium that includes leading US universities, the Bodleian,
BL, Wellcome and others. Two services have evolved which offer archives the
chance to provide deep zoom and interoperability for their images for their
users: Mirador, and the Wellcome’s Universal Viewer (http://showcase.iiif.io/viewer/mirador/).
These get around the problem of having to create differently sized derivatives
of images for different uses, and of having to publish very large images on the
internet when download speeds might be slow.

Digital New Zealand

Chris McDowall of Digital New Zealand explored how best to
make LOD work for non-LOD people. Linked Open Data uses a lot of acronyms and
assumes a fairly high level of technical knowledge of systems which should not
be assumed. This is a particular bugbear of mine, which is why this talk
resonated. Chris’ advocacy of cross developer/user meetups also chimed with my
own thinking: LOD will never be properly adopted if it is assumed to be the
province of ‘techies’. Developers often don’t know what they are developing
because they don’t understand the content or its purpose: they are not
curators.

He stressed the importance of vocabulary cross-walks and the
need for good communication in organisations to make services stable and
sustainable. Again, this chimed with my own thinking: much work needs to be
done to ‘sell’ the benefits of Linked Data to sceptical senior management.
These benefits might include context building around archive collections,
gamification of data to encourage re-use, and serendipity searches and prompts
which can aid researchers. Linked Data offers the kind of truly targeted
searching in contrast to the ‘faith based technology’ of existing search
engines (a really memorable expression).

He warned that the infrastructure demands of LOD should not
be underestimated, particularly from researchers making a lot of simultaneous
queries: he mooted a pared down type of LOD for wider adoption.

Chris finished by highlighting a number of interesting use
cases of LOD in Libraries as part of the Linked Data for Libraries (LD4L) project,
a collaboration between Harvard, Cornell and Stanford (https://wiki.duraspace.org/pages/viewpage.action?pageId=41354028). See also
Richard Wallis’ presentation on the benefit of LO for libraries: http://swib.org/swib13/slides/wallis_swib13_108.pdf

Schema.org

Richard Wallis of OCLC explored the potential of Schema.org,
a growing vocabulary of high level terms agreed by the main search engines to
make content more searchable. Schema.org helps power search result boxes one
sees at the top of Google search return pages. Richard suggested the creation
of an extension relevant to archives to add to the one for bibliographic
material. The advantage of schema.org is that it can easily be added to web
pages, resulting in appreciable improvement in ranking and the possibility of
generating user-centred suggestions in search results. For an archive, this
might mean a Google user searches for the papers of Winston Churchill and is
offered suggested other uses such as booking tickets to a talk about the
papers, or viewing Google maps information showing the opening times and
location of the archive.

The group discussion centred on the potential elements (would
the extension refer to thesis, research data, university systems that contain
archive data such as Finance and student information?), and on the need for use
cases and setting out potential benefits. I agreed to be part of an
international team through the W3C Consortium, to help set one up.

[photo: Shakespeare window at the State Library of New South Wales]

Dork shorts/Speedos –
these are impromptu lightning talks lasting a few minutes, which highlight a
project, idea or proposal. View here:
http://summit2015.lodlam.net/about/speedos/

Highlights:

Cultuurlink (http://cultuurlink.beeldengeluid.nl/app/#/): Introduction by Johan Oomen

This Dutch service facilitates the linking of different
controlled vocabularies and thesauri and helps address the problem faced by
many cultural organisations ‘which thesauri do I use?’ and ‘how do I avoid
reinventing the thesauri wheel?’. The services allows users to upload a SKOS
vocabulary, link it with one of four supported vocabularies and visualise the
results.

The service helps different types of organisation to connect
their vocabularies, for example an audio-visual archive with a museum’s
collections. The approach also allows content from one repository to be
enhanced or deepened through contextual information from another. The example
of Vermeer’s Milkmaid was cited: enhancing the discoverability of information
on the painting held in the Rijksmuseum
in Amsterdam through connecting the collection data held on the local museum
management system with DBPedia and with the Getty Art and Architecture
Thesaurus. This sort of approach builds on the prototypes developed in the last
few years to align vocabularies (and to ‘Skosify’ data – turn it into Linked
Data) around shared Europeana initiatives (see http://semanticweb.cs.vu.nl/amalgame/).

Research Data
Services project: Introduction by Ingrid Mason

This is a pan-Australian research data management project
focusing on the repackaging of cultural heritage data for academic re-use.
Linked Data will be used to describe a ‘meta-collection’ of the country’s
cultural data, one that brings together academic users of data and curators. It
will utilise the Australia-wide research data nodes for high speed retrieval (https://www.rds.edu.au/project-overview
and http://www.intersect.org.au/).

Tim Sherratt on
historians using LOD

This fascinating short explained how historians have been
creating LOD for years – and haven’t even known they were doing it –
identifying links and narratives in text as part of the painstaking historical
process. How can Linked Data be used to mimic and speed up this historical
research process? Tim showed a working example and a step by step guide is
available: http://discontents.com.au/stories-for-machines-data-for-humans/
and listen to the talk: http://summit2015.lodlam.net/2015/07/10/lod-book/

Jon Voss on
historypin

Jon explained how the popular historical mapping service,
historypin, is dealing with the problem of ‘roundtripping’ where heritage data
is enhanced or augmented through crowdsourcing and returned to its source. This
is of particular interest to Europeana, whose data might pass through many
hands. It highlights a potential difficulty of LOD: validating the authenticity
and quality of data that has been distributed and enriched.

Chris McDowall of
Digital New Zealand

Chris explained how to search across different types of data
source in New Zealand, for example to match and search for people using
phonetic algorithms to generate sound alike suggestions and fuzzy name
matching: http://digitalnz.github.io/supplejack/.

Axes Project (http://www.axes-project.eu/): Introduction from Martijn Kleppe

This 6 million Euro EU-funded project aims to make
audio-visual material more accessible and has been trialled with thousands of
hours of video footage, and expert users, from the BBC. Its purpose is to help users
mine vast quantities of audio-visual material in the public domain as
accurately and quickly as possible. The team have developed tools using open
source frameworks that allow users to detect people, places, events and other
entities in speech and images and to annotate and refine these results. This
sophisticated tool set utilises face, speech and place recognition to zero-in
on precise fragments without the need for accompanying (longhand) metadata. The
results are undeniably impressive – with a speedy, clear, interface locating
the parts of each video with filtering and similarity options. The main use for
the toolset to date is with film studies and journalism students but it
unquestionably has wider application.

The Axes website also highlights a number of interesting
projects in this field. Two stand out: http://www.axes-project.eu/?page_id=25,
notably Cubrik (http://www.cubrikproject.eu/),
another FP 7 multinational project which mixes crowd and machine analysis to
refine and improving searching of multimedia assets; and the PATHS prototype (http://www.paths-project.eu/)  ‘an interactive personalised tour guide through
existing digital library collections. The system will offer suggestions about
items to look at and assist in their interpretation. Navigation will be based
around the metaphor of a path through the collection.’ The project created an
API, User Interface and launched a tested exemplar with Europeana to
demonstrate the potential of new discovery journeys to open access to
already-digitised collections.

Loom project (http://dxlab.sl.nsw.gov.au/making-loom/): Introduction from Paula Bray of State Library of New South Wales

The NSW State Library sought to find new ways of visualising
their collections by date and geography through their DX Labs, an experimental
data laboratory similar to BL Labs, which I have worked with in the UK. One
visually arresting visualisation shows the proportions of collections relevant
to particular geographical locations in the city of Sydney. Accompanied by
approving gasps from the audience, this showed an iceberg graphic superimposed
onto a map showing the proportion of collections about a place that had been
digitised and yet to be digitised – a striking way of communicating the
fragility of some collections and the work still to be done to make them
accessible to the public.

LODLAM challenge

19 entries were received: http://summit2015.lodlam.net/challenge/challenge-entries/

  1. Open Memory Project. This Italian entry
    won the main prize. It uses Linked Data to re-connect victims of the Holocaust
    in wartime Italy. The project was thought provoking and moving and has the
    potential to capture the public imagination.
  2. Polimedia is a service designed to
    answer questions from the media and journalists by querying multi-media
    libraries, identifying fragments of speech. It won second prize for its
    innovative solution to the challenge of searching video archives.
  3. LodView goes LAM is a new Italian
    software designed to make it easier for novices to publish data as Linked Data.
    A visually beautiful and engaging interface makes this a joy to look at.
  4. EEXCESS is a European project to
    augment books and other research and teaching materials with contextual
    information, and to develop sophisticated tools to measure usage. This is an
    exciting, ambitious, project to assemble different sources using Linked Data to
    enable a new kind of publication made up of a portfolio of assets.
  5. Preservation Planning Ontology is a
    proposal for using Linked Data in the planning of digital preservation by
    archives. It has been developed by Artefactual Systems, the Canadian company
    behind ATOM and Archivematica software. This made the shortlist as it is a good
    example of a ‘behind the scenes’ management use of Linked data to make
    preservation workflows easier.

A selection of other
entries:

Public Domain City
extracts curious images from digitised content. This is similar to BL Labs’
Mechanical Curator, a way of mining digitised books for interesting images and
making them available to social media to improve the profile and use of a
collection.

Project Mosul uses
Linked Data to digitally recreate damaged archaeological heritage from Iraq. A
good example of using this technology to protect and recreate heritage damaged
in conflict and disaster.

The Muninn Project
combines 3D visualisations and printing using Linked Data taken from First
World War source material.

LOD Stories is a
way of creating story maps between different pots of data about art and
visualising the results. The project is a good example of the need to make
Linked Data more appealing and useful, in this case by building ‘family trees’
of information about subjects to create picture narratives.

Get your coins out of
your pocket
is a Linked Data engine about Roman coinage and the stories it
has to tell – geographically and temporally. The project uses nodegoat as an
engine for volunteers to map useful information: http://nodegoat.net/.

Graphity is a
Danish project to improve access to historical Danish digitised newspapers and
enhancing with maps and other content using Linked Data.

Dutch Ships and
Sailors
brings together multiple historical data sources and uses Linked
Data to make them searchable.

Corbicula is a way
of automating the extraction of data from collection management systems and
publishing it as Linked Data.

[photo: delegates at the summit]

Day two sessions

Day two sessions focused on the future. A key session led by
Richard Wallis explained how Google is moving from a page ranking approach to a
triple confidence assertion approach to generating search results. The way in
which Google generates its results will therefore move closer to the LOD method
of attributing significance to results.

Highlights

  • Need for a vendor manifesto to encourage systems
    vendors such as Ex Libris, to build LOD into their systems (Corey Harper of New
    York University proposed this and is working closely with Ex Libris to bring
    this about)
  • Depositing APIs/documentation for maximum re-use
    (APIs are often a weak link – adoption of LOD won’t happen if services break or
    are unreliable)
  • Uses identified (mining digitised newspaper
    archives was cited)
  • Potential piggy-backing from Big Pharma
    investment in Big Data (massive investment by drugs companies to crunch huge
    quantities of data – how far can the heritage sector utilise even a fraction of
    that?)
  • Need to validate LOD: the quality issue – need
    for an assertion testing service (LOD won’t be used if its quality is
    questionable. Do curators (traditional guardians of quality) manage this?)
  • Training in Linked Data needs to be addressed
  • Need to encourage fundraising and make LO
    sustainable: what are we going to do with LOD in the next ten years? (Will the
    test of the success of Linked Open Data be if the term drops out of use when we
    are all doing it without noticing? Will 5 Star Linked Data be realised? http://5stardata.info/)

Summary

There were several key learning points from this conference:

  • The divide between technical experts and policy
    and decision makers remains significant: more work is needed to provide use
    cases and examples of improved efficiencies or innovative public engagement
    opportunities that the technology provides
  • The re-use and publication of Linked Data is
    becoming important and this brings challenges in terms of IPR, reliability of
    APIs and quality of data
  • Easy to use tools and widgets will help spread
    its use; avoiding complicated and unsustainable technical solutions that depend
    on project funding
  • Working with vendors to incorporate Linked Data
    tools in library and archive systems will speed its adoption
  • The Linked Data community ought to work towards
    the day Linked Data is business as usual and the terms goes out of use