The widely accepted FAIR principles require that data be both human and machine-readable. The use of the web to provide access to structured data is widely accepted, and the mass-market web is beginning to make use of open standards for structured data, which are folding back into the technical community through initiatives such as ‘Science on Schema.org’. To maximise semantic interoperability, particularly across different domains, we are dependent on shared terminology that can be understood by both humans and machines and used by multiple communities, and ideally in multiple languages.
As more structured data goes online, codelists and reference vocabularies to support these are proliferating. A lack of mechanisms for discovery of existing vocabularies, and limited support for shared development and governance of them, mean there is limited motivation for vocabulary reuse. This leads to replications of effort resulting in multiple vocabularies with similar scope. The quality of these online vocabularies is highly variable. In this context, how does the user determine which vocabularies are trustworthy and fit for their purpose? Important considerations are:
- is the vocabulary developed by a recognised organisation?
- does it have scientifically valid definitions?
- are the definitions provided in a useful form?
- is the vocabulary aligned with related vocabularies?
- is there a plan for sustaining the vocabulary, both semantics and hosting arrangements, that will persist as long as the data that they connect to them?
Nevertheless, there are cases where it is necessary to manage a vocabulary locally, even if it has the same scope as existing vocabularies. Under these circumstances we need strategies and tools for vocabulary harmonization, in order to support interoperability between applications using them.
Guidelines are required both for users as to which are the best vocabularies to use, and to communities and terminology providers on when to develop a new vocabulary and how to govern its development.
Note: For this session, we are using the work ‘vocabulary’ to mean any semantic asset containing terms and (usually) information about those terms. This includes value sets (aka: bag of terms or term list), concept sets, topics, vocabularies, glossaries, thesauri, concept maps, taxonomies, ontologies, and now of course knowledge graphs…
The Session will start with three short presentations to set the scene:
- Proliferation of “controlled” vocabularies: feature or bug: Simon Cox
- Vocabulary Pick'em: The Definitive List of Vocabulary Selection Criteria (Now What?): John Graybeal
- Perspectives from Vocabulary Services project: Adrian Burton
These presentations will be followed by breakout room discussions - pick which one you want or suggest another one.
- Can we develop a “5-star vocab” ranking similar to the Tim Berners Lee Five Star Open Data? (Lesley Wyborn)
- Can we develop guidelines for vocabulary mapping and harmonization? (Pier Luigi Buttigieg)
- Can we better utilise and extend the ESIP Community Ontology Repository? (Lewis McGibbney)
- Can learn the essentials of vocabulary governance from the “Big Three” in Earth and environmental science: GCMD, CGI-IUGS and NERC (Rowan Brownlee, Tyler Stevens, Natalia Atkins and Mark Rattenbury)
- Can we improve online vocab services? (Adrian Burton)
- Can we create a multi-disciplinary vocabulary space for physical samples (Jens Klump, Kerstin Lehnert)
The breakouts will be followed by report back and determining any potential next steps within ESIP.
Desirable outcomes from this session:- Greater awareness of the issue of the current proliferation of vocabularies.
- Better coordination of work on vocabularies/ontologies across ESIP and greater pull from the ESIP Community Ontology Repository.
- Communities forming to develop best practice guidelines.
View Recording
View Session Notes
View Presentations
TakeawaysTBD