RDF Cube Schema

Zazuko Document

More details about this document
History:
Commit history
Editors:
Thomas Bergwinkl
Adrian Gschwend
Bart van Leeuwen
Michael Luggen
Feedback:
GitHub zazuko/rdf-cube-schema (pull requests, new issue, open issues)

Abstract

A good amount of data in organizations is maintained in tables with multiple columns. Typically you can think of a multitude of Excel or CSV tables. This data often has a dimension which describes the time, sometimes other dimensions for classifying the data, and in most cases some actual observed values or counts.

Such tables are holding the data in a structured form, but most of the time, the information to understand the columns and also the necessary metadata enabling the creation of use-full representations in charts and visualizations is missing.

With the creation of Cubes you as data provider and domain specialist of the data are able to augment and annotate your data with everything necessary to understand the input data – directly in the to be published dataset. Fully annotated Cubes can also be used to visualize your data proper tooling.

1. Issue Summary

There are no issues listed in this specification.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document.

This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

2. RDF Cube Schema: Core

This section describes the RDF Cube Schema model.

The RDF Cube Schema defines a minimal set of classes and properties necessary to represent multi-dimensional arrays of data in RDF 1.1 Concepts and Abstract Syntax.

We describe the model, an elaborate example, and scripts to validate observations based on the constraint (SHACL shape) provided.

2.1 Namespaces and Prefixes

2.1.1 RDF Cube Schema

PREFIX IRI Description
cube https://cube.link/ RDF Cube Schema.
meta https://cube.link/meta/ RDF Cube Schema meta data extension.
relation https://cube.link/relation/ RDF Cube Schema meta data extension.

2.1.2 External

PREFIX IRI Description
schema http://schema.org/ To describe basic properties.
sh http://www.w3.org/ns/shacl# Inherited from the RDF Cube Schema for constratins.
qudt http://qudt.org/vocab/ Describe scale of mesures.
unit http://qudt.org/vocab/unit/ Describes units on values.
time http://www.w3.org/2006/time# A time description ontology.
geo http://www.opengis.net/ont/geosparql# OGC GeoSPARQL 1.0.
xsd http://www.w3.org/2001/XMLSchema# XML Schema Datatypes.
skos http://www.w3.org/2004/02/skos/core# SKOS Simple Knowledge Organization System.

2.2 Core Schema

The core schema is very simple and almost unconstrained from an RDF perspective. This ensures it can be used in many ways and does not restrict its usefulness due to too rigorous definitions.

Basic RDF Cube Schema structure

2.2.1 Classes

There are 4 classes defined in the RDF Cube Schema

2.2.1.1 cube:Cube

Represents the entry point for a collection of one or more observation sets, conforming to some common dimensional structure.

2.2.1.2 cube:Constraint

Specifies constraints that need to be met on the Cube. Used for metadata and validation. (Optional) For more information see RDF Cube Schema : Constraints

A Constraint for a cube. A Constraint is optional but recommended, it is used to:

  • Define how data (Observations) in a Cube can be validated.
  • Add Cube-specific metadata (custom labels, translation to other languages, etc).
2.2.1.3 cube:Observation

A single observation in the cube may have one or more associated dimensions. A Observation can appear in one ore more ObservationSets.

2.2.1.4 cube:ObservationSet

An ObservationSet is a structure that acts as a container for multiple Observations. It can be used to group any set of Observations, as long as they use the same dimensions. There is on purpose no stronger semantics attached to this set, to make sure it can be used in almost any scenario. A cube can have one or more ObservationSets and an Observation can appear in multiple ObservationSets.

2.2.2 Datatypes

2.2.2.1 cube:Undefined

An observation which is not defined. for more information see NULL and Empty values

2.2.3 Properties

The resources described by the various classes are connected by a small set of properties.

Observations can be connected to an observer

2.2.3.1 cube:observationSet

Connects a cube with a set of observations.

2.2.3.2 cube:observationConstraint

Connects a cube with a constraint for metadata and validation.

2.2.3.3 cube:observation

Connects a set of observations with a single observation.

2.2.3.4 cube:observedBy

Connects an observation with the agent that created the observation. The agent can be a person, organization, device, or software. A description of the method to gather the data could be attached to the agent.

2.3 Dimensions

A dimension is a structure that categorizes facts and measures to enable users to answer business questions. Commonly used dimensions are people, products, place, and time (Source: Wikidata).

In RDF Cube Schema, facts, measures, and categories are all considered a dimension.

All Observations need to provide the same set of dimensions, they cannot be optional. This ensures that cubes can be queried efficiently.

Unlike other RDF vocabularies in that domain, there is no specific class for a dimension. Creating a new RDF Cube Schema dimension would be the same as defining a new RDF Property.

This encourages re-use and makes it much easier to cherrypick on existing RDF properties and use them as dimensions. Obvious examples are temporal properties like schema:validFrom, dcterms:date, dcterms:temporal, etc.

In general, any RDF Property can be considered for describing a dimension, except for heavily constrained properties that might lead to unwanted conclusions (inference) by reasoners.

Note that choosing a particular dimension will have implications. For querying cubes via SPARQL, spatial and temporal dimensions might only be filtered properly if the datatype used (i.e. xsd:date, geo:asWKT) is supported/optimized by the SPARQL endpoint. This is for example mostly not the case for fragment datatypes like xsd:gYear etc.

It has to be ensured that properties are not attached at the wrong level. Spatial dimensions for example are most likely not attached to the observation directly but to an instance of a dimension referenced in the observation. This is the case when the observation is referring to a static location where the observation is done, e.g. a sensor location. When however the actual location is the observation then it can be recorded as a dimension

Instances of a dimension can be RDF literals with data types (sometimes called typed literals) or IRIs. Only one literal must be attached to each dimension or must point to a single IRI.

Language tagged literals and all other (meta) data should be modeled as terms/concepts. For this purpose, IRI's should be used and the literal(s) would be appended to that particular instance of a term/concept. This can be done e.g. by using SKOS Simple Knowledge Organization System Primer (https://www.w3.org/TR/skos-primer/) or schema.org DefinedTerm. As shown in the following example, a typical cube structure is a combination of dimensions with typed literals attached to the "observation" itself and dimensions that refer to concept groups via IRIs.

An Observation often combines dimensions of typed literals with dimensions that point to IRIs

In RDF 1.1 Turtle syntax, the observation above looks like this:

room1 is an IRI that has labels attached as language-tagged strings.

Nesting of relations can be expressed in a machine-readable form as well but is not part of the core RDF Cube Schema.

2.3.1 NULL and Empty values

In RDF Cube Schema, all dimensions are mandatory for a cube. If a value could not be measured, it should be expressed as such.

There is no generic "built-in" way to solve this in RDF. For some numeric datatypes, XML and thus RDF defines "not a number" (NaN) as a value. According to the specs, this is only valid for xsd:float and xsd:double and not for xsd:decimal and xsd:integer.

To provide a generic solution that works for all numbers and IRIs, RDF Cube Schema provides cube:Undefined. The following example shows how to use it in cube:Observation and in the attached shape:

If it is necessary to state why the value is cube:Undefined, annotations should be used.

2.4 Metadata

From a high-level point of view, the core classes and properties are enough to publish a valid cube, however, the absence of all metadata might make it hard for consumers to understand what the data is about.

2.4.1 Cube Description

To add the title and a short description of the cube add the following properties directly on an instance of a cube:Cube

2.4.1.1 schema:name

A descriptive name of the Cube, this description can be multilingual.

2.4.1.2 schema:description

A description of the Cube, this description can be multilingual.

2.4.1.3 other properties

you are free to add other properties related to your Cube or the publication process of your Cube directly on the instance of the Cube.

3. RDF Cube Schema: Constraints

From a pure publishing point of view and according to the Open World Model, publishing observations using the core RDF Cube Schema is enough. However, in reality, one might want to use the data for other purposes as well, like visualizing it in web applications and other publications. To be useful, such tools might require additional metadata and cubes that adhere to certain constraints.

RDF Cube Schema supports attaching a "constraint" to a cube. The constraints themselves are expressed using the Shapes Constraint Language (SHACL)

Providing constraints for a cube facilitates the documentation, interpretation, and validation of the cube for tooling and libraries. There is only one constraint per cube allowed, this results in a very low overhead for the documentation, the cube and observations do not need any documentation on the data level.

It is up to the consumer of the data to decide how the constraint is used. If provided, it should be possible to validate the observations in the cube with it.

3.1 Constraints

3.1.1 Cube Constraints

A cube:Constraint is also a sh:NodeShape. Shapes Constraint Language (SHACL) is used to restrict the cube to a particular structure. The Constraint presented in this section can be used to validate all Observations in the cube.

If additional restrictions are needed, additional restrictions expressed as Shapes Constraint Language (SHACL) can be provided.

The following snippet defines a new constraint, it applies to all cube:Observation:

3.1.2 Dimension Constraints

Dimension Constraints are linked to Cube Constraints through the sh:property property, the Dimension constraints uses the Shapes Constraint Language (SHACL) vocabulary to express the constraints for the specific dimension.

3.1.2.1 sh:path

The dimension/property is referenced using sh:path. The value of the path must be a single value and not an RDF list.

3.1.2.2 additional constraints

Additional constraints can be added according to the SHACL specification, for example, data types, min/max values, etc. The following snippet defines a dimension (property) dc:date with a literal value (xsd:dateTime):

3.1.2.3 Usage of code lists

Dimensions that point to objects like code lists (i.e taxonomies represented in vocabularies like SKOS Simple Knowledge Organization System Primer) can be expressed as well:

Dimensions can have further types, which are not defined by this vocabulary to support other Dimensions (e.g. Precision, Statistical Measures) or for additional Attributes to filter on, which are not part of the key to defining a cube:KeyDimension.

4. RDF Cube Schema: User eXperience Extension

To facilitate the visualization of RDF Cubes or any other user experience related activity it is possible to extend the Constraints to include additional metadata that describes the characteristics of the cube and its dimensions. By providing this information in the Constraints tools used for displaying the data in the cube do not need to process and interpret the actual data in the cube to configure the visualization.

4.1 Dimensions

To be able to understand the nature of a dimension we can type the dimension in the constraints. In general, we have at least two mandatory types per cube, the cube:MeasureDimension and the cube:KeyDimension.

4.1.1 Classes

The following classes are used to define the various visualization element of the RDF Cube Schema

4.1.1.1 cube:KeyDimension

The KeyDimension tags one or multiple dimensions which are together uniquely identifying an observation. You can think of them as the Key in a relational database.

4.1.1.2 cube:MeasureDimension

The MeasureDimension tags at least one dimension, but potentially multiple, which is the actual measurement, or statistical count attached to an observation.

4.1.1.3 cube:SharedDimension

To be able to distinguish Dimensions that are defined inside a Cube from Dimensions that are used in multiple cubes, we have the type of cube:SharedDimension. Every dimension except the ones typed as a cube:MeasureDimension can be a cube:SharedDimension.

4.1.2 Properties

The following properties are used to define the various visualization element of the RDF Cube Schema

4.1.2.1 schema:name

A descriptive name of the Dimension, this description can be multilingual.

4.1.2.2 schema:description

A description of the Dimension, this description can be multilingual.

4.1.2.3 qudt:unit

To describe the unit of the values in a dimension the respective qudt:Unit instance can be attached to a Dimension Constraint with the qudt:unit property.

4.1.2.4 qudt:scaleType

To provide more information on the statistical property scale of measure it can be described by qudt:NominalScale, qudt:OrdinalScale, qudt:IntervalScale or qudt:RatioScale which is attached through qudt:scaleType to the Dimension Constraint.

The different scale types hint about features that can be used for visualization properties:

  • qudt:NominalScale: Expects the dimension to be either Resource (connected by a URI) or a value with any kind of dataType.
  • qudt:OrdinalScale: Expects the dimension to be either:
    • A Resource (connected by a URI): In this case, the Resource should provide schema:position on the elements which provides a lexical ordering (best to use integer numbers). Further can be expected that the order in sh:in of the constraint is correctly ordered.
    • Or value where the lexical order has a meaning. (e.g. 1th, 2nd, 3rd).
  • qudt:IntervalScale: Expects the dimension to be values with a numeric dataType and the unit not to contradict the correct Scale.
  • qudt:RatioScale: Expects the dimension to be values with a numeric dataType and the unit not to contradict the correct Scale.
4.1.2.5 sh:datatype

To describe the datatype used by the dimension attach the sh:datatype to the Dimension Constraint. Be aware that this implies the presence of a typed literal as the dimension value

4.1.2.6 meta:dataKind (temporal / spatial)

To express that the dimension provides a specific kind of data which is necessary to select the correct visual representation you can add a meta:dataKind resource with the following possible structures:

  • schema:GeoCoordinates: To hint that the dimension does provide Resources with latitude and longitude which can be shown on a map.
  • schema:GeoShape: To hint that the dimension does provide Resources that have a shape that can be shown on a map.

  • time:GeneralDateTimeDescription: To hint that the dimension does provide Resources that can be shown on a timeline.

    It is further possible to add time:unitType to hint about the precision in which the dimension should be presented. A time:TemporalUnit is expected: time:unitYear, time:unitMonth, time:unitWeek, time:unitDay, time:unitHour, time:unitMinute and time:unitSecond.

4.1.2.7 sh:order

The sh:order can be used to indicate the relative order of the dimension, for use in visualizations. It should be used according to the specification by using ascending order, for example, so that properties with smaller order are placed above (or to the left) of properties with a larger order.

5. RDF Cube Schema: Advanced Topics

The previous section described the properties of basic cubes, there are however situations where more complex observations or relations between observations need to be expressed. This section provides a set of best practices for various subjects.

5.1 Version History of Cubes

To be able to have a continuous history of a published cube there is a meta construct that can be put around a cube, describing a line of the history of a cube based on a schema:CreativeWork.

The version history has attached through schema:hasPart each time a fully described cube which can be interpreted independently. It is expected that the cubes in the same history line do not change the count of dimensions. All the other descriptions can change.

A status of the cube, like Draft or Published can be added to the cube through schema:CreativeWorkStatus. The status is expected to be a schema:DefinedTerm.

To record a version the schema:version property can be used

A cube can be invalidated or unlisted by adding schema:expires with the expiry date to the cube itself.

5.2 Relations between quantitative values

Observation may hold dimensions that are related to each other as quantitative relation. Expressing this on the observation with blank nodes creates properly structured RDF, but creates performance and complexity issues when querying the Cube.

To overcome this limitation the relation can be expressed on the relevant Dimension Constraints There is one sh:property definition per dimension so the lookup only needs to be done once and is valid for all observations of that particular cube.

A relation between dimensions is described only with cube and meta vocabulary. The relation classes itself can be extended based on specific use cases. The controlled vocabulary introduced with namespace PREFIX relation: <https://cube.link/relation/> provides the most common relation Classes, and is proposed as a guideline.

This is an advanced usage of the cube and increases its complexity. But it gives the expressiveness needed to describe the complex relationship between data in a machine-processable way.

5.2.1 Classes

5.2.1.1 meta:Relation

A Cube:Relation resource is used to express the relation between different dimensions, the nature of the relationship is determined by the properties used. A Cube:Relation is linked to an observation through a meta:relation property. See this example.

5.2.2 Properties

5.2.2.1 meta:relation

This property is used on a Dimension Constraint to express a relation with other properties through a meta:Relation instance, the nature of this relationship is determined by the properties used on the instance. See this example.

5.3 Hierarchies

Note

Observations can be structured in hierarchies inside cubes. It is possible to define hierarchies which reside inside one dimension (e.g. categories, classifications) or also hierarchies which span over multiple dimensions. It is also possible to have hierarchies using external concepts.

To allow to reuse existing hierarchies described with e.g. (schema:hasPart / schema:isPartOf, SKOS Simple Knowledge Organization System Primer (https://www.w3.org/TR/skos-primer/) or similar ontologies) the following solution simply annotates one or multiple possible hierarchies.

The hierarchies definition are always done top-down with one or multiple roots, following predicates of your choosing and the leaves must be the final observations inside one dimension.

The hierarchy annotation is attached to a cube dimension as a meta:Hierarchy through meta:inHierarchy (similar to a meta:Relation). It holds a name describing the hierarchy as such with schema:name and least one, but potentially multiple meta:hierarchyRoot. Finally the connection between the root nodes and the first level below in the hierarchy is attached through meta:nextInHierarchy.

The simplest example above puts the two concepts countries and cantons in relation.

5.3.1 sh:path (connecting levels of a hierarchy)

With the use of Property Paths (sh:path) the connection between to levels in the hierarchy is expressed.

As a guideline we suggest support minimally support one step Predicate Paths and inverse one step Predicate Paths.

More complex paths will depend on the support of the used applications.

5.3.2 sh:targetClass (differentiating concepts of a hierarchy level)

If the predicate using sh:path is not distinct enough, it is possible to add sh:targetClass specify additionally the Class of which the sh:path is pointing to.

5.3.3 Nested Levels

With the use of meta:nextInHierarchy it is possible to extend the number of levels indefinitely. Once a path does point to a instance of a concept which is attached by the defined dimension, the hierarchy for this element is complete. Therefore is it possible that the levels change for different levels.

6. Tools and Samples

6.1 Generating Constraints

It is possible to generate a minimal constraint given a Cube and a set of Observations.

SPARQL CONSTRUCT queries will be provided in this repository.

6.2 Example Cube

An example Cube is specified in cube.ttl. The cube provides a constraint in constraint.ttl.

6.2.1 Validate the cube

Editor's note: Work in progress
This section is work in progress, the wording and terminology still need some thought.

The validation process of the cube can be divided in three different aspects.

  1. The cube structure and contents
  2. The structure of the observations
  3. The integrity of the constraints

We provide a tool to do the actual validations in the repository.

Install the package dependencies: npm i

6.2.1.1 The cube structure and contents

Even though RDF Cube Schema is a very lightweight vocabulary its structure needs to conform to a minimal set of rules to be considered a valid Cube we provide a very light weight constraint that can be used to check this. The constraint can be found in validation directory of the repository, the constraint is called basic-cube-constraint.ttl

6.2.1.2 The structure of the observations

When a cube provides an optional observation constraint through the observationConstaint property this can be tested as well

6.2.1.3 The integrity of the constraints

If constraints are used to provide guidance for interaction with the cube, e.g. visualizations it is important that the constraints themselves are confirming to a structure that interaction can deal with

The standalone constraint can be extended to meet the specific requirements for the intended interaction.

A.1 RDF Data Cube Vocabulary

The The RDF Data Cube Vocabulary is probably the oldest vocabulary in the domain of RDF for representing cubes. The authors of this document used RDF Data Cubes extensively in the past and ran into multiple issues with it.

The authors of this specification are grateful for the work done by the original authors of the RDF Data Cube Vocabulary specification. This work would not have been possible without it and some parts look pretty much like the RDF Data Cube Vocabulary.

It was considered to either clarify or update the RDF Data Cube Vocabulary specification. For the sake of simplicity, it was decided to start from scratch.

A.1.1 Issues with RDF Data Cube Vocabulary

  • The metadata model is overly complex.
    • Many additional nodes are introduced that make querying the data in the real world overly complex.
    • Generating proper metadata from basic cubes is not easy, which increases complexity for automated pipelines.
  • There is a mix of forward- and backward-linking within the metadata model.
  • Follow your nose is often not possible.
  • There is more than one way to do it. Different people interpret the spec differently, which makes it very hard to write libraries that consume generic RDF Data Cubes.
  • There is a clear focus on SDMX, which introduces too rigorous restrictions and/or examples for use-cases outside the statistical domain.
  • Re-use of dimensions is not very common in the RDF Data Cube vocabulary, which makes it much harder to compare data across data providers.

There are at least two efforts that extend the RDF Data Cube Vocabulary to address some of its limitations:

Both efforts could likely be solved/addressed within the RDF Cube Schema approach, this needs to be validated by interested parties.

A.2 SSN

The Semantic Sensor Network Ontology defines a simplified model for describing observations from a sensor in RDF.

The Observation model is at least inspired by the RDF Data Cube Vocabulary but it is not very useful for use-cases outside of sensor networks.

It should be relatively easy to replace SSN observations with the RDF Cube Schema.

B. References

B.1 Informative references

[qb4st]
QB4ST: RDF Data Cube extensions for spatio-temporal components. Rob Atkinson. W3C. 28 September 2017. W3C Working Group Note. URL: https://www.w3.org/TR/qb4st/
[rdf11-concepts]
RDF 1.1 Concepts and Abstract Syntax. Richard Cyganiak; David Wood; Markus Lanthaler. W3C. 25 February 2014. W3C Recommendation. URL: https://www.w3.org/TR/rdf11-concepts/
[shacl]
Shapes Constraint Language (SHACL). Holger Knublauch; Dimitris Kontokostas. W3C. 20 July 2017. W3C Recommendation. URL: https://www.w3.org/TR/shacl/
[skos-primer]
SKOS Simple Knowledge Organization System Primer. Antoine Isaac; Ed Summers. W3C. 18 August 2009. W3C Working Group Note. URL: https://www.w3.org/TR/skos-primer/
[turtle]
RDF 1.1 Turtle. Eric Prud'hommeaux; Gavin Carothers. W3C. 25 February 2014. W3C Recommendation. URL: https://www.w3.org/TR/turtle/
[vocab-data-cube]
The RDF Data Cube Vocabulary. Richard Cyganiak; Dave Reynolds. W3C. 16 January 2014. W3C Recommendation. URL: https://www.w3.org/TR/vocab-data-cube/
[vocab-ssn]
Semantic Sensor Network Ontology. Armin Haller; Krzysztof Janowicz; Simon Cox; Danh Le Phuoc; Kerry Taylor; Maxime Lefrançois. W3C. 19 October 2017. W3C Recommendation. URL: https://www.w3.org/TR/vocab-ssn/