Metadata Quality Assessment

The Metadata Quality Assessment is an extension of the Factory Data Catalogue. It calculates the Quality assessment on the metadata of every new or changed dataset.

The UI for this component is integrated into the catalogue UI. The MQA measures the quality of various indicators, each indicator is explained in the tables below. The results of the checks are stored as Data Quality Vocabulary (DQV).

DQV is a specification of the W3C that is used to describe the quality of a dataset.

As accessibility can be volatile, repeated checks for the accessURL and downloadURL are necessary. For this reason, the MQA regularly checks the accessibility of all distributions. In contrast to the verification of the other indicators, this has a higher runtime, since the distributions are checked via HTTP and each requested URL may have a longer response time. The MQA uses a mechanism that takes into account that each URL is re-examined for accessibility within a few weeks of the last review.

These measurements and metrics are subject to change during the project phase.

Assumptions

The MQA is based on the following assumption:

We believe that filling the DCAT-AP mandatory fields alone is not sufficient to provide high-quality metadata. For this reason, the evaluation also checks fields that are not specified as mandatory according to DCAT-AP. The exact fields that are checked are listed below.

Dimensions

This section describes all dimensions that the MQA examines in order to determine quality. The dimensions are derived from the FAIR principles.

Findability

The following table describes the metrics that help people and machines in finding datasets.

Indicator	Description	Metrics	Computed on
Keyword usage	Keywords directly support the search and thus increase the findability of the data dataset.	The system checks whether keywords are defined. The number of keywords has no impact to the score.	Dataset `dcat:keyword`
Categories	Categories help users to explore datasets thematically.	It is checked whether one or more categories are assigned to the dataset.	Dataset `dcat:theme`
Geo search	Usage of spatial information would enable users to find the dataset with a geo-facetted search.	It is checked whether the property is set or not.	Dataset `dct:spatial`
Time-based search	Usage of temporal information would enable users for a timely based facetted search.	It is checked whether the property is set or not.	Dataset `dct:temporal`

Accessibility

The following table describes which metrics are used to determine whether access to the data referenced by the distributions is guaranteed.

Indicator	Description	Metrics	Computed on
AccessURL accessibility	The `AccessURL` is not necessarily a direct link to the data, but may refer to a URL that provides access to it.	The specified URL is checked for accessibility via an HTTP HEAD request.	Distribution `dcat:accessURL`
DownloadURL	The `downloadURL` is a direct link to the referenced data.	It is checked whether the property is set or not.	Distribution `dcat:accessURL`
DownloadURL accessibility	If a `downloadURL` exists, the accessibility is checked.	The specified URL is checked for accessibility via an HTTP HEAD request.	Distribution `dcat:downloadURL`

Interoperability

The following table describes the metrics used to determine whether a distribution is considered interoperable. According to the assumption of 'identical content with several distributions', only the distribution with the highest number of points is used to calculate the points.

Indicator	Description	Metrics	Computed on
Format	This field specifies the file format of the distribution.	It is checked whether the property is set or not.	Distribution `dct:format`
Media type	This field specifies the media type of the distribution.	It is checked whether the property is set or not.	Distribution `dcat:mediaType`
Format / Media type vocabulary	Checks whether format and media type belong to a controlled vocabulary.	The format vocabulary can be found in the `data.europa.eu` GitLab repository.	Distribution `dct:format`, `dcat:mediaType`
Non-proprietary	Checks if the format of the distribution is non-proprietary.	The distribution is considered non-proprietary if the specified format is contained in the vocabulary.	Distribution `dct:format`
Machine-readable	Checks if the format of the distribution is machine-readable.	The distribution is considered machine-readable if the specified format is in the vocabulary.	Distribution `dct:format`
DCAT-AP compliance	DCAT-AP compliance is calculated across all sources and datasets available on a catalogue.	The metadata is validated against a set of SHACL shapes.

Reusability

The following table describes which metrics are used to check the reusability of the data.

Indicator	Description	Metrics	Computed on
License information	A license is valuable information for the reuse of data.	It is checked whether the property is set or not.	Distribution `dct:license`
License vocabulary	Limits incorrect license information (e.g. incomplete CC licenses).	The MQA credits the usage of controlled vocabularies.	Distribution `dct:license`
Access restrictions	Indicates whether access to the data is public or restricted.	It is checked whether the property is set or not.	Dataset `dct:accessRights`
Access restrictions vocab	Use of a controlled vocabulary increases reusability.	It is checked whether the controlled vocabulary for access rights is used.	Dataset `dct:accessRights`
Contact point	Contains information on whom to address in case of questions regarding the data.	It is checked whether the property is set or not.	Dataset `dcat:contactPoint`
Publisher	Indicates the publisher of the dataset.	It is checked whether the property is set or not.	Dataset `dct:publisher`

Contextuality

The following table shows some lightweight properties that provide more context to the user.

Indicator	Description	Metrics	Computed on
Rights	Specifies a reference to inform the user about rights related to the dataset.	It is checked whether the property is set or not.	Distribution `dct:rights`
File size	Specifies the size of the file in bytes.	It is checked whether the property is set or not.	Distribution `dcat:byteSize`
Date of issue	The date on which the dataset or distribution was released.	It is checked whether the property is set or not.	Dataset and Distribution `dct:issued`
Modification date	The date on which the dataset or distribution was last changed.	It is checked whether the property is set or not.	Dataset and Distribution `dct:modified`

API Endpoints for Metrics

The following endpoints can be used to retrieve metrics related to datasets and their distributions. These metrics cover dimensions such as findability, accessibility, interoperability, reusability, and contextuality. You can find the full OpenAPI description under https://{factory_name}.pistis-market.eu/srv/mqa/

1. Get Metrics for a Dataset

Endpoint: /datasets/{id}
Method: GET
Description: Retrieve the measurements for metadata referring to a specific dataset. The unique dataset ID should be provided as {id} in the URL.
Query Parameter:

locale (optional) - The language of the dataset that should be returned.

Example Request

curl -X GET "https://{factory_name}.pistis-market.eu/srv/mqa/datasets/12345?locale=en"

Example Response

{
  "success": true,
  "result": {
    "count": 1,
    "results": [
      {
        "findability": 80,
        "accessibility": 90,
        "interoperability": 85,
        "reusability": 70,
        "contextuality": 20
      }
    ]
  }
}

Responses:

200 OK: Returns the metrics for the dataset.
404 Not Found: The dataset with the specified ID could not be found.

2. Get Metrics for All Distributions of a Dataset

Endpoint: /datasets/{id}/distributions
Method: GET
Description: Retrieve metrics for all distributions of a specific dataset. The unique dataset ID should be provided as {id} in the URL.
Query Parameter:

locale (optional) - The language of the distribution that should be returned.

Example Request

curl -X GET "https://{factory_name}.pistis-market.eu/srv/mqa/datasets/12345/distributions?locale=en"

Example Response

{
  "success": true,
  "result": {
    "count": 3,
    "results": [
      {
        "id": "dist1",
        "findability": 80,
        "accessibility": 90,
        "interoperability": 85,
        "reusability": 70,
        "contextuality": 20
      },
      {
        "id": "dist2",
        "findability": 75,
        "accessibility": 88,
        "interoperability": 80,
        "reusability": 65,
        "contextuality": 18
      }
    ]
  }
}

Responses:

200 OK: Returns the metrics for all distributions of the dataset.
404 Not Found: The dataset with the specified ID could not be found.

Data Quality AssessmentData Quality Assessment

Data TransformationData Transformation