Metadata Quality Assessment

The Metadata Quality Assessment is an extension of the Factory Data Catalogue. It calculates the Quality assessment on the metadata of every new or changed dataset.

The UI for this component is integrated into the catalogue UI. The MQA measures the quality of various indicators, each indicator is explained in the tables below. The results of the checks are stored as Data Quality Vocabulary (DQV).

DQV is a specification of the W3C that is used to describe the quality of a dataset.

As accessibility can be volatile, repeated checks for the accessURL and downloadURL are necessary. For this reason, the MQA regularly checks the accessibility of all distributions. In contrast to the verification of the other indicators, this has a higher runtime, since the distributions are checked via HTTP and each requested URL may have a longer response time. The MQA uses a mechanism that takes into account that each URL is re-examined for accessibility within a few weeks of the last review.

These measurements and metrics are subject to change during the project phase.

Assumptions

The MQA is based on the following assumption:

We believe that filling the DCAT-AP mandatory fields alone is not sufficient to provide high-quality metadata. For this reason, the evaluation also checks fields that are not specified as mandatory according to DCAT-AP. The exact fields that are checked are listed below.


Dimensions

This section describes all dimensions that the MQA examines in order to determine quality. The dimensions are derived from the FAIR principles.

Findability

The following table describes the metrics that help people and machines in finding datasets.

IndicatorDescriptionMetricsComputed on
Keyword usageKeywords directly support the search and thus increase the findability of the data dataset.The system checks whether keywords are defined. The number of keywords has no impact to the score.Dataset dcat:keyword
CategoriesCategories help users to explore datasets thematically.It is checked whether one or more categories are assigned to the dataset.Dataset dcat:theme
Geo searchUsage of spatial information would enable users to find the dataset with a geo-facetted search.It is checked whether the property is set or not.Dataset dct:spatial
Time-based searchUsage of temporal information would enable users for a timely based facetted search.It is checked whether the property is set or not.Dataset dct:temporal

Accessibility

The following table describes which metrics are used to determine whether access to the data referenced by the distributions is guaranteed.

IndicatorDescriptionMetricsComputed on
AccessURL accessibilityThe AccessURL is not necessarily a direct link to the data, but may refer to a URL that provides access to it.The specified URL is checked for accessibility via an HTTP HEAD request.Distribution dcat:accessURL
DownloadURLThe downloadURL is a direct link to the referenced data.It is checked whether the property is set or not.Distribution dcat:accessURL
DownloadURL accessibilityIf a downloadURL exists, the accessibility is checked.The specified URL is checked for accessibility via an HTTP HEAD request.Distribution dcat:downloadURL

Interoperability

The following table describes the metrics used to determine whether a distribution is considered interoperable. According to the assumption of 'identical content with several distributions', only the distribution with the highest number of points is used to calculate the points.

IndicatorDescriptionMetricsComputed on
FormatThis field specifies the file format of the distribution.It is checked whether the property is set or not.Distribution dct:format
Media typeThis field specifies the media type of the distribution.It is checked whether the property is set or not.Distribution dcat:mediaType
Format / Media type vocabularyChecks whether format and media type belong to a controlled vocabulary.The format vocabulary can be found in the data.europa.eu GitLab repository.Distribution dct:format, dcat:mediaType
Non-proprietaryChecks if the format of the distribution is non-proprietary.The distribution is considered non-proprietary if the specified format is contained in the vocabulary.Distribution dct:format
Machine-readableChecks if the format of the distribution is machine-readable.The distribution is considered machine-readable if the specified format is in the vocabulary.Distribution dct:format
DCAT-AP complianceDCAT-AP compliance is calculated across all sources and datasets available on a catalogue.The metadata is validated against a set of SHACL shapes.

Reusability

The following table describes which metrics are used to check the reusability of the data.

IndicatorDescriptionMetricsComputed on
License informationA license is valuable information for the reuse of data.It is checked whether the property is set or not.Distribution dct:license
License vocabularyLimits incorrect license information (e.g. incomplete CC licenses).The MQA credits the usage of controlled vocabularies.Distribution dct:license
Access restrictionsIndicates whether access to the data is public or restricted.It is checked whether the property is set or not.Dataset dct:accessRights
Access restrictions vocabUse of a controlled vocabulary increases reusability.It is checked whether the controlled vocabulary for access rights is used.Dataset dct:accessRights
Contact pointContains information on whom to address in case of questions regarding the data.It is checked whether the property is set or not.Dataset dcat:contactPoint
PublisherIndicates the publisher of the dataset.It is checked whether the property is set or not.Dataset dct:publisher

Contextuality

The following table shows some lightweight properties that provide more context to the user.

IndicatorDescriptionMetricsComputed on
RightsSpecifies a reference to inform the user about rights related to the dataset.It is checked whether the property is set or not.Distribution dct:rights
File sizeSpecifies the size of the file in bytes.It is checked whether the property is set or not.Distribution dcat:byteSize
Date of issueThe date on which the dataset or distribution was released.It is checked whether the property is set or not.Dataset and Distribution dct:issued
Modification dateThe date on which the dataset or distribution was last changed.It is checked whether the property is set or not.Dataset and Distribution dct:modified

API Endpoints for Metrics

The following endpoints can be used to retrieve metrics related to datasets and their distributions. These metrics cover dimensions such as findability, accessibility, interoperability, reusability, and contextuality. You can find the full OpenAPI description under https://{factory_name}.pistis-market.eu/srv/mqa/


1. Get Metrics for a Dataset

Endpoint: /datasets/{id}
Method: GET
Description: Retrieve the measurements for metadata referring to a specific dataset. The unique dataset ID should be provided as {id} in the URL.
Query Parameter:

  • locale (optional) - The language of the dataset that should be returned.

Example Request

curl -X GET "https://{factory_name}.pistis-market.eu/srv/mqa/datasets/12345?locale=en"

Example Response

{
  "success": true,
  "result": {
    "count": 1,
    "results": [
      {
        "findability": 80,
        "accessibility": 90,
        "interoperability": 85,
        "reusability": 70,
        "contextuality": 20
      }
    ]
  }
}

Responses:

  • 200 OK: Returns the metrics for the dataset.
  • 404 Not Found: The dataset with the specified ID could not be found.

2. Get Metrics for All Distributions of a Dataset

Endpoint: /datasets/{id}/distributions
Method: GET
Description: Retrieve metrics for all distributions of a specific dataset. The unique dataset ID should be provided as {id} in the URL.
Query Parameter:

  • locale (optional) - The language of the distribution that should be returned.

Example Request

curl -X GET "https://{factory_name}.pistis-market.eu/srv/mqa/datasets/12345/distributions?locale=en"

Example Response

{
  "success": true,
  "result": {
    "count": 3,
    "results": [
      {
        "id": "dist1",
        "findability": 80,
        "accessibility": 90,
        "interoperability": 85,
        "reusability": 70,
        "contextuality": 20
      },
      {
        "id": "dist2",
        "findability": 75,
        "accessibility": 88,
        "interoperability": 80,
        "reusability": 65,
        "contextuality": 18
      }
    ]
  }
}

Responses:

  • 200 OK: Returns the metrics for all distributions of the dataset.
  • 404 Not Found: The dataset with the specified ID could not be found.