Metadata Quality Assessment
The Metadata Quality Assessment is an extension of the Factory Data Catalogue. It calculates the Quality assessment on the metadata of every new or changed dataset.
The UI for this component is integrated into the catalogue UI. The MQA measures the quality of various indicators, each indicator is explained in the tables below. The results of the checks are stored as Data Quality Vocabulary (DQV).
DQV is a specification of the W3C that is used to describe the quality of a dataset.
As accessibility can be volatile, repeated checks for the accessURL and downloadURL are necessary. For this reason,
the MQA regularly checks the accessibility of all distributions. In contrast to the verification of the other
indicators, this has a higher runtime, since the distributions are checked via HTTP and each requested URL may have a
longer response time. The MQA uses a mechanism that takes into account that each URL is re-examined for accessibility
within a few weeks of the last review.
These measurements and metrics are subject to change during the project phase.
Assumptions
The MQA is based on the following assumption:
We believe that filling the DCAT-AP mandatory fields alone is not sufficient to provide high-quality metadata. For this reason, the evaluation also checks fields that are not specified as mandatory according to DCAT-AP. The exact fields that are checked are listed below.
Dimensions
This section describes all dimensions that the MQA examines in order to determine quality. The dimensions are derived from the FAIR principles.
Findability
The following table describes the metrics that help people and machines in finding datasets.
| Indicator | Description | Metrics | Computed on |
|---|---|---|---|
| Keyword usage | Keywords directly support the search and thus increase the findability of the data dataset. | The system checks whether keywords are defined. The number of keywords has no impact to the score. | Dataset dcat:keyword |
| Categories | Categories help users to explore datasets thematically. | It is checked whether one or more categories are assigned to the dataset. | Dataset dcat:theme |
| Geo search | Usage of spatial information would enable users to find the dataset with a geo-facetted search. | It is checked whether the property is set or not. | Dataset dct:spatial |
| Time-based search | Usage of temporal information would enable users for a timely based facetted search. | It is checked whether the property is set or not. | Dataset dct:temporal |
Accessibility
The following table describes which metrics are used to determine whether access to the data referenced by the distributions is guaranteed.
| Indicator | Description | Metrics | Computed on |
|---|---|---|---|
| AccessURL accessibility | The AccessURL is not necessarily a direct link to the data, but may refer to a URL that provides access to it. | The specified URL is checked for accessibility via an HTTP HEAD request. | Distribution dcat:accessURL |
| DownloadURL | The downloadURL is a direct link to the referenced data. | It is checked whether the property is set or not. | Distribution dcat:accessURL |
| DownloadURL accessibility | If a downloadURL exists, the accessibility is checked. | The specified URL is checked for accessibility via an HTTP HEAD request. | Distribution dcat:downloadURL |
Interoperability
The following table describes the metrics used to determine whether a distribution is considered interoperable. According to the assumption of 'identical content with several distributions', only the distribution with the highest number of points is used to calculate the points.
| Indicator | Description | Metrics | Computed on |
|---|---|---|---|
| Format | This field specifies the file format of the distribution. | It is checked whether the property is set or not. | Distribution dct:format |
| Media type | This field specifies the media type of the distribution. | It is checked whether the property is set or not. | Distribution dcat:mediaType |
| Format / Media type vocabulary | Checks whether format and media type belong to a controlled vocabulary. | The format vocabulary can be found in the data.europa.eu GitLab repository. | Distribution dct:format, dcat:mediaType |
| Non-proprietary | Checks if the format of the distribution is non-proprietary. | The distribution is considered non-proprietary if the specified format is contained in the vocabulary. | Distribution dct:format |
| Machine-readable | Checks if the format of the distribution is machine-readable. | The distribution is considered machine-readable if the specified format is in the vocabulary. | Distribution dct:format |
| DCAT-AP compliance | DCAT-AP compliance is calculated across all sources and datasets available on a catalogue. | The metadata is validated against a set of SHACL shapes. |
Reusability
The following table describes which metrics are used to check the reusability of the data.
| Indicator | Description | Metrics | Computed on |
|---|---|---|---|
| License information | A license is valuable information for the reuse of data. | It is checked whether the property is set or not. | Distribution dct:license |
| License vocabulary | Limits incorrect license information (e.g. incomplete CC licenses). | The MQA credits the usage of controlled vocabularies. | Distribution dct:license |
| Access restrictions | Indicates whether access to the data is public or restricted. | It is checked whether the property is set or not. | Dataset dct:accessRights |
| Access restrictions vocab | Use of a controlled vocabulary increases reusability. | It is checked whether the controlled vocabulary for access rights is used. | Dataset dct:accessRights |
| Contact point | Contains information on whom to address in case of questions regarding the data. | It is checked whether the property is set or not. | Dataset dcat:contactPoint |
| Publisher | Indicates the publisher of the dataset. | It is checked whether the property is set or not. | Dataset dct:publisher |
Contextuality
The following table shows some lightweight properties that provide more context to the user.
| Indicator | Description | Metrics | Computed on |
|---|---|---|---|
| Rights | Specifies a reference to inform the user about rights related to the dataset. | It is checked whether the property is set or not. | Distribution dct:rights |
| File size | Specifies the size of the file in bytes. | It is checked whether the property is set or not. | Distribution dcat:byteSize |
| Date of issue | The date on which the dataset or distribution was released. | It is checked whether the property is set or not. | Dataset and Distribution dct:issued |
| Modification date | The date on which the dataset or distribution was last changed. | It is checked whether the property is set or not. | Dataset and Distribution dct:modified |
API Endpoints for Metrics
The following endpoints can be used to retrieve metrics related to datasets and their distributions. These metrics cover dimensions such as findability, accessibility, interoperability, reusability, and contextuality. You can find the full OpenAPI description under https://{factory_name}.pistis-market.eu/srv/mqa/
1. Get Metrics for a Dataset
Endpoint: /datasets/{id}
Method: GET
Description: Retrieve the measurements for metadata referring to a specific dataset. The unique dataset ID should be provided as {id} in the URL.
Query Parameter:
locale(optional) - The language of the dataset that should be returned.
Example Request
curl -X GET "https://{factory_name}.pistis-market.eu/srv/mqa/datasets/12345?locale=en"
Example Response
{
"success": true,
"result": {
"count": 1,
"results": [
{
"findability": 80,
"accessibility": 90,
"interoperability": 85,
"reusability": 70,
"contextuality": 20
}
]
}
}
Responses:
200 OK: Returns the metrics for the dataset.404 Not Found: The dataset with the specified ID could not be found.
2. Get Metrics for All Distributions of a Dataset
Endpoint: /datasets/{id}/distributions
Method: GET
Description: Retrieve metrics for all distributions of a specific dataset. The unique dataset ID should be provided as {id} in the URL.
Query Parameter:
locale(optional) - The language of the distribution that should be returned.
Example Request
curl -X GET "https://{factory_name}.pistis-market.eu/srv/mqa/datasets/12345/distributions?locale=en"
Example Response
{
"success": true,
"result": {
"count": 3,
"results": [
{
"id": "dist1",
"findability": 80,
"accessibility": 90,
"interoperability": 85,
"reusability": 70,
"contextuality": 20
},
{
"id": "dist2",
"findability": 75,
"accessibility": 88,
"interoperability": 80,
"reusability": 65,
"contextuality": 18
}
]
}
}
Responses:
200 OK: Returns the metrics for all distributions of the dataset.404 Not Found: The dataset with the specified ID could not be found.
