Data Quality Assessment

Metadata Quality

Clicking on the Quality Assessment button on the dataset details page leads to an overview of the metadata quality of that dataset. The metrics software stack allows analysis of metadata with regards to the DCAT Application Profile for data portals in Europe (DCAT-AP) standard, which is based on the Data Catalog Vocabulary (DCAT) developed by the W3C. It is ".. a specification for metadata records to meet the specific application needs of data portals in Europe while providing semantic interoperability with other applications on the basis of reuse of established controlled vocabularies (e.g. EuroVoc) and mappings to existing metadata vocabularies (e.g. Dublin Core, SDMX, INSPIRE metadata, etc.)." Multiple components analyse incoming metadata and derive measurements based on the Data Quality Vocabulary (DQV). These services are pipe modules and can therefore be orchestrated as a pipe.

In order to determine metadata quality the following five aspects are considered:

Compliance with DCAT-AP and DCAT-AP derivatives
Disclosure of information not mandated by DCAT-AP
Accessibility of the data referenced in the metadata through the Access and Download URL
Machine readability of the referenced data
License usage

Based on these analyses a score can be calculated, which serves as an easily comparable indicator of overall metadata quality. The individual measurements as well as the final score can be visualized via web frontend.

Select a dataset

Dataset Content Quality

Upon distirbution creation or metadata update, distribution conent will undergo a content quality assessment. A baseline set of quality expectations is informed by the Insights Generator. The content quality assessment uses the rule-based Great Expectations framework. The content quality assessment process requires no user input and runs automatically behind the scenes.

The baseline assessment can be further refined for distributions which have undergone the Data Enrichment process. The Data Enrichment component creates a mapping between a specific data distribution and the PISTIS Data Model. Certain data types and concepts withing the PISTIS Data Model have contextual quality expectations embedded within them. Any user whose dataset includes one of these enhanced data types will inherit all embedded quality rules. This provides a way to add context-informed and consistent quality standards throughout the platform. Currently, not all PISTIS Data Model data types have embedded quality rules.

Dimensions & Metrics

The PISTIS Data Quality Assessor uses the Great Expectations library for the validation of data quality assessments. Each assessment is made up of collection of quality expectations. As PISTIS takes a prescriptive approach to assessing data quality (validating the data against preexisting assumptions of what good data looks like in each context), each expectation represents a specific rule to be checked. Typically these rules bound to a specific feature, however more complex expectations have scopes over multiple features or even the entire table. A complete list of expectations can be found in the Expectation Gallery.

Dimensions

In PISTIS, each quality expectation is mapped to a Data Quality Dimension. The dimension characterizes a different aspect of data quality. There are five PISTIS Data Quality Dimensions: Accuracy, Consistency, Completeness, Uniqueness, Validity.

Accuracy

The degree to which data has attributes that correctly represent the true value of the intended attribute of a concept or event in a specific context of use.
ISO-5259

Accuracy refers to the degree to which data correctly represents the real-world values or events it is intended to describe within a given context of use. High data accuracy means that the recorded data attributes closely match the true, factual values of the corresponding entities or occurrences. Inaccurate data can result from errors in measurement, entry, or processing, leading to misleading insights and poor decision-making. Ensuring accuracy often involves validation against reliable sources, implementing error-checking mechanisms, and maintaining clear data collection standards.

Expectation types in the accuracy dimension check that data serves as a faithful reflection of reality, supporting trustworthy analysis and outcomes.

Expectation Type	Description	Appropriate for
ExpectColumnDistinctValuesToBeInSet	Expect the set of distinct column values to be contained by a given set.	Categorical features
ExpectColumnDistinctValuesToContainSet	Expect the set of distinct column values to contain a given set.	Categorical features
ExpectColumnDistinctValuesToEqualSet	Expect the set of distinct column values to equal a given set.	Categorical features
ExpectColumnValuesToBeInSet	Expect each column value to be in a given set.	Categorical features
ExpectColumnValuesToBeBetween	Expect the column entries to be between a minimum value and a maximum value.	Numerical features

Consistency

The degree to which data has attributes that are free from contradiction and are coherent with other data in a specific context of use. It can be either or both among data regarding one entity and across similar data for comparable entities.
ISO-5259

Consistent data maintains internal uniformity within a single entity’s records, ensuring that related information aligns and does not conflict. Consistency expectations may evaluate feature-specific logic, such as adherence to defined minimum and maximum values or compliance with required string patterns. They may also assess relationships between features, such as verifying valid value combinations or enforcing inequality constraints. Together, these checks help ensure that the dataset remains coherent, logically sound, and suitable for reliable analysis.

Expectation Type	Description	Appropriate for
ExpectColumnPairValuesAToBeGreaterThanB	Expect the values in column A to be greater than column B.	Pairs of numerical features
ExpectColumnPairValuesToBeEqual	Expect the values in column A to be the same as column B.	Any feature
ExpectColumnPairValuesToBeInSet	Expect the paired values from columns A and B to belong to a set of valid pairs.	Any pair of features
ExpectMultiColumnSumToBeEqual	Expect that the sum of row values in a specified column list is the same for each row, and equal to a specified sum total.	Collection of numerical features
ExpectColumnMaxToBeBetween	Expect the column maximum to be between a minimum value and a maximum value.	Numerical features
ExpectColumnMeanToBeBetween	Expect the column mean to be between a minimum value and a maximum value (inclusive).	Numerical features
ExpectColumnMedianToBeBetween	Expect the column median to be between a minimum value and a maximum value.	Numerical features
ExpectColumnMinToBeBetween	Expect the column minimum to be between a minimum value and a maximum value.	Numerical features
ExpectColumnMostCommonValueToBeInSet	Expect the most common value to be within the designated value set.	Numerical features
ExpectColumnStdevToBeBetween	Expect the column standard deviation to be between a minimum value and a maximum value.	Numerical features
ExpectColumnSumToBeBetween	Expect the column sum to be between a minimum value and a maximum value.	Numerical features
ExpectColumnValueLengthsToBeBetween	Expect the column entries to be strings with length between a minimum value and a maximum value (inclusive).	Numerical features
ExpectColumnValueLengthsToEqual	Expect the column entries to be strings with length equal to the provided value.	String or text features
ExpectColumnValueZScoresToBeLessThan	Expect the Z-scores of a column's values to be less than a given threshold.	Numerical features
ExpectColumnValuesToMatchRegex	Expect the column entries to be strings that match a given regular expression.	String or text features
ExpectColumnValuesToMatchRegexList	Expect the column entries to be strings that can be matched to either any of or all of a list of regular expressions.	String or text features

Completeness

The degree to which subject data associated with an entity has values for all expected attributes and related entity instances in a specific context of use.
ISO-5259

Completeness measures the extent to which all required data elements for a given entity are present and populated within a dataset. It assesses whether the expected attributes and related entity instances contain valid values in accordance with the intended context of use. High completeness indicates that no critical information is missing, enabling accurate analysis, decision-making, and reporting. Conversely, incomplete data can lead to gaps in understanding, misinterpretation of results, and reduced trust in data-driven outcomes. Ensuring completeness involves defining mandatory data fields, monitoring data entry and integration processes, and establishing controls to address missing or partial information.

Expectation types in the completeness dimension check that all expected data is present. This is done not only by checking missing values, but also by checking that no default values appear. They are typically negative assertions.

Expectation Type	Description	Appropriate for
ExpectColumnValuesToNotBeInSet	Expect column entries to not be in the set.	Any feature
ExpectColumnValuesToNotMatchRegex	Expect the column entries to be strings that do NOT match a given regular expression.	String or text feature
ExpectColumnValuesToNotMatchRegexList	Expect the column entries to be strings that do not match any of a list of regular expressions.	String or text feature
ExpectColumnValuesToBeNull	Expect the column values to be null.	Any feature
ExpectColumnValuesToNotBeNull	Expect the column values to not be null.	Any feature

Uniqueness

The degree to which all data instances provide new information. Namely, there are no instances of duplicate data.
Adapted from Askham et al. (2013)

Uniqueness measures the extent to which each data instance within a dataset is distinct and represents new information. It ensures that no duplicate records or redundant entries exist for the same entity or event. High uniqueness indicates that data is non-repetitive, consistent, and accurately reflects real-world entities without duplication. Maintaining uniqueness supports reliable analytics, prevents data inflation, and enhances the efficiency of data management processes. Achieving this dimension typically involves implementing robust data matching, deduplication, and validation controls to identify and resolve redundant or overlapping records.

Expectation Type	Description	Appropriate for
ExpectColumnProportionOfUniqueValuesToBeBetween	Expect the proportion of unique values to be between a minimum value and a maximum value. For example, in a column containing 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, there are 4 unique values and 10 total values for a proportion of 0.4.	Any features
ExpectColumnUniqueValueCountToBeBetween	Expect the number of unique values to be between a minimum value and a maximum value.	Any features
ExpectColumnValuesToBeUnique	Expect each column value to be unique. This expectation detects duplicates. All duplicated values are counted as exceptions.	Any features
ExpectCompoundColumnsToBeUnique	Expect the compound columns to be unique.	Any collection of features
ExpectSelectColumnValuesToBeUniqueWithinRecord	Expect the values for each record to be unique across the columns listed. Note that records can be duplicated.	Any collection of features

Validity

The degree to the subject data adheres to the expected formats and data structures. This includes, feature existence, data typing, and row population.
Adapted from Askham et al. (2023)

Validity measures the extent to which data conforms to defined formats, structures, and business rules. It assesses whether data values meet expected standards for type, range, and structure, including correct feature existence, data typing, and appropriate row population. High validity ensures that data is syntactically and structurally sound, enabling reliable processing, integration, and analysis. Invalid data, such as improperly formatted values, incorrect data types, or missing structural elements, can lead to system errors, reporting inaccuracies, and compliance issues. Ensuring validity involves enforcing data standards, applying validation rules, and routinely monitoring data entry and transformation processes.

Expectation Type	Description	Appropriate for
ExpectColumnToExist	Checks for the existence of a specified column within a table.	Any feature
ExpectColumnValuesToBeInTypeList	Expect a column to contain values from a specified type list.	Any feature
ExpectColumnValuesToBeOfType	Expect a column to contain values of a specified data type.	Any feature
ExpectTableColumnCountToBeBetween	Expect the number of columns in a table to be between two values.	Entire table
ExpectTableColumnCountToEqual	Expect the number of columns in a table to equal a value.	Entire table
ExpectTableColumnsToMatchOrderedList	Expect the columns in a table to exactly match a specified list.	Entire table
ExpectTableColumnsToMatchSet	Expect the columns in a table to exactly match a specified list.	Entire table
ExpectTableRowCountToBeBetween	Expect the number of rows to be between two values.	Entire table
ExpectTableRowCountToEqual	Expect the number of rows to equal a value.	Entire table

Data Factory EnvironmentLineage Tracker

Data Factory EnvironmentMarket Insights