Compontent - Dataset Comparison

Choose version:

Table of contents
Features
Features for spark-job
Constraints
Concepts
- Schema alignment
- Provided schema

Features

compares two datasets and provides a delta file of sorts together with some metrics and metadata
if keys are provided, the tool is specific in its output, providing paths to differences and showing the diff data.
schema can be supplied for selective comparison. This will only compare the fields found in the schema

Features for spark-job

can load any data type/format that Apache Spark is able to load. For some of the formats to be supported, additional libraries might need to be added to the classpath
input data referential or new (being tested) can have different formats
format of the diff file can be configured as well (default is parquet)
regardless of fail or pass status of the data comparison, a metrics file called _METRICS is wirtten to the destination

Constraints

referential schema must be a subset of the new (being verified) data
without a provided key, the diff file makes a comparison of a row as a whole. With keys, it compares precise data and shows paths to differences

Concepts

Schema alignment

Dataset comparison takes the referential data and tries to align the new (being verified) data to the schema of the referential data. This means both sorting the columns and only selecting columns that are present in the referential data.

Provided schema