Table of contents
Features
- compares two datasets and provides a delta file of sorts together with some metrics and metadata
- if keys are provided, the tool is specific in its output, providing paths to differences and showing the diff data.
- schema can be supplied for selective comparison. This will only compare the fields found in the schema
Features for spark-job
- can load any data type/format that Apache Spark is able to load. For some of the formats to be supported, additional libraries might need to be added to the classpath
- input data referential or new (being tested) can have different formats
- format of the diff file can be configured as well (default is
parquet
) - regardless of fail or pass status of the data comparison, a metrics file called
_METRICS
is wirtten to the destination
Constraints
- referential schema must be a subset of the new (being verified) data
- without a provided key, the diff file makes a comparison of a row as a whole. With keys, it compares precise data and shows paths to differences
Concepts
Schema alignment
Dataset comparison takes the referential data and tries to align the new (being verified) data to the schema of the referential data. This means both sorting the columns and only selecting columns that are present in the referential data.
Provided schema
If a schema is provided for a dataset comparison, then schema alignment is done against this provided schema for both referential and new data.