Table of contents
Dataset Comparison
Dataset comparison can be run as a spark job or used as a library.
Dataset Comparison used as a spark job example. This example doesn’t show spark arguments Example:
spark-submit dataset-comparison.jar \
--ref-format csv \
--ref-path /path/to/csv-dir \ // will pickup `data.csv` in the directory
--ref-header true \
--new-format parquet \
--new-path /path/to/parquet \
--keys ID \
--out-path /path/to/results
This little example would produce a folder /path/to/results
which would hold the parquet with differences, if there were any and
a _METRICS
file with some metrics about the comparison.
Info File Comparison
Atum’s Info file comparison. Ran as part of the E2E Runner but it can be run as a plain old jar file.
java -jar info-file-comparison.jar \
--ref-path /path/to/reference/data/_INFO \
--new-path /path/to/new/data/_INFO \
--out-path /path/to/results
For _INFO file placed in local repository use format of path file://path/to/_INFO
.
E2E Runner
E2E usage shifted the most since 0.2.2. Now it can be used to run any test that there is a plugin for. It can be run as a spark job or a classic jar. On input, it expects a maximum of 3 arguments: path to test definitions, jar-path to additional plugins, and fail-fast switch.
spark-submit e2e-runner.jar \
--test-definition-path /some/path/testDefinition.json \
--jar-path /extra/jar/folder \
--fail-fast true
Plugins
There are 3 built-in plugins. These are all out of the box usable with the E2E Runner.
Then there is an option of creating a plugin tailored for specific need(s) following this guide