Table of contents

Dataset Comparison

Dataset comparison can be run as a spark job or used as a library.

Dataset Comparison used as a spark job example. This example doesn’t show spark arguments Example:

spark-submit dataset-comparison.jar \
    --ref-format csv \
    --ref-path /path/to/csv-dir \ // will pickup `data.csv` in the directory
    --ref-header true \
    --new-format parquet \
    --new-path /path/to/parquet \
    --keys ID \
    --out-path /path/to/results

This little example would produce a folder /path/to/results which would hold the parquet with differences, if there were any and a _METRICS file with some metrics about the comparison.


Info File Comparison

Atum’s Info file comparison. Ran as part of the E2E Runner but it can be run as a plain old jar file.

java -jar info-file-comparison.jar \
    --ref-path /path/to/reference/data/_INFO \
    --new-path /path/to/new/data/_INFO \
    --out-path /path/to/results

For _INFO file placed in local repository use format of path file://path/to/_INFO.


E2E Runner

E2E usage shifted the most since 0.2.2. Now it can be used to run any test that there is a plugin for. It can be run as a spark job or a classic jar. On input, it expects a maximum of 3 arguments: path to test definitions, jar-path to additional plugins, and fail-fast switch.

spark-submit e2e-runner.jar \
    --test-definition-path /some/path/testDefinition.json \
    --jar-path /extra/jar/folder \
    --fail-fast true



There are 3 built-in plugins. These are all out of the box usable with the E2E Runner.

Then there is an option of creating a plugin tailored for specific need(s) following this guide