Usage > Configuration

Table Of Contents

Intro

This page describes the usage of configuration of Standardization and Conformance. There are number of default options that Project’s readme documents. This page describes the configuration values stored in spark-jobs’s reference.conf (link) or application.conf provided by the user. These values can be overridden using the -D property values as in:

spark-submit --conf "spark.driver.extraJavaOptions= -Dkey1=value1 -Dkey2=value2" ...

General options

Configuration Options
Config Path Possible Value(s) Description
conformance.allowOriginalColumnsMutability boolean Allows to modify/drop columns from the original input (default is false)
conformance.autoclean.standardized.hdfs.folder boolean Automatically delete standardized data folder after successful run of a Conformance job *
control.info.validation strict Job will fail on failed _INFO file validation.
warning (default) A warning message will be displayed on failed validation, but the job will go on.
none No validation is done.
enceladus.recordId.generation.strategy uuid (default) enceladus_record_id column will be added and will contain a UUID String for each row.
stableHashId enceladus_record_id column will be added and populated with an always-the-same Int hash (Murmur3-based, for testing).
none no column will be added to the output.
max.processing.partition.size non-negative long integer Maximal size (in bytes) for the processing partition, which would influence the written parquet file size NB! Experimental - sizes might still not fulfill the requested limits
enceladus.rest.uri string with URLs Comma-separated list of URLs where REST API will be looked for. E.g.: http://example.com/rest_api1,http://domain.com:8080/rest_api2
enceladus.rest.retryCount non-negative integer Each of the enceladus.rest.uri URLs can be tried multiple times for fault-tolerance
enceladus.rest.availability.setup roundrobin (default) Starts from a random URL from the enceladus.rest.uri list, if it fails the next one is tried, if last is reached start from 0 until all are tried
fallback Always starts from the first URL, and only if it fails the second follows etc.
min.processing.partition.size non-negative long integer Minimal size (in bytes) for the processing partition, which would influence the written parquet file size NB! Experimental - sizes might still not fulfill the requested limits
standardization.defaultTimestampTimeZone.default string with any valid time zone name The time zone for normalization of timestamps that don't have their own time zone either in data itself or in metadata. If left empty the system time zone will be used.
standardization.defaultTimestampTimeZone.[rawFormat] string with any valid time zone name Same as above standardization.defaultTimestampTimeZone.default, but applies only for the specific input raw format - then it takes precedence over standardization.defaultTimestampTimeZone.default.
standardization.defaultDateTimeZone.default string with any valid time zone name The time zone for normalization of dates that don't have their own time zone either in data itself or in metadata in case they need it. Most probably this should be left undefined.
standardization.defaultDateTimeZone.[rawFormat] string with any valid time zone name Same as above standardization.defaultDateTimeZone.default, but applies only for the specific input raw format - then it takes precedence over standardization.defaultDateTimeZone.default.
timezone string with any valid time zone name The time zone the Spark application will operate in. Strongly recommended to keep it to default UTC

Note that

* conformance.autoclean.standardized.hdfs.folder when set to true and the job is writing to S3, there could be a leftover empty file like conformance-output_$folder$ after the autoclean. This, however, will not negatively impact the functionality of other jobs even when using the same path and is due to the EMR committer.

Selected plugin options:

Configuration Options
Config Path Possible Value(s) Description
atum.hdfs.info.file.permissions string with FS permissions Desired FS permissions for Atum _INFO file. Default: 644.
spline.hdfs.file.permissions string with FS permissions Desired FS permissions for Spline's _LINEAGE file. Default: 644.