Table Of Contents
Intro
This page describes the usage of configuration of Standardization and Conformance.
There are number of default options that Project’s readme documents.
This page describes the configuration values stored in spark-jobs
’s reference.conf
(link) or application.conf
provided by the user.
These values can be overridden using the -D
property values as in:
spark-submit --conf "spark.driver.extraJavaOptions= -Dkey1=value1 -Dkey2=value2" ...
General options
Config Path | Possible Value(s) | Description |
---|---|---|
conformance.allowOriginalColumnsMutability |
boolean | Allows to modify/drop columns from the original input (default is false) |
conformance.autoclean.standardized.hdfs.folder |
boolean | Automatically delete standardized data folder after successful run of a Conformance job * |
control.info.validation |
strict | Job will fail on failed _INFO file validation. |
warning | (default) A warning message will be displayed on failed validation, but the job will go on. | |
none | No validation is done. | |
enceladus.recordId.generation.strategy |
uuid | (default) enceladus_record_id column will be added and will contain a UUID String for each row. |
stableHashId | enceladus_record_id column will be added and populated with an always-the-same Int hash (Murmur3-based, for testing). |
|
none | no column will be added to the output. | |
max.processing.partition.size |
non-negative long integer | Maximal size (in bytes) for the processing partition, which would influence the written parquet file size NB! Experimental - sizes might still not fulfill the requested limits |
menas.rest.uri |
string with URLs | Comma-separated list of URLs where Menas will be looked for. E.g.: http://example.com/menas1,http://domain.com:8080/menas2 |
menas.rest.retryCount |
non-negative integer | Each of the menas.rest.uri URLs can be tried multiple times for fault-tolerance |
menas.rest.availability.setup |
roundrobin | (default) Starts from a random URL from the menas.rest.uri list, if it fails the next one is tried, if last is reached start from 0 until all are tried |
fallback | Always starts from the first URL, and only if it fails the second follows etc. | |
min.processing.partition.size |
non-negative long integer | Minimal size (in bytes) for the processing partition, which would influence the written parquet file size NB! Experimental - sizes might still not fulfill the requested limits |
standardization.defaultTimestampTimeZone.default |
string with any valid time zone name | The time zone for normalization of timestamps that don't have their own time zone either in data itself or in metadata. If left empty the system time zone will be used. |
standardization.defaultTimestampTimeZone.[rawFormat] |
string with any valid time zone name | Same as above standardization.defaultTimestampTimeZone.default , but applies only for the specific input raw format - then it takes precedence over standardization.defaultTimestampTimeZone.default . |
standardization.defaultDateTimeZone.default |
string with any valid time zone name | The time zone for normalization of dates that don't have their own time zone either in data itself or in metadata in case they need it. Most probably this should be left undefined. |
standardization.defaultDateTimeZone.[rawFormat] |
string with any valid time zone name | Same as above standardization.defaultDateTimeZone.default , but applies only for the specific input raw format - then it takes precedence over standardization.defaultDateTimeZone.default . |
timezone |
string with any valid time zone name | The time zone the Spark application will operate in. Strongly recommended to keep it to default UTC |
Note that
* conformance.autoclean.standardized.hdfs.folder
when set to true and the job is writing to S3, there could be a leftover empty file like conformance-output_$folder$
after the autoclean.
This, however, will not negatively impact the functionality of other jobs even when using the same path and is due to the EMR committer.
Selected plugin options:
Config Path | Possible Value(s) | Description |
---|---|---|
atum.hdfs.info.file.permissions |
string with FS permissions | Desired FS permissions for Atum _INFO file. Default: 644 . |
spline.hdfs.file.permissions |
string with FS permissions | Desired FS permissions for Spline's _LINEAGE file. Default: 644 . |