Usage - _INFO file

Choose version:

Table Of Contents
Description
Validation
Additional Information

Description

File named _INFO placed within the source directory together with the raw data is a JSON file tracking control measures via Atum. Example how the file should contain can be found in the code.

Validation

The _INFO file verification consists of checking that it has an array field of name checkpoints. The array has to have at least two objects one named (field name) “Raw” and one “Source”. Each of them has to have an array field controls. This array has to contain a control of type count ("controlType": "controlType.Count") with control value (field controlValue) containing a positive integer.

E.g.

{
  ...
  "checkpoints": [
    {
      "name": "Source",
      "processStartTime": "??? (timestamp)",
      "processEndTime": "??? (timestamp)",
      "workflowName": "Source",
      "order": "??? (positive integer)",
      "controls": [
        {
          "controlName": "recordCount",
          "controlType": "controlType.Count",
          "controlCol": "???",
          "controlValue": "??? (positive integer)"
        }
      ]
    },
    {
      "name": "Raw",
      "processStartTime": "??? (timestamp)",
      "processEndTime": "??? (timestamp)",
      "workflowName": "Raw",
      "order": "???",
      "controls": [
        {
          "controlName": "recordCount",
          "controlType": "controlType.Count",
          "controlCol": "???",
          "controlValue": "??? (positive integer)"
        }
      ]
    }
  ]
}

For a fully expanded example go here.

Additional Information

Additional information regarding the processing of information is added into the _INFO file during Standardization and Conformance.

Metadata-Key	Description
`conform_driver_memory`	The amount of memory used to run Conformance
`conform_enceladus_version`	Which version of Enceladus was used to run Conformance
`conform_errors_count`	Number of errors after conformance
`conform_executor_memory`	Number of executors running conformance
`conform_executors_num`	How many executors used for conformance
`conform_input_data_size`	The size of the input data (without metadata) to the Conformance.Usually it is the same as the size of standardized data since Conformance is ran after Standardization
`conform_output_data_size`	The size of conformed/published data (without metadata such as lineage or _INFO file)
`conform_output_dir_size`	The size of the published directory including metadata
`conform_records_failed`	Number of records that has at least one error after Conformance
`conform_size_ratio`	Size of the conformed/published folder in relation to a standardized folder
`conform_spark_master`	Spark master of the Conformance job (usually yarn)
`conform_username`	User account under which Conformance was performed
`csv_delimiter`	dependant on the input file eg. csv
`raw_format`	Format of raw data, eg. `csv`, `json`, `xml`, `cobol`
`source_record_count`	The number of records in the dataset when it was exported from the source syste
`std_application_id`	Spark Application unique id of the Standardization Job
`std_errors_count`	Number of errors after standardization
`std_executor_memory`	Memory requested per executor for Standardization
`std_executors_num`	How many executors used for Standardization
`std_input_dir_size`	The size of the raw folder
`std_output_data_size`	Size of the output data after standardization
`std_output_dir_size`	The size of the standardized folder
`std_records_failed`	Number of records that has at least one error after standardization
`std_records_succeeded`	Number of records that has no errors after standardization
`std_spark_master`	Spark master of the Standardization job (usually yarn)
`std_username`	User account under which Standardization was performed
`std_yarn_deploy_mode`	Yarn deployment mode used (client or cluster)

Table Of Contents

Description

Validation

Additional Information