Usage > _INFO file

Table Of Contents

Description

File named _INFO placed within the source directory together with the raw data is a JSON file tracking control measures via Atum. Example how the file should contain can be found in the code.

Validation

The _INFO file verification consists of checking that it has an array field of name checkpoints. The array has to have at least two objects one named (field name) “Raw” and one “Source”. Each of them has to have an array field controls. This array has to contain a control of type count ("controlType": "controlType.Count") with control value (field controlValue) containing a positive integer.

E.g.

{
  ...
  "checkpoints": [
    {
      "name": "Source",
      "processStartTime": "??? (timestamp)",
      "processEndTime": "??? (timestamp)",
      "workflowName": "Source",
      "order": "??? (positive integer)",
      "controls": [
        {
          "controlName": "recordCount",
          "controlType": "controlType.Count",
          "controlCol": "???",
          "controlValue": "??? (positive integer)"
        }
      ]
    },
    {
      "name": "Raw",
      "processStartTime": "??? (timestamp)",
      "processEndTime": "??? (timestamp)",
      "workflowName": "Raw",
      "order": "???",
      "controls": [
        {
          "controlName": "recordCount",
          "controlType": "controlType.Count",
          "controlCol": "???",
          "controlValue": "??? (positive integer)"
        }
      ]
    }
  ]
}

For a fully expanded example go here.

Additional Information

Additional information regarding the processing of information is added into the _INFO file during Standardization and Conformance.

Metadata-Key Description
conform_driver_memory The amount of memory used to run Conformance
conform_enceladus_version Which version of Enceladus was used to run Conformance
conform_errors_count Number of errors after conformance
conform_executor_memory Number of executors running conformance
conform_executors_num How many executors used for conformance
conform_input_data_size The size of the input data (without metadata) to the Conformance.Usually it is the same as the size of standardized data since Conformance is ran after Standardization
conform_output_data_size The size of conformed/published data (without metadata such as lineage or _INFO file)
conform_output_dir_size The size of the published directory including metadata
conform_records_failed Number of records that has at least one error after Conformance
conform_size_ratio Size of the conformed/published folder in relation to a standardized folder
conform_spark_master Spark master of the Conformance job (usually yarn)
conform_username User account under which Conformance was performed
csv_delimiter dependant on the input file eg. csv
raw_format Format of raw data, eg. csv, json, xml, cobol
source_record_count The number of records in the dataset when it was exported from the source syste
std_application_id Spark Application unique id of the Standardization Job
std_errors_count Number of errors after standardization
std_executor_memory Memory requested per executor for Standardization
std_executors_num How many executors used for Standardization
std_input_dir_size The size of the raw folder
std_output_data_size Size of the output data after standardization
std_output_dir_size The size of the standardized folder
std_records_failed Number of records that has at least one error after standardization
std_records_succeeded Number of records that has no errors after standardization
std_spark_master Spark master of the Standardization job (usually yarn)
std_username User account under which Standardization was performed
std_yarn_deploy_mode Yarn deployment mode used (client or cluster)