Schema
Intro
Schema is the description of fields in a dataset. All and only the fields defined in the schema will be in the output table. That means fields not mentioned in the schema won’t be in the input. There’s an exception of three fields added automatically - see bellow. Fields are defined in the order they are to be in the output table and have three basic common properties:
name
- the field (column) nametype
- data type of the fieldnullable
(optional) - flag indicating if the data can contain the value null, if not specified considered set to false
Furthermore, some type can have additional properties. The details of each supported type, their meaning and additional properties will be described in the following chapters.
Thanks to Data Types StructType
and ArrayType
the fields can be nested –
fields within fields.
You provide Schema to Standardization in a JSON file:
{
"type": "struct",
"fields": [{
"name": "name",
"type": "string",
"nullable": false,
"metadata": {}
},
{
"name": "surname",
"type": "string",
"nullable": false,
"metadata": {
"default": "Unknown Surname"
}
},
{
"name": "hoursWorked",
"type": {
"type": "array",
"elementType": "integer",
"containsNull": false
},
"nullable": false,
"metadata": {}
},
{
"name": "employeeNumbers",
"type": {
"type": "array",
"elementType": {
"type": "struct",
"fields": [{
"name": "numberType",
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name": "numbers",
"type": {
"type": "array",
"elementType": "integer",
"containsNull": true
},
"nullable": true,
"metadata": {}
}
]
},
"containsNull": true
},
"nullable": true,
"metadata": {}
},
{
"name": "startDate",
"type": "date",
"nullable": false,
"metadata": {
"pattern": "yyyy-MM-dd"
}
},
{
"name": "updated",
"type": "timestamp",
"nullable": true,
"metadata": {
"pattern": "yyyyMMdd.HHmmss"
}
}
]
}
Example of data adhering to the above schema can be found here.
Automatically added columns
There is a column automatically added to each Standardization output. Its name is errCol
and it contains information
on all errors that happened on the particular row standardization. If defined in schema its structure there has to
adhere exactly to the automatically added one. More on this field see in dedicated documentation.
Data types
String
The data type representing String values.
Metadata keys: sourcecolumn, default
Boolean
The data type representing Boolean values.
Metadata keys: sourcecolumn, default
Decimal
The data type representing BigDecimal values, a fixed-point numeric type. The type is specified by two additional parameters, precision and scale. Precision limits the number of digits and cannot be greater than 38. Scale specifies the number of digits after the decimal point and has to be equal or less than the precision.
The type is specified as decimal(
precision, scale)
, for example: decimal(15, 3)
Metadata keys: sourcecolumn, default, pattern, timezone, decimal_separator, grouping_separator, minus_sign
Long
The data type representing Long values. That is a whole number between -9223372036854775808 and 9223372036854775807.
Metadata keys: sourcecolumn, default, pattern, decimal_separator, grouping_separator, minus_sign, radix
Integer
The data type representing Integer values. That is a whole number between -2147483648 and 2147483647.
Metadata keys: sourcecolumn, default, pattern, decimal_separator, grouping_separator, minus_sign, radix
Short
The data type representing Short values. That is a whole number between -32768 and 32767.
Metadata keys: sourcecolumn, default, pattern, decimal_separator, grouping_separator, minus_sign, radix
Byte
The data type representing Byte values. That is a whole number between -128 and 127.
Metadata keys: sourcecolumn, default, pattern, decimal_separator, grouping_separator, minus_sign, radix
Double
The data type representing Double values, 64-bit (IEEE 754) double-precision float.
Metadata keys: sourcecolumn, default, pattern, decimal_separator, grouping_separator, minus_sign, allow_infinity
Float
The data type representing Float values, 32-bit (IEEE 754) single-precision float.
Metadata keys: sourcecolumn, default, pattern, decimal_separator, grouping_separator, minus_sign, allow_infinity
Timestamp
The data type representing java.sql.Timestamp values. Upon entry they are normalized to UTC time zone.
Metadata keys: sourcecolumn, default, pattern, timezone
Date
The data type representing java.sql.Date values. If time zone is specified the date is adjusted to UTC.
Metadata keys: sourcecolumn, default, pattern, timezone
Struct
The data type representing a structure of one or more sub-fields.
The type is specified as struct of the following properties:
type
- string value “array”fields
- array of fields
Metadata keys: sourcecolumn
Array
The data type representing an array of values of another type.
The type is specified as struct of following properties:
type
- string value “array”elementType
- the type of the elements of the array, can be any of the types includingstruct
andarray
containsNull
- boolean value
Metadata keys: sourcecolumn
Metadata
Standardization can be influenced by metadata
in the schema of the data. The metadata
are optional properties.
Here are the recognized ones with the description of their purpose (with detailed description below):
Property | Target data type | Description | Example | Default* |
---|---|---|---|---|
sourcecolumn | any | The source column to provide data of the described column | id | - ** |
default | any atomic type | Default value to use in case data are missing | 0 | - ** |
pattern | timestamp & date | Pattern for the timestamp or date representation | dd.MM.yy | yyyy-MM-dd HH:mm:ss / yyyy-MM-dd |
timezone | timestamp (also date) | The time zone of the timestamp when that is not part of the pattern (NB! for date it can return unexpected results) | US/Pacific | UTC*** |
pattern | any numeric type | Pattern for the number representation | #,##0.# | - ** |
decimal_separator | any numeric type | The character separating the integer and the fractional parts of the number | , | . |
grouping_separator | any numeric type | Character to mark boundaries between orders of magnitude, usually to mark thousands, millions etc. | _ | , |
minus_sign | any numeric type | Character to mark the number is negative. | N | - |
allow_infinity | float & double | Flag indicating if the column accepts infinity as a value (and positive/negative numbers which are too large are converted to infinity/-infinity) | true | false |
radix | long, integer, short, byte | The base of the numbers provided | hex | 10 |
NB! All values in metadata have to be entered as string. Even if they would conform to other types, like number or boolean.
* Value used if nothing is specified in metadata
** No default exists (as not needed)
*** Unless a different default time zone is specified via
defaultTimestampTimeZone
and defaultDateTimeZone
application settings
sourcecolumn
Supported by types: String, Boolean, Decimal, Long, Integer, Short, Byte, Double, Float, Timestamp, Date, Struct, Array
The name of the column to get the data from (so it only makes sense if it’s different from field name). The most common use case is when the original column name is not a valid Parquet field name. It can also be used in the rare cases when the column needs to be standardized into more fields and/or different types.
default
Supported by types: String, Boolean, Decimal, Long, Integer, Short, Byte, Double, Float, Timestamp, Date
This is the value to be used in case the input is missing (and nulls are not allowed) or when the casting (standardization) fails. You can think of this as a fallback value.
It should be noted, that this is the only metadata key which accepts the null
value (written without quotes) next to
string values (of course, such a field has to be nullable: "nullable": true
)
For more about the topic see chapter Defaults.
pattern
Supported by types: Decimal, Long, Integer, Short, Byte, Double, Float, Timestamp, Date
The format the input adheres to. Mostly used for timestamp and date entries but it can be leveraged for numeric types too. Details for valid patterns are in the chapter Parsing.
In case default
value is specified in metadata, it needs to adhere to
the pattern.
If radix is specified and differs from the default 10, pattern
value
will be ignored.
timezone
Supported by types: Timestamp, Date
Time zone of the timestamp or date (not recommended for the latter). For details see the chapter Parsing timestamps and dates.
In case the pattern
already includes information to recognize the time zone, the timezone
entry in metadata will
be ignored. Namely if the pattern includes the “z”, “Z” or “X” placeholder or the “epoch”, “epochmilli”,
“epochmicro” and “epochnano” keywords.
NB! Due to a Spark limitation, only time zone IDs are accepted as valid values. To get the full list of supported time
zone denominators see the output of Java’s
TimeZone.getAvailableIDs()
function.
decimal_separator
Supported by types: Decimal, Long, Integer, Short, Byte, Double, Float
The character separating the integer and the fractional parts of the number.
For whole numbers which use ”.” as the grouping_separator
, the decimal_separator
has to be
redefined, to avoid conflict.
grouping_separator
Supported by types: Decimal, Long, Integer, Short, Byte, Double, Float
Character to mark boundaries between orders of magnitude, usually to mark thousands, millions etc.
It has to be used in the pattern to be taken into consideration.
minus_sign
Supported by types: Decimal, Long, Integer, Short, Byte, Double, Float
Character to mark the number is negative. By default, it’s the standard minus sign (”-“).
allow_infinity
Supported by types:** Double, Float
Flag indicating if the column accepts infinity as a value. When set to true infinity/-infinity are recognized as a valid value, instead of failing with casting error (see here). The string representing infinity on input is “∞” and ”-∞” respectively. Positive and negative numbers with values that are too large are converted to infinity and -infinity, respectively.
radix
Supported by types: Decimal, Long, Integer, Short, Byte
The radix (base) of the numbers entered. Accepted values are numbers between 1 and 36, as well as the following keywords (case insensitive): “dec”, “decimal”, “hex”, “hexadecimal”, “bin”, “binary”, “oct”, “octal”.
For higher bases, letters (A, B, C etc.) are used for digits (case insensitive).
For hexadecimal value entries in the form “0xFF” are accepted as well.
If radix
is specified as anything other than the default 10, pattern
value will be ignored.
Parsing
Parsing timestamps and dates
Dates and especially timestamps (date + time) can be tricky. Currently Spark considers all time entries to be in the system’s time zone by default. (For more detailed explanation of possible issues with that see Consistent timestamp types in Hadoop SQL engines.)
To address this potential source of discrepancies the following has been implemented:
- All Enceladus components are set to run in UTC
- As part of Standardization all time related entries are normalized to UTC
- There are several methods how to ensure that a timestamp entry is normalized as expected
- We urge users, that all timestamp entries should include time zone information in one of the supported ways
- While this is all valid for date entries too, it should be noted that UTC normalization of a date can have unexpected consequences - namely all dates west from UTC would be shifted to a day earlier
To enable processing of time entries from other systems Standardization offers the possibility to convert
string and even numeric values to timestamp or date types. This is done using Spark’s ability to convert strings to
timestamp/date with some enhancements. The pattern placeholders and usage is described in Java’s
SimpleDateFormat
class description with
the addition of recognizing some keywords (like epoch
and milliepoch
(case insensitive)) to denote the number of
seconds/milliseconds since epoch (1970/01/01 00:00:00.000 UTC) and some additional placeholders.
It should be noted explicitly that “epoch”, “epochmilli”, “epochmicro” and “epochnano” are considered a pattern
including time zone.
Summary:
placeholder | Description | Example |
---|---|---|
G |
Era designator | AD |
y |
Year | 1996; 96 |
Y |
Week year | 2009; 09 |
M |
Month in year (context sensitive) | July; Jul; 07 |
L |
Month in year (standalone form) | July; Jul; 07 |
w |
Week in year | 27 |
W |
Week in month | 2 |
D |
Day in year | 189 |
d |
Day in month | 10 |
F |
Day of week in month | 2 |
E |
Day name in week | Tuesday; Tue |
u |
Day number of week (1 = Monday, ..., 7 = Sunday) | 1 |
a |
Am/pm marker | PM |
H |
Hour in day (0-23) | 0 |
k |
Hour in day (1-24) | 24 |
K |
Hour in am/pm (0-11) | 0 |
h |
Hour in am/pm (1-12) | 12 |
m |
Minute in hour | 30 |
s |
Second in minute | 55 |
S |
Millisecond | 978 |
z |
General time zone | Pacific Standard Time; PST; GMT-08:00 |
Z |
RFC 822 time zone | -0800 |
X |
ISO 8601 time zone | -08; -0800; -08:00 |
_epoch_ |
Seconds since 1970/01/01 00:00:00 | 1557136493, 1557136493.136 |
_epochmilli_ |
Milliseconds since 1970/01/01 00:00:00.0000 | 1557136493128, 1557136493128.001 |
_epochmicro_ |
Microseconds since 1970/01/01 00:00:00.0000 | 1557136493128789, 1557136493128789.999 |
_epochnano_ * |
Nanoseconds since 1970/01/01 00:00:00.0000 | 1557136493128789101 |
i |
Microsecond | 111, 321001 |
n * |
Nanosecond | 999, 542113879 |
”* While nanoseconds designation is supported on input, it’s not supported in storage or further usage. So any value behind microseconds precision will be truncated.
NB! Spark uses US Locale and because on-the-fly conversion would be complicated, at the moment we stick to this
hardcoded locale as well. E.g. am/pm
for a
placeholder, English names of days and months etc.
NB! The keywords are case insensitive. Therefore, there is no difference between epoch
and EpoCH
.
Time Zone support
As mentioned, it is highly recommended to use timestamps with time zone but it’s not unlikely that the
source for standardization doesn’t provide time zone information. On the other hand, these times are usually within
one time zone. To ensure proper standardization, the schema’s metadata can include the timezone
value.
When set, all timestamps in the column will be standardized as belonging to the particular time zone.
E.g. 2019-05-04 11:31:10 with timzone
specified as CET will be standardized to 2019-05-04 10:31:10 (UTC of
course)
Parsing numbers
When converting string to any of the numeric types there are two standard formats accepted:
- the usual string of digits with the eventual minus or plus sign in front and optional decimal separator (e.g 3.14)
- the scientific notation, where the numbers are expressed as the product of a mantissa and a power of ten (e.g. 1234 can be expressed as 1.234 x 10^3 = 1.234E3)
Note, that for whole numbers (Long, Integer, Short and Byte), the decimal separator must not be present.
If the string is being parsed to decimal type and the input has more decimal places than is the scale of the decimal type, the result will be rounded to the number of decimal places allowed by scale.
Radix usage
For whole numbers, the numbers can be entered using a different radix (base) than the usual 10. For radices smaller than 10, the appropriate subset of numeric digits are accepted. For radices above 10, letters are used for the digit representation. The letters are case insensitive, therefore 1Fa = 1fA.
To specify a non-standard radix (different from 10), use the Radix
metadata key. The radix has to be between 1
and 36.
Pattern parsing
When the number is formatted in some non-standard way you can use a pattern. The parsing is executed using
the Java class DecimalFormat
, whose documentation
provides the most comprehensive explanation of patterns and their usage.
Pattern contains a positive and negative subpattern, for example, #,##0.00;(#,##0.00)
. Each subpattern has a prefix,
numeric part, and suffix. The negative subpattern is optional; if absent, the positive subpattern prefixed with the
minus sign (‘-‘ in most locales) is used as the negative subpattern. That is, 0.00
alone is equivalent to 0.00;-0.00
.
If there is an explicit negative subpattern, it serves only to specify the negative prefix and suffix; the number of
digits, minimal digits, and other characteristics are all the same as the positive pattern. That means that #,##0.0#;(#)
produces precisely the same behavior as #,##0.0#;(#,##0.0#)
.
The prefixes, suffixes, and various symbols used for infinity, digits, thousands separators, decimal separators, etc. may be set to arbitrary values. However, care must be taken that the symbols and strings do not conflict, or parsing will be unreliable. For example, either the positive and negative prefixes or the suffixes must be distinct for parsing to be able to distinguish positive from negative values. Another example is that the decimal separator and thousands separator should be distinct characters, or parsing will be impossible.
Symbol | Location | Meaning | Metadata to change |
---|---|---|---|
0 |
Number | Digit | |
# |
Number | Digit, zero shows as absent | |
. |
Number | Decimal separator | Decimal separator NB!* |
- |
Number | Minus sign | Minus sign NB!* |
, |
Number | Grouping separator | Grouping separator NB!* |
E |
Number | Separates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix. | |
; |
Subpattern boundary | Separates positive and negative subpatterns | |
% |
Prefix or suffix | Divide by 100 on parsing | |
‰ |
Prefix or suffix | Divide by 1000 on parsing | |
' |
Prefix or suffix | Used to quote special characters in a prefix or suffix, for example, the “’#’#” pattern allows the value #123 to be read in as the number 123 . To create a single quote itself, use two in a row: “# o’‘clock”. |
|
∞ |
not part of pattern | String to represent infinity |
* while these can be changed, in the pattern
the default value must be
used. E.g. ”.”, ”.” and ”-“.
NB! If there’s no special format of the input, it’s advised to avoid the usage of patterns due to the added performance hit (parsing using a pattern adds computational overhead). It’s still possible to redefine the special symbols like decimal separator or minus sign even without the pattern.
Number parsing peculiarities
- number of
#
and0
in the pattern is not reliable, the input is parsed regardless (e.g. “0.0” produces the same result as ”##.#) - Grouping separator placement is “flexible”. If the grouping separator has been added to the pattern the number is
parsed regardless of the placement(s) of the grouping separator in the parsed strings (e.g. pattern
#,##0.#
will parse 1,000.0, 1,1234.5 so as 2000.) - Grouping separator is not accepted in decimal places
- ”+” is not accepted when pattern is specified, unless explicitly present in the pattern
%
and‰
in fractional and decimal values and patterns divide the number by 100 and 1000 (in integrals as well, but then they are usually not whole numbers)- Even if redefined the standard grouping and decimal separators need to be used in the pattern
- If pattern is used,
e
is not accepted onlyE
in the exponential expresion (without a pattern both are recognized)
Defaults
As described, when a field fails to standardize, either because of missing data in a non-nullable column or because it was being cast to the wrong type, the field is populated with a default value and an error is added to the error column.
Explicit default
The value optionally specified in the schema of the dataset for Standardization in metadata Default property to be used for the particular column.
Global default values
The value used when explicit default was not defined in the schema:
null
for nullable column0
for numeric columnfalse
for Boolean column""
(empty string) for string column1970/01/01
for date column-
1970/01/01 00:00:00
for timestamp column - default timezone if not specified for both timestamps and dates is UTC; this can be changed via application settings
defaultTimestampTimeZone
anddefaultDateTimeZone
- default locale and the decimal symbols based on that is US. Therefore:
- minus sign is ”-“
- decimal separator is ”.”
- grouping separator is ”,”
Explicit default values restrictions
- The value has to be a string convertible to the field’s type and fitting within its size limitations (e.g.“200”
cannot be a
default
for the typeShort
, or “∞” ifallow_infinity
is “false” forDouble
/Float
) - If it’s a type supporting
pattern
and it is defined, the default value has to adhere to thepattern