Spline is a free and open-source tool for automated tracking data lineage and data pipeline structure in your organization.

Originally the project was created as a lineage tracking tool specifically for Apache Spark ™ (the name Spline stands for Spark Lineage). In 2018, the IEEE Paper has been published. Later though, the vision of the project was expanded, and the system design was generalized to accommodate other data technologies, not only Spark.

The goal of the project is to create a simple but capable cross-platform and cross-framework data-lineage tracking solution that could be used along, or serve as a foundation for building more advanced data governance solutions on top of it.

At a high-level, the project consists of three main parts:

Spline Server
Spline Agents:
- Apache Spark agent – for Scala, Python or Spark SQL dialect
- Python agent – for arbitrary functions written in Python 3.9+
- Open-Lineage adapter (PoC) – for importing metadata in OpenLineage API format.
Spline UI

The Spline Server is the heart of Spline. It receives the lineage data from agents via Producer API and stores it in the ArangoDB. On the other end It provides Consumer API for reading and querying the lineage data. Consumer API is used by Spline UI, but can also be used by 3rd party applications.

The agents capture the lineage meta-data form the data-transformation pipelines and send it to the Spline server in a standardized format via a cross-platform API (called Producer API) using HTTP (REST) or Kafka as transport.

Spline diagram

The lineage data is then processed and stored in a graph format, and is accessible via another REST API (called Consumer API).

Motivation
Getting started
Upgrade Spline database
- Upgrade from any Spline version greater than 0.4 to the latest version
- Upgrade from Spline 0.3

Motivation

Our main focus is to solve the following particular problems:

Regulatory requirements (BCBS 239, GDPR etc)
Documentation of business logic

Business analysts should get a chance to verify whether jobs were written according to the rules they provided. Moreover, it would be beneficial for them to have up-to-date documentation where they can refresh their knowledge of a project.
Identification of performance bottlenecks

Our focus is not only business-oriented; we also see Spline as a development tool that should be able to help developers with the performance optimization of their Spark jobs.

Getting started

TL;DR

We have created a Docker-compose config (see Spline getting started) to help you to get Spline up and running in just a few keystrokes. So if you only have 5 minutes to try Spline out then simply do the following:

wget https://raw.githubusercontent.com/AbsaOSS/spline-getting-started/main/docker/docker-compose.yml

wget https://raw.githubusercontent.com/AbsaOSS/spline-getting-started/main/docker/.env

docker-compose up

That will spin up a few Docker containers (ArangoDB, Spline Server and Spline UI), and run a set of Spark examples to pre-populate the database.

Open http://localhost:9090 in the browser to see the Spline UI and captured example lineage data.

The Spline server is running on http://localhost:8080

Now you can run your tracked job implemented in Python, or Apache Spark (spark-shell, pyspark or spark-submit) with lineage tracking enabled and immediately see the result in the Spline UI.

Python example

First, decorate you Python code that you want to track with the Spline Agent decorators. At minimum, you need to specify the entry point function (a top-level function that will be tracked) and the input and output datasource URIs (for the purpose of tracking lineage).

my-example-app.py:

import spline_agent
from spline_agent.enums import WriteMode

@spline_agent.track_lineage(name = "My Awesome Python Job")
def my_example_func():
    # read some files, do stuff, write the result
    _read_from("hdfs://..../input.csv")
    _write_to("s3://..../output.csv")


@spline_agent.inputs('{url}')
def _read_from(url: str) -> SomeData:
    # do actual read
    return data


@spline_agent.output('{url}', WriteMode.OVERWRITE)
def _write_to(url: str, payload: SomeData):
    # do actual write
    pass

then, install Spline Agent library, configure it, and execute your Python app as usual.

# install "spline-agent" library
pip install spline-agent
# configure the Spline Agent to send captured lineage metadata to the Spline Producer REST API 
export SPLINE_LINEAGE_DISPATCHER_TYPE=http
export SPLINE_LINEAGE_DISPATCHER_HTTP_BASE_URL=http://localhost:8080/producer
# execute the Python app. The lineage will captured and sent to the Spline server.
python my-example-app.py

Spark example

# Start the "pyspark" shell (or we can use "spark-shell" or "spark-submit" in the same way)
# with lineage tracking enabled automatically for any execute Spark job.
pyspark \
  --packages za.co.absa.spline.agent.spark:spark-3.0-spline-agent-bundle_2.12:2.2.0 \
  --conf "spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener" \
  --conf "spark.spline.producer.url=http://localhost:8080/producer"

Note: By default, only persistent writes are tracked (like write to a file, Hive table or a database).

Note: In the example above we used Spline Agent bundle compiled for Spark 3.0 and Scala 2.12. For other Spark and Scala versions use corresponding bundles. See selecting agent artifacts for details.

Also see Spline getting started

Step-by-step instructions

Get Spline components

All Spline components are available as J2EE artifacts (JAR or WAR-files) as well as Docker containers:

Choose what suites you better.

Docker containers are hosted on Docker Hub:

absaoss/spline-admin:0.7.9
absaoss/spline-rest-server:0.7.9
absaoss/spline-kafka-server:0.7.9
absaoss/spline-web-ui:0.7.5
absaoss/spline-spark-agent:2.2.0 (contains examples)

J2EE artifacts can be found on Maven Central:

Python agent is on available on PyPI:

spline-python-agent

Create Spline Database

Spline server requires ArangoDB to run.

Please install ArangoDB version 3.11 or newer according to the instructions in ArangoDB documentation.

Example:

docker run -p 8529:8529 -e ARANGO_NO_AUTH=1 arangodb:3.11

Once the database server is running you should be able to see ArangoDB Web UI at http://localhost:8529

Next, create a Spline database using Spline Admin utility:

docker run -it --rm absaoss/spline-admin:0.7.9 db-init arangodb://172.17.0.1/spline

Detailed usage documentation can be seen by running the Admin tool with the --help parameter:

docker run -it --rm absaoss/spline-admin:0.7.9 --help

Start Spline Server

docker container run \
  -e SPLINE_DATABASE_CONNECTION_URL=arangodb://host.docker.internal/spline \
  -p 8080:8080 \
  absaoss/spline-rest-server:0.7.9

Optionally, you may also start Spline Kafka Gateway if you plan to use Kafka transport for your agents.

See Spline Kafka for details.

docker container run \
  -e SPLINE_DATABASE_CONNECTION_URL=arangodb://host.docker.internal/spline \
  -e spline.kafka.consumer.bootstrap.servers=localhost:9092 \
  -e spline.kafka.consumer.group.id=spline-group \
  -e spline.kafka.topic=spline-topic \
  -p 7070:8080 \
  absaoss/spline-kafka-server:0.7.9

Note for Linux users: If host.docker.internal does not resolve replace it with 172.17.0.1 (see Docker for-linux bug report)

Configuration

SPLINE_DATABASE_CONNECTION_URL

URL to the ArangoDB database. (See Admin tool help)
SPLINE_DATABASE_LOG_FULL_QUERY_ON_ERROR

Useful for development and debugging purposes. It controls the AQL log verbosity in case of errors. Set it to true if you want the full AQL query to be logged. (Default is false - only a part of the query is logged).
SPLINE_DATABASE_DISABLE_SSL_VALIDATION

Should validation of self-signed SSL certificates be disabled. true - SSL validation is disabled (Don’t use on production). false - enabled (Default).
spline.kafka.topic

A topic from which the gateway should consume messages.
spline.kafka.consumer.*

A prefix for standard kafka consumer properties. Kafka gateway uses standard Kafka consumer inside, so any of it’s confiuration properties can be used after this prefix.

For example spline.kafka.consumer.bootstrap.servers or spline.kafka.consumer.group.id.

Check Spline Server status

Open the server URL in the browser: http://localhost:8080

You should see a dashboard with the updating server status information, server version, exposed API and some other useful info.

The Spline REST server exposes the following REST APIs:

Producer API (/producer/*)
Consumer API (/consumer/*)

… and other useful URLs:

Running server version information: /about/version
Producer API Swagger documentation: /docs/producer.html
Consumer API Swagger documentation: /docs/consumer.html

Start Spline UI

docker container run \
      -e SPLINE_CONSUMER_URL=http://localhost:8080/consumer \
      -p 9090:8080 \
      absaoss/spline-web-ui:0.7.5

Open Spline Web UI in the browser: http://localhost:9090

See Spline UI Readme for details.

Capture some lineage

Refer to the Spline agent for Spark for detailed explanation how Spline agent is used to capture lineage from Spark jobs.

Also see Examples page, where you can find examples of generic (non-Spark) lineage capturing, by directly calling Spline Producer API.

Running Spline as Java application

Although Docker is the preferred and the most convenient way of running Spline, you can also run it on standard Java environment.

The admin tool:

java -jar admin-0.7.9.jar db-init arangodb://localhost/spline

The REST server:

Download the WAR-file using the link below, and deploy it onto any J2EE-compatible Web Container, e.g. Tomcat, Jetty, JBoss etc.

za.co.absa.spline:rest-gateway:0.7.9

Spline Web application is looking for configuration in the following sources (in order of precedence):

Naming and directory context (via JNDI) if available

(for example in a context.xml in the Tomcat server)

  <Environment type="java.lang.String"
               name="spline/database/connectionUrl"
               value="arangodb://localhost/spline" />

JVM system properties

  $JAVA_OPTS -Dspline.database.connectionUrl=arangodb://localhost/spline

System environment variables

  export SPLINE_DATABASE_CONNECTION_URL=arangodb://localhost/spline

Please adhere to the naming convention of each configuration source type. For example, the property foo.barBaz would be looked up as foo.barBaz in the JVM options, as foo/barBaz in the JNDI, and as FOO_BAR_BAZ in the environment variables.

The Web UI:

See Spline UI - Running a WAR-file

Upgrade Spline database

Upgrade from any Spline version greater than 0.4 to the latest version

docker run -it --rm absaoss/spline-admin:0.7.9 db-upgrade <SPLINE_DB_URL>

… or if you prefer Java

java -jar admin-0.7.9.jar db-upgrade <SPLINE_DB_URL>

Note: Depending on your database size, migration procedure (especially from 0.5 to 0.6) can consume significant amount of memory on the ArangoDB server. It’s recommended to preliminary increase RAM size (e.g. at least 128Gb for a database with average 500K records per collection). When migration is complete, RAM can be returned to its normal size.

Note: There is no automatic rollback for database migration! If migration fails for any reason you should start over from the clean database snapshot. So don’t forget to make backups before running the migration tool!

Upgrade from Spline 0.3

Spline 0.3 stored data in the MongoDB. Starting from version 0.4 further Spline uses ArangoDB instead. You need to use another tool for migrating data between those two databases - Spline Migrator

You also need to have a new Spline server running. The migrator tool will fetch the data directly from a MongoDB, convert it and send to a new Spline Producer REST endpoint.

java -jar migrator-tool-0.4.2.jar \
  --source=mongodb://localhost:27017/splinedb \
  --target=http://localhost:8080/producer

Also run java -jar migrator-tool-0.4.2.jar --help to read about usage and available options.

For more information you may take a look at the Migrator tool source code.

Copyright 2019 ABSA Group Limited

you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Data Lineage Tracking And Visualization Solution

Table of Contents