View on GitHub

Data Lineage Tracking And Visualization Solution

Docker Pulls GitHub Stars

This Spline version has reached the End-Of-Life and is not maintained anymore.

Please use a recent Spline version.


Data Lineage Tracking And Visualization Solution

TeamCity build (develop) Codacy Badge Maven Central

The project consists of three main parts:

Spline diagram

There are several other tools. Check the examples to get a better idea how to use Spline.

Other docs/readme files can be found at:

Spline is aimed to be used with Spark 2.3+ but also provides limited support for Spark 2.2.

Motivation

Spline aims to fill a big gap within the Apache Hadoop ecosystem. Spark jobs shouldn’t be treated only as magic black boxes; people should be able to understand what happens with their data. Our main focus is to solve the following particular problems:


Get Spline

To get started, you need to get a minimal set of Spline’s moving parts - a server, an admin tool and a client Web UI to see the captured lineage.

There are two ways how to do it:

Download prebuild Spline artifacts from the Maven repo

Alternatively, build Spline from the source code

Note: Skip this section unless you want to hack with Spline

  1. Make sure you have JDK 8, Maven and NodeJS installed.

  2. Get and unzip the Spline source code:
    wget https://github.com/AbsaOSS/spline/archive/release/0.4.2.zip
    unzip 0.4.2.zip
    
  3. Change the directory:
    cd spline-release-0.4.2
    
  4. Run the Maven build:
    mvn install -DskipTests
    

Install ArangoDB

Spline server requires ArangoDB to run. Please install ArangoDB 3.5+ according to the instructions in ArangoDB documentation.

If you prefer a Docker image there is a Docker repo as well.

docker pull arangodb:3.5.1

Create Spline Database

java \
  -jar admin/target/admin-0.4.2.jar \
  db-init arangodb://localhost/spline

Start Spline Server

Spline server can be started using 2 different ways:

Docker
docker container run \
  -e spline.database.connectionUrl=arangodb://host.docker.internal/spline \
  -p 8080:8080 \
  absaoss/spline-rest-server

Note for Linux: If host.docker.internal does not resolve replace it with 172.17.0.1 (see Docker for-linux bug report)

Java compatible Web-Container (e.g. Tomcat)

You can find a WAR-file in the Maven repo here:

za.co.absa.spline:rest-gateway:0.4.2

Add the argument for the ArangoDB connection string

-Dspline.database.connectionUrl=arangodb://localhost/spline

The server exposes the following REST API:

… and other useful URLs:

Start Spline UI

Spline web client can be started using 3 different ways:

Docker
docker container run \
      -e spline.consumer.url=http://localhost:8080/consumer \
      -p 9090:8080 \
      absaoss/spline-web-client
Java compatible Web-Container (e.g. Tomcat)

You can find the WAR-file of the Web Client in the repo here:

za.co.absa.spline:client-web:0.4.2

Add the argument for the consumer url

-Dspline.consumer.url=http://localhost:8080/consumer
Node JS application (For development purposes)

Download node.js then install @angular/cli to run ng serve or ng-build command.

To specify the consumer url please edit the config.json file

You can find the documentation of this module in ClientUI.

Check the result in the browser

http://localhost:9090

Use spline in your application

Add a dependency on Spark Agent.

<dependency>
    <groupId>za.co.absa.spline</groupId>
    <artifactId>spark-agent</artifactId>
    <version>0.4.2</version>
</dependency>

In your spark job you have to enable spline.

// given a Spark session ...
val sparkSession: SparkSession = ???

// ... enable data lineage tracking with Spline
import za.co.absa.spline.harvester.SparkLineageInitializer._
sparkSession.enableLineageTracking()

// ... then run some Dataset computations as usual.
// Data lineage of the job will be captured and stored in the
// configured database for further visualization by Spline Web UI

Properties

You also need to set some configuration properties. Spline combine these properties from several sources:

  1. Hadoop config (core-site.xml)

  2. JVM system properties

  3. spline.properties file in the classpath

spline.mode

spline.producer.url

Example:

spline.mode=REQUIRED
spline.producer.url=http://localhost:8080/producer

Run Spline Migration from 0.3 to 0.4+

Spline 0.3 was using MongoDB as a database. In Spline 0.4 we switched to ArangoDB. Since using MongoDB as database is no longer supported you may need to migrate your data from MongoDB to ArangoDB. To do that, Simply run:

java -jar migrator-tool/target/migrator-tool.jar \
  --source=mongodb://localhost:27017/splinedb \
  --target=http://localhost:8080/spline/producer

For more information please take a look in migrator tool documentation.


Copyright 2019 ABSA Group Limited

you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.