View on GitHub

Data Lineage Tracking and Visualization tool for Apache Spark ™

Spline (from Spark lineage) project helps people get insight into data processing performed by Apache Spark.

The project consists of two parts:


in your POM file:

    <!-- You can use other types of persistence including your own. -->
    <!-- See below for details. -->

in your Spark job:

// given a Spark session ...
val sparkSession: SparkSession = ???

// ... enable data lineage tracking with Spline

// ... then run some Dataset computations as usual.
// Data lineage of the job will be captured and stored in the
// configured Mongo database for further visualization by Spline Web UI

download Spline Web UI executable JAR and run:

java -jar spline-web-0.2.5-exec-war.jar -Dspline.mongodb.url=...

in your browser open localhost:8080 and you will get:


Spline should fill a big gap within Apache Hadoop ecosystem. Spark jobs should’t be treated only as magic black boxes and people should have a chance to understant what happens with their data. Our main focus is to solve the following particular problems:

Getting started


Setup for your Spark job:
  1. Include Spline core jar into your Spark job classpath (it’s enough to have it in a driver only, executors don’t need it)

  2. Configure database connection properties (see Configuration section)

  3. Enable data lineage tracking on a Spark session before calling any action method:

Web UI application:

There are two ways how to run Spline Web UI:

Standalone application (executable JAR)

java -jar spline-web-0.2.5-exec-war.jar -Dspline.mongodb.url=... and then point your browser to http://localhost:8080.

To change the port number from 8080 to say 1234 add -httpPort 1234 to the command line.

(for more details see Generated executable jar/war section.

Standard Java web application (WAR)
  1. In your Java web container (e.g. Tomcat) setup the Spline database connection properties (either via system environment variables or JVM system properties) in the following format:
  2. Deploy Spline WAR file to your Java web container (tested on Tomcat 7, but other containers should also work)

Build Spline from the source code

You will need:

mvn install -DskipTests

Lineage persistence

Spline can persist harvested lineages in various ways. It uses PersistenceFactory to obtain instances of DataLineageReader and DataLineageWriter to persist and access the data lineages. Out of the box Spline supports three types of persistors:

There is also a ParallelCompositeFactory that works as a proxy and delegate work to other persistors. So for example, you can store the lineages to, say, Mongo and Atlas simultaneously.


When enabling data lineage tracking for a Spark session in your Spark job a SparkConfigurer instance can be passed as a argument to the enableLineageTracking() method.

The method signature is the following:

def enableLineageTracking(configurer: SplineConfigurer = new DefaultSplineConfigurer(defaultSplineConfiguration)): SparkSession

DefaultSplineConfigurer looks up the configuration parameters in the given Configuration object.

defaultSplineConfiguration object combines several configuration sources (ordered by priority):

  1. Hadoop config (core-site.xml)
  2. JVM system properties
  3. file in the classpath

Configuration properties

Property Description Example
spline.mode DISABLED
Lineage tracking is completely disabled and Spline is unhooked from Spark.

If Spline fails to initialize itself (e.g. wrong configuration, no db connection etc) the Spark application aborts with an error.

BEST_EFFORT (default)
Spline will try to initialize itself, but if fails it switches to DISABLED mode allowing the Spark application to proceed normally without Lineage tracking.
spline.persistence.factory Fully qualified name of the PersistenceFactory implementation to use by Spline
spline.mongodb.url Mongo connection URL
(MongoPersistenceFactory only)
mongodb:// Mongo database name
(MongoPersistenceFactory only)
spline.persistence.composition.factories Comma separated list of factories to delegate to
(specific to ParallelCompositeFactory),


Sample folder contains some sample Spline enabled Spark jobs.

Sample jobs read data from the /sample/data/input/ folder and write the result into /sample/data/results/

When the lineage data is captured and stored into the database, it can be visualized and explored via Spline UI Web application.

Sample job 1
val sparkBuilder = SparkSession.builder().appName("Sample Job 2")
val spark = sparkBuilder.getOrCreate()

// Enable data lineage tracking with Spline

// A business logic of a Spark job ...
import spark.implicits._

val sourceDS =
  .option("header", "true")
  .option("inferSchema", "true")
  .filter($"total_response_size" > 1000)
  .filter($"count_views" > 10)

val domainMappingDS =
  .option("header", "true")
  .option("inferSchema", "true")

val joinedDS = sourceDS
  .join(domainMappingDS, $"domain_code" === $"d_code", "left_outer")
  .select($"page_title".as("page"), $"d_name".as("domain"), $"count_views")





Copyright 2017 Barclays Africa Group Limited

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
See the License for the specific language governing permissions and
limitations under the License.