Articles

View on GitHub

Data Lineage Tracking And Visualization Solution

01 Apr 2022 - AWS Big Data Blog: Build data lineage for data lakes using AWS Glue, Amazon Neptune, and Spline

Data lineage is one of the most critical components of a data governance strategy for data lakes. Data lineage helps ensure that accurate, complete and trustworthy data is being used to drive business decisions. While a data catalog provides metadata management features and search capabilities, data lineage shows the full context of your data by capturing in greater detail the true relationships between data sources, where the data originated from and how it gets transformed and converged. Different personas in the data lake benefit from data lineage:

For data scientists, the ability to view and track data flow as it moves from source to destination helps you easily understand the quality and origin of a particular metric or dataset
Data platform engineers can get more insights into the data pipelines and the interdependencies between datasets
Changes in data pipelines are easier to apply and validate because engineers can identify a job’s upstream dependencies and downstream usage to properly evaluate service impacts

As the complexity of data landscape grows, customers are facing significant manageability challenges in capturing lineage in a cost-effective and consistent manner. In this post, we walk you through three steps in building an end-to-end automated data lineage solution for data lakes: lineage capturing, modeling and storage and finally visualization.

In this solution, we capture both coarse-grained and fine-grained data lineage. Coarse-grained data lineage, which often targets business users, focuses on capturing the high-level business processes and overall data workflows. Typically, it captures and visualizes the relationships between datasets and how they’re propagated across storage tiers, including extract, transform and load (ETL) jobs and operational information. Fine-grained data lineage gives access to column-level lineage and the data transformation steps in the processing and analytical pipelines.

01 Apr 2022 - AWS Big Data Blog: Build data lineage for data lakes using AWS Glue, Amazon Neptune, and Spline

01 Nov 2021 - Capturing & Displaying Data Transformations with Spline

04 Oct 2021 - Collecting and visualizing data lineage of Spark jobs

22 Mar 2021 - Best Data Lineage Tools

16 Mar 2021 - Data Lineage from Databricks to Azure Purview

28 Sep 2020 - How We Extract Data Lineage from Large Data Warehouses

22 Jan 2020 - Data lineage tracking using Spline 0.3 on Atlas via Event Hub

16 Dec 2019 - Spline 0.4 has arrived!

Updated vision and architecture

20 Nov 2019 - Spark Data Lineage on Databricks Notebook using Spline

14 Apr 2019 - Data Lineage In Azure Databricks With Spline

25 Mar 2019 - Spark job lineage in Azure Databricks with Spline and Azure Cosmos DB API for MongoDB

24 Jan 2019 - Atlas Support Is Back!

Spline Atlas Integration vs Hortonworks Spark Atlas Connector

How To Try Out Spline Atlas Integration

24 Dec 2018 - Exploring the Spline Data Tracker and Visualization tool for Apache Spark (Part 2)

02 Dec 2018 - Exploring the Spline Data Tracker and Visualization tool for Apache Spark (Part 1)

30 Nov 2018 - Spline 2: Vision And Architecture Overview

25 Oct 2018 - Spline 0.3 User Guide

04 Oct 2018 - Spline: Data Lineage For Spark Structured Streaming

18 Apr 2018 - Zeenea - Data lineage : Comment cartographier ses données au sein de son SI ?

16 Apr 2018 - End to End Atlas Lineage with Nifi, Spark, Hive

19 Feb 2018 - Data Lineage sur Apache Spark avec Spline

05 Feb 2018 - Data Lineage Tracking and Visualization tool for Apache Spark

17 Jan 2018 - Spline: Spark Lineage, Not Only for the Banking Industry

24 Oct 2017 - Spline: Apache Spark Lineage, Not Only for the Banking Industry