Data lineage tracking is one of the significant problems that financial institutions face. Banking and other highly regulated industries are forced to have a good understanding of how data flows through their systems to comply with strict regulatory frameworks. Many of these organizations also utilize big data technologies such as Hadoop and Apache Spark. Spark has become one of the most popular engines for big data computation, but it lacks support for data lineage tracking.
This paper describes Spline - a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans in a lightweight, unobtrusive and easy to use manner. Additionally, Spline offers a modern user interface that allows non-technical users to understand the logic of Apache Spark applications. Keywords—Spline; Apache Spark; data lineage; Big data applications; Apache Hadoop; banking; BCBS