The work of a data engineer typically consists of making code (received from data analysts or data scientists) ready for production. This means it runs on Apache Spark and the code is scalable. Once the code is in production, the data engineers move on to a different task. The code written typically involves a lot of number-crunching. As time passes (weeks or months even), people analyzing the numbers come with questions on how the numbers are calculated. This becomes an impossible task for a data engineer. Hence, tools are necessary that can do this.
The process of keeping track of and visualizing the data manipulation is called data lineage. Not only is this useful for people working with the data, but with legal regulations like GDPR coming in May 2018, this will become a must. For financial institutions data lineage is one of the most significant problems when using big data tools. Mostly because they are obligated to have transparent reports, and to show how numbers are being derived. In order to make this process more streamlined, a team of developers at the Barclays Africa Group Limited have developed a tool that integrates into Apache Spark: Spline (Spark Lineage). This is what we will be discussing in this article.