In the world of data analytics in 2019, keeping tabs on where bits of information came from, how they were processed and where they ended up at is more important than ever. This concept is boiled down to two words: data lineage. Just as a dog breeder would want to the lineage of a pooch they're paying for, folks in the business intelligence sector want to know the lineage of the data that shows up in a final work product. Let's look at the what, the why and the how of this process.
The simplest form of lineage for data is indexing items with unique keys that follow them everywhere. From the moment a piece of data is entered into a system, it should be tagged with a unique identifier that will follow it through every process it's subjected to. This will ensure that all data points can be tracked across departments, systems and even data centers.The concept can be extended significantly. Meta-data about entries can include information regarding:
In other words, the lineage functions as a pedigree that allows anyone looking at it to evaluate where it came from and how it got where it is today.
Within the context of business intelligence, there will always be questions about the inputs that went into a final product. Individual data points can be reviewed to discover problems with processes or to show how transformations occurred. This allows folks to:
When someone wants to pull a specific anecdote from the data, the lineage allows them to get very granular, too. In the NBA of 2019, for example, shot location data is used to study players, set defenses and even choose when and where to shoot. If a coach wants to cite an example, they can look through the lineage for a shot in order to find film to pull up.The same logic applies in many business use cases. An insurance company may be trying to find ways to deal with specific kinds of claims. No amount of data in the world is going to have the narrative power of a particular anecdote. In presenting insights, data scientists can enhance their presentations by honing in on a handful of data points that really highlight the ideas they're trying to convey. This might include:
Data governance is also becoming a bigger deal with each passing year. Questions about privacy and anonymization can be answered based on the lineage of a company's data. Knowing what the entire life cycle of a piece of information is ultimately enhances trust both within an organization and with the larger public.Cost savings may be discovered along the way, too. Verification can be sped up by having a good lineage already available. Errors like duplication are more likely to be discovered and to be found sooner, ultimately improving both the quality and speed of a process. If a data set is outdated, it will be more evident based on its lineage.
Talking about data lineage in the abstract is one thing. Implementing sensible and practical policies is another.Just as data analytics demands a number of particular cultural changes within an organization, caring about lineage takes that one step further. It entails being able to:
At a technical level, databases have to be configured to make tracking lineage possible. Data architecture takes on new meaning under these circumstances, and systems have to be designed from the start with lineage in mind. This can often be a major undertaking when confronting banks of older data. If it's implemented in the acquisition and use of new data, though, it can save a ton of headaches.
Tracking the lineage of a company's data allows it to handle a wide array of tasks more professionally and competently. This is especially the case when pulling data from outsides sources, particularly when paying for third-party data. Not only is caring about lineage the right thing to do, but it also has a strong business case to back it up.READ MORE HERE