Building a collection of data sources that a business or an organization has into a data lake that everyone can access is an idea that inspires a lot of interest. The idea is to make the data lake into a resource that will drive innovation and insights by allowing clever team members to test ideas across many sources and variables. Unfortunately, a lack of good data curation techniques can lead that lake to become a data swamp in no time at all.
Let's say you want to take a database that contains information about all the employees at your company. There are two data sources, with one that includes an employee's name, salary, birthday and current address. Perhaps the second source includes information about their name, current city of residence, listed hobbies from their application and salary. You want to bring these collections of information together. That's the data ingestion process.
There will be transformation needs, as you'll have to breakdown information like the address into its constituent pieces, such as street, city, state and ZIP code. Similarly, the street address itself may be one or two lines long, depending on things like whether there's an apartment number or a separate P.O. box. There may also be more advanced issues, such as differences in formatting across countries. Schema issues also present problems. For example, let's say you have an entry in your first source for "John Jones" and another for "John J. Jones" or something similar. How do you decide what constitutes a match? More importantly, what criteria can be used to ensure actual matches are obtained through the kinds of automated processes that are common during data ingestion? In the best-case scenario, good data curation practices are in place from the start. Some sort of unique identifier is employed across all your data tables that matches people based on, for example, employee ID numbers that are never reused. In the worst-case scenario, you simply have a bunch of mush that's going to have to be stabbed at in the dark.
Even if your organization employs best practices, such as unique IDs for entries, date stamps and preservable identifiers across transforms, there are going to be curation needs in virtually every data set. Perhaps you get super lucky and all the data lines up perfectly based on those ID tags, too. Many other things can go wrong. For example, what happens if there's a scrubbed or foreign character in an entry? For example, HTML entities are often transformed by security protocols prior to database insertion to prevent SQL injection attacks. Data sources can also induce problems. Perhaps you've been importing information from a CSV file, and you don't notice one or two entries that throw the alignment off by one or two columns. Worse, instead of getting a runtime error from your code or your analytics package, it all appears to be good. Without a person scanning through the data, you won't notice a flaw until someone pulls one of the broken entries. In the absolutely worst scenario, critical computational data ends up being passed along and ends up producing a flawed work product.
Okay, you've gotten all that business straightened out. Curation superstar that you are, everything aligns beautifully, automated processes flag issues and humans are double-checking everything. Now you have to put usable information into your employees' hands. First, you need to know the technical limits of everyone you employ. If someone can't code an SQL entry, you need to have data in additional formats, such as spreadsheets, that will allow them to load it into their own analytics packages. Will you walk back those transforms in the output process? If so, how do you confirm they will be accurate renderings of the original input? Likewise, the data needs to be highly browseable. This means ensuring that servers are accessible, and they also need to contain folders with structures and names that make sense. For example, the top level folders in a system may place an emphasis on generalizing their contents, such as naming them "employees" and "customers" for easier reading. Data curation is a larger cultural choice for an organization. By placing an emphasis on structure the whole way from ingestion to deployment, you can ensure that everyone has access and quickly begin deriving insights from your data lake.