4 Ways to Improve Data Quality with Curator 2

Recently we have had several customers ask a wide range of questions regarding Data Quality and Curator. While the specific questions vary, they generally center on one main theme: does Curator do Data Quality? Yes, Curator does Data Quality. In fact, with Curator, Data Quality is baked right into the platform.

Let’s look at a few ways Curator can make it easier to get clean, conforming data queued up and ready to publish. In this example we are using a data dump from the Github-based Openrecipes project. The file has been scraped together from a large number of sources using a variety of python scripts, and contains over a hundred thousand recipes. Because the database was scraped together using varied sources and different acquisition techniques, the consistency and quality of the data ranges from acceptable to very bad.

Even accounting for the bad rows, a hundred thousand recipes is still nothing to sneeze at! This is a very good example of using Curator to pull quality-controlled, valuable data from a challenging initial file.

Input data
170K+ lines like this.

Once downloaded, we imported the file to Curator, created a “Recipes” template and used the template to seed a new Curator project. The template and the project are the very best way we can ensure data quality–the template fields all have rules and types to which the output data will need to conform. Any individual observation (in this case, a recipe) can be investigated and edited. We have several tools in Curator to get data into the template cleanly and automatically.

1. Inverse Your Filters and Vacuum Up Junk

For our finished Dataset, we want to set a few standards for output. Our recipes will need names and attribution sources in the form of correctly formed URLs. Additionally, we’d like to omit the records that obviously didn’t parse correctly into the original datastore.

We can inspect our initial data in the Groups view. The Groups view contains filter boxes at the top of each column, which allow a variety of inputs and operators that are keyed to the “type” of data in that column. (For instance, a numerical range for floating-point numbers.) This allows us to get a handle on any number of recipes we might want to. Here, we examine which recipes do not have web addresses:

inv-filter-2
No web addresses.

Grouping this way filters out about 70 items or so out of the original file with more than 170 thousand recipes. Looking at the other columns where we want valid data, we might also notice that some of the recipe names actually have web addresses. This usually means the original data scrapers have somehow mis-constructed the source columns.

inv-filter-3
Incorrect names

Another issue with the data in its original format is that some of the ingredients have failed to parse correctly. Below, we have filtered (using empty delimiters) cases where that has occurred, and get nearly 5000 recipes.

inv-filter-4
No ingredients.

When we enter filter conditions into multiple columns they are combined with an “AND” conjunction, meaning that an ever more restrictive filter can be built. This makes a lot of sense when we have conditions that we need to see in each column to group observations together.

What about cases like this where we want to capture the columns containing “everything else?” These are cases where inverse filters really shine. By using the invert function, our queries are not only negated, but also the combination of filters is combined with the “OR” conjunction. This is a great way to gather up data problems that span the file.

inv-filter-5
Inverse filters

If we look at the number of filtered observations, we see that the number is close to a sum of our previous three conditions, but not quite; some of our recipes in fact have more than one problem at once. Curator has gathered up all of these data problems into a single Group for which we can now issue a wide variety of actions.

inv-filter-6
Groups are persistent bundles of observations you can use throughout the process of building and publishing a Dataset

With our Problem Recipes Group we have a number of options to proceed. We can leave it as is and simply omit exporting the recipes when we decide to publish the data. We can annotate the entire Group at once, specifically overwriting one of the key fields as {Invalid} and expecting our downstream data consumers to prepare for and handle this annotation. We could also choose to run Transformations specific to the Group, but in this case we will stick with the Group annotation. We’ll touch on these options later in the post.

2. Unconditional (Data) Love with Conditional Transformations

In the recipes dataset, we have times for cook time, preparation time, and total time. The total time has been calculated as an addition of the cook time and the preparation time, but sometimes these values are mismatched. Not surprisingly, we do not have all these values for each timing for every single one of the recipes. For our example we will work to extract the recipe times into minutes, fill in missing values, and times for each.

The three “times” (cook time, prep time, and total time) are stored in a somewhat unusual way: they are displayed as a number of hours and a number of minutes in the format: PTxxHyyM, where xx is the number of hours and yy is the number of minutes. So, for example:

PT45M = 45 minutes
PT3H = 180 minutes
PT1H30M = 90 minutes

This format, while not unreasonable, illustrates one of the chores that often precedes Data Quality efforts, and that’s isolating true, calculable values into places where we they can be examined. In the examples above, we write the number of minutes since that is the actual measurement we would like to work with from the incoming data.

Curator Transformations handle this chore easily.

recipe-time-extract
Part of the Transformation Detail for extracting recipe times from the format PTxxHyyM. There are actually several different possible setups that can accomplish this.

Running this Transformation for each of the timings gives us our extracted numerical value, minutes. We’ll also run simple Transformations to set our Time Units field to say “Minutes.” Using template fields that separate values and units is not only a good technique for data quality, but a general best practice when designing reusable Datasets. Among other things, the practice makes for easier downstream calculations, standardized units, and better internal “sanity checks” like the one we are about to do.

Once the minutes are extracted for each of the recipe timings, the Dataset is ready for some real QC investigation. Our key tool for this is using Conditional Transformations, which are Transformations with actions that test conditions before being executed.

A useful and common case for Conditional Transformations is checking for an missing value, and then assembling that missing value from other data.  Curator excels at this type of operation, which can otherwise quite cumbersome using simple manual file manipulation.

compute-if-missing
Simple Transformation checking for Empty values

Curator can run Transformation actions on both the incoming data columns from the file as well as upon data that we have already created for the Dataset. In this example, the condition is checking whether the Total Time Value field that we just extracted in the previous Transformation is empty. If it has a value already, the Transformation will leave it alone; if not, the Transformation will set the value to add Cook TIme Value and Prep Time Value (also previously set.) The action will run on the entire Dataset.

Another case for Conditional Transformations can be typically described as “sanity checking” the data measurements by investigating the surrounding Data. For the recipes Dataset, one example is to evaluate the Total Time as a function of the Prep Time and Cook Time. To do this, we again set up a Transformation adding the two components (Prep and Cook) together:

prep-plus-cook
Prep plus Cook Times.

Since the data products of the actions themselves are also testable data, be chained together to evaluate a result, a condition can be added to evaluate the summed value:

Sanity checking Total Time.
Sanity checking Total Time.

For this example, we have chosen to take any case where the listed Total Time does not accurately reflect the actual total (Prep Time and Cook Time) and overwrite with the corrected value. Of course, there are several other checks we could perform with the same values, and in any data QC operation, the steps taken when a data mismatch occurs will vary based upon a large number of project factors.

3. Save Your Work Using Transformation Sets

As we add our Transformations, we set them to process in a group called a Transformation Set:

qc-TS
Transformations belonging to the QC Transformation Set

The Transformation Set grows with each new Transformation added and can be saved to the project.

save-transformation-set

The Transformation Set contains all of the Transformations that we have set up, their names, and the order in which they need to be run. By saving the “Recipes Data QC” Transformation Set we get a several major benefits for our future data quality efforts. First, anytime we return to work with this particular file, we can load the Transformation Set to modify or extend it with additional Transformations.

Second, we can use Transformation Sets to reprocess data that changes. For instance, we might choose to rerun the QC commands once we get a new data dump from the openrecipe project and update our Curator file with it. Since Transformation Sets can be edited, we can go in and tweak individual Transformations that might need to be altered due to any change in format (or–very possibly–a change in the types of errors we are seeing in the file.)

Finally, a potentially huge benefit of Transformation Sets is that they can be applied to any Dataset in the project. Imagine that the openrecipe project were creating quarterly files, each of them in the same format but containing different scraped data. We can QC each of the files by loading them into Curator, adding them to the project, and loading our QC Transformation Set:

load-transformation-set
Selecting from available Transformation Sets.

Once we do this we can simply run the Transformations and enforce our periodic data quality check with minimal effort.

4. Take Good Care of Bad Data

In this post, we have mentioned different available approaches to handling low-quality data in Curator. All of these approaches involve data design and a good understanding of the project objective. In fact, handling bad data well is the best indicator of a successful data project.

A key component of project design in Curator comes in template construction. When developing a template, we recommend considering where values may exceed thresholds or contain unexpected additional information. General approaches include:

Approach one: “mark” the Observation as flawed

In the first section, above, the example included a mention of adding an {Invalid} annotation to Observations with flawed URL values. This is often the easiest approach to handling data flaws and especially appropriate in cases where a flawed record is for all practical purposes a useless one. While the approach does not require a change to the design of a Template, it can have some drawbacks. It can be difficult to enforce exactly which stored annotations will indicate data problems and data can be “destroyed” (or at least overwritten) when the {Invalid} tag is switched in.

Approach two: “Fix” broken data

We used this approach with the Conditional Transformations in the second section. In this case, in order for to process the data, we converted from text strings to numbers that could then be evaluated in order to apply the QC Transformation. The template was modified accordingly to both capture the incoming text and to accommodate the numerical values needed. “Repaired” data has the inherent value of adding additional value to the Dataset. However, there is still a potential drawback; we do not communicate downstream what the fix is and whether it was performed.

Approach three: Add specific Quality Control Fields

There is a way we can correct our data and communicate quality in a completely lossless fashion, and that’s by using additional fields. For instance, we could simply add a Field called “Observation Valid?” to the template and switch it from true to false when appropriate. If we choose to do so, we could also go even further and add additional fields:

dataset-extra-qc-fields
Extra QC fields.

Using this many fields for our recipes Dataset might seem like overkill for such an informal example. However, it does illustrate an important point. With Curator, it’s very easy to add fields to a template, and if they are going to typically be operated on using Groups or Transformations, they are inexpensive to support, even for a small project like this one. They can always be toggled on and off for downstream use; if a consuming analytics application cannot take process data from a “Total Time QC Remarks” field, it can simply be omitted. The majority of the time and effort of data Quality Control in Curator occurs where it should, in our opinion; during the up-front design and subsequent design iterations of the data project.

The Next Step: Curator Enterprise and QC Automation

In this post we have seen a number of ways Curator 2 can make data quality execution easier, less time consuming, and more effective for individual Datasets. In an upcoming post, we will demonstrate sewing our individual operations together in a complete QC Automation workflow, using data tags and project attributes to trigger those operations. Stay tuned!