Elevada recently released Curator 2.1 with expanded capabilities making data transformations even more powerful. Available today, this update introduces Lookup Transformations designed to allow automated augmentation and combination of datafiles.
Curator 2.1 now provides the user with more freedom to establish references between datafiles using the Lookup Transformation. The ability to dynamically Lookup and retrieve information within the Transformation chain across multiple datafiles is a game-changer for many users.
Here’s how it works:
Say we are interested in combining data recently made available by the National Highway Traffic Safety Administration (NHTSA) with Census data to discover demographic patterns in traffic fatalities.
Looking at the data, an immediate and obvious question that comes to mind is, “do some areas have higher traffic fatality rates (that is, in relation to the local population levels) than others?” Since we only have raw frequency numbers in the NHTSA data, we need to turn to our public Census files for population aggregates by United States county.
Augmenting source data by cross-referencing other datasets is the core functionality that Lookup Transformations provide. Lookups join Concept Schemes as a type of reference we can use to add meaning and control to our finished data using Curator. Lookup Transformations will match up every record in our source file to the selected reference or references, and can flexibly access any column within a reference; so, if we have a demographics table such as the ones provided by the Census Bureau, we can access multiple columns of that table in distinct Transformations in order to enrich our finished data.
Here we have just the Lookup file we need, loaded into Curator. In order to do this, we simply access the United States Census portal, download the file(s), and use the appropriate reference loader to upload into Curator. Once a Lookup is added to Curator it becomes share-able for all users, unless the uploader chooses to restrict access. It does not need to be added again for additional projects.
To combine the reference data with our traffic data, we’ll generate a Transformation that keys off the county codes.
In this example, we are keying off of our Census 2014 population Lookup file. As an Input we have selected a State-County code (technically a “FIPS” code) that we actually coalesced using an earlier Transformation step. We use that code as the search criterion, and the output can be taken from any column of the Lookup file–here we select the Census population estimates for 2014. The preview renders and we see that we get a “sanity check” that these indeed are the values we are looking for.
Note that since Lookups are encapsulated within our overall Transformations Engine logic, we can perform interesting syntheses by chaining together other logic. For instance; we might decide that we want to use different Lookup files for different time periods. That’s easy enough:
The Lookup Transformation represented above conditionally evaluates the traffic data from January 2010 (01-10) performing a lookup using the Census file for 2000-2010. In our uploaded Lookup, State 1 is Alabama; State 21 is Kentucky. By using additional Lookups we can retrieve and assemble all of this information into our finished Dataset which we plan to export. The processed data generated by the Transformation are displayed in the Editor grid, pictured below.
So far, we have augmented the NHTSA data with county populations from Census data. Using the same processing pipeline, we can also generate the accident and fatality rates keyed to county populations. We generate these by simply “bouncing” our assembled population data into calculations with our raw NHTSA data as follows.
Once we have created our Transformation Set, we can save it–Lookups and all–to our project. This means that our Lookups can be reused to augment all of our datafiles, additionally taking advantage of any automation we might want to apply at the project level. In this case, saving our Transformations and automatically processing them is very helpful; from NHTSA, we have dozens of years of datafiles from which to pull and assemble Datasets.
We can export our finished product to Excel files, which can then be loaded into a visualization tool such as Microsoft’s PowerBI:
This is an interesting initial look at our data, and we were able to access it in minutes after downloading the data directly from the source! We’ve assembled all this information in minutes, for dozens of files, with no programming and no delays, using Curator 2.1.
Global Field Definitions and Global Data Mappings
It is easier than ever to define and manage fields to Curator. Global fields allow the rapid creation and expansion of Datasets using well-known, controlled data definitions, and they also lend opportunities for automation. In the examples below, we have set up several new field definitions that we will reuse when we process our NHTSA Data.
In the example above, we’re using field definitions to precisely capture the latitude and longitude values we’ll get with every crash record in our traffic datafiles. Of course, we might eventually want to convert these values to actual geopositional values, but for now, we want to make sure we capture lat/long values in their native types–floating point numbers–in order to stay flexible for coalescing data later. We can add an official ID to the field (“Latitude_FP”) that will be globally unique to Curator, but we can also add a user-friendly label for casual users that does not need to be unique (here, we use “Latitude (Decimal)”). We might also choose to add data value restrictions at the point of defining the global field, and we can even add an export mapping value for a little extra customizability upon data output.
One of the most useful additions to global fields is the Field Mapping tool. Field Mapping lets us not only choose but also prioritize the column names from source data that will automatically map into our newly-defined field.
In order to map columns, we can type in a search string that will process against all of the column names loaded in as Curator source data:
What we are seeing here is correct: some of our NHTSA files label longitude columns as LONGITUD. What’s more, this label changes when we look at historical data from previous decades. We want data to flow into our Longitude_FP field from any of these columns, and we set up a priority order so that we have a deterministic method of selecting the single column that will match–and thus map–first. We can even set up partial matches; for instance, “LONGIT” would match “LONGITUD” and would map if we choose to do so.
Once our definitions and mappings are in place, we can implement both in our project where we generate our finished Datasets. Below, we restrict modifications to the Datasets to use only the the field definitions we have tagged for our NHTSA data (“FARS”,) and we let the project know that we want automatic mappings to be executed.
Once data are added to the project, the Curator Jobs panel fills with notifications of our pre-processed datasets. We can map hundreds of files at once in this way, since our Field Mapping tool is fully-integrated with Curator’s Transformations Engine and the global job control mechanisms embedded in the Elevada Data Platform. Processing will continue automatically and end-users can access status updates in real time.
We have used field mappings to pre-populate much of our new Dataset. At this point we can run additional Transformations such as the Lookups described above, or we might even choose to export fully-mapped Datasets automatically upon the completion of our processing.
Other Updates in Curator 2.1
Of course, there are numerous other features shipping with Curator 2.1. Some of the important new items include:
- Many new Transformation capabilities for all data types, including some very powerful advanced Concept Scheme Transformations.
- More Transformations power: users can now combine multiple condition checks into a single action using “AND” or “OR” logic.
- You can also use a “Return” action to cease processing on one logical branch while continue to process another in Transformations. If this makes no sense to you while reading; trust us–it can really come in handy sometimes.
- Outside of Transformations we have new import and export capabilities for Microsoft Excel files. Additionally, our bulk importer has been greatly improved.
- The Dataset exporters now have several additional options like “Ignore empty fields” and “Export just labels.”
- Pivot on Import! Easily transform rows to columns and vice-versa for source datafiles loaded into Curator. This is can be a great convenience feature.
- Many additional convenience features for data and object filtering throughout the interface.
These and other new additions make Curator more powerful than ever–as well as a delight to apply to any data processing or preparation projects, large, small, or anywhere in between. Of course, we’re always looking forward to the future, and we can’t wait to show you the new features we’ve got cooking for upcoming releases.
*PowerBI is a Registered Trademark of Microsoft Corporation.
**Elevada would like to thank Dan Nguyen of Stanford University, whose GitHub mirror helped us round up a few missing files for this project.