In our previous blog post we introduced the new iteration of our Transformations Engine. We walked through building and executing transformations on a single large dataset. Now, what if we want to work with a lot of data?
A Lot of Data
Imagine the following problem: At some point one or more of your proprietary systems glitched and your team needs to determine the cause of a subtle but critical production issue. What kinds of systems? Imagine you have a custom engineering setup with machines working together, or a marketing system combining input from various business intelligence tools that is driving your day-to-day sales effort. Combining data from multiple systems is a common and tricky problem in a number of industries.
You have logs or output files from each of the systems, but each of the outputs are in different formats. One system tracks date and time to the millisecond; the other tracks only to the minute. One system coalesces messages into single date-stamped fields; the other reports messages in separate reports. Verbose descriptions aren’t available at all in a third system. Each system has severity levels (e.g., High, Medium, and Low Severity) but each system uses different terms for severity.
And so on. Across over 1000 files.
Three Formats In, One Format Out
In order to investigate the production issue, we need to unify the data. In Curator, it’s easy to whip up a quick template to specify exactly the single format we want to help get a handle on the situation:
- Incident Date (Text)
- Incident Time (Text)
- Incident Days (Number)
- Incident Seconds (Number)
- Incident Address (Text)
- Incident Severity (Lookup)
- Incident Message (Text)
- Incident Comments (Text)
To transform the data into the template the team sets up a Transformation Set. The Transformation Set specifies about a dozen operations to split timestamps, calculate exact times, parse error messages, validate addresses, and so on. Each of the systems is generating messages about the severity levels of events, but each system uses different messages; one might use HIGH, MEDIUM, LOW: another uses CRITICAL , MODERATE, INFO: and so on. In order to analyze the problem, we’ll want to assemble the events by status. We can use a conditional transformation specifies Incident Severity by mapping all of the levels from the original files to a single set of warning levels.
We might add a few more finishing touches to clean spaces and remove punctuation, and then we run the Transformation Set. The dataset is completely cleaned up in a few seconds.
That’s one file down and 999(+) to go!
Since the Transformation Set is ready to operate, we save it to the cleanup project.
Handling the Next 250 Files with a single Tag
As promised in the headline, we can automate data processing in Curator using data tags. Tags are simple labels applied to raw data loaded into Curator. They can be applied while loading or to data already available in the system. We can start by loading a few files into the system:
Clicking the “Add Tag” button, we actually set up and add two tags: Log and LogFileA,
The heading of this section promised automation using a single tag. It turns out to be useful to apply multiple tags in a style like this–a global tag and more detailed tags–since tags can be used to filter all datafiles. The “Log” tag will be convenient later on when we increase the number of formats we are dealing with.
The “LogFormatA” tag is going to do all the rest of the work for us. We set up automation for the tag using the cleanup project’s setup tools.
When we automate with the tag that we just created, “LogFormatA,” any data with that tag is automatically added to the project. In fact, that’s exactly what happens when we save the setting:
In some cases, this might be all we want to do: route data into a project based on a given tag. However, in this case, we want to go a step further and actually trigger the Transformation Set we built and saved before. To do this, we assign the tag to the Transformation Set:
…and that is all we need to do to process our log files. Note that once we click save the Transformations Rngine begins working. The Jobs count in the lower right corner of the screen indicates that all five assigned datasets are being processed. (Note also: the Advanced… button near the Transformation Set control lets us assign post-processing actions. Depending on the edition of Curator we are using, we can select to automatically download files, or automatically transfer them to a filesystem as desired.)
Processing continues until all the jobs are complete; we can click on any of the datasets to see the transformations in action.
After a minute (or a fraction of one, at least,) all of the jobs are completed.
Automation in Bulk
For the sake of the example in this blog post, we loaded the files above manually to watch the individual interactions with the files. However, there is a short-cut that allows us to make even faster work of our automation: the bulk loader.
Curator’s bulk importer allows users to add a number of files to the system at once. The bulk import interface also contains its own tagging control. We can simply select all of the files from the directory that contains our log data and set each of the files to be tagged. When we click to confirm the bulk import, the jobs are all queued and begin to process.
That’s all the work we need to do to handle processing for this format. In about half an hour we have handled everything we need to for our first 250 files to be converted to a usable state.
Quickly Adding Additional Formats to Automation
We’re on to our next 750+ files, which come from two more additional formats. Normally, this would be the painful (and expensive) part of a data cleanup project. However, we’re already nearly done handling the additional data.
While it would likely be possible to set up one giant Transformation Set to handle all three formats at once, we can make things a lot easier by using the following strategy: a Transformation Set for every format, and a tag for every Transformation Set.
To build a new Transformation Set it is possible to start from scratch or load one in from the project, modify it, and save it with a different name. In this case of our second set of files, the format is significantly different, but the operations needed are similar enough that we can just modify “Log Format A Cleanup.” For instance, we’ll update our severity level remapping:
It takes just a few minutes to rewire the transformations, and we can always check the previews to make sure everything is working correctly:
Once we are done adjusting the transformations to our liking, we can test on a single dataset. If everything looks good we are ready to save to a new Transformation Set: “Log Format B Cleanup.”
At this point the rest of the work should be pretty clear: we need to tag our datafile(s) and map the project to the tag:
Once again automation takes over and processing runs for our next 500 or so files.
As is apparent in the screenshot above, we can move on to handle the next format with another tag (LogFormatC) and an adjusted Transformation Set.
All told, the process to unify and clean up the data takes under an hour of non-technical interaction with the Curator interface. Now that we have automation set up, we can keep monitoring files as they continue to be produced. We can also easily handle additional changes or integrate more data in minutes repeating the processes we used in this post.
Tags are a brand new way to automate much of the work that Curator already made easier. We’re really excited about the functionality since it should open up new applications for Elevada’s tools that we hadn’t even thought about until now. If you’re curious about the details of these automation capabilities and would like to know how Elevada can apply them to your team’s work, we would love to hear from you.