Data Transformation with Curator works best using Concept Schemes. Concept Schemes in Curator are very similar to taxonomies and ontologies, but are somewhat modified and simplified to best suit the needs of users producing large volumes of data. There is a wealth of publicly-available ontologies available online, and Curator can import these to Concept Schemes to quickly access public knowledge.
The Public Champion
One of the best sources of public terminology in life sciences is the National Center for Biomedical Ontology’s BioPortal site.
The site contains hundreds of downloadable ontologies of various sizes and hundreds more partial versions in ontology “views.” In most cases the full ontologies or the views can be downloaded as comma-delimited text files (.csv) and the format of these files conforms to a rough but not entirely uniform pattern.
Curator’s Concept Scheme importer is just the trick for converting these ontologies to usable concepts. Users can select a file, get a preview of the ontology structure to be imported, and prune unwanted branches before importing. The typical BioPortal .csv format is not exactly the same as the Curator importer format. The files need to be tweaked a bit before they can be uploaded. This might sound easy enough, but it turns out that converting file after file manually as spreadsheets is labor intensive, and, for some of the larger ontologies, fairly problematic or downright impossible.
It just so happens, however, that Curator itself is terrific at transforming disparate incoming sources of data into a specific output format. So, at the risk of getting too meta, let’s take a look at how that can work for converting ontologies.
Note: As we walk through the example we have a number of screenshots attached. Click any image to enlarge.
Let’s start by downloading an Ontology view from Bioportal and importing it to Curator. In this example, we have selected the EDAM bioinformatics operations main CSV view for download. We can upload that file directly into Curator.
In the image above we can see some of the options selected to import the file. The separator is of course the comma, we do have a header row, and–important–we are “dealing with” quotes (Curator will strip quotation marks where it makes sense.) Identifiers and Names for the Observations will be autogenerated as simple integers, as our Observation is clearly the Concept itself. We could also select the class id as the Observation ID if desired. Finally we add a simple ID and name for the Dataset itself. That’s it: we can upload our file.
In order to do work in Curator we of course need to assign our Datafiles to projects. Typically the first thing we need for a project is a template to define the project’s Datasets. As the template defines exactly what our output is going to look like, and in this case we have a precise definition of what we need to load into Curator, building this template is very straightforward.
The Dataset level of the template is very simple, although not completely empty: later on, we might want to export a number of files and need to know their sources in a controlled way.
The observation level has more fields, but it is still quite simple. Each of the fields here gives us one of the required columns for Curator Concept Scheme imports. Note that for all of our template fields we are simply using text field types, since that will work for both labels and URIs (unique identifiers) prevalent in ontology files.
While we might tinker with the template later on, we only need to create it once to handle all of our conversions. With the template created we can start a project and add our uploaded Datafiles that were obtained from BioPortal. Here, we have our EDAM file added as well as a few others in various stages of workflow:
We can click on EDAM-full to open our Dataset Editor.
Some Quick Transforms
On the Groups tab of the Dataset Editor we can quickly look through and sort the incoming data supplied by the Datafile. Everything is basically as expected; there are class IDs, preferred labels (think of these as “official names”,) synonyms, definitions, and (not pictured) links to the broader or “parent” terms in the class hierarchy. Note that in particular the Synoymns column sometimes contains multiple values separated by the “|” (pipe) character. That will be important since synonyms make Concept Schemes much more useful in Curator.
Often the first thing we do in the Dataset Editor is to begin to group Observations and apply annotations to the groups. In this case, however, we can do all of the work we need to on the Transforms tab.
Here we can see the familiar field names from the template designed for our Ontology Conversion project. We will set a transformation for all of these fields by clicking the Advanced… button for each. We can start with setting some basic values.
The name of our Concept Scheme is EDAM-full, and as we need that value in every row of our output, we’ll go ahead and just do a direct set of that value for the field. There are other ways we could do this using Curator, but this one is probably quickest! We’ll also set up transformations pulling data from specific rows in our input–specifically we will pull the Class ID data into the IDENTIFIER field and the Preferred Label into the LABEL field. Remember, different loaded ontologies might label these input columns differently, but we can always adjust our transformations accordingly. Finally, we will set the Parent Label field to Not Applicable; this is not a required value for Concept Schemes, and we don’t have values in our input. (Parent Labels can be especially useful in “homemade” taxonomies built with spreadsheets, thus they are part of the import specification for Curator.)
When we hit Apply Transformations, progress bars appear. We can continue working on other fields, Datasets, or even projects while the transformations are in progress.
Now we need to take advantage of an advanced feature of Curator for the remaining fields. Each Concept can take a value for a parent identifier so that Concept Schemes can be organized hierarchically. In the case of our input EDAM file, we have a “Parents” column that contains these IDs, with the occasional multiple value which Curator doesn’t need. We’ll split that field and take everything before the separator (“|”) occurs.
For our synonyms fields, as mentioned before, we want to keep multiple values from the single Synonyms column in our input data. In effect, we want to split one input into multiple outputs using the rules we set for Curator. This is easy enough as we can assign any input, with transforms, to multiple output fields using the Advanced transforms tool. So, for first Synonym Field (SYNONYM 1), we run a split on Synonyms using the pipe character and take the first value found. For SYNONYM 2 we do the same, only taking the second output value (which looks like the picture below.)
We proceed similarly for SYNONYM 3, SYNONYM 4, and SYNONYM 5, and Apply Transformations.
Let’s take a look at the fields for one of the processed concepts to make sure we are finished:
Sure enough, we have our fields filled in including our multiple synonyms distributed properly across fields. And, as it turns out, we’re almost done.
Heading back to the Project View, we can select our EDAM dataset and the Export Dataset button. This brings up the exporter dialog with different adapter options:
In order to export a file the the Concept Scheme importer can use, we select the Tabular Export Adapter, which is focused on generating output using rows corresponding to Observations (Concepts in this case) and columns corresponding to Fields. We have several options that we can set for the Adapter, but in this case we do not need to select any of them. Once “Export” is clicked, Curator generates a file and downloads it to the desktop.
To summarize: we could have used just about any inputs and any outputs we wanted for this exercise. For instance, we can load in taxonomies or ontologies from sources other than BioPortal just as easily as we did here–so we could pull information from the W3C for instance, or use spreadsheets we’ve extracted from in-house sources. In any of these cases, we could use the template created above just as it is to create uploadable Curator Concept Scheme files. For a separate knowledge domain (say business instead of life science), we might use the same template to create a separate project.
Similarly, we can also change our output in case we want to export ontology files that might work with other systems. A great way to do this would be to take our existing template, clone it: (Save as…)
…alter the outputs in the new template, and then start a new project. And voilà! We can export to just about any format we need to.
This is a great example of using Curator to set up a repeatable solution to a messy data problem. In real life, we’ve probably spent about 10 minutes–a fraction of which was spent setting up the template and project we’ll reuse later. Each new ontology or taxonomy conversion gets easier than the last.