Skip to main content
RSS feed Subscribe to feed

 

Creating a Transformation

Transformations are applied to data when loading it into Spotfire, shaping the data to the desired form before analyzing it.

Overview

Data is often not formatted for immediate analysis, and it may contain errors. The transformations framework adds a layer between data reading and the creating of a data table, providing the means to modify the data before it is imported. A data transformation applies a rule or a function to the retrieved data to prepare it for analysis.

Data is imported in pipe-like fashion, starting from a data source, optionally continuing through a set of transformations, ending up in a data table. This sequence is conceptually defined by a data flow, and implemented by a DataFlow, a container object which consists of a data source and an ordered set of one or more transformations. It defines the data source and the transformations to be performed, but since it inherits from the base class for data sources, it can be used as a normal data source.

DataFlow as DataSource

Since the DataFlow is immutable there is a DataFlowBuilder class simplifying the creating of data flows:

DataFlowBuilder

Extension Points

The extension point for the transformation framework is the CustomDataTransformation class, which can be inherited from to implement a transformation. This class is then registered in the RegisterDataTransformations method on the AddIn class together with a CustomTypeIdentifier.

Tutorials and Examples

The Data Flow

Within a data flow, the data source and the transformations make up the basic building blocks. The main difference is that while a data source can return data directly, the transformation is integrated in the chain, requiring input from the previous stage:

The data flow chain blow-up

In the basic case, rows are retrieved from a DataSource via a DataSourceConnection, which in turn provides the DataRowReader.

The DataTransformation stores transformation settings and serves as a factory for a connection. The DataSourceConnection performs prompting by returning prompting models, and serves as a factory for a DataRowReader.

The connection class will typically not be explicitly implemented since there are static factory methods on the DataTransformationConnection class. Only the DataTransformation object is serialized and persisted in the document.

The Reader

The reader component provides a way to retrieve data from a source and is used in the step preceding the creation of a data table. The reader is created either by a factory method on a DataSourceConnection or DataTransformationConnection. It is then used as input for the next step in the flow.

A reader is not stored in the Spotfire Document and only exists during the transformation execution. The DataRowReader is a resettable iterator where the current values are accessed from cursors in the columns. The reader also provides metadata on the columns, which is returned from the reader. It also returns table level metadata in the ResultProperties property.

One instance of the DataRowReaderColumn class is created for each output column and these columns are contained in the DataRowReaderColumnCollection.

The Context

The transformation will receive an ImportContext providing properties for commonly used services. The context is required by the transformation to execute.

The import context is available as a property on the AnalysisApplication object and therefore available even when no document is loaded. Nevertheless, make sure to check that any service to be used is actually available, sinc not all services are available at all times.

Unknown Data Types

Some data sources and transformations may need to return data types not supported by Spotfire, such as blobs and byte[]. Prior to Spotfire 2.1.0, there was no support for returning unknown data values from a reader. Even if a column type could be specified as DataType.Undefined, there was no way to retrieve values.

From Spotfire 2.1.0, the transformation or data source is expected to return a cursor returning values of type object. The data source or transformation may also add custom metadata describing what kind of objects are stored in the column, but note that columns of data type Undefined are ignored when creating the table from a data flow. The unknown data types may be used in another context, such as a tool reading data from a data source and storing the result in an internal structure.

The MutableValueCursor

A transformation which performs some calculation or modification of the input values needs a way to set the return value in the cursor when the MoveNext method is called. The Transformation: Using Properties From a Previous Transform example shows a simple way to do this. A MutableValueCursor is created from the CreateMutableCursor factory method on the DataValueCursor. The requested data type of the result is provided as an argument.

For performance reasons the mutable value cursor is a generic class typed on the actual representation type. To set the value the DataValueCursor created by the DataValueCursor.CreateMutableCursor must be cast to the actual type, for instance MutableValueCursor<double> or MutableValueCursor<string>. Then the output value can be modified by setting the properties of the MutableValueCursor<T>.MutableDataValue property.

Example
output.MutableDataValue.IsValid = true;
output.MutableDataValue.Value = val;

Modifying Columns of Different Types

The outlined procedure results in complicated code when the transformation modifies input columns of many different types: Storing the MutableValueCursor as a DataValueCursor (the base class) and casting the cursor to the different supported mutable value cursor types in the MoveNextCore method to see which one it was, requires several casts for each row and may degrade performance.

A solution to this problem is to use an object oriented approach which performs the cast only during initialization. Consider a transformation handling both int and double values. The operation it performs is to add one to every value.

  1. Define a non-generic abstract base class that can be called without knowing which type the computer works on:
    private abstract class MyComputerBase
    {
      public abstract void ComputeNextValue();
    }
  2. Since we need to handle both int and double values, add two simple classes implementing the base class for the different types. For double:
    private sealed class DoubleCompute : MyComputerBase
    {
      private readonly DataValueCursor<double> inputCursor;
      private readonly MutableValueCursor<double> outputCursor;
      
      public DoubleCompute(DataValueCursor<double> inputCursor, MutableValueCursor<double> outputCursor)
      {
        this.inputCursor = inputCursor;
        this.outputCursor = outputCursor;
      }
    
      public override void ComputeNextValue()
      {
        // check if the input cursor is valid.
        if(inputCursor.IsCurrentValueValid)
        {
          outputCursor.MutableDataValue.IsValid = true;
          // output is input + 1
          outputCursor.MutableDataValue.Value  = input.CurrentValue + 1.0;
        }
        else
        {
          // no valid value, output is not valid either        
          output.MutableDataValue.IsValid = false;
          output.MutableDataValue.ErrorValue = input.CurrentDataValue.ErrorValue;
        }
      }
    }
  3. Duplicate the code to handle integers:
    This unfortunate duplication is due to the lack of an interface for specifying that a generic type should implement the + operator.
    private sealed class IntCompute : MyComputerBase
    {
      private readonly DataValueCursor<int> inputCursor;
      private readonly MutableValueCursor<int> outputCursor;
      
      public IntCompute(DataValueCursor<int> inputCursor, MutableValueCursor<int> outputCursor)
      {
        this.inputCursor = inputCursor;
        this.outputCursor = outputCursor;
      }
    
      public override void ComputeNextValue()
      {
        // check if the input cursor is valid.
        if(inputCursor.IsCurrentValueValid)
        {
          outputCursor.MutableDataValue.IsValid = true;
          // output is input + 1
          outputCursor.MutableDataValue.Value  = input.CurrentValue + 1;
        }
        else
        {
          // no valid value, output is not valid either        
          output.MutableDataValue.IsValid = false;
          output.MutableDataValue.ErrorValue = input.CurrentDataValue.ErrorValue;
        }
      }
    }
  4. These computers can now be created at initialization and stored in a list used by the MoveNextCore method.
    public sealed class MyTransformationReader : CustomDataRowReader
    {
    
        private readonly DataRowReader inputReader;
        private readonly IList<MyComputerBase> computers;
        private readonly List<DataRowReaderColumn> columns;
    
        public MyTransformationReader(DataRowReader inputReader)
        {
            // initialize the variables
            this.inputReader = inputReader;
            this.computers = new List<MyComputerBase>();
            this.columns = new List<DataRowReaderColumn>();
            
            // loop over all input columns to see if we should transform or just
            // pass through the values.
            foreach (DataRowReaderColumn col in inputReader.Columns)
            {
                // check if the data type is real or integer
                DataColumnProperties properties = col.Properties;
                if (col.DataType == DataType.Real)
                {
                  // it is real so we should transfrom it.
                  // create the output cursor and cast the input cursor.
                  DataValueCursor<double> inputCursor = (DataValueCursor<double>)col.Cursor;
                  MutableValueCursor<double> outputCursor =      
                       (MutableValueCursor<double>)DataValueCursor.CreateMutableCursor(DataType.Real);
                          
                  // add the output column to the columns collection         
                  this.columns.Add(
                      new DataRowReaderColumn(
                          col.Name,
                          col.DataType,
                          properties,
                          outputCursor));
                          
                  // create the correct computer specificed earlier.
                  this.computers.Add(new DoubleCompute(inputCursor, outputCursor));
                }
                else if(col.DataType == DataType.Integer)
                {                
                  // it is int so we should transfrom it.
                  // create the output cursor and cast the input cursor.
                  DataValueCursor<int> inputCursor = (DataValueCursor<int>)col.Cursor;
                  MutableValueCursor<int> outputCursor =      
                       (MutableValueCursor<int>)DataValueCursor.CreateMutableCursor(DataType.Integer);
                          
                  // add the output column to the columns collection         
                  this.columns.Add(
                      new DataRowReaderColumn(
                          col.Name,
                          col.DataType,
                          properties,
                          outputCursor));
                          
                  // create the correct computer specificed earlier.
                  this.computers.Add(new IntCompute(inputCursor, outputCursor));
                }           
                else 
                {
                    // not int or real, just propagate the value.
                    this.columns.Add(col);
                }
            }
        }
        ...
    }
  5. Since the actual work is performed in the constructor and the classes, the MoveNext method becomes very simple:
    protected override bool MoveNextCore()
    {
        // tell the input reader to set the values for the next row.
        if (!this.inputReader.MoveNext())
        {
            // there were no more rows, return.
            return false;                    
        }
        
        // iterate over all the computers and tell them to set the current value.
        foreach(MyComputerBase myComputerBase in computers)
        {
          myComputerBase.ComputeNextValue();
        }
    }