Transformations are applied to data when loading it into Spotfire, shaping the data to the desired form before analyzing it.
Overview
Data is often not formatted for immediate analysis, and it may contain errors. The
transformations framework adds a layer between data reading and the creating of
a data table, providing the means to modify the data before it is imported. A data
transformation applies a rule or a function to the retrieved data to prepare it
for analysis.
Data is imported in pipe-like fashion, starting from a data source, optionally continuing
through a set of transformations, ending up in a data table. This sequence is conceptually
defined by a data flow, and implemented by a
DataFlow,
a container object which consists of a data source and an ordered set of one or
more transformations. It defines the data source and the transformations to be performed,
but since it inherits from the base class for data sources, it can be used as a
normal data source.
Since the DataFlow is immutable there is a
DataFlowBuilder
class simplifying the creating of data flows:
Extension Points
The extension point for the transformation framework is the
CustomDataTransformation
class, which can be inherited from to implement a transformation. This class is
then registered in the
RegisterDataTransformations
method on the AddIn class together with a
CustomTypeIdentifier.
Tutorials and Examples
The Data Flow
Within a data flow, the data source and the transformations make up the basic building
blocks. The main difference is that while a data source can return data directly,
the transformation is integrated in the chain, requiring input from the previous
stage:
In the basic case, rows are retrieved from a
DataSource
via a
DataSourceConnection,
which in turn provides the
DataRowReader.
The
DataTransformation
stores transformation settings and serves as a factory for a connection. The DataSourceConnection
performs prompting by returning prompting models, and serves as a factory for a
DataRowReader.
The connection class will typically not be explicitly implemented since there are
static factory methods on the
DataTransformationConnection
class. Only the DataTransformation object is serialized and persisted
in the document.
The Reader
The reader component provides a way to retrieve data from a source and is used in
the step preceding the creation of a data table. The reader is created either by
a factory method on a DataSourceConnection or DataTransformationConnection.
It is then used as input for the next step in the flow.
A reader is not stored in the
Spotfire Document
and only exists during the transformation execution. The DataRowReader
is a resettable iterator where the current values are accessed from cursors in the
columns. The reader also provides metadata on the columns, which is returned from
the reader. It also returns table level metadata in the
ResultProperties
property.
One instance of the
DataRowReaderColumn
class is created for each output column and these columns are contained in the
DataRowReaderColumnCollection.
The Context
The transformation will receive an
ImportContext
providing properties for commonly used services. The context is required by the
transformation to execute.
The import context is available as a property on the
AnalysisApplication
object and therefore available even when no document is loaded. Nevertheless, make
sure to check that any service to be used is actually available, sinc not all services
are available at all times.
Unknown Data Types
Some data sources and transformations may need to return data types not supported
by Spotfire, such as blobs and byte[]. Prior to Spotfire 2.1.0, there
was no support for returning unknown data values from a reader. Even if a column
type could be specified as DataType.Undefined, there was no way to
retrieve values.
From Spotfire 2.1.0, the transformation or data source is expected to return a cursor
returning values of type object. The data source or transformation
may also add custom metadata describing what kind of objects are stored in the column,
but note that columns of data type Undefined are ignored when creating
the table from a data flow. The unknown data types may be used in another context,
such as a tool reading data from a data source and storing the result in an internal
structure.
The MutableValueCursor
A transformation which performs some calculation or modification of the input values
needs a way to set the return value in the cursor when the
MoveNext
method is called. The
Transformation: Using Properties From a Previous Transform
example shows a simple way to do this. A MutableValueCursor is created
from the
CreateMutableCursor
factory method on the
DataValueCursor.
The requested data type of the result is provided as an argument.
For performance reasons the mutable value cursor is a generic class typed on the
actual representation type. To set the value the DataValueCursor created
by the DataValueCursor.CreateMutableCursor must be cast to the actual
type, for instance MutableValueCursor<double> or MutableValueCursor<string>.
Then the output value can be modified by setting the properties of the
MutableValueCursor<T>.MutableDataValue
property.
Example
output.MutableDataValue.IsValid = true;
output.MutableDataValue.Value = val;
Modifying Columns of Different Types
The outlined procedure results in complicated code when the transformation modifies
input columns of many different types: Storing the MutableValueCursor
as a
DataValueCursor
(the base class) and casting the cursor to the different supported mutable value
cursor types in the
MoveNextCore
method to see which one it was, requires several casts for each row and may degrade
performance.
A solution to this problem is to use an object oriented approach which performs the
cast only during initialization. Consider a transformation handling both int
and double values. The operation it performs is to add one to every
value.
- Define a non-generic abstract base class that can be called without knowing which
type the computer works on:
private abstract class MyComputerBase
{
public abstract void ComputeNextValue();
}
- Since we need to handle both
int and double values, add
two simple classes implementing the base class for the different types. For double:
private sealed class DoubleCompute : MyComputerBase
{
private readonly DataValueCursor<double> inputCursor;
private readonly MutableValueCursor<double> outputCursor;
public DoubleCompute(DataValueCursor<double> inputCursor, MutableValueCursor<double> outputCursor)
{
this.inputCursor = inputCursor;
this.outputCursor = outputCursor;
}
public override void ComputeNextValue()
{
// check if the input cursor is valid.
if(inputCursor.IsCurrentValueValid)
{
outputCursor.MutableDataValue.IsValid = true;
// output is input + 1
outputCursor.MutableDataValue.Value = input.CurrentValue + 1.0;
}
else
{
// no valid value, output is not valid either
output.MutableDataValue.IsValid = false;
output.MutableDataValue.ErrorValue = input.CurrentDataValue.ErrorValue;
}
}
}
- Duplicate the code to handle integers:
private sealed class IntCompute : MyComputerBase
{
private readonly DataValueCursor<int> inputCursor;
private readonly MutableValueCursor<int> outputCursor;
public IntCompute(DataValueCursor<int> inputCursor, MutableValueCursor<int> outputCursor)
{
this.inputCursor = inputCursor;
this.outputCursor = outputCursor;
}
public override void ComputeNextValue()
{
// check if the input cursor is valid.
if(inputCursor.IsCurrentValueValid)
{
outputCursor.MutableDataValue.IsValid = true;
// output is input + 1
outputCursor.MutableDataValue.Value = input.CurrentValue + 1;
}
else
{
// no valid value, output is not valid either
output.MutableDataValue.IsValid = false;
output.MutableDataValue.ErrorValue = input.CurrentDataValue.ErrorValue;
}
}
}
- These computers can now be created at initialization and stored in a list used by
the
MoveNextCore method.
public sealed class MyTransformationReader : CustomDataRowReader
{
private readonly DataRowReader inputReader;
private readonly IList<MyComputerBase> computers;
private readonly List<DataRowReaderColumn> columns;
public MyTransformationReader(DataRowReader inputReader)
{
// initialize the variables
this.inputReader = inputReader;
this.computers = new List<MyComputerBase>();
this.columns = new List<DataRowReaderColumn>();
// loop over all input columns to see if we should transform or just
// pass through the values.
foreach (DataRowReaderColumn col in inputReader.Columns)
{
// check if the data type is real or integer
DataColumnProperties properties = col.Properties;
if (col.DataType == DataType.Real)
{
// it is real so we should transfrom it.
// create the output cursor and cast the input cursor.
DataValueCursor<double> inputCursor = (DataValueCursor<double>)col.Cursor;
MutableValueCursor<double> outputCursor =
(MutableValueCursor<double>)DataValueCursor.CreateMutableCursor(DataType.Real);
// add the output column to the columns collection
this.columns.Add(
new DataRowReaderColumn(
col.Name,
col.DataType,
properties,
outputCursor));
// create the correct computer specificed earlier.
this.computers.Add(new DoubleCompute(inputCursor, outputCursor));
}
else if(col.DataType == DataType.Integer)
{
// it is int so we should transfrom it.
// create the output cursor and cast the input cursor.
DataValueCursor<int> inputCursor = (DataValueCursor<int>)col.Cursor;
MutableValueCursor<int> outputCursor =
(MutableValueCursor<int>)DataValueCursor.CreateMutableCursor(DataType.Integer);
// add the output column to the columns collection
this.columns.Add(
new DataRowReaderColumn(
col.Name,
col.DataType,
properties,
outputCursor));
// create the correct computer specificed earlier.
this.computers.Add(new IntCompute(inputCursor, outputCursor));
}
else
{
// not int or real, just propagate the value.
this.columns.Add(col);
}
}
}
...
}
- Since the actual work is performed in the constructor and the classes, the
MoveNext method becomes very simple:
protected override bool MoveNextCore()
{
// tell the input reader to set the values for the next row.
if (!this.inputReader.MoveNext())
{
// there were no more rows, return.
return false;
}
// iterate over all the computers and tell them to set the current value.
foreach(MyComputerBase myComputerBase in computers)
{
myComputerBase.ComputeNextValue();
}
}