Transformations are applied to data when loading it into Spotfire, shaping the data to the desired form before analyzing it.
Overview
Data is often not formatted for immediate analysis, and it may
contain errors. The transformations framework adds a layer between
data reading and the creating of a data table, providing the
means to modify the data before it is imported. A data transformation
applies a rule or a function to the retrieved data to prepare
it for analysis.
Data is imported in pipe-like fashion, starting from a data source,
optionally continuing through a set of transformations, ending
up in a data table. This sequence is conceptually defined by
a data flow, and implemented by a
DataFlow,
a container object which consists of a data source and an ordered
set of one or more transformations. It defines the data source
and the transformations to be performed, but since it inherits
from the base class for data sources, it can be used as a normal
data source.
Since the DataFlow is immutable there
is a
DataFlowBuilder
class simplifying the creating of data flows:
Extension Points
The extension point for the transformation framework is the
CustomDataTransformation
class, which can be inherited from to implement a transformation.
This class is then registered in the
RegisterDataTransformations
method on the AddIn class together with a
CustomTypeIdentifier.
Tutorials and Examples
The Data Flow
Within a data flow, the data source and the transformations make
up the basic building blocks. The main difference is that while
a data source can return data directly, the transformation is
integrated in the chain, requiring input from the previous stage:
In the basic case, rows are retrieved from a
DataSource
via a
DataSourceConnection,
which in turn provides the
DataRowReader.
The
DataTransformation
stores transformation settings and serves as a factory for a
connection. The DataSourceConnection performs prompting
by returning prompting models, and serves as a factory for a
DataRowReader.
The connection class will typically not be explicitly implemented
since there are static factory methods on the
DataTransformationConnection
class. Only the DataTransformation object is serialized
and persisted in the document.
The Reader
The reader component provides a way to retrieve data from a source
and is used in the step preceding the creation of a data table.
The reader is created either by a factory method on a DataSourceConnection
or DataTransformationConnection. It is then used
as input for the next step in the flow.
A reader is not stored in the
Document
and only exists during the transformation execution. The DataRowReader
is a resettable iterator where the current values are accessed
from cursors in the columns. The reader also provides metadata
on the columns, which is returned from the reader. It also returns
table level metadata in the
ResultProperties
property.
One instance of the
DataRowReaderColumn
class is created for each output column and these columns are
contained in the
DataRowReaderColumnCollection.
The Context
The transformation will receive an
ImportContext
providing properties for commonly used services. The context
is required by the transformation to execute.
The import context is available as a property on the
AnalysisApplication
object and therefore available even when no document is loaded.
Nevertheless, make sure to check that any service to be used
is actually available, sinc not all services are available at
all times.
Unknown Data Types
Some data sources and transformations may need to return data
types not supported by Spotfire, such as blobs and byte[].
Prior to Spotfire 2.1.0, there was no support for returning
unknown data values from a reader. Even if a column type could
be specified as DataType.Undefined, there was no
way to retrieve values.
From Spotfire 2.1.0, the transformation or data source is expected
to return a cursor returning values of type object.
The data source or transformation may also add custom metadata
describing what kind of objects are stored in the column, but
note that columns of data type Undefined are ignored
when creating the table from a data flow. The unknown data types
may be used in another context, such as a tool reading data
from a data source and storing the result in an internal structure.
The MutableValueCursor
A transformation which performs some calculation or modification
of the input values needs a way to set the return value in the
cursor when the
MoveNext
method is called. The
Transformation: Using Properties From a Previous Transform
example shows a simple way to do this. A MutableValueCursor
is created from the
CreateMutableCursor
factory method on the
DataValueCursor.
The requested data type of the result is provided as an argument.
For performance reasons the mutable value cursor is a generic
class typed on the actual representation type. To set the value
the DataValueCursor created by the DataValueCursor.CreateMutableCursor
must be cast to the actual type, for instance MutableValueCursor<double>
or MutableValueCursor<string>. Then the output
value can be modified by setting the properties of the
MutableValueCursor<T>.MutableDataValue
property.
Example
output.MutableDataValue.IsValid = true;
output.MutableDataValue.Value = val;
Modifying Columns of Different Types
The outlined procedure results in complicated code when the transformation
modifies input columns of many different types: Storing the
MutableValueCursor as a
DataValueCursor
(the base class) and casting the cursor to the different supported
mutable value cursor types in the
MoveNextCore
method to see which one it was, requires several casts for each
row and may degrade performance.
A solution to this problem is to use an object oriented approach
which performs the cast only during initialization. Consider
a transformation handling both int and double
values. The operation it performs is to add one to every value.
- Define a non-generic abstract base class that can be called
without knowing which type the computer works on:
private abstract class MyComputerBase
{
public abstract void ComputeNextValue();
}
- Since we need to handle both
int and double
values, add two simple classes implementing the base class for
the different types. For double:
private sealed class DoubleCompute : MyComputerBase
{
private readonly DataValueCursor<double> inputCursor;
private readonly MutableValueCursor<double> outputCursor;
public DoubleCompute(DataValueCursor<double> inputCursor, MutableValueCursor<double> outputCursor)
{
this.inputCursor = inputCursor;
this.outputCursor = outputCursor;
}
public override void ComputeNextValue()
{
// check if the input cursor is valid.
if(inputCursor.IsCurrentValueValid)
{
outputCursor.MutableDataValue.IsValid = true;
// output is input + 1
outputCursor.MutableDataValue.Value = input.CurrentValue + 1.0;
}
else
{
// no valid value, output is not valid either
output.MutableDataValue.IsValid = false;
output.MutableDataValue.ErrorValue = input.CurrentDataValue.ErrorValue;
}
}
}
- Duplicate the code to handle integers:
private sealed class IntCompute : MyComputerBase
{
private readonly DataValueCursor<int> inputCursor;
private readonly MutableValueCursor<int> outputCursor;
public IntCompute(DataValueCursor<int> inputCursor, MutableValueCursor<int> outputCursor)
{
this.inputCursor = inputCursor;
this.outputCursor = outputCursor;
}
public override void ComputeNextValue()
{
// check if the input cursor is valid.
if(inputCursor.IsCurrentValueValid)
{
outputCursor.MutableDataValue.IsValid = true;
// output is input + 1
outputCursor.MutableDataValue.Value = input.CurrentValue + 1;
}
else
{
// no valid value, output is not valid either
output.MutableDataValue.IsValid = false;
output.MutableDataValue.ErrorValue = input.CurrentDataValue.ErrorValue;
}
}
}
- These computers can now be created at initialization and stored
in a list used by the
MoveNextCore method.
public sealed class MyTransformationReader : CustomDataRowReader
{
private readonly DataRowReader inputReader;
private readonly IList<MyComputerBase> computers;
private readonly List<DataRowReaderColumn> columns;
public MyTransformationReader(DataRowReader inputReader)
{
// initialize the variables
this.inputReader = inputReader;
this.computers = new List<MyComputerBase>();
this.columns = new List<DataRowReaderColumn>();
// loop over all input columns to see if we should transform or just
// pass through the values.
foreach (DataRowReaderColumn col in inputReader.Columns)
{
// check if the data type is real or integer
DataColumnProperties properties = col.Properties;
if (col.DataType == DataType.Real)
{
// it is real so we should transfrom it.
// create the output cursor and cast the input cursor.
DataValueCursor<double> inputCursor = (DataValueCursor<double>)col.Cursor;
MutableValueCursor<double> outputCursor =
(MutableValueCursor<double>)DataValueCursor.CreateMutableCursor(DataType.Real);
// add the output column to the columns collection
this.columns.Add(
new DataRowReaderColumn(
col.Name,
col.DataType,
properties,
outputCursor));
// create the correct computer specificed earlier.
this.computers.Add(new DoubleCompute(inputCursor, outputCursor));
}
else if(col.DataType == DataType.Integer)
{
// it is int so we should transfrom it.
// create the output cursor and cast the input cursor.
DataValueCursor<int> inputCursor = (DataValueCursor<int>)col.Cursor;
MutableValueCursor<int> outputCursor =
(MutableValueCursor<int>)DataValueCursor.CreateMutableCursor(DataType.Integer);
// add the output column to the columns collection
this.columns.Add(
new DataRowReaderColumn(
col.Name,
col.DataType,
properties,
outputCursor));
// create the correct computer specificed earlier.
this.computers.Add(new IntCompute(inputCursor, outputCursor));
}
else
{
// not int or real, just propagate the value.
this.columns.Add(col);
}
}
}
...
}
- Since the actual work is performed in the constructor and the
classes, the
MoveNext method becomes very simple:
protected override bool MoveNextCore()
{
// tell the input reader to set the values for the next row.
if (!this.inputReader.MoveNext())
{
// there were no more rows, return.
return false;
}
// iterate over all the computers and tell them to set the current value.
foreach(MyComputerBase myComputerBase in computers)
{
myComputerBase.ComputeNextValue();
}
}