Skip to main content

Data Management

The Data Manager makes up a substantial part of the analysis document. It holds and handles data. It is the top node and entry point to data management.

Representing Data

Data hierarchy concepts

Data is represented by data tables. Each table in the table node collection has a unique and fixed id as well as a unique but editable name. A particular node can be retrieved both by name and id. The data tables are kept in memory, but if memory is scarce the least used data can be paged to disc.

Tables come in several flavors:

  • Source tables are initiated and populated once by reading from a data source.
  • On demand tables are created on the fly based on an analytic action. They typically retrieve additional data from an external database using a query defined by, for instance, the selection in a visualization. The life time of the table is usually the life time of the artifact it supports, such as a details-on-demand view.
  • Calculated tables are derived from source tables for a specific purpose. When it has been created, it exists until it is removed. It may have to be manually updated to correctly reflect a change in the underlying data. A calculated table may contain data describing the relationships between columns of a source table.
  • Derived tables are created from source tables. They define dynamic views on the source data, typically an aggregation of a filtered subset of data. They are automatically updated, but are not accessible from the data manager and are usually not persisted.

A data column consists of metadata and row values. As indicated in the figure, there is a document node for each column. The data type of a column is Integer, Real, String, Date, Time, DateTime, Currency, or Binary. The data representation is column-based in the sense that data values are associated with columns. The actual data values however, are stored in specialized internal data structures.

See Spotfire Data for further class references and code samples.

Loading Data

When an analysis file is saved to disk, the data for a table can be stored directly in the file. This mode is known as embedded data. Alternatively, data can be reloaded from the external data source when the file is opened. This mode is known as linked data. Metadata about the table and its columns is stored in the file for both embedded and linked data.

  • Spotfire Text Data Format
    The Spotfire Text Data Format (STDF) is a tabular data format, the common file format for Spotfire products. It is strict, unforgiving, easy to parse efficiently, and particularly useful if data is both formatted and parsed by Spotfire products. Otherwise, a more flexible format might be preferable.

A data source is an object that contains a reference to external data, such as a file or database. There are many kinds of data sources custom data sources may be created. Furtermore, a data source may be included in a data flow, a pipeline that reads data from a data source and processes the data through a sequence of transformations. The output usually ends up in a data table. A data transformation can perform anything from simple tasks, like data cleaning, to complex operations, like pivoting. It is possible to extend the platform by creating custom transformations.

Data Flow

Handling Data

A data relation defines a connection between two tables, usually by declaring that a column in the first table corresponds to a column in the second table. Data relations are used for translating a row selection in one table to a row selection in the other table. The set of all data relations implicitly define a grouping of the tables. All tables in such a table group are related to each other.

A row selection defines a subset of the rows in a table. Multiple row selections for the same table can be combined using set operations such as taking the intersection.

A data selection is conceptually similar to a row selection, but it is more complex since it is applicable to all the tables. You can set the selection relative to one table and then retrieve the selection translated to another table. There are two specific types of data selections. A marking selection defines a set of marked rows, while a filtering selection defines a set of filtered rows. The two types turn out to be more different than one might expect at first.

A data view is a table that is derived from another table, typically by aggregating the rows. Data views are used extensively by the visualization framework.