Skip to main content

Spotfire Text Data Format

The Spotfire Text Data Format (STDF) is a tabular data format, the common file format for Spotfire products. It is strict, unforgiving, easy to parse efficiently, and particularly useful if data is both formatted and parsed by Spotfire products. Otherwise, a more flexible format might be preferable.

Overview

These are the key characteristics of Spotfire Text Data Format version 1.0:

  • STDF is a text format rather than binary, and all characters are Unicode.
  • There are two lines of metadata: The first line contains column names, the second contains column types.
  • The format is similar to the popular comma-separated values (CSV) format, but differs in both choice and use of value separator:
    • The values are separated by semicolons (;), not commas (,).
    • The semicolons are used as value terminators rather than separators – there is a semicolon after the last value on each line. Semicolon is used as terminator character, since it is compatible with the probable future addition of list types.
  • Whitespace and special characters are significant and never trimmed.
    Note: A data file where the header, all comments, and all empty lines have been removed will be referred to as a trimmed data file.
  • There is a well-defined way to format null and invalid values.

Example:

<bom>\! filetype=Spotfire.DataFormat.Text; version=1.0;
\* ich bin ein berliner Column A;Column #14B;Kolonn Ö;The n:th
column; Real;String;Blob;Date; -123.45;i think there\r\nshall
never be;\#aaXzD;2004-06-18; 1.0E-14;   a poem\r\nlovely
as a tree;\#ADB12=;\?lost in time; 222.2;\?invalid text;\?;2004-06-19;
\?error11;\\förstår ej\\;\#aXzCV==;\?1979; 3.14;hej å hå\seller?;\?NIL;\?#ERROR;

Format specification

This section lists rules for the data format. When a parser is given a data file that breaks one or more of the following rules, it must always exit with an error message rather than trying to overcome the error.

Persistent storage

  1. The UTF-8 encoding should be used for persistent storage of all data files. The first three bytes of every file must be the byte order mark (BOM) for UTF-8, which is hexadecimal EF BB BF.
    Note: The BOM is mandatory. In the rest of this specification the BOM will be omitted or written as <bom>.
  2. Immediately following the BOM, all data files must contain a single header line on exactly the following form:
    \! filetype=Spotfire.DataFormat.Text; version=1.0;
  3. The filename extension .txt is recommended to make the data files open in a text editor. The .csv extension should not be used since this is not a CSV data format, despite the fact that there are similarities.

Special characters

There are three special character sequences in the data format:

semicolon is used as the column separator/terminator

crlf (carriage return immediately followed by line feed) used as the row separator/terminator

backslash marks the beginning of an escape sequence

All two-character sequences beginning with a backslash are reserved for special use. In the first version of the data format the following escape sequences are supported:

\\   A backslash character in a string or name.

\s   A semicolon character in a string or name.

\n   A newline character in a string or name.

\!   Marks the beginning of the file header.

\?   Used for null and invalid values; see section 2.6.

\*   A comment; see section 2.5.

\#   Used for base64-encoded binary values; see section 2.11.

\r   The return character in a string or name.

\t   The tab character (used to enhance readability).

\[   Marks the beginning of a list value; see section 2.12.

\]   Marks the end of a list value.

If the parser encounters an unrecognized escape sequence, that is, a backslash followed by a character not listed above, it must exit with an error message.

Basic tabular format

  1. A data file consists of a number of text lines. Each text line is a sequence of Unicode characters.
  2. All lines must have the same number of columns.
  3. Semicolon characters separate the columns: Every value must be followed by a semicolon, including the last value on a row. The number of unescaped semicolons on a line will always equal the number of columns.
  4. A semicolon character that is not used as a column separator must be escaped using the \s syntax.
  5. Every line, including the last, must end with crlf.
  6. crlf in the first position – an empty line – is interpreted as insignificant whitespace and should be skipped by the parser.
  7. A newline character that is not used as an end-of-line marker must be escaped using the \n syntax.
  8. A carriage return character that is not used as an end-of-line marker must be escaped using the \r syntax.
  9. A trimmed data file can be completely empty, containing zero rows and zero columns. Such a file contains zero characters.
  10. If a trimmed data file is not completely empty, it must begin with two lines of metadata.
  11. A trimmed data file for an empty data set – zero data rows but one or more columns – contains exactly two rows specifying the column names and types.

Metadata format

  1. The first metadata line – the first line of a trimmed data file – defines the column names.
  2. All column names must be unique and contain at least one non-whitespace character.
  3. Whitespace characters are significant in column names, and the parser must not trim the name strings.
  4. Newline, carriage return, backslash, and semicolon characters may be present in column names only if escaped.
  5. Column names are case sensitive.
  6. The second line of a trimmed data file defines the column types.
    Note: The supported column types are listed in the Type names section.
  7. The type names are case sensitive and contain no whitespace.
  8. If a parser finds an unrecognized column type, it must exit with an error message.
  9. The syntax for null and invalid values must not be used for column names or types.
  10. The syntax for base64-encoded values must not be used for column names or types.
  11. The syntax for list values must not be used for column names or types.

Comments

  1. Comments are allowed anywhere in a file after the header line, but not before.
  2. A comment starts with the \* escape sequence and extends to the end of the line.
  3. A comment must cover an entire line: The first two characters must be \*.
  4. A comment in any other position will cause the parser to exit with an error message.

Null and invalid values

  1. A null (missing) value is represented by the escape sequence \?.
  2. A non-null invalid value is represented by the escape sequence \? followed by a non-empty string of other characters. The string of characters following but not including the \? is referred to as the error code. The error code has no significance within the data format: There are no predefined error codes. For instance, the string \?ERROR is used to represent an invalid value with the error code ERROR.
  3. The error code may contain special characters if they are encoded as they would be in a string value.

Type names

The following type names are supported in the text data format:

  • Integer
  • Real
  • String
  • Date
  • Time
  • DateTime
  • Blob

Type names are case sensitive and no synonyms are supported. A parser must exit with an error message when it encounters an unrecognized type name.

For each column type there is a corresponding list-valued type whose name is the name of the base type with an additional List suffix, for instance, StringList. List types may not be supported by all parsers, and a parser should fail early when encountering an unsupported type.

Additional type names, like a Boolean type or a DateTimeTz with time zone information, may be added to the text data format in the future.

Numeric values

  1. There must be no whitespace characters before or after a numeric value.
  2. The only base is 10. There is no support for octal or hexadecimal numbers.
  3. A decimal point should always be used for floating point values. The decimal point must be a point and no other character such as a comma.
  4. Scientific notation for Real values is supported. Examples: 1.34e+45, -5.670001E-12.
  5. Thousand separators are not supported.
  6. The standard minus sign must be used for negative numbers.
  7. The plus sign is implicit for positive numbers and should not be used except for the exponent in scientific notation, where a plus sign is allowed but optional.
  8. There is no support for special floating point values such as "not a number" or infinity. Nevertheless it is recommended that a formatter output the following invalid values where applicable: \?NaN, \?+Inf, and \?-Inf.

String values

  1. A String column may contain valid empty values - the empty string. Such a value is represented by zero characters, that is, the column is left empty except for the terminating semicolon.
  2. Single and double quote marks have no special meaning in the data format. They can be included in string values and need not be escaped. Accordingly "A" represents a string value with three characters.
  3. All whitespace characters at the beginning and end of a string value are significant and included in the string value.
  4. The special characters ; and \ and newline and carriage return can be included in string values if they are escaped.

Date and time values

  1. A Date value is formatted like this: YYYY-MM-DD, with exactly four digits for the year, exactly two digits for the month, and exactly two digits for the day of month.
  2. A Time value is formatted like this: HH:MM:SS, with exactly two digits for each part and 24-hour notation for the hours.
  3. A Time value may optionally include a milliseconds part by appending .XXX to the time format above. Exactly three digits should be used in this case. Example: 23:59:59.999
  4. The time zone is not included in time values. If time zones turn out to be important, extra value types will be added for this purpose.
  5. A DateTime value is formatted as a Date value followed by exactly one space character and a Time value.

Binary objects

The type name Blob (binary large object) is used because it is standard, but small binary objects are clearly more suitable for a text file than large ones.

A Blob column is used for binary data types. The data format supports a single encoding for binary objects, the Base64 encoding; refer to rfc2045-6.8 for details. The escape sequence \# marks the beginning of a base64-encoded object, like this: \#IAMB64. The use of an escape sequence for binary objects may seem unnecessary since base64 is the only supported encoding, but other encodings may be added in a future version of the format.

The Base64 encoding sets a maximum line length of 76 characters. To break an encoded object into lines, the \r\n escape sequence must be used. Example: \#abc...\r\nxCd...\r\nAD=

Note: The data format provides only a transport mechanism for binary objects and not complete type information. The object type metadata can either be encoded with the objects or provided through a separate channel.

List types

Some applications use data sets where the values in a column are lists of strings. The type name of such a column is StringList. Lists of other types than strings are expressed in a similar fashion, but not all parsers will have support for all list types, or perhaps no list types at all. List types are an optional feature of the format.

Formal rules for list values:

  1. A valid list value must begin with \[ and end with \] followed by a terminating semicolon.
  2. The format does not explicitly encode the length of the lists, nor does it require that all lists in a column are of the same length.
  3. All list items are terminated by a semicolon, including the last item in a list.
  4. Null and invalid list items are represented using the standard escape sequences.
  5. Special characters can be included in list items if they are escaped.
  6. There is no support for nested lists.
  7. A list value cannot be broken into more than one line. If the parser encounters a newline character while parsing a list value, it should exit with an error message.

StringList examples:

  • A list of the three strings blue, white, and brown is represented with all list items terminated by a semicolon:
    \[blue;white;brown;\]
  • If a row contains an Integer column followed by a StringList column and another Integer column:
    17;\[blue;white;brown;\];19;
  • The empty list:
    \[\]
  • A list containing a single empty string:
    \[;\]
  • A list containing null and invalid items:
    \[apa;\?;bepa;\?que;\]
  • If the list itself, rather than its items, is null or invalid:
    \?   or   \?error11

Test cases

This section contains a number of test cases that should be used to implement unit tests for all parsers.
Note: The UTF-8 BOM is assumed to be present in all examples but is not shown.

Basic tabular format

BOM tests

  1. A test file with correctly formatted header and data, but with no BOM should give an error message stating that no BOM is present.
  2. A test file with correctly formatted header and data, but with a BOM other than UTF-8, should result in an error message stating that the wrong encoding is used.

File header tests

Test case File content Result
Missing header c1;c2; Integer;Real; 1;2.0; An error message stating that the file header is missing.
Wrong file header \! filetype=Spotfire.CsvFormat; version=1.0; c1;c2; Integer;Real; 1;2.0; An error message stating that the file header is not the one expected.
Unexpected version \! filetype=Spotfire.DataFormat.Text; version=1.1; c1;c2; Integer;Real; 1;2.0; An error message stating that the format version number is too high and not supported by the parser.

Row and column separation tests

Test case File content Result
Empty data set \! filetype=Spotfire.DataFormat.Text; version=1.0; A successfully parsed data set containing zero rows and zero columns.
Missing final semicolon \! filetype=Spotfire.DataFormat.Text; version=1.0; c1;c2;c3; String;String;String; a;b;c; d;e;f An error message stating that there is a missing semicolon on the last line.
Embedded semicolons and newlines \! filetype=Spotfire.DataFormat.Text; version=1.0; c1;c2;c3; String;String;String; \sa;b\sb;c\s; \nd;e\ne;f\n; A successfully parsed data set with two rows and three columns. All values on the first row contain an embedded semicolon character, and all values on the second row contain an embedded newline character
Unequal number of columns \! filetype=Spotfire.DataFormat.Text; version=1.0; c1;c2;c3; String;String;String; a;b; 1;2;3; An error message stating that the first data row contains too few columns.
Missing carriage return in row delimiter   An error message stating that a crlf character sequence must be used as row delimiter.
crlf missing at the end of the last row   An error message stating that a crlf character sequence was expected. It also indicates that the data file may have been truncated.

Metadata

Test case File content Result>
Missing metadata \! filetype=Spotfire.DataFormat.Text; version=1.0; a;b;c; d;e;f; An error message stating that the file contains no metadata.
Whitespace in metadata \! filetype=Spotfire.DataFormat.Text; version=1.0; c1; c2; c3; String; String; String; a; b; c; An error message stating that the type name "String" - note the space before 'String' - is unrecognized.
Case sensitivity \! filetype=Spotfire.DataFormat.Text; version=1.0; c1;c2;c3; string;integer;float; a;1;2.0; An error message stating that the type name "string" - note the lowercase 's' - is unrecognized.
Unique column names \! filetype=Spotfire.DataFormat.Text; version=1.0; a;a; String;Integer; a;1; An error message stating that the column name "a" is used twice.
Column names, case sensitive \! filetype=Spotfire.DataFormat.Text; version=1.0; a;A; String;Integer; a;1; Parsed correctly.

Comments

Test case File content Result
Comment before header \* My latest data file. \! filetype=Spotfire.DataFormat.Text; version=1.0; c1;c2; Integer;Real; 1;2.0; An error message stating that a comment was encountered when a file header was expected.
Valid use of comments and empty lines \! filetype=Spotfire.DataFormat.Text; version=1.0; \* File generated by tool XYZ. \* Metadata section. Column A;Column B; String;DateTime; \* Data section. a;\?; b;\?; A successfully parsed data set with two columns and two rows.
Comments must cover the entire line \! filetype=Spotfire.DataFormat.Text; version=1.0; Column; String; Value; \* Only one value. An error message stating that a comment must start at the first position of a line.

Numeric values

The following table lists examples of integer values. Those marked valid should pass through the parser without problems, while those marked invalid should cause the parser to exit with an error message. There is also a third category of values that are formally invalid but where common standard IO functions would succeed in parsing the values and it would be more costly to check for errors than to accept the values. The result of parsing the third category of values is undefined: A parser may issue an error or accept the value. A formatter, however, must not output such values.

Integer Category Result/note
1 valid  
-1 valid  
+1 undefined Unnecessary + sign
  1 undefined Avoid whitespace padding
\t1 undefined Avoid whitespace padding
1.0 invalid Decimal point in integer
1E5 invalid Scientific notation not supported
$100 invalid Prefix not allowed
1 SEK invalid Suffix not allowed
1,000 invalid Thousand separators not allowed
100 000 invalid Thousand separators not allowed
0xAAFF invalid Hexadecimal not allowed
0777 invalid Octal notation not allowed
123L invalid Long syntax not allowed
\? valid Null integer value

 

The following table lists examples of floating point values:

Real Category Result/note
1.0 valid  
-1.0 valid  
+1.0 undefined Unnecessary + sign
1 undefined Missing decimal point
  1.0 undefined Avoid whitespace padding
1.0d invalid Precision suffix not allowed
1.0E5 valid Scientific notation
1.0e-5 valid Lower case e supported
1.0E+5 undefined Redundant + sign
1E5 undefined Missing decimal point
12.0E3 undefined Not normalized
.4 undefined Does not begin with a digit
E-13 invalid Exponent only not allowed
1,0 invalid Decimal comma not allowed
1,000.0 invalid Thousand separator not allowed
\?-Inf valid Valid representation of invalid real value

String values

The following table lists examples of valid and invalid string values, including the terminating semicolon. For the valid values, the resulting string is shown in Java syntax.

String Category Result/note
a; valid "a"
 a  ; valid " a  "
\ta\r\n; valid "\ta\r\n"
[a,b,c]; valid "[a,b,c]"
\u221e; invalid No escape syntax for Unicode
4"10'; valid "4\"10'"
a\"; invalid No escape syntax for double quote
a\s; valid "a;"
; valid ""
ökentråk; valid "ökentråk"
\?\?; invalid Two null values in one cell

Date and time values

The following table lists examples of valid and invalid date values:

Date Category Result/note
2004-08-05 valid  
04-08-05 invalid 4 year digits required
Aug 5, 2004 invalid Wrong format
2004-13-01 undefined Month number too high
2004-02-31 undefined Day number too high

 

The following table lists examples of valid and invalid time values:

Time Category Result/note
10:42:56 valid  
23:59:59.999 valid Close to midnight
2:32pm invalid Wrong format
24:00:00 undefined Hour 24 should be written as 00
00:00:00 valid  
8:42 invalid Missing seconds
8:42:32 undefined Should be two digits for the hour
8:8:8 undefined Should be two digits everywhere
13:14:15Z undefined Time zone not supported
13:14:15+02 undefined Time zone not supported

DateTime examples are not provided, but any combination of a valid date and a valid time is a valid datetime. Both date and time are required.

Binary objects

Blob values are encoded using Base64. Base64 is based on a 6-bit alphabet where four 6-bit characters are used to represent three 8-bit bytes, and where the = character is used for padding when the number of bytes is not a multiple of three.

The following table lists examples of valid and invalid blob values:

Blob Category Decoded value/note
\#aHVja2xlYnVjaw== valid hucklebuck
\#a== invalid Not a full encoding quantum
\#apa! invalid ! is not in the base64 alphabet
\# valid Empty string
\#dHdvb\r\nGluZXI= valid twoliner
ZXJyb3I= invalid Missing escape

The parser must be able to handle escaped row delimiters (crlf) in the blob values, but the behavior for segments longer than 76 characters is undefined and depends on the base64 decoder: The values may be decoded correctly or an error message may be given. A formatter must always make sure that there are no segments longer than 76 characters.

List types

The following table lists examples of valid and invalid values for a column of type StringList to be used as test cases:

StringList Category Result/note
\[a;b;c;\] valid  
\[ a \s;\] valid List with one string, " a ; "
[a;] invalid Unescaped brackets
\[\] valid Empty list
\[;\] valid List with one empty string
\[a;\[a;\];\] invalid Nested lists not supported
\[\?;\?e11;\] valid List with two invalid items
\? valid Null list value
\[a;b\] invalid Missing semicolon after second item

Similar test cases should be created for all other supported list types.