MultiCellXML : An open XML data standard for multicell agent models

MultiCellXML XML Specification

MultiCellXML is a human-readable, XML-based data format, which includes the random seed state, global variables, information on (and filenames of) microenvironmental field variables, and a list of each cell object and its current state. This structure allows us to easily parse the data (using standardised XML parsers, such as Expat, xmlParser, and TinyXML for use in data visualisation and post-processing. The list of cells in the XML file is very similar to the objectoriented Cell data structure in the simulator, making the format well-suited to resuming simulations from saved states. Modifying simulation parameters during a simulation can be readily achieved with simple plaintext search/replace operations in the XML files.

MultiCellXML Version 1.00

File header and other early elements

We begin with XML header information (?xml) for XML 1.0 standards compliance, followed by a "root" data_set tag. In the data_source section, we include information on the originating simulation software (simulator), the user (user), and any publication information that may assist the recipient of a data file in (1) locating the original source of the data, and (2) proper academic citation (reference). See Fig. 1. Future MultiCellXML versions may include reference and citation information for the simulation software.

?xml version="1.0" encoding="UTF-8" ?
data_set MultiCellXML_version="1.0"
data_source
filenamedata/output00000117.xml/filename
created29 July 2010/created
simulator
program_nameDCIS_2D/program_name
program_version1.38/program_version
compiled1 July 2010/compiled
authorPaul Macklin/author
contactPaul.Macklin@MathCancer.org/contact
URLhttp://www.MathCancer.org//URL
/simulator
user
namePaul Macklin/name
contactPaul.Macklin@MathCancer.org/contact
/user
reference
citationMacklin et al. J. Theor. Biol. (2011) (in review)/citation
URLhttp://multicellxml.sourceforge.net/URL
noteUser notes may go here./note
/reference
/data_source
globals
time units="minutes"7020/time
next_output_time units="minutes"7020/next_output_time
frame_number117/frame_number
random_seed_state769969952/random_seed_state
Domain_width_in_microns1000/Domain_width_in_microns
Domain_height_in_microns340/Domain_height_in_microns
/globals
...
Fig. 1: Start of a MultiCellXML file: The first tag is for XML 1.0 standards compliance. The data_source section indicates the source of the data, including the originating program, information on the user, and requested reference for citation (if any). The globals section gives information on program globals, including (in particular) the current simulation time and the random seed state.

Global data

Following the data_source section, the globals section includes information such as the current simulation time, and random seed state--this is important for resuming saved simulation states without affecting the pseudorandom number generator. See Fig. 1. Where possible, we include information on physical units as XML tag attributes. We note that because this was initially a format developed for internal use, we have not been entirely consistent in our conventions; improvements are planned in future drafts of the file specification. For dimensionless quanti- ties, the scale should ideally be stated (e.g., as an additional XML attribute):

local oxygen units="dimensionless" scale="far-field"0.84local oxygen

In future drafts, we may include a new scales section to facilitate this.

Cell elements

The file format continues with a list structure of all the cells (cell_list), with essentially all internal cell variables (i.e., member data of the Cell class) listed clearly. We give each cell both a numeric type (cell_type_code) to assist comparing and classifying cells in software, and a human-readable type (cell_type_text) to assist data recipients with interpretting the data. See Fig. 2. Note that we have included "type" attributes to indicate boolean variables, rather than units. In future file version drafts, we may include both "type" and "units" attributes to all cell data fields. However, we can gen- erally assume that the presence of units indicates a non-boolean variable, and the precence of a boolean type obviates "units."

...
cell_list
cell
cell_properties
cell_type_code0/cell_type_code
cell_type_textDCIS cell/cell_type_text
radius units="microns"9.95299956207/radius
nuclear_radius units="microns"5.295/nuclear_radius
volume units="cubic microns"4130.00487398/volume
mature_volume units="cubic microns"4130.00487398/mature_volume
solid_volume units="cubic microns"413.000487398/solid_volume
cell_adhesion_1_level units="dimensionless"1/cell_adhesion_1_level
cell_adhesion_2_level units="dimensionless"0/cell_adhesion_2_level
matrix_adhesion_level units="dimensionless"1/matrix_adhesion_level
calcite_level units="dimensionless"0/calcite_level
mean_cell_cycle_time units="minutes"1080/mean_cell_cycle_time
mean_G1_time units="minutes"540/mean_G1_time
mean_time_to_apoptosis units="minutes"47196.6/mean_time_to_apoptosis
mean_time_to_mitosis units="minutes"115.27/mean_time_to_mitosis
BM_adhesion_exponent units="dimensionless"1/BM_adhesion_exponent
BM_repulsion_exponent units="dimensionless"1/BM_repulsion_exponent
BM_adhesion_max_distance units="x radius"1.214/BM_adhesion_max_distance
/cell_properties
cell_state
is_cycling type="boolean"true/is_cycling
is_quiescent type="boolean"false/is_quiescent
is_apoptosing type="boolean"false/is_apoptosing
is_anoxic type="boolean"false/is_anoxic
is_necrosing type="boolean"false/is_necrosing
is_debris type="boolean"false/is_debris
apoptosis_time units="minutes"360.85/apoptosis_time
necrosis_time units="minutes"0/necrosis_time
cell_cycle_time units="minutes"0/cell_cycle_time
Position units="microns"(86.5666,53.5001,0)/Position
Velocity units="microns/minute"(-0.1084,0.2131,0)/Velocity
/cell_state
/cell
cell
...
/cell
...
/cell_list
...
Fig. 2: Main content of a MultiCellXML file: Within the cell list section, we save each individual cell agent's data within a set of cell/cell tags, including cell_properties and the cell_state. In future revisions, these fields may be merged due to the fact that cell properties change in time. Note: These fields have been minimised from the actual published datasets to simplify the presentation.

Due to historical reasons stemming from code development, each cell is split into cell properties and cell state sections; future versions of the data standard will likely merge these into a single cell state section, because many cell properties tend to change over time due to the cells' exposure to differing microenvironments.

Field variables and other final elements

After all data files have been listed, we include a global_variables section with a list of all saved field variables and file formation information. See Fig. 3. Note that we have included the full path of each data file; often all the files (including the XML file) are saved in the same directory, so postprocessing may need to strip part of the path by comparison to the filename filed in the data_source section. Due to the large size of 2-D and 3-D double-precision data arrays, we opted for a binary data format. For increased compatibility, we choose the MATLAB .MAT (Level 4) file format, which is relatively simple to implement directly from the published file format specification, and is simple to read and write with common open source software (e.g., Octave) as well as MATLAB. In the source code to follow, we include C++ code to read and write these MATLAB data.

...
global_variables
variable
nameoxygen/name
format version="Level 4"MATLAB/format
filenamedata/oxygen_00000117.mat/filename
/variable
variable
nameDuct_Wall_Level_Set/name
format version="Level 4"MATLAB/format
filenamedata/level_set.mat/filename
/variable
/global_variables
/data_set

Fig. 3: End of a MultiCellXML file: After the cell_list section, the global_variables section gives a list of all associated external field data (here saved in MATLAB format).

Lastly, note that a primary goal of our specification is to make the for- mat as human-readable as possible, rendering the format (partially) "self- documenting". This will make it simpler to interpret archived data long after the originating software is out of use, thus eliminating the need for reverse engineering-hence our choice of human-readable, non-binary data. While this results in much larger files, we regard data compression as a separate software problem from the specification of content. Compression can readily be applied to the data files after creation with widespread open source libraries, such as gzip.