Supplementary Material

Proximity Measures for Clustering Gene Expression Microarray Data:
a Validation Methodology and a Comparative Analysis

Pablo A. Jaskowiak, Ricardo J. G. B. Campello, and Ivan G. Costa

 

Abstract

 

Clustering is the first step usually adopted to unveil information from gene expression data. Even though guidelines have been established concerning the choice of clustering algorithms in this application domain, little attention has been given to the proximity measures they apply. Employing  Pearson  turns out to be common practice, although no comprehensive study analyzing alternatives to it has been conducted to date. In this paper we compare 16 proximity measures for the clustering of gene expression data. Measures are evaluated considering 52 datasets from time-course and cancer experiments, w.r.t.: (i) their intrinsic separation ability and (ii) their robustness to noise. Results support performance differences among proximity measures. Moreover, measures rarely considered for gene expression analysis may exhibit competitive results when compared to commonly employed ones. We preprocess and compile 17 time-course datasets from the microarray literature in a benchmark along with a new methodology, called Intrinsic Biological Separation Ability (IBSA), to evaluate proximity measures regarding the clustering of time-course data. Both can be employed in future research to evaluate the effectiveness of new proximity measures for time-course data clustering.

 




Datasets

 

   Original Data – Results shown in our paper

 

The collection of 17 datasets that we propose as benchmark in our paper can be found here. Note that these datasets are publicly available from their original sources. In our version, however, we selected about 1000 genes from each original dataset (see below). In the zip file you will find 17 csv files, one for each dataset (datasets are named according to Table 2 of our paper). The files are in the following format:

First row: file header containing, in this order:

  1. Gene ID (number) in the original dataset.
  2. Gene identifier (name).
  3. Time instants (in minutes) at which expression values were measured for each gene.

 

 

The remainder of the file contains gene expression values for each gene.

 

Additionally, in the same zip file, we provide a file called fold_change.xls. In this file we detail the values of l and c that were employed during the filtering stage (fold change). This procedure was employed in order to select about 1000 genes for each dataset. References to the original sources of each dataset are provided in Table 2 of our paper.

 

Normalized Data – Results shown in our supplementary pdf file

 

Following the suggestion from one of the reviewers we also performed experiments on normalized versions of the aforementioned datasets. Since we were unable to locate the original array files for one of the datasets, experiments were performed on 16 datasets. Results for these experiments are available in our pdf supplementary file (see below). These datasets (also composed of about 1000 selected genes) were normalized with Multiple-Slide Normalization. They can be found here. In the zip file you will find 16 csv files, one for each dataset. For further information on the normalization of such datasets, please, refer to our pdf supplement. The name of each dataset follow the same convention previously adopted.  Files are in the following format:

 

First row: file header containing, in this order:

  1. Gene identifier (name).
  2. Time instants (in minutes) at which expression values were measured for each gene.

 

The remainder of the file contains gene expression values for each gene.

 

Additionally, in the same zip file, we provide a file called fold_change.xls. In this file we detail the values of l and c that were employed during the filtering stage (fold change). This procedure was employed in order to select about 1000 genes for each dataset. References to the original sources of each dataset are provided in Table 2 of our paper.

 

 

Supplementary Results

 

In this part of the supplement we provide additional results that due to space restrictions we could not present in our paper. We also provide further information concerning the normalization procedure employed to obtain the normalized version of the datasets. Please, find the supplementary pdf file here.