Center for Bio-Image Informatics

Engineering, Biology and Computer Science, working together.

  • Increase font size
  • Default font size
  • Decrease font size

Benchmark For Evaluating Biological Image Analysis Tools

E-mail Print PDF


Elisa Drelie Gelasca, 1,2Jiyun Byun 1,2 Boguslaw Obara,1,2, B. S. Manjunath1,2

1Department of Electrical and Computer Engineering, 2Center for Bio-image Informatics, University of California, Santa Barbara, CA.

At UCSB we develop ongoing work on creating a benchmarking and validation dataset for biological image segmentation (download dataset). While the primary target is biological images, we believe that the dataset would be of help to researchers working in image segmentation and tracking in general. The motivation for creating this resource comes from the observation that while there are a large number of effective segmentation methods available in the research literature, it is difficult for the application scientists to make an informed choice as to what methods would work for her particular problem. No one single tool exists that is effective on a diverse set of application contexts and different methods have their own strengths and limitations. We describe below three different classes of data, ranging in scale from subcellular to cellular to tissue level images, each of which pose their own set of challenges to image analysis. Of particular value to the image processing researchers is that the data comes with associated ground truth information that can be used to evaluate the effectiveness of different methods. The analysis and evaluation are also integrated into a database framework that is available online at


Benchmarks are useful to validate segmentation methods and have to satisfy the following requirements:

  1. specify a clear problem;
  2. provide a representative dataset and ground truth;
  3. propose an evaluation methodology.

In the past there have been some successful benchmarks, such as: Handwritten digit recognition (MNIST), Face recognition (FERET, Yale Face, PIE CMU), Object Recognition (Caltech), Object segmentation (BerkelyDataset). Recently, there have been some efforts in creating microbiological image benchmark, such as the cell center database ( and mouse retina database ( However, these benchmark do not yet include different scale data, ground truth and/or image analysis tools. This was highlighted at a recent panel involving benchmarking and validation of computer vision methods in biology (

Figure 1. Example dataset provided in the benchmark.

In this work, we propose a benchmark for biological images to provide:
  • 1) collections with well defined ground truth;
  • 2) image analysis tools;
  • 3) evaluation methods to compare and validate analysis tools.
We include a representative dataset of microbiological structures whose scales range from a subcellular level (nm) to a tissue level (?m), inheriting intrinsic challenges in the domain of biomedical image analysis (see Figure 1).
The dataset is acquired through two of the main microscopic imaging techniques: transmitted light microscopy and confocal laser scanning microscopy. The benchmark includes the analysis tools that are designed to obtain different quantitative measures from the dataset such as microtubule tracing, cell segmentation, and retinal layer segmentation. The analysis tools' description can be found here and the papers can be downloaded at Additionally, in the proposed benchmark, ground truth is manually created from part of each dataset. Evaluation methods are provided to evaluate the performance of the analysis tools using the ground truth. The benchmark includes standard evaluation measures and ad hoc methods designed for specific applications. Common measures such as precision and recall, F-measure, and receiver operating characteristics (ROC), are adopted in our evaluation framework. In the following, we briefly explain the evaluation measures used at various scale level.

Subcellular level

Tracing curvilinear structures is one of the fundamental problems for extracting information about structures such as blood vessels, microtubules, and similar entities. In order to understand microtubule dynamics, biologists study how microtubules grow and shorten by analyzing the stacks of images acquired through transmitted light microscope. Microtubule stacks and corresponding ground truth are part of the benchmark and available at (dataset_label= microtubule_evaluation). An automatic method for extracting curvilinear structures from live cell fluorescence images is also integrated in the benchmark. To assess the performance of automated tracing algorithms, four evaluation measures are proposed to compare the tracing result to ground truth: 1) tip distance, 2) trace distance, 3) length difference, and 4) combination of 1), 2) and 3). Tip distance error is the Euclidean distance between the tip position defined by ground truth and that detected by the algorithm. Trace distance error is the average distance between the points on the ground truth and the points on the traced microtubule. Length difference is simply the difference between the length of the ground truth and the traced microtubule. When these errors are satisfied the following conditions which are set by biologists, the tracing algorithm are considered as an acceptable method. The conditions are: 1) tip distance smaller than 0.792 ?m, 2) length difference is smaller than 0.792 ?m, and 3) trace distance (mean) is smaller than 0.396 ?m. For more details see link.

Cellular level

Cell/nuclei segmentation is the first step of any further analysis of images at cellular level since the resulting counts of cells or nuclei provide a quantitative information crucial to cell viability. A nucleus detection method that count cells, nuclei, or other objects in sectioned materials, is supported in the benchmark framework and used to detect nuclei in the outer nuclear layer (ONL) within retinal images. Retinal images acquired through confocal microscopy and manually counted cells within the ONL are part of the benchmark. A simple evaluation method is integrated to evaluate the nucleus detection method. Datasets, method and evaluation tools are available at (dataset_label=cellCounting_ONL). The error in cell counting is computed by the percentage error between manual counts obtained from three different experts and the result by the nucleus detector.

In addition, there are about 50 histopathology images used in breast cancer cell detection with associated ground truth data available. There are, however, no benchmark methods currently available for performance evaluation on these histopathology images. Datasets and ground truth are available at (dataset_label=BreastCancerCell).

Moreover, about 200 confocal microscopy images of COS1 cells with 5 associated ground truth data will be soon made publicly available. Datasets and ground truth will be down loadable at (dataset_label=COS1cell).

Tissue level

 The retina consists of multiple layer of nerve cells and synapses. Since each layer has a different structure which consists of the group of cell bodies or synaptic terminals, the intact architecture of layers is crucial to retinal function. Layer segmentation simplifies image analysis task for understanding the function of retina before and after injury. Two retinal segmentation methods are included in the benchmark: GPAC and Layer Boundary Segmentation. Confocal retinal images, and ground truth of the layers are supported in the benchmark and are available at (dataset_label=GPAC and dataset_label=boundary_layer_segmentation). Several evaluation measures are integrated to test the performance of two automatic segmentation methods: 1) distancelayer: averaged distance between ground truth and segmented boundaries of a layer obtained by Fast Marching; 2) Precision: the ratio between the number of true positive and detected pixels; 3) Recall (sensitivity): the ratio between the number of true positive pixels and ground truth, 4) 1-specificity: the fraction of false positive pixels; ) F measure: harmonic mean of precision and recall for each layer; 6) weighted F measure: a weighted sum of F-measures for each layer. The weight of each layer is determined by its area in proportion to the total area of all layers.


The proposed benchmark provides a unique, publicly available, datasets as well as image analysis tools and evaluation methods for bioimages. The benchmark will help researchers to validate, test and
improve their algorithms, and provide biologists a guidance of algorithms' limitations and capabilities. The benchmark is integrated into the Bisque bioimage database infrastructure at UCSB and all the tools described above can be applied on the proposed dataset. Users are encouraged to upload their bioimages and ground truth, test the analysis tools and perform the evaluation. Moreover, new (user contributed) segmentation algorithms and evaluation methods can be integrated upon request. Dataset and ground truth can be downloaded from our website