This page gives a introduction on how to get started with ImageDataExtractor. This assumes you already have ImageDataExtractor and its requirements installed.

>>> import imagedataextractor as ide

Extraction from Images and Documents

ImageDataExtractor can be used to extract information from images directly. Conversely, microscopy images can be automatically identified and extracted from HTML or XML documents, followed by particle extraction with ImageDataExtractor. The latter requires ChemDataExtractor to be installed.

You can view the example usage notebook here.


Simply provide as input a path to an image or a document, or a path to a directory of images and/or documents, as well as an output directory which specifies where you would like the results to be written to. If the input image is a figure containing a panel of images, these will be split and extraction will be performed on each sub-image separately.

>>> data = ide.extract(input_path)

This will return a list of EMData objects, each of which contains the image, resulting segmentation, uncertainty, scalebar information and extracted quantitative data for each detected particle.

Extracted Data

The resulting segmentation and its uncertainty can be accessed by

>>> seg = data.segmentation
>>> uncertainty = data.uncertainty

You can obtain a pandas DataFrame containing all extracted data from an EMData object.

>>> df = data.to_pandas()

Extracted scalebar information can be accessed from the scalebar attribute of an EMData object.

>>> sb_text = data.scalebar.text
>>> conversion = data.scalebar.conversion
>>> units = data.scalebar.units
>>> sb_contours = data.scalebar.scalebar_contour

And that's it!

ImageDataExtractor currently supports HTML documents from the Royal Society of Chemistry and XML files obtained using the Elsevier Developers Portal.

Segmentation Model Adjustment

The segmentation model can be adjusted using the seg keyword arguments of ide.extract:

>>> data = ide.extract(input_path, seg_bayesian=True, seg_tu=0.0125, seg_n_samples=30, seg_device='cpu')

a. Bayesian Particle Segmentation

For optimal performance, particle segmentation is performed using Bayesian inference by default. Segmentation can be performed discriminatively, although this is not recommended, due to the significant accuracy and precision gains afforded by the Bayesian version. Setting the seg_bayesian argument to True will allow the segmentation model to run in the recommended Bayesian-mode. The default is True.

b. Uncertainty Filtering Threshold

False positives are filtered automatically using the uncertainties afforded by Bayesian inference. The threshold beyond which particles are filtered can be adjusted using the seg_tu parameter. The default is 0.0125.

c. Number of Monte Carlo Samples

Performing Bayesian inference by Monte Carlo sampling slows down the extraction process noticeably. The number of Monte Carlo samples used in inference can be set using the seg_n_samples argument. The default is 30.

d. GPU-Accelerated Extraction

Extraction can be accelerated by utilising a Graphics Processing Unit (GPU). Specifying the device argument as 'cuda' allows particle segmentation to be performed on a GPU, if one is available. This can speed up extraction significantly, particularly if extraction is being run in Bayesian mode. The default is 'cpu'.