HDBSCAN* is an easy-to-use, almost parameterless framework for unsupervised and semi-supervised descriptive data analysis based on hierarchical density estimates. Developed in collaboration with the creator of DBSCAN, OPTICS, and LOF, Prof. Jörg Sander, this tool allows state-of-the-art hierarchical density-based clustering, noise modeling, unsupervised and semi-supervised cluster extraction (flat clustering from flexible, non-horizontal dendrogram cuts), outlier detection, as well as different forms of visualization.
CLICK HERE to download the complete Java Package with both source and executable files.
NOTE: please, refer to our ACM TKDD paper when referencing HDBSCAN*:
R.J.G.B. Campello, D. Moulavi, A. Zimek and J. Sander “Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection”, ACM Trans. on Knowledge Discovery from Data, Vol 10, 1 (July 2015), 1-51.
This repository, devised mainly for the evaluation of unsupervised outlier detection, contains: 1) a compiled collection of over twenty real-world basis datasets as well as hundreds of variants of these datasets pre-processed for outlier detection according to different procedures (e.g., subsampling and normalization); 2) the whole set of raw results obtained by applying on these datasets 12 well established outlier detection methods, within a range of different parameter values; and 3) aggregated results, statistics, and visualizations over all the methods and datasets.
CLICK HERE to access the repository.
NOTE: please, refer to our related paper when referencing this repository:
G.O. Campos, A. Zimek, J. Sander, R. J.G.B. Campello, B. Micenková, E. Schubert, I. Assent and M. E. Houle “On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study”, Data Mining and Knowledge Discovery (DOI 10.1007/s10618-015-0444-8).
This repository contains two compiled sets of gene expression time-series data from microarray experiments regarding Saccharomyces cerevisiae organism. In total each set contains 17 datasets. These datasets were previously used in the biological evaluation of statistical proximity measures for clustering, using our methodology called Intrinsic Biological Separability Ability (IBSA). Here you will also find additional results that due to space constraints were not included in the original paper.
CLICK HERE to access the repository (supplementary material).
NOTE: please, refer to our related paper when referencing these datasets:
P.A. Jaskowiak, R.J.G.B. Campello and I.G. Costa Filho “Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis” IEEE/ACM Trans. Comput. Biol. Bioinformatics 10, 4 (July 2013), 845-857.