Quantitative Taxonomy with Lyubishchev’s Methods

Background

Alexander Alexandrovich Lyubishchev (1890-1972) was a Russian biologist and entomologist who, in a 1943 manuscript titled Programma obshchey sistematiki (Program of General Systematics), set out a quantitative, multivariate approach to classification. His methods were later presented in English in Biometrics (Lubischew, 1962).

Lyubishchev’s framework operates directly on continuous measurements, using means, variances and covariances to quantify how far apart groups are and whether they overlap. This predates and is more general than the binary-character similarity coefficients of Sokal and Sneath (1963) that appear in other R packages. Because the original Russian manuscript was not widely cited in the Western numerical-taxonomy literature, this lineage is often overlooked.

This package implements four core functions. We illustrate them on the familiar iris data set.

Divergence coefficient

The divergence coefficient D measures the standardised separation between two groups summed across features. Setosa is famously distinct from the other two species, so we expect a large value.

setosa <- iris[iris$Species == "setosa", 1:4]
versicolor <- iris[iris$Species == "versicolor", 1:4]

divergence_coefficient(setosa, versicolor)
#> [1] 58.42465

A large D confirms the two groups are easily separable on these features.

Scatter ellipses

scatter_ellipse() fits a covariance ellipse to every class, returning the centroid, covariance and sample size for each.

ellipses <- scatter_ellipse(iris[, 1:4], iris$Species)

ellipses[["setosa"]]$mean
#> Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
#>        5.006        3.428        1.462        0.246
ellipses[["setosa"]]$cov
#>              Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Sepal.Length   0.12424898 0.099216327  0.016355102 0.010330612
#> Sepal.Width    0.09921633 0.143689796  0.011697959 0.009297959
#> Petal.Length   0.01635510 0.011697959  0.030159184 0.006069388
#> Petal.Width    0.01033061 0.009297959  0.006069388 0.011106122
ellipses[["setosa"]]$n_samples
#> [1] 50

Transgression

transgression() checks whether two ellipses overlap, comparing the squared Mahalanobis distance between centroids against a chi-squared threshold. Versicolor and virginica are the hard pair: they are known to overlap.

transgression(ellipses, "versicolor", "virginica")
#> $mahalanobis_distance
#> [1] 14.21889
#> 
#> $threshold
#> [1] 9.487729
#> 
#> $transgression
#> [1] FALSE
#> 
#> $separation_ratio
#> [1] 1.498661

Contrast this with the easy pair, setosa versus virginica:

transgression(ellipses, "setosa", "virginica")
#> $mahalanobis_distance
#> [1] 195.1855
#> 
#> $threshold
#> [1] 9.487729
#> 
#> $transgression
#> [1] FALSE
#> 
#> $separation_ratio
#> [1] 20.57242

A separation_ratio above 1 (and transgression = FALSE) marks well-separated groups.

Classification

classify() assigns posterior probabilities to a new specimen using the multivariate Gaussian likelihood of each class. Here is a typical setosa specimen.

specimen <- c(5.1, 3.5, 1.4, 0.2)
result <- classify(specimen, ellipses)

sapply(result, function(r) r$posterior)
#>       setosa   versicolor    virginica 
#> 1.000000e+00 4.918517e-26 2.981541e-41

The posterior concentrates on setosa, as expected.

When to use this package

These methods assume continuous, roughly Gaussian features. Use them for measurement data such as morphometrics, spectra or sensor readings. They are not appropriate for purely categorical or binary character data, where the Sokal-Sneath style similarity coefficients are the right tool.

References

Lyubishchev, A.A. (1943). Programma obshchey sistematiki [Program of General Systematics]. Manuscript, 22 November 1943. Digitized by ZIN RAS Coleoptera Laboratory. https://www.zin.ru/animalia/coleoptera/rus/lyubis05.htm

Lubischew, A.A. (1962). On the use of discriminant functions in taxonomy. Biometrics, 18(4), 455-477.