cDiffusion

cDiffusion is a high-performance R package for non-linear dimensionality reduction and manifold learning via Diffusion Maps. Written using R C API, it provides a fast and memory-efficient way to compute diffusion maps for moderately large datasets. This package offers both a classic dense method for small-scale analytics and a highly optimized sparse implementation designed to scale with larger data.

Documentation

For API reference, performance benchmarks, and package overview, please visit the official package website:

cDiffusion Documentation Website

Package Features

Two Approaches: Provides a dense matrix solver for exact computation on smaller datasets, and a sparse solver for heavier workloads.
Optimized Memory Usage: The sparse implementation eliminates the need to allocate full $N \times N$ distance matrices. It computes an exact k-nearest neighbors graph on the fly using a max-heap structure and stores the transition graph in Compressed Sparse Row (CSR) format, reducing memory complexity to $\mathcal{O}(N \cdot kneighbors)$.
rSVD solver: Eigen-decompositions on matrices are done via custom Randomized Singular Value Decomposition (rSVD) solver. Most heavy computations are multithreaded with the use of OpenMP. Only then LAPACK finds eigenvectors for a small $m \times m$ matrix where $m$ is the requested number of dimensions.
Straightforwards Design: Zero external dependencies. The codebase relies strictly on the native R C API (including LAPACK/BLAS), OpenMP, and the base R stats package, ensuring straightforward compilation.

Installation

Install the development version directly from GitHub:

install.packages("remotes")
remotes::install_github("hzalewski/cDiffusion")

Usage

cDiffusion is designed to be easy in use. The API consists of 4 core functions:

run_diffusion(): The exact, dense solver. Perfect for capturing global topologies on small to medium datasets.
run_diffusion_sparse(): The sparse solver. Bypasses the need to allocate dense $N \times N$ euclidean distance matrix by taking only k - nearest neighbours for each point and converting those distances into Compressed Sparse Row (CSR) format.
estimate_sigma(): Computes the global Gaussian kernel width ($\sigma$) using the Median Heuristic by subsampling of the entry data.
cdiff_kmeans(): Spectral Clustering function. It applies K-means to the diffusion coordinates and features an Eigengap Heuristic for automatic detection of the optimal number of clusters.

1. Dense Method

library(cDiffusion)

data_matrix <- matrix(rnorm(6000), ncol = 3)

opt_sigma <- estimate_sigma(data_matrix, sample_size = 500)

model_dense <- run_diffusion(
  data_matrix, 
  dims = 3,
  sigma = opt_sigma, 
  n_iter = 50)

plot(model_dense$coordinates[, 1], model_dense$coordinates[, 2],
    pch = 16, col = "blue",
    main = "Diffusion map result",
    xlab = "Diffusion Coord 1", ylab = "Diffusion Coord 2")

2. Sparse Method

library(cDiffusion)

data_matrix <- matrix(rnorm(30000 * 100), nrow = 30000, ncol = 100)

# Here we don't need to be concerned for the sigma parameter 
# since sparse method uses local sigma

model_sparse <- run_diffusion_sparse(
  data_matrix, 
  dims = 3,
  k_neighbors = 35,   
  oversampling = 10,   
  n_iter = 1000
)

3. Spectral Clustering (diffusion map + kmeans)

# Provide the diffmap object directly
# centers = "auto" will automatically find the best 'k'
clusters <- cdiff_kmeans(model_sparse, centers = "auto")
#> Eigengap heuristic: selected k = 4 clusters.

# The result is a standard kmeans object
table(clusters$cluster)

Output

cDiffusion returns a standard R list with the S3 class diffmap. This allows for easy extraction of the results for tasks like visualization or passing it to cdiff_kmeans() function.

Here is the exact structure of the output depending on the chosen method:

Dense Method Output ($N = 2000$):

str(model_dense)
#> List of 5
#>  $ coordinates: num [1:2000, 1:2] -0.0691 -0.1416 0.0883 0.2579 -0.
#>  $ eigenvalues: num [1:2] 0.38 0.37
#>  $ sigma      : num 1.57
#>  $ data_dim   : int [1:2] 2000 3
#>  $ method     : chr "dense"
#>  - attr(*, "class")= chr "diffmap"

Sparse Method Output ($N = 30000$):

str(model_sparse)
#> List of 4
#>  $ coordinates: num [1:30000, 1:2] -0.00457 0.01001 -0.02034 0.02162 
#>  $ eigenvalues: num [1:2] 0.65 0.52
#>  $ data_dim   : int [1:2] 30000 100
#>  $ method     : chr "sparse"
#>  - attr(*, "class")= chr "diffmap"

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github		.github
R		R
inst/tinytest		inst/tinytest
man		man
pkgdown/assets/images		pkgdown/assets/images
src		src
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
Makefile		Makefile
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
_pkgdown.yml		_pkgdown.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cDiffusion

Documentation

Package Features

Installation

Usage

1. Dense Method

2. Sparse Method

3. Spectral Clustering (diffusion map + kmeans)

Output

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cDiffusion

Documentation

Package Features

Installation

Usage

1. Dense Method

2. Sparse Method

3. Spectral Clustering (diffusion map + kmeans)

Output

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages