Package 'ClustImpute'

Title: K-Means Clustering with Build-in Missing Data Imputation
Description: This k-means algorithm is able to cluster data with missing values and as a by-product completes the data set. The implementation can deal with missing values in multiple variables and is computationally efficient since it iteratively uses the current cluster assignment to define a plausible distribution for missing value imputation. Weights are used to shrink early random draws for missing values (i.e., draws based on the cluster assignments after few iterations) towards the global mean of each feature. This shrinkage slowly fades out after a fixed number of iterations to reflect the increasing credibility of cluster assignments. See the vignette for details.
Authors: Oliver Pfaffel
Maintainer: Oliver Pfaffel <[email protected]>
License: GPL-3
Version: 0.2.4
Built: 2025-03-11 04:01:32 UTC
Source: https://github.com/o1iv3r/clustimpute

Help Index


Check and replace duplicate (centroid) rows

Description

Internal function of ClustImpute: check new centroids for duplicate rows and replace with random draws in this case.

Usage

check_replace_dups(centroids, X, seed)

Arguments

centroids

Matrix of centroids

X

Underlying data matrix (without missings)

seed

Seed used for random sampling

Value

Returns centroids where duplicate rows are replaced by random draws


K-means clustering with build-in missing data imputation

Description

Clustering algorithm that produces a missing value imputation using on the go. The (local) imputation distribution is defined by the currently assigned cluster. The first draw is by random imputation.

Usage

ClustImpute(
  X,
  nr_cluster,
  nr_iter = 10,
  c_steps = 1,
  wf = default_wf,
  n_end = 10,
  seed_nr = 150519,
  assign_with_wf = TRUE,
  shrink_towards_global_mean = TRUE
)

Arguments

X

Data frame with only numeric values or NAs

nr_cluster

Number of clusters

nr_iter

Iterations of procedure

c_steps

Number of clustering steps per iteration

wf

Weight function. Linear up to n_end by default. Used to shrink X towards zero or the global mean (default). See shrink_towards_global_mean

n_end

Steps until convergence of weight function to 1

seed_nr

Number for set.seed()

assign_with_wf

Default is TRUE. If set to False, then the weight function is only applied in the centroid computation, but ignored in the cluster assignment.

shrink_towards_global_mean

By default TRUE. The weight matrix w is applied on the difference of X from the global mean m, i.e, (x-m)*w+m

Value

complete_data

Completed data without NAs

clusters

For each row of complete_data, the associated cluster

centroids

For each cluster, the coordinates of the centroids in tidy format

centroids_matrix

For each cluster, the coordinates of the centroids in matrix format

imp_values_mean

Mean of the imputed variables per draw

imp_values_sd

Standard deviation of the imputed variables per draw

Examples

# Random Dataset
set.seed(739)
n <- 750 # numer of points
nr_other_vars <- 2
mat <- matrix(rnorm(nr_other_vars*n),n,nr_other_vars)
me<-4 # mean
x <- c(rnorm(n/3,me/2,1),rnorm(2*n/3,-me/2,1))
y <- c(rnorm(n/3,0,1),rnorm(n/3,me,1),rnorm(n/3,-me,1))
dat <- cbind(mat,x,y)
dat<- as.data.frame(scale(dat)) # scaling

# Create NAs
dat_with_miss <- miss_sim(dat,p=.1,seed_nr=120)

# Run ClustImpute
res <- ClustImpute(dat_with_miss,nr_cluster=3)

# Plot complete data set and cluster assignment
ggplot2::ggplot(res$complete_data,ggplot2::aes(x,y,color=factor(res$clusters))) +
ggplot2::geom_point()

# View centroids
res$centroids

K-means clustering with build-in missing data imputation

Description

Default weight function. One minus the return value is multiplied with missing(=imputed) values. It starts with 1 and goes to 0 at n_end.

Usage

default_wf(n, n_end = 10)

Arguments

n

current step

n_end

steps until convergence of weight function to 0

Value

value between 0 and 1

Examples

x <- 0:20
plot(x,1-default_wf(x))

Simulation of missings

Description

Simulates missing at random using a normal copula to create correlations between the missing (type="MAR"). Missings appear in each column of the provided data frame with the same ratio.

Usage

miss_sim(dat, p = 0.2, type = "MAR", seed_nr = 123)

Arguments

dat

Data frame with only numeric values

p

Fraction of missings (for entire data frame)

type

Type of missingness. Either MCAR (=missing completely at random) or MAR (=missing at random)

seed_nr

Number for set.seed()

Value

data frame with only numeric values and NAs

Examples

data(cars)
cars_with_missings <- miss_sim(cars,p = .2,seed_nr = 4)
summary(cars_with_missings)

Plot showing marginal distribution by cluster assignment

Description

Returns a plot with the marginal distributions by cluster and feature. The plot shows histograms or boxplots and , as a ggplot object, can be modified further.

Usage

## S3 method for class 'kmeans_ClustImpute'
plot(
  x,
  type = "hist",
  vline = "centroids",
  hist_bins = 30,
  color_bins = "#56B4E9",
  color_vline = "#E69F00",
  size_vline = 2,
  ...
)

Arguments

x

an object returned from ClustImpute

type

either "hist" to plot a histogram or "box" for a boxplot

vline

for "hist" a vertical line is plotted showing either the centroid value or the mean of all data points grouped by cluster and feature

hist_bins

number of bins for histogram

color_bins

color for the histogram bins

color_vline

color for the vertical line

size_vline

size of the vertical line

...

currently unused

Value

Returns a ggplot object


Prediction method

Description

Prediction method

Usage

## S3 method for class 'kmeans_ClustImpute'
predict(object, newdata, ...)

Arguments

object

Object of class kmeans_ClustImpute

newdata

Data frame

...

additional arguments affecting the predictions produced - not currently used

Value

integer value (cluster assignment)

Examples

# Random Dataset
set.seed(739)
n <- 750 # numer of points
nr_other_vars <- 2
mat <- matrix(rnorm(nr_other_vars*n),n,nr_other_vars)
me<-4 # mean
x <- c(rnorm(n/3,me/2,1),rnorm(2*n/3,-me/2,1))
y <- c(rnorm(n/3,0,1),rnorm(n/3,me,1),rnorm(n/3,-me,1))
dat <- cbind(mat,x,y)
dat<- as.data.frame(scale(dat)) # scaling

# Create NAs
dat_with_miss <- miss_sim(dat,p=.1,seed_nr=120)

res <- ClustImpute(dat_with_miss,nr_cluster=3)
predict(res,newdata=dat[1,])

Print method for ClustImpute

Description

Returns a plot with the marginal distributions by cluster and feature. The plot shows histograms or boxplots and , as a ggplot object, can be modified further.

Usage

## S3 method for class 'kmeans_ClustImpute'
print(x, ...)

Arguments

x

an object returned from ClustImpute

...

currently unused

Value

No return value (print function)


Reduction of variance

Description

Computes one minus the ratio of the sum of all within cluster variances by the overall variance

Usage

var_reduction(clusterObj)

Arguments

clusterObj

Object of class kmeans_ClustImpute

Value

integer value typically between 0 and 1

Examples

# Random Dataset
set.seed(739)
n <- 750 # numer of points
nr_other_vars <- 2
mat <- matrix(rnorm(nr_other_vars*n),n,nr_other_vars)
me<-4 # mean
x <- c(rnorm(n/3,me/2,1),rnorm(2*n/3,-me/2,1))
y <- c(rnorm(n/3,0,1),rnorm(n/3,me,1),rnorm(n/3,-me,1))
dat <- cbind(mat,x,y)
dat<- as.data.frame(scale(dat)) # scaling

# Create NAs
dat_with_miss <- miss_sim(dat,p=.1,seed_nr=120)

res <- ClustImpute(dat_with_miss,nr_cluster=3)
var_reduction(res)