Package 'FeatureImpCluster'

Title: Feature Importance for Partitional Clustering
Description: Implements a novel approach for measuring feature importance in k-means clustering. Importance of a feature is measured by the misclassification rate relative to the baseline cluster assignment due to a random permutation of feature values. An explanation of permutation feature importance in general can be found here: <https://christophm.github.io/interpretable-ml-book/feature-importance.html>.
Authors: Oliver Pfaffel [aut, cre]
Maintainer: Oliver Pfaffel <[email protected]>
License: GPL-3
Version: 0.1.5
Built: 2025-01-31 03:54:29 UTC
Source: https://github.com/o1iv3r/featureimpcluster

Help Index


Create random data set with 4 clusters

Description

Create random data set with 4 clusters in a 2 dimensional subspace of a nr_other_vars+2 dimensional space

Usage

create_random_data(n = 10000, nr_other_vars = 4)

Arguments

n

number of points

nr_other_vars

number of other variables / "noise" dimensions

Value

list containing the random data.table and a vector with the true underlying cluster assignments

Examples

create_random_data(n=1e3)

Feature importance for k-means clustering

Description

This function loops through PermMisClassRate for each variable of the data. The mean misclassification rate over all iterations is interpreted as variable importance.

Usage

FeatureImpCluster(
  clusterObj,
  data,
  basePred = NULL,
  predFUN = NULL,
  sub = 1,
  biter = 10
)

Arguments

clusterObj

a "typical" cluster object. The only requirement is that there must be a prediction function which maps the data to an integer

data

data.table with the same features as the data set used for clustering (or the simply the same data)

basePred

should be equal to results of predFUN(clusterObj,newdata=data); this option saves time when data is a very large data set

predFUN

predFUN(clusterObj,newdata=data) should provide the cluster assignment as a numeric vector; typically this is a wrapper around a build-in prediction function

sub

integer between 0 and 1(=default), indicates that only a subset of the data should be used if <1

biter

the permutation is iterated biter(=5, default) times

Value

A list of

misClassRate

A matrix of the permutation misclassification rate for each variable and each iteration

featureImp

For each row of complete_data, the associated cluster

Examples

set.seed(123)
dat <- create_random_data(n=1e3)$data # random data

library(flexclust)
res <- kcca(dat,k=4)
f <- FeatureImpCluster(res,dat)
plot(f)

Permutation misclassification rate for single variable

Description

Answers the following question: Using the current partion as a baseline, what is the misclassification rate if a given feature is permuted?

Usage

PermMisClassRate(
  clusterObj,
  data,
  varName,
  basePred = NULL,
  predFUN = NULL,
  sub = 1,
  biter = 5,
  seed = 123
)

Arguments

clusterObj

a "typical" cluster object. The only requirement is that there must be a prediction function which maps the data to an integer

data

data.table with the same features as the data set used for clustering (or the simply the same data)

varName

character; variable name

basePred

should be equal to results of predFUN(clusterObj,newdata=data); this option saves time when data is a very large data set

predFUN

predFUN(clusterObj,newdata=data) should provide the cluster assignment as a numeric vector; typically this is a wrapper around a build-in prediction function

sub

integer between 0 and 1(=default), indicates that only a subset of the data should be used if <1

biter

the permutation is iterated biter(=5, default) times

seed

value for random seed

Value

vector of length biter with the misclassification rate

Examples

set.seed(123)
dat <- create_random_data(n=1e3)$data # random data

library(flexclust)
res <- kcca(dat,k=4)
PermMisClassRate(res,dat,varName="x")

Feature importance box plot

Description

Feature importance box plot

Usage

## S3 method for class 'featImpCluster'
plot(x, dat = NULL, color = "none", showPoints = FALSE, ...)

Arguments

x

an object returned from FeatureImpCluster

dat

same data as used for the computation of the feature importance (only relevant for colored plots)

color

If set to "type", the plot will show different variable types with a different color.

showPoints

Show points (default is False)

...

arguments to be passed to base plot method

Value

Returns a ggplot2 object