Title: | Feature Importance for Partitional Clustering |
---|---|
Description: | Implements a novel approach for measuring feature importance in k-means clustering. Importance of a feature is measured by the misclassification rate relative to the baseline cluster assignment due to a random permutation of feature values. An explanation of permutation feature importance in general can be found here: <https://christophm.github.io/interpretable-ml-book/feature-importance.html>. |
Authors: | Oliver Pfaffel [aut, cre] |
Maintainer: | Oliver Pfaffel <[email protected]> |
License: | GPL-3 |
Version: | 0.1.5 |
Built: | 2025-01-31 03:54:29 UTC |
Source: | https://github.com/o1iv3r/featureimpcluster |
Create random data set with 4 clusters in a 2 dimensional subspace of a nr_other_vars+2 dimensional space
create_random_data(n = 10000, nr_other_vars = 4)
create_random_data(n = 10000, nr_other_vars = 4)
n |
number of points |
nr_other_vars |
number of other variables / "noise" dimensions |
list containing the random data.table and a vector with the true underlying cluster assignments
create_random_data(n=1e3)
create_random_data(n=1e3)
This function loops through PermMisClassRate
for each variable of the data.
The mean misclassification rate over all iterations is interpreted as variable importance.
FeatureImpCluster( clusterObj, data, basePred = NULL, predFUN = NULL, sub = 1, biter = 10 )
FeatureImpCluster( clusterObj, data, basePred = NULL, predFUN = NULL, sub = 1, biter = 10 )
clusterObj |
a "typical" cluster object. The only requirement is that there must be a prediction function which maps the data to an integer |
data |
data.table with the same features as the data set used for clustering (or the simply the same data) |
basePred |
should be equal to results of predFUN(clusterObj,newdata=data); this option saves time when data is a very large data set |
predFUN |
predFUN(clusterObj,newdata=data) should provide the cluster assignment as a numeric vector; typically this is a wrapper around a build-in prediction function |
sub |
integer between 0 and 1(=default), indicates that only a subset of the data should be used if <1 |
biter |
the permutation is iterated biter(=5, default) times |
A list of
A matrix of the permutation misclassification rate for each variable and each iteration
For each row of complete_data, the associated cluster
set.seed(123) dat <- create_random_data(n=1e3)$data # random data library(flexclust) res <- kcca(dat,k=4) f <- FeatureImpCluster(res,dat) plot(f)
set.seed(123) dat <- create_random_data(n=1e3)$data # random data library(flexclust) res <- kcca(dat,k=4) f <- FeatureImpCluster(res,dat) plot(f)
Answers the following question: Using the current partion as a baseline, what is the misclassification rate if a given feature is permuted?
PermMisClassRate( clusterObj, data, varName, basePred = NULL, predFUN = NULL, sub = 1, biter = 5, seed = 123 )
PermMisClassRate( clusterObj, data, varName, basePred = NULL, predFUN = NULL, sub = 1, biter = 5, seed = 123 )
clusterObj |
a "typical" cluster object. The only requirement is that there must be a prediction function which maps the data to an integer |
data |
data.table with the same features as the data set used for clustering (or the simply the same data) |
varName |
character; variable name |
basePred |
should be equal to results of predFUN(clusterObj,newdata=data); this option saves time when data is a very large data set |
predFUN |
predFUN(clusterObj,newdata=data) should provide the cluster assignment as a numeric vector; typically this is a wrapper around a build-in prediction function |
sub |
integer between 0 and 1(=default), indicates that only a subset of the data should be used if <1 |
biter |
the permutation is iterated biter(=5, default) times |
seed |
value for random seed |
vector of length biter with the misclassification rate
set.seed(123) dat <- create_random_data(n=1e3)$data # random data library(flexclust) res <- kcca(dat,k=4) PermMisClassRate(res,dat,varName="x")
set.seed(123) dat <- create_random_data(n=1e3)$data # random data library(flexclust) res <- kcca(dat,k=4) PermMisClassRate(res,dat,varName="x")
Feature importance box plot
## S3 method for class 'featImpCluster' plot(x, dat = NULL, color = "none", showPoints = FALSE, ...)
## S3 method for class 'featImpCluster' plot(x, dat = NULL, color = "none", showPoints = FALSE, ...)
x |
an object returned from FeatureImpCluster |
dat |
same data as used for the computation of the feature importance (only relevant for colored plots) |
color |
If set to "type", the plot will show different variable types with a different color. |
showPoints |
Show points (default is False) |
... |
arguments to be passed to base plot method |
Returns a ggplot2 object