Clustering analysis in R, with factoextra and NbClust

Clustering Analysis Libraries

factoextra:

https://www.rdocumentation.org/packages/factoextra/versions/1.0.3

NbClust:

There are many libraries and functions in R for performing clustering analysis, so why look at these 2? Well, they solve two important challenges with clustering: visualisation and determining the optimal number of clusters.

In general, cluster analysis is an unsupervised machine learning task, meaning we don’t predefine a target output for the learning. For clustering, this mainly means that we don’t know what the categories will be before we start the analysis. We also don’t know how many clusters are present.

Many indices have been created to help with determining the optimal number of clusters, however these each have their own advantages and disadvantages and often give conflicting results. We’ll demonstrate this by looking at the results from three popular graphical methods: elbow plot, silhouette, and gap statistic.

Dataset

We’ll be looking at data from the Soundscape Indices (SSID) Database. This dataset contains data in-situ perceptual assessments of urban soundscapes, paired with acoustic and environmental data. For this set, we’ll be looking at the perceptual data only.

To collect the data, random members of the public were approached while in urban public spaces and asked to take a survey about how they perceive the sound environment. A section of the questions ask specifically about the perceived dominance of sound sources in the space. Sound sources are categorized as Traffic noise, Other noise, Human sounds, Natural sounds, and are rated from 1 [not at all] to 5 [dominates completely].

The data was collected across 27 locations in the UK, Italy, Spain, the Netherlands, and China. The goal here is to investigate whether these locations can be categorized based on their composition of sound sources.

Code

library(readxl)
library(dplyr)  # Data processing and piping

# Clustering libraries
library(factoextra) # Clustering and visualisation
library(NbClust)    # Optimal Number of Clusters
library(RCurl)      # For downloading data from Zenodo

temp.file <- paste0(tempfile(), ".xlsx")
download.file("https://zenodo.org/record/5705908/files/SSID%20Lockdown%20Database%20VL0.2.2.xlsx", temp.file, mode="wb")
ssid.data <- read_excel(temp.file)

vars <- c("Traffic", "Other", "Natural", "Human")
# vars <- c("pleasant", "chaotic", "vibrant", "uneventful", "calm", "annoying", "eventful", "monotonous")

# Cutdown the dataset
ssid.data <- ssid.data[c("GroupID", "SessionID", "LocationID", vars)]

# ssid.data <- subset(ssid.data, Lockdown != 1)

# Set GroupID, SessionID, Location as factor type
ssid.data <- ssid.data %>% mutate_at(vars(GroupID, SessionID, LocationID),
                                     funs(as.factor))
ssid.data <- ssid.data %>% mutate_at(vars(vars),
                                     funs(as.numeric))

# Calculate the mean response for each GroupID
ssid.data <- ssid.data %>% 
    group_by(GroupID) %>%
    summarize(
              Traffic = mean(Traffic, na.rm=TRUE),
              Other = mean(Other, na.rm=TRUE),
              Natural = mean(Natural, na.rm=TRUE),
              Human = mean(Human, na.rm=TRUE),
              # pleasant = mean(pleasant, na.rm = TRUE),
              # chaotic = mean(chaotic, na.rm = TRUE),
              # vibrant = mean(vibrant, na.rm = TRUE),
              # uneventful = mean(uneventful, na.rm = TRUE),
              # calm = mean(calm, na.rm = TRUE),
              # annoying = mean(annoying, na.rm = TRUE),
              # eventful = mean(eventful, na.rm = TRUE),
              # monotonous = mean(monotonous, na.rm = TRUE),
              LocationID = LocationID[1])

# analysis.data$LocationID <- unique(ssid.data[c('GroupID', 'LocationID')])['LocationID']
ssid.data <- na.omit(ssid.data)

knitr::kable(head(ssid.data))

GroupID	Traffic	Other	Natural	Human	LocationID
AM01	1	3.0	1.0	4.000000	SanMarco
AM02	2	1.5	3.5	4.000000	SanMarco
AM03	1	1.0	2.0	4.666667	SanMarco
AM05	2	2.0	3.0	1.000000	SanMarco
AM06	1	4.0	1.5	3.500000	SanMarco
AM07	1	1.0	4.0	4.000000	SanMarco

Our locations are:

Code

print(levels(ssid.data$LocationID))

 [1] "CamdenTown"         "EustonTap"          "MarchmontGarden"   
 [4] "MonumentoGaribaldi" "PancrasLock"        "RegentsParkFields" 
 [7] "RegentsParkJapan"   "RussellSq"          "SanMarco"          
[10] "StPaulsCross"       "StPaulsRow"         "TateModern"        
[13] "TorringtonSq"

Calculate the mean value for each Location

Note: If the data use different scales, they should always be standardised before clustering. In this case, all of the data are on the same scale, so we don’t need to worry.

Code

means <- aggregate(ssid.data[c(vars)], by=list(ssid.data$LocationID), FUN=mean, na.rm=TRUE)
means <- data.frame(means[, -1], row.names = means[, 1])

knitr::kable(means)

	Traffic	Other	Natural	Human
CamdenTown	3.787745	2.652941	1.316667	3.219608
EustonTap	3.717857	2.858929	1.642262	2.515476
MarchmontGarden	2.663542	2.412500	2.564583	2.695833
MonumentoGaribaldi	1.903509	1.885965	2.982456	3.254386
PancrasLock	2.457500	3.144167	2.391667	2.453333
RegentsParkFields	2.422886	1.911692	3.148010	2.866915
RegentsParkJapan	1.867179	1.528718	3.974103	2.450000
RussellSq	2.668120	2.027132	3.334302	2.987403
SanMarco	1.431852	1.895926	2.230370	4.041852
StPaulsCross	2.503704	2.000000	2.266667	3.311111
StPaulsRow	2.527132	2.294574	1.744186	3.368217
TateModern	2.541667	2.150833	2.598333	3.660833
TorringtonSq	3.230676	2.848068	1.957005	3.269565

Clustering Analysis

Some standard, single indices

Elbow plot (within-sum-of-squares), Silhouette, Gap statistic

Code

set.seed(123)
fviz_nbclust(means, hcut, method="wss", ggtheme = theme_bw())

Code

fviz_nbclust(means, hcut, method="silhouette", ggtheme = theme_bw())

Code

fviz_nbclust(means, hcut, method="gap_stat", ggtheme=theme_bw())

As we can see, it can be less than obvious how to interpret some of these - where exactly is the ‘elbow’ in the elbow plot? Silhouette pretty clearly says k = 2, but the Gap stat gives k = 1, which isn’t very useful. How do we know which is right?

30 indices using NbClust

NbClust

Code

indices = c("kl", "ch", "ccc", "cindex", "db", "silhouette", "duda", "pseudot2", "ratkowsky", "ptbiserial", "gap", "mcclain", "gamma", "gplus", "tau","sdindex", "sdbw")
res <- NbClust(data = means, distance='euclidean', min.nc = 2, max.nc=9, method="ward.D2", index = "alllong")

Warning in pf(beale, pp, df2): NaNs produced

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot.

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 9 proposed 2 as the best number of clusters 
* 4 proposed 3 as the best number of clusters 
* 2 proposed 4 as the best number of clusters 
* 1 proposed 5 as the best number of clusters 
* 4 proposed 6 as the best number of clusters 
* 1 proposed 7 as the best number of clusters 
* 2 proposed 8 as the best number of clusters 
* 5 proposed 9 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  2 
 
 
*******************************************************************

Code

knitr::kable(res)

	KL	CH	Hartigan	CCC	Scott	Marriot	TrCovW	TraceW	Friedman	Rubin	Cindex	DB	Silhouette	Duda	Pseudot2	Beale	Ratkowsky	Ball	Ptbiserial	Gap	Frey	McClain	Gamma	Gplus	Tau	Dunn	Hubert	SDindex	Dindex	SDbw
2	1.0970	8.1449	6.0597	7.5851	102.6537	30.2615	13.4952	10.1742	351.3546	36.9377	0.4269	0.8174	0.3529	0.6156	4.9946	1.3398	0.3840	5.0871	0.5020	-0.9326	1.0370	0.4096	0.5903	3.7821	10.8974	0.3539	0.0753	2.5622	0.8042	0.4459
3	1.1005	8.4961	5.2330	6.3990	119.6026	18.4865	5.5195	6.5603	404.1006	57.2860	0.3951	1.0948	0.2818	0.4299	5.3038	2.5609	0.4103	2.1868	0.4710	-1.6230	0.2765	1.3913	0.6265	3.1026	10.4103	0.3626	0.0800	2.8643	0.6523	0.3387
4	1.6303	9.3352	3.5123	5.9285	136.6838	8.8328	2.0114	4.3066	514.5025	87.2635	0.5492	0.8335	0.3286	2.5977	-1.2301	-0.9899	0.4303	1.0767	0.4840	-1.8404	0.2171	2.0924	0.7560	1.5513	9.6154	0.5689	0.0773	2.4025	0.5465	0.2559
5	0.8819	9.4327	4.1056	5.4008	150.1412	4.9017	0.9988	3.0977	615.3276	121.3182	0.5171	0.6492	0.4032	2.6324	-1.2402	-0.9981	0.4026	0.6195	0.4854	-2.3510	0.1834	2.4726	0.8391	0.8718	9.0897	0.6139	0.0863	2.3482	0.4455	0.1689
6	2.4474	10.7100	1.9565	5.2900	170.5431	1.4694	0.3679	2.0471	969.8285	183.5792	0.5973	0.5852	0.4719	1.8668	-0.4643	-0.5605	0.3812	0.3412	0.4830	-2.4797	0.5241	2.9403	0.9529	0.2051	8.3077	0.8559	0.1062	2.4432	0.3528	0.1185
7	0.9458	10.0676	1.9696	4.5460	183.1309	0.7595	0.2213	1.5999	1194.4928	234.8889	0.5305	0.5508	0.4589	3.4939	-0.7138	-0.8616	0.3578	0.2286	0.4476	-2.9193	0.8670	3.6092	0.9643	0.1282	6.9231	0.8905	0.1153	3.0849	0.2980	0.0970
8	0.9554	9.7863	2.1072	4.5678	213.5276	0.0957	0.1542	1.2045	3652.8265	311.9964	0.4860	0.4826	0.5005	3.0804	-0.6754	-0.8152	0.3386	0.1506	0.3947	-3.0908	0.5258	4.8153	0.9537	0.1282	5.2821	0.7320	0.1161	3.2357	0.2402	0.0697
9	0.9437	9.9481	2.5926	3.7773	235.4352	0.0225	0.0873	0.8474	6052.3637	443.4830	0.4147	0.4100	0.5619	4.7507	0.0000	0.0000	0.3219	0.0942	0.3339	-3.4867	0.1666	7.0102	0.9662	0.0641	3.6667	0.7386	0.1163	3.4232	0.1907	0.0490

	CritValue_Duda	CritValue_PseudoT2	Fvalue_Beale	CritValue_Gap
2	0.2019	31.6230	0.2766	0.7534
3	0.0160	246.4030	0.0787	0.2949
4	-0.1694	-13.8056	1.0000	0.6048
5	-0.1694	-13.8056	1.0000	0.2346
6	-0.3257	-4.0703	1.0000	0.5672
7	-0.3257	-4.0703	1.0000	0.3163
8	-0.3257	-4.0703	1.0000	0.5663
9	-0.5879	0.0000	NaN	0.3864

	KL	CH	Hartigan	CCC	Scott	Marriot	TrCovW	TraceW	Friedman	Rubin	Cindex	DB	Silhouette	Duda	PseudoT2	Beale	Ratkowsky	Ball	PtBiserial	Gap	Frey	McClain	Gamma	Gplus	Tau	Dunn	Hubert	SDindex	Dindex	SDbw
Number_clusters	6.0000	6.00	6.0000	2.0000	8.0000	4.0000	3.0000	3.0000	8.000	6.0000	3.0000	9.00	9.0000	2.0000	2.0000	2.0000	4.0000	3.0000	2.000	2.0000	2.000	2.0000	9.0000	9.0000	2.0000	7.0000	0	5.0000	0	9.000
Value_Index	2.4474	10.71	2.1492	7.5851	30.3968	5.7226	7.9756	1.3603	2458.334	-10.9512	0.3951	0.41	0.5619	0.6156	4.9946	1.3398	0.4303	2.9003	0.502	-0.9326	1.037	0.4096	0.9662	0.0641	10.8974	0.8905	0	2.3482	0	0.049

	x
CamdenTown	1
EustonTap	1
MarchmontGarden	2
MonumentoGaribaldi	2
PancrasLock	2
RegentsParkFields	2
RegentsParkJapan	2
RussellSq	2
SanMarco	2
StPaulsCross	2
StPaulsRow	2
TateModern	2
TorringtonSq	1

Code

fviz_nbclust(res)

Error in if (class(best_nc) == "numeric") print(best_nc) else if (class(best_nc) == : the condition has length > 1

Code

k = 2
k.fit <- kmeans(means, k)
knitr::kable(k.fit)

Error in as.data.frame.default(x): cannot coerce class '"kmeans"' to a data.frame

Code

fviz_cluster(k.fit, means, repel=TRUE, ggtheme = theme_bw())

Code

res.pca <- prcomp(means, scale=TRUE)
facto_summarize(res.pca, "var")

           name      Dim.1      Dim.2     coord      cos2  contrib
Traffic Traffic -0.8891690 -0.1531570 0.8140785 0.8140785 22.70701
Other     Other -0.9153206 -0.1088317 0.8496561 0.8496561 23.69937
Natural Natural  0.8316428 -0.5046107 0.9462617 0.9462617 26.39398
Human     Human  0.1897201  0.9690987 0.9751460 0.9751460 27.19965

Code

fviz_contrib(res.pca, "var")

Code

fit <- eclust(means, "kmeans", k=k)

Code

h.fit <- eclust(means, "hclust", k=k, stand = T, hc_method = "ward.D2")

Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

Code

fviz_dend(h.fit, labels_track_height = 2.5, horiz=TRUE, rect = TRUE, cex = 0.5)

Methods I haven’t really tested out thoroughly

Code

fviz_pca_biplot(res.pca, repel=T,ggtheme = theme_bw())

`fviz_silhouette()`

Silhouette (Si) analysis is a cluster validation approach that measures how well an observation is clustered and it estimates the average distance between clusters.

Details

Observations with a large silhouette Si (almost 1) are very well clustered
A small Si (around 0) means that the observation lies between two clusters
Observations with a negative Si are probably placed in the wrong cluster

Code

fviz_silhouette(h.fit)

  cluster size ave.sil.width
1       1    5          0.32
2       2    8          0.32

`get_clust_tendency`

Before applying cluster methods, the first step is to assess whether the data is clusterable, a process defined as the assessing of clustering tendency. get_clust_tendency() assesses clustering tendency using Hopkins’ statistic and a visual approach.

Details

Hopkins Statistic: If the value of Hopkins statistic is close to 1 (far above 0.5), then we can conclude that the dataset is significantly clusterable.

$hopkins_stat
[1] 0.6977892

$plot

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{mitchell2020,
  author = {Mitchell, Andrew},
  title = {Clustering Analysis in {R,} with `Factoextra` and
    {`NbClust`}},
  date = {2020-06-18},
  url = {https://drandrewmitchell.com/posts/2020-06-18_cluster-analysis-in-r/},
  langid = {en}
}

For attribution, please cite this work as:

Mitchell, Andrew. 2020. “Clustering Analysis in R, with `Factoextra` and `NbClust`.” June 18, 2020. https://drandrewmitchell.com/posts/2020-06-18_cluster-analysis-in-r/.