Ujjwal Karn's blog

deep learning, computer vision, nlp

Introduction to k-Means clustering in R

Posted on May 29, 2016August 6, 2016 by ujjwalkarn

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. I have provided below the R code to get started with k-means clustering in R. The dataset can be downloaded from here.

# Topics Covered
#
# 1. Reading data and Summary Statistics
# 2. Determining the Optimal Number of Clusters
# 3. Running Clustering Algorithm and Visualisations

##############################################################################
#Reading data and Summary Statistics

#change the working directory
setwd(&quot;C:\\Users\\ujjwal.karn\\Desktop\\Classification &amp; Clustering&quot;)

mydata&lt;-read.csv(&quot;data/kmeans_data.csv&quot;)

head(mydata)
str(mydata)
summary(mydata)

plot(mydata[c(&quot;Sepal.Length&quot;, &quot;Sepal.Width&quot;)], main=&quot;Raw Data&quot;)

#standardising the data
mydata &lt;- scale(mydata)

##############################################################################
#Determining the Optimal Number of Clusters
#http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters/

wss &lt;- (nrow(mydata)-1)*sum(apply(mydata,2,var))

for(i in 1:25){wss[i] &lt;- sum(kmeans(mydata, centers=i)$withinss)}

plot(1:25, wss, type=&quot;b&quot;, xlab=&quot;No. of Clusters&quot;, ylab=&quot;wss&quot;)

wss

##############################################################################
#Running Clustering Algorithm

# trying with 4 clusters
clus4 &lt;- kmeans(mydata, centers=4, nstart=30)

#check between_SS / total_SS
clus4

# get cluster means
aggregate(mydata ,by=list(clus4$cluster), FUN=mean)

# append cluster assignment
mydata &lt;- data.frame(mydata, clus4$cluster)

#summary
groups &lt;- data.frame(clus4$cluster)
table(groups)

plot(mydata[c(&quot;Sepal.Length&quot;, &quot;Sepal.Width&quot;)], col=clus4$cluster)
points(clus4$centers[,c(&quot;Sepal.Length&quot;, &quot;Sepal.Width&quot;)], col=1:3, pch=8, cex=2)

# trying with 3 clusters
clus3 &lt;- kmeans(mydata, centers=3, nstart=20)
clus3

# get cluster means
aggregate(mydata ,by=list(clus3$cluster), FUN=mean)

# append cluster assignment
mydata &lt;- data.frame(mydata, clus3$cluster)

#summary
groups &lt;- data.frame(clus3$cluster)
table(groups)

plot(mydata[c(&quot;Sepal.Length&quot;, &quot;Sepal.Width&quot;)], col=clus3$cluster)
points(clus3$centers[,c(&quot;Sepal.Length&quot;, &quot;Sepal.Width&quot;)], col=1:3, pch=8, cex=2)

Leave a comment Cancel reply