Introduction

The antecedent of this post is “MDS in theory” in this blog. Now continue this with an example which is the Groceries dataset in {arules} package.

Load Groceries dataset

The Groceries dataset represents products that were bought together by buyers. There are 169 products and 9835 purchase transactions in the dataset. In this chapter, we want to cluster products with Jaccard similarity and hierarchical clustering and investigate results of clustering with MDS.

library(arules)

data(Groceries)
Groceries <- as(Groceries, "matrix")*1

Hierarchical clustering on produts

library(ggdendro)
library(dplyr)
library(ggplot2)

gr.jac <- dissimilarity(t(Groceries), method = "jaccard") #jaccard dissimilarity of columns, works with just binary data

hc <- hclust(gr.jac, method="ward.D2")
cut <- as.data.frame(cutree(hc, k=7))
names(cut) <- "cut"
cut$names <- rownames(cut)

hcdata <- dendro_data(hc, type="triangle")
hcdata$labels <- left_join(hcdata$labels, cut, by=c("label"="names"))
ggplot(hcdata$segments) + 
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend))+
  geom_text(data = hcdata$labels, aes(x, y, label = label, colour=factor(cut)), 
  hjust = 1, size = 2.9) + scale_colour_discrete(name = "clusters") +
  labs(x="", y="") + coord_flip() + ylim(-1, 2) + xlim(0,170) + theme_bw()

plot of chunk hclust

MDS

MDS with cmdscale()

mds.cmdscale <- as.data.frame(cmdscale(as.matrix(gr.jac)))
mds.cmdscale$names <- rownames(mds.cmdscale)
mds.cmdscale$cut <- cut$cut
ggplot(mds.cmdscale, aes(V1, V2, label=names)) + 
  geom_point(aes(colour=factor(cut)), size=2.3) +
  geom_text(aes(colour=factor(cut)), check_overlap = TRUE, size=2.2, 
  hjust = "center", vjust = "bottom", nudge_x = 0, nudge_y = 0.005) + 
  scale_colour_discrete(name = "clusters") +
  labs(x="", y="", title="MDS by Jaccard and cmdscale()") + theme_bw()

plot of chunk mds_cmdscale

MDS with smacofSym()

library(smacof)

mds.smacof <- smacofSym(as.matrix(gr.jac))

plotdata <- as.data.frame(mds.smacof$conf)
plotdata$names <- rownames(mds.smacof$conf)
plotdata$cut <- cut$cut
ggplot(plotdata, aes(D1, D2, label=names)) + 
  geom_point(aes(colour=factor(cut)), size=2.3) +
  geom_text(aes(colour=factor(cut)), check_overlap = TRUE, size=2.2, 
  hjust = "center", vjust = "bottom", nudge_x = 0, nudge_y = 0.015) + 
  scale_colour_discrete(name = "clusters") +
  labs(x="", y="", title="MDS by Jaccard and smacofSym()") + theme_bw()

plot of chunk mds_smacof

MDS with isoMDS()

library(MASS)

mds.isomds <- isoMDS(as.matrix(gr.jac), k=2)
plotdata <- mds.isomds$points
plotdata <- as.data.frame(plotdata)
plotdata$names <- rownames(mds.isomds$points)
plotdata$cut <- cut$cut
ggplot(plotdata, aes(V1, V2, label=names)) + 
  geom_point(aes(colour=factor(cut)), size=2.3) +
  geom_text(aes(colour=factor(cut)), check_overlap = TRUE, size=2.2, 
  hjust = "center", vjust = "bottom", nudge_x = 0, nudge_y = 0.05) + 
  scale_colour_discrete(name = "clusters") +
  labs(x="", y="", title="MDS by Jaccard and isoMDS()") + theme_bw()

plot of chunk mds_isomds

There are some indiscriminate points on the plot but in total view good clusters can get with this method. Of course clustering method can be improved for better clusters. There is some point of improvement:

  • similarity/dissimilarity method (here was Jaccard)
  • clustering method (here was hierarhical clustering with ward.D method)
  • MDS (here was PCA like MDS with cmdscale() function, then with smacofSmy() function, and then with isoMDS() function)

Be happyR! 🙂

Advertisements