Introduction

One of the most popular similarity indexes is Jaccard similarity index (or coefficient) that measures the similarity between finite sample set by intersection and union of samples.

J(A,B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A|+|B|-|A \cap B|}

The result is 0 \leq J(A,B) \leq 1.

Jaccard distance is simple D(A,B) = 1 - J(A,B).

There are several implementation of Jaccard similarity/distance calculation in R (clusteval, proxy, prabclus, vegdist, ade4 etc.). And Jaccard similarity can built up with basic function just see this forum. But these works for binary datasets only.

If one need to measure weighted Jaccard coefficient than there is some difficulty to find implementations. Weighted Jaccard definition is simple

J(x,y) = \frac{\sum\nolimits_{i}min(x_i,y_i)}{\sum\nolimits_{i}max(x_i,y_i)}

Fortunately write some code is simple as well.

Making Jaccard similarity matrix

The dataset is a matrix.

A <- matrix(c(2,5,2,1,0,0,0,0,1,0,0,0,0,1,3,5,6,0,0,1,0,0,0,2,0,0,1,2,7,2,4,6,2,5,1,0,0,1,0,0,0,1,0,0,3,5,4,0,0,1,0,0,1,0,0,2,0,3,5,7,3,1,4,0,1,0,0,0,0,2,0,0,0,1,3,4,6,0,0,1), byrow=T, nrow=8, ncol=10)
colnames(A) <- letters[1:10]
rownames(A) <- LETTERS[1:8]
print(A)
## a b c d e f g h i j
## A 2 5 2 1 0 0 0 0 1 0
## B 0 0 0 1 3 5 6 0 0 1
## C 0 0 0 2 0 0 1 2 7 2
## D 4 6 2 5 1 0 0 1 0 0
## E 0 1 0 0 3 5 4 0 0 1
## F 0 0 1 0 0 2 0 3 5 7
## G 3 1 4 0 1 0 0 0 0 2
## H 0 0 0 1 3 4 6 0 0 1

Jaccard similarity of rows of matrix can be seen bellow. The code chunk makes similarity matrix for rows.

#weighted jaccard similarity matrix setup
sim.jac <- matrix(0, nrow=nrow(A), ncol=nrow(A))
rownames(sim.jac) <- rownames(A)
colnames(sim.jac) <- rownames(A)

#weighted jaccard
pairs <- t(combn(1:nrow(A), 2)) for (i in 1:nrow(pairs))
{ num <- sum(sapply(1:ncol(A), function(x)(min(A[pairs[i,1],x],A[pairs[i,2],x])))) den <- sum(sapply(1:ncol(A), function(x)(max(A[pairs[i,1],x],A[pairs[i,2],x])))) sim.jac[pairs[i,1],pairs[i,2]] <- num/den sim.jac[pairs[i,2],pairs[i,1]] <- num/den 
} 
sim.jac[which(is.na(sim.jac))] <- 0
diag(sim.jac) <- 1

The last two commands are corrections that mean:

  • if max(x_i,y_i) = 0 for every i than the denominator is 0 which has no means in this case (definition of Jaccard suggest 1 in this case but in real cases I think better to keep 0)
  • every elements similar to oneself

Application of Jaccard similarity matrix

Hierarchical clustering

dist.jac <- as.dist(1-sim.jac)

hc <- hclust(dist.jac, method = "ward.D2")

cut <- as.data.frame(cutree(hc, k=3))
cut$names <- rownames(cut)
names(cut) <- c("cut", "names")
library(ggplot2)
library(ggdendro)
library(dplyr)

hcdata <- dendro_data(hc, type="triangle")

hcdata$labels <- left_join(hcdata$labels, cut, by=c("label"="names"))
ggplot(hcdata$segments) + 
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend))+
  geom_text(data = hcdata$labels, aes(x, y, label = label, colour=factor(cut)), 
  hjust = 1, size = 4) +
  scale_color_manual(values=c("red", "blue", "green"), guide_legend(title="clusters")) +
  labs(x="", y="") + coord_flip() + theme_bw()

plot of chunk dendogram_figure

Clusters with MDS

Multi-dimensional scaling is an often used dimension reduction process that suitable to visualize samples in two dimension and to recognize clusters.

mds <- as.data.frame(cmdscale(dist.jac))
mds$names <- rownames(mds)
mds$cut <- cutree(hc, k=3)
ggplot(mds, aes(V1, V2, label=names)) + 
  geom_point(aes(colour=factor(cut)), size=2) +
  geom_text(aes(colour=factor(cut)), check_overlap = FALSE, size=2.3, 
  hjust = "center", vjust = "bottom", nudge_x = 0.005, nudge_y = 0.02) + 
  scale_color_manual(values=c("red", "blue", "green"), guide_legend(title="clusters")) +
  xlab("") + ylab("") + theme_bw()

plot of chunk MDS_figure

As figure shows clusters with weighted Jaccard similarity/dissimilarity are well match with clusters that shows MDS.


Be happyR! 🙂

See also:

Jaccard definition in Wikipedia

Advertisements