Introduction

I had a survey data about a department of a big company. How to see themselves and how to see them co-workers of other departments. There were 91 questions about different kind of activities, quality of contacts, personality, processes etc. in 15 module. The possible answers for every question were a six-point Likert scale from completely unsatisfied to completely satisfied. There were 86 responders mainly middle managers, top leaders, and juniors from examined department.

I was interested in clustering of responders and question themes based on answers.

The similarity of responders and question was calculated with a modified Jaccard coefficient with every pair of responder vectors and the second step with question vectors. The essence of Jaccard coefficient is that the union of vectors goes to the denominator and intersection goes to the numerator. Then similarity scores were converted to distance scores. Hierarchical clustering with Ward algorithm was applied to generate clusters of responders and questions in two different steps. It was anonymous questionnaire but there were known leadership experience, department, leadership position.

Analysis of data was given deeper insights to results. Resulted clusters almost speak about attitudes of responders about topics of questions.

Setup

library(reshape2)
library(readxl)
library(ggplot2)
library(dplyr)

Load data

Survey data is in xls format, saved from LimeSurvey.

dat <- read_excel("...my_path...xls")

Some preparation is needed.

dat <- dat[!is.na(dat[,2]),]
data <- dat %>% select(contains("[")) #"[" contains question codes
data <- cbind(data, dat[,8:11]) #question data and responders proterties are binded
data <- as.matrix(data)
data[data==0] <- NA #in the questionaire was 0 answers that means "no information" 

Create a matrix of answers. There were a few responders, therefore, six-point Likert was converted to 3-point Likert.

resp <- data[,1:91]
resp[resp==1 | resp==2] <- -1
resp[resp==3 | resp==4] <- 0
resp[resp==5 | resp==6] <- 1
longData<-melt(resp)
longData<-longData[!is.na(longData$value),]
ggplot(longData, aes(x = Var2, y = Var1)) + 
  geom_raster(aes(fill=factor(value))) +
  labs(x="Questions", y="Responders") +
  scale_fill_manual(values=c("#4da6ff", "gray80", "#ffad33"),
                    name="Answers") +
  theme_bw() +
  theme(axis.text.x = element_text(angle=90, hjust=0, vjust=0, size=4))

plot of chunk survey_data_raw

Clusters of responders

Create similarity matrix

sim.jac <- matrix(0, nrow=nrow(resp), ncol=nrow(resp))
rownames(sim.jac) <- 1:nrow(resp)
colnames(sim.jac) <- 1:nrow(resp)

pairs <- t(combn(1:nrow(resp), 2))

for (i in 1:nrow(pairs)){
  num <- sum(resp[pairs[i,1],]==resp[pairs[i,2],], na.rm=T)
  den <- length(union(which(!is.na(resp[pairs[i,1],])), which(!is.na(resp[pairs[i,2],]))))
  sim.jac[pairs[i,1],pairs[i,2]] <- num/den
  sim.jac[pairs[i,2],pairs[i,1]] <- num/den  
}

sim.jac[which(is.na(sim.jac))] <- 0
diag(sim.jac) <- 1 

Hierarchical clustering

sim2dist <- function(mx) as.dist(sqrt(outer(diag(mx), diag(mx), "+") - 2*mx))
dist <- sim2dist(sim.jac)
hc <- hclust(dist, method = "ward.D2")
source("http://addictedtor.free.fr/packages/A2R/lastVersion/R/code.R")
# colored dendrogram
op = par(bg = "#EFEFEF")
A2Rplot(hc, k = 9, boxes = F, col.up = "gray50", col.down = c("#ff9900", 
    "#4ECDC4", "#556270", "#ff66ff", "#00cc00", "#cc0000", "#cccc00", "#ADFF2F", "#000080"), show.labels=F, main=NULL)

plot of chunk responder_clusters

Properties of clusters of responders

The aim of this blog post is to show technical solutions of clustering. The table below was used to evaluate of clusters.

cut <- cutree(hc, k=9)
f <- function(x) {k <- which(x!=0); mean(x[k])}
d <- function(x) {91-sum(is.na(x))-sum(x=0)}
x <- data.frame(cut,
                data[,92:95],
                mean=apply(resp, 1, f),
                nonzero=apply(resp, 1, d))

Clusters of questions

Create similarity matrix

resp <- t(resp)
sim.jac <- matrix(0, nrow=nrow(resp), ncol=nrow(resp))
rownames(sim.jac) <- rownames(resp)
colnames(sim.jac) <- rownames(resp)

pairs <- t(combn(1:nrow(resp), 2))

for (i in 1:nrow(pairs)){
  num <- sum(resp[pairs[i,1],]==resp[pairs[i,2],], na.rm=T)
  den <- length(union(which(!is.na(resp[pairs[i,1],])), which(!is.na(resp[pairs[i,2],]))))
  sim.jac[pairs[i,1],pairs[i,2]] <- num/den
  sim.jac[pairs[i,2],pairs[i,1]] <- num/den  
}

sim.jac[which(is.na(sim.jac))] <- 0
diag(sim.jac) <- 1

Hierarchical clustering

dist <- sim2dist(sim.jac)
hc2 <- hclust(dist, method = "ward.D2")
source("http://addictedtor.free.fr/packages/A2R/lastVersion/R/code.R")
# colored dendrogram
op = par(bg = "#EFEFEF")
A2Rplot(hc2, k = 7, boxes = F, col.up = "gray50", col.down = c("#ff9900", 
    "#4ECDC4", "#556270", "#ff66ff", "#00cc00", "#cc0000", "#cccc00", "#ADFF2F", "#000080"), show.labels=F, main=NULL)

plot of chunk question_clusters

Properties of question clusters

The aim of this blog post is to show technical solutions of clustering. The table below was used to evaluate of question clusters.

cut <- cutree(hc2, k=7)
f <- function(x) {k <- which(x!=0); mean(x[k])}
d2 <- function(x) {86-sum(is.na(x))-sum(x=0)}
x2 <- data.frame(cut,
                Q_number=colnames(data[,1:91]),
                atlag=apply(resp, 1, f),
                nonzero=apply(resp, 1, d2))

Ordered figure and summary

longData$Var1 <- factor(longData$Var1, levels=hc$labels[hc$order]) #ordering of responders
longData$Var2 <- factor(longData$Var2, levels=hc2$labels[hc2$order]) #ordering of questions

pal = c("#ff9900", "#4ECDC4", "#556270", "#ff66ff", "#00cc00", "#cc0000", "#cccc00", "#ADFF2F", "#000080") #color palette

col.pal.y <- data.frame(col=pal,
                        cluster=c(5,3,8,6,2,4,9,7,1))
col.pal.x <- data.frame(col=pal[1:7],
                        cluster=c(3,7,6,1,2,4,5))
clust.y <- data.frame(node.y=rownames(x), cut.y=x$cut)
clust.y <- left_join(clust.y, col.pal.y, by=c("cut.y"="cluster"))
clust.x <- data.frame(node.x=rownames(x2), cut.x=x2$cut)
clust.x <- left_join(clust.x, col.pal.x, by=c("cut.x"="cluster"))

axis.y.color <- as.character(clust.y$col[hc$order]) #colors of question by clusters
axis.x.color <- as.character(clust.x$col[hc2$order]) #colors of responders by clusters
ggplot(longData, aes(x = Var2, y = Var1)) + 
  geom_raster(aes(fill=factor(value))) +
  labs(x="Questions", y="Responders") +
  scale_fill_manual(values=c("#4da6ff", "gray80", "#ffad33"),
                    name="Answers") +
  theme_bw() +
  theme(axis.text.x = element_text(angle=90, hjust=0, vjust=0, size=4, color=axis.x.color),
        axis.text.y = element_text(size=4, color=axis.y.color)) + 
  annotate("segment", x = 17.5, xend = 17.5, y = 0, yend = 86.5, colour = "red", size = 0.2) +
  annotate("segment", x = 28.5, xend = 28.5, y = 0, yend = 86.5, colour = "red", size = 0.2) +
  annotate("segment", x = 43.5, xend = 43.5, y = 0, yend = 86.5, colour = "red", size = 0.2) +
  annotate("segment", x = 62.5, xend = 62.5, y = 0, yend = 86.5, colour = "red", size = 0.2) +
  annotate("segment", x = 80.5, xend = 80.5, y = 0, yend = 86.5, colour = "red", size = 0.2) +
  annotate("segment", x = 83.5, xend = 83.5, y = 0, yend = 86.5, colour = "red", size = 0.2) +
  annotate("segment", x = 0, xend = 91.5, y = 9,5, yend = 9.5, colour = "red", size = 0.2) +
  annotate("segment", x = 0, xend = 91.5, y = 20.5, yend = 20.5, colour = "red", size = 0.2) + 
  annotate("segment", x = 0, xend = 91.5, y = 34.5, yend = 34.5, colour = "red", size = 0.2) + 
  annotate("segment", x = 0, xend = 91.5, y = 43.5, yend = 43.5, colour = "red", size = 0.2) + 
  annotate("segment", x = 0, xend = 91.5, y = 54.5, yend = 54.5, colour = "red", size = 0.2) + 
  annotate("segment", x = 0, xend = 91.5, y = 68.5, yend = 68.5, colour = "red", size = 0.2) + 
  annotate("segment", x = 0, xend = 91.5, y = 77.5, yend = 77.5, colour = "red", size = 0.2) + 
  annotate("segment", x = 0, xend = 91.5, y = 81.5, yend = 81.5, colour = "red", size = 0.2)

plot of chunk survey_data_ordered

Some nontechnical statement about examinations:

  • Satisfied responders are at the bottom of the figure.
  • Responders are mostly unsatisfied with magenta question cluster even satisfied responders. These are about recruitment, resupply, training etc. in which must evolve most.
  • Members of ocher responder cluster are the most unsatisfied. It contains top leaders.
  • Members of Orange responder cluster are the most satisfied. It contains juniors of examined department. Their work was evaluated in this examination.
  • The fewest answers got questions of ocher cluster. All these are about measurement function of HR department.
  • Some responders that are in the green cluster but they are leaders more than 10 years. Their answers are unsatisfied.
  • Answers to questions in orange clusters were most satisfied. These are about workers personality of HR and quality of contacts.

See also:
Jaccard similarity on Wikipedia
Similar survey data examination with Python


Be happyR! 🙂

Advertisements