For novice: Network analysis using Community detection algorithms

I came across an interesting tutorial post by Dr Keith McNulty in LinkedIn.

https://drkeithmcnulty.com/2020/02/11/making-sense-of-the-game-of-thrones-universe-using-community-detection-algorithms/amp/

This article teaches a method of using community detection algorithms. Let's follow them step-by-step and see if I can understand correctly.

The aim of the analysis is to cluster the data based on their relationship and connection. In short, the data has certain connectivity to each other, forming a network, and we want to cluster them into different groups. Some of the data is more closely connected to each other and thus grouping into the same cluster/community. Then, we want to find out the most important data that connecting other data in each cluster.

The article uses a very interesting example, which is a list of characters in Games of Throne. Someone has already done a list of interaction scoring board based on the number of interactions between two characters at a time.

"There are five interaction types. Character A and Character B are connected whenever:
Character A speaks directly after Character B
Character A speaks about Character B
Character C speaks about Character A and Character B
Character A and Character B are mentioned in the same stage direction
Character A and Character B appear in a scene together"

The input of the analysis is a set of edgelist files. Each file contains a column of source character (one edge), corresponding to a target character (to the other edge), assigned with the number of interactions based on the frequency of the interaction stated above (weight column).



This article uses only three libraries:

library(tidyverse)   - a package to make data look nice

library(readr)         - a fast way to read rectangular data (e.g. csv or tsv file)

library(igraph)       - to create graphs and visualised networks


Here are the five main steps in the script:

1. Read and compile the edgefiles, please notice the very useful script about looping the file reading for season 2 - 8. The paste0 function is to glue the words in the bracket together, without leaving any space, in this case, it is to glue the website link with a different number of season-i .

# get s1 edgelist

edgefile_url <- "https://raw.githubusercontent.com/mathbeveridge/gameofthrones/master/data/got-s1-edges.csv"
edgelist <- readr::read_csv(edgefile_url)


# append edglistists for s2-8
for (i in 2:8) {
  edgefile_url <- paste0("https://raw.githubusercontent.com/mathbeveridge/gameofthrones/master/data//got-s", i, "-edges.csv")
  edges <- readr::read_csv(edgefile_url) 
  edgelist <- edgelist %>% 
    dplyr::bind_rows(edges)
}

seasons <- 1:8 # <- adjust if you want to focus on specific seasons

edgelist <- edgelist %>% 
  dplyr::filter(Season %in% seasons)


2. After compiling the edge file, make them a matrix, selectively pick only the columns 1 & 2 (source and target) to form a very rough draft of the node-edge network.

Graph_from_edgelist creates a graph based on a two-column matrix, each row defines one edge. The advantage of using this function is, it can assign a vertex ID to the character matrix, in this case, both the columns 1 & 2 are the names of the characters, which then convert into vertex ID in the graph.

directed refers to a directed graph (TRUE) or not (FALSE). A directed graph is taking into account the negative or positive sign of the values, which mean the edge with negative or positive numbers would not be grouped together. In this GOT case, there is no negative value at all, and thus need not the directed version of the graph.

layout_with_mds refers to layout of the matrix with multidimensional scaling in generating the coordinates in the graph, in short, it means creating a graph with a 3D shape.

# create igraph network with weighted edges

edgelist_matrix <- as.matrix(edgelist[ ,1:2])

got_graph <- igraph::graph_from_edgelist(edgelist_matrix, directed = FALSE) %>% 
  igraph::set.edge.attribute("weight", value = edgelist$Weight)

l <- igraph::layout_with_mds(got_graph)
plot(got_graph, vertex.label = NA, vertex.size = 5, rescale = F, layout = l*0.02)

This gives a look-fancy but meaningless network showing the connectivity of all the compiled edgefile data. Next, a "magical" statistical algorithm will be applied to make sense of network.

3. The "magical" tool is the use of Louvain community detection algorithm. The algorithm and mathematical concept of the Louvain method are quite complicated. A Quora Q&A has a very good explanation from Mr Shankar Iyer, he did a very good job of explaining the concept in a general way, both in short or in a longer version. From what I can digest, Louvain method is about the measurement of the modularity when adding a node into a particular community. The modularity is higher when the node is actually added into the right community, which indicates it is more densely connected with other nodes within the community. The node is sorted to a different group based on the hierarchy approach. Then, when all the nodes have been assigned in their own community/group, each community will be treated as a node/vertex on its own, and the calculation starts again to find and define the connectivity among the communities. The process will be repeated until there is no increase in modularity and no reassignment of communities are possible. 

# run louvain with edge weights

louvain_partition <- igraph::cluster_louvain(got_graph, weights = E(got_graph)$weight)

got_graph$community <- louvain_partition$membership

sizes(louvain_partition) %>% 
  knitr::kable()                kable shows the data in a simple table form

This step returns the community sizes and the members within each community. The characters within the same community interact almost entirely with each other (in particular season of GOT). The following code checks the characters within the community 6th and 7th, where we can see that all five dwarfs are in the community 6th while Black_Jack, KEGS and MULLY are in the 7th community. 

membership(louvain_partition)[which(membership(louvain_partition) %in% c(6,7))] 

##   BALON_DWARF    ROBB_DWARF   RENLY_DWARF STANNIS_DWARF JOFFREY_DWARF    BLACK_JACK          KEGS         MULLY 
##       6             6             6             6             6             7             7             7

4. While the step above has successfully clustered the nodes (the characters of GOT), we still want to further identify the important nodes in each community -- characters who are important in the overall connectedness of each community.

subgraph creates a subgraph of a graph containing only the specific vertices and all the edges among them. v refers to the numeric vector, in this case, all the vertices and the edges of each community will be kept in the resulting graph.

The article demonstrated two approaches to look for the most important node in each community. The first one is to find the vector containing the names of the nodes with the highest degree in each community.

high_degree_nodes <- c()

for (i in 1:8) {
  subgraph <- induced_subgraph(got_graph, v = which(got_graph$community == i))
  degree <-  igraph::degree(subgraph)
  high_degree_nodes[i] <- names(which(degree == max(degree)))
}

high_degree_nodes[]

[1] "DAENERYS"    "SANSA"       "CERSEI"      "JON"         "THEON"       "BALON_DWARF" "BLACK_JACK" 
[8] "ARYA

The results show the main characters who are the most important node in each community, and without much surprise, they are also the leading characters in the drama GOT. The second approach is to find the node that is the best connector of the other nodes by the betweenness centrality.

high_btwn_nodes <- c()

for (i in 1:8) {
  subgraph <- induced_subgraph(got_graph, v = which(got_graph$community == i))
  btwn <-  igraph::betweenness(subgraph)
  high_btwn_nodes[i] <- names(which(btwn == max(btwn)))
}

high_btwn_nodes[]
[1] "JORAH"       "SANSA"       "TYRION"      "SAM"         "ROBB"        "RENLY_DWARF" "BLACK_JACK" 
[8] "ARYA"   

This returns a slightly different output as the first approach. It is up to you to use which approach for the following visualisation step.

5. Visualising the communities
After obtaining the most important nodes according to the highest degree or better betweenness centrality, we can draw the network using the following script, to generate a pretty sphere network.

# give our nodes some properties, incl scaling them by degree and coloring them by community

V(got_graph)$size <- degree(got_graph)/10
V(got_graph)$frame.color <- "white"
V(got_graph)$color <- got_graph$community
V(got_graph)$label <- V(got_graph)$name

# also color edges according to their starting node
edge.start <- ends(got_graph, es = E(got_graph), names = F)[,1]
E(got_graph)$color <- V(got_graph)$color[edge.start]
E(got_graph)$arrow.mode <- 0

# only label central characters

v_labels <- which(V(got_graph)$name %in% high_degree_nodes[c(1:5, 8)])

for (i in 1:length(V(got_graph))) {
  if (!(i %in% v_labels)) {
    V(got_graph)$label[i] <- ""
  }
}

# plot network
l1 <- layout_on_sphere(got_graph)
plot(got_graph, rescale = F, layout = l1)


To make it better in separating the communities, the article applied multi-dimensional scaling to ensure that more distant nodes are less connected.

l2 <- layout_with_mds(got_graph)
plot(got_graph, rescale = F, layout = l2*0.02)


Bang! That's a creation of a pretty visualisation for the characters' interaction in Games of Throne. 

This community detection algorithms have been applied to some social network analysis including:
1. The very beginning idea of the published Louvain method, to study a large Belgian phone call network over a six-month period. The data generated 2.6 million nodes and 6.3 million links, lying within a hierarchy of six levels of communities. The output showed that there's only one very special community having more than 85% of members communicating using French or Dutch, instead of their primary language English. This is thus very interesting to further investigating the psychological or political perspective of the phenomena. 

2. This method is also applied in the real world applications: 
- analysis of online social networks such as Twitter, Flickr, Youtube, and LiveJournal   
- analysis of collaboration of retail transactions

3. And might be useful in medical and life science research
- the study of brain networks using the Louvain method   



References
1. Keith McNulty. Making Sense of the Game of Thrones Universe Using Community Detection Algorithms. https://drkeithmcnulty.com/2020/02/11/making-sense-of-the-game-of-thrones-universe-using-community-detection-algorithms/amp/ 
2. https://www.quora.com/Is-there-a-simple-explanation-of-the-Louvain-Method-of-community-detection
3. H.A. Simon. The Architecture of Complexity. Proceedings of the American Philosophical Society. 106-6: 467 (1962).
4. Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10), P10008 (2008).
4. D. Meunier et al. Hierarchical Modularity in Human Brain Functional Networks. Frontiers in Neuroinformatics. 3: 37 (2009).

Comments