Friday, April 27, 2012

Basic graph analytics using igraph

Social Network Site such as Facebook, Twitter becomes are integral part of people's life in. People interact with each other in different form of activities and a lot of information has been captured in the social network.  Mining such a network can reveal some very useful information that can help an organization to gain competitive advantages.

I recently come across a powerful tools called igraph that provides some very powerful graph mining capabilities.  Following are some interesting things that I have found.

Create a Graph

Graph is composed of Nodes and Edges, both of them can be attached with a set of properties (name/value pairs). Furthermore, edges can be directed or undirected and weights can be attached to it.
> library(igraph)
> # Create a directed graph
> g <- graph(c(0,1, 0,2, 1,3, 0,3), directed=T)
> g
Vertices: 4
Edges: 4
Directed: TRUE
Edges:

[0] 0 -> 1
[1] 0 -> 2
[2] 1 -> 3
[3] 0 -> 3
> # Create a directed graph using adjacency matrix
> m <- matrix(runif(4*4), nrow=4)
> m
[,1]      [,2]      [,3]      [,4]
[1,] 0.4086389 0.2160924 0.1557989 0.2896239
[2,] 0.4669456 0.1071071 0.1290673 0.3715809
[3,] 0.2031678 0.3911691 0.5906273 0.7417764
[4,] 0.8808119 0.7687493 0.9734323 0.4487252
> g <- graph.adjacency(m > 0.5)
> g
Vertices: 4
Edges: 5
Directed: TRUE
Edges:

[0] 2 -> 2
[1] 2 -> 3
[2] 3 -> 0
[3] 3 -> 1
[4] 3 -> 2
> plot(g, layout=layout.fruchterman.reingold)
>
iGraph also provide various convenient ways to create patterned graphs
> #Create a full graph
> g1 <- graph.full(4)
> g1
Vertices: 4
Edges: 6
Directed: FALSE
Edges:

[0] 0 -- 1
[1] 0 -- 2
[2] 0 -- 3
[3] 1 -- 2
[4] 1 -- 3
[5] 2 -- 3
> #Create a ring graph
> g2 <- graph.ring(3)
> g2
Vertices: 3
Edges: 3
Directed: FALSE
Edges:

[0] 0 -- 1
[1] 1 -- 2
[2] 0 -- 2
> #Combine 2 graphs
> g <- g1 %du% g2
> g
Vertices: 7
Edges: 9
Directed: FALSE
Edges:

[0] 0 -- 1
[1] 0 -- 2
[2] 0 -- 3
[3] 1 -- 2
[4] 1 -- 3
[5] 2 -- 3
[6] 4 -- 5
[7] 5 -- 6
[8] 4 -- 6
> graph.difference(g, graph(c(0,1,0,2), directed=F))
Vertices: 7
Edges: 7
Directed: FALSE
Edges:

[0] 0 -- 3
[1] 1 -- 3
[2] 1 -- 2
[3] 2 -- 3
[4] 4 -- 6
[5] 4 -- 5
[6] 5 -- 6
> # Create a lattice
> g1 = graph.lattice(c(3,4,2))
> # Create a tree
> g2 = graph.tree(12, children=2)
> plot(g1, layout=layout.fruchterman.reingold)
> plot(g2, layout=layout.reingold.tilford)
iGraph also provides 2 graph generation mechanism. "Random graph" is to generate an edge randomly between any two nodes. "Preferential attachment" is to assign a higher probably to create an edge to an existing node which has a high in-degree already (the rich gets richer model).
# Generate random graph, fixed probability
> g <- erdos.renyi.game(20, 0.3)
> plot(g, layout=layout.fruchterman.reingold,
  vertex.label=NA, vertex.size=5)

# Generate random graph, fixed number of arcs
> g <- erdos.renyi.game(20, 15, type='gnm')

# Generate preferential attachment graph
> g <- barabasi.game(60, power=1, zero.appeal=1.3)

Basic Graph Algorithms

This section will cover how to use iGraph to perform some very basic graph algorithm.

Minimum Spanning Tree algorithm is to find a Tree that connect all the nodes within a connected graph while the sum of edges weight is minimum.

# Create the graph and assign random edge weights
> g <- erdos.renyi.game(12, 0.35)
> E(g)$weight <- round(runif(length(E(g))),2) * 50
> plot(g, layout=layout.fruchterman.reingold, 
          edge.label=E(g)$weight)
# Compute the minimum spanning tree
> mst <- minimum.spanning.tree(g)
> plot(mst, layout=layout.reingold.tilford, 
          edge.label=E(mst)$weight)



Connected Component algorithms is to find the island of nodes that are interconnected with each other, in other words, one can traverse from one node to another one via a path.  Notice that connectivity is symmetric in undirected graph, it is not the necessary the case for directed graph (ie: it is possible that nodeA can reach nodeB, then nodeB cannot reach nodeA).  Therefore in directed graph, there is a concept of "strong" connectivity which means both nodes are considered connected only when it is reachable in both direction.  A "weak" connectivity means nodes are connected

> g <- graph(c(0, 1, 1, 2, 2, 0, 1, 3, 3, 4, 
               4, 5, 5, 3, 4, 6, 6, 7, 7, 8, 
               8, 6, 9, 10, 10, 11, 11, 9))
# Nodes reachable from node4
> subcomponent(g, 4, mode="out")
[1] 4 5 6 3 7 8
# Nodes who can reach node4
> subcomponent(g, 4, mode="in")
[1] 4 3 1 5 0 2

> clusters(g, mode="weak")
$membership
 [1] 0 0 0 0 0 0 0 0 0 1 1 1
$csize
[1] 9 3
$no
[1] 2

> myc <- clusters(g, mode="strong")
> myc
$membership
 [1] 1 1 1 2 2 2 3 3 3 0 0 0
$csize
[1] 3 3 3 3
$no
[1] 4

> mycolor <- c('green', 'yellow', 'red', 'skyblue')
> V(g)$color <- mycolor[myc$membership + 1]
> plot(g, layout=layout.fruchterman.reingold)


Shortest Path is almost the most commonly used algorithm in many scenarios, it aims to find the shortest path from nodeA to nodeB.  In iGraph, it use "breath-first search" if the graph is unweighted (ie: weight is 1) and use Dijkstra's algo if the weights are positive, otherwise it will use Bellman-Ford's algorithm for negatively weighted edges.

> g <- erdos.renyi.game(12, 0.25)
> plot(g, layout=layout.fruchterman.reingold)
> pa <- get.shortest.paths(g, 5, 9)[[1]]
> pa
[1] 5 0 4 9
> V(g)[pa]$color <- 'green'
> E(g)$color <- 'grey'
> E(g, path=pa)$color <- 'red'
> E(g, path=pa)$width <- 3
> plot(g, layout=layout.fruchterman.reingold)



Graph Statistics

There are many statistics that we can look to get a general ideas of the shape of the graph.  At the highest level, we can look at summarized statistics of the graph. This includes ...
  • Size of the graph (number of nodes and edges)
  • Density of the graph measure weither the graph dense (|E| proportional to |V|^2) or sparse (|E| proportional to |V|) ?
  • Is the graph very connected (large portion of nodes can reach each other), or is it disconnected (many islands) ?
  • Diameter of the graph measure the longest distance between any two nodes
  • Reciprocity measures in a directed graph, how symmetric the relationships are
  • Distribution of in/out "degrees"
> # Create a random graph
> g <- erdos.renyi.game(200, 0.01)
> plot(g, layout=layout.fruchterman.reingold, 
       vertex.label=NA, vertex.size=3)
> # No of nodes
> length(V(g))
[1] 200
> # No of edges
> length(E(g))
[1] 197
> # Density (No of edges / possible edges)
> graph.density(g)
[1] 0.009899497
> # Number of islands
> clusters(g)$no
[1] 34
> # Global cluster coefficient:
> #(close triplets/all triplets)
> transitivity(g, type="global")
[1] 0.015
> # Edge connectivity, 0 since graph is disconnected
> edge.connectivity(g)
[1] 0
> # Same as graph adhesion
> graph.adhesion(g)
[1] 0
> # Diameter of the graph
> diameter(g)
[1] 18
> # Reciprocity of the graph
> reciprocity(g)
[1] 1
> # Diameter of the graph
> diameter(g)
[1] 18
> # Reciprocity of the graph
> reciprocity(g)
[1] 1
> degree.distribution(g)
[1] 0.135 0.280 0.315 0.110 0.095 0.050 0.005 0.010
> plot(degree.distribution(g), xlab="node degree")
> lines(degree.distribution(g))



Drill down a level, we can also look at statistics of each pair of nodes, such as ...
  • Connectivity between two nodes measure the distinct paths with no shared edges between two nodes. (ie: how much edges need to be removed to disconnect them)
  • Shortest path between two nodes
  • Trust between two nodes (a function of number of distinct path and distance of each path)
> # Create a random graph
> g <- erdos.renyi.game(9, 0.5)
> plot(g, layout=layout.fruchterman.reingold)
> # Compute the shortest path matrix
> shortest.paths(g)
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
 [1,]    0    1    3    1    2    2    1    3    2
 [2,]    1    0    2    2    3    2    2    2    1
 [3,]    3    2    0    2    1    2    2    2    1
 [4,]    1    2    2    0    3    1    2    2    1
 [5,]    2    3    1    3    0    3    1    3    2
 [6,]    2    2    2    1    3    0    2    1    1
 [7,]    1    2    2    2    1    2    0    2    1
 [8,]    3    2    2    2    3    1    2    0    1
 [9,]    2    1    1    1    2    1    1    1    0
> # Compute the connectivity matrix
> M <- matrix(rep(0, 81), nrow=9)
> M
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
 [1,]    0    0    0    0    0    0    0    0    0
 [2,]    0    0    0    0    0    0    0    0    0
 [3,]    0    0    0    0    0    0    0    0    0
 [4,]    0    0    0    0    0    0    0    0    0
 [5,]    0    0    0    0    0    0    0    0    0
 [6,]    0    0    0    0    0    0    0    0    0
 [7,]    0    0    0    0    0    0    0    0    0
 [8,]    0    0    0    0    0    0    0    0    0
 [9,]    0    0    0    0    0    0    0    0    0
> for (i in 0:8) {
+   for (j in 0:8) {
+     if (i == j) {
+       M[i+1, j+1] <- -1
+     } else {
+       M[i+1, j+1] <- edge.connectivity(g, i, j)
+     }
+   }
+ }
> M
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
 [1,]   -1    2    2    3    2    3    3    2    3
 [2,]    2   -1    2    2    2    2    2    2    2
 [3,]    2    2   -1    2    2    2    2    2    2
 [4,]    3    2    2   -1    2    3    3    2    3
 [5,]    2    2    2    2   -1    2    2    2    2
 [6,]    3    2    2    3    2   -1    3    2    3
 [7,]    3    2    2    3    2    3   -1    2    3
 [8,]    2    2    2    2    2    2    2   -1    2
 [9,]    3    2    2    3    2    3    3    2   -1
> 

Centrality Measures

At the fine grain level, we can look at statistics of individual nodes.  Centrality score measure the social importance of a node in terms of how "central" it is based on a number of measures ...
  • Degree centrality gives a higher score to a node that has a high in/out-degree
  • Closeness centrality gives a higher score to a node that has short path distance to every other nodes
  • Betweenness centrality gives a higher score to a node that sits on many shortest path of other node pairs
  • Eigenvector centrality gives a higher score to a node if it connects to many high score nodes
  • Local cluster coefficient measures how my neighbors are inter-connected with each other, which means the node becomes less important.
> # Degree
> degree(g)
[1] 2 2 2 2 2 3 3 2 6
> # Closeness (inverse of average dist)
> closeness(g)
[1] 0.4444444 0.5333333 0.5333333 0.5000000
[5] 0.4444444 0.5333333 0.6153846 0.5000000
[9] 0.8000000
> # Betweenness
> betweenness(g)
[1]  0.8333333  2.3333333  2.3333333
[4]  0.0000000  0.8333333  0.5000000
[7]  6.3333333  0.0000000 18.8333333
> # Local cluster coefficient
> transitivity(g, type="local")
[1] 0.0000000 0.0000000 0.0000000 1.0000000
[5] 0.0000000 0.6666667 0.0000000 1.0000000
[9] 0.1333333
> # Eigenvector centrality
> evcent(g)$vector
[1] 0.3019857 0.4197153 0.4197153 0.5381294
[5] 0.3019857 0.6693142 0.5170651 0.5381294
[9] 1.0000000
> # Now rank them
> order(degree(g))
[1] 1 2 3 4 5 8 6 7 9
> order(closeness(g))
[1] 1 5 4 8 2 3 6 7 9
> order(betweenness(g))
[1] 4 8 6 1 5 2 3 7 9
> order(evcent(g)$vector)
[1] 1 5 2 3 7 4 8 6 9

From his studies, Drew Conway has found that people with low Eigenvector centrality but high Betweenness centrality are important gate keepers, while people with high Eigenvector centrality but low Betweenness centrality has direct contact to important persons.  So lets plot Eigenvector centrality against Betweenness centrality.
> # Create a graph
> g1 <- barabasi.game(100, directed=F)
> g2 <- barabasi.game(100, directed=F)
> g <- g1 %u% g2
> lay <- layout.fruchterman.reingold(g)
> # Plot the eigevector and betweenness centrality
> plot(evcent(g)$vector, betweenness(g))
> text(evcent(g)$vector, betweenness(g), 0:100, 
       cex=0.6, pos=4)
> V(g)[12]$color <- 'red'
> V(g)[8]$color <- 'green'
> plot(g, layout=lay, vertex.size=8, 
       vertex.label.cex=0.6)



With this basic of graph mining, in future posts I will cover some specific examples of social network analysis.

6 comments:

Toni said...

Very interesting, in this moment i'm trying to use Igraph in my work. Your post is very useful, thank you!

Unknown said...

Thanks for the tutorial, is a very good start point for network analysis with i-graph

Francois said...

Thanks. Very insightful post.
"Drew Conway has found that people with low Eigenvector centrality but high Betweenness centrality are important gate keepers, while people with high Eigenvector centrality but low Betweenness centrality has direct contact to important persons."

This got confirmed to me when I had a node with all edges going to it but no edges going out of it... hence the eigenvector centrality was 1 and the betweeness was 0 :-)

Unknown said...

Hi,i have a problem, i want to plot the degree distribution of my adjacency matrix; i already coerced in object "graph" in order to work with "igraph". The matrix is correct, the graph (g) is undirected, with 88 nodes and 227 edges. I want just the degree distribution and its "plotting" but the function "degree.distribution(g)" return "NULL"..it doesn't work, neither the "plot" what can i do? i'm not an expert
thanks

Unknown said...

p.s.
not even with the "erdos-renyi graph", it doesn't work

Bzh4469 said...

Thanks this tutorial is very useful.
But how to create a plot of the average clustering coefficient of nodes with degree k depending on k, Ck = f(k)