I recently come across a powerful tools called igraph that provides some very powerful graph mining capabilities. Following are some interesting things that I have found.
Create a Graph
Graph is composed of Nodes and Edges, both of them can be attached with a set of properties (name/value pairs). Furthermore, edges can be directed or undirected and weights can be attached to it.> library(igraph)
> # Create a directed graph
> g <- graph(c(0,1, 0,2, 1,3, 0,3), directed=T)
> g
Vertices: 4
Edges: 4
Directed: TRUE
Edges:
[0] 0 -> 1
[1] 0 -> 2
[2] 1 -> 3
[3] 0 -> 3
> # Create a directed graph using adjacency matrix
> m <- matrix(runif(4*4), nrow=4)
> m
[,1] [,2] [,3] [,4]
[1,] 0.4086389 0.2160924 0.1557989 0.2896239
[2,] 0.4669456 0.1071071 0.1290673 0.3715809
[3,] 0.2031678 0.3911691 0.5906273 0.7417764
[4,] 0.8808119 0.7687493 0.9734323 0.4487252
> g <- graph.adjacency(m > 0.5)
> g
Vertices: 4
Edges: 5
Directed: TRUE
Edges:
[0] 2 -> 2
[1] 2 -> 3
[2] 3 -> 0
[3] 3 -> 1
[4] 3 -> 2
> plot(g, layout=layout.fruchterman.reingold)
>
iGraph also provide various convenient ways to create patterned graphs> #Create a full graph
> g1 <- graph.full(4)
> g1
Vertices: 4
Edges: 6
Directed: FALSE
Edges:
[0] 0 -- 1
[1] 0 -- 2
[2] 0 -- 3
[3] 1 -- 2
[4] 1 -- 3
[5] 2 -- 3
> #Create a ring graph
> g2 <- graph.ring(3)
> g2
Vertices: 3
Edges: 3
Directed: FALSE
Edges:
[0] 0 -- 1
[1] 1 -- 2
[2] 0 -- 2
> #Combine 2 graphs
> g <- g1 %du% g2
> g
Vertices: 7
Edges: 9
Directed: FALSE
Edges:
[0] 0 -- 1
[1] 0 -- 2
[2] 0 -- 3
[3] 1 -- 2
[4] 1 -- 3
[5] 2 -- 3
[6] 4 -- 5
[7] 5 -- 6
[8] 4 -- 6
> graph.difference(g, graph(c(0,1,0,2), directed=F))
Vertices: 7
Edges: 7
Directed: FALSE
Edges:
[0] 0 -- 3
[1] 1 -- 3
[2] 1 -- 2
[3] 2 -- 3
[4] 4 -- 6
[5] 4 -- 5
[6] 5 -- 6
> # Create a lattice
> g1 = graph.lattice(c(3,4,2))
> # Create a tree
> g2 = graph.tree(12, children=2)
> plot(g1, layout=layout.fruchterman.reingold)
> plot(g2, layout=layout.reingold.tilford)
iGraph also provides 2 graph generation mechanism. "Random graph" is to generate an edge randomly between any two nodes. "Preferential attachment" is to assign a higher probably to create an edge to an existing node which has a high in-degree already (the rich gets richer model).# Generate random graph, fixed probability
> g <- erdos.renyi.game(20, 0.3)
> plot(g, layout=layout.fruchterman.reingold,
vertex.label=NA, vertex.size=5)
# Generate random graph, fixed number of arcs
> g <- erdos.renyi.game(20, 15, type='gnm')
# Generate preferential attachment graph
> g <- barabasi.game(60, power=1, zero.appeal=1.3)
Basic Graph Algorithms
This section will cover how to use iGraph to perform some very basic graph algorithm.Minimum Spanning Tree algorithm is to find a Tree that connect all the nodes within a connected graph while the sum of edges weight is minimum.
# Create the graph and assign random edge weights
> g <- erdos.renyi.game(12, 0.35)
> E(g)$weight <- round(runif(length(E(g))),2) * 50
> plot(g, layout=layout.fruchterman.reingold,
edge.label=E(g)$weight)
# Compute the minimum spanning tree
> mst <- minimum.spanning.tree(g)
> plot(mst, layout=layout.reingold.tilford,
edge.label=E(mst)$weight)
Connected Component algorithms is to find the island of nodes that are interconnected with each other, in other words, one can traverse from one node to another one via a path. Notice that connectivity is symmetric in undirected graph, it is not the necessary the case for directed graph (ie: it is possible that nodeA can reach nodeB, then nodeB cannot reach nodeA). Therefore in directed graph, there is a concept of "strong" connectivity which means both nodes are considered connected only when it is reachable in both direction. A "weak" connectivity means nodes are connected
> g <- graph(c(0, 1, 1, 2, 2, 0, 1, 3, 3, 4,
4, 5, 5, 3, 4, 6, 6, 7, 7, 8,
8, 6, 9, 10, 10, 11, 11, 9))
# Nodes reachable from node4
> subcomponent(g, 4, mode="out")
[1] 4 5 6 3 7 8
# Nodes who can reach node4
> subcomponent(g, 4, mode="in")
[1] 4 3 1 5 0 2
> clusters(g, mode="weak")
$membership
[1] 0 0 0 0 0 0 0 0 0 1 1 1
$csize
[1] 9 3
$no
[1] 2
> myc <- clusters(g, mode="strong")
> myc
$membership
[1] 1 1 1 2 2 2 3 3 3 0 0 0
$csize
[1] 3 3 3 3
$no
[1] 4
> mycolor <- c('green', 'yellow', 'red', 'skyblue')
> V(g)$color <- mycolor[myc$membership + 1]
> plot(g, layout=layout.fruchterman.reingold)
Shortest Path is almost the most commonly used algorithm in many scenarios, it aims to find the shortest path from nodeA to nodeB. In iGraph, it use "breath-first search" if the graph is unweighted (ie: weight is 1) and use Dijkstra's algo if the weights are positive, otherwise it will use Bellman-Ford's algorithm for negatively weighted edges.
> g <- erdos.renyi.game(12, 0.25)
> plot(g, layout=layout.fruchterman.reingold)
> pa <- get.shortest.paths(g, 5, 9)[[1]]
> pa
[1] 5 0 4 9
> V(g)[pa]$color <- 'green'
> E(g)$color <- 'grey'
> E(g, path=pa)$color <- 'red'
> E(g, path=pa)$width <- 3
> plot(g, layout=layout.fruchterman.reingold)
Graph Statistics
There are many statistics that we can look to get a general ideas of the shape of the graph. At the highest level, we can look at summarized statistics of the graph. This includes ...- Size of the graph (number of nodes and edges)
- Density of the graph measure weither the graph dense (|E| proportional to |V|^2) or sparse (|E| proportional to |V|) ?
- Is the graph very connected (large portion of nodes can reach each other), or is it disconnected (many islands) ?
- Diameter of the graph measure the longest distance between any two nodes
- Reciprocity measures in a directed graph, how symmetric the relationships are
- Distribution of in/out "degrees"
> # Create a random graph
> g <- erdos.renyi.game(200, 0.01)
> plot(g, layout=layout.fruchterman.reingold,
vertex.label=NA, vertex.size=3)
> # No of nodes
> length(V(g))
[1] 200
> # No of edges
> length(E(g))
[1] 197
> # Density (No of edges / possible edges)
> graph.density(g)
[1] 0.009899497
> # Number of islands
> clusters(g)$no
[1] 34
> # Global cluster coefficient:
> #(close triplets/all triplets)
> transitivity(g, type="global")
[1] 0.015
> # Edge connectivity, 0 since graph is disconnected
> edge.connectivity(g)
[1] 0
> # Same as graph adhesion
> graph.adhesion(g)
[1] 0
> # Diameter of the graph
> diameter(g)
[1] 18
> # Reciprocity of the graph
> reciprocity(g)
[1] 1
> # Diameter of the graph
> diameter(g)
[1] 18
> # Reciprocity of the graph
> reciprocity(g)
[1] 1
> degree.distribution(g)
[1] 0.135 0.280 0.315 0.110 0.095 0.050 0.005 0.010
> plot(degree.distribution(g), xlab="node degree")
> lines(degree.distribution(g))
Drill down a level, we can also look at statistics of each pair of nodes, such as ...
- Connectivity between two nodes measure the distinct paths with no shared edges between two nodes. (ie: how much edges need to be removed to disconnect them)
- Shortest path between two nodes
- Trust between two nodes (a function of number of distinct path and distance of each path)
> # Create a random graph
> g <- erdos.renyi.game(9, 0.5)
> plot(g, layout=layout.fruchterman.reingold)
> # Compute the shortest path matrix
> shortest.paths(g)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0 1 3 1 2 2 1 3 2
[2,] 1 0 2 2 3 2 2 2 1
[3,] 3 2 0 2 1 2 2 2 1
[4,] 1 2 2 0 3 1 2 2 1
[5,] 2 3 1 3 0 3 1 3 2
[6,] 2 2 2 1 3 0 2 1 1
[7,] 1 2 2 2 1 2 0 2 1
[8,] 3 2 2 2 3 1 2 0 1
[9,] 2 1 1 1 2 1 1 1 0
> # Compute the connectivity matrix
> M <- matrix(rep(0, 81), nrow=9)
> M
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0 0 0 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 0 0 0
[3,] 0 0 0 0 0 0 0 0 0
[4,] 0 0 0 0 0 0 0 0 0
[5,] 0 0 0 0 0 0 0 0 0
[6,] 0 0 0 0 0 0 0 0 0
[7,] 0 0 0 0 0 0 0 0 0
[8,] 0 0 0 0 0 0 0 0 0
[9,] 0 0 0 0 0 0 0 0 0
> for (i in 0:8) {
+ for (j in 0:8) {
+ if (i == j) {
+ M[i+1, j+1] <- -1
+ } else {
+ M[i+1, j+1] <- edge.connectivity(g, i, j)
+ }
+ }
+ }
> M
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] -1 2 2 3 2 3 3 2 3
[2,] 2 -1 2 2 2 2 2 2 2
[3,] 2 2 -1 2 2 2 2 2 2
[4,] 3 2 2 -1 2 3 3 2 3
[5,] 2 2 2 2 -1 2 2 2 2
[6,] 3 2 2 3 2 -1 3 2 3
[7,] 3 2 2 3 2 3 -1 2 3
[8,] 2 2 2 2 2 2 2 -1 2
[9,] 3 2 2 3 2 3 3 2 -1
>
Centrality Measures
At the fine grain level, we can look at statistics of individual nodes. Centrality score measure the social importance of a node in terms of how "central" it is based on a number of measures ...- Degree centrality gives a higher score to a node that has a high in/out-degree
- Closeness centrality gives a higher score to a node that has short path distance to every other nodes
- Betweenness centrality gives a higher score to a node that sits on many shortest path of other node pairs
- Eigenvector centrality gives a higher score to a node if it connects to many high score nodes
- Local cluster coefficient measures how my neighbors are inter-connected with each other, which means the node becomes less important.
> # Degree
> degree(g)
[1] 2 2 2 2 2 3 3 2 6
> # Closeness (inverse of average dist)
> closeness(g)
[1] 0.4444444 0.5333333 0.5333333 0.5000000
[5] 0.4444444 0.5333333 0.6153846 0.5000000
[9] 0.8000000
> # Betweenness
> betweenness(g)
[1] 0.8333333 2.3333333 2.3333333
[4] 0.0000000 0.8333333 0.5000000
[7] 6.3333333 0.0000000 18.8333333
> # Local cluster coefficient
> transitivity(g, type="local")
[1] 0.0000000 0.0000000 0.0000000 1.0000000
[5] 0.0000000 0.6666667 0.0000000 1.0000000
[9] 0.1333333
> # Eigenvector centrality
> evcent(g)$vector
[1] 0.3019857 0.4197153 0.4197153 0.5381294
[5] 0.3019857 0.6693142 0.5170651 0.5381294
[9] 1.0000000
> # Now rank them
> order(degree(g))
[1] 1 2 3 4 5 8 6 7 9
> order(closeness(g))
[1] 1 5 4 8 2 3 6 7 9
> order(betweenness(g))
[1] 4 8 6 1 5 2 3 7 9
> order(evcent(g)$vector)
[1] 1 5 2 3 7 4 8 6 9
From his studies, Drew Conway has found that people with low Eigenvector centrality but high Betweenness centrality are important gate keepers, while people with high Eigenvector centrality but low Betweenness centrality has direct contact to important persons. So lets plot Eigenvector centrality against Betweenness centrality.
> # Create a graph
> g1 <- barabasi.game(100, directed=F)
> g2 <- barabasi.game(100, directed=F)
> g <- g1 %u% g2
> lay <- layout.fruchterman.reingold(g)
> # Plot the eigevector and betweenness centrality
> plot(evcent(g)$vector, betweenness(g))
> text(evcent(g)$vector, betweenness(g), 0:100,
cex=0.6, pos=4)
> V(g)[12]$color <- 'red'
> V(g)[8]$color <- 'green'
> plot(g, layout=lay, vertex.size=8,
vertex.label.cex=0.6)
With this basic of graph mining, in future posts I will cover some specific examples of social network analysis.