Take-home Exercise 6

Visualising and Analysing Community Network

Raveena Chakrapani https://www.linkedin.com/in/raveena-chakrapani-444a60174/ (School Of Computing and Information Systems, Singapore Management University)https://scis.smu.edu.sg/master-it-business
2022-06-05

1. Overview

In this Take-home Exercise 6, I have explored the patterns of community interactions of the city of Engagement, Ohio USA residents by using social network analysis approach.

2. Task

3.Getting Started

3.1 Installing and loading the required libraries

In this exercise, four network data modelling and visualisation packages will be installed and launched. They are igraph , tidygraph, ggraph and visNetwork. Beside these four packages, tidyverse and lubridate, an R package specially designed to handle and wrangling time data will be installed and launched too.

The code chunk below is used to install and load the required packages onto RStudio.

packages = c('igraph', 'tidygraph', 
             'ggraph', 'visNetwork', 
             'lubridate', 'clock',
             'tidyverse', 'graphlayouts',
             'patchwork','DT','knitr',
             'zoo','ggdist')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Importing data

As Social network data file is huge in size, it is processed and saved as rds file. Let’s read the rds file. There are two data sets. One contains the nodes data and the other contains the edges (also know as link) data. The edges data consists of social interactions between the participants and the nodes data consists of other details of participants. Let’s import tidy_social_data.rds and GAStech_email_edges.csv into RStudio environment by using readRDS() and read_csv() of readr package.

data <- read_csv("data/SocialNetwork.csv")
data <- data %>%
  mutate(Date = as_date(timestamp)) %>%
  mutate(Weekday=weekdays(Date),Month=zoo::as.yearmon(Date,"%Y %m"))

Which month had more and less interactions ?

count_interactions <- data %>%
  group_by(Month)%>%
  tally()
count_interactions[order(-count_interactions$n),][1:2,])
count_interactions[order(count_interactions$n),][1:2,])

Revealing patterns of the most interactive month

It is understood that July month had a lot of interactions and March month had less no. of interactions. Let’s zoom in and view the attributes.The edges data contains individual interactions records which is not useful for analysis or visualisation. So, let us aggregate the individuals by date, senders, receivers, month and day of the week. Four functions from dplyr package are used. They are filter(), group(), summarise(), and ungroup().A new field called Weight has been added in data_edges_aggregated and saved it in a rds file.

data_edges_aggregated <- data %>%
  filter(timestamp >= "2022-07-01" & timestamp <= "2022-07-31") %>%
  group_by(participantIdFrom , participantIdTo, Weekday) %>%
    dplyr::summarise(Weight = n()) %>%
  filter(participantIdFrom!=participantIdTo) %>%
  filter(Weight > 1) %>%
  ungroup()
saveRDS(data_edges_aggregated,'data/july2022_social_data.rds')
data_edges <- readRDS("data/july2022_social_data.rds")
data_nodes <- read_csv("data/participants.csv")
data_nodes$ageGroup <- cut(data_nodes$age, breaks = c(17,35,60),labels = c('Young Adults','Old Adults'),include.lowest=TRUE)

Analytics

Before proceeding with the network analysis, first let us examine the data. Here the weights calculated are single numbers. A point estimate is a single number, such as a mean. Uncertainty is expressed as standard error, confidence interval, or credible interval.

data_edges$Weekday <- factor(data_edges$Weekday, level = c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'))
my_sum <-  data_edges %>%
  group_by(Weekday) %>%
  dplyr::summarise(
    n=n(),
    mean=mean(Weight),
    sd=sd(Weight)
    ) %>%
  mutate(se=sd/sqrt(n-1))

knitr::kable(head(my_sum), format = 'html')
Weekday n mean sd se
Sunday 23504 2.729238 0.7992271 0.0052132
Monday 26640 2.718243 0.7881383 0.0048288
Tuesday 26428 2.670425 0.7917778 0.0048706
Wednesday 25204 2.625218 0.7940233 0.0050016
Thursday 23954 2.588211 0.7889700 0.0050978
Friday 28372 2.747286 1.0521967 0.0062468

The code chunk below is used to reveal the standard error of mean weights score by weekdays

ggplot(my_sum) +
  geom_errorbar(
    aes(x=Weekday, 
        ymin=mean-se, 
        ymax=mean+se,
        y=mean), 
    width=0.2, 
    colour="black", 
    alpha=0.9, 
    size=0.5) +
  geom_point(aes
           (x=Weekday, 
            y=mean), 
           stat="identity", 
           color="red",
           size = 1.5,
           alpha=1) +
  xlab('Weekday')+
  ylab('Mean')+
  labs(title="Standard error of Mean weight by Weekdays")+
  theme(panel.background = element_rect(fill = "#BFD5E3", colour = "#6D9EC1",
                                size = 2, linetype = "solid"),
  panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                colour = "white"), 
  panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                colour = "white"),
  axis.line = element_line(size = 1.5, colour = "black"),
  axis.ticks = element_line(colour = "black", size = rel(2)),
  plot.title = element_text(size=15, hjust = 0.5),
  axis.title.y.left = element_text(vjust = 0.5),
  axis.title.y=element_text(angle=0))

Observation

It is seen that standard error of mean weight is low on Wednesday and Thursday.It is comparatively high on other weekdays.

In the code chunk below, stat_gradientinterval() of ggdist is used to build a visual for displaying distribution of weights by weekdays.

data_edges %>%
  ggplot(aes(x=Weekday, 
             y = Weight)) +
  stat_gradientinterval(
    fill = "skyblue",
    show.legend = TRUE
  ) +
  labs(
    title = "Confidence intervals of mean Weight",
    subtitle = "Gradient + interval plot")+
  theme(plot.title = element_text(size=15, hjust = 0.5),
        plot.subtitle = element_text(size=15, hjust = 0.5),
        axis.title.y.left = element_text(vjust = 0.5),
        axis.title.y=element_text(angle=0))

Observation

The above graph shows various confidence levels from 70 - 90 with varying colour intensity.

Reviewing the data

glimpse(data_edges)
Rows: 181,836
Columns: 4
$ participantIdFrom <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
$ participantIdTo   <dbl> 282, 282, 282, 282, 282, 282, 282, 300, 30~
$ Weekday           <fct> Friday, Monday, Saturday, Sunday, Thursday~
$ Weight            <int> 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~

Firstly, let’s visualise the social interactions of participants who has joviality index more than 0.9. Then, lets create an edge file with only those participants.

jov0.9 <- data_nodes %>%
  filter(joviality > 0.95)

jovEdge <- data_edges %>%
  filter(participantIdFrom %in% jov0.9$participantId & participantIdTo %in% jov0.9$participantId)

jovNode <-jov0.9 %>%
  filter(participantId %in% jovEdge$participantIdFrom & participantId %in% jovEdge$participantIdTo)

Building data model

The below code chunk is to build an tidygraph’s network graph data.frame using tbl_graph() of tinygraph package.

social_graph <- graph_from_data_frame(jovEdge, vertices = jovNode) %>%
  as_tbl_graph() %>%
  activate(edges) %>%
  arrange(desc(Weight))
social_graph
# A tbl_graph: 46 nodes and 1506 edges
#
# A directed multigraph with 1 component
#
# Edge Data: 1,506 x 4 (active)
   from    to Weekday  Weight
  <int> <int> <chr>     <int>
1     4     5 Friday        5
2     4     5 Saturday      5
3     5     4 Friday        5
4     5     4 Saturday      5
5     6    13 Friday        5
6     8    11 Friday        5
# ... with 1,500 more rows
#
# Node Data: 46 x 8
  name  householdSize haveKids   age educationLevel interestGroup
  <chr>         <dbl> <lgl>    <dbl> <chr>          <chr>        
1 7                 3 TRUE        27 Bachelors      A            
2 82                2 FALSE       40 Graduate       H            
3 104               2 FALSE       47 HighSchoolOrC~ F            
# ... with 43 more rows, and 2 more variables: joviality <dbl>,
#   ageGroup <chr>
g <- ggraph(social_graph, 
            layout = "nicely") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.1, 3)) +
  geom_node_point(aes(colour = educationLevel, 
                      size = ageGroup))
 g + theme_graph() +
  labs(fill = "Education Level",title="Do birds of same feather flock together ?",
       subtitle = "During July 2022 (Most interactive season)") +
  theme(plot.title = element_text(size=15, hjust = 0.5),
        plot.subtitle  = element_text(size = 10, hjust = 0.5),
        text = element_text(size=12,face="bold"))

Observations:

set_graph_style()
g <- ggraph(social_graph, 
            layout = "nicely") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.1, 5)) +
  geom_node_point(aes(colour = educationLevel), 
                  size = 2)
g + facet_nodes(~interestGroup)+
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  labs(title="Why people of same interest groups (B,C,D) don't interact with each other?") +
  theme(plot.title = element_text(size=15, hjust = 0.5),
  legend.position = 'bottom',
  text = element_text(size=12,face="bold"))

Observation

Revealing patterns of the least interactive month

It is understood that July month had a lot of interactions. Let’s zoom in and view the attributes.The edges data contains individual interactions records which is not useful for analysis or visualisation. So, let us aggregate the individuals by date, senders, receivers, month and day of the week. Four functions from dplyr package are used. They are filter(), group(), summarise(), and ungroup().A new field called Weight has been added in data_edges_aggregated and saved it in a rds file.

data_edges_aggregated <- data %>%
  filter(timestamp >= "2022-03-01" & timestamp <= "2022-03-31") %>%
  group_by(participantIdFrom , participantIdTo, Weekday) %>%
    dplyr::summarise(Weight = n()) %>%
  filter(participantIdFrom!=participantIdTo) %>%
  filter(Weight > 1) %>%
  ungroup()
saveRDS(data_edges_aggregated,'data/march2022_social_data.rds')
data_edges <- readRDS("data/march2022_social_data.rds")
data_nodes <- read_csv("data/participants.csv")
data_nodes$ageGroup <- cut(data_nodes$age, breaks = c(17,35,60),labels = c('Young Adults','Old Adults'),include.lowest=TRUE)

Reviewing the data

glimpse(data_edges)
Rows: 49,284
Columns: 4
$ participantIdFrom <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ participantIdTo   <dbl> 226, 226, 226, 226, 226, 226, 226, 644, 64~
$ Weekday           <chr> "Friday", "Monday", "Saturday", "Sunday", ~
$ Weight            <int> 2, 3, 2, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, ~

Firstly, let’s visualise the social interactions of participants who has joviality index more than 0.9. Then, lets create an edge file with only those participants.

jov0.9 <- data_nodes %>%
  filter(joviality > 0.95)

jovEdge <- data_edges %>%
  filter(participantIdFrom %in% jov0.9$participantId & participantIdTo %in% jov0.9$participantId)

jovNode <-jov0.9 %>%
  filter(participantId %in% jovEdge$participantIdFrom & participantId %in% jovEdge$participantIdTo)

Building data model

The below code chunk is to build an tidygraph’s network graph data.frame using *tbl_graph()* of tinygraph package.

social_graph <- graph_from_data_frame(jovEdge, vertices = jovNode) %>%
  as_tbl_graph() %>%
  activate(edges) %>%
  arrange(desc(Weight))
social_graph
# A tbl_graph: 32 nodes and 416 edges
#
# A directed multigraph with 5 components
#
# Edge Data: 416 x 4 (active)
   from    to Weekday   Weight
  <int> <int> <chr>      <int>
1     4     5 Wednesday      5
2     5     4 Wednesday      5
3     6     8 Wednesday      5
4     8     6 Wednesday      5
5    11    16 Wednesday      5
6    12    15 Wednesday      5
# ... with 410 more rows
#
# Node Data: 32 x 8
  name  householdSize haveKids   age educationLevel interestGroup
  <chr>         <dbl> <lgl>    <dbl> <chr>          <chr>        
1 104               2 FALSE       47 HighSchoolOrC~ F            
2 113               2 FALSE       51 Graduate       G            
3 165               2 FALSE       30 Graduate       G            
# ... with 29 more rows, and 2 more variables: joviality <dbl>,
#   ageGroup <chr>
g <- ggraph(social_graph, 
            layout = "nicely") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.1, 3)) +
  geom_node_point(aes(colour = educationLevel, 
                      size = ageGroup))
 g + theme_graph() +
  labs(fill = "Education Level",title="Do birds of same feather flock together ?",
       subtitle = "During March 2022 (Least interactive season)") +
  theme(plot.title = element_text(size=15, hjust = 0.5),
        plot.subtitle  = element_text(size = 10, hjust = 0.5),
        text = element_text(size=12,face="bold"))

Observations:

set_graph_style()
g <- ggraph(social_graph, 
            layout = "nicely") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.1, 5)) +
  geom_node_point(aes(colour = educationLevel), 
                  size = 2)
g + facet_nodes(~interestGroup)+
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  labs(title="Why people of same interest groups don't interact with each other?") +
  theme(plot.title = element_text(size=15, hjust = 0.5), 
  legend.position = 'bottom',
  text = element_text(size=12,face="bold"))

Observation