Visualising and Analysing Community Network
In this Take-home Exercise 6, I have explored the patterns of community interactions of the city of Engagement, Ohio USA residents by using social network analysis approach.
In this exercise, four network data modelling and visualisation packages will be installed and launched. They are igraph , tidygraph, ggraph and visNetwork. Beside these four packages, tidyverse and lubridate, an R package specially designed to handle and wrangling time data will be installed and launched too.
The code chunk below is used to install and load the required packages onto RStudio.
packages = c('igraph', 'tidygraph',
'ggraph', 'visNetwork',
'lubridate', 'clock',
'tidyverse', 'graphlayouts',
'patchwork','DT','knitr',
'zoo','ggdist')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
As Social network data file is huge in size, it is processed and
saved as rds file. Let’s read the rds file. There are two data sets. One
contains the nodes data and the other contains the edges (also know as
link) data. The edges data consists of social interactions between the
participants and the nodes data consists of other details of
participants. Let’s import tidy_social_data.rds and
GAStech_email_edges.csv into RStudio environment by using
readRDS() and read_csv() of readr package.
data <- read_csv("data/SocialNetwork.csv")
data <- data %>%
mutate(Date = as_date(timestamp)) %>%
mutate(Weekday=weekdays(Date),Month=zoo::as.yearmon(Date,"%Y %m"))
count_interactions <- data %>%
group_by(Month)%>%
tally()
count_interactions[order(-count_interactions$n),][1:2,])
count_interactions[order(count_interactions$n),][1:2,])It is understood that July month had a lot of
interactions and March month had less no. of
interactions. Let’s zoom in and view the attributes.The edges data
contains individual interactions records which is not useful for
analysis or visualisation. So, let us aggregate the individuals by date,
senders, receivers, month and day of the week. Four functions from dplyr
package are used. They are filter(), group(),
summarise(), and ungroup().A new field called
Weight has been added in data_edges_aggregated and saved it in a rds
file.
data_edges_aggregated <- data %>%
filter(timestamp >= "2022-07-01" & timestamp <= "2022-07-31") %>%
group_by(participantIdFrom , participantIdTo, Weekday) %>%
dplyr::summarise(Weight = n()) %>%
filter(participantIdFrom!=participantIdTo) %>%
filter(Weight > 1) %>%
ungroup()
saveRDS(data_edges_aggregated,'data/july2022_social_data.rds')
Before proceeding with the network analysis, first let us examine the data. Here the weights calculated are single numbers. A point estimate is a single number, such as a mean. Uncertainty is expressed as standard error, confidence interval, or credible interval.
data_edges$Weekday <- factor(data_edges$Weekday, level = c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'))
my_sum <- data_edges %>%
group_by(Weekday) %>%
dplyr::summarise(
n=n(),
mean=mean(Weight),
sd=sd(Weight)
) %>%
mutate(se=sd/sqrt(n-1))
knitr::kable(head(my_sum), format = 'html')
| Weekday | n | mean | sd | se |
|---|---|---|---|---|
| Sunday | 23504 | 2.729238 | 0.7992271 | 0.0052132 |
| Monday | 26640 | 2.718243 | 0.7881383 | 0.0048288 |
| Tuesday | 26428 | 2.670425 | 0.7917778 | 0.0048706 |
| Wednesday | 25204 | 2.625218 | 0.7940233 | 0.0050016 |
| Thursday | 23954 | 2.588211 | 0.7889700 | 0.0050978 |
| Friday | 28372 | 2.747286 | 1.0521967 | 0.0062468 |
The code chunk below is used to reveal the standard error of mean weights score by weekdays
ggplot(my_sum) +
geom_errorbar(
aes(x=Weekday,
ymin=mean-se,
ymax=mean+se,
y=mean),
width=0.2,
colour="black",
alpha=0.9,
size=0.5) +
geom_point(aes
(x=Weekday,
y=mean),
stat="identity",
color="red",
size = 1.5,
alpha=1) +
xlab('Weekday')+
ylab('Mean')+
labs(title="Standard error of Mean weight by Weekdays")+
theme(panel.background = element_rect(fill = "#BFD5E3", colour = "#6D9EC1",
size = 2, linetype = "solid"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid',
colour = "white"),
panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
colour = "white"),
axis.line = element_line(size = 1.5, colour = "black"),
axis.ticks = element_line(colour = "black", size = rel(2)),
plot.title = element_text(size=15, hjust = 0.5),
axis.title.y.left = element_text(vjust = 0.5),
axis.title.y=element_text(angle=0))
It is seen that standard error of mean weight is low on Wednesday and Thursday.It is comparatively high on other weekdays.
In the code chunk below, stat_gradientinterval() of ggdist is used to build a visual for displaying distribution of weights by weekdays.
data_edges %>%
ggplot(aes(x=Weekday,
y = Weight)) +
stat_gradientinterval(
fill = "skyblue",
show.legend = TRUE
) +
labs(
title = "Confidence intervals of mean Weight",
subtitle = "Gradient + interval plot")+
theme(plot.title = element_text(size=15, hjust = 0.5),
plot.subtitle = element_text(size=15, hjust = 0.5),
axis.title.y.left = element_text(vjust = 0.5),
axis.title.y=element_text(angle=0))
The above graph shows various confidence levels from 70 - 90 with varying colour intensity.
glimpse(data_edges)
Rows: 181,836
Columns: 4
$ participantIdFrom <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
$ participantIdTo <dbl> 282, 282, 282, 282, 282, 282, 282, 300, 30~
$ Weekday <fct> Friday, Monday, Saturday, Sunday, Thursday~
$ Weight <int> 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
Firstly, let’s visualise the social interactions of participants who has joviality index more than 0.9. Then, lets create an edge file with only those participants.
The below code chunk is to build an tidygraph’s network graph
data.frame using tbl_graph() of tinygraph package.
social_graph <- graph_from_data_frame(jovEdge, vertices = jovNode) %>%
as_tbl_graph() %>%
activate(edges) %>%
arrange(desc(Weight))
social_graph
# A tbl_graph: 46 nodes and 1506 edges
#
# A directed multigraph with 1 component
#
# Edge Data: 1,506 x 4 (active)
from to Weekday Weight
<int> <int> <chr> <int>
1 4 5 Friday 5
2 4 5 Saturday 5
3 5 4 Friday 5
4 5 4 Saturday 5
5 6 13 Friday 5
6 8 11 Friday 5
# ... with 1,500 more rows
#
# Node Data: 46 x 8
name householdSize haveKids age educationLevel interestGroup
<chr> <dbl> <lgl> <dbl> <chr> <chr>
1 7 3 TRUE 27 Bachelors A
2 82 2 FALSE 40 Graduate H
3 104 2 FALSE 47 HighSchoolOrC~ F
# ... with 43 more rows, and 2 more variables: joviality <dbl>,
# ageGroup <chr>
g <- ggraph(social_graph,
layout = "nicely") +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.1, 3)) +
geom_node_point(aes(colour = educationLevel,
size = ageGroup))
g + theme_graph() +
labs(fill = "Education Level",title="Do birds of same feather flock together ?",
subtitle = "During July 2022 (Most interactive season)") +
theme(plot.title = element_text(size=15, hjust = 0.5),
plot.subtitle = element_text(size = 10, hjust = 0.5),
text = element_text(size=12,face="bold"))
It is observed that participants of High School or College interacts among them than with others. They form a cluster.
Also, Graduate students are those who interact with everyone irrespective of educational qualification or age group.
Though participants of Bachelors interact with other participants, their intensity is too low. They don’t form a strong bonding with anyone.
set_graph_style()
g <- ggraph(social_graph,
layout = "nicely") +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.1, 5)) +
geom_node_point(aes(colour = educationLevel),
size = 2)
g + facet_nodes(~interestGroup)+
th_foreground(foreground = "grey80",
border = TRUE) +
labs(title="Why people of same interest groups (B,C,D) don't interact with each other?") +
theme(plot.title = element_text(size=15, hjust = 0.5),
legend.position = 'bottom',
text = element_text(size=12,face="bold"))
It is observed that participants of interest groups B, C and D don’t interact much may be in this timeline despite July 2022 having maximum interactions overall.
There may be a chance of happening any events in this month for the interest groups A,F,G & I and hence people of those interest groups are highly active.
It is also observed that Interest Group I comprises of participants of different educational background.
It is understood that July month had a lot of interactions. Let’s
zoom in and view the attributes.The edges data contains individual
interactions records which is not useful for analysis or visualisation.
So, let us aggregate the individuals by date, senders, receivers, month
and day of the week. Four functions from dplyr package are used. They
are filter(), group(),
summarise(), and ungroup().A new field called
Weight has been added in data_edges_aggregated and saved it in a rds
file.
data_edges_aggregated <- data %>%
filter(timestamp >= "2022-03-01" & timestamp <= "2022-03-31") %>%
group_by(participantIdFrom , participantIdTo, Weekday) %>%
dplyr::summarise(Weight = n()) %>%
filter(participantIdFrom!=participantIdTo) %>%
filter(Weight > 1) %>%
ungroup()
saveRDS(data_edges_aggregated,'data/march2022_social_data.rds')
glimpse(data_edges)
Rows: 49,284
Columns: 4
$ participantIdFrom <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ participantIdTo <dbl> 226, 226, 226, 226, 226, 226, 226, 644, 64~
$ Weekday <chr> "Friday", "Monday", "Saturday", "Sunday", ~
$ Weight <int> 2, 3, 2, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, ~
Firstly, let’s visualise the social interactions of participants who has joviality index more than 0.9. Then, lets create an edge file with only those participants.
The below code chunk is to build an tidygraph’s network graph
data.frame using *tbl_graph()* of tinygraph package.
social_graph <- graph_from_data_frame(jovEdge, vertices = jovNode) %>%
as_tbl_graph() %>%
activate(edges) %>%
arrange(desc(Weight))
social_graph
# A tbl_graph: 32 nodes and 416 edges
#
# A directed multigraph with 5 components
#
# Edge Data: 416 x 4 (active)
from to Weekday Weight
<int> <int> <chr> <int>
1 4 5 Wednesday 5
2 5 4 Wednesday 5
3 6 8 Wednesday 5
4 8 6 Wednesday 5
5 11 16 Wednesday 5
6 12 15 Wednesday 5
# ... with 410 more rows
#
# Node Data: 32 x 8
name householdSize haveKids age educationLevel interestGroup
<chr> <dbl> <lgl> <dbl> <chr> <chr>
1 104 2 FALSE 47 HighSchoolOrC~ F
2 113 2 FALSE 51 Graduate G
3 165 2 FALSE 30 Graduate G
# ... with 29 more rows, and 2 more variables: joviality <dbl>,
# ageGroup <chr>
g <- ggraph(social_graph,
layout = "nicely") +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.1, 3)) +
geom_node_point(aes(colour = educationLevel,
size = ageGroup))
g + theme_graph() +
labs(fill = "Education Level",title="Do birds of same feather flock together ?",
subtitle = "During March 2022 (Least interactive season)") +
theme(plot.title = element_text(size=15, hjust = 0.5),
plot.subtitle = element_text(size = 10, hjust = 0.5),
text = element_text(size=12,face="bold"))
It can be seen that people are highly detatched among themselves and the intensity of interaction is also too low at most cases.
Participants of High School or College doesn’t interact with others and they form their own group.
set_graph_style()
g <- ggraph(social_graph,
layout = "nicely") +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.1, 5)) +
geom_node_point(aes(colour = educationLevel),
size = 2)
g + facet_nodes(~interestGroup)+
th_foreground(foreground = "grey80",
border = TRUE) +
labs(title="Why people of same interest groups don't interact with each other?") +
theme(plot.title = element_text(size=15, hjust = 0.5),
legend.position = 'bottom',
text = element_text(size=12,face="bold"))
It is observed that though participants from same interest group doesn’t interact with each other.
It can be due to absence of any events happening in the city that makes the people to spend time all alone without much interactions with others.