Programming Elegant DataVis with tidyverse and ggplot2 R packages
In this Take-home Exercise 1, I have explored demographic of the city of Engagement, Ohio USA by using appropriate statistical graphic methods in R. The data is processed by using tidyverse family of packages and the statistical graphics are prepared by using ggplot2 and its extensions.
Lets breakdown the demographic analysis into individual components
and try to answer the following questions
1. Which category residents are more jovial?
2. Are the residents with specific educational qualification shares
similar interest group ?
3. What’s the most common educational qualification among the
residents?
The dataset used in this exercise is Participants.csv file which contains information about the residents such as age, educationlevel, household size etc., Link to download the dataset is found below
Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R pacakges. If they have yet to be installed, we will install the R packages and load them onto R environment.The required packages are tidyverse, ggplot2, dplyr, likert, plotrix, plyr, patchwork, ggthemes
The code chunk below is used to install and load the required packages onto RStudio.
packages = c('tidyverse','ggplot2','dplyr','plotrix','plyr','patchwork','ggthemes')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The code chunk below imports Participants.csv from the data
folder into R by using read_csv()
of readr
and save it as an tibble data frame called part_data
participantId householdSize haveKids age educationLevel
1 0 3 TRUE 36 HighSchoolOrCollege
2 1 3 TRUE 25 HighSchoolOrCollege
3 2 3 TRUE 35 HighSchoolOrCollege
4 3 3 TRUE 21 HighSchoolOrCollege
5 4 3 TRUE 43 Bachelors
6 5 3 TRUE 32 HighSchoolOrCollege
interestGroup joviality
1 H 0.001626703
2 B 0.328086500
3 A 0.393469590
4 I 0.138063446
5 H 0.857396691
6 D 0.772957791
The below figure shows the proposed sketch
The values ranging from [0,1] indicating the participant’s overall
happiness level at the start of the study are recoded into 4 levels such
as ‘Not too Happy’, ‘Fairly Happy’, ‘Happy’,‘Very
Happy’ using the below code chunk. It can be performed using cut()
which helps to convert the numeric values to factors.
data$jovialityGroup <- cut(data$joviality, breaks =c(-Inf,0.2,0.5,0.8,1),labels=c("Not too Happy","Fairly Happy","Happy","Very Happy"))
table(data$jovialityGroup)
Not too Happy Fairly Happy Happy Very Happy
207 320 278 206
Similar to jovialityGroup age groups are also created
such as ‘Young Adult’, ‘Middle Age’, ‘Older Adult’
using cut()
data$ageGroup <- cut(data$age,breaks=c(18,35,55,Inf),labels=c("Young Adult","Middle Age","Older Adult"),
include.lowest = TRUE)
table(data$ageGroup)
Young Adult Middle Age Older Adult
427 470 114
To answer the question of which category residents are more jovial, lets create 3 individual charts for joviality across Household size, No. of kids, Age category and then draw a final conclusion.
g1 <- ggplot(data,aes(householdSize, fill = jovialityGroup))+
geom_bar()+
theme(legend.position = "none")+
xlab("Household Size") + ylab("No. of \n residents") +
theme(axis.title.y=element_text(angle=0,vjust = 0.5),
axis.text = element_text(face="bold"))
g2 <- ggplot(data,aes(haveKids, fill = jovialityGroup))+
geom_bar() +
theme(legend.position = "none")+
theme(axis.title.y = element_blank(),
axis.text = element_text(face="bold"))+
xlab("Having Kids") + ylab("No. of residents")
g3 <- ggplot(data, aes(ageGroup, fill= jovialityGroup))+
geom_bar() +
theme(axis.title.y = element_blank(),
axis.text = element_text(face="bold"))+
xlab("Age Category") + ylab("No. of residents")
g1+g2+g3
Order of x labels in the chart can be rearranged using factor()
In ggplot2, the legend order is determined by the stack order. So,
reversing the order of factor levels can be done using fct_rev()
Three individual charts are combined using
patchwork()[https://cran.r-project.org/web/packages/patchwork/vignettes/patchwork.html]
Also, Main title for the chart is created using plot_annotation
along with orientation adjustment using theme()
The below code chunk accomplishes all the above mentioned formatting
patchwork <- g1 + g2 + g3 +
plot_annotation(title = "Residents of young age and those who are not having kids are much happier comparatively") &
theme(plot.title = element_text(hjust = 0.5))
Final chart with necessary formatting is created using below code chunk
g1 <- ggplot(data,aes(householdSize, fill = forcats::fct_rev(jovialityGroup)))+
geom_bar()+
theme(legend.position = "none")+
xlab("Household Size") + ylab("No. of residents") +
theme(axis.title.y=element_text(angle=0,vjust = 0.5),
axis.text = element_text(face="bold"))
g2 <- ggplot(data,aes(lvl_kids, fill = forcats::fct_rev(jovialityGroup)))+
geom_bar() +
theme(legend.position = "none")+
theme(axis.title.y = element_blank(),
axis.text = element_text(face="bold"))+
xlab("Having Kids") + ylab("No. of residents")
g3 <- ggplot(data, aes(ageGroup, fill=forcats::fct_rev(jovialityGroup)))+
geom_bar() +
theme(axis.title.y = element_blank(),
axis.text = element_text(face="bold"))+
xlab("Age Category") + ylab("No. of \n residents")+
labs(fill = "Happiness Level")
g1 + g2 + g3 +
plot_annotation(title = "Residents of young age and those who are not having kids are much happier comparatively") &
theme(plot.title = element_text(hjust = 0.5),
axis.text = element_text(face="bold"))
The chart shows that many residents who are young and without kids are much happier than the residents who are old adult and having kids.
The below figure shows the proposed sketch
The proportion of interest group in each age category is computed
using below code chunk. group_by().
group_by() function is used to group the dataframe by
multiple columns such as Age Group and Interest Group and count()
function helps to count the unique values of variables.
df <- data %>%
group_by(ageGroup,interestGroup) %>%
dplyr::summarise(count=n()) %>%
mutate(ageGroup.count = sum(count),
prop = count/sum(count)) %>%
ungroup()
df
# A tibble: 30 x 5
ageGroup interestGroup count ageGroup.count prop
<fct> <chr> <int> <int> <dbl>
1 Young Adult A 43 1011 0.0425
2 Young Adult B 39 1011 0.0386
3 Young Adult C 47 1011 0.0465
4 Young Adult D 39 1011 0.0386
5 Young Adult E 40 1011 0.0396
6 Young Adult F 38 1011 0.0376
7 Young Adult G 45 1011 0.0445
8 Young Adult H 49 1011 0.0485
9 Young Adult I 39 1011 0.0386
10 Young Adult J 48 1011 0.0475
# ... with 20 more rows
The basic ggplot chart is created using the below code chunk
ggplot(df,
aes(x = ageGroup, y = prop, width = ageGroup.count, fill = interestGroup)) +
geom_bar(stat = "identity", position = "fill", colour = "black") +
facet_grid(~ageGroup, scales = "free_x", space = "free_x") +
scale_fill_brewer(palette = "Set3") +
theme(panel.spacing.x = unit(0, "npc")) +
theme_void()
The proportion value calculated previously is shown on the bars for
comparison purposes. The geom_text()
is used for labelling plots. Also, scales::percent
is used to display the value in percentage with value rounded to 1
decimal point.
It is accomplished by below code chunk
geom_text(aes(label = scales::percent(prop, accuracy = 0.1L)), position = position_stack(vjust = 0.5))
mapping: label = ~scales::percent(prop, accuracy = 0.1)
geom_text: parse = FALSE, check_overlap = FALSE, na.rm = FALSE
stat_identity: na.rm = FALSE
position_stack
Main title for chart is added using ggtitle()
and legend title is added using labs()
geom_text(aes(label = scales::percent(prop, accuracy = 0.1L)), position = position_stack(vjust = 0.5))
mapping: label = ~scales::percent(prop, accuracy = 0.1)
geom_text: parse = FALSE, check_overlap = FALSE, na.rm = FALSE
stat_identity: na.rm = FALSE
position_stack
Final chart with necessary formatting is created using below code chunk
ggplot(df,
aes(ageGroup, y = prop, width = ageGroup.count, fill = interestGroup)) +
geom_bar(stat = "identity", position = "fill", colour = "black") +
#geom_text(aes(label = scales::percent(prop, accuracy = 0.1L)),hjust=0.5 ,position = position_stack(vjust = 0.5 ))+
facet_grid(~ageGroup, scales = "free_x", space = "free_x") +
scale_fill_brewer(palette = "Set3") +
theme(panel.spacing.x = unit(0, "npc")) + # if no spacing preferred between bars
theme_void()+
labs(fill="Interest Group")+
xlab("Age Category")+
ggtitle("Interest Group F is least preferred by Young Adults and most preferred by Older Adults")+
theme(plot.title = element_text(hjust = 0.5))
Interest Group F is quite common among Older adults and it is the least preferred group among young adults
The below figure shows the proposed sketch
The no. of residents with specific Educatio level is computed using
below code chunk. group_by().
group_by() function is used to group the dataframe by
education level and count()
function helps to count the unique values of variables
ed_level_data <- data %>%
group_by(educationLevel) %>%
dplyr::summarise(count = n())
ed_level_data
# A tibble: 4 x 2
educationLevel count
<chr> <int>
1 Bachelors 232
2 Graduate 170
3 HighSchoolOrCollege 525
4 Low 84
Cumulative sum is calculated using cumsum
function to determine the proportion of residents with each education
level
ed_level_data <- ed_level_data %>%
arrange(desc(educationLevel)) %>%
mutate(prop = count / sum(ed_level_data$count) *100) %>%
mutate(ypos = cumsum(prop)- 0.5*prop )
ed_level_data
# A tibble: 4 x 4
educationLevel count prop ypos
<chr> <int> <dbl> <dbl>
1 Low 84 8.31 4.15
2 HighSchoolOrCollege 525 51.9 34.3
3 Graduate 170 16.8 68.6
4 Bachelors 232 22.9 88.5
A pie chart in ggplot is a bar plot plus a polar coordinate. Hence,
coord_polar()
is used to create a circular chart . The basic ggplot chart is created
using the below code chunk
ggplot(ed_level_data, aes(x="", y=prop, fill=educationLevel)) +
geom_bar(stat="identity", width=1, color="white") +
coord_polar("y", start=0) +
theme_void()
The legend order is quite inappropriate. It can be rearranged using
factor()
Main title for chart is added using ggtitle()
and legend title is added using labs().
Also, since this pie chart is built on bar chart, conventional axis
labels and tick elements should be made invisible. It can be performed
using theme
And final chart is obtained using the below code chunk
ggplot(ed_level_data, aes(x = "", y = prop, fill = edu_order)) +
geom_col(color = "black") +
geom_text(aes(label = scales::percent(prop/100)),
position = position_stack(vjust = 0.5)) +
coord_polar(theta = "y") +
geom_label(aes(label = scales::percent(prop/100)),
position = position_stack(vjust = 0.5),
show.legend = FALSE)+
ggtitle("Majority of the residents are High School or College Graduate")+
labs(fill = "Education Level")+
scale_fill_manual(values=c("#FF5733", "#75FF33", "#33DBFF", "#BD33FF"))+
scale_fill_discrete(limits = c("Graduate", "Bachelors", "HighSchoolOrCollege","Low"))+
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
panel.background = element_rect(fill = "#ebf2ff"),
plot.background = element_rect(fill = "#ebf2ff"),
plot.title=element_text(hjust = 0.01,vjust=0.9),
legend.background = element_rect(fill = "#ebf2ff"))
The chart shows that education level of majority of the residents are upto High school o College.