Take-home Exercise 1

Programming Elegant DataVis with tidyverse and ggplot2 R packages

Raveena Chakrapani https://www.linkedin.com/in/raveena-chakrapani-444a60174/ (School Of Computing and Information Systems, Singapore Management University)https://scis.smu.edu.sg/master-it-business
2022-05-07

1.Overview

In this Take-home Exercise 1, I have explored demographic of the city of Engagement, Ohio USA by using appropriate statistical graphic methods in R. The data is processed by using tidyverse family of packages and the statistical graphics are prepared by using ggplot2 and its extensions.

2.Task

Lets breakdown the demographic analysis into individual components and try to answer the following questions
1. Which category residents are more jovial?
2. Are the residents with specific educational qualification shares similar interest group ?
3. What’s the most common educational qualification among the residents?

3.Getting Started

3.1 Data

The dataset used in this exercise is Participants.csv file which contains information about the residents such as age, educationlevel, household size etc., Link to download the dataset is found below

Download Participants.csv

3.2 Installing and loading the required libraries

Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R pacakges. If they have yet to be installed, we will install the R packages and load them onto R environment.The required packages are tidyverse, ggplot2, dplyr, likert, plotrix, plyr, patchwork, ggthemes

The code chunk below is used to install and load the required packages onto RStudio.

packages = c('tidyverse','ggplot2','dplyr','plotrix','plyr','patchwork','ggthemes')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3.3 Importing Data

The code chunk below imports Participants.csv from the data folder into R by using read_csv() of readr and save it as an tibble data frame called part_data

# read csv file
data <- read.csv("data/Participants.csv")
head(data)
  participantId householdSize haveKids age      educationLevel
1             0             3     TRUE  36 HighSchoolOrCollege
2             1             3     TRUE  25 HighSchoolOrCollege
3             2             3     TRUE  35 HighSchoolOrCollege
4             3             3     TRUE  21 HighSchoolOrCollege
5             4             3     TRUE  43           Bachelors
6             5             3     TRUE  32 HighSchoolOrCollege
  interestGroup   joviality
1             H 0.001626703
2             B 0.328086500
3             A 0.393469590
4             I 0.138063446
5             H 0.857396691
6             D 0.772957791

4. Which category residents are more jovial?

4.1 Sketch of Proposed Design

The below figure shows the proposed sketch

4.2 Data Wrangling

Recoding Joviality group

The values ranging from [0,1] indicating the participant’s overall happiness level at the start of the study are recoded into 4 levels such as ‘Not too Happy’, ‘Fairly Happy’, ‘Happy’,‘Very Happy’ using the below code chunk. It can be performed using cut() which helps to convert the numeric values to factors.

data$jovialityGroup <- cut(data$joviality, breaks =c(-Inf,0.2,0.5,0.8,1),labels=c("Not too Happy","Fairly Happy","Happy","Very Happy"))
table(data$jovialityGroup)

Not too Happy  Fairly Happy         Happy    Very Happy 
          207           320           278           206 

Recoding age group

Similar to jovialityGroup age groups are also created such as ‘Young Adult’, ‘Middle Age’, ‘Older Adult’ using cut()

data$ageGroup <- cut(data$age,breaks=c(18,35,55,Inf),labels=c("Young Adult","Middle Age","Older Adult"),
include.lowest = TRUE)
table(data$ageGroup)

Young Adult  Middle Age Older Adult 
        427         470         114 

4.3 Creating Basic Chart

To answer the question of which category residents are more jovial, lets create 3 individual charts for joviality across Household size, No. of kids, Age category and then draw a final conclusion.

g1 <- ggplot(data,aes(householdSize, fill = jovialityGroup))+
  geom_bar()+
  theme(legend.position = "none")+
  xlab("Household Size") + ylab("No. of \n residents") +
  theme(axis.title.y=element_text(angle=0,vjust = 0.5),
        axis.text = element_text(face="bold"))
g2 <- ggplot(data,aes(haveKids, fill = jovialityGroup))+
  geom_bar() +
  theme(legend.position = "none")+
  theme(axis.title.y  = element_blank(),
        axis.text = element_text(face="bold"))+
  xlab("Having Kids") + ylab("No. of residents")
g3 <- ggplot(data, aes(ageGroup, fill= jovialityGroup))+
  geom_bar() +
  theme(axis.title.y  = element_blank(),
        axis.text = element_text(face="bold"))+
  xlab("Age Category") + ylab("No. of residents")
g1+g2+g3

4.4 Formatting

Changing the order of x labels of Having Kids bar plot

Order of x labels in the chart can be rearranged using factor()

lvl_kids <- factor(data$haveKids, level = c('TRUE', 'FALSE'))

Changing the order of stacking and legend

In ggplot2, the legend order is determined by the stack order. So, reversing the order of factor levels can be done using fct_rev()

g1<-ggplot(data,aes(householdSize, fill = forcats::fct_rev(jovialityGroup)))
g2<-ggplot(data,aes(lvl_kids, fill = forcats::fct_rev(jovialityGroup))) 
g3<-ggplot(data, aes(ageGroup, fill=forcats::fct_rev(jovialityGroup)))

Annotate the final patchwork

Three individual charts are combined using patchwork()[https://cran.r-project.org/web/packages/patchwork/vignettes/patchwork.html] Also, Main title for the chart is created using plot_annotation along with orientation adjustment using theme() The below code chunk accomplishes all the above mentioned formatting

patchwork <- g1 + g2 + g3 +
  plot_annotation(title = "Residents of young age and those who are not having kids are much happier comparatively") & 
  theme(plot.title = element_text(hjust = 0.5))

4.5 Final Chart

Final chart with necessary formatting is created using below code chunk

g1 <- ggplot(data,aes(householdSize, fill = forcats::fct_rev(jovialityGroup)))+
  geom_bar()+
  theme(legend.position = "none")+
  xlab("Household Size") + ylab("No. of residents") +
  theme(axis.title.y=element_text(angle=0,vjust = 0.5),
        axis.text = element_text(face="bold"))
g2 <- ggplot(data,aes(lvl_kids, fill = forcats::fct_rev(jovialityGroup)))+
  geom_bar() +
  theme(legend.position = "none")+
  theme(axis.title.y  = element_blank(),
        axis.text = element_text(face="bold"))+
  xlab("Having Kids") + ylab("No. of residents")
g3 <- ggplot(data, aes(ageGroup, fill=forcats::fct_rev(jovialityGroup)))+
  geom_bar() +
  theme(axis.title.y  = element_blank(),
        axis.text = element_text(face="bold"))+
  xlab("Age Category") + ylab("No. of \n residents")+
  labs(fill = "Happiness Level")
g1 + g2 + g3 +
  plot_annotation(title = "Residents of young age and those who are not having kids are much happier comparatively") & 
  theme(plot.title = element_text(hjust = 0.5),
        axis.text = element_text(face="bold"))

4.5 Insights from the Visualisation

The chart shows that many residents who are young and without kids are much happier than the residents who are old adult and having kids.

5. Is there a specific interest group which attracts young adults or Older adults?

5.1 Sketch of Proposed Design

The below figure shows the proposed sketch

5.2 Data Wrangling

Compute proportion

The proportion of interest group in each age category is computed using below code chunk. group_by(). group_by() function is used to group the dataframe by multiple columns such as Age Group and Interest Group and count() function helps to count the unique values of variables.

df <- data %>%
  group_by(ageGroup,interestGroup) %>%
  dplyr::summarise(count=n()) %>%
  mutate(ageGroup.count = sum(count),
         prop = count/sum(count)) %>%
  ungroup()
df
# A tibble: 30 x 5
   ageGroup    interestGroup count ageGroup.count   prop
   <fct>       <chr>         <int>          <int>  <dbl>
 1 Young Adult A                43           1011 0.0425
 2 Young Adult B                39           1011 0.0386
 3 Young Adult C                47           1011 0.0465
 4 Young Adult D                39           1011 0.0386
 5 Young Adult E                40           1011 0.0396
 6 Young Adult F                38           1011 0.0376
 7 Young Adult G                45           1011 0.0445
 8 Young Adult H                49           1011 0.0485
 9 Young Adult I                39           1011 0.0386
10 Young Adult J                48           1011 0.0475
# ... with 20 more rows

5.3 Creating Basic Chart

The basic ggplot chart is created using the below code chunk

ggplot(df,
       aes(x = ageGroup, y = prop, width = ageGroup.count, fill = interestGroup)) +
  geom_bar(stat = "identity", position = "fill", colour = "black") +
  facet_grid(~ageGroup, scales = "free_x", space = "free_x") +
  scale_fill_brewer(palette = "Set3") +
  theme(panel.spacing.x = unit(0, "npc")) +
  theme_void()

5.4 Formatting

Adding percentage values

The proportion value calculated previously is shown on the bars for comparison purposes. The geom_text() is used for labelling plots. Also, scales::percent is used to display the value in percentage with value rounded to 1 decimal point.

It is accomplished by below code chunk

geom_text(aes(label = scales::percent(prop, accuracy = 0.1L)), position = position_stack(vjust = 0.5))
mapping: label = ~scales::percent(prop, accuracy = 0.1) 
geom_text: parse = FALSE, check_overlap = FALSE, na.rm = FALSE
stat_identity: na.rm = FALSE
position_stack 

Adding appropriate plot and legend title

Main title for chart is added using ggtitle() and legend title is added using labs()

geom_text(aes(label = scales::percent(prop, accuracy = 0.1L)), position = position_stack(vjust = 0.5))
mapping: label = ~scales::percent(prop, accuracy = 0.1) 
geom_text: parse = FALSE, check_overlap = FALSE, na.rm = FALSE
stat_identity: na.rm = FALSE
position_stack 

5.5 Final Chart

Final chart with necessary formatting is created using below code chunk

ggplot(df,
       aes(ageGroup, y = prop, width = ageGroup.count, fill = interestGroup)) +
  geom_bar(stat = "identity", position = "fill", colour = "black") +
  
  #geom_text(aes(label = scales::percent(prop, accuracy = 0.1L)),hjust=0.5 ,position = position_stack(vjust = 0.5 ))+
  
  facet_grid(~ageGroup, scales = "free_x", space = "free_x") +
  scale_fill_brewer(palette = "Set3") +
  theme(panel.spacing.x = unit(0, "npc")) + # if no spacing preferred between bars
  theme_void()+
  labs(fill="Interest Group")+
  xlab("Age Category")+
  ggtitle("Interest Group F is least preferred by Young Adults and most preferred by Older Adults")+
  theme(plot.title = element_text(hjust = 0.5))

5.5 Insights from the Visualisation

Interest Group F is quite common among Older adults and it is the least preferred group among young adults

6.What’s the most common educational qualification among the residents?

6.1 Sketch of Proposed Design

The below figure shows the proposed sketch

6.2 Data Wrangling

Computing frequency by education level

The no. of residents with specific Educatio level is computed using below code chunk. group_by(). group_by() function is used to group the dataframe by education level and count() function helps to count the unique values of variables

ed_level_data <- data %>%
  group_by(educationLevel) %>%
  dplyr::summarise(count = n())
ed_level_data
# A tibble: 4 x 2
  educationLevel      count
  <chr>               <int>
1 Bachelors             232
2 Graduate              170
3 HighSchoolOrCollege   525
4 Low                    84

Computing cumulative sum

Cumulative sum is calculated using cumsum function to determine the proportion of residents with each education level

ed_level_data <- ed_level_data %>% 
  arrange(desc(educationLevel)) %>%
  mutate(prop = count / sum(ed_level_data$count) *100) %>%
  mutate(ypos = cumsum(prop)- 0.5*prop )
ed_level_data
# A tibble: 4 x 4
  educationLevel      count  prop  ypos
  <chr>               <int> <dbl> <dbl>
1 Low                    84  8.31  4.15
2 HighSchoolOrCollege   525 51.9  34.3 
3 Graduate              170 16.8  68.6 
4 Bachelors             232 22.9  88.5 

6.3 Creating Basic Chart

A pie chart in ggplot is a bar plot plus a polar coordinate. Hence, coord_polar() is used to create a circular chart . The basic ggplot chart is created using the below code chunk

ggplot(ed_level_data, aes(x="", y=prop, fill=educationLevel)) +
  geom_bar(stat="identity", width=1, color="white") +
  coord_polar("y", start=0) +
  theme_void() 

6.4 Formatting

Changing legend order

The legend order is quite inappropriate. It can be rearranged using factor()

edu_order <- factor(ed_level_data$educationLevel, level = c('Low', 'HighSchoolOrCollege','Bachelors','Graduate'))

Improving the aesthetics of the chart

Main title for chart is added using ggtitle() and legend title is added using labs(). Also, since this pie chart is built on bar chart, conventional axis labels and tick elements should be made invisible. It can be performed using theme

And final chart is obtained using the below code chunk

ggplot(ed_level_data, aes(x = "", y = prop, fill = edu_order)) +
  geom_col(color = "black") +
  geom_text(aes(label = scales::percent(prop/100)),
            position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y") +
  geom_label(aes(label = scales::percent(prop/100)),
             position = position_stack(vjust = 0.5),
             show.legend = FALSE)+
  ggtitle("Majority of the residents are High School or College Graduate")+
  labs(fill = "Education Level")+
  scale_fill_manual(values=c("#FF5733", "#75FF33", "#33DBFF", "#BD33FF"))+
  scale_fill_discrete(limits = c("Graduate", "Bachelors", "HighSchoolOrCollege","Low"))+
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        panel.grid = element_blank(),
        panel.background = element_rect(fill = "#ebf2ff"),
        plot.background = element_rect(fill = "#ebf2ff"),
        plot.title=element_text(hjust = 0.01,vjust=0.9),
        legend.background = element_rect(fill = "#ebf2ff")) 

6.5 Insights from the Visualisation

The chart shows that education level of majority of the residents are upto High school o College.