Exploratory Analysis of my Goodreads data

1 Introduction

Hello everyone, welcome to an Analysis of my reading journey. A little bit of background: I started reading again in 2018, after a long time of not touching books. The last book I had read before was when I was in 8th grade :D.

Since I started reading in 2018, I’ve been keeping track of the books I read, rating and reviews on Goodreads, and after two years, I had this small set of data available. I came across it the other day, when Goodreads showed me some stats on their website.

Moreover, I recently started learning data science on my own, therefore I am so thrilled to analyze this data. My purpose of this analysis is to answer some simple exploratory and descriptive analysis questions:
1. How much books I’ve read over the years ? Do I read more every year?
2. I read books in Vietnamese and English? How many of those are Viet and how many are English?
3. Are there correlation between time of year and the amount of reading ?
4. Do I give better rating than other people do ?
5. How fast do I read ?

I am new to this, so if you have any comments or suggestions, please let me know!

LET’S DIVE IN!

2 The Analysis

2.1 The Goodreads Data

First, let grab our data

# encoding UTF-8 since there are Vietnamese characters
data <- read.csv("data//goodreads_library_export.csv",encoding="UTF-8",header=TRUE,stringsAsFactors = FALSE)
# call out dimensions of data
dim(data)
## [1] 57 31

There are 57 books in this data. This include books I read, currently am reading, and going to read. This is a small data set, however I believe there are some interesting insights in this one. Each of this book has 31 variables to describe it.

This is a sample of the data set. The column My.Review is left out, due to the fact that it made the row height too large. There are also a lot of missing values.

#create a table
kable(data[,-which(names(data) %in% "My.Review")]) %>% 
  kable_styling() %>%
  scroll_box(width ="800px",height = "350px")
Book.Id Title Author Author.l.f Additional.Authors ISBN ISBN13 My.Rating Average.Rating Publisher Binding Number.of.Pages Year.Published Original.Publication.Year Date.Read Date.Added Bookshelves Bookshelves.with.positions Exclusive.Shelf Spoiler Private.Notes Read.Count Recommended.For Recommended.By Owned.Copies Original.Purchase.Date Original.Purchase.Location Condition Condition.Description BCID
32498468 Before the Fall Noah Hawley Hawley, Noah =“1455561797” =“9781455561797” 5 3.72 Grand Central Publishing Paperback 416 2017 2016 2020/01/25 2020/01/25 read NA NA 1 NA NA 0 NA NA NA NA NA
40513711 Financial Freedom: A Proven Path to All the Money You Will Ever Need Grant Sabatier Sabatier, Grant Vicki Robin =“0525540881” =“9780525540885” 5 4.00 Avery Publishing Group Hardcover 352 2019 NA 2020/08/02 2020/06/18 read NA NA 1 NA NA 0 NA NA NA NA NA
12701065 The Start-up of You: Adapt to the Future, Invest in Yourself, and Transform Your Career Reid Hoffman Hoffman, Reid Ben Casnocha =“0307888908” =“9780307888907” 4 3.85 Crown Business Hardcover 260 2012 2012 2020/12/11 2020/11/27 read NA NA 1 NA NA 0 NA NA NA NA NA
32989564 The Comprehensive INFP Survival Guide Heidi Priebe Priebe, Heidi =“1945796154” =“9781945796159” 4 4.24 Thought Catalog Books Paperback 274 2016 2016 2020/04/20 2020/04/14 read NA NA 1 NA NA 0 NA NA NA NA NA
29875487 The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life Mark Manson Manson, Mark =“0062641549” =“9780062641540” 5 3.94 Harper Paperback 206 2016 2016 2020/03/05 2020/03/01 read NA NA 1 NA NA 0 NA NA NA NA NA
52061818 Bạn Có Phải Là Đứa Trẻ Sợ Hãi Ẩn Sau Lớp Vỏ Trưởng Thành? Beth Evans Evans, Beth Trịnh Thu Hằng =“9786047761” ="" 4 3.73 Công Ty TNHH Văn Hoá & Truyền Thông Skybooks Paperback 191 2018 2018 2019/12/23 2019/12/22 read NA NA 1 NA NA 0 NA NA NA NA NA
20697410 Golden Son (Red Rising, #2) Pierce Brown Brown, Pierce =“0345539818” =“9780345539816” 5 4.44 Del Rey Hardcover 442 2015 2015 2020/11/10 2020/10/29 read NA NA 1 NA NA 0 NA NA NA NA NA
25574881 Dòng Thời Gian Michael Crichton Crichton, Michael ="" ="" 4 3.85 NXB Lao Động Paperback 494 2015 1999 2020/12/15 2020/10/23 read NA NA 1 NA NA 0 NA NA NA NA NA
18966806 Morning Star (Red Rising Saga, #3) Pierce Brown Brown, Pierce =“0345539842” =“9780345539847” 5 4.48 Del Rey Hardcover 524 2016 2016 2020/11/26 2020/11/17 read NA NA 1 NA NA 0 NA NA NA NA NA
15839976 Red Rising (Red Rising Saga, #1) Pierce Brown Brown, Pierce =“0345539788” =“9780345539786” 5 4.24 Del Rey (Random House) Hardcover 382 2014 2014 2020/10/15 2020/09/11 read NA NA 1 NA NA 0 NA NA NA NA NA
38746485 Becoming Michelle Obama Obama, Michelle ="" ="" 0 4.53 Crown Hardcover 426 2018 2018 2020/09/17 currently-reading currently-reading (#1) currently-reading NA NA 1 NA NA 0 NA NA NA NA NA
8520610 Quiet: The Power of Introverts in a World That Can’t Stop Talking Susan Cain Cain, Susan =“0307352145” =“9780307352149” 5 4.06 Crown Publishing Group/Random House, Inc.  Hardcover 333 2012 2012 2020/09/09 2020/04/05 read NA NA 1 NA NA 0 NA NA NA NA NA
23398763 Everything I Never Told You Celeste Ng Ng, Celeste =“0143127551” =“9780143127550” 5 3.86 Penguin Books Paperback 292 2015 2014 2020/08/18 2020/08/06 read NA NA 1 NA NA 0 NA NA NA NA NA
86154 The Mystery of Capital: Why Capitalism Triumphs in the West and Fails Everywhere Else Hernando de Soto Soto, Hernando de =“0465016154” =“9780465016150” 4 4.00 Basic Books Paperback 288 2003 2000 2018/05/22 2018/06/06 read NA NA 1 NA NA 0 NA NA NA NA NA
28800977 Lời Nói Dối Vĩ Đại Của Não Kelly McGonigal McGonigal, Kelly Khánh Thủy ="" =“9786045929889” 4 4.15 Thái Hà Books & NXB Lao Động Paperback 228 2015 2011 2018/01/14 2018/06/06 read NA NA 1 NA NA 0 NA NA NA NA NA
60931 Kindred Octavia E. Butler Butler, Octavia E. =“0807083690” =“9780807083697” 5 4.26 Beacon Press Paperback 287 2004 1979 2019/02/17 2019/02/17 read NA NA 1 NA NA 0 NA NA NA NA NA
28818221 Kindred: A Graphic Novel Adaptation Damian Duffy Duffy, Damian Octavia E. Butler, John Jennings, Nnedi Okorafor =“141970947X” =“9781419709470” 4 4.19 Harry N. Abrams Hardcover 240 2017 2017 2019/03/29 2019/04/02 read NA NA 1 NA NA 0 NA NA NA NA NA
36045222 Nhân Tố Enzyme: Minh Họa Hiromi Shinya Shinya, Hiromi Như Nữ ="" =“9786047735303” 4 4.16 Nxb. Thế Giới & Thái Hà Books Paperback 79 2017 NA 2019/04/11 2019/04/11 read NA NA 1 NA NA 0 NA NA NA NA NA
45420940 Gửi bạn, người đã trưởng thành mà chưa tìm thấy tài năng Shinzi Kamioka Kamioka, Shinzi Nguyễn Quốc Vượng ="" ="" 3 3.57 Nhà Xuất Bản Phụ Nữ Bìa mềm 245 2019 2017 2019/12/29 2019/12/29 read NA NA 1 NA NA 0 NA NA NA NA NA
49506165 Những điều giữ tôi còn sống Matt Haig Haig, Matt N.D.T. Anh ="" ="" 5 4.16 Paperback 224 2019 2015 2020/01/08 2020/01/08 read NA NA 1 NA NA 0 NA NA NA NA NA
40198460 Ăn Ít Để Khỏe – 1 Bữa Là Đủ Sao Cần Phải 3 Yoshinori Nagumo Nagumo, Yoshinori Minh Yến ="" ="" 3 3.66 NXB Lao Động, Thái Hà Paperback 200 2018 2012 2020/02/04 2020/02/04 read NA NA 1 NA NA 0 NA NA NA NA NA
6486483 Emotional Intelligence 2.0 Travis Bradberry Bradberry, Travis Jean Greaves, Patrick Lencioni =“0974320625” =“9780974320625” 4 3.84 Talentsmart Hardcover 255 2009 2003 2020/07/08 2020/06/18 read NA NA 1 NA NA 0 NA NA NA NA NA
76865 Good to Great: Why Some Companies Make the Leap… and Others Don’t James C. Collins Collins, James C. =“0066620996” =“9780066620992” 0 4.11 Harper Business Hardcover 300 2001 2001 2020/06/15 to-read to-read (#5) to-read NA NA 0 NA NA 0 NA NA NA NA NA
13596297 Phi Lý Trí: Khám Phá Những Động Lực Vô Hình Ẩn Sau Những Quyết Định Của Con Người Dan Ariely Ariely, Dan Hồng Lê ="" ="" 4 4.13 Lao động Xã Hội & Alphabooks 268 2009 2008 2020/06/14 2020/04/21 read NA NA 1 NA NA 0 NA NA NA NA NA
97642 No More Mr. Nice Guy Robert A. Glover Glover, Robert A. =“0762415339” =“9780762415335” 4 4.06 Running Press Adult Hardcover 208 2003 2000 2020/06/11 2020/05/21 read NA NA 1 NA NA 0 NA NA NA NA NA
35133922 Educated Tara Westover Westover, Tara ="" ="" 5 4.46 Random House Hardcover 334 2018 2018 2020/05/19 2020/05/03 read NA NA 1 NA NA 0 NA NA NA NA NA
38447 The Handmaid’s Tale (The Handmaid’s Tale, #1) Margaret Atwood Atwood, Margaret ="" ="" 4 4.11 Anchor Books Paperback 314 1998 1985 2020/04/05 2020/03/23 read NA NA 1 NA NA 0 NA NA NA NA NA
6708 The Power of Now: A Guide to Spiritual Enlightenment Eckhart Tolle Tolle, Eckhart =“1577314808” =“9781577314806” 0 4.13 New World Library Paperback 229 2004 1997 2020/04/05 to-read to-read (#4) to-read NA NA 0 NA NA 0 NA NA NA NA NA
3450744 Nudge: Improving Decisions About Health, Wealth, and Happiness Richard H. Thaler Thaler, Richard H. Cass R. Sunstein =“014311526X” =“9780143115267” 0 3.84 Penguin Books Paperback 314 2009 2008 2020/04/05 to-read to-read (#3) to-read NA NA 0 NA NA 0 NA NA NA NA NA
26530355 Misbehaving: The Making of Behavioral Economics Richard H. Thaler Thaler, Richard H. =“039335279X” =“9780393352795” 0 4.19 W. W. Norton Company Paperback 432 2016 2016 2020/04/05 to-read to-read (#2) to-read NA NA 0 NA NA 0 NA NA NA NA NA
52539809 Mắt Biếc Nguyễn Nhật Ánh Ánh, Nguyễn Nhật ="" =“9786041140783” 5 4.04 NXB Trẻ Paperback 298 2019 1990 2020/01/02 2020/01/02 read NA NA 1 NA NA 0 NA NA NA NA NA
29780253 Born a Crime: Stories From a South African Childhood Trevor Noah Noah, Trevor =“0385689225” =“9780385689229” 5 4.45 Doubleday Canada Hardcover 289 2016 2016 2020/03/19 2020/03/14 read NA NA 1 NA NA 0 NA NA NA NA NA
34273236 Little Fires Everywhere Celeste Ng Ng, Celeste =“0735224293” =“9780735224292” 5 4.09 Penguin Press Hardcover 338 2017 2017 2020/03/13 2020/03/09 read NA NA 1 NA NA 0 NA NA NA NA NA
1922929 The Black Swan: The Impact of the Highly Improbable Nassim Nicholas Taleb Taleb, Nassim Nicholas =“081297381X” =“9780812973815” 4 3.94 Random House Trade Paperbacks Paperback 444 2010 2007 2020/03/01 2018/06/20 read NA NA 1 NA NA 0 NA NA NA NA NA
13612914 Óc sáng suốt Nguyễn Duy Cần Cần, Nguyễn Duy ="" ="" 0 4.21 NXB Trẻ Paperback 180 2011 1952 2020/01/16 to-read to-read (#1) to-read NA NA 0 NA NA 0 NA NA NA NA NA
25802987 Chuỗi Án Mạng A.B.C Agatha Christie Christie, Agatha Võ Thị Hương Lan ="" ="" 4 4.02 Nxb. Trẻ Paperback 300 2015 1936 2020/01/10 2020/01/10 read NA NA 1 NA NA 0 NA NA NA NA NA
762462 One Up On Wall Street: How to Use What You Already Know to Make Money in the Market Peter Lynch Lynch, Peter John Rothchild =“0743200403” =“9780743200400” 4 4.23 Simon Schuster Paperback 304 2000 1988 2019/11/26 2019/10/20 read NA NA 1 NA NA 0 NA NA NA NA NA
21558662 Nhà Giả Kim Paulo Coelho Coelho, Paulo Lê Chu Cầu ="" ="" 4 3.88 Nxb. Văn Học Paperback 228 2013 1988 2019/08/13 2019/08/13 read NA NA 1 NA NA 0 NA NA NA NA NA
36000433 The INFJ Personality Guide: Understand yourself, reach your potential, and live a life of purpose. Bo Miller Miller, Bo ="" ="" 5 4.23 Kindle Edition 102 2017 NA 2019/10/20 2019/10/19 read NA NA 1 NA NA 0 NA NA NA NA NA
22015713 Thi vương Tương Tây Tianxia Bachang Bachang, Tianxia Nguyễn Thanh Tân ="" ="" 4 3.69 Nhã Nam & NXB Văn học Bìa mềm 532 NA NA 2019/10/16 2019/09/16 read NA NA 1 NA NA 0 NA NA NA NA NA
35517485 Gieo Trồng Hạnh Phúc Thich Nhat Hanh Hanh, Thich Nhat ="" ="" 5 4.33 NXB Lao Động Xã Hội Paperback 350 2014 NA 2019/09/06 2019/08/23 read NA NA 1 NA NA 0 NA NA NA NA NA
35561949 Hẹn với thần chết Agatha Christie Christie, Agatha Trần Hữu Kham ="" =“9786041076648” 4 3.88 NXB Trẻ Paperback NA 2015 1937 2019/08/15 2019/08/15 read NA NA 1 NA NA 0 NA NA NA NA NA
39803056 Nếu Biết Trăm Năm Là Hữu Hạn Phạm Lữ Ân Ân, Phạm Lữ ="" =“9786045329610” 5 4.15 NXB Hội Nhà Văn Paperback 187 2018 2011 2019/07/10 2019/07/10 read NA NA 1 NA NA 0 NA NA NA NA NA
13857191 Nam Hải Quy Khư (Ma Thổi Đèn II, #2) Tianxia Bachang Bachang, Tianxia Lục Hương ="" ="" 4 3.88 Văn Học Paperback 660 2012 2012 2019/05/24 2019/04/12 read NA NA 2 NA NA 0 NA NA NA NA NA
36045198 Nhân Tố Enzyme: Trẻ Hóa Hiromi Shinya Shinya, Hiromi Như Nữ ="" =“9786047735297” 4 4.05 Nxb. Thế Giới & Thái Hà Books Paperback 175 2017 NA 2019/04/11 2019/04/03 read NA NA 1 NA NA 0 NA NA NA NA NA
35506630 Nhân Tố Enzyme Hiromi Shinya Shinya, Hiromi Như Nữ ="" =“9786047728152” 4 4.10 Thái Hà Books & Nxb. Thế Giới Paperback 223 2016 2005 2019/03/01 2019/02/03 read NA NA 1 NA NA 0 NA NA NA NA NA
35680030 Nhân Tố Enzyme: Thực Hành Hiromi Shinya Shinya, Hiromi Như Nữ ="" =“9786047734429” 5 4.11 Nxb. Thế Giới & Thái Hà Books Paperback 300 2017 NA 2019/04/02 2019/03/05 read NA NA 1 NA NA 0 NA NA NA NA NA
136531 The Laramie Project Moisés Kaufman Kaufman, Moisés Tectonic Theater Project =“0375727191” =“9780375727191” 3 4.18 Vintage Paperback 110 2001 2001 2019/03/16 2019/03/16 read NA NA 1 NA NA 0 NA NA NA NA NA
42865890 Giá Trị Cuộc Đời Trương Chi Chi, Trương ="" ="" 5 5.00 Hồng Đức Paperback 262 2015 NA 2018/11/28 2018/11/19 read NA NA 1 NA NA 0 NA NA NA NA NA
13635670 Thần Cung Côn Luân (Ma thổi đèn, #4) Tianxia Bachang Bachang, Tianxia ="" ="" 4 3.73 Văn Học, Nhã Nam Paperback 608 2010 2008 2018/07/04 2018/06/21 read NA NA 1 NA NA 0 NA NA NA NA NA
13635674 Mộ hoàng bì tử (Ma thổi đèn II, #1) Tianxia Bachang Bachang, Tianxia Lục Hương ="" ="" 4 3.90 Văn Học, Nhã Nam Paperback 640 2012 2012 2019/01/20 2019/01/20 read NA NA 1 NA NA 0 NA NA NA NA NA
27040147 The Richest Man in Babylon George S. Clason Clason, George S. =“1939438330” =“9781939438331” 5 4.27 Dauphin Publications Paperback 118 2015 1926 2019/02/03 2018/06/20 read NA NA 1 NA NA 0 NA NA NA NA NA
22755371 Đứa Trẻ Thứ 44 Tom Rob Smith Smith, Tom Rob Võ Hồng Long ="" =“9786045393772” 4 4.09 Nhã Nam & NXB Thời Đại Paperback 362 2014 2008 2018/01/01 2018/06/06 read NA NA 1 NA NA 0 NA NA NA NA NA
30273181 Án Mạng Trên Sông Nile (Hercule Poirot, #17) Agatha Christie Christie, Agatha Lan Phương ="" =“9786041015692” 5 4.10 NXB Trẻ Paperback 336 2015 1937 2018/10/06 2018/09/30 read NA NA 1 NA NA 0 NA NA NA NA NA
13118173 The Intelligent Investor (Collins Business Essentials) Benjamin Graham Graham, Benjamin =“9780060555” ="" 4 4.23 Harper Business Paperback 623 2006 1949 2018/12/22 2018/06/20 read NA NA 1 NA NA 0 NA NA NA NA NA
32521178 Tuổi Trẻ Đáng Giá Bao Nhiêu Rosie Nguyễn Nguyễn, Rosie ="" =“9786045370193” 5 4.30 Nxb. Hội Nhà Văn & Nhã Nam Paperback 292 2016 2017 2018/06/09 2018/06/06 read NA NA 1 NA NA 0 NA NA NA NA NA
23839097 Người Giỏi Không Bởi Học Nhiều Alphabooks Alphabooks, Alphabooks ="" ="" 4 3.75 Alphabooks & NXB Lao Động - Xã Hội Paperback 208 2012 2012 2018/06/21 2018/06/09 read NA NA 1 NA NA 0 NA NA NA NA NA

I will convert some variables to its proper data type.

#turn variable into date
data$Date.Read <- as.Date(data$Date.Read)
data$Date.Added <- as.Date(data$Date.Added)

2.2 Descriptive and Exploratory Analysis

2.2.1 Books

shelf1 <- as.data.frame(table(data$Exclusive.Shelf))
colnames(shelf1)[1] <- "Shelf"
kable(shelf1) %>%
  kable_styling(latex_options = "striped")
Shelf Freq
currently-reading 1
read 51
to-read 5

I have read 51 books so far.

We are only interested in looking at books I have read, therefore we will filter read books.

And also make make some variables less mouthful.

# filter read books
rdata<-data %>% filter(Exclusive.Shelf=='read')
#easier to see
rdata <- as_tibble(rdata)

#make this variable sh
names(rdata)[names(data)=='Original.Publication.Year'] <- "opy"

I am interested in seeing read books by year

# create shelf 2, dataframe split books read by year
shelf2 <- rdata %>% group_by(year(Date.Read)) %>% select('year(Date.Read)') 
shelf2 <- as.data.frame(table(shelf2[,1]))
shelf2 <- mutate(shelf2,pct_change=round((Freq/lag(Freq)-1)*100,2))
colnames(shelf2) <- c("Year","Books Read","Percent Change")
#create a table
kable(shelf2)%>%
    kable_styling(latex_options = "striped")
Year Books Read Percent Change
2018 9 NA
2019 19 111.11
2020 23 21.05

There was a big jump between 2018 and 2019 in terms of book read (111%). However, there was no significant change from 2019 to 2020 (21%). I’m reading more books each year, but at a decreasing rate.

Here are the oldest and newest books

#create a table with row as the title of the newest books, remove missing values
table <- tibble(rdata$Title[rdata$opy==max(na.omit(rdata$opy))&is.na(rdata$opy)==FALSE])
table[,2] <- rep(max(na.omit(rdata$opy),nrow(table)))
colnames(table) <- c("Newest books","Year")
kable(table) %>%
    kable_styling(latex_options = "striped")


#create a table with row as the title of the oldest books, remove missing values
table <- tibble(rdata$Title[rdata$opy==min(na.omit(rdata$opy))&is.na(rdata$opy)==FALSE])
table [,2] <- rep(min(na.omit(rdata$opy)),nrow(table))
colnames(table) <- c("Oldest books","Year")
kable(table) %>%
    kable_styling(latex_options = "striped")
Newest books Year
Bạn Có Phải Là Đứa Trẻ Sợ Hãi Ẩn Sau Lớp Vỏ Trưởng Thành? 2018
Educated 2018
Oldest books Year
The Richest Man in Babylon 1926

Here are the thickest and thinnest books.

table1 <- tibble(rdata$Title[which.max(rdata$Number.of.Pages)])
table1[,2] <- max(rdata$Number.of.Pages,na.rm = TRUE)
table1[2,1] <- rdata$Title[which.min(rdata$Number.of.Pages)]
table1[2,2] <-min(rdata$Number.of.Pages,na.rm=TRUE)
colnames(table1) <- c("Books", "No. of Pages")
kable(table1)%>%
    kable_styling(latex_options = "striped")
Books No. of Pages
Nam Hải Quy Khư (Ma Thổi Đèn II, #2) 660
Nhân Tố Enzyme: Minh Họa 79

Lots of these books are Vietnamese, since I picked up some Viet books every time I vist home. With that being said, I did not expect the thickest book was a Viet. Now I want to know how often I read Viet books.

2.2.2 Viet books

To detect books which are in Vietnamese, we might need to install some different packages. I will use stringi in this case to detect some characters.

Thank to this StackOverflow post I was able to find this package and functions.

#call stringi packages
library(stringi)

#create a dataframe with title and their encoding
title <- tibble(Title=rdata$Title)
title$Encoding <- stri_enc_mark(title$Title)
kable(head(subset(title,title$Encoding=="UTF-8"),3)) %>%
    kable_styling(latex_options = "striped")
Title Encoding
Bạn Có Phải Là Đứa Trẻ Sợ Hãi Ẩn Sau Lớp Vỏ Trưởng Thành? UTF-8
Dòng Thời Gian UTF-8
Lời Nói Dối Vĩ Đại Của Não UTF-8
kable(head(subset(title,title$Encoding=="ASCII"),3))%>%
    kable_styling(latex_options = "striped")
Title Encoding
Before the Fall ASCII
Financial Freedom: A Proven Path to All the Money You Will Ever Need ASCII
The Start-up of You: Adapt to the Future, Invest in Yourself, and Transform Your Career ASCII

So we can assume that Viet titles are encoded with UTF-8, since they have some non-ASCII characters. Fortunately, we had a small dataset so I was able to check if I have left out any books. This might not hold true if a Viet book title has no non-ASCII character and we have a big data set.

# Number of books in Viet
title %>% filter(Encoding=="UTF-8")%>% nrow()
## [1] 26
# Number of books in English
title %>% filter(Encoding!="UTF-8")%>% nrow()
## [1] 25

I read 26 books in Viet and 25 books in English. This is significantly different from what I expected, which is that I read more books in English than in Viet.

# only find viet books
vietdata <- merge(rdata,title,by="Title")
vietdata <- vietdata %>% filter(Encoding=="UTF-8") %>% mutate(month=month.abb[month(Date.Read)])
vietdata <- as.data.frame(table(vietdata$month))
colnames(vietdata)[1] <- "Month"

#plot viet books by month
ggplot(vietdata,aes(x=Month,y=Freq))+
  geom_bar(stat = "identity")+
  scale_x_discrete(limits = month.abb)+
  labs(x="Months",y="Number of books",title="Viet books read by months")+
  theme(plot.title = element_text(hjust = 0.5))

This graph makes sense since January is when I finished the most Viet books. This might be because I went home for Christmas and New Year and grabbed some books to read for my flight.

April, June and December are also months with second highest Viet books finished. I read some viet books in December so I can bring them home and put them on the shelves. I have no explanation for April and June.

2.2.3 Number of Pages

We will look at number of pages I’ve read.

datagraph <- rdata %>% mutate(month=floor_date(Date.Read,unit="month")) %>% group_by(month) %>% mutate(sumpages=sum(Number.of.Pages,na.rm = TRUE)) 
datagraph %>%
  ggplot(aes(x=month,y=sumpages))+
  geom_bar(position="dodge",
           stat="identity",
           fill="#00AFBB",width = 20) +
  geom_text(aes(label=sumpages),vjust=-.5,size=2.5)+
  labs(x="Months",y="Pages Read",title="Pages Read through time")+
  theme(plot.title = element_text(hjust = 0.5))+
  ylim(0,1500)+
  scale_x_date(breaks="month",
               labels = function(x) ifelse(year(x)%in%c(2017,max(year(datagraph$month))+1),"", format(x,"%b %Y")), 
               limits=as.Date(c('2017-12-15', '2020-12-15')))+
  theme(axis.text.x = element_text(angle=60, hjust = 1))

As you can see, I’m not a consistent reader :D.

Additionally, this represents how many pages of a book I finished in a certain month, rather than how much I read that month. Let’s say if I started reading a 500-pages book in January and finished it in May, the 500 will be count in May.

I haven’t figured out a way to reflect the true number of pages read. I hope to do this in future analysis.

# create a year vector from the data
years <- as.integer(unique(year(data$Date.Read)))
years <- na.omit(years)
years <- sort(years, decreasing = FALSE)
# create a color vector for graph colors
colors <- c("deepskyblue","dodgerblue4","steelblue")
# create a vector for total number of pages read in a year
totalpages <- rep(0,length(years))
for (i in c(1:length(years))){
  totalpages[i] <- sum(data$Number.of.Pages[year(data$Date.Read)==years[i]],na.rm=TRUE)
}

#loop through years to create chart of pages read in each year
for (i in c(1:length(years))){
    print(datagraph %>% filter(year(Date.Read)==as.integer(years[i])) %>% 
    ggplot(aes(x=month,y=sumpages))+
    geom_bar(position="dodge",
             stat="identity",
             fill=colors[i]) +
    geom_text(aes(label=sumpages),vjust=-.5)+
    labs(x="Months",y="Pages Read",title=paste(totalpages[i]," pages read in ",years[i],sep=""))  +
    theme(plot.title = element_text(hjust = 0.5))+
    ylim(0,1500)+
    scale_x_date(breaks="month",
                 labels = function(x) ifelse(year(x)!=years[i],"" , months(x, TRUE)),
                 limits=as.Date(c(paste(toString(years[i]-1),"-12-15",sep=""),paste(toString(years[i]),"-12-15",sep="")))
    )
    )
    }

In 2018, I didn’t finish any book in February, March, April. Again, this only reflects the fact that I didn’t finish any book, not that I was not reading.

In 2019 and 2020, there are not a lot of empty months, since I read more frequently, thus finished books more often. I wonder if my speed also plays a role here.

2.2.4 Speed

Another information I want to know is how fast do I finish a book? I will create a variable in rdata, to calculate the speed. Speed will equal to Number of pages divided by Days Difference or datediff.

datediff is calculated by taking Date Read minus Date Added

# days to finish a book
rdata$datediff <- difftime(rdata$Date.Read,rdata$Date.Added,units = "day")
rdata$datediff
## Time differences in days
##  [1]    0   45   14    6    4    1   12   53    9   34  157   12  -15 -143    0   -4    0    0
## [19]    0    0   20   54   21   16   13    0    5    4  620    0   37    0    1   30   14    0
## [37]    0   42    8   26   28    0    9   13    0  228 -156    6  185    3   12

The disadvantage of Goodreads export data is that it only allows Date Added and Date Read. Date Added is the variable reflects when I added the book to my shelf, and Date Read reflects when I claimed to finish it. Normally, Date Read will be after Date Added, if I added on the day I started reading it, which results in a positive datediff.

There are instances where I updated a book after I finished it, therefore Date Read will be before Date Added. The negative represents those cases.

The 0s are the cases where I added the book right on the day I finished it.

There is another column called Date Started on Goodreads, but they don’t allow you to export it.

#count datediff which smaller than 1
aggregate(rdata$datediff,list(rdata$datediff<1), FUN=length)
##   Group.1  x
## 1   FALSE 34
## 2    TRUE 17

I will remove the 0 and the negative in order to calculate speed. 17 books will be removed

rdata$speed  <- rdata$Number.of.Pages/as.integer(rdata$datediff)
speedat <- rdata%>% filter(speed>0&is.na(rdata$speed)==FALSE&speed!=Inf)

#average speed
mean(speedat$speed)
## [1] 33.31033

My average reading speed is 33.31 pages/day

Let’s look at it through year

kable(speedat %>% group_by(year=year(Date.Read))%>%summarise(avgspeed = mean(speed,na.rm=TRUE))) %>%
      kable_styling(latex_options = "striped")
year avgspeed
2018 41.65243
2019 40.13476
2020 26.73828

Average speed in 2018 somehow is the highest, despite the low number of books. Is it because of the number of pages per book?

kable(speedat %>% group_by(year(Date.Read))%>%summarise(avgspeed = mean(speed,na.rm=TRUE),avgpages = mean(Number.of.Pages,na.rm=TRUE),n=n()) %>% mutate(true_n=shelf2$`Books Read`) %>% mutate(removed=true_n-n)) %>%
      kable_styling(latex_options = "striped")
year(Date.Read) avgspeed avgpages n true_n removed
2018 41.65243 388.1667 6 9 3
2019 40.13476 295.5000 10 19 9
2020 26.73828 333.8333 18 23 5

The average number of pages in 2018 is actually higher than 2019 and 2020. From what I understand, I read less books in 2018, but in a short window. In 2020, I read more books, more frequently, but at slower pace.

The removed data also play a part here. There are 3 books removed for 2018, 9 for 2019 and 5 for 2020. Even though 2020 has double the books of 2018, it might be that the books which are removed in 2020 have a really high number of pages.

Or I just read super fast in a short window in 2018.

2.2.5 Ratings & Reviews

2.2.5.1 Rating

I want to compare my ratings with average book rating

#My rating 
mean(data$My.Rating)
## [1] 3.912281
#Average rating
mean(data$Average.Rating)
## [1] 4.084035
#creating a dataset of my rating versus average rating
ratedata<-aggregate(rdata$Average.Rating,list(myrating=rdata$My.Rating),FUN="mean")
ratedata$class <- ratedata$myrating
for (i in c(0:nrow(ratedata))){ratedata[i,3]<- ratedata$myrating[i]}
colnames(ratedata)[2] <-"avg"
ratedata <- gather(ratedata,var,value,myrating,avg)
#plot it
ggplot(ratedata, aes(x=class,y=value,fill=var))+
  geom_bar(stat='identity',position='dodge')+
  geom_text(aes(label=round(value,2)),vjust=-.5,position=position_dodge(width=0.9))+
  labs(y="Rating",title="Books Rating: Phan's versus Average")+
  ylim(0,6)+
  theme(plot.title = element_text(hjust = 0.5),legend.title=element_blank(),axis.title.x = element_blank(),axis.text.x = element_blank(),axis.ticks.x = element_blank()) +
  scale_fill_discrete(labels = c("Average Rating", "My Rating"))

The 3-star books, which are “meh” in my opnions, seems to be preferred by others. On the other hand, books I found excellent - 5-stars books - are rated much lower.

2.2.5.2 Reviews

Does thicker book has higher rating? Let’s find out. First, let’s get a dataframe for Number.of.Pages and Average.Rating.

# dataframe for review length and my.rating
ratereview <- tibble(numpages = rdata$Number.of.Pages,rating = rdata$Average.Rating,title=rdata$Title)
ggplot(ratereview,aes(x=rating,y=numpages))+geom_point()

As first glance, there is no correlation here. The 4 star and five star books have somewhat the same length of reviews. There are some outliers for five stars reviews. And I rated only one book 3 stars.

cor.test(ratereview$numpages,ratereview$rating)
model <- lm(rating~numpages,data=ratereview)
summary(model)
## 
##  Pearson's product-moment correlation
## 
## data:  ratereview$numpages and ratereview$rating
## t = -0.78319, df = 48, p-value = 0.4374
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3788314  0.1713777
## sample estimates:
##        cor 
## -0.1123284 
## 
## 
## Call:
## lm(formula = rating ~ numpages, data = ratereview)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52216 -0.15153  0.01987  0.13962  0.91140 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.1435724  0.0912948  45.387   <2e-16 ***
## numpages    -0.0002098  0.0002679  -0.783    0.437    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2574 on 48 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.01262,    Adjusted R-squared:  -0.007953 
## F-statistic: 0.6134 on 1 and 48 DF,  p-value: 0.4374

It seems like numpages doesn’t correlate to rating at all. The correlation coefficient is around -0.11, and the p-value for the predictor is large. Therefore rating and book thickness are independent.

3 Conclusion

Let’s summarize answers to the questions I asked at the beginning of this analysis:
1. I read 51 books so far. And I do read more every year.
2. Half of the books I read are Vietnamese.
3. There are not a lot of correlation between time of year and amount of reading.
4. Books I found ‘meh’ got higher rating from other people, and books I loved got lower rating.
5. I read around 30 pages/day.

Even though this analysis is mostly exploratory in essence, I found that it is a great exercise as a beginner. I also think there are a lot more to be done in the future.

From this project, I learned that I need to put more time into exploring the data, before doing any type of analysis. I spent a lot of time making graphs and charts, because of not knowing syntax and not understanding data structures. The write-ups were quite short for the time I put into each graphs.

I could also learn more by reading other analyses on Goodreads data.

This is one of my first projects I did on something I like. And I was surprised with the amount of time I put in it. It felt great. I’m gonna do this at the end of 2021 as well.