Exploratory Analysis of my Goodreads data
24 Dec 20201 Introduction
Hello everyone, welcome to an Analysis of my reading journey. A little bit of background: I started reading again in 2018, after a long time of not touching books. The last book I had read before was when I was in 8th grade :D.
Since I started reading in 2018, I’ve been keeping track of the books I read, rating and reviews on Goodreads, and after two years, I had this small set of data available. I came across it the other day, when Goodreads showed me some stats on their website.
Moreover, I recently started learning data science on my own, therefore I am so thrilled to analyze this data. My purpose of this analysis is to answer some simple exploratory and descriptive analysis questions:
1. How much books I’ve read over the years ? Do I read more every year?
2. I read books in Vietnamese and English? How many of those are Viet and how many are English?
3. Are there correlation between time of year and the amount of reading ?
4. Do I give better rating than other people do ?
5. How fast do I read ?
I am new to this, so if you have any comments or suggestions, please let me know!
LET’S DIVE IN!
2 The Analysis
2.1 The Goodreads Data
First, let grab our data
# encoding UTF-8 since there are Vietnamese characters
data <- read.csv("data//goodreads_library_export.csv",encoding="UTF-8",header=TRUE,stringsAsFactors = FALSE)
## [1] 57 31
There are 57 books in this data. This include books I read, currently am reading, and going to read. This is a small data set, however I believe there are some interesting insights in this one. Each of this book has 31 variables to describe it.
This is a sample of the data set. The column My.Review is left out, due to the fact that it made the row height too large. There are also a lot of missing values.
#create a table
kable(data[,-which(names(data) %in% "My.Review")]) %>%
kable_styling() %>%
scroll_box(width ="800px",height = "350px")
Book.Id | Title | Author | Author.l.f | Additional.Authors | ISBN | ISBN13 | My.Rating | Average.Rating | Publisher | Binding | Number.of.Pages | Year.Published | Original.Publication.Year | Date.Read | Date.Added | Bookshelves | Bookshelves.with.positions | Exclusive.Shelf | Spoiler | Private.Notes | Read.Count | Recommended.For | Recommended.By | Owned.Copies | Original.Purchase.Date | Original.Purchase.Location | Condition | Condition.Description | BCID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32498468 | Before the Fall | Noah Hawley | Hawley, Noah | =“1455561797” | =“9781455561797” | 5 | 3.72 | Grand Central Publishing | Paperback | 416 | 2017 | 2016 | 2020/01/25 | 2020/01/25 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
40513711 | Financial Freedom: A Proven Path to All the Money You Will Ever Need | Grant Sabatier | Sabatier, Grant | Vicki Robin | =“0525540881” | =“9780525540885” | 5 | 4.00 | Avery Publishing Group | Hardcover | 352 | 2019 | NA | 2020/08/02 | 2020/06/18 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
12701065 | The Start-up of You: Adapt to the Future, Invest in Yourself, and Transform Your Career | Reid Hoffman | Hoffman, Reid | Ben Casnocha | =“0307888908” | =“9780307888907” | 4 | 3.85 | Crown Business | Hardcover | 260 | 2012 | 2012 | 2020/12/11 | 2020/11/27 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
32989564 | The Comprehensive INFP Survival Guide | Heidi Priebe | Priebe, Heidi | =“1945796154” | =“9781945796159” | 4 | 4.24 | Thought Catalog Books | Paperback | 274 | 2016 | 2016 | 2020/04/20 | 2020/04/14 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
29875487 | The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life | Mark Manson | Manson, Mark | =“0062641549” | =“9780062641540” | 5 | 3.94 | Harper | Paperback | 206 | 2016 | 2016 | 2020/03/05 | 2020/03/01 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
52061818 | Bạn Có Phải Là Đứa Trẻ Sợ Hãi Ẩn Sau Lớp Vỏ Trưởng Thành? | Beth Evans | Evans, Beth | Trịnh Thu Hằng | =“9786047761” | ="" | 4 | 3.73 | Công Ty TNHH Văn Hoá & Truyền Thông Skybooks | Paperback | 191 | 2018 | 2018 | 2019/12/23 | 2019/12/22 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
20697410 | Golden Son (Red Rising, #2) | Pierce Brown | Brown, Pierce | =“0345539818” | =“9780345539816” | 5 | 4.44 | Del Rey | Hardcover | 442 | 2015 | 2015 | 2020/11/10 | 2020/10/29 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
25574881 | Dòng Thời Gian | Michael Crichton | Crichton, Michael | ="" | ="" | 4 | 3.85 | NXB Lao Động | Paperback | 494 | 2015 | 1999 | 2020/12/15 | 2020/10/23 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
18966806 | Morning Star (Red Rising Saga, #3) | Pierce Brown | Brown, Pierce | =“0345539842” | =“9780345539847” | 5 | 4.48 | Del Rey | Hardcover | 524 | 2016 | 2016 | 2020/11/26 | 2020/11/17 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
15839976 | Red Rising (Red Rising Saga, #1) | Pierce Brown | Brown, Pierce | =“0345539788” | =“9780345539786” | 5 | 4.24 | Del Rey (Random House) | Hardcover | 382 | 2014 | 2014 | 2020/10/15 | 2020/09/11 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
38746485 | Becoming | Michelle Obama | Obama, Michelle | ="" | ="" | 0 | 4.53 | Crown | Hardcover | 426 | 2018 | 2018 | 2020/09/17 | currently-reading | currently-reading (#1) | currently-reading | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
8520610 | Quiet: The Power of Introverts in a World That Can’t Stop Talking | Susan Cain | Cain, Susan | =“0307352145” | =“9780307352149” | 5 | 4.06 | Crown Publishing Group/Random House, Inc. | Hardcover | 333 | 2012 | 2012 | 2020/09/09 | 2020/04/05 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
23398763 | Everything I Never Told You | Celeste Ng | Ng, Celeste | =“0143127551” | =“9780143127550” | 5 | 3.86 | Penguin Books | Paperback | 292 | 2015 | 2014 | 2020/08/18 | 2020/08/06 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
86154 | The Mystery of Capital: Why Capitalism Triumphs in the West and Fails Everywhere Else | Hernando de Soto | Soto, Hernando de | =“0465016154” | =“9780465016150” | 4 | 4.00 | Basic Books | Paperback | 288 | 2003 | 2000 | 2018/05/22 | 2018/06/06 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
28800977 | Lời Nói Dối Vĩ Đại Của Não | Kelly McGonigal | McGonigal, Kelly | Khánh Thủy | ="" | =“9786045929889” | 4 | 4.15 | Thái Hà Books & NXB Lao Động | Paperback | 228 | 2015 | 2011 | 2018/01/14 | 2018/06/06 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
60931 | Kindred | Octavia E. Butler | Butler, Octavia E. | =“0807083690” | =“9780807083697” | 5 | 4.26 | Beacon Press | Paperback | 287 | 2004 | 1979 | 2019/02/17 | 2019/02/17 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
28818221 | Kindred: A Graphic Novel Adaptation | Damian Duffy | Duffy, Damian | Octavia E. Butler, John Jennings, Nnedi Okorafor | =“141970947X” | =“9781419709470” | 4 | 4.19 | Harry N. Abrams | Hardcover | 240 | 2017 | 2017 | 2019/03/29 | 2019/04/02 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
36045222 | Nhân Tố Enzyme: Minh Họa | Hiromi Shinya | Shinya, Hiromi | Như Nữ | ="" | =“9786047735303” | 4 | 4.16 | Nxb. Thế Giới & Thái Hà Books | Paperback | 79 | 2017 | NA | 2019/04/11 | 2019/04/11 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
45420940 | Gửi bạn, người đã trưởng thành mà chưa tìm thấy tài năng | Shinzi Kamioka | Kamioka, Shinzi | Nguyễn Quốc Vượng | ="" | ="" | 3 | 3.57 | Nhà Xuất Bản Phụ Nữ | Bìa mềm | 245 | 2019 | 2017 | 2019/12/29 | 2019/12/29 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
49506165 | Những điều giữ tôi còn sống | Matt Haig | Haig, Matt | N.D.T. Anh | ="" | ="" | 5 | 4.16 | Paperback | 224 | 2019 | 2015 | 2020/01/08 | 2020/01/08 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
40198460 | Ăn Ít Để Khỏe – 1 Bữa Là Đủ Sao Cần Phải 3 | Yoshinori Nagumo | Nagumo, Yoshinori | Minh Yến | ="" | ="" | 3 | 3.66 | NXB Lao Động, Thái Hà | Paperback | 200 | 2018 | 2012 | 2020/02/04 | 2020/02/04 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
6486483 | Emotional Intelligence 2.0 | Travis Bradberry | Bradberry, Travis | Jean Greaves, Patrick Lencioni | =“0974320625” | =“9780974320625” | 4 | 3.84 | Talentsmart | Hardcover | 255 | 2009 | 2003 | 2020/07/08 | 2020/06/18 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
76865 | Good to Great: Why Some Companies Make the Leap… and Others Don’t | James C. Collins | Collins, James C. | =“0066620996” | =“9780066620992” | 0 | 4.11 | Harper Business | Hardcover | 300 | 2001 | 2001 | 2020/06/15 | to-read | to-read (#5) | to-read | NA | NA | 0 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
13596297 | Phi Lý Trí: Khám Phá Những Động Lực Vô Hình Ẩn Sau Những Quyết Định Của Con Người | Dan Ariely | Ariely, Dan | Hồng Lê | ="" | ="" | 4 | 4.13 | Lao động Xã Hội & Alphabooks | 268 | 2009 | 2008 | 2020/06/14 | 2020/04/21 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
97642 | No More Mr. Nice Guy | Robert A. Glover | Glover, Robert A. | =“0762415339” | =“9780762415335” | 4 | 4.06 | Running Press Adult | Hardcover | 208 | 2003 | 2000 | 2020/06/11 | 2020/05/21 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
35133922 | Educated | Tara Westover | Westover, Tara | ="" | ="" | 5 | 4.46 | Random House | Hardcover | 334 | 2018 | 2018 | 2020/05/19 | 2020/05/03 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
38447 | The Handmaid’s Tale (The Handmaid’s Tale, #1) | Margaret Atwood | Atwood, Margaret | ="" | ="" | 4 | 4.11 | Anchor Books | Paperback | 314 | 1998 | 1985 | 2020/04/05 | 2020/03/23 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
6708 | The Power of Now: A Guide to Spiritual Enlightenment | Eckhart Tolle | Tolle, Eckhart | =“1577314808” | =“9781577314806” | 0 | 4.13 | New World Library | Paperback | 229 | 2004 | 1997 | 2020/04/05 | to-read | to-read (#4) | to-read | NA | NA | 0 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
3450744 | Nudge: Improving Decisions About Health, Wealth, and Happiness | Richard H. Thaler | Thaler, Richard H. | Cass R. Sunstein | =“014311526X” | =“9780143115267” | 0 | 3.84 | Penguin Books | Paperback | 314 | 2009 | 2008 | 2020/04/05 | to-read | to-read (#3) | to-read | NA | NA | 0 | NA | NA | 0 | NA | NA | NA | NA | NA | |
26530355 | Misbehaving: The Making of Behavioral Economics | Richard H. Thaler | Thaler, Richard H. | =“039335279X” | =“9780393352795” | 0 | 4.19 | W. W. Norton Company | Paperback | 432 | 2016 | 2016 | 2020/04/05 | to-read | to-read (#2) | to-read | NA | NA | 0 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
52539809 | Mắt Biếc | Nguyễn Nhật Ánh | Ánh, Nguyễn Nhật | ="" | =“9786041140783” | 5 | 4.04 | NXB Trẻ | Paperback | 298 | 2019 | 1990 | 2020/01/02 | 2020/01/02 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
29780253 | Born a Crime: Stories From a South African Childhood | Trevor Noah | Noah, Trevor | =“0385689225” | =“9780385689229” | 5 | 4.45 | Doubleday Canada | Hardcover | 289 | 2016 | 2016 | 2020/03/19 | 2020/03/14 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
34273236 | Little Fires Everywhere | Celeste Ng | Ng, Celeste | =“0735224293” | =“9780735224292” | 5 | 4.09 | Penguin Press | Hardcover | 338 | 2017 | 2017 | 2020/03/13 | 2020/03/09 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
1922929 | The Black Swan: The Impact of the Highly Improbable | Nassim Nicholas Taleb | Taleb, Nassim Nicholas | =“081297381X” | =“9780812973815” | 4 | 3.94 | Random House Trade Paperbacks | Paperback | 444 | 2010 | 2007 | 2020/03/01 | 2018/06/20 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
13612914 | Óc sáng suốt | Nguyễn Duy Cần | Cần, Nguyễn Duy | ="" | ="" | 0 | 4.21 | NXB Trẻ | Paperback | 180 | 2011 | 1952 | 2020/01/16 | to-read | to-read (#1) | to-read | NA | NA | 0 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
25802987 | Chuỗi Án Mạng A.B.C | Agatha Christie | Christie, Agatha | Võ Thị Hương Lan | ="" | ="" | 4 | 4.02 | Nxb. Trẻ | Paperback | 300 | 2015 | 1936 | 2020/01/10 | 2020/01/10 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
762462 | One Up On Wall Street: How to Use What You Already Know to Make Money in the Market | Peter Lynch | Lynch, Peter | John Rothchild | =“0743200403” | =“9780743200400” | 4 | 4.23 | Simon Schuster | Paperback | 304 | 2000 | 1988 | 2019/11/26 | 2019/10/20 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
21558662 | Nhà Giả Kim | Paulo Coelho | Coelho, Paulo | Lê Chu Cầu | ="" | ="" | 4 | 3.88 | Nxb. Văn Học | Paperback | 228 | 2013 | 1988 | 2019/08/13 | 2019/08/13 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
36000433 | The INFJ Personality Guide: Understand yourself, reach your potential, and live a life of purpose. | Bo Miller | Miller, Bo | ="" | ="" | 5 | 4.23 | Kindle Edition | 102 | 2017 | NA | 2019/10/20 | 2019/10/19 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||||
22015713 | Thi vương Tương Tây | Tianxia Bachang | Bachang, Tianxia | Nguyễn Thanh Tân | ="" | ="" | 4 | 3.69 | Nhã Nam & NXB Văn học | Bìa mềm | 532 | NA | NA | 2019/10/16 | 2019/09/16 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
35517485 | Gieo Trồng Hạnh Phúc | Thich Nhat Hanh | Hanh, Thich Nhat | ="" | ="" | 5 | 4.33 | NXB Lao Động Xã Hội | Paperback | 350 | 2014 | NA | 2019/09/06 | 2019/08/23 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
35561949 | Hẹn với thần chết | Agatha Christie | Christie, Agatha | Trần Hữu Kham | ="" | =“9786041076648” | 4 | 3.88 | NXB Trẻ | Paperback | NA | 2015 | 1937 | 2019/08/15 | 2019/08/15 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
39803056 | Nếu Biết Trăm Năm Là Hữu Hạn | Phạm Lữ Ân | Ân, Phạm Lữ | ="" | =“9786045329610” | 5 | 4.15 | NXB Hội Nhà Văn | Paperback | 187 | 2018 | 2011 | 2019/07/10 | 2019/07/10 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
13857191 | Nam Hải Quy Khư (Ma Thổi Đèn II, #2) | Tianxia Bachang | Bachang, Tianxia | Lục Hương | ="" | ="" | 4 | 3.88 | Văn Học | Paperback | 660 | 2012 | 2012 | 2019/05/24 | 2019/04/12 | read | NA | NA | 2 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
36045198 | Nhân Tố Enzyme: Trẻ Hóa | Hiromi Shinya | Shinya, Hiromi | Như Nữ | ="" | =“9786047735297” | 4 | 4.05 | Nxb. Thế Giới & Thái Hà Books | Paperback | 175 | 2017 | NA | 2019/04/11 | 2019/04/03 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
35506630 | Nhân Tố Enzyme | Hiromi Shinya | Shinya, Hiromi | Như Nữ | ="" | =“9786047728152” | 4 | 4.10 | Thái Hà Books & Nxb. Thế Giới | Paperback | 223 | 2016 | 2005 | 2019/03/01 | 2019/02/03 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
35680030 | Nhân Tố Enzyme: Thực Hành | Hiromi Shinya | Shinya, Hiromi | Như Nữ | ="" | =“9786047734429” | 5 | 4.11 | Nxb. Thế Giới & Thái Hà Books | Paperback | 300 | 2017 | NA | 2019/04/02 | 2019/03/05 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
136531 | The Laramie Project | Moisés Kaufman | Kaufman, Moisés | Tectonic Theater Project | =“0375727191” | =“9780375727191” | 3 | 4.18 | Vintage | Paperback | 110 | 2001 | 2001 | 2019/03/16 | 2019/03/16 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
42865890 | Giá Trị Cuộc Đời | Trương Chi | Chi, Trương | ="" | ="" | 5 | 5.00 | Hồng Đức | Paperback | 262 | 2015 | NA | 2018/11/28 | 2018/11/19 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
13635670 | Thần Cung Côn Luân (Ma thổi đèn, #4) | Tianxia Bachang | Bachang, Tianxia | ="" | ="" | 4 | 3.73 | Văn Học, Nhã Nam | Paperback | 608 | 2010 | 2008 | 2018/07/04 | 2018/06/21 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
13635674 | Mộ hoàng bì tử (Ma thổi đèn II, #1) | Tianxia Bachang | Bachang, Tianxia | Lục Hương | ="" | ="" | 4 | 3.90 | Văn Học, Nhã Nam | Paperback | 640 | 2012 | 2012 | 2019/01/20 | 2019/01/20 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
27040147 | The Richest Man in Babylon | George S. Clason | Clason, George S. | =“1939438330” | =“9781939438331” | 5 | 4.27 | Dauphin Publications | Paperback | 118 | 2015 | 1926 | 2019/02/03 | 2018/06/20 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
22755371 | Đứa Trẻ Thứ 44 | Tom Rob Smith | Smith, Tom Rob | Võ Hồng Long | ="" | =“9786045393772” | 4 | 4.09 | Nhã Nam & NXB Thời Đại | Paperback | 362 | 2014 | 2008 | 2018/01/01 | 2018/06/06 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
30273181 | Án Mạng Trên Sông Nile (Hercule Poirot, #17) | Agatha Christie | Christie, Agatha | Lan Phương | ="" | =“9786041015692” | 5 | 4.10 | NXB Trẻ | Paperback | 336 | 2015 | 1937 | 2018/10/06 | 2018/09/30 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | ||
13118173 | The Intelligent Investor (Collins Business Essentials) | Benjamin Graham | Graham, Benjamin | =“9780060555” | ="" | 4 | 4.23 | Harper Business | Paperback | 623 | 2006 | 1949 | 2018/12/22 | 2018/06/20 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
32521178 | Tuổi Trẻ Đáng Giá Bao Nhiêu | Rosie Nguyễn | Nguyễn, Rosie | ="" | =“9786045370193” | 5 | 4.30 | Nxb. Hội Nhà Văn & Nhã Nam | Paperback | 292 | 2016 | 2017 | 2018/06/09 | 2018/06/06 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA | |||
23839097 | Người Giỏi Không Bởi Học Nhiều | Alphabooks | Alphabooks, Alphabooks | ="" | ="" | 4 | 3.75 | Alphabooks & NXB Lao Động - Xã Hội | Paperback | 208 | 2012 | 2012 | 2018/06/21 | 2018/06/09 | read | NA | NA | 1 | NA | NA | 0 | NA | NA | NA | NA | NA |
I will convert some variables to its proper data type.
2.2 Descriptive and Exploratory Analysis
2.2.1 Books
shelf1 <- as.data.frame(table(data$Exclusive.Shelf))
colnames(shelf1)[1] <- "Shelf"
kable(shelf1) %>%
kable_styling(latex_options = "striped")
Shelf | Freq |
---|---|
currently-reading | 1 |
read | 51 |
to-read | 5 |
I have read 51 books so far.
We are only interested in looking at books I have read, therefore we will filter read books.
And also make make some variables less mouthful.
# filter read books
rdata<-data %>% filter(Exclusive.Shelf=='read')
#easier to see
rdata <- as_tibble(rdata)
#make this variable sh
names(rdata)[names(data)=='Original.Publication.Year'] <- "opy"
I am interested in seeing read books by year
# create shelf 2, dataframe split books read by year
shelf2 <- rdata %>% group_by(year(Date.Read)) %>% select('year(Date.Read)')
shelf2 <- as.data.frame(table(shelf2[,1]))
shelf2 <- mutate(shelf2,pct_change=round((Freq/lag(Freq)-1)*100,2))
colnames(shelf2) <- c("Year","Books Read","Percent Change")
Year | Books Read | Percent Change |
---|---|---|
2018 | 9 | NA |
2019 | 19 | 111.11 |
2020 | 23 | 21.05 |
There was a big jump between 2018 and 2019 in terms of book read (111%). However, there was no significant change from 2019 to 2020 (21%). I’m reading more books each year, but at a decreasing rate.
Here are the oldest and newest books
#create a table with row as the title of the newest books, remove missing values
table <- tibble(rdata$Title[rdata$opy==max(na.omit(rdata$opy))&is.na(rdata$opy)==FALSE])
table[,2] <- rep(max(na.omit(rdata$opy),nrow(table)))
colnames(table) <- c("Newest books","Year")
kable(table) %>%
kable_styling(latex_options = "striped")
#create a table with row as the title of the oldest books, remove missing values
table <- tibble(rdata$Title[rdata$opy==min(na.omit(rdata$opy))&is.na(rdata$opy)==FALSE])
table [,2] <- rep(min(na.omit(rdata$opy)),nrow(table))
colnames(table) <- c("Oldest books","Year")
kable(table) %>%
kable_styling(latex_options = "striped")
Newest books | Year |
---|---|
Bạn Có Phải Là Đứa Trẻ Sợ Hãi Ẩn Sau Lớp Vỏ Trưởng Thành? | 2018 |
Educated | 2018 |
Oldest books | Year |
---|---|
The Richest Man in Babylon | 1926 |
Here are the thickest and thinnest books.
table1 <- tibble(rdata$Title[which.max(rdata$Number.of.Pages)])
table1[,2] <- max(rdata$Number.of.Pages,na.rm = TRUE)
table1[2,1] <- rdata$Title[which.min(rdata$Number.of.Pages)]
table1[2,2] <-min(rdata$Number.of.Pages,na.rm=TRUE)
colnames(table1) <- c("Books", "No. of Pages")
kable(table1)%>%
kable_styling(latex_options = "striped")
Books | No. of Pages |
---|---|
Nam Hải Quy Khư (Ma Thổi Đèn II, #2) | 660 |
Nhân Tố Enzyme: Minh Họa | 79 |
Lots of these books are Vietnamese, since I picked up some Viet books every time I vist home. With that being said, I did not expect the thickest book was a Viet. Now I want to know how often I read Viet books.
2.2.2 Viet books
To detect books which are in Vietnamese, we might need to install some different packages. I will use stringi in this case to detect some characters.
Thank to this StackOverflow post I was able to find this package and functions.
#call stringi packages
library(stringi)
#create a dataframe with title and their encoding
title <- tibble(Title=rdata$Title)
title$Encoding <- stri_enc_mark(title$Title)
kable(head(subset(title,title$Encoding=="UTF-8"),3)) %>%
kable_styling(latex_options = "striped")
Title | Encoding |
---|---|
Bạn Có Phải Là Đứa Trẻ Sợ Hãi Ẩn Sau Lớp Vỏ Trưởng Thành? | UTF-8 |
Dòng Thời Gian | UTF-8 |
Lời Nói Dối Vĩ Đại Của Não | UTF-8 |
Title | Encoding |
---|---|
Before the Fall | ASCII |
Financial Freedom: A Proven Path to All the Money You Will Ever Need | ASCII |
The Start-up of You: Adapt to the Future, Invest in Yourself, and Transform Your Career | ASCII |
So we can assume that Viet titles are encoded with UTF-8, since they have some non-ASCII characters. Fortunately, we had a small dataset so I was able to check if I have left out any books. This might not hold true if a Viet book title has no non-ASCII character and we have a big data set.
## [1] 26
## [1] 25
I read 26 books in Viet and 25 books in English. This is significantly different from what I expected, which is that I read more books in English than in Viet.
# only find viet books
vietdata <- merge(rdata,title,by="Title")
vietdata <- vietdata %>% filter(Encoding=="UTF-8") %>% mutate(month=month.abb[month(Date.Read)])
vietdata <- as.data.frame(table(vietdata$month))
colnames(vietdata)[1] <- "Month"
#plot viet books by month
ggplot(vietdata,aes(x=Month,y=Freq))+
geom_bar(stat = "identity")+
scale_x_discrete(limits = month.abb)+
labs(x="Months",y="Number of books",title="Viet books read by months")+
theme(plot.title = element_text(hjust = 0.5))
This graph makes sense since January is when I finished the most Viet books. This might be because I went home for Christmas and New Year and grabbed some books to read for my flight.
April, June and December are also months with second highest Viet books finished. I read some viet books in December so I can bring them home and put them on the shelves. I have no explanation for April and June.
2.2.3 Number of Pages
We will look at number of pages I’ve read.
datagraph <- rdata %>% mutate(month=floor_date(Date.Read,unit="month")) %>% group_by(month) %>% mutate(sumpages=sum(Number.of.Pages,na.rm = TRUE))
datagraph %>%
ggplot(aes(x=month,y=sumpages))+
geom_bar(position="dodge",
stat="identity",
fill="#00AFBB",width = 20) +
geom_text(aes(label=sumpages),vjust=-.5,size=2.5)+
labs(x="Months",y="Pages Read",title="Pages Read through time")+
theme(plot.title = element_text(hjust = 0.5))+
ylim(0,1500)+
scale_x_date(breaks="month",
labels = function(x) ifelse(year(x)%in%c(2017,max(year(datagraph$month))+1),"", format(x,"%b %Y")),
limits=as.Date(c('2017-12-15', '2020-12-15')))+
theme(axis.text.x = element_text(angle=60, hjust = 1))
As you can see, I’m not a consistent reader :D.
Additionally, this represents how many pages of a book I finished in a certain month, rather than how much I read that month. Let’s say if I started reading a 500-pages book in January and finished it in May, the 500 will be count in May.
I haven’t figured out a way to reflect the true number of pages read. I hope to do this in future analysis.
# create a year vector from the data
years <- as.integer(unique(year(data$Date.Read)))
years <- na.omit(years)
years <- sort(years, decreasing = FALSE)
# create a color vector for graph colors
colors <- c("deepskyblue","dodgerblue4","steelblue")
# create a vector for total number of pages read in a year
totalpages <- rep(0,length(years))
for (i in c(1:length(years))){
totalpages[i] <- sum(data$Number.of.Pages[year(data$Date.Read)==years[i]],na.rm=TRUE)
}
#loop through years to create chart of pages read in each year
for (i in c(1:length(years))){
print(datagraph %>% filter(year(Date.Read)==as.integer(years[i])) %>%
ggplot(aes(x=month,y=sumpages))+
geom_bar(position="dodge",
stat="identity",
fill=colors[i]) +
geom_text(aes(label=sumpages),vjust=-.5)+
labs(x="Months",y="Pages Read",title=paste(totalpages[i]," pages read in ",years[i],sep="")) +
theme(plot.title = element_text(hjust = 0.5))+
ylim(0,1500)+
scale_x_date(breaks="month",
labels = function(x) ifelse(year(x)!=years[i],"" , months(x, TRUE)),
limits=as.Date(c(paste(toString(years[i]-1),"-12-15",sep=""),paste(toString(years[i]),"-12-15",sep="")))
)
)
}
In 2018, I didn’t finish any book in February, March, April. Again, this only reflects the fact that I didn’t finish any book, not that I was not reading.
In 2019 and 2020, there are not a lot of empty months, since I read more frequently, thus finished books more often. I wonder if my speed also plays a role here.
2.2.4 Speed
Another information I want to know is how fast do I finish a book? I will create a variable in rdata, to calculate the speed. Speed will equal to Number of pages divided by Days Difference or datediff.
datediff is calculated by taking Date Read minus Date Added
# days to finish a book
rdata$datediff <- difftime(rdata$Date.Read,rdata$Date.Added,units = "day")
rdata$datediff
## Time differences in days
## [1] 0 45 14 6 4 1 12 53 9 34 157 12 -15 -143 0 -4 0 0
## [19] 0 0 20 54 21 16 13 0 5 4 620 0 37 0 1 30 14 0
## [37] 0 42 8 26 28 0 9 13 0 228 -156 6 185 3 12
The disadvantage of Goodreads export data is that it only allows Date Added and Date Read. Date Added is the variable reflects when I added the book to my shelf, and Date Read reflects when I claimed to finish it. Normally, Date Read will be after Date Added, if I added on the day I started reading it, which results in a positive datediff.
There are instances where I updated a book after I finished it, therefore Date Read will be before Date Added. The negative represents those cases.
The 0s are the cases where I added the book right on the day I finished it.
There is another column called Date Started on Goodreads, but they don’t allow you to export it.
## Group.1 x
## 1 FALSE 34
## 2 TRUE 17
I will remove the 0 and the negative in order to calculate speed. 17 books will be removed
rdata$speed <- rdata$Number.of.Pages/as.integer(rdata$datediff)
speedat <- rdata%>% filter(speed>0&is.na(rdata$speed)==FALSE&speed!=Inf)
#average speed
mean(speedat$speed)
## [1] 33.31033
My average reading speed is 33.31 pages/day
Let’s look at it through year
kable(speedat %>% group_by(year=year(Date.Read))%>%summarise(avgspeed = mean(speed,na.rm=TRUE))) %>%
kable_styling(latex_options = "striped")
year | avgspeed |
---|---|
2018 | 41.65243 |
2019 | 40.13476 |
2020 | 26.73828 |
Average speed in 2018 somehow is the highest, despite the low number of books. Is it because of the number of pages per book?
kable(speedat %>% group_by(year(Date.Read))%>%summarise(avgspeed = mean(speed,na.rm=TRUE),avgpages = mean(Number.of.Pages,na.rm=TRUE),n=n()) %>% mutate(true_n=shelf2$`Books Read`) %>% mutate(removed=true_n-n)) %>%
kable_styling(latex_options = "striped")
year(Date.Read) | avgspeed | avgpages | n | true_n | removed |
---|---|---|---|---|---|
2018 | 41.65243 | 388.1667 | 6 | 9 | 3 |
2019 | 40.13476 | 295.5000 | 10 | 19 | 9 |
2020 | 26.73828 | 333.8333 | 18 | 23 | 5 |
The average number of pages in 2018 is actually higher than 2019 and 2020. From what I understand, I read less books in 2018, but in a short window. In 2020, I read more books, more frequently, but at slower pace.
The removed data also play a part here. There are 3 books removed for 2018, 9 for 2019 and 5 for 2020. Even though 2020 has double the books of 2018, it might be that the books which are removed in 2020 have a really high number of pages.
Or I just read super fast in a short window in 2018.
2.2.5 Ratings & Reviews
2.2.5.1 Rating
I want to compare my ratings with average book rating
## [1] 3.912281
## [1] 4.084035
#creating a dataset of my rating versus average rating
ratedata<-aggregate(rdata$Average.Rating,list(myrating=rdata$My.Rating),FUN="mean")
ratedata$class <- ratedata$myrating
for (i in c(0:nrow(ratedata))){ratedata[i,3]<- ratedata$myrating[i]}
colnames(ratedata)[2] <-"avg"
ratedata <- gather(ratedata,var,value,myrating,avg)
#plot it
ggplot(ratedata, aes(x=class,y=value,fill=var))+
geom_bar(stat='identity',position='dodge')+
geom_text(aes(label=round(value,2)),vjust=-.5,position=position_dodge(width=0.9))+
labs(y="Rating",title="Books Rating: Phan's versus Average")+
ylim(0,6)+
theme(plot.title = element_text(hjust = 0.5),legend.title=element_blank(),axis.title.x = element_blank(),axis.text.x = element_blank(),axis.ticks.x = element_blank()) +
scale_fill_discrete(labels = c("Average Rating", "My Rating"))
The 3-star books, which are “meh” in my opnions, seems to be preferred by others. On the other hand, books I found excellent - 5-stars books - are rated much lower.
2.2.5.2 Reviews
Does thicker book has higher rating? Let’s find out. First, let’s get a dataframe for Number.of.Pages and Average.Rating.
# dataframe for review length and my.rating
ratereview <- tibble(numpages = rdata$Number.of.Pages,rating = rdata$Average.Rating,title=rdata$Title)
ggplot(ratereview,aes(x=rating,y=numpages))+geom_point()
As first glance, there is no correlation here. The 4 star and five star books have somewhat the same length of reviews. There are some outliers for five stars reviews. And I rated only one book 3 stars.
cor.test(ratereview$numpages,ratereview$rating)
model <- lm(rating~numpages,data=ratereview)
summary(model)
##
## Pearson's product-moment correlation
##
## data: ratereview$numpages and ratereview$rating
## t = -0.78319, df = 48, p-value = 0.4374
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3788314 0.1713777
## sample estimates:
## cor
## -0.1123284
##
##
## Call:
## lm(formula = rating ~ numpages, data = ratereview)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52216 -0.15153 0.01987 0.13962 0.91140
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.1435724 0.0912948 45.387 <2e-16 ***
## numpages -0.0002098 0.0002679 -0.783 0.437
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2574 on 48 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.01262, Adjusted R-squared: -0.007953
## F-statistic: 0.6134 on 1 and 48 DF, p-value: 0.4374
It seems like numpages doesn’t correlate to rating at all. The correlation coefficient is around -0.11, and the p-value for the predictor is large. Therefore rating and book thickness are independent.
3 Conclusion
Let’s summarize answers to the questions I asked at the beginning of this analysis:
1. I read 51 books so far. And I do read more every year.
2. Half of the books I read are Vietnamese.
3. There are not a lot of correlation between time of year and amount of reading.
4. Books I found ‘meh’ got higher rating from other people, and books I loved got lower rating.
5. I read around 30 pages/day.
Even though this analysis is mostly exploratory in essence, I found that it is a great exercise as a beginner. I also think there are a lot more to be done in the future.
From this project, I learned that I need to put more time into exploring the data, before doing any type of analysis. I spent a lot of time making graphs and charts, because of not knowing syntax and not understanding data structures. The write-ups were quite short for the time I put into each graphs.
I could also learn more by reading other analyses on Goodreads data.
This is one of my first projects I did on something I like. And I was surprised with the amount of time I put in it. It felt great. I’m gonna do this at the end of 2021 as well.