I started reading when I was fairly young, not at child-prodigious levels, but early enough where I was that kid with my nose in a book at adult parties. It was by necessity. As a quiet, only child, I didn’t have many options for entertainment.
I graduated from the beautifully illustrated Ladybird books to every Indian child’s favorite British babysitter and author, Enid Blyton. My collection of books started growing, and I was immensely proud of my little home library.
I started charging kids money to borrow books from me, and decided I needed a personalized stamp to legitimize my shady business (see above). In any case, I was fortunate enough to have a study that I shared with my mother for my books. Over time, I graduated from Enid Blyton to a rapid assortment of genres, with a proclivity for young adult fiction to carry me through my pained adolescent years.
Now as a fully-functioning adult with no permanent home, I bring back a suitcase load of books every time I visit home in India. I can’t bring myself to give my books away, it feels like giving away a piece of my heart; so I hoard them. Admittedly, since living in San Francisco, I’ve been taking advantage of the excellent San Francisco Public Library system to borrow books, both physical and digital. I definitely buy fewer books as a result.
Cataloging my books, all 550 of them
Since I fortuitously got stuck at home in India during COVID-19, I decided this would be a great time to start cataloging all the books I have hoarded over the years. For the purpose of this exercise, I decided to exclude everything I read prior to middle school, even though I still have those books.
I needed a way to enter all my books into some type of database. I found that Goodreads allows you to scan books via ISBN, but it was so god-awfully slow, I quickly gave up. After some research, I landed on using a free tool called Libib.com, a home library management system. This allowed for almost instantaneous scanning. Given the number of books I have, this took me several hours spread out over a week or so.
Once I finished, here’s what my library of 539 books looked like.
Metadata associated with each book included
authors | title | publisher | pages | isbn | description |
The data wasn’t super clean and I was also really curious about the ratings of the books I was reading. So I downloaded the csv file from Libib and fed it to Goodreads. This way, I got a fairly similar, but cleaner dataset with ratings to boot. I also added to my list books that I had left in my apartment in San Francisco. This gave me a total of about 550 books.
Some interesting findings from looking at the data (thank you Excel)
Longest books
Unsurprisingly, all the fantasy books are pretty dense.
Book | Author | Pages |
The Rise and Fall of the Third Reich | William L. Shirer | 1264 |
Oathbringer, The Stormlight Archive | Brandon Sanderson | 1243 |
The Lord of the Rings, #1-3 | J.R.R. Tolkien | 1178 |
Words of Radiance, The Stormlight Archive | Brandon Sanderson | 1087 |
Atlas Shrugged | Ayn Rand | 1080 |
Gone with the Wind | Margaret Mitchell | 1011 |
The Way of Kings, The Stormlight Archive | Brandon Sanderson | 1007 |
Most popular authors
Considering I bought most of my books between the ages of 12 and 16, I’m not surprised by this list either. These authors were really prolific too in terms of churning books out fairly quickly and regularly. I drew the line at six books by the same author.
Author | Book Count |
Meg Cabot | 25 |
Jeffrey Archer | 14 |
Agatha Christie | 11 |
Sophie Kinsella | 10 |
Eoin Colfer | 8 |
Megan McCafferty | 7 |
Sidney Sheldon | 7 |
P.C. Cast | 7 |
J.K. Rowling | 7 |
Salman Rushdie | 6 |
Michael Crichton | 6 |
Harlan Coben | 6 |
Danielle Steel | 6 |
Distribution of Goodreads Ratings
This is roughly what you would expect, with most books falling in the 3.5 to 4.5 range with a few outliers. I guess I should congratulate myself for reading very average books.
Book Categories
This is the bit that took me the longest since no single API gave me this out of the box. The Goodreads UI does have a field called “Genre” which they very conveniently don’t make available through their public API. I decided to finally use the Google Books API which wasn’t perfect but came the closest. If anyone knows of anything better, let me know!
I had to code a little for this, which made me realize how rusty I am. I used python
and a nifty library called pandas
which was probably overkill for this. Here’s the code I hacked together (use at your own peril)
Sadly, this only gave me results for about 317 books out of the 550. There were some books that didn’t have a category assigned to them, and some books with ISBNs that didn’t exist in the Google inventory. While this was disappointing, I was still happy I didn’t have to manually categorize any of them, although there was still a significant amount of data cleanup. Obviously categorization can be difficult and subjective, but it seems like the Google Books API could use some improvement in this area, especially for standardization across a wide range.
The results were again as expected. I definitely preferred fiction to non-fiction when I was younger. I also to this day love Indian fiction, we have some of the best writers in the world.
Category | Count |
Fiction | 125 |
Young Adult Fiction | 45 |
Indian Fiction | 34 |
Fantasy fiction | 33 |
Detective and mystery stories | 23 |
Business & Economics | 20 |
Biography & Autobiography | 13 |
History | 11 |
Comics & Graphic Novels | 4 |
Poetry | 3 |
Science | 3 |
Philosophy | 3 |
Total | 317 |
Overall, this was a fun exercise for a Sunday afternoon. While none of the trends were particularly surprising, I’m happy I now have a full catalog of my books. I think there’s some interesting things I could do with book categorizations and generating recommendations, but that’s a project for another day.
I’m not attaching the dataset here, but if you have any fun ideas/projects, let me know and I’d be happy to share!