TDM 40100: Project 4 — 2023

Motivation: The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a sqlite3 database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, sqlite3, with the opportunity to use other skills you’ve built throughout the previous years.

Context: In TDM 20100 you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with sqlite3 — a powerful database engine. In this project (and following projects), we will branch into SQL and sqlite3-specific topics and techniques that you haven’t yet had exposure to in The Data Mine.

Scope: sqlite3, lmod, SQL

Learning Objectives
  • Create your own sqlite3 database file.

  • Analyze a large dataset and formulate CREATE TABLE statements designed to store the data.

  • Run one or more queries to test out the end result.

  • Demonstrate the ability to normalize a series of database tables.

  • Wrangle and insert data into database.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_authors.json

  • /anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_series.json

  • /anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_books.json

  • /anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_reviews_dedup.json

Questions

Question 1 (1 pt)

  1. Create a junction table between books and series called books_series.

  2. Write a list of each column in the books_series table, their data type, and a brief description of what they contain.

  3. Write an SQL query to find every book in a series with series_id 12345.

Assume that a series can have many books and a book can be a part of many series. Following a similar process as in the previous project, create another junction table called books_series.

In a markdown cell, list the columns in the books_series table, their data type, and a brief description of what they represent.

Finally, write an SQL query to find every book in a series with series id 12345.

Items to submit
  • CREATE TABLE statement for books_series.

  • List of columns in books_series with descriptions.

  • SQL query to find every book in a series with series id 12345.

Question 2 (2 pts)

  1. Create a new database called my_goodreads.db that is the same as our my.db database (after normalization), but with primary and foreign key constraints set.

As you may have noticed, we have determined the primary and foreign keys for our tables, but haven’t actually set them in the database yet. Unfortunately, sqlite3 has no way of setting these constraints after creating the able, so we will be rewriting our CREATE TABLE statements inside a new database labeled my_goodreads.db.

Going through your Jupyter notebooks from the previous project (or the beginning of this project) recreate the existing database we have made, but this time with primary and foreign key constraints set.

While doing this, remember to stick to the naming conventions and normalization we established in this project, and don’t forget to create your junction tables. It is particularly important to set primary and foreign key constraints in junction tables, as this gives us an added layer of data integrity when we are trying to modify data in our database.

If you are struggling or need a sanity check on your work, below is the number of constraints you will need to set for each table.

  • books : 1 primary key

  • reviews: 1 primary key

  • authors: 1 primary key

  • series : 1 primary key

  • authors_books : 1 primary key, 2 foreign keys

  • books_series : 1 primary key, 2 foreign keys

You can check the constraints on a table by running .schema table_name in a sqlite3 session.

Items to submit
  • Modified CREATE TABLE statements for our new database, with proper constraints set.

Question 3 (2 pts)

  1. Create a new junction table to handle similar_books.

  2. Create a new table and accompanying junction table to handle popular_shelves.

  3. Write 2+ sentences detailing an alternative approach to handling popular_shelves, and any benefits your approach has over the one provided.

  4. Write two queries that use all of the new tables created during this question.

The remaining two fields that need to be dealt with are similar_books and popular_shelves. These fields are a bit more complicated than the others, so you will need to think a bit more before implementing a solution. Remember to keep normalization in mind.

Firstly, for similar_books, I would recommend another junction table. As there is nothing to stop you from linking a table to itself, you can simply create a junction table that links books to each other. Write (and run) a CREATE TABLE statement that creates a junction table between books and books called similar_books. Paste your CREATE TABLE statement into your Jupyter notebook for reference.

Next, for popular_shelves, let’s create two more tables. First, create a shelves table with id and name columns. Choose data types appropriately. Next, create a junction table between shelves and books called books_shelves with the appropriate column names and types. Be sure to normalize everything, as we have been doing throughout this project.

Now that we’ve created some rather straightforward approaches to handling these two data fields, do some more thinking on your own. Write at least two sentences in a markdown cell detaining a different approach to handling popular_shelves, and any benefits your approach has over this one. (Hint: composite keys!)

Finally, write two queries that use all of the new tables we created (one for similar_books and one for popular_shelves).

Ensure that all of your work is visible in your Jupyter notebook prior to submitting!

Items to submit
  • 3 CREATE TABLE statements for the new tables and junction tables.

  • 2+ sentences in a markdown cell detailing your alternative approach and its benefits over the provided approach.

  • 2 queries that use all of the new tables created during this question.

Submitting your Work

Good work, you’ve made it to the end of your fourth project for TDM 401 and finished preparing your database for data entry! In the next project, we will begin populating our database with data! As always, ensure that all your work is visible as you expect in your submission to ensure you get the full points you deserve.

You must double check your .ipynb after submitting it in Gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or it does not render properly in Gradescope. Please ask a TA if you need help with this.

Items to submit
  • firstname-lastname-project04.ipynb.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.