Harvard CS 50x — Week 7 (SQL)

Janine L
4 min readNov 16, 2020

(This is a summary of week 7 from the Harvard CS 50x series. Visit this page to see more from this series! )

This Monday is a little special. Not only is it the week for SQL, but today marks the first day of the Women in Technology Online Festival 2020! As Diversity and Inclusivity Representative, it makes me feel so proud to announce that Avast will be a Bronze sponsor this year. Avast also gave 15 Avastians the opportunity to get premium passes to the week-long conference!

So after taking Week 7 of the Harvard University CS50 course, I also took some time to navigate around the Women in Technology Online Festival platform and got to network with some very inspiring women in tech.

For now, let’s just get back to CS50…

I started this week feeling fresh and excited to learn more about data manipulation. Our focus this week was Structured Query Language (SQL) and how to integrate it with Python. I’ve had previous experience with SQL but don’t use it often at work, so this lecture acted as a great refresh class for my brain.

Summary

  • Data
  • Pros of spreadsheets:
    quick sort
    rows + columns
    can include attributes in columns
    can determine the layout of schema ahead of time thanks to columns
  • Limitations of spreadsheets:
    a limited number of rows + column
  • Flat-file DB (e.g. csv, tsb)
  • Canonicalisation → converting data that has more than one possible representation into a “standard”, “normal”, or canonical form
  • set() function
  • Massaging data → Cleaning up and doing some mutations to canonicalise it without changing its meaning
  • dict() function or {}
  • Relational database
  • SQL → Structured Query Language
  • SQLite → A light version of SQL, more user-friendly
  • sqlite3 → uses a binary file to store all of the data
    .mode
    .import
    .schema
  • For any relational database:
    CRUD: create, read, update, delete
  • For SQL: create, insert, select, update, delete
    e.g.
    CREATE TABLE table (column type, …);
    SELECT columns FROM table;
    SELECT * FROM table;
  • Functions in SQL:
    AVG
    COUNT
    DISTINCT
    LOWER
    MAX
    MIN
    UPPER
  • Queries in SQL:
    WHERE
    LIKE
    ORDER BY
    LIMIT
    GROUP BY

    e.g.
    SELECT title FROM shows WHERE title LIKE “%office%”;
    SELECT DISTINCT(UPPER(title)) FROM shows ORDER BY UPPER(title);
    SELECT UPPER(TRIM(title)), COUNT(title) FROM shows GROUP BY UPPER(TRIM(title)) ORDER BY COUNT(title) DESC LIMIT 10;
    INSERT INTO table (column, …) VALUES (value, …);
    INSERT INTO shows (Timestamp, title, genres) VALUES (“now”, “The Muppet Show”, “Comedy, Musical”);
    SELECT * FROM shows WHERE title LIKE “%Muppet%”;
    UPDATE shows SET genres = “Comedy, Drama, Musical” WHERE title = “The Muppet Show”;
  • Save SQL file by running .save
  • Normalising data → Eliminate redundancies
  • 5 main datatypes (oversimplified) in SQLite:
    1. Blob
    2. Integer
    3. Numeric
    4. Real
    5. Text
  • Primary + foreign keys
Use of primary and foreign keys
  • Nested SQL queries:
    e.g.
    SELECT title FROM shows WHERE id IN (SELECT show_id FROM genres WHERE genre = “Musical”);
    SELECT DISTINCT(genre) FROM genres WHERE show_id IN (SELECT id FROM shows WHERE title = “THE OFFICE”) ORDER BY genre;
  • Many-to-many relationship: join tables
Use of join tables to minimise repetition
  • Exploring IMDb
An example of how IMDb’s data schema structure might look like
  • B-trees → A data structure that tries to keep all of the leaf nodes as close to the root as possible (more wide than tall)
    CREATE INDEX title_index ON shows (title);
    Saves a lot more run time with index
  • JOIN → allows us to join tables together to get a bigger table for us to use
    SELECT title FROM shows WERE id IN (SELECT show_id FROM stars WHERE people_id = (SELECT id FROM people WHERE name = “Steve Carell”)));
  • SELECT title FROM people
    > JOIN stars ON people.id = stars.person_id
    > JOIN shows ON stars.show_id = shows.id
    > WHERE name = “Steve Carell”;
  • Joining table might take a bit longer but thanks to indexes it can help save us some time:
    e.g.
    CREATE INDEX person_index ON stars (person_id);
    CREATE INDEX show_index ON stars (show_id);
    CREATE INDEX name_index ON people(name);
  • By using indexes, we remove bottlenecks where linear search might have taken place
  • Commonly introduced issues related to SQL:
    1. SQL Injection Attacks
    2. Race conditions → multi-threaded environments
    transactions allow us to lock a row in a table to avoid losing track of data
No matter how stressful life gets, don’t forget to smell the roses (and more importantly, don’t forget to ‘BEGIN’ and ‘COMMIT’ your TRANSACTIONS!)

Thoughts

  • This week was easy to follow as it’s something that I’ve come across before
  • SQL is super straightforward
  • Love how David J. Malan showed us run time in order for us to understand how we can improve efficiency and minimise the time required to run queries via indexing
  • Shout-out to the memes being shown during the lecture, even though it was virtual and no one could hear any laughter from the audience 😂

Links to Assignments

For this week’s assignment, we were asked to complete Problem Set 7.

--

--