Harvard CS 50x — Week 7 (SQL)

4 min readNov 16, 2020

(This is a summary of week 7 from the Harvard CS 50x series. Visit this page to see more from this series! )

This Monday is a little special. Not only is it the week for SQL, but today marks the first day of the Women in Technology Online Festival 2020! As Diversity and Inclusivity Representative, it makes me feel so proud to announce that Avast will be a Bronze sponsor this year. Avast also gave 15 Avastians the opportunity to get premium passes to the week-long conference!

So after taking Week 7 of the Harvard University CS50 course, I also took some time to navigate around the Women in Technology Online Festival platform and got to network with some very inspiring women in tech.

For now, let’s just get back to CS50…

I started this week feeling fresh and excited to learn more about data manipulation. Our focus this week was Structured Query Language (SQL) and how to integrate it with Python. I’ve had previous experience with SQL but don’t use it often at work, so this lecture acted as a great refresh class for my brain.

Summary

Data
Pros of spreadsheets:
quick sort
rows + columns
can include attributes in columns
can determine the layout of schema ahead of time thanks to columns
Limitations of spreadsheets:
a limited number of rows + column
Flat-file DB (e.g. csv, tsb)

Canonicalisation → converting data that has more than one possible representation into a “standard”, “normal”, or canonical form
set() function

Massaging data → Cleaning up and doing some mutations to canonicalise it without changing its meaning
dict() function or {}

Relational database
SQL → Structured Query Language
SQLite → A light version of SQL, more user-friendly
sqlite3 → uses a binary file to store all of the data
.mode
.import
.schema
For any relational database:
CRUD: create, read, update, delete
For SQL: create, insert, select, update, delete
e.g.
CREATE TABLE table (column type, …);
SELECT columns FROM table;
SELECT * FROM table;
Functions in SQL:
AVG
COUNT
DISTINCT
LOWER
MAX
MIN
UPPER
…
Queries in SQL:
WHERE
LIKE
ORDER BY
LIMIT
GROUP BY
…
e.g.
SELECT title FROM shows WHERE title LIKE “%office%”;
SELECT DISTINCT(UPPER(title)) FROM shows ORDER BY UPPER(title);
SELECT UPPER(TRIM(title)), COUNT(title) FROM shows GROUP BY UPPER(TRIM(title)) ORDER BY COUNT(title) DESC LIMIT 10;
INSERT INTO table (column, …) VALUES (value, …);
INSERT INTO shows (Timestamp, title, genres) VALUES (“now”, “The Muppet Show”, “Comedy, Musical”);
SELECT * FROM shows WHERE title LIKE “%Muppet%”;
UPDATE shows SET genres = “Comedy, Drama, Musical” WHERE title = “The Muppet Show”;

Save SQL file by running .save
Normalising data → Eliminate redundancies
5 main datatypes (oversimplified) in SQLite:
1. Blob
2. Integer
3. Numeric
4. Real
5. Text
Primary + foreign keys

Nested SQL queries:
e.g.
SELECT title FROM shows WHERE id IN (SELECT show_id FROM genres WHERE genre = “Musical”);
SELECT DISTINCT(genre) FROM genres WHERE show_id IN (SELECT id FROM shows WHERE title = “THE OFFICE”) ORDER BY genre;
Many-to-many relationship: join tables

Use of join tables to minimise repetition

Exploring IMDb

An example of how IMDb’s data schema structure might look like

B-trees → A data structure that tries to keep all of the leaf nodes as close to the root as possible (more wide than tall)
CREATE INDEX title_index ON shows (title);
Saves a lot more run time with index
JOIN → allows us to join tables together to get a bigger table for us to use
SELECT title FROM shows WERE id IN (SELECT show_id FROM stars WHERE people_id = (SELECT id FROM people WHERE name = “Steve Carell”)));
SELECT title FROM people
> JOIN stars ON people.id = stars.person_id
> JOIN shows ON stars.show_id = shows.id
> WHERE name = “Steve Carell”;
Joining table might take a bit longer but thanks to indexes it can help save us some time:
e.g.
CREATE INDEX person_index ON stars (person_id);
CREATE INDEX show_index ON stars (show_id);
CREATE INDEX name_index ON people(name);
By using indexes, we remove bottlenecks where linear search might have taken place
Commonly introduced issues related to SQL:
1. SQL Injection Attacks
2. Race conditions → multi-threaded environments
transactions allow us to lock a row in a table to avoid losing track of data

No matter how stressful life gets, don’t forget to smell the roses (and more importantly, don’t forget to ‘BEGIN’ and ‘COMMIT’ your TRANSACTIONS!)

Thoughts

This week was easy to follow as it’s something that I’ve come across before
SQL is super straightforward
Love how David J. Malan showed us run time in order for us to understand how we can improve efficiency and minimise the time required to run queries via indexing
Shout-out to the memes being shown during the lecture, even though it was virtual and no one could hear any laughter from the audience 😂

Links to Assignments

For this week’s assignment, we were asked to complete Problem Set 7.

Harvard CS 50x — Week 7 (SQL)

Summary

Thoughts

Links to Assignments

Written by Janine L