SQL

SQL, SQLite, MySQL, PostgreSQL, SQL Server

SQL is the generic term for querying language.

Other names are different flavours of SQL languages for differnet Relational database systems. E.g. PostgreSQL is for querying from Postgre.

Using BigQuery

BigQuery is a web service that can let us apply SQL to huge datasets.

Initialize

# *** Setup
from google.cloud import bigquery
 
# create a client object
client = bigquery.Client()
 
# *** Fetch dataset
# Construct a reference first, and then fetch data using ~get_dataset~
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")
dataset = client.get_dataset(dataset_ref)

Fetch table and review table structure

list_tables(): list tables in datasets
get_table(): construct a table reference and then fetch table from the refernece, similar to how we fetched a dataset.

# print the names of all tables
tables = list(client.list_tables(dataset))
for table in tables:
    print(table.table_id)
# print the number of tables
print(len(tables))
 
# Fetch table
table_ref = dataset_ref.table("full")
table = client.get_table(table_ref)

Alternative:

tables = list(client.list_tables(dataset))
list_of_tables = [table.table_id for table in tables]
print(list_of_tables)

Explore data

table.schema: Table schema is the jargon of the structure of a table, this will print info of all columns in tables within the dataset
list_rows(): preview first n rows

# *** Table schema
table.schema
 
# *** Preview first n rows
# Preview first 5 rows
client.list_rows(table, max_results=5).to_dataframe()
 
# View the first 3 column only
client.list_rows(table, max_results=5, selected_fields=table.schema[:3]).to_dataframe()

Select, From, Where

FROM specify the table
SELECT apply to columns.
- My note: SELECT is different from Select in R. It does not mean selecting column xyz and only xyz will be loaded. In SQL it means xyz are the columns that would appear in the final dataset! In addition, all columns included in SELECT must be passed through Group by, Having, Count clause if it exists.
WHERE apply to rows (filter)
WHERE...LIKE: filter. We can also use wildcard % for any number of characters.

query = """
        SELECT Name, Type
        FROM `bigquery-public-data.pet_records.pet`
        WHERE Animal = "Cat"
        """
 
-- Example for WHERE LIKE
WHERE Name LIKE "Ripley"
WHERE Name like '%ipl%'
 
-- Select all columns
SELECT *
 
-- Select distinct value
SELECT DISTINCT Name
 
-- Select as alias
SELECT `by` AS author

Set up query in BigQuery

# Run query and convert to pandas df
query_job = client.query(query)
us_cities = query_job.to_dataframe()

Tips to manage limit in BigQuery

Dry run to estimate size and cost

Create a query job config to estimate the query size ahead
Dry run to estimate cost

query = """
        SELECT Name, Type
        FROM `bigquery-public-data.pet_records.pet`
        """
# Create a query job config to estimate size of query
dry_run_config = bigquery.QueryJobConfig(dry_run=True)
dry_run_query_job = client.query(query, job_config=dry_run_config)
print("This query will process {} bytes".format(dry_run_query_job.total_bytes_processed))

Limit the size

# limit to 1MB or lower
one_mb = 1000*1000
safe_config= bigquery.QueryJobConfig(maximum_bytes_billed=one_mb)
safe_query_job = client.query(query, job_config=safe_config) # It only runs when it's < 1MB
safe_query_job.to_dataframe()

Group by, Having, Count

COUNT()
- the default col name is f0___ which can be changed by aliasing using AS.
- Use COUNT(1) if the col is not important. It’s also quicker
GROUP BY: similar to summarize() in R
GROUP BY ... HAVING: It only includes the filtered group
Note that all SELECT variables has to pass either GROUP BY or COUNT() because it’s an aggregate function. so SELECT * follow by above won’t work.

SELECT `by` AS author, count(id) AS NumPosts
FROM `bigquery-public-data.hacer-news.full`
GROUP BY author
HAVING COUNT(id) > 10

Order by

ORDER BY: usually put it at the end of query

SELECT ID, Name, Animal
FROM `bigquery-public-data-pet_records.pets`
ORDER BY Animal DESC

EXTRACT() from Dates

-- Extract day value from Date column
SELECT Name, EXTRACT(DAY from Date) AS Day

As, with

AS: aliasing column
WITH...AS: to be used to produce common table expression (CTE), a temporary table.

WITH time AS
          (
SELECT DATE(block_timestamp) AS trans_date
FROM `bigquery-public-data.crypto_bitcoin.transaction`
)
SELECT COUNT(1) AS transactions,
                trans_date
                FROM time
                     GROUP BY trans_date
                     ORDER BY trans_date

Join

INNER JOIN...ON

pets table

ID	Name	Animal
1	Moon	Dog
2	Tom	Cat

owners table

ID	Name	Pet_ID
1	Magnus	2
2	Jules	1

SELECT p.Name AS Pet_Name, o.Name as Owner_Name
FROM `bigquery-public-data.pet_records.pets` AS p
INNER JOIN `bigquery-public-data.pet_records.owners` AS o
           ON p.ID = o.Pet_ID

🪴 Aster's notebook

Recent updated

Explorer

SQL

SQL, SQLite, MySQL, PostgreSQL, SQL Server

Using BigQuery

Initialize

Fetch table and review table structure

Explore data

Select, From, Where

Set up query in BigQuery

Tips to manage limit in BigQuery

Dry run to estimate size and cost

Limit the size

Group by, Having, Count

Order by

As, with

Join

Graph View

Table of Contents

Backlinks

🪴 Aster's notebook

Recent updated

Explorer

SQL

SQL, SQLite, MySQL, PostgreSQL, SQL Server §

Using BigQuery §

Initialize §

Fetch table and review table structure §

Explore data §

Select, From, Where §

Set up query in BigQuery §

Tips to manage limit in BigQuery §

Dry run to estimate size and cost §

Limit the size §

Group by, Having, Count §

Order by §

As, with §

Join §

Graph View

Table of Contents

Backlinks

SQL, SQLite, MySQL, PostgreSQL, SQL Server

Using BigQuery

Initialize

Fetch table and review table structure

Explore data

Select, From, Where

Set up query in BigQuery

Tips to manage limit in BigQuery

Dry run to estimate size and cost

Limit the size

Group by, Having, Count

Order by

As, with

Join