SQL, SQLite, MySQL, PostgreSQL, SQL Server
SQL is the generic term for querying language.
Other names are different flavours of SQL languages for differnet Relational database systems. E.g. PostgreSQL is for querying from Postgre.
Using BigQuery
BigQuery is a web service that can let us apply SQL to huge datasets.
Initialize
# *** Setup
from google.cloud import bigquery
# create a client object
client = bigquery.Client()
# *** Fetch dataset
# Construct a reference first, and then fetch data using ~get_dataset~
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")
dataset = client.get_dataset(dataset_ref)
Fetch table and review table structure
list_tables()
: list tables in datasetsget_table()
: construct a table reference and then fetch table from the refernece, similar to how we fetched a dataset.
# print the names of all tables
tables = list(client.list_tables(dataset))
for table in tables:
print(table.table_id)
# print the number of tables
print(len(tables))
# Fetch table
table_ref = dataset_ref.table("full")
table = client.get_table(table_ref)
Alternative:
tables = list(client.list_tables(dataset))
list_of_tables = [table.table_id for table in tables]
print(list_of_tables)
Explore data
table.schema
: Table schema is the jargon of the structure of a table, this will print info of all columns in tables within the datasetlist_rows()
: preview first n rows
# *** Table schema
table.schema
# *** Preview first n rows
# Preview first 5 rows
client.list_rows(table, max_results=5).to_dataframe()
# View the first 3 column only
client.list_rows(table, max_results=5, selected_fields=table.schema[:3]).to_dataframe()
Select, From, Where
FROM
specify the tableSELECT
apply to columns.- My note:
SELECT
is different fromSelect
in R. It does not mean selecting column xyz and only xyz will be loaded. In SQL it means xyz are the columns that would appear in the final dataset! In addition, all columns included inSELECT
must be passed through Group by, Having, Count clause if it exists.
- My note:
WHERE
apply to rows (filter)WHERE...LIKE
: filter. We can also use wildcard%
for any number of characters.
query = """
SELECT Name, Type
FROM `bigquery-public-data.pet_records.pet`
WHERE Animal = "Cat"
"""
-- Example for WHERE LIKE
WHERE Name LIKE "Ripley"
WHERE Name like '%ipl%'
-- Select all columns
SELECT *
-- Select distinct value
SELECT DISTINCT Name
-- Select as alias
SELECT `by` AS author
*
Set up query in BigQuery
# Run query and convert to pandas df
query_job = client.query(query)
us_cities = query_job.to_dataframe()
Tips to manage limit in BigQuery
Dry run to estimate size and cost
- Create a query job config to estimate the query size ahead
- Dry run to estimate cost
query = """
SELECT Name, Type
FROM `bigquery-public-data.pet_records.pet`
"""
# Create a query job config to estimate size of query
dry_run_config = bigquery.QueryJobConfig(dry_run=True)
dry_run_query_job = client.query(query, job_config=dry_run_config)
print("This query will process {} bytes".format(dry_run_query_job.total_bytes_processed))
Limit the size
# limit to 1MB or lower
one_mb = 1000*1000
safe_config= bigquery.QueryJobConfig(maximum_bytes_billed=one_mb)
safe_query_job = client.query(query, job_config=safe_config) # It only runs when it's < 1MB
safe_query_job.to_dataframe()
Group by, Having, Count
COUNT()
- the default col name is
f0___
which can be changed by aliasing usingAS
. - Use
COUNT(1)
if the col is not important. It’s also quicker
- the default col name is
GROUP BY
: similar tosummarize()
in RGROUP BY ... HAVING
: It only includes the filtered group- Note that all SELECT variables has to pass either
GROUP BY
orCOUNT()
because it’s an aggregate function. soSELECT *
follow by above won’t work.
SELECT `by` AS author, count(id) AS NumPosts
FROM `bigquery-public-data.hacer-news.full`
GROUP BY author
HAVING COUNT(id) > 10
Order by
ORDER BY
: usually put it at the end of query
SELECT ID, Name, Animal
FROM `bigquery-public-data-pet_records.pets`
ORDER BY Animal DESC
EXTRACT()
from Dates-- Extract day value from Date column SELECT Name, EXTRACT(DAY from Date) AS Day
As, with
AS
: aliasing columnWITH...AS
: to be used to produce common table expression (CTE), a temporary table.
WITH time AS
(
SELECT DATE(block_timestamp) AS trans_date
FROM `bigquery-public-data.crypto_bitcoin.transaction`
)
SELECT COUNT(1) AS transactions,
trans_date
FROM time
GROUP BY trans_date
ORDER BY trans_date
Join
INNER JOIN...ON
pets table
ID | Name | Animal |
---|---|---|
1 | Moon | Dog |
2 | Tom | Cat |
owners table
ID | Name | Pet_ID |
---|---|---|
1 | Magnus | 2 |
2 | Jules | 1 |
SELECT p.Name AS Pet_Name, o.Name as Owner_Name
FROM `bigquery-public-data.pet_records.pets` AS p
INNER JOIN `bigquery-public-data.pet_records.owners` AS o
ON p.ID = o.Pet_ID