Using BigQuery

BigQuery is a web service that can let us apply SQL to huge datasets.

Initialize

# *** Setup
from google.cloud import bigquery
 
# create a client object
client = bigquery.Client()
 
# *** Fetch dataset
# Construct a reference first, and then fetch data using ~get_dataset~
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")
dataset = client.get_dataset(dataset_ref)

Fetch table and review table structure

  • list_tables(): list tables in datasets
  • get_table(): construct a table reference and then fetch table from the refernece, similar to how we fetched a dataset.
# print the names of all tables
tables = list(client.list_tables(dataset))
for table in tables:
    print(table.table_id)
# print the number of tables
print(len(tables))
 
# Fetch table
table_ref = dataset_ref.table("full")
table = client.get_table(table_ref)

Alternative:

tables = list(client.list_tables(dataset))
list_of_tables = [table.table_id for table in tables]
print(list_of_tables)

Explore data

  • table.schema: Table schema is the jargon of the structure of a table, this will print info of all columns in tables within the dataset
  • list_rows(): preview first n rows
# *** Table schema
table.schema
 
# *** Preview first n rows
# Preview first 5 rows
client.list_rows(table, max_results=5).to_dataframe()
 
# View the first 3 column only
client.list_rows(table, max_results=5, selected_fields=table.schema[:3]).to_dataframe()

Select, From, Where

  • FROM specify the table
  • SELECT apply to columns.
    • My note: SELECT is different from Select in R. It does not mean selecting column xyz and only xyz will be loaded. In SQL it means xyz are the columns that would appear in the final dataset! In addition, all columns included in SELECT must be passed through Group by, Having, Count clause if it exists.
  • WHERE apply to rows (filter)
  • WHERE...LIKE: filter. We can also use wildcard % for any number of characters.
query = """
        SELECT Name, Type
        FROM `bigquery-public-data.pet_records.pet`
        WHERE Animal = "Cat"
        """
 
-- Example for WHERE LIKE
WHERE Name LIKE "Ripley"
WHERE Name like '%ipl%'
 
-- Select all columns
SELECT *
 
-- Select distinct value
SELECT DISTINCT Name
 
-- Select as alias
SELECT `by` AS author

*

Set up query in BigQuery

# Run query and convert to pandas df
query_job = client.query(query)
us_cities = query_job.to_dataframe()
 

Tips to manage limit in BigQuery

Dry run to estimate size and cost

  1. Create a query job config to estimate the query size ahead
  2. Dry run to estimate cost
query = """
        SELECT Name, Type
        FROM `bigquery-public-data.pet_records.pet`
        """
# Create a query job config to estimate size of query
dry_run_config = bigquery.QueryJobConfig(dry_run=True)
dry_run_query_job = client.query(query, job_config=dry_run_config)
print("This query will process {} bytes".format(dry_run_query_job.total_bytes_processed))

Limit the size

# limit to 1MB or lower
one_mb = 1000*1000
safe_config= bigquery.QueryJobConfig(maximum_bytes_billed=one_mb)
safe_query_job = client.query(query, job_config=safe_config) # It only runs when it's < 1MB
safe_query_job.to_dataframe()

Group by, Having, Count

  • COUNT()
    • the default col name is f0___ which can be changed by aliasing using AS.
    • Use COUNT(1) if the col is not important. It’s also quicker
  • GROUP BY: similar to summarize() in R
  • GROUP BY ... HAVING: It only includes the filtered group
  • Note that all SELECT variables has to pass either GROUP BY or COUNT() because it’s an aggregate function. so SELECT * follow by above won’t work.
SELECT `by` AS author, count(id) AS NumPosts
FROM `bigquery-public-data.hacer-news.full`
GROUP BY author
HAVING COUNT(id) > 10

Order by

  • ORDER BY: usually put it at the end of query
SELECT ID, Name, Animal
FROM `bigquery-public-data-pet_records.pets`
ORDER BY Animal DESC
  • EXTRACT() from Dates
    -- Extract day value from Date column
    SELECT Name, EXTRACT(DAY from Date) AS Day

As, with

  • AS: aliasing column
  • WITH...AS: to be used to produce common table expression (CTE), a temporary table.
WITH time AS
          (
SELECT DATE(block_timestamp) AS trans_date
FROM `bigquery-public-data.crypto_bitcoin.transaction`
)
SELECT COUNT(1) AS transactions,
                trans_date
                FROM time
                     GROUP BY trans_date
                     ORDER BY trans_date

Join

  • INNER JOIN...ON

pets table

IDNameAnimal
1MoonDog
2TomCat

owners table

IDNamePet_ID
1Magnus2
2Jules1
SELECT p.Name AS Pet_Name, o.Name as Owner_Name
FROM `bigquery-public-data.pet_records.pets` AS p
INNER JOIN `bigquery-public-data.pet_records.owners` AS o
           ON p.ID = o.Pet_ID