Showing the efffect of measuring bigquery cost per query and optimising accordingly
BigQuery

Monitoring you BigQuery costs per Query and per User

aka – Cutting costs without Cutting value while upskilling your team

The Problem

We love BigQuery. It’s easy to setup and maintain, meaning anyone with basic SQL skills (or can copy paste from StackOverflow) can get value super quick. 

I love seeing the effect it has on businesses who become liberated once they migrate from another platform where they are constrained by query speed (MySQL/Postgres), query language (Mongo) or maintenance and tooling (Redshift). Access to their data suddenly becomes democratised, and their staff start putting together DataStudio reports and running queries in the console themselves, without having to get a developer or analyst involved.

However – this comes with a downside – costs.

BigQuery costs for a client before optimising. They were getting massive value out of it, however the costs were skyrocketing
Before optimisations – business was seeing massive growth due to the value BigQuery provided, however staff were running ad hoc queries & DataStudio reports with no optimisations.

BigQuery charges for the amount of data that is used in a Query – not for the cost of the server or the length of time the Query takes.

So optimising how the data is stored (partitions & clusters), as well as the SQL for the queries themselves we can save large amounts of money.

Costs after query and table optimisations. They were using BigQuery even more, however the costs had dropped dramatically.
After table optimisations put in-place, and staff trained on how to optimise their queries.

Part 1: You can only change what you can measure

How to Store in BiqQuery the costs of every query, who ran it, and what was the SQL.

  1. Setup a Sink inside GCP Logging for the BigQuery “query_job_completed” event.
  2. Open GCP Logging for your Project
  3. Paste in this query and run it: resource.type=”bigquery_resource” protoPayload.serviceData.jobCompletedEvent.eventName=”query_job_completed”

4. You should now see a list of all the events from the last hour (run a simple query in BigQuery if you don’t see any or extend the timeframe)

GCP Logging Query for BigQuery job complete events

5. Click “Actions” -> “Create Sink”

How to create a sink in google cloud logging

6. Create the Sink, and point it to a new Dataset. Be sure to use “Partitioned tables”

pointing a google cloud logging sink to bigquery

7. The Dataset will be created immediately, however the tables inside it will take a while, so run some queries, grab a coffee and check back in a while.

8. Review the new tables:

seeing the bigquery job completed events and costs saved to bigquery

9. Copy this Query into the console, changing the YOUR-PROJECT and YOUR-DATASET for your own.

This query will return the User, Cost (USD), SQL and Timestamp for every success query in the last 30 days.

SELECT  
	protopayload_auditlog.authenticationInfo.principalEmail as user
	, cast(5 * (protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalProcessedBytes/POWER(2,40)) as numeric) as cost
  , protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.query as sql
  , timestamp
		 
FROM `YOUR-PROJECT.YOUR-DATASET-NAME.cloudaudit_googleapis_com_data_access`  
WHERE protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.eventName="query_job_completed"
AND protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalProcessedBytes IS NOT NULL
AND DATE(timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)

10 . Taking Action!

Now we can see how much each query cost, we can put the data into DataStudio to visually see the costs by date, and by User. The important thing to remember is we’re not out to shame anyone, as we want them to get more value and insights for the business – not less, so don’t scare them off.

Instead review the SQL and see if there’s any common mistakes that could have an impact. For instance I could see that one colleague had always use date filters like this:

WHERE FORMAT_DATETIME ("%F %X", DATETIME(timestamp, "Australia/Sydney")) > "2020-12-01"

Despite the table being partitioned on “timestamp” this filter didn’t optimise anything, as it only adds additional columns and increases the cost. Simply teaching them how to do it differently helped save terrabytes of data being queried.

WHERE Date(timestamp) > "2020-12-01"

I hope that has helped, any issues following the steps please comment or email me directly.

We’ll be writing more articles on how to get the best out of BigQuery and Googole DataStudio so subscribe and bookmark. All the best, Aden