Funnel Analysis with Google Analytics Data in BigQuery

Funnel Analysis with Google Analytics Data in BigQuery

Conversion funnels are a basic concept in web analytics, and if you’ve worked with them enough, you may have gotten to a point where you want to perform a deeper analysis than your tools will allow.

“Which steps in my funnel are being skipped? What was going on in this funnel before I defined it? Which user-characteristics determine sets of segments across which progression through my funnel differs?” These questions can be answered using the solution described in this article. In particular, I’m going to talk about how to use BigQuery (BQ) to analyze Google Analytics (GA) page-hit data, though the principles could be applied to any page-hit data stored in a relational database.

The Google Analytics Funnel Visualization report (see below) makes certain abstractions and has certain limitations, and advanced users can benefit through the use of Google BigQuery (BQ) – an infrastructure-as-a-service offering which allows for SQL-like queries over massive datasets.

Funnel Analysis

In this article, we’ll discuss the benefits of using BigQuery for funnel analysis as opposed to the Google Analytics user interface. In order to make the solution clear I will go over the basic structure of an SQL query for funnel analysis and explain how to use Funneler, a simple Windows application to automate query-writing. The source code of Funneler is also provided as a Python 3 script. Please note that in order to use the specific examples provided here you will need a Google Analytics Premium account linked to BigQuery (learn more about the BigQuery Export feature).

Funnel Analysis – Google Analytics UI vs. BigQuery

The solution I propose below works as follows: using a Windows application (or Python script) a BigQuery-dialect SQL query is generated which tracks user-sessions through a set of web properties, and optionally segmenting and/or filtering the sessions based on session characteristics. BigQuery’s output is a table with two columns per funnel stage: one for session-counts, and one for exit-counts.

Below is a list of the most significant differences between GA Funnel Visualization and the solution I will be discussing.

  1. Loopbacks: If a user goes from steps 1 -> 2 -> 1, GA will register two sessions: one which goes to step 1, one which goes to step 2, and an exit from step 2 to step 1. Our query will only count one session in the above instance: a session which goes from step 1 to step 2. Furthermore, since progress through the funnel is measured by the “deepest” page reached, the above scenario will not be distinguished from a session which goes from step 1 -> 2 -> 1.
  2. Backfilling funnel steps: GA will backfill any skipped steps between the entrance and the exit. This solution will only register actual page-hits, so you’ll get real numbers of page-hits.
  3. Historical Information: GA Funnels cannot show historical data on a new funnel, whereas this workflow can be used on any date range during which GA was tracking page-hits on the selected funnel-stage pages.
  4. Advanced Segmentation: GA Funnels don’t support advanced segmentation, whereas with Group By clauses in BigQuery, you can segment the funnel on any column.
  5. Sampling: GA Funnel Visualization shows up to 50,000 unique paths, whereas BQ will contain all the page-hits that GA recorded, and allow you to query them all.

The Query

For Google Analytics data, the basis of a funnel query is a list of URLs or Regular Expressions (regex), each representing a stage in the conversion funnel.

If you have a pre-existing funnel in GA, follow the steps below to find your funnel settings:

  1. Go to Admin in GA
  2. Select the correct Account, Property, and View
  3. Go to Goals
  4. Select a Goal
  5. Click Goal Details

In this screen you will find a regex or URL for each step of the funnel. They may look like this: “/job/apply/”.

The basic process of writing the query, given the list of regexes or URLs, is as follows:

1. Create a base-level subquery for each regex

For each row which has a regex-satisfying value in the URL column, pull out fullVisitorId and visitId (this works as a unique session ID), and the smallest hit-number. The smallest hit-number just serves as a non-null value which will be counted later. The result sets of these subqueries have one row per session.


SELECT
fullVisitorId,
visitId,
MIN(hits.hitNumber) AS firstHit
FROM
TABLE_DATE_RANGE([.ga_sessions_], TIMESTAMP('YYYY-MM-DD'),
TIMESTAMP('YYYY-MM-DD'))
WHERE
REGEXP_MATCH(hits.page.pagePath, '')
AND totals.visits = 1
GROUP BY
fullVisitorId,
visitId

2. Join the first subquery to the second on session ID

Select session ID, hit-number from the first subquery, and hit-number from the second subquery. When we use full outer joins, we’re saying sessions can enter the funnel at any step. To count sessions at each stage that have only hit a previous stage, use a left join.


SELECT
s0.fullVisitorId,
s0.visitId,
s0.firstHit,
s1.firstHit
FROM (
# Begin Subquery #1 aka s0
SELECT
fullVisitorId,
visitId,
MIN(hits.hitNumber) AS firstHit
FROM
TABLE_DATE_RANGE([.ga_sessions_], TIMESTAMP('2015-11-01'),
TIMESTAMP('2015-11-04'))
WHERE
REGEXP_MATCH(hits.page.pagePath, '')
AND totals.visits = 1
GROUP BY
fullVisitorId,
visitId) s0
# End Subquery #1 aka s0
FULL OUTER JOIN EACH (
# Begin Subquery #2 aka s1
SELECT
fullVisitorId,
visitId,
MIN(hits.hitNumber) AS firstHit
FROM
TABLE_DATE_RANGE([.ga_sessions_], TIMESTAMP('2015-11-01'),
TIMESTAMP('2015-11-04'))
WHERE
REGEXP_MATCH(hits.page.pagePath, '')
AND totals.visits = 1
GROUP BY
fullVisitorId,
visitId) s1
# End Subquery #2 aka s1

ON
s0.fullVisitorId = s1.fullVisitorId
AND s0.visitId = s1.visitId

3. Join the third subquery to the result of the above join on session ID

Select session ID, hit-number from the first subquery, hit-number from the second subquery, and hit-number from the third subquery.

4. Join the fourth subquery to the result of the above join on session ID

Select session ID, hit-number from the first subquery, hit-number from the second subquery, hit-number from the third subquery, and hit-number from the fourth subquery.

5. Continue until all subqueries are joined in this way

6. Aggregate results

Instead of a row for each session, we want one row with counts of non-null hit-numbers per funnel-step. Take the query so far, and wrap it with this:


SELECT
COUNT(s0.firstHit) AS _job_details_,
COUNT(s1.firstHit) AS _job_apply_
FROM (
(query from 2. goes here if the funnel has two steps))

The query has a recursive structure, which means that we could use a recursive program to generate the query mechanically. This is a major advantage, because for longer funnels, the query can grow quite large (500+ lines for a 13-step funnel). By automating the process, we can save lots of development time. We’ll now go over how to use Funneler to generate the query.

Funneler

Funneler is an executable Python script (no need to have Python installed) which, when fed a json containing a list of regexes or URLs, generates the SQL query in the BigQuery dialect to build that funnel. It manipulates and combines strings of SQL code recursively. It extends the functionality of the query described in section 2 and it allows for segmenting and filtering of sessions based on any column in the BigQuery table.

Funneler and funneler.py can be found on my Github page (https://github.com/douug).

The input to Funneler is a json document with the following name/value pairs:

  • Table name, with the following format: [(Dataset ID).ga_sessions_]
  • Start date: ‘YYYY-MM-DD’
  • End date: ‘YYYY-MM-DD’
  • List of regexes: one regex per funnel-step
  • Segmode: True for segmenting, False otherwise
  • Segment: The column to segment on
  • Filtermode: True for filtering, False otherwise
  • Filtercol: The column to filter on
  • Filterval: The value to filter on in the above-mentioned column

Here is an example of an input json:


{
"table": "[123456789.ga_sessions_]",
"start": "'2015-11-01'",
"end": "'2015-11-04'",
"regex_list": ["'/job/details/'",
"'/job/apply/'",
"'/job/apply/upload-resume/'",
"'/job/apply/basic-profile/'",
"'/job/apply/full-profile/'",
"'/job/apply/(assessment/external|thank-you)'"],
"segmode": "True",
"segment": "device.deviceCategory",
"filtermode": "False",
"filtercol" : "hits.customDimensions.index",
"filterval" : "23"
}

Please note the quoted quotes (e.g. in the elements of the value of the key "regex_list" above). These are included because after the json is ingested into a Python dictionary, the Python strings may contain SQL strings, which themselves require quotes. But, the value of the key "filterval" has no inside quotes because 23 is of type int in SQL and wouldn’t be quoted.

To run Funneler, go to \dist_funneler\data. Open input.json and modify the contents, then go back to \dist_funneler and run funneler.exe. Three files should appear – std_error.log, std_out.log (which contains feedback about whether Segmode or Filtermode are engaged, and where the generated query can be found), and your query. Copy and paste your query into BigQuery. Try starting with a short funnel, as it may take a few tries to format the input correctly.

Alternatively, if you are running funneler.py, it can be executed from the command line with the following:

python funneler.py input.json

In this case, the contents of the above mentioned std_error.log and std_out.log files will appear in-console. This query can then be copied into your BQ instance. The resulting table should have two columns per regex/funnel-step – one for hits, and one for exits – and one row . If segmode is set to True, then there will be a row per value in the segment column.

Hopefully these tools help you to quickly create complex queries and meet analysis objectives to perform deeper analysis of GA page-hit data.