BigQuery 是 Google 完全代管、無需維運、低成本的分析數據庫。 使用 BigQuery,您可以查詢大量數據,而無需管理任何基礎架構或需要數據庫管理員。 BigQuery 使用 SQL 並可以利用按用量付費模式。 BigQuery 讓您可以專注於分析數據以找到有意義的見解。
BigQuery Machine Learning(BigQuery ML)是 BigQuery 中的一項功能,數據分析師可以使用SQL語句創建、訓練、評估和預測機器學習模型。
在這次實作將使用一個電子商務(Google 商品商店)數據集,其中包含數百萬條已加載到 BigQuery 的 Google Analytics 記錄。 將使用這些數據來運行一些典型的查詢和模型訓練和預測,以讓企業更了解其客戶的購買習慣的。
/*--
使用 visitors 子查詢計算總訪客數,該數據來自於 data-to-insights.ecommerce.web_analytics 資料集中所有的 fullVisitorId 的計數。
--*/
WITH visitors AS(
SELECT
COUNT(DISTINCT fullVisitorId) AS total_visitors
FROM `data-to-insights.ecommerce.web_analytics`
),
/*--
使用 purchasers 子查詢計算總購買者數,該數據同樣來自於 data-to-insights.ecommerce.web_analytics 資料集,但僅計算有交易(totals.transactions)的 fullVisitorId 的計數。
--*/
purchasers AS(
SELECT
COUNT(DISTINCT fullVisitorId) AS total_purchasers
FROM `data-to-insights.ecommerce.web_analytics`
WHERE totals.transactions IS NOT NULL
)
/*--
主查詢部分選擇顯示總訪客數(total_visitors)、總購買者數(total_purchasers)以及轉換率(conversion_rate)。轉換率的計算方式是總購買者數除以總訪客數。
--*/
SELECT
total_visitors,
total_purchasers,
total_purchasers / total_visitors AS conversion_rate
FROM visitors, purchasers
轉換率:2.69%
SELECT
p.v2ProductName,
p.v2ProductCategory,
SUM(p.productQuantity) AS units_sold,
ROUND(SUM(p.localProductRevenue/1000000),2) AS revenue
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h,
UNNEST(h.product) AS p
GROUP BY 1, 2
ORDER BY revenue DESC
LIMIT 5;
/*--
選擇顯示商品名稱(v2ProductName)、商品類別(v2ProductCategory)、已售出的單位數量(units_sold)和總收入(revenue)。
使用 UNNEST 函數展平(flatten)資料結構,將 hits 中的每個元素以及其中的 product 屬性展平,以便提取商品相關的數據。
使用 SUM 函數計算每個商品的已售出的單位數量和總收入。已售出的單位數量來自 productQuantity,而總收入則是使用 localProductRevenue,並且將其轉換為百萬單位(除以 1000000)並四捨五入到兩位小數。
使用 GROUP BY 子句將結果按商品名稱和商品類別進行分組。
使用 ORDER BY 子句將結果按照總收入(revenue)降序排序。
使用 LIMIT 5 限制結果集的大小,只顯示前五個結果。
--*/
# visitors who bought on a return visit (could have bought on first as well
WITH all_visitor_stats AS (
SELECT
fullvisitorid, # 741,721 unique visitors
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
SELECT
COUNT(DISTINCT fullvisitorid) AS total_visitors,
will_buy_on_return_visit
FROM all_visitor_stats
GROUP BY will_buy_on_return_visit
/*--
使用 all_visitor_stats 子查詢,對每個 fullVisitorId 計算相應的統計信息。
fullvisitorid 表示每個獨特的訪客 ID。
IF 函數用於判斷該訪客是否在返回訪問時購買。條件是該訪客的 totals.transactions 大於 0(表示有購買),並且 totals.newVisits 為空(表示不是新訪問)。
will_buy_on_return_visit 列的值為 1(如果滿足條件)或 0(否則)。
在主查詢中,使用 COUNT(DISTINCT fullvisitorid) 計算總訪客數,並根據 will_buy_on_return_visit 分別計算購買和不購買的訪客數。
使用 GROUP BY will_buy_on_return_visit 將結果按照是否在返回訪問時購買進行分組。
--*/
分析結果表示,可以看到 (11873 / 741721) = 1.6% 的訪客總數會再次返回網站進行購買。(包括在第一次會話已經購買的訪客子集)
Google Analytics(分析)可擷取有關使用者造訪該本電子商務網站資料集的各種維度和衡量標準。瀏覽BigQuery Export schema文件中的各種維度列表,查找有用的特徵,這些特徵將幫助機器學習模型了解訪問者首次訪問您的網站的數據與他們是否會返回購買之間的關係。
測試這兩個欄位是否適合您的分類模型:
totals.bounces
(訪客是否立即離開網站)totals.timeOnSite
(訪客在我們網站上停留的時間)SELECT
* EXCEPT(fullVisitorId)
FROM
# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1)
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId)
ORDER BY time_on_site DESC
LIMIT 10;
/*--
在子查詢中,選擇 data-to-insights.ecommerce.web_analytics 資料集中的 totals.newVisits = 1 的訪問數據,同時選擇 totals.bounces(彈跳次數)和 totals.timeOnSite(在站點上花費的時間),並使用 IFNULL 函數處理可能的空值。
在另一個子查詢中,計算每個訪客是否在返回訪問時購買,並使用 COUNTIF 函數進行判斷。將結果以 will_buy_on_return_visit 的形式表示。
在主查詢中,使用 JOIN 子句將這兩個子查詢的結果根據 fullVisitorId 進行合併。
使用 EXCEPT(fullVisitorId) 排除掉原始資料中的 fullVisitorId 列,只顯示其他特徵。
使用 ORDER BY time_on_site DESC 將結果按照在站點上花費的時間降序排序。
使用 LIMIT 10 限制結果集的大小,只顯示前十個結果。
--*/
我們可以得知:
BigQuery ML目前支援的模型
CREATE OR REPLACE MODEL `ecommerce.classification_model`
OPTIONS
(
model_type='logistic_reg',
labels = ['will_buy_on_return_visit']
)
AS
#standardSQL
SELECT
* EXCEPT(fullVisitorId)
FROM
# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430') # train on first 9 months
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId)
;
/*--
使用 CREATE OR REPLACE MODEL 語句創建一個新的模型或替換現有的模型。模型的名稱為 ecommerce.classification_model。
使用 OPTIONS 子句指定模型的選項,包括模型類型為邏輯回歸('logistic_reg')和標籤為 will_buy_on_return_visit。
在 AS 子句之後,使用 SELECT 語句選擇模型的輸入特徵。這裡選擇了 totals.bounces 和 totals.timeOnSite,同時使用 IFNULL 函數處理可能的空值。
這些特徵是基於 data-to-insights.ecommerce.web_analytics 資料集中 totals.newVisits = 1 以及日期範圍為 '20160801' 到 '20170430'(前9個月)的數據。
使用 JOIN 子句將這些特徵和目標變數 will_buy_on_return_visit 進行合併。目標變數的值是根據每個訪客是否在返回訪問時購買的統計信息而得到的。
--*/
ML.EVALUATE
SELECT
roc_auc,
CASE
WHEN roc_auc > .9 THEN 'good'
WHEN roc_auc > .8 THEN 'fair'
WHEN roc_auc > .7 THEN 'decent'
WHEN roc_auc > .6 THEN 'not great'
ELSE 'poor' END AS model_quality
FROM
ML.EVALUATE(MODEL ecommerce.classification_model, (
SELECT
* EXCEPT(fullVisitorId)
FROM
# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1
AND date BETWEEN '20170501' AND '20170630') # eval on 2 months
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId)
));
classification_model_2
:CREATE OR REPLACE MODEL `ecommerce.classification_model_2`
OPTIONS
(model_type='logistic_reg', labels = ['will_buy_on_return_visit']) AS
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
# add in new features
SELECT * EXCEPT(unique_session_id) FROM (
SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
IFNULL(totals.pageviews, 0) AS pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430' # train 9 months
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
);
#standardSQL
SELECT
roc_auc,
CASE
WHEN roc_auc > .9 THEN 'good'
WHEN roc_auc > .8 THEN 'fair'
WHEN roc_auc > .7 THEN 'decent'
WHEN roc_auc > .6 THEN 'not great'
ELSE 'poor' END AS model_quality
FROM
ML.EVALUATE(MODEL ecommerce.classification_model_2, (
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
# add in new features
SELECT * EXCEPT(unique_session_id) FROM (
SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20170501' AND '20170630' # eval 2 months
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)
));
SELECT
*
FROM
ml.PREDICT(MODEL `ecommerce.classification_model_2`,
(
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
SELECT
CONCAT(fullvisitorid, '-',CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE
# only predict for new visits
totals.newVisits = 1
AND date BETWEEN '20170701' AND '20170801' # test 1 month
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)
)
ORDER BY
predicted_will_buy_on_return_visit DESC;
原始資料來源:Predict Visitor Purchases with a Classification Model in BigQuery ML
如果你喜歡這篇文章歡迎幫我按愛心鼓勵一下喔!~閱讀愉快!~