Preface: It was not successful.
I would like to implement a small recommendation engine using Google BigQuery ML. We have anonymized order data in Google BigQuery and the idea is:
Find all combinations of product 1 and product 2 of all orders with two or more products purchased. Let BigQuery ML learn. If we later have a given product 1 let BigQuery suggest some related products 2 to our customers.
Build a BigQuery select statement to find orders with 2 order lines / 2 products purchased.
WITH
TableFirstProduct AS (
SELECT
*
FROM (
SELECT
ol1.IncrementId AS IncrementId1,
ol1.Sku AS Sku1,
pc1.ParentSku AS ParentSku1,
pc1.ParentId AS ParentEntityId1,
pc1.Price,
pc1.MyCustomProperty01,
pc1.MyCustomProperty02,
pc1.MyCustomProperty03,
pc1.MyCustomProperty04,
pc1.Color,
ROW_NUMBER() OVER (PARTITION BY ol1.IncrementId) AS RowNumber1
FROM
`myproject01.mydataset01.order_lines` ol1
JOIN
`myproject01.mydataset01.product_childs` pc1
ON
ol1.ProductId = pc1.EntityId )
WHERE
RowNumber1 = 1),
TableSecondProduct AS (
SELECT
*
FROM (
SELECT
ol2.IncrementId AS IncrementId2,
ol2.Sku AS Sku2,
pc2.ParentSku AS ParentSku2,
pc2.ParentId AS ParentEntityId2,
ROW_NUMBER() OVER (PARTITION BY ol2.IncrementId) AS RowNumber2
FROM
`myproject01.mydataset01.order_lines` ol2
JOIN
`myproject01.mydataset01.product_childs` pc2
ON
ol2.ProductId = pc2.EntityId )
WHERE
RowNumber2 = 2 )
SELECT
t1.*,
t2.*
FROM
TableFirstProduct t1
JOIN
TableSecondProduct t2
ON
t1.IncrementId1 = t2.IncrementId2
ORDER BY
t1.IncrementId1
Saved as view „cart_2_products“.
Create model following this documentation
https://cloud.google.com/bigquery/docs/bigqueryml-analyst-start
#standardSQL
CREATE OR REPLACE MODEL `myproject01.recommandations.recom_product2`
OPTIONS(model_type='logistic_reg') AS
SELECT
ParentSku2 as label,
ParentSku1,
Price,
MyCustomProperty01,
MyCustomProperty02,
MyCustomProperty03,
MyCustomProperty04,
Color
FROM
`myproject01.recommandations.cart_2_products`
Error:
Logistic regression currently only supports binary classification and the label column had 43305 unique labels instead of 2.
Ok, changed to
(model_type='linear_reg')
Error:
'label' column should be numerical.
Fixed using EntityId instead of SKU, I am not sure if this is the correct way to do it.
#standardSQL
CREATE OR REPLACE MODEL `myproject01.recommandations.recom_product2`
OPTIONS
(model_type='linear_reg') AS
SELECT
ParentEntityId2 AS label,
ParentSku1,
Price,
MyCustomProperty01,
MyCustomProperty02,
MyCustomProperty03,
MyCustomProperty04,
Color
FROM
`myproject01.recommandations.cart_2_products`
Error:
Label column 'label' has NULL values.
Fixing:
IF(ParentEntityId2 IS NULL,
0,
ParentEntityId2) AS label,
or full query
#standardSQL
CREATE OR REPLACE MODEL `myproject01.recommandations.recom_product2`
OPTIONS
(model_type='linear_reg') AS
SELECT
IF(ParentEntityId2 IS NULL,
0,
ParentEntityId2) AS label,
ParentSku1,
Price,
MyCustomProperty01,
MyCustomProperty02,
MyCustomProperty03,
MyCustomProperty04,
Color
FROM
`myproject01.recommandations.cart_2_products`
running over 6 mins…
Testing the model via ML.EVALUATE
SELECT
*
FROM
ML.EVALUATE(MODEL `myproject01.recommandations.recom_product2`,
(
SELECT
IF(ParentEntityId2 IS NULL,
0,
ParentEntityId2) AS label,
ParentSku1,
Price,
MyCustomProperty01,
MyCustomProperty02,
MyCustomProperty03,
MyCustomProperty04,
Color
FROM
`myproject01.recommandations.cart_2_products`
LIMIT
10 ))
results to
Row |
mean_absolute_error |
mean_squared_error |
mean_squared_log_error |
median_absolute_error |
r2_score |
explained_variance |
|
1 |
23886.669450492096 |
1.0217489316058102E9 |
0.5414029727864474 |
18347.787328148363 |
0.1333862788413296 |
0.17761877811782734 |
To be honest: I have no clue 🙂
Use the model to predict via ML.PREDICT
SELECT
*
FROM
ML.PREDICT(MODEL `myproject01.recommandations.recom_product2`,
(
SELECT
IF(ParentEntityId2 IS NULL,
0,
ParentEntityId2) AS label,
ParentSku1,
Price,
MyCustomProperty01,
MyCustomProperty02,
MyCustomProperty03,
MyCustomProperty04,
Color
FROM
`myproject01.recommandations.cart_2_products`
LIMIT
10 ))
Row |
predicted_label |
label |
ParentSku1 |
Price |
MyCustomProperty01 |
MyCustomProperty02 |
MyCustomProperty03 |
MyCustomProperty04 |
Color |
|
1 |
75350.00227889014 |
36182 |
SKU01 |
64.99 |
Prop0101 |
Prop0201 |
null |
null |
Schwarz |
|
2 |
35302.296034777406 |
24184 |
SKU02 |
9.99 |
Prop0102 |
Prop0201 |
null |
null |
Schwarz |
|
3 |
54069.06814550898 |
36395 |
SKU03 |
49.99 |
Prop0103 |
Prop0201 |
null |
null |
Grau |
|
4 |
80172.38105557486 |
37277 |
SKU04 |
26.99 |
Prop0104 |
Prop0201 |
null |
null |
Schwarz |
|
5 |
50016.606966097504 |
25384 |
SKU05 |
9.99 |
Prop0105 |
Prop0201 |
null |
null |
Schwarz |
|
6 |
47691.50187015592 |
35497 |
SKU06 |
18.99 |
Prop0106 |
Prop0201 |
null |
null |
Schwarz |
|
7 |
34641.00410877989 |
32896 |
SKU07 |
69.99 |
Prop0107 |
Prop0201 |
null |
null |
Grau |
|
8 |
35884.611568531276 |
34268 |
SKU08 |
59.99 |
Prop0108 |
Prop0201 |
null |
null |
Blau |
|
9 |
46193.88571743077 |
38027 |
SKU09 |
24.99 |
Prop0109 |
Prop0201 |
null |
null |
Blau |
|
10 |
45592.61311424247 |
35621 |
SKU10 |
64.99 |
Prop0110 |
Prop0210 |
null |
null |
Schwarz |
And I see: I does not make sense. It is the wrong way!
I need exact EntityIds and not probabilities and linear regression model is not aimed for this case.