Preface: It was not successful.
I would like to implement a small recommendation engine using Google BigQuery ML. We have anonymized order data in Google BigQuery and the idea is:
Find all combinations of product 1 and product 2 of all orders with two or more products purchased. Let BigQuery ML learn. If we later have a given product 1 let BigQuery suggest some related products 2 to our customers.
Build a BigQuery select statement to find orders with 2 order lines / 2 products purchased.
WITH TableFirstProduct AS ( SELECT * FROM ( SELECT ol1.IncrementId AS IncrementId1, ol1.Sku AS Sku1, pc1.ParentSku AS ParentSku1, pc1.ParentId AS ParentEntityId1, pc1.Price, pc1.MyCustomProperty01, pc1.MyCustomProperty02, pc1.MyCustomProperty03, pc1.MyCustomProperty04, pc1.Color, ROW_NUMBER() OVER (PARTITION BY ol1.IncrementId) AS RowNumber1 FROM `myproject01.mydataset01.order_lines` ol1 JOIN `myproject01.mydataset01.product_childs` pc1 ON ol1.ProductId = pc1.EntityId ) WHERE RowNumber1 = 1), TableSecondProduct AS ( SELECT * FROM ( SELECT ol2.IncrementId AS IncrementId2, ol2.Sku AS Sku2, pc2.ParentSku AS ParentSku2, pc2.ParentId AS ParentEntityId2, ROW_NUMBER() OVER (PARTITION BY ol2.IncrementId) AS RowNumber2 FROM `myproject01.mydataset01.order_lines` ol2 JOIN `myproject01.mydataset01.product_childs` pc2 ON ol2.ProductId = pc2.EntityId ) WHERE RowNumber2 = 2 ) SELECT t1.*, t2.* FROM TableFirstProduct t1 JOIN TableSecondProduct t2 ON t1.IncrementId1 = t2.IncrementId2 ORDER BY t1.IncrementId1
Saved as view „cart_2_products“.
Create model following this documentation
https://cloud.google.com/bigquery/docs/bigqueryml-analyst-start
#standardSQL CREATE OR REPLACE MODEL `myproject01.recommandations.recom_product2` OPTIONS(model_type='logistic_reg') AS SELECT ParentSku2 as label, ParentSku1, Price, MyCustomProperty01, MyCustomProperty02, MyCustomProperty03, MyCustomProperty04, Color FROM `myproject01.recommandations.cart_2_products`
Error:
Logistic regression currently only supports binary classification and the label column had 43305 unique labels instead of 2.
Ok, changed to
(model_type='linear_reg')
Error:
'label' column should be numerical.
Fixed using EntityId instead of SKU, I am not sure if this is the correct way to do it.
#standardSQL CREATE OR REPLACE MODEL `myproject01.recommandations.recom_product2` OPTIONS (model_type='linear_reg') AS SELECT ParentEntityId2 AS label, ParentSku1, Price, MyCustomProperty01, MyCustomProperty02, MyCustomProperty03, MyCustomProperty04, Color FROM `myproject01.recommandations.cart_2_products`
Error:
Label column 'label' has NULL values.
Fixing:
IF(ParentEntityId2 IS NULL, 0, ParentEntityId2) AS label,
or full query
#standardSQL CREATE OR REPLACE MODEL `myproject01.recommandations.recom_product2` OPTIONS (model_type='linear_reg') AS SELECT IF(ParentEntityId2 IS NULL, 0, ParentEntityId2) AS label, ParentSku1, Price, MyCustomProperty01, MyCustomProperty02, MyCustomProperty03, MyCustomProperty04, Color FROM `myproject01.recommandations.cart_2_products`
running over 6 mins…
Testing the model via ML.EVALUATE
SELECT * FROM ML.EVALUATE(MODEL `myproject01.recommandations.recom_product2`, ( SELECT IF(ParentEntityId2 IS NULL, 0, ParentEntityId2) AS label, ParentSku1, Price, MyCustomProperty01, MyCustomProperty02, MyCustomProperty03, MyCustomProperty04, Color FROM `myproject01.recommandations.cart_2_products` LIMIT 10 ))
results to
Row | mean_absolute_error | mean_squared_error | mean_squared_log_error | median_absolute_error | r2_score | explained_variance | |
---|---|---|---|---|---|---|---|
1 | 23886.669450492096 | 1.0217489316058102E9 | 0.5414029727864474 | 18347.787328148363 | 0.1333862788413296 | 0.17761877811782734 |
To be honest: I have no clue 🙂
Use the model to predict via ML.PREDICT
SELECT * FROM ML.PREDICT(MODEL `myproject01.recommandations.recom_product2`, ( SELECT IF(ParentEntityId2 IS NULL, 0, ParentEntityId2) AS label, ParentSku1, Price, MyCustomProperty01, MyCustomProperty02, MyCustomProperty03, MyCustomProperty04, Color FROM `myproject01.recommandations.cart_2_products` LIMIT 10 ))
Row | predicted_label | label | ParentSku1 | Price | MyCustomProperty01 | MyCustomProperty02 | MyCustomProperty03 | MyCustomProperty04 | Color | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 75350.00227889014 | 36182 | SKU01 | 64.99 | Prop0101 | Prop0201 | null | null | Schwarz | |
2 | 35302.296034777406 | 24184 | SKU02 | 9.99 | Prop0102 | Prop0201 | null | null | Schwarz | |
3 | 54069.06814550898 | 36395 | SKU03 | 49.99 | Prop0103 | Prop0201 | null | null | Grau | |
4 | 80172.38105557486 | 37277 | SKU04 | 26.99 | Prop0104 | Prop0201 | null | null | Schwarz | |
5 | 50016.606966097504 | 25384 | SKU05 | 9.99 | Prop0105 | Prop0201 | null | null | Schwarz | |
6 | 47691.50187015592 | 35497 | SKU06 | 18.99 | Prop0106 | Prop0201 | null | null | Schwarz | |
7 | 34641.00410877989 | 32896 | SKU07 | 69.99 | Prop0107 | Prop0201 | null | null | Grau | |
8 | 35884.611568531276 | 34268 | SKU08 | 59.99 | Prop0108 | Prop0201 | null | null | Blau | |
9 | 46193.88571743077 | 38027 | SKU09 | 24.99 | Prop0109 | Prop0201 | null | null | Blau | |
10 | 45592.61311424247 | 35621 | SKU10 | 64.99 | Prop0110 | Prop0210 | null | null | Schwarz |
And I see: I does not make sense. It is the wrong way!
I need exact EntityIds and not probabilities and linear regression model is not aimed for this case.