Recommendation engine with Google BigQuery ML Machine Learning

Preface: It was not successful. 

I would like to implement a small recommendation engine using Google BigQuery ML. We have anonymized order data in Google BigQuery and the idea is:

Find all combinations of product 1 and product 2 of all orders with two or more products purchased.  Let BigQuery ML learn. If we later have a given product 1 let BigQuery suggest some related products 2 to our customers.

Build a BigQuery select statement to find orders with 2 order lines / 2 products purchased.

WITH
  TableFirstProduct AS (
  SELECT
    *
  FROM (
    SELECT
      ol1.IncrementId AS IncrementId1,
      ol1.Sku AS Sku1,
      pc1.ParentSku AS ParentSku1,
      pc1.ParentId AS ParentEntityId1,
      pc1.Price,
      pc1.MyCustomProperty01,
      pc1.MyCustomProperty02,
      pc1.MyCustomProperty03,
      pc1.MyCustomProperty04,
      pc1.Color,
      ROW_NUMBER() OVER (PARTITION BY ol1.IncrementId) AS RowNumber1
    FROM
      `myproject01.mydataset01.order_lines` ol1
    JOIN
      `myproject01.mydataset01.product_childs` pc1
    ON
      ol1.ProductId = pc1.EntityId )
  WHERE
    RowNumber1 = 1),
  TableSecondProduct AS (
  SELECT
    *
  FROM (
    SELECT
      ol2.IncrementId AS IncrementId2,
      ol2.Sku AS Sku2,
      pc2.ParentSku AS ParentSku2,
      pc2.ParentId AS ParentEntityId2,
      ROW_NUMBER() OVER (PARTITION BY ol2.IncrementId) AS RowNumber2
    FROM
      `myproject01.mydataset01.order_lines` ol2
    JOIN
      `myproject01.mydataset01.product_childs` pc2
    ON
      ol2.ProductId = pc2.EntityId )
  WHERE
    RowNumber2 = 2 )
SELECT
  t1.*,
  t2.*
FROM
  TableFirstProduct t1
JOIN
  TableSecondProduct t2
ON
  t1.IncrementId1 = t2.IncrementId2
ORDER BY
  t1.IncrementId1

Saved as view „cart_2_products“.

Create model following this documentation

https://cloud.google.com/bigquery/docs/bigqueryml-analyst-start

#standardSQL
CREATE OR REPLACE MODEL `myproject01.recommandations.recom_product2`
OPTIONS(model_type='logistic_reg') AS
SELECT
  ParentSku2 as label,
  ParentSku1,
  Price,
  MyCustomProperty01,
  MyCustomProperty02,
  MyCustomProperty03,
  MyCustomProperty04,
  Color
FROM
  `myproject01.recommandations.cart_2_products`

Error:

Logistic regression currently only supports binary classification and the label column had 43305 unique labels instead of 2.

Ok, changed to

(model_type='linear_reg')

Error:

'label' column should be numerical.

Fixed using EntityId instead of SKU, I am not sure if this is the correct way to do it.

#standardSQL
CREATE OR REPLACE MODEL `myproject01.recommandations.recom_product2`
OPTIONS
  (model_type='linear_reg') AS
SELECT
  ParentEntityId2 AS label,
  ParentSku1,
  Price,
  MyCustomProperty01,
  MyCustomProperty02,
  MyCustomProperty03,
  MyCustomProperty04,
  Color
FROM
  `myproject01.recommandations.cart_2_products`

Error:

Label column 'label' has NULL values.

Fixing:

IF(ParentEntityId2 IS NULL,
    0,
    ParentEntityId2) AS label,

or full query

#standardSQL
CREATE OR REPLACE MODEL `myproject01.recommandations.recom_product2`
OPTIONS
  (model_type='linear_reg') AS
SELECT
  IF(ParentEntityId2 IS NULL,
    0,
    ParentEntityId2) AS label,
  ParentSku1,
  Price,
  MyCustomProperty01,
  MyCustomProperty02,
  MyCustomProperty03,
  MyCustomProperty04,
  Color
FROM
  `myproject01.recommandations.cart_2_products`

running over 6 mins…

Testing the model via ML.EVALUATE

SELECT
  *
FROM
  ML.EVALUATE(MODEL `myproject01.recommandations.recom_product2`,
    (
    SELECT
      IF(ParentEntityId2 IS NULL,
        0,
        ParentEntityId2) AS label,
      ParentSku1,
      Price,
      MyCustomProperty01,
      MyCustomProperty02,
      MyCustomProperty03,
      MyCustomProperty04,
      Color
    FROM
      `myproject01.recommandations.cart_2_products`
    LIMIT
      10 ))

results to

Row mean_absolute_error mean_squared_error mean_squared_log_error median_absolute_error r2_score explained_variance
1 23886.669450492096 1.0217489316058102E9 0.5414029727864474 18347.787328148363 0.1333862788413296 0.17761877811782734

To be honest:  I have no clue 🙂

Use the model to predict via ML.PREDICT

SELECT
  *
FROM
  ML.PREDICT(MODEL `myproject01.recommandations.recom_product2`,
    (
    SELECT
      IF(ParentEntityId2 IS NULL,
        0,
        ParentEntityId2) AS label,
      ParentSku1,
      Price,
      MyCustomProperty01,
      MyCustomProperty02,
      MyCustomProperty03,
      MyCustomProperty04,
      Color
    FROM
      `myproject01.recommandations.cart_2_products`
    LIMIT
      10 ))

 

Row predicted_label label ParentSku1 Price MyCustomProperty01 MyCustomProperty02 MyCustomProperty03 MyCustomProperty04 Color
1 75350.00227889014 36182 SKU01 64.99 Prop0101 Prop0201 null null Schwarz
2 35302.296034777406 24184 SKU02 9.99 Prop0102 Prop0201 null null Schwarz
3 54069.06814550898 36395 SKU03 49.99 Prop0103 Prop0201 null null Grau
4 80172.38105557486 37277 SKU04 26.99 Prop0104 Prop0201 null null Schwarz
5 50016.606966097504 25384 SKU05 9.99 Prop0105 Prop0201 null null Schwarz
6 47691.50187015592 35497 SKU06 18.99 Prop0106 Prop0201 null null Schwarz
7 34641.00410877989 32896 SKU07 69.99 Prop0107 Prop0201 null null Grau
8 35884.611568531276 34268 SKU08 59.99 Prop0108 Prop0201 null null Blau
9 46193.88571743077 38027 SKU09 24.99 Prop0109 Prop0201 null null Blau
10 45592.61311424247 35621 SKU10 64.99 Prop0110 Prop0210 null null Schwarz

And I see: I does not make sense.  It is the wrong way!

I need exact EntityIds and not probabilities and linear regression model is not aimed for this case. 

 

 

 

 

 

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert