Competition summary

19 May 2019 - will Xu | 许士亭

2015 TaoBao Challenge of User Classification

PengyangWang And I participated in a big data challenge game holded by Alibaba in our school,Beijing University of Posts and Telecommunications , in this autumn. This challenge required participants to classify customers of TaoBao,chinese Amazon, into 12 categories based on their online behaviors.Our solution had achieved 34% precision rate. In this article, we prefer to making a summarize of this challenge and our method.

code

Problem Description

Data Description

This is a Multi classification problem. We are given three files: log_train.csv, info_train.csv,log_test.csv.


Data in file info_train.csv contains Id and class of all users appeared in log_train.csv. The structure of info_train.csv is given in the table below .

Example
user_id class
254 1
456 3

Data Description
Col Name Type Description Comment
user_id int identify of user  
class int class of user number range from 1 to 12

log_train.csv and log_test.csv have same data structure .The contents of this two files is user’s online-shopping behaviors during a year. The name and meaning of all colums in this two files is given below.

Example
user_id item_id cat_id seller_id brand_id time_stamp action_type
254 1034 1206 1003 34 173 0
254 1498 1085 1243 4241 173 1

Data Description
Col Name Type Description Comment
user_id int identify of user  
item_id int identify of item  
cat_id int Category Id of The Item  
seller_id int Seller Id of The Item  
brand_id int brand Id of the Item  
time_stamp int When The Action Occured  
Action_type int Type Id of the action 0-click,1-collect,2-Buy,3-Delete

TASK

This challenge required participants to classify all the users appeared in log_test.csvinto 12 categories based on user’s online-shopping behavior.


Evaluation

The evaluation method is to calculate the precision rate of classification.The python code can be download here.


Solution

Perception of Question

Costumer’s number of all 12 categories in log_train.csv is given below

User’s number of 12 categories
class 1 2 3 4 5 6 7 8 9 10 11 12
number 12166 26830 18836 10066 8981 1999 6578 11915 8416 4163 3536 973

Class 2 has the highest numbers of users and class12 has the minimum ,only 937 ,users. The total number of users is 114459.

Major diffrences between online-behaviors of 12 classes customer’s

It’s important to understand all the key elements that can influence classification precision rate.According to our usual experience of online-shopping and the data we accessed from this competition,we concluded some potential meaningful features behavior of users:

\[buying\_clicking\_ratio = \frac{clicking\_frequency}{buying\_frequency}\] \[C_i\_brand_t = \frac{Numbers\_of\_Class_i\_Buying\_Brand_t}{\sum_{i=1}^12 Numbers\_of\_Class_i\_Buying\_Brand_t}\]

After we construct this dataframe ,we are using K-meanson Brand DataFrame .R code of this process is given below.


getKmeans_brand<-function(log,k)
{
brand_class<-table(log$brand_id,log$class)/rowSums(table(log$brand_id,log$class))

brand_kmeans_k<-kmeans(brand_class,centers = k)

brand_id<-rownames(brand_class)
brand_kmeans_k<-data.frame(brand_id,brand_kmeans_k$cluster)
return (brand_kmeans_k)

}

After we cluster brand into 14 calsses,then we update brand_id with their newest calss_id.


Data processing

All the R code of clean data process can be found here.

clean_data<-function(log,time=1:185,kmeans_k=6,user.all,brand_kmeans,cat_kmeans)

The argument time in clean_data function can split the whole log_train.csv into different part by time interval as your wish.We give a invoking example below.


 train_t1<-clean_data(log_train,time = 1:10,user.all = get_user(log_train),brand_kmeans = brand_kmeans,cat_kmeans = cat_kmeans)
  train_t2<-clean_data(log_train,time = 11:40,user.all = get_user(log_train),brand_kmeans = brand_kmeans,cat_kmeans = cat_kmeans)
  train_t3<-clean_data(log_train,time = 40:100,user.all = get_user(log_train),brand_kmeans = brand_kmeans,cat_kmeans = cat_kmeans)
  train_t4<-clean_data(log_train,time = 100:185,user.all = get_user(log_train),brand_kmeans = brand_kmeans,cat_kmeans = cat_kmeans)
  final_train<-merge_data(train_t1,train_t2,train_t3,train_t4)

In the above code, we split the whole 185 days into 4 parts.We also combine all the feature in 4 parts together by a merge_data function.

merge_data<-function(data1,data2,data3,data4,data5,data6,data7)
{
final_log<-merge(data1,data2,by.x="user_id",by.y="user_id")
final_log<-merge(final_log,data3,by.x="user_id",by.y="user_id")
final_log<-merge(final_log,data4,by.x="user_id",by.y="user_id")
final_log<-merge(final_log,data5,by.x="user_id",by.y="user_id")
final_log<-merge(final_log,data6,by.x="user_id",by.y="user_id")
final_log<-merge(final_log,data7,by.x="user_id",by.y="user_id")
#final_log<-merge(final_log,data7,by.x="user_id",by.y="user_id")
return(final_log)
}

we got our final dataframe final_train.Then we start to construct a classification model.


Modeling

We have tired many classification algorithm like Naive Bayes,Decision Tree,SVM ,Random forests,KNN and Adaboosting.We found that when using 5 descision trees in Adaboosting we got highest precision rate 34%.The modeling code can be found here.