Avito Duplicate Ads Detection

luoq08@gmail.com OR hzluoqiang@corp.netease.com

data

data

size

  • train pair: 2991396
  • test pair: 1044196
  • train info: 3344613
  • train test: 1315205
  • image: 11380670

solution

detail

result analysis

parameter tuning

  • grid search and fit on leaderboard
  • max_depth: 5 -> 10 -> 15(auc increase in local, auc decrease in leaderboard)
  • subsample=0.8, colsample_bytree=0.8
  • min_child_weight = 1
  • learning_rate=0.05, n_estimators=1000

feature size

features_train = pd.concat((
        simple_features_train,
        aggregation_features_train,
        title_features_train, description_features_train, ncd_features_train,
        image_features_train,
        corpus_based_features_train,
        dummy_features_train,
        categoryID_shuffle_features_train,
    ), axis=1)
feature group size
simple 26
aggregation 20
text simple 18
image 44
text vector space 110
dummy 51
total 290

time

  • simple, aggregation, text simple: hours
  • image histogram, hash: each 1day with 8(16?) cores
  • image mxnet: about 5days with GPU(Titan)
  • text vector space: overnight
  • model training, 32 threads
    CPU times: user 3d 10h 19min 40s, sys: 2min 51s, total: 3d 10h 22min 32s
    Wall time: 3h 4min 33s

feature importance

not representative for leaderboard

  • by weight(the number of times a feature is used to split the data across all trees)
180    title_word_1_2gram_dtm_0_predict_log_price__1    0.015352
181    title_word_1_2gram_dtm_0_predict_log_price__2    0.015343
107    mxnet_bn_batch_mean_sim    0.014696
4    description_length_max    0.014154
5    description_length_min    0.013829
11    price_diff    0.013002
184    title_description_dtm_0_predict_log_price__1    0.012954
185    title_description_dtm_0_predict_log_price__2    0.012500
105    mxnet_bn_batch_max_sim    0.011103
14    price_min    0.010932
30    locationID_1_freq    0.010878
13    price_max    0.010721
  • by gain (the average gain of the feature when it is used in trees)
    image_phash_hamming_0_min                         2875.218537
    mxnet_bn_batch_max_sim                            1869.406305
    categoryID_112                                    1714.816033
    title_word_dtm_0_1__binary_tfidf__cosine          1053.358729
    image_dhash_hamming_0_min                          811.860253
    title_word_dtm_0_1__tfidf__cosine                  359.690265
    categoryID_9                                       311.754781
    categoryID_33                                      275.383379
    categoryID_111                                     254.657611
    price_diff_ratio                                   254.372805

auc by categoryID

  • not much image in category 112
112    0.858464
85     0.903141
33     0.913381
111    0.913763
101    0.927559
105    0.930247
31     0.932722
99     0.934950
10     0.940427
23     0.942282
97     0.944655
19     0.947516
26     0.948816
98     0.949131
34     0.949929
25     0.950278
32     0.951069
115    0.951840
42     0.951955
86     0.954261
21     0.957213
40     0.958290
24     0.958903
87     0.959818
84     0.960464
39     0.960946
96     0.963323
102    0.966269
29     0.966634
38     0.966994
36     0.967149
90     0.967512
83     0.968200
27     0.969401
28     0.971950
94     0.974298
114    0.974501
20     0.974637
89     0.976800
9      0.978176
106    0.980815
88     0.981223
30     0.982920
93     0.983322
81     0.984331
14     0.985742
82     0.987740
92     0.988143
11     0.989218
91     0.989520
116    0.993231

feature contribution

impact on final result without imagehash and mxnet features

In [2]:
pd.read_csv('auc.old.csv')
Out[2]:
model features test set public leaderboard gap
0 xgboost.26.weighted all 0.956060 0.91957 0.036490
1 xgboost.27.weighted -image 0.929910 0.86827 0.061640
2 xgboost.28.weighted -image, -corpus 0.922269 0.85488 0.067389
3 xgboost.29.weighted -corpus 0.952553 0.91441 0.038143
4 lr_text.1 title_description_dtm_0 0.793570 0.64333 0.150240
5 xgboost.31.weighted +description_sentence__binary__agg_cosine 0.956172 0.91963 0.036542

leaderboard

progress

In [3]:
leaderboard = pd.read_csv('avito-duplicate-ads-detection_public_leaderboard.csv',
                          parse_dates=['SubmissionDate'])
# remove bad data
leaderboard = leaderboard[leaderboard['TeamId']!=334028]
leaderboard[leaderboard.TeamName=='luoq']
Out[3]:
TeamId TeamName SubmissionDate Score
1116 332723 luoq 2016-05-31 02:57:14 0.76210
1117 332723 luoq 2016-05-31 03:33:38 0.76784
1128 332723 luoq 2016-05-31 10:53:39 0.76784
1144 332723 luoq 2016-06-01 01:26:02 0.78804
1145 332723 luoq 2016-06-01 05:03:52 0.80068
1146 332723 luoq 2016-06-01 10:45:47 0.80763
1167 332723 luoq 2016-06-02 02:31:37 0.82881
1174 332723 luoq 2016-06-02 06:01:23 0.83649
1206 332723 luoq 2016-06-03 10:21:27 0.83654
1309 332723 luoq 2016-06-06 11:23:05 0.83714
1332 332723 luoq 2016-06-07 03:35:36 0.89558
1366 332723 luoq 2016-06-08 06:11:24 0.90249
1412 332723 luoq 2016-06-09 05:52:11 0.90954
1530 332723 luoq 2016-06-12 08:54:48 0.91841
1701 332723 luoq 2016-06-15 07:18:48 0.91903
1704 332723 luoq 2016-06-15 08:34:40 0.91916
1713 332723 luoq 2016-06-15 10:17:36 0.91941
2071 332723 luoq 2016-06-23 10:59:09 0.92034
2818 332723 luoq 2016-07-04 08:07:50 0.93212
3223 332723 luoq 2016-07-08 01:30:57 0.93778
3233 332723 luoq 2016-07-08 06:29:38 0.93791
3322 332723 luoq 2016-07-09 04:08:20 0.93873
3430 332723 luoq 2016-07-10 01:54:56 0.93890
3566 332723 luoq 2016-07-11 01:24:38 0.93964

top 20

In [4]:
leaderboard.groupby('TeamName')['Score'].max().sort_values(ascending=False).iloc[:20]
Out[4]:
TeamName
Devil Team                    0.95839
TheQuants                     0.95317
Native Russian Speakers :P    0.95118
otivA                         0.95101
ADAD                          0.94991
8 + 9 = 11                    0.94732
ololobhi                      0.94627
DataMinders                   0.94456
frist                         0.94456
Li-Der                        0.94302
Pavel Blinov                  0.94299
TeamYK                        0.94137
amsqr_run2                    0.94107
luoq                          0.93964
theFuture                     0.93907
ZigZag                        0.93801
Igor Pasechnik                0.93679
leventis_vamvakas             0.93648
Sameh & Javier                0.93603
x0x0w1                        0.93579
Name: Score, dtype: float64
In [10]:
top20_team = leaderboard.groupby('TeamName')['Score'].max().sort_values(ascending=False).iloc[:20].index.tolist()
p = TimeSeries(leaderboard[leaderboard.TeamName.apply(lambda x: x in top20_team)],
               x='SubmissionDate', y='Score', color='TeamName',
               plot_width=1000, legend=False)
show(p)
Out[10]:

<Bokeh Notebook handle for In[10]>

In [11]:
p = TimeSeries(leaderboard[(leaderboard.TeamName.apply(lambda x: x in top20_team)) & (leaderboard.Score>=0.9)],
               x='SubmissionDate', y='Score', color='TeamName',
               plot_width=1000, legend=False)
show(p)
Out[11]:

<Bokeh Notebook handle for In[11]>

interesting finding

  • gap between train and test
    • not iid
    • ordered by time
    • text feature
  • bad result for some category
  • bad case analysis: not useful