Avito Duplicate Ads Detection¶

luoq08@gmail.com OR hzluoqiang@corp.netease.com

data¶

data

size¶

train pair: 2991396
test pair: 1044196
train info: 3344613
train test: 1315205
image: 11380670

solution¶

detail

result analysis¶

parameter tuning¶

~~grid search~~ and fit on leaderboard
max_depth: 5 -> 10 -> 15(auc increase in local, auc decrease in leaderboard)
subsample=0.8, colsample_bytree=0.8
min_child_weight = 1
learning_rate=0.05, n_estimators=1000

feature size¶

features_train = pd.concat((
        simple_features_train,
        aggregation_features_train,
        title_features_train, description_features_train, ncd_features_train,
        image_features_train,
        corpus_based_features_train,
        dummy_features_train,
        categoryID_shuffle_features_train,
    ), axis=1)

feature group	size
simple	26
aggregation	20
text simple	18
image	44
text vector space	110
dummy	51
total	290

time¶

simple, aggregation, text simple: hours
image histogram, hash: each 1day with 8(16?) cores
image mxnet: about 5days with GPU(Titan)
text vector space: overnight

model training, 32 threads

CPU times: user 3d 10h 19min 40s, sys: 2min 51s, total: 3d 10h 22min 32s
Wall time: 3h 4min 33s

feature importance¶

not representative for leaderboard

by weight(the number of times a feature is used to split the data across all trees)

180    title_word_1_2gram_dtm_0_predict_log_price__1    0.015352
181    title_word_1_2gram_dtm_0_predict_log_price__2    0.015343
107    mxnet_bn_batch_mean_sim    0.014696
4    description_length_max    0.014154
5    description_length_min    0.013829
11    price_diff    0.013002
184    title_description_dtm_0_predict_log_price__1    0.012954
185    title_description_dtm_0_predict_log_price__2    0.012500
105    mxnet_bn_batch_max_sim    0.011103
14    price_min    0.010932
30    locationID_1_freq    0.010878
13    price_max    0.010721

by gain (the average gain of the feature when it is used in trees)

image_phash_hamming_0_min                         2875.218537
mxnet_bn_batch_max_sim                            1869.406305
categoryID_112                                    1714.816033
title_word_dtm_0_1__binary_tfidf__cosine          1053.358729
image_dhash_hamming_0_min                          811.860253
title_word_dtm_0_1__tfidf__cosine                  359.690265
categoryID_9                                       311.754781
categoryID_33                                      275.383379
categoryID_111                                     254.657611
price_diff_ratio                                   254.372805

auc by categoryID¶

not much image in category 112

112    0.858464
85     0.903141
33     0.913381
111    0.913763
101    0.927559
105    0.930247
31     0.932722
99     0.934950
10     0.940427
23     0.942282
97     0.944655
19     0.947516
26     0.948816
98     0.949131
34     0.949929
25     0.950278
32     0.951069
115    0.951840
42     0.951955
86     0.954261
21     0.957213
40     0.958290
24     0.958903
87     0.959818
84     0.960464
39     0.960946
96     0.963323
102    0.966269
29     0.966634
38     0.966994
36     0.967149
90     0.967512
83     0.968200
27     0.969401
28     0.971950
94     0.974298
114    0.974501
20     0.974637
89     0.976800
9      0.978176
106    0.980815
88     0.981223
30     0.982920
93     0.983322
81     0.984331
14     0.985742
82     0.987740
92     0.988143
11     0.989218
91     0.989520
116    0.993231

feature contribution¶

impact on final result without imagehash and mxnet features

In [2]:

pd.read_csv('auc.old.csv')

Out[2]:

	model	features	test set	public leaderboard	gap
0	xgboost.26.weighted	all	0.956060	0.91957	0.036490
1	xgboost.27.weighted	-image	0.929910	0.86827	0.061640
2	xgboost.28.weighted	-image, -corpus	0.922269	0.85488	0.067389
3	xgboost.29.weighted	-corpus	0.952553	0.91441	0.038143
4	lr_text.1	title_description_dtm_0	0.793570	0.64333	0.150240
5	xgboost.31.weighted	+description_sentence__binary__agg_cosine	0.956172	0.91963	0.036542

leaderboard¶

progress¶

In [3]:

leaderboard = pd.read_csv('avito-duplicate-ads-detection_public_leaderboard.csv',
                          parse_dates=['SubmissionDate'])
# remove bad data
leaderboard = leaderboard[leaderboard['TeamId']!=334028]
leaderboard[leaderboard.TeamName=='luoq']

Out[3]:

	TeamId	TeamName	SubmissionDate	Score
1116	332723	luoq	2016-05-31 02:57:14	0.76210
1117	332723	luoq	2016-05-31 03:33:38	0.76784
1128	332723	luoq	2016-05-31 10:53:39	0.76784
1144	332723	luoq	2016-06-01 01:26:02	0.78804
1145	332723	luoq	2016-06-01 05:03:52	0.80068
1146	332723	luoq	2016-06-01 10:45:47	0.80763
1167	332723	luoq	2016-06-02 02:31:37	0.82881
1174	332723	luoq	2016-06-02 06:01:23	0.83649
1206	332723	luoq	2016-06-03 10:21:27	0.83654
1309	332723	luoq	2016-06-06 11:23:05	0.83714
1332	332723	luoq	2016-06-07 03:35:36	0.89558
1366	332723	luoq	2016-06-08 06:11:24	0.90249
1412	332723	luoq	2016-06-09 05:52:11	0.90954
1530	332723	luoq	2016-06-12 08:54:48	0.91841
1701	332723	luoq	2016-06-15 07:18:48	0.91903
1704	332723	luoq	2016-06-15 08:34:40	0.91916
1713	332723	luoq	2016-06-15 10:17:36	0.91941
2071	332723	luoq	2016-06-23 10:59:09	0.92034
2818	332723	luoq	2016-07-04 08:07:50	0.93212
3223	332723	luoq	2016-07-08 01:30:57	0.93778
3233	332723	luoq	2016-07-08 06:29:38	0.93791
3322	332723	luoq	2016-07-09 04:08:20	0.93873
3430	332723	luoq	2016-07-10 01:54:56	0.93890
3566	332723	luoq	2016-07-11 01:24:38	0.93964

top 20¶

In [4]:

leaderboard.groupby('TeamName')['Score'].max().sort_values(ascending=False).iloc[:20]

Out[4]:

TeamName
Devil Team                    0.95839
TheQuants                     0.95317
Native Russian Speakers :P    0.95118
otivA                         0.95101
ADAD                          0.94991
8 + 9 = 11                    0.94732
ololobhi                      0.94627
DataMinders                   0.94456
frist                         0.94456
Li-Der                        0.94302
Pavel Blinov                  0.94299
TeamYK                        0.94137
amsqr_run2                    0.94107
luoq                          0.93964
theFuture                     0.93907
ZigZag                        0.93801
Igor Pasechnik                0.93679
leventis_vamvakas             0.93648
Sameh & Javier                0.93603
x0x0w1                        0.93579
Name: Score, dtype: float64

In [10]:

top20_team = leaderboard.groupby('TeamName')['Score'].max().sort_values(ascending=False).iloc[:20].index.tolist()
p = TimeSeries(leaderboard[leaderboard.TeamName.apply(lambda x: x in top20_team)],
               x='SubmissionDate', y='Score', color='TeamName',
               plot_width=1000, legend=False)
show(p)

Out[10]:

<Bokeh Notebook handle for In[10]>

In [11]:

p = TimeSeries(leaderboard[(leaderboard.TeamName.apply(lambda x: x in top20_team)) & (leaderboard.Score>=0.9)],
               x='SubmissionDate', y='Score', color='TeamName',
               plot_width=1000, legend=False)
show(p)

Out[11]:

<Bokeh Notebook handle for In[11]>

interesting finding¶

gap between train and test
- not iid
- ordered by time
- text feature
bad result for some category
bad case analysis: not useful