luoq08@gmail.com OR hzluoqiang@corp.netease.com
features_train = pd.concat((
simple_features_train,
aggregation_features_train,
title_features_train, description_features_train, ncd_features_train,
image_features_train,
corpus_based_features_train,
dummy_features_train,
categoryID_shuffle_features_train,
), axis=1)
feature group | size |
---|---|
simple | 26 |
aggregation | 20 |
text simple | 18 |
image | 44 |
text vector space | 110 |
dummy | 51 |
total | 290 |
CPU times: user 3d 10h 19min 40s, sys: 2min 51s, total: 3d 10h 22min 32s
Wall time: 3h 4min 33s
not representative for leaderboard
180 title_word_1_2gram_dtm_0_predict_log_price__1 0.015352
181 title_word_1_2gram_dtm_0_predict_log_price__2 0.015343
107 mxnet_bn_batch_mean_sim 0.014696
4 description_length_max 0.014154
5 description_length_min 0.013829
11 price_diff 0.013002
184 title_description_dtm_0_predict_log_price__1 0.012954
185 title_description_dtm_0_predict_log_price__2 0.012500
105 mxnet_bn_batch_max_sim 0.011103
14 price_min 0.010932
30 locationID_1_freq 0.010878
13 price_max 0.010721
image_phash_hamming_0_min 2875.218537
mxnet_bn_batch_max_sim 1869.406305
categoryID_112 1714.816033
title_word_dtm_0_1__binary_tfidf__cosine 1053.358729
image_dhash_hamming_0_min 811.860253
title_word_dtm_0_1__tfidf__cosine 359.690265
categoryID_9 311.754781
categoryID_33 275.383379
categoryID_111 254.657611
price_diff_ratio 254.372805
112 0.858464
85 0.903141
33 0.913381
111 0.913763
101 0.927559
105 0.930247
31 0.932722
99 0.934950
10 0.940427
23 0.942282
97 0.944655
19 0.947516
26 0.948816
98 0.949131
34 0.949929
25 0.950278
32 0.951069
115 0.951840
42 0.951955
86 0.954261
21 0.957213
40 0.958290
24 0.958903
87 0.959818
84 0.960464
39 0.960946
96 0.963323
102 0.966269
29 0.966634
38 0.966994
36 0.967149
90 0.967512
83 0.968200
27 0.969401
28 0.971950
94 0.974298
114 0.974501
20 0.974637
89 0.976800
9 0.978176
106 0.980815
88 0.981223
30 0.982920
93 0.983322
81 0.984331
14 0.985742
82 0.987740
92 0.988143
11 0.989218
91 0.989520
116 0.993231
impact on final result without imagehash and mxnet features
pd.read_csv('auc.old.csv')
model | features | test set | public leaderboard | gap | |
---|---|---|---|---|---|
0 | xgboost.26.weighted | all | 0.956060 | 0.91957 | 0.036490 |
1 | xgboost.27.weighted | -image | 0.929910 | 0.86827 | 0.061640 |
2 | xgboost.28.weighted | -image, -corpus | 0.922269 | 0.85488 | 0.067389 |
3 | xgboost.29.weighted | -corpus | 0.952553 | 0.91441 | 0.038143 |
4 | lr_text.1 | title_description_dtm_0 | 0.793570 | 0.64333 | 0.150240 |
5 | xgboost.31.weighted | +description_sentence__binary__agg_cosine | 0.956172 | 0.91963 | 0.036542 |
leaderboard = pd.read_csv('avito-duplicate-ads-detection_public_leaderboard.csv',
parse_dates=['SubmissionDate'])
# remove bad data
leaderboard = leaderboard[leaderboard['TeamId']!=334028]
leaderboard[leaderboard.TeamName=='luoq']
TeamId | TeamName | SubmissionDate | Score | |
---|---|---|---|---|
1116 | 332723 | luoq | 2016-05-31 02:57:14 | 0.76210 |
1117 | 332723 | luoq | 2016-05-31 03:33:38 | 0.76784 |
1128 | 332723 | luoq | 2016-05-31 10:53:39 | 0.76784 |
1144 | 332723 | luoq | 2016-06-01 01:26:02 | 0.78804 |
1145 | 332723 | luoq | 2016-06-01 05:03:52 | 0.80068 |
1146 | 332723 | luoq | 2016-06-01 10:45:47 | 0.80763 |
1167 | 332723 | luoq | 2016-06-02 02:31:37 | 0.82881 |
1174 | 332723 | luoq | 2016-06-02 06:01:23 | 0.83649 |
1206 | 332723 | luoq | 2016-06-03 10:21:27 | 0.83654 |
1309 | 332723 | luoq | 2016-06-06 11:23:05 | 0.83714 |
1332 | 332723 | luoq | 2016-06-07 03:35:36 | 0.89558 |
1366 | 332723 | luoq | 2016-06-08 06:11:24 | 0.90249 |
1412 | 332723 | luoq | 2016-06-09 05:52:11 | 0.90954 |
1530 | 332723 | luoq | 2016-06-12 08:54:48 | 0.91841 |
1701 | 332723 | luoq | 2016-06-15 07:18:48 | 0.91903 |
1704 | 332723 | luoq | 2016-06-15 08:34:40 | 0.91916 |
1713 | 332723 | luoq | 2016-06-15 10:17:36 | 0.91941 |
2071 | 332723 | luoq | 2016-06-23 10:59:09 | 0.92034 |
2818 | 332723 | luoq | 2016-07-04 08:07:50 | 0.93212 |
3223 | 332723 | luoq | 2016-07-08 01:30:57 | 0.93778 |
3233 | 332723 | luoq | 2016-07-08 06:29:38 | 0.93791 |
3322 | 332723 | luoq | 2016-07-09 04:08:20 | 0.93873 |
3430 | 332723 | luoq | 2016-07-10 01:54:56 | 0.93890 |
3566 | 332723 | luoq | 2016-07-11 01:24:38 | 0.93964 |
leaderboard.groupby('TeamName')['Score'].max().sort_values(ascending=False).iloc[:20]
TeamName Devil Team 0.95839 TheQuants 0.95317 Native Russian Speakers :P 0.95118 otivA 0.95101 ADAD 0.94991 8 + 9 = 11 0.94732 ololobhi 0.94627 DataMinders 0.94456 frist 0.94456 Li-Der 0.94302 Pavel Blinov 0.94299 TeamYK 0.94137 amsqr_run2 0.94107 luoq 0.93964 theFuture 0.93907 ZigZag 0.93801 Igor Pasechnik 0.93679 leventis_vamvakas 0.93648 Sameh & Javier 0.93603 x0x0w1 0.93579 Name: Score, dtype: float64
top20_team = leaderboard.groupby('TeamName')['Score'].max().sort_values(ascending=False).iloc[:20].index.tolist()
p = TimeSeries(leaderboard[leaderboard.TeamName.apply(lambda x: x in top20_team)],
x='SubmissionDate', y='Score', color='TeamName',
plot_width=1000, legend=False)
show(p)
<Bokeh Notebook handle for In[10]>
p = TimeSeries(leaderboard[(leaderboard.TeamName.apply(lambda x: x in top20_team)) & (leaderboard.Score>=0.9)],
x='SubmissionDate', y='Score', color='TeamName',
plot_width=1000, legend=False)
show(p)
<Bokeh Notebook handle for In[11]>