Showing preview only (717K chars total). Download the full file or copy to clipboard to get everything.
Repository: chizhu/yiguan_sex_age_predict_1st_solution
Branch: master
Commit: 6ffeda04746e
Files: 71
Total size: 678.0 KB
Directory structure:
gitextract_j9hppgj3/
├── 2018易观A10大数据应用峰会-RNG_终极版.pptx
├── README.md
├── THLUO/
│ ├── 1.w2c_model_start.py
│ ├── 10.age_bin_prob_oof.py
│ ├── 11.hcc_device_brand_age_sex.py
│ ├── 12.device_age_regression_prob_oof.py
│ ├── 13.device_start_GRU_pred.py
│ ├── 14.device_start_GRU_pred_age.py
│ ├── 15.device_all_GRU_pred.py
│ ├── 16.device_start_capsule_pred.py
│ ├── 17.device_start_textcnn_pred.py
│ ├── 18.device_start_text_dpcnn_pred.py
│ ├── 19.device_start_lstm_pred.py
│ ├── 2.w2c_model_close.py
│ ├── 20.lgb_sex_age_prob_oof.py
│ ├── 21.tfidf_lr_sex_age_prob_oof.py
│ ├── 22.base_feat.py
│ ├── 23.ATT_v6.py
│ ├── 24.thluo_22_lgb.py
│ ├── 25.thluo_22_xgb.py
│ ├── 26.thluo_nb_lgb.py
│ ├── 27.thluo_nb_xgb.py
│ ├── 28.final.py
│ ├── 3.device_quchong_start_app_w2c.py
│ ├── 3.w2c_all_emb.py
│ ├── 3.w2c_model_all.py
│ ├── 4.device_age_prob_oof.py
│ ├── 5.device_sex_prob_oof.py
│ ├── 6.start_close_age_prob_oof.py
│ ├── 7.start_close_sex_prob_oof.py
│ ├── 9.sex_age_bin_prob_oof.py
│ ├── TextModel.py
│ ├── readme.md
│ ├── util.py
│ └── 代码运行.bat
├── chizhu/
│ ├── readme.txt
│ ├── single_model/
│ │ ├── cnn.py
│ │ ├── config.py
│ │ ├── deepnn.py
│ │ ├── get_nn_feat.py
│ │ ├── lgb.py
│ │ ├── user_behavior.py
│ │ ├── xgb.py
│ │ ├── xgb_nb.py
│ │ └── yg_best_nn.py
│ ├── stacking/
│ │ ├── all_feat/
│ │ │ └── xgb__nurbs_nb.ipynb
│ │ └── nurbs_feat/
│ │ ├── xgb_22.py
│ │ └── xgb__nurbs_nb.py
│ └── util/
│ ├── bagging.py
│ └── get_nn_res.py
├── linwangli/
│ ├── code/
│ │ ├── lgb_allfeat_22.py
│ │ ├── lgb_allfeat_condProb.py
│ │ └── utils.py
│ ├── readme.txt
│ ├── yg-1st-lgb.py
│ └── 融合思路.pptx
├── nb_cz_lwl_wcm/
│ ├── 10_lgb.py
│ ├── 11_cnn.py
│ ├── 12_get_feature_lwl.py
│ ├── 13_last_get_all_feature.py
│ ├── 1_get_age_reg.py
│ ├── 2_get_feature_brand.py
│ ├── 3_get_feature_device_package.py
│ ├── 4_get_feature_device_start_close_tfidf_1_2.py
│ ├── 5_get_feature_device_start_close_tfidf.py
│ ├── 6_get_feature_device_start_close.py
│ ├── 7_get_feature_w2v.py
│ ├── 8_get_feature_lwl.py
│ ├── 9_yg_best_nn.py
│ └── 运行说明.txt
└── wangcanming/
└── deepnet_v33.py
================================================
FILE CONTENTS
================================================
================================================
FILE: README.md
================================================
# yiguan_sex_age_predict_1st_solution
易观性别年龄预测第一名解决方案
##### [比赛链接](https://www.tinymind.cn/competitions/43)
--------
团队是分别个人做然后再合并,所以团队中特征文件有所交叉,主要用到的方案是stacking不同模型,因为数据产出的维度较高,通过不同模型stacking可以达到不会损失过量信息下达到降维的目的。
以下是运行代码的顺序:
* 1.产出特征文件
> 按照nb_cz_lwl_wcm文件夹运行说明分别运行 nb_cz_lwl_wcm文件夹下的所有文件产出特征文件 feature_one.csv
> 按照thluo 文件夹下运行说明分别运行 thluo 文件夹下的代码生成 thluo_train_best_feat.csv
* 2.模型加权
注:模型所得到的结果在linwangli文件夹下
> 运行完thluo文件夹下面的所有代码会生成thluo_prob
> 用linwangli/code文件夹下面的模型以及上面所求得的特征文件可跑出对应概率文件,相关概率文件加权方案看 linwangli文件夹下面的融合思路ppt
<br>
<br>
CONTRIBUTORS:[THLUO](https://github.com/THLUO) [WangliLin](https://github.com/WangliLin) [Puck Wang](https://github.com/PuckWong) [chizhu](https://github.com/chizhu) [NURBS](https://github.com/suncostanx)
================================================
FILE: THLUO/1.w2c_model_start.py
================================================
# coding: utf-8
# In[1]:
import pandas as pd
import seaborn as sns
import numpy as np
from tqdm import tqdm
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
from datetime import datetime,timedelta
import time
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
import gc
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
# In[2]:
path='input/'
data=pd.DataFrame()
print ('1.w2c_model_start.py')
# In[3]:
deviceid_packages=pd.read_csv(path+'deviceid_packages.tsv',sep='\t',names=['device_id','apps'])
deviceid_test=pd.read_csv(path+'deviceid_test.tsv',sep='\t',names=['device_id'])
deviceid_train=pd.read_csv(path+'deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
deviceid_brand = pd.read_csv(path+'deviceid_brand.tsv',sep='\t', names=['device_id','device_brand', 'device_type'])
deviceid_package_start_close = pd.read_csv(path+'deviceid_package_start_close.tsv',sep='\t', names=['device_id','app_id','start_time','close_time'])
package_label = pd.read_csv(path+'package_label.tsv',sep='\t',names=['app_id','app_parent_type', 'app_child_type'])
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : str(x).split(' ')[0])
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 1].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other' if x in one_time_brand else x)
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 2].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_2' if x in one_time_brand else x)
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 3].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_3' if x in one_time_brand else x)
#转换成对应的数字
lbl = LabelEncoder()
lbl.fit(list(deviceid_brand.device_brand.values))
deviceid_brand['device_brand'] = lbl.transform(list(deviceid_brand.device_brand.values))
lbl = LabelEncoder()
lbl.fit(list(deviceid_brand.device_type.values))
deviceid_brand['device_type'] = lbl.transform(list(deviceid_brand.device_type.values))
#转换成对应的数字
lbl = LabelEncoder()
lbl.fit(list(package_label.app_parent_type.values))
package_label['app_parent_type'] = lbl.transform(list(package_label.app_parent_type.values))
lbl = LabelEncoder()
lbl.fit(list(package_label.app_child_type.values))
package_label['app_child_type'] = lbl.transform(list(package_label.app_child_type.values))
# In[4]:
df_sorted = deviceid_package_start_close.sort_values(by='start_time')
# In[20]:
df_results = df_sorted.groupby('device_id')['app_id'].apply(lambda x:' '.join(x)).reset_index().rename(columns = {'app_id' : 'app_list'})
df_results.to_csv('01.device_click_app_sorted_by_start.csv', index=None)
del df_results
# In[5]:
df_device_start_app_list = df_sorted.groupby('device_id').apply(lambda x : list(x.app_id)).reset_index().rename(columns = {0 : 'app_list'})
# In[7]:
app_list = list(df_device_start_app_list.app_list.values)
# In[9]:
model = Word2Vec(app_list, size=10, window=10, min_count=2, workers=4)
model.save("word2vec.model")
# In[10]:
vocab = list(model.wv.vocab.keys())
w2c_arr = []
for v in vocab :
w2c_arr.append(list(model.wv[v]))
# In[11]:
df_w2c_start = pd.DataFrame()
df_w2c_start['app_id'] = vocab
df_w2c_start = pd.concat([df_w2c_start, pd.DataFrame(w2c_arr)], axis=1)
df_w2c_start.columns = ['app_id'] + ['w2c_start_app_' + str(i) for i in range(10)]
# In[13]:
w2c_nums = 10
agg = {}
for l in ['w2c_start_app_' + str(i) for i in range(w2c_nums)] :
agg[l] = ['mean', 'std', 'max', 'min']
# In[14]:
deviceid_package_start_close = deviceid_package_start_close.merge(df_w2c_start, on='app_id', how='left')
# In[15]:
df_agg = deviceid_package_start_close.groupby('device_id').agg(agg)
df_agg.columns = pd.Index(['device_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
df_agg.to_csv('device_start_app_w2c.csv', index=None)
# In[16]:
df_results = deviceid_package_start_close.groupby(['device_id', 'app_id'])['start_time'].mean().reset_index()
df_results = df_results.merge(df_w2c_start, on='app_id', how='left')
# In[18]:
df_agg = df_results.groupby('device_id').agg(agg)
df_agg.columns = pd.Index(['device_app_unique_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
# In[24]:
df_agg.to_csv('device_app_unique_start_app_w2c.csv', index=None)
print ('success.....')
================================================
FILE: THLUO/10.age_bin_prob_oof.py
================================================
# coding: utf-8
# In[1]:
import pandas as pd
import seaborn as sns
import numpy as np
from tqdm import tqdm
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
from datetime import datetime,timedelta
import time
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
import gc
# In[2]:
print ('10.age_bin_prob_oof.py')
path='input/'
data=pd.DataFrame()
#sex_age=pd.read_excel('./data/性别年龄对照表.xlsx')
# In[3]:
deviceid_packages=pd.read_csv(path+'deviceid_packages.tsv',sep='\t',names=['device_id','apps'])
deviceid_test=pd.read_csv(path+'deviceid_test.tsv',sep='\t',names=['device_id'])
deviceid_train=pd.read_csv(path+'deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
deviceid_brand = pd.read_csv(path+'deviceid_brand.tsv',sep='\t', names=['device_id','device_brand', 'device_type'])
deviceid_package_start_close = pd.read_csv(path+'deviceid_package_start_close.tsv',sep='\t', names=['device_id','app_id','start_time','close_time'])
package_label = pd.read_csv(path+'package_label.tsv',sep='\t',names=['app_id','app_parent_type', 'app_child_type'])
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : str(x).split(' ')[0])
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 1].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other' if x in one_time_brand else x)
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 2].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_2' if x in one_time_brand else x)
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 3].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_3' if x in one_time_brand else x)
#转换成对应的数字
lbl = LabelEncoder()
lbl.fit(list(deviceid_brand.device_brand.values))
deviceid_brand['device_brand'] = lbl.transform(list(deviceid_brand.device_brand.values))
lbl = LabelEncoder()
lbl.fit(list(deviceid_brand.device_type.values))
deviceid_brand['device_type'] = lbl.transform(list(deviceid_brand.device_type.values))
#转换成对应的数字
lbl = LabelEncoder()
lbl.fit(list(package_label.app_parent_type.values))
package_label['app_parent_type'] = lbl.transform(list(package_label.app_parent_type.values))
lbl = LabelEncoder()
lbl.fit(list(package_label.app_child_type.values))
package_label['app_child_type'] = lbl.transform(list(package_label.app_child_type.values))
# In[4]:
import time
# 输入毫秒级的时间,转出正常格式的时间
def timeStamp(timeNum):
timeStamp = float(timeNum/1000)
timeArray = time.localtime(timeStamp)
otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
return otherStyleTime
#解析出具体的时间
deviceid_package_start_close['start_date'] = pd.to_datetime(deviceid_package_start_close.start_time.apply(timeStamp))
deviceid_package_start_close['end_date'] = pd.to_datetime(deviceid_package_start_close.close_time.apply(timeStamp))
deviceid_package_start_close['start_hour'] = deviceid_package_start_close.start_date.dt.hour
deviceid_package_start_close['end_hour'] = deviceid_package_start_close.end_date.dt.hour
deviceid_package_start_close['time_gap'] = (deviceid_package_start_close['end_date'] - deviceid_package_start_close['start_date']).astype('timedelta64[s]')
deviceid_package_start_close = deviceid_package_start_close.merge(package_label, on='app_id', how='left')
deviceid_package_start_close.app_parent_type.fillna(-1, inplace=True)
deviceid_package_start_close.app_child_type.fillna(-1, inplace=True)
deviceid_package_start_close['start_year'] = deviceid_package_start_close.start_date.dt.year
deviceid_package_start_close['end_year'] = deviceid_package_start_close.end_date.dt.year
deviceid_package_start_close['year_gap'] = deviceid_package_start_close['end_year'] - deviceid_package_start_close['start_year']
# In[5]:
deviceid_train=pd.concat([deviceid_train,deviceid_test])
# In[6]:
deviceid_packages['apps']=deviceid_packages['apps'].apply(lambda x:x.split(','))
deviceid_packages['app_lenghth']=deviceid_packages['apps'].apply(lambda x:len(x))
apps=deviceid_packages['apps'].apply(lambda x:' '.join(x)).tolist()
vectorizer=CountVectorizer()
transformer=TfidfTransformer()
cntTf = vectorizer.fit_transform(apps)
tfidf=transformer.fit_transform(cntTf)
word=vectorizer.get_feature_names()
weight=tfidf.toarray()
df_weight=pd.DataFrame(weight)
feature=df_weight.columns
df_weight['sum']=0
for f in tqdm(feature):
df_weight['sum']+=df_weight[f]
deviceid_packages['tfidf_sum']=df_weight['sum']
# In[10]:
lda = LatentDirichletAllocation(n_topics=5,
learning_offset=50.,
random_state=666)
docres = lda.fit_transform(cntTf)
# In[11]:
deviceid_packages = pd.concat([deviceid_packages,pd.DataFrame(docres)],axis=1)
# In[12]:
temp=deviceid_packages.drop('apps',axis=1)
deviceid_train=pd.merge(deviceid_train,temp,on='device_id',how='left')
# In[13]:
#解析出所有的device_app_pair
device_id_arr = []
app_arr = []
df_device_app_pair = pd.DataFrame()
for row in deviceid_packages.values :
device_id = row[0]
app_list = row[1]
for app in app_list :
device_id_arr.append(device_id)
app_arr.append(app)
#生成pair
df_device_app_pair['device_id'] = device_id_arr
df_device_app_pair['app_id'] = app_arr
df_device_app_pair = df_device_app_pair.merge(package_label, how='left', on='app_id')
#特征工程
def open_app_timegap_in_hour() :
df_temp = deviceid_package_start_close.groupby(['device_id', 'start_hour'])['time_gap'].mean().reset_index().rename(columns = {'time_gap': 'mean_time_gap'})
df_mean_temp = pd.pivot_table(df_temp, index='device_id', columns='start_hour', values='mean_time_gap').reset_index()
df_mean_temp.columns = ['device_id'] + ['open_app_timegap_in_'+str(i) + '_mean_hour' for i in range(0,24)]
df_mean_temp.fillna(0, inplace=True)
return df_mean_temp
# In[8]:
def device_start_end_app_timegap() :
#用户打开,关闭app的时间间隔
df_ = deviceid_package_start_close.sort_values(by=['device_id', 'start_date'], ascending=False)
df_['prev_start_date'] = df_.groupby('device_id')['start_date'].shift(-1)
df_['start_date_gap'] = (df_['start_date'] - df_['prev_start_date']).astype('timedelta64[s]')
agg_dic = {'start_date_gap' : ['min', 'max', 'mean', 'median', 'std']}
df_start_gap_agg = df_.groupby('device_id').agg(agg_dic)
df_start_gap_agg.columns = pd.Index(['device_' + e[0] + "_" + e[1].upper() for e in df_start_gap_agg.columns.tolist()])
df_start_gap_agg = df_start_gap_agg.reset_index()
#del df_
gc.collect()
#关闭时间间隔
df_ = deviceid_package_start_close.sort_values(by=['device_id', 'end_date'], ascending=False)
df_['prev_end_date'] = df_.groupby('device_id')['end_date'].shift(-1)
df_['end_date_gap'] = (df_['end_date'] - df_['prev_end_date']).astype('timedelta64[s]')
agg_dic = {'end_date_gap' : ['min', 'max', 'mean', 'median', 'std']}
df_end_gap_agg = df_.groupby('device_id').agg(agg_dic)
df_end_gap_agg.columns = pd.Index(['device_' + e[0] + "_" + e[1].upper() for e in df_end_gap_agg.columns.tolist()])
df_end_gap_agg = df_end_gap_agg.reset_index()
#del df_
gc.collect()
df_agg = df_start_gap_agg.merge(df_end_gap_agg, on='device_id', how='left')
#df_agg = df_agg.merge(df_app_start_gap_agg, on='device_id', how='left')
#df_agg = df_agg.merge(df_app_end_gap_agg, on='device_id', how='left')
return df_agg
def open_app_counts_in_hour() :
df_temp = deviceid_package_start_close.groupby(['device_id', 'start_hour'])['app_id'].count().reset_index().rename(columns = {'app_id': 'app_counts'})
df_temp = pd.pivot_table(df_temp, index='device_id', columns='start_hour', values='app_counts').reset_index()
df_temp.columns = ['device_id'] + ['open_app_counts_in'+str(i) + '_hour' for i in range(0,24)]
df_temp.fillna(0, inplace=True)
return df_temp
def close_app_counts_in_hour() :
df_temp = deviceid_package_start_close.groupby(['device_id', 'end_hour'])['app_id'].count().reset_index().rename(columns = {'app_id': 'app_counts'})
df_temp = pd.pivot_table(df_temp, index='device_id', columns='end_hour', values='app_counts').reset_index()
df_temp.columns = ['device_id'] + ['close_app_counts_in'+str(i) + '_hour' for i in range(0,24)]
df_temp.fillna(0, inplace=True)
return df_temp
def app_type_mean_time_gap_one_hot () :
df_temp = deviceid_package_start_close.groupby(['device_id', 'app_parent_type'])['time_gap'].mean().reset_index()
df_temp = pd.pivot_table(df_temp, index='device_id', columns='app_parent_type', values='time_gap').reset_index()
df_temp.columns = ['device_id'] + ['app_parent_type_mean_time_gap'+str(i) for i in range(-1,45)]
df_temp.fillna(-1, inplace=True)
return df_temp
def device_active_hour() :
aggregations = {
'start_hour' : ['std','mean','max','min'],
'end_hour' : ['std','mean','max','min']
}
df_agg = deviceid_package_start_close.groupby('device_id').agg(aggregations)
df_agg.columns = pd.Index(['device_grouped_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
return df_agg
def device_brand_encoding() :
df_temp = deviceid_brand.merge(deviceid_train[['device_id', 'age', 'sex']], on='device_id', how='left')
aggregations = {
'age' : ['std','mean'],
'sex' : ['mean'],
}
df_device_brand = df_temp.groupby('device_brand').agg(aggregations)
df_device_brand.columns = pd.Index(['device_brand_' + e[0] + "_" + e[1].upper() for e in df_device_brand.columns.tolist()])
df_device_brand = df_device_brand.reset_index()
df_device_type = df_temp.groupby('device_type').agg(aggregations)
df_device_type.columns = pd.Index(['device_type_' + e[0] + "_" + e[1].upper() for e in df_device_type.columns.tolist()])
df_device_type = df_device_type.reset_index()
df_temp = df_temp.merge(df_device_brand, on='device_brand', how='left')
df_temp = df_temp.merge(df_device_type, on='device_type', how='left')
aggregations = {
'device_brand_age_STD' : ['mean'],
'device_brand_age_MEAN' : ['mean'],
'device_brand_sex_MEAN' : ['mean'],
#'device_type_age_STD' : ['mean'],
#'device_type_age_MEAN' : ['mean'],
#'device_type_sex_MEAN' : ['mean']
}
df_agg = df_temp.groupby('device_id').agg(aggregations)
df_agg.columns = pd.Index(['device_grouped_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
return df_agg
#统计device运行app的情况
def device_active_time_time_stat() :
#device开启app的时间统计信息
deviceid_package_start_close['active_time'] = deviceid_package_start_close['close_time'] - deviceid_package_start_close['start_time']
#device开启了多少次app
#device开启了多少个app
aggregations = {
'app_id' : ['count', 'nunique'],
'active_time' : ['mean', 'std', 'max', 'min'],
}
df_agg = deviceid_package_start_close.groupby('device_id').agg(aggregations)
df_agg.columns = pd.Index(['device_grouped_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
aggregations = {
'active_time' : ['mean', 'std', 'max', 'min', 'count'],
}
df_da_agg = deviceid_package_start_close.groupby(['device_id', 'app_id']).agg(aggregations)
df_da_agg.columns = pd.Index(['device_app_grouped_' + e[0] + "_" + e[1].upper() for e in df_da_agg.columns.tolist()])
df_da_agg = df_da_agg.reset_index()
#device开启app的平均时间
aggregations = {
'device_app_grouped_active_time_MEAN' : ['mean', 'std', 'max', 'min'],
'device_app_grouped_active_time_STD' : ['mean', 'std', 'max', 'min'],
'device_app_grouped_active_time_MAX' : ['mean', 'std', 'max', 'min'],
'device_app_grouped_active_time_MIN' : ['mean', 'std', 'max', 'min'],
'device_app_grouped_active_time_COUNT' : ['mean', 'std', 'max', 'min'],
}
df_temp = df_da_agg.groupby(['device_id']).agg(aggregations)
df_temp.columns = pd.Index([e[0] + "_" + e[1].upper() for e in df_temp.columns.tolist()])
df_temp = df_temp.reset_index()
df_agg = df_agg.merge(df_temp, on='device_id', how='left')
return df_agg
def app_type_encoding() :
df_temp = df_device_app_pair.merge(deviceid_train[['device_id', 'age', 'sex']], on='device_id', how='left')
aggregations = {
'age' : ['std','mean'],
'sex' : ['mean'],
}
df_agg_app_parent_type = df_temp.groupby('app_parent_type').agg(aggregations)
df_agg_app_parent_type.columns = pd.Index(['app_parent_type_' + e[0] + "_" + e[1].upper() for e in df_agg_app_parent_type.columns.tolist()])
df_agg_app_parent_type = df_agg_app_parent_type.reset_index()
df_agg_app_child_type = df_temp.groupby('app_child_type').agg(aggregations)
df_agg_app_child_type.columns = pd.Index(['app_child_type_' + e[0] + "_" + e[1].upper() for e in df_agg_app_child_type.columns.tolist()])
df_agg_app_child_type = df_agg_app_child_type.reset_index()
df_temp = df_temp.merge(df_agg_app_parent_type, on='app_parent_type', how='left')
df_temp = df_temp.merge(df_agg_app_child_type, on='app_child_type', how='left')
aggregations = {
'app_parent_type_age_STD' : ['mean'],
'app_parent_type_age_MEAN' : ['mean'],
'app_parent_type_sex_MEAN' : ['mean'],
'app_child_type_age_STD' : ['mean'],
'app_child_type_age_MEAN' : ['mean'],
'app_child_type_sex_MEAN' : ['mean']
}
df_agg = df_temp.groupby('device_id').agg(aggregations)
df_agg.columns = pd.Index(['device_grouped_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
return df_agg
#每个device对应的app_parent_type计数
def app_type_onehot_in_device(df) :
df_copy = df.fillna(-1)
df_temp = df_copy.groupby(['device_id', 'app_parent_type'])['app_id'].size().reset_index()
df_temp.rename(columns = {'app_id' : 'app_parent_type_counts'}, inplace=True)
df_temp = pd.pivot_table(df_temp, index='device_id', columns='app_parent_type', values='app_parent_type_counts').reset_index()
df_temp.columns = ['device_id'] + ['app_parent_type'+str(i) for i in range(-1,45)]
df_temp.fillna(0, inplace=True)
return df_temp
# In[15]:
#提取特征
df_train = deviceid_train.merge(device_active_time_time_stat(), on='device_id', how='left')
df_train = df_train.merge(deviceid_brand, on='device_id', how='left')
df_train = df_train.merge(app_type_onehot_in_device(df_device_app_pair), on='device_id', how='left')
df_train = df_train.merge(app_type_encoding(), on='device_id', how='left')
df_train = df_train.merge(device_active_hour(), on='device_id', how='left')
df_train = df_train.merge(app_type_mean_time_gap_one_hot(), on='device_id', how='left')
df_train = df_train.merge(open_app_counts_in_hour(), on='device_id', how='left')
df_train = df_train.merge(close_app_counts_in_hour(), on='device_id', how='left')
df_train = df_train.merge(device_brand_encoding(), on='device_id', how='left')
df_train = df_train.merge(device_start_end_app_timegap(), on='device_id', how='left')
df_train = df_train.merge(open_app_timegap_in_hour(), on='device_id', how='left')
# In[16]:
df_w2c_start = pd.read_csv('device_start_app_w2c.csv')
df_w2c_close = pd.read_csv('device_close_app_w2c.csv')
df_w2c_all = pd.read_csv('device_all_app_w2c.csv')
df_device_quchong_start_app_w2c = pd.read_csv('device_quchong_start_app_w2c.csv')
df_device_app_unique_start_app_w2c = pd.read_csv('device_app_unique_start_app_w2c.csv')
df_device_app_unique_close_app_w2c = pd.read_csv('device_app_unique_close_app_w2c.csv')
df_device_app_unique_all_app_w2c = pd.read_csv('device_app_unique_all_app_w2c.csv')
# In[17]:
df_train_w2v = df_train.merge(df_w2c_start, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_w2c_close, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_w2c_all, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_quchong_start_app_w2c, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_app_unique_start_app_w2c, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_app_unique_close_app_w2c, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_app_unique_all_app_w2c, on='device_id', how='left')
# In[19]:
df_train_w2v['sex'] = df_train_w2v['sex'].apply(lambda x:str(x))
df_train_w2v['age'] = df_train_w2v['age'].apply(lambda x:str(x))
def tool(x):
if x=='nan':
return x
else:
return str(int(float(x)))
df_train_w2v['sex']=df_train_w2v['sex'].apply(tool)
df_train_w2v['age']=df_train_w2v['age'].apply(tool)
df_train_w2v['sex_age']=df_train_w2v['sex']+'-'+df_train_w2v['age']
df_train_w2v = df_train_w2v.replace({'nan':np.NaN,'nan-nan':np.NaN})
# In[31]:
train = df_train_w2v[df_train_w2v['sex_age'].notnull()]
test = df_train_w2v[df_train_w2v['sex_age'].isnull()]
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)
# In[32]:
Y = train['age']
train['label'] = Y
# In[35]:
from sklearn.model_selection import KFold, StratifiedKFold
label_set = train.label.unique()
lgb_round = {'3': 363,
'5': 273,
'4': 328,
'7': 228,
'6': 361,
'9': 181,
'10': 338,
'2': 312,
'8': 234,
'1': 220,
'0': 200}
for sex_age in label_set :
print (sex_age)
X = train.drop(['sex', 'age', 'sex_age', 'label', 'device_id'],axis=1)
Y = train.label.apply(lambda x : 1 if x == sex_age else 0)
print (Y.value_counts())
seed = 2018
num_folds = 5
folds = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed)
sub_list = []
oof_preds = np.zeros(train.shape[0])
sub_preds = np.zeros(test.shape[0])
params = {
'boosting_type': 'gbdt',
'learning_rate' : 0.02,
#'max_depth':5,
'num_leaves' : 2 ** 5,
'metric': {'binary_logloss'},
#'num_class' : 22,
'objective' : 'binary',
'random_state' : 6666,
'bagging_freq' : 5,
'feature_fraction' : 0.7,
'bagging_fraction' : 0.7,
'min_split_gain' : 0.0970905919552776,
'min_child_weight' : 9.42012323936088,
}
for n_fold, (train_idx, valid_idx) in enumerate(folds.split(X, Y)):
train_x, train_y = X.iloc[train_idx], Y.iloc[train_idx]
valid_x, valid_y = X.iloc[valid_idx], Y.iloc[valid_idx]
lgb_train=lgb.Dataset(train_x,label=train_y)
lgb_eval = lgb.Dataset(valid_x, valid_y, reference=lgb_train)
gbm = lgb.train(params, lgb_train, num_boost_round=lgb_round[sex_age], valid_sets=[lgb_train, lgb_eval], verbose_eval=50)
oof_preds[valid_idx] = gbm.predict(valid_x[X.columns.values])
train['age_bin_prob_oof_' + str(sex_age)] = oof_preds
#用全部的train来预测test
lgb_train = lgb.Dataset(X,label=Y)
gbm = lgb.train(params, lgb_train, num_boost_round=lgb_round[sex_age], valid_sets=lgb_train, verbose_eval=50)
test['age_bin_prob_oof_' + str(sex_age)] = gbm.predict(test[X.columns.values])
# In[36]:
columns = ['device_id'] + ['age_bin_prob_oof_' + str(i) for i in range(11)]
# In[38]:
pd.concat([train[columns], test[columns]]).to_csv('age_bin_prob_oof.csv', index=None)
================================================
FILE: THLUO/11.hcc_device_brand_age_sex.py
================================================
# coding: utf-8
# In[1]:
import pandas as pd
import seaborn as sns
import numpy as np
from tqdm import tqdm
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
from datetime import datetime,timedelta
import time
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
import gc
from sklearn.model_selection import StratifiedKFold
# In[2]:
print ('11.hcc_device_brand_age_sex.py')
path='input/'
data=pd.DataFrame()
#sex_age=pd.read_excel('./data/性别年龄对照表.xlsx')
# In[3]:
deviceid_packages=pd.read_csv(path+'deviceid_packages.tsv',sep='\t',names=['device_id','apps'])
deviceid_test=pd.read_csv(path+'deviceid_test.tsv',sep='\t',names=['device_id'])
deviceid_train=pd.read_csv(path+'deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
deviceid_brand = pd.read_csv(path+'deviceid_brand.tsv',sep='\t', names=['device_id','device_brand', 'device_type'])
deviceid_package_start_close = pd.read_csv(path+'deviceid_package_start_close.tsv',sep='\t', names=['device_id','app_id','start_time','close_time'])
package_label = pd.read_csv(path+'package_label.tsv',sep='\t',names=['app_id','app_parent_type', 'app_child_type'])
#deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : str(x).split(' ')[0])
#转换成对应的数字
lbl = LabelEncoder()
lbl.fit(list(deviceid_brand.device_brand.values))
deviceid_brand['device_brand'] = lbl.transform(list(deviceid_brand.device_brand.values))
lbl = LabelEncoder()
lbl.fit(list(deviceid_brand.device_type.values))
deviceid_brand['device_type'] = lbl.transform(list(deviceid_brand.device_type.values))
#转换成对应的数字
lbl = LabelEncoder()
lbl.fit(list(package_label.app_parent_type.values))
package_label['app_parent_type'] = lbl.transform(list(package_label.app_parent_type.values))
lbl = LabelEncoder()
lbl.fit(list(package_label.app_child_type.values))
package_label['app_child_type'] = lbl.transform(list(package_label.app_child_type.values))
# In[4]:
df_train = deviceid_train.merge(deviceid_brand, how='left', on='device_id')
df_train.fillna(-1, inplace=True)
df_test = deviceid_test.merge(deviceid_brand, how='left', on='device_id')
df_test.fillna(-1, inplace=True)
# In[5]:
df_train['sex'] = df_train.sex.apply(lambda x : 1 if x == 1 else 0)
df_train = df_train.join(pd.get_dummies(df_train["age"], prefix="age").astype(int))
df_train['sex_age'] = df_train['sex'].map(str) + '_' + df_train['age'].map(str)
Y = df_train['sex_age']
Y_CAT = pd.Categorical(Y)
df_train['sex_age'] = pd.Series(Y_CAT.codes)
df_train = df_train.join(pd.get_dummies(df_train["sex_age"], prefix="sex_age").astype(int))
# In[6]:
sex_age_columns = ['sex_age_' + str(i) for i in range(22)]
sex_age_prior_set = df_train[sex_age_columns].mean().values
age_columns = ['age_' + str(i) for i in range(11)]
age_prior_set = df_train[age_columns].mean().values
sex_prior_prob= df_train.sex.mean()
sex_prior_prob
# In[7]:
def hcc_encode(train_df, test_df, variable, target, prior_prob, k=5, f=1, g=1, update_df=None):
"""
See "A Preprocessing Scheme for High-Cardinality Categorical Attributes in
Classification and Prediction Problems" by Daniele Micci-Barreca
"""
hcc_name = "_".join(["hcc", variable, target])
grouped = train_df.groupby(variable)[target].agg({"size": "size", "mean": "mean"})
grouped["lambda"] = 1 / (g + np.exp((k - grouped["size"]) / f))
grouped[hcc_name] = grouped["lambda"] * grouped["mean"] + (1 - grouped["lambda"]) * prior_prob
df = test_df[[variable]].join(grouped, on=variable, how="left")[hcc_name].fillna(prior_prob)
if update_df is None: update_df = test_df
if hcc_name not in update_df.columns: update_df[hcc_name] = np.nan
update_df.update(df)
return
# In[8]:
#拟合年龄
#拟合测试集
# High-Cardinality Categorical encoding
skf = StratifiedKFold(5)
nums = 11
for variable in ['device_brand', 'device_type'] :
for i in range(nums) :
target = age_columns[i]
age_prior_prob = age_prior_set[i]
print (variable, target, age_prior_prob)
hcc_encode(df_train, df_test, variable, target, age_prior_prob, k=5, f=1, g=1, update_df=None)
#拟合验证集
for train, test in skf.split(np.zeros(len(df_train)), df_train['age']):
hcc_encode(df_train.iloc[train], df_train.iloc[test], variable, target, age_prior_prob, k=5, update_df=df_train)
# In[9]:
#拟合性别
#拟合测试集
# High-Cardinality Categorical encoding
skf = StratifiedKFold(5)
for variable in ['device_brand', 'device_type'] :
target = 'sex'
print (variable, target, sex_prior_prob)
hcc_encode(df_train, df_test, variable, target, sex_prior_prob, k=5, f=1, g=1, update_df=None)
#拟合验证集
for train, test in skf.split(np.zeros(len(df_train)), df_train['age']):
hcc_encode(df_train.iloc[train], df_train.iloc[test], variable, target, sex_prior_prob, k=5, f=1, g=1, update_df=df_train)
# In[10]:
#拟合性别年龄
#拟合测试集
# High-Cardinality Categorical encoding
skf = StratifiedKFold(5)
nums = 22
for variable in ['device_brand', 'device_type'] :
for i in range(nums) :
target = sex_age_columns[i]
sex_age_prior_prob = sex_age_prior_set[i]
print (variable, target, sex_age_prior_prob)
hcc_encode(df_train, df_test, variable, target, sex_age_prior_prob, k=5, f=1, g=1, update_df=None)
#拟合验证集
for train, test in skf.split(np.zeros(len(df_train)), df_train['sex_age']):
hcc_encode(df_train.iloc[train], df_train.iloc[test], variable, target, sex_age_prior_prob, k=5, update_df=df_train)
# In[14]:
hcc_columns = ['device_id'] + ['hcc_device_brand_age_' + str(i) for i in range(11)] + ['hcc_device_brand_sex'] + ['hcc_device_type_age_' + str(i) for i in range(11)] + ['hcc_device_type_sex'] + ['hcc_device_type_sex_age_' + str(i) for i in range(22)]
df_total = pd.concat([df_train[hcc_columns], df_test[hcc_columns]])
# In[15]:
df_total.to_csv('hcc_device_brand_age_sex.csv', index=None)
================================================
FILE: THLUO/12.device_age_regression_prob_oof.py
================================================
# coding: utf-8
# In[1]:
import pandas as pd
import seaborn as sns
import numpy as np
from tqdm import tqdm
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
from datetime import datetime,timedelta
import time
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
import gc
from feat_util import *
# In[2]:
print ('12.device_age_regression_prob_oof.py')
path='input/'
data=pd.DataFrame()
#sex_age=pd.read_excel('./data/性别年龄对照表.xlsx')
# In[3]:
deviceid_packages=pd.read_csv(path+'deviceid_packages.tsv',sep='\t',names=['device_id','apps'])
deviceid_test=pd.read_csv(path+'deviceid_test.tsv',sep='\t',names=['device_id'])
deviceid_train=pd.read_csv(path+'deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
deviceid_brand = pd.read_csv(path+'deviceid_brand.tsv',sep='\t', names=['device_id','device_brand', 'device_type'])
deviceid_package_start_close = pd.read_csv(path+'deviceid_package_start_close.tsv',sep='\t', names=['device_id','app_id','start_time','close_time'])
package_label = pd.read_csv(path+'package_label.tsv',sep='\t',names=['app_id','app_parent_type', 'app_child_type'])
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : str(x).split(' ')[0])
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 1].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other' if x in one_time_brand else x)
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 2].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_2' if x in one_time_brand else x)
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 3].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_3' if x in one_time_brand else x)
#转换成对应的数字
lbl = LabelEncoder()
lbl.fit(list(deviceid_brand.device_brand.values))
deviceid_brand['device_brand'] = lbl.transform(list(deviceid_brand.device_brand.values))
lbl = LabelEncoder()
lbl.fit(list(deviceid_brand.device_type.values))
deviceid_brand['device_type'] = lbl.transform(list(deviceid_brand.device_type.values))
#转换成对应的数字
lbl = LabelEncoder()
lbl.fit(list(package_label.app_parent_type.values))
package_label['app_parent_type'] = lbl.transform(list(package_label.app_parent_type.values))
lbl = LabelEncoder()
lbl.fit(list(package_label.app_child_type.values))
package_label['app_child_type'] = lbl.transform(list(package_label.app_child_type.values))
# In[4]:
import time
# 输入毫秒级的时间,转出正常格式的时间
def timeStamp(timeNum):
timeStamp = float(timeNum/1000)
timeArray = time.localtime(timeStamp)
otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
return otherStyleTime
#解析出具体的时间
deviceid_package_start_close['start_date'] = pd.to_datetime(deviceid_package_start_close.start_time.apply(timeStamp))
deviceid_package_start_close['end_date'] = pd.to_datetime(deviceid_package_start_close.close_time.apply(timeStamp))
deviceid_package_start_close['start_hour'] = deviceid_package_start_close.start_date.dt.hour
deviceid_package_start_close['end_hour'] = deviceid_package_start_close.end_date.dt.hour
deviceid_package_start_close['time_gap'] = (deviceid_package_start_close['end_date'] - deviceid_package_start_close['start_date']).astype('timedelta64[s]')
deviceid_package_start_close = deviceid_package_start_close.merge(package_label, on='app_id', how='left')
deviceid_package_start_close.app_parent_type.fillna(-1, inplace=True)
deviceid_package_start_close.app_child_type.fillna(-1, inplace=True)
deviceid_package_start_close['start_year'] = deviceid_package_start_close.start_date.dt.year
deviceid_package_start_close['end_year'] = deviceid_package_start_close.end_date.dt.year
deviceid_package_start_close['year_gap'] = deviceid_package_start_close['end_year'] - deviceid_package_start_close['start_year']
# In[5]:
deviceid_train=pd.concat([deviceid_train,deviceid_test])
# In[6]:
deviceid_packages['apps']=deviceid_packages['apps'].apply(lambda x:x.split(','))
deviceid_packages['app_lenghth']=deviceid_packages['apps'].apply(lambda x:len(x))
#特征工程
def open_app_timegap_in_hour() :
df_temp = deviceid_package_start_close.groupby(['device_id', 'start_hour'])['time_gap'].mean().reset_index().rename(columns = {'time_gap': 'mean_time_gap'})
df_mean_temp = pd.pivot_table(df_temp, index='device_id', columns='start_hour', values='mean_time_gap').reset_index()
df_mean_temp.columns = ['device_id'] + ['open_app_timegap_in_'+str(i) + '_mean_hour' for i in range(0,24)]
df_mean_temp.fillna(0, inplace=True)
return df_mean_temp
# In[8]:
def device_start_end_app_timegap() :
#用户打开,关闭app的时间间隔
df_ = deviceid_package_start_close.sort_values(by=['device_id', 'start_date'], ascending=False)
df_['prev_start_date'] = df_.groupby('device_id')['start_date'].shift(-1)
df_['start_date_gap'] = (df_['start_date'] - df_['prev_start_date']).astype('timedelta64[s]')
agg_dic = {'start_date_gap' : ['min', 'max', 'mean', 'median', 'std']}
df_start_gap_agg = df_.groupby('device_id').agg(agg_dic)
df_start_gap_agg.columns = pd.Index(['device_' + e[0] + "_" + e[1].upper() for e in df_start_gap_agg.columns.tolist()])
df_start_gap_agg = df_start_gap_agg.reset_index()
#del df_
gc.collect()
#关闭时间间隔
df_ = deviceid_package_start_close.sort_values(by=['device_id', 'end_date'], ascending=False)
df_['prev_end_date'] = df_.groupby('device_id')['end_date'].shift(-1)
df_['end_date_gap'] = (df_['end_date'] - df_['prev_end_date']).astype('timedelta64[s]')
agg_dic = {'end_date_gap' : ['min', 'max', 'mean', 'median', 'std']}
df_end_gap_agg = df_.groupby('device_id').agg(agg_dic)
df_end_gap_agg.columns = pd.Index(['device_' + e[0] + "_" + e[1].upper() for e in df_end_gap_agg.columns.tolist()])
df_end_gap_agg = df_end_gap_agg.reset_index()
#del df_
gc.collect()
df_agg = df_start_gap_agg.merge(df_end_gap_agg, on='device_id', how='left')
#df_agg = df_agg.merge(df_app_start_gap_agg, on='device_id', how='left')
#df_agg = df_agg.merge(df_app_end_gap_agg, on='device_id', how='left')
return df_agg
def open_app_counts_in_hour() :
df_temp = deviceid_package_start_close.groupby(['device_id', 'start_hour'])['app_id'].count().reset_index().rename(columns = {'app_id': 'app_counts'})
df_temp = pd.pivot_table(df_temp, index='device_id', columns='start_hour', values='app_counts').reset_index()
df_temp.columns = ['device_id'] + ['open_app_counts_in'+str(i) + '_hour' for i in range(0,24)]
df_temp.fillna(0, inplace=True)
return df_temp
def close_app_counts_in_hour() :
df_temp = deviceid_package_start_close.groupby(['device_id', 'end_hour'])['app_id'].count().reset_index().rename(columns = {'app_id': 'app_counts'})
df_temp = pd.pivot_table(df_temp, index='device_id', columns='end_hour', values='app_counts').reset_index()
df_temp.columns = ['device_id'] + ['close_app_counts_in'+str(i) + '_hour' for i in range(0,24)]
df_temp.fillna(0, inplace=True)
return df_temp
def app_type_mean_time_gap_one_hot () :
df_temp = deviceid_package_start_close.groupby(['device_id', 'app_parent_type'])['time_gap'].mean().reset_index()
df_temp = pd.pivot_table(df_temp, index='device_id', columns='app_parent_type', values='time_gap').reset_index()
df_temp.columns = ['device_id'] + ['app_parent_type_mean_time_gap'+str(i) for i in range(-1,45)]
df_temp.fillna(-1, inplace=True)
return df_temp
def device_active_hour() :
aggregations = {
'start_hour' : ['std','mean','max','min'],
'end_hour' : ['std','mean','max','min']
}
df_agg = deviceid_package_start_close.groupby('device_id').agg(aggregations)
df_agg.columns = pd.Index(['device_grouped_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
return df_agg
def device_brand_encoding() :
df_temp = deviceid_brand.merge(deviceid_train[['device_id', 'age', 'sex']], on='device_id', how='left')
aggregations = {
'age' : ['std','mean'],
'sex' : ['mean'],
}
df_device_brand = df_temp.groupby('device_brand').agg(aggregations)
df_device_brand.columns = pd.Index(['device_brand_' + e[0] + "_" + e[1].upper() for e in df_device_brand.columns.tolist()])
df_device_brand = df_device_brand.reset_index()
df_device_type = df_temp.groupby('device_type').agg(aggregations)
df_device_type.columns = pd.Index(['device_type_' + e[0] + "_" + e[1].upper() for e in df_device_type.columns.tolist()])
df_device_type = df_device_type.reset_index()
df_temp = df_temp.merge(df_device_brand, on='device_brand', how='left')
df_temp = df_temp.merge(df_device_type, on='device_type', how='left')
aggregations = {
'device_brand_age_STD' : ['mean'],
'device_brand_age_MEAN' : ['mean'],
'device_brand_sex_MEAN' : ['mean'],
#'device_type_age_STD' : ['mean'],
#'device_type_age_MEAN' : ['mean'],
#'device_type_sex_MEAN' : ['mean']
}
df_agg = df_temp.groupby('device_id').agg(aggregations)
df_agg.columns = pd.Index(['device_grouped_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
return df_agg
#统计device运行app的情况
def device_active_time_time_stat() :
#device开启app的时间统计信息
deviceid_package_start_close['active_time'] = deviceid_package_start_close['close_time'] - deviceid_package_start_close['start_time']
#device开启了多少次app
#device开启了多少个app
aggregations = {
'app_id' : ['count', 'nunique'],
'active_time' : ['mean', 'std', 'max', 'min'],
}
df_agg = deviceid_package_start_close.groupby('device_id').agg(aggregations)
df_agg.columns = pd.Index(['device_grouped_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
aggregations = {
'active_time' : ['mean', 'std', 'max', 'min', 'count'],
}
df_da_agg = deviceid_package_start_close.groupby(['device_id', 'app_id']).agg(aggregations)
df_da_agg.columns = pd.Index(['device_app_grouped_' + e[0] + "_" + e[1].upper() for e in df_da_agg.columns.tolist()])
df_da_agg = df_da_agg.reset_index()
#device开启app的平均时间
aggregations = {
'device_app_grouped_active_time_MEAN' : ['mean', 'std', 'max', 'min'],
'device_app_grouped_active_time_STD' : ['mean', 'std', 'max', 'min'],
'device_app_grouped_active_time_MAX' : ['mean', 'std', 'max', 'min'],
'device_app_grouped_active_time_MIN' : ['mean', 'std', 'max', 'min'],
'device_app_grouped_active_time_COUNT' : ['mean', 'std', 'max', 'min'],
}
df_temp = df_da_agg.groupby(['device_id']).agg(aggregations)
df_temp.columns = pd.Index([e[0] + "_" + e[1].upper() for e in df_temp.columns.tolist()])
df_temp = df_temp.reset_index()
df_agg = df_agg.merge(df_temp, on='device_id', how='left')
return df_agg
def app_type_encoding() :
df_temp = df_device_app_pair.merge(deviceid_train[['device_id', 'age', 'sex']], on='device_id', how='left')
aggregations = {
'age' : ['std','mean'],
'sex' : ['mean'],
}
df_agg_app_parent_type = df_temp.groupby('app_parent_type').agg(aggregations)
df_agg_app_parent_type.columns = pd.Index(['app_parent_type_' + e[0] + "_" + e[1].upper() for e in df_agg_app_parent_type.columns.tolist()])
df_agg_app_parent_type = df_agg_app_parent_type.reset_index()
df_agg_app_child_type = df_temp.groupby('app_child_type').agg(aggregations)
df_agg_app_child_type.columns = pd.Index(['app_child_type_' + e[0] + "_" + e[1].upper() for e in df_agg_app_child_type.columns.tolist()])
df_agg_app_child_type = df_agg_app_child_type.reset_index()
df_temp = df_temp.merge(df_agg_app_parent_type, on='app_parent_type', how='left')
df_temp = df_temp.merge(df_agg_app_child_type, on='app_child_type', how='left')
aggregations = {
'app_parent_type_age_STD' : ['mean'],
'app_parent_type_age_MEAN' : ['mean'],
'app_parent_type_sex_MEAN' : ['mean'],
'app_child_type_age_STD' : ['mean'],
'app_child_type_age_MEAN' : ['mean'],
'app_child_type_sex_MEAN' : ['mean']
}
df_agg = df_temp.groupby('device_id').agg(aggregations)
df_agg.columns = pd.Index(['device_grouped_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
return df_agg
#每个device对应的app_parent_type计数
def app_type_onehot_in_device(df) :
df_copy = df.fillna(-1)
df_temp = df_copy.groupby(['device_id', 'app_parent_type'])['app_id'].size().reset_index()
df_temp.rename(columns = {'app_id' : 'app_parent_type_counts'}, inplace=True)
df_temp = pd.pivot_table(df_temp, index='device_id', columns='app_parent_type', values='app_parent_type_counts').reset_index()
df_temp.columns = ['device_id'] + ['app_parent_type'+str(i) for i in range(-1,45)]
df_temp.fillna(0, inplace=True)
return df_temp
apps=deviceid_packages['apps'].apply(lambda x:' '.join(x)).tolist()
vectorizer=CountVectorizer()
transformer=TfidfTransformer()
cntTf = vectorizer.fit_transform(apps)
tfidf=transformer.fit_transform(cntTf)
word=vectorizer.get_feature_names()
weight=tfidf.toarray()
df_weight=pd.DataFrame(weight)
feature=df_weight.columns
df_weight['sum']=0
for f in tqdm(feature):
df_weight['sum']+=df_weight[f]
deviceid_packages['tfidf_sum']=df_weight['sum']
# In[10]:
lda = LatentDirichletAllocation(n_topics=5,
learning_offset=50.,
random_state=666)
docres = lda.fit_transform(cntTf)
# In[11]:
deviceid_packages = pd.concat([deviceid_packages,pd.DataFrame(docres)],axis=1)
# In[12]:
temp=deviceid_packages.drop('apps',axis=1)
deviceid_train=pd.merge(deviceid_train,temp,on='device_id',how='left')
# In[13]:
#解析出所有的device_app_pair
device_id_arr = []
app_arr = []
df_device_app_pair = pd.DataFrame()
for row in deviceid_packages.values :
device_id = row[0]
app_list = row[1]
for app in app_list :
device_id_arr.append(device_id)
app_arr.append(app)
#生成pair
df_device_app_pair['device_id'] = device_id_arr
df_device_app_pair['app_id'] = app_arr
df_device_app_pair = df_device_app_pair.merge(package_label, how='left', on='app_id')
# In[15]:
#提取特征
df_train = deviceid_train.merge(device_active_time_time_stat(), on='device_id', how='left')
df_train = df_train.merge(deviceid_brand, on='device_id', how='left')
df_train = df_train.merge(app_type_onehot_in_device(df_device_app_pair), on='device_id', how='left')
df_train = df_train.merge(app_type_encoding(), on='device_id', how='left')
df_train = df_train.merge(device_active_hour(), on='device_id', how='left')
df_train = df_train.merge(app_type_mean_time_gap_one_hot(), on='device_id', how='left')
df_train = df_train.merge(open_app_counts_in_hour(), on='device_id', how='left')
df_train = df_train.merge(close_app_counts_in_hour(), on='device_id', how='left')
df_train = df_train.merge(device_brand_encoding(), on='device_id', how='left')
df_train = df_train.merge(device_start_end_app_timegap(), on='device_id', how='left')
df_train = df_train.merge(open_app_timegap_in_hour(), on='device_id', how='left')
# In[16]:
df_w2c_start = pd.read_csv('device_start_app_w2c.csv')
df_w2c_close = pd.read_csv('device_close_app_w2c.csv')
df_w2c_all = pd.read_csv('device_all_app_w2c.csv')
df_device_quchong_start_app_w2c = pd.read_csv('device_quchong_start_app_w2c.csv')
df_device_app_unique_start_app_w2c = pd.read_csv('device_app_unique_start_app_w2c.csv')
df_device_app_unique_close_app_w2c = pd.read_csv('device_app_unique_close_app_w2c.csv')
df_device_app_unique_all_app_w2c = pd.read_csv('device_app_unique_all_app_w2c.csv')
df_hcc_device_brand_age_sex = pd.read_csv('hcc_device_brand_age_sex.csv')
df_train_w2v = df_train.merge(df_w2c_start, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_w2c_close, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_w2c_all, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_quchong_start_app_w2c, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_app_unique_start_app_w2c, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_app_unique_close_app_w2c, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_app_unique_all_app_w2c, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_hcc_device_brand_age_sex, on='device_id', how='left')
# In[22]:
train = df_train_w2v[df_train_w2v['age'].notnull()]
test = df_train_w2v[df_train_w2v['age'].isnull()]
# In[23]:
X = train.drop(['sex', 'age', 'device_id'],axis=1)
Y = train['age']
# In[24]:
from sklearn.model_selection import KFold, StratifiedKFold
seed = 2018
num_folds = 5
folds = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed)
sub_list = []
oof_preds = np.zeros(train.shape[0])
cate_feat = ['device_type','device_brand']
params = {
'boosting_type': 'gbdt',
'learning_rate' : 0.02,
'num_leaves' : 2 ** 5,
'objective' : 'regression',
'metric' : 'rmse',
'random_state' : 6666,
'bagging_freq' : 5,
'feature_fraction' : 0.7,
'bagging_fraction' : 0.7,
'min_split_gain' : 0.0970905919552776,
'min_child_weight' : 9.42012323936088,
}
for n_fold, (train_idx, valid_idx) in enumerate(folds.split(X, Y)):
train_x, train_y = X.iloc[train_idx], Y.iloc[train_idx]
valid_x, valid_y = X.iloc[valid_idx], Y.iloc[valid_idx]
lgb_train=lgb.Dataset(train_x,label=train_y)
lgb_eval = lgb.Dataset(valid_x, valid_y, reference=lgb_train)
gbm = lgb.train(params, lgb_train, num_boost_round=800, valid_sets=[lgb_train, lgb_eval], verbose_eval=50)
oof_preds[valid_idx] = gbm.predict(valid_x[X.columns.values])
train['age_regression_prob_oof'] = oof_preds
# In[26]:
#用全部的train来预测test
lgb_train = lgb.Dataset(X,label=Y)
gbm = lgb.train(params, lgb_train, num_boost_round=800, valid_sets=lgb_train, verbose_eval=50)
test = test.reset_index(drop=True)
test_preds = gbm.predict(test[X.columns.values])
# In[27]:
test['age_regression_prob_oof'] = test_preds
# In[30]:
df_age_prob_oof = pd.concat([train[['device_id', 'age_regression_prob_oof']],
test[['device_id', 'age_regression_prob_oof']]])
df_age_prob_oof.to_csv('device_age_regression_prob_oof.csv', index=None)
================================================
FILE: THLUO/13.device_start_GRU_pred.py
================================================
# coding: utf-8
# In[1]:
# coding: utf-8
import feather
import os
import re
import sys
import gc
import random
import pandas as pd
import numpy as np
import gensim
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from scipy import stats
import tensorflow as tf
import keras
from keras.layers import *
from keras.models import *
from keras.optimizers import *
from keras.callbacks import *
from keras.preprocessing import text, sequence
from keras.utils import to_categorical
from keras.engine.topology import Layer
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
from keras.utils.training_utils import multi_gpu_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
import warnings
from TextModel import *
warnings.filterwarnings('ignore')
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
# In[2]:
print ('13.device_start_GRU_pred.py')
df_doc = pd.read_csv('01.device_click_app_sorted_by_start.csv')
deviceid_test=pd.read_csv('input/deviceid_test.tsv',sep='\t',names=['device_id'])
deviceid_train=pd.read_csv('input/deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
df_total = pd.concat([deviceid_train, deviceid_test])
df_doc = df_doc.merge(df_total, on='device_id', how='left')
df_wv2_all = pd.read_csv('w2c_all_emb.csv')
dic_w2c_all = {}
for row in df_wv2_all.values :
app_id = row[0]
vector = row[1:]
dic_w2c_all[app_id] = vector
# In[3]:
df_doc['sex'] = df_doc['sex'].apply(lambda x:str(x))
df_doc['age'] = df_doc['age'].apply(lambda x:str(x))
def tool(x):
if x=='nan':
return x
else:
return str(int(float(x)))
df_doc['sex']=df_doc['sex'].apply(tool)
df_doc['age']=df_doc['age'].apply(tool)
df_doc['sex_age']=df_doc['sex']+'-'+df_doc['age']
df_doc = df_doc.replace({'nan':np.NaN,'nan-nan':np.NaN})
train = df_doc[df_doc['sex_age'].notnull()]
test = df_doc[df_doc['sex_age'].isnull()]
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)
lb = LabelEncoder()
train_label = lb.fit_transform(train['sex_age'].values)
train['class'] = train_label
# In[5]:
column_name="app_list"
word_seq_len = 900
victor_size = 200
num_words = 35000
batch_size = 64
classification = 22
kfold=10
# In[6]:
from sklearn.metrics import log_loss
def get_mut_label(y_label) :
results = []
for ele in y_label :
results.append(ele.argmax())
return results
class RocAucEvaluation(Callback):
def __init__(self, validation_data=(), interval=1):
super(Callback, self).__init__()
self.interval = interval
self.X_val, self.y_val = validation_data
def on_epoch_end(self, epoch, logs={}):
if epoch % self.interval == 0:
y_pred = self.model.predict(self.X_val, verbose=0)
val_y = get_mut_label(self.y_val)
score = log_loss(val_y, y_pred)
print("\n mlogloss - epoch: %d - score: %.6f \n" % (epoch+1, score))
# In[7]:
#词向量
def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
tokenizer = text.Tokenizer(num_words=num_words, lower=False,filters="")
tokenizer.fit_on_texts(list(df_train[col].values)+list(df_test[col].values))
train_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_train[col].values), maxlen=maxlen_)
test_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_test[col].values), maxlen=maxlen_)
word_index = tokenizer.word_index
count = 0
nb_words = len(word_index)
print(nb_words)
all_data=pd.concat([df_train[col],df_test[col]])
file_name = 'embedding/' + 'Word2Vec_start_' + col +"_"+ str(victor_size) + '.model'
if not os.path.exists(file_name):
model = Word2Vec([[word for word in document.split(' ')] for document in all_data.values],
size=victor_size, window=5, iter=10, workers=11, seed=2018, min_count=2)
model.save(file_name)
else:
model = Word2Vec.load(file_name)
print("add word2vec finished....")
embedding_word2vec_matrix = np.zeros((nb_words + 1, victor_size))
for word, i in word_index.items():
embedding_vector = model[word] if word in model else None
if embedding_vector is not None:
count += 1
embedding_word2vec_matrix[i] = embedding_vector
else:
unk_vec = np.random.random(victor_size) * 0.5
unk_vec = unk_vec - unk_vec.mean()
embedding_word2vec_matrix[i] = unk_vec
embedding_w2c_all = np.zeros((nb_words + 1, victor_size))
for word, i in word_index.items():
embedding_vector = dic_w2c_all[word]
embedding_w2c_all[i] = embedding_vector
#embedding_matrix = np.concatenate((embedding_word2vec_matrix,embedding_w2c_all),axis=1)
embedding_matrix = embedding_word2vec_matrix
return train_, test_, word_index, embedding_matrix
# In[8]:
train_, test_,word2idx, word_embedding = w2v_pad(train,test,column_name, word_seq_len,victor_size, num_words)
# In[11]:
my_opt="bi_gru_model"
#参数
Y = train['class'].values
if not os.path.exists("cache/"+my_opt):
os.mkdir("cache/"+my_opt)
# In[12]:
from sklearn.model_selection import KFold, StratifiedKFold
gc.collect()
seed = 2006
num_folds = 10
kf = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed).split(train_, Y)
# In[13]:
epochs = 4
my_opt=eval(my_opt)
train_model_pred = np.zeros((train_.shape[0], classification))
test_model_pred = np.zeros((test_.shape[0], classification))
for i, (train_fold, val_fold) in enumerate(kf):
X_train, X_valid, = train_[train_fold, :], train_[val_fold, :]
y_train, y_valid = Y[train_fold], Y[val_fold]
y_tra = to_categorical(y_train)
y_val = to_categorical(y_valid)
#模型
name = str(my_opt.__name__)
model = my_opt(word_seq_len, word_embedding, classification)
RocAuc = RocAucEvaluation(validation_data=(X_valid, y_val), interval=1)
hist = model.fit(X_train, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_valid, y_val),
callbacks=[RocAuc])
train_model_pred[val_fold, :] = model.predict(X_valid)
# In[26]:
#模型
#用全部的数据预测
train_label = to_categorical(Y)
name = str(my_opt.__name__)
model = my_opt(word_seq_len, word_embedding, classification)
RocAuc = RocAucEvaluation(validation_data=(train_, train_label), interval=1)
hist = model.fit(train_, train_label, batch_size=batch_size, epochs=epochs, validation_data=(train_, train_label),
callbacks=[RocAuc])
test_model_pred = model.predict(test_)
# In[27]:
df_train_pred = pd.DataFrame(train_model_pred)
df_test_pred = pd.DataFrame(test_model_pred)
df_train_pred.columns = ['device_start_GRU_pred_' + str(i) for i in range(22)]
df_test_pred.columns = ['device_start_GRU_pred_' + str(i) for i in range(22)]
# In[35]:
df_train_pred = pd.concat([train[['device_id']], df_train_pred], axis=1)
df_test_pred = pd.concat([test[['device_id']], df_test_pred], axis=1)
# In[37]:
df_results = pd.concat([df_train_pred, df_test_pred])
df_results.to_csv('device_start_GRU_pred.csv', index=None)
================================================
FILE: THLUO/14.device_start_GRU_pred_age.py
================================================
# coding: utf-8
# In[1]:
# coding: utf-8
import feather
import os
import re
import sys
import gc
import random
import pandas as pd
import numpy as np
import gensim
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from scipy import stats
import tensorflow as tf
import keras
from keras.layers import *
from keras.models import *
from keras.optimizers import *
from keras.callbacks import *
from keras.preprocessing import text, sequence
from keras.utils import to_categorical
from keras.engine.topology import Layer
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
from keras.utils.training_utils import multi_gpu_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
from TextModel import *
import warnings
warnings.filterwarnings('ignore')
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
# In[2]:
print('14.device_start_GRU_pred_age.py')
df_doc = pd.read_csv('01.device_click_app_sorted_by_start.csv')
deviceid_test=pd.read_csv('input/deviceid_test.tsv',sep='\t',names=['device_id'])
deviceid_train=pd.read_csv('input/deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
df_total = pd.concat([deviceid_train, deviceid_test])
df_doc = df_doc.merge(df_total, on='device_id', how='left')
df_wv2_all = pd.read_csv('w2c_all_emb.csv')
dic_w2c_all = {}
for row in df_wv2_all.values :
app_id = row[0]
vector = row[1:]
dic_w2c_all[app_id] = vector
# In[3]:
train = df_doc[df_doc['age'].notnull()]
test = df_doc[df_doc['age'].isnull()]
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)
lb = LabelEncoder()
train_label = lb.fit_transform(train['age'].values)
train['class'] = train_label
# In[5]:
column_name="app_list"
word_seq_len = 900
victor_size = 200
num_words = 35000
batch_size = 64
classification = 11
kfold=10
# In[6]:
from sklearn.metrics import log_loss
def get_mut_label(y_label) :
results = []
for ele in y_label :
results.append(ele.argmax())
return results
class RocAucEvaluation(Callback):
def __init__(self, validation_data=(), interval=1):
super(Callback, self).__init__()
self.interval = interval
self.X_val, self.y_val = validation_data
def on_epoch_end(self, epoch, logs={}):
if epoch % self.interval == 0:
y_pred = self.model.predict(self.X_val, verbose=0)
val_y = get_mut_label(self.y_val)
score = log_loss(val_y, y_pred)
print("\n mlogloss - epoch: %d - score: %.6f \n" % (epoch+1, score))
# In[7]:
#词向量
def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
tokenizer = text.Tokenizer(num_words=num_words, lower=False,filters="")
tokenizer.fit_on_texts(list(df_train[col].values)+list(df_test[col].values))
train_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_train[col].values), maxlen=maxlen_)
test_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_test[col].values), maxlen=maxlen_)
word_index = tokenizer.word_index
count = 0
nb_words = len(word_index)
print(nb_words)
all_data=pd.concat([df_train[col],df_test[col]])
file_name = 'embedding/' + 'Word2Vec_start_' + col +"_"+ str(victor_size) + '.model'
if not os.path.exists(file_name):
model = Word2Vec([[word for word in document.split(' ')] for document in all_data.values],
size=victor_size, window=5, iter=10, workers=11, seed=2018, min_count=2)
model.save(file_name)
else:
model = Word2Vec.load(file_name)
print("add word2vec finished....")
embedding_word2vec_matrix = np.zeros((nb_words + 1, victor_size))
for word, i in word_index.items():
embedding_vector = model[word] if word in model else None
if embedding_vector is not None:
count += 1
embedding_word2vec_matrix[i] = embedding_vector
else:
unk_vec = np.random.random(victor_size) * 0.5
unk_vec = unk_vec - unk_vec.mean()
embedding_word2vec_matrix[i] = unk_vec
embedding_w2c_all = np.zeros((nb_words + 1, victor_size))
for word, i in word_index.items():
embedding_vector = dic_w2c_all[word]
embedding_w2c_all[i] = embedding_vector
#embedding_matrix = np.concatenate((embedding_word2vec_matrix,embedding_w2c_all),axis=1)
embedding_matrix = embedding_word2vec_matrix
return train_, test_, word_index, embedding_matrix
# In[8]:
train_, test_,word2idx, word_embedding = w2v_pad(train,test,column_name, word_seq_len,victor_size, num_words)
# In[11]:
my_opt="bi_gru_model"
#参数
Y = train['class'].values
if not os.path.exists("cache/"+my_opt):
os.mkdir("cache/"+my_opt)
# In[17]:
from sklearn.model_selection import KFold, StratifiedKFold
gc.collect()
seed = 2006
num_folds = 10
kf = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed).split(train_, Y)
epochs = 4
my_opt=eval(my_opt)
train_model_pred = np.zeros((train_.shape[0], classification))
test_model_pred = np.zeros((test_.shape[0], classification))
for i, (train_fold, val_fold) in enumerate(kf):
X_train, X_valid, = train_[train_fold, :], train_[val_fold, :]
y_train, y_valid = Y[train_fold], Y[val_fold]
y_tra = to_categorical(y_train)
y_val = to_categorical(y_valid)
#模型
name = str(my_opt.__name__)
model = my_opt(word_seq_len, word_embedding, classification)
RocAuc = RocAucEvaluation(validation_data=(X_valid, y_val), interval=1)
hist = model.fit(X_train, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_valid, y_val),
callbacks=[RocAuc])
train_model_pred[val_fold, :] = model.predict(X_valid)
# In[21]:
#模型
#用全部的数据预测
train_label = to_categorical(Y)
name = str(my_opt.__name__)
model = my_opt(word_seq_len, word_embedding, classification)
RocAuc = RocAucEvaluation(validation_data=(train_, train_label), interval=1)
hist = model.fit(train_, train_label, batch_size=batch_size, epochs=epochs, validation_data=(train_, train_label),
callbacks=[RocAuc])
test_model_pred = model.predict(test_)
# In[22]:
df_train_pred = pd.DataFrame(train_model_pred)
df_test_pred = pd.DataFrame(test_model_pred)
df_train_pred.columns = ['device_start_GRU_pred_age_' + str(i) for i in range(11)]
df_test_pred.columns = ['device_start_GRU_pred_age_' + str(i) for i in range(11)]
# In[23]:
df_train_pred = pd.concat([train[['device_id']], df_train_pred], axis=1)
df_test_pred = pd.concat([test[['device_id']], df_test_pred], axis=1)
# In[24]:
df_results = pd.concat([df_train_pred, df_test_pred])
df_results.to_csv('device_start_GRU_pred_age.csv', index=None)
================================================
FILE: THLUO/15.device_all_GRU_pred.py
================================================
# coding: utf-8
# In[1]:
# coding: utf-8
import feather
import os
import re
import sys
import gc
import random
import pandas as pd
import numpy as np
import gensim
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from scipy import stats
import tensorflow as tf
import keras
from keras.layers import *
from keras.models import *
from keras.optimizers import *
from keras.callbacks import *
from keras.preprocessing import text, sequence
from keras.utils import to_categorical
from keras.engine.topology import Layer
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
from keras.utils.training_utils import multi_gpu_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from TextModel import *
from sklearn.metrics import f1_score
import warnings
warnings.filterwarnings('ignore')
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
# In[2]:
print('15.device_all_GRU_pred.py')
df_doc = pd.read_csv('03.device_click_app_sorted_by_all.csv')
deviceid_test=pd.read_csv('input/deviceid_test.tsv',sep='\t',names=['device_id'])
deviceid_train=pd.read_csv('input/deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
df_total = pd.concat([deviceid_train, deviceid_test])
df_doc = df_doc.merge(df_total, on='device_id', how='left')
df_wv2_all = pd.read_csv('w2c_all_emb.csv')
dic_w2c_all = {}
for row in df_wv2_all.values :
app_id = row[0]
vector = row[1:]
dic_w2c_all[app_id] = vector
# In[3]:
df_doc['sex'] = df_doc['sex'].apply(lambda x:str(x))
df_doc['age'] = df_doc['age'].apply(lambda x:str(x))
def tool(x):
if x=='nan':
return x
else:
return str(int(float(x)))
df_doc['sex']=df_doc['sex'].apply(tool)
df_doc['age']=df_doc['age'].apply(tool)
df_doc['sex_age']=df_doc['sex']+'-'+df_doc['age']
df_doc = df_doc.replace({'nan':np.NaN,'nan-nan':np.NaN})
train = df_doc[df_doc['sex_age'].notnull()]
test = df_doc[df_doc['sex_age'].isnull()]
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)
lb = LabelEncoder()
train_label = lb.fit_transform(train['sex_age'].values)
train['class'] = train_label
# In[6]:
column_name="app_list"
word_seq_len = 1800
victor_size = 200
num_words = 35000
batch_size = 64
classification = 22
kfold=10
# In[7]:
from sklearn.metrics import log_loss
def get_mut_label(y_label) :
results = []
for ele in y_label :
results.append(ele.argmax())
return results
class RocAucEvaluation(Callback):
def __init__(self, validation_data=(), interval=1):
super(Callback, self).__init__()
self.interval = interval
self.X_val, self.y_val = validation_data
def on_epoch_end(self, epoch, logs={}):
if epoch % self.interval == 0:
y_pred = self.model.predict(self.X_val, verbose=0)
val_y = get_mut_label(self.y_val)
score = log_loss(val_y, y_pred)
print("\n mlogloss - epoch: %d - score: %.6f \n" % (epoch+1, score))
# In[14]:
#词向量
def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
tokenizer = text.Tokenizer(num_words=num_words, lower=False,filters="")
tokenizer.fit_on_texts(list(df_train[col].values)+list(df_test[col].values))
train_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_train[col].values), maxlen=maxlen_)
test_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_test[col].values), maxlen=maxlen_)
word_index = tokenizer.word_index
count = 0
nb_words = len(word_index)
print(nb_words)
all_data=pd.concat([df_train[col],df_test[col]])
file_name = 'embedding/' + 'Word2Vec_all' + col +"_"+ str(victor_size) + '.model'
if not os.path.exists(file_name):
model = Word2Vec([[word for word in document.split(' ')] for document in all_data.values],
size=victor_size, window=30, iter=10, workers=11, seed=2018, min_count=2)
model.save(file_name)
else:
model = Word2Vec.load(file_name)
print("add word2vec finished....")
embedding_word2vec_matrix = np.zeros((nb_words + 1, victor_size))
for word, i in word_index.items():
embedding_vector = model[word] if word in model else None
if embedding_vector is not None:
count += 1
embedding_word2vec_matrix[i] = embedding_vector
else:
unk_vec = np.random.random(victor_size) * 0.5
unk_vec = unk_vec - unk_vec.mean()
embedding_word2vec_matrix[i] = unk_vec
embedding_w2c_all = np.zeros((nb_words + 1, victor_size))
for word, i in word_index.items():
embedding_vector = dic_w2c_all[word]
embedding_w2c_all[i] = embedding_vector
#embedding_matrix = np.concatenate((embedding_word2vec_matrix,embedding_w2c_all),axis=1)
embedding_matrix = embedding_word2vec_matrix
return train_, test_, word_index, embedding_matrix
# In[15]:
train_, test_,word2idx, word_embedding = w2v_pad(train,test,column_name, word_seq_len,victor_size, num_words)
# In[21]:
my_opt="bi_gru_model"
#参数
Y = train['class'].values
if not os.path.exists("cache/"+my_opt):
os.mkdir("cache/"+my_opt)
# In[22]:
from sklearn.model_selection import KFold, StratifiedKFold
gc.collect()
seed = 2006
num_folds = 10
kf = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed).split(train_, Y)
# In[23]:
epochs = 4
my_opt=eval(my_opt)
train_model_pred = np.zeros((train_.shape[0], classification))
test_model_pred = np.zeros((test_.shape[0], classification))
for i, (train_fold, val_fold) in enumerate(kf):
X_train, X_valid, = train_[train_fold, :], train_[val_fold, :]
y_train, y_valid = Y[train_fold], Y[val_fold]
y_tra = to_categorical(y_train)
y_val = to_categorical(y_valid)
#模型
name = str(my_opt.__name__)
model = my_opt(word_seq_len, word_embedding, classification)
RocAuc = RocAucEvaluation(validation_data=(X_valid, y_val), interval=1)
hist = model.fit(X_train, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_valid, y_val),
callbacks=[RocAuc])
train_model_pred[val_fold, :] = model.predict(X_valid)
del model
del hist
gc.collect()
# In[27]:
#模型
#用全部的数据预测
train_label = to_categorical(Y)
name = str(my_opt.__name__)
model = my_opt(word_seq_len, word_embedding, classification)
RocAuc = RocAucEvaluation(validation_data=(train_, train_label), interval=1)
hist = model.fit(train_, train_label, batch_size=batch_size, epochs=epochs, validation_data=(train_, train_label),
callbacks=[RocAuc])
test_model_pred = model.predict(test_)
# In[28]:
df_train_pred = pd.DataFrame(train_model_pred)
df_test_pred = pd.DataFrame(test_model_pred)
df_train_pred.columns = ['device_all_GRU_pred_' + str(i) for i in range(22)]
df_test_pred.columns = ['device_all_GRU_pred_' + str(i) for i in range(22)]
# In[29]:
df_train_pred = pd.concat([train[['device_id']], df_train_pred], axis=1)
df_test_pred = pd.concat([test[['device_id']], df_test_pred], axis=1)
# In[30]:
df_results = pd.concat([df_train_pred, df_test_pred])
df_results.to_csv('device_all_GRU_pred.csv', index=None)
================================================
FILE: THLUO/16.device_start_capsule_pred.py
================================================
# coding: utf-8
# In[1]:
# coding: utf-8
import feather
import os
import re
import sys
import gc
import random
import pandas as pd
import numpy as np
import gensim
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from scipy import stats
import tensorflow as tf
import keras
from keras.layers import *
from keras.models import *
from keras.optimizers import *
from keras.callbacks import *
from keras.preprocessing import text, sequence
from keras.utils import to_categorical
from keras.engine.topology import Layer
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
from keras.utils.training_utils import multi_gpu_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
import warnings
warnings.filterwarnings('ignore')
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
# In[2]:
print ('16.device_start_capsule_pred.py')
df_doc = pd.read_csv('01.device_click_app_sorted_by_start.csv')
deviceid_test=pd.read_csv('input/deviceid_test.tsv',sep='\t',names=['device_id'])
deviceid_train=pd.read_csv('input/deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
df_total = pd.concat([deviceid_train, deviceid_test])
df_doc = df_doc.merge(df_total, on='device_id', how='left')
df_wv2_all = pd.read_csv('w2c_all_emb.csv')
dic_w2c_all = {}
for row in df_wv2_all.values :
app_id = row[0]
vector = row[1:]
dic_w2c_all[app_id] = vector
# In[3]:
df_doc['sex'] = df_doc['sex'].apply(lambda x:str(x))
df_doc['age'] = df_doc['age'].apply(lambda x:str(x))
def tool(x):
if x=='nan':
return x
else:
return str(int(float(x)))
df_doc['sex']=df_doc['sex'].apply(tool)
df_doc['age']=df_doc['age'].apply(tool)
df_doc['sex_age']=df_doc['sex']+'-'+df_doc['age']
df_doc = df_doc.replace({'nan':np.NaN,'nan-nan':np.NaN})
train = df_doc[df_doc['sex_age'].notnull()]
test = df_doc[df_doc['sex_age'].isnull()]
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)
lb = LabelEncoder()
train_label = lb.fit_transform(train['sex_age'].values)
train['class'] = train_label
# In[5]:
column_name="app_list"
word_seq_len = 900
victor_size = 200
num_words = 35000
batch_size = 64
classification = 22
kfold=10
# In[6]:
from sklearn.metrics import log_loss
def get_mut_label(y_label) :
results = []
for ele in y_label :
results.append(ele.argmax())
return results
class RocAucEvaluation(Callback):
def __init__(self, validation_data=(), interval=1):
super(Callback, self).__init__()
self.interval = interval
self.X_val, self.y_val = validation_data
def on_epoch_end(self, epoch, logs={}):
if epoch % self.interval == 0:
y_pred = self.model.predict(self.X_val, verbose=0)
val_y = get_mut_label(self.y_val)
score = log_loss(val_y, y_pred)
print("\n mlogloss - epoch: %d - score: %.6f \n" % (epoch+1, score))
# In[7]:
#词向量
def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
tokenizer = text.Tokenizer(num_words=num_words, lower=False,filters="")
tokenizer.fit_on_texts(list(df_train[col].values)+list(df_test[col].values))
train_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_train[col].values), maxlen=maxlen_)
test_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_test[col].values), maxlen=maxlen_)
word_index = tokenizer.word_index
count = 0
nb_words = len(word_index)
print(nb_words)
all_data=pd.concat([df_train[col],df_test[col]])
file_name = 'embedding/' + 'Word2Vec_start_' + col +"_"+ str(victor_size) + '.model'
if not os.path.exists(file_name):
model = Word2Vec([[word for word in document.split(' ')] for document in all_data.values],
size=victor_size, window=5, iter=10, workers=11, seed=2018, min_count=2)
model.save(file_name)
else:
model = Word2Vec.load(file_name)
print("add word2vec finished....")
embedding_word2vec_matrix = np.zeros((nb_words + 1, victor_size))
for word, i in word_index.items():
embedding_vector = model[word] if word in model else None
if embedding_vector is not None:
count += 1
embedding_word2vec_matrix[i] = embedding_vector
else:
unk_vec = np.random.random(victor_size) * 0.5
unk_vec = unk_vec - unk_vec.mean()
embedding_word2vec_matrix[i] = unk_vec
embedding_w2c_all = np.zeros((nb_words + 1, victor_size))
for word, i in word_index.items():
embedding_vector = dic_w2c_all[word]
embedding_w2c_all[i] = embedding_vector
#embedding_matrix = np.concatenate((embedding_word2vec_matrix,embedding_w2c_all),axis=1)
embedding_matrix = embedding_word2vec_matrix
return train_, test_, word_index, embedding_matrix
# In[8]:
train_, test_,word2idx, word_embedding = w2v_pad(train,test,column_name, word_seq_len,victor_size, num_words)
# In[10]:
from TextModel import *
# In[18]:
my_opt="get_text_capsule"
#参数
Y = train['class'].values
if not os.path.exists("cache/"+my_opt):
os.mkdir("cache/"+my_opt)
# In[19]:
from sklearn.model_selection import KFold, StratifiedKFold
gc.collect()
seed = 2006
num_folds = 5
kf = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed).split(train_, Y)
# In[20]:
epochs = 10
my_opt=eval(my_opt)
train_model_pred = np.zeros((train_.shape[0], classification))
test_model_pred = np.zeros((test_.shape[0], classification))
for i, (train_fold, val_fold) in enumerate(kf):
X_train, X_valid, = train_[train_fold, :], train_[val_fold, :]
y_train, y_valid = Y[train_fold], Y[val_fold]
y_tra = to_categorical(y_train)
y_val = to_categorical(y_valid)
#模型
name = str(my_opt.__name__)
model = my_opt(word_seq_len, word_embedding, classification)
RocAuc = RocAucEvaluation(validation_data=(X_valid, y_val), interval=1)
hist = model.fit(X_train, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_valid, y_val),
callbacks=[RocAuc])
train_model_pred[val_fold, :] = model.predict(X_valid)
# In[24]:
#模型
#用全部的数据预测
train_label = to_categorical(Y)
name = str(my_opt.__name__)
model = my_opt(word_seq_len, word_embedding, classification)
RocAuc = RocAucEvaluation(validation_data=(train_, train_label), interval=1)
hist = model.fit(train_, train_label, batch_size=batch_size, epochs=epochs, validation_data=(train_, train_label),
callbacks=[RocAuc])
test_model_pred = model.predict(test_)
# In[25]:
df_train_pred = pd.DataFrame(train_model_pred)
df_test_pred = pd.DataFrame(test_model_pred)
df_train_pred.columns = ['device_start_capsule_pred_' + str(i) for i in range(22)]
df_test_pred.columns = ['device_start_capsule_pred_' + str(i) for i in range(22)]
# In[26]:
df_train_pred = pd.concat([train[['device_id']], df_train_pred], axis=1)
df_test_pred = pd.concat([test[['device_id']], df_test_pred], axis=1)
# In[27]:
df_results = pd.concat([df_train_pred, df_test_pred])
df_results.to_csv('device_start_capsule_pred.csv', index=None)
================================================
FILE: THLUO/17.device_start_textcnn_pred.py
================================================
# coding: utf-8
# In[1]:
# coding: utf-8
import feather
import os
import re
import sys
import gc
import random
import pandas as pd
import numpy as np
import gensim
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from scipy import stats
import tensorflow as tf
import keras
from keras.layers import *
from keras.models import *
from keras.optimizers import *
from keras.callbacks import *
from keras.preprocessing import text, sequence
from keras.utils import to_categorical
from keras.engine.topology import Layer
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
from keras.utils.training_utils import multi_gpu_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
import warnings
warnings.filterwarnings('ignore')
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
# In[2]:
print ('17.device_start_textcnn_pred.py')
df_doc = pd.read_csv('01.device_click_app_sorted_by_start.csv')
deviceid_test=pd.read_csv('input/deviceid_test.tsv',sep='\t',names=['device_id'])
deviceid_train=pd.read_csv('input/deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
df_total = pd.concat([deviceid_train, deviceid_test])
df_doc = df_doc.merge(df_total, on='device_id', how='left')
df_wv2_all = pd.read_csv('w2c_all_emb.csv')
dic_w2c_all = {}
for row in df_wv2_all.values :
app_id = row[0]
vector = row[1:]
dic_w2c_all[app_id] = vector
# In[3]:
df_doc['sex'] = df_doc['sex'].apply(lambda x:str(x))
df_doc['age'] = df_doc['age'].apply(lambda x:str(x))
def tool(x):
if x=='nan':
return x
else:
return str(int(float(x)))
df_doc['sex']=df_doc['sex'].apply(tool)
df_doc['age']=df_doc['age'].apply(tool)
df_doc['sex_age']=df_doc['sex']+'-'+df_doc['age']
df_doc = df_doc.replace({'nan':np.NaN,'nan-nan':np.NaN})
train = df_doc[df_doc['sex_age'].notnull()]
test = df_doc[df_doc['sex_age'].isnull()]
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)
lb = LabelEncoder()
train_label = lb.fit_transform(train['sex_age'].values)
train['class'] = train_label
# In[5]:
column_name="app_list"
word_seq_len = 900
victor_size = 200
num_words = 35000
batch_size = 64
classification = 22
kfold=10
# In[6]:
from sklearn.metrics import log_loss
def get_mut_label(y_label) :
results = []
for ele in y_label :
results.append(ele.argmax())
return results
class RocAucEvaluation(Callback):
def __init__(self, validation_data=(), interval=1):
super(Callback, self).__init__()
self.interval = interval
self.X_val, self.y_val = validation_data
def on_epoch_end(self, epoch, logs={}):
if epoch % self.interval == 0:
y_pred = self.model.predict(self.X_val, verbose=0)
val_y = get_mut_label(self.y_val)
score = log_loss(val_y, y_pred)
print("\n mlogloss - epoch: %d - score: %.6f \n" % (epoch+1, score))
# In[7]:
#词向量
def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
tokenizer = text.Tokenizer(num_words=num_words, lower=False,filters="")
tokenizer.fit_on_texts(list(df_train[col].values)+list(df_test[col].values))
train_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_train[col].values), maxlen=maxlen_)
test_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_test[col].values), maxlen=maxlen_)
word_index = tokenizer.word_index
count = 0
nb_words = len(word_index)
print(nb_words)
all_data=pd.concat([df_train[col],df_test[col]])
file_name = 'embedding/' + 'Word2Vec_start_' + col +"_"+ str(victor_size) + '.model'
if not os.path.exists(file_name):
model = Word2Vec([[word for word in document.split(' ')] for document in all_data.values],
size=victor_size, window=5, iter=10, workers=11, seed=2018, min_count=2)
model.save(file_name)
else:
model = Word2Vec.load(file_name)
print("add word2vec finished....")
embedding_word2vec_matrix = np.zeros((nb_words + 1, victor_size))
for word, i in word_index.items():
embedding_vector = model[word] if word in model else None
if embedding_vector is not None:
count += 1
embedding_word2vec_matrix[i] = embedding_vector
else:
unk_vec = np.random.random(victor_size) * 0.5
unk_vec = unk_vec - unk_vec.mean()
embedding_word2vec_matrix[i] = unk_vec
embedding_w2c_all = np.zeros((nb_words + 1, victor_size))
for word, i in word_index.items():
embedding_vector = dic_w2c_all[word]
embedding_w2c_all[i] = embedding_vector
#embedding_matrix = np.concatenate((embedding_word2vec_matrix,embedding_w2c_all),axis=1)
embedding_matrix = embedding_word2vec_matrix
return train_, test_, word_index, embedding_matrix
# In[8]:
train_, test_,word2idx, word_embedding = w2v_pad(train,test,column_name, word_seq_len,victor_size, num_words)
# In[10]:
from TextModel import *
# In[19]:
my_opt="get_text_cnn2"
#参数
Y = train['class'].values
if not os.path.exists("cache/"+my_opt):
os.mkdir("cache/"+my_opt)
# In[20]:
from sklearn.model_selection import KFold, StratifiedKFold
gc.collect()
seed = 2006
num_folds = 5
kf = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed).split(train_, Y)
# In[21]:
epochs = 6
my_opt=eval(my_opt)
train_model_pred = np.zeros((train_.shape[0], classification))
test_model_pred = np.zeros((test_.shape[0], classification))
for i, (train_fold, val_fold) in enumerate(kf):
X_train, X_valid, = train_[train_fold, :], train_[val_fold, :]
y_train, y_valid = Y[train_fold], Y[val_fold]
y_tra = to_categorical(y_train)
y_val = to_categorical(y_valid)
#模型
name = str(my_opt.__name__)
model = my_opt(word_seq_len, word_embedding, classification)
RocAuc = RocAucEvaluation(validation_data=(X_valid, y_val), interval=1)
hist = model.fit(X_train, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_valid, y_val),
callbacks=[RocAuc])
train_model_pred[val_fold, :] = model.predict(X_valid)
# In[25]:
#模型
#用全部的数据预测
train_label = to_categorical(Y)
name = str(my_opt.__name__)
model = my_opt(word_seq_len, word_embedding, classification)
RocAuc = RocAucEvaluation(validation_data=(train_, train_label), interval=1)
hist = model.fit(train_, train_label, batch_size=batch_size, epochs=epochs, validation_data=(train_, train_label),
callbacks=[RocAuc])
test_model_pred = model.predict(test_)
# In[26]:
df_train_pred = pd.DataFrame(train_model_pred)
df_test_pred = pd.DataFrame(test_model_pred)
df_train_pred.columns = ['device_start_textcnn_pred_' + str(i) for i in range(22)]
df_test_pred.columns = ['device_start_textcnn_pred_' + str(i) for i in range(22)]
# In[27]:
df_train_pred = pd.concat([train[['device_id']], df_train_pred], axis=1)
df_test_pred = pd.concat([test[['device_id']], df_test_pred], axis=1)
# In[28]:
df_results = pd.concat([df_train_pred, df_test_pred])
df_results.to_csv('device_start_textcnn_pred.csv', index=None)
================================================
FILE: THLUO/18.device_start_text_dpcnn_pred.py
================================================
# coding: utf-8
# In[1]:
# coding: utf-8
import feather
import os
import re
import sys
import gc
import random
import pandas as pd
import numpy as np
import gensim
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from scipy import stats
import tensorflow as tf
import keras
from keras.layers import *
from keras.models import *
from keras.optimizers import *
from keras.callbacks import *
from keras.preprocessing import text, sequence
from keras.utils import to_categorical
from keras.engine.topology import Layer
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
from keras.utils.training_utils import multi_gpu_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
import warnings
warnings.filterwarnings('ignore')
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
# In[2]:
print ('18.device_start_text_dpcnn_pred.py')
df_doc = pd.read_csv('01.device_click_app_sorted_by_start.csv')
deviceid_test=pd.read_csv('input/deviceid_test.tsv',sep='\t',names=['device_id'])
deviceid_train=pd.read_csv('input/deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
df_total = pd.concat([deviceid_train, deviceid_test])
df_doc = df_doc.merge(df_total, on='device_id', how='left')
df_wv2_all = pd.read_csv('w2c_all_emb.csv')
dic_w2c_all = {}
for row in df_wv2_all.values :
app_id = row[0]
vector = row[1:]
dic_w2c_all[app_id] = vector
# In[3]:
df_doc['sex'] = df_doc['sex'].apply(lambda x:str(x))
df_doc['age'] = df_doc['age'].apply(lambda x:str(x))
def tool(x):
if x=='nan':
return x
else:
return str(int(float(x)))
df_doc['sex']=df_doc['sex'].apply(tool)
df_doc['age']=df_doc['age'].apply(tool)
df_doc['sex_age']=df_doc['sex']+'-'+df_doc['age']
df_doc = df_doc.replace({'nan':np.NaN,'nan-nan':np.NaN})
train = df_doc[df_doc['sex_age'].notnull()]
test = df_doc[df_doc['sex_age'].isnull()]
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)
lb = LabelEncoder()
train_label = lb.fit_transform(train['sex_age'].values)
train['class'] = train_label
# In[5]:
column_name="app_list"
word_seq_len = 900
victor_size = 200
num_words = 35000
batch_size = 64
classification = 22
kfold=10
# In[6]:
from sklearn.metrics import log_loss
def get_mut_label(y_label) :
results = []
for ele in y_label :
results.append(ele.argmax())
return results
class RocAucEvaluation(Callback):
def __init__(self, validation_data=(), interval=1):
super(Callback, self).__init__()
self.interval = interval
self.X_val, self.y_val = validation_data
def on_epoch_end(self, epoch, logs={}):
if epoch % self.interval == 0:
y_pred = self.model.predict(self.X_val, verbose=0)
val_y = get_mut_label(self.y_val)
score = log_loss(val_y, y_pred)
print("\n mlogloss - epoch: %d - score: %.6f \n" % (epoch+1, score))
# In[7]:
#词向量
def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
tokenizer = text.Tokenizer(num_words=num_words, lower=False,filters="")
tokenizer.fit_on_texts(list(df_train[col].values)+list(df_test[col].values))
train_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_train[col].values), maxlen=maxlen_)
test_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_test[col].values), maxlen=maxlen_)
word_index = tokenizer.word_index
count = 0
nb_words = len(word_index)
print(nb_words)
all_data=pd.concat([df_train[col],df_test[col]])
file_name = 'embedding/' + 'Word2Vec_start_' + col +"_"+ str(victor_size) + '.model'
if not os.path.exists(file_name):
model = Word2Vec([[word for word in document.split(' ')] for document in all_data.values],
size=victor_size, window=5, iter=10, workers=11, seed=2018, min_count=2)
model.save(file_name)
else:
model = Word2Vec.load(file_name)
print("add word2vec finished....")
embedding_word2vec_matrix = np.zeros((nb_words + 1, victor_size))
for word, i in word_index.items():
embedding_vector = model[word] if word in model else None
if embedding_vector is not None:
count += 1
embedding_word2vec_matrix[i] = embedding_vector
else:
unk_vec = np.random.random(victor_size) * 0.5
unk_vec = unk_vec - unk_vec.mean()
embedding_word2vec_matrix[i] = unk_vec
embedding_w2c_all = np.zeros((nb_words + 1, victor_size))
for word, i in word_index.items():
embedding_vector = dic_w2c_all[word]
embedding_w2c_all[i] = embedding_vector
#embedding_matrix = np.concatenate((embedding_word2vec_matrix,embedding_w2c_all),axis=1)
embedding_matrix = embedding_word2vec_matrix
return train_, test_, word_index, embedding_matrix
# In[8]:
train_, test_,word2idx, word_embedding = w2v_pad(train,test,column_name, word_seq_len,victor_size, num_words)
# In[10]:
from TextModel import *
# In[12]:
my_opt="get_text_dpcnn"
#参数
Y = train['class'].values
if not os.path.exists("cache/"+my_opt):
os.mkdir("cache/"+my_opt)
# In[13]:
from sklearn.model_selection import KFold, StratifiedKFold
gc.collect()
seed = 2006
num_folds = 5
kf = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed).split(train_, Y)
# In[14]:
from keras import backend as K
epochs = 6
my_opt=eval(my_opt)
train_model_pred = np.zeros((train_.shape[0], classification))
test_model_pred = np.zeros((test_.shape[0], classification))
for i, (train_fold, val_fold) in enumerate(kf):
X_train, X_valid, = train_[train_fold, :], train_[val_fold, :]
y_train, y_valid = Y[train_fold], Y[val_fold]
y_tra = to_categorical(y_train)
y_val = to_categorical(y_valid)
#模型
name = str(my_opt.__name__)
model = my_opt(word_seq_len, word_embedding, classification)
RocAuc = RocAucEvaluation(validation_data=(X_valid, y_val), interval=1)
hist = model.fit(X_train, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_valid, y_val),
callbacks=[RocAuc])
train_model_pred[val_fold, :] = model.predict(X_valid)
del model
del hist
gc.collect()
K.clear_session()
tf.reset_default_graph()
# In[15]:
#模型
#用全部的数据预测
train_label = to_categorical(Y)
name = str(my_opt.__name__)
model = my_opt(word_seq_len, word_embedding, classification)
RocAuc = RocAucEvaluation(validation_data=(train_, train_label), interval=1)
hist = model.fit(train_, train_label, batch_size=batch_size, epochs=epochs, validation_data=(train_, train_label),
callbacks=[RocAuc])
test_model_pred = model.predict(test_)
# In[16]:
df_train_pred = pd.DataFrame(train_model_pred)
df_test_pred = pd.DataFrame(test_model_pred)
df_train_pred.columns = ['device_start_text_dpcnn_pred_' + str(i) for i in range(22)]
df_test_pred.columns = ['device_start_text_dpcnn_pred_' + str(i) for i in range(22)]
# In[17]:
df_train_pred = pd.concat([train[['device_id']], df_train_pred], axis=1)
df_test_pred = pd.concat([test[['device_id']], df_test_pred], axis=1)
# In[18]:
df_results = pd.concat([df_train_pred, df_test_pred])
df_results.to_csv('device_start_text_dpcnn_pred.csv', index=None)
================================================
FILE: THLUO/19.device_start_lstm_pred.py
================================================
import feather
import os
import re
import sys
import gc
import random
import pandas as pd
import numpy as np
import gensim
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from scipy import stats
import tensorflow as tf
import keras
from keras.layers import *
from keras.models import *
from keras.optimizers import *
from keras.callbacks import *
from keras.preprocessing import text, sequence
from keras.utils import to_categorical
from keras.engine.topology import Layer
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
from keras.utils.training_utils import multi_gpu_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
import warnings
warnings.filterwarnings('ignore')
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
print ('19.lstm...........py')
# In[2]:
df_doc = pd.read_csv('01.device_click_app_sorted_by_start.csv')
deviceid_test=pd.read_csv('input/deviceid_test.tsv',sep='\t',names=['device_id'])
deviceid_train=pd.read_csv('input/deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
df_total = pd.concat([deviceid_train, deviceid_test])
df_doc = df_doc.merge(df_total, on='device_id', how='left')
df_wv2_all = pd.read_csv('w2c_all_emb.csv')
dic_w2c_all = {}
for row in df_wv2_all.values :
app_id = row[0]
vector = row[1:]
dic_w2c_all[app_id] = vector
# In[3]:
df_doc['sex'] = df_doc['sex'].apply(lambda x:str(x))
df_doc['age'] = df_doc['age'].apply(lambda x:str(x))
def tool(x):
if x=='nan':
return x
else:
return str(int(float(x)))
df_doc['sex']=df_doc['sex'].apply(tool)
df_doc['age']=df_doc['age'].apply(tool)
df_doc['sex_age']=df_doc['sex']+'-'+df_doc['age']
df_doc = df_doc.replace({'nan':np.NaN,'nan-nan':np.NaN})
train = df_doc[df_doc['sex_age'].notnull()]
test = df_doc[df_doc['sex_age'].isnull()]
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)
lb = LabelEncoder()
train_label = lb.fit_transform(train['sex_age'].values)
train['class'] = train_label
# In[5]:
column_name="app_list"
word_seq_len = 900
victor_size = 200
num_words = 35000
batch_size = 64
classification = 22
kfold=10
# In[6]:
from sklearn.metrics import log_loss
def get_mut_label(y_label) :
results = []
for ele in y_label :
results.append(ele.argmax())
return results
class RocAucEvaluation(Callback):
def __init__(self, validation_data=(), interval=1):
super(Callback, self).__init__()
self.interval = interval
self.X_val, self.y_val = validation_data
def on_epoch_end(self, epoch, logs={}):
if epoch % self.interval == 0:
y_pred = self.model.predict(self.X_val, verbose=0)
val_y = get_mut_label(self.y_val)
score = log_loss(val_y, y_pred)
print("\n mlogloss - epoch: %d - score: %.6f \n" % (epoch+1, score))
# In[7]:
#词向量
def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
tokenizer = text.Tokenizer(num_words=num_words, lower=False,filters="")
tokenizer.fit_on_texts(list(df_train[col].values)+list(df_test[col].values))
train_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_train[col].values), maxlen=maxlen_)
test_ = sequence.pad_sequences(tokenizer.texts_to_sequences(df_test[col].values), maxlen=maxlen_)
word_index = tokenizer.word_index
count = 0
nb_words = len(word_index)
print(nb_words)
all_data=pd.concat([df_train[col],df_test[col]])
file_name = 'embedding/' + 'Word2Vec_start_' + col +"_"+ str(victor_size) + '.model'
if not os.path.exists(file_name):
model = Word2Vec([[word for word in document.split(' ')] for document in all_data.values],
size=victor_size, window=5, iter=10, workers=11, seed=2018, min_count=2)
model.save(file_name)
else:
model = Word2Vec.load(file_name)
print("add word2vec finished....")
embedding_word2vec_matrix = np.zeros((nb_words + 1, victor_size))
for word, i in word_index.items():
embedding_vector = model[word] if word in model else None
if embedding_vector is not None:
count += 1
embedding_word2vec_matrix[i] = embedding_vector
else:
unk_vec = np.random.random(victor_size) * 0.5
unk_vec = unk_vec - unk_vec.mean()
embedding_word2vec_matrix[i] = unk_vec
embedding_w2c_all = np.zeros((nb_words + 1, victor_size))
for word, i in word_index.items():
embedding_vector = dic_w2c_all[word]
embedding_w2c_all[i] = embedding_vector
#embedding_matrix = np.concatenate((embedding_word2vec_matrix,embedding_w2c_all),axis=1)
embedding_matrix = embedding_word2vec_matrix
return train_, test_, word_index, embedding_matrix
# In[8]:
train_, test_,word2idx, word_embedding = w2v_pad(train,test,column_name, word_seq_len,victor_size, num_words)
# In[10]:
from TextModel import *
# In[13]:
my_opt="get_text_lstm1"
#参数
Y = train['class'].values
if not os.path.exists("cache/"+my_opt):
os.mkdir("cache/"+my_opt)
# In[14]:
from sklearn.model_selection import KFold, StratifiedKFold
gc.collect()
seed = 2006
num_folds = 5
kf = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed).split(train_, Y)
# In[15]:
from keras import backend as K
epochs = 6
my_opt=eval(my_opt)
train_model_pred = np.zeros((train_.shape[0], classification))
test_model_pred = np.zeros((test_.shape[0], classification))
for i, (train_fold, val_fold) in enumerate(kf):
X_train, X_valid, = train_[train_fold, :], train_[val_fold, :]
y_train, y_valid = Y[train_fold], Y[val_fold]
y_tra = to_categorical(y_train)
y_val = to_categorical(y_valid)
#模型
name = str(my_opt.__name__)
model = my_opt(word_seq_len, word_embedding, classification)
RocAuc = RocAucEvaluation(validation_data=(X_valid, y_val), interval=1)
hist = model.fit(X_train, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_valid, y_val),
callbacks=[RocAuc])
train_model_pred[val_fold, :] = model.predict(X_valid)
del model
del hist
gc.collect()
K.clear_session()
tf.reset_default_graph()
# In[19]:
#模型
#用全部的数据预测
train_label = to_categorical(Y)
name = str(my_opt.__name__)
model = my_opt(word_seq_len, word_embedding, classification)
RocAuc = RocAucEvaluation(validation_data=(train_, train_label), interval=1)
hist = model.fit(train_, train_label, batch_size=batch_size, epochs=epochs, validation_data=(train_, train_label),
callbacks=[RocAuc])
test_model_pred = model.predict(test_)
# In[20]:
df_train_pred = pd.DataFrame(train_model_pred)
df_test_pred = pd.DataFrame(test_model_pred)
df_train_pred.columns = ['device_start_lstm_pred_' + str(i) for i in range(22)]
df_test_pred.columns = ['device_start_lstm_pred_' + str(i) for i in range(22)]
# In[21]:
df_train_pred = pd.concat([train[['device_id']], df_train_pred], axis=1)
df_test_pred = pd.concat([test[['device_id']], df_test_pred], axis=1)
# In[22]:
df_results = pd.concat([df_train_pred, df_test_pred])
df_results.to_csv('device_start_lstm_pred.csv', index=None)
================================================
FILE: THLUO/2.w2c_model_close.py
================================================
# coding: utf-8
# In[1]:
import pandas as pd
import seaborn as sns
import numpy as np
from tqdm import tqdm
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
from datetime import datetime,timedelta
import time
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
import gc
# In[2]:
print ('2.w2c_model_close.py')
path='input/'
data=pd.DataFrame()
#sex_age=pd.read_excel('./data/性别年龄对照表.xlsx')
# In[3]:
deviceid_packages=pd.read_csv(path+'deviceid_packages.tsv',sep='\t',names=['device_id','apps'])
deviceid_test=pd.read_csv(path+'deviceid_test.tsv',sep='\t',names=['device_id'])
deviceid_train=pd.read_csv(path+'deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
deviceid_brand = pd.read_csv(path+'deviceid_brand.tsv',sep='\t', names=['device_id','device_brand', 'device_type'])
deviceid_package_start_close = pd.read_csv(path+'deviceid_package_start_close.tsv',sep='\t', names=['device_id','app_id','start_time','close_time'])
package_label = pd.read_csv(path+'package_label.tsv',sep='\t',names=['app_id','app_parent_type', 'app_child_type'])
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : str(x).split(' ')[0])
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 1].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other' if x in one_time_brand else x)
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 2].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_2' if x in one_time_brand else x)
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 3].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_3' if x in one_time_brand else x)
#转换成对应的数字
lbl = LabelEncoder()
lbl.fit(list(deviceid_brand.device_brand.values))
deviceid_brand['device_brand'] = lbl.transform(list(deviceid_brand.device_brand.values))
lbl = LabelEncoder()
lbl.fit(list(deviceid_brand.device_type.values))
deviceid_brand['device_type'] = lbl.transform(list(deviceid_brand.device_type.values))
#转换成对应的数字
lbl = LabelEncoder()
lbl.fit(list(package_label.app_parent_type.values))
package_label['app_parent_type'] = lbl.transform(list(package_label.app_parent_type.values))
lbl = LabelEncoder()
lbl.fit(list(package_label.app_child_type.values))
package_label['app_child_type'] = lbl.transform(list(package_label.app_child_type.values))
# In[4]:
df_sorted = deviceid_package_start_close.sort_values(by='close_time')
# In[6]:
df_results = df_sorted.groupby('device_id')['app_id'].apply(lambda x:' '.join(x)).reset_index().rename(columns = {'app_id' : 'app_list'})
# In[7]:
df_results.to_csv('02.device_click_app_sorted_by_close.csv', index=None)
# In[6]:
df_device_start_app_list = df_sorted.groupby('device_id').apply(lambda x : list(x.app_id)).reset_index().rename(columns = {0 : 'app_list'})
# In[7]:
app_list = list(df_device_start_app_list.app_list.values)
# In[8]:
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
# In[9]:
model = Word2Vec(app_list, size=10, window=10, min_count=2, workers=4)
model.save("word2vec.model")
# In[11]:
vocab = list(model.wv.vocab.keys())
w2c_arr = []
for v in vocab :
w2c_arr.append(list(model.wv[v]))
# In[12]:
df_w2c_start = pd.DataFrame()
df_w2c_start['app_id'] = vocab
df_w2c_start = pd.concat([df_w2c_start, pd.DataFrame(w2c_arr)], axis=1)
df_w2c_start.columns = ['app_id'] + ['w2c_close_app_' + str(i) for i in range(10)]
# In[ ]:
w2c_nums = 10
agg = {}
for l in ['w2c_close_app_' + str(i) for i in range(w2c_nums)] :
agg[l] = ['mean', 'std', 'max', 'min']
# In[14]:
deviceid_package_start_close = deviceid_package_start_close.merge(df_w2c_start, on='app_id', how='left')
# In[ ]:
df_agg = deviceid_package_start_close.groupby('device_id').agg(agg)
df_agg.columns = pd.Index(['device_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
df_agg.to_csv('device_close_app_w2c.csv', index=None)
# In[14]:
df_results = deviceid_package_start_close.groupby(['device_id', 'app_id'])['start_time'].mean().reset_index()
df_results = df_results.merge(df_w2c_start, on='app_id', how='left')
# In[17]:
df_agg = df_results.groupby('device_id').agg(agg)
df_agg.columns = pd.Index(['device_app_unique_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
# In[18]:
df_agg.to_csv('device_app_unique_close_app_w2c.csv', index=None)
================================================
FILE: THLUO/20.lgb_sex_age_prob_oof.py
================================================
# coding: utf-8
# In[1]:
import pandas as pd
import seaborn as sns
import numpy as np
from tqdm import tqdm
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
from datetime import datetime,timedelta
import time
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
import gc
# In[2]:
print ('20.lgb_sex_age_prob_oof.py')
path='input/'
data=pd.DataFrame()
#sex_age=pd.read_excel('./data/性别年龄对照表.xlsx')
# In[3]:
deviceid_packages=pd.read_csv(path+'deviceid_packages.tsv',sep='\t',names=['device_id','apps'])
deviceid_test=pd.read_csv(path+'deviceid_test.tsv',sep='\t',names=['device_id'])
deviceid_train=pd.read_csv(path+'deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
deviceid_brand = pd.read_csv(path+'deviceid_brand.tsv',sep='\t', names=['device_id','device_brand', 'device_type'])
deviceid_package_start_close = pd.read_csv(path+'deviceid_package_start_close.tsv',sep='\t', names=['device_id','app_id','start_time','close_time'])
package_label = pd.read_csv(path+'package_label.tsv',sep='\t',names=['app_id','app_parent_type', 'app_child_type'])
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : str(x).split(' ')[0])
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 1].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other' if x in one_time_brand else x)
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 2].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_2' if x in one_time_brand else x)
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 3].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_3' if x in one_time_brand else x)
#转换成对应的数字
lbl = LabelEncoder()
lbl.fit(list(deviceid_brand.device_brand.values))
deviceid_brand['device_brand'] = lbl.transform(list(deviceid_brand.device_brand.values))
lbl = LabelEncoder()
lbl.fit(list(deviceid_brand.device_type.values))
deviceid_brand['device_type'] = lbl.transform(list(deviceid_brand.device_type.values))
#转换成对应的数字
lbl = LabelEncoder()
lbl.fit(list(package_label.app_parent_type.values))
package_label['app_parent_type'] = lbl.transform(list(package_label.app_parent_type.values))
lbl = LabelEncoder()
lbl.fit(list(package_label.app_child_type.values))
package_label['app_child_type'] = lbl.transform(list(package_label.app_child_type.values))
# In[4]:
import time
# 输入毫秒级的时间,转出正常格式的时间
def timeStamp(timeNum):
timeStamp = float(timeNum/1000)
timeArray = time.localtime(timeStamp)
otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
return otherStyleTime
#解析出具体的时间
deviceid_package_start_close['start_date'] = pd.to_datetime(deviceid_package_start_close.start_time.apply(timeStamp))
deviceid_package_start_close['end_date'] = pd.to_datetime(deviceid_package_start_close.close_time.apply(timeStamp))
deviceid_package_start_close['start_hour'] = deviceid_package_start_close.start_date.dt.hour
deviceid_package_start_close['end_hour'] = deviceid_package_start_close.end_date.dt.hour
deviceid_package_start_close['time_gap'] = (deviceid_package_start_close['end_date'] - deviceid_package_start_close['start_date']).astype('timedelta64[s]')
deviceid_package_start_close = deviceid_package_start_close.merge(package_label, on='app_id', how='left')
deviceid_package_start_close.app_parent_type.fillna(-1, inplace=True)
deviceid_package_start_close.app_child_type.fillna(-1, inplace=True)
deviceid_package_start_close['start_year'] = deviceid_package_start_close.start_date.dt.year
deviceid_package_start_close['end_year'] = deviceid_package_start_close.end_date.dt.year
deviceid_package_start_close['year_gap'] = deviceid_package_start_close['end_year'] - deviceid_package_start_close['start_year']
# In[5]:
deviceid_train=pd.concat([deviceid_train,deviceid_test])
# In[6]:
deviceid_packages['apps']=deviceid_packages['apps'].apply(lambda x:x.split(','))
deviceid_packages['app_lenghth']=deviceid_packages['apps'].apply(lambda x:len(x))
#特征工程
def open_app_timegap_in_hour() :
df_temp = deviceid_package_start_close.groupby(['device_id', 'start_hour'])['time_gap'].mean().reset_index().rename(columns = {'time_gap': 'mean_time_gap'})
df_mean_temp = pd.pivot_table(df_temp, index='device_id', columns='start_hour', values='mean_time_gap').reset_index()
df_mean_temp.columns = ['device_id'] + ['open_app_timegap_in_'+str(i) + '_mean_hour' for i in range(0,24)]
df_mean_temp.fillna(0, inplace=True)
return df_mean_temp
# In[8]:
def device_start_end_app_timegap() :
#用户打开,关闭app的时间间隔
df_ = deviceid_package_start_close.sort_values(by=['device_id', 'start_date'], ascending=False)
df_['prev_start_date'] = df_.groupby('device_id')['start_date'].shift(-1)
df_['start_date_gap'] = (df_['start_date'] - df_['prev_start_date']).astype('timedelta64[s]')
agg_dic = {'start_date_gap' : ['min', 'max', 'mean', 'median', 'std']}
df_start_gap_agg = df_.groupby('device_id').agg(agg_dic)
df_start_gap_agg.columns = pd.Index(['device_' + e[0] + "_" + e[1].upper() for e in df_start_gap_agg.columns.tolist()])
df_start_gap_agg = df_start_gap_agg.reset_index()
#del df_
gc.collect()
#关闭时间间隔
df_ = deviceid_package_start_close.sort_values(by=['device_id', 'end_date'], ascending=False)
df_['prev_end_date'] = df_.groupby('device_id')['end_date'].shift(-1)
df_['end_date_gap'] = (df_['end_date'] - df_['prev_end_date']).astype('timedelta64[s]')
agg_dic = {'end_date_gap' : ['min', 'max', 'mean', 'median', 'std']}
df_end_gap_agg = df_.groupby('device_id').agg(agg_dic)
df_end_gap_agg.columns = pd.Index(['device_' + e[0] + "_" + e[1].upper() for e in df_end_gap_agg.columns.tolist()])
df_end_gap_agg = df_end_gap_agg.reset_index()
#del df_
gc.collect()
df_agg = df_start_gap_agg.merge(df_end_gap_agg, on='device_id', how='left')
#df_agg = df_agg.merge(df_app_start_gap_agg, on='device_id', how='left')
#df_agg = df_agg.merge(df_app_end_gap_agg, on='device_id', how='left')
return df_agg
def open_app_counts_in_hour() :
df_temp = deviceid_package_start_close.groupby(['device_id', 'start_hour'])['app_id'].count().reset_index().rename(columns = {'app_id': 'app_counts'})
df_temp = pd.pivot_table(df_temp, index='device_id', columns='start_hour', values='app_counts').reset_index()
df_temp.columns = ['device_id'] + ['open_app_counts_in'+str(i) + '_hour' for i in range(0,24)]
df_temp.fillna(0, inplace=True)
return df_temp
def close_app_counts_in_hour() :
df_temp = deviceid_package_start_close.groupby(['device_id', 'end_hour'])['app_id'].count().reset_index().rename(columns = {'app_id': 'app_counts'})
df_temp = pd.pivot_table(df_temp, index='device_id', columns='end_hour', values='app_counts').reset_index()
df_temp.columns = ['device_id'] + ['close_app_counts_in'+str(i) + '_hour' for i in range(0,24)]
df_temp.fillna(0, inplace=True)
return df_temp
def app_type_mean_time_gap_one_hot () :
df_temp = deviceid_package_start_close.groupby(['device_id', 'app_parent_type'])['time_gap'].mean().reset_index()
df_temp = pd.pivot_table(df_temp, index='device_id', columns='app_parent_type', values='time_gap').reset_index()
df_temp.columns = ['device_id'] + ['app_parent_type_mean_time_gap'+str(i) for i in range(-1,45)]
df_temp.fillna(-1, inplace=True)
return df_temp
def device_active_hour() :
aggregations = {
'start_hour' : ['std','mean','max','min'],
'end_hour' : ['std','mean','max','min']
}
df_agg = deviceid_package_start_close.groupby('device_id').agg(aggregations)
df_agg.columns = pd.Index(['device_grouped_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
return df_agg
def device_brand_encoding() :
df_temp = deviceid_brand.merge(deviceid_train[['device_id', 'age', 'sex']], on='device_id', how='left')
aggregations = {
'age' : ['std','mean'],
'sex' : ['mean'],
}
df_device_brand = df_temp.groupby('device_brand').agg(aggregations)
df_device_brand.columns = pd.Index(['device_brand_' + e[0] + "_" + e[1].upper() for e in df_device_brand.columns.tolist()])
df_device_brand = df_device_brand.reset_index()
df_device_type = df_temp.groupby('device_type').agg(aggregations)
df_device_type.columns = pd.Index(['device_type_' + e[0] + "_" + e[1].upper() for e in df_device_type.columns.tolist()])
df_device_type = df_device_type.reset_index()
df_temp = df_temp.merge(df_device_brand, on='device_brand', how='left')
df_temp = df_temp.merge(df_device_type, on='device_type', how='left')
aggregations = {
'device_brand_age_STD' : ['mean'],
'device_brand_age_MEAN' : ['mean'],
'device_brand_sex_MEAN' : ['mean'],
#'device_type_age_STD' : ['mean'],
#'device_type_age_MEAN' : ['mean'],
#'device_type_sex_MEAN' : ['mean']
}
df_agg = df_temp.groupby('device_id').agg(aggregations)
df_agg.columns = pd.Index(['device_grouped_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
return df_agg
#统计device运行app的情况
def device_active_time_time_stat() :
#device开启app的时间统计信息
deviceid_package_start_close['active_time'] = deviceid_package_start_close['close_time'] - deviceid_package_start_close['start_time']
#device开启了多少次app
#device开启了多少个app
aggregations = {
'app_id' : ['count', 'nunique'],
'active_time' : ['mean', 'std', 'max', 'min'],
}
df_agg = deviceid_package_start_close.groupby('device_id').agg(aggregations)
df_agg.columns = pd.Index(['device_grouped_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
aggregations = {
'active_time' : ['mean', 'std', 'max', 'min', 'count'],
}
df_da_agg = deviceid_package_start_close.groupby(['device_id', 'app_id']).agg(aggregations)
df_da_agg.columns = pd.Index(['device_app_grouped_' + e[0] + "_" + e[1].upper() for e in df_da_agg.columns.tolist()])
df_da_agg = df_da_agg.reset_index()
#device开启app的平均时间
aggregations = {
'device_app_grouped_active_time_MEAN' : ['mean', 'std', 'max', 'min'],
'device_app_grouped_active_time_STD' : ['mean', 'std', 'max', 'min'],
'device_app_grouped_active_time_MAX' : ['mean', 'std', 'max', 'min'],
'device_app_grouped_active_time_MIN' : ['mean', 'std', 'max', 'min'],
'device_app_grouped_active_time_COUNT' : ['mean', 'std', 'max', 'min'],
}
df_temp = df_da_agg.groupby(['device_id']).agg(aggregations)
df_temp.columns = pd.Index([e[0] + "_" + e[1].upper() for e in df_temp.columns.tolist()])
df_temp = df_temp.reset_index()
df_agg = df_agg.merge(df_temp, on='device_id', how='left')
return df_agg
def app_type_encoding() :
df_temp = df_device_app_pair.merge(deviceid_train[['device_id', 'age', 'sex']], on='device_id', how='left')
aggregations = {
'age' : ['std','mean'],
'sex' : ['mean'],
}
df_agg_app_parent_type = df_temp.groupby('app_parent_type').agg(aggregations)
df_agg_app_parent_type.columns = pd.Index(['app_parent_type_' + e[0] + "_" + e[1].upper() for e in df_agg_app_parent_type.columns.tolist()])
df_agg_app_parent_type = df_agg_app_parent_type.reset_index()
df_agg_app_child_type = df_temp.groupby('app_child_type').agg(aggregations)
df_agg_app_child_type.columns = pd.Index(['app_child_type_' + e[0] + "_" + e[1].upper() for e in df_agg_app_child_type.columns.tolist()])
df_agg_app_child_type = df_agg_app_child_type.reset_index()
df_temp = df_temp.merge(df_agg_app_parent_type, on='app_parent_type', how='left')
df_temp = df_temp.merge(df_agg_app_child_type, on='app_child_type', how='left')
aggregations = {
'app_parent_type_age_STD' : ['mean'],
'app_parent_type_age_MEAN' : ['mean'],
'app_parent_type_sex_MEAN' : ['mean'],
'app_child_type_age_STD' : ['mean'],
'app_child_type_age_MEAN' : ['mean'],
'app_child_type_sex_MEAN' : ['mean']
}
df_agg = df_temp.groupby('device_id').agg(aggregations)
df_agg.columns = pd.Index(['device_grouped_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
return df_agg
#每个device对应的app_parent_type计数
def app_type_onehot_in_device(df) :
df_copy = df.fillna(-1)
df_temp = df_copy.groupby(['device_id', 'app_parent_type'])['app_id'].size().reset_index()
df_temp.rename(columns = {'app_id' : 'app_parent_type_counts'}, inplace=True)
df_temp = pd.pivot_table(df_temp, index='device_id', columns='app_parent_type', values='app_parent_type_counts').reset_index()
df_temp.columns = ['device_id'] + ['app_parent_type'+str(i) for i in range(-1,45)]
df_temp.fillna(0, inplace=True)
return df_temp
apps=deviceid_packages['apps'].apply(lambda x:' '.join(x)).tolist()
vectorizer=CountVectorizer()
transformer=TfidfTransformer()
cntTf = vectorizer.fit_transform(apps)
tfidf=transformer.fit_transform(cntTf)
word=vectorizer.get_feature_names()
weight=tfidf.toarray()
df_weight=pd.DataFrame(weight)
feature=df_weight.columns
df_weight['sum']=0
for f in tqdm(feature):
df_weight['sum']+=df_weight[f]
deviceid_packages['tfidf_sum']=df_weight['sum']
# In[10]:
lda = LatentDirichletAllocation(n_topics=5,
learning_offset=50.,
random_state=666)
docres = lda.fit_transform(cntTf)
# In[11]:
deviceid_packages = pd.concat([deviceid_packages,pd.DataFrame(docres)],axis=1)
# In[12]:
temp=deviceid_packages.drop('apps',axis=1)
deviceid_train=pd.merge(deviceid_train,temp,on='device_id',how='left')
# In[13]:
#解析出所有的device_app_pair
device_id_arr = []
app_arr = []
df_device_app_pair = pd.DataFrame()
for row in deviceid_packages.values :
device_id = row[0]
app_list = row[1]
for app in app_list :
device_id_arr.append(device_id)
app_arr.append(app)
#生成pair
df_device_app_pair['device_id'] = device_id_arr
df_device_app_pair['app_id'] = app_arr
df_device_app_pair = df_device_app_pair.merge(package_label, how='left', on='app_id')
# In[15]:
#提取特征
df_train = deviceid_train.merge(device_active_time_time_stat(), on='device_id', how='left')
df_train = df_train.merge(deviceid_brand, on='device_id', how='left')
df_train = df_train.merge(app_type_onehot_in_device(df_device_app_pair), on='device_id', how='left')
df_train = df_train.merge(app_type_encoding(), on='device_id', how='left')
df_train = df_train.merge(device_active_hour(), on='device_id', how='left')
df_train = df_train.merge(app_type_mean_time_gap_one_hot(), on='device_id', how='left')
df_train = df_train.merge(open_app_counts_in_hour(), on='device_id', how='left')
df_train = df_train.merge(close_app_counts_in_hour(), on='device_id', how='left')
df_train = df_train.merge(device_brand_encoding(), on='device_id', how='left')
df_train = df_train.merge(device_start_end_app_timegap(), on='device_id', how='left')
df_train = df_train.merge(open_app_timegap_in_hour(), on='device_id', how='left')
# In[16]:
df_w2c_start = pd.read_csv('device_start_app_w2c.csv')
df_w2c_close = pd.read_csv('device_close_app_w2c.csv')
df_w2c_all = pd.read_csv('device_all_app_w2c.csv')
df_device_quchong_start_app_w2c = pd.read_csv('device_quchong_start_app_w2c.csv')
df_device_app_unique_start_app_w2c = pd.read_csv('device_app_unique_start_app_w2c.csv')
df_device_app_unique_close_app_w2c = pd.read_csv('device_app_unique_close_app_w2c.csv')
df_device_app_unique_all_app_w2c = pd.read_csv('device_app_unique_all_app_w2c.csv')
# In[17]:
df_train_w2v = df_train.merge(df_w2c_start, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_w2c_close, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_w2c_all, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_quchong_start_app_w2c, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_app_unique_start_app_w2c, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_app_unique_close_app_w2c, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_app_unique_all_app_w2c, on='device_id', how='left')
# In[19]:
df_train_w2v['sex'] = df_train_w2v['sex'].apply(lambda x:str(x))
df_train_w2v['age'] = df_train_w2v['age'].apply(lambda x:str(x))
def tool(x):
if x=='nan':
return x
else:
return str(int(float(x)))
df_train_w2v['sex']=df_train_w2v['sex'].apply(tool)
df_train_w2v['age']=df_train_w2v['age'].apply(tool)
df_train_w2v['sex_age']=df_train_w2v['sex']+'-'+df_train_w2v['age']
df_train_w2v = df_train_w2v.replace({'nan':np.NaN,'nan-nan':np.NaN})
# In[33]:
train = df_train_w2v[df_train_w2v['sex'].notnull()]
test = df_train_w2v[df_train_w2v['sex'].isnull()]
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)
X = train.drop(['sex','age','sex_age','device_id'],axis=1)
Y = train['sex_age']
Y_CAT = pd.Categorical(Y)
Y = pd.Series(Y_CAT.codes)
# In[36]:
from sklearn.model_selection import KFold, StratifiedKFold
gc.collect()
seed = 2018
num_folds = 5
folds = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed)
oof_preds = np.zeros([train.shape[0], 22])
sub_list = []
cate_feat = ['device_type','device_brand']
for n_fold, (train_idx, valid_idx) in enumerate(folds.split(X, Y)):
train_x, train_y = X.iloc[train_idx], Y.iloc[train_idx]
valid_x, valid_y = X.iloc[valid_idx], Y.iloc[valid_idx]
lgb_train=lgb.Dataset(train_x,label=train_y)
lgb_eval = lgb.Dataset(valid_x, valid_y, reference=lgb_train)
params = {
'boosting_type': 'gbdt',
#'learning_rate' : 0.02,
'learning_rate' : 0.02,
'max_depth':5,
'num_leaves' : 2 ** 4,
'metric': {'multi_logloss'},
'num_class' : 22,
'objective' : 'multiclass',
'random_state' : 2018,
'bagging_freq' : 5,
'feature_fraction' : 0.7,
'bagging_fraction' : 0.7,
'min_split_gain' : 0.0970905919552776,
'min_child_weight' : 9.42012323936088,
}
gbm = lgb.train(params,
lgb_train,
num_boost_round=600,
valid_sets=[lgb_train, lgb_eval],
#early_stopping_rounds=200,
verbose_eval=100)
oof_preds[valid_idx] = gbm.predict(valid_x[X.columns.values])
oof_train = pd.DataFrame(oof_preds)
oof_train.columns = ['lgb_sex_age_prob_oof_' + str(i) for i in range(22)]
train = pd.concat([train, oof_train], axis=1)
# In[37]:
#用全部的数据来预测
#用全部的train来预测test
lgb_train = lgb.Dataset(X,label=Y)
gbm = lgb.train(params, lgb_train, num_boost_round=600, valid_sets=lgb_train, verbose_eval=100)
test = test.reset_index(drop=True)
test_preds = gbm.predict(test[X.columns.values])
oof_test = pd.DataFrame(test_preds)
oof_test.columns = ['lgb_sex_age_prob_oof_' + str(i) for i in range(22)]
test = pd.concat([test, oof_test], axis=1)
# In[39]:
df_sex_age_prob_oof = pd.concat([train[['device_id'] + ['lgb_sex_age_prob_oof_' + str(i) for i in range(22)] ],
test[['device_id'] + ['lgb_sex_age_prob_oof_' + str(i) for i in range(22)] ]])
df_sex_age_prob_oof.to_csv('lgb_sex_age_prob_oof.csv', index=None)
================================================
FILE: THLUO/21.tfidf_lr_sex_age_prob_oof.py
================================================
# coding: utf-8
# In[1]:
import pandas as pd
import seaborn as sns
import numpy as np
from tqdm import tqdm
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
from datetime import datetime,timedelta
import time
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
import gc
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
# In[2]:
print('21.tfidf_lr.py')
path='input/'
data=pd.DataFrame()
#sex_age=pd.read_excel('./data/性别年龄对照表.xlsx')
# In[3]:
deviceid_packages=pd.read_csv(path+'deviceid_packages.tsv',sep='\t',names=['device_id','apps'])
deviceid_test=pd.read_csv(path+'deviceid_test.tsv',sep='\t',names=['device_id'])
deviceid_train=pd.read_csv(path+'deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
deviceid_brand = pd.read_csv(path+'deviceid_brand.tsv',sep='\t', names=['device_id','device_brand', 'device_type'])
deviceid_package_start_close = pd.read_csv(path+'deviceid_package_start_close.tsv',sep='\t', names=['device_id','app_id','start_time','close_time'])
package_label = pd.read_csv(path+'package_label.tsv',sep='\t',names=['app_id','app_parent_type', 'app_child_type'])
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : str(x).split(' ')[0])
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 1].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other' if x in one_time_brand else x)
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 2].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_2' if x in one_time_brand else x)
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 3].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_3' if x in one_time_brand else x)
#转换成对应的数字
lbl = LabelEncoder()
lbl.fit(list(deviceid_brand.device_brand.values))
deviceid_brand['device_brand'] = lbl.transform(list(deviceid_brand.device_brand.values))
lbl = LabelEncoder()
lbl.fit(list(deviceid_brand.device_type.values))
deviceid_brand['device_type'] = lbl.transform(list(deviceid_brand.device_type.values))
#转换成对应的数字
lbl = LabelEncoder()
lbl.fit(list(package_label.app_parent_type.values))
package_label['app_parent_type'] = lbl.transform(list(package_label.app_parent_type.values))
lbl = LabelEncoder()
lbl.fit(list(package_label.app_child_type.values))
package_label['app_child_type'] = lbl.transform(list(package_label.app_child_type.values))
deviceid_train = pd.concat([deviceid_train, deviceid_test])
# In[4]:
deviceid_package_start = deviceid_package_start_close[['device_id', 'app_id', 'start_time']]
deviceid_package_start.columns = ['device_id', 'app_id', 'all_time']
deviceid_package_close = deviceid_package_start_close[['device_id', 'app_id', 'close_time']]
deviceid_package_close.columns = ['device_id', 'app_id', 'all_time']
deviceid_package_all = pd.concat([deviceid_package_start, deviceid_package_close])
deviceid_package_all = deviceid_package_all.sort_values(by='all_time')
#deviceid_package_all = deviceid_package_all.merge(deviceid_train, on='device_id', how='left')
# In[5]:
df = deviceid_package_all.groupby('device_id').apply(lambda x : list(x.app_id)).reset_index().rename(columns = {0 : 'app_list'})
# In[6]:
df_sex_prob_oof = pd.read_csv('device_sex_prob_oof.csv')
df_age_prob_oof = pd.read_csv('device_age_prob_oof.csv')
df_start_close_sex_prob_oof = pd.read_csv('start_close_sex_prob_oof.csv')
df_start_close_age_prob_oof = pd.read_csv('start_close_age_prob_oof.csv')
df_start_close_sex_age_prob_oof = pd.read_csv('start_close_sex_age_prob_oof.csv')
gc.collect()
df = df.merge(df_sex_prob_oof, on='device_id', how='left')
df = df.merge(df_age_prob_oof, on='device_id', how='left')
df = df.merge(df_start_close_sex_prob_oof, on='device_id', how='left')
df = df.merge(df_start_close_age_prob_oof, on='device_id', how='left')
df = df.merge(df_start_close_sex_age_prob_oof, on='device_id', how='left')
df.fillna(0, inplace=True)
apps = df['app_list'].apply(lambda x:' '.join(x)).tolist()
del df['app_list']
df = df.merge(deviceid_train, on='device_id', how='left')
# In[8]:
vectorizer=CountVectorizer()
transformer=TfidfTransformer()
cntTf = vectorizer.fit_transform(apps)
tfidf=transformer.fit_transform(cntTf)
word=vectorizer.get_feature_names()
weight=tfidf.toarray()
df_weight=pd.DataFrame(weight)
feature=df_weight.columns
# In[9]:
for i in df.columns.values:
df_weight[i] = df[i]
df_weight[i] = df[i]
# In[11]:
df_weight['sex'] = df_weight['sex'].apply(lambda x:str(x))
df_weight['age'] = df_weight['age'].apply(lambda x:str(x))
def tool(x):
if x == 'nan':
return x
else:
return str(int(float(x)))
df_weight['sex'] = df_weight['sex'].apply(tool)
df_weight['age'] = df_weight['age'].apply(tool)
df_weight['sex_age'] = df_weight['sex']+'-'+df_weight['age']
df_weight['sex_age'] = df_weight.sex_age.replace({'nan':np.NaN,'nan-nan':np.NaN})
# In[12]:
train = df_weight[df_weight.sex_age.notnull()]
train.reset_index(drop=True, inplace=True)
test = df_weight[df_weight.sex_age.isnull()]
test.reset_index(drop=True, inplace=True)
gc.collect()
# In[16]:
X = train.drop(['sex','age','sex_age','device_id'],axis=1)
Y = train['sex_age']
Y_CAT = pd.Categorical(Y)
Y = pd.Series(Y_CAT.codes)
# In[18]:
from sklearn.model_selection import KFold, StratifiedKFold
gc.collect()
seed = 666
num_folds = 5
folds = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed)
oof_preds = np.zeros([train.shape[0], 22])
for n_fold, (train_idx, valid_idx) in enumerate(folds.split(X, Y)):
train_x, train_y = X.iloc[train_idx], Y.iloc[train_idx]
valid_x, valid_y = X.iloc[valid_idx], Y.iloc[valid_idx]
clf = LogisticRegression(C=4)
clf.fit(train_x, train_y)
valid_preds=clf.predict_proba(valid_x)
train_preds=clf.predict_proba(train_x)
oof_preds[valid_idx] = valid_preds
print (log_loss(train_y.values, train_preds), log_loss(valid_y.values, valid_preds))
oof_train = pd.DataFrame(oof_preds)
oof_train.columns = ['tfidf_lr_sex_age_prob_oof_' + str(i) for i in range(22)]
train_temp = pd.concat([train[['device_id']], oof_train], axis=1)
# In[20]:
#用全部的数据预测
clf = LogisticRegression(C=4)
clf.fit(X, Y)
train_preds=clf.predict_proba(X)
test_preds=clf.predict_proba(test[X.columns])
print (log_loss(Y.values, train_preds))
oof_test = pd.DataFrame(test_preds)
oof_test.columns = ['tfidf_lr_sex_age_prob_oof_' + str(i) for i in range(22)]
# In[24]:
oof_test
# In[25]:
test_temp = pd.concat([test[['device_id']], oof_test], axis=1)
test_temp
# In[26]:
sex_age_oof = pd.concat([train_temp, test_temp])
sex_age_oof
# In[29]:
sex_age_oof.to_csv('tfidf_lr_sex_age_prob_oof.csv', index=None)
================================================
FILE: THLUO/22.base_feat.py
================================================
# coding: utf-8
# In[1]:
import pandas as pd
import seaborn as sns
import numpy as np
from tqdm import tqdm
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
from datetime import datetime,timedelta
import time
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
import gc
# In[2]:
path='input/'
data=pd.DataFrame()
#sex_age=pd.read_excel('./data/性别年龄对照表.xlsx')
# In[3]:
deviceid_packages=pd.read_csv(path+'deviceid_packages.tsv',sep='\t',names=['device_id','apps'])
deviceid_test=pd.read_csv(path+'deviceid_test.tsv',sep='\t',names=['device_id'])
deviceid_train=pd.read_csv(path+'deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
deviceid_brand = pd.read_csv(path+'deviceid_brand.tsv',sep='\t', names=['device_id','device_brand', 'device_type'])
deviceid_package_start_close = pd.read_csv(path+'deviceid_package_start_close.tsv',sep='\t', names=['device_id','app_id','start_time','close_time'])
package_label = pd.read_csv(path+'package_label.tsv',sep='\t',names=['app_id','app_parent_type', 'app_child_type'])
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : str(x).split(' ')[0])
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 1].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other' if x in one_time_brand else x)
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 2].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_2' if x in one_time_brand else x)
df_temp = deviceid_brand.groupby('device_brand')['device_id'].count().reset_index().rename(columns={'device_id':'brand_counts'})
one_time_brand = df_temp[df_temp.brand_counts == 3].device_brand.values
deviceid_brand['device_brand'] = deviceid_brand.device_brand.apply(lambda x : 'other_3' if x in one_time_brand else x)
#转换成对应的数字
lbl = LabelEncoder()
lbl.fit(list(deviceid_brand.device_brand.values))
deviceid_brand['device_brand'] = lbl.transform(list(deviceid_brand.device_brand.values))
lbl = LabelEncoder()
lbl.fit(list(deviceid_brand.device_type.values))
deviceid_brand['device_type'] = lbl.transform(list(deviceid_brand.device_type.values))
#转换成对应的数字
lbl = LabelEncoder()
lbl.fit(list(package_label.app_parent_type.values))
package_label['app_parent_type'] = lbl.transform(list(package_label.app_parent_type.values))
lbl = LabelEncoder()
lbl.fit(list(package_label.app_child_type.values))
package_label['app_child_type'] = lbl.transform(list(package_label.app_child_type.values))
# In[4]:
import time
# 输入毫秒级的时间,转出正常格式的时间
def timeStamp(timeNum):
timeStamp = float(timeNum/1000)
timeArray = time.localtime(timeStamp)
otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
return otherStyleTime
#解析出具体的时间
deviceid_package_start_close['start_date'] = pd.to_datetime(deviceid_package_start_close.start_time.apply(timeStamp))
deviceid_package_start_close['end_date'] = pd.to_datetime(deviceid_package_start_close.close_time.apply(timeStamp))
deviceid_package_start_close['start_hour'] = deviceid_package_start_close.start_date.dt.hour
deviceid_package_start_close['end_hour'] = deviceid_package_start_close.end_date.dt.hour
deviceid_package_start_close['time_gap'] = (deviceid_package_start_close['end_date'] - deviceid_package_start_close['start_date']).astype('timedelta64[s]')
deviceid_package_start_close = deviceid_package_start_close.merge(package_label, on='app_id', how='left')
deviceid_package_start_close.app_parent_type.fillna(-1, inplace=True)
deviceid_package_start_close.app_child_type.fillna(-1, inplace=True)
deviceid_package_start_close['start_year'] = deviceid_package_start_close.start_date.dt.year
deviceid_package_start_close['end_year'] = deviceid_package_start_close.end_date.dt.year
deviceid_package_start_close['year_gap'] = deviceid_package_start_close['end_year'] - deviceid_package_start_close['start_year']
# In[5]:
deviceid_train=pd.concat([deviceid_train,deviceid_test])
# In[6]:
deviceid_packages['apps']=deviceid_packages['apps'].apply(lambda x:x.split(','))
deviceid_packages['app_lenghth']=deviceid_packages['apps'].apply(lambda x:len(x))
#特征工程
def open_app_timegap_in_hour() :
df_temp = deviceid_package_start_close.groupby(['device_id', 'start_hour'])['time_gap'].mean().reset_index().rename(columns = {'time_gap': 'mean_time_gap'})
df_mean_temp = pd.pivot_table(df_temp, index='device_id', columns='start_hour', values='mean_time_gap').reset_index()
df_mean_temp.columns = ['device_id'] + ['open_app_timegap_in_'+str(i) + '_mean_hour' for i in range(0,24)]
df_mean_temp.fillna(0, inplace=True)
return df_mean_temp
# In[8]:
def device_start_end_app_timegap() :
#用户打开,关闭app的时间间隔
df_ = deviceid_package_start_close.sort_values(by=['device_id', 'start_date'], ascending=False)
df_['prev_start_date'] = df_.groupby('device_id')['start_date'].shift(-1)
df_['start_date_gap'] = (df_['start_date'] - df_['prev_start_date']).astype('timedelta64[s]')
agg_dic = {'start_date_gap' : ['min', 'max', 'mean', 'median', 'std']}
df_start_gap_agg = df_.groupby('device_id').agg(agg_dic)
df_start_gap_agg.columns = pd.Index(['device_' + e[0] + "_" + e[1].upper() for e in df_start_gap_agg.columns.tolist()])
df_start_gap_agg = df_start_gap_agg.reset_index()
#del df_
gc.collect()
#关闭时间间隔
df_ = deviceid_package_start_close.sort_values(by=['device_id', 'end_date'], ascending=False)
df_['prev_end_date'] = df_.groupby('device_id')['end_date'].shift(-1)
df_['end_date_gap'] = (df_['end_date'] - df_['prev_end_date']).astype('timedelta64[s]')
agg_dic = {'end_date_gap' : ['min', 'max', 'mean', 'median', 'std']}
df_end_gap_agg = df_.groupby('device_id').agg(agg_dic)
df_end_gap_agg.columns = pd.Index(['device_' + e[0] + "_" + e[1].upper() for e in df_end_gap_agg.columns.tolist()])
df_end_gap_agg = df_end_gap_agg.reset_index()
#del df_
gc.collect()
df_agg = df_start_gap_agg.merge(df_end_gap_agg, on='device_id', how='left')
#df_agg = df_agg.merge(df_app_start_gap_agg, on='device_id', how='left')
#df_agg = df_agg.merge(df_app_end_gap_agg, on='device_id', how='left')
return df_agg
def open_app_counts_in_hour() :
df_temp = deviceid_package_start_close.groupby(['device_id', 'start_hour'])['app_id'].count().reset_index().rename(columns = {'app_id': 'app_counts'})
df_temp = pd.pivot_table(df_temp, index='device_id', columns='start_hour', values='app_counts').reset_index()
df_temp.columns = ['device_id'] + ['open_app_counts_in'+str(i) + '_hour' for i in range(0,24)]
df_temp.fillna(0, inplace=True)
return df_temp
def close_app_counts_in_hour() :
df_temp = deviceid_package_start_close.groupby(['device_id', 'end_hour'])['app_id'].count().reset_index().rename(columns = {'app_id': 'app_counts'})
df_temp = pd.pivot_table(df_temp, index='device_id', columns='end_hour', values='app_counts').reset_index()
df_temp.columns = ['device_id'] + ['close_app_counts_in'+str(i) + '_hour' for i in range(0,24)]
df_temp.fillna(0, inplace=True)
return df_temp
def app_type_mean_time_gap_one_hot () :
df_temp = deviceid_package_start_close.groupby(['device_id', 'app_parent_type'])['time_gap'].mean().reset_index()
df_temp = pd.pivot_table(df_temp, index='device_id', columns='app_parent_type', values='time_gap').reset_index()
df_temp.columns = ['device_id'] + ['app_parent_type_mean_time_gap'+str(i) for i in range(-1,45)]
df_temp.fillna(-1, inplace=True)
return df_temp
def device_active_hour() :
aggregations = {
'start_hour' : ['std','mean','max','min'],
'end_hour' : ['std','mean','max','min']
}
df_agg = deviceid_package_start_close.groupby('device_id').agg(aggregations)
df_agg.columns = pd.Index(['device_grouped_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
return df_agg
def device_brand_encoding() :
df_temp = deviceid_brand.merge(deviceid_train[['device_id', 'age', 'sex']], on='device_id', how='left')
aggregations = {
'age' : ['std','mean'],
'sex' : ['mean'],
}
df_device_brand = df_temp.groupby('device_brand').agg(aggregations)
df_device_brand.columns = pd.Index(['device_brand_' + e[0] + "_" + e[1].upper() for e in df_device_brand.columns.tolist()])
df_device_brand = df_device_brand.reset_index()
df_device_type = df_temp.groupby('device_type').agg(aggregations)
df_device_type.columns = pd.Index(['device_type_' + e[0] + "_" + e[1].upper() for e in df_device_type.columns.tolist()])
df_device_type = df_device_type.reset_index()
df_temp = df_temp.merge(df_device_brand, on='device_brand', how='left')
df_temp = df_temp.merge(df_device_type, on='device_type', how='left')
aggregations = {
'device_brand_age_STD' : ['mean'],
'device_brand_age_MEAN' : ['mean'],
'device_brand_sex_MEAN' : ['mean'],
#'device_type_age_STD' : ['mean'],
#'device_type_age_MEAN' : ['mean'],
#'device_type_sex_MEAN' : ['mean']
}
df_agg = df_temp.groupby('device_id').agg(aggregations)
df_agg.columns = pd.Index(['device_grouped_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
return df_agg
#统计device运行app的情况
def device_active_time_time_stat() :
#device开启app的时间统计信息
deviceid_package_start_close['active_time'] = deviceid_package_start_close['close_time'] - deviceid_package_start_close['start_time']
#device开启了多少次app
#device开启了多少个app
aggregations = {
'app_id' : ['count', 'nunique'],
'active_time' : ['mean', 'std', 'max', 'min'],
}
df_agg = deviceid_package_start_close.groupby('device_id').agg(aggregations)
df_agg.columns = pd.Index(['device_grouped_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
aggregations = {
'active_time' : ['mean', 'std', 'max', 'min', 'count'],
}
df_da_agg = deviceid_package_start_close.groupby(['device_id', 'app_id']).agg(aggregations)
df_da_agg.columns = pd.Index(['device_app_grouped_' + e[0] + "_" + e[1].upper() for e in df_da_agg.columns.tolist()])
df_da_agg = df_da_agg.reset_index()
#device开启app的平均时间
aggregations = {
'device_app_grouped_active_time_MEAN' : ['mean', 'std', 'max', 'min'],
'device_app_grouped_active_time_STD' : ['mean', 'std', 'max', 'min'],
'device_app_grouped_active_time_MAX' : ['mean', 'std', 'max', 'min'],
'device_app_grouped_active_time_MIN' : ['mean', 'std', 'max', 'min'],
'device_app_grouped_active_time_COUNT' : ['mean', 'std', 'max', 'min'],
}
df_temp = df_da_agg.groupby(['device_id']).agg(aggregations)
df_temp.columns = pd.Index([e[0] + "_" + e[1].upper() for e in df_temp.columns.tolist()])
df_temp = df_temp.reset_index()
df_agg = df_agg.merge(df_temp, on='device_id', how='left')
return df_agg
def app_type_encoding() :
df_temp = df_device_app_pair.merge(deviceid_train[['device_id', 'age', 'sex']], on='device_id', how='left')
aggregations = {
'age' : ['std','mean'],
'sex' : ['mean'],
}
df_agg_app_parent_type = df_temp.groupby('app_parent_type').agg(aggregations)
df_agg_app_parent_type.columns = pd.Index(['app_parent_type_' + e[0] + "_" + e[1].upper() for e in df_agg_app_parent_type.columns.tolist()])
df_agg_app_parent_type = df_agg_app_parent_type.reset_index()
df_agg_app_child_type = df_temp.groupby('app_child_type').agg(aggregations)
df_agg_app_child_type.columns = pd.Index(['app_child_type_' + e[0] + "_" + e[1].upper() for e in df_agg_app_child_type.columns.tolist()])
df_agg_app_child_type = df_agg_app_child_type.reset_index()
df_temp = df_temp.merge(df_agg_app_parent_type, on='app_parent_type', how='left')
df_temp = df_temp.merge(df_agg_app_child_type, on='app_child_type', how='left')
aggregations = {
'app_parent_type_age_STD' : ['mean'],
'app_parent_type_age_MEAN' : ['mean'],
'app_parent_type_sex_MEAN' : ['mean'],
'app_child_type_age_STD' : ['mean'],
'app_child_type_age_MEAN' : ['mean'],
'app_child_type_sex_MEAN' : ['mean']
}
df_agg = df_temp.groupby('device_id').agg(aggregations)
df_agg.columns = pd.Index(['device_grouped_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
df_agg = df_agg.reset_index()
return df_agg
#每个device对应的app_parent_type计数
def app_type_onehot_in_device(df) :
df_copy = df.fillna(-1)
df_temp = df_copy.groupby(['device_id', 'app_parent_type'])['app_id'].size().reset_index()
df_temp.rename(columns = {'app_id' : 'app_parent_type_counts'}, inplace=True)
df_temp = pd.pivot_table(df_temp, index='device_id', columns='app_parent_type', values='app_parent_type_counts').reset_index()
df_temp.columns = ['device_id'] + ['app_parent_type'+str(i) for i in range(-1,45)]
df_temp.fillna(0, inplace=True)
return df_temp
apps=deviceid_packages['apps'].apply(lambda x:' '.join(x)).tolist()
vectorizer=CountVectorizer()
transformer=TfidfTransformer()
cntTf = vectorizer.fit_transform(apps)
tfidf=transformer.fit_transform(cntTf)
word=vectorizer.get_feature_names()
weight=tfidf.toarray()
df_weight=pd.DataFrame(weight)
feature=df_weight.columns
df_weight['sum']=0
for f in tqdm(feature):
df_weight['sum']+=df_weight[f]
deviceid_packages['tfidf_sum']=df_weight['sum']
# In[10]:
lda = LatentDirichletAllocation(n_topics=5,
learning_offset=50.,
random_state=666)
docres = lda.fit_transform(cntTf)
# In[11]:
deviceid_packages = pd.concat([deviceid_packages,pd.DataFrame(docres)],axis=1)
# In[12]:
temp=deviceid_packages.drop('apps',axis=1)
deviceid_train=pd.merge(deviceid_train,temp,on='device_id',how='left')
# In[13]:
#解析出所有的device_app_pair
device_id_arr = []
app_arr = []
df_device_app_pair = pd.DataFrame()
for row in deviceid_packages.values :
device_id = row[0]
app_list = row[1]
for app in app_list :
device_id_arr.append(device_id)
app_arr.append(app)
#生成pair
df_device_app_pair['device_id'] = device_id_arr
df_device_app_pair['app_id'] = app_arr
df_device_app_pair = df_device_app_pair.merge(package_label, how='left', on='app_id')
# In[15]:
#提取特征
df_train = deviceid_train.merge(device_active_time_time_stat(), on='device_id', how='left')
df_train = df_train.merge(deviceid_brand, on='device_id', how='left')
df_train = df_train.merge(app_type_onehot_in_device(df_device_app_pair), on='device_id', how='left')
df_train = df_train.merge(app_type_encoding(), on='device_id', how='left')
df_train = df_train.merge(device_active_hour(), on='device_id', how='left')
df_train = df_train.merge(app_type_mean_time_gap_one_hot(), on='device_id', how='left')
df_train = df_train.merge(open_app_counts_in_hour(), on='device_id', how='left')
df_train = df_train.merge(close_app_counts_in_hour(), on='device_id', how='left')
df_train = df_train.merge(device_brand_encoding(), on='device_id', how='left')
df_train = df_train.merge(device_start_end_app_timegap(), on='device_id', how='left')
df_train = df_train.merge(open_app_timegap_in_hour(), on='device_id', how='left')
# In[33]:
df_w2c_start = pd.read_csv('device_start_app_w2c.csv')
df_sex_prob_oof = pd.read_csv('device_sex_prob_oof.csv')
df_age_prob_oof = pd.read_csv('device_age_prob_oof.csv')
df_w2c_close = pd.read_csv('device_close_app_w2c.csv')
df_w2c_all = pd.read_csv('device_all_app_w2c.csv')
df_start_close_sex_prob_oof = pd.read_csv('start_close_sex_prob_oof.csv')
#后面两个,线上线下不对应,线下过拟合了
df_start_close_age_prob_oof = pd.read_csv('start_close_age_prob_oof.csv')
df_device_quchong_start_app_w2c = pd.read_csv('device_quchong_start_app_w2c.csv')
df_tfidf_lr_sex_age_prob_oof = pd.read_csv('tfidf_lr_sex_age_prob_oof.csv')
df_device_app_unique_start_app_w2c = pd.read_csv('device_app_unique_start_app_w2c.csv')
df_device_app_unique_close_app_w2c = pd.read_csv('device_app_unique_close_app_w2c.csv')
df_device_app_unique_all_app_w2c = pd.read_csv('device_app_unique_all_app_w2c.csv')
#之前的有用的
df_sex_age_bin_prob_oof = pd.read_csv('sex_age_bin_prob_oof.csv')
df_age_bin_prob_oof = pd.read_csv('age_bin_prob_oof.csv')
df_hcc_device_brand_age_sex = pd.read_csv('hcc_device_brand_age_sex.csv')
df_device_age_regression_prob_oof = pd.read_csv('device_age_regression_prob_oof.csv')
df_device_start_GRU_pred = pd.read_csv('device_start_GRU_pred.csv')
df_device_start_GRU_pred_age = pd.read_csv('device_start_GRU_pred_age.csv')
df_device_all_GRU_pred = pd.read_csv('device_all_GRU_pred.csv')
df_device_start_capsule_pred = pd.read_csv('device_start_capsule_pred.csv')
df_lgb_sex_age_prob_oof = pd.read_csv('lgb_sex_age_prob_oof.csv')
df_device_start_textcnn_pred = pd.read_csv('device_start_textcnn_pred.csv')
df_device_start_text_dpcnn_pred = pd.read_csv('device_start_text_dpcnn_pred.csv')
df_device_start_lstm_pred = pd.read_csv('device_start_lstm_pred.csv')
#过拟合特征
del df_start_close_age_prob_oof['device_app_groupedstart_close_age_prob_oof_4_MEAN']
del df_start_close_sex_prob_oof['device_app_groupedstart_close_sex_prob_oof_MIN']
del df_start_close_sex_prob_oof['device_app_groupedstart_close_sex_prob_oof_MAX']
# In[35]:
df_train_w2v = df_train.merge(df_w2c_start, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_sex_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_age_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_w2c_close, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_w2c_all, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_start_close_sex_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_start_close_age_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_quchong_start_app_w2c, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_tfidf_lr_sex_age_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_app_unique_start_app_w2c, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_app_unique_close_app_w2c, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_app_unique_all_app_w2c, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_sex_age_bin_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_age_bin_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_hcc_device_brand_age_sex, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_age_regression_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_GRU_pred, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_GRU_pred_age, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_all_GRU_pred, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_capsule_pred, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_lgb_sex_age_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_textcnn_pred, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_text_dpcnn_pred, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_lstm_pred, on='device_id', how='left')
# In[24]:
df_train_w2v['sex'] = df_train_w2v['sex'].apply(lambda x:str(x))
df_train_w2v['age'] = df_train_w2v['age'].apply(lambda x:str(x))
def tool(x):
if x=='nan':
return x
else:
return str(int(float(x)))
df_train_w2v['sex']=df_train_w2v['sex'].apply(tool)
df_train_w2v['age']=df_train_w2v['age'].apply(tool)
df_train_w2v['sex_age']=df_train_w2v['sex']+'-'+df_train_w2v['age']
df_train_w2v = df_train_w2v.replace({'nan':np.NaN,'nan-nan':np.NaN})
# In[ ]:
df_train_w2v.to_csv('thluo_train_best_feat.csv', index=None)
================================================
FILE: THLUO/23.ATT_v6.py
================================================
import pandas as pd
import seaborn as sns
import numpy as np
from tqdm import tqdm
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from datetime import datetime,timedelta
import time
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from gensim.models import FastText, Word2Vec
import re
from keras.layers import *
from keras.models import *
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing import text, sequence
from keras.callbacks import *
from keras.layers.advanced_activations import LeakyReLU, PReLU
import keras.backend as K
from keras.optimizers import *
from keras.utils import to_categorical
from keras.utils import multi_gpu_model
import tensorflow as tf
#from keras.backend.tensorflow_backend import set_session
#config = tf.ConfigProto()
#config.gpu_options.per_process_gpu_memory_fraction = 0.9
#set_session(tf.Session(config=config))
print ('23.ATT_V6.py')
path="input/"
np.random.seed(1337)
packages = pd.read_csv(path+'deviceid_packages.tsv', sep='\t', names=['device_id', 'apps'])
test = pd.read_csv(path+'deviceid_test.tsv', sep='\t', names=['device_id'])
train = pd.read_csv(path+'deviceid_train.tsv', sep='\t', names=['device_id', 'sex', 'age'])
brand = pd.read_table(path+'deviceid_brand.tsv', names=['device_id', 'vendor', 'version'])
data = pd.read_csv('thluo_train_best_feat.csv')
data.head()
train = pd.merge(train, data, on='device_id', how='left')
test = pd.merge(test, data, on='device_id', how='left')
train.head()
X_h = train.drop(['device_id', 'sex', 'age'], axis=1).values
X_h_test = test.drop(['device_id'], axis=1).values
packages['app_lenghth'] = packages['apps'].apply(lambda x:x.split(',')).apply(lambda x:len(x))
packages['app_list'] = packages['apps'].apply(lambda x:x.split(','))
train = pd.merge(train, packages, on='device_id', how='left')
test = pd.merge(test, packages, on='device_id', how='left')
embed_size = 128
fastmodel = FastText(list(packages['app_list']), size=embed_size, window=4, min_count=3, negative=2,
sg=1, sample=0.002, hs=1, workers=4)
embedding_fast = pd.DataFrame([fastmodel[word] for word in (fastmodel.wv.vocab)])
embedding_fast['app'] = list(fastmodel.wv.vocab)
embedding_fast.columns= ["fdim_%s" % str(i) for i in range(embed_size)]+["app"]
tokenizer = Tokenizer(lower=False, char_level=False, split=',')
tokenizer.fit_on_texts(list(packages['apps']))
X_seq = tokenizer.texts_to_sequences(train['apps'])
X_test_seq = tokenizer.texts_to_sequences(test['apps'])
maxlen = 50
X = pad_sequences(X_seq, maxlen=maxlen, value=0)
X_test = pad_sequences(X_test_seq, maxlen=maxlen, value=0)
Y_sex = train['sex']-1
max_feaures=35001
embedding_matrix = np.zeros((max_feaures, embed_size))
for word in tokenizer.word_index:
if word not in fastmodel.wv.vocab:
continue
embedding_matrix[tokenizer.word_index[word]] = fastmodel[word]
def dot_product(x, kernel):
"""
Wrapper for dot product operation, in order to be compatible with both
Theano and Tensorflow
Args:
x (): input
kernel (): weights
Returns:
"""
if K.backend() == 'tensorflow':
return K.squeeze(K.dot(x, K.expand_dims(kernel)), axis=-1)
else:
return K.dot(x, kernel)
class AttentionWithContext(Layer):
"""
Attention operation, with a context/query vector, for temporal data.
Supports Masking.
Follows the work of Yang et al. [https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf]
"Hierarchical Attention Networks for Document Classification"
by using a context vector to assist the attention
# Input shape
3D tensor with shape: `(samples, steps, features)`.
# Output shape
2D tensor with shape: `(samples, features)`.
How to use:
Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
The dimensions are inferred based on the output shape of the RNN.
Note: The layer has been tested with Keras 2.0.6
Example:
model.add(LSTM(64, return_sequences=True))
model.add(AttentionWithContext())
# next add a Dense layer (for classification/regression) or whatever...
"""
def __init__(self,
W_regularizer=None, u_regularizer=None, b_regularizer=None,
W_constraint=None, u_constraint=None, b_constraint=None,
bias=True, **kwargs):
self.supports_masking = True
self.init = initializers.get('glorot_uniform')
self.W_regularizer = regularizers.get(W_regularizer)
self.u_regularizer = regularizers.get(u_regularizer)
self.b_regularizer = regularizers.get(b_regularizer)
self.W_constraint = constraints.get(W_constraint)
self.u_constraint = constraints.get(u_constraint)
self.b_constraint = constraints.get(b_constraint)
self.bias = bias
super(AttentionWithContext, self).__init__(**kwargs)
def build(self, input_shape):
assert len(input_shape) == 3
self.W = self.add_weight((input_shape[-1], input_shape[-1],),
initializer=self.init,
name='{}_W'.format(self.name),
regularizer=self.W_regularizer,
constraint=self.W_constraint)
if self.bias:
self.b = self.add_weight((input_shape[-1],),
initializer='zero',
name='{}_b'.format(self.name),
regularizer=self.b_regularizer,
constraint=self.b_constraint)
self.u = self.add_weight((input_shape[-1],),
initializer=self.init,
name='{}_u'.format(self.name),
regularizer=self.u_regularizer,
constraint=self.u_constraint)
super(AttentionWithContext, self).build(input_shape)
def compute_mask(self, input, input_mask=None):
# do not pass the mask to the next layers
return None
def call(self, x, mask=None):
uit = dot_product(x, self.W)
if self.bias:
uit += self.b
uit = K.tanh(uit)
ait = dot_product(uit, self.u)
a = K.exp(ait)
# apply mask after the exp. will be re-normalized next
if mask is not None:
# Cast the mask to floatX to avoid float64 upcasting in theano
a *= K.cast(mask, K.floatx())
# in some cases especially in the early stages of training the sum may be almost zero
# and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
# a /= K.cast(K.sum(a, axis=1, keepdims=True), K.floatx())
a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
a = K.expand_dims(a)
weighted_input = x * a
return K.sum(weighted_input, axis=1)
def compute_output_shape(self, input_shape):
return input_shape[0], input_shape[-1]
class AdamW(Optimizer):
def __init__(self, lr=0.001, beta_1=0.9, beta_2=0.999, weight_decay=1e-4, # decoupled weight decay (1/4)
epsilon=1e-8, decay=0., **kwargs):
super(AdamW, self).__init__(**kwargs)
with K.name_scope(self.__class__.__name__):
self.iterations = K.variable(0, dtype='int64', name='iterations')
self.lr = K.variable(lr, name='lr')
self.beta_1 = K.variable(beta_1, name='beta_1')
self.beta_2 = K.variable(beta_2, name='beta_2')
self.decay = K.variable(decay, name='decay')
self.wd = K.variable(weight_decay, name='weight_decay') # decoupled weight decay (2/4)
self.epsilon = epsilon
self.initial_decay = decay
@interfaces.legacy_get_updates_support
def get_updates(self, loss, params):
grads = self.get_gradients(loss, params)
self.updates = [K.update_add(self.iterations, 1)]
wd = self.wd # decoupled weight decay (3/4)
lr = self.lr
if self.initial_decay > 0:
lr *= (1. / (1. + self.decay * K.cast(self.iterations,
K.dtype(self.decay))))
t = K.cast(self.iterations, K.floatx()) + 1
lr_t = lr * (K.sqrt(1. - K.pow(self.beta_2, t)) /
(1. - K.pow(self.beta_1, t)))
ms = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
vs = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
self.weights = [self.iterations] + ms + vs
for p, g, m, v in zip(params, grads, ms, vs):
m_t = (self.beta_1 * m) + (1. - self.beta_1) * g
v_t = (self.beta_2 * v) + (1. - self.beta_2) * K.square(g)
p_t = p - lr_t * m_t / (K.sqrt(v_t) + self.epsilon) - lr * wd * p # decoupled weight decay (4/4)
self.updates.append(K.update(m, m_t))
self.updates.append(K.update(v, v_t))
new_p = p_t
# Apply constraints.
if getattr(p, 'constraint', None) is not None:
new_p = p.constraint(new_p)
self.updates.append(K.update(p, new_p))
return self.updates
def get_config(self):
config = {'lr': float(K.get_value(self.lr)),
'beta_1': float(K.get_value(self.beta_1)),
'beta_2': float(K.get_value(self.beta_2)),
'decay': float(K.get_value(self.decay)),
'weight_decay': float(K.get_value(self.wd)),
'epsilon': self.epsilon}
base_config = super(AdamW, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
def model_conv1D(embedding_matrix):
K.clear_session()
# The embedding layer containing the word vectors
emb_layer = Embedding(
input_dim=embedding_matrix.shape[0],
output_dim=embedding_matrix.shape[1],
weights=[embedding_matrix],
input_length=maxlen,
trainable=False)
lstm_layer = Bidirectional(GRU(128, recurrent_dropout=0.15, dropout=0.15, return_sequences=True))
att = AttentionWithContext()
# Define inputs
seq = Input(shape=(maxlen,))
# Run inputs through embedding
emb = emb_layer(seq)
lstm = lstm_layer(emb)
# Run through CONV + GAP layers
att1 = att(lstm)
hin = Input(shape=(X_h.shape[1], ))
htime = Dense(64, activation='relu')(hin)
merge1 = concatenate([att1, htime])
# The MLP that determines the outcome
x = Dropout(0.3)(merge1)
x = BatchNormalization()(x)
x = Dense(200, activation='relu',)(x)
x = Dropout(0.22)(x)
x = BatchNormalization()(x)
x = Dense(200, activation='relu',)(x)
x = Dropout(0.22)(x)
x = BatchNormalization()(x)
pred = Dense(1, activation='sigmoid')(x)
model = Model(inputs=[seq, hin], outputs=pred)
model.compile(loss='binary_crossentropy', optimizer=AdamW(weight_decay=0.08,))###
return model
kfold = StratifiedKFold(n_splits=5, random_state=20, shuffle=True)
sub1 = np.zeros((X_test.shape[0], ))
oof_pref1 = np.zeros((X.shape[0], 1))
score = []
count=0
for i, (train_index, test_index) in enumerate(kfold.split(X, Y_sex)):
print("FOLD | ",count+1)
filepath="sex_weights_best.h5"
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.8, patience=2, min_lr=0.0001, verbose=1)
earlystopping = EarlyStopping(monitor='val_loss', min_delta=0.0001, patience=6, verbose=1, mode='auto')
callbacks = [checkpoint, reduce_lr, earlystopping]
model_sex = model_conv1D(embedding_matrix)
X_tr, X_vl, X_tr2, X_vl2, y_tr, y_vl = X[train_index], X[test_index], X_h[train_index], X_h[test_index], Y_sex[train_index], Y_sex[test_index]
hist = model_sex.fit([X_tr, X_tr2], y_tr, batch_size=256, epochs=50, validation_data=([X_vl, X_vl2], y_vl),
callbacks=callbacks, verbose=2, shuffle=True)
model_sex.load_weights(filepath)
sub1 += np.squeeze(model_sex.predict([X_test, X_h_test]))/kfold.n_splits
oof_pref1[test_index] = model_sex.predict([X_vl, X_vl2])
score.append(np.min(hist.history['val_loss']))
count+=1
print('log loss:',np.mean(score))
oof_pref1 = pd.DataFrame(oof_pref1, columns=['sex2'])
sub1 = pd.DataFrame(sub1, columns=['sex2'])
res1 = pd.concat([oof_pref1, sub1])
res1['sex1'] = 1-res1['sex2']
res1.to_csv("res1.csv", index=False)
def model_age_conv(embedding_matrix):
# The embedding layer containing the word vectors
K.clear_session()
emb_layer = Embedding(
input_dim=embedding_matrix.shape[0],
output_dim=embedding_matrix.shape[1],
weights=[embedding_matrix],
input_length=maxlen,
trainable=False
)
lstm_layer = Bidirectional(GRU(128, recurrent_dropout=0.15, dropout=0.15, return_sequences=True))
att = AttentionWithContext()
# Define inputs
seq = Input(shape=(maxlen,))
# Run inputs through embedding
emb = emb_layer(seq)
lstm = lstm_layer(emb)
# Run through CONV + GAP layers
att1 = att(lstm)
hin = Input(shape=(X_h.shape[1], ))
htime = Dense(64, activation='relu')(hin)
merge1 = concatenate([att1, htime])
# The MLP that determines the outcome
x = Dropout(0.3)(merge1)
x = BatchNormalization()(x)
x = Dense(200, activation='relu',)(x)
x = Dropout(0.22)(x)
x = BatchNormalization()(x)
x = Dense(200, activation='relu',)(x)
x = Dropout(0.22)(x)
x = BatchNormalization()(x)
pred = Dense(11, activation='softmax')(x)
model = Model(inputs=[seq, hin], outputs=pred)
model.compile(loss='categorical_crossentropy', optimizer=AdamW(weight_decay=0.08,))
return model
Y_age = to_categorical(train['age'])
X_h = np.hstack([X_h, train['sex'].values.reshape((-1, 1))])
X_h_test1 = np.hstack([X_h_test, np.ones((X_h_test.shape[0], 1))])
X_h_test2 = np.hstack([X_h_test, np.ones((X_h_test.shape[0], 1))*2])
sub2_1 = np.zeros((X_test.shape[0], 11))
sub2_2 = np.zeros((X_test.shape[0], 11))
oof_pref2 = np.zeros((X.shape[0], 11))
score = []
count=0
for i, (train_index, test_index) in enumerate(kfold.split(X, train['age'])):
print("FOLD | ",count+1)
filepath2="age_weights_best_%d.h5"%count
checkpoint2 = ModelCheckpoint(filepath2, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
reduce_lr2 = ReduceLROnPlateau(monitor='val_loss', factor=0.8, patience=2, min_lr=0.0001, verbose=1)
earlystopping2 = EarlyStopping(monitor='val_loss', min_delta=0.0001, patience=8, verbose=1, mode='auto')
callbacks2 = [checkpoint2, reduce_lr2, earlystopping2]
model_age = model_age_conv(embedding_matrix)
X_tr, X_vl, X_tr2, X_vl2, y_tr, y_vl = X[train_index], X[test_index], X_h[train_index], X_h[test_index], Y_age[train_index], Y_age[test_index]
hist = model_age.fit([X_tr, X_tr2], y_tr, batch_size=256, epochs=50, validation_data=([X_vl, X_vl2], y_vl),
callbacks=callbacks2, verbose=2, shuffle=True)
model_age.load_weights(filepath2)
oof_pref2[test_index] = model_age.predict([X_vl, X_vl2])
sub2_1 += model_age.predict([X_test, X_h_test1])/kfold.n_splits
sub2_2 += model_age.predict([X_test, X_h_test2])/kfold.n_splits
score.append(np.min(hist.history['val_loss']))
count+=1
print('log loss:',np.mean(score))
res2_1 = np.vstack((oof_pref2, sub2_1))
res2_1 = pd.DataFrame(res2_1)
res2_1.to_csv("res2_1.csv",index=False)
res2_2 = np.vstack((oof_pref2, sub2_2))
res2_2 = pd.DataFrame(res2_2)
res2_2.to_csv("res2_2.csv",index=False)
res1.index=range(len(res1))
res2_1.index=range(len(res2_1))
res2_2.index=range(len(res2_2))
final_1 = res2_1
final_2 = res2_2
for i in range(11):
final_1[i] = res1['sex1']*res2_1[i]
final_2[i] = res1['sex2']*res2_2[i]
id_list = pd.concat([train[['device_id']],test[['device_id']]])
final = id_list
final.index = range(len(final))
final.columns= ['device_id']
final_pred = pd.concat([final_1, final_2], axis=1)
final = pd.concat([final, final_pred], axis=1)
final.columns = ['device_id', '1-0', '1-1', '1-2', '1-3', '1-4', '1-5', '1-6',
'1-7','1-8', '1-9', '1-10', '2-0', '2-1', '2-2', '2-3', '2-4',
'2-5', '2-6', '2-7', '2-8', '2-9', '2-10']
final.to_csv('att_nn_feat_v6.csv', index=False)
sub = pd.merge(test[['device_id']], final, on='device_id', how='left')
sub.columns = ['DeviceID', '1-0', '1-1', '1-2', '1-3', '1-4', '1-5', '1-6',
'1-7','1-8', '1-9', '1-10', '2-0', '2-1', '2-2', '2-3', '2-4',
'2-5', '2-6', '2-7', '2-8', '2-9', '2-10']
sub.to_csv('Att_v6.csv', index=False)
================================================
FILE: THLUO/24.thluo_22_lgb.py
================================================
# coding: utf-8
# In[1]:
import pandas as pd
import seaborn as sns
import numpy as np
from tqdm import tqdm
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
from datetime import datetime,timedelta
import time
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
import gc
# In[24]:
df_train_w2v = pd.read_csv('thluo_train_best_feat.csv')
df_att_nn_feat_v6 = pd.read_csv('att_nn_feat_v6.csv')
df_att_nn_feat_v6.columns = ['device_id'] + ['att_nn_feat_' + str(i) for i in range(22)]
df_train_w2v = df_train_w2v.merge(df_att_nn_feat_v6, on='device_id', how='left')
# In[ ]:
df_train_w2v.to_csv('thluo_train_best_feat.csv', index=None)
# In[26]:
train = df_train_w2v[df_train_w2v['sex'].notnull()]
test = df_train_w2v[df_train_w2v['sex'].isnull()]
X = train.drop(['sex','age','sex_age','device_id'],axis=1)
Y = train['sex_age']
Y_CAT = pd.Categorical(Y)
Y = pd.Series(Y_CAT.codes)
# In[28]:
from sklearn.model_selection import KFold, StratifiedKFold
gc.collect()
seed = 666
num_folds = 5
folds = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed)
sub_list = []
cate_feat = ['device_type','device_brand']
for n_fold, (train_idx, valid_idx) in enumerate(folds.split(X, Y)):
train_x, train_y = X.iloc[train_idx], Y.iloc[train_idx]
valid_x, valid_y = X.iloc[valid_idx], Y.iloc[valid_idx]
lgb_train=lgb.Dataset(train_x,label=train_y)
lgb_eval = lgb.Dataset(valid_x, valid_y, reference=lgb_train)
params = {
'boosting_type': 'gbdt',
#'learning_rate' : 0.02,
'learning_rate' : 0.01,
'max_depth':5,
'num_leaves' : 2 ** 4,
'metric': {'multi_logloss'},
'num_class' : 22,
'objective' : 'multiclass',
'random_state' : 2018,
'bagging_freq' : 5,
'feature_fraction' : 0.7,
'bagging_fraction' : 0.7,
'min_split_gain' : 0.0970905919552776,
'min_child_weight' : 9.42012323936088,
}
gbm = lgb.train(params,
lgb_train,
num_boost_round=1000,
valid_sets=lgb_eval,
early_stopping_rounds=200, verbose_eval=100)
sub = pd.DataFrame(gbm.predict(test[X.columns.values],num_iteration=gbm.best_iteration))
sub_list.append(sub)
# In[29]:
sub = (sub_list[0] + sub_list[1] + sub_list[2] + sub_list[3] + sub_list[4]) / num_folds
# In[31]:
sub.columns=Y_CAT.categories
sub['DeviceID']=test['device_id'].values
sub=sub[['DeviceID', '1-0', '1-1', '1-2', '1-3', '1-4', '1-5', '1-6', '1-7','1-8', '1-9', '1-10', '2-0', '2-1', '2-2', '2-3', '2-4', '2-5', '2-6', '2-7', '2-8', '2-9', '2-10']]
# In[32]:
sub.to_csv('th_22_results_lgb.csv',index=False)
================================================
FILE: THLUO/25.thluo_22_xgb.py
================================================
# coding: utf-8
# In[1]:
import pandas as pd
import seaborn as sns
import numpy as np
from tqdm import tqdm
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
import xgboost as xgb
from datetime import datetime,timedelta
import time
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
import gc
from feat_util import *
# In[2]:
print ('25.thluo_22_xgb.py')
path='input/'
data=pd.DataFrame()
#sex_age=pd.read_excel('./data/性别年龄对照表.xlsx')
# In[3]:
deviceid_packages=pd.read_csv(path+'deviceid_packages.tsv',sep='\t',names=['device_id','apps'])
deviceid_test=pd.read_csv(path+'deviceid_test.tsv',sep='\t',names=['device_id'])
deviceid_train=pd.read_csv(path+'deviceid_train.tsv',sep='\t',names=['device_id','sex','age'])
# In[4]:
df_train = pd.concat([deviceid_train, deviceid_test])
# In[5]:
df_train
# In[6]:
df_sex_prob_oof = pd.read_csv('device_sex_prob_oof.csv')
df_age_prob_oof = pd.read_csv('device_age_prob_oof.csv')
df_start_close_sex_prob_oof = pd.read_csv('start_close_sex_prob_oof.csv')
#后面两个,线上线下不对应,线下过拟合了
df_start_close_age_prob_oof = pd.read_csv('start_close_age_prob_oof.csv')
df_tfidf_lr_sex_age_prob_oof = pd.read_csv('tfidf_lr_sex_age_prob_oof.csv')
#之前的有用的
df_sex_age_bin_prob_oof = pd.read_csv('sex_age_bin_prob_oof.csv')
df_age_bin_prob_oof = pd.read_csv('age_bin_prob_oof.csv')
df_hcc_device_brand_age_sex = pd.read_csv('hcc_device_brand_age_sex.csv')
df_device_age_regression_prob_oof = pd.read_csv('device_age_regression_prob_oof.csv')
df_device_start_GRU_pred = pd.read_csv('device_start_GRU_pred.csv')
df_device_start_GRU_pred_age = pd.read_csv('device_start_GRU_pred_age.csv')
df_device_all_GRU_pred = pd.read_csv('device_all_GRU_pred.csv')
df_lgb_sex_age_prob_oof = pd.read_csv('lgb_sex_age_prob_oof.csv')
df_device_start_capsule_pred = pd.read_csv('device_start_capsule_pred.csv')
df_device_start_textcnn_pred = pd.read_csv('device_start_textcnn_pred.csv')
df_device_start_text_dpcnn_pred = pd.read_csv('device_start_text_dpcnn_pred.csv')
df_device_start_lstm_pred = pd.read_csv('device_start_lstm_pred.csv')
df_att_nn_feat_v6 = pd.read_csv('att_nn_feat_v6.csv')
df_att_nn_feat_v6.columns = ['device_id'] + ['att_nn_feat_' + str(i) for i in range(22)]
#过拟合特征
del df_start_close_age_prob_oof['device_app_groupedstart_close_age_prob_oof_4_MEAN']
del df_start_close_sex_prob_oof['device_app_groupedstart_close_sex_prob_oof_MIN']
del df_start_close_sex_prob_oof['device_app_groupedstart_close_sex_prob_oof_MAX']
# In[7]:
df_train_w2v = df_train.merge(df_sex_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_age_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_start_close_sex_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_start_close_age_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_sex_age_bin_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_age_bin_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_hcc_device_brand_age_sex, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_age_regression_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_GRU_pred, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_GRU_pred_age, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_all_GRU_pred, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_lgb_sex_age_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_capsule_pred, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_textcnn_pred, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_text_dpcnn_pred, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_lstm_pred, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_att_nn_feat_v6, on='device_id', how='left')
# In[9]:
df_train_w2v['sex'] = df_train_w2v['sex'].apply(lambda x:str(x))
df_train_w2v['age'] = df_train_w2v['age'].apply(lambda x:str(x))
def tool(x):
if x=='nan':
return x
else:
return str(int(float(x)))
df_train_w2v['sex']=df_train_w2v['sex'].apply(tool)
df_train_w2v['age']=df_train_w2v['age'].apply(tool)
df_train_w2v['sex_age']=df_train_w2v['sex']+'-'+df_train_w2v['age']
df_train_w2v = df_train_w2v.replace({'nan':np.NaN,'nan-nan':np.NaN})
# In[11]:
train = df_train_w2v[df_train_w2v['sex'].notnull()]
test = df_train_w2v[df_train_w2v['sex'].isnull()]
X = train.drop(['sex','age','sex_age','device_id'],axis=1)
Y = train['sex_age']
Y_CAT = pd.Categorical(Y)
Y = pd.Series(Y_CAT.codes)
# In[14]:
from sklearn.model_selection import KFold, StratifiedKFold
gc.collect()
#seed = 2048
seed = 666
num_folds = 5
folds = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed)
sub_list = []
cate_feat = ['device_type','device_brand']
for n_fold, (train_idx, valid_idx) in enumerate(folds.split(X, Y)):
train_x, train_y = X.iloc[train_idx], Y.iloc[train_idx]
valid_x, valid_y = X.iloc[valid_idx], Y.iloc[valid_idx]
xg_train = xgb.DMatrix(train_x, label=train_y)
xg_val = xgb.DMatrix(valid_x, label=valid_y)
param = {
'objective' : 'multi:softprob',
'eta' : 0.03,
'max_depth' : 3,
'num_class' : 22,
'eval_metric' : 'mlogloss',
'min_child_weight' : 3,
'subsample' : 0.7,
'colsample_bytree' : 0.7,
'seed' : 2006,
'nthread' : 5
}
num_rounds = 1000
watchlist = [ (xg_train,'train'), (xg_val, 'val') ]
model = xgb.train(param, xg_train, num_rounds, watchlist, early_stopping_rounds=100, verbose_eval=50)
test_matrix = xgb.DMatrix(test[X.columns.values])
sub = pd.DataFrame(model.predict(test_matrix))
sub_list.append(sub)
# In[15]:
sub = (sub_list[0] + sub_list[1] + sub_list[2] + sub_list[3] + sub_list[4]) / num_folds
sub
# In[16]:
sub.columns=Y_CAT.categories
sub['DeviceID']=test['device_id'].values
sub=sub[['DeviceID', '1-0', '1-1', '1-2', '1-3', '1-4', '1-5', '1-6', '1-7','1-8', '1-9', '1-10', '2-0', '2-1', '2-2', '2-3', '2-4', '2-5', '2-6', '2-7', '2-8', '2-9', '2-10']]
sub.to_csv('th_22_results_xgb.csv',index=False)
================================================
FILE: THLUO/26.thluo_nb_lgb.py
================================================
# coding: utf-8
# In[1]:
# coding: utf-8
# In[1]:
from sklearn.metrics import log_loss
import pandas as pd
import seaborn as sns
import numpy as np
from tqdm import tqdm
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
from datetime import datetime,timedelta
import time
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
get_ipython().run_line_magic('matplotlib', 'inline')
#add
import gc
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack, vstack
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
# from skopt.space import Integer, Categorical, Real, Log10
# from skopt.utils import use_named_args
# from skopt import gp_minimize
from gensim.models import Word2Vec, FastText
import gensim
import re
import os
path="./"
os.listdir(path)
# In[2]:
print ('26.thluo_nb_lgb.py')
train_id=pd.read_csv("input/deviceid_train.tsv",sep="\t",names=['device_id','sex','age'])
test_id=pd.read_csv("input/deviceid_test.tsv",sep="\t",names=['device_id'])
all_id=pd.concat([train_id[['device_id']],test_id[['device_id']]])
#nurbs=pd.read_csv("nurbs_feature_all.csv")
#nurbs.columns=["nurbs_"+str(i) for i in nurbs.columns]
thluo = pd.read_csv("thluo_train_best_feat.csv")
del thluo['age']
del thluo['sex']
del thluo['sex_age']
# In[7]:
feat = thluo.copy()
# In[8]:
train=pd.merge(train_id,feat,on="device_id",how="left")
test=pd.merge(test_id,feat,on="device_id",how="left")
# In[11]:
features = [x for x in train.columns if x not in ['device_id', 'sex',"age",]]
Y = train['sex'] - 1
# In[12]:
from sklearn.model_selection import KFold, StratifiedKFold
gc.collect()
seed = 1024
num_folds = 5
folds = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed)
# In[13]:
params = {
'boosting_type': 'gbdt',
'learning_rate' : 0.02,
#'max_depth':5,
'num_leaves' : 2 ** 5,
'metric': {'binary_logloss'},
#'num_class' : 22,
'objective' : 'binary',
'random_state' : 6666,
'bagging_freq' : 5,
'feature_fraction' : 0.7,
'bagging_fraction' : 0.7,
'min_split_gain' : 0.0970905919552776,
'min_child_weight' : 9.42012323936088,
}
# In[14]:
#预测性别
aus = []
sub1 = np.zeros((len(test), ))
pred_oob1=np.zeros((len(train),))
for i,(train_index,test_index) in enumerate(folds.split(train[features], Y)):
tr_x = train[features].reindex(index=train_index, copy=False)
tr_y = Y[train_index]
te_x = train[features].reindex(index=test_index, copy=False)
te_y = Y[test_index]
lgb_train=lgb.Dataset(tr_x,label=tr_y)
lgb_eval = lgb.Dataset(te_x, te_y, reference=lgb_train)
gbm = lgb.train(params, lgb_train, num_boost_round=300,
valid_sets=[lgb_train, lgb_eval], verbose_eval=100)
pred = gbm.predict(te_x[tr_x.columns.values])
pred_oob1[test_index] =pred
# te_y=te_y.apply(lambda x:1 if x>0 else 0)
a = log_loss(te_y, pred)
print ("idx: ", i)
print (" loss: %.5f" % a)
# print " gini: %.5f" % g
aus.append(a)
print ("mean")
print ("auc: %s" % (sum(aus) / 5.0))
# In[15]:
#用全部数据训练一个lgb
#用全部的train来预测test
lgb_train = lgb.Dataset(train[features],label=Y)
gbm = lgb.train(params, lgb_train, num_boost_round=300, valid_sets=lgb_train, verbose_eval=100)
sub1 = gbm.predict(test[features])
# In[16]:
pred_oob1 = pd.DataFrame(pred_oob1, columns=['sex2'])
sub1 = pd.DataFrame(sub1, columns=['sex2'])
res1=pd.concat([pred_oob1,sub1])
res1['sex1'] = 1-res1['sex2']
# In[18]:
# In[50]:
features = [x for x in train.columns if x not in ['device_id',"age"]]
Y = train['age']
# In[51]:
import lightgbm as lgb
import xgboost as xgb
from sklearn.metrics import auc, log_loss, roc_auc_score,f1_score,recall_score,precision_score
# In[19]:
from sklearn.model_selection import KFold, StratifiedKFold
gc.collect()
seed = 1024
num_folds = 5
folds = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed)
# In[20]:
params = {
'boosting_type': 'gbdt',
'learning_rate' : 0.02,
#'max_depth':5,
'num_leaves' : 2 ** 5,
'metric': {'multi_logloss'},
'num_class' : 11,
'objective' : 'multiclass',
'random_state' : 6666,
'bagging_freq' : 5,
'feature_fraction' : 0.7,
'bagging_fraction' : 0.7,
'min_split_gain' : 0.0970905919552776,
'min_child_weight' : 9.42012323936088,
}
# In[22]:
#预测性别
aus = []
sub2 = np.zeros((len(test),11 ))
pred_oob2=np.zeros((len(train),11))
models=[]
iters=[]
for i,(train_index,test_index) in enumerate(folds.split(train[features], Y)):
tr_x = train[features].reindex(index=train_index, copy=False)
tr_y = Y[train_index]
te_x = train[features].reindex(index=test_index, copy=False)
te_y = Y[test_index]
lgb_train=lgb.Dataset(tr_x,label=tr_y)
lgb_eval = lgb.Dataset(te_x, te_y, reference=lgb_train)
gbm = lgb.train(params, lgb_train, num_boost_round=430,
valid_sets=[lgb_train, lgb_eval], verbose_eval=100)
pred = gbm.predict(te_x[tr_x.columns.values])
pred_oob2[test_index] = pred
# te_y=te_y.apply(lambda x:1 if x>0 else 0)
a = log_loss(te_y, pred)
#sub2 += gbm.predict(test[features], num_iteration=gbm.best_iteration) / 5
models.append(gbm)
iters.append(gbm.best_iteration)
print ("idx: ", i)
print (" loss: %.5f" % a)
# print " gini: %.5f" % g
aus.append(a)
print ("mean")
print ("auc: %s" % (sum(aus) / 5.0))
# In[23]:
#预测条件概率
####sex1
test['sex']=1
#用全部数据训练一个lgb
#用全部的train来预测test
lgb_train = lgb.Dataset(train[features],label=Y)
gbm = lgb.train(params, lgb_train, num_boost_round=430, valid_sets=lgb_train, verbose_eval=100)
sub2 = gbm.predict(test[features])
res2_1=np.vstack((pred_oob2,sub2))
res2_1 = pd.DataFrame(res2_1)
# In[24]:
###sex2
#预测条件概率
test['sex']=2
sub2 = np.zeros((len(test),11))
sub2 = gbm.predict(test[features], num_iteration = gbm.best_iteration)
res2_2=np.vstack((pred_oob2,sub2))
res2_2 = pd.DataFrame(res2_2)
# In[27]:
res1.index=range(len(res1))
res2_1.index=range(len(res2_1))
res2_2.index=range(len(res2_2))
final_1=res2_1.copy()
final_2=res2_2.copy()
# In[28]:
for i in range(11):
final_1[i]=res1['sex1'] * res2_1[i]
final_2[i]=res1['sex2'] * res2_2[i]
id_list = pd.concat([train[['device_id']],test[['device_id']]])
final = id_list
final.index = range(len(final))
final.columns = ['DeviceID']
final_pred = pd.concat([final_1,final_2], 1)
final = pd.concat([final,final_pred],1)
final.columns = ['DeviceID', '1-0', '1-1', '1-2', '1-3', '1-4', '1-5', '1-6',
'1-7','1-8', '1-9', '1-10', '2-0', '2-1', '2-2', '2-3', '2-4',
'2-5', '2-6', '2-7', '2-8', '2-9', '2-10']
# In[30]:
test['DeviceID']=test['device_id']
sub=pd.merge(test[['DeviceID']],final,on="DeviceID",how="left")
sub.to_csv("th_lgb_nb.csv",index=False)
================================================
FILE: THLUO/27.thluo_nb_xgb.py
================================================
# coding: utf-8
# In[1]:
# coding: utf-8
# In[1]:
from sklearn.metrics import log_loss
import pandas as pd
import seaborn as sns
import numpy as np
from tqdm import tqdm
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
from datetime import datetime,timedelta
import time
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
get_ipython().run_line_magic('matplotlib', 'inline')
#add
import gc
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack, vstack
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
# from skopt.space import Integer, Categorical, Real, Log10
# from skopt.utils import use_named_args
# from skopt import gp_minimize
from gensim.models import Word2Vec, FastText
import gensim
import re
import os
import xgboost as xgb
path="./"
os.listdir(path)
# In[2]:
print ('27.thluo_nb_xgb.py')
train_id=pd.read_csv("input/deviceid_train.tsv",sep="\t",names=['device_id','sex','age'])
test_id=pd.read_csv("input/deviceid_test.tsv",sep="\t",names=['device_id'])
all_id=pd.concat([train_id[['device_id']],test_id[['device_id']]])
df_sex_prob_oof = pd.read_csv('device_sex_prob_oof.csv')
df_age_prob_oof = pd.read_csv('device_age_prob_oof.csv')
df_start_close_sex_prob_oof = pd.read_csv('start_close_sex_prob_oof.csv')
#后面两个,线上线下不对应,线下过拟合了
df_start_close_age_prob_oof = pd.read_csv('start_close_age_prob_oof.csv')
#df_start_close_sex_age_prob_oof = pd.read_csv('start_close_sex_age_prob_oof.csv')
df_tfidf_lr_sex_age_prob_oof = pd.read_csv('tfidf_lr_sex_age_prob_oof.csv')
#之前的有用的
df_sex_age_bin_prob_oof = pd.read_csv('sex_age_bin_prob_oof.csv')
df_age_bin_prob_oof = pd.read_csv('age_bin_prob_oof.csv')
df_hcc_device_brand_age_sex = pd.read_csv('hcc_device_brand_age_sex.csv')
df_device_age_regression_prob_oof = pd.read_csv('device_age_regression_prob_oof.csv')
df_device_start_GRU_pred = pd.read_csv('device_start_GRU_pred.csv')
df_device_start_GRU_pred_age = pd.read_csv('device_start_GRU_pred_age.csv')
df_device_all_GRU_pred = pd.read_csv('device_all_GRU_pred.csv')
#df_boost_sex_age_prob_oof = pd.read_csv('boost_sex_age_prob_oof.csv')
df_lgb_sex_age_prob_oof = pd.read_csv('lgb_sex_age_prob_oof.csv')
df_device_start_capsule_pred = pd.read_csv('device_start_capsule_pred.csv')
df_device_start_textcnn_pred = pd.read_csv('device_start_textcnn_pred.csv')
df_device_start_text_dpcnn_pred = pd.read_csv('device_start_text_dpcnn_pred.csv')
df_device_start_lstm_pred = pd.read_csv('device_start_lstm_pred.csv')
df_att_nn_feat_v6 = pd.read_csv('att_nn_feat_v6.csv')
df_att_nn_feat_v6.columns = ['device_id'] + ['att_nn_feat_' + str(i) for i in range(22)]
#过拟合特征
del df_start_close_age_prob_oof['device_app_groupedstart_close_age_prob_oof_4_MEAN']
del df_start_close_sex_prob_oof['device_app_groupedstart_close_sex_prob_oof_MIN']
del df_start_close_sex_prob_oof['device_app_groupedstart_close_sex_prob_oof_MAX']
# In[3]:
df_train_w2v = all_id.merge(df_sex_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_age_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_start_close_sex_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_start_close_age_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_sex_age_bin_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_age_bin_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_hcc_device_brand_age_sex, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_age_regression_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_GRU_pred, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_GRU_pred_age, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_all_GRU_pred, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_lgb_sex_age_prob_oof, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_capsule_pred, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_textcnn_pred, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_text_dpcnn_pred, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_device_start_lstm_pred, on='device_id', how='left')
df_train_w2v = df_train_w2v.merge(df_att_nn_feat_v6, on='device_id', how='left')
# In[5]:
feat = df_train_w2v.copy()
# In[6]:
train=pd.merge(train_id,feat,on="device_id",how="left")
test=pd.merge(test_id,feat,on="device_id",how="left")
# In[8]:
features = [x for x in train.columns if x not in ['device_id', 'sex',"age",]]
Y = train['sex'] - 1
# In[9]:
from sklearn.model_selection import KFold, StratifiedKFold
gc.collect()
seed = 1024
num_folds = 5
folds = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=seed)
# In[10]:
params={
'booster':'gbtree',
'objective': 'binary:logistic',
# 'is_unbalance':'True',
# 'scale_pos_weight': 1500.0/13458.0,
'eval_metric': "logloss",
'gamma':0.2,#0.2 is ok
'max_depth':6,
# 'lambda':20,
# "alpha":5,
'subsample':0.7,
'colsample_bytree':0.4 ,
# 'min_child_weight':2.5,
'eta': 0.01,
# 'learning_rate':0.01,
"silent":1,
'seed':1024,
'nthread':5,
}
num_round = 3500
early_stopping_rounds = 100
# In[11]:
#预测性别
aus = []
sub1 = np.zeros((len(test), ))
pred_oob1=np.zeros((len(train),))
for i,(train_index,test_index) in enumerate(folds.split(train[features], Y)):
tr_x = train[features].reindex(index=train_index, copy=False)
tr_y = Y[train_index]
te_x = train[features].reindex(index=test_index, copy=False)
te_y = Y[test_index]
d_tr = xgb.DMatrix(tr_x, label=tr_y)
d_te = xgb.DMatrix(te_x, label=te_y)
watchlist = [(d_tr,'train'),
(d_te,'val')
]
model = xgb.train(params, d_tr, num_boost_ro
gitextract_j9hppgj3/
├── 2018易观A10大数据应用峰会-RNG_终极版.pptx
├── README.md
├── THLUO/
│ ├── 1.w2c_model_start.py
│ ├── 10.age_bin_prob_oof.py
│ ├── 11.hcc_device_brand_age_sex.py
│ ├── 12.device_age_regression_prob_oof.py
│ ├── 13.device_start_GRU_pred.py
│ ├── 14.device_start_GRU_pred_age.py
│ ├── 15.device_all_GRU_pred.py
│ ├── 16.device_start_capsule_pred.py
│ ├── 17.device_start_textcnn_pred.py
│ ├── 18.device_start_text_dpcnn_pred.py
│ ├── 19.device_start_lstm_pred.py
│ ├── 2.w2c_model_close.py
│ ├── 20.lgb_sex_age_prob_oof.py
│ ├── 21.tfidf_lr_sex_age_prob_oof.py
│ ├── 22.base_feat.py
│ ├── 23.ATT_v6.py
│ ├── 24.thluo_22_lgb.py
│ ├── 25.thluo_22_xgb.py
│ ├── 26.thluo_nb_lgb.py
│ ├── 27.thluo_nb_xgb.py
│ ├── 28.final.py
│ ├── 3.device_quchong_start_app_w2c.py
│ ├── 3.w2c_all_emb.py
│ ├── 3.w2c_model_all.py
│ ├── 4.device_age_prob_oof.py
│ ├── 5.device_sex_prob_oof.py
│ ├── 6.start_close_age_prob_oof.py
│ ├── 7.start_close_sex_prob_oof.py
│ ├── 9.sex_age_bin_prob_oof.py
│ ├── TextModel.py
│ ├── readme.md
│ ├── util.py
│ └── 代码运行.bat
├── chizhu/
│ ├── readme.txt
│ ├── single_model/
│ │ ├── cnn.py
│ │ ├── config.py
│ │ ├── deepnn.py
│ │ ├── get_nn_feat.py
│ │ ├── lgb.py
│ │ ├── user_behavior.py
│ │ ├── xgb.py
│ │ ├── xgb_nb.py
│ │ └── yg_best_nn.py
│ ├── stacking/
│ │ ├── all_feat/
│ │ │ └── xgb__nurbs_nb.ipynb
│ │ └── nurbs_feat/
│ │ ├── xgb_22.py
│ │ └── xgb__nurbs_nb.py
│ └── util/
│ ├── bagging.py
│ └── get_nn_res.py
├── linwangli/
│ ├── code/
│ │ ├── lgb_allfeat_22.py
│ │ ├── lgb_allfeat_condProb.py
│ │ └── utils.py
│ ├── readme.txt
│ ├── yg-1st-lgb.py
│ └── 融合思路.pptx
├── nb_cz_lwl_wcm/
│ ├── 10_lgb.py
│ ├── 11_cnn.py
│ ├── 12_get_feature_lwl.py
│ ├── 13_last_get_all_feature.py
│ ├── 1_get_age_reg.py
│ ├── 2_get_feature_brand.py
│ ├── 3_get_feature_device_package.py
│ ├── 4_get_feature_device_start_close_tfidf_1_2.py
│ ├── 5_get_feature_device_start_close_tfidf.py
│ ├── 6_get_feature_device_start_close.py
│ ├── 7_get_feature_w2v.py
│ ├── 8_get_feature_lwl.py
│ ├── 9_yg_best_nn.py
│ └── 运行说明.txt
└── wangcanming/
└── deepnet_v33.py
SYMBOL INDEX (249 symbols across 39 files)
FILE: THLUO/10.age_bin_prob_oof.py
function timeStamp (line 84) | def timeStamp(timeNum):
function open_app_timegap_in_hour (line 177) | def open_app_timegap_in_hour() :
function device_start_end_app_timegap (line 191) | def device_start_end_app_timegap() :
function open_app_counts_in_hour (line 220) | def open_app_counts_in_hour() :
function close_app_counts_in_hour (line 227) | def close_app_counts_in_hour() :
function app_type_mean_time_gap_one_hot (line 234) | def app_type_mean_time_gap_one_hot () :
function device_active_hour (line 241) | def device_active_hour() :
function device_brand_encoding (line 253) | def device_brand_encoding() :
function device_active_time_time_stat (line 288) | def device_active_time_time_stat() :
function app_type_encoding (line 325) | def app_type_encoding() :
function app_type_onehot_in_device (line 359) | def app_type_onehot_in_device(df) :
function tool (line 416) | def tool(x):
FILE: THLUO/11.hcc_device_brand_age_sex.py
function hcc_encode (line 101) | def hcc_encode(train_df, test_df, variable, target, prior_prob, k=5, f=1...
FILE: THLUO/12.device_age_regression_prob_oof.py
function timeStamp (line 84) | def timeStamp(timeNum):
function open_app_timegap_in_hour (line 120) | def open_app_timegap_in_hour() :
function device_start_end_app_timegap (line 134) | def device_start_end_app_timegap() :
function open_app_counts_in_hour (line 163) | def open_app_counts_in_hour() :
function close_app_counts_in_hour (line 170) | def close_app_counts_in_hour() :
function app_type_mean_time_gap_one_hot (line 177) | def app_type_mean_time_gap_one_hot () :
function device_active_hour (line 184) | def device_active_hour() :
function device_brand_encoding (line 196) | def device_brand_encoding() :
function device_active_time_time_stat (line 231) | def device_active_time_time_stat() :
function app_type_encoding (line 268) | def app_type_encoding() :
function app_type_onehot_in_device (line 302) | def app_type_onehot_in_device(df) :
FILE: THLUO/13.device_start_GRU_pred.py
function tool (line 70) | def tool(x):
function get_mut_label (line 106) | def get_mut_label(y_label) :
class RocAucEvaluation (line 112) | class RocAucEvaluation(Callback):
method __init__ (line 113) | def __init__(self, validation_data=(), interval=1):
method on_epoch_end (line 119) | def on_epoch_end(self, epoch, logs={}):
function w2v_pad (line 131) | def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
FILE: THLUO/14.device_start_GRU_pred_age.py
function get_mut_label (line 95) | def get_mut_label(y_label) :
class RocAucEvaluation (line 101) | class RocAucEvaluation(Callback):
method __init__ (line 102) | def __init__(self, validation_data=(), interval=1):
method on_epoch_end (line 108) | def on_epoch_end(self, epoch, logs={}):
function w2v_pad (line 120) | def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
FILE: THLUO/15.device_all_GRU_pred.py
function tool (line 69) | def tool(x):
function get_mut_label (line 105) | def get_mut_label(y_label) :
class RocAucEvaluation (line 111) | class RocAucEvaluation(Callback):
method __init__ (line 112) | def __init__(self, validation_data=(), interval=1):
method on_epoch_end (line 118) | def on_epoch_end(self, epoch, logs={}):
function w2v_pad (line 130) | def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
FILE: THLUO/16.device_start_capsule_pred.py
function tool (line 69) | def tool(x):
function get_mut_label (line 105) | def get_mut_label(y_label) :
class RocAucEvaluation (line 111) | class RocAucEvaluation(Callback):
method __init__ (line 112) | def __init__(self, validation_data=(), interval=1):
method on_epoch_end (line 118) | def on_epoch_end(self, epoch, logs={}):
function w2v_pad (line 130) | def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
FILE: THLUO/17.device_start_textcnn_pred.py
function tool (line 69) | def tool(x):
function get_mut_label (line 105) | def get_mut_label(y_label) :
class RocAucEvaluation (line 111) | class RocAucEvaluation(Callback):
method __init__ (line 112) | def __init__(self, validation_data=(), interval=1):
method on_epoch_end (line 118) | def on_epoch_end(self, epoch, logs={}):
function w2v_pad (line 130) | def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
FILE: THLUO/18.device_start_text_dpcnn_pred.py
function tool (line 69) | def tool(x):
function get_mut_label (line 105) | def get_mut_label(y_label) :
class RocAucEvaluation (line 111) | class RocAucEvaluation(Callback):
method __init__ (line 112) | def __init__(self, validation_data=(), interval=1):
method on_epoch_end (line 118) | def on_epoch_end(self, epoch, logs={}):
function w2v_pad (line 130) | def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
FILE: THLUO/19.device_start_lstm_pred.py
function tool (line 61) | def tool(x):
function get_mut_label (line 97) | def get_mut_label(y_label) :
class RocAucEvaluation (line 103) | class RocAucEvaluation(Callback):
method __init__ (line 104) | def __init__(self, validation_data=(), interval=1):
method on_epoch_end (line 110) | def on_epoch_end(self, epoch, logs={}):
function w2v_pad (line 122) | def w2v_pad(df_train,df_test,col, maxlen_,victor_size, num_words):
FILE: THLUO/20.lgb_sex_age_prob_oof.py
function timeStamp (line 84) | def timeStamp(timeNum):
function open_app_timegap_in_hour (line 119) | def open_app_timegap_in_hour() :
function device_start_end_app_timegap (line 133) | def device_start_end_app_timegap() :
function open_app_counts_in_hour (line 162) | def open_app_counts_in_hour() :
function close_app_counts_in_hour (line 169) | def close_app_counts_in_hour() :
function app_type_mean_time_gap_one_hot (line 176) | def app_type_mean_time_gap_one_hot () :
function device_active_hour (line 183) | def device_active_hour() :
function device_brand_encoding (line 195) | def device_brand_encoding() :
function device_active_time_time_stat (line 230) | def device_active_time_time_stat() :
function app_type_encoding (line 267) | def app_type_encoding() :
function app_type_onehot_in_device (line 301) | def app_type_onehot_in_device(df) :
function tool (line 416) | def tool(x):
FILE: THLUO/21.tfidf_lr_sex_age_prob_oof.py
function tool (line 150) | def tool(x):
FILE: THLUO/22.base_feat.py
function timeStamp (line 83) | def timeStamp(timeNum):
function open_app_timegap_in_hour (line 119) | def open_app_timegap_in_hour() :
function device_start_end_app_timegap (line 133) | def device_start_end_app_timegap() :
function open_app_counts_in_hour (line 162) | def open_app_counts_in_hour() :
function close_app_counts_in_hour (line 169) | def close_app_counts_in_hour() :
function app_type_mean_time_gap_one_hot (line 176) | def app_type_mean_time_gap_one_hot () :
function device_active_hour (line 183) | def device_active_hour() :
function device_brand_encoding (line 195) | def device_brand_encoding() :
function device_active_time_time_stat (line 230) | def device_active_time_time_stat() :
function app_type_encoding (line 267) | def app_type_encoding() :
function app_type_onehot_in_device (line 301) | def app_type_onehot_in_device(df) :
function tool (line 457) | def tool(x):
FILE: THLUO/23.ATT_v6.py
function dot_product (line 90) | def dot_product(x, kernel):
class AttentionWithContext (line 104) | class AttentionWithContext(Layer):
method __init__ (line 125) | def __init__(self,
method build (line 144) | def build(self, input_shape):
method compute_mask (line 167) | def compute_mask(self, input, input_mask=None):
method call (line 171) | def call(self, x, mask=None):
method compute_output_shape (line 196) | def compute_output_shape(self, input_shape):
class AdamW (line 199) | class AdamW(Optimizer):
method __init__ (line 200) | def __init__(self, lr=0.001, beta_1=0.9, beta_2=0.999, weight_decay=1e...
method get_updates (line 214) | def get_updates(self, loss, params):
method get_config (line 248) | def get_config(self):
function model_conv1D (line 258) | def model_conv1D(embedding_matrix):
function model_age_conv (line 327) | def model_age_conv(embedding_matrix):
FILE: THLUO/25.thluo_22_xgb.py
function tool (line 112) | def tool(x):
FILE: THLUO/3.device_quchong_start_app_w2c.py
function timeStamp (line 84) | def timeStamp(timeNum):
FILE: THLUO/4.device_age_prob_oof.py
function timeStamp (line 85) | def timeStamp(timeNum):
function open_app_timegap_in_hour (line 129) | def open_app_timegap_in_hour() :
function device_start_end_app_timegap (line 143) | def device_start_end_app_timegap() :
function open_app_counts_in_hour (line 172) | def open_app_counts_in_hour() :
function close_app_counts_in_hour (line 179) | def close_app_counts_in_hour() :
function app_type_mean_time_gap_one_hot (line 186) | def app_type_mean_time_gap_one_hot () :
function device_active_hour (line 193) | def device_active_hour() :
function device_brand_encoding (line 205) | def device_brand_encoding() :
function device_active_time_time_stat (line 240) | def device_active_time_time_stat() :
function app_type_encoding (line 277) | def app_type_encoding() :
function app_type_onehot_in_device (line 311) | def app_type_onehot_in_device(df) :
FILE: THLUO/5.device_sex_prob_oof.py
function timeStamp (line 84) | def timeStamp(timeNum):
function open_app_timegap_in_hour (line 127) | def open_app_timegap_in_hour() :
function device_start_end_app_timegap (line 141) | def device_start_end_app_timegap() :
function open_app_counts_in_hour (line 170) | def open_app_counts_in_hour() :
function close_app_counts_in_hour (line 177) | def close_app_counts_in_hour() :
function app_type_mean_time_gap_one_hot (line 184) | def app_type_mean_time_gap_one_hot () :
function device_active_hour (line 191) | def device_active_hour() :
function device_brand_encoding (line 203) | def device_brand_encoding() :
function device_active_time_time_stat (line 238) | def device_active_time_time_stat() :
function app_type_encoding (line 275) | def app_type_encoding() :
function app_type_onehot_in_device (line 309) | def app_type_onehot_in_device(df) :
FILE: THLUO/6.start_close_age_prob_oof.py
function timeStamp (line 93) | def timeStamp(timeNum):
FILE: THLUO/7.start_close_sex_prob_oof.py
function timeStamp (line 92) | def timeStamp(timeNum):
FILE: THLUO/9.sex_age_bin_prob_oof.py
function timeStamp (line 84) | def timeStamp(timeNum):
function open_app_timegap_in_hour (line 119) | def open_app_timegap_in_hour() :
function device_start_end_app_timegap (line 133) | def device_start_end_app_timegap() :
function open_app_counts_in_hour (line 162) | def open_app_counts_in_hour() :
function close_app_counts_in_hour (line 169) | def close_app_counts_in_hour() :
function app_type_mean_time_gap_one_hot (line 176) | def app_type_mean_time_gap_one_hot () :
function device_active_hour (line 183) | def device_active_hour() :
function device_brand_encoding (line 195) | def device_brand_encoding() :
function device_active_time_time_stat (line 230) | def device_active_time_time_stat() :
function app_type_encoding (line 267) | def app_type_encoding() :
function app_type_onehot_in_device (line 301) | def app_type_onehot_in_device(df) :
function tool (line 415) | def tool(x):
FILE: THLUO/TextModel.py
function capsule_lstm (line 30) | def capsule_lstm(sent_length, embeddings_weight,class_num):
function get_text_capsule (line 60) | def get_text_capsule(sent_length, embeddings_weight,class_num):
function get_text_cnn1 (line 85) | def get_text_cnn1(sent_length, embeddings_weight,class_num):
function get_text_cnn2 (line 130) | def get_text_cnn2(sent_length, embeddings_weight,class_num):
function get_text_cnn3 (line 176) | def get_text_cnn3(sent_length, embeddings_weight,class_num):
function get_text_gru1 (line 217) | def get_text_gru1(sent_length, embeddings_weight,class_num):
function get_text_gru2 (line 246) | def get_text_gru2(sent_length, embeddings_weight,class_num):
function get_text_gru4 (line 276) | def get_text_gru4(sent_length, embeddings_weight,class_num):
function get_text_gru5 (line 305) | def get_text_gru5(sent_length, embeddings_weight,class_num):
function get_text_gru6 (line 336) | def get_text_gru6(sent_length, embeddings_weight,class_num):
function get_text_rcnn1 (line 371) | def get_text_rcnn1(sent_length, embeddings_weight,class_num):
function get_text_rcnn2 (line 399) | def get_text_rcnn2(sent_length, embeddings_weight,class_num):
function get_text_rcnn3 (line 428) | def get_text_rcnn3(sent_length, embeddings_weight,class_num):
function get_text_rcnn4 (line 462) | def get_text_rcnn4(sent_length, embeddings_weight,class_num):
function get_text_rcnn5 (line 493) | def get_text_rcnn5(sent_length, embeddings_weight,class_num):
function get_text_lstm1 (line 526) | def get_text_lstm1(sent_length, embeddings_weight,class_num):
function get_text_lstm2 (line 551) | def get_text_lstm2(sent_length, embeddings_weight,class_num):
function get_text_lstm3 (line 577) | def get_text_lstm3(sent_length, embeddings_weight,class_num):
function get_text_lstm_attention (line 605) | def get_text_lstm_attention(sent_length, embeddings_weight,class_num):
function get_text_dpcnn (line 631) | def get_text_dpcnn(sent_length, embeddings_weight,class_num):
function bi_gru_model (line 694) | def bi_gru_model(sent_length, embeddings_weight,class_num):
function bi_gru_model_binary (line 723) | def bi_gru_model_binary(sent_length, embeddings_weight,class_num):
FILE: THLUO/util.py
class Attention (line 44) | class Attention(Layer):
method __init__ (line 45) | def __init__(self, step_dim,
method build (line 79) | def build(self, input_shape):
method compute_mask (line 100) | def compute_mask(self, input, input_mask=None):
method call (line 104) | def call(self, x, mask=None):
method compute_output_shape (line 135) | def compute_output_shape(self, input_shape):
function squash (line 140) | def squash(x, axis=-1):
class Capsule (line 146) | class Capsule(Layer):
method __init__ (line 147) | def __init__(self, num_capsule, dim_capsule, routings=3, kernel_size=(...
method build (line 160) | def build(self, input_shape):
method call (line 179) | def call(self, u_vecs):
method compute_output_shape (line 204) | def compute_output_shape(self, input_shape):
class AttentionWeightedAverage (line 208) | class AttentionWeightedAverage(Layer):
method __init__ (line 214) | def __init__(self, return_attention=False, **kwargs):
method build (line 220) | def build(self, input_shape):
method call (line 230) | def call(self, x, mask=None):
method get_output_shape_for (line 251) | def get_output_shape_for(self, input_shape):
method compute_output_shape (line 254) | def compute_output_shape(self, input_shape):
method compute_mask (line 260) | def compute_mask(self, input, input_mask=None):
class KMaxPooling (line 267) | class KMaxPooling(Layer):
method __init__ (line 273) | def __init__(self, k=1, **kwargs):
method compute_output_shape (line 278) | def compute_output_shape(self, input_shape):
method call (line 281) | def call(self, inputs):
function performance (line 292) | def performance(f): # 定义装饰器函数,功能是传进来的函数进行包装并返回包装后的函数
function f1 (line 303) | def f1(y_true, y_pred):
function f1 (line 336) | def f1(y_true, y_pred):
function evalation_score (line 368) | def evalation_score(y_true,y_pred):
FILE: chizhu/single_model/cnn.py
class AdamW (line 157) | class AdamW(Optimizer):
method __init__ (line 158) | def __init__(self, lr=0.001, beta_1=0.9, beta_2=0.999, weight_decay=1e...
method get_updates (line 173) | def get_updates(self, loss, params):
method get_config (line 208) | def get_config(self):
function model_conv1D (line 222) | def model_conv1D(embedding_matrix):
function model_age_conv (line 343) | def model_age_conv(embedding_matrix):
FILE: chizhu/single_model/deepnn.py
class AdamW (line 82) | class AdamW(Optimizer):
method __init__ (line 83) | def __init__(self, lr=0.001, beta_1=0.9, beta_2=0.999, weight_decay=1e...
method get_updates (line 98) | def get_updates(self, loss, params):
method get_config (line 133) | def get_config(self):
function model_conv1D_sex (line 144) | def model_conv1D_sex(embedding_matrix):
function model_age_conv (line 282) | def model_age_conv(embedding_matrix):
FILE: chizhu/single_model/lgb.py
function get_str (line 53) | def get_str(df):
FILE: chizhu/single_model/yg_best_nn.py
class AdamW (line 95) | class AdamW(Optimizer):
method __init__ (line 96) | def __init__(self, lr=0.001, beta_1=0.9, beta_2=0.999, weight_decay=1e...
method get_updates (line 111) | def get_updates(self, loss, params):
method get_config (line 146) | def get_config(self):
function model_conv1D (line 157) | def model_conv1D(embedding_matrix):
function model_age_conv (line 267) | def model_age_conv(embedding_matrix):
FILE: linwangli/code/utils.py
function weights_ensemble (line 4) | def weights_ensemble(results, weights):
function result_corr (line 33) | def result_corr(path1, path2):
FILE: nb_cz_lwl_wcm/10_lgb.py
function get_str (line 53) | def get_str(df):
FILE: nb_cz_lwl_wcm/11_cnn.py
class AdamW (line 157) | class AdamW(Optimizer):
method __init__ (line 158) | def __init__(self, lr=0.001, beta_1=0.9, beta_2=0.999, weight_decay=1e...
method get_updates (line 173) | def get_updates(self, loss, params):
method get_config (line 208) | def get_config(self):
function model_conv1D (line 222) | def model_conv1D(embedding_matrix):
function model_age_conv (line 343) | def model_age_conv(embedding_matrix):
FILE: nb_cz_lwl_wcm/12_get_feature_lwl.py
function get_top100_statis_feat (line 31) | def get_top100_statis_feat(start_close):
function get_brand_feat (line 45) | def get_brand_feat(brand):
function get_start_close_tfidf_feat (line 62) | def get_start_close_tfidf_feat(data_all, start_close):
function get_packages_w2c_feat (line 82) | def get_packages_w2c_feat(packages):
function get_week_statis_feat (line 109) | def get_week_statis_feat(start_close):
function get_user_behaviour_feat (line 125) | def get_user_behaviour_feat(start_close):
FILE: nb_cz_lwl_wcm/1_get_age_reg.py
function get_label (line 17) | def get_label(row):
function get_more_information (line 43) | def get_more_information(row):
function xx_mse_s (line 92) | def xx_mse_s(y_true,y_pre):
FILE: nb_cz_lwl_wcm/3_get_feature_device_package.py
function get_label (line 14) | def get_label(row):
function get_more_information (line 44) | def get_more_information(row):
function get_cluster (line 263) | def get_cluster(num_clusters):
FILE: nb_cz_lwl_wcm/4_get_feature_device_start_close_tfidf_1_2.py
function dealed_row (line 33) | def dealed_row(row):
function get_label (line 58) | def get_label(row):
FILE: nb_cz_lwl_wcm/5_get_feature_device_start_close_tfidf.py
function dealed_row (line 31) | def dealed_row(row):
function get_label (line 56) | def get_label(row):
function get_cluster (line 247) | def get_cluster(num_clusters):
FILE: nb_cz_lwl_wcm/6_get_feature_device_start_close.py
function get_max_label (line 36) | def get_max_label(row):
FILE: nb_cz_lwl_wcm/7_get_feature_w2v.py
function get_w2v_avg (line 23) | def get_w2v_avg(text, w2v_out_path, word2vec_Path):
FILE: nb_cz_lwl_wcm/9_yg_best_nn.py
class AdamW (line 95) | class AdamW(Optimizer):
method __init__ (line 96) | def __init__(self, lr=0.001, beta_1=0.9, beta_2=0.999, weight_decay=1e...
method get_updates (line 111) | def get_updates(self, loss, params):
method get_config (line 146) | def get_config(self):
function model_conv1D (line 157) | def model_conv1D(embedding_matrix):
function model_age_conv (line 267) | def model_age_conv(embedding_matrix):
FILE: wangcanming/deepnet_v33.py
class AdamW (line 72) | class AdamW(Optimizer):
method __init__ (line 73) | def __init__(self, lr=0.001, beta_1=0.9, beta_2=0.999, weight_decay=1e...
method get_updates (line 87) | def get_updates(self, loss, params):
method get_config (line 121) | def get_config(self):
function model_conv1D_sex (line 131) | def model_conv1D_sex(embedding_matrix):
function model_age_conv (line 200) | def model_age_conv(embedding_matrix):
Condensed preview — 71 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (734K chars).
[
{
"path": "README.md",
"chars": 764,
"preview": "# yiguan_sex_age_predict_1st_solution \n易观性别年龄预测第一名解决方案\n\n##### [比赛链接](https://www.tinymind.cn/competitions/43)\n--------\n\n"
},
{
"path": "THLUO/1.w2c_model_start.py",
"chars": 5131,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/10.age_bin_prob_oof.py",
"chars": 20069,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/11.hcc_device_brand_age_sex.py",
"chars": 6167,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/12.device_age_regression_prob_oof.py",
"chars": 19152,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/13.device_start_GRU_pred.py",
"chars": 7484,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\n# coding: utf-8\nimport feather\nimport os\nimport re\nimport sys \nimport gc\nimport random\nimp"
},
{
"path": "THLUO/14.device_start_GRU_pred_age.py",
"chars": 7090,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\n# coding: utf-8\nimport feather\nimport os\nimport re\nimport sys \nimport gc\nimport random\nimp"
},
{
"path": "THLUO/15.device_all_GRU_pred.py",
"chars": 7518,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\n# coding: utf-8\nimport feather\nimport os\nimport re\nimport sys \nimport gc\nimport random\nimp"
},
{
"path": "THLUO/16.device_start_capsule_pred.py",
"chars": 7527,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\n# coding: utf-8\nimport feather\nimport os\nimport re\nimport sys \nimport gc\nimport random\nimp"
},
{
"path": "THLUO/17.device_start_textcnn_pred.py",
"chars": 7523,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\n# coding: utf-8\nimport feather\nimport os\nimport re\nimport sys \nimport gc\nimport random\nimp"
},
{
"path": "THLUO/18.device_start_text_dpcnn_pred.py",
"chars": 7680,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\n# coding: utf-8\nimport feather\nimport os\nimport re\nimport sys \nimport gc\nimport random\nimp"
},
{
"path": "THLUO/19.device_start_lstm_pred.py",
"chars": 7602,
"preview": "import feather\nimport os\nimport re\nimport sys \nimport gc\nimport random\nimport pandas as pd\nimport numpy as np\nimport ge"
},
{
"path": "THLUO/2.w2c_model_close.py",
"chars": 5165,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/20.lgb_sex_age_prob_oof.py",
"chars": 20145,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/21.tfidf_lr_sex_age_prob_oof.py",
"chars": 7502,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/22.base_feat.py",
"chars": 20551,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/23.ATT_v6.py",
"chars": 17222,
"preview": "import pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom sklearn.decomposition import Lat"
},
{
"path": "THLUO/24.thluo_22_lgb.py",
"chars": 2961,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/25.thluo_22_xgb.py",
"chars": 6557,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/26.thluo_nb_lgb.py",
"chars": 7129,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\n# coding: utf-8\n\n# In[1]:\n\nfrom sklearn.metrics import log_loss\nimport pandas as pd\nimport "
},
{
"path": "THLUO/27.thluo_nb_xgb.py",
"chars": 10233,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\n# coding: utf-8\n\n# In[1]:\n\nfrom sklearn.metrics import log_loss\nimport pandas as pd\nimport "
},
{
"path": "THLUO/28.final.py",
"chars": 1040,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport numpy as np\nimport pandas as pd\n\n\n# In[2]:\n\n\nth_22_results_lgb = pd.read_csv('th_22_"
},
{
"path": "THLUO/3.device_quchong_start_app_w2c.py",
"chars": 5925,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/3.w2c_all_emb.py",
"chars": 4420,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/3.w2c_model_all.py",
"chars": 5553,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/4.device_age_prob_oof.py",
"chars": 19542,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/5.device_sex_prob_oof.py",
"chars": 19244,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/6.start_close_age_prob_oof.py",
"chars": 11780,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/7.start_close_sex_prob_oof.py",
"chars": 10894,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/9.sex_age_bin_prob_oof.py",
"chars": 20372,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "THLUO/TextModel.py",
"chars": 29400,
"preview": "import os\nimport re\nimport sys\nimport pandas as pd\nimport numpy as np\nimport gensim\nfrom gensim.models import Word2Vec\nf"
},
{
"path": "THLUO/readme.md",
"chars": 2515,
"preview": "本代码运行在windows10, 48G内存, 1070ti显卡上, 由于运行的py文件比较多, 所以需要比较长的时间才能跑完\r\n\r\n文件夹说明:\r\n> cache文件夹是存放输出模型的文件夹\r\n> embedding是存放w2c词嵌入的文"
},
{
"path": "THLUO/util.py",
"chars": 13690,
"preview": "import os\nimport re\nimport sys\nimport pandas as pd\nimport numpy as np\nimport gensim\nfrom gensim.models import Word2Vec\nf"
},
{
"path": "THLUO/代码运行.bat",
"chars": 995,
"preview": "python 1.w2c_model_start.py\t\r\npython 2.w2c_model_close.py\t\r\npython 3.w2c_model_all.py\t\r\npython 3.device_quchong_start_ap"
},
{
"path": "chizhu/readme.txt",
"chars": 807,
"preview": "|-single_model/\n |-data/ 处理后的特征和数据存放位置\n |-model/ 模型文件\n |-submit 模型概率文件,可用作stacking材料\n |-config.py 配置原始文件路径\n "
},
{
"path": "chizhu/single_model/cnn.py",
"chars": 17934,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "chizhu/single_model/config.py",
"chars": 49,
"preview": "path = \"/Users/chizhu/data/competition_data/易观/\"\n"
},
{
"path": "chizhu/single_model/deepnn.py",
"chars": 16043,
"preview": "import pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom sklearn.decomposition import Lat"
},
{
"path": "chizhu/single_model/get_nn_feat.py",
"chars": 6172,
"preview": "import pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom sklearn.decomposition import Lat"
},
{
"path": "chizhu/single_model/lgb.py",
"chars": 47125,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skle"
},
{
"path": "chizhu/single_model/user_behavior.py",
"chars": 4244,
"preview": "\n# coding: utf-8\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom sklearn.decomp"
},
{
"path": "chizhu/single_model/xgb.py",
"chars": 7007,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "chizhu/single_model/xgb_nb.py",
"chars": 8404,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "chizhu/single_model/yg_best_nn.py",
"chars": 14491,
"preview": "import pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom sklearn.decomposition import Lat"
},
{
"path": "chizhu/stacking/all_feat/xgb__nurbs_nb.ipynb",
"chars": 19702,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"metadata\": {},\n \"outputs\": [\n {\n \"data\":"
},
{
"path": "chizhu/stacking/nurbs_feat/xgb_22.py",
"chars": 4715,
"preview": "\n# coding: utf-8\n\n# In[2]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "chizhu/stacking/nurbs_feat/xgb__nurbs_nb.py",
"chars": 8062,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "chizhu/util/bagging.py",
"chars": 1301,
"preview": "import os\nimport pandas as pd\npath = \"/Users/chizhu/data/competition_data/易观/\"\nos.listdir(path)\n\ntrain = pd.read_csv(pat"
},
{
"path": "chizhu/util/get_nn_res.py",
"chars": 1237,
"preview": "import pandas as pd \npath = \"/Users/chizhu/data/competition_data/易观/\"\nres1 = pd.read_csv(path+\"res1.csv\")\nres2_1 = pd.re"
},
{
"path": "linwangli/code/lgb_allfeat_22.py",
"chars": 3056,
"preview": "#!/usr/bin/env python\n# coding: utf-8\n\nfrom catboost import Pool, CatBoostClassifier, cv\nimport pandas as pd\nimport seab"
},
{
"path": "linwangli/code/lgb_allfeat_condProb.py",
"chars": 7160,
"preview": "#!/usr/bin/env python\n# coding: utf-8\nfrom catboost import Pool, CatBoostClassifier, cv\nimport pandas as pd\nimport seabo"
},
{
"path": "linwangli/code/utils.py",
"chars": 1545,
"preview": "import pandas as pd\r\nimport numpy as np\r\n\r\ndef weights_ensemble(results, weights):\r\n\t'''\r\n\t针对此次比赛的按权重进行模型融合的函数脚本\r\n\tresul"
},
{
"path": "linwangli/readme.txt",
"chars": 253,
"preview": "|—— code\r\n |—— lgb_allfeat_22.py:基于【全部特征】训练得到lgb结果\r\n |—— lgb_allfeat_condProb.py:基于【全部特征+条件概率】训练得到lgb结果\r\n |—— u"
},
{
"path": "linwangli/yg-1st-lgb.py",
"chars": 15689,
"preview": "\n# coding: utf-8\n\n# In[ ]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "nb_cz_lwl_wcm/10_lgb.py",
"chars": 47126,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skle"
},
{
"path": "nb_cz_lwl_wcm/11_cnn.py",
"chars": 17935,
"preview": "\n# coding: utf-8\n\n# In[1]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "nb_cz_lwl_wcm/12_get_feature_lwl.py",
"chars": 10227,
"preview": "#!/usr/bin/env python\r\n# coding: utf-8\r\n\r\nimport pandas as pd\r\nimport seaborn as sns\r\nimport numpy as np\r\nfrom sklearn.d"
},
{
"path": "nb_cz_lwl_wcm/13_last_get_all_feature.py",
"chars": 2356,
"preview": "# -*- coding:utf-8 -*-\n\nimport pandas as pd\n\ndf_brand = pd.read_csv('feature/deviceid_brand_feature.csv')\ndf_lr = pd.rea"
},
{
"path": "nb_cz_lwl_wcm/1_get_age_reg.py",
"chars": 6396,
"preview": "# -*- coding:utf-8 -*-\n\n\n####### 尝试骚操作,单独针对这个表\nimport pandas as pd\nfrom sklearn.cluster import KMeans\nfrom sklearn.line"
},
{
"path": "nb_cz_lwl_wcm/2_get_feature_brand.py",
"chars": 855,
"preview": "# -*- coding:utf-8 -*-\n\nimport pandas as pd\nimport numpy as np\nfrom sklearn import preprocessing\n\ntrain = pd.read_csv('D"
},
{
"path": "nb_cz_lwl_wcm/3_get_feature_device_package.py",
"chars": 11707,
"preview": "# -*- coding:utf-8 -*-\n\n\n####### 尝试骚操作,单独针对这个表\nimport pandas as pd\nfrom sklearn.cluster import KMeans\nfrom sklearn.line"
},
{
"path": "nb_cz_lwl_wcm/4_get_feature_device_start_close_tfidf_1_2.py",
"chars": 8892,
"preview": "# -*- coding:utf-8 -*-\n\nimport pandas as pd\nimport scipy.sparse\nimport numpy as np\nfrom sklearn import preprocessing\nfro"
},
{
"path": "nb_cz_lwl_wcm/5_get_feature_device_start_close_tfidf.py",
"chars": 11207,
"preview": "# -*- coding:utf-8 -*-\n\nimport pandas as pd\nimport scipy.sparse\nimport numpy as np\nfrom sklearn import preprocessing\nfro"
},
{
"path": "nb_cz_lwl_wcm/6_get_feature_device_start_close.py",
"chars": 1886,
"preview": "# -*- coding:utf-8 -*-\n\nimport pandas as pd\nimport numpy as np\nfrom sklearn import preprocessing\n\ntrain = pd.read_csv('D"
},
{
"path": "nb_cz_lwl_wcm/7_get_feature_w2v.py",
"chars": 1954,
"preview": "from gensim.models import Word2Vec\nimport pandas as pd\npath=\"Demo/\"\npackages = pd.read_csv(path+\"deviceid_packages.tsv\","
},
{
"path": "nb_cz_lwl_wcm/8_get_feature_lwl.py",
"chars": 15689,
"preview": "\n# coding: utf-8\n\n# In[ ]:\n\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom skl"
},
{
"path": "nb_cz_lwl_wcm/9_yg_best_nn.py",
"chars": 14492,
"preview": "import pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom sklearn.decomposition import Lat"
},
{
"path": "nb_cz_lwl_wcm/运行说明.txt",
"chars": 67,
"preview": "Demo文件夹下存放原始数据集\n按照1、2、3... 顺序运行,最后在feature文件夹下面生成feature_nurbs.csv\n"
},
{
"path": "wangcanming/deepnet_v33.py",
"chars": 11958,
"preview": "import pandas as pd\nimport seaborn as sns\nimport numpy as np\nfrom tqdm import tqdm\nfrom sklearn.decomposition import Lat"
}
]
// ... and 2 more files (download for full content)
About this extraction
This page contains the full source code of the chizhu/yiguan_sex_age_predict_1st_solution GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 71 files (678.0 KB), approximately 200.6k tokens, and a symbol index with 249 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.