Repository: luoda888/2018-KUAISHOU-TSINGHUA-Top13-Solutions Branch: master Commit: f7e871e063bd Files: 7 Total size: 93.2 KB Directory structure: gitextract_z0_s53rx/ ├── README.md ├── feature_engineer/ │ ├── feature_engineer1 │ ├── feature_engineer2.py │ └── get_feature.py └── model/ ├── ffm.py ├── lgb_model └── xgb_model.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: README.md ================================================ # 2018-KUAISHOU-TSINGHUA-Top13-Solutions 2018中国高校计算机大赛--大数据挑战赛 Top 13 Solutions #### 初赛A Top 2,初赛B Top 5 复赛Final 13 每一次相遇都是久别重逢,下一次站在答辩台上又是何年何月,对得起自己,对得起青春。 队友的RNN:https://github.com/totoruo/KuaiShou2018-RANK13-RNN #### 题意简述:给出用户在快手APP上1-30日的历史行为,预测接下来7天(31-37)的活跃用户。定义活跃为任意一张表出现过,不考虑冷启动。 ##### 框架思考 1. 等长滑动 [1,16] [2,17] ... [8,23] ---> Predict [15,30] 2. 不等长滑动 [1,16] [1,17] ... [1,24] ---> Predict [1,30] ##### 特征工程: Tips: 以下Feature都是较为通用的,但是在不等长框架2中,需要对所有涉及到距离的计算做平滑,即除以时间窗口的长度 对APP-LAUNCH、ACTION、VIDEO表 1.对时间序列进行编码 编码方式考虑两种: a. 将用户活跃天数视为二进制数,按二进制方式对活跃天数进行加权,越接近预测日期权重越高,如6天的窗口,1,3,5 --- 101010 将其倒置,010101,转化为十进制数21 `ans += binary_day[i]*(2**i)` b. 直接按离预测日期距离进行加权 `ans += binary_day[i]*(1/(end_date-i))` 2.对时间序列进行描述性统计 一级统计特征: mean,std,median,max,min,(max-min),mode,count,nunique (在APP表中,count==nunique) 二级统计特征: skew,kurt,mad 用户登录的频域周期性: Var(fft(X['day'])) 用户登录的星期周期性: Get_Mode(X['week']) 3.对时间序列与预测日期进行时间天数交互 该用户最后一次登录距离预测日期的长度 end_date+1-x['day'].max() 该用户倒数第K次登录距离预测日期的长度 end_date+1-get_second_day(x,k) 最后一次和倒数第二次的距离 x['day'].max()-get_second_day(x,2) 4.差分一阶时间序列 用户最大/最小间隔多少天登录一次 Diff Max/Min 用户平均登录间隔 Diff Mean 用户登录间隔的稳定性 Diff Var 用户登录间隔的周期性 Var(fft(X['day'])) 对ACTION表的特殊处理 1.对时间序列进行衰减系数编码 操作数/当前天数-预测日期的距离 sigma_ans += np.log(ans[i]/(window_len-i)) 2.对时间序列与预测日期进行操作数交互 该用户最后一次/倒数第二次登录 观看Page、Action的分布 如看了3次Page 0,4次Page 1,2次Action 1,没有的行为用0填充 将多组行为拆成子图,即可实现统计User在不同页面的分布,如只在Page0发生的行为的描述性统计,mean,std等 (对Action) 该用户最后一次/K次操作当天,Count VIDEOID/AuthorID 最后一次和倒数第二次间隔中,Count VIDEOID/AuthorID 3.对Page,Action进行User的全局展开 统计每一种行为在该用户整个行为序列里的比例,如 5次点击Page 0,该用户共点击10次 Page 0,那么此处Ratio1 为0.5 统计每一种行为在当天用户里所有行为的比例,如 该用户900次点击 Page 0,当天共有1000次点击,那么此处Ratio2 为0.9 上述两种统计,可以有效防止刷单行为,正常的行为序列应当是较为平滑的点击序列,一旦出现峰值,如举报行为居多的,观看同一视频的,即可判定为刷单的特征 4.计算用户观看VIDEO的贡献序列 def GongXianDu(df): d11 = df.set_index('video_id') d11['gongxian_rate'] = df.groupby('video_id').size() d11['gongxian_rate'] = d11['gongxian_rate'] / d11['video_watched_times'] meand = d11['gongxian_rate'].mean() sumd = d11['gongxian_rate'].sum() stdd = d11['gongxian_rate'].std() skeww = d11['gongxian_rate'].skew() kurtt = d11['gongxian_rate'].kurt() return sumd,meand,stdd,skeww,kurtt temp = train_act.groupby('user_id').apply(GongXianDu) 5.计算用户是否有追星行为 最喜欢的作者有无更新行为 用户看过的视频还有多少人爱看(区分小众与大众) def FavAuthorCreate(df): most_author = df.groupby('author_id').size().sort_values(ascending=False).index[0] create_video_num = len(df[df['author_id']==most_author]['video_id'].unique()) watch_other_video_num = len(df[df['author_id']==most_author]['video_id'].unique()) watch_other_video = 1 if watch_other_video_num>1 else 0 return create_video_num , watch_other_video 对REGSITER表的挖掘 1.注册周期性,如周末的促销活动,最直观的Feature就是Week 2.周期性的交互,如在周末特定的Type组合 register['week'] * register['device_type'] register['week'] * register['register_type'] 3.类别特征间的交互,如 register['device_type'] * register['register_type'] 4.计算不同类别的使用人数,如 register_log.groupby(['register_type'])['user_id'].transform('count').values 可以计算device_type,week_rt,week_dt,rt_dt 5.计算不同类别的转化率(需滑窗计算),groupby(['count_label_ratio'])['regsiter_type','device_type'].transform('count').vaules ##### 模型选择 Snake 的 糖尿病特征选择框架 https://github.com/luoda888/tianchi-diabetes-top12 在保留必选特征后(Encoder/FFT) 设置阈值产生两套特征 树模型: LGB 框架1 XGB 框架2 FFM : Xlearn 按特征重要性筛选TopK个特征后计算 xDeepFM 输入序列特征/Category特征 CNN/RNN 输入序列特征 ##### 模型融合 本题模型融合收益极高,我们尝试了3种方式进行模型融合,按效果来计算 Top 3 Stacking 没卵用,如果用同一套特征甚至掉分 Top 2 加权融合 相似性都是0.98 0.97左右,0.97可以有1个千的收益,0.98大概是7-8个万 Top 1 对半Blending 如用第一个滑窗测试,第二个滑窗测试,两折的Blending 收益是1.2个千 ##### 感言 首先感谢队友一直努力,到最后没有放弃,本来打算对这个题的思路写很多,但是想想自己还是菜了点,就算了。 最近这几个比赛,平安11,腾讯12,快手13,真的是10名以外的王者。 也有反思自己,很多时候挖特征的思路不太对,太喜欢一把梭,忽略了具体数据背景,忽略了具体变化 数据挖掘多的应该是自己的思考,而少一些套路性的东西。 当一切都有固定化的套路的时候,也少了很多乐趣。 快手群里还是氛围不错的,Kesci平台的态度确实也是我目前见过比较好的。 或许前天中午那个LGB的模型能跑出来,现在肯定是可以Top10的。 遗憾也有,咽下肚。 皇图霸业谈笑间,不胜人生一场醉。 走起,醉去! ================================================ FILE: feature_engineer/feature_engineer1 ================================================ # 该特征工程 滑窗步长为1天,全量数据,无部分平均加权 from multiprocessing import Pool import numpy as np import pandas as pd from pandas import DataFrame as DF import gc from multiprocessing import Pool from sklearn.preprocessing import LabelEncoder import time from scipy import stats def get_transform(now,start_date,end_date): get_trans = now[(now['day']>=start_date) & (now['day']<=end_date)] return get_trans def get_label(start_date,end_date): merge_name = ['user_id','day'] all_log = pd.concat([action_log[merge_name],app_log[merge_name],video_log[merge_name]],axis=0) train_label = get_transform(all_log,start_date,end_date) train_1 = DF(list(set(train_label['user_id']))).rename(columns={0:'user_id'}) train_1['label'] = 1 reg_temp = get_transform(register_log,1,start_date-1) train_1 = train_1[train_1['user_id'].isin(reg_temp['user_id'])] train_0 = DF(list(set(reg_temp['user_id'])-set(train_1['user_id']))).rename(columns={0:'user_id'}) train_0['label'] = 0 del train_label gc.collect() return pd.concat([train_1,train_0],axis=0) def check_id(uid,now): return now[now['user_id'].isin(uid)] def get_mode(now): return stats.mode(now)[0][0] def get_binary_seq(now,start_date,end_date): day = list(range(1,end_date-start_date+2)) day.reverse() ans1 = 0 binary_day = [] now_uni = now.unique() for i in day: if i in now_uni: binary_day.append(1) else: binary_day.append(0) return binary_day def get_binary1(now,start_date,end_date): # Boss Feature ans = 0 binary_day = get_binary_seq(now,start_date,end_date) for i in range(len(binary_day)): ans += binary_day[i]*(2**i) return ans def get_binary2(now,start_date,end_date): # Boss Feature ans = 0 binary_day = get_binary_seq(now,start_date,end_date) for i in range(len(binary_day)): ans += binary_day[i]*(1/(end_date-i)) return ans def get_time_log_weight_sigma(now,start_date,end_date): window_len = end_date+1-start_date ans = np.zeros(window_len) sigma_ans = 0 for i in now: ans[(i-1)%window_len] += 1 for i in range(window_len): if ans[i]!=0: sigma_ans += np.log(ans[i]/(window_len-i)) return sigma_ans def get_max_count(x,x_max): x_max = int(x_max) if x_max>0: return x['day'].value_counts()[x_max] else: return 0 def get_max_movie(x,x_max): x_max = int(x_max) if x_max>0: x = x[x['day']==x_max] return x['video_id'].nunique() else: return 0 def get_type_feature(control,name,now,train_data,start_date,end_date,gap,gap_name): now = get_transform(now,start_date,end_date) train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].max())).reset_index().rename(columns={0:'max1_'+control+name+gap_name}).fillna(-1),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],2))).reset_index().rename(columns={0:'max2_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_count(x,np.nan_to_num(x['day'].max()))).reset_index().rename(columns={0:'max_count_'+control+name+gap_name}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_count(x,get_second_day(x[name],2))).reset_index().rename(columns={0:'max2_count_'+control+name+gap_name}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_movie(x,np.nan_to_num(x['day'].max()))).reset_index().rename(columns={0:'nunique_video_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_movie(x,get_second_day(x[name],2))).reset_index().rename(columns={0:'nunique2_video_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(x[name].max()-get_second_day(x[name],2))).reset_index().rename(columns={0:'max_distance_12_'+control+name+str(end_date-start_date)}).fillna(end_date-start_date),on=['user_id'],how='left') return train_data def get_encoder_feature(control,name,now,train_data,start_date,end_date): now = get_transform(now,start_date,end_date) train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary1(x[name],start_date,end_date)).reset_index().rename(columns={0:'encoder1_01seq'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary2(x[name],start_date,end_date)).reset_index().rename(columns={0:'encoder2_01seq'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_time_log_weight_sigma(x[name],start_date,end_date)).reset_index().rename(columns={0:'LogSigma_'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left') return train_data def get_time_feature(control,name,now,train_data,start_date,end_date): now = get_transform(now,start_date,end_date) # 描述性统计特征 6 t1 = time.time() train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_all_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()).reset_index().rename(columns={0:'count_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') train_data['nunique / count' + control + name +str(end_date-start_date)] = train_data['nunique_all_'+control+name+str(end_date-start_date)] / train_data['count_'+control+name+str(end_date-start_date)] train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(x[name].min()-start_date)).reset_index().rename(columns={0:'min-start_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].min())).reset_index().rename(columns={0:'end-min_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].kurt()).reset_index().rename(columns={0:'kurt_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].skew()).reset_index().rename(columns={0:'skew_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.84)).reset_index().rename(columns={0:'q4_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.92)).reset_index().rename(columns={0:'q5_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.97)).reset_index().rename(columns={0:'q6_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_mode(x[name])).reset_index().rename(columns={0:'mode_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') t2 = time.time() print(name,' Describe Finished... ',t2-t1,' Shape: ',train_data.shape) t1 = time.time() train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.var(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_var_'+control+name+str(end_date-start_date)}).fillna(-1),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.mean(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_mean_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.var(np.fft.fft(get_binary_seq(x[name],start_date,end_date))))).reset_index().rename(columns={0:'fft_01seq_var_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.mean(np.fft.fft(get_binary_seq(x[name],start_date,end_date))))).reset_index().rename(columns={0:'fft_01seq_mean_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') t2 = time.time() print(control,' FFT Finished...',t2-t1,' Shape: ',train_data.shape) t1 = time.time() train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:np.array(get_binary_seq(x[name],start_date,end_date)).std()).reset_index().rename(columns={0:'01seq_std_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:np.array(get_binary_seq(x[name],start_date,end_date)).mean()).reset_index().rename(columns={0:'01seq_mean_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') t2 = time.time() print(control,' 01seq Describe Finished... ',t2-t1,' Shape: ',train_data.shape) # 时间衰减 4 t1 = time.time() train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary1(x[name],start_date,end_date)).reset_index().rename(columns={0:'encoder1_01seq'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary2(x[name],start_date,end_date)).reset_index().rename(columns={0:'encoder2_01seq'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_time_log_weight_sigma(x[name],start_date,end_date)).reset_index().rename(columns={0:'LogSigma_'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left') t2 = time.time() print(control,' Sigma Finished... ',t2-t1,' Shape: ',train_data.shape) t1 = time.time() train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].max())).reset_index().rename(columns={0:'max1_'+control+name+str(end_date-start_date)}).fillna(end_date-start_date),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],2))).reset_index().rename(columns={0:'max2_'+control+name+str(end_date-start_date)}).fillna(end_date-start_date),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],3))).reset_index().rename(columns={0:'max3_'+control+name+str(end_date-start_date)}).fillna(end_date-start_date),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(x[name].max()-get_second_day(x[name],2))).reset_index().rename(columns={0:'max_distance_12_'+control+name+str(end_date-start_date)}).fillna(end_date-start_date),on=['user_id'],how='left') t2 = time.time() print(control,' Max Finished... ',t2-t1,' Shape: ',train_data.shape) return train_data def get_second_day(now,seq): now = list(now.unique()) for i in range(seq-1): if len(now)>1: now.remove(max(now)) else: return 0 return max(now) def get_id_feature(control,name,now,train_data,start_date,end_date): now = get_transform(now,start_date,end_date) train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()).reset_index().rename(columns={0:'count_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].var()).reset_index().rename(columns={0:'var_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left') return train_data def get_diff_feature(control,name,now,train_data,start_date,end_date): now = get_transform(now,start_date,end_date) t1 = time.time() train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()).reset_index().rename(columns={0:'count_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].std()).reset_index().rename(columns={0:'var_'+control+name+str(end_date-start_date)}).fillna(-1),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].mean()).reset_index().rename(columns={0:'mean_'+control+name+str(end_date-start_date)}).fillna(-1),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].max()).reset_index().rename(columns={0:'max_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_mode(x[name])).reset_index().rename(columns={0:'mode_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].min()).reset_index().rename(columns={0:'min_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left') t2 = time.time() print(control,' Get Diff Feature Finished... Used: ',t2-t1,' Shape: ',train_data.shape) return train_data def HowManyPeopleWatch(df): num_people = len(df['user_id'].unique()) return num_people def MostHandle(df): most_handle = df.groupby('video_id').size().max() return most_handle def FavAuthorCreate(df): most_author = df.groupby('author_id').size().sort_values(ascending=False).index[0] create_video_num = len(df[df['author_id']==most_author]['video_id'].unique()) watch_other_video_num = len(df[df['author_id']==most_author]['video_id'].unique()) watch_other_video = 1 if watch_other_video_num>1 else 0 return create_video_num , watch_other_video def GongXianDu(df): d11 = df.set_index('video_id') d11['gongxian_rate'] = df.groupby('video_id').size() d11['gongxian_rate'] = d11['gongxian_rate'] / d11['video_watched_times'] meand = d11['gongxian_rate'].mean() sumd = d11['gongxian_rate'].sum() stdd = d11['gongxian_rate'].std() skeww = d11['gongxian_rate'].skew() kurtt = d11['gongxian_rate'].kurt() return sumd,meand,stdd,skeww,kurtt def get_category_count(name,deal_now,train_data,start_date,end_date): count = DF(deal_now.groupby(['user_id',name]).size().reset_index().rename(columns={0:'times'})) count_size = deal_now.groupby([name]).size().shape[0] sum_data = 0 for i in range(0,count_size): new_name = 'see_'+name+'_'+str(i) temp = pd.merge(train_data,count[count[name]==i],on=['user_id']).rename(columns={'times':new_name}) train_have = pd.merge(train_data,temp[['user_id',new_name]],on=['user_id']) train_have = train_have[['user_id',new_name]] not_have_name = list(set(train_data['user_id'].values)-set(train_have['user_id'].values)) train_not_have = DF() train_not_have['user_id'] = train_data[train_data['user_id'].isin(not_have_name)]['user_id'] train_not_have['see_'+name+'_'+str(i)] = 0 temp = pd.concat([train_have,train_not_have],axis=0) train_data = pd.merge(train_data,temp,on=['user_id'],how='left') sum_data += train_data[new_name].values for i in range(0,count_size): new_name = 'see_'+name+'_'+str(i) train_data[new_name+'_ratio'] = train_data[new_name].values/sum_data return train_data def get_last_window(now): if now.min()>0: return 1 else: return 0 def parallelize_df_func(df, func, start, end, num_partitions=21, n_jobs=7): df_split = np.array_split(df, num_partitions) start_date = [start] * num_partitions end_date = [end] * num_partitions param_info = zip(df_split, start_date, end_date) pool = Pool(n_jobs) gc.collect() df = pd.concat(pool.map(func, param_info)) pool.close() pool.join() gc.collect() return df def get_train(param_info): uid = param_info[0] start_date= param_info[1] end_date= param_info[2] t_start = time.time() t1 = time.time() train_act = check_id(uid,get_transform(action_log,start_date,end_date)) train_video = check_id(uid,get_transform(video_log,start_date,end_date)) train_app = check_id(uid,get_transform(app_log,start_date,end_date)) train_reg = register_log[register_log['user_id'].isin(uid)].rename(columns={'day':'reg_day'}) # Get Week train_act['week'] = (train_act['day'].values) % 7 train_video['week'] = (train_video['day'].values) % 7 train_app['week'] = (train_app['day'].values) % 7 # Modify Day train_reg['reg_day'] = train_reg['reg_day'] - start_date + 1 train_act['day'] = train_act['day'] - start_date + 1 train_video['day'] = train_video['day'] - start_date + 1 train_app['day'] = train_app['day'] - start_date + 1 end_date = end_date-start_date+1 true_start = start_date start_date = 1 t2 = time.time() print(start_date,' To ',end_date,' Have User: ',len(uid)) print('Data Prepare Use...',t2-t1) # Build train_data = DF() train_data['user_id'] = uid # 1 feature train_data = get_time_feature('act_','day',train_act,train_data,start_date,end_date) train_data = get_time_feature('act_','day',train_act,train_data,start_date,end_date) train_data = get_time_feature('act_','day',train_act,train_data,start_date,end_date) print('Act Encoder Finished') for i in range(5): page_temp = train_act[train_act['page']==i] train_data = get_type_feature('act_page'+str(i)+'_','day',page_temp,train_data,start_date,end_date,end_date-start_date,'_all') for i in range(6): act_temp = train_act[train_act['action_type']==i] train_data = get_type_feature('act_action'+str(i)+'_','day',act_temp,train_data,start_date,end_date,end_date-start_date,'_all') train_data = get_diff_feature('act_','diff_day',train_act,train_data,start_date,end_date) train_data = get_encoder_feature('act_','diff_day',train_act,train_data,start_date,end_date) train_data = pd.merge(train_data,train_act.groupby(['user_id']).apply(lambda x:get_mode(x['week'])).reset_index().rename(columns={0:'act_mode_week'+'_'+str(end_date-start_date)}),on=['user_id'],how='left') for i in ['page','action_type','video_id','author_id']: # 4*3 12 feature train_data = get_id_feature('id_act_',i,train_act,train_data,start_date,end_date) print(train_data.shape,' Aci Finished') train_data = get_category_count('page',train_act,train_data,start_date,end_date) train_data = get_category_count('action_type',train_act,train_data,start_date,end_date) print(train_data.shape,' Category Finished') train_data = get_time_feature('video_','day',train_video,train_data,start_date,end_date) train_data = get_diff_feature('video_','diff_day',train_video,train_data,start_date,end_date) print(train_data.shape,' Video Finished') train_data = get_time_feature('app_','day',train_app,train_data,start_date,end_date) train_data = get_diff_feature('app_','diff_day',train_app,train_data,start_date,end_date) train_data = get_encoder_feature('app_','diff_day',train_app,train_data,start_date,end_date) t1 = time.time() train_data = pd.merge(train_data,train_app.groupby(['user_id']).apply(lambda x:get_mode(x['week'])).reset_index().rename(columns={0:'app_mode_week'+'_'+str(end_date-start_date)}),on=['user_id'],how='left') print(train_data.shape,' APP Finished') train_app['diff2'] = train_app['day']-train_app.groupby(['user_id'])['day'].shift(2).values app_diff = train_app.dropna() app_diff = app_diff.groupby(['user_id'],as_index=False).agg({'diff2':['max','min','mean','std']}) app_diff.columns = ['user_id','app_diff2_max','app_diff2_min','app_diff2_mean','app_diff2_std'] train_data = pd.merge(train_data,app_diff,on=['user_id'],how='left') t2 = time.time() print('APP DIFF ',t2-t1) t1 = time.time() user_feature = DF(train_data['user_id'].unique()) user_feature.columns = ['user_id'] user_feature = user_feature.set_index('user_id') user_feature['HowManyPeople_Watch'] = train_act.groupby('author_id').apply(HowManyPeopleWatch) user_feature['Most_Handle'] = train_act.groupby('user_id').apply(MostHandle) #计算视频被观看总次数 video_size = train_act.groupby('video_id').size().reset_index() video_size.columns = ['video_id','video_watched_times'] train_act = pd.merge(train_act,video_size,on=['video_id'],how='left') #分别计算每个用户的贡献度和、均、方 temp = train_act.groupby('user_id').apply(GongXianDu) user_feature['GongXianSum'] = temp.apply(lambda x:x[0]) user_feature['GongXianMean'] = temp.apply(lambda x:x[1]) user_feature['GongXianStd'] = temp.apply(lambda x:x[2]) user_feature['GongXianSkeww'] = temp.apply(lambda x:x[3]) user_feature['GongXianKurtt'] = temp.apply(lambda x:x[4]) fav_author = train_act.groupby('user_id').apply(FavAuthorCreate) user_feature['FavAuthorCreate'] = fav_author.apply(lambda x:x[0]) user_feature['WatchOtherVideo'] = fav_author.apply(lambda x:x[1]) train_data = pd.merge(train_data,user_feature.reset_index(),on=['user_id'],how='left') t2 = time.time() print('Use Time: ',t2-t1,' User-Author Finish... ','Shape:',train_data.shape) train_data = pd.merge(train_data,train_reg[['user_id','register_type','device_type','week','reg_day']],on=['user_id'],how='left').rename(columns={'week':'reg_week'}) # 2 t_end = time.time() print('Get Feature Use All Time: ',t_end-t_start,' Shape: ',train_data.shape) gc.collect() return train_data def data_prepare(read_path=None): register_log = pd.read_csv(read_path+'user_register_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint16,3:np.uint16}).rename(columns={0:'user_id',1:'day',2:'register_type',3:'device_type'}) action_log = pd.read_csv(read_path+'user_activity_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint8,3:np.uint32,4:np.uint32,5:np.uint8}).rename(columns={0:'user_id',1:'day',2:'page',3:'video_id',4:'author_id',5:'action_type'}) app_log = pd.read_csv(read_path+'app_launch_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'}) video_log = pd.read_csv(read_path+'video_create_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'}) # Sort By User register_log = register_log.sort_values(by=['user_id','day'],ascending=True) action_log = action_log.sort_values(by=['user_id','day'],ascending=True) app_log = app_log.sort_values(by=['user_id','day'],ascending=True) video_log = video_log.sort_values(by=['user_id','day'],ascending=True) # Diff Day t1 = time.time() app_log['diff_day'] = app_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8) video_log['diff_day'] = video_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8) action_log['diff_day'] = action_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8) t2 = time.time() print('Diff Day Finished... ',t2-t1) # Prepare REGISTER register_log['week'] = register_log['day'] % 7 register_log['rt_dt'] = (register_log['register_type']+1)*(register_log['device_type']+1) register_log['week_rt'] = (register_log['register_type']+1)*(register_log['reg_week']+1) register_log['week_dt'] = (register_log['device_type']+1)*(register_log['reg_week']+1) register_log['use_reg_people'] = register_log.groupby(['register_type'])['user_id'].transform('count').values register_log['use_dev_people'] = register_log.groupby(['device_type'])['user_id'].transform('count').values register_log['week_rt_use_people'] = register_log.groupby(['week_rt'])['user_id'].transform('count').values register_log['week_dt_use_people'] = register_log.groupby(['week_dt'])['user_id'].transform('count').values register_log['rt_dt_use_people'] = register_log.groupby(['rt_dt'])['user_id'].transform('count').values return register_log,action_log,app_log,video_log read_path = '/mnt/datasets/fusai/' register_log,action_log,app_log,video_log = data_prepare(read_path) train_set = [] for i in range(17,25): train_label = get_label(i,i+6) train_data_part1 = parallelize_df_func(train_label['user_id'], get_train, i-16, i-1, 1, 1) train_data = pd.merge(train_data_part1,register_log[['user_id','use_reg_people','week','register_type','device_type','rt_dt', 'week_rt','week_dt','use_dev_people','week_rt_use_people','week_dt_use_people', 'rt_dt_use_people']],on=['user_id'],how='left') train_data = pd.merge(train_data,train_label,on=['user_id'],how='left') train_set.append(train_data) del train_data_part1 gc.collect() train_data = pd.concat(train_set[0:-1],axis=0).reset_index(drop=True) valid_data = train_set[-1] online_data = parallelize_df_func(register_log['user_id'].unique(), get_train, 15, 30, 1,1) online_data = pd.merge(online_data,register_log[['user_id','use_reg_people','week','register_type','device_type','rt_dt', 'week_rt','week_dt','use_dev_people','week_rt_use_people','week_dt_use_people', 'rt_dt_use_people']],on=['user_id'],how='left') write_path = '/home/kesci/' train_data.to_csv(write_path+'train_data.csv',index=False) valid_data.to_csv(write_path+'valid_data.csv',index=False) online_data.to_csv(write_path+'online_data.csv',index=False) print('Style 1 Feature Engineer Finished...') ================================================ FILE: feature_engineer/feature_engineer2.py ================================================ # 该特征工程为不等长滑窗,滑窗的step为1天,全量数据 import numpy as np import pandas as pd from pandas import DataFrame as DF import gc from multiprocessing import Pool from sklearn.preprocessing import LabelEncoder import time from scipy import stats register_log = pd.read_csv('/mnt/datasets/fusai/user_register_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint16,3:np.uint16}).rename(columns={0:'user_id',1:'day',2:'register_type',3:'device_type'}) action_log = pd.read_csv('/mnt/datasets/fusai/user_activity_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint8,3:np.uint32,4:np.uint32,5:np.uint8}).rename(columns={0:'user_id',1:'day',2:'page',3:'video_id',4:'author_id',5:'action_type'}) app_log = pd.read_csv('/mnt/datasets/fusai/app_launch_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'}) video_log = pd.read_csv('/mnt/datasets/fusai/video_create_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'}) register_log = register_log.sort_values(by=['user_id','day'],ascending=True) action_log = action_log.sort_values(by=['user_id','day'],ascending=True) app_log = app_log.sort_values(by=['user_id','day'],ascending=True) video_log = video_log.sort_values(by=['user_id','day'],ascending=True) register_log['week'] = register_log['day'] % 7 t1 = time.time() app_log['diff_day'] = app_log.groupby(['user_id'])['day'].diff().fillna(-1) app_log['diff_day'] = app_log['diff_day'].astype(np.int8) t2 = time.time() print('Diff APP Finished... ',t2-t1) t1 = time.time() video_log['diff_day'] = video_log.groupby(['user_id'])['day'].diff().fillna(-1) video_log['diff_day'] = video_log['diff_day'].astype(np.int8) t2 = time.time() print('Diff Video Finished... ',t2-t1) t1 = time.time() action_log['diff_day'] = action_log.groupby(['user_id'])['day'].diff().fillna(-1) action_log['diff_day'] = action_log['diff_day'].astype(np.int8) t2 = time.time() print('Diff Act Finished... ',t2-t1) def reduce_mem_usage(props): # 计算当前内存 start_mem_usg = props.memory_usage().sum() / 1024 ** 2 print("Memory usage of the dataframe is :", start_mem_usg, "MB") NAlist = [] for col in props.columns: if (props[col].dtypes != object): isInt = False mmax = props[col].max() mmin = props[col].min() if not np.isfinite(props[col]).all(): NAlist.append(col) props[col].fillna(-1, inplace=True) props[col].replace(np.inf,-1,inplace=True) asint = props[col].fillna(-1).astype(np.int64) result = np.fabs(props[col] - asint) result = result.sum() if result < 0.01: isInt = True if isInt: if mmin >= 0: if mmax <= 255: props[col] = props[col].astype(np.uint8) elif mmax <= 65535: props[col] = props[col].astype(np.uint16) elif mmax <= 4294967295: props[col] = props[col].astype(np.uint32) else: props[col] = props[col].astype(np.uint64) else: if mmin > np.iinfo(np.int8).min and mmax < np.iinfo(np.int8).max: props[col] = props[col].astype(np.int8) elif mmin > np.iinfo(np.int16).min and mmax < np.iinfo(np.int16).max: props[col] = props[col].astype(np.int16) elif mmin > np.iinfo(np.int32).min and mmax < np.iinfo(np.int32).max: props[col] = props[col].astype(np.int32) elif mmin > np.iinfo(np.int64).min and mmax < np.iinfo(np.int64).max: props[col] = props[col].astype(np.int64) else: props[col] = props[col].astype(np.float16) mem_usg = props.memory_usage().sum() / 1024**2 print("This is ",100*mem_usg/start_mem_usg,"% of the initial size") return props def get_transform(now,start_date,end_date): get_trans = now[(now['day']>=start_date) & (now['day']<=end_date)] return get_trans def get_label(start_date,end_date): merge_name = ['user_id','day'] all_log = pd.concat([action_log[merge_name],app_log[merge_name],video_log[merge_name]],axis=0) train_label = get_transform(all_log,start_date,end_date) train_1 = DF(list(set(train_label['user_id']))).rename(columns={0:'user_id'}) train_1['label'] = 1 reg_temp = get_transform(register_log,1,start_date-1) train_1 = train_1[train_1['user_id'].isin(reg_temp['user_id'])] train_0 = DF(list(set(reg_temp['user_id'])-set(train_1['user_id']))).rename(columns={0:'user_id'}) train_0['label'] = 0 del train_label gc.collect() return pd.concat([train_1,train_0],axis=0) def check_id(uid,now): return now[now['user_id'].isin(uid)] def get_mode(now): return stats.mode(now)[0][0] def get_binary_seq(now,start_date,end_date): day = list(range(1,end_date-start_date+2)) ans1 = 0 binary_day = [] now_uni = now.unique() for i in day: if i in now_uni: binary_day.append(1) else: binary_day.append(0) return binary_day def get_binary1(now,start_date,end_date): # Boss Feature ans = 0 binary_day = get_binary_seq(now,start_date,end_date) for i in range(len(binary_day)): ans += binary_day[i]*(2**i) return ans def get_binary2(now,start_date,end_date): # Boss Feature ans = 0 binary_day = get_binary_seq(now,start_date,end_date) for i in range(len(binary_day)): ans += binary_day[i]*(1/(end_date-i)) return ans def get_time_log_weight_sigma(now,start_date,end_date): window_len = end_date+1-start_date ans = np.zeros(window_len) sigma_ans = 0 for i in now: ans[(i-1)%window_len] += 1 for i in range(window_len): if ans[i]!=0: sigma_ans += np.log(ans[i]/(window_len-i)) return sigma_ans def get_max_count(x,name): x_max = x[name].max() if x_max>0: return x[name].value_counts(x_max) else: return np.nan def get_max_movie(x): x_max = x['day'].max() if x_max>0: x = x[x['day']==x_max] return x['video_id'].nunique() else: return np.nan def get_type_feature(control,name,now,train_data,start_date,end_date,gap,gap_name): now = get_transform(now,start_date,end_date) train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].max())).reset_index().rename(columns={0:'max1_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],2))).reset_index().rename(columns={0:'max2_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_count(x,name)).reset_index().rename(columns={0:'max_count_'+control+name+gap_name}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_movie(x)).reset_index().rename(columns={0:'nunique_video_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left') return train_data def get_time_feature(control,name,now,train_data,start_date,end_date,gap,gap_name): now = get_transform(now,start_date,end_date) # 描述性统计特征 6 t1 = time.time() train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_all_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()/gap).reset_index().rename(columns={0:'count_'+control+name+gap_name}),on=['user_id'],how='left') train_data['nunique / count' + control + name + gap_name] = train_data['nunique_all_'+control+name+gap_name] / train_data['count_'+control+name+gap_name] train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(x[name].min()-start_date)).reset_index().rename(columns={0:'min-start_'+control+name+gap_name}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].min())).reset_index().rename(columns={0:'end-min_'+control+name+gap_name}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_count(x,name)).reset_index().rename(columns={0:'max_count_'+control+name+gap_name}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.75)/gap).reset_index().rename(columns={0:'q2_'+control+name+gap_name}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.84)/gap).reset_index().rename(columns={0:'q3_'+control+name+gap_name}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.96)/gap).reset_index().rename(columns={0:'q4_'+control+name+gap_name}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_mode(x[name])).reset_index().rename(columns={0:'mode_'+control+name+gap_name}),on=['user_id'],how='left') t2 = time.time() print(name,' Describe Finished... ',t2-t1,' Shape: ',train_data.shape) t1 = time.time() train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary1(x[name],start_date,end_date)/gap).reset_index().rename(columns={0:'encoder1_01seq'+control+name+'_'+gap_name}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary2(x[name],start_date,end_date)/gap).reset_index().rename(columns={0:'encoder2_01seq'+control+name+'_'+gap_name}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_time_log_weight_sigma(x[name],start_date,end_date)/gap).reset_index().rename(columns={0:'LogSigma_'+control+name+'_'+gap_name}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.std(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_var_'+control+name+gap_name}).fillna(-1),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.mean(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_mean_'+control+name+gap_name}),on=['user_id'],how='left') t2 = time.time() print(control,' Sigma Finished... ',t2-t1,' Shape: ',train_data.shape) t1 = time.time() train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].max())).reset_index().rename(columns={0:'max1_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],2))).reset_index().rename(columns={0:'max2_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],3))).reset_index().rename(columns={0:'max3_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left') t2 = time.time() print(control,' Max Finished... ',t2-t1,' Shape: ',train_data.shape) return train_data def get_second_day(now,seq): now = list(now.unique()) for i in range(seq-1): if len(now)>1: now.remove(max(now)) else: return np.nan return max(now) def get_id_feature(control,name,now,train_data,start_date,end_date,gap,gap_name): now = get_transform(now,start_date,end_date) train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()/gap).reset_index().rename(columns={0:'count_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].var()).reset_index().rename(columns={0:'var_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left') return train_data def get_diff_feature(control,name,now,train_data,start_date,end_date,gap,gap_name): now = get_transform(now,start_date,end_date) t1 = time.time() train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()).reset_index().rename(columns={0:'count_'+control+name+gap_name}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+gap_name}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].std()).reset_index().rename(columns={0:'var_'+control+name+gap_name}).fillna(-1),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].mean()).reset_index().rename(columns={0:'mean_'+control+name+gap_name}).fillna(-1),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].max()).reset_index().rename(columns={0:'max_'+control+name+gap_name}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_mode(x[name])).reset_index().rename(columns={0:'mode_'+control+name+gap_name}),on=['user_id'],how='left') train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].min()).reset_index().rename(columns={0:'min_'+control+name+gap_name}),on=['user_id'],how='left') t2 = time.time() # print(control,' Get Diff Feature Finished... Used: ',t2-t1,' Shape: ',train_data.shape) return train_data def get_category_count(name,deal_now,train_data,start_date,end_date): count = DF(deal_now.groupby(['user_id',name]).size().reset_index().rename(columns={0:'times'})) count_size = deal_now.groupby([name]).size().shape[0] sum_data = 0 for i in range(0,count_size): new_name = 'see_'+name+'_'+str(i) temp = pd.merge(train_data,count[count[name]==i],on=['user_id']).rename(columns={'times':new_name}) train_have = pd.merge(train_data,temp[['user_id',new_name]],on=['user_id']) train_have = train_have[['user_id',new_name]] not_have_name = list(set(train_data['user_id'].values)-set(train_have['user_id'].values)) train_not_have = DF() train_not_have['user_id'] = train_data[train_data['user_id'].isin(not_have_name)]['user_id'] train_not_have['see_'+name+'_'+str(i)] = 0 temp = pd.concat([train_have,train_not_have],axis=0) train_data = pd.merge(train_data,temp,on=['user_id'],how='left') sum_data += train_data[new_name].values for i in range(0,count_size): new_name = 'see_'+name+'_'+str(i) train_data[new_name+'_ratio'] = train_data[new_name].values/sum_data return train_data from multiprocessing import Pool def parallelize_df_func(df, func, start, end, num_partitions=40, n_jobs=4): df_split = np.array_split(df, num_partitions) start_date = [start] * num_partitions end_date = [end] * num_partitions param_info = zip(df_split, start_date, end_date) pool = Pool(n_jobs) gc.collect() df = pd.concat(pool.map(func, param_info)) pool.close() pool.join() return df def get_train(param_info): uid = param_info[0] start_date= param_info[1] end_date= param_info[2] t_start = time.time() t1 = time.time() train_act = check_id(uid,get_transform(action_log,start_date,end_date)) train_video = check_id(uid,get_transform(video_log,start_date,end_date)) train_app = check_id(uid,get_transform(app_log,start_date,end_date)) # Get Week train_act['week'] = (train_act['day'].values) % 7 train_video['week'] = (train_video['day'].values) % 7 train_app['week'] = (train_app['day'].values) % 7 # Modify Day train_act['day'] = train_act['day'] - start_date + 1 train_video['day'] = train_video['day'] - start_date + 1 train_app['day'] = train_app['day'] - start_date + 1 end_date = end_date-start_date+1 true_start = start_date start_date = 1 train_reg = register_log[register_log['user_id'].isin(uid)].rename(columns={'day':'reg_day'}) train_act = pd.merge(train_act,train_reg[['user_id','reg_day']],on=['user_id'],how='left') train_video = pd.merge(train_video,train_reg[['user_id','reg_day']],on=['user_id'],how='left') train_app = pd.merge(train_app,train_reg[['user_id','reg_day']],on=['user_id'],how='left') del train_act['reg_day'] del train_video['reg_day'] del train_app['reg_day'] gc.collect() t2 = time.time() print(start_date,' To ',end_date,' Have User: ',len(uid)) print('Data Prepare Use...',t2-t1) # Build train_data = DF() train_data['user_id'] = uid # 1 feature train_data = pd.merge(train_data,train_act.groupby(['user_id']).size().reset_index().rename(columns={0:'action_all_times'}),on=['user_id'],how='left').fillna(0) for i in range(5): page_temp = train_act[train_act['page']==i] train_data = get_type_feature('act_page'+str(i),'day',page_temp,train_data,start_date,end_date,end_date-start_date,'_all') train_data = get_time_feature('act_','day',train_act,train_data,start_date,end_date,end_date-start_date,'_all') train_data = get_diff_feature('act_','diff_day',train_act,train_data,start_date,end_date,end_date-start_date,'_all') train_data = pd.merge(train_data,train_act.groupby(['user_id']).apply(lambda x:get_mode(x['week'])).reset_index().rename(columns={0:'act_mode_week'+'_'+str(end_date-start_date)}),on=['user_id'],how='left') for i in ['page','action_type','video_id','author_id']: # 4*3 12 feature train_data = get_id_feature('id_act_',i,train_act,train_data,start_date,end_date,end_date-start_date,'_all') train_data = get_category_count('page',train_act,train_data,start_date,end_date) train_data = get_category_count('action_type',train_act,train_data,start_date,end_date) train_data = get_category_count('page',train_act,train_data,end_date-3,end_date) train_data = get_category_count('action_type',train_act,train_data,end_date-3,end_date) train_data = reduce_mem_usage(train_data) train_data = get_time_feature('video_','day',train_video,train_data,start_date,end_date,end_date-start_date,'_all') train_data = get_diff_feature('video_','diff_day',train_video,train_data,start_date,end_date,end_date-start_date,'_all') train_data = reduce_mem_usage(train_data) train_data = get_time_feature('app_','day',train_app,train_data,start_date,end_date,end_date-start_date,'_all') train_data = get_time_feature('app_','day',train_app,train_data,start_date,end_date,7,'_7') train_data = get_diff_feature('app_','diff_day',train_app,train_data,start_date,end_date,end_date-start_date,'_all') train_data = pd.merge(train_data,train_app.groupby(['user_id']).apply(lambda x:get_mode(x['week'])).reset_index().rename(columns={0:'app_mode_week'+'_'+str(end_date-start_date)}),on=['user_id'],how='left') train_data = reduce_mem_usage(train_data) train_data = pd.merge(train_data,register_log[['user_id','register_type','device_type','week','day']],on=['user_id'],how='left').rename(columns={'week':'reg_week','day':'reg_day'}) # 2 train_data['rt_dt'] = (train_data['register_type']+1)*(train_data['device_type']+1) train_data['week_rt'] = (train_data['register_type']+1)*(train_data['reg_week']+1) train_data['week_dt'] = (train_data['device_type']+1)*(train_data['reg_week']+1) t_end = time.time() print('Get Feature Use All Time: ',t_end-t_start) train_data = reduce_mem_usage(train_data) return train_data def data_prepare(read_path=None): register_log = pd.read_csv(read_path+'user_register_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint16,3:np.uint16}).rename(columns={0:'user_id',1:'day',2:'register_type',3:'device_type'}) action_log = pd.read_csv(read_path+'user_activity_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint8,3:np.uint32,4:np.uint32,5:np.uint8}).rename(columns={0:'user_id',1:'day',2:'page',3:'video_id',4:'author_id',5:'action_type'}) app_log = pd.read_csv(read_path+'app_launch_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'}) video_log = pd.read_csv(read_path+'video_create_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'}) # Sort By User register_log = register_log.sort_values(by=['user_id','day'],ascending=True) action_log = action_log.sort_values(by=['user_id','day'],ascending=True) app_log = app_log.sort_values(by=['user_id','day'],ascending=True) video_log = video_log.sort_values(by=['user_id','day'],ascending=True) # Diff Day t1 = time.time() app_log['diff_day'] = app_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8) video_log['diff_day'] = video_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8) action_log['diff_day'] = action_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8) t2 = time.time() print('Diff Day Finished... ',t2-t1) # Prepare REGISTER register_log['week'] = register_log['day'] % 7 register_log['rt_dt'] = (register_log['register_type']+1)*(register_log['device_type']+1) register_log['week_rt'] = (register_log['register_type']+1)*(register_log['reg_week']+1) register_log['week_dt'] = (register_log['device_type']+1)*(register_log['reg_week']+1) register_log['use_reg_people'] = register_log.groupby(['register_type'])['user_id'].transform('count').values register_log['use_dev_people'] = register_log.groupby(['device_type'])['user_id'].transform('count').values register_log['week_rt_use_people'] = register_log.groupby(['week_rt'])['user_id'].transform('count').values register_log['week_dt_use_people'] = register_log.groupby(['week_dt'])['user_id'].transform('count').values register_log['rt_dt_use_people'] = register_log.groupby(['rt_dt'])['user_id'].transform('count').values return register_log,action_log,app_log,video_log read_path = '/mnt/datasets/fusai/' register_log,action_log,app_log,video_log = data_prepare(read_path) train_set = [] for i in range(17,25): train_label = get_label(i,i+6) train_data_part1 = parallelize_df_func(train_label['user_id'], get_train, 1, i-1, 1, 1) train_data = pd.merge(train_data_part1,register_log[['user_id','use_reg_people','week','register_type','device_type','rt_dt', 'week_rt','week_dt','use_dev_people','week_rt_use_people','week_dt_use_people', 'rt_dt_use_people']],on=['user_id'],how='left') train_data = pd.merge(train_data,train_label,on=['user_id'],how='left') train_set.append(train_data) del train_data_part1 gc.collect() train_data = pd.concat(train_set[0:-1],axis=0).reset_index(drop=True) valid_data = train_set[-1] online_data = parallelize_df_func(register_log['user_id'].unique(), get_train, 1, 30, 1,1) online_data = pd.merge(online_data,register_log[['user_id','use_reg_people','week','register_type','device_type','rt_dt', 'week_rt','week_dt','use_dev_people','week_rt_use_people','week_dt_use_people', 'rt_dt_use_people']],on=['user_id'],how='left') write_path = '/home/kesci/' train_data.to_csv(write_path+'train_data.csv',index=False) valid_data.to_csv(write_path+'valid_data.csv',index=False) online_data.to_csv(write_path+'online_data.csv',index=False) print('Style 2 Feature Engineer Finished...') ================================================ FILE: feature_engineer/get_feature.py ================================================ import numpy as np import pandas as pd import lightgbm as lgb import xgboost as xgb import catboost as cbt from pandas import DataFrame as DF import gc from sklearn.preprocessing import LabelEncoder import time import networkx as nx from sklearn.cluster import MeanShift,KMeans # 1. 分别训练A,B榜数据,得到A,B模型,则只需要利用A榜模型预测B榜All Feature,得到预测值Model-A,将该列值并入Feature-B,即B榜维数加一,利用增强后的数据再训练Model-B # 2. 尝试重编码User-id,合并A,B数据,得到Merge-AB,在此基础上提取特征,训练模型(但未知Video-id,Author-id是否为乱序) reg_log_a = pd.read_csv('data/a/user_register_log.txt',sep='\t',header=None).rename(columns={0:'user_id',1:'day',2:'register_type',3:'device_type'}) aci_log_a = pd.read_csv('data/a/user_activity_log.txt',sep='\t',header=None).rename(columns={0:'user_id',1:'day',2:'page',3:'video_id',4:'author_id',5:'action_type'}) app_log_a = pd.read_csv('data/a/app_launch_log.txt',sep='\t',header=None).rename(columns={0:'user_id',1:'day'}) video_log_a = pd.read_csv('data/a/video_create_log.txt',sep='\t',header=None).rename(columns={0:'user_id',1:'day'}) reg_log_b = pd.read_csv('data/b/user_register_log.txt',sep='\t',header=None).rename(columns={0:'user_id',1:'day',2:'register_type',3:'device_type'}) aci_log_b = pd.read_csv('data/b/user_activity_log.txt',sep='\t',header=None).rename(columns={0:'user_id',1:'day',2:'page',3:'video_id',4:'author_id',5:'action_type'}) app_log_b = pd.read_csv('data/b/app_launch_log.txt',sep='\t',header=None).rename(columns={0:'user_id',1:'day'}) video_log_b = pd.read_csv('data/b/video_create_log.txt',sep='\t',header=None).rename(columns={0:'user_id',1:'day'}) reg_log = pd.concat([reg_log_a,reg_log_b],axis=0).reset_index(drop=True) aci_log = pd.concat([aci_log_a,aci_log_b],axis=0).reset_index(drop=True) app_log = pd.concat([app_log_a,app_log_b],axis=0).reset_index(drop=True) video_log = pd.concat([video_log_a,video_log_b],axis=0).reset_index(drop=True) reg_log.to_csv('data/user_register_log.txt',sep=' ',header=False,index=False) aci_log.to_csv('data/user_activity_log.txt',sep=' ',header=False,index=False) app_log.to_csv('data/app_launch_log.txt',sep=' ',header=False,index=False) video_log.to_csv('data/video_create_log.txt',sep=' ',header=False,index=False) print('可持久化 Finished...') # Get Week reg_log = reg_log[reg_log['device_type']!=1] reg_log['week'] = reg_log['day'] % 7 def get_transform(now,start_date,end_date): get_trans = now[(now['day']>=start_date) & (now['day']<=end_date)] return get_trans def get_label(start_date,end_date): merge_name = ['user_id','day'] all_log = pd.concat([aci_log[merge_name],app_log[merge_name],video_log[merge_name]],axis=0) train_label = get_transform(all_log,start_date,end_date) train_1 = DF(list(set(train_label['user_id']))).rename(columns={0:'user_id'}) train_1['label'] = 1 reg_temp = get_transform(reg_log,start_date-16,start_date-1) train_1 = train_1[train_1['user_id'].isin(reg_temp['user_id'])] train_0 = DF(list(set(reg_temp['user_id'])-set(train_1['user_id']))).rename(columns={0:'user_id'}) train_0['label'] = 0 del train_label gc.collect() return pd.concat([train_1,train_0],axis=0) def check_id(uid,now): return now[now['user_id'].isin(uid)] def get_category_count(name,deal_now,train_data,start_date,end_date): count = DF(deal_now.groupby(['user_id',name]).size().reset_index().rename(columns={0:'times'})) count_size = aci_log.groupby([name]).size().shape[0] sum_data = 0 for i in range(0,count_size): new_name = 'see_'+name+'_'+str(i) temp = pd.merge(train_data,count[count[name]==i],on=['user_id']).rename(columns={'times':new_name}) train_have = pd.merge(train_data,temp[['user_id',new_name]],on=['user_id']) train_have = train_have[['user_id',new_name]] not_have_name = list(set(train_data['user_id'].values)-set(train_have['user_id'].values)) train_not_have = DF() train_not_have['user_id'] = train_data[train_data['user_id'].isin(not_have_name)]['user_id'] train_not_have['see_'+name+'_'+str(i)] = 0 temp = pd.concat([train_have,train_not_have],axis=0) train_data = pd.merge(train_data,temp,on=['user_id'],how='left') train_data = pd.merge(train_data,deal_now[deal_now[name]==i].groupby(['user_id']).apply(lambda x:get_binary(x['day'],start_date,end_date)).reset_index().rename(columns={0:'binary_'+str(i)+'_'+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left') train_data = pd.merge(train_data,deal_now[deal_now[name]==i].groupby(['user_id']).apply(lambda x:get_time_log_weight_sigma(x['day'],start_date,end_date)).reset_index().rename(columns={0:'get_log_sigma_'+str(i)+'_'+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left') sum_data += train_data[new_name].values for i in range(0,count_size): new_name = 'see_'+name+'_'+str(i) train_data[new_name+'_ratio'] = train_data[new_name].values/sum_data return train_data def get_binary_seq(now,start_date,end_date): day = list(range(start_date,end_date+1)) ans1 = 0 binary_day = [] for i in day: if i in now.unique(): binary_day.append(1) else: binary_day.append(0) return binary_day def get_binary(now,start_date,end_date): # Boss Feature ans = 0 binary_day = get_binary_seq(now,start_date,end_date) for i in range(len(binary_day)): ans += binary_day[i]*(2**i) return ans def get_binary_mol7(now,start_date,end_date): ans = 0 binary_day = get_binary_seq(now,start_date,end_date) for i in range(len(binary_day)): ans += binary_day[i]*(2**(i%7)) return ans def get_time_log_weight_sigma(now,start_date,end_date): window_len = end_date+1-start_date ans = np.zeros(window_len) sigma_ans = 0 for i in now: ans[(i-1)%window_len] += 1 for i in range(window_len): if ans[i]!=0: sigma_ans += np.log(ans[i]/(window_len-i)) return sigma_ans def get_time_weight_sigma(now,start_date,end_date): window_len = end_date+1-start_date ans = np.zeros(window_len) sigma_ans = 0 for i in now: ans[(i-1)%window_len] += 1 for i in range(window_len): sigma_ans += ans[i]*(i+1) return sigma_ans def get_id_feature(control,name,now,train_data,start_date,end_date): if end_date1 else 0 return create_video_num , watch_other_video def GongXianDu(df): d11 = df.set_index('video_id') d11['gongxian_rate'] = df.groupby('video_id').size() d11['gongxian_rate'] = d11['gongxian_rate'] / d11['video_watched_times'] meand = d11['gongxian_rate'].mean() sumd = d11['gongxian_rate'].sum() stdd = d11['gongxian_rate'].std() skeww = d11['gongxian_rate'].skew() kurtt = d11['gongxian_rate'].kurt() return sumd,meand,stdd,skeww,kurtt def get_lx_day(now): k1 = np.array(now) k2 = np.where(np.diff(k1)==1)[0] i = 0 ans = [] while i=start_date else 0 # Need To Add 用户活跃时间与注册时间的差值 t2 = time.time() print('Use Time: ',t2-t1,' Reg Finish... ','Shape: ',train_data.shape) # 获取业务逻辑特征 3 feature t1 = time.time() user_feature = DF(train_data['user_id'].unique()) user_feature.columns = ['user_id'] user_feature = user_feature.set_index('user_id') user_feature['HowManyPeople_Watch'] = train_act.groupby('author_id').apply(HowManyPeopleWatch) user_feature['Most_Handle'] = train_act.groupby('user_id').apply(MostHandle) #计算视频被观看总次数 video_size = train_act.groupby('video_id').size().reset_index() video_size.columns = ['video_id','video_watched_times'] train_act = pd.merge(train_act,video_size,on=['video_id'],how='left') #分别计算每个用户的贡献度和、均、方 temp = train_act.groupby('user_id').apply(GongXianDu) user_feature['GongXianSum'] = temp.apply(lambda x:x[0]) user_feature['GongXianMean'] = temp.apply(lambda x:x[1]) user_feature['GongXianStd'] = temp.apply(lambda x:x[2]) user_feature['GongXianSkeww'] = temp.apply(lambda x:x[3]) user_feature['GongXianKurtt'] = temp.apply(lambda x:x[4]) fav_author = train_act.groupby('user_id').apply(FavAuthorCreate) user_feature['FavAuthorCreate'] = fav_author.apply(lambda x:x[0]) user_feature['WatchOtherVideo'] = fav_author.apply(lambda x:x[1]) train_data = pd.merge(train_data,user_feature.reset_index(),on=['user_id'],how='left') t2 = time.time() print('Use Time: ',t2-t1,' User-Author Finish... ','Shape:',train_data.shape) # 获取聚类特征 t1 = time.time() train_data = get_only_user_author_graph_feature(train_act,'',train_data) kmean = KMeans(n_clusters=20,n_jobs=20) train_data['cluster_graph'] = kmean.fit_predict(train_data[['G_degree','G_indegree','G_pagerank']].fillna(0)) train_data = get_only_user_author_graph_feature(train_act[train_act['page']==0],'page0',train_data) t2 = time.time() print('Use Time: ',t2-t1,' Cluster Finish... ','Shape:',train_data.shape) # Feature End t_end = time.time() print('Get Feature Use All Time: ',t_end-t_start) return train_data # offline # 0 10day step # 1 14day step # 2 16day step print('The Style is ',style,'...') print('Dealing Offline...') style = 2 if style == 0: t1 = time.time() train_label = [] train_data = [] for i in range(1,8): try_label = get_label(i+10,i+16) train_label.append(try_label) train_data.append(get_train(try_label['user_id'],i,i+9)) t2 = time.time() print('Deal Train Feature: ',t2-t1) t1 = time.time() valid_label = get_label(24,30) valid_data = get_train(valid_label['user_id'],14,23) t2 = time.time() print('Deal Valid Feature: ',t2-t1) elif style == 1: t1 = time.time() train_label = [] train_data = [] for i in range(1,4): try_label = get_label(i+14,i+20) train_label.append(try_label) train_data.append(get_train(try_label['user_id'],i,i+13)) t2 = time.time() print('Deal Train Feature: ',t2-t1) t1 = time.time() valid_label = get_label(24,30) valid_data = get_train(valid_label['user_id'],10,23) t2 = time.time() print('Deal Valid Feature: ',t2-t1) elif style == 2: t1 = time.time() train_label = get_label(17,23) train_data = get_train(train_label['user_id'],1,16) valid_label = get_label(24,30) valid_data = get_train(valid_label['user_id'],8,23) t2 = time.time() print('Deal Train Feature: ',t2-t1) # online print('Dealing Online...') if style == 0: t1 = time.time() online_label = [] online_data = [] for i in range(8,15): try_label = get_label(i+10,i+16) online_label.append(try_label) online_data.append(get_train(try_label['user_id'],i,i+9)) # model online_test = get_train(reg_log['user_id'].unique(),21,30) t2 = time.time() print('Deal Online: ',t2-t1) elif style == 1: t1 = time.time() online_label = [] online_data = [] for i in range(4,11): try_label = get_label(i+14,i+20) online_label.append(try_label) online_data.append(get_train(try_label['user_id'],i,i+13)) # model online_test = get_train(reg_log['user_id'].unique(),17,30) t2 = time.time() print('Deal Online: ',t2-t1) elif style == 2: # online online_label = pd.concat([train_label,valid_label],axis=0).reset_index(drop=True) online_data = pd.concat([train_data,valid_data],axis=0).reset_index(drop=True) # model online_test_b = get_train(reg_log_b['user_id'].unique(),15,30) def merge_data(df): return pd.concat(df,axis=0).reset_index(drop=True) if style!=2 : train_data = merge_data(train_data) train_label = merge_data(train_label) online_data = merge_data(online_data) online_label = merge_data(online_label) online_data = pd.concat([train_data,online_data],axis=0).reset_index(drop=True) online_label = pd.concat([train_label,online_label],axis=0).reset_index(drop=True) path_name = 'pre_data/style_'+str(style) train_data.to_csv(path_name+'/train_data.csv',index=False) train_label.to_csv(path_name+'/train_label.csv',index=False) valid_data.to_csv(path_name+'/valid_data.csv',index=False) valid_label.to_csv(path_name+'/valid_label.csv',index=False) online_data.to_csv(path_name+'/online_data.csv',index=False) online_label.to_csv(path_name+'/online_label.csv',index=False) online_test.to_csv(path_name+'/online_test.csv',index=False) # Need to Add : # a. User-Author-Video Interfacing # 1. Node2Vec # 2. User-Video/Author Embedding # 2 # 3. User-Author-Video Embedding # 1 # 4. User-Video/Author Tf-idf/Word2Vec # 2*2 # 5. User-Video/Author Cluster (By Tf-idf/Word2Vec) # 2*2 # 6. User-Author-Video Embedding/Tf-idf/Word2Vec + Cluster # b. Know More Author # 1. Define "An Active User", For Example, You can choose "All of the Positive Sample" using their Mean Value # (Tips: Mean Value is the Times of Watching Video,Looking Author,See Page,Action Click Count) # 2. Get Diff Value For Active User # 3. Calc The UV metric , Fuv or Iuv. # (Tips: Find top100 the most active User,Get their favourite Author/Video Union Set, Ex https://zhuanlan.zhihu.com/p/20943978) # 4. Node Centrality/Influence (Wiki) # 5. See Author Delay ================================================ FILE: model/ffm.py ================================================ import hashlib, math, os, subprocess from multiprocessing import Process import xlearn import numpy as np import pandas as pd from padnas import DataFrame as DF def hashstr(str, nr_bins=1e+6): return int(hashlib.md5(str.encode('utf8')).hexdigest(), 16) % (int(nr_bins) - 1) + 1 class FfmEncoder(): def __init__(self, field_names, label_name, nthread=1): self.field_names = field_names self.nthread = nthread self.label = label_name def gen_feats(self, row): feats = [] for field in self.field_names: value = row[field] key = field + '-' + str(value) feats.append(key) return feats def gen_hashed_fm_feats(self, feats): feats = ['{0}:{1}:1'.format(field, hashstr(feat, 1e+6)) for (field, feat) in feats] return feats def convert(self, df, path, i): lines_per_thread = math.ceil(float(df.shape[0]) / self.nthread) sub_df = df.iloc[i * lines_per_thread: (i + 1) * lines_per_thread] tmp_path = path + '_tmp_{0}'.format(i) with open(tmp_path, 'w') as f: for index,row in sub_df.iterrows(): feats = [] for i, feat in enumerate(self.gen_feats(row)): feats.append((i, feat)) feats = self.gen_hashed_fm_feats(feats) f.write(str(int(row[self.label])) + ' ' + ' '.join(feats) + '\n') def parallel_convert(self, df, path): processes = [] for i in range(self.nthread): p = Process(target=self.convert, args=(df, path, i)) p.start() processes.append(p) for p in processes: p.join() def delete(self, path): for i in range(self.nthread): os.remove(path + '_tmp_{0}'.format(i)) def cat(self, path): if os.path.exists(path): os.remove(path) for i in range(self.nthread): cmd = 'cat {svm}_tmp_{idx} >> {svm}'.format(svm=path, idx=i) p = subprocess.Popen(cmd, shell=True) p.communicate() def transform(self, df, path): print('converting data......') self.parallel_convert(df, path) self.cat(path) self.delete(path) write_path = '/home/kesci' ffm_train = train_data.copy() ffm_valid = valid_data.copy() ffm_online_train = online_train.copy() ffm_online_test = online_data.copy() ffm_online_test['label'] = 0 # filed_names = list(fi.sort_values(by=['score'],ascending=False).head(50)['name'].values) filed_names = [i for i in ffm_train.columns if i not in ['user_id','label']] print(filed_names) fe = FfmEncoder(filed_names,label_name='label',nthread=8) fe.transform(ffm_train, write_path+'train.ffm') print('Train FFM Finished...') fe.transform(ffm_valid, write_path+'valid.ffm') print('Valid FFM Finished...') fe.transform(ffm_online_train,write_path+'train_online.ffm') print('Train Online FFM Finished...') fe.transform(ffm_online_test, write_path+'test_online.ffm') print('Test Online FFM Finished') # Training task ffm_model = xl.create_ffm() # Use field-aware factorization machine ffm_model.setTrain("/home/kesci/train.ffm") # Training data ffm_model.setValidate("/home/kesci/valid.ffm") # Validation data # param: # 0. binary classification # 1. learning rate: 0.2 # 2. regular lambda: 0.002 # 3. evaluation metric: accuracy param = {'task':'binary', 'lr':0.1, 'lambda':0.01, 'metric':'auc', 'epoch' : 20,'opt':'ftrl'} # Start to train # The trained model will be stored in model.out ffm_model.fit(param, write_path+'model.out') print('Offline Train Finished...') # Prediction task param = {'task':'binary', 'lr':0.05, 'lambda':0.003, 'metric':'auc'} ffm_online_model = xl.create_ffm() ffm_online_model.setTrain(write_path+'train_online.ffm') ffm_online_model.fit(param,write_path+'online_model.out') ffm_model.setTest(write_path+'test_online.ffm') # Test data ffm_model.setSigmoid() # Convert output to 0-1 # Start to predict # The output result will be stored in output.txt ffm_model.predict(write_path+'model.out', write_path+'output.txt') ================================================ FILE: model/lgb_model ================================================ import numpy as np import pandas as pd import lightgbm as lgb import xgboost as xgb import catboost as cbt from pandas import DataFrame as DF import gc import time from scipy import stats import seaborn as sns import matplotlib.pyplot as plt # Read write_path = '/home/kesci/' train_data = pd.read_csv(write_path+'train_data.csv') valid_data = pd.read_csv(write_path+'valid_data.csv') online_data = pd.read_csv(write_path+'online_data.csv') online_train = pd.concat([train_data,valid_data],axis=0).reset_index(drop=True) # LGB Model feature_name = [i for i in train_data.columns if i not in ['user_id','label']] print(len(feature_name)) dtrain = lgb.Dataset(train_data[feature_name], label=train_data['label'].values) dval = lgb.Dataset(valid_data[feature_name], label=valid_data['label'].values) params = {'learning_rate': 0.05, 'metric': ['auc','binary_logloss'], 'objective': 'binary', 'nthread': 8, 'num_leaves': 8, 'colsample_bytree': 0.7, 'bagging_fraction' : 0.8, 'bagging_freq' : 10, 'seed' : 2018, } lgb_model = lgb.train(params, dtrain, 2500, dval, verbose_eval=50,early_stopping_rounds=100,) pred = lgb_model.predict(train_data[feature_name]) from sklearn.metrics import roc_auc_score,f1_score print('TRAIN SET auc ',roc_auc_score(train_data['label'],pred)) f1_ans = [] for i in pred: if i>=0.5: f1_ans.append(1) else: f1_ans.append(0) print('TRAIN SET F1 ',f1_score(train_data['label'],f1_ans)) fi = DF() fi['name'] = feature_name fi['score'] = lgb_model.feature_importance() print(fi.sort_values(by=['score'],ascending=False)) lgb.plot_importance(lgb_model,max_num_features=40,figsize=(10,8)) plt.show() online_train = pd.concat([train_data,valid_data],axis=0).reset_index(drop=True) online_lgb_set = lgb.Dataset(online_train[feature_name],label=online_train['label']) online_lgb_model = lgb.train(params,online_lgb_set,num_boost_round=lgb_model.best_iteration-50) ans = online_lgb_model.predict(online_data[feature_name]) submit = DF() submit['id'] = online_data['user_id'] submit['score'] = ans print(submit.head(10)) print(submit['score'].describe()) submit.to_csv('Submit.txt',index=False,header=False) ================================================ FILE: model/xgb_model.py ================================================ import numpy as np import pandas as pd import lightgbm as lgb import xgboost as xgb import catboost as cbt from pandas import DataFrame as DF import gc import time from scipy import stats import seaborn as sns import matplotlib.pyplot as plt # Read write_path = '/home/kesci/' train_data = pd.read_csv(write_path+'train_data.csv') valid_data = pd.read_csv(write_path+'valid_data.csv') online_data = pd.read_csv(write_path+'online_data.csv') online_train = pd.concat([train_data,valid_data],axis=0).reset_index(drop=True) # LGB Model feature_name = [i for i in train_data.columns if i not in ['user_id','label']] print(len(feature_name)) xgb_train = xgb.DMatrix(train_data[feature_name],train_data['label'].values) xgb_valid = xgb.DMatrix(valid_data[feature_name],valid_data['label'].values) watch_list = [(xgb_train,'dtrain'),(xgb_valid,'dvalid')] params = { 'booster': 'gbtree', 'objective': 'rank:pairwise', #'binary:logistic', 'eta': 0.05, 'seed' : 2018, 'max_depth': 5, 'subsample': 0.9, 'colsample_bytree': 0.8, 'colsample_bylevel' : 0.8, 'eval_metric': ['auc'], # Need TO Logloss 'nthread' : 8, 'gamma': 2, } xgb_model = xgb.train(params,xgb_train,2000,watch_list,early_stopping_rounds=40,verbose_eval=10) pred = xgb_model.predict(xgb.DMatrix(valid_data[feature_name])) from sklearn.metrics import roc_auc_score,f1_score print('auc ',roc_auc_score(valid_data['label'],pred)) f1_ans = [] for i in pred: if i>=0.5: f1_ans.append(1) else: f1_ans.append(0) print('f1 ',f1_score(valid_data['label'],f1_ans)) def create_feature_map(features): outfile = open('xgb.fmap', 'w') i = 0 for feat in features: outfile.write('{0}\t{1}\tq\n'.format(i, feat)) i = i + 1 outfile.close() create_feature_map(feature_name) import operator xgb_importance = xgb_model.get_fscore(fmap='xgb.fmap') xgb_importance = sorted(xgb_importance.items(), key=operator.itemgetter(1)) xgb_importance = DF(xgb_importance, columns=['name', 'fscore']) print(xgb_importance) online_xgb_set = xgb.DMatrix(online_train[feature_name],label=online_train['label']) online_xgb_model = xgb.train(params,online_xgb_set,num_boost_round=xgb_model.best_iteration) ans_xgb = online_xgb_model.predict(xgb.DMatrix(online_data[feature_name])) submit_xgb = DF() submit_xgb['id'] = online_data['user_id'] from sklearn.preprocessing import MinMaxScaler st = MinMaxScaler() submit_xgb['score'] = st.fit_transform(ans_xgb.reshape(-1,1)) # RANK # submit_xgb['score'] = ans_xgb # Binary print(submit_xgb.head(10)) print(submit_xgb['score'].describe()) submit.to_csv('Submit XGB.txt',index=False,header=False)