Repository: luoda888/2018-KUAISHOU-TSINGHUA-Top13-Solutions
Branch: master
Commit: f7e871e063bd
Files: 7
Total size: 93.2 KB
Directory structure:
gitextract_z0_s53rx/
├── README.md
├── feature_engineer/
│ ├── feature_engineer1
│ ├── feature_engineer2.py
│ └── get_feature.py
└── model/
├── ffm.py
├── lgb_model
└── xgb_model.py
================================================
FILE CONTENTS
================================================
================================================
FILE: README.md
================================================
# 2018-KUAISHOU-TSINGHUA-Top13-Solutions
2018中国高校计算机大赛--大数据挑战赛 Top 13 Solutions
#### 初赛A Top 2,初赛B Top 5
复赛Final 13
每一次相遇都是久别重逢,下一次站在答辩台上又是何年何月,对得起自己,对得起青春。
队友的RNN:https://github.com/totoruo/KuaiShou2018-RANK13-RNN
#### 题意简述:给出用户在快手APP上1-30日的历史行为,预测接下来7天(31-37)的活跃用户。定义活跃为任意一张表出现过,不考虑冷启动。
##### 框架思考
1. 等长滑动 [1,16] [2,17] ... [8,23] ---> Predict [15,30]
2. 不等长滑动 [1,16] [1,17] ... [1,24] ---> Predict [1,30]
##### 特征工程:
Tips: 以下Feature都是较为通用的,但是在不等长框架2中,需要对所有涉及到距离的计算做平滑,即除以时间窗口的长度
对APP-LAUNCH、ACTION、VIDEO表
1.对时间序列进行编码
编码方式考虑两种:
a. 将用户活跃天数视为二进制数,按二进制方式对活跃天数进行加权,越接近预测日期权重越高,如6天的窗口,1,3,5 --- 101010 将其倒置,010101,转化为十进制数21
`ans += binary_day[i]*(2**i)`
b. 直接按离预测日期距离进行加权
`ans += binary_day[i]*(1/(end_date-i))`
2.对时间序列进行描述性统计
一级统计特征: mean,std,median,max,min,(max-min),mode,count,nunique (在APP表中,count==nunique)
二级统计特征: skew,kurt,mad
用户登录的频域周期性: Var(fft(X['day']))
用户登录的星期周期性: Get_Mode(X['week'])
3.对时间序列与预测日期进行时间天数交互
该用户最后一次登录距离预测日期的长度 end_date+1-x['day'].max()
该用户倒数第K次登录距离预测日期的长度 end_date+1-get_second_day(x,k)
最后一次和倒数第二次的距离 x['day'].max()-get_second_day(x,2)
4.差分一阶时间序列
用户最大/最小间隔多少天登录一次 Diff Max/Min
用户平均登录间隔 Diff Mean
用户登录间隔的稳定性 Diff Var
用户登录间隔的周期性 Var(fft(X['day']))
对ACTION表的特殊处理
1.对时间序列进行衰减系数编码
操作数/当前天数-预测日期的距离 sigma_ans += np.log(ans[i]/(window_len-i))
2.对时间序列与预测日期进行操作数交互
该用户最后一次/倒数第二次登录 观看Page、Action的分布 如看了3次Page 0,4次Page 1,2次Action 1,没有的行为用0填充
将多组行为拆成子图,即可实现统计User在不同页面的分布,如只在Page0发生的行为的描述性统计,mean,std等 (对Action)
该用户最后一次/K次操作当天,Count VIDEOID/AuthorID
最后一次和倒数第二次间隔中,Count VIDEOID/AuthorID
3.对Page,Action进行User的全局展开
统计每一种行为在该用户整个行为序列里的比例,如 5次点击Page 0,该用户共点击10次 Page 0,那么此处Ratio1 为0.5
统计每一种行为在当天用户里所有行为的比例,如 该用户900次点击 Page 0,当天共有1000次点击,那么此处Ratio2 为0.9
上述两种统计,可以有效防止刷单行为,正常的行为序列应当是较为平滑的点击序列,一旦出现峰值,如举报行为居多的,观看同一视频的,即可判定为刷单的特征
4.计算用户观看VIDEO的贡献序列
def GongXianDu(df):
d11 = df.set_index('video_id')
d11['gongxian_rate'] = df.groupby('video_id').size()
d11['gongxian_rate'] = d11['gongxian_rate'] / d11['video_watched_times']
meand = d11['gongxian_rate'].mean()
sumd = d11['gongxian_rate'].sum()
stdd = d11['gongxian_rate'].std()
skeww = d11['gongxian_rate'].skew()
kurtt = d11['gongxian_rate'].kurt()
return sumd,meand,stdd,skeww,kurtt
temp = train_act.groupby('user_id').apply(GongXianDu)
5.计算用户是否有追星行为
最喜欢的作者有无更新行为
用户看过的视频还有多少人爱看(区分小众与大众)
def FavAuthorCreate(df):
most_author = df.groupby('author_id').size().sort_values(ascending=False).index[0]
create_video_num = len(df[df['author_id']==most_author]['video_id'].unique())
watch_other_video_num = len(df[df['author_id']==most_author]['video_id'].unique())
watch_other_video = 1 if watch_other_video_num>1 else 0
return create_video_num , watch_other_video
对REGSITER表的挖掘
1.注册周期性,如周末的促销活动,最直观的Feature就是Week
2.周期性的交互,如在周末特定的Type组合
register['week'] * register['device_type']
register['week'] * register['register_type']
3.类别特征间的交互,如 register['device_type'] * register['register_type']
4.计算不同类别的使用人数,如 register_log.groupby(['register_type'])['user_id'].transform('count').values
可以计算device_type,week_rt,week_dt,rt_dt
5.计算不同类别的转化率(需滑窗计算),groupby(['count_label_ratio'])['regsiter_type','device_type'].transform('count').vaules
##### 模型选择
Snake 的 糖尿病特征选择框架 https://github.com/luoda888/tianchi-diabetes-top12
在保留必选特征后(Encoder/FFT) 设置阈值产生两套特征
树模型: LGB 框架1
XGB 框架2
FFM : Xlearn 按特征重要性筛选TopK个特征后计算
xDeepFM 输入序列特征/Category特征
CNN/RNN 输入序列特征
##### 模型融合
本题模型融合收益极高,我们尝试了3种方式进行模型融合,按效果来计算
Top 3 Stacking 没卵用,如果用同一套特征甚至掉分
Top 2 加权融合 相似性都是0.98 0.97左右,0.97可以有1个千的收益,0.98大概是7-8个万
Top 1 对半Blending 如用第一个滑窗测试,第二个滑窗测试,两折的Blending 收益是1.2个千
##### 感言
首先感谢队友一直努力,到最后没有放弃,本来打算对这个题的思路写很多,但是想想自己还是菜了点,就算了。
最近这几个比赛,平安11,腾讯12,快手13,真的是10名以外的王者。
也有反思自己,很多时候挖特征的思路不太对,太喜欢一把梭,忽略了具体数据背景,忽略了具体变化
数据挖掘多的应该是自己的思考,而少一些套路性的东西。
当一切都有固定化的套路的时候,也少了很多乐趣。
快手群里还是氛围不错的,Kesci平台的态度确实也是我目前见过比较好的。
或许前天中午那个LGB的模型能跑出来,现在肯定是可以Top10的。
遗憾也有,咽下肚。
皇图霸业谈笑间,不胜人生一场醉。
走起,醉去!
================================================
FILE: feature_engineer/feature_engineer1
================================================
# 该特征工程 滑窗步长为1天,全量数据,无部分平均加权
from multiprocessing import Pool
import numpy as np
import pandas as pd
from pandas import DataFrame as DF
import gc
from multiprocessing import Pool
from sklearn.preprocessing import LabelEncoder
import time
from scipy import stats
def get_transform(now,start_date,end_date):
get_trans = now[(now['day']>=start_date) & (now['day']<=end_date)]
return get_trans
def get_label(start_date,end_date):
merge_name = ['user_id','day']
all_log = pd.concat([action_log[merge_name],app_log[merge_name],video_log[merge_name]],axis=0)
train_label = get_transform(all_log,start_date,end_date)
train_1 = DF(list(set(train_label['user_id']))).rename(columns={0:'user_id'})
train_1['label'] = 1
reg_temp = get_transform(register_log,1,start_date-1)
train_1 = train_1[train_1['user_id'].isin(reg_temp['user_id'])]
train_0 = DF(list(set(reg_temp['user_id'])-set(train_1['user_id']))).rename(columns={0:'user_id'})
train_0['label'] = 0
del train_label
gc.collect()
return pd.concat([train_1,train_0],axis=0)
def check_id(uid,now):
return now[now['user_id'].isin(uid)]
def get_mode(now):
return stats.mode(now)[0][0]
def get_binary_seq(now,start_date,end_date):
day = list(range(1,end_date-start_date+2))
day.reverse()
ans1 = 0
binary_day = []
now_uni = now.unique()
for i in day:
if i in now_uni:
binary_day.append(1)
else:
binary_day.append(0)
return binary_day
def get_binary1(now,start_date,end_date): # Boss Feature
ans = 0
binary_day = get_binary_seq(now,start_date,end_date)
for i in range(len(binary_day)):
ans += binary_day[i]*(2**i)
return ans
def get_binary2(now,start_date,end_date): # Boss Feature
ans = 0
binary_day = get_binary_seq(now,start_date,end_date)
for i in range(len(binary_day)):
ans += binary_day[i]*(1/(end_date-i))
return ans
def get_time_log_weight_sigma(now,start_date,end_date):
window_len = end_date+1-start_date
ans = np.zeros(window_len)
sigma_ans = 0
for i in now:
ans[(i-1)%window_len] += 1
for i in range(window_len):
if ans[i]!=0:
sigma_ans += np.log(ans[i]/(window_len-i))
return sigma_ans
def get_max_count(x,x_max):
x_max = int(x_max)
if x_max>0:
return x['day'].value_counts()[x_max]
else:
return 0
def get_max_movie(x,x_max):
x_max = int(x_max)
if x_max>0:
x = x[x['day']==x_max]
return x['video_id'].nunique()
else:
return 0
def get_type_feature(control,name,now,train_data,start_date,end_date,gap,gap_name):
now = get_transform(now,start_date,end_date)
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].max())).reset_index().rename(columns={0:'max1_'+control+name+gap_name}).fillna(-1),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],2))).reset_index().rename(columns={0:'max2_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_count(x,np.nan_to_num(x['day'].max()))).reset_index().rename(columns={0:'max_count_'+control+name+gap_name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_count(x,get_second_day(x[name],2))).reset_index().rename(columns={0:'max2_count_'+control+name+gap_name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_movie(x,np.nan_to_num(x['day'].max()))).reset_index().rename(columns={0:'nunique_video_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_movie(x,get_second_day(x[name],2))).reset_index().rename(columns={0:'nunique2_video_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(x[name].max()-get_second_day(x[name],2))).reset_index().rename(columns={0:'max_distance_12_'+control+name+str(end_date-start_date)}).fillna(end_date-start_date),on=['user_id'],how='left')
return train_data
def get_encoder_feature(control,name,now,train_data,start_date,end_date):
now = get_transform(now,start_date,end_date)
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary1(x[name],start_date,end_date)).reset_index().rename(columns={0:'encoder1_01seq'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary2(x[name],start_date,end_date)).reset_index().rename(columns={0:'encoder2_01seq'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_time_log_weight_sigma(x[name],start_date,end_date)).reset_index().rename(columns={0:'LogSigma_'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
return train_data
def get_time_feature(control,name,now,train_data,start_date,end_date):
now = get_transform(now,start_date,end_date)
# 描述性统计特征 6
t1 = time.time()
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_all_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()).reset_index().rename(columns={0:'count_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data['nunique / count' + control + name +str(end_date-start_date)] = train_data['nunique_all_'+control+name+str(end_date-start_date)] / train_data['count_'+control+name+str(end_date-start_date)]
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(x[name].min()-start_date)).reset_index().rename(columns={0:'min-start_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].min())).reset_index().rename(columns={0:'end-min_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].kurt()).reset_index().rename(columns={0:'kurt_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].skew()).reset_index().rename(columns={0:'skew_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.84)).reset_index().rename(columns={0:'q4_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.92)).reset_index().rename(columns={0:'q5_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.97)).reset_index().rename(columns={0:'q6_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_mode(x[name])).reset_index().rename(columns={0:'mode_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
t2 = time.time()
print(name,' Describe Finished... ',t2-t1,' Shape: ',train_data.shape)
t1 = time.time()
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.var(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_var_'+control+name+str(end_date-start_date)}).fillna(-1),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.mean(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_mean_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.var(np.fft.fft(get_binary_seq(x[name],start_date,end_date))))).reset_index().rename(columns={0:'fft_01seq_var_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.mean(np.fft.fft(get_binary_seq(x[name],start_date,end_date))))).reset_index().rename(columns={0:'fft_01seq_mean_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
t2 = time.time()
print(control,' FFT Finished...',t2-t1,' Shape: ',train_data.shape)
t1 = time.time()
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:np.array(get_binary_seq(x[name],start_date,end_date)).std()).reset_index().rename(columns={0:'01seq_std_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:np.array(get_binary_seq(x[name],start_date,end_date)).mean()).reset_index().rename(columns={0:'01seq_mean_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
t2 = time.time()
print(control,' 01seq Describe Finished... ',t2-t1,' Shape: ',train_data.shape)
# 时间衰减 4
t1 = time.time()
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary1(x[name],start_date,end_date)).reset_index().rename(columns={0:'encoder1_01seq'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary2(x[name],start_date,end_date)).reset_index().rename(columns={0:'encoder2_01seq'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_time_log_weight_sigma(x[name],start_date,end_date)).reset_index().rename(columns={0:'LogSigma_'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
t2 = time.time()
print(control,' Sigma Finished... ',t2-t1,' Shape: ',train_data.shape)
t1 = time.time()
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].max())).reset_index().rename(columns={0:'max1_'+control+name+str(end_date-start_date)}).fillna(end_date-start_date),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],2))).reset_index().rename(columns={0:'max2_'+control+name+str(end_date-start_date)}).fillna(end_date-start_date),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],3))).reset_index().rename(columns={0:'max3_'+control+name+str(end_date-start_date)}).fillna(end_date-start_date),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(x[name].max()-get_second_day(x[name],2))).reset_index().rename(columns={0:'max_distance_12_'+control+name+str(end_date-start_date)}).fillna(end_date-start_date),on=['user_id'],how='left')
t2 = time.time()
print(control,' Max Finished... ',t2-t1,' Shape: ',train_data.shape)
return train_data
def get_second_day(now,seq):
now = list(now.unique())
for i in range(seq-1):
if len(now)>1:
now.remove(max(now))
else:
return 0
return max(now)
def get_id_feature(control,name,now,train_data,start_date,end_date):
now = get_transform(now,start_date,end_date)
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()).reset_index().rename(columns={0:'count_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].var()).reset_index().rename(columns={0:'var_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left')
return train_data
def get_diff_feature(control,name,now,train_data,start_date,end_date):
now = get_transform(now,start_date,end_date)
t1 = time.time()
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()).reset_index().rename(columns={0:'count_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].std()).reset_index().rename(columns={0:'var_'+control+name+str(end_date-start_date)}).fillna(-1),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].mean()).reset_index().rename(columns={0:'mean_'+control+name+str(end_date-start_date)}).fillna(-1),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].max()).reset_index().rename(columns={0:'max_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_mode(x[name])).reset_index().rename(columns={0:'mode_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].min()).reset_index().rename(columns={0:'min_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
t2 = time.time()
print(control,' Get Diff Feature Finished... Used: ',t2-t1,' Shape: ',train_data.shape)
return train_data
def HowManyPeopleWatch(df):
num_people = len(df['user_id'].unique())
return num_people
def MostHandle(df):
most_handle = df.groupby('video_id').size().max()
return most_handle
def FavAuthorCreate(df):
most_author = df.groupby('author_id').size().sort_values(ascending=False).index[0]
create_video_num = len(df[df['author_id']==most_author]['video_id'].unique())
watch_other_video_num = len(df[df['author_id']==most_author]['video_id'].unique())
watch_other_video = 1 if watch_other_video_num>1 else 0
return create_video_num , watch_other_video
def GongXianDu(df):
d11 = df.set_index('video_id')
d11['gongxian_rate'] = df.groupby('video_id').size()
d11['gongxian_rate'] = d11['gongxian_rate'] / d11['video_watched_times']
meand = d11['gongxian_rate'].mean()
sumd = d11['gongxian_rate'].sum()
stdd = d11['gongxian_rate'].std()
skeww = d11['gongxian_rate'].skew()
kurtt = d11['gongxian_rate'].kurt()
return sumd,meand,stdd,skeww,kurtt
def get_category_count(name,deal_now,train_data,start_date,end_date):
count = DF(deal_now.groupby(['user_id',name]).size().reset_index().rename(columns={0:'times'}))
count_size = deal_now.groupby([name]).size().shape[0]
sum_data = 0
for i in range(0,count_size):
new_name = 'see_'+name+'_'+str(i)
temp = pd.merge(train_data,count[count[name]==i],on=['user_id']).rename(columns={'times':new_name})
train_have = pd.merge(train_data,temp[['user_id',new_name]],on=['user_id'])
train_have = train_have[['user_id',new_name]]
not_have_name = list(set(train_data['user_id'].values)-set(train_have['user_id'].values))
train_not_have = DF()
train_not_have['user_id'] = train_data[train_data['user_id'].isin(not_have_name)]['user_id']
train_not_have['see_'+name+'_'+str(i)] = 0
temp = pd.concat([train_have,train_not_have],axis=0)
train_data = pd.merge(train_data,temp,on=['user_id'],how='left')
sum_data += train_data[new_name].values
for i in range(0,count_size):
new_name = 'see_'+name+'_'+str(i)
train_data[new_name+'_ratio'] = train_data[new_name].values/sum_data
return train_data
def get_last_window(now):
if now.min()>0:
return 1
else:
return 0
def parallelize_df_func(df, func, start, end, num_partitions=21, n_jobs=7):
df_split = np.array_split(df, num_partitions)
start_date = [start] * num_partitions
end_date = [end] * num_partitions
param_info = zip(df_split, start_date, end_date)
pool = Pool(n_jobs)
gc.collect()
df = pd.concat(pool.map(func, param_info))
pool.close()
pool.join()
gc.collect()
return df
def get_train(param_info):
uid = param_info[0]
start_date= param_info[1]
end_date= param_info[2]
t_start = time.time()
t1 = time.time()
train_act = check_id(uid,get_transform(action_log,start_date,end_date))
train_video = check_id(uid,get_transform(video_log,start_date,end_date))
train_app = check_id(uid,get_transform(app_log,start_date,end_date))
train_reg = register_log[register_log['user_id'].isin(uid)].rename(columns={'day':'reg_day'})
# Get Week
train_act['week'] = (train_act['day'].values) % 7
train_video['week'] = (train_video['day'].values) % 7
train_app['week'] = (train_app['day'].values) % 7
# Modify Day
train_reg['reg_day'] = train_reg['reg_day'] - start_date + 1
train_act['day'] = train_act['day'] - start_date + 1
train_video['day'] = train_video['day'] - start_date + 1
train_app['day'] = train_app['day'] - start_date + 1
end_date = end_date-start_date+1
true_start = start_date
start_date = 1
t2 = time.time()
print(start_date,' To ',end_date,' Have User: ',len(uid))
print('Data Prepare Use...',t2-t1)
# Build
train_data = DF()
train_data['user_id'] = uid # 1 feature
train_data = get_time_feature('act_','day',train_act,train_data,start_date,end_date)
train_data = get_time_feature('act_','day',train_act,train_data,start_date,end_date)
train_data = get_time_feature('act_','day',train_act,train_data,start_date,end_date)
print('Act Encoder Finished')
for i in range(5):
page_temp = train_act[train_act['page']==i]
train_data = get_type_feature('act_page'+str(i)+'_','day',page_temp,train_data,start_date,end_date,end_date-start_date,'_all')
for i in range(6):
act_temp = train_act[train_act['action_type']==i]
train_data = get_type_feature('act_action'+str(i)+'_','day',act_temp,train_data,start_date,end_date,end_date-start_date,'_all')
train_data = get_diff_feature('act_','diff_day',train_act,train_data,start_date,end_date)
train_data = get_encoder_feature('act_','diff_day',train_act,train_data,start_date,end_date)
train_data = pd.merge(train_data,train_act.groupby(['user_id']).apply(lambda x:get_mode(x['week'])).reset_index().rename(columns={0:'act_mode_week'+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
for i in ['page','action_type','video_id','author_id']: # 4*3 12 feature
train_data = get_id_feature('id_act_',i,train_act,train_data,start_date,end_date)
print(train_data.shape,' Aci Finished')
train_data = get_category_count('page',train_act,train_data,start_date,end_date)
train_data = get_category_count('action_type',train_act,train_data,start_date,end_date)
print(train_data.shape,' Category Finished')
train_data = get_time_feature('video_','day',train_video,train_data,start_date,end_date)
train_data = get_diff_feature('video_','diff_day',train_video,train_data,start_date,end_date)
print(train_data.shape,' Video Finished')
train_data = get_time_feature('app_','day',train_app,train_data,start_date,end_date)
train_data = get_diff_feature('app_','diff_day',train_app,train_data,start_date,end_date)
train_data = get_encoder_feature('app_','diff_day',train_app,train_data,start_date,end_date)
t1 = time.time()
train_data = pd.merge(train_data,train_app.groupby(['user_id']).apply(lambda x:get_mode(x['week'])).reset_index().rename(columns={0:'app_mode_week'+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
print(train_data.shape,' APP Finished')
train_app['diff2'] = train_app['day']-train_app.groupby(['user_id'])['day'].shift(2).values
app_diff = train_app.dropna()
app_diff = app_diff.groupby(['user_id'],as_index=False).agg({'diff2':['max','min','mean','std']})
app_diff.columns = ['user_id','app_diff2_max','app_diff2_min','app_diff2_mean','app_diff2_std']
train_data = pd.merge(train_data,app_diff,on=['user_id'],how='left')
t2 = time.time()
print('APP DIFF ',t2-t1)
t1 = time.time()
user_feature = DF(train_data['user_id'].unique())
user_feature.columns = ['user_id']
user_feature = user_feature.set_index('user_id')
user_feature['HowManyPeople_Watch'] = train_act.groupby('author_id').apply(HowManyPeopleWatch)
user_feature['Most_Handle'] = train_act.groupby('user_id').apply(MostHandle)
#计算视频被观看总次数
video_size = train_act.groupby('video_id').size().reset_index()
video_size.columns = ['video_id','video_watched_times']
train_act = pd.merge(train_act,video_size,on=['video_id'],how='left')
#分别计算每个用户的贡献度和、均、方
temp = train_act.groupby('user_id').apply(GongXianDu)
user_feature['GongXianSum'] = temp.apply(lambda x:x[0])
user_feature['GongXianMean'] = temp.apply(lambda x:x[1])
user_feature['GongXianStd'] = temp.apply(lambda x:x[2])
user_feature['GongXianSkeww'] = temp.apply(lambda x:x[3])
user_feature['GongXianKurtt'] = temp.apply(lambda x:x[4])
fav_author = train_act.groupby('user_id').apply(FavAuthorCreate)
user_feature['FavAuthorCreate'] = fav_author.apply(lambda x:x[0])
user_feature['WatchOtherVideo'] = fav_author.apply(lambda x:x[1])
train_data = pd.merge(train_data,user_feature.reset_index(),on=['user_id'],how='left')
t2 = time.time()
print('Use Time: ',t2-t1,' User-Author Finish... ','Shape:',train_data.shape)
train_data = pd.merge(train_data,train_reg[['user_id','register_type','device_type','week','reg_day']],on=['user_id'],how='left').rename(columns={'week':'reg_week'}) # 2
t_end = time.time()
print('Get Feature Use All Time: ',t_end-t_start,' Shape: ',train_data.shape)
gc.collect()
return train_data
def data_prepare(read_path=None):
register_log = pd.read_csv(read_path+'user_register_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint16,3:np.uint16}).rename(columns={0:'user_id',1:'day',2:'register_type',3:'device_type'})
action_log = pd.read_csv(read_path+'user_activity_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint8,3:np.uint32,4:np.uint32,5:np.uint8}).rename(columns={0:'user_id',1:'day',2:'page',3:'video_id',4:'author_id',5:'action_type'})
app_log = pd.read_csv(read_path+'app_launch_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'})
video_log = pd.read_csv(read_path+'video_create_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'})
# Sort By User
register_log = register_log.sort_values(by=['user_id','day'],ascending=True)
action_log = action_log.sort_values(by=['user_id','day'],ascending=True)
app_log = app_log.sort_values(by=['user_id','day'],ascending=True)
video_log = video_log.sort_values(by=['user_id','day'],ascending=True)
# Diff Day
t1 = time.time()
app_log['diff_day'] = app_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8)
video_log['diff_day'] = video_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8)
action_log['diff_day'] = action_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8)
t2 = time.time()
print('Diff Day Finished... ',t2-t1)
# Prepare REGISTER
register_log['week'] = register_log['day'] % 7
register_log['rt_dt'] = (register_log['register_type']+1)*(register_log['device_type']+1)
register_log['week_rt'] = (register_log['register_type']+1)*(register_log['reg_week']+1)
register_log['week_dt'] = (register_log['device_type']+1)*(register_log['reg_week']+1)
register_log['use_reg_people'] = register_log.groupby(['register_type'])['user_id'].transform('count').values
register_log['use_dev_people'] = register_log.groupby(['device_type'])['user_id'].transform('count').values
register_log['week_rt_use_people'] = register_log.groupby(['week_rt'])['user_id'].transform('count').values
register_log['week_dt_use_people'] = register_log.groupby(['week_dt'])['user_id'].transform('count').values
register_log['rt_dt_use_people'] = register_log.groupby(['rt_dt'])['user_id'].transform('count').values
return register_log,action_log,app_log,video_log
read_path = '/mnt/datasets/fusai/'
register_log,action_log,app_log,video_log = data_prepare(read_path)
train_set = []
for i in range(17,25):
train_label = get_label(i,i+6)
train_data_part1 = parallelize_df_func(train_label['user_id'], get_train, i-16, i-1, 1, 1)
train_data = pd.merge(train_data_part1,register_log[['user_id','use_reg_people','week','register_type','device_type','rt_dt',
'week_rt','week_dt','use_dev_people','week_rt_use_people','week_dt_use_people',
'rt_dt_use_people']],on=['user_id'],how='left')
train_data = pd.merge(train_data,train_label,on=['user_id'],how='left')
train_set.append(train_data)
del train_data_part1
gc.collect()
train_data = pd.concat(train_set[0:-1],axis=0).reset_index(drop=True)
valid_data = train_set[-1]
online_data = parallelize_df_func(register_log['user_id'].unique(), get_train, 15, 30, 1,1)
online_data = pd.merge(online_data,register_log[['user_id','use_reg_people','week','register_type','device_type','rt_dt',
'week_rt','week_dt','use_dev_people','week_rt_use_people','week_dt_use_people',
'rt_dt_use_people']],on=['user_id'],how='left')
write_path = '/home/kesci/'
train_data.to_csv(write_path+'train_data.csv',index=False)
valid_data.to_csv(write_path+'valid_data.csv',index=False)
online_data.to_csv(write_path+'online_data.csv',index=False)
print('Style 1 Feature Engineer Finished...')
================================================
FILE: feature_engineer/feature_engineer2.py
================================================
# 该特征工程为不等长滑窗,滑窗的step为1天,全量数据
import numpy as np
import pandas as pd
from pandas import DataFrame as DF
import gc
from multiprocessing import Pool
from sklearn.preprocessing import LabelEncoder
import time
from scipy import stats
register_log = pd.read_csv('/mnt/datasets/fusai/user_register_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint16,3:np.uint16}).rename(columns={0:'user_id',1:'day',2:'register_type',3:'device_type'})
action_log = pd.read_csv('/mnt/datasets/fusai/user_activity_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint8,3:np.uint32,4:np.uint32,5:np.uint8}).rename(columns={0:'user_id',1:'day',2:'page',3:'video_id',4:'author_id',5:'action_type'})
app_log = pd.read_csv('/mnt/datasets/fusai/app_launch_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'})
video_log = pd.read_csv('/mnt/datasets/fusai/video_create_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'})
register_log = register_log.sort_values(by=['user_id','day'],ascending=True)
action_log = action_log.sort_values(by=['user_id','day'],ascending=True)
app_log = app_log.sort_values(by=['user_id','day'],ascending=True)
video_log = video_log.sort_values(by=['user_id','day'],ascending=True)
register_log['week'] = register_log['day'] % 7
t1 = time.time()
app_log['diff_day'] = app_log.groupby(['user_id'])['day'].diff().fillna(-1)
app_log['diff_day'] = app_log['diff_day'].astype(np.int8)
t2 = time.time()
print('Diff APP Finished... ',t2-t1)
t1 = time.time()
video_log['diff_day'] = video_log.groupby(['user_id'])['day'].diff().fillna(-1)
video_log['diff_day'] = video_log['diff_day'].astype(np.int8)
t2 = time.time()
print('Diff Video Finished... ',t2-t1)
t1 = time.time()
action_log['diff_day'] = action_log.groupby(['user_id'])['day'].diff().fillna(-1)
action_log['diff_day'] = action_log['diff_day'].astype(np.int8)
t2 = time.time()
print('Diff Act Finished... ',t2-t1)
def reduce_mem_usage(props):
# 计算当前内存
start_mem_usg = props.memory_usage().sum() / 1024 ** 2
print("Memory usage of the dataframe is :", start_mem_usg, "MB")
NAlist = []
for col in props.columns:
if (props[col].dtypes != object):
isInt = False
mmax = props[col].max()
mmin = props[col].min()
if not np.isfinite(props[col]).all():
NAlist.append(col)
props[col].fillna(-1, inplace=True)
props[col].replace(np.inf,-1,inplace=True)
asint = props[col].fillna(-1).astype(np.int64)
result = np.fabs(props[col] - asint)
result = result.sum()
if result < 0.01:
isInt = True
if isInt:
if mmin >= 0:
if mmax <= 255:
props[col] = props[col].astype(np.uint8)
elif mmax <= 65535:
props[col] = props[col].astype(np.uint16)
elif mmax <= 4294967295:
props[col] = props[col].astype(np.uint32)
else:
props[col] = props[col].astype(np.uint64)
else:
if mmin > np.iinfo(np.int8).min and mmax < np.iinfo(np.int8).max:
props[col] = props[col].astype(np.int8)
elif mmin > np.iinfo(np.int16).min and mmax < np.iinfo(np.int16).max:
props[col] = props[col].astype(np.int16)
elif mmin > np.iinfo(np.int32).min and mmax < np.iinfo(np.int32).max:
props[col] = props[col].astype(np.int32)
elif mmin > np.iinfo(np.int64).min and mmax < np.iinfo(np.int64).max:
props[col] = props[col].astype(np.int64)
else:
props[col] = props[col].astype(np.float16)
mem_usg = props.memory_usage().sum() / 1024**2
print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")
return props
def get_transform(now,start_date,end_date):
get_trans = now[(now['day']>=start_date) & (now['day']<=end_date)]
return get_trans
def get_label(start_date,end_date):
merge_name = ['user_id','day']
all_log = pd.concat([action_log[merge_name],app_log[merge_name],video_log[merge_name]],axis=0)
train_label = get_transform(all_log,start_date,end_date)
train_1 = DF(list(set(train_label['user_id']))).rename(columns={0:'user_id'})
train_1['label'] = 1
reg_temp = get_transform(register_log,1,start_date-1)
train_1 = train_1[train_1['user_id'].isin(reg_temp['user_id'])]
train_0 = DF(list(set(reg_temp['user_id'])-set(train_1['user_id']))).rename(columns={0:'user_id'})
train_0['label'] = 0
del train_label
gc.collect()
return pd.concat([train_1,train_0],axis=0)
def check_id(uid,now):
return now[now['user_id'].isin(uid)]
def get_mode(now):
return stats.mode(now)[0][0]
def get_binary_seq(now,start_date,end_date):
day = list(range(1,end_date-start_date+2))
ans1 = 0
binary_day = []
now_uni = now.unique()
for i in day:
if i in now_uni:
binary_day.append(1)
else:
binary_day.append(0)
return binary_day
def get_binary1(now,start_date,end_date): # Boss Feature
ans = 0
binary_day = get_binary_seq(now,start_date,end_date)
for i in range(len(binary_day)):
ans += binary_day[i]*(2**i)
return ans
def get_binary2(now,start_date,end_date): # Boss Feature
ans = 0
binary_day = get_binary_seq(now,start_date,end_date)
for i in range(len(binary_day)):
ans += binary_day[i]*(1/(end_date-i))
return ans
def get_time_log_weight_sigma(now,start_date,end_date):
window_len = end_date+1-start_date
ans = np.zeros(window_len)
sigma_ans = 0
for i in now:
ans[(i-1)%window_len] += 1
for i in range(window_len):
if ans[i]!=0:
sigma_ans += np.log(ans[i]/(window_len-i))
return sigma_ans
def get_max_count(x,name):
x_max = x[name].max()
if x_max>0:
return x[name].value_counts(x_max)
else:
return np.nan
def get_max_movie(x):
x_max = x['day'].max()
if x_max>0:
x = x[x['day']==x_max]
return x['video_id'].nunique()
else:
return np.nan
def get_type_feature(control,name,now,train_data,start_date,end_date,gap,gap_name):
now = get_transform(now,start_date,end_date)
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].max())).reset_index().rename(columns={0:'max1_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],2))).reset_index().rename(columns={0:'max2_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_count(x,name)).reset_index().rename(columns={0:'max_count_'+control+name+gap_name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_movie(x)).reset_index().rename(columns={0:'nunique_video_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')
return train_data
def get_time_feature(control,name,now,train_data,start_date,end_date,gap,gap_name):
now = get_transform(now,start_date,end_date)
# 描述性统计特征 6
t1 = time.time()
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_all_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()/gap).reset_index().rename(columns={0:'count_'+control+name+gap_name}),on=['user_id'],how='left')
train_data['nunique / count' + control + name + gap_name] = train_data['nunique_all_'+control+name+gap_name] / train_data['count_'+control+name+gap_name]
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(x[name].min()-start_date)).reset_index().rename(columns={0:'min-start_'+control+name+gap_name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].min())).reset_index().rename(columns={0:'end-min_'+control+name+gap_name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_count(x,name)).reset_index().rename(columns={0:'max_count_'+control+name+gap_name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.75)/gap).reset_index().rename(columns={0:'q2_'+control+name+gap_name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.84)/gap).reset_index().rename(columns={0:'q3_'+control+name+gap_name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.96)/gap).reset_index().rename(columns={0:'q4_'+control+name+gap_name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_mode(x[name])).reset_index().rename(columns={0:'mode_'+control+name+gap_name}),on=['user_id'],how='left')
t2 = time.time()
print(name,' Describe Finished... ',t2-t1,' Shape: ',train_data.shape)
t1 = time.time()
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary1(x[name],start_date,end_date)/gap).reset_index().rename(columns={0:'encoder1_01seq'+control+name+'_'+gap_name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary2(x[name],start_date,end_date)/gap).reset_index().rename(columns={0:'encoder2_01seq'+control+name+'_'+gap_name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_time_log_weight_sigma(x[name],start_date,end_date)/gap).reset_index().rename(columns={0:'LogSigma_'+control+name+'_'+gap_name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.std(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_var_'+control+name+gap_name}).fillna(-1),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.mean(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_mean_'+control+name+gap_name}),on=['user_id'],how='left')
t2 = time.time()
print(control,' Sigma Finished... ',t2-t1,' Shape: ',train_data.shape)
t1 = time.time()
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].max())).reset_index().rename(columns={0:'max1_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],2))).reset_index().rename(columns={0:'max2_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],3))).reset_index().rename(columns={0:'max3_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left')
t2 = time.time()
print(control,' Max Finished... ',t2-t1,' Shape: ',train_data.shape)
return train_data
def get_second_day(now,seq):
now = list(now.unique())
for i in range(seq-1):
if len(now)>1:
now.remove(max(now))
else:
return np.nan
return max(now)
def get_id_feature(control,name,now,train_data,start_date,end_date,gap,gap_name):
now = get_transform(now,start_date,end_date)
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()/gap).reset_index().rename(columns={0:'count_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].var()).reset_index().rename(columns={0:'var_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')
return train_data
def get_diff_feature(control,name,now,train_data,start_date,end_date,gap,gap_name):
now = get_transform(now,start_date,end_date)
t1 = time.time()
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()).reset_index().rename(columns={0:'count_'+control+name+gap_name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+gap_name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].std()).reset_index().rename(columns={0:'var_'+control+name+gap_name}).fillna(-1),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].mean()).reset_index().rename(columns={0:'mean_'+control+name+gap_name}).fillna(-1),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].max()).reset_index().rename(columns={0:'max_'+control+name+gap_name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_mode(x[name])).reset_index().rename(columns={0:'mode_'+control+name+gap_name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].min()).reset_index().rename(columns={0:'min_'+control+name+gap_name}),on=['user_id'],how='left')
t2 = time.time()
# print(control,' Get Diff Feature Finished... Used: ',t2-t1,' Shape: ',train_data.shape)
return train_data
def get_category_count(name,deal_now,train_data,start_date,end_date):
count = DF(deal_now.groupby(['user_id',name]).size().reset_index().rename(columns={0:'times'}))
count_size = deal_now.groupby([name]).size().shape[0]
sum_data = 0
for i in range(0,count_size):
new_name = 'see_'+name+'_'+str(i)
temp = pd.merge(train_data,count[count[name]==i],on=['user_id']).rename(columns={'times':new_name})
train_have = pd.merge(train_data,temp[['user_id',new_name]],on=['user_id'])
train_have = train_have[['user_id',new_name]]
not_have_name = list(set(train_data['user_id'].values)-set(train_have['user_id'].values))
train_not_have = DF()
train_not_have['user_id'] = train_data[train_data['user_id'].isin(not_have_name)]['user_id']
train_not_have['see_'+name+'_'+str(i)] = 0
temp = pd.concat([train_have,train_not_have],axis=0)
train_data = pd.merge(train_data,temp,on=['user_id'],how='left')
sum_data += train_data[new_name].values
for i in range(0,count_size):
new_name = 'see_'+name+'_'+str(i)
train_data[new_name+'_ratio'] = train_data[new_name].values/sum_data
return train_data
from multiprocessing import Pool
def parallelize_df_func(df, func, start, end, num_partitions=40, n_jobs=4):
df_split = np.array_split(df, num_partitions)
start_date = [start] * num_partitions
end_date = [end] * num_partitions
param_info = zip(df_split, start_date, end_date)
pool = Pool(n_jobs)
gc.collect()
df = pd.concat(pool.map(func, param_info))
pool.close()
pool.join()
return df
def get_train(param_info):
uid = param_info[0]
start_date= param_info[1]
end_date= param_info[2]
t_start = time.time()
t1 = time.time()
train_act = check_id(uid,get_transform(action_log,start_date,end_date))
train_video = check_id(uid,get_transform(video_log,start_date,end_date))
train_app = check_id(uid,get_transform(app_log,start_date,end_date))
# Get Week
train_act['week'] = (train_act['day'].values) % 7
train_video['week'] = (train_video['day'].values) % 7
train_app['week'] = (train_app['day'].values) % 7
# Modify Day
train_act['day'] = train_act['day'] - start_date + 1
train_video['day'] = train_video['day'] - start_date + 1
train_app['day'] = train_app['day'] - start_date + 1
end_date = end_date-start_date+1
true_start = start_date
start_date = 1
train_reg = register_log[register_log['user_id'].isin(uid)].rename(columns={'day':'reg_day'})
train_act = pd.merge(train_act,train_reg[['user_id','reg_day']],on=['user_id'],how='left')
train_video = pd.merge(train_video,train_reg[['user_id','reg_day']],on=['user_id'],how='left')
train_app = pd.merge(train_app,train_reg[['user_id','reg_day']],on=['user_id'],how='left')
del train_act['reg_day']
del train_video['reg_day']
del train_app['reg_day']
gc.collect()
t2 = time.time()
print(start_date,' To ',end_date,' Have User: ',len(uid))
print('Data Prepare Use...',t2-t1)
# Build
train_data = DF()
train_data['user_id'] = uid # 1 feature
train_data = pd.merge(train_data,train_act.groupby(['user_id']).size().reset_index().rename(columns={0:'action_all_times'}),on=['user_id'],how='left').fillna(0)
for i in range(5):
page_temp = train_act[train_act['page']==i]
train_data = get_type_feature('act_page'+str(i),'day',page_temp,train_data,start_date,end_date,end_date-start_date,'_all')
train_data = get_time_feature('act_','day',train_act,train_data,start_date,end_date,end_date-start_date,'_all')
train_data = get_diff_feature('act_','diff_day',train_act,train_data,start_date,end_date,end_date-start_date,'_all')
train_data = pd.merge(train_data,train_act.groupby(['user_id']).apply(lambda x:get_mode(x['week'])).reset_index().rename(columns={0:'act_mode_week'+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
for i in ['page','action_type','video_id','author_id']: # 4*3 12 feature
train_data = get_id_feature('id_act_',i,train_act,train_data,start_date,end_date,end_date-start_date,'_all')
train_data = get_category_count('page',train_act,train_data,start_date,end_date)
train_data = get_category_count('action_type',train_act,train_data,start_date,end_date)
train_data = get_category_count('page',train_act,train_data,end_date-3,end_date)
train_data = get_category_count('action_type',train_act,train_data,end_date-3,end_date)
train_data = reduce_mem_usage(train_data)
train_data = get_time_feature('video_','day',train_video,train_data,start_date,end_date,end_date-start_date,'_all')
train_data = get_diff_feature('video_','diff_day',train_video,train_data,start_date,end_date,end_date-start_date,'_all')
train_data = reduce_mem_usage(train_data)
train_data = get_time_feature('app_','day',train_app,train_data,start_date,end_date,end_date-start_date,'_all')
train_data = get_time_feature('app_','day',train_app,train_data,start_date,end_date,7,'_7')
train_data = get_diff_feature('app_','diff_day',train_app,train_data,start_date,end_date,end_date-start_date,'_all')
train_data = pd.merge(train_data,train_app.groupby(['user_id']).apply(lambda x:get_mode(x['week'])).reset_index().rename(columns={0:'app_mode_week'+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = reduce_mem_usage(train_data)
train_data = pd.merge(train_data,register_log[['user_id','register_type','device_type','week','day']],on=['user_id'],how='left').rename(columns={'week':'reg_week','day':'reg_day'}) # 2
train_data['rt_dt'] = (train_data['register_type']+1)*(train_data['device_type']+1)
train_data['week_rt'] = (train_data['register_type']+1)*(train_data['reg_week']+1)
train_data['week_dt'] = (train_data['device_type']+1)*(train_data['reg_week']+1)
t_end = time.time()
print('Get Feature Use All Time: ',t_end-t_start)
train_data = reduce_mem_usage(train_data)
return train_data
def data_prepare(read_path=None):
register_log = pd.read_csv(read_path+'user_register_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint16,3:np.uint16}).rename(columns={0:'user_id',1:'day',2:'register_type',3:'device_type'})
action_log = pd.read_csv(read_path+'user_activity_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint8,3:np.uint32,4:np.uint32,5:np.uint8}).rename(columns={0:'user_id',1:'day',2:'page',3:'video_id',4:'author_id',5:'action_type'})
app_log = pd.read_csv(read_path+'app_launch_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'})
video_log = pd.read_csv(read_path+'video_create_log.txt',sep='\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'})
# Sort By User
register_log = register_log.sort_values(by=['user_id','day'],ascending=True)
action_log = action_log.sort_values(by=['user_id','day'],ascending=True)
app_log = app_log.sort_values(by=['user_id','day'],ascending=True)
video_log = video_log.sort_values(by=['user_id','day'],ascending=True)
# Diff Day
t1 = time.time()
app_log['diff_day'] = app_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8)
video_log['diff_day'] = video_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8)
action_log['diff_day'] = action_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8)
t2 = time.time()
print('Diff Day Finished... ',t2-t1)
# Prepare REGISTER
register_log['week'] = register_log['day'] % 7
register_log['rt_dt'] = (register_log['register_type']+1)*(register_log['device_type']+1)
register_log['week_rt'] = (register_log['register_type']+1)*(register_log['reg_week']+1)
register_log['week_dt'] = (register_log['device_type']+1)*(register_log['reg_week']+1)
register_log['use_reg_people'] = register_log.groupby(['register_type'])['user_id'].transform('count').values
register_log['use_dev_people'] = register_log.groupby(['device_type'])['user_id'].transform('count').values
register_log['week_rt_use_people'] = register_log.groupby(['week_rt'])['user_id'].transform('count').values
register_log['week_dt_use_people'] = register_log.groupby(['week_dt'])['user_id'].transform('count').values
register_log['rt_dt_use_people'] = register_log.groupby(['rt_dt'])['user_id'].transform('count').values
return register_log,action_log,app_log,video_log
read_path = '/mnt/datasets/fusai/'
register_log,action_log,app_log,video_log = data_prepare(read_path)
train_set = []
for i in range(17,25):
train_label = get_label(i,i+6)
train_data_part1 = parallelize_df_func(train_label['user_id'], get_train, 1, i-1, 1, 1)
train_data = pd.merge(train_data_part1,register_log[['user_id','use_reg_people','week','register_type','device_type','rt_dt',
'week_rt','week_dt','use_dev_people','week_rt_use_people','week_dt_use_people',
'rt_dt_use_people']],on=['user_id'],how='left')
train_data = pd.merge(train_data,train_label,on=['user_id'],how='left')
train_set.append(train_data)
del train_data_part1
gc.collect()
train_data = pd.concat(train_set[0:-1],axis=0).reset_index(drop=True)
valid_data = train_set[-1]
online_data = parallelize_df_func(register_log['user_id'].unique(), get_train, 1, 30, 1,1)
online_data = pd.merge(online_data,register_log[['user_id','use_reg_people','week','register_type','device_type','rt_dt',
'week_rt','week_dt','use_dev_people','week_rt_use_people','week_dt_use_people',
'rt_dt_use_people']],on=['user_id'],how='left')
write_path = '/home/kesci/'
train_data.to_csv(write_path+'train_data.csv',index=False)
valid_data.to_csv(write_path+'valid_data.csv',index=False)
online_data.to_csv(write_path+'online_data.csv',index=False)
print('Style 2 Feature Engineer Finished...')
================================================
FILE: feature_engineer/get_feature.py
================================================
import numpy as np
import pandas as pd
import lightgbm as lgb
import xgboost as xgb
import catboost as cbt
from pandas import DataFrame as DF
import gc
from sklearn.preprocessing import LabelEncoder
import time
import networkx as nx
from sklearn.cluster import MeanShift,KMeans
# 1. 分别训练A,B榜数据,得到A,B模型,则只需要利用A榜模型预测B榜All Feature,得到预测值Model-A,将该列值并入Feature-B,即B榜维数加一,利用增强后的数据再训练Model-B
# 2. 尝试重编码User-id,合并A,B数据,得到Merge-AB,在此基础上提取特征,训练模型(但未知Video-id,Author-id是否为乱序)
reg_log_a = pd.read_csv('data/a/user_register_log.txt',sep='\t',header=None).rename(columns={0:'user_id',1:'day',2:'register_type',3:'device_type'})
aci_log_a = pd.read_csv('data/a/user_activity_log.txt',sep='\t',header=None).rename(columns={0:'user_id',1:'day',2:'page',3:'video_id',4:'author_id',5:'action_type'})
app_log_a = pd.read_csv('data/a/app_launch_log.txt',sep='\t',header=None).rename(columns={0:'user_id',1:'day'})
video_log_a = pd.read_csv('data/a/video_create_log.txt',sep='\t',header=None).rename(columns={0:'user_id',1:'day'})
reg_log_b = pd.read_csv('data/b/user_register_log.txt',sep='\t',header=None).rename(columns={0:'user_id',1:'day',2:'register_type',3:'device_type'})
aci_log_b = pd.read_csv('data/b/user_activity_log.txt',sep='\t',header=None).rename(columns={0:'user_id',1:'day',2:'page',3:'video_id',4:'author_id',5:'action_type'})
app_log_b = pd.read_csv('data/b/app_launch_log.txt',sep='\t',header=None).rename(columns={0:'user_id',1:'day'})
video_log_b = pd.read_csv('data/b/video_create_log.txt',sep='\t',header=None).rename(columns={0:'user_id',1:'day'})
reg_log = pd.concat([reg_log_a,reg_log_b],axis=0).reset_index(drop=True)
aci_log = pd.concat([aci_log_a,aci_log_b],axis=0).reset_index(drop=True)
app_log = pd.concat([app_log_a,app_log_b],axis=0).reset_index(drop=True)
video_log = pd.concat([video_log_a,video_log_b],axis=0).reset_index(drop=True)
reg_log.to_csv('data/user_register_log.txt',sep=' ',header=False,index=False)
aci_log.to_csv('data/user_activity_log.txt',sep=' ',header=False,index=False)
app_log.to_csv('data/app_launch_log.txt',sep=' ',header=False,index=False)
video_log.to_csv('data/video_create_log.txt',sep=' ',header=False,index=False)
print('可持久化 Finished...')
# Get Week
reg_log = reg_log[reg_log['device_type']!=1]
reg_log['week'] = reg_log['day'] % 7
def get_transform(now,start_date,end_date):
get_trans = now[(now['day']>=start_date) & (now['day']<=end_date)]
return get_trans
def get_label(start_date,end_date):
merge_name = ['user_id','day']
all_log = pd.concat([aci_log[merge_name],app_log[merge_name],video_log[merge_name]],axis=0)
train_label = get_transform(all_log,start_date,end_date)
train_1 = DF(list(set(train_label['user_id']))).rename(columns={0:'user_id'})
train_1['label'] = 1
reg_temp = get_transform(reg_log,start_date-16,start_date-1)
train_1 = train_1[train_1['user_id'].isin(reg_temp['user_id'])]
train_0 = DF(list(set(reg_temp['user_id'])-set(train_1['user_id']))).rename(columns={0:'user_id'})
train_0['label'] = 0
del train_label
gc.collect()
return pd.concat([train_1,train_0],axis=0)
def check_id(uid,now):
return now[now['user_id'].isin(uid)]
def get_category_count(name,deal_now,train_data,start_date,end_date):
count = DF(deal_now.groupby(['user_id',name]).size().reset_index().rename(columns={0:'times'}))
count_size = aci_log.groupby([name]).size().shape[0]
sum_data = 0
for i in range(0,count_size):
new_name = 'see_'+name+'_'+str(i)
temp = pd.merge(train_data,count[count[name]==i],on=['user_id']).rename(columns={'times':new_name})
train_have = pd.merge(train_data,temp[['user_id',new_name]],on=['user_id'])
train_have = train_have[['user_id',new_name]]
not_have_name = list(set(train_data['user_id'].values)-set(train_have['user_id'].values))
train_not_have = DF()
train_not_have['user_id'] = train_data[train_data['user_id'].isin(not_have_name)]['user_id']
train_not_have['see_'+name+'_'+str(i)] = 0
temp = pd.concat([train_have,train_not_have],axis=0)
train_data = pd.merge(train_data,temp,on=['user_id'],how='left')
train_data = pd.merge(train_data,deal_now[deal_now[name]==i].groupby(['user_id']).apply(lambda x:get_binary(x['day'],start_date,end_date)).reset_index().rename(columns={0:'binary_'+str(i)+'_'+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,deal_now[deal_now[name]==i].groupby(['user_id']).apply(lambda x:get_time_log_weight_sigma(x['day'],start_date,end_date)).reset_index().rename(columns={0:'get_log_sigma_'+str(i)+'_'+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
sum_data += train_data[new_name].values
for i in range(0,count_size):
new_name = 'see_'+name+'_'+str(i)
train_data[new_name+'_ratio'] = train_data[new_name].values/sum_data
return train_data
def get_binary_seq(now,start_date,end_date):
day = list(range(start_date,end_date+1))
ans1 = 0
binary_day = []
for i in day:
if i in now.unique():
binary_day.append(1)
else:
binary_day.append(0)
return binary_day
def get_binary(now,start_date,end_date): # Boss Feature
ans = 0
binary_day = get_binary_seq(now,start_date,end_date)
for i in range(len(binary_day)):
ans += binary_day[i]*(2**i)
return ans
def get_binary_mol7(now,start_date,end_date):
ans = 0
binary_day = get_binary_seq(now,start_date,end_date)
for i in range(len(binary_day)):
ans += binary_day[i]*(2**(i%7))
return ans
def get_time_log_weight_sigma(now,start_date,end_date):
window_len = end_date+1-start_date
ans = np.zeros(window_len)
sigma_ans = 0
for i in now:
ans[(i-1)%window_len] += 1
for i in range(window_len):
if ans[i]!=0:
sigma_ans += np.log(ans[i]/(window_len-i))
return sigma_ans
def get_time_weight_sigma(now,start_date,end_date):
window_len = end_date+1-start_date
ans = np.zeros(window_len)
sigma_ans = 0
for i in now:
ans[(i-1)%window_len] += 1
for i in range(window_len):
sigma_ans += ans[i]*(i+1)
return sigma_ans
def get_id_feature(control,name,now,train_data,start_date,end_date):
if end_date<start_date:
return train_data
now = get_transform(now,start_date,end_date)
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'count_all_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].var()).reset_index().rename(columns={0:'see_var_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].mean()).reset_index().rename(columns={0:'see_mean_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].median()).reset_index().rename(columns={0:'see_median_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].mad()).reset_index().rename(columns={0:'see_mad_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].skew()).reset_index().rename(columns={0:'see_skew_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
return train_data
def get_time_feature(control,name,now,train_data,start_date,end_date):
if end_date<start_date:
return train_data
now = get_transform(now,start_date,end_date)
t1 = time.time()
# 描述性统计特征 10
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'count_all_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].var()).reset_index().rename(columns={0:'see_var_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].mean()).reset_index().rename(columns={0:'see_mean_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].median()).reset_index().rename(columns={0:'see_median_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].mad()).reset_index().rename(columns={0:'see_mad_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].skew()).reset_index().rename(columns={0:'see_skew_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].kurt()).reset_index().rename(columns={0:'see_kurt_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].min()).reset_index().rename(columns={0:'see_min_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].max()).reset_index().rename(columns={0:'see_max_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(x[name].max()-x[name].min())).reset_index().rename(columns={0:'see_max-min_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')
t2 = time.time()
print('Describe Feature Finished... Used: ',t2-t1)
t1 = time.time()
# 一阶差分 二阶差分 11
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().var()).reset_index().rename(columns={0:'diff_seq_var_'+control+name}).fillna(0),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().mean()).reset_index().rename(columns={0:'diff_seq_mean_'+control+name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().median()).reset_index().rename(columns={0:'diff_seq_median_'+control+name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().mad()).reset_index().rename(columns={0:'diff_seq_mad_'+control+name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().skew()).reset_index().rename(columns={0:'diff_seq_skew_'+control+name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().kurt()).reset_index().rename(columns={0:'diff_seq_kurt_'+control+name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().min()).reset_index().rename(columns={0:'diff_seq_min_'+control+name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().max()).reset_index().rename(columns={0:'diff_seq_max_'+control+name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().max()-x[name].sort_values().diff().min()).reset_index().rename(columns={0:'diff_seq_max_gap_min_'+control+name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().diff().min()).reset_index().rename(columns={0:'diff2_min_gap'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().diff().max()).reset_index().rename(columns={0:'diff2_max_gap'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
t2 = time.time()
print('Diff Feature Finished... Used: ',t2-t1)
t1 = time.time()
# FFT ori 5
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.var(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_seq_var_'+control+name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.mean(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_seq_mean_'+control+name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.median(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_seq_median_'+control+name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.max(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_seq_mad_'+control+name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.min(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_seq_skew_'+control+name}),on=['user_id'],how='left')
t2 = time.time()
print('FFT LAST ABS Feature Finished... Used: ',t2-t1)
t1 = time.time()
# FFT 01
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.var(np.fft.fft(get_binary_seq(x[name],start_date,end_date))))).reset_index().rename(columns={0:'fft_01seq_var_'+control+name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.mean(np.fft.fft(get_binary_seq(x[name],start_date,end_date))))).reset_index().rename(columns={0:'fft_01seq_mean_'+control+name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.median(np.fft.fft(get_binary_seq(x[name],start_date,end_date))))).reset_index().rename(columns={0:'fft_01seq_median_'+control+name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.max(np.fft.fft(get_binary_seq(x[name],start_date,end_date))))).reset_index().rename(columns={0:'fft_01seq_mad_'+control+name}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.min(np.fft.fft(get_binary_seq(x[name],start_date,end_date))))).reset_index().rename(columns={0:'fft_01seq_skew_'+control+name}),on=['user_id'],how='left')
t2 = time.time()
print('FFT FIRST 01 ABS Feature Finished... Used: ',t2-t1)
t1 = time.time()
# 时间衰减 4
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary(x[name],start_date,end_date)).reset_index().rename(columns={0:'binary_'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary_mol7(x[name],start_date,end_date)).reset_index().rename(columns={0:'binary_mol7'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_time_log_weight_sigma(x[name],start_date,end_date)).reset_index().rename(columns={0:'get_log_sigma_'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_time_weight_sigma(x[name],start_date,end_date)).reset_index().rename(columns={0:'get_sigma_'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
print('SIGMA Feature Finished... Uesd: ',t2-t1)
t2 = time.time()
return train_data
def HowManyPeopleWatch(df):
num_people = len(df['user_id'].unique())
return num_people
def MostHandle(df):
most_handle = df.groupby('video_id').size().max()
return most_handle
def FavAuthorCreate(df):
most_author = df.groupby('author_id').size().sort_values(ascending=False).index[0]
create_video_num = len(df[df['author_id']==most_author]['video_id'].unique())
watch_other_video_num = len(df[df['author_id']==most_author]['video_id'].unique())
watch_other_video = 1 if watch_other_video_num>1 else 0
return create_video_num , watch_other_video
def GongXianDu(df):
d11 = df.set_index('video_id')
d11['gongxian_rate'] = df.groupby('video_id').size()
d11['gongxian_rate'] = d11['gongxian_rate'] / d11['video_watched_times']
meand = d11['gongxian_rate'].mean()
sumd = d11['gongxian_rate'].sum()
stdd = d11['gongxian_rate'].std()
skeww = d11['gongxian_rate'].skew()
kurtt = d11['gongxian_rate'].kurt()
return sumd,meand,stdd,skeww,kurtt
def get_lx_day(now):
k1 = np.array(now)
k2 = np.where(np.diff(k1)==1)[0]
i = 0
ans = []
while i<len(k2)-1:
l1 = 1
while k2[i+1]-k2[i]==1:
l1 += 1
i += 1
if i == len(k2)-1:
break
if l1 == 1:
i += 1
ans.append(2)
else:
ans.append(l1+1)
if len(k2)==1:
ans.append(2)
return ans
def get_lx_day_feature(name,train_data,now):
lx_now = now.groupby(['user_id']).apply(lambda x:get_lx_day(x['day'].sort_values(ascending=True).unique())).reset_index().rename(columns={0:'lx_day'})
lx_collect = {
name+'lx_count_len' : [],
name+'lx_max' : [],
name+'lx_min' : [],
name+'lx_var' : []
}
if (lx_now.shape[0]==0) | (lx_now.shape[1]==0):
lx_collect[name+'lx_count_len'].append(0)
lx_collect[name+'lx_max'].append(0)
lx_collect[name+'lx_min'].append(0)
lx_collect[name+'lx_var'].append(-1)
else:
for i in lx_now['lx_day'].values:
lx_collect[name+'lx_count_len'].append(len(i))
lx_collect[name+'lx_max'].append(np.max(i))
lx_collect[name+'lx_min'].append(np.min(i))
lx_collect[name+'lx_var'].append(np.var(i))
lx_collect = DF(lx_collect)
lx_collect['user_id'] = lx_now['user_id']
train_data = pd.merge(train_data,lx_collect,on=['user_id'],how='left')
return train_data
def get_only_user_author_graph_feature(now,name,train_data):
G = nx.DiGraph()
need_to = ['user_id','author_id']
to_make = now[need_to].drop_duplicates()
for edge in to_make[need_to].values:
G.add_edge(edge[0],edge[1])
pr = nx.pagerank(G,alpha=0.8)
G_degree1 = DF(dict(G.degree),index=['G_degree'+name]).T.reset_index().rename(columns={'index':'user_id'}).fillna(0)
G_indegree1 = DF(dict(G.in_degree),index=['G_indegree'+name]).T.reset_index().rename(columns={'index':'user_id'}).fillna(0)
G_pr1 = DF(pr,index=['G_pagerank'+name]).T.reset_index().rename(columns={'index':'user_id'}).sort_values(by=['G_pagerank'+name]).fillna(0)
train_data = pd.merge(train_data,G_degree1,on=['user_id'],how='left')
train_data = pd.merge(train_data,G_indegree1,on=['user_id'],how='left')
train_data = pd.merge(train_data,G_pr1,on=['user_id'],how='left')
return train_data
def get_train(uid,start_date,end_date):
t_start = time.time()
t1 = time.time()
train_act = check_id(uid,get_transform(aci_log,start_date,end_date))
train_video = check_id(uid,get_transform(video_log,start_date,end_date))
train_app = check_id(uid,get_transform(app_log,start_date,end_date))
# Get Week
train_act['week'] = (train_act['day'].values) % 7
train_video['week'] = (train_video['day'].values) % 7
train_app['week'] = (train_app['day'].values) % 7
train_reg = reg_log[reg_log['user_id'].isin(uid)].rename(columns={'day':'reg_day'})
train_act = pd.merge(train_act,train_reg[['user_id','reg_day']],on=['user_id'],how='left')
train_video = pd.merge(train_video,train_reg[['user_id','reg_day']],on=['user_id'],how='left')
train_app = pd.merge(train_app,train_reg[['user_id','reg_day']],on=['user_id'],how='left')
train_act['aci_distance_from_reg'] = train_act['day'] - train_act['reg_day']
train_video['video_distance_from_reg'] = train_video['day'] - train_video['reg_day']
train_app['app_distance_from_reg'] = train_app['day'] - train_app['reg_day']
del train_act['reg_day']
del train_video['reg_day']
del train_app['reg_day']
gc.collect()
train_act = train_act.sort_values(by=['user_id','day'],ascending=True)
train_app = train_app.sort_values(by=['user_id','day'],ascending=True)
train_video = train_video.sort_values(by=['user_id','day'],ascending=True)
t2 = time.time()
print(start_date,' To ',end_date,' Have User: ',len(uid))
print('Data Prepare Use...',t2-t1)
# Build
train_data = DF()
train_data['user_id'] = uid # 1 feature
# 获取每个人的点击总数/动作类型总数 # 36 feature
t1 = time.time()
train_act['at_page'] = (train_act['action_type']+100)*(train_act['page']+1)
train_data = pd.merge(train_data,train_act.groupby(['user_id']).size().reset_index().rename(columns={0:'action_all_times'}),on=['user_id'],how='left').fillna(0)
# Time Window 4 + 30 feature
train_data = get_lx_day_feature('aci_',train_data,train_act)
train_data = get_time_feature('aci_','day',train_act,train_data,start_date,end_date)
train_data = get_time_feature('aci_','aci_distance_from_reg',train_act,train_data,start_date,end_date) # 30
for i in ['page','action_type','at_page']: # 4*5*6 120feature
train_data = get_id_feature('aci_',i,train_act,train_data,start_date,end_date)
t2 = time.time()
print('Use Time: ',t2-t1,' Aci Finish... ','Shape: ',train_data.shape)
# 获取每个人在Category中不同点击的分布 17 feature
t1 = time.time()
train_data = get_category_count('page',train_act,train_data,start_date,end_date)
train_data = get_category_count('action_type',train_act,train_data,start_date,end_date)
t2 = time.time()
print('Use Time: ',t2-t1,' Category Finish... ','Shape: ',train_data.shape)
# 获取Video_log中的特征 27 feature
t1 = time.time()
train_data = get_lx_day_feature('video_',train_data,train_video)
train_data = get_time_feature('video_','day',train_video,train_data,start_date,end_date)
train_data = pd.merge(train_data,train_video.groupby(['user_id']).apply(lambda x:((x['day'].mode().values[0])%7)).reset_index().rename(columns={0:'mode_week'+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
t2 = time.time()
print('Use Time: ',t2-t1,' Video Finish... ','Shape: ',train_data.shape)
# 获取App_log中的特征 27 feature
t1 = time.time()
train_data = get_lx_day_feature('app_',train_data,train_app)
train_data = get_time_feature('app_','day',train_app,train_data,start_date,end_date)
train_data = get_time_feature('app_','app_distance_from_reg',train_app,train_data,start_date,end_date)
train_data = pd.merge(train_data,train_app.groupby(['user_id']).apply(lambda x:((x['day'].mode().values[0])%7)).reset_index().rename(columns={0:'mode_week'+'_'+str(end_date-start_date)}),on=['user_id'],how='left')
t2 = time.time()
print('Use Time: ',t2-t1,' App Finish... ','Shape: ',train_data.shape)
# 获取注册类型特征 # 3 feature
t1 = time.time()
train_data = pd.merge(train_data,reg_log[['user_id','register_type','device_type','week','day']],on=['user_id'],how='left').rename(columns={'week':'reg_week','day':'reg_day'}) # 2
train_data['rt_dt'] = (train_data['register_type']+1)*(train_data['device_type'])
train_data['week_rt'] = (train_data['register_type']+1)*(train_data['reg_week']+1)
train_data['week_dt'] = (train_data['device_type']+1)*(train_data['reg_week']+1)
train_data['distance_reg_end_window'] = end_date-train_data['reg_day']+1
train_data['distance_reg_start_window'] = train_data['reg_day']-start_date+1
train_data['is_window_reg'] = 1 if train_data['reg_day'].all()>=start_date else 0
# Need To Add 用户活跃时间与注册时间的差值
t2 = time.time()
print('Use Time: ',t2-t1,' Reg Finish... ','Shape: ',train_data.shape)
# 获取业务逻辑特征 3 feature
t1 = time.time()
user_feature = DF(train_data['user_id'].unique())
user_feature.columns = ['user_id']
user_feature = user_feature.set_index('user_id')
user_feature['HowManyPeople_Watch'] = train_act.groupby('author_id').apply(HowManyPeopleWatch)
user_feature['Most_Handle'] = train_act.groupby('user_id').apply(MostHandle)
#计算视频被观看总次数
video_size = train_act.groupby('video_id').size().reset_index()
video_size.columns = ['video_id','video_watched_times']
train_act = pd.merge(train_act,video_size,on=['video_id'],how='left')
#分别计算每个用户的贡献度和、均、方
temp = train_act.groupby('user_id').apply(GongXianDu)
user_feature['GongXianSum'] = temp.apply(lambda x:x[0])
user_feature['GongXianMean'] = temp.apply(lambda x:x[1])
user_feature['GongXianStd'] = temp.apply(lambda x:x[2])
user_feature['GongXianSkeww'] = temp.apply(lambda x:x[3])
user_feature['GongXianKurtt'] = temp.apply(lambda x:x[4])
fav_author = train_act.groupby('user_id').apply(FavAuthorCreate)
user_feature['FavAuthorCreate'] = fav_author.apply(lambda x:x[0])
user_feature['WatchOtherVideo'] = fav_author.apply(lambda x:x[1])
train_data = pd.merge(train_data,user_feature.reset_index(),on=['user_id'],how='left')
t2 = time.time()
print('Use Time: ',t2-t1,' User-Author Finish... ','Shape:',train_data.shape)
# 获取聚类特征
t1 = time.time()
train_data = get_only_user_author_graph_feature(train_act,'',train_data)
kmean = KMeans(n_clusters=20,n_jobs=20)
train_data['cluster_graph'] = kmean.fit_predict(train_data[['G_degree','G_indegree','G_pagerank']].fillna(0))
train_data = get_only_user_author_graph_feature(train_act[train_act['page']==0],'page0',train_data)
t2 = time.time()
print('Use Time: ',t2-t1,' Cluster Finish... ','Shape:',train_data.shape)
# Feature End
t_end = time.time()
print('Get Feature Use All Time: ',t_end-t_start)
return train_data
# offline
# 0 10day step
# 1 14day step
# 2 16day step
print('The Style is ',style,'...')
print('Dealing Offline...')
style = 2
if style == 0:
t1 = time.time()
train_label = []
train_data = []
for i in range(1,8):
try_label = get_label(i+10,i+16)
train_label.append(try_label)
train_data.append(get_train(try_label['user_id'],i,i+9))
t2 = time.time()
print('Deal Train Feature: ',t2-t1)
t1 = time.time()
valid_label = get_label(24,30)
valid_data = get_train(valid_label['user_id'],14,23)
t2 = time.time()
print('Deal Valid Feature: ',t2-t1)
elif style == 1:
t1 = time.time()
train_label = []
train_data = []
for i in range(1,4):
try_label = get_label(i+14,i+20)
train_label.append(try_label)
train_data.append(get_train(try_label['user_id'],i,i+13))
t2 = time.time()
print('Deal Train Feature: ',t2-t1)
t1 = time.time()
valid_label = get_label(24,30)
valid_data = get_train(valid_label['user_id'],10,23)
t2 = time.time()
print('Deal Valid Feature: ',t2-t1)
elif style == 2:
t1 = time.time()
train_label = get_label(17,23)
train_data = get_train(train_label['user_id'],1,16)
valid_label = get_label(24,30)
valid_data = get_train(valid_label['user_id'],8,23)
t2 = time.time()
print('Deal Train Feature: ',t2-t1)
# online
print('Dealing Online...')
if style == 0:
t1 = time.time()
online_label = []
online_data = []
for i in range(8,15):
try_label = get_label(i+10,i+16)
online_label.append(try_label)
online_data.append(get_train(try_label['user_id'],i,i+9))
# model
online_test = get_train(reg_log['user_id'].unique(),21,30)
t2 = time.time()
print('Deal Online: ',t2-t1)
elif style == 1:
t1 = time.time()
online_label = []
online_data = []
for i in range(4,11):
try_label = get_label(i+14,i+20)
online_label.append(try_label)
online_data.append(get_train(try_label['user_id'],i,i+13))
# model
online_test = get_train(reg_log['user_id'].unique(),17,30)
t2 = time.time()
print('Deal Online: ',t2-t1)
elif style == 2:
# online
online_label = pd.concat([train_label,valid_label],axis=0).reset_index(drop=True)
online_data = pd.concat([train_data,valid_data],axis=0).reset_index(drop=True)
# model
online_test_b = get_train(reg_log_b['user_id'].unique(),15,30)
def merge_data(df):
return pd.concat(df,axis=0).reset_index(drop=True)
if style!=2 :
train_data = merge_data(train_data)
train_label = merge_data(train_label)
online_data = merge_data(online_data)
online_label = merge_data(online_label)
online_data = pd.concat([train_data,online_data],axis=0).reset_index(drop=True)
online_label = pd.concat([train_label,online_label],axis=0).reset_index(drop=True)
path_name = 'pre_data/style_'+str(style)
train_data.to_csv(path_name+'/train_data.csv',index=False)
train_label.to_csv(path_name+'/train_label.csv',index=False)
valid_data.to_csv(path_name+'/valid_data.csv',index=False)
valid_label.to_csv(path_name+'/valid_label.csv',index=False)
online_data.to_csv(path_name+'/online_data.csv',index=False)
online_label.to_csv(path_name+'/online_label.csv',index=False)
online_test.to_csv(path_name+'/online_test.csv',index=False)
# Need to Add :
# a. User-Author-Video Interfacing
# 1. Node2Vec
# 2. User-Video/Author Embedding # 2
# 3. User-Author-Video Embedding # 1
# 4. User-Video/Author Tf-idf/Word2Vec # 2*2
# 5. User-Video/Author Cluster (By Tf-idf/Word2Vec) # 2*2
# 6. User-Author-Video Embedding/Tf-idf/Word2Vec + Cluster
# b. Know More Author
# 1. Define "An Active User", For Example, You can choose "All of the Positive Sample" using their Mean Value
# (Tips: Mean Value is the Times of Watching Video,Looking Author,See Page,Action Click Count)
# 2. Get Diff Value For Active User
# 3. Calc The UV metric , Fuv or Iuv.
# (Tips: Find top100 the most active User,Get their favourite Author/Video Union Set, Ex https://zhuanlan.zhihu.com/p/20943978)
# 4. Node Centrality/Influence (Wiki)
# 5. See Author Delay
================================================
FILE: model/ffm.py
================================================
import hashlib, math, os, subprocess
from multiprocessing import Process
import xlearn
import numpy as np
import pandas as pd
from padnas import DataFrame as DF
def hashstr(str, nr_bins=1e+6):
return int(hashlib.md5(str.encode('utf8')).hexdigest(), 16) % (int(nr_bins) - 1) + 1
class FfmEncoder():
def __init__(self, field_names, label_name, nthread=1):
self.field_names = field_names
self.nthread = nthread
self.label = label_name
def gen_feats(self, row):
feats = []
for field in self.field_names:
value = row[field]
key = field + '-' + str(value)
feats.append(key)
return feats
def gen_hashed_fm_feats(self, feats):
feats = ['{0}:{1}:1'.format(field, hashstr(feat, 1e+6)) for (field, feat) in feats]
return feats
def convert(self, df, path, i):
lines_per_thread = math.ceil(float(df.shape[0]) / self.nthread)
sub_df = df.iloc[i * lines_per_thread: (i + 1) * lines_per_thread]
tmp_path = path + '_tmp_{0}'.format(i)
with open(tmp_path, 'w') as f:
for index,row in sub_df.iterrows():
feats = []
for i, feat in enumerate(self.gen_feats(row)):
feats.append((i, feat))
feats = self.gen_hashed_fm_feats(feats)
f.write(str(int(row[self.label])) + ' ' + ' '.join(feats) + '\n')
def parallel_convert(self, df, path):
processes = []
for i in range(self.nthread):
p = Process(target=self.convert, args=(df, path, i))
p.start()
processes.append(p)
for p in processes:
p.join()
def delete(self, path):
for i in range(self.nthread):
os.remove(path + '_tmp_{0}'.format(i))
def cat(self, path):
if os.path.exists(path):
os.remove(path)
for i in range(self.nthread):
cmd = 'cat {svm}_tmp_{idx} >> {svm}'.format(svm=path, idx=i)
p = subprocess.Popen(cmd, shell=True)
p.communicate()
def transform(self, df, path):
print('converting data......')
self.parallel_convert(df, path)
self.cat(path)
self.delete(path)
write_path = '/home/kesci'
ffm_train = train_data.copy()
ffm_valid = valid_data.copy()
ffm_online_train = online_train.copy()
ffm_online_test = online_data.copy()
ffm_online_test['label'] = 0
# filed_names = list(fi.sort_values(by=['score'],ascending=False).head(50)['name'].values)
filed_names = [i for i in ffm_train.columns if i not in ['user_id','label']]
print(filed_names)
fe = FfmEncoder(filed_names,label_name='label',nthread=8)
fe.transform(ffm_train, write_path+'train.ffm')
print('Train FFM Finished...')
fe.transform(ffm_valid, write_path+'valid.ffm')
print('Valid FFM Finished...')
fe.transform(ffm_online_train,write_path+'train_online.ffm')
print('Train Online FFM Finished...')
fe.transform(ffm_online_test, write_path+'test_online.ffm')
print('Test Online FFM Finished')
# Training task
ffm_model = xl.create_ffm() # Use field-aware factorization machine
ffm_model.setTrain("/home/kesci/train.ffm") # Training data
ffm_model.setValidate("/home/kesci/valid.ffm") # Validation data
# param:
# 0. binary classification
# 1. learning rate: 0.2
# 2. regular lambda: 0.002
# 3. evaluation metric: accuracy
param = {'task':'binary', 'lr':0.1,
'lambda':0.01, 'metric':'auc', 'epoch' : 20,'opt':'ftrl'}
# Start to train
# The trained model will be stored in model.out
ffm_model.fit(param, write_path+'model.out')
print('Offline Train Finished...')
# Prediction task
param = {'task':'binary', 'lr':0.05,
'lambda':0.003, 'metric':'auc'}
ffm_online_model = xl.create_ffm()
ffm_online_model.setTrain(write_path+'train_online.ffm')
ffm_online_model.fit(param,write_path+'online_model.out')
ffm_model.setTest(write_path+'test_online.ffm') # Test data
ffm_model.setSigmoid() # Convert output to 0-1
# Start to predict
# The output result will be stored in output.txt
ffm_model.predict(write_path+'model.out', write_path+'output.txt')
================================================
FILE: model/lgb_model
================================================
import numpy as np
import pandas as pd
import lightgbm as lgb
import xgboost as xgb
import catboost as cbt
from pandas import DataFrame as DF
import gc
import time
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
# Read
write_path = '/home/kesci/'
train_data = pd.read_csv(write_path+'train_data.csv')
valid_data = pd.read_csv(write_path+'valid_data.csv')
online_data = pd.read_csv(write_path+'online_data.csv')
online_train = pd.concat([train_data,valid_data],axis=0).reset_index(drop=True)
# LGB Model
feature_name = [i for i in train_data.columns if i not in ['user_id','label']]
print(len(feature_name))
dtrain = lgb.Dataset(train_data[feature_name], label=train_data['label'].values)
dval = lgb.Dataset(valid_data[feature_name], label=valid_data['label'].values)
params = {'learning_rate': 0.05,
'metric': ['auc','binary_logloss'],
'objective': 'binary',
'nthread': 8,
'num_leaves': 8,
'colsample_bytree': 0.7,
'bagging_fraction' : 0.8,
'bagging_freq' : 10,
'seed' : 2018,
}
lgb_model = lgb.train(params, dtrain, 2500, dval, verbose_eval=50,early_stopping_rounds=100,)
pred = lgb_model.predict(train_data[feature_name])
from sklearn.metrics import roc_auc_score,f1_score
print('TRAIN SET auc ',roc_auc_score(train_data['label'],pred))
f1_ans = []
for i in pred:
if i>=0.5:
f1_ans.append(1)
else:
f1_ans.append(0)
print('TRAIN SET F1 ',f1_score(train_data['label'],f1_ans))
fi = DF()
fi['name'] = feature_name
fi['score'] = lgb_model.feature_importance()
print(fi.sort_values(by=['score'],ascending=False))
lgb.plot_importance(lgb_model,max_num_features=40,figsize=(10,8))
plt.show()
online_train = pd.concat([train_data,valid_data],axis=0).reset_index(drop=True)
online_lgb_set = lgb.Dataset(online_train[feature_name],label=online_train['label'])
online_lgb_model = lgb.train(params,online_lgb_set,num_boost_round=lgb_model.best_iteration-50)
ans = online_lgb_model.predict(online_data[feature_name])
submit = DF()
submit['id'] = online_data['user_id']
submit['score'] = ans
print(submit.head(10))
print(submit['score'].describe())
submit.to_csv('Submit.txt',index=False,header=False)
================================================
FILE: model/xgb_model.py
================================================
import numpy as np
import pandas as pd
import lightgbm as lgb
import xgboost as xgb
import catboost as cbt
from pandas import DataFrame as DF
import gc
import time
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
# Read
write_path = '/home/kesci/'
train_data = pd.read_csv(write_path+'train_data.csv')
valid_data = pd.read_csv(write_path+'valid_data.csv')
online_data = pd.read_csv(write_path+'online_data.csv')
online_train = pd.concat([train_data,valid_data],axis=0).reset_index(drop=True)
# LGB Model
feature_name = [i for i in train_data.columns if i not in ['user_id','label']]
print(len(feature_name))
xgb_train = xgb.DMatrix(train_data[feature_name],train_data['label'].values)
xgb_valid = xgb.DMatrix(valid_data[feature_name],valid_data['label'].values)
watch_list = [(xgb_train,'dtrain'),(xgb_valid,'dvalid')]
params = {
'booster': 'gbtree',
'objective': 'rank:pairwise', #'binary:logistic',
'eta': 0.05,
'seed' : 2018,
'max_depth': 5,
'subsample': 0.9,
'colsample_bytree': 0.8,
'colsample_bylevel' : 0.8,
'eval_metric': ['auc'], # Need TO Logloss
'nthread' : 8,
'gamma': 2,
}
xgb_model = xgb.train(params,xgb_train,2000,watch_list,early_stopping_rounds=40,verbose_eval=10)
pred = xgb_model.predict(xgb.DMatrix(valid_data[feature_name]))
from sklearn.metrics import roc_auc_score,f1_score
print('auc ',roc_auc_score(valid_data['label'],pred))
f1_ans = []
for i in pred:
if i>=0.5:
f1_ans.append(1)
else:
f1_ans.append(0)
print('f1 ',f1_score(valid_data['label'],f1_ans))
def create_feature_map(features):
outfile = open('xgb.fmap', 'w')
i = 0
for feat in features:
outfile.write('{0}\t{1}\tq\n'.format(i, feat))
i = i + 1
outfile.close()
create_feature_map(feature_name)
import operator
xgb_importance = xgb_model.get_fscore(fmap='xgb.fmap')
xgb_importance = sorted(xgb_importance.items(), key=operator.itemgetter(1))
xgb_importance = DF(xgb_importance, columns=['name', 'fscore'])
print(xgb_importance)
online_xgb_set = xgb.DMatrix(online_train[feature_name],label=online_train['label'])
online_xgb_model = xgb.train(params,online_xgb_set,num_boost_round=xgb_model.best_iteration)
ans_xgb = online_xgb_model.predict(xgb.DMatrix(online_data[feature_name]))
submit_xgb = DF()
submit_xgb['id'] = online_data['user_id']
from sklearn.preprocessing import MinMaxScaler
st = MinMaxScaler()
submit_xgb['score'] = st.fit_transform(ans_xgb.reshape(-1,1)) # RANK
# submit_xgb['score'] = ans_xgb # Binary
print(submit_xgb.head(10))
print(submit_xgb['score'].describe())
submit.to_csv('Submit XGB.txt',index=False,header=False)
gitextract_z0_s53rx/
├── README.md
├── feature_engineer/
│ ├── feature_engineer1
│ ├── feature_engineer2.py
│ └── get_feature.py
└── model/
├── ffm.py
├── lgb_model
└── xgb_model.py
SYMBOL INDEX (51 symbols across 4 files)
FILE: feature_engineer/feature_engineer2.py
function reduce_mem_usage (line 37) | def reduce_mem_usage(props):
function get_transform (line 82) | def get_transform(now,start_date,end_date):
function get_label (line 86) | def get_label(start_date,end_date):
function check_id (line 100) | def check_id(uid,now):
function get_mode (line 103) | def get_mode(now):
function get_binary_seq (line 106) | def get_binary_seq(now,start_date,end_date):
function get_binary1 (line 118) | def get_binary1(now,start_date,end_date): # Boss Feature
function get_binary2 (line 125) | def get_binary2(now,start_date,end_date): # Boss Feature
function get_time_log_weight_sigma (line 132) | def get_time_log_weight_sigma(now,start_date,end_date):
function get_max_count (line 143) | def get_max_count(x,name):
function get_max_movie (line 150) | def get_max_movie(x):
function get_type_feature (line 158) | def get_type_feature(control,name,now,train_data,start_date,end_date,gap...
function get_time_feature (line 169) | def get_time_feature(control,name,now,train_data,start_date,end_date,gap...
function get_second_day (line 209) | def get_second_day(now,seq):
function get_id_feature (line 218) | def get_id_feature(control,name,now,train_data,start_date,end_date,gap,g...
function get_diff_feature (line 228) | def get_diff_feature(control,name,now,train_data,start_date,end_date,gap...
function get_category_count (line 244) | def get_category_count(name,deal_now,train_data,start_date,end_date):
function parallelize_df_func (line 268) | def parallelize_df_func(df, func, start, end, num_partitions=40, n_jobs=4):
function get_train (line 280) | def get_train(param_info):
function data_prepare (line 362) | def data_prepare(read_path=None):
FILE: feature_engineer/get_feature.py
function get_transform (line 41) | def get_transform(now,start_date,end_date):
function get_label (line 45) | def get_label(start_date,end_date):
function check_id (line 59) | def check_id(uid,now):
function get_category_count (line 62) | def get_category_count(name,deal_now,train_data,start_date,end_date):
function get_binary_seq (line 88) | def get_binary_seq(now,start_date,end_date):
function get_binary (line 99) | def get_binary(now,start_date,end_date): # Boss Feature
function get_binary_mol7 (line 106) | def get_binary_mol7(now,start_date,end_date):
function get_time_log_weight_sigma (line 113) | def get_time_log_weight_sigma(now,start_date,end_date):
function get_time_weight_sigma (line 124) | def get_time_weight_sigma(now,start_date,end_date):
function get_id_feature (line 134) | def get_id_feature(control,name,now,train_data,start_date,end_date):
function get_time_feature (line 148) | def get_time_feature(control,name,now,train_data,start_date,end_date):
function HowManyPeopleWatch (line 216) | def HowManyPeopleWatch(df):
function MostHandle (line 220) | def MostHandle(df):
function FavAuthorCreate (line 224) | def FavAuthorCreate(df):
function GongXianDu (line 231) | def GongXianDu(df):
function get_lx_day (line 242) | def get_lx_day(now):
function get_lx_day_feature (line 263) | def get_lx_day_feature(name,train_data,now):
function get_only_user_author_graph_feature (line 288) | def get_only_user_author_graph_feature(now,name,train_data):
function get_train (line 305) | def get_train(uid,start_date,end_date):
function merge_data (line 541) | def merge_data(df):
FILE: model/ffm.py
function hashstr (line 8) | def hashstr(str, nr_bins=1e+6):
class FfmEncoder (line 11) | class FfmEncoder():
method __init__ (line 12) | def __init__(self, field_names, label_name, nthread=1):
method gen_feats (line 17) | def gen_feats(self, row):
method gen_hashed_fm_feats (line 25) | def gen_hashed_fm_feats(self, feats):
method convert (line 29) | def convert(self, df, path, i):
method parallel_convert (line 41) | def parallel_convert(self, df, path):
method delete (line 50) | def delete(self, path):
method cat (line 54) | def cat(self, path):
method transform (line 62) | def transform(self, df, path):
FILE: model/xgb_model.py
function create_feature_map (line 54) | def create_feature_map(features):
Condensed preview — 7 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (101K chars).
[
{
"path": "README.md",
"chars": 4077,
"preview": "# 2018-KUAISHOU-TSINGHUA-Top13-Solutions\n2018中国高校计算机大赛--大数据挑战赛 Top 13 Solutions\n\n#### 初赛A Top 2,初赛B Top 5\n\t 复赛Final 13 \n"
},
{
"path": "feature_engineer/feature_engineer1",
"chars": 27058,
"preview": "# 该特征工程 滑窗步长为1天,全量数据,无部分平均加权\n\nfrom multiprocessing import Pool\nimport numpy as np\nimport pandas as pd\nfrom pandas import"
},
{
"path": "feature_engineer/feature_engineer2.py",
"chars": 24686,
"preview": "# 该特征工程为不等长滑窗,滑窗的step为1天,全量数据\n\nimport numpy as np\nimport pandas as pd\nfrom pandas import DataFrame as DF\nimport gc\nfrom "
},
{
"path": "feature_engineer/get_feature.py",
"chars": 30561,
"preview": "import numpy as np\nimport pandas as pd \nimport lightgbm as lgb\nimport xgboost as xgb\nimport catboost as cbt\nfrom pandas "
},
{
"path": "model/ffm.py",
"chars": 4130,
"preview": "import hashlib, math, os, subprocess\nfrom multiprocessing import Process\nimport xlearn\nimport numpy as np\nimport pandas "
},
{
"path": "model/lgb_model",
"chars": 2249,
"preview": "import numpy as np\nimport pandas as pd\nimport lightgbm as lgb\nimport xgboost as xgb\nimport catboost as cbt\nfrom pandas i"
},
{
"path": "model/xgb_model.py",
"chars": 2685,
"preview": "import numpy as np\nimport pandas as pd\nimport lightgbm as lgb\nimport xgboost as xgb\nimport catboost as cbt\nfrom pandas i"
}
]
About this extraction
This page contains the full source code of the luoda888/2018-KUAISHOU-TSINGHUA-Top13-Solutions GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 7 files (93.2 KB), approximately 27.5k tokens, and a symbol index with 51 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.