[
  {
    "path": "README.md",
    "content": "# 2018-KUAISHOU-TSINGHUA-Top13-Solutions\n2018中国高校计算机大赛--大数据挑战赛 Top 13 Solutions\n\n#### 初赛A Top 2,初赛B Top 5\n\t 复赛Final 13 \n\t 每一次相遇都是久别重逢，下一次站在答辩台上又是何年何月，对得起自己，对得起青春。\n\t 队友的RNN:https://github.com/totoruo/KuaiShou2018-RANK13-RNN\n\n#### 题意简述:给出用户在快手APP上1-30日的历史行为，预测接下来7天(31-37)的活跃用户。定义活跃为任意一张表出现过，不考虑冷启动。\n\n##### 框架思考\n\t1. 等长滑动  [1,16] [2,17] ... [8,23]  ---> Predict [15,30]\n\t2. 不等长滑动 [1,16] [1,17] ... [1,24] ---> Predict [1,30]\n\n##### 特征工程:\n\tTips: 以下Feature都是较为通用的，但是在不等长框架2中，需要对所有涉及到距离的计算做平滑，即除以时间窗口的长度 \n\n\t对APP-LAUNCH、ACTION、VIDEO表\n\t\t1.对时间序列进行编码\n\t\t\t编码方式考虑两种:\n\t\t\t\ta. 将用户活跃天数视为二进制数，按二进制方式对活跃天数进行加权，越接近预测日期权重越高，如6天的窗口,1,3,5 --- 101010 将其倒置，010101，转化为十进制数21 \n\t\t\t\t`ans += binary_day[i]*(2**i)`\n\t\t\t\tb. 直接按离预测日期距离进行加权\n\t\t\t\t`ans += binary_day[i]*(1/(end_date-i))`\n\n\t\t2.对时间序列进行描述性统计\n\t\t\t一级统计特征: mean,std,median,max,min,(max-min),mode,count,nunique (在APP表中，count==nunique)\n\t\t\t二级统计特征: skew,kurt,mad\n\t\t\t用户登录的频域周期性: Var(fft(X['day']))\n\t\t\t用户登录的星期周期性: Get_Mode(X['week']) \n\n\t\t3.对时间序列与预测日期进行时间天数交互\n\t\t\t该用户最后一次登录距离预测日期的长度   end_date+1-x['day'].max()\n\t\t\t该用户倒数第K次登录距离预测日期的长度  end_date+1-get_second_day(x,k)\n\t\t\t最后一次和倒数第二次的距离 x['day'].max()-get_second_day(x,2)\n\n\t\t4.差分一阶时间序列\n\t\t\t用户最大/最小间隔多少天登录一次 Diff Max/Min\n\t\t\t用户平均登录间隔 Diff Mean\n\t\t\t用户登录间隔的稳定性 Diff Var\n\t\t\t用户登录间隔的周期性 Var(fft(X['day'])) \n\n\t对ACTION表的特殊处理\n\t\t1.对时间序列进行衰减系数编码\n\t\t\t操作数/当前天数-预测日期的距离 sigma_ans += np.log(ans[i]/(window_len-i))\n\n\t\t2.对时间序列与预测日期进行操作数交互\n\t\t\t该用户最后一次/倒数第二次登录 观看Page、Action的分布 如看了3次Page 0，4次Page 1，2次Action 1，没有的行为用0填充\n\t\t\t将多组行为拆成子图，即可实现统计User在不同页面的分布，如只在Page0发生的行为的描述性统计，mean,std等 (对Action)\n\t\t\t该用户最后一次/K次操作当天，Count VIDEOID/AuthorID\n\t\t\t最后一次和倒数第二次间隔中，Count VIDEOID/AuthorID\n\n\t\t3.对Page，Action进行User的全局展开\n\t\t\t统计每一种行为在该用户整个行为序列里的比例，如 5次点击Page 0，该用户共点击10次 Page 0，那么此处Ratio1 为0.5\n\t\t\t统计每一种行为在当天用户里所有行为的比例，如 该用户900次点击 Page 0，当天共有1000次点击，那么此处Ratio2 为0.9\n\t\t\t上述两种统计，可以有效防止刷单行为，正常的行为序列应当是较为平滑的点击序列，一旦出现峰值，如举报行为居多的，观看同一视频的，即可判定为刷单的特征\n\n\t\t4.计算用户观看VIDEO的贡献序列\n\t\t\tdef GongXianDu(df):\n\t\t\t    d11 = df.set_index('video_id')\n\t\t\t    d11['gongxian_rate'] = df.groupby('video_id').size() \n\t\t\t    d11['gongxian_rate'] = d11['gongxian_rate'] / d11['video_watched_times']\n\t\t\t    meand = d11['gongxian_rate'].mean()\n\t\t\t    sumd = d11['gongxian_rate'].sum()\n\t\t\t    stdd = d11['gongxian_rate'].std()\n\t\t\t    skeww = d11['gongxian_rate'].skew()\n\t\t\t    kurtt = d11['gongxian_rate'].kurt()\n\t\t\treturn sumd,meand,stdd,skeww,kurtt\n\t\t\ttemp = train_act.groupby('user_id').apply(GongXianDu)\n\n\t    5.计算用户是否有追星行为\n\t    \t最喜欢的作者有无更新行为\n\t    \t用户看过的视频还有多少人爱看(区分小众与大众)\n\t\tdef FavAuthorCreate(df):\n\t\t    most_author = df.groupby('author_id').size().sort_values(ascending=False).index[0]\n\t\t    create_video_num = len(df[df['author_id']==most_author]['video_id'].unique())\n\t\t    watch_other_video_num = len(df[df['author_id']==most_author]['video_id'].unique())\n\t\t    watch_other_video = 1 if watch_other_video_num>1 else 0\n\t\t    return create_video_num , watch_other_video\n\n\t对REGSITER表的挖掘\n\t\t1.注册周期性，如周末的促销活动，最直观的Feature就是Week\n\t\t2.周期性的交互，如在周末特定的Type组合\n\t\t\tregister['week'] * register['device_type']\n\t\t\tregister['week'] * register['register_type']\n\t\t3.类别特征间的交互，如 register['device_type'] * register['register_type']\n\t\t4.计算不同类别的使用人数，如 register_log.groupby(['register_type'])['user_id'].transform('count').values\n\t\t\t可以计算device_type，week_rt，week_dt，rt_dt\n\t\t5.计算不同类别的转化率(需滑窗计算)，groupby(['count_label_ratio'])['regsiter_type','device_type'].transform('count').vaules\n\n##### 模型选择\n\tSnake 的 糖尿病特征选择框架 https://github.com/luoda888/tianchi-diabetes-top12\n\t在保留必选特征后(Encoder/FFT) 设置阈值产生两套特征\n\t树模型: LGB 框架1 \n\t\t   XGB 框架2\n\tFFM : Xlearn 按特征重要性筛选TopK个特征后计算\n\txDeepFM 输入序列特征/Category特征\n\tCNN/RNN 输入序列特征\n\n##### 模型融合\n\t本题模型融合收益极高，我们尝试了3种方式进行模型融合，按效果来计算\n\t\tTop 3 Stacking 没卵用，如果用同一套特征甚至掉分\n\t\tTop 2 加权融合 相似性都是0.98 0.97左右，0.97可以有1个千的收益，0.98大概是7-8个万\n\t\tTop 1 对半Blending 如用第一个滑窗测试，第二个滑窗测试，两折的Blending 收益是1.2个千\n\t\t\n##### 感言\n\t首先感谢队友一直努力，到最后没有放弃，本来打算对这个题的思路写很多，但是想想自己还是菜了点，就算了。\n\t最近这几个比赛，平安11，腾讯12，快手13，真的是10名以外的王者。\n\t也有反思自己，很多时候挖特征的思路不太对，太喜欢一把梭，忽略了具体数据背景，忽略了具体变化\n\t数据挖掘多的应该是自己的思考，而少一些套路性的东西。\n\t当一切都有固定化的套路的时候，也少了很多乐趣。\n\t快手群里还是氛围不错的，Kesci平台的态度确实也是我目前见过比较好的。\n\t或许前天中午那个LGB的模型能跑出来，现在肯定是可以Top10的。\n\t遗憾也有，咽下肚。\n\n\t皇图霸业谈笑间，不胜人生一场醉。\n\t走起，醉去！\n\n\n\n\n\n\n\n\n"
  },
  {
    "path": "feature_engineer/feature_engineer1",
    "content": "# 该特征工程 滑窗步长为1天，全量数据，无部分平均加权\n\nfrom multiprocessing import Pool\nimport numpy as np\nimport pandas as pd\nfrom pandas import DataFrame as DF\nimport gc\nfrom multiprocessing import Pool\nfrom sklearn.preprocessing import LabelEncoder\nimport time\nfrom scipy import stats\n\ndef get_transform(now,start_date,end_date):\n    get_trans = now[(now['day']>=start_date) & (now['day']<=end_date)]\n    return get_trans\n\ndef get_label(start_date,end_date):\n    merge_name = ['user_id','day']\n    all_log = pd.concat([action_log[merge_name],app_log[merge_name],video_log[merge_name]],axis=0)\n    train_label = get_transform(all_log,start_date,end_date)\n    train_1 = DF(list(set(train_label['user_id']))).rename(columns={0:'user_id'})\n    train_1['label'] = 1\n    reg_temp = get_transform(register_log,1,start_date-1)\n    train_1 = train_1[train_1['user_id'].isin(reg_temp['user_id'])]\n    train_0 = DF(list(set(reg_temp['user_id'])-set(train_1['user_id']))).rename(columns={0:'user_id'})\n    train_0['label'] = 0\n    del train_label\n    gc.collect()\n    return pd.concat([train_1,train_0],axis=0) \n\ndef check_id(uid,now):\n    return now[now['user_id'].isin(uid)]\n    \ndef get_mode(now):\n    return stats.mode(now)[0][0]\n    \ndef get_binary_seq(now,start_date,end_date): \n    day = list(range(1,end_date-start_date+2))\n    day.reverse()\n    ans1 = 0\n    binary_day = []\n    now_uni = now.unique()\n    for i in day:\n        if i in now_uni:\n            binary_day.append(1)\n        else:\n            binary_day.append(0)\n    return binary_day\n    \ndef get_binary1(now,start_date,end_date): # Boss Feature\n    ans = 0\n    binary_day = get_binary_seq(now,start_date,end_date)\n    for i in range(len(binary_day)):\n        ans += binary_day[i]*(2**i)\n    return ans\n\ndef get_binary2(now,start_date,end_date): # Boss Feature\n    ans = 0\n    binary_day = get_binary_seq(now,start_date,end_date)\n    for i in range(len(binary_day)):\n        ans += binary_day[i]*(1/(end_date-i))\n    return ans\n\ndef get_time_log_weight_sigma(now,start_date,end_date):\n    window_len = end_date+1-start_date\n    ans = np.zeros(window_len)\n    sigma_ans = 0\n    for i in now:\n        ans[(i-1)%window_len] += 1\n    for i in range(window_len):\n        if ans[i]!=0:\n            sigma_ans += np.log(ans[i]/(window_len-i))\n    return sigma_ans\n\ndef get_max_count(x,x_max):\n    x_max = int(x_max)\n    if x_max>0:\n        return x['day'].value_counts()[x_max]\n    else:\n        return 0\n\ndef get_max_movie(x,x_max):\n    x_max = int(x_max)\n    if x_max>0:\n        x = x[x['day']==x_max]\n        return x['video_id'].nunique()\n    else:\n        return 0\n\ndef get_type_feature(control,name,now,train_data,start_date,end_date,gap,gap_name):\n\n    now = get_transform(now,start_date,end_date)\n    \n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].max())).reset_index().rename(columns={0:'max1_'+control+name+gap_name}).fillna(-1),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],2))).reset_index().rename(columns={0:'max2_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_count(x,np.nan_to_num(x['day'].max()))).reset_index().rename(columns={0:'max_count_'+control+name+gap_name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_count(x,get_second_day(x[name],2))).reset_index().rename(columns={0:'max2_count_'+control+name+gap_name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_movie(x,np.nan_to_num(x['day'].max()))).reset_index().rename(columns={0:'nunique_video_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_movie(x,get_second_day(x[name],2))).reset_index().rename(columns={0:'nunique2_video_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(x[name].max()-get_second_day(x[name],2))).reset_index().rename(columns={0:'max_distance_12_'+control+name+str(end_date-start_date)}).fillna(end_date-start_date),on=['user_id'],how='left')\n    \n    return train_data\n    \ndef get_encoder_feature(control,name,now,train_data,start_date,end_date):\n    \n    now = get_transform(now,start_date,end_date)\n    \n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary1(x[name],start_date,end_date)).reset_index().rename(columns={0:'encoder1_01seq'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary2(x[name],start_date,end_date)).reset_index().rename(columns={0:'encoder2_01seq'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_time_log_weight_sigma(x[name],start_date,end_date)).reset_index().rename(columns={0:'LogSigma_'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    \n    return train_data\n        \ndef get_time_feature(control,name,now,train_data,start_date,end_date):\n    \n    now = get_transform(now,start_date,end_date)\n\n    # 描述性统计特征 6\n    t1 = time.time()\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_all_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()).reset_index().rename(columns={0:'count_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data['nunique / count' + control + name +str(end_date-start_date)] = train_data['nunique_all_'+control+name+str(end_date-start_date)] / train_data['count_'+control+name+str(end_date-start_date)]\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(x[name].min()-start_date)).reset_index().rename(columns={0:'min-start_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].min())).reset_index().rename(columns={0:'end-min_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].kurt()).reset_index().rename(columns={0:'kurt_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].skew()).reset_index().rename(columns={0:'skew_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.84)).reset_index().rename(columns={0:'q4_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.92)).reset_index().rename(columns={0:'q5_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.97)).reset_index().rename(columns={0:'q6_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_mode(x[name])).reset_index().rename(columns={0:'mode_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    \n    t2 = time.time()\n    print(name,' Describe Finished... ',t2-t1,' Shape: ',train_data.shape)\n    \n    t1 = time.time()\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.var(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_var_'+control+name+str(end_date-start_date)}).fillna(-1),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.mean(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_mean_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.var(np.fft.fft(get_binary_seq(x[name],start_date,end_date))))).reset_index().rename(columns={0:'fft_01seq_var_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.mean(np.fft.fft(get_binary_seq(x[name],start_date,end_date))))).reset_index().rename(columns={0:'fft_01seq_mean_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    t2 = time.time()\n    print(control,' FFT Finished...',t2-t1,' Shape: ',train_data.shape)\n    \n    t1 = time.time()\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:np.array(get_binary_seq(x[name],start_date,end_date)).std()).reset_index().rename(columns={0:'01seq_std_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:np.array(get_binary_seq(x[name],start_date,end_date)).mean()).reset_index().rename(columns={0:'01seq_mean_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    t2 = time.time()\n    print(control,' 01seq Describe Finished... ',t2-t1,' Shape: ',train_data.shape)\n    \n    # 时间衰减 4\n    t1 = time.time()\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary1(x[name],start_date,end_date)).reset_index().rename(columns={0:'encoder1_01seq'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary2(x[name],start_date,end_date)).reset_index().rename(columns={0:'encoder2_01seq'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_time_log_weight_sigma(x[name],start_date,end_date)).reset_index().rename(columns={0:'LogSigma_'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    t2 = time.time()\n    print(control,' Sigma Finished... ',t2-t1,' Shape: ',train_data.shape)\n    \n    t1 = time.time()\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].max())).reset_index().rename(columns={0:'max1_'+control+name+str(end_date-start_date)}).fillna(end_date-start_date),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],2))).reset_index().rename(columns={0:'max2_'+control+name+str(end_date-start_date)}).fillna(end_date-start_date),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],3))).reset_index().rename(columns={0:'max3_'+control+name+str(end_date-start_date)}).fillna(end_date-start_date),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(x[name].max()-get_second_day(x[name],2))).reset_index().rename(columns={0:'max_distance_12_'+control+name+str(end_date-start_date)}).fillna(end_date-start_date),on=['user_id'],how='left')\n    \n    t2 = time.time()\n    \n    print(control,' Max Finished... ',t2-t1,' Shape: ',train_data.shape)\n    \n    return train_data\n    \ndef get_second_day(now,seq):\n    now = list(now.unique())\n    for i in range(seq-1):\n        if len(now)>1:\n            now.remove(max(now))\n        else:\n            return 0\n    return max(now)\n\ndef get_id_feature(control,name,now,train_data,start_date,end_date):\n    \n    now = get_transform(now,start_date,end_date)\n\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()).reset_index().rename(columns={0:'count_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].var()).reset_index().rename(columns={0:'var_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left')\n    \n    return train_data\n\ndef get_diff_feature(control,name,now,train_data,start_date,end_date):\n    \n    now = get_transform(now,start_date,end_date)\n\n    t1 = time.time()\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()).reset_index().rename(columns={0:'count_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].std()).reset_index().rename(columns={0:'var_'+control+name+str(end_date-start_date)}).fillna(-1),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].mean()).reset_index().rename(columns={0:'mean_'+control+name+str(end_date-start_date)}).fillna(-1),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].max()).reset_index().rename(columns={0:'max_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_mode(x[name])).reset_index().rename(columns={0:'mode_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].min()).reset_index().rename(columns={0:'min_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    t2 = time.time()\n    print(control,' Get Diff Feature Finished... Used: ',t2-t1,' Shape: ',train_data.shape)\n    return train_data\n    \ndef HowManyPeopleWatch(df):\n    num_people = len(df['user_id'].unique())\n    return num_people\n\ndef MostHandle(df):\n    most_handle = df.groupby('video_id').size().max()\n    return most_handle\n\ndef FavAuthorCreate(df):\n    most_author = df.groupby('author_id').size().sort_values(ascending=False).index[0]\n    create_video_num = len(df[df['author_id']==most_author]['video_id'].unique())\n    watch_other_video_num = len(df[df['author_id']==most_author]['video_id'].unique())\n    watch_other_video = 1 if watch_other_video_num>1 else 0\n    return create_video_num , watch_other_video\n\ndef GongXianDu(df):\n    d11 = df.set_index('video_id')\n    d11['gongxian_rate'] = df.groupby('video_id').size() \n    d11['gongxian_rate'] = d11['gongxian_rate'] / d11['video_watched_times']\n    meand = d11['gongxian_rate'].mean()\n    sumd = d11['gongxian_rate'].sum()\n    stdd = d11['gongxian_rate'].std()\n    skeww = d11['gongxian_rate'].skew()\n    kurtt = d11['gongxian_rate'].kurt()\n    return sumd,meand,stdd,skeww,kurtt\n\ndef get_category_count(name,deal_now,train_data,start_date,end_date):\n    count = DF(deal_now.groupby(['user_id',name]).size().reset_index().rename(columns={0:'times'}))\n    count_size = deal_now.groupby([name]).size().shape[0]\n    sum_data = 0\n    for i in range(0,count_size):\n        new_name = 'see_'+name+'_'+str(i)\n        temp = pd.merge(train_data,count[count[name]==i],on=['user_id']).rename(columns={'times':new_name})\n        train_have = pd.merge(train_data,temp[['user_id',new_name]],on=['user_id'])\n        train_have = train_have[['user_id',new_name]]\n        not_have_name = list(set(train_data['user_id'].values)-set(train_have['user_id'].values))\n        train_not_have = DF()\n        train_not_have['user_id'] = train_data[train_data['user_id'].isin(not_have_name)]['user_id']\n        train_not_have['see_'+name+'_'+str(i)] = 0\n        temp = pd.concat([train_have,train_not_have],axis=0)\n        train_data = pd.merge(train_data,temp,on=['user_id'],how='left')\n        sum_data += train_data[new_name].values\n\n    for i in range(0,count_size):\n        new_name = 'see_'+name+'_'+str(i)\n        train_data[new_name+'_ratio'] = train_data[new_name].values/sum_data\n\n    return train_data\n    \ndef get_last_window(now):\n    if now.min()>0:\n        return 1\n    else:\n        return 0\n\ndef parallelize_df_func(df, func, start, end, num_partitions=21, n_jobs=7):\n    df_split = np.array_split(df, num_partitions)\n    start_date = [start] * num_partitions\n    end_date = [end] * num_partitions\n    param_info = zip(df_split, start_date, end_date)\n    pool = Pool(n_jobs)\n    gc.collect()\n    df = pd.concat(pool.map(func, param_info))\n    pool.close()\n    pool.join()\n    gc.collect()\n    return df\n    \ndef get_train(param_info):\n    uid = param_info[0]\n    start_date= param_info[1]\n    end_date= param_info[2]\n    t_start = time.time()\n    \n    t1 = time.time()\n    \n    train_act = check_id(uid,get_transform(action_log,start_date,end_date))\n    train_video = check_id(uid,get_transform(video_log,start_date,end_date))\n    train_app = check_id(uid,get_transform(app_log,start_date,end_date))\n    train_reg = register_log[register_log['user_id'].isin(uid)].rename(columns={'day':'reg_day'})\n    \n    # Get Week\n    train_act['week'] = (train_act['day'].values) % 7\n    train_video['week'] = (train_video['day'].values) % 7\n    train_app['week'] = (train_app['day'].values) % 7\n    \n    # Modify Day\n    train_reg['reg_day'] = train_reg['reg_day'] - start_date + 1\n    train_act['day'] = train_act['day'] - start_date + 1\n    train_video['day'] = train_video['day'] - start_date + 1\n    train_app['day'] = train_app['day'] - start_date + 1\n    \n    end_date = end_date-start_date+1\n    true_start = start_date\n    start_date = 1\n    t2 = time.time()\n    \n    print(start_date,' To ',end_date,' Have User: ',len(uid))\n    print('Data Prepare Use...',t2-t1)\n    \n    # Build\n    train_data = DF()\n    train_data['user_id'] = uid # 1 feature\n    \n    train_data = get_time_feature('act_','day',train_act,train_data,start_date,end_date) \n    train_data = get_time_feature('act_','day',train_act,train_data,start_date,end_date) \n    train_data = get_time_feature('act_','day',train_act,train_data,start_date,end_date)\n    print('Act Encoder Finished')\n    for i in range(5):\n        page_temp = train_act[train_act['page']==i]\n        train_data = get_type_feature('act_page'+str(i)+'_','day',page_temp,train_data,start_date,end_date,end_date-start_date,'_all')\n    for i in range(6):\n        act_temp = train_act[train_act['action_type']==i]\n        train_data = get_type_feature('act_action'+str(i)+'_','day',act_temp,train_data,start_date,end_date,end_date-start_date,'_all')\n    train_data = get_diff_feature('act_','diff_day',train_act,train_data,start_date,end_date)\n    train_data = get_encoder_feature('act_','diff_day',train_act,train_data,start_date,end_date)\n    train_data = pd.merge(train_data,train_act.groupby(['user_id']).apply(lambda x:get_mode(x['week'])).reset_index().rename(columns={0:'act_mode_week'+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    \n    for i in ['page','action_type','video_id','author_id']: # 4*3 12 feature\n        train_data = get_id_feature('id_act_',i,train_act,train_data,start_date,end_date)\n    print(train_data.shape,' Aci Finished')\n    \n    train_data = get_category_count('page',train_act,train_data,start_date,end_date)\n    train_data = get_category_count('action_type',train_act,train_data,start_date,end_date)\n    print(train_data.shape,' Category Finished')\n    \n    train_data = get_time_feature('video_','day',train_video,train_data,start_date,end_date) \n    train_data = get_diff_feature('video_','diff_day',train_video,train_data,start_date,end_date) \n    print(train_data.shape,' Video Finished')\n    \n    train_data = get_time_feature('app_','day',train_app,train_data,start_date,end_date) \n    train_data = get_diff_feature('app_','diff_day',train_app,train_data,start_date,end_date)\n    train_data = get_encoder_feature('app_','diff_day',train_app,train_data,start_date,end_date)\n    t1 = time.time()\n    train_data = pd.merge(train_data,train_app.groupby(['user_id']).apply(lambda x:get_mode(x['week'])).reset_index().rename(columns={0:'app_mode_week'+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    print(train_data.shape,' APP Finished')\n    train_app['diff2'] = train_app['day']-train_app.groupby(['user_id'])['day'].shift(2).values\n    app_diff = train_app.dropna()\n    app_diff = app_diff.groupby(['user_id'],as_index=False).agg({'diff2':['max','min','mean','std']})\n    app_diff.columns = ['user_id','app_diff2_max','app_diff2_min','app_diff2_mean','app_diff2_std']\n    train_data = pd.merge(train_data,app_diff,on=['user_id'],how='left')\n    t2 = time.time()\n    print('APP DIFF ',t2-t1)\n\n    t1 = time.time()\n    user_feature = DF(train_data['user_id'].unique())\n    user_feature.columns = ['user_id']\n    user_feature = user_feature.set_index('user_id')\n    user_feature['HowManyPeople_Watch'] = train_act.groupby('author_id').apply(HowManyPeopleWatch)\n    user_feature['Most_Handle'] = train_act.groupby('user_id').apply(MostHandle)\n    \n    #计算视频被观看总次数\n    video_size = train_act.groupby('video_id').size().reset_index()\n    video_size.columns = ['video_id','video_watched_times']\n    train_act = pd.merge(train_act,video_size,on=['video_id'],how='left')\n\n    #分别计算每个用户的贡献度和、均、方\n    temp = train_act.groupby('user_id').apply(GongXianDu)\n    user_feature['GongXianSum'] = temp.apply(lambda x:x[0])\n    user_feature['GongXianMean'] = temp.apply(lambda x:x[1])\n    user_feature['GongXianStd'] = temp.apply(lambda x:x[2])\n    user_feature['GongXianSkeww'] = temp.apply(lambda x:x[3])\n    user_feature['GongXianKurtt'] = temp.apply(lambda x:x[4])\n    \n    fav_author = train_act.groupby('user_id').apply(FavAuthorCreate)\n    user_feature['FavAuthorCreate'] = fav_author.apply(lambda x:x[0])\n    user_feature['WatchOtherVideo'] = fav_author.apply(lambda x:x[1])\n    train_data = pd.merge(train_data,user_feature.reset_index(),on=['user_id'],how='left')\n    \n    t2 = time.time()\n    print('Use Time: ',t2-t1,' User-Author Finish... ','Shape:',train_data.shape)\n    \n    train_data = pd.merge(train_data,train_reg[['user_id','register_type','device_type','week','reg_day']],on=['user_id'],how='left').rename(columns={'week':'reg_week'}) # 2\n    t_end = time.time()\n    print('Get Feature Use All Time: ',t_end-t_start,' Shape: ',train_data.shape)\n    gc.collect()\n    \n    return train_data\n\ndef data_prepare(read_path=None):\n    \n    register_log = pd.read_csv(read_path+'user_register_log.txt',sep='\\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint16,3:np.uint16}).rename(columns={0:'user_id',1:'day',2:'register_type',3:'device_type'})\n    action_log = pd.read_csv(read_path+'user_activity_log.txt',sep='\\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint8,3:np.uint32,4:np.uint32,5:np.uint8}).rename(columns={0:'user_id',1:'day',2:'page',3:'video_id',4:'author_id',5:'action_type'})\n    app_log = pd.read_csv(read_path+'app_launch_log.txt',sep='\\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'})\n    video_log = pd.read_csv(read_path+'video_create_log.txt',sep='\\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'})\n    \n    # Sort By User\n    register_log = register_log.sort_values(by=['user_id','day'],ascending=True)\n    action_log = action_log.sort_values(by=['user_id','day'],ascending=True)\n    app_log = app_log.sort_values(by=['user_id','day'],ascending=True)\n    video_log = video_log.sort_values(by=['user_id','day'],ascending=True)\n\n    # Diff Day\n    t1 = time.time()\n    app_log['diff_day'] = app_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8)\n    video_log['diff_day'] = video_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8)\n    action_log['diff_day'] = action_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8)\n    t2 = time.time()\n    print('Diff Day Finished... ',t2-t1)\n\n    # Prepare REGISTER\n    register_log['week'] = register_log['day'] % 7\n    register_log['rt_dt'] = (register_log['register_type']+1)*(register_log['device_type']+1)\n    register_log['week_rt'] = (register_log['register_type']+1)*(register_log['reg_week']+1)\n    register_log['week_dt'] = (register_log['device_type']+1)*(register_log['reg_week']+1)\n    register_log['use_reg_people'] = register_log.groupby(['register_type'])['user_id'].transform('count').values\n    register_log['use_dev_people'] = register_log.groupby(['device_type'])['user_id'].transform('count').values\n    register_log['week_rt_use_people'] = register_log.groupby(['week_rt'])['user_id'].transform('count').values\n    register_log['week_dt_use_people'] = register_log.groupby(['week_dt'])['user_id'].transform('count').values\n    register_log['rt_dt_use_people'] = register_log.groupby(['rt_dt'])['user_id'].transform('count').values\n\n    return register_log,action_log,app_log,video_log\n\nread_path = '/mnt/datasets/fusai/'\nregister_log,action_log,app_log,video_log = data_prepare(read_path)\ntrain_set = []\nfor i in range(17,25):\n    train_label = get_label(i,i+6)\n    train_data_part1 = parallelize_df_func(train_label['user_id'], get_train, i-16, i-1, 1, 1)\n    train_data = pd.merge(train_data_part1,register_log[['user_id','use_reg_people','week','register_type','device_type','rt_dt',\n                                                    'week_rt','week_dt','use_dev_people','week_rt_use_people','week_dt_use_people',\n                                                    'rt_dt_use_people']],on=['user_id'],how='left')\n    train_data = pd.merge(train_data,train_label,on=['user_id'],how='left')\n    train_set.append(train_data)\n    del train_data_part1\n    gc.collect()\n\ntrain_data = pd.concat(train_set[0:-1],axis=0).reset_index(drop=True)\nvalid_data = train_set[-1]\nonline_data = parallelize_df_func(register_log['user_id'].unique(), get_train, 15, 30, 1,1)\nonline_data = pd.merge(online_data,register_log[['user_id','use_reg_people','week','register_type','device_type','rt_dt',\n                                            'week_rt','week_dt','use_dev_people','week_rt_use_people','week_dt_use_people',\n                                            'rt_dt_use_people']],on=['user_id'],how='left')\n\nwrite_path = '/home/kesci/'\ntrain_data.to_csv(write_path+'train_data.csv',index=False)\nvalid_data.to_csv(write_path+'valid_data.csv',index=False)\nonline_data.to_csv(write_path+'online_data.csv',index=False)\nprint('Style 1 Feature Engineer Finished...')\n\n"
  },
  {
    "path": "feature_engineer/feature_engineer2.py",
    "content": "# 该特征工程为不等长滑窗，滑窗的step为1天，全量数据\n\nimport numpy as np\nimport pandas as pd\nfrom pandas import DataFrame as DF\nimport gc\nfrom multiprocessing import Pool\nfrom sklearn.preprocessing import LabelEncoder\nimport time\nfrom scipy import stats\n\nregister_log = pd.read_csv('/mnt/datasets/fusai/user_register_log.txt',sep='\\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint16,3:np.uint16}).rename(columns={0:'user_id',1:'day',2:'register_type',3:'device_type'})\naction_log = pd.read_csv('/mnt/datasets/fusai/user_activity_log.txt',sep='\\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint8,3:np.uint32,4:np.uint32,5:np.uint8}).rename(columns={0:'user_id',1:'day',2:'page',3:'video_id',4:'author_id',5:'action_type'})\napp_log = pd.read_csv('/mnt/datasets/fusai/app_launch_log.txt',sep='\\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'})\nvideo_log = pd.read_csv('/mnt/datasets/fusai/video_create_log.txt',sep='\\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'})\nregister_log = register_log.sort_values(by=['user_id','day'],ascending=True)\naction_log = action_log.sort_values(by=['user_id','day'],ascending=True)\napp_log = app_log.sort_values(by=['user_id','day'],ascending=True)\nvideo_log = video_log.sort_values(by=['user_id','day'],ascending=True)\nregister_log['week'] = register_log['day'] % 7\nt1 = time.time()\napp_log['diff_day'] = app_log.groupby(['user_id'])['day'].diff().fillna(-1)\napp_log['diff_day'] = app_log['diff_day'].astype(np.int8)\nt2 = time.time()\nprint('Diff APP Finished... ',t2-t1)\nt1 = time.time()\nvideo_log['diff_day'] = video_log.groupby(['user_id'])['day'].diff().fillna(-1)\nvideo_log['diff_day'] = video_log['diff_day'].astype(np.int8)\nt2 = time.time()\nprint('Diff Video Finished... ',t2-t1)\nt1 = time.time()\naction_log['diff_day'] = action_log.groupby(['user_id'])['day'].diff().fillna(-1)\naction_log['diff_day'] = action_log['diff_day'].astype(np.int8)\nt2 = time.time()\nprint('Diff Act Finished... ',t2-t1)\n\ndef reduce_mem_usage(props):\n    # 计算当前内存\n    start_mem_usg = props.memory_usage().sum() / 1024 ** 2\n    print(\"Memory usage of the dataframe is :\", start_mem_usg, \"MB\")\n    \n    NAlist = []\n    for col in props.columns:\n        if (props[col].dtypes != object):\n            isInt = False\n            mmax = props[col].max()\n            mmin = props[col].min()\n            if not np.isfinite(props[col]).all():\n                NAlist.append(col)\n                props[col].fillna(-1, inplace=True) \n                props[col].replace(np.inf,-1,inplace=True)\n            asint = props[col].fillna(-1).astype(np.int64)\n            result = np.fabs(props[col] - asint)\n            result = result.sum()\n            if result < 0.01: \n                isInt = True\n            if isInt:\n                if mmin >= 0: \n                    if mmax <= 255:\n                        props[col] = props[col].astype(np.uint8)\n                    elif mmax <= 65535:\n                        props[col] = props[col].astype(np.uint16)\n                    elif mmax <= 4294967295:\n                        props[col] = props[col].astype(np.uint32)\n                    else:\n                        props[col] = props[col].astype(np.uint64)\n                else: \n                    if mmin > np.iinfo(np.int8).min and mmax < np.iinfo(np.int8).max:\n                        props[col] = props[col].astype(np.int8)\n                    elif mmin > np.iinfo(np.int16).min and mmax < np.iinfo(np.int16).max:\n                        props[col] = props[col].astype(np.int16)\n                    elif mmin > np.iinfo(np.int32).min and mmax < np.iinfo(np.int32).max:\n                        props[col] = props[col].astype(np.int32)\n                    elif mmin > np.iinfo(np.int64).min and mmax < np.iinfo(np.int64).max:\n                        props[col] = props[col].astype(np.int64)  \n            else: \n                props[col] = props[col].astype(np.float16)        \n    mem_usg = props.memory_usage().sum() / 1024**2 \n    print(\"This is \",100*mem_usg/start_mem_usg,\"% of the initial size\")\n    return props\n\ndef get_transform(now,start_date,end_date):\n    get_trans = now[(now['day']>=start_date) & (now['day']<=end_date)]\n    return get_trans\n\ndef get_label(start_date,end_date):\n    merge_name = ['user_id','day']\n    all_log = pd.concat([action_log[merge_name],app_log[merge_name],video_log[merge_name]],axis=0)\n    train_label = get_transform(all_log,start_date,end_date)\n    train_1 = DF(list(set(train_label['user_id']))).rename(columns={0:'user_id'})\n    train_1['label'] = 1\n    reg_temp = get_transform(register_log,1,start_date-1)\n    train_1 = train_1[train_1['user_id'].isin(reg_temp['user_id'])]\n    train_0 = DF(list(set(reg_temp['user_id'])-set(train_1['user_id']))).rename(columns={0:'user_id'})\n    train_0['label'] = 0\n    del train_label\n    gc.collect()\n    return pd.concat([train_1,train_0],axis=0) \n\ndef check_id(uid,now):\n    return now[now['user_id'].isin(uid)]\n    \ndef get_mode(now):\n    return stats.mode(now)[0][0]\n\ndef get_binary_seq(now,start_date,end_date): \n    day = list(range(1,end_date-start_date+2))\n    ans1 = 0\n    binary_day = []\n    now_uni = now.unique()\n    for i in day:\n        if i in now_uni:\n            binary_day.append(1)\n        else:\n            binary_day.append(0)\n    return binary_day\n    \ndef get_binary1(now,start_date,end_date): # Boss Feature\n    ans = 0\n    binary_day = get_binary_seq(now,start_date,end_date)\n    for i in range(len(binary_day)):\n        ans += binary_day[i]*(2**i)\n    return ans\n\ndef get_binary2(now,start_date,end_date): # Boss Feature\n    ans = 0\n    binary_day = get_binary_seq(now,start_date,end_date)\n    for i in range(len(binary_day)):\n        ans += binary_day[i]*(1/(end_date-i))\n    return ans\n\ndef get_time_log_weight_sigma(now,start_date,end_date):\n    window_len = end_date+1-start_date\n    ans = np.zeros(window_len)\n    sigma_ans = 0\n    for i in now:\n        ans[(i-1)%window_len] += 1\n    for i in range(window_len):\n        if ans[i]!=0:\n            sigma_ans += np.log(ans[i]/(window_len-i))\n    return sigma_ans\n\ndef get_max_count(x,name):\n    x_max = x[name].max()\n    if x_max>0:\n        return x[name].value_counts(x_max)\n    else:\n        return np.nan\n\ndef get_max_movie(x):\n    x_max = x['day'].max()\n    if x_max>0:\n        x = x[x['day']==x_max]\n        return x['video_id'].nunique()\n    else:\n        return np.nan\n\ndef get_type_feature(control,name,now,train_data,start_date,end_date,gap,gap_name):\n\n    now = get_transform(now,start_date,end_date)\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].max())).reset_index().rename(columns={0:'max1_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],2))).reset_index().rename(columns={0:'max2_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_count(x,name)).reset_index().rename(columns={0:'max_count_'+control+name+gap_name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_movie(x)).reset_index().rename(columns={0:'nunique_video_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')\n    \n    return train_data\n\ndef get_time_feature(control,name,now,train_data,start_date,end_date,gap,gap_name):\n    \n    now = get_transform(now,start_date,end_date)\n\n    # 描述性统计特征 6\n    t1 = time.time()\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_all_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()/gap).reset_index().rename(columns={0:'count_'+control+name+gap_name}),on=['user_id'],how='left')\n    train_data['nunique / count' + control + name + gap_name] = train_data['nunique_all_'+control+name+gap_name] / train_data['count_'+control+name+gap_name]\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(x[name].min()-start_date)).reset_index().rename(columns={0:'min-start_'+control+name+gap_name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].min())).reset_index().rename(columns={0:'end-min_'+control+name+gap_name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_max_count(x,name)).reset_index().rename(columns={0:'max_count_'+control+name+gap_name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.75)/gap).reset_index().rename(columns={0:'q2_'+control+name+gap_name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.84)/gap).reset_index().rename(columns={0:'q3_'+control+name+gap_name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].quantile(q=0.96)/gap).reset_index().rename(columns={0:'q4_'+control+name+gap_name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_mode(x[name])).reset_index().rename(columns={0:'mode_'+control+name+gap_name}),on=['user_id'],how='left')\n    \n    t2 = time.time()\n    print(name,' Describe Finished... ',t2-t1,' Shape: ',train_data.shape)\n    \n    t1 = time.time()\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary1(x[name],start_date,end_date)/gap).reset_index().rename(columns={0:'encoder1_01seq'+control+name+'_'+gap_name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary2(x[name],start_date,end_date)/gap).reset_index().rename(columns={0:'encoder2_01seq'+control+name+'_'+gap_name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_time_log_weight_sigma(x[name],start_date,end_date)/gap).reset_index().rename(columns={0:'LogSigma_'+control+name+'_'+gap_name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.std(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_var_'+control+name+gap_name}).fillna(-1),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.mean(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_mean_'+control+name+gap_name}),on=['user_id'],how='left')\n    t2 = time.time()\n    print(control,' Sigma Finished... ',t2-t1,' Shape: ',train_data.shape)\n    \n    t1 = time.time()\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-x[name].max())).reset_index().rename(columns={0:'max1_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],2))).reset_index().rename(columns={0:'max2_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(end_date-get_second_day(x[name],3))).reset_index().rename(columns={0:'max3_'+control+name+gap_name}).fillna(end_date-start_date),on=['user_id'],how='left')\n    \n    t2 = time.time()\n    \n    print(control,' Max Finished... ',t2-t1,' Shape: ',train_data.shape)\n    \n    return train_data\n\ndef get_second_day(now,seq):\n    now = list(now.unique())\n    for i in range(seq-1):\n        if len(now)>1:\n            now.remove(max(now))\n        else:\n            return np.nan\n    return max(now)\n\ndef get_id_feature(control,name,now,train_data,start_date,end_date,gap,gap_name):\n    \n    now = get_transform(now,start_date,end_date)\n\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()/gap).reset_index().rename(columns={0:'count_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].var()).reset_index().rename(columns={0:'var_'+control+name+gap_name}).fillna(0),on=['user_id'],how='left')\n    \n    return train_data\n\ndef get_diff_feature(control,name,now,train_data,start_date,end_date,gap,gap_name):\n    \n    now = get_transform(now,start_date,end_date)\n\n    t1 = time.time()\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].count()).reset_index().rename(columns={0:'count_'+control+name+gap_name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'nunique_'+control+name+gap_name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].std()).reset_index().rename(columns={0:'var_'+control+name+gap_name}).fillna(-1),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].mean()).reset_index().rename(columns={0:'mean_'+control+name+gap_name}).fillna(-1),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].max()).reset_index().rename(columns={0:'max_'+control+name+gap_name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_mode(x[name])).reset_index().rename(columns={0:'mode_'+control+name+gap_name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].min()).reset_index().rename(columns={0:'min_'+control+name+gap_name}),on=['user_id'],how='left')\n    t2 = time.time()\n    # print(control,' Get Diff Feature Finished... Used: ',t2-t1,' Shape: ',train_data.shape)\n    return train_data\n\ndef get_category_count(name,deal_now,train_data,start_date,end_date):\n    count = DF(deal_now.groupby(['user_id',name]).size().reset_index().rename(columns={0:'times'}))\n    count_size = deal_now.groupby([name]).size().shape[0]\n    sum_data = 0\n    for i in range(0,count_size):\n        new_name = 'see_'+name+'_'+str(i)\n        temp = pd.merge(train_data,count[count[name]==i],on=['user_id']).rename(columns={'times':new_name})\n        train_have = pd.merge(train_data,temp[['user_id',new_name]],on=['user_id'])\n        train_have = train_have[['user_id',new_name]]\n        not_have_name = list(set(train_data['user_id'].values)-set(train_have['user_id'].values))\n        train_not_have = DF()\n        train_not_have['user_id'] = train_data[train_data['user_id'].isin(not_have_name)]['user_id']\n        train_not_have['see_'+name+'_'+str(i)] = 0\n        temp = pd.concat([train_have,train_not_have],axis=0)\n        train_data = pd.merge(train_data,temp,on=['user_id'],how='left')\n        sum_data += train_data[new_name].values\n\n    for i in range(0,count_size):\n        new_name = 'see_'+name+'_'+str(i)\n        train_data[new_name+'_ratio'] = train_data[new_name].values/sum_data\n\n    return train_data\n\nfrom multiprocessing import Pool\ndef parallelize_df_func(df, func, start, end, num_partitions=40, n_jobs=4):\n    df_split = np.array_split(df, num_partitions)\n    start_date = [start] * num_partitions\n    end_date = [end] * num_partitions\n    param_info = zip(df_split, start_date, end_date)\n    pool = Pool(n_jobs)\n    gc.collect()\n    df = pd.concat(pool.map(func, param_info))\n    pool.close()\n    pool.join()\n    return df\n    \ndef get_train(param_info):\n    uid = param_info[0]\n    start_date= param_info[1]\n    end_date= param_info[2]\n    t_start = time.time()\n    \n    t1 = time.time()\n    \n    train_act = check_id(uid,get_transform(action_log,start_date,end_date))\n    train_video = check_id(uid,get_transform(video_log,start_date,end_date))\n    train_app = check_id(uid,get_transform(app_log,start_date,end_date))\n    \n    # Get Week\n    train_act['week'] = (train_act['day'].values) % 7\n    train_video['week'] = (train_video['day'].values) % 7\n    train_app['week'] = (train_app['day'].values) % 7\n    \n    # Modify Day\n    train_act['day'] = train_act['day'] - start_date + 1\n    train_video['day'] = train_video['day'] - start_date + 1\n    train_app['day'] = train_app['day'] - start_date + 1\n    \n    end_date = end_date-start_date+1\n    true_start = start_date\n    start_date = 1\n    \n    train_reg = register_log[register_log['user_id'].isin(uid)].rename(columns={'day':'reg_day'})\n    train_act = pd.merge(train_act,train_reg[['user_id','reg_day']],on=['user_id'],how='left')\n    train_video = pd.merge(train_video,train_reg[['user_id','reg_day']],on=['user_id'],how='left')\n    train_app = pd.merge(train_app,train_reg[['user_id','reg_day']],on=['user_id'],how='left')\n    \n    del train_act['reg_day']\n    del train_video['reg_day']\n    del train_app['reg_day']\n    gc.collect()\n    \n    t2 = time.time()\n    \n    print(start_date,' To ',end_date,' Have User: ',len(uid))\n    print('Data Prepare Use...',t2-t1)\n    \n    # Build\n    train_data = DF()\n    train_data['user_id'] = uid # 1 feature\n    \n    train_data = pd.merge(train_data,train_act.groupby(['user_id']).size().reset_index().rename(columns={0:'action_all_times'}),on=['user_id'],how='left').fillna(0)\n    for i in range(5):\n        page_temp = train_act[train_act['page']==i]\n        train_data = get_type_feature('act_page'+str(i),'day',page_temp,train_data,start_date,end_date,end_date-start_date,'_all')\n    \n    train_data = get_time_feature('act_','day',train_act,train_data,start_date,end_date,end_date-start_date,'_all')\n    train_data = get_diff_feature('act_','diff_day',train_act,train_data,start_date,end_date,end_date-start_date,'_all')\n    train_data = pd.merge(train_data,train_act.groupby(['user_id']).apply(lambda x:get_mode(x['week'])).reset_index().rename(columns={0:'act_mode_week'+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    \n    for i in ['page','action_type','video_id','author_id']: # 4*3 12 feature\n        train_data = get_id_feature('id_act_',i,train_act,train_data,start_date,end_date,end_date-start_date,'_all')\n        \n    train_data = get_category_count('page',train_act,train_data,start_date,end_date)\n    train_data = get_category_count('action_type',train_act,train_data,start_date,end_date)\n    train_data = get_category_count('page',train_act,train_data,end_date-3,end_date)\n    train_data = get_category_count('action_type',train_act,train_data,end_date-3,end_date)\n\n    train_data = reduce_mem_usage(train_data)\n    train_data = get_time_feature('video_','day',train_video,train_data,start_date,end_date,end_date-start_date,'_all')  \n    train_data = get_diff_feature('video_','diff_day',train_video,train_data,start_date,end_date,end_date-start_date,'_all') \n    train_data = reduce_mem_usage(train_data)\n    train_data = get_time_feature('app_','day',train_app,train_data,start_date,end_date,end_date-start_date,'_all') \n    train_data = get_time_feature('app_','day',train_app,train_data,start_date,end_date,7,'_7') \n    train_data = get_diff_feature('app_','diff_day',train_app,train_data,start_date,end_date,end_date-start_date,'_all')\n    train_data = pd.merge(train_data,train_app.groupby(['user_id']).apply(lambda x:get_mode(x['week'])).reset_index().rename(columns={0:'app_mode_week'+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    \n    train_data = reduce_mem_usage(train_data)\n    train_data = pd.merge(train_data,register_log[['user_id','register_type','device_type','week','day']],on=['user_id'],how='left').rename(columns={'week':'reg_week','day':'reg_day'}) # 2\n    train_data['rt_dt'] = (train_data['register_type']+1)*(train_data['device_type']+1)\n    train_data['week_rt'] = (train_data['register_type']+1)*(train_data['reg_week']+1)\n    train_data['week_dt'] = (train_data['device_type']+1)*(train_data['reg_week']+1)\n    t_end = time.time()\n    print('Get Feature Use All Time: ',t_end-t_start)\n    train_data = reduce_mem_usage(train_data)\n    \n    return train_data\n\ndef data_prepare(read_path=None):\n    \n    register_log = pd.read_csv(read_path+'user_register_log.txt',sep='\\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint16,3:np.uint16}).rename(columns={0:'user_id',1:'day',2:'register_type',3:'device_type'})\n    action_log = pd.read_csv(read_path+'user_activity_log.txt',sep='\\t',header=None,dtype={0:np.uint32,1:np.uint8,2:np.uint8,3:np.uint32,4:np.uint32,5:np.uint8}).rename(columns={0:'user_id',1:'day',2:'page',3:'video_id',4:'author_id',5:'action_type'})\n    app_log = pd.read_csv(read_path+'app_launch_log.txt',sep='\\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'})\n    video_log = pd.read_csv(read_path+'video_create_log.txt',sep='\\t',header=None,dtype={0:np.uint32,1:np.uint8}).rename(columns={0:'user_id',1:'day'})\n    \n    # Sort By User\n    register_log = register_log.sort_values(by=['user_id','day'],ascending=True)\n    action_log = action_log.sort_values(by=['user_id','day'],ascending=True)\n    app_log = app_log.sort_values(by=['user_id','day'],ascending=True)\n    video_log = video_log.sort_values(by=['user_id','day'],ascending=True)\n\n    # Diff Day\n    t1 = time.time()\n    app_log['diff_day'] = app_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8)\n    video_log['diff_day'] = video_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8)\n    action_log['diff_day'] = action_log.groupby(['user_id'])['day'].diff().fillna(-1).astype(np.int8)\n    t2 = time.time()\n    print('Diff Day Finished... ',t2-t1)\n\n    # Prepare REGISTER\n    register_log['week'] = register_log['day'] % 7\n    register_log['rt_dt'] = (register_log['register_type']+1)*(register_log['device_type']+1)\n    register_log['week_rt'] = (register_log['register_type']+1)*(register_log['reg_week']+1)\n    register_log['week_dt'] = (register_log['device_type']+1)*(register_log['reg_week']+1)\n    register_log['use_reg_people'] = register_log.groupby(['register_type'])['user_id'].transform('count').values\n    register_log['use_dev_people'] = register_log.groupby(['device_type'])['user_id'].transform('count').values\n    register_log['week_rt_use_people'] = register_log.groupby(['week_rt'])['user_id'].transform('count').values\n    register_log['week_dt_use_people'] = register_log.groupby(['week_dt'])['user_id'].transform('count').values\n    register_log['rt_dt_use_people'] = register_log.groupby(['rt_dt'])['user_id'].transform('count').values\n\n    return register_log,action_log,app_log,video_log\n\nread_path = '/mnt/datasets/fusai/'\nregister_log,action_log,app_log,video_log = data_prepare(read_path)\ntrain_set = []\nfor i in range(17,25):\n    train_label = get_label(i,i+6)\n    train_data_part1 = parallelize_df_func(train_label['user_id'], get_train, 1, i-1, 1, 1)\n    train_data = pd.merge(train_data_part1,register_log[['user_id','use_reg_people','week','register_type','device_type','rt_dt',\n                                                    'week_rt','week_dt','use_dev_people','week_rt_use_people','week_dt_use_people',\n                                                    'rt_dt_use_people']],on=['user_id'],how='left')\n    train_data = pd.merge(train_data,train_label,on=['user_id'],how='left')\n    train_set.append(train_data)\n    del train_data_part1\n    gc.collect()\n\ntrain_data = pd.concat(train_set[0:-1],axis=0).reset_index(drop=True)\nvalid_data = train_set[-1]\nonline_data = parallelize_df_func(register_log['user_id'].unique(), get_train, 1, 30, 1,1)\nonline_data = pd.merge(online_data,register_log[['user_id','use_reg_people','week','register_type','device_type','rt_dt',\n                                            'week_rt','week_dt','use_dev_people','week_rt_use_people','week_dt_use_people',\n                                            'rt_dt_use_people']],on=['user_id'],how='left')\n\nwrite_path = '/home/kesci/'\ntrain_data.to_csv(write_path+'train_data.csv',index=False)\nvalid_data.to_csv(write_path+'valid_data.csv',index=False)\nonline_data.to_csv(write_path+'online_data.csv',index=False)\nprint('Style 2 Feature Engineer Finished...')\n"
  },
  {
    "path": "feature_engineer/get_feature.py",
    "content": "import numpy as np\nimport pandas as pd \nimport lightgbm as lgb\nimport xgboost as xgb\nimport catboost as cbt\nfrom pandas import DataFrame as DF\nimport gc\nfrom sklearn.preprocessing import LabelEncoder\nimport time\nimport networkx as nx\nfrom sklearn.cluster import MeanShift,KMeans\n# 1. 分别训练A,B榜数据，得到A，B模型，则只需要利用A榜模型预测B榜All Feature,得到预测值Model-A，将该列值并入Feature-B,即B榜维数加一，利用增强后的数据再训练Model-B\n# 2. 尝试重编码User-id，合并A，B数据，得到Merge-AB，在此基础上提取特征，训练模型(但未知Video-id,Author-id是否为乱序)\n\nreg_log_a = pd.read_csv('data/a/user_register_log.txt',sep='\\t',header=None).rename(columns={0:'user_id',1:'day',2:'register_type',3:'device_type'})\naci_log_a = pd.read_csv('data/a/user_activity_log.txt',sep='\\t',header=None).rename(columns={0:'user_id',1:'day',2:'page',3:'video_id',4:'author_id',5:'action_type'})\napp_log_a = pd.read_csv('data/a/app_launch_log.txt',sep='\\t',header=None).rename(columns={0:'user_id',1:'day'})\nvideo_log_a = pd.read_csv('data/a/video_create_log.txt',sep='\\t',header=None).rename(columns={0:'user_id',1:'day'})\n\nreg_log_b = pd.read_csv('data/b/user_register_log.txt',sep='\\t',header=None).rename(columns={0:'user_id',1:'day',2:'register_type',3:'device_type'})\naci_log_b = pd.read_csv('data/b/user_activity_log.txt',sep='\\t',header=None).rename(columns={0:'user_id',1:'day',2:'page',3:'video_id',4:'author_id',5:'action_type'})\napp_log_b = pd.read_csv('data/b/app_launch_log.txt',sep='\\t',header=None).rename(columns={0:'user_id',1:'day'})\nvideo_log_b = pd.read_csv('data/b/video_create_log.txt',sep='\\t',header=None).rename(columns={0:'user_id',1:'day'})\n\nreg_log = pd.concat([reg_log_a,reg_log_b],axis=0).reset_index(drop=True)\naci_log = pd.concat([aci_log_a,aci_log_b],axis=0).reset_index(drop=True)\napp_log = pd.concat([app_log_a,app_log_b],axis=0).reset_index(drop=True)\nvideo_log = pd.concat([video_log_a,video_log_b],axis=0).reset_index(drop=True)\n\nreg_log.to_csv('data/user_register_log.txt',sep=' ',header=False,index=False)\naci_log.to_csv('data/user_activity_log.txt',sep=' ',header=False,index=False)\napp_log.to_csv('data/app_launch_log.txt',sep=' ',header=False,index=False)\nvideo_log.to_csv('data/video_create_log.txt',sep=' ',header=False,index=False)\n\nprint('可持久化 Finished...')\n\n# Get Week\nreg_log = reg_log[reg_log['device_type']!=1]\nreg_log['week'] = reg_log['day'] % 7\n\ndef get_transform(now,start_date,end_date):\n    get_trans = now[(now['day']>=start_date) & (now['day']<=end_date)]\n    return get_trans\n\ndef get_label(start_date,end_date):\n    merge_name = ['user_id','day']\n    all_log = pd.concat([aci_log[merge_name],app_log[merge_name],video_log[merge_name]],axis=0)\n    train_label = get_transform(all_log,start_date,end_date)\n    train_1 = DF(list(set(train_label['user_id']))).rename(columns={0:'user_id'})\n    train_1['label'] = 1\n    reg_temp = get_transform(reg_log,start_date-16,start_date-1)\n    train_1 = train_1[train_1['user_id'].isin(reg_temp['user_id'])]\n    train_0 = DF(list(set(reg_temp['user_id'])-set(train_1['user_id']))).rename(columns={0:'user_id'})\n    train_0['label'] = 0\n    del train_label\n    gc.collect()\n    return pd.concat([train_1,train_0],axis=0) \n\ndef check_id(uid,now):\n    return now[now['user_id'].isin(uid)]\n\ndef get_category_count(name,deal_now,train_data,start_date,end_date):\n    count = DF(deal_now.groupby(['user_id',name]).size().reset_index().rename(columns={0:'times'}))\n    count_size = aci_log.groupby([name]).size().shape[0]\n    sum_data = 0\n    for i in range(0,count_size):\n        new_name = 'see_'+name+'_'+str(i)\n        temp = pd.merge(train_data,count[count[name]==i],on=['user_id']).rename(columns={'times':new_name})\n        train_have = pd.merge(train_data,temp[['user_id',new_name]],on=['user_id'])\n        train_have = train_have[['user_id',new_name]]\n        not_have_name = list(set(train_data['user_id'].values)-set(train_have['user_id'].values))\n        train_not_have = DF()\n        train_not_have['user_id'] = train_data[train_data['user_id'].isin(not_have_name)]['user_id']\n        train_not_have['see_'+name+'_'+str(i)] = 0\n        temp = pd.concat([train_have,train_not_have],axis=0)\n        train_data = pd.merge(train_data,temp,on=['user_id'],how='left')\n        train_data = pd.merge(train_data,deal_now[deal_now[name]==i].groupby(['user_id']).apply(lambda x:get_binary(x['day'],start_date,end_date)).reset_index().rename(columns={0:'binary_'+str(i)+'_'+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n        train_data = pd.merge(train_data,deal_now[deal_now[name]==i].groupby(['user_id']).apply(lambda x:get_time_log_weight_sigma(x['day'],start_date,end_date)).reset_index().rename(columns={0:'get_log_sigma_'+str(i)+'_'+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n        \n        sum_data += train_data[new_name].values\n\n    for i in range(0,count_size):\n        new_name = 'see_'+name+'_'+str(i)\n        train_data[new_name+'_ratio'] = train_data[new_name].values/sum_data\n\n    return train_data\n\ndef get_binary_seq(now,start_date,end_date): \n    day = list(range(start_date,end_date+1))\n    ans1 = 0\n    binary_day = []\n    for i in day:\n        if i in now.unique():\n            binary_day.append(1)\n        else:\n            binary_day.append(0)\n    return binary_day\n\ndef get_binary(now,start_date,end_date): # Boss Feature\n    ans = 0\n    binary_day = get_binary_seq(now,start_date,end_date)\n    for i in range(len(binary_day)):\n        ans += binary_day[i]*(2**i)\n    return ans\n\ndef get_binary_mol7(now,start_date,end_date):\n    ans = 0\n    binary_day = get_binary_seq(now,start_date,end_date)\n    for i in range(len(binary_day)):\n        ans += binary_day[i]*(2**(i%7))\n    return ans\n\ndef get_time_log_weight_sigma(now,start_date,end_date):\n    window_len = end_date+1-start_date\n    ans = np.zeros(window_len)\n    sigma_ans = 0\n    for i in now:\n        ans[(i-1)%window_len] += 1\n    for i in range(window_len):\n        if ans[i]!=0:\n            sigma_ans += np.log(ans[i]/(window_len-i))\n    return sigma_ans\n\ndef get_time_weight_sigma(now,start_date,end_date):\n    window_len = end_date+1-start_date\n    ans = np.zeros(window_len)\n    sigma_ans = 0\n    for i in now:\n        ans[(i-1)%window_len] += 1\n    for i in range(window_len):\n        sigma_ans += ans[i]*(i+1)\n    return sigma_ans\n\ndef get_id_feature(control,name,now,train_data,start_date,end_date):\n    if end_date<start_date:\n        return train_data\n    \n    now = get_transform(now,start_date,end_date)\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'count_all_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].var()).reset_index().rename(columns={0:'see_var_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].mean()).reset_index().rename(columns={0:'see_mean_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].median()).reset_index().rename(columns={0:'see_median_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].mad()).reset_index().rename(columns={0:'see_mad_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].skew()).reset_index().rename(columns={0:'see_skew_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n   \n    return train_data\n\ndef get_time_feature(control,name,now,train_data,start_date,end_date):\n    if end_date<start_date:\n        return train_data\n    \n    now = get_transform(now,start_date,end_date)\n\n    t1 = time.time()\n    # 描述性统计特征 10\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].nunique()).reset_index().rename(columns={0:'count_all_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].var()).reset_index().rename(columns={0:'see_var_'+control+name+str(end_date-start_date)}).fillna(0),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].mean()).reset_index().rename(columns={0:'see_mean_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].median()).reset_index().rename(columns={0:'see_median_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].mad()).reset_index().rename(columns={0:'see_mad_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].skew()).reset_index().rename(columns={0:'see_skew_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].kurt()).reset_index().rename(columns={0:'see_kurt_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].min()).reset_index().rename(columns={0:'see_min_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].max()).reset_index().rename(columns={0:'see_max_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:(x[name].max()-x[name].min())).reset_index().rename(columns={0:'see_max-min_'+control+name+str(end_date-start_date)}),on=['user_id'],how='left')\n    t2 = time.time()\n    print('Describe Feature Finished... Used: ',t2-t1)\n    \n    t1 = time.time()\n    # 一阶差分 二阶差分 11\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().var()).reset_index().rename(columns={0:'diff_seq_var_'+control+name}).fillna(0),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().mean()).reset_index().rename(columns={0:'diff_seq_mean_'+control+name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().median()).reset_index().rename(columns={0:'diff_seq_median_'+control+name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().mad()).reset_index().rename(columns={0:'diff_seq_mad_'+control+name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().skew()).reset_index().rename(columns={0:'diff_seq_skew_'+control+name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().kurt()).reset_index().rename(columns={0:'diff_seq_kurt_'+control+name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().min()).reset_index().rename(columns={0:'diff_seq_min_'+control+name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().max()).reset_index().rename(columns={0:'diff_seq_max_'+control+name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().max()-x[name].sort_values().diff().min()).reset_index().rename(columns={0:'diff_seq_max_gap_min_'+control+name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().diff().min()).reset_index().rename(columns={0:'diff2_min_gap'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:x[name].diff().diff().max()).reset_index().rename(columns={0:'diff2_max_gap'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    t2 = time.time()\n    print('Diff Feature Finished... Used: ',t2-t1)\n\n    t1 = time.time()\n    # FFT ori 5\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.var(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_seq_var_'+control+name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.mean(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_seq_mean_'+control+name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.median(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_seq_median_'+control+name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.max(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_seq_mad_'+control+name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.min(np.fft.fft(x[name])))).reset_index().rename(columns={0:'fft_seq_skew_'+control+name}),on=['user_id'],how='left') \n    t2 = time.time()\n    print('FFT LAST ABS Feature Finished... Used: ',t2-t1)\n\n    t1 = time.time()\n    # FFT 01\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.var(np.fft.fft(get_binary_seq(x[name],start_date,end_date))))).reset_index().rename(columns={0:'fft_01seq_var_'+control+name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.mean(np.fft.fft(get_binary_seq(x[name],start_date,end_date))))).reset_index().rename(columns={0:'fft_01seq_mean_'+control+name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.median(np.fft.fft(get_binary_seq(x[name],start_date,end_date))))).reset_index().rename(columns={0:'fft_01seq_median_'+control+name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.max(np.fft.fft(get_binary_seq(x[name],start_date,end_date))))).reset_index().rename(columns={0:'fft_01seq_mad_'+control+name}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:abs(np.min(np.fft.fft(get_binary_seq(x[name],start_date,end_date))))).reset_index().rename(columns={0:'fft_01seq_skew_'+control+name}),on=['user_id'],how='left') \n    t2 = time.time()\n    print('FFT FIRST 01 ABS Feature Finished... Used: ',t2-t1)\n\n    t1 = time.time()\n    # 时间衰减 4\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary(x[name],start_date,end_date)).reset_index().rename(columns={0:'binary_'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_binary_mol7(x[name],start_date,end_date)).reset_index().rename(columns={0:'binary_mol7'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_time_log_weight_sigma(x[name],start_date,end_date)).reset_index().rename(columns={0:'get_log_sigma_'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    train_data = pd.merge(train_data,now.groupby(['user_id']).apply(lambda x:get_time_weight_sigma(x[name],start_date,end_date)).reset_index().rename(columns={0:'get_sigma_'+control+name+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    print('SIGMA Feature Finished... Uesd: ',t2-t1)\n    t2 = time.time()\n\n    return train_data\n    \ndef HowManyPeopleWatch(df):\n    num_people = len(df['user_id'].unique())\n    return num_people\n\ndef MostHandle(df):\n    most_handle = df.groupby('video_id').size().max()\n    return most_handle\n\ndef FavAuthorCreate(df):\n    most_author = df.groupby('author_id').size().sort_values(ascending=False).index[0]\n    create_video_num = len(df[df['author_id']==most_author]['video_id'].unique())\n    watch_other_video_num = len(df[df['author_id']==most_author]['video_id'].unique())\n    watch_other_video = 1 if watch_other_video_num>1 else 0\n    return create_video_num , watch_other_video\n\ndef GongXianDu(df):\n    d11 = df.set_index('video_id')\n    d11['gongxian_rate'] = df.groupby('video_id').size() \n    d11['gongxian_rate'] = d11['gongxian_rate'] / d11['video_watched_times']\n    meand = d11['gongxian_rate'].mean()\n    sumd = d11['gongxian_rate'].sum()\n    stdd = d11['gongxian_rate'].std()\n    skeww = d11['gongxian_rate'].skew()\n    kurtt = d11['gongxian_rate'].kurt()\n    return sumd,meand,stdd,skeww,kurtt\n\ndef get_lx_day(now):\n    k1 = np.array(now)\n    k2 = np.where(np.diff(k1)==1)[0]\n    i = 0\n    ans = []\n    while i<len(k2)-1:\n        l1 = 1\n        while k2[i+1]-k2[i]==1:\n            l1 += 1\n            i += 1\n            if i == len(k2)-1:\n                break\n        if l1 == 1:\n            i += 1\n            ans.append(2)\n        else:\n            ans.append(l1+1)\n    if len(k2)==1:\n        ans.append(2)\n    return ans\n\ndef get_lx_day_feature(name,train_data,now):\n    lx_now = now.groupby(['user_id']).apply(lambda x:get_lx_day(x['day'].sort_values(ascending=True).unique())).reset_index().rename(columns={0:'lx_day'})\n    lx_collect = {\n        name+'lx_count_len' : [],\n        name+'lx_max' : [],\n        name+'lx_min' : [],\n        name+'lx_var' : []\n    }\n\n    if (lx_now.shape[0]==0) | (lx_now.shape[1]==0):\n        lx_collect[name+'lx_count_len'].append(0)\n        lx_collect[name+'lx_max'].append(0)\n        lx_collect[name+'lx_min'].append(0)\n        lx_collect[name+'lx_var'].append(-1)\n    else:\n        for i in lx_now['lx_day'].values:\n            lx_collect[name+'lx_count_len'].append(len(i))\n            lx_collect[name+'lx_max'].append(np.max(i))\n            lx_collect[name+'lx_min'].append(np.min(i))\n            lx_collect[name+'lx_var'].append(np.var(i))\n    lx_collect = DF(lx_collect)\n    lx_collect['user_id'] = lx_now['user_id']\n    train_data = pd.merge(train_data,lx_collect,on=['user_id'],how='left')\n    return train_data\n\ndef get_only_user_author_graph_feature(now,name,train_data):\n    G = nx.DiGraph()\n    need_to = ['user_id','author_id']\n    to_make = now[need_to].drop_duplicates()\n    for edge in to_make[need_to].values:\n        G.add_edge(edge[0],edge[1])\n    \n    pr = nx.pagerank(G,alpha=0.8)\n    G_degree1 = DF(dict(G.degree),index=['G_degree'+name]).T.reset_index().rename(columns={'index':'user_id'}).fillna(0)\n    G_indegree1 = DF(dict(G.in_degree),index=['G_indegree'+name]).T.reset_index().rename(columns={'index':'user_id'}).fillna(0)\n    G_pr1 = DF(pr,index=['G_pagerank'+name]).T.reset_index().rename(columns={'index':'user_id'}).sort_values(by=['G_pagerank'+name]).fillna(0)\n   \n    train_data = pd.merge(train_data,G_degree1,on=['user_id'],how='left')\n    train_data = pd.merge(train_data,G_indegree1,on=['user_id'],how='left')\n    train_data = pd.merge(train_data,G_pr1,on=['user_id'],how='left')\n    return train_data\n\ndef get_train(uid,start_date,end_date):\n    \n    t_start = time.time()\n    \n    t1 = time.time()\n    \n    train_act = check_id(uid,get_transform(aci_log,start_date,end_date))\n    train_video = check_id(uid,get_transform(video_log,start_date,end_date))\n    train_app = check_id(uid,get_transform(app_log,start_date,end_date))\n    \n    # Get Week\n    train_act['week'] = (train_act['day'].values) % 7\n    train_video['week'] = (train_video['day'].values) % 7\n    train_app['week'] = (train_app['day'].values) % 7\n    \n    train_reg = reg_log[reg_log['user_id'].isin(uid)].rename(columns={'day':'reg_day'})\n    train_act = pd.merge(train_act,train_reg[['user_id','reg_day']],on=['user_id'],how='left')\n    train_video = pd.merge(train_video,train_reg[['user_id','reg_day']],on=['user_id'],how='left')\n    train_app = pd.merge(train_app,train_reg[['user_id','reg_day']],on=['user_id'],how='left')\n    \n    train_act['aci_distance_from_reg'] = train_act['day'] - train_act['reg_day']\n    train_video['video_distance_from_reg'] = train_video['day'] - train_video['reg_day']\n    train_app['app_distance_from_reg'] = train_app['day'] - train_app['reg_day']\n\n    del train_act['reg_day']\n    del train_video['reg_day']\n    del train_app['reg_day']\n    gc.collect()\n    \n    train_act = train_act.sort_values(by=['user_id','day'],ascending=True)\n    train_app = train_app.sort_values(by=['user_id','day'],ascending=True)\n    train_video = train_video.sort_values(by=['user_id','day'],ascending=True)\n    t2 = time.time()\n    \n    print(start_date,' To ',end_date,' Have User: ',len(uid))\n    print('Data Prepare Use...',t2-t1)\n    \n    # Build\n    train_data = DF()\n    train_data['user_id'] = uid # 1 feature\n    \n    # 获取每个人的点击总数/动作类型总数 # 36 feature \n    t1 = time.time()\n    \n    train_act['at_page'] = (train_act['action_type']+100)*(train_act['page']+1)\n    train_data = pd.merge(train_data,train_act.groupby(['user_id']).size().reset_index().rename(columns={0:'action_all_times'}),on=['user_id'],how='left').fillna(0)\n    # Time Window  4 + 30 feature\n    train_data = get_lx_day_feature('aci_',train_data,train_act)\n    train_data = get_time_feature('aci_','day',train_act,train_data,start_date,end_date) \n    train_data = get_time_feature('aci_','aci_distance_from_reg',train_act,train_data,start_date,end_date) # 30\n    for i in ['page','action_type','at_page']: # 4*5*6 120feature\n        train_data = get_id_feature('aci_',i,train_act,train_data,start_date,end_date)\n\n    t2 = time.time()\n    print('Use Time: ',t2-t1,' Aci Finish... ','Shape: ',train_data.shape)\n    \n    # 获取每个人在Category中不同点击的分布 17 feature\n    t1 = time.time() \n    train_data = get_category_count('page',train_act,train_data,start_date,end_date)\n    train_data = get_category_count('action_type',train_act,train_data,start_date,end_date)\n    t2 = time.time()\n    print('Use Time: ',t2-t1,' Category Finish... ','Shape: ',train_data.shape)\n    \n    # 获取Video_log中的特征  27 feature\n    t1 = time.time()\n    \n    train_data = get_lx_day_feature('video_',train_data,train_video)\n    train_data = get_time_feature('video_','day',train_video,train_data,start_date,end_date) \n    train_data = pd.merge(train_data,train_video.groupby(['user_id']).apply(lambda x:((x['day'].mode().values[0])%7)).reset_index().rename(columns={0:'mode_week'+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    \n    t2 = time.time()\n    print('Use Time: ',t2-t1,' Video Finish... ','Shape: ',train_data.shape)\n    \n    # 获取App_log中的特征  27 feature\n    t1 = time.time()\n    \n    train_data = get_lx_day_feature('app_',train_data,train_app)\n    train_data = get_time_feature('app_','day',train_app,train_data,start_date,end_date) \n    train_data = get_time_feature('app_','app_distance_from_reg',train_app,train_data,start_date,end_date)\n    train_data = pd.merge(train_data,train_app.groupby(['user_id']).apply(lambda x:((x['day'].mode().values[0])%7)).reset_index().rename(columns={0:'mode_week'+'_'+str(end_date-start_date)}),on=['user_id'],how='left')\n    \n    t2 = time.time() \n    print('Use Time: ',t2-t1,' App Finish... ','Shape: ',train_data.shape)\n    \n    # 获取注册类型特征 # 3 feature\n    t1 = time.time()\n    train_data = pd.merge(train_data,reg_log[['user_id','register_type','device_type','week','day']],on=['user_id'],how='left').rename(columns={'week':'reg_week','day':'reg_day'}) # 2\n    train_data['rt_dt'] = (train_data['register_type']+1)*(train_data['device_type'])\n    train_data['week_rt'] = (train_data['register_type']+1)*(train_data['reg_week']+1)\n    train_data['week_dt'] = (train_data['device_type']+1)*(train_data['reg_week']+1)\n    train_data['distance_reg_end_window'] = end_date-train_data['reg_day']+1\n    train_data['distance_reg_start_window'] = train_data['reg_day']-start_date+1\n    train_data['is_window_reg'] = 1 if train_data['reg_day'].all()>=start_date else 0\n    # Need To Add 用户活跃时间与注册时间的差值\n    \n    t2 = time.time()\n    print('Use Time: ',t2-t1,' Reg Finish... ','Shape: ',train_data.shape)\n    \n    # 获取业务逻辑特征  3 feature\n    t1 = time.time()\n    user_feature = DF(train_data['user_id'].unique())\n    user_feature.columns = ['user_id']\n    user_feature = user_feature.set_index('user_id')\n    user_feature['HowManyPeople_Watch'] = train_act.groupby('author_id').apply(HowManyPeopleWatch)\n    user_feature['Most_Handle'] = train_act.groupby('user_id').apply(MostHandle)\n    \n    #计算视频被观看总次数\n    video_size = train_act.groupby('video_id').size().reset_index()\n    video_size.columns = ['video_id','video_watched_times']\n    train_act = pd.merge(train_act,video_size,on=['video_id'],how='left')\n\n    #分别计算每个用户的贡献度和、均、方\n    temp = train_act.groupby('user_id').apply(GongXianDu)\n    user_feature['GongXianSum'] = temp.apply(lambda x:x[0])\n    user_feature['GongXianMean'] = temp.apply(lambda x:x[1])\n    user_feature['GongXianStd'] = temp.apply(lambda x:x[2])\n    user_feature['GongXianSkeww'] = temp.apply(lambda x:x[3])\n    user_feature['GongXianKurtt'] = temp.apply(lambda x:x[4])\n    \n    fav_author = train_act.groupby('user_id').apply(FavAuthorCreate)\n    user_feature['FavAuthorCreate'] = fav_author.apply(lambda x:x[0])\n    user_feature['WatchOtherVideo'] = fav_author.apply(lambda x:x[1])\n    train_data = pd.merge(train_data,user_feature.reset_index(),on=['user_id'],how='left')\n    \n    t2 = time.time()\n    print('Use Time: ',t2-t1,' User-Author Finish... ','Shape:',train_data.shape)\n    \n    # 获取聚类特征\n    \n    t1 = time.time()\n    train_data = get_only_user_author_graph_feature(train_act,'',train_data)\n    kmean = KMeans(n_clusters=20,n_jobs=20)\n    train_data['cluster_graph'] = kmean.fit_predict(train_data[['G_degree','G_indegree','G_pagerank']].fillna(0))\n    \n    train_data = get_only_user_author_graph_feature(train_act[train_act['page']==0],'page0',train_data)\n\n    t2 = time.time()\n    print('Use Time: ',t2-t1,' Cluster Finish... ','Shape:',train_data.shape)\n    \n    # Feature End\n    t_end = time.time()\n    print('Get Feature Use All Time: ',t_end-t_start)\n    \n    return train_data\n\n# offline\n# 0 10day step \n# 1 14day step\n# 2 16day step\n\nprint('The Style is ',style,'...')\nprint('Dealing Offline...')\nstyle = 2\n\nif style == 0:\n    t1 = time.time()\n    train_label = []\n    train_data = []\n    for i in range(1,8):\n        try_label = get_label(i+10,i+16)\n        train_label.append(try_label)\n        train_data.append(get_train(try_label['user_id'],i,i+9))\n\n    t2 = time.time()\n    print('Deal Train Feature: ',t2-t1)\n\n    t1 = time.time()\n    valid_label = get_label(24,30)\n    valid_data = get_train(valid_label['user_id'],14,23)\n    t2 = time.time()\n    print('Deal Valid Feature: ',t2-t1)\nelif style == 1:\n    t1 = time.time()\n    train_label = []\n    train_data = []\n    for i in range(1,4):\n        try_label = get_label(i+14,i+20)\n        train_label.append(try_label)\n        train_data.append(get_train(try_label['user_id'],i,i+13))\n\n    t2 = time.time()\n    print('Deal Train Feature: ',t2-t1)\n\n    t1 = time.time()\n    valid_label = get_label(24,30)\n    valid_data = get_train(valid_label['user_id'],10,23)\n    t2 = time.time()\n    print('Deal Valid Feature: ',t2-t1)\n    \nelif style == 2:\n    t1 = time.time()\n    train_label = get_label(17,23)\n    train_data = get_train(train_label['user_id'],1,16)\n    valid_label = get_label(24,30)\n    valid_data = get_train(valid_label['user_id'],8,23)\n    t2 = time.time()\n    print('Deal Train Feature: ',t2-t1)\n\n# online\nprint('Dealing Online...')\nif style == 0:\n    t1 = time.time()\n    online_label = []\n    online_data = []\n    for i in range(8,15):\n        try_label = get_label(i+10,i+16)\n        online_label.append(try_label)\n        online_data.append(get_train(try_label['user_id'],i,i+9))\n\n    # model\n    online_test = get_train(reg_log['user_id'].unique(),21,30)\n    t2 = time.time()\n\n    print('Deal Online: ',t2-t1)\nelif style == 1:\n    t1 = time.time()\n    online_label = []\n    online_data = []\n    for i in range(4,11):\n        try_label = get_label(i+14,i+20)\n        online_label.append(try_label)\n        online_data.append(get_train(try_label['user_id'],i,i+13))\n\n    # model\n    online_test = get_train(reg_log['user_id'].unique(),17,30)\n    t2 = time.time()\n\n    print('Deal Online: ',t2-t1)\nelif style == 2:\n    # online\n    online_label = pd.concat([train_label,valid_label],axis=0).reset_index(drop=True)\n    online_data =  pd.concat([train_data,valid_data],axis=0).reset_index(drop=True)\n\n    # model\n    online_test_b = get_train(reg_log_b['user_id'].unique(),15,30)\n\ndef merge_data(df):\n    return pd.concat(df,axis=0).reset_index(drop=True)\n\nif style!=2 :\n    train_data = merge_data(train_data)\n    train_label = merge_data(train_label)\n    online_data = merge_data(online_data)\n    online_label = merge_data(online_label)\n\n    online_data = pd.concat([train_data,online_data],axis=0).reset_index(drop=True)\n    online_label = pd.concat([train_label,online_label],axis=0).reset_index(drop=True)\n\npath_name = 'pre_data/style_'+str(style)\ntrain_data.to_csv(path_name+'/train_data.csv',index=False)\ntrain_label.to_csv(path_name+'/train_label.csv',index=False)\nvalid_data.to_csv(path_name+'/valid_data.csv',index=False)\nvalid_label.to_csv(path_name+'/valid_label.csv',index=False)\nonline_data.to_csv(path_name+'/online_data.csv',index=False)\nonline_label.to_csv(path_name+'/online_label.csv',index=False)\nonline_test.to_csv(path_name+'/online_test.csv',index=False)\n\n\n# Need to Add : \n#   a. User-Author-Video Interfacing\n#     1. Node2Vec \n#     2. User-Video/Author Embedding # 2\n#     3. User-Author-Video Embedding # 1\n#     4. User-Video/Author Tf-idf/Word2Vec  # 2*2\n#     5. User-Video/Author Cluster (By Tf-idf/Word2Vec) # 2*2\n#     6. User-Author-Video Embedding/Tf-idf/Word2Vec + Cluster\n#   b. Know More Author\n#     1. Define \"An Active User\", For Example, You can choose \"All of the Positive Sample\" using their Mean Value\n#     (Tips: Mean Value is the Times of Watching Video,Looking Author,See Page,Action Click Count)\n#     2. Get Diff Value For Active User\n#     3. Calc The UV metric , Fuv or Iuv. \n#     (Tips: Find top100 the most active User,Get their favourite Author/Video Union Set, Ex https://zhuanlan.zhihu.com/p/20943978)\n#     4. Node Centrality/Influence (Wiki)\n#     5. See Author Delay\n\n"
  },
  {
    "path": "model/ffm.py",
    "content": "import hashlib, math, os, subprocess\nfrom multiprocessing import Process\nimport xlearn\nimport numpy as np\nimport pandas as pd\nfrom padnas import DataFrame as DF\n\ndef hashstr(str, nr_bins=1e+6):\n    return int(hashlib.md5(str.encode('utf8')).hexdigest(), 16) % (int(nr_bins) - 1) + 1\n\nclass FfmEncoder():\n    def __init__(self, field_names, label_name, nthread=1):\n        self.field_names = field_names\n        self.nthread = nthread\n        self.label = label_name\n\n    def gen_feats(self, row):\n        feats = []\n        for field in self.field_names:\n            value = row[field]\n            key = field + '-' + str(value)\n            feats.append(key)\n        return feats\n\n    def gen_hashed_fm_feats(self, feats):\n        feats = ['{0}:{1}:1'.format(field, hashstr(feat, 1e+6)) for (field, feat) in feats]\n        return feats\n\n    def convert(self, df, path, i):\n        lines_per_thread = math.ceil(float(df.shape[0]) / self.nthread)\n        sub_df = df.iloc[i * lines_per_thread: (i + 1) * lines_per_thread]\n        tmp_path = path + '_tmp_{0}'.format(i)\n        with open(tmp_path, 'w') as f:\n            for index,row in sub_df.iterrows():\n                feats = []\n                for i, feat in enumerate(self.gen_feats(row)):\n                    feats.append((i, feat))\n                feats = self.gen_hashed_fm_feats(feats)\n                f.write(str(int(row[self.label])) + ' ' + ' '.join(feats) + '\\n')\n\n    def parallel_convert(self, df, path):\n        processes = []\n        for i in range(self.nthread):\n            p = Process(target=self.convert, args=(df, path, i))\n            p.start()\n            processes.append(p)\n        for p in processes:\n            p.join()\n\n    def delete(self, path):\n        for i in range(self.nthread):\n            os.remove(path + '_tmp_{0}'.format(i))\n\n    def cat(self, path):\n        if os.path.exists(path):\n            os.remove(path)\n        for i in range(self.nthread):\n            cmd = 'cat {svm}_tmp_{idx} >> {svm}'.format(svm=path, idx=i)\n            p = subprocess.Popen(cmd, shell=True)\n            p.communicate()\n\n    def transform(self, df, path):\n        print('converting data......')\n        self.parallel_convert(df, path)\n        self.cat(path)\n        self.delete(path)\n        \nwrite_path = '/home/kesci'\nffm_train = train_data.copy()\nffm_valid = valid_data.copy()\nffm_online_train = online_train.copy()\nffm_online_test = online_data.copy()\nffm_online_test['label'] = 0\n# filed_names = list(fi.sort_values(by=['score'],ascending=False).head(50)['name'].values)\nfiled_names = [i for i in ffm_train.columns if i not in ['user_id','label']]\nprint(filed_names)\nfe = FfmEncoder(filed_names,label_name='label',nthread=8)\nfe.transform(ffm_train, write_path+'train.ffm')\nprint('Train FFM Finished...')\nfe.transform(ffm_valid, write_path+'valid.ffm')\nprint('Valid FFM Finished...')\nfe.transform(ffm_online_train,write_path+'train_online.ffm')\nprint('Train Online FFM Finished...')\nfe.transform(ffm_online_test, write_path+'test_online.ffm')\nprint('Test Online FFM Finished')\n\n# Training task\nffm_model = xl.create_ffm() # Use field-aware factorization machine\nffm_model.setTrain(\"/home/kesci/train.ffm\")  # Training data\nffm_model.setValidate(\"/home/kesci/valid.ffm\")  # Validation data\n\n# param:\n#  0. binary classification\n#  1. learning rate: 0.2\n#  2. regular lambda: 0.002\n#  3. evaluation metric: accuracy\nparam = {'task':'binary', 'lr':0.1,\n         'lambda':0.01, 'metric':'auc', 'epoch' : 20,'opt':'ftrl'}\n\n# Start to train\n# The trained model will be stored in model.out\nffm_model.fit(param, write_path+'model.out')\nprint('Offline Train Finished...')\n\n# Prediction task\nparam = {'task':'binary', 'lr':0.05,\n         'lambda':0.003, 'metric':'auc'}\nffm_online_model = xl.create_ffm()\nffm_online_model.setTrain(write_path+'train_online.ffm')\nffm_online_model.fit(param,write_path+'online_model.out')\nffm_model.setTest(write_path+'test_online.ffm')  # Test data\nffm_model.setSigmoid()  # Convert output to 0-1\n\n# Start to predict\n# The output result will be stored in output.txt\nffm_model.predict(write_path+'model.out', write_path+'output.txt')"
  },
  {
    "path": "model/lgb_model",
    "content": "import numpy as np\nimport pandas as pd\nimport lightgbm as lgb\nimport xgboost as xgb\nimport catboost as cbt\nfrom pandas import DataFrame as DF\nimport gc\nimport time\nfrom scipy import stats\nimport seaborn as sns\nimport matplotlib.pyplot as plt\n\n# Read\nwrite_path = '/home/kesci/'\ntrain_data = pd.read_csv(write_path+'train_data.csv')\nvalid_data = pd.read_csv(write_path+'valid_data.csv')\nonline_data = pd.read_csv(write_path+'online_data.csv')\nonline_train = pd.concat([train_data,valid_data],axis=0).reset_index(drop=True)\n\n# LGB Model\nfeature_name = [i for i in train_data.columns if i not in ['user_id','label']]\nprint(len(feature_name))\ndtrain = lgb.Dataset(train_data[feature_name], label=train_data['label'].values)\ndval = lgb.Dataset(valid_data[feature_name], label=valid_data['label'].values)\n\nparams = {'learning_rate': 0.05,\n          'metric': ['auc','binary_logloss'],\n          'objective': 'binary',\n          'nthread': 8,\n          'num_leaves': 8,\n          'colsample_bytree': 0.7,\n          'bagging_fraction' : 0.8,\n          'bagging_freq' : 10,\n          'seed' : 2018,\n        }\n         \nlgb_model = lgb.train(params, dtrain, 2500, dval, verbose_eval=50,early_stopping_rounds=100,)\npred = lgb_model.predict(train_data[feature_name])\nfrom sklearn.metrics import roc_auc_score,f1_score\nprint('TRAIN SET auc ',roc_auc_score(train_data['label'],pred))\nf1_ans = []\nfor i in pred:\n    if i>=0.5:\n        f1_ans.append(1)\n    else:\n        f1_ans.append(0)\nprint('TRAIN SET F1 ',f1_score(train_data['label'],f1_ans))\n\nfi = DF()\nfi['name'] = feature_name\nfi['score'] = lgb_model.feature_importance()\nprint(fi.sort_values(by=['score'],ascending=False))\nlgb.plot_importance(lgb_model,max_num_features=40,figsize=(10,8))\nplt.show()\n\nonline_train = pd.concat([train_data,valid_data],axis=0).reset_index(drop=True)\nonline_lgb_set = lgb.Dataset(online_train[feature_name],label=online_train['label'])\nonline_lgb_model = lgb.train(params,online_lgb_set,num_boost_round=lgb_model.best_iteration-50)\nans = online_lgb_model.predict(online_data[feature_name])\nsubmit = DF()\nsubmit['id'] = online_data['user_id']\nsubmit['score'] = ans\nprint(submit.head(10))\nprint(submit['score'].describe())\nsubmit.to_csv('Submit.txt',index=False,header=False)\n\n\n"
  },
  {
    "path": "model/xgb_model.py",
    "content": "import numpy as np\nimport pandas as pd\nimport lightgbm as lgb\nimport xgboost as xgb\nimport catboost as cbt\nfrom pandas import DataFrame as DF\nimport gc\nimport time\nfrom scipy import stats\nimport seaborn as sns\nimport matplotlib.pyplot as plt\n\n# Read\nwrite_path = '/home/kesci/'\ntrain_data = pd.read_csv(write_path+'train_data.csv')\nvalid_data = pd.read_csv(write_path+'valid_data.csv')\nonline_data = pd.read_csv(write_path+'online_data.csv')\nonline_train = pd.concat([train_data,valid_data],axis=0).reset_index(drop=True)\n\n# LGB Model\nfeature_name = [i for i in train_data.columns if i not in ['user_id','label']]\nprint(len(feature_name))\nxgb_train = xgb.DMatrix(train_data[feature_name],train_data['label'].values)\nxgb_valid = xgb.DMatrix(valid_data[feature_name],valid_data['label'].values)\nwatch_list = [(xgb_train,'dtrain'),(xgb_valid,'dvalid')]\n\nparams = {\n    'booster': 'gbtree',\n    'objective': 'rank:pairwise', #'binary:logistic', \n    'eta': 0.05, \n    'seed' : 2018,\n    'max_depth': 5,\n    'subsample': 0.9, \n    'colsample_bytree': 0.8,\n    'colsample_bylevel' : 0.8,\n    'eval_metric': ['auc'], # Need TO Logloss\n    'nthread' : 8,\n    'gamma': 2,\n}\n\nxgb_model = xgb.train(params,xgb_train,2000,watch_list,early_stopping_rounds=40,verbose_eval=10)\n\npred = xgb_model.predict(xgb.DMatrix(valid_data[feature_name]))\nfrom sklearn.metrics import roc_auc_score,f1_score\nprint('auc ',roc_auc_score(valid_data['label'],pred))\nf1_ans = []\nfor i in pred:\n    if i>=0.5:\n        f1_ans.append(1)\n    else:\n        f1_ans.append(0)\nprint('f1 ',f1_score(valid_data['label'],f1_ans))\n\ndef create_feature_map(features):  \n    outfile = open('xgb.fmap', 'w')  \n    i = 0  \n    for feat in features:  \n        outfile.write('{0}\\t{1}\\tq\\n'.format(i, feat))  \n        i = i + 1  \n    outfile.close()  \n\ncreate_feature_map(feature_name)\nimport operator\nxgb_importance = xgb_model.get_fscore(fmap='xgb.fmap')  \nxgb_importance = sorted(xgb_importance.items(), key=operator.itemgetter(1))  \nxgb_importance = DF(xgb_importance, columns=['name', 'fscore'])\nprint(xgb_importance)\n\nonline_xgb_set = xgb.DMatrix(online_train[feature_name],label=online_train['label'])\nonline_xgb_model = xgb.train(params,online_xgb_set,num_boost_round=xgb_model.best_iteration)\nans_xgb = online_xgb_model.predict(xgb.DMatrix(online_data[feature_name]))\nsubmit_xgb = DF()\nsubmit_xgb['id'] = online_data['user_id']\nfrom sklearn.preprocessing import MinMaxScaler\nst = MinMaxScaler()\nsubmit_xgb['score'] = st.fit_transform(ans_xgb.reshape(-1,1)) # RANK\n# submit_xgb['score'] = ans_xgb # Binary\nprint(submit_xgb.head(10))\nprint(submit_xgb['score'].describe())\nsubmit.to_csv('Submit XGB.txt',index=False,header=False)\n\n\n"
  }
]