[
  {
    "path": ".gitignore",
    "content": "checkpoints/__pycache__\ncheckpoints/*pth\ncheckpoints/*ckpt\nconfig/__pycache__\ndataset/__pycache__\ndataset/*_data\nframework/__pycache__\nmodels/__pycache__\npro_data/*json\n*pyc\nlightning_logs/\n"
  },
  {
    "path": "README.md",
    "content": "A Toolkit for Neural Review-based Recommendation models with Pytorch.\n基于评论文本的深度推荐系统模型库 (Pytorch)\n\n**Update: 2021.03.29**\n\nAdd a branch (PL) to use [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning/) to wrap the framework for further distributed training.\n```\ngit clone https://github.com/ShomyLiu/Neu-Review-Rec.git\ngit checkout pl\n```\nAnd the usage:\n```\npython3 pl_main.py run --use_ddp=True --gpu_id=2\n# indicating that using ddp mode for distributed training with 2 gpus. refer `config/config.py`.\n```\n\n\n\n# Neural Review-based Recommendaton\nIn this repository, we reimplement some important review-based recommendation models, and provide an extensible framework **NRRec**  with Pytorch.\nResearchers can implement their own methodss easily in our framework (just in *models* folder).\n\n\n## Introduction to Review-based Recommendaiton\n\nE-commerce platforms allow users to post their reviews towards products, and the reviews may contain the opinions of users and the features of the items.\nHence, many works start to utilize the reviews to model user preference and item features.\nTraditional methods always use topic modelling technology to capture  the semantic informtion.\nRecently, many deep learning based methods are proposed, such as DeepCoNN, D-Attn etc, which use the neural networks, attention mechanism to learn representations of users and items more comprehensively.\n\n\nMore details please refer to [my blog](http://shomy.top/2019/12/31/neu-review-rec/).\n\n## Methods\n\n>Note: since each user and each item would have multiple reviews, we categorize the existing methods into two kinds:\n- document-level methods: concatenate all the reviews into a long document, and then learn representations from the doc, we denote as **Doc** feature.\n- review-level methods: model each review seperately and then aggregate all reviews together as the user item latent feature.\n\nBesides, the rating feature of users and items (i.e., ID embedding) is usefule when there are few reviews.\nSo there would be three features in all (i.e., document-level, review-level, ID).\n\nWe plan to follow the state-of-art review-based recommendation methods and involve them into this repo, the baseline methods are listed here:\n\n| Method | Feature |  Status|\n| ---- | ---- |  ---- |\n| DeepCoNN(WSDM'17) | Doc | &check; |\n| D-Attn(RecSys'17) | Doc | &check;|\n| ANR(CIKM'18) | Doc, ID  | &#9746;|\n|NARRE(WWW'18) | Review, ID  | &check; |\n|MPCN(KDD'18) | Review |&check; |\n|TARMF(WWW'18) | Review, ID | &#9746;|\n| CARL(TOIS'19) | Doc, ID |  &#9746;|\n| CARP(SIGIR'19) | Doc, ID | &#9746;|\n| DAML(KDD'19) | Doc, ID | &check; |\n\nWe will release the rest unfinished baseline methods later.\n\n### References\n> \n- Zheng L, Noroozi V, Yu P S. Joint deep modeling of users and items using reviews for recommendation[C]//Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017: 425-434.\n- Sungyong Seo, Jing Huang, Hao Yang, and Yan Liu. 2017. Interpretable Convolutional Neural Networks with Dual Local and Global Attention for Review Rating Prediction. In Proceedings ofthe Eleventh ACMConference on Recommender Systems.\n- Chin J Y, Zhao K, Joty S, et al. ANR: Aspect-based neural recommender[C]//Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018: 147-156.\n- Chen C, Zhang M, Liu Y, et al. Neural attentional rating regression with review-level explanations[C]//Proceedings of the 2018 World Wide Web Conference. 2018: 1583-1592.\n- Tay Y, Luu A T, Hui S C. Multi-pointer co-attention networks for recommendation[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018: 2309-2318.\n- Lu Y, Dong R, Smyth B. Coevolutionary recommendation model: Mutual learning between ratings and reviews[C]//Proceedings of the 2018 World Wide Web Conference. 2018: 773-782.\n- Wu L, Quan C, Li C, et al. A context-aware user-item representation learning for item recommendation[J]. ACM Transactions on Information Systems (TOIS), 2019, 37(2): 1-29.\n- Li C, Quan C, Peng L, et al. A capsule network for recommendation and explaining what you like and dislike[C]//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019: 275-284.\n- Liu D, Li J, Du B, et al. Daml: Dual attention mutual learning between ratings and reviews for item recommendation[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019: 344-352.\n\n## Usage\n\n**Requirements**\n\n- Python >= 3.6\n- Pytorch >= 1.0\n- fire: commend line parameters (in `config/config.py`)\n- numpy, gensim etc.\n\n\n**Use the code**\n\n- Preprocessing the origin Amazon or Yelp dataset via `pro_data/data_pro.py`, then some `npy` files will be generated in `dataset/`, including train, val, and test datset.\n    ```\n    cd pro_data\n    python3 data_pro.py Digital_Music_5.json\n    # details in data_pro.py (e.g., the pretrained word2vec.bin path)\n    ```\n- Train the model. Take DeepCoNN and NARRE as examples, the command lines can be customized:\n    ```\n    python3 main.py train --model=DeepCoNN --num_fea=1 --output=fm\n    python3 main.py train --model=NARRE --num_fea=2 --output=lfm\n    ```\n    Note that the `num_fea (1,2,3)` corresponds how many features used in the methods, (ID feature, Review-level and Doc-level denoted above).\n- Test the model using the saved pth file in `checkpoints` in the test datase: for example:\n    ```\n    python3 main.py test --pth_path=\"./checkpoints/THE_PTH_PATH\" --model=DeepCoNN\n\n    ```\n\nAn output sample:\n```\nloading train data\nloading val data\ntrain data: 51764; test data: 6471\nstart training....\n2020-07-28 12:27:58  Epoch 0...\n        train data: loss:107503.6215, mse: 2.0768;\n        evaluation reslut: mse: 1.2466; rmse: 1.1165; mae: 0.9691;\nmodel save\n******************************\n2020-07-28 12:28:13  Epoch 1...\n        train data: loss:80552.2573, mse: 1.5561;\n        evaluation reslut: mse: 1.0296; rmse: 1.0147; mae: 0.8384;\nmodel save\n******************************\n2020-07-28 12:28:29  Epoch 2...\n        train data: loss:70202.6199, mse: 1.3562;\n        evaluation reslut: mse: 0.9926; rmse: 0.9963; mae: 0.8146;\nmodel save\n******************************\n```\n\n## Framework Design\n\nAn overview of the package dir:\n![framework](http://cdn.htliu.cn/blog/review-based-rec/code.png)\n\n### Data Preprocessing\nAfter data processing, one record of the training/validation/test dataset is:\n```\nuser_id, item_id, ratings\n```\nFor example the training data triples are stored as `Train.npy, Train_Score.npy` in `dataset/Digital_Music_data/train/`.\n\nThe review information of users and items are preprocessed in the following format:\n\n- user_id\n- user_doc: the word index sequence of the document of the user, `[w1, w2, ... wn]`\n- user_reviews list: the list of all the review of the user, `[[w1,w2..], [w1,w2,..],...[w1,w2..]]`\n- user_item2id: the item ids that the user have purchased, `[id1, id2,...]`\n\nThe same as the items. Hence in the code, we orgnize our batch data as:\n```\nuids, iids, user_reviews, item_reviews, user_item2id, item_user2id, user_doc, item_doc\n```\nThis is all the information involved in review-based recommendation, researchers can utilize this data format to build own models.\nNote that the review in validation/test dataset is excluded.\n\n>Note that  the review processing methods are usually different among these papers (e.g., the vocab, padding), which would influence their performance.\nIn this repo, to be fair, we adopt the same pre-poressing approach for all the methods. \nHence the performance may be not consistent with the origin papers.\n\n\n### Model Details\nIn order to make our framework more extensible, we define three modules in our framework:\n\n- User/Item Representation Learning Layer (in `models/*py`): the main part of most baseline methods, such as the CNN encoder in DeepCoNN.\n- Fusion Layer in `framework/fusion.py`: combine the user/item different features (e.g., ID feature and review/doc feature), and then fuse the user and item feature into one feature vector, we pre-define the following methods:\n    - sum\n    - add\n    - concatenation\n    - self attention\n- Prediction Layer in `framework/prediction.py`: prediction the score that user towards item (i.e., a regression layer), we pre-define the following rating prediction layers:\n    - (Neural) Factorization Machine\n    - Latent Factor Model\n    - MLP\n\nHence, researchers could build their models in user/item representation learning layer.\n\n>Note that if you would like to add new method or datset, please remember to declare in the `__init__.py`\n\n## Citation\nIf you use the code, please cite:\n```\n@inproceedings{liu2019nrpa,\n  title={NRPA: Neural Recommendation with Personalized Attention},\n  author={Liu, Hongtao and Wu, Fangzhao and Wang, Wenjun and Wang, Xianchen and Jiao, Pengfei and Wu, Chuhan and Xie, Xing},\n  booktitle={Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval},\n  pages={1233--1236},\n  year={2019}\n}\n```\n"
  },
  {
    "path": "checkpoints/.gitkeeper",
    "content": ""
  },
  {
    "path": "config/__init__.py",
    "content": "# -*- coding: utf-8 -*-\n\nfrom .config import DefaultConfig\n# from .config import Office_Products_data_Config\n# from .config import Gourmet_Food_data_Config\n# from .config import Toys_and_Games_data_Config\n# from .config import Sports_and_Outdoors_data_Config\n# from .config import Clothing_Shoes_and_Jewelry_data_Config\n# from .config import Toys_and_Games_data_Config\n# from .config import Video_Games_data_Config\n# from .config import Movies_and_TV_data_Config\n# from .config import Kindle_Store_data_Config\n# from .config import yelp2013_data_Config\n# from .config import yelp2014_data_Config\n# from .config import Musical_Instruments_data_Config\nfrom .config import Digital_Music_data_Config\n# from .config import yelp2016_data_Config\n# from .config import Tools_Improvement_data_Config\n# from .config import Automotive_data_Config\n# from .config import Patio_Lawn_and_Garden_data_Config\n"
  },
  {
    "path": "config/config.py",
    "content": "# -*- coding: utf-8 -*-\n\nimport numpy as np\n\n\nclass DefaultConfig:\n\n    model = 'DeepCoNN'\n    dataset = 'Digital_Music_data'\n\n    # -------------base config-----------------------#\n    use_gpu = True\n    gpu_id = 1\n    multi_gpu = False\n    gpu_ids = []\n\n    seed = 2019\n    num_epochs = 20\n    num_workers = 0\n\n    optimizer = 'Adam'\n    weight_decay = 1e-3  # optimizer rameteri\n    lr = 2e-3\n    loss_method = 'mse'\n    drop_out = 0.5\n\n    use_word_embedding = True\n\n    id_emb_size = 32\n    query_mlp_size = 128\n    fc_dim = 32\n\n    doc_len = 500\n    filters_num = 100\n    kernel_size = 3\n\n    num_fea = 1  # id feature, review feature, doc feature\n    use_review = True\n    use_doc = True\n    self_att = False\n\n    r_id_merge = 'cat'  # review and ID feature\n    ui_merge = 'cat'  # cat/add/dot\n    output = 'lfm'  # 'fm', 'lfm', 'other: sum the ui_feature'\n\n    fine_step = False  # save mode in step level, defualt in epoch\n    pth_path = \"\"  # the saved pth path for test\n    print_opt = 'default'\n\n    def set_path(self, name):\n        '''\n        specific\n        '''\n        self.data_root = f'./dataset/{name}'\n        prefix = f'{self.data_root}/train'\n\n        self.user_list_path = f'{prefix}/userReview2Index.npy'\n        self.item_list_path = f'{prefix}/itemReview2Index.npy'\n\n        self.user2itemid_path = f'{prefix}/user_item2id.npy'\n        self.item2userid_path = f'{prefix}/item_user2id.npy'\n\n        self.user_doc_path = f'{prefix}/userDoc2Index.npy'\n        self.item_doc_path = f'{prefix}/itemDoc2Index.npy'\n\n        self.w2v_path = f'{prefix}/w2v.npy'\n\n    def parse(self, kwargs):\n        '''\n        user can update the default hyperparamter\n        '''\n        print(\"load npy from dist...\")\n        self.users_review_list = np.load(self.user_list_path, encoding='bytes')\n        self.items_review_list = np.load(self.item_list_path, encoding='bytes')\n        self.user2itemid_list = np.load(self.user2itemid_path, encoding='bytes')\n        self.item2userid_list = np.load(self.item2userid_path, encoding='bytes')\n        self.user_doc = np.load(self.user_doc_path, encoding='bytes')\n        self.item_doc = np.load(self.item_doc_path, encoding='bytes')\n\n        for k, v in kwargs.items():\n            if not hasattr(self, k):\n                raise Exception('opt has No key: {}'.format(k))\n            setattr(self, k, v)\n\n        print('*************************************************')\n        print('user config:')\n        for k, v in self.__class__.__dict__.items():\n            if not k.startswith('__') and k != 'user_list' and k != 'item_list':\n                print(\"{} => {}\".format(k, getattr(self, k)))\n\n        print('*************************************************')\n\n\nclass Digital_Music_data_Config(DefaultConfig):\n\n    def __init__(self):\n        self.set_path('Digital_Music_data')\n\n    vocab_size = 50002\n    word_dim = 300\n\n    r_max_len = 202\n\n    u_max_r = 13\n    i_max_r = 24\n\n    train_data_size = 51764\n    test_data_size = 6471\n    val_data_size = 6471\n\n    user_num = 5541 + 2\n    item_num = 3568 + 2\n\n    batch_size = 128\n    print_step = 100\n"
  },
  {
    "path": "dataset/__init__.py",
    "content": "\nfrom .data_review import ReviewData\n"
  },
  {
    "path": "dataset/data_review.py",
    "content": "# -*- coding: utf-8 -*-\n\nimport os\nimport numpy as np\nfrom torch.utils.data import Dataset\n\n\nclass ReviewData(Dataset):\n\n    def __init__(self, root_path, mode):\n        if mode == 'Train':\n            path = os.path.join(root_path, 'train/')\n            print('loading train data')\n            self.data = np.load(path + 'Train.npy', encoding='bytes')\n            self.scores = np.load(path + 'Train_Score.npy')\n        elif mode == 'Val':\n            path = os.path.join(root_path, 'val/')\n            print('loading val data')\n            self.data = np.load(path + 'Val.npy', encoding='bytes')\n            self.scores = np.load(path + 'Val_Score.npy')\n        else:\n            path = os.path.join(root_path, 'test/')\n            print('loading test data')\n            self.data = np.load(path + 'Test.npy', encoding='bytes')\n            self.scores = np.load(path + 'Test_Score.npy')\n        self.x = list(zip(self.data, self.scores))\n\n    def __getitem__(self, idx):\n        assert idx < len(self.x)\n        return self.x[idx]\n\n    def __len__(self):\n        return len(self.x)\n"
  },
  {
    "path": "framework/__init__.py",
    "content": "# -*- coding: utf-8 -*-\nfrom .models import Model\n"
  },
  {
    "path": "framework/fusion.py",
    "content": "# -*- coding: utf-8 -*-\n\nimport torch\nimport torch.nn as nn\n\n\nclass FusionLayer(nn.Module):\n    '''\n    Fusion Layer for user feature and item feature\n    '''\n    def __init__(self, opt):\n        super(FusionLayer, self).__init__()\n        if opt.self_att:\n            self.attn = SelfAtt(opt.id_emb_size, opt.num_heads)\n        self.opt = opt\n        self.linear = nn.Linear(opt.feature_dim, opt.feature_dim)\n        self.drop_out = nn.Dropout(0.5)\n        nn.init.uniform_(self.linear.weight, -0.1, 0.1)\n        nn.init.constant_(self.linear.bias, 0.1)\n\n    def forward(self, u_out, i_out):\n        if self.opt.self_att:\n            out = self.attn(u_out, i_out)\n            s_u_out, s_i_out = torch.split(out, out.size(1)//2, 1)\n            u_out = u_out + s_u_out\n            i_out = i_out + s_i_out\n        if self.opt.r_id_merge == 'cat':\n            u_out = u_out.reshape(u_out.size(0), -1)\n            i_out = i_out.reshape(i_out.size(0), -1)\n        else:\n            u_out = u_out.sum(1)\n            i_out = i_out.sum(1)\n\n        if self.opt.ui_merge == 'cat':\n            out = torch.cat([u_out, i_out], 1)\n        elif self.opt.ui_merge == 'add':\n            out = u_out + i_out\n        else:\n            out = u_out * i_out\n        # out = self.drop_out(out)\n        # return F.relu(self.linear(out))\n        return out\n\n\nclass SelfAtt(nn.Module):\n    '''\n    self attention for interaction\n    '''\n    def __init__(self, dim, num_heads):\n        super(SelfAtt, self).__init__()\n        self.encoder_layer = nn.TransformerEncoderLayer(dim, num_heads, 128, 0.4)\n        self.encoder = nn.TransformerEncoder(self.encoder_layer, 1)\n\n    def forward(self, user_fea, item_fea):\n        fea = torch.cat([user_fea, item_fea], 1).permute(1, 0, 2)  # batch * 6 * 64\n        out = self.encoder(fea)\n        return out.permute(1, 0, 2)\n\n"
  },
  {
    "path": "framework/models.py",
    "content": "# -*- coding: utf-8 -*-\n\nimport torch\nimport torch.nn as nn\nimport time\n\nfrom .prediction import PredictionLayer\nfrom .fusion import FusionLayer\n\n\nclass Model(nn.Module):\n\n    def __init__(self, opt, Net):\n        super(Model, self).__init__()\n        self.opt = opt\n        self.model_name = self.opt.model\n        self.net = Net(opt)\n\n        if self.opt.ui_merge == 'cat':\n            if self.opt.r_id_merge == 'cat':\n                feature_dim = self.opt.id_emb_size * self.opt.num_fea * 2\n            else:\n                feature_dim = self.opt.id_emb_size * 2\n        else:\n            if self.opt.r_id_merge == 'cat':\n                feature_dim = self.opt.id_emb_size * self.opt.num_fea\n            else:\n                feature_dim = self.opt.id_emb_size\n\n        self.opt.feature_dim = feature_dim\n        self.fusion_net = FusionLayer(opt)\n        self.predict_net = PredictionLayer(opt)\n        self.dropout = nn.Dropout(self.opt.drop_out)\n\n    def forward(self, datas):\n\n        user_reviews, item_reviews, uids, iids, user_item2id, item_user2id, user_doc, item_doc = datas\n        user_feature, item_feature = self.net(datas)\n\n        ui_feature = self.fusion_net(user_feature, item_feature)\n        ui_feature = self.dropout(ui_feature)\n        output = self.predict_net(ui_feature, uids, iids).squeeze(1)\n        return output\n\n    def load(self, path):\n        '''\n        加载指定模型\n        '''\n        self.load_state_dict(torch.load(path))\n\n    def save(self, epoch=None, name=None, opt=None):\n        '''\n        保存模型\n        '''\n        prefix = 'checkpoints/'\n        if name is None:\n            name = prefix + self.model_name + '_'\n            name = time.strftime(name + '%m%d_%H:%M:%S.pth')\n        else:\n            name = prefix + self.model_name + '_' + str(name) + '_' + str(opt) + '.pth'\n        torch.save(self.state_dict(), name)\n        return name\n"
  },
  {
    "path": "framework/prediction.py",
    "content": "# -*- coding: utf-8 -*-\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n\nclass PredictionLayer(nn.Module):\n    '''\n        Rating Prediciton Methods\n        - LFM: Latent Factor Model\n        - (N)FM: (Neural) Factorization Machine\n        - MLP\n        - SUM\n    '''\n    def __init__(self, opt):\n        super(PredictionLayer, self).__init__()\n        self.output = opt.output\n        if opt.output == \"fm\":\n            self.model = FM(opt.feature_dim, opt.user_num, opt.item_num)\n        elif opt.output == \"lfm\":\n            self.model = LFM(opt.feature_dim, opt.user_num, opt.item_num)\n        elif opt.output == 'mlp':\n            self.model = MLP(opt.feature_dim)\n        elif opt.output == 'nfm':\n            self.model = NFM(opt.feature_dim)\n        else:\n            self.model = torch.sum\n\n    def forward(self, feature, uid, iid):\n        if self.output == \"lfm\" or \"fm\" or \"nfm\":\n            return self.model(feature, uid, iid)\n        else:\n            return self.model(feature, 1, keepdim=True)\n\n\nclass LFM(nn.Module):\n\n    def __init__(self, dim, user_num, item_num):\n        super(LFM, self).__init__()\n\n        # ---------------------------fc_linear------------------------------\n        self.fc = nn.Linear(dim, 1)\n        # -------------------------LFM-user/item-bias-----------------------\n        self.b_users = nn.Parameter(torch.randn(user_num, 1))\n        self.b_items = nn.Parameter(torch.randn(item_num, 1))\n\n        self.init_weight()\n\n    def init_weight(self):\n        nn.init.uniform_(self.fc.weight, a=-0.1, b=0.1)\n        nn.init.uniform_(self.fc.bias, a=0.5, b=1.5)\n        nn.init.uniform_(self.b_users, a=0.5, b=1.5)\n        nn.init.uniform_(self.b_users, a=0.5, b=1.5)\n\n    def rescale_sigmoid(self, score, a, b):\n        return a + torch.sigmoid(score) * (b - a)\n\n    def forward(self, feature, user_id, item_id):\n        # return self.rescale_sigmoid(self.fc(feature), 1.0, 5.0) + self.b_users[user_id] + self.b_items[item_id]\n        return self.fc(feature) + self.b_users[user_id] + self.b_items[item_id]\n\n\nclass NFM(nn.Module):\n    '''\n    Neural FM\n    '''\n    def __init__(self, dim):\n        super(NFM, self).__init__()\n        self.dim = dim\n        # ---------------------------fc_linear------------------------------\n        self.fc = nn.Linear(dim, 1)\n        # ------------------------------FM----------------------------------\n        self.fm_V = nn.Parameter(torch.randn(16, dim))\n        self.mlp = nn.Linear(16, 16)\n        self.h = nn.Linear(16, 1, bias=False)\n        self.drop_out = nn.Dropout(0.5)\n        self.init_weight()\n\n    def init_weight(self):\n        nn.init.uniform_(self.fc.weight, -0.1, 0.1)\n        nn.init.constant_(self.fc.bias, 0.1)\n        nn.init.uniform_(self.fm_V, -0.1, 0.1)\n        nn.init.uniform_(self.h.weight, -0.1, 0.1)\n\n    def forward(self, input_vec, *args):\n        fm_linear_part = self.fc(input_vec)\n        fm_interactions_1 = torch.mm(input_vec, self.fm_V.t())\n        fm_interactions_1 = torch.pow(fm_interactions_1, 2)\n\n        fm_interactions_2 = torch.mm(torch.pow(input_vec, 2), torch.pow(self.fm_V, 2).t())\n        bilinear = 0.5 * (fm_interactions_1 - fm_interactions_2)\n\n        out = F.relu(self.mlp(bilinear))\n        out = self.drop_out(out)\n        out = self.h(out) + fm_linear_part\n        return out\n\n\nclass FM(nn.Module):\n\n    def __init__(self, dim, user_num, item_num):\n        super(FM, self).__init__()\n        self.dim = dim\n        # ---------------------------fc_linear------------------------------\n        self.fc = nn.Linear(dim, 1)\n        # ------------------------------FM----------------------------------\n        self.fm_V = nn.Parameter(torch.randn(dim, 10))\n        self.b_users = nn.Parameter(torch.randn(user_num, 1))\n        self.b_items = nn.Parameter(torch.randn(item_num, 1))\n\n        self.init_weight()\n\n    def init_weight(self):\n        nn.init.uniform_(self.fc.weight, -0.05, 0.05)\n        nn.init.constant_(self.fc.bias, 0.0)\n        nn.init.uniform_(self.b_users, a=0, b=0.1)\n        nn.init.uniform_(self.b_items, a=0, b=0.1)\n        nn.init.uniform_(self.fm_V, -0.05, 0.05)\n\n    def build_fm(self, input_vec):\n        '''\n        y = w_0 + \\sum {w_ix_i} + \\sum_{i=1}\\sum_{j=i+1}<v_i, v_j>x_ix_j\n        factorization machine layer\n        refer: https://github.com/vanzytay/KDD2018_MPCN/blob/master/tylib/lib\n                      /compose_op.py#L13\n        '''\n        # linear part: first two items\n        fm_linear_part = self.fc(input_vec)\n\n        fm_interactions_1 = torch.mm(input_vec, self.fm_V)\n        fm_interactions_1 = torch.pow(fm_interactions_1, 2)\n\n        fm_interactions_2 = torch.mm(torch.pow(input_vec, 2),\n                                     torch.pow(self.fm_V, 2))\n        fm_output = 0.5 * torch.sum(fm_interactions_1 - fm_interactions_2, 1, keepdim=True) + fm_linear_part\n        return fm_output\n\n    def forward(self, feature, uids, iids):\n        fm_out = self.build_fm(feature)\n        return fm_out + self.b_users[uids] + self.b_items[iids]\n\n\nclass MLP(nn.Module):\n\n    def __init__(self, dim):\n        super(MLP, self).__init__()\n        self.dim = dim\n        # ---------------------------fc_linear------------------------------\n        self.fc = nn.Linear(dim, 1)\n        self.init_weight()\n\n    def init_weight(self):\n        nn.init.uniform_(self.fc.weight, 0.1, 0.1)\n        nn.init.uniform_(self.fc.bias, a=0, b=0.2)\n\n    def forward(self, feature, *args, **kwargs):\n        return F.relu(self.fc(feature))\n"
  },
  {
    "path": "main.py",
    "content": "# -*- encoding: utf-8 -*-\nimport time\nimport random\nimport math\nimport fire\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.optim as optim\nfrom torch.utils.data import DataLoader\n\nfrom dataset import ReviewData\nfrom framework import Model\nimport models\nimport config\n\n\ndef now():\n    return str(time.strftime('%Y-%m-%d %H:%M:%S'))\n\n\ndef collate_fn(batch):\n    data, label = zip(*batch)\n    return data, label\n\n\ndef train(**kwargs):\n\n    if 'dataset' not in kwargs:\n        opt = getattr(config, 'Digital_Music_data_Config')()\n    else:\n        opt = getattr(config, kwargs['dataset'] + '_Config')()\n    opt.parse(kwargs)\n\n    random.seed(opt.seed)\n    np.random.seed(opt.seed)\n    torch.manual_seed(opt.seed)\n    if opt.use_gpu:\n        torch.cuda.manual_seed_all(opt.seed)\n\n    if len(opt.gpu_ids) == 0 and opt.use_gpu:\n        torch.cuda.set_device(opt.gpu_id)\n\n    model = Model(opt, getattr(models, opt.model))\n    if opt.use_gpu:\n        model.cuda()\n        if len(opt.gpu_ids) > 0:\n            model = nn.DataParallel(model, device_ids=opt.gpu_ids)\n\n    if model.net.num_fea != opt.num_fea:\n        raise ValueError(f\"the num_fea of {opt.model} is error, please specific --num_fea={model.net.num_fea}\")\n\n    # 3 data\n    train_data = ReviewData(opt.data_root, mode=\"Train\")\n    train_data_loader = DataLoader(train_data, batch_size=opt.batch_size, shuffle=True, collate_fn=collate_fn)\n    val_data = ReviewData(opt.data_root, mode=\"Val\")\n    val_data_loader = DataLoader(val_data, batch_size=opt.batch_size, shuffle=False, collate_fn=collate_fn)\n    print(f'train data: {len(train_data)}; test data: {len(val_data)}')\n\n    optimizer = optim.Adam(model.parameters(), lr=opt.lr, weight_decay=opt.weight_decay)\n    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.8)\n\n    # training\n    print(\"start training....\")\n    min_loss = 1e+10\n    best_res = 1e+10\n    mse_func = nn.MSELoss()\n    mae_func = nn.L1Loss()\n    smooth_mae_func = nn.SmoothL1Loss()\n\n    for epoch in range(opt.num_epochs):\n        total_loss = 0.0\n        total_maeloss = 0.0\n        model.train()\n        print(f\"{now()}  Epoch {epoch}...\")\n        for idx, (train_datas, scores) in enumerate(train_data_loader):\n            if opt.use_gpu:\n                scores = torch.FloatTensor(scores).cuda()\n            else:\n                scores = torch.FloatTensor(scores)\n            train_datas = unpack_input(opt, train_datas)\n\n            optimizer.zero_grad()\n            output = model(train_datas)\n            mse_loss = mse_func(output, scores)\n            total_loss += mse_loss.item() * len(scores)\n\n            mae_loss = mae_func(output, scores)\n            total_maeloss += mae_loss.item()\n            smooth_mae_loss = smooth_mae_func(output, scores)\n            if opt.loss_method == 'mse':\n                loss = mse_loss\n            if opt.loss_method == 'rmse':\n                loss = torch.sqrt(mse_loss) / 2.0\n            if opt.loss_method == 'mae':\n                loss = mae_loss\n            if opt.loss_method == 'smooth_mae':\n                loss = smooth_mae_loss\n            loss.backward()\n            optimizer.step()\n            if opt.fine_step:\n                if idx % opt.print_step == 0 and idx > 0:\n                    print(\"\\t{}, {} step finised;\".format(now(), idx))\n                    val_loss, val_mse, val_mae = predict(model, val_data_loader, opt)\n                    if val_loss < min_loss:\n                        model.save(name=opt.dataset, opt=opt.print_opt)\n                        min_loss = val_loss\n                        print(\"\\tmodel save\")\n                    if val_loss > min_loss:\n                        best_res = min_loss\n\n        scheduler.step()\n        mse = total_loss * 1.0 / len(train_data)\n        print(f\"\\ttrain data: loss:{total_loss:.4f}, mse: {mse:.4f};\")\n\n        val_loss, val_mse, val_mae = predict(model, val_data_loader, opt)\n        if val_loss < min_loss:\n            model.save(name=opt.dataset, opt=opt.print_opt)\n            min_loss = val_loss\n            print(\"model save\")\n        if val_mse < best_res:\n            best_res = val_mse\n        print(\"*\"*30)\n\n    print(\"----\"*20)\n    print(f\"{now()} {opt.dataset} {opt.print_opt} best_res:  {best_res}\")\n    print(\"----\"*20)\n\n\ndef test(**kwargs):\n\n    if 'dataset' not in kwargs:\n        opt = getattr(config, 'Digital_Music_data_Config')()\n    else:\n        opt = getattr(config, kwargs['dataset'] + '_Config')()\n    opt.parse(kwargs)\n    assert(len(opt.pth_path) > 0)\n    random.seed(opt.seed)\n    np.random.seed(opt.seed)\n    torch.manual_seed(opt.seed)\n    if opt.use_gpu:\n        torch.cuda.manual_seed_all(opt.seed)\n\n    if len(opt.gpu_ids) == 0 and opt.use_gpu:\n        torch.cuda.set_device(opt.gpu_id)\n\n    model = Model(opt, getattr(models, opt.model))\n    if opt.use_gpu:\n        model.cuda()\n        if len(opt.gpu_ids) > 0:\n            model = nn.DataParallel(model, device_ids=opt.gpu_ids)\n    if model.net.num_fea != opt.num_fea:\n        raise ValueError(f\"the num_fea of {opt.model} is error, please specific --num_fea={model.net.num_fea}\")\n\n    model.load(opt.pth_path)\n    print(f\"load model: {opt.pth_path}\")\n    test_data = ReviewData(opt.data_root, mode=\"Test\")\n    test_data_loader = DataLoader(test_data, batch_size=opt.batch_size, shuffle=False, collate_fn=collate_fn)\n    print(f\"{now()}: test in the test datset\")\n    predict_loss, test_mse, test_mae = predict(model, test_data_loader, opt)\n\n\ndef predict(model, data_loader, opt):\n    total_loss = 0.0\n    total_maeloss = 0.0\n    model.eval()\n    with torch.no_grad():\n        for idx, (test_data, scores) in enumerate(data_loader):\n            if opt.use_gpu:\n                scores = torch.FloatTensor(scores).cuda()\n            else:\n                scores = torch.FloatTensor(scores)\n            test_data = unpack_input(opt, test_data)\n\n            output = model(test_data)\n            mse_loss = torch.sum((output-scores)**2)\n            total_loss += mse_loss.item()\n\n            mae_loss = torch.sum(abs(output-scores))\n            total_maeloss += mae_loss.item()\n\n    data_len = len(data_loader.dataset)\n    mse = total_loss * 1.0 / data_len\n    mae = total_maeloss * 1.0 / data_len\n    print(f\"\\tevaluation reslut: mse: {mse:.4f}; rmse: {math.sqrt(mse):.4f}; mae: {mae:.4f};\")\n    model.train()\n    return total_loss, mse, mae\n\n\ndef unpack_input(opt, x):\n\n    uids, iids = list(zip(*x))\n    uids = list(uids)\n    iids = list(iids)\n\n    user_reviews = opt.users_review_list[uids]\n    user_item2id = opt.user2itemid_list[uids]  # 检索出该user对应的item id\n\n    user_doc = opt.user_doc[uids]\n\n    item_reviews = opt.items_review_list[iids]\n    item_user2id = opt.item2userid_list[iids]  # 检索出该item对应的user id\n    item_doc = opt.item_doc[iids]\n\n    data = [user_reviews, item_reviews, uids, iids, user_item2id, item_user2id, user_doc, item_doc]\n    data = list(map(lambda x: torch.LongTensor(x).cuda(), data))\n    return data\n\n\nif __name__ == \"__main__\":\n    fire.Fire()\n"
  },
  {
    "path": "models/__init__.py",
    "content": "# -*- coding: utf-8 -*-\n\n# from .narre import NARRE\nfrom .deepconn import DeepCoNN\nfrom .daml import DAML\nfrom .narre import NARRE\nfrom .d_attn import D_ATTN\nfrom .mpcn import MPCN\n"
  },
  {
    "path": "models/d_attn.py",
    "content": "# -*- coding: utf-8 -*-\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n\nclass D_ATTN(nn.Module):\n    '''\n    Interpretable Convolutional Neural Networks with Dual Local and Global Attention for Review Rating Prediction\n    Rescys 2017\n    '''\n    def __init__(self, opt):\n        super(D_ATTN, self).__init__()\n        self.opt = opt\n        self.num_fea = 1   # Document\n\n        self.user_net = Net(opt, 'user')\n        self.item_net = Net(opt, 'item')\n\n    def forward(self, datas):\n        user_reviews, item_reviews, uids, iids, user_item2id, item_user2id, user_doc, item_doc = datas\n        u_fea = self.user_net(user_doc)\n        i_fea = self.item_net(item_doc)\n        return u_fea, i_fea\n\n\nclass Net(nn.Module):\n    def __init__(self, opt, uori='user'):\n        super(Net, self).__init__()\n        self.opt = opt\n\n        self.word_embs = nn.Embedding(self.opt.vocab_size, self.opt.word_dim)  # vocab_size * 300\n        self.local_att = LocalAttention(opt.doc_len, win_size=5, emb_size=opt.word_dim, filters_num=opt.filters_num)\n        self.global_att = GlobalAttention(opt.doc_len, emb_size=opt.word_dim, filters_num=opt.filters_num)\n\n        fea_dim = opt.filters_num * 4\n        self.fc = nn.Sequential(\n            nn.Linear(fea_dim, fea_dim),\n            nn.Dropout(0.5),\n            nn.ReLU(),\n            nn.Linear(fea_dim, opt.id_emb_size),\n        )\n        self.dropout = nn.Dropout(self.opt.drop_out)\n        self.reset_para()\n\n    def forward(self, docs):\n        docs = self.word_embs(docs)  # size * 300\n        local_fea = self.local_att(docs)\n        global_fea = self.global_att(docs)\n        r_fea = torch.cat([local_fea]+global_fea, 1)\n        r_fea = self.dropout(r_fea)\n        r_fea = self.fc(r_fea)\n\n        return torch.stack([r_fea], 1)\n\n    def reset_para(self):\n        cnns = [self.local_att.cnn, self.local_att.att_conv[0]]\n        for cnn in cnns:\n            nn.init.xavier_uniform_(cnn.weight, gain=1)\n            nn.init.uniform_(cnn.bias, -0.1, 0.1)\n        for cnn in self.global_att.convs:\n            nn.init.xavier_uniform_(cnn.weight, gain=1)\n            nn.init.uniform_(cnn.bias, -0.1, 0.1)\n        nn.init.uniform_(self.fc[0].weight, -0.1, 0.1)\n        nn.init.uniform_(self.fc[-1].weight, -0.1, 0.1)\n        if self.opt.use_word_embedding:\n            w2v = torch.from_numpy(np.load(self.opt.w2v_path))\n            if self.opt.use_gpu:\n                self.word_embs.weight.data.copy_(w2v.cuda())\n            else:\n                self.word_embs.weight.data.copy_(w2v)\n        else:\n            nn.init.xavier_normal_(self.word_embs.weight)\n\n\nclass LocalAttention(nn.Module):\n    def __init__(self, seq_len, win_size, emb_size, filters_num):\n        super(LocalAttention, self).__init__()\n        self.att_conv = nn.Sequential(\n            nn.Conv2d(1, 1, kernel_size=(win_size, emb_size), padding=((win_size-1)//2, 0)),\n            nn.Sigmoid()\n        )\n        self.cnn = nn.Conv2d(1, filters_num, kernel_size=(1, emb_size))\n\n    def forward(self, x):\n        score = self.att_conv(x.unsqueeze(1)).squeeze(1)\n        out = x.mul(score)\n        out = out.unsqueeze(1)\n        out = torch.tanh(self.cnn(out)).squeeze(3)\n        out = F.max_pool1d(out, out.size(2)).squeeze(2)\n        return out\n\n\nclass GlobalAttention(nn.Module):\n    def __init__(self, seq_len, emb_size, filters_size=[2, 3, 4], filters_num=100):\n        super(GlobalAttention, self).__init__()\n        self.att_conv = nn.Sequential(\n            nn.Conv2d(1, 1, kernel_size=(seq_len, emb_size)),\n            nn.Sigmoid()\n        )\n        self.convs = nn.ModuleList([nn.Conv2d(1, filters_num, (k, emb_size)) for k in filters_size])\n\n    def forward(self, x):\n        x = x.unsqueeze(1)\n        score = self.att_conv(x)\n        x = x.mul(score)\n        conv_outs = [torch.tanh(cnn(x).squeeze(3)) for cnn in self.convs]\n        conv_outs = [F.max_pool1d(out, out.size(2)).squeeze(2) for out in conv_outs]\n        return conv_outs\n"
  },
  {
    "path": "models/daml.py",
    "content": "# -*- coding: utf-8 -*-\n\nimport torch\nimport torch.nn as nn\nimport numpy as np\nimport torch.nn.functional as F\n\n\nclass DAML(nn.Module):\n    '''\n    KDD 2019 DAML\n    '''\n    def __init__(self, opt):\n        super(DAML, self).__init__()\n\n        self.opt = opt\n        self.num_fea = 2  # ID + DOC\n        self.user_word_embs = nn.Embedding(opt.vocab_size, opt.word_dim)  # vocab_size * 300\n        self.item_word_embs = nn.Embedding(opt.vocab_size, opt.word_dim)  # vocab_size * 300\n\n        # share\n        self.word_cnn = nn.Conv2d(1, 1, (5, opt.word_dim), padding=(2, 0))\n        # document-level cnn\n        self.user_doc_cnn = nn.Conv2d(1, opt.filters_num, (opt.kernel_size, opt.word_dim), padding=(1, 0))\n        self.item_doc_cnn = nn.Conv2d(1, opt.filters_num, (opt.kernel_size, opt.word_dim), padding=(1, 0))\n        # abstract-level cnn\n        self.user_abs_cnn = nn.Conv2d(1, opt.filters_num, (opt.kernel_size, opt.filters_num))\n        self.item_abs_cnn = nn.Conv2d(1, opt.filters_num, (opt.kernel_size, opt.filters_num))\n\n        self.unfold = nn.Unfold((3, opt.filters_num), padding=(1, 0))\n\n        # fc layer\n        self.user_fc = nn.Linear(opt.filters_num, opt.id_emb_size)\n        self.item_fc = nn.Linear(opt.filters_num, opt.id_emb_size)\n\n        self.uid_embedding = nn.Embedding(opt.user_num + 2, opt.id_emb_size)\n        self.iid_embedding = nn.Embedding(opt.item_num + 2, opt.id_emb_size)\n\n        self.reset_para()\n\n    def forward(self, datas):\n        '''\n        user_reviews, item_reviews, uids, iids, \\\n        user_item2id, item_user2id, user_doc, item_doc = datas\n        '''\n        _, _, uids, iids, _, _, user_doc, item_doc = datas\n\n        # ------------------ review encoder ---------------------------------\n        user_word_embs = self.user_word_embs(user_doc)\n        item_word_embs = self.item_word_embs(item_doc)\n        # (BS, 100, DOC_LEN, 1)\n        user_local_fea = self.local_attention_cnn(user_word_embs, self.user_doc_cnn)\n        item_local_fea = self.local_attention_cnn(item_word_embs, self.item_doc_cnn)\n\n        # DOC_LEN * DOC_LEN\n        euclidean = (user_local_fea - item_local_fea.permute(0, 1, 3, 2)).pow(2).sum(1).sqrt()\n        attention_matrix = 1.0 / (1 + euclidean)\n        # (?, DOC_LEN)\n        user_attention = attention_matrix.sum(2)\n        item_attention = attention_matrix.sum(1)\n\n        # (?, 32)\n        user_doc_fea = self.local_pooling_cnn(user_local_fea, user_attention, self.user_abs_cnn, self.user_fc)\n        item_doc_fea = self.local_pooling_cnn(item_local_fea, item_attention, self.item_abs_cnn, self.item_fc)\n\n        # ------------------ id embedding ---------------------------------\n        uid_emb = self.uid_embedding(uids)\n        iid_emb = self.iid_embedding(iids)\n\n        use_fea = torch.stack([user_doc_fea, uid_emb], 1)\n        item_fea = torch.stack([item_doc_fea,  iid_emb], 1)\n\n        return use_fea, item_fea\n\n    def local_attention_cnn(self, word_embs, doc_cnn):\n        '''\n        :Eq1 - Eq7\n        '''\n        local_att_words = self.word_cnn(word_embs.unsqueeze(1))\n        local_word_weight = torch.sigmoid(local_att_words.squeeze(1))\n        word_embs = word_embs * local_word_weight\n        d_fea = doc_cnn(word_embs.unsqueeze(1))\n        return d_fea\n\n    def local_pooling_cnn(self, feature, attention, cnn, fc):\n        '''\n        :Eq11 - Eq13\n        feature: (?, 100, DOC_LEN ,1)\n        attention: (?, DOC_LEN)\n        '''\n        bs, n_filters, doc_len, _ = feature.shape\n        feature = feature.permute(0, 3, 2, 1)  # bs * 1 * doc_len * embed\n        attention = attention.reshape(bs, 1, doc_len, 1)  # bs * doc\n        pools = feature * attention\n        pools = self.unfold(pools)\n        pools = pools.reshape(bs, 3, n_filters, doc_len)\n        pools = pools.sum(dim=1, keepdims=True)  # bs * 1 * n_filters * doc_len\n        pools = pools.transpose(2, 3)  # bs * 1 * doc_len * n_filters\n\n        abs_fea = cnn(pools).squeeze(3)  # ? (DOC_LEN-2), 100\n        abs_fea = F.avg_pool1d(abs_fea, abs_fea.size(2))  # ? 100\n        abs_fea = F.relu(fc(abs_fea.squeeze(2)))  # ? 32\n\n        return abs_fea\n\n    def reset_para(self):\n\n        cnns = [self.word_cnn, self.user_doc_cnn, self.item_doc_cnn, self.user_abs_cnn, self.item_abs_cnn]\n        for cnn in cnns:\n            nn.init.xavier_normal_(cnn.weight)\n            nn.init.uniform_(cnn.bias, -0.1, 0.1)\n\n        fcs = [self.user_fc, self.item_fc]\n        for fc in fcs:\n            nn.init.uniform_(fc.weight, -0.1, 0.1)\n            nn.init.constant_(fc.bias, 0.1)\n\n        nn.init.uniform_(self.uid_embedding.weight, -0.1, 0.1)\n        nn.init.uniform_(self.iid_embedding.weight, -0.1, 0.1)\n\n        w2v = torch.from_numpy(np.load(self.opt.w2v_path))\n        self.user_word_embs.weight.data.copy_(w2v.cuda())\n        self.item_word_embs.weight.data.copy_(w2v.cuda())\n"
  },
  {
    "path": "models/deepconn.py",
    "content": "# -*- coding: utf-8 -*-\n\nimport torch\nimport torch.nn as nn\nimport numpy as np\nimport torch.nn.functional as F\n\n\nclass DeepCoNN(nn.Module):\n    '''\n    deep conn 2017\n    '''\n    def __init__(self, opt, uori='user'):\n        super(DeepCoNN, self).__init__()\n        self.opt = opt\n        self.num_fea = 1 # DOC\n\n        self.user_word_embs = nn.Embedding(opt.vocab_size, opt.word_dim)  # vocab_size * 300\n        self.item_word_embs = nn.Embedding(opt.vocab_size, opt.word_dim)  # vocab_size * 300\n\n        self.user_cnn = nn.Conv2d(1, opt.filters_num, (opt.kernel_size, opt.word_dim))\n        self.item_cnn = nn.Conv2d(1, opt.filters_num, (opt.kernel_size, opt.word_dim))\n\n        self.user_fc_linear = nn.Linear(opt.filters_num, opt.fc_dim)\n        self.item_fc_linear = nn.Linear(opt.filters_num, opt.fc_dim)\n        self.dropout = nn.Dropout(self.opt.drop_out)\n\n        self.reset_para()\n\n    def forward(self, datas):\n        _, _, uids, iids, _, _, user_doc, item_doc = datas\n\n        user_doc = self.user_word_embs(user_doc)\n        item_doc = self.item_word_embs(item_doc)\n\n        u_fea = F.relu(self.user_cnn(user_doc.unsqueeze(1))).squeeze(3)  # .permute(0, 2, 1)\n        i_fea = F.relu(self.item_cnn(item_doc.unsqueeze(1))).squeeze(3)  # .permute(0, 2, 1)\n        u_fea = F.max_pool1d(u_fea, u_fea.size(2)).squeeze(2)\n        i_fea = F.max_pool1d(i_fea, i_fea.size(2)).squeeze(2)\n        u_fea = self.dropout(self.user_fc_linear(u_fea))\n        i_fea = self.dropout(self.item_fc_linear(i_fea))\n\n        return torch.stack([u_fea], 1), torch.stack([i_fea], 1)\n\n    def reset_para(self):\n\n        for cnn in [self.user_cnn, self.item_cnn]:\n            nn.init.xavier_normal_(cnn.weight)\n            nn.init.constant_(cnn.bias, 0.1)\n\n        for fc in [self.user_fc_linear, self.item_fc_linear]:\n            nn.init.uniform_(fc.weight, -0.1, 0.1)\n            nn.init.constant_(fc.bias, 0.1)\n\n        if self.opt.use_word_embedding:\n            w2v = torch.from_numpy(np.load(self.opt.w2v_path))\n            if self.opt.use_gpu:\n                self.user_word_embs.weight.data.copy_(w2v.cuda())\n                self.item_word_embs.weight.data.copy_(w2v.cuda())\n            else:\n                self.user_word_embs.weight.data.copy_(w2v)\n                self.item_word_embs.weight.data.copy_(w2v)\n        else:\n            nn.init.uniform_(self.user_word_embs.weight, -0.1, 0.1)\n            nn.init.uniform_(self.item_word_embs.weight, -0.1, 0.1)\n"
  },
  {
    "path": "models/mpcn.py",
    "content": "# -*- coding: utf-8 -*-\n\nimport torch\nimport torch.nn as nn\nimport numpy as np\nimport torch.nn.functional as F\n\n\nclass MPCN(nn.Module):\n    '''\n    Multi-Pointer Co-Attention Network for Recommendation\n    WWW 2018\n    '''\n    def __init__(self, opt, head=3):\n        '''\n        head: the number of pointers\n        '''\n        super(MPCN, self).__init__()\n\n        self.opt = opt\n        self.num_fea = 1  # ID + DOC\n        self.head = head\n        self.user_word_embs = nn.Embedding(opt.vocab_size, opt.word_dim)  # vocab_size * 300\n        self.item_word_embs = nn.Embedding(opt.vocab_size, opt.word_dim)  # vocab_size * 300\n\n        # review gate\n        self.fc_g1 = nn.Linear(opt.word_dim, opt.word_dim)\n        self.fc_g2 = nn.Linear(opt.word_dim, opt.word_dim)\n\n        # multi points\n        self.review_coatt = nn.ModuleList([Co_Attention(opt.word_dim, gumbel=True, pooling='max') for _ in range(head)])\n        self.word_coatt = nn.ModuleList([Co_Attention(opt.word_dim, gumbel=False, pooling='avg') for _ in range(head)])\n\n        # final fc\n        self.u_fc = self.fc_layer()\n        self.i_fc = self.fc_layer()\n\n        self.drop_out = nn.Dropout(opt.drop_out)\n        self.reset_para()\n\n    def fc_layer(self):\n        return nn.Sequential(\n            nn.Linear(self.opt.word_dim * self.head, self.opt.word_dim),\n            nn.ReLU(),\n            nn.Linear(self.opt.word_dim, self.opt.id_emb_size)\n        )\n\n    def forward(self, datas):\n        '''\n        user_reviews, item_reviews, uids, iids, \\\n        user_item2id, item_user2id, user_doc, item_doc = datas\n        :user_reviews: B * L1 * N\n        :item_reviews: B * L2 * N\n        '''\n        user_reviews, item_reviews, _, _, _, _, _, _ = datas\n\n        # ------------------review-level co-attention ---------------------------------\n        u_word_embs = self.user_word_embs(user_reviews)\n        i_word_embs = self.item_word_embs(item_reviews)\n        u_reviews = self.review_gate(u_word_embs)\n        i_reviews = self.review_gate(i_word_embs)\n        u_fea = []\n        i_fea = []\n        for i in range(self.head):\n            r_coatt = self.review_coatt[i]\n            w_coatt = self.word_coatt[i]\n\n            # ------------------review-level co-attention ---------------------------------\n            p_u, p_i = r_coatt(u_reviews, i_reviews)             # B * L1/2 * 1\n            # ------------------word-level co-attention ---------------------------------\n            u_r_words = user_reviews.permute(0, 2, 1).float().bmm(p_u)   # (B * N * L1) X (B * L1 * 1)\n            i_r_words = item_reviews.permute(0, 2, 1).float().bmm(p_i)   # (B * N * L2) X (B * L2 * 1)\n            u_words = self.user_word_embs(u_r_words.squeeze(2).long())  # B * N * d\n            i_words = self.item_word_embs(i_r_words.squeeze(2).long())  # B * N * d\n            p_u, p_i = w_coatt(u_words, i_words)                 # B * N * 1\n            u_w_fea = u_words.permute(0, 2, 1).bmm(p_u).squeeze(2)\n            i_w_fea = u_words.permute(0, 2, 1).bmm(p_i).squeeze(2)\n            u_fea.append(u_w_fea)\n            i_fea.append(i_w_fea)\n\n        u_fea = torch.cat(u_fea, 1)\n        i_fea = torch.cat(i_fea, 1)\n\n        u_fea = self.drop_out(self.u_fc(u_fea))\n        i_fea = self.drop_out(self.i_fc(i_fea))\n\n        return torch.stack([u_fea], 1), torch.stack([i_fea], 1)\n\n    def review_gate(self, reviews):\n        # Eq 1\n        reviews = reviews.sum(2)\n        return torch.sigmoid(self.fc_g1(reviews)) * torch.tanh(self.fc_g2(reviews))\n\n    def reset_para(self):\n        for fc in [self.fc_g1, self.fc_g2, self.u_fc[0], self.u_fc[-1], self.i_fc[0], self.i_fc[-1]]:\n            nn.init.uniform_(fc.weight, -0.1, 0.1)\n            nn.init.uniform_(fc.bias, -0.1, 0.1)\n\n        if self.opt.use_word_embedding:\n            w2v = torch.from_numpy(np.load(self.opt.w2v_path))\n            if self.opt.use_gpu:\n                self.user_word_embs.weight.data.copy_(w2v.cuda())\n                self.item_word_embs.weight.data.copy_(w2v.cuda())\n            else:\n                self.user_word_embs.weight.data.copy_(w2v)\n                self.item_word_embs.weight.data.copy_(w2v)\n        else:\n            nn.init.uniform_(self.user_word_embs.weight, -0.1, 0.1)\n            nn.init.uniform_(self.item_word_embs.weight, -0.1, 0.1)\n\n\nclass Co_Attention(nn.Module):\n    '''\n    review-level and word-level co-attention module\n    Eq (2,3, 10,11)\n    '''\n    def __init__(self, dim, gumbel, pooling):\n        super(Co_Attention, self).__init__()\n        self.gumbel = gumbel\n        self.pooling = pooling\n        self.M = nn.Parameter(torch.randn(dim, dim))\n        self.fc_u = nn.Linear(dim, dim)\n        self.fc_i = nn.Linear(dim, dim)\n\n        self.reset_para()\n\n    def reset_para(self):\n        nn.init.xavier_uniform_(self.M, gain=1)\n        nn.init.uniform_(self.fc_u.weight, -0.1, 0.1)\n        nn.init.uniform_(self.fc_u.bias, -0.1, 0.1)\n        nn.init.uniform_(self.fc_i.weight, -0.1, 0.1)\n        nn.init.uniform_(self.fc_i.bias, -0.1, 0.1)\n\n    def forward(self, u_fea, i_fea):\n        '''\n        u_fea: B * L1 * d\n        i_fea: B * L2 * d\n        return:\n        B * L1 * 1\n        B * L2 * 1\n        '''\n        u = self.fc_u(u_fea)\n        i = self.fc_i(i_fea)\n        S = u.matmul(self.M).bmm(i.permute(0, 2, 1))  # B * L1 * L2 Eq(2/10), we transport item instead user\n        if self.pooling == 'max':\n            u_score = S.max(2)[0]  # B * L1\n            i_score = S.max(1)[0]  # B * L2\n        else:\n            u_score = S.mean(2)  # B * L1\n            i_score = S.mean(1)  # B * L2\n        if self.gumbel:\n            p_u = F.gumbel_softmax(u_score, hard=True, dim=1)\n            p_i = F.gumbel_softmax(i_score, hard=True, dim=1)\n        else:\n            p_u = F.softmax(u_score, dim=1)\n            p_i = F.softmax(i_score, dim=1)\n        return p_u.unsqueeze(2), p_i.unsqueeze(2)\n"
  },
  {
    "path": "models/narre.py",
    "content": "# -*- coding: utf-8 -*-\n\nimport torch\nimport torch.nn as nn\nimport numpy as np\nimport torch.nn.functional as F\n\n\nclass NARRE(nn.Module):\n    '''\n    NARRE: WWW 2018\n    '''\n    def __init__(self, opt):\n        super(NARRE, self).__init__()\n        self.opt = opt\n        self.num_fea = 2  # ID + Review\n\n        self.user_net = Net(opt, 'user')\n        self.item_net = Net(opt, 'item')\n\n    def forward(self, datas):\n        user_reviews, item_reviews, uids, iids, user_item2id, item_user2id, user_doc, item_doc = datas\n        u_fea = self.user_net(user_reviews, uids, user_item2id)\n        i_fea = self.item_net(item_reviews, iids, item_user2id)\n        return u_fea, i_fea\n\n\nclass Net(nn.Module):\n    def __init__(self, opt, uori='user'):\n        super(Net, self).__init__()\n        self.opt = opt\n\n        if uori == 'user':\n            id_num = self.opt.user_num\n            ui_id_num = self.opt.item_num\n        else:\n            id_num = self.opt.item_num\n            ui_id_num = self.opt.user_num\n\n        self.id_embedding = nn.Embedding(id_num, self.opt.id_emb_size)  # user/item num * 32\n        self.word_embs = nn.Embedding(self.opt.vocab_size, self.opt.word_dim)  # vocab_size * 300\n        self.u_i_id_embedding = nn.Embedding(ui_id_num, self.opt.id_emb_size)\n\n        self.review_linear = nn.Linear(self.opt.filters_num, self.opt.id_emb_size)\n        self.id_linear = nn.Linear(self.opt.id_emb_size, self.opt.id_emb_size, bias=False)\n        self.attention_linear = nn.Linear(self.opt.id_emb_size, 1)\n        self.fc_layer = nn.Linear(self.opt.filters_num, self.opt.id_emb_size)\n\n        self.cnn = nn.Conv2d(1, opt.filters_num, (opt.kernel_size, opt.word_dim))\n\n        self.dropout = nn.Dropout(self.opt.drop_out)\n        self.reset_para()\n\n    def forward(self, reviews, ids, ids_list):\n        # --------------- word embedding ----------------------------------\n        reviews = self.word_embs(reviews)  # size * 300\n        bs, r_num, r_len, wd = reviews.size()\n        reviews = reviews.view(-1, r_len, wd)\n\n        id_emb = self.id_embedding(ids)\n\n        u_i_id_emb = self.u_i_id_embedding(ids_list)\n\n        # --------cnn for review--------------------\n        fea = F.relu(self.cnn(reviews.unsqueeze(1))).squeeze(3)  # .permute(0, 2, 1)\n        fea = F.max_pool1d(fea, fea.size(2)).squeeze(2)\n        fea = fea.view(-1, r_num, fea.size(1))\n\n        # ------------------linear attention-------------------------------\n        rs_mix = F.relu(self.review_linear(fea) + self.id_linear(F.relu(u_i_id_emb)))\n        att_score = self.attention_linear(rs_mix)\n        att_weight = F.softmax(att_score, 1)\n        r_fea = fea * att_weight\n        r_fea = r_fea.sum(1)\n        r_fea = self.dropout(r_fea)\n\n        return torch.stack([id_emb, self.fc_layer(r_fea)], 1)\n\n    def reset_para(self):\n        nn.init.xavier_normal_(self.cnn.weight)\n        nn.init.constant_(self.cnn.bias, 0.1)\n\n        nn.init.uniform_(self.id_linear.weight, -0.1, 0.1)\n\n        nn.init.uniform_(self.review_linear.weight, -0.1, 0.1)\n        nn.init.constant_(self.review_linear.bias, 0.1)\n\n        nn.init.uniform_(self.attention_linear.weight, -0.1, 0.1)\n        nn.init.constant_(self.attention_linear.bias, 0.1)\n\n        nn.init.uniform_(self.fc_layer.weight, -0.1, 0.1)\n        nn.init.constant_(self.fc_layer.bias, 0.1)\n        if self.opt.use_word_embedding:\n            w2v = torch.from_numpy(np.load(self.opt.w2v_path))\n            if self.opt.use_gpu:\n                self.word_embs.weight.data.copy_(w2v.cuda())\n            else:\n                self.word_embs.weight.data.copy_(w2v)\n        else:\n            nn.init.xavier_normal_(self.word_embs.weight)\n        nn.init.uniform_(self.id_embedding.weight, a=-0.1, b=0.1)\n        nn.init.uniform_(self.u_i_id_embedding.weight, a=-0.1, b=0.1)\n"
  },
  {
    "path": "pro_data/data_pro.py",
    "content": "# -*- coding: utf-8 -*-\nimport json\nimport pandas as pd\nimport re\nimport sys\nimport os\nimport numpy as np\nimport time\nfrom sklearn.model_selection import train_test_split\nfrom operator import itemgetter\nimport gensim\nfrom collections import defaultdict\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\nP_REVIEW = 0.85\nMAX_DF = 0.7\nMAX_VOCAB = 50000\nDOC_LEN = 500\nPRE_W2V_BIN_PATH = \"\"  # the pre-trained word2vec files\n\n\ndef now():\n    return str(time.strftime('%Y-%m-%d %H:%M:%S'))\n\n\ndef get_count(data, id):\n    ids = set(data[id].tolist())\n    return ids\n\n\ndef numerize(data):\n    uid = list(map(lambda x: user2id[x], data['user_id']))\n    iid = list(map(lambda x: item2id[x], data['item_id']))\n    data['user_id'] = uid\n    data['item_id'] = iid\n    return data\n\n\ndef clean_str(string):\n    string = re.sub(r\"[^A-Za-z0-9]\", \" \", string)\n    string = re.sub(r\"\\'s\", \" \\'s\", string)\n    string = re.sub(r\"\\'ve\", \" \\'ve\", string)\n    string = re.sub(r\"n\\'t\", \" n\\'t\", string)\n    string = re.sub(r\"\\'re\", \" \\'re\", string)\n    string = re.sub(r\"\\'d\", \" \\'d\", string)\n    string = re.sub(r\"\\'ll\", \" \\'ll\", string)\n    string = re.sub(r\",\", \" , \", string)\n    string = re.sub(r\"!\", \" ! \", string)\n    string = re.sub(r\"\\(\", \" \\( \", string)\n    string = re.sub(r\"\\)\", \" \\) \", string)\n    string = re.sub(r\"\\?\", \" \\? \", string)\n    string = re.sub(r\"\\s{2,}\", \" \", string)\n    string = re.sub(r\"\\s{2,}\", \" \", string)\n    string = re.sub(r\"sssss \", \" \", string)\n    return string.strip().lower()\n\n\ndef bulid_vocbulary(xDict):\n    rawReviews = []\n    for (id, text) in xDict.items():\n        rawReviews.append(' '.join(text))\n    return rawReviews\n\n\ndef build_doc(u_reviews_dict, i_reviews_dict):\n    '''\n    1. extract the vocab\n    2. fiter the reviews and documents of users and items\n    '''\n    u_reviews = []\n    for ind in range(len(u_reviews_dict)):\n        u_reviews.append(' <SEP> '.join(u_reviews_dict[ind]))\n\n    i_reviews = []\n    for ind in range(len(i_reviews_dict)):\n        i_reviews.append('<SEP>'.join(i_reviews_dict[ind]))\n\n    vectorizer = TfidfVectorizer(max_df=MAX_DF, max_features=MAX_VOCAB)\n    vectorizer.fit(u_reviews)\n    vocab = vectorizer.vocabulary_\n    vocab[MAX_VOCAB] = '<SEP>'\n\n    def clean_review(rDict):\n        new_dict = {}\n        for k, text in rDict.items():\n            new_reviews = []\n            for r in text:\n                words = ' '.join([w for w in r.split() if w in vocab])\n                new_reviews.append(words)\n            new_dict[k] = new_reviews\n        return new_dict\n\n    def clean_doc(raw):\n        new_raw = []\n        for line in raw:\n            review = [word for word in line.split() if word in vocab]\n            if len(review) > DOC_LEN:\n                review = review[:DOC_LEN]\n            new_raw.append(review)\n        return new_raw\n\n    u_reviews_dict = clean_review(u_reviews_dict)\n    i_reviews_dict = clean_review(i_reviews_dict)\n\n    u_doc = clean_doc(u_reviews)\n    i_doc = clean_doc(i_reviews)\n\n    return vocab, u_doc, i_doc, u_reviews_dict, i_reviews_dict\n\n\ndef countNum(xDict):\n    minNum = 100\n    maxNum = 0\n    sumNum = 0\n    maxSent = 0\n    minSent = 3000\n    # pSentLen = 0\n    ReviewLenList = []\n    SentLenList = []\n    for (i, text) in xDict.items():\n        sumNum = sumNum + len(text)\n        if len(text) < minNum:\n            minNum = len(text)\n        if len(text) > maxNum:\n            maxNum = len(text)\n        ReviewLenList.append(len(text))\n        for sent in text:\n            # SentLenList.append(len(sent))\n            if sent != \"\":\n                wordTokens = sent.split()\n            if len(wordTokens) > maxSent:\n                maxSent = len(wordTokens)\n            if len(wordTokens) < minSent:\n                minSent = len(wordTokens)\n            SentLenList.append(len(wordTokens))\n    averageNum = sumNum // (len(xDict))\n\n    x = np.sort(SentLenList)\n    xLen = len(x)\n    pSentLen = x[int(P_REVIEW * xLen) - 1]\n    x = np.sort(ReviewLenList)\n    xLen = len(x)\n    pReviewLen = x[int(P_REVIEW * xLen) - 1]\n\n    return minNum, maxNum, averageNum, maxSent, minSent, pReviewLen, pSentLen\n\n\nif __name__ == '__main__':\n\n    start_time = time.time()\n    assert(len(sys.argv) >= 2)\n    filename = sys.argv[1]\n\n    yelp_data = False\n    if len(sys.argv) > 2 and sys.argv[2] == 'yelp':\n        # yelp dataset\n        yelp_data = True\n        save_folder = '../dataset/' + filename[:-3]+\"_data\"\n    else:\n        # amazon dataset\n        save_folder = '../dataset/' + filename[:-7]+\"_data\"\n    print(f\"数据集名称：{save_folder}\")\n\n    if not os.path.exists(save_folder + '/train'):\n        os.makedirs(save_folder + '/train')\n    if not os.path.exists(save_folder + '/val'):\n        os.makedirs(save_folder + '/val')\n    if not os.path.exists(save_folder + '/test'):\n        os.makedirs(save_folder + '/test')\n\n    if len(PRE_W2V_BIN_PATH) == 0:\n        print(\"Warning: the word embedding file is not provided, will be initialized randomly\")\n    file = open(filename, errors='ignore')\n\n    print(f\"{now()}: Step1: loading raw review datasets...\")\n\n    users_id = []\n    items_id = []\n    ratings = []\n    reviews = []\n\n    if yelp_data:\n        for line in file:\n            value = line.split('\\t')\n            reviews.append(value[2])\n            users_id.append(value[0])\n            items_id.append(value[1])\n            ratings.append(value[3])\n    else:\n        for line in file:\n            js = json.loads(line)\n            if str(js['reviewerID']) == 'unknown':\n                print(\"unknown user id\")\n                continue\n            if str(js['asin']) == 'unknown':\n                print(\"unkown item id\")\n                continue\n            try:\n                reviews.append(js['reviewText'])\n                users_id.append(str(js['reviewerID']))\n                items_id.append(str(js['asin']))\n                ratings.append(str(js['overall']))\n            except:\n                continue\n\n    data_frame = {'user_id': pd.Series(users_id), 'item_id': pd.Series(items_id),\n                  'ratings': pd.Series(ratings), 'reviews': pd.Series(reviews)}\n    data = pd.DataFrame(data_frame)     # [['user_id', 'item_id', 'ratings', 'reviews']]\n    del users_id, items_id, ratings, reviews\n\n    uidList, iidList = get_count(data, 'user_id'), get_count(data, 'item_id')\n    userNum_all = len(uidList)\n    itemNum_all = len(iidList)\n    print(\"===============Start:all  rawData size======================\")\n    print(f\"dataNum: {data.shape[0]}\")\n    print(f\"userNum: {userNum_all}\")\n    print(f\"itemNum: {itemNum_all}\")\n    print(f\"data densiy: {data.shape[0]/float(userNum_all * itemNum_all):.4f}\")\n    print(\"===============End: rawData size========================\")\n\n    user2id = dict((uid, i) for(i, uid) in enumerate(uidList))\n    item2id = dict((iid, i) for(i, iid) in enumerate(iidList))\n    data = numerize(data)\n\n    print(f\"-\"*60)\n    print(f\"{now()} Step2: split datsets into train/val/test, save into npy data\")\n    data_train, data_test = train_test_split(data, test_size=0.2, random_state=1234)\n    uids_train, iids_train = get_count(data_train, 'user_id'), get_count(data_train, 'item_id')\n    userNum = len(uids_train)\n    itemNum = len(iids_train)\n    print(\"===============Start: no-preprocess: trainData size======================\")\n    print(\"dataNum: {}\".format(data_train.shape[0]))\n    print(\"userNum: {}\".format(userNum))\n    print(\"itemNum: {}\".format(itemNum))\n    print(\"===============End: no-preprocess: trainData size========================\")\n\n    uidMiss = []\n    iidMiss = []\n    if userNum != userNum_all or itemNum != itemNum_all:\n        for uid in range(userNum_all):\n            if uid not in uids_train:\n                uidMiss.append(uid)\n        for iid in range(itemNum_all):\n            if iid not in iids_train:\n                iidMiss.append(iid)\n    uid_index = []\n    for uid in uidMiss:\n        index = data_test.index[data_test['user_id'] == uid].tolist()\n        uid_index.extend(index)\n    data_train = pd.concat([data_train, data_test.loc[uid_index]])\n\n    iid_index = []\n    for iid in iidMiss:\n        index = data_test.index[data_test['item_id'] == iid].tolist()\n        iid_index.extend(index)\n    data_train = pd.concat([data_train, data_test.loc[iid_index]])\n\n    all_index = list(set().union(uid_index, iid_index))\n    data_test = data_test.drop(all_index)\n\n    # split validate set aand test set\n    data_test, data_val = train_test_split(data_test, test_size=0.5, random_state=1234)\n    uidList_train, iidList_train = get_count(data_train, 'user_id'), get_count(data_train, 'item_id')\n    userNum = len(uidList_train)\n    itemNum = len(iidList_train)\n    print(\"===============Start--process finished: trainData size======================\")\n    print(\"dataNum: {}\".format(data_train.shape[0]))\n    print(\"userNum: {}\".format(userNum))\n    print(\"itemNum: {}\".format(itemNum))\n    print(\"===============End-process finished: trainData size========================\")\n\n    def extract(data_dict):\n        x = []\n        y = []\n        for i in data_dict.values:\n            uid = i[0]\n            iid = i[1]\n            x.append([uid, iid])\n            y.append(float(i[2]))\n        return x, y\n\n    x_train, y_train = extract(data_train)\n    x_val, y_val = extract(data_val)\n    x_test, y_test = extract(data_test)\n\n    np.save(f\"{save_folder}/train/Train.npy\", x_train)\n    np.save(f\"{save_folder}/train/Train_Score.npy\", y_train)\n    np.save(f\"{save_folder}/val/Val.npy\", x_val)\n    np.save(f\"{save_folder}/val/Val_Score.npy\", y_val)\n    np.save(f\"{save_folder}/test/Test.npy\", x_test)\n    np.save(f\"{save_folder}/test/Test_Score.npy\", y_test)\n\n    print(now())\n    print(f\"Train data size: {len(x_train)}\")\n    print(f\"Val data size: {len(x_val)}\")\n    print(f\"Test data size: {len(x_test)}\")\n\n    print(f\"-\"*60)\n    print(f\"{now()} Step3: Construct the vocab and user/item reviews from training set.\")\n    # 2: build vocabulary only with train dataset\n    user_reviews_dict = {}\n    item_reviews_dict = {}\n    user_iid_dict = {}\n    item_uid_dict = {}\n    user_len = defaultdict(int)\n    item_len = defaultdict(int)\n\n    for i in data_train.values:\n        str_review = clean_str(i[3].encode('ascii', 'ignore').decode('ascii'))\n\n        if len(str_review.strip()) == 0:\n            str_review = \"<unk>\"\n\n        if i[0] in user_reviews_dict:\n            user_reviews_dict[i[0]].append(str_review)\n            user_iid_dict[i[0]].append(i[1])\n        else:\n            user_reviews_dict[i[0]] = [str_review]\n            user_iid_dict[i[0]] = [i[1]]\n\n        if i[1] in item_reviews_dict:\n            item_reviews_dict[i[1]].append(str_review)\n            item_uid_dict[i[1]].append(i[0])\n        else:\n            item_reviews_dict[i[1]] = [str_review]\n            item_uid_dict[i[1]] = [i[0]]\n\n    vocab, user_review2doc, item_review2doc, user_reviews_dict, item_reviews_dict = build_doc(user_reviews_dict, item_reviews_dict)\n    word_index = {}\n    word_index['<unk>'] = 0\n    for i, w in enumerate(vocab.keys(), 1):\n        word_index[w] = i\n    print(f\"The vocab size: {len(word_index)}\")\n    print(f\"Average user document length: {sum([len(i) for i in user_review2doc])/len(user_review2doc)}\")\n    print(f\"Average item document length: {sum([len(i) for i in item_review2doc])/len(item_review2doc)}\")\n\n    print(now())\n    u_minNum, u_maxNum, u_averageNum, u_maxSent, u_minSent, u_pReviewLen, u_pSentLen = countNum(user_reviews_dict)\n    print(\"用户最少有{}个评论,最多有{}个评论，平均有{}个评论, \" \\\n         \"句子最大长度{},句子的最短长度{}，\" \\\n         \"设定用户评论个数为{}： 设定句子最大长度为{}\".format(u_minNum, u_maxNum, u_averageNum, u_maxSent, u_minSent, u_pReviewLen, u_pSentLen))\n    i_minNum, i_maxNum, i_averageNum, i_maxSent, i_minSent, i_pReviewLen, i_pSentLen = countNum(item_reviews_dict)\n    print(\"商品最少有{}个评论,最多有{}个评论，平均有{}个评论,\" \\\n         \"句子最大长度{},句子的最短长度{},\" \\\n         \",设定商品评论数目{}, 设定句子最大长度为{}\".format(i_minNum, i_maxNum, i_averageNum, u_maxSent, i_minSent, i_pReviewLen, i_pSentLen))\n    print(\"最终设定句子最大长度为(取最大值)：{}\".format(max(u_pSentLen, i_pSentLen)))\n    # ########################################################################################################\n    maxSentLen = max(u_pSentLen, i_pSentLen)\n    minSentlen = 1\n\n    userReview2Index = []\n    userDoc2Index = []\n    user_iid_list = []\n\n    print(f\"-\"*60)\n    print(f\"{now()} Step4: padding all the text and id lists and save into npy.\")\n\n    def padding_text(textList, num):\n        new_textList = []\n        if len(textList) >= num:\n            new_textList = textList[:num]\n        else:\n            padding = [[0] * len(textList[0]) for _ in range(num - len(textList))]\n            new_textList = textList + padding\n        return new_textList\n\n    def padding_ids(iids, num, pad_id):\n        if len(iids) >= num:\n            new_iids = iids[:num]\n        else:\n            new_iids = iids + [pad_id] * (num - len(iids))\n        return new_iids\n\n    def padding_doc(doc):\n        pDocLen = DOC_LEN\n        new_doc = []\n        for d in doc:\n            if len(d) < pDocLen:\n                d = d + [0] * (pDocLen - len(d))\n            else:\n                d = d[:pDocLen]\n            new_doc.append(d)\n\n        return new_doc, pDocLen\n\n    for i in range(userNum):\n        count_user = 0\n        dataList = []\n        a_count = 0\n\n        textList = user_reviews_dict[i]\n        u_iids = user_iid_dict[i]\n        u_reviewList = []\n\n        user_iid_list.append(padding_ids(u_iids, u_pReviewLen, itemNum+1))\n        doc2index = [word_index[w] for w in user_review2doc[i]]\n\n        for text in textList:\n            text2index = []\n            wordTokens = text.strip().split()\n            if len(wordTokens) == 0:\n                wordTokens = ['<unk>']\n            text2index = [word_index[w] for w in wordTokens]\n            if len(text2index) < maxSentLen:\n                text2index = text2index + [0] * (maxSentLen - len(text2index))\n            else:\n                text2index = text2index[:maxSentLen]\n            u_reviewList.append(text2index)\n\n        userReview2Index.append(padding_text(u_reviewList, u_pReviewLen))\n        userDoc2Index.append(doc2index)\n\n    # userReview2Index = []\n    userDoc2Index, userDocLen = padding_doc(userDoc2Index)\n    print(f\"user document length: {userDocLen}\")\n\n    itemReview2Index = []\n    itemDoc2Index = []\n    item_uid_list = []\n    for i in range(itemNum):\n        count_item = 0\n        dataList = []\n        textList = item_reviews_dict[i]\n        i_uids = item_uid_dict[i]\n        i_reviewList = []  # 待添加\n        i_reviewLen = []  # 待添加\n        item_uid_list.append(padding_ids(i_uids, i_pReviewLen, userNum+1))\n        doc2index = [word_index[w] for w in item_review2doc[i]]\n\n        for text in textList:\n            text2index = []\n            wordTokens = text.strip().split()\n            if len(wordTokens) == 0:\n                wordTokens = ['<unk>']\n            text2index = [word_index[w] for w in wordTokens]\n            if len(text2index) < maxSentLen:\n                text2index = text2index + [0] * (maxSentLen - len(text2index))\n            else:\n                text2index = text2index[:maxSentLen]\n            if len(text2index) < maxSentLen:\n                text2index = text2index + [0] * (maxSentLen - len(text2index))\n            i_reviewList.append(text2index)\n        itemReview2Index.append(padding_text(i_reviewList, i_pReviewLen))\n        itemDoc2Index.append(doc2index)\n\n    itemDoc2Index, itemDocLen = padding_doc(itemDoc2Index)\n    print(f\"item document length: {itemDocLen}\")\n\n    print(\"-\"*60)\n    print(f\"{now()} start writing npy...\")\n    np.save(f\"{save_folder}/train/userReview2Index.npy\", userReview2Index)\n    np.save(f\"{save_folder}/train/user_item2id.npy\", user_iid_list)\n    np.save(f\"{save_folder}/train/userDoc2Index.npy\", userDoc2Index)\n\n    np.save(f\"{save_folder}/train/itemReview2Index.npy\", itemReview2Index)\n    np.save(f\"{save_folder}/train/item_user2id.npy\", item_uid_list)\n    np.save(f\"{save_folder}/train/itemDoc2Index.npy\", itemDoc2Index)\n\n    print(f\"{now()} write finised\")\n\n    # #####################################################3,产生w2v############################################\n    print(\"-\"*60)\n    print(f\"{now()} Step5: start word embedding mapping...\")\n    vocab_item = sorted(word_index.items(), key=itemgetter(1))\n    w2v = []\n    out = 0\n    if PRE_W2V_BIN_PATH:\n        pre_word2v = gensim.models.KeyedVectors.load_word2vec_format(PRE_W2V_BIN_PATH, binary=True)\n    else:\n        pre_word2v = {}\n    print(f\"{now()} 开始提取embedding\")\n    for word, key in vocab_item:\n        if word in pre_word2v:\n            w2v.append(pre_word2v[word])\n        else:\n            out += 1\n            w2v.append(np.random.uniform(-1.0, 1.0, (300,)))\n    print(\"############################\")\n    print(f\"out of vocab: {out}\")\n    # print w2v[1000]\n    print(f\"w2v size: {len(w2v)}\")\n    print(\"############################\")\n    w2vArray = np.array(w2v)\n    print(w2vArray.shape)\n    np.save(f\"{save_folder}/train/w2v.npy\", w2v)\n    end_time = time.time()\n    print(f\"{now()} all steps finised, cost time: {end_time-start_time:.4f}s\")\n"
  }
]