Repository: YZHANG1270/Aspect-Based-Sentiment-Analysis Branch: master Commit: e5505ab3fdcf Files: 30 Total size: 20.9 MB Directory structure: gitextract_rxbagi6o/ ├── README.md ├── ai_challenge_sentiment/ │ ├── code/ │ │ └── sentiment_analysis2018_baseline/ │ │ ├── README.md │ │ ├── __init__.py │ │ ├── config.py │ │ ├── data_process.py │ │ ├── main_predict.py │ │ ├── main_train.py │ │ ├── model.py │ │ └── requirements.txt │ ├── model.py │ └── train.py ├── aspect_predict.py ├── config.json ├── data/ │ ├── aspect/ │ │ ├── aspect_svc_test.xlsx │ │ └── aspect_svc_train.xlsx │ ├── chinese/ │ │ ├── CH_CAME_SB1_TEST.xlsx │ │ ├── CH_PHNS_SB1_TEST.xlsx │ │ ├── Chinese_phones_training.xlsx │ │ └── camera_training.xlsx │ └── polarity/ │ └── polarity_docu.xlsx ├── polarity_predict.py ├── train/ │ ├── aspect_classifier.py │ ├── model/ │ │ ├── bilstm.py │ │ └── model.py │ └── polarity_classifier.py └── utils/ ├── __init__.py ├── baidu_tagging.py ├── data_process.py ├── grammar.py └── utils.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: README.md ================================================ # ABSA Aspect Based Sentiment Analysis 虽说是基于观点的分析,但也是基于句子层的分析,因为需要按句子进行分析。 ![](https://github.com/YZHANG1270/Aspect-Based-Sentiment-Analysis/blob/master/img/absa.png?raw=true) ##### 概念参考 - ABSA refer presentation [[ppt](https://www.iaria.org/conferences2016/filesHUSO16/OrpheeDeClercq_Keynote_ABSA.pdf)] - 阿里云的商品评价解析 [[link](https://help.aliyun.com/document_detail/64231.html?spm=5176.12095382.1232858.4.739e3b24xUnvbZ)] | 参数名 | 值 | | -------------- | ------------------------------------------------------------ | | textPolarity | 整条文本情感极性:正、中、负,text字段输入非法时返回-100 | | textIntensity | 整条文本情感程度(取值范围[-1,1],越大代表越正向,越小代表越负向,接近0代表中性) | | aspectItem | 属性情感列表,每个元素是一个json字段 | | aspectCategory | 属性类别 | | aspectIndex | 属性词所在的起始位置,终结位置 | | aspectTerm | 属性词 | | opinionTerm | 情感词 | | aspectPolarity | 属性片段极性(正、中、负) | ##### Task Process 1. 按句 提取 属性词 2. 按句 提取 情感词 3. 属性词所在起始位置,终止位置 4. 属性词 -> EA分类 5. 情感词 -> 极性分类 6. 整条文本的感情极性(正、负、中) 及其概率值 ##### Done Tasks 根据现有数据集,实际完成的任务 - [x] 按句进行 EA 分类 - [x] 按句进行情感极性分析 ##### To do - [ ] 观点过滤:文字噪音处理、虚假评论、水军、广告、不含观点、无意义文本 - [ ] negation 否定处理 ##### SemEval ABSA - NLP的 SemEval 论文合辑 [[ACL](https://www.aclweb.org/anthology/)] - SemEval - 2014 - ABSA [[competition](http://alt.qcri.org/semeval2014/task4/)] [[data](http://alt.qcri.org/semeval2014/task4/index.php?id=data-and-tools)] - SemEval - 2015 - ABSA [[competition](http://alt.qcri.org/semeval2015/task12/)] [[data](http://alt.qcri.org/semeval2015/task12/index.php?id=data-and-tools)] [[paper](https://www.aclweb.org/anthology/S15-2082)] - SemEval - 2016 - ABSA [[competition](http://alt.qcri.org/semeval2016/task5/)] [[data](http://alt.qcri.org/semeval2016/task5/index.php?id=data-and-tools)] [[guideline](http://alt.qcri.org/semeval2016/task5/data/uploads/absa2016_annotationguidelines.pdf)] [[paper](https://www.aclweb.org/anthology/S16-1002)] - bonus: CodaLab Competitions [[intro](https://www.hse.ru/data/2017/05/31/1171931089/CodaLabCompetitions.pdf)] ##### 可参考的GitHub项目 数据集基本都基于 2014-2016 SemEval 比赛 - [data: self data] [Unsupervised-Aspect-Extraction](https://github.com/ruidan/Unsupervised-Aspect-Extraction) - [data: SemEval-2016] [aspect-extraction](https://github.com/soujanyaporia/aspect-extraction) - [data: SemEval-2015] [AspectBasedSentimentAnalysis](https://github.com/yardstick17/AspectBasedSentimentAnalysis) 跑了下这个项目,其中结合了语法分析和机器学习,按照语法规则抽取的属性词。代码嵌套逻辑比较强,不建议套用。 - [data: SemEval-2016] [Review_aspect_extraction](https://github.com/yafangy/Review_aspect_extraction) - [data: SemEval-2014, 2016] [DE-CNN](https://github.com/howardhsu/DE-CNN) - [data: SemEval-2015] [Coupled-Multi-layer-Attentions](https://github.com/happywwy/Coupled-Multi-layer-Attentions) - [data: SemEval-2016 laptop] [mem_absa](https://github.com/ganeshjawahar/mem_absa) - [data: SemEval-2014] [ABSA-PyTorch](https://github.com/songyouwei/ABSA-PyTorch) - [data: SemEval-2014, 2016] [Attention_Based_LSTM_AspectBased_SA](https://github.com/gangeshwark/Attention_Based_LSTM_AspectBased_SA) - [data: SemEval-2014] [ABSA_Keras](https://github.com/AlexYangLi/ABSA_Keras) 利用了tensorflow hub,适用hub时出现了版本问题未跑通。 - [data: SemEval-2016] [ABSA](https://github.com/LingxB/ABSA/tree/master/Data/SemEval) ##### paper - Deep Learning for Aspect-Based Sentiment Analysis [[paper](https://cs224d.stanford.edu/reports/WangBo.pdf)] - Fine-grained Opinion Mining with Recurrent Neural Networks and Word Embeddings [[paper](https://www.aclweb.org/anthology/D15-1168)] - Encoding Conversation Context for Neural Keyphrase Extraction from Microblog Posts [[paper](https://ai.tencent.com/ailab/media/publications/naacl2018/Encoding_Conversation_Context_for_Neural_Keyphrase_Extraction_from_Microblog_Posts.pdf)] - End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF [[paper](https://arxiv.org/pdf/1603.01354.pdf)] - [2012] 用户评论中的标签抽取以及排序 [[paper](http://lipiji.com/docs/li2011opinion.pdf)] ##### 数据集 ###### 中文 - AI-Challenge [[data](https://drive.google.com/file/d/1OInXRx_OmIJgK3ZdoFZnmqUi0rGfOaQo/view)] - SemEval ABSA 2016 [[data](http://alt.qcri.org/semeval2016/task5/index.php?id=data-and-tools)] ###### 英文 - Amazon product data [[data](http://jmcauley.ucsd.edu/data/amazon/)] - Web data: Amazon reviews [[data](https://snap.stanford.edu/data/web-Amazon.html)] - Amazon Fine Food Reviews [[kaggle](https://www.kaggle.com/snap/amazon-fine-food-reviews)] - SemEval ABSA #### 优化方向 ##### 字/词/句 文本嵌入Embedding ###### 中文 - Chinese Word Vectors [[github](https://github.com/Embedding/Chinese-Word-Vectors)] - nlp_chinese_corpus [[github](https://github.com/brightmart/nlp_chinese_corpus)] - 泛化语料、专业语料、向量化时,如何整合,还是两者独立向量化 ABSA书的目录,可以学习逻辑 #### ABSA Book Outline 1. Introduction 2. Aspect-Based Sentiment Analysis (ABSA) - 2.1. The three tasks of ABSA - 2.2. Domain and benchmark datasets - 2.3. Previous approaches to ABSA tasks - 2.4. Evaluation measures of ABSA tasks 3. Deep Learning for ABSA - 3.1. Multiple layers of DNN - 3.2. Initialization of input vectors - 3.2.1. Word embeddings vectors - 3.2.2. Featuring vectors - 3.2.3. Part-Of-Speech (POS) and chunk tags - 3.2.4. Commonsense knowledge - 3.3. Training process of DNNs - 3.4. Convolutional Neural Network Model (CNN) - 3.4.1. Architecture - 3.4.2. Application in consumer review domain - 3.5. Recurrent Neural Network Models (RNN) - 3.5.1. Computation of RNN models - 3.5.2. Bidirectional RNN - 3.5.3. Attention mechanism and memory networks - 3.5.4. Application in the consumer review domain - 3.5.5. Application in targeted sentiment analysis - 3.6. Recursive Neural Network Model (RecNN) - 3.6.1. Architecture - 3.6.2. Application - 3.7. Hybrid models 4. Comparison of performance on benchmark datasets - 4.1. Opinion target extraction - 4.2. Aspect category detection - 4.3. Sentiment polarity of aspect-based consumer reviews - 4.4. Sentiment polarity of targeted text 5. Challenges - 5.1. Domain adaptation - 5.2. Multilingual application - 5.3. Technical requirements - 5.4. Linguistic complications 6. Conclusion 7. Appendix: List of Abbreviations 8. References ================================================ FILE: ai_challenge_sentiment/code/sentiment_analysis2018_baseline/README.md ================================================ AI Challenger Sentiment Analysis Baseline ========================================= 功能描述 --- 本工程主要用于为参赛者提供一个baseline,方便参赛者快速上手比赛,主要功能涵盖完成比赛的全流程,如数据读取、分词、特征提取、模型定义以及封装、 模型训练、模型验证、模型存储以及模型预测等。baseline仅是一个简单的参考,希望参赛者能够充分发挥自己的想象,构建在该任务上更加强大的模型。 开发环境 --- * 主要依赖工具包以及版本,详情见requirements.txt 项目结构 --- * src/config.py 项目配置信息模块,主要包括文件读取或存储路径信息 * src/data_process.py 数据处理模块,主要包括数据的读取以及处理等功能 * src/model.py 模型定义模块,主要包括模型的定义以及使用封装 * src/main_train.py 模型训练模块,模型训练流程包括 数据读取、分词、特征提取、模型训练、模型验证、模型存储等步骤 * src/main_predict.py 模型预测模块,模型预测流程包括 数据和模型的读取、分词、模型预测、预测结果存储等步骤 使用方法 --- * 配置 在config.py中配置好文件存储路径 * 训练 运行nohup python main_train.py -mn your_model_name & 训练模型并保存,同时通过日志可以得到验证集的F1_score指标 * 预测 运行nohup python main_predict.py -mn your_model_name $ 通过加载上一步的模型,在测试集上做预测 ================================================ FILE: ai_challenge_sentiment/code/sentiment_analysis2018_baseline/__init__.py ================================================ #!/user/bin/env python # -*- coding:utf-8 -*- ================================================ FILE: ai_challenge_sentiment/code/sentiment_analysis2018_baseline/data_process.py ================================================ #!/user/bin/env python # -*- coding:utf-8 -*- import pandas as pd import jieba # 加载数据 def load_data_from_csv(file_name, header=0, encoding="utf-8"): data_df = pd.read_csv(file_name, header=header, encoding=encoding) return data_df # 分词 def seg_words(contents): contents_segs = list() for content in contents: segs = jieba.lcut(content) contents_segs.append(" ".join(segs)) return contents_segs ================================================ FILE: ai_challenge_sentiment/code/sentiment_analysis2018_baseline/main_predict.py ================================================ #!/user/bin/env python # -*- coding:utf-8 -*- from data_process import seg_words, load_data_from_csv import config import logging import argparse from sklearn.externals import joblib logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] <%(processName)s> (%(threadName)s) %(message)s') logger = logging.getLogger(__name__) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('-mn', '--model_name', type=str, nargs='?', help='the name of model') args = parser.parse_args() model_name = args.model_name if not model_name: model_name = "model_dict.pkl" # load data logger.info("start load data") test_data_df = load_data_from_csv(config.test_data_path) # load model logger.info("start load model") classifier_dict = joblib.load(config.model_save_path + model_name) columns = test_data_df.columns.tolist() # seg words logger.info("start seg test data") content_test = test_data_df.iloc[:, 1] content_test = seg_words(content_test) logger.info("complete seg test data") # model predict logger.info("start predict test data") for column in columns[2:]: test_data_df[column] = classifier_dict[column].predict(content_test) logger.info("compete %s predict" % column) test_data_df.to_csv(config.test_data_predict_out_path, encoding="utf_8_sig", index=False) logger.info("compete predict test data") ================================================ FILE: ai_challenge_sentiment/code/sentiment_analysis2018_baseline/main_train.py ================================================ #!/user/bin/env python # -*- coding:utf-8 -*- from data_process import load_data_from_csv, seg_words from model import TextClassifier from sklearn.feature_extraction.text import TfidfVectorizer import config import logging import numpy as np from sklearn.externals import joblib import os import argparse logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] <%(processName)s> (%(threadName)s) %(message)s') logger = logging.getLogger(__name__) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('-mn', '--model_name', type=str, nargs='?', help='the name of model') args = parser.parse_args() model_name = args.model_name if not model_name: model_name = "model_dict.pkl" # load train data logger.info("start load data") train_data_df = load_data_from_csv(config.train_data_path) validate_data_df = load_data_from_csv(config.validate_data_path) content_train = train_data_df.iloc[:, 1] logger.info("start seg train data") content_train = seg_words(content_train) logger.info("complete seg train data") columns = train_data_df.columns.values.tolist() logger.info("start train feature extraction") vectorizer_tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1, 5), min_df=5, norm='l2') vectorizer_tfidf.fit(content_train) logger.info("complete train feature extraction models") logger.info("vocab shape: %s" % np.shape(vectorizer_tfidf.vocabulary_.keys())) # model train logger.info("start train model") classifier_dict = dict() for column in columns[2:]: label_train = train_data_df[column] text_classifier = TextClassifier(vectorizer=vectorizer_tfidf) logger.info("start train %s model" % column) text_classifier.fit(content_train, label_train) logger.info("complete train %s model" % column) classifier_dict[column] = text_classifier logger.info("complete train model") # validate model content_validate = validate_data_df.iloc[:, 1] logger.info("start seg validate data") content_validate = seg_words(content_validate) logger.info("complete seg validate data") logger.info("start validate model") f1_score_dict = dict() for column in columns[2:]: label_validate = validate_data_df[column] text_classifier = classifier_dict[column] f1_score = text_classifier.get_f1_score(content_validate, label_validate) f1_score_dict[column] = f1_score f1_score = np.mean(list(f1_score_dict.values())) str_score = "\n" for column in columns[2:]: str_score = str_score + column + ":" + str(f1_score_dict[column]) + "\n" logger.info("f1_scores: %s\n" % str_score) logger.info("f1_score: %s" % f1_score) logger.info("complete validate model") # save model logger.info("start save model") model_save_path = config.model_save_path if not os.path.exists(model_save_path): os.makedirs(model_save_path) joblib.dump(classifier_dict, model_save_path + model_name) logger.info("complete save model") ================================================ FILE: ai_challenge_sentiment/code/sentiment_analysis2018_baseline/model.py ================================================ #!/user/bin/env python # -*- coding:utf-8 -*- from sklearn.svm import SVC from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import f1_score import logging logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] <%(processName)s> (%(threadName)s) %(message)s') logger = logging.getLogger(__name__) class TextClassifier(): def __init__(self, vectorizer, classifier=MultinomialNB()): classifier = SVC(kernel="rbf") # classifier = SVC(kernel="linear") self.classifier = classifier self.vectorizer = vectorizer def features(self, x): return self.vectorizer.transform(x) def fit(self, x, y): self.classifier.fit(self.features(x), y) def predict(self, x): return self.classifier.predict(self.features(x)) def score(self, x, y): return self.classifier.score(self.features(x), y) def get_f1_score(self, x, y): return f1_score(y, self.predict(x), average='macro') ================================================ FILE: ai_challenge_sentiment/code/sentiment_analysis2018_baseline/requirements.txt ================================================ python==2.7.13 numpy==1.13.1 pandas==0.20.3 jieba==0.39 sklearn==0.19.2 ================================================ FILE: ai_challenge_sentiment/model.py ================================================ #!/user/bin/env python # -*- coding:utf-8 -*- from sklearn.svm import SVC from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import f1_score import logging logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] <%(processName)s> (%(threadName)s) %(message)s') logger = logging.getLogger(__name__) class TextClassifier(): def __init__(self, vectorizer, classifier=MultinomialNB()): classifier = SVC(kernel="rbf") # classifier = SVC(kernel="linear") self.classifier = classifier self.vectorizer = vectorizer def features(self, x): return self.vectorizer.transform(x) def fit(self, x, y): self.classifier.fit(self.features(x), y) def predict(self, x): return self.classifier.predict(self.features(x)) def score(self, x, y): return self.classifier.score(self.features(x), y) def get_f1_score(self, x, y): return f1_score(y, self.predict(x), average='macro') ================================================ FILE: ai_challenge_sentiment/train.py ================================================ # -*- coding: utf-8 -*- """ Spyder Editor This is a temporary script file. """ import os os.chdir("C:/Users/LUMI/Desktop/sentiment") import pandas as pd import jieba from model import TextClassifier from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np from sklearn.externals import joblib def seg_words(contents): contents_segs = list() for content in contents: segs = jieba.lcut(content) contents_segs.append(" ".join(segs)) return contents_segs # load train data train_data_df = pd.read_csv('data/train/train.csv') validate_data_df = pd.read_csv('data/validation/validation.csv') content_train = train_data_df.iloc[:, 1] content_train = seg_words(content_train) columns = train_data_df.columns.values.tolist() vectorizer_tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1, 5), min_df=5, norm='l2') vectorizer_tfidf.fit(content_train) # model train classifier_dict = dict() for column in columns[2:]: label_train = train_data_df[column] text_classifier = TextClassifier(vectorizer=vectorizer_tfidf) text_classifier.fit(content_train, label_train) classifier_dict[column] = text_classifier # validate model content_validate = validate_data_df.iloc[:, 1] content_validate = seg_words(content_validate) f1_score_dict = dict() for column in columns[2:]: label_validate = validate_data_df[column] text_classifier = classifier_dict[column] f1_score = text_classifier.get_f1_score(content_validate, label_validate) f1_score_dict[column] = f1_score f1_score = np.mean(list(f1_score_dict.values())) str_score = "\n" for column in columns[2:]: str_score = str_score + column + ":" + str(f1_score_dict[column]) + "\n" # save model joblib.dump(classifier_dict, model_save_path + model_name) ================================================ FILE: aspect_predict.py ================================================ # -*- coding: utf-8 -*- __author__ = 'ZhangYi' import os from sklearn.externals import joblib from utils.utils import delimiter from utils.data_process import seg_words,load_aspect_list class AspectPredict(object): def __init__(self): path_delimiter = delimiter() path_absa = os.path.abspath('.') # config model_name = 'aspect_svc' # todo: add to config path_config = path_absa + path_delimiter + 'config.json' # load model path_model = path_absa + path_delimiter + 'model' + path_delimiter + '{}.mdl'.format(model_name) self.model = joblib.load(path_model) # load aspect list self.aspect_list = load_aspect_list(path_config) def predict(self, text): # 1. generate result result = dict() result['text'] = text result['aspectCategory'] = [] # 2. seg words content_test = seg_words([text]) # 3. predict all_result = dict() for column in self.aspect_list: all_result[column] = self.model[column].predict(content_test)[0] if all_result[column]>0.5: result['aspectCategory'].append(column) result['all_result'] = all_result print('PREDICT RESULT:',result) print('PREDICT ASPECT:', result['aspectCategory']) return result if __name__=="__main__": aspect = AspectPredict() aspect.predict('这块屏幕不错') ================================================ FILE: config.json ================================================ {"aspect_list": ["HARDWARE#USABILITY", "BATTERY#USABILITY", "HARDWARE#QUALITY", "MEMORY#GENERAL", "OS#PRICE", "MULTIMEDIA_DEVICES#QUALITY", "MULTIMEDIA_DEVICES#OPERATION_PERFORMANCE", "PORTS#DESIGN_FEATURES", "MULTIMEDIA_DEVICES#USABILITY", "OS#GENERAL", "SUPPORT#MISCELLANEOUS", "KEYBOARD#GENERAL", "POWER_SUPPLY#OPERATION_PERFORMANCE", "PHONE#QUALITY", "MEMORY#DESIGN_FEATURES", "CPU#USABILITY", "OS#CONNECTIVITY", "SOFTWARE#MISCELLANEOUS", "CPU#OPERATION_PERFORMANCE", "KEYBOARD#USABILITY", "PORTS#USABILITY", "KEYBOARD#QUALITY", "HARD_DISC#QUALITY", "MULTIMEDIA_DEVICES#CONNECTIVITY", "SOFTWARE#OPERATION_PERFORMANCE", "MEMORY#USABILITY", "PHONE#CONNECTIVITY", "DISPLAY#OPERATION_PERFORMANCE", "PHONE#DESIGN_FEATURES", "KEYBOARD#OPERATION_PERFORMANCE", "HARDWARE#OPERATION_PERFORMANCE", "POWER_SUPPLY#CONNECTIVITY", "PHONE#USABILITY", "OS#QUALITY", "BATTERY#OPERATION_PERFORMANCE", "HARDWARE#CONNECTIVITY", "POWER_SUPPLY#QUALITY", "HARD_DISC#OPERATION_PERFORMANCE", "SUPPORT#QUALITY", "PHONE#OPERATION_PERFORMANCE", "CPU#GENERAL", "SUPPORT#USABILITY", "DISPLAY#QUALITY", "OS#DESIGN_FEATURES", "POWER_SUPPLY#USABILITY", "HARDWARE#DESIGN_FEATURES", "CPU#QUALITY", "PHONE#MISCELLANEOUS", "SOFTWARE#QUALITY", "OS#OPERATION_PERFORMANCE", "WARRANTY#OPERATION_PERFORMANCE", "PHONE#GENERAL", "PHONE#PRICE", "MULTIMEDIA_DEVICES#GENERAL", "PORTS#OPERATION_PERFORMANCE", "POWER_SUPPLY#GENERAL", "KEYBOARD#DESIGN_FEATURES", "MEMORY#QUALITY", "SOFTWARE#USABILITY", "DISPLAY#DESIGN_FEATURES", "BATTERY#QUALITY", "PORTS#CONNECTIVITY", "PORTS#QUALITY", "HARDWARE#GENERAL", "OS#USABILITY", "SOFTWARE#GENERAL", "DISPLAY#USABILITY", "DISPLAY#GENERAL", "MULTIMEDIA_DEVICES#DESIGN_FEATURES", "BATTERY#DESIGN_FEATURES", "OTHERS", "SOFTWARE#CONNECTIVITY", "SOFTWARE#DESIGN_FEATURES"]} ================================================ FILE: data/polarity/polarity_docu.xlsx ================================================ [File too large to display: 20.9 MB] ================================================ FILE: polarity_predict.py ================================================ #!/usr/bin/env python # -*- coding: utf-8 -*- __author__ = 'ZhangYi' import os from sklearn.externals import joblib from utils.utils import delimiter from utils.grammar import chinese_only from utils.data_process import seg_words, gen_text_vec class PolarityClassifier(object): """ text classification """ def __init__(self): # config model_name = 'polarity_doc' # doc-based # path path_delimiter = delimiter() if 'absa' in os.path.abspath('.').split(path_delimiter): path_absa = os.path.abspath('.') else: # 被调用路径=path_comment path_absa = os.path.abspath('.') + path_delimiter + 'train' \ + path_delimiter + 'sentiment' + path_delimiter + 'absa' # model path path_model_dir = path_absa + path_delimiter + 'model' # load tokenizer path_tokenizer = path_model_dir + path_delimiter + '{}.tk'.format(model_name) self.tokenizer = joblib.load(path_tokenizer) # load model path_model = path_model_dir + path_delimiter + '{}.mdl'.format(model_name) self.model = joblib.load(path_model) self.model._make_predict_function() def predict(self, comment): # 1. chinese only cmt = chinese_only([comment]) # 2. jieba token cmt = seg_words(cmt)[0] # 3. gen word vector _cmt = gen_text_vec(self.tokenizer, cmt, maxlen = 200) # token observation # split_tokens = [] # for token in str(_cmt).split(" "): # if token.isdigit(): # split_tokens.append(token) # print("len(split_tokens):{}".format(len(split_tokens))) # 4. predict neg_prob = self.model.predict(_cmt)[0][0] # neg_prob = (neg_prob > 0.5) # 5. json result output result = {'items':[{'negative_prob': 0,'sentiment': 0}], 'log_id': '', 'text': ''} result['items'][0]['negative_prob'] = neg_prob result['items'][0]['sentiment'] = int(round(neg_prob)) # 1表示差评;0表示好评 result['text'] = comment print("SENTIMENT RESULT: ",result) return result if __name__=="__main__": t = PolarityClassifier() t.predict('这块电池好看') ================================================ FILE: train/aspect_classifier.py ================================================ # -*- coding: utf-8 -*- __author__ = 'ZhangYi' import os import ast import json import pandas as pd import numpy as np from sklearn.externals import joblib from sklearn.feature_extraction.text import TfidfVectorizer from train.model.model import TextClassifier from utils.utils import delimiter from utils.data_process import nan_to_others,category_transpose,seg_words,load_aspect_list class AspectClassifier(object): """ Aspect(=EA) Classifier Train Part """ def __init__(self): path_delimiter = delimiter() path_absa = os.path.abspath('..') # config task_tag = 'aspect_' model_name = task_tag + 'svc' # config path self.path_config = path_absa + path_delimiter + 'config.json' # model path self.model_path = path_absa + path_delimiter + 'model' + path_delimiter + '{}.mdl'.format(model_name) # data path self.path_data = path_absa + path_delimiter +'data' self.path_data_ch = path_absa + path_delimiter +'data' + path_delimiter + 'chinese' + path_delimiter self.path_train_df = self.path_data + path_delimiter + 'aspect' + path_delimiter + '{}_train.xlsx'.format(model_name) self.path_test_df = self.path_data + path_delimiter + 'aspect' + path_delimiter + '{}_test.xlsx'.format(model_name) def data_process(self): if os.path.isfile(self.path_train_df) \ and os.path.isfile(self.path_test_df) \ and os.path.isfile(self.path_config): train_df = pd.read_excel(self.path_train_df) test_df = pd.read_excel(self.path_test_df) self.category_list = load_aspect_list(self.path_config) else: # 1. load data train = pd.read_excel(self.path_data_ch+'Chinese_phones_training.xlsx') test = pd.read_excel(self.path_data_ch+'CH_PHNS_SB1_TEST.xlsx') # 2. mark NaN as 'OTHERS' _data = [] for data in [train, test]: df = nan_to_others(data) _data.append(df) # 3. generate category list self.category_list = list(set(_data[0]['category'])) # len = 73 # 4. save category list to config cate_dict = {'aspect_list':self.category_list} with open(self.path_config, "w") as f: f.write(json.dumps(cate_dict)) f.close() # 5. generate df by category transpose all_data = [] for d in _data: df = category_transpose(d, self.category_list) all_data.append(df) # 6. save data train_df, test_df = all_data[0], all_data[1] train_df.to_excel(self.path_train_df, index=False) test_df.to_excel(self.path_test_df, index=False) return train_df, test_df def train(self, train_df): content_train = seg_words(train_df['text']) vectorizer_tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1, 5), min_df=5, norm='l2') vectorizer_tfidf.fit(content_train) # model train classifier_dict = dict() for column in self.category_list: print(column) label_train = train_df[column] text_classifier = TextClassifier(vectorizer=vectorizer_tfidf) text_classifier.fit(content_train, label_train) classifier_dict[column] = text_classifier # save model if os.path.isfile(self.model_path): pass else: joblib.dump(classifier_dict, self.model_path) def test(self, test_df): classifier = joblib.load(self.model_path) content_test = seg_words(test_df['text']) f1_score_dict = dict() for column in self.category_list: label_validate = test_df[column] text_classifier = classifier[column] f1_score = text_classifier.get_f1_score(content_test, label_validate) f1_score_dict[column] = f1_score f1_score = np.mean(list(f1_score_dict.values())) print('F1-SCORE-DICT: ', f1_score_dict) print('MEAN OF F1-SCORE-DICT: ', f1_score) return f1_score_dict if __name__=="__main__": aspect = AspectClassifier() train_df, test_df = aspect.data_process() aspect.train(train_df) aspect.test(test_df) ================================================ FILE: train/model/bilstm.py ================================================ #!/usr/bin/env python # -*- coding: utf-8 -*- __author__ = 'ZhangYi' from sklearn.metrics import accuracy_score, f1_score, confusion_matrix from keras.models import Sequential from keras.layers import Dense, LSTM, Embedding, Dropout,Bidirectional, GlobalMaxPool1D class BiLSTM(): def __init__(self, max_features, embed_size): model = Sequential() model.add(Embedding(max_features, embed_size)) model.add(Bidirectional(LSTM(32, return_sequences=True))) model.add(GlobalMaxPool1D()) model.add(Dense(20, activation="relu")) model.add(Dropout(0.05)) model.add(Dense(1, activation="sigmoid")) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) self.classifier = model def fit(self, x, y, batch_size, epochs, validation_split): self.classifier.fit(x,y, batch_size=batch_size, epochs=epochs, validation_split=0.2) def predict(self, x): return self.classifier.predict(x) def evaluate(self, y_true, y_pred): acc = accuracy_score(y_pred, y_true) f1 = f1_score(y_pred, y_true) cfs_matrix = confusion_matrix(y_pred, y_true) print('Accuracy Score:', acc) print('F1-score: {0}'.format(f1)) print('Confusion matrix:\n', cfs_matrix) return acc, f1, cfs_matrix def _make_predict_function(self): self.classifier._make_predict_function() ================================================ FILE: train/model/model.py ================================================ #!/user/bin/env python # -*- coding:utf-8 -*- __author__ = 'ZhangYi' from sklearn.svm import SVC from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import f1_score import logging logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] <%(processName)s> (%(threadName)s) %(message)s') logger = logging.getLogger(__name__) class TextClassifier(): def __init__(self, vectorizer, classifier=MultinomialNB()): classifier = SVC(kernel="rbf") # classifier = SVC(kernel="linear") self.classifier = classifier self.vectorizer = vectorizer def features(self, x): return self.vectorizer.transform(x) def fit(self, x, y): self.classifier.fit(self.features(x), y) def predict(self, x): return self.classifier.predict(self.features(x)) def score(self, x, y): return self.classifier.score(self.features(x), y) def get_f1_score(self, x, y): return f1_score(y, self.predict(x), average='macro') ================================================ FILE: train/polarity_classifier.py ================================================ #!/usr/bin/env python # -*- coding: utf-8 -*- __author__ = 'ZhangYi' import os import pandas as pd from sklearn.externals import joblib from sklearn.model_selection import train_test_split from keras.preprocessing.text import Tokenizer from train.model.bilstm import BiLSTM from utils.utils import delimiter from utils.grammar import chinese_only from utils.data_process import merge_excel,seg_words,remove_empty_row,gen_text_vec class PolarityClassifier(object): """ train sentiment model and generate model file """ def __init__(self): path_delimiter = delimiter() path_absa = os.path.abspath('..') # config self.maxlen = 200 # doc word length task_tag = 'polarity_' model_name = task_tag + 'docu' # model path path_model = path_absa + path_delimiter + 'model' self.model_path = path_model + path_delimiter + '{}.mdl'.format(model_name) self.path_tokenizer = path_model + path_delimiter + '{}.tk'.format(model_name) # data path path_data_doc_level = path_delimiter.join(path_absa.split(path_delimiter)[:-2]) + path_delimiter + "data" \ + path_delimiter + 'sentiment' + path_delimiter + 'document_level' self.path_train_data = path_data_doc_level + path_delimiter + 'train_data' self.path_data = path_absa + path_delimiter + 'data' self.path_corpus = self.path_data + path_delimiter + 'polarity' + path_delimiter + '{}.xlsx'.format(model_name) # generate tokenizer self.data = self.data_process() self.tokenizer = self.gen_tokenizer(self.data['cmt_split']) def data_process(self): if os.path.isfile(self.path_corpus): data = pd.read_excel(self.path_corpus) else: # 1. merge data data = merge_excel(self.path_train_data) # 2. Chinese character only data['cmt_zh'] = chinese_only(data['comment_content']) # 3. jieba token for dictionary data['cmt_split'] = seg_words(data['cmt_zh']) # 4. remove empty comment data = remove_empty_row(data, 'cmt_split') # 5. save data data.to_excel(self.path_corpus) return data def gen_tokenizer(self, cut_corpus_list): if os.path.isfile(self.path_tokenizer): tokenizer = joblib.load(self.path_tokenizer) else: tokenizer = Tokenizer() tokenizer.fit_on_texts(cut_corpus_list.astype(str)) joblib.dump(tokenizer, self.path_tokenizer) return tokenizer def gen_train_test(self, x, y): X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1) return X_train, X_test, y_train, y_test def train(self, X_train, y_train): embed_size = 256 max_features = 66000 # dictionary size classifier = BiLSTM(max_features, embed_size) epochs = 2 batch_size = 100 X_tr = gen_text_vec(self.tokenizer, X_train, self.maxlen) classifier.fit(X_tr, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2) # save model if os.path.isfile(self.model_path): print('model exist already') else: joblib.dump(classifier, self.model_path) def test(self, X_test, y_test): X_te = gen_text_vec(self.tokenizer, X_test, self.maxlen) # load model model = joblib.load(self.model_path) pred_prob = model.predict(X_te) # pred = (pred_prob > 0.65) pred = [int(round(i[0])) for i in pred_prob] y_test = [int(i) for i in y_test] # evaluate eval = model.evaluate(y_test, pred) return eval def batch_predict(self, batch_cmt_df): # 1. chinese only batch_cmt_df['cmt_zh'] = chinese_only(batch_cmt_df['comment_content']) # 2. token cut batch_cmt_df['cmt_split'] = seg_words(batch_cmt_df['cmt_zh']) # 暂时没有remove empty环节 # 3. predict self.test(batch_cmt_df['cmt_split'],batch_cmt_df['label']) # # save result # result = pd.DataFrame(np.array([self.X_test,self.y_test,pred]).T,columns=['comment_zh','GroundTruth','bilstm']) # result.to_excel('data/sentiment/result_.xlsx') if __name__=="__main__": pc = PolarityClassifier() data = pc.data X_train, X_test, y_train, y_test = pc.gen_train_test(data['cmt_split'], data['label']) pc.train(X_train, y_train) pc.test(X_test, y_test) ================================================ FILE: utils/__init__.py ================================================ ================================================ FILE: utils/baidu_tagging.py ================================================ #!/usr/bin/env python # -*- coding: utf-8 -*- from aip import AipNlp import pandas as pd import time """ 你的 APPID AK SK """ APP_ID = '155934' API_KEY = 'PBW2w1dveS7x3YcKSZW0V7' SECRET_KEY = 'AOE75EWZqeI6kM7Kesq8i6FzQruDI' client = AipNlp(APP_ID, API_KEY, SECRET_KEY) # 请求文件 source_file = "请求文件路径" source_df = pd.read_excel(source_file) comments = [] neg_probs = [] pos_probs = [] confidences = [] sentiments = [] complete_count = 0 # 请求错误统计 err_count = 0 err_comment = [] start_time = time.time() # 循环请求 i = 0 while i < len(source_df): comment = source_df["comment_content"][i] try: query_result = client.sentimentClassify(comment[:1024]) except Exception as e: print("query_result:{}".format(query_result)) print("#######请求过程存在问题#######") err_count += 1 err_comment.append(comment) i += 1 continue try: result = query_result['items'][0] neg_prob = result['negative_prob'] pos_prob = result['positive_prob'] confidence = result['confidence'] sentiment = result['sentiment'] except KeyError as e: print("#######请求QPS限制#######") print("i={}".format(i)) continue i += 1 comments.append(comment) neg_probs.append(neg_prob) pos_probs.append(pos_prob) confidences.append(confidence) sentiments.append(sentiment) complete_count += 1 print("总共:{}条".format(len(source_df))) print("请求完成: {}条".format(complete_count)) print("完成进度:{}%".format(round(complete_count / len(source_df) * 100, 2))) cost_mins = (time.time() - start_time) / 60 print("累计用时:{}分钟".format(round(cost_mins, 2))) avg_query_time = complete_count / cost_mins # print("每条请求平均用时:{}".format(avg_query_time)) left_mins = (len(source_df) - complete_count - err_count) / avg_query_time print("预计还需:{}分钟".format(round(left_mins, 2))) print("\n") print("所有请求完成!") print("请求总数量:{}".format(len(source_df))) print("请求过程中存在问题的数量:{}".format(err_count)) # 保存结果 # 请求成功的结果保存 desti_df = pd.DataFrame() desti_df["comment"] = comments desti_df["neg_probs"] = neg_probs desti_df["pos_probs"] = pos_probs desti_df["confidences"] = confidences desti_df["sentiments"] = sentiments desti_file = "请求结果保存路径" desti_df.to_excel(desti_file, engine='xlsxwriter') # 请求失败的结果保存 err_df = pd.DataFrame() err_file = "请求结果报错保存路径" err_df["comment"] = err_comment err_df.to_excel(err_file, engine='xlsxwriter') # 如果请求接口里有奇怪字符,保存文件时就使用, engine='xlsxwriter' ================================================ FILE: utils/data_process.py ================================================ #!/usr/bin/env python # -*- coding: utf-8 -*- __author__ = 'ZhangYi' import ast import jieba import itertools import pandas as pd import numpy as np from keras.preprocessing.sequence import pad_sequences # mark NaN as 'OTHERS' def nan_to_others(df): new_cate = [] new_polarity = [] # dataframe必须含有列:['text', 'category', 'polarity'] for idx, i in enumerate(df['polarity']): if i in ['negative', 'positive', 'neutral', 'conflict']: new_cate.append(df['category'][idx]) new_polarity.append(i) else: new_cate.append('OTHERS') new_polarity.append('OTHERS') _df = pd.DataFrame(np.array([df['text'], new_cate, new_polarity]).T, columns=['text', 'category', 'polarity']) return _df # tokenize def seg_words(contents): contents_segs = list() for content in contents: segs = jieba.lcut(content) contents_segs.append(" ".join(segs)) return contents_segs # get text vector def gen_text_vec(tokenizer, cut_corpus_list, maxlen): text_vec = tokenizer.texts_to_sequences(cut_corpus_list) t_vec = pad_sequences(text_vec, maxlen=maxlen) return t_vec # category transpose def category_transpose(df, category_list): for i in category_list: l_ist = [] # dataframe必须含有列:['category'] for cate in df['category']: if cate == i: l_ist.append(1) else: l_ist.append(0) df[i] = l_ist return df # load config: aspect_list def load_aspect_list(path_config): # only one param in config: aspect_list a = 0 with open(path_config, "r", encoding='utf-8') as f: for i in f: category_list = ast.literal_eval(i)['aspect_list'] a = a + 1 if a == 1: break f.close() return category_list # merge excel def merge_excel(path_data_dir): cmt_l = [] scr_l = [] # 被merge的df都必须有['comment_content', 'label'] data_source = ['/2019-04-12_lock_comment_jd_spider_baidu_sentiment.xlsx', \ '/20190329_train_lock_comments_document_level_with_label.xls', \ '/all_comments_document_level_without_lock_comments.xls', \ '/bad_comments_in_forum_mi.com_youpin.xls'] for i in data_source: path_data = path_data_dir + i _data = pd.read_excel(path_data) cmt_l.append(_data['comment_content']) scr_l.append(_data['label']) comment = list(itertools.chain.from_iterable(cmt_l)) score = list(itertools.chain.from_iterable(scr_l)) data = pd.DataFrame(np.array([comment, score]).T, columns=['comment_content', 'label']) return data # remove row by column with empty value def remove_empty_row(df, column_name): row_to_delete = [] for idx, i in enumerate(df[column_name]): if not bool(i): row_to_delete.append(idx) df = df.drop(df.index[row_to_delete]) return df.reset_index(drop=True) ================================================ FILE: utils/grammar.py ================================================ #!/usr/bin/env python # -*- coding: utf-8 -*- __author__ = 'ZhangYi' import re def chinese_only(txt_list): cmt_zh = [] for cmt in txt_list: line = cmt.strip() p2 = re.compile(u'[^\u4e00-\u9fa5]') zh = " ".join(p2.split(line)).strip() cmt_zh.append(",".join(zh.split())) return cmt_zh ================================================ FILE: utils/utils.py ================================================ # -*- coding: utf-8 -*- __author__ = 'ZhangYi' import sys def delimiter(): path_delimiter = '/' if 'win' in sys.platform: path_delimiter = '\\' return path_delimiter