[
  {
    "path": ".gitignore",
    "content": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nenv/\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\n*.egg-info/\n.installed.cfg\n*.egg\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*,cover\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\ntarget/\n\n*.yaml\n\n# virtualenv\nvenv/\nENV/\n\n# Pycharm settings\n*.idea/"
  },
  {
    "path": "LICENSE",
    "content": "The MIT License (MIT)\n\nCopyright (c) by Windfarer\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n\n"
  },
  {
    "path": "MANIFEST.in",
    "content": "include README.md LICENSE\nrecursive-include src *.yaml"
  },
  {
    "path": "README.md",
    "content": "# py-mysql-elasticsearch-sync\nSimple and fast MySQL to Elasticsearch sync tool, written in Python.\n\n[中文文档](https://github.com/zhongbiaodev/py-mysql-elasticsearch-sync/blob/master/README_CN.md)\n\n## Introduction\nThis tool helps you to initialize MySQL dump table to Elasticsearch by parsing mysqldump, then incremental sync MySQL table to Elasticsearch by processing MySQL Binlog.\nAlso, during the binlog syncing, this tool will save the binlog sync position, so that it is easy to recover after this tool being shutdown for any reason.\n\n## Installation\nBy following these steps.\n\n##### 1. ibxml2 and libxslt\nThis tool depends on python lxml package, so that you should install  the lxml's dependecies correctly, the libxml2 and libxslt are required.\n\nFor example, in CentOS:\n\n```\nsudo yum install libxml2 libxml2-devel libxslt libxslt-devel\n```\n\nOr in Debian/Ubuntu:\n\n```\nsudo apt-get install libxml2-dev libxslt-dev python-dev\n```\n\nSee [lxml Installation](http://lxml.de/installation.html) for more infomation.\n##### 2. mysqldump\nAnd then, mysqldump is required in the machine where this tool will be run on it.(and the mysql server must enable binlog)\n\n\n##### 3. this tool\nThen install this tool\n\n```\npip install py-mysql-elasticsearch-sync\n```\n\n## Configuration\nThere is a [sample config](https://github.com/zhongbiaodev/py-mysql-elasticsearch-sync/blob/master/es_sync/sample.yaml) file in repo, you can start by editing it.\n\n## Running\nSimply run command\n\n```\nes-sync path/to/your/config.yaml\n```\nand the tool will dump your data as stream to sync, when dump is over, it will start to sync binlog.\n\nThe latest synced binlog file and position are recorded in your info file which is configured in your config file. You can restart dump step by remove it, or you can change sync position by edit it.\n\nOr if you  but want to load it from your own dumpfile. You should dump your table first as xml format(by adding ```-X```option to your mysqldump command) \n\nthen\n\n```\nes-sync path/to/your/config.yaml --fromfile\n```\nto start sync, when xml sync is over, it will also start binlog sync.\n\n## Deployment\nWe provide an [upstart script]((https://github.com/zhongbiaodev/py-mysql-elasticsearch-sync/blob/master/upstart.conf)) to help you deploy this tool, you can edit it for your own condition, besides, you can deploy it in your own way.\n\n## MultiTable Supporting\nNow Multi-table is supported through setting tables in config file, the first table is master as default and the others are slave.\n\nMaster table and slave tables must use the same primary key, which is defined via _id.\n\nTable has higher priority than tables.\n\n## TODO\n- [ ] MultiIndex Supporting\n"
  },
  {
    "path": "README_CN.md",
    "content": "# py-mysql-elasticsearch-sync\n一个从MySQL向Elasticsearch同步数据的工具，使用Python实现。\n\n## 简介\n在第一次初始化数据时，本工具解析mysqldump导出的数据，并导入ES中，在后续增量更新中，解析binlog的数据，对ES中的数据进行同步。在binlog同步阶段，支持断点恢复，因此无需担心意外中断的问题。\n\n## 安装\n\n##### 1. ibxml2 和 libxslt\n本工具基于lxml库，因此需要安装它的依赖的libxml2和libxslt\n\n在CentOS中:\n\n```\nsudo yum install libxml2 libxml2-devel libxslt libxslt-devel\n```\n\n在Debian/Ubuntu中:\n\n```\nsudo apt-get install libxml2-dev libxslt-dev python-dev\n```\n\n查看[lxml Installation](http://lxml.de/installation.html)来获取更多相关信息\n\n##### 2. mysqldump\n在运行本工具的机器上需要有mysqldump，并且mysql服务器需要开启binlog功能。\n\n\n##### 3. 本工具\n安装本工具\n\n```\npip install py-mysql-elasticsearch-sync\n```\n\n## 配置\n你可以通过修改[配置文件示例](https://github.com/zhongbiaodev/py-mysql-elasticsearch-sync/blob/master/es_sync/sample.yaml)来编写自己的配置文件\n\n## 运行\n运行命令\n\n```\nes-sync path/to/your/config.yaml\n```\n工具将开始执行mysqldump并解析流进行同步，当dump结束后，将启动binlog同步\n\n最近一次binlog同步位置记录在一个文件中，这个文件的路径在config文件中配置过。\n\n你可以删除记录文件来从头进行binlog同步，或者修改文件里的内容，来从特定位置开始同步。\n\n\n你也可以把自己从mysql导出的xml文件同步进ES中(在mysqldump的命令中加上参数```-X```即可导出xml) \n\n然后执行\n\n```\nes-sync path/to/your/config.yaml --fromfile\n```\n启动从xml导入，当从xml导入完毕后，它会开始同步binlog\n\n## 服务管理\n我们写了一个[upstart脚本](https://github.com/zhongbiaodev/py-mysql-elasticsearch-sync/blob/master/upstart.conf)来管理本工具的运行，你也可以用你自己的方式进行部署运行\n\n## 多表支持\n你可以在config文件中配置tables以支持多表，默认tables中第一张表为主表，其余表为从表。\n\n主表和从表主键必须相同，均为_id字段。\n\n当同时设置table和tables时，table优先级较高。\n\n## TODO\n- [ ] 多索引支持\n"
  },
  {
    "path": "es_sync/__init__.py",
    "content": "from __future__ import print_function, unicode_literals\nfrom future.builtins import str, range\nimport sys\n\nPY2 = sys.version_info[0] == 2\n\nif PY2:\n    import os\n    DEVNULL = open(os.devnull, 'wb')\nelse:\n    from subprocess import DEVNULL\n\n\ndef encode_in_py2(s):\n    if PY2:\n        return s.encode('utf-8')\n    return s\n\nimport os.path\nimport yaml\nimport signal\nimport requests\nimport subprocess\nimport json\nimport logging\nimport shlex\nimport datetime\nimport decimal\nfrom lxml.etree import iterparse\nfrom functools import reduce\nfrom pymysqlreplication import BinLogStreamReader\nfrom pymysqlreplication.row_event import DeleteRowsEvent, UpdateRowsEvent, WriteRowsEvent\nfrom pymysqlreplication.event import RotateEvent, XidEvent\n\n__version__ = '0.4.2'\n\n\n# The magic spell for removing invalid characters in xml stream.\nREMOVE_INVALID_PIPE = r'tr -d \"\\00\\01\\02\\03\\04\\05\\06\\07\\10\\13\\14\\16\\17\\20\\21\\22\\23\\24\\25\\26\\27\\30\\31\\32\\33\\34\\35\\36\\37\"'\n\nDEFAULT_BULKSIZE = 100\nDEFAULT_BINLOG_BULKSIZE = 1\n\n\nclass ElasticSync(object):\n    table_structure = {}\n    log_file = None\n    log_pos = None\n\n    @property\n    def is_binlog_sync(self):\n        rv = bool(self.log_file and self.log_pos)\n        return rv\n\n    def __init__(self):\n        try:\n            self.config = yaml.load(open(sys.argv[1]))\n        except IndexError:\n            print('Error: not specify config file')\n            exit(1)\n\n        mysql = self.config.get('mysql')\n        if mysql.get('table'):\n            self.tables = [mysql.get('table')]\n            self.dump_cmd = 'mysqldump -h {host} -P {port} -u {user} --password={password} {db} {table} ' \\\n                        '--default-character-set=utf8 -X --opt --quick'.format(**mysql)\n        elif mysql.get('tables'):\n            self.tables = mysql.get('tables')\n            mysql.update({\n                'tables': ' '.join(mysql.get('tables'))\n            })\n            self.dump_cmd = 'mysqldump -h {host} -P {port} -u {user} --password={password} --database {db} --tables {tables} ' \\\n                        '--default-character-set=utf8 -X --opt --quick'.format(**mysql)\n        else:\n            print('Error: must specify either table or tables')\n            exit(1)\n        self.master = self.tables[0]  # use the first table as master\n        self.current_table = None\n\n        self.binlog_conf = dict(\n            [(key, self.config['mysql'][key]) for key in ['host', 'port', 'user', 'password', 'db']]\n        )\n\n        self.endpoint = 'http://{host}:{port}/{index}/{type}/_bulk'.format(\n            host=self.config['elastic']['host'],\n            port=self.config['elastic']['port'],\n            index=self.config['elastic']['index'],\n            type=self.config['elastic']['type']\n        )  # todo: supporting multi-index\n\n        self.mapping = self.config.get('mapping') or {}\n        if self.mapping.get('_id'):\n            self.id_key = self.mapping.pop('_id')\n        else:\n            self.id_key = None\n\n        self.ignoring = self.config.get('ignoring') or []\n\n        record_path = self.config['binlog_sync']['record_file']\n        if os.path.isfile(record_path):\n            with open(record_path, 'r') as f:\n                record = yaml.load(f)\n                self.log_file = record.get('log_file')\n                self.log_pos = record.get('log_pos')\n\n        self.bulk_size = self.config.get('elastic').get('bulk_size') or DEFAULT_BULKSIZE\n        self.binlog_bulk_size = self.config.get('elastic').get('binlog_bulk_size') or DEFAULT_BINLOG_BULKSIZE\n\n        self._init_logging()\n        self._force_commit = False\n\n    def _init_logging(self):\n        logging.basicConfig(filename=self.config['logging']['file'],\n                            level=logging.INFO,\n                            format='[%(levelname)s] - %(filename)s[line:%(lineno)d] - %(asctime)s %(message)s')\n        self.logger = logging.getLogger(__name__)\n        logging.getLogger(\"requests\").setLevel(logging.WARNING)  # disable requests info logging\n\n        def cleanup(*args):\n            self.logger.info('Received stop signal')\n            self.logger.info('Shutdown')\n            sys.exit(0)\n\n        signal.signal(signal.SIGINT, cleanup)\n        signal.signal(signal.SIGTERM, cleanup)\n\n    def _post_to_es(self, data):\n        \"\"\"\n        send post requests to es restful api\n        \"\"\"\n        resp = requests.post(self.endpoint, data=data)\n        if resp.json().get('errors'):  # a boolean to figure error occurs\n            for item in resp.json()['items']:\n                if list(item.values())[0].get('error'):\n                    logging.error(item)\n        else:\n            self._save_binlog_record()\n\n    def _bulker(self, bulk_size):\n        \"\"\"\n        Example:\n            u = bulker()\n            u.send(None)  #for generator initialize\n            u.send(json_str)  # input json item\n            u.send(another_json_str)  # input json item\n            ...\n            u.send(None) force finish bulk and post\n        \"\"\"\n        while True:\n            data = \"\"\n            for i in range(bulk_size):\n                item = yield\n                if item:\n                    data = data + item + \"\\n\"\n                else:\n                    break\n                if self._force_commit:\n                    break\n            # print(data)\n            print('-'*10)\n            if data:\n                self._post_to_es(data)\n\n            self._force_commit = False\n\n    def _updater(self, data):\n        \"\"\"\n        encapsulation of bulker\n        \"\"\"\n        if self.is_binlog_sync:\n            u = self._bulker(bulk_size=self.binlog_bulk_size)\n        else:\n            u = self._bulker(bulk_size=self.bulk_size)\n\n        u.send(None)  # push the generator to first yield\n        for item in data:\n            u.send(item)\n        u.send(None)  # tell the generator it's the end\n\n    def _json_serializer(self, obj):\n        \"\"\"\n        format the object which json not supported\n        \"\"\"\n        if isinstance(obj, datetime.datetime) or isinstance(obj, datetime.date):\n            return obj.isoformat()\n        elif isinstance(obj, decimal.Decimal):\n            return str(obj)\n        raise TypeError('Type not serializable for obj {obj}'.format(obj=obj))\n\n    def _processor(self, data):\n        \"\"\"\n        The action must be one of the following:\n        create\n            Create a document only if the document does not already exist.\n        index\n            Create a new document or replace an existing document.\n        update\n            Do a partial update on a document.\n        delete\n            Delete a document.\n        \"\"\"\n        for item in data:\n            if self.id_key:\n                action_content = {'_id': item['doc'][self.id_key]}\n            else:\n                action_content = {}\n            for field in self.ignoring:\n                try:\n                    item['doc'].pop(field)\n                except KeyError:\n                    pass\n            meta = json.dumps({item['action']: action_content})\n            if item['action'] == 'index':\n                body = json.dumps(item['doc'], default=self._json_serializer)\n                rv = meta + '\\n' + body\n            elif item['action'] == 'update':\n                body = json.dumps({'doc': item['doc']}, default=self._json_serializer)\n                rv = meta + '\\n' + body\n            elif item['action'] == 'delete':\n                rv = meta + '\\n'\n            elif item['action'] == 'create':\n                body = json.dumps(item['doc'], default=self._json_serializer)\n                rv = meta + '\\n' + body\n            else:\n                logging.error('unknown action type in doc')\n                raise TypeError('unknown action type in doc')\n            yield rv\n\n    def _mapper(self, data):\n        \"\"\"\n        mapping old key to new key\n        \"\"\"\n        for item in data:\n            if self.mapping:\n                for k, v in self.mapping.items():\n                    try:\n                        item['doc'][k] = item['doc'][v]\n                        del item['doc'][v]\n                    except KeyError:\n                        continue\n            # print(doc)\n            yield item\n\n    def _formatter(self, data):\n        \"\"\"\n        format every field from xml, according to parsed table structure\n        \"\"\"\n        for item in data:\n            for field, serializer in self.table_structure.items():\n                if field in item['doc'] and item['doc'][field]:\n                    try:\n                        item['doc'][field] = serializer(item['doc'][field])\n                    except ValueError as e:\n                        self.logger.error(\n                            \"Error occurred during format, ErrorMessage:{msg}, ErrorItem:{item}\".format(\n                            msg=str(e),\n                            item=str(item)))\n                        item['doc'][field] = None\n                    except TypeError as e:\n                        item['doc'][field] = None\n            # print(item)\n            yield item\n\n    def _binlog_loader(self):\n        \"\"\"\n        read row from binlog\n        \"\"\"\n        if self.is_binlog_sync:\n            resume_stream = True\n            logging.info(\"Resume from binlog_file: {file}  binlog_pos: {pos}\".format(file=self.log_file,\n                                                                                     pos=self.log_pos))\n        else:\n            resume_stream = False\n\n        stream = BinLogStreamReader(connection_settings=self.binlog_conf,\n                                    server_id=self.config['mysql']['server_id'],\n                                    only_events=[DeleteRowsEvent, WriteRowsEvent, UpdateRowsEvent, RotateEvent, XidEvent],\n                                    only_tables=self.tables,\n                                    resume_stream=resume_stream,\n                                    blocking=True,\n                                    log_file=self.log_file,\n                                    log_pos=self.log_pos)\n        for binlogevent in stream:\n            self.log_file = stream.log_file\n            self.log_pos = stream.log_pos\n\n            # RotateEvent to update binlog record when no related table changed\n            if isinstance(binlogevent, RotateEvent):\n                self._save_binlog_record()\n                continue\n\n            if isinstance(binlogevent, XidEvent):  # event_type == 16\n                self._force_commit = True\n                continue\n\n            for row in binlogevent.rows:\n                if isinstance(binlogevent, DeleteRowsEvent):\n                    if binlogevent.table == self.master:\n                        rv = {\n                            'action': 'delete',\n                            'doc': row['values']\n                        }\n                    else:\n                        rv = {\n                            'action': 'update',\n                            'doc': {k: row['values'][k] if self.id_key and self.id_key == k else None for k in row['values']}\n                        }\n                elif isinstance(binlogevent, UpdateRowsEvent):\n                    rv = {\n                        'action': 'update',\n                        'doc': row['after_values']\n                    }\n                elif isinstance(binlogevent, WriteRowsEvent):\n                    if binlogevent.table == self.master:\n                        rv = {\n                                'action': 'create',\n                                'doc': row['values']\n                            }\n                    else:\n                        rv = {\n                                'action': 'update',\n                                'doc': row['values']\n                            }\n                else:\n                    logging.error('unknown action type in binlog')\n                    raise TypeError('unknown action type in binlog')\n                yield rv\n                # print(rv)\n        stream.close()\n        raise IOError('mysql connection closed')\n\n    def _parse_table_structure(self, data):\n        \"\"\"\n        parse the table structure\n        \"\"\"\n        for item in data.iter():\n            if item.tag == 'field':\n                field = item.attrib.get('Field')\n                type = item.attrib.get('Type')\n                if 'int' in type:\n                    serializer = int\n                elif 'float' in type:\n                    serializer = float\n                elif 'datetime' in type:\n                    if '(' in type:\n                        serializer = lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f')\n                    else:\n                        serializer = lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')\n                elif 'char' in type:\n                    serializer = str\n                elif 'text' in type:\n                    serializer = str\n                else:\n                    serializer = str\n                self.table_structure[field] = serializer\n\n    def _parse_and_remove(self, f, path):\n        \"\"\"\n        snippet from python cookbook, for parsing large xml file\n        \"\"\"\n        path_parts = path.split('/')\n        doc = iterparse(f, ('start', 'end'), recover=False, encoding='utf-8', huge_tree=True)\n        # Skip the root element\n        next(doc)\n        tag_stack = []\n        elem_stack = []\n        for event, elem in doc:\n            if event == 'start':\n                if elem.tag == 'table_data':\n                    self.current_table = elem.attrib['name']\n                tag_stack.append(elem.tag)\n                elem_stack.append(elem)\n            elif event == 'end':\n                if tag_stack == ['database', 'table_data']:\n                    self.current_table = None\n                if tag_stack == path_parts:\n                    yield elem\n                    elem_stack[-2].remove(elem)\n                if tag_stack == ['database', 'table_structure']:\n                # dirty hack for getting the tables structure\n                    self._parse_table_structure(elem)\n                    elem_stack[-2].remove(elem)\n                try:\n                    tag_stack.pop()\n                    elem_stack.pop()\n                except IndexError:\n                    pass\n\n    def _xml_parser(self, f_obj):\n        \"\"\"\n        parse mysqldump XML streaming, convert every item to dict object.\n        'database/table_data/row'\n        \"\"\"\n        for row in self._parse_and_remove(f_obj, 'database/table_data/row'):\n            doc = {}\n            for field in row.iter(tag='field'):\n                k = field.attrib.get('name')\n                v = field.text\n                doc[k] = v\n            if not self.current_table or self.current_table == self.master:\n                yield {'action': 'create', 'doc': doc}\n            else:\n                yield {'action': 'update', 'doc': doc}\n\n    def _save_binlog_record(self):\n        if self.is_binlog_sync:\n            with open(self.config['binlog_sync']['record_file'], 'w') as f:\n                logging.info(\"Sync binlog_file: {file}  binlog_pos: {pos}\".format(\n                    file=self.log_file,\n                    pos=self.log_pos)\n                )\n                yaml.safe_dump({\"log_file\": self.log_file,\n                                \"log_pos\": self.log_pos},\n                               f,\n                               default_flow_style=False)\n\n    def _xml_dump_loader(self):\n        mysqldump = subprocess.Popen(\n            shlex.split(encode_in_py2(self.dump_cmd)),\n            stdout=subprocess.PIPE,\n            stderr=DEVNULL,\n            close_fds=True)\n\n        remove_invalid_pipe = subprocess.Popen(\n            shlex.split(encode_in_py2(REMOVE_INVALID_PIPE)),\n            stdin=mysqldump.stdout,\n            stdout=subprocess.PIPE,\n            stderr=DEVNULL,\n            close_fds=True)\n\n        return remove_invalid_pipe.stdout\n\n    def _xml_file_loader(self, filename):\n        f = open(filename, 'rb')  # bytes required\n\n        remove_invalid_pipe = subprocess.Popen(\n            shlex.split(encode_in_py2(REMOVE_INVALID_PIPE)),\n            stdin=f,\n            stdout=subprocess.PIPE,\n            stderr=DEVNULL,\n            close_fds=True)\n        return remove_invalid_pipe.stdout\n\n    def _send_email(self, title, content):\n        \"\"\"\n        send notification email\n        \"\"\"\n        if not self.config.get('email'):\n            return\n\n        import smtplib\n        from email.mime.text import MIMEText\n\n        msg = MIMEText(content)\n        msg['Subject'] = title\n        msg['From'] = self.config['email']['from']['username']\n        msg['To'] = ', '.join(self.config['email']['to'])\n\n        # Send the message via our own SMTP server.\n        s = smtplib.SMTP()\n        s.connect(self.config['email']['from']['host'])\n        s.login(user=self.config['email']['from']['username'],\n                password=self.config['email']['from']['password'])\n        s.sendmail(msg['From'], msg['To'], msg=msg.as_string())\n        s.quit()\n\n    def _sync_from_stream(self):\n        logging.info(\"Start to dump from stream\")\n        docs = reduce(lambda x, y: y(x), [self._xml_parser, \n                                          self._formatter,\n                                          self._mapper, \n                                          self._processor], \n                      self._xml_dump_loader())\n        self._updater(docs)\n        logging.info(\"Dump success\")\n\n    def _sync_from_file(self):\n        logging.info(\"Start to dump from xml file\")\n        logging.info(\"Filename: {}\".format(self.config['xml_file']['filename']))\n        docs = reduce(lambda x, y: y(x), [self._xml_parser, \n                                          self._formatter, \n                                          self._mapper, \n                                          self._processor],\n                      self._xml_file_loader(self.config['xml_file']['filename']))\n        self._updater(docs)\n        logging.info(\"Dump success\")\n\n    def _sync_from_binlog(self):\n        logging.info(\"Start to sync binlog\")\n        docs = reduce(lambda x, y: y(x), [self._mapper,\n                                          self._processor],\n                      self._binlog_loader())\n        self._updater(docs)\n\n    def run(self):\n        \"\"\"\n        workflow:\n        1. sync dump data\n        2. sync binlog\n        \"\"\"\n        try:\n            if not self.is_binlog_sync:\n                if len(sys.argv) > 2 and sys.argv[2] == '--fromfile':\n                    self._sync_from_file()\n                else:\n                    self._sync_from_stream()\n            self._sync_from_binlog()\n        except Exception:\n            import traceback\n            logging.error(traceback.format_exc())\n            self._send_email('es sync error', traceback.format_exc())\n            raise\n\n\ndef start():\n    instance = ElasticSync()\n    instance.run()\n\nif __name__ == '__main__':\n    start()\n"
  },
  {
    "path": "requirements.txt",
    "content": "PyMySQL==0.6.7\nmysql-replication>=0.8\nrequests>=2.9.1\nPyYAML>=3.11\nlxml>=3.5.0\nfuture>=0.15.2 #for py2 compat\n"
  },
  {
    "path": "setup.py",
    "content": "from setuptools import setup, find_packages\nimport es_sync\n\nsetup(\n    name='py-mysql-elasticsearch-sync',\n    version=es_sync.__version__,\n    packages=find_packages(),\n    url='https://github.com/zhongbiaodev/py-mysql-elasticsearch-sync',\n    license='MIT',\n    author='Windfarer',\n    author_email='windfarer@gmail.com',\n    description='MySQL to Elasticsearch sync tool',\n    install_requires=[\n        'PyMySQL==0.6.7',\n        'mysql-replication==0.9',\n        'requests==2.9.1',\n        'PyYAML==3.11',\n        'lxml==3.5.0',\n        'future==0.15.2'\n    ],\n    entry_points={\n        'console_scripts': [\n            'es-sync=es_sync:start',\n        ]\n    },\n    include_package_data=True\n)"
  },
  {
    "path": "src/__init__.py",
    "content": "from __future__ import print_function, unicode_literals\nfrom future.builtins import str, range\nimport sys\nPY2 = sys.version_info[0] == 2\n\nif PY2:\n    import os\n    DEVNULL = open(os.devnull, 'wb')\nelse:\n    from subprocess import DEVNULL\ndef encode_in_py2(s):\n    if PY2:\n        return s.encode('utf-8')\n    return s\n\nimport os.path\nimport yaml\nimport signal\nimport requests\nimport subprocess\nimport json\nimport logging\nimport shlex\nfrom datetime import datetime\nfrom lxml.etree import iterparse\nfrom functools import reduce\nfrom pymysqlreplication import BinLogStreamReader\nfrom pymysqlreplication.row_event import DeleteRowsEvent, UpdateRowsEvent, WriteRowsEvent\n\n__version__ = '0.3.3.1'\n\n\n# The magic spell for removing invalid characters in xml stream.\nREMOVE_INVALID_PIPE = r'tr -d \"\\00\\01\\02\\03\\04\\05\\06\\07\\10\\13\\14\\16\\17\\20\\21\\22\\23\\24\\25\\26\\27\\30\\31\\32\\33\\34\\35\\36\\37\"'\n\nDEFAULT_BULKSIZE = 100\nDEFAULT_BINLOG_BULKSIZE = 1\n\n\nclass ElasticSync(object):\n    table_structure = {}\n    log_file = None\n    log_pos = None\n\n    @property\n    def is_binlog_sync(self):\n        rv = bool(self.log_file and self.log_pos)\n        return rv\n\n    def __init__(self):\n        try:\n            self.config = yaml.load(open(sys.argv[1]))\n        except IndexError:\n            print('Error: not specify config file')\n            exit(1)\n\n        self.dump_cmd = 'mysqldump -h {host} -P {port} -u {user} --password={password} {db} {table} ' \\\n                        '--default-character-set=utf8 -X'.format(**self.config['mysql'])\n\n        self.binlog_conf = dict(\n            [(key, self.config['mysql'][key]) for key in ['host', 'port', 'user', 'password', 'db']]\n        )\n\n        self.endpoint = 'http://{host}:{port}/{index}/{type}/_bulk'.format(\n            host=self.config['elastic']['host'],\n            port=self.config['elastic']['port'],\n            index=self.config['elastic']['index'],\n            type=self.config['elastic']['type']\n        )  # todo: supporting multi-index\n\n        self.mapping = self.config.get('mapping') or {}\n        if self.mapping.get('_id'):\n            self.id_key = self.mapping.pop('_id')\n        else:\n            self.id_key = None\n\n        record_path = self.config['binlog_sync']['record_file']\n        if os.path.isfile(record_path):\n            with open(record_path, 'r') as f:\n                record = yaml.load(f)\n                self.log_file = record.get('log_file')\n                self.log_pos = record.get('log_pos')\n\n        self.bulk_size = self.config.get('elastic').get('bulk_size') or DEFAULT_BULKSIZE\n        self.binlog_bulk_size = self.config.get('elastic').get('binlog_bulk_size') or DEFAULT_BINLOG_BULKSIZE\n\n        self._init_logging()\n\n    def _init_logging(self):\n        logging.basicConfig(filename=self.config['logging']['file'],\n                            level=logging.INFO,\n                            format='[%(levelname)s] %(asctime)s %(message)s')\n        self.logger = logging.getLogger(__name__)\n        logging.getLogger(\"requests\").setLevel(logging.WARNING)  # disable requests info logging\n\n        def cleanup(*args):\n            self.logger.info('Received stop signal')\n            self.logger.info('Shutdown')\n            sys.exit(0)\n\n        signal.signal(signal.SIGINT, cleanup)\n        signal.signal(signal.SIGTERM, cleanup)\n\n    def _post_to_es(self, data):\n        \"\"\"\n        send post requests to es restful api\n        \"\"\"\n        resp = requests.post(self.endpoint, data=data)\n        if resp.json().get('errors'):  # a boolean to figure error occurs\n            for item in resp.json()['items']:\n                if list(item.values())[0].get('error'):\n                    logging.error(item)\n        else:\n            self._save_binlog_record()\n\n    def _bulker(self, bulk_size):\n        \"\"\"\n        Example:\n            u = bulker()\n            u.send(None)  #for generator initialize\n            u.send(json_str)  # input json item\n            u.send(another_json_str)  # input json item\n            ...\n            u.send(None) force finish bulk and post\n        \"\"\"\n        while True:\n            data = \"\"\n            for i in range(bulk_size):\n                item = yield\n                if item:\n                    data = data + item + \"\\n\"\n                else:\n                    break\n            # print(data)\n            print('-'*10)\n            if data:\n                self._post_to_es(data)\n\n    def _updater(self, data):\n        \"\"\"\n        encapsulation of bulker\n        \"\"\"\n        if self.is_binlog_sync:\n                u = self._bulker(bulk_size=self.binlog_bulk_size)\n        else:\n                u = self._bulker(bulk_size=self.bulk_size)\n\n        u.send(None)  # push the generator to first yield\n        for item in data:\n            u.send(item)\n        u.send(None)  # tell the generator it's the end\n\n    def _json_serializer(self, obj):\n        \"\"\"\n        format the object which json not supported\n        \"\"\"\n        if isinstance(obj, datetime):\n            return obj.isoformat()\n        raise TypeError('Type not serializable')\n\n    def _processor(self, data):\n        \"\"\"\n        The action must be one of the following:\n        create\n            Create a document only if the document does not already exist.\n        index\n            Create a new document or replace an existing document.\n        update\n            Do a partial update on a document.\n        delete\n            Delete a document.\n        \"\"\"\n        for item in data:\n            if self.id_key:\n                action_content = {'_id': item['doc'][self.id_key]}\n            else:\n                action_content = {}\n            meta = json.dumps({item['action']: action_content})\n            if item['action'] == 'index':\n                body = json.dumps(item['doc'], default=self._json_serializer)\n                rv = meta + '\\n' + body\n            elif item['action'] == 'update':\n                body = json.dumps({'doc': item['doc']}, default=self._json_serializer)\n                rv = meta + '\\n' + body\n            elif item['action'] == 'delete':\n                rv = meta + '\\n'\n            elif item['action'] == 'create':\n                body = json.dumps(item['doc'], default=self._json_serializer)\n                rv = meta + '\\n' + body\n            else:\n                logging.error('unknown action type in doc')\n                raise TypeError('unknown action type in doc')\n            yield rv\n\n    def _mapper(self, data):\n        \"\"\"\n        mapping old key to new key\n        \"\"\"\n        for item in data:\n            if self.mapping:\n                for k, v in self.mapping.items():\n                    item['doc'][k] = item['doc'][v]\n                    del item['doc'][v]\n            # print(doc)\n            yield item\n\n    def _formatter(self, data):\n        \"\"\"\n        format every field from xml, according to parsed table structure\n        \"\"\"\n        for item in data:\n            for field, serializer in self.table_structure.items():\n                if item['doc'][field]:\n                    try:\n                        item['doc'][field] = serializer(item['doc'][field])\n                    except ValueError as e:\n                        self.logger.error(\"Error occurred during format, ErrorMessage:{msg}, ErrorItem:{item}\".format(\n                            msg=str(e),\n                            item=str(item)))\n                        item['doc'][field] = None\n            # print(item)\n            yield item\n\n    def _binlog_loader(self):\n        \"\"\"\n        read row from binlog\n        \"\"\"\n        if self.is_binlog_sync:\n            resume_stream = True\n            logging.info(\"Resume from binlog_file: {file}  binlog_pos: {pos}\".format(file=self.log_file,\n                                                                                     pos=self.log_pos))\n        else:\n            resume_stream = False\n\n        stream = BinLogStreamReader(connection_settings=self.binlog_conf,\n                                    server_id=self.config['mysql']['server_id'],\n                                    only_events=[DeleteRowsEvent, WriteRowsEvent, UpdateRowsEvent],\n                                    only_tables=[self.config['mysql']['table']],\n                                    resume_stream=resume_stream,\n                                    blocking=True,\n                                    log_file=self.log_file,\n                                    log_pos=self.log_pos)\n        for binlogevent in stream:\n            self.log_file = stream.log_file\n            self.log_pos = stream.log_pos\n            for row in binlogevent.rows:\n                if isinstance(binlogevent, DeleteRowsEvent):\n                    rv = {\n                        'action': 'delete',\n                        'doc': row['values']\n                    }\n                elif isinstance(binlogevent, UpdateRowsEvent):\n                    rv = {\n                        'action': 'update',\n                        'doc': row['after_values']\n                    }\n                elif isinstance(binlogevent, WriteRowsEvent):\n                    rv = {\n                        'action': 'index',\n                        'doc': row['values']\n                    }\n                else:\n                    logging.error('unknown action type in binlog')\n                    raise TypeError('unknown action type in binlog')\n                yield rv\n                # print(rv)\n        stream.close()\n        raise IOError('mysql connection closed')\n\n    def _parse_table_structure(self, data):\n        \"\"\"\n        parse the table structure\n        \"\"\"\n        for item in data.iter():\n            if item.tag == 'field':\n                field = item.attrib.get('Field')\n                type = item.attrib.get('Type')\n                if 'int' in type:\n                    serializer = int\n                elif 'float' in type:\n                    serializer = float\n                elif 'datetime' in type:\n                    if '(' in type:\n                        serializer = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f')\n                    else:\n                        serializer = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')\n                elif 'char' in type:\n                    serializer = str\n                elif 'text' in type:\n                    serializer = str\n                else:\n                    serializer = str\n                self.table_structure[field] = serializer\n\n    def _parse_and_remove(self, f, path):\n        \"\"\"\n        snippet from python cookbook, for parsing large xml file\n        \"\"\"\n        path_parts = path.split('/')\n        doc = iterparse(f, ('start', 'end'), recover=False, encoding='utf-8', huge_tree=True)\n        # Skip the root element\n        next(doc)\n        tag_stack = []\n        elem_stack = []\n        for event, elem in doc:\n            if event == 'start':\n                tag_stack.append(elem.tag)\n                elem_stack.append(elem)\n            elif event == 'end':\n                if tag_stack == path_parts:\n                    yield elem\n                    elem_stack[-2].remove(elem)\n                if tag_stack == ['database', 'table_structure']:  # dirty hack for getting the tables structure\n                    self._parse_table_structure(elem)\n                    elem_stack[-2].remove(elem)\n                try:\n                    tag_stack.pop()\n                    elem_stack.pop()\n                except IndexError:\n                    pass\n\n    def _xml_parser(self, f_obj):\n        \"\"\"\n        parse mysqldump XML streaming, convert every item to dict object. 'database/table_data/row'\n        \"\"\"\n        for row in self._parse_and_remove(f_obj, 'database/table_data/row'):\n            doc = {}\n            for field in row.iter(tag='field'):\n                k = field.attrib.get('name')\n                v = field.text\n                doc[k] = v\n            yield {'action': 'index', 'doc': doc}\n\n    def _save_binlog_record(self):\n        if self.is_binlog_sync:\n            with open(self.config['binlog_sync']['record_file'], 'w') as f:\n                logging.info(\"Sync binlog_file: {file}  binlog_pos: {pos}\".format(\n                    file=self.log_file,\n                    pos=self.log_pos)\n                )\n                yaml.safe_dump({\"log_file\": self.log_file, \"log_pos\": self.log_pos}, f, default_flow_style=False)\n\n    def _xml_dump_loader(self):\n        mysqldump = subprocess.Popen(\n            shlex.split(encode_in_py2(self.dump_cmd)),\n            stdout=subprocess.PIPE,\n            stderr=DEVNULL,\n            close_fds=True)\n\n        remove_invalid_pipe = subprocess.Popen(\n            shlex.split(encode_in_py2(REMOVE_INVALID_PIPE)),\n            stdin=mysqldump.stdout,\n            stdout=subprocess.PIPE,\n            stderr=DEVNULL,\n            close_fds=True)\n\n        return remove_invalid_pipe.stdout\n\n    def _xml_file_loader(self, filename):\n        f = open(filename, 'rb')  # bytes required\n        return f\n\n    def _send_email(self, title, content):\n        \"\"\"\n        send notification email\n        \"\"\"\n        if not self.config.get('email'):\n            return\n\n        import smtplib\n        from email.mime.text import MIMEText\n\n        msg = MIMEText(content)\n        msg['Subject'] = title\n        msg['From'] = self.config['email']['from']['username']\n        msg['To'] = ', '.join(self.config['email']['to'])\n\n        # Send the message via our own SMTP server.\n        s = smtplib.SMTP()\n        s.connect(self.config['email']['from']['host'])\n        s.login(user=self.config['email']['from']['username'],\n                password=self.config['email']['from']['password'])\n        s.sendmail(msg['From'], msg['To'], msg=msg.as_string())\n        s.quit()\n\n    def _sync_from_stream(self):\n        logging.info(\"Start to dump from stream\")\n        docs = reduce(lambda x, y: y(x), [self._xml_parser, \n                                          self._formatter,\n                                          self._mapper, \n                                          self._processor], \n                      self._xml_dump_loader())\n        self._updater(docs)\n        logging.info(\"Dump success\")\n\n    def _sync_from_file(self):\n        logging.info(\"Start to dump from xml file\")\n        logging.info(\"Filename: {}\".format(self.config['xml_file']['filename']))\n        docs = reduce(lambda x, y: y(x), [self._xml_parser, \n                                          self._formatter, \n                                          self._mapper, \n                                          self._processor],\n                      self._xml_file_loader(self.config['xml_file']['filename']))\n        self._updater(docs)\n        logging.info(\"Dump success\")\n\n    def _sync_from_binlog(self):\n        logging.info(\"Start to sync binlog\")\n        docs = reduce(lambda x, y: y(x), [self._mapper, self._processor], self._binlog_loader())\n        self._updater(docs)\n\n    def run(self):\n        \"\"\"\n        workflow:\n        1. sync dump data\n        2. sync binlog\n        \"\"\"\n        try:\n            if not self.is_binlog_sync:\n                if len(sys.argv) > 2 and sys.argv[2] == '--fromfile':\n                    self._sync_from_file()\n                else:\n                    self._sync_from_stream()\n            self._sync_from_binlog()\n        except Exception:\n            import traceback\n            logging.error(traceback.format_exc())\n            self._send_email('es sync error', traceback.format_exc())\n            raise\n\n\ndef start():\n    instance = ElasticSync()\n    instance.run()\n\nif __name__ == '__main__':\n    start()\n"
  },
  {
    "path": "upstart.conf",
    "content": "description 'es sync'\nstart on runlevel [2345]\nstop on runlevel [06]\n\nrespawn\nnormal exit 0\n\nchdir <PATH_TO_CONFIG>\n\nscript\n    es-sync config.yaml\nend script"
  }
]