Full Code of zhongbiaodev/py-mysql-elasticsearch-sync for AI

master a2a076e00a73 cached

10 files

40.5 KB

9.3k tokens

50 symbols

1 requests

Download .txt

Repository: zhongbiaodev/py-mysql-elasticsearch-sync
Branch: master
Commit: a2a076e00a73
Files: 10
Total size: 40.5 KB

Directory structure:
gitextract_re_gm3op/

├── .gitignore
├── LICENSE
├── MANIFEST.in
├── README.md
├── README_CN.md
├── es_sync/
│   └── __init__.py
├── requirements.txt
├── setup.py
├── src/
│   └── __init__.py
└── upstart.conf

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover

# Translations
*.mo
*.pot

# Django stuff:
*.log

# Sphinx documentation
docs/_build/

# PyBuilder
target/

*.yaml

# virtualenv
venv/
ENV/

# Pycharm settings
*.idea/

================================================
FILE: LICENSE
================================================
The MIT License (MIT)

Copyright (c) by Windfarer

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.



================================================
FILE: MANIFEST.in
================================================
include README.md LICENSE
recursive-include src *.yaml

================================================
FILE: README.md
================================================
# py-mysql-elasticsearch-sync
Simple and fast MySQL to Elasticsearch sync tool, written in Python.

[中文文档](https://github.com/zhongbiaodev/py-mysql-elasticsearch-sync/blob/master/README_CN.md)

## Introduction
This tool helps you to initialize MySQL dump table to Elasticsearch by parsing mysqldump, then incremental sync MySQL table to Elasticsearch by processing MySQL Binlog.
Also, during the binlog syncing, this tool will save the binlog sync position, so that it is easy to recover after this tool being shutdown for any reason.

## Installation
By following these steps.

##### 1. ibxml2 and libxslt
This tool depends on python lxml package, so that you should install  the lxml's dependecies correctly, the libxml2 and libxslt are required.

For example, in CentOS:

```
sudo yum install libxml2 libxml2-devel libxslt libxslt-devel
```

Or in Debian/Ubuntu:

```
sudo apt-get install libxml2-dev libxslt-dev python-dev
```

See [lxml Installation](http://lxml.de/installation.html) for more infomation.
##### 2. mysqldump
And then, mysqldump is required in the machine where this tool will be run on it.(and the mysql server must enable binlog)


##### 3. this tool
Then install this tool

```
pip install py-mysql-elasticsearch-sync
```

## Configuration
There is a [sample config](https://github.com/zhongbiaodev/py-mysql-elasticsearch-sync/blob/master/es_sync/sample.yaml) file in repo, you can start by editing it.

## Running
Simply run command

```
es-sync path/to/your/config.yaml
```
and the tool will dump your data as stream to sync, when dump is over, it will start to sync binlog.

The latest synced binlog file and position are recorded in your info file which is configured in your config file. You can restart dump step by remove it, or you can change sync position by edit it.

Or if you  but want to load it from your own dumpfile. You should dump your table first as xml format(by adding ```-X```option to your mysqldump command) 

then

```
es-sync path/to/your/config.yaml --fromfile
```
to start sync, when xml sync is over, it will also start binlog sync.

## Deployment
We provide an [upstart script]((https://github.com/zhongbiaodev/py-mysql-elasticsearch-sync/blob/master/upstart.conf)) to help you deploy this tool, you can edit it for your own condition, besides, you can deploy it in your own way.

## MultiTable Supporting
Now Multi-table is supported through setting tables in config file, the first table is master as default and the others are slave.

Master table and slave tables must use the same primary key, which is defined via _id.

Table has higher priority than tables.

## TODO
- [ ] MultiIndex Supporting


================================================
FILE: README_CN.md
================================================
# py-mysql-elasticsearch-sync
一个从MySQL向Elasticsearch同步数据的工具，使用Python实现。

## 简介
在第一次初始化数据时，本工具解析mysqldump导出的数据，并导入ES中，在后续增量更新中，解析binlog的数据，对ES中的数据进行同步。在binlog同步阶段，支持断点恢复，因此无需担心意外中断的问题。

## 安装

##### 1. ibxml2 和 libxslt
本工具基于lxml库，因此需要安装它的依赖的libxml2和libxslt

在CentOS中:

```
sudo yum install libxml2 libxml2-devel libxslt libxslt-devel
```

在Debian/Ubuntu中:

```
sudo apt-get install libxml2-dev libxslt-dev python-dev
```

查看[lxml Installation](http://lxml.de/installation.html)来获取更多相关信息

##### 2. mysqldump
在运行本工具的机器上需要有mysqldump，并且mysql服务器需要开启binlog功能。


##### 3. 本工具
安装本工具

```
pip install py-mysql-elasticsearch-sync
```

## 配置
你可以通过修改[配置文件示例](https://github.com/zhongbiaodev/py-mysql-elasticsearch-sync/blob/master/es_sync/sample.yaml)来编写自己的配置文件

## 运行
运行命令

```
es-sync path/to/your/config.yaml
```
工具将开始执行mysqldump并解析流进行同步，当dump结束后，将启动binlog同步

最近一次binlog同步位置记录在一个文件中，这个文件的路径在config文件中配置过。

你可以删除记录文件来从头进行binlog同步，或者修改文件里的内容，来从特定位置开始同步。


你也可以把自己从mysql导出的xml文件同步进ES中(在mysqldump的命令中加上参数```-X```即可导出xml) 

然后执行

```
es-sync path/to/your/config.yaml --fromfile
```
启动从xml导入，当从xml导入完毕后，它会开始同步binlog

## 服务管理
我们写了一个[upstart脚本](https://github.com/zhongbiaodev/py-mysql-elasticsearch-sync/blob/master/upstart.conf)来管理本工具的运行，你也可以用你自己的方式进行部署运行

## 多表支持
你可以在config文件中配置tables以支持多表，默认tables中第一张表为主表，其余表为从表。

主表和从表主键必须相同，均为_id字段。

当同时设置table和tables时，table优先级较高。

## TODO
- [ ] 多索引支持


================================================
FILE: es_sync/__init__.py
================================================
from __future__ import print_function, unicode_literals
from future.builtins import str, range
import sys

PY2 = sys.version_info[0] == 2

if PY2:
    import os
    DEVNULL = open(os.devnull, 'wb')
else:
    from subprocess import DEVNULL


def encode_in_py2(s):
    if PY2:
        return s.encode('utf-8')
    return s

import os.path
import yaml
import signal
import requests
import subprocess
import json
import logging
import shlex
import datetime
import decimal
from lxml.etree import iterparse
from functools import reduce
from pymysqlreplication import BinLogStreamReader
from pymysqlreplication.row_event import DeleteRowsEvent, UpdateRowsEvent, WriteRowsEvent
from pymysqlreplication.event import RotateEvent, XidEvent

__version__ = '0.4.2'


# The magic spell for removing invalid characters in xml stream.
REMOVE_INVALID_PIPE = r'tr -d "\00\01\02\03\04\05\06\07\10\13\14\16\17\20\21\22\23\24\25\26\27\30\31\32\33\34\35\36\37"'

DEFAULT_BULKSIZE = 100
DEFAULT_BINLOG_BULKSIZE = 1


class ElasticSync(object):
    table_structure = {}
    log_file = None
    log_pos = None

    @property
    def is_binlog_sync(self):
        rv = bool(self.log_file and self.log_pos)
        return rv

    def __init__(self):
        try:
            self.config = yaml.load(open(sys.argv[1]))
        except IndexError:
            print('Error: not specify config file')
            exit(1)

        mysql = self.config.get('mysql')
        if mysql.get('table'):
            self.tables = [mysql.get('table')]
            self.dump_cmd = 'mysqldump -h {host} -P {port} -u {user} --password={password} {db} {table} ' \
                        '--default-character-set=utf8 -X --opt --quick'.format(**mysql)
        elif mysql.get('tables'):
            self.tables = mysql.get('tables')
            mysql.update({
                'tables': ' '.join(mysql.get('tables'))
            })
            self.dump_cmd = 'mysqldump -h {host} -P {port} -u {user} --password={password} --database {db} --tables {tables} ' \
                        '--default-character-set=utf8 -X --opt --quick'.format(**mysql)
        else:
            print('Error: must specify either table or tables')
            exit(1)
        self.master = self.tables[0]  # use the first table as master
        self.current_table = None

        self.binlog_conf = dict(
            [(key, self.config['mysql'][key]) for key in ['host', 'port', 'user', 'password', 'db']]
        )

        self.endpoint = 'http://{host}:{port}/{index}/{type}/_bulk'.format(
            host=self.config['elastic']['host'],
            port=self.config['elastic']['port'],
            index=self.config['elastic']['index'],
            type=self.config['elastic']['type']
        )  # todo: supporting multi-index

        self.mapping = self.config.get('mapping') or {}
        if self.mapping.get('_id'):
            self.id_key = self.mapping.pop('_id')
        else:
            self.id_key = None

        self.ignoring = self.config.get('ignoring') or []

        record_path = self.config['binlog_sync']['record_file']
        if os.path.isfile(record_path):
            with open(record_path, 'r') as f:
                record = yaml.load(f)
                self.log_file = record.get('log_file')
                self.log_pos = record.get('log_pos')

        self.bulk_size = self.config.get('elastic').get('bulk_size') or DEFAULT_BULKSIZE
        self.binlog_bulk_size = self.config.get('elastic').get('binlog_bulk_size') or DEFAULT_BINLOG_BULKSIZE

        self._init_logging()
        self._force_commit = False

    def _init_logging(self):
        logging.basicConfig(filename=self.config['logging']['file'],
                            level=logging.INFO,
                            format='[%(levelname)s] - %(filename)s[line:%(lineno)d] - %(asctime)s %(message)s')
        self.logger = logging.getLogger(__name__)
        logging.getLogger("requests").setLevel(logging.WARNING)  # disable requests info logging

        def cleanup(*args):
            self.logger.info('Received stop signal')
            self.logger.info('Shutdown')
            sys.exit(0)

        signal.signal(signal.SIGINT, cleanup)
        signal.signal(signal.SIGTERM, cleanup)

    def _post_to_es(self, data):
        """
        send post requests to es restful api
        """
        resp = requests.post(self.endpoint, data=data)
        if resp.json().get('errors'):  # a boolean to figure error occurs
            for item in resp.json()['items']:
                if list(item.values())[0].get('error'):
                    logging.error(item)
        else:
            self._save_binlog_record()

    def _bulker(self, bulk_size):
        """
        Example:
            u = bulker()
            u.send(None)  #for generator initialize
            u.send(json_str)  # input json item
            u.send(another_json_str)  # input json item
            ...
            u.send(None) force finish bulk and post
        """
        while True:
            data = ""
            for i in range(bulk_size):
                item = yield
                if item:
                    data = data + item + "\n"
                else:
                    break
                if self._force_commit:
                    break
            # print(data)
            print('-'*10)
            if data:
                self._post_to_es(data)

            self._force_commit = False

    def _updater(self, data):
        """
        encapsulation of bulker
        """
        if self.is_binlog_sync:
            u = self._bulker(bulk_size=self.binlog_bulk_size)
        else:
            u = self._bulker(bulk_size=self.bulk_size)

        u.send(None)  # push the generator to first yield
        for item in data:
            u.send(item)
        u.send(None)  # tell the generator it's the end

    def _json_serializer(self, obj):
        """
        format the object which json not supported
        """
        if isinstance(obj, datetime.datetime) or isinstance(obj, datetime.date):
            return obj.isoformat()
        elif isinstance(obj, decimal.Decimal):
            return str(obj)
        raise TypeError('Type not serializable for obj {obj}'.format(obj=obj))

    def _processor(self, data):
        """
        The action must be one of the following:
        create
            Create a document only if the document does not already exist.
        index
            Create a new document or replace an existing document.
        update
            Do a partial update on a document.
        delete
            Delete a document.
        """
        for item in data:
            if self.id_key:
                action_content = {'_id': item['doc'][self.id_key]}
            else:
                action_content = {}
            for field in self.ignoring:
                try:
                    item['doc'].pop(field)
                except KeyError:
                    pass
            meta = json.dumps({item['action']: action_content})
            if item['action'] == 'index':
                body = json.dumps(item['doc'], default=self._json_serializer)
                rv = meta + '\n' + body
            elif item['action'] == 'update':
                body = json.dumps({'doc': item['doc']}, default=self._json_serializer)
                rv = meta + '\n' + body
            elif item['action'] == 'delete':
                rv = meta + '\n'
            elif item['action'] == 'create':
                body = json.dumps(item['doc'], default=self._json_serializer)
                rv = meta + '\n' + body
            else:
                logging.error('unknown action type in doc')
                raise TypeError('unknown action type in doc')
            yield rv

    def _mapper(self, data):
        """
        mapping old key to new key
        """
        for item in data:
            if self.mapping:
                for k, v in self.mapping.items():
                    try:
                        item['doc'][k] = item['doc'][v]
                        del item['doc'][v]
                    except KeyError:
                        continue
            # print(doc)
            yield item

    def _formatter(self, data):
        """
        format every field from xml, according to parsed table structure
        """
        for item in data:
            for field, serializer in self.table_structure.items():
                if field in item['doc'] and item['doc'][field]:
                    try:
                        item['doc'][field] = serializer(item['doc'][field])
                    except ValueError as e:
                        self.logger.error(
                            "Error occurred during format, ErrorMessage:{msg}, ErrorItem:{item}".format(
                            msg=str(e),
                            item=str(item)))
                        item['doc'][field] = None
                    except TypeError as e:
                        item['doc'][field] = None
            # print(item)
            yield item

    def _binlog_loader(self):
        """
        read row from binlog
        """
        if self.is_binlog_sync:
            resume_stream = True
            logging.info("Resume from binlog_file: {file}  binlog_pos: {pos}".format(file=self.log_file,
                                                                                     pos=self.log_pos))
        else:
            resume_stream = False

        stream = BinLogStreamReader(connection_settings=self.binlog_conf,
                                    server_id=self.config['mysql']['server_id'],
                                    only_events=[DeleteRowsEvent, WriteRowsEvent, UpdateRowsEvent, RotateEvent, XidEvent],
                                    only_tables=self.tables,
                                    resume_stream=resume_stream,
                                    blocking=True,
                                    log_file=self.log_file,
                                    log_pos=self.log_pos)
        for binlogevent in stream:
            self.log_file = stream.log_file
            self.log_pos = stream.log_pos

            # RotateEvent to update binlog record when no related table changed
            if isinstance(binlogevent, RotateEvent):
                self._save_binlog_record()
                continue

            if isinstance(binlogevent, XidEvent):  # event_type == 16
                self._force_commit = True
                continue

            for row in binlogevent.rows:
                if isinstance(binlogevent, DeleteRowsEvent):
                    if binlogevent.table == self.master:
                        rv = {
                            'action': 'delete',
                            'doc': row['values']
                        }
                    else:
                        rv = {
                            'action': 'update',
                            'doc': {k: row['values'][k] if self.id_key and self.id_key == k else None for k in row['values']}
                        }
                elif isinstance(binlogevent, UpdateRowsEvent):
                    rv = {
                        'action': 'update',
                        'doc': row['after_values']
                    }
                elif isinstance(binlogevent, WriteRowsEvent):
                    if binlogevent.table == self.master:
                        rv = {
                                'action': 'create',
                                'doc': row['values']
                            }
                    else:
                        rv = {
                                'action': 'update',
                                'doc': row['values']
                            }
                else:
                    logging.error('unknown action type in binlog')
                    raise TypeError('unknown action type in binlog')
                yield rv
                # print(rv)
        stream.close()
        raise IOError('mysql connection closed')

    def _parse_table_structure(self, data):
        """
        parse the table structure
        """
        for item in data.iter():
            if item.tag == 'field':
                field = item.attrib.get('Field')
                type = item.attrib.get('Type')
                if 'int' in type:
                    serializer = int
                elif 'float' in type:
                    serializer = float
                elif 'datetime' in type:
                    if '(' in type:
                        serializer = lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f')
                    else:
                        serializer = lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
                elif 'char' in type:
                    serializer = str
                elif 'text' in type:
                    serializer = str
                else:
                    serializer = str
                self.table_structure[field] = serializer

    def _parse_and_remove(self, f, path):
        """
        snippet from python cookbook, for parsing large xml file
        """
        path_parts = path.split('/')
        doc = iterparse(f, ('start', 'end'), recover=False, encoding='utf-8', huge_tree=True)
        # Skip the root element
        next(doc)
        tag_stack = []
        elem_stack = []
        for event, elem in doc:
            if event == 'start':
                if elem.tag == 'table_data':
                    self.current_table = elem.attrib['name']
                tag_stack.append(elem.tag)
                elem_stack.append(elem)
            elif event == 'end':
                if tag_stack == ['database', 'table_data']:
                    self.current_table = None
                if tag_stack == path_parts:
                    yield elem
                    elem_stack[-2].remove(elem)
                if tag_stack == ['database', 'table_structure']:
                # dirty hack for getting the tables structure
                    self._parse_table_structure(elem)
                    elem_stack[-2].remove(elem)
                try:
                    tag_stack.pop()
                    elem_stack.pop()
                except IndexError:
                    pass

    def _xml_parser(self, f_obj):
        """
        parse mysqldump XML streaming, convert every item to dict object.
        'database/table_data/row'
        """
        for row in self._parse_and_remove(f_obj, 'database/table_data/row'):
            doc = {}
            for field in row.iter(tag='field'):
                k = field.attrib.get('name')
                v = field.text
                doc[k] = v
            if not self.current_table or self.current_table == self.master:
                yield {'action': 'create', 'doc': doc}
            else:
                yield {'action': 'update', 'doc': doc}

    def _save_binlog_record(self):
        if self.is_binlog_sync:
            with open(self.config['binlog_sync']['record_file'], 'w') as f:
                logging.info("Sync binlog_file: {file}  binlog_pos: {pos}".format(
                    file=self.log_file,
                    pos=self.log_pos)
                )
                yaml.safe_dump({"log_file": self.log_file,
                                "log_pos": self.log_pos},
                               f,
                               default_flow_style=False)

    def _xml_dump_loader(self):
        mysqldump = subprocess.Popen(
            shlex.split(encode_in_py2(self.dump_cmd)),
            stdout=subprocess.PIPE,
            stderr=DEVNULL,
            close_fds=True)

        remove_invalid_pipe = subprocess.Popen(
            shlex.split(encode_in_py2(REMOVE_INVALID_PIPE)),
            stdin=mysqldump.stdout,
            stdout=subprocess.PIPE,
            stderr=DEVNULL,
            close_fds=True)

        return remove_invalid_pipe.stdout

    def _xml_file_loader(self, filename):
        f = open(filename, 'rb')  # bytes required

        remove_invalid_pipe = subprocess.Popen(
            shlex.split(encode_in_py2(REMOVE_INVALID_PIPE)),
            stdin=f,
            stdout=subprocess.PIPE,
            stderr=DEVNULL,
            close_fds=True)
        return remove_invalid_pipe.stdout

    def _send_email(self, title, content):
        """
        send notification email
        """
        if not self.config.get('email'):
            return

        import smtplib
        from email.mime.text import MIMEText

        msg = MIMEText(content)
        msg['Subject'] = title
        msg['From'] = self.config['email']['from']['username']
        msg['To'] = ', '.join(self.config['email']['to'])

        # Send the message via our own SMTP server.
        s = smtplib.SMTP()
        s.connect(self.config['email']['from']['host'])
        s.login(user=self.config['email']['from']['username'],
                password=self.config['email']['from']['password'])
        s.sendmail(msg['From'], msg['To'], msg=msg.as_string())
        s.quit()

    def _sync_from_stream(self):
        logging.info("Start to dump from stream")
        docs = reduce(lambda x, y: y(x), [self._xml_parser, 
                                          self._formatter,
                                          self._mapper, 
                                          self._processor], 
                      self._xml_dump_loader())
        self._updater(docs)
        logging.info("Dump success")

    def _sync_from_file(self):
        logging.info("Start to dump from xml file")
        logging.info("Filename: {}".format(self.config['xml_file']['filename']))
        docs = reduce(lambda x, y: y(x), [self._xml_parser, 
                                          self._formatter, 
                                          self._mapper, 
                                          self._processor],
                      self._xml_file_loader(self.config['xml_file']['filename']))
        self._updater(docs)
        logging.info("Dump success")

    def _sync_from_binlog(self):
        logging.info("Start to sync binlog")
        docs = reduce(lambda x, y: y(x), [self._mapper,
                                          self._processor],
                      self._binlog_loader())
        self._updater(docs)

    def run(self):
        """
        workflow:
        1. sync dump data
        2. sync binlog
        """
        try:
            if not self.is_binlog_sync:
                if len(sys.argv) > 2 and sys.argv[2] == '--fromfile':
                    self._sync_from_file()
                else:
                    self._sync_from_stream()
            self._sync_from_binlog()
        except Exception:
            import traceback
            logging.error(traceback.format_exc())
            self._send_email('es sync error', traceback.format_exc())
            raise


def start():
    instance = ElasticSync()
    instance.run()

if __name__ == '__main__':
    start()


================================================
FILE: requirements.txt
================================================
PyMySQL==0.6.7
mysql-replication>=0.8
requests>=2.9.1
PyYAML>=3.11
lxml>=3.5.0
future>=0.15.2 #for py2 compat


================================================
FILE: setup.py
================================================
from setuptools import setup, find_packages
import es_sync

setup(
    name='py-mysql-elasticsearch-sync',
    version=es_sync.__version__,
    packages=find_packages(),
    url='https://github.com/zhongbiaodev/py-mysql-elasticsearch-sync',
    license='MIT',
    author='Windfarer',
    author_email='windfarer@gmail.com',
    description='MySQL to Elasticsearch sync tool',
    install_requires=[
        'PyMySQL==0.6.7',
        'mysql-replication==0.9',
        'requests==2.9.1',
        'PyYAML==3.11',
        'lxml==3.5.0',
        'future==0.15.2'
    ],
    entry_points={
        'console_scripts': [
            'es-sync=es_sync:start',
        ]
    },
    include_package_data=True
)

================================================
FILE: src/__init__.py
================================================
from __future__ import print_function, unicode_literals
from future.builtins import str, range
import sys
PY2 = sys.version_info[0] == 2

if PY2:
    import os
    DEVNULL = open(os.devnull, 'wb')
else:
    from subprocess import DEVNULL
def encode_in_py2(s):
    if PY2:
        return s.encode('utf-8')
    return s

import os.path
import yaml
import signal
import requests
import subprocess
import json
import logging
import shlex
from datetime import datetime
from lxml.etree import iterparse
from functools import reduce
from pymysqlreplication import BinLogStreamReader
from pymysqlreplication.row_event import DeleteRowsEvent, UpdateRowsEvent, WriteRowsEvent

__version__ = '0.3.3.1'


# The magic spell for removing invalid characters in xml stream.
REMOVE_INVALID_PIPE = r'tr -d "\00\01\02\03\04\05\06\07\10\13\14\16\17\20\21\22\23\24\25\26\27\30\31\32\33\34\35\36\37"'

DEFAULT_BULKSIZE = 100
DEFAULT_BINLOG_BULKSIZE = 1


class ElasticSync(object):
    table_structure = {}
    log_file = None
    log_pos = None

    @property
    def is_binlog_sync(self):
        rv = bool(self.log_file and self.log_pos)
        return rv

    def __init__(self):
        try:
            self.config = yaml.load(open(sys.argv[1]))
        except IndexError:
            print('Error: not specify config file')
            exit(1)

        self.dump_cmd = 'mysqldump -h {host} -P {port} -u {user} --password={password} {db} {table} ' \
                        '--default-character-set=utf8 -X'.format(**self.config['mysql'])

        self.binlog_conf = dict(
            [(key, self.config['mysql'][key]) for key in ['host', 'port', 'user', 'password', 'db']]
        )

        self.endpoint = 'http://{host}:{port}/{index}/{type}/_bulk'.format(
            host=self.config['elastic']['host'],
            port=self.config['elastic']['port'],
            index=self.config['elastic']['index'],
            type=self.config['elastic']['type']
        )  # todo: supporting multi-index

        self.mapping = self.config.get('mapping') or {}
        if self.mapping.get('_id'):
            self.id_key = self.mapping.pop('_id')
        else:
            self.id_key = None

        record_path = self.config['binlog_sync']['record_file']
        if os.path.isfile(record_path):
            with open(record_path, 'r') as f:
                record = yaml.load(f)
                self.log_file = record.get('log_file')
                self.log_pos = record.get('log_pos')

        self.bulk_size = self.config.get('elastic').get('bulk_size') or DEFAULT_BULKSIZE
        self.binlog_bulk_size = self.config.get('elastic').get('binlog_bulk_size') or DEFAULT_BINLOG_BULKSIZE

        self._init_logging()

    def _init_logging(self):
        logging.basicConfig(filename=self.config['logging']['file'],
                            level=logging.INFO,
                            format='[%(levelname)s] %(asctime)s %(message)s')
        self.logger = logging.getLogger(__name__)
        logging.getLogger("requests").setLevel(logging.WARNING)  # disable requests info logging

        def cleanup(*args):
            self.logger.info('Received stop signal')
            self.logger.info('Shutdown')
            sys.exit(0)

        signal.signal(signal.SIGINT, cleanup)
        signal.signal(signal.SIGTERM, cleanup)

    def _post_to_es(self, data):
        """
        send post requests to es restful api
        """
        resp = requests.post(self.endpoint, data=data)
        if resp.json().get('errors'):  # a boolean to figure error occurs
            for item in resp.json()['items']:
                if list(item.values())[0].get('error'):
                    logging.error(item)
        else:
            self._save_binlog_record()

    def _bulker(self, bulk_size):
        """
        Example:
            u = bulker()
            u.send(None)  #for generator initialize
            u.send(json_str)  # input json item
            u.send(another_json_str)  # input json item
            ...
            u.send(None) force finish bulk and post
        """
        while True:
            data = ""
            for i in range(bulk_size):
                item = yield
                if item:
                    data = data + item + "\n"
                else:
                    break
            # print(data)
            print('-'*10)
            if data:
                self._post_to_es(data)

    def _updater(self, data):
        """
        encapsulation of bulker
        """
        if self.is_binlog_sync:
                u = self._bulker(bulk_size=self.binlog_bulk_size)
        else:
                u = self._bulker(bulk_size=self.bulk_size)

        u.send(None)  # push the generator to first yield
        for item in data:
            u.send(item)
        u.send(None)  # tell the generator it's the end

    def _json_serializer(self, obj):
        """
        format the object which json not supported
        """
        if isinstance(obj, datetime):
            return obj.isoformat()
        raise TypeError('Type not serializable')

    def _processor(self, data):
        """
        The action must be one of the following:
        create
            Create a document only if the document does not already exist.
        index
            Create a new document or replace an existing document.
        update
            Do a partial update on a document.
        delete
            Delete a document.
        """
        for item in data:
            if self.id_key:
                action_content = {'_id': item['doc'][self.id_key]}
            else:
                action_content = {}
            meta = json.dumps({item['action']: action_content})
            if item['action'] == 'index':
                body = json.dumps(item['doc'], default=self._json_serializer)
                rv = meta + '\n' + body
            elif item['action'] == 'update':
                body = json.dumps({'doc': item['doc']}, default=self._json_serializer)
                rv = meta + '\n' + body
            elif item['action'] == 'delete':
                rv = meta + '\n'
            elif item['action'] == 'create':
                body = json.dumps(item['doc'], default=self._json_serializer)
                rv = meta + '\n' + body
            else:
                logging.error('unknown action type in doc')
                raise TypeError('unknown action type in doc')
            yield rv

    def _mapper(self, data):
        """
        mapping old key to new key
        """
        for item in data:
            if self.mapping:
                for k, v in self.mapping.items():
                    item['doc'][k] = item['doc'][v]
                    del item['doc'][v]
            # print(doc)
            yield item

    def _formatter(self, data):
        """
        format every field from xml, according to parsed table structure
        """
        for item in data:
            for field, serializer in self.table_structure.items():
                if item['doc'][field]:
                    try:
                        item['doc'][field] = serializer(item['doc'][field])
                    except ValueError as e:
                        self.logger.error("Error occurred during format, ErrorMessage:{msg}, ErrorItem:{item}".format(
                            msg=str(e),
                            item=str(item)))
                        item['doc'][field] = None
            # print(item)
            yield item

    def _binlog_loader(self):
        """
        read row from binlog
        """
        if self.is_binlog_sync:
            resume_stream = True
            logging.info("Resume from binlog_file: {file}  binlog_pos: {pos}".format(file=self.log_file,
                                                                                     pos=self.log_pos))
        else:
            resume_stream = False

        stream = BinLogStreamReader(connection_settings=self.binlog_conf,
                                    server_id=self.config['mysql']['server_id'],
                                    only_events=[DeleteRowsEvent, WriteRowsEvent, UpdateRowsEvent],
                                    only_tables=[self.config['mysql']['table']],
                                    resume_stream=resume_stream,
                                    blocking=True,
                                    log_file=self.log_file,
                                    log_pos=self.log_pos)
        for binlogevent in stream:
            self.log_file = stream.log_file
            self.log_pos = stream.log_pos
            for row in binlogevent.rows:
                if isinstance(binlogevent, DeleteRowsEvent):
                    rv = {
                        'action': 'delete',
                        'doc': row['values']
                    }
                elif isinstance(binlogevent, UpdateRowsEvent):
                    rv = {
                        'action': 'update',
                        'doc': row['after_values']
                    }
                elif isinstance(binlogevent, WriteRowsEvent):
                    rv = {
                        'action': 'index',
                        'doc': row['values']
                    }
                else:
                    logging.error('unknown action type in binlog')
                    raise TypeError('unknown action type in binlog')
                yield rv
                # print(rv)
        stream.close()
        raise IOError('mysql connection closed')

    def _parse_table_structure(self, data):
        """
        parse the table structure
        """
        for item in data.iter():
            if item.tag == 'field':
                field = item.attrib.get('Field')
                type = item.attrib.get('Type')
                if 'int' in type:
                    serializer = int
                elif 'float' in type:
                    serializer = float
                elif 'datetime' in type:
                    if '(' in type:
                        serializer = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f')
                    else:
                        serializer = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
                elif 'char' in type:
                    serializer = str
                elif 'text' in type:
                    serializer = str
                else:
                    serializer = str
                self.table_structure[field] = serializer

    def _parse_and_remove(self, f, path):
        """
        snippet from python cookbook, for parsing large xml file
        """
        path_parts = path.split('/')
        doc = iterparse(f, ('start', 'end'), recover=False, encoding='utf-8', huge_tree=True)
        # Skip the root element
        next(doc)
        tag_stack = []
        elem_stack = []
        for event, elem in doc:
            if event == 'start':
                tag_stack.append(elem.tag)
                elem_stack.append(elem)
            elif event == 'end':
                if tag_stack == path_parts:
                    yield elem
                    elem_stack[-2].remove(elem)
                if tag_stack == ['database', 'table_structure']:  # dirty hack for getting the tables structure
                    self._parse_table_structure(elem)
                    elem_stack[-2].remove(elem)
                try:
                    tag_stack.pop()
                    elem_stack.pop()
                except IndexError:
                    pass

    def _xml_parser(self, f_obj):
        """
        parse mysqldump XML streaming, convert every item to dict object. 'database/table_data/row'
        """
        for row in self._parse_and_remove(f_obj, 'database/table_data/row'):
            doc = {}
            for field in row.iter(tag='field'):
                k = field.attrib.get('name')
                v = field.text
                doc[k] = v
            yield {'action': 'index', 'doc': doc}

    def _save_binlog_record(self):
        if self.is_binlog_sync:
            with open(self.config['binlog_sync']['record_file'], 'w') as f:
                logging.info("Sync binlog_file: {file}  binlog_pos: {pos}".format(
                    file=self.log_file,
                    pos=self.log_pos)
                )
                yaml.safe_dump({"log_file": self.log_file, "log_pos": self.log_pos}, f, default_flow_style=False)

    def _xml_dump_loader(self):
        mysqldump = subprocess.Popen(
            shlex.split(encode_in_py2(self.dump_cmd)),
            stdout=subprocess.PIPE,
            stderr=DEVNULL,
            close_fds=True)

        remove_invalid_pipe = subprocess.Popen(
            shlex.split(encode_in_py2(REMOVE_INVALID_PIPE)),
            stdin=mysqldump.stdout,
            stdout=subprocess.PIPE,
            stderr=DEVNULL,
            close_fds=True)

        return remove_invalid_pipe.stdout

    def _xml_file_loader(self, filename):
        f = open(filename, 'rb')  # bytes required
        return f

    def _send_email(self, title, content):
        """
        send notification email
        """
        if not self.config.get('email'):
            return

        import smtplib
        from email.mime.text import MIMEText

        msg = MIMEText(content)
        msg['Subject'] = title
        msg['From'] = self.config['email']['from']['username']
        msg['To'] = ', '.join(self.config['email']['to'])

        # Send the message via our own SMTP server.
        s = smtplib.SMTP()
        s.connect(self.config['email']['from']['host'])
        s.login(user=self.config['email']['from']['username'],
                password=self.config['email']['from']['password'])
        s.sendmail(msg['From'], msg['To'], msg=msg.as_string())
        s.quit()

    def _sync_from_stream(self):
        logging.info("Start to dump from stream")
        docs = reduce(lambda x, y: y(x), [self._xml_parser, 
                                          self._formatter,
                                          self._mapper, 
                                          self._processor], 
                      self._xml_dump_loader())
        self._updater(docs)
        logging.info("Dump success")

    def _sync_from_file(self):
        logging.info("Start to dump from xml file")
        logging.info("Filename: {}".format(self.config['xml_file']['filename']))
        docs = reduce(lambda x, y: y(x), [self._xml_parser, 
                                          self._formatter, 
                                          self._mapper, 
                                          self._processor],
                      self._xml_file_loader(self.config['xml_file']['filename']))
        self._updater(docs)
        logging.info("Dump success")

    def _sync_from_binlog(self):
        logging.info("Start to sync binlog")
        docs = reduce(lambda x, y: y(x), [self._mapper, self._processor], self._binlog_loader())
        self._updater(docs)

    def run(self):
        """
        workflow:
        1. sync dump data
        2. sync binlog
        """
        try:
            if not self.is_binlog_sync:
                if len(sys.argv) > 2 and sys.argv[2] == '--fromfile':
                    self._sync_from_file()
                else:
                    self._sync_from_stream()
            self._sync_from_binlog()
        except Exception:
            import traceback
            logging.error(traceback.format_exc())
            self._send_email('es sync error', traceback.format_exc())
            raise


def start():
    instance = ElasticSync()
    instance.run()

if __name__ == '__main__':
    start()


================================================
FILE: upstart.conf
================================================
description 'es sync'
start on runlevel [2345]
stop on runlevel [06]

respawn
normal exit 0

chdir <PATH_TO_CONFIG>

script
    es-sync config.yaml
end script

Download .txt

gitextract_re_gm3op/

├── .gitignore
├── LICENSE
├── MANIFEST.in
├── README.md
├── README_CN.md
├── es_sync/
│   └── __init__.py
├── requirements.txt
├── setup.py
├── src/
│   └── __init__.py
└── upstart.conf

Download .txt

SYMBOL INDEX (50 symbols across 2 files)

FILE: es_sync/__init__.py
  function encode_in_py2 (line 14) | def encode_in_py2(s):
  class ElasticSync (line 45) | class ElasticSync(object):
    method is_binlog_sync (line 51) | def is_binlog_sync(self):
    method __init__ (line 55) | def __init__(self):
    method _init_logging (line 112) | def _init_logging(self):
    method _post_to_es (line 127) | def _post_to_es(self, data):
    method _bulker (line 139) | def _bulker(self, bulk_size):
    method _updater (line 166) | def _updater(self, data):
    method _json_serializer (line 180) | def _json_serializer(self, obj):
    method _processor (line 190) | def _processor(self, data):
    method _mapper (line 229) | def _mapper(self, data):
    method _formatter (line 244) | def _formatter(self, data):
    method _binlog_loader (line 264) | def _binlog_loader(self):
    method _parse_table_structure (line 332) | def _parse_table_structure(self, data):
    method _parse_and_remove (line 357) | def _parse_and_remove(self, f, path):
    method _xml_parser (line 389) | def _xml_parser(self, f_obj):
    method _save_binlog_record (line 405) | def _save_binlog_record(self):
    method _xml_dump_loader (line 417) | def _xml_dump_loader(self):
    method _xml_file_loader (line 433) | def _xml_file_loader(self, filename):
    method _send_email (line 444) | def _send_email(self, title, content):
    method _sync_from_stream (line 467) | def _sync_from_stream(self):
    method _sync_from_file (line 477) | def _sync_from_file(self):
    method _sync_from_binlog (line 488) | def _sync_from_binlog(self):
    method run (line 495) | def run(self):
  function start (line 515) | def start():

FILE: src/__init__.py
  function encode_in_py2 (line 11) | def encode_in_py2(s):
  class ElasticSync (line 40) | class ElasticSync(object):
    method is_binlog_sync (line 46) | def is_binlog_sync(self):
    method __init__ (line 50) | def __init__(self):
    method _init_logging (line 89) | def _init_logging(self):
    method _post_to_es (line 104) | def _post_to_es(self, data):
    method _bulker (line 116) | def _bulker(self, bulk_size):
    method _updater (line 139) | def _updater(self, data):
    method _json_serializer (line 153) | def _json_serializer(self, obj):
    method _processor (line 161) | def _processor(self, data):
    method _mapper (line 195) | def _mapper(self, data):
    method _formatter (line 207) | def _formatter(self, data):
    method _binlog_loader (line 224) | def _binlog_loader(self):
    method _parse_table_structure (line 270) | def _parse_table_structure(self, data):
    method _parse_and_remove (line 295) | def _parse_and_remove(self, f, path):
    method _xml_parser (line 322) | def _xml_parser(self, f_obj):
    method _save_binlog_record (line 334) | def _save_binlog_record(self):
    method _xml_dump_loader (line 343) | def _xml_dump_loader(self):
    method _xml_file_loader (line 359) | def _xml_file_loader(self, filename):
    method _send_email (line 363) | def _send_email(self, title, content):
    method _sync_from_stream (line 386) | def _sync_from_stream(self):
    method _sync_from_file (line 396) | def _sync_from_file(self):
    method _sync_from_binlog (line 407) | def _sync_from_binlog(self):
    method run (line 412) | def run(self):
  function start (line 432) | def start():

Download .json

Condensed preview — 10 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (44K chars).

[
  {
    "path": ".gitignore",
    "chars": 762,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\n"
  },
  {
    "path": "LICENSE",
    "chars": 1075,
    "preview": "The MIT License (MIT)\n\nCopyright (c) by Windfarer\n\nPermission is hereby granted, free of charge, to any person obtaining"
  },
  {
    "path": "MANIFEST.in",
    "chars": 54,
    "preview": "include README.md LICENSE\nrecursive-include src *.yaml"
  },
  {
    "path": "README.md",
    "chars": 2656,
    "preview": "# py-mysql-elasticsearch-sync\nSimple and fast MySQL to Elasticsearch sync tool, written in Python.\n\n[中文文档](https://githu"
  },
  {
    "path": "README_CN.md",
    "chars": 1376,
    "preview": "# py-mysql-elasticsearch-sync\n一个从MySQL向Elasticsearch同步数据的工具，使用Python实现。\n\n## 简介\n在第一次初始化数据时，本工具解析mysqldump导出的数据，并导入ES中，在后续"
  },
  {
    "path": "es_sync/__init__.py",
    "chars": 18949,
    "preview": "from __future__ import print_function, unicode_literals\nfrom future.builtins import str, range\nimport sys\n\nPY2 = sys.ver"
  },
  {
    "path": "requirements.txt",
    "chars": 110,
    "preview": "PyMySQL==0.6.7\nmysql-replication>=0.8\nrequests>=2.9.1\nPyYAML>=3.11\nlxml>=3.5.0\nfuture>=0.15.2 #for py2 compat\n"
  },
  {
    "path": "setup.py",
    "chars": 698,
    "preview": "from setuptools import setup, find_packages\nimport es_sync\n\nsetup(\n    name='py-mysql-elasticsearch-sync',\n    version=e"
  },
  {
    "path": "src/__init__.py",
    "chars": 15619,
    "preview": "from __future__ import print_function, unicode_literals\nfrom future.builtins import str, range\nimport sys\nPY2 = sys.vers"
  },
  {
    "path": "upstart.conf",
    "chars": 158,
    "preview": "description 'es sync'\nstart on runlevel [2345]\nstop on runlevel [06]\n\nrespawn\nnormal exit 0\n\nchdir <PATH_TO_CONFIG>\n\nscr"
  }
]

About this extraction

This page contains the full source code of the zhongbiaodev/py-mysql-elasticsearch-sync GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 10 files (40.5 KB), approximately 9.3k tokens, and a symbol index with 50 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo