[
  {
    "path": ".gitignore",
    "content": ".DS_Store\n.vscode\n__pycache__\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2019 Ryan\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE."
  },
  {
    "path": "README.md",
    "content": "# Image Similarity\n\nThis is an efficient utility of image similarity using [MobileNet](https://arxiv.org/abs/1704.04861) deep neural network.\n\nImage similarity is a task mostly about feature selection of the image. Here, the Convolutional Neural Network (CNN) is used to extract features of these images. It is a better way for computer to understand them effectively.\n\nThis repository use a light-weight model, the MobileNet, to extract image features, then calculate their cosine distances as matrixes. The distance of two features will lie in `[-1, 1]`, where `-1` denotes the features are the most unlike, and `1` denotes they are the most similar. Choose a proper threshold `[-1, 1]`, the most similar images will be matched.\n\n## Usage\n\nThe code is written to match the similar images in a huge amount as efficiently as possible.\n\nTo use it, two `.csv` source files should be prepared before running. Here is an example of one source file. By default, the `.csv` file should at least include one field that place the urls [[1]](#notice).\n\n```text\nid,url\n1,https://raw.githubusercontent.com/ryanfwy/image-similarity/master/demo/1.jpg\n2,https://raw.githubusercontent.com/ryanfwy/image-similarity/master/demo/2.jpg\n3,https://raw.githubusercontent.com/ryanfwy/image-similarity/master/demo/3.jpg\n4,https://raw.githubusercontent.com/ryanfwy/image-similarity/master/demo/4.jpg\n5,https://raw.githubusercontent.com/ryanfwy/image-similarity/master/demo/5.jpg\n6,https://raw.githubusercontent.com/ryanfwy/image-similarity/master/demo/6.jpg\n```\n\nAfter that, we can setup the number of processes that are used to request images from the urls parallelly. For example, we use 2 processes with this tiny demo.\n\n```python\nsimilarity.num_processes = 2\n```\n\nFor feature extraction, a data generator is used to predict images with model batch by batch. By default, GPU will be used if it satisfy the conditions of [Tensorflow](https://www.tensorflow.org/install/gpu). Now we can set a proper size of batch based on the memory size of our computer or server. In this demo, we set it to 16.\n\n```python\nsimilarity.batch_size = 16\n```\n\nAfter invoking the function `save_data()` two times, four self-generated files will be saved into `__generated__` directory with the file names of `_*_feature.h5` and `_*_fields.csv`. We can further calculate the similarities by calling `iteration()`, or load the generated files at any time afterward.\n\nTotally, the full example will look like:\n\n```python\nsimilarity = ImageSimilarity()\n\n'''Setup'''\nsimilarity.batch_size = 16\nsimilarity.num_processes = 2\n\n'''Load source data'''\ntest1 = similarity.load_data_csv('./demo/test1.csv', delimiter=',')\ntest2 = similarity.load_data_csv('./demo/test2.csv', delimiter=',', cols=['id', 'url'])\n\n'''Save features and fields'''\nsimilarity.save_data('test1', test1)\nsimilarity.save_data('test2', test2)\n\n'''Calculate similarities'''\nresult = similarity.iteration(['test1_id', 'test1_url', 'test2_id', 'test2_url'], thresh=0.845)\nprint('Row for source file 1, and column for source file 2.')\nprint(result)\n```\n\nor if the files have been generated before:\n\n```python\nsimilarity = ImageSimilarity()\nsimilarity.iteration(['test1_id', 'test1_url', 'test2_id', 'test2_id'], thresh=0.845, title1='test1', title2='test2')\n```\n\nFor practical usage, the `thresh` argument of `save_data()` is recommended to be in `[0.84, 1)`. One balanced value can be `0.845`.\n\nAny other details, please check the usages of each function given by `main_multi.py`.\n\n## Requirements and Installation\n\n**NOTE**: Tensorflow is not included in `requirements.txt` due to the platform differences, please install and configure yourself based on your computer or server. Also note that `Python 3` is required.\n\n```pip\n$   git clone https://github.com/ryanfwy/image-similarity.git\n$   cd image-similarity\n$   pip3 install -r requirements.txt\n```\n\nThe requirements are also listed down bellow.\n\n- tensorflow: the newest version for CPU, or the version that matches your GPU and CUDA.\n- h5py~=2.6.0\n- numpy~=1.14.5\n- requests~=2.21.0\n\n## Experiment\n\nIn the demo, 6 and 3 images are used to match their similarities.\n\n### Accuracy\n\nThe cosine distances are shown in the table.\n\n| | <img width=\"100\" src=\"./demo/3.jpg\"/> | <img width=\"100\" src=\"./demo/4.jpg\"/> | <img width=\"100\" src=\"./demo/5.jpg\"/> |\n| --- | :---: | :---: | :---: |\n| <img width=\"100\" src=\"./demo/1.jpg\"/> | **0.9229318** | 0.5577963 | 0.5826051 |\n| <img width=\"100\" src=\"./demo/2.jpg\"/> | **0.84877944** | 0.538753 | 0.5624183 |\n| <img width=\"100\" src=\"./demo/3.jpg\"/> | **1.** | 0.5512465 | 0.57025677 |\n| <img width=\"100\" src=\"./demo/4.jpg\"/> | 0.5512465 | **0.99999994** | 0.54037786 |\n| <img width=\"100\" src=\"./demo/5.jpg\"/> | 0.57025677 | 0.54037786 | **0.9999998** |\n| <img width=\"100\" src=\"./demo/6.jpg\"/> | 0.5575757 | 0.5238174 | **0.91234696** |\n\nAs it is shown, image similarity using deep neural network works fine. The distances of the matched images will roughly be greater than `0.84`.\n\n### Efficiency\n\nFor running efficiency, multi-processing and batch-wise prediction are used in feature extraction procedure. And thus, image requesting and processing in CPU, image prediction with model in GPU, will run simultaneously. In the procedure of similarity analysis, a matrix-wise mathematical method is used to avoid n*m iteration one by one. This may help a lot in the condition of low efficiency of python iteration, especially in a huge amount.\n\nTable bellow shows the time consumption runing with 8 processes in a practical case. The results are only for reference, they may change a lot based on the number of processes we use, the quality of the network, the image size of the online resources and so on.\n\n|  | Source 1 | Source 2 | Iteration |\n| :---: | :---: | :---: | :---: |\n| Amount | 13501 | 21221 | 13501 * 21221 |\n| Time Consumption | 0:35:53 | 0:17:50 | 0:00:03.913282 |\n\n## Notice\n\n[1] By default, the programme have to get the online images from urls we prepared in `.csv`. If we want to run the code with a list of offline images, we need to override the `_sub_process()` class method by ourselves. For demo and details, please check [demo_override](./demo_override).\n\n\n## Thanks\n\nDemo images come from [ImageSimilarity](https://github.com/nivance/image-similarity) by [nivance](https://github.com/nivance). It is an another algorithm (pHash) of image similarity implementation in java.\n"
  },
  {
    "path": "demo/test1.csv",
    "content": "id,url\n1,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/1.jpg\n2,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/2.jpg\n3,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/3.jpg\n4,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/4.jpg\n5,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/5.jpg\n6,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/6.jpg"
  },
  {
    "path": "demo/test2.csv",
    "content": "id,url\n3,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/3.jpg\n4,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/4.jpg\n5,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/5.jpg"
  },
  {
    "path": "demo_override/README.md",
    "content": "# Implement Your Own `_sub_process()`\n\nBy default, the `.csv` source file should at least include one field that place the **urls**. In other words, the programme have to get the online images from urls. However, if we want to run the code with a list of offline images, we need to override the `_sub_process()` class method by ourselves.\n\n## Implement the Subclass\n\nThe implementation should look like:\n\n```python\nclass NewImageSimilarity(ImageSimilarity):\n    @staticmethod\n    def _sub_process(para):\n        # Override the method from the base class\n        path, fields = para['path'], para['fields']\n        try:\n            feature = DeepModel.preprocess_image(path)\n            return feature, fields\n\n        except Exception as e:\n            print('Error file %s: %s' % (fields[0], e))\n\n        return None, None\n```\n\nAs it is shown, the method `_sub_process()` just simply remove one line `request.get(path)` and pass the `path` argument to `DeepModel.preprocess_image()` directly.\n\nIn here, the `.csv` source file should at least include a field, such as `path`, to place all the local image paths. For example, it can be prepared like this.\n\n```\nid,path\n3,../demo/3.jpg\n4,../demo/4.jpg\n5,../demo/5.jpg\n```\n\nThe full example is also given in [main_override.py](./main_override.py). Please read it for more details about how to implement your own `_sub_process()` and run.\n\n## Quick Preparation\n\nIf we want to load a batch of offline image paths from the local directory which are prepared for `.csv` source file, the [image_util_cli.py](../image_util_cli.py) quick preparation script can easily do this job.\n\nTo run this script, you should first put a batch of images into a directory, such as `source1`. The document tree will look like this.\n\n```\n./source1\n |- image1.jpg\n |- image2.jpg\n |- ...\n |_ image100.jpg\n```\n\nAfter that, open `Terminal.app` (MacOS), `cd` to the directory of `image_util_cli.py`, and run it with the required arguments.\n\n```\n$   cd image-similarity\n$   python3 image_util_cli.py ./source1 -d '\\t' -o ./images.csv\n```\n\nThe usage of `image_util_cli.py` is given bellow. Also we can check it at any time by passing the argument `-h`.\n\n```\nusage: image_util_cli [-h] [-d DELIMITER] [-o OUT_PATH] source\npositional arguments:\nsource                directory of the source images\n\noptional arguments:\n-h, --help            show this help message and exit\n-d DELIMITER, --delimiter DELIMITER\n                      delimiter to the output file, default: ','\n-o OUT_PATH, --out-path OUT_PATH\n                      path to the output file, default: name of the source directory\n```\n"
  },
  {
    "path": "demo_override/main_override.py",
    "content": "import sys\nsys.path.append('..')\n\nfrom main_multi import ImageSimilarity, DeepModel\n\nclass NewImageSimilarity(ImageSimilarity):\n    @staticmethod\n    def _sub_process(para):\n        # Override the method from the base class\n        path, fields = para['path'], para['fields']\n        try:\n            feature = DeepModel.preprocess_image(path)\n            return feature, fields\n\n        except Exception as e:\n            print('Error file %s: %s' % (fields[0], e))\n\n        return None, None\n\n\nif __name__ == \"__main__\":\n    similarity = NewImageSimilarity()\n\n    '''Setup'''\n    similarity.batch_size = 16\n    similarity.num_processes = 2\n\n    '''Load source data'''\n    test1 = similarity.load_data_csv('./test1.csv', delimiter=',')\n    test2 = similarity.load_data_csv('./test2.csv', delimiter=',', cols=['id', 'path'])\n\n    '''Save features and fields'''\n    similarity.save_data('test1', test1)\n    similarity.save_data('test2', test2)\n"
  },
  {
    "path": "demo_override/test1.csv",
    "content": "id,path\n1,../demo/1.jpg\n2,../demo/2.jpg\n3,../demo/3.jpg\n4,../demo/4.jpg\n5,../demo/5.jpg\n6,../demo/6.jpg"
  },
  {
    "path": "demo_override/test2.csv",
    "content": "id,path\n3,../demo/3.jpg\n4,../demo/4.jpg\n5,../demo/5.jpg"
  },
  {
    "path": "image_util_cli.py",
    "content": "'''CLI utility for image preparation.'''\n\nimport os\nimport argparse\nimport numpy as np\n\n\ndef process(input_dir, delimiter=',', output_path=None):\n    '''Generate a `.csv` file with image paths.'''\n    result = [['name', 'path']]\n    file_names = os.listdir(input_dir)\n    file_names.sort()\n    for file_name in file_names:\n        file_path = os.path.join(input_dir, file_name)\n        result.append([os.path.splitext(file_name)[0], os.path.abspath(file_path)])\n\n    if output_path is None:\n        parent_dir = list(filter(lambda x: not x == '', input_dir.split('/')))[-1]\n        output_path = parent_dir + '.csv'\n\n    np.savetxt(output_path, result, delimiter=delimiter, fmt='%s', encoding='utf-8')\n\n    print('File saved to `%s`.' % output_path)\n\ndef main():\n    '''CLI entrance.'''\n    parser = argparse.ArgumentParser(prog='image_util_cli')\n    parser.add_argument('source', action='store', type=str, help='directory of the source images')\n    parser.add_argument('-d', '--delimiter', required=False, type=str, default=',', help=\"delimiter to the output file, default: ','\")\n    parser.add_argument('-o', '--out-path', required=False, type=str, help='path to the output file, default: name of the source directory')\n\n    args = parser.parse_args()\n    if args.source:\n        if os.path.isdir(args.source) is False:\n            exit('No directory `%s`.' % args.source)\n\n        process(args.source, delimiter=args.delimiter, output_path=args.out_path)\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "main_multi.py",
    "content": "'''Image similarity using deep features.\r\n\r\nRecommendation: the threshold of the `DeepModel.cosine_distance` can be set as the following values.\r\n    0.84 = greater matches amount\r\n    0.845 = balance, default\r\n    0.85 = better accuracy\r\n'''\r\n\r\nfrom io import BytesIO\r\nfrom multiprocessing import Pool\r\n\r\nimport os\r\nimport datetime\r\nimport numpy as np\r\nimport requests\r\nimport h5py\r\n\r\nfrom model_util import DeepModel, DataSequence\r\n\r\n\r\nclass ImageSimilarity():\r\n    '''Image similarity.'''\r\n    def __init__(self):\r\n        self._tmp_dir = './__generated__'\r\n        self._batch_size = 64\r\n        self._num_processes = 4\r\n        self._model = None\r\n        self._title = []\r\n\r\n    @property\r\n    def batch_size(self):\r\n        '''Batch size of model prediction.'''\r\n        return self._batch_size\r\n\r\n    @property\r\n    def num_processes(self):\r\n        '''Number of processes using `Multiprocessing.Pool`.'''\r\n        return self._num_processes\r\n\r\n    @batch_size.setter\r\n    def batch_size(self, batch_size):\r\n        self._batch_size = batch_size\r\n\r\n    @num_processes.setter\r\n    def num_processes(self, num_processes):\r\n        self._num_processes = num_processes\r\n\r\n    def _data_generation(self, args):\r\n        '''Generate input batches for predict generator.\r\n\r\n        Args:\r\n            args: parameters that pass to `sub_process`.\r\n                - path: path of the image, online url by default.\r\n                - fields: all other fields.\r\n\r\n        Returns:\r\n            batch_x: a batch of predict samples.\r\n            batch_fields: a batch of fields that matches the samples.\r\n        '''\r\n        # Multiprocessing\r\n        pool = Pool(self._num_processes)\r\n        res = pool.map(self._sub_process, args)\r\n        pool.close()\r\n        pool.join()\r\n\r\n        batch_x, batch_fields = [], []\r\n        for x, fields in res:\r\n            if x is not None:\r\n                batch_x.append(x)\r\n                batch_fields.append(fields)\r\n\r\n        return batch_x, batch_fields\r\n\r\n    def _predict_generator(self, paras):\r\n        '''Build a predict generator.\r\n\r\n        Args:\r\n            paras: input parameters of all samples.\r\n                - path: path of the image, online url by default.\r\n                - fields: all other fields.\r\n\r\n        Returns:\r\n            The predict generator.\r\n        '''\r\n        return DataSequence(paras, self._data_generation, batch_size=self._batch_size)\r\n\r\n    @staticmethod\r\n    def _sub_process(para):\r\n        '''A sub-process function of `multiprocessing`.\r\n\r\n        Download image from url and process it into a numpy array.\r\n\r\n        Args:\r\n            para: input parameters of one image.\r\n                - path: path of the image, online url by default.\r\n                - fields: all other fields.\r\n\r\n        Returns:\r\n            feature: feature array of one image.\r\n            fields: all other fields  of one image that passed from `para`.\r\n\r\n        Note: If error happens, `None` will be returned.\r\n        '''\r\n        path, fields = para['path'], para['fields']\r\n        try:\r\n            headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}\r\n            res = requests.get(path, headers=headers)\r\n            feature = DeepModel.preprocess_image(BytesIO(res.content))\r\n            return feature, fields\r\n\r\n        except Exception as e:\r\n            print('Error downloading %s: %s' % (fields[0], e))\r\n\r\n        return None, None\r\n\r\n    @staticmethod\r\n    def load_data_csv(fname, delimiter=None, include_header=True, cols=None):\r\n        '''Load `.csv` file. Mostly it should be a file that list all fields to match.\r\n\r\n        Args:\r\n            fname: name or path to the file.\r\n            delimiter: delimiter to split the content.\r\n            include_header: whether the source file include header or not.\r\n            cols: a list of columns to read. Pass `None` to read all columns.\r\n\r\n        Returns:\r\n            A list of data.\r\n        '''\r\n        assert delimiter is not None, 'Delimiter is required.'\r\n\r\n        if include_header:\r\n            usecols = None\r\n            skip_header = 1\r\n            if cols:\r\n                with open(fname, 'r', encoding='utf-8') as f:\r\n                    csv_head = f.readline().strip().split(delimiter)\r\n\r\n                usecols = [csv_head.index(col) for col in cols]\r\n\r\n        else:\r\n            usecols = None\r\n            skip_header = 0\r\n\r\n        data = np.genfromtxt(\r\n            fname,\r\n            dtype=str,\r\n            comments=None,\r\n            delimiter=delimiter,\r\n            encoding='utf-8',\r\n            invalid_raise=False,\r\n            usecols=usecols,\r\n            skip_header=skip_header\r\n        )\r\n\r\n        return data if len(data.shape) > 1 else data.reshape(1, -1)\r\n\r\n    @staticmethod\r\n    def load_data_h5(fname):\r\n        '''Load `.h5` file. Mostly it should be a file with features that extracted from the model.\r\n\r\n        Args:\r\n            fname: name or path to the file.\r\n\r\n        Returns:\r\n            A list of data.\r\n        '''\r\n        with h5py.File(fname, 'r') as h:\r\n            data = np.array(h['data'])\r\n        return data\r\n\r\n\r\n\r\n    def save_data(self, title, lines):\r\n        '''Load images from `url`, extract features and fields, save as `.h5` and `.csv` files.\r\n\r\n        Args:\r\n            title: title to save the results.\r\n            lines: lines of the source data. `url` should be placed at the end of all the fields.\r\n\r\n        Returns:\r\n            None. `.h5` and `.csv` files will be saved instead.\r\n        '''\r\n        # Load model\r\n        if self._model is None:\r\n            self._model = DeepModel()\r\n\r\n        print('%s: download starts.' % title)\r\n        start = datetime.datetime.now()\r\n\r\n        args = [{'path': line[-1], 'fields': line} for line in lines]\r\n\r\n        # Prediction\r\n        generator = self._predict_generator(args)\r\n        features = self._model.extract_feature(generator)\r\n\r\n        # Save files\r\n        if len(self._title) == 2:\r\n            self._title = []\r\n        self._title.append(title)\r\n\r\n        if not os.path.isdir(self._tmp_dir):\r\n            os.mkdir(self._tmp_dir)\r\n\r\n        fname_feature = os.path.join(self._tmp_dir, '_' + title + '_feature.h5')\r\n        with h5py.File(fname_feature, mode='w') as h:\r\n            h.create_dataset('data', data=features)\r\n        print('%s: feature saved to `%s`.' % (title, fname_feature))\r\n\r\n        fname_fields = os.path.join(self._tmp_dir, '_' + title + '_fields.csv')\r\n        np.savetxt(fname_fields, generator.list_of_label_fields, delimiter='\\t', fmt='%s', encoding='utf-8')\r\n        print('%s: fields saved to `%s`.' % (title, fname_fields))\r\n\r\n        print('%s: download succeeded.' % title)\r\n        print('Amount:', len(generator.list_of_label_fields))\r\n        print('Time consumed:', datetime.datetime.now()-start)\r\n        print()\r\n\r\n    def iteration(self, save_header, thresh=0.845, title1=None, title2=None):\r\n        '''Calculate the cosine distance of two inputs, save the matched fields to `.csv` file.\r\n\r\n        Args:\r\n            save_header: header of the result `.csv` file.\r\n            thresh: threshold of the similarity.\r\n            title1, title2: Optional. If `save_data()` is not invoked, titles of two inputs should be passed.\r\n\r\n        Returns:\r\n            A matrix of element-wise cosine distance.\r\n\r\n        Note:\r\n            1. The threshold can be set as the following values.\r\n                0.84 = greater matches amount\r\n                0.845 = balance, default\r\n                0.85 = better accuracy\r\n\r\n            2. If the generated files are exist, set `title1` or `title2` as same as the title of their source files.\r\n                For example, pass `test.csv` to `save_data()` will generate `_test_feature.h5` and `_test_fields.csv` files,\r\n                so set `title1` or `title2` to `test`, and `save_data()` will not be required to invoke.\r\n        '''\r\n        if title1 and title2:\r\n            self._title = [title1, title2]\r\n\r\n        assert len(self._title) == 2, 'Two inputs are required.'\r\n\r\n        feature1 = self.load_data_h5(os.path.join(self._tmp_dir, '_' + self._title[0] + '_feature.h5'))\r\n        feature2 = self.load_data_h5(os.path.join(self._tmp_dir, '_' + self._title[1] + '_feature.h5'))\r\n\r\n        fields1 = self.load_data_csv(os.path.join(self._tmp_dir, '_' + self._title[0] + '_fields.csv'), delimiter='\\t', include_header=False)\r\n        fields2 = self.load_data_csv(os.path.join(self._tmp_dir, '_' + self._title[1] + '_fields.csv'), delimiter='\\t', include_header=False)\r\n\r\n        print('%s: feature loaded, shape' % self._title[0], feature1.shape)\r\n        print('%s: fields loaded, length' % self._title[0], len(fields1))\r\n\r\n        print('%s: feature loaded, shape' % self._title[1], feature2.shape)\r\n        print('%s: fields loaded, length' % self._title[1], len(fields2))\r\n\r\n        print('Iteration starts.')\r\n        start = datetime.datetime.now()\r\n\r\n        distances = DeepModel.cosine_distance(feature1, feature2)\r\n        indexes = np.argmax(distances, axis=1)\r\n\r\n        result = [save_header + ['similarity']]\r\n\r\n        for x, y in enumerate(indexes):\r\n            dis = distances[x][y]\r\n            if dis >= thresh:\r\n                result.append(np.concatenate((fields1[x], fields2[y], np.array(['%.5f' % dis])), axis=0))\r\n\r\n        if len(result) > 0:\r\n            np.savetxt('result_similarity.csv', result, fmt='%s', delimiter='\\t', encoding='utf-8')\r\n\r\n        print('Iteration finished: results saved to `result_similarity.csv`.')\r\n        print('Amount: %d (%d * %d)' % (len(fields1)*len(fields2), len(fields1), len(fields2)))\r\n        print('Time consumed:', datetime.datetime.now()-start)\r\n        print()\r\n\r\n        return distances\r\n\r\n\r\nif __name__ == '__main__':\r\n    similarity = ImageSimilarity()\r\n\r\n    '''Setup'''\r\n    similarity.batch_size = 16\r\n    similarity.num_processes = 2\r\n\r\n    '''Load source data'''\r\n    test1 = similarity.load_data_csv('./demo/test1.csv', delimiter=',')\r\n    test2 = similarity.load_data_csv('./demo/test2.csv', delimiter=',', cols=['id', 'url'])\r\n\r\n    '''Save features and fields'''\r\n    similarity.save_data('test1', test1)\r\n    similarity.save_data('test2', test2)\r\n\r\n    '''Calculate similarities'''\r\n    result = similarity.iteration(['test1_id', 'test1_url', 'test2_id', 'test2_url'], thresh=0.845)\r\n    print('Row for source file 1, and column for source file 2.')\r\n    print(result)\r\n"
  },
  {
    "path": "model_util.py",
    "content": "import os\nos.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'\n\nimport numpy as np\n\nfrom tensorflow.python.keras.applications.mobilenet import MobileNet, preprocess_input\nfrom tensorflow.python.keras.preprocessing import image as process_image\nfrom tensorflow.python.keras.utils import Sequence\nfrom tensorflow.python.keras.layers import GlobalAveragePooling2D\nfrom tensorflow.python.keras import Model\n\n\nclass DeepModel():\n    '''MobileNet deep model.'''\n    def __init__(self):\n        self._model = self._define_model()\n\n        print('Loading MobileNet.')\n        print()\n\n    @staticmethod\n    def _define_model(output_layer=-1):\n        '''Define a pre-trained MobileNet model.\n\n        Args:\n            output_layer: the number of layer that output.\n\n        Returns:\n            Class of keras model with weights.\n        '''\n        base_model = MobileNet(weights='imagenet', include_top=False, input_shape=(224, 224, 3))\n        output = base_model.layers[output_layer].output\n        output = GlobalAveragePooling2D()(output)\n        model = Model(inputs=base_model.input, outputs=output)\n        return model\n\n    @staticmethod\n    def preprocess_image(path):\n        '''Process an image to numpy array.\n\n        Args:\n            path: the path of the image.\n\n        Returns:\n            Numpy array of the image.\n        '''\n        img = process_image.load_img(path, target_size=(224, 224))\n        x = process_image.img_to_array(img)\n        # x = np.expand_dims(x, axis=0)\n        x = preprocess_input(x)\n        return x\n\n    @staticmethod\n    def cosine_distance(input1, input2):\n        '''Calculating the distance of two inputs.\n\n        The return values lies in [-1, 1]. `-1` denotes two features are the most unlike,\n        `1` denotes they are the most similar.\n\n        Args:\n            input1, input2: two input numpy arrays.\n\n        Returns:\n            Element-wise cosine distances of two inputs.\n        '''\n        # return np.dot(input1, input2) / (np.linalg.norm(input1) * np.linalg.norm(input2))\n        return np.dot(input1, input2.T) / \\\n                np.dot(np.linalg.norm(input1, axis=1, keepdims=True), \\\n                        np.linalg.norm(input2.T, axis=0, keepdims=True))\n\n    def extract_feature(self, generator):\n        '''Extract deep feature using MobileNet model.\n\n        Args:\n            generator: a predict generator inherit from `keras.utils.Sequence`.\n\n        Returns:\n            The output features of all inputs.\n        '''\n        features = self._model.predict_generator(generator)\n        return features\n\n\nclass DataSequence(Sequence):\n    '''Predict generator inherit from `keras.utils.Sequence`.'''\n    def __init__(self, paras, generation, batch_size=32):\n        self.list_of_label_fields = []\n        self.list_of_paras = paras\n        self.data_generation = generation\n        self.batch_size = batch_size\n        self.__idx = 0\n\n    def __len__(self):\n        '''The number of batches per epoch.'''\n        return int(np.ceil(len(self.list_of_paras) / self.batch_size))\n\n    def __getitem__(self, idx):\n        '''Generate one batch of data.'''\n        paras = self.list_of_paras[idx * self.batch_size : (idx+1) * self.batch_size]\n        batch_x, batch_fields = self.data_generation(paras)\n\n        if idx == self.__idx:\n            self.list_of_label_fields += batch_fields\n            self.__idx += 1\n\n        return np.array(batch_x)\n"
  },
  {
    "path": "requirements.txt",
    "content": "h5py~=2.6.0\nnumpy~=1.14.5\nrequests~=2.21.0\n"
  }
]