Full Code of ryanfwy/image-similarity for AI

master fcf4856c1ea4 cached

13 files

26.8 KB

6.8k tokens

27 symbols

1 requests

Download .txt

Repository: ryanfwy/image-similarity
Branch: master
Commit: fcf4856c1ea4
Files: 13
Total size: 26.8 KB

Directory structure:
gitextract_bc4qvi6_/

├── .gitignore
├── LICENSE
├── README.md
├── demo/
│   ├── test1.csv
│   └── test2.csv
├── demo_override/
│   ├── README.md
│   ├── main_override.py
│   ├── test1.csv
│   └── test2.csv
├── image_util_cli.py
├── main_multi.py
├── model_util.py
└── requirements.txt

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
.DS_Store
.vscode
__pycache__


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2019 Ryan

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

================================================
FILE: README.md
================================================
# Image Similarity

This is an efficient utility of image similarity using [MobileNet](https://arxiv.org/abs/1704.04861) deep neural network.

Image similarity is a task mostly about feature selection of the image. Here, the Convolutional Neural Network (CNN) is used to extract features of these images. It is a better way for computer to understand them effectively.

This repository use a light-weight model, the MobileNet, to extract image features, then calculate their cosine distances as matrixes. The distance of two features will lie in `[-1, 1]`, where `-1` denotes the features are the most unlike, and `1` denotes they are the most similar. Choose a proper threshold `[-1, 1]`, the most similar images will be matched.

## Usage

The code is written to match the similar images in a huge amount as efficiently as possible.

To use it, two `.csv` source files should be prepared before running. Here is an example of one source file. By default, the `.csv` file should at least include one field that place the urls [[1]](#notice).

```text
id,url
1,https://raw.githubusercontent.com/ryanfwy/image-similarity/master/demo/1.jpg
2,https://raw.githubusercontent.com/ryanfwy/image-similarity/master/demo/2.jpg
3,https://raw.githubusercontent.com/ryanfwy/image-similarity/master/demo/3.jpg
4,https://raw.githubusercontent.com/ryanfwy/image-similarity/master/demo/4.jpg
5,https://raw.githubusercontent.com/ryanfwy/image-similarity/master/demo/5.jpg
6,https://raw.githubusercontent.com/ryanfwy/image-similarity/master/demo/6.jpg
```

After that, we can setup the number of processes that are used to request images from the urls parallelly. For example, we use 2 processes with this tiny demo.

```python
similarity.num_processes = 2
```

For feature extraction, a data generator is used to predict images with model batch by batch. By default, GPU will be used if it satisfy the conditions of [Tensorflow](https://www.tensorflow.org/install/gpu). Now we can set a proper size of batch based on the memory size of our computer or server. In this demo, we set it to 16.

```python
similarity.batch_size = 16
```

After invoking the function `save_data()` two times, four self-generated files will be saved into `__generated__` directory with the file names of `_*_feature.h5` and `_*_fields.csv`. We can further calculate the similarities by calling `iteration()`, or load the generated files at any time afterward.

Totally, the full example will look like:

```python
similarity = ImageSimilarity()

'''Setup'''
similarity.batch_size = 16
similarity.num_processes = 2

'''Load source data'''
test1 = similarity.load_data_csv('./demo/test1.csv', delimiter=',')
test2 = similarity.load_data_csv('./demo/test2.csv', delimiter=',', cols=['id', 'url'])

'''Save features and fields'''
similarity.save_data('test1', test1)
similarity.save_data('test2', test2)

'''Calculate similarities'''
result = similarity.iteration(['test1_id', 'test1_url', 'test2_id', 'test2_url'], thresh=0.845)
print('Row for source file 1, and column for source file 2.')
print(result)
```

or if the files have been generated before:

```python
similarity = ImageSimilarity()
similarity.iteration(['test1_id', 'test1_url', 'test2_id', 'test2_id'], thresh=0.845, title1='test1', title2='test2')
```

For practical usage, the `thresh` argument of `save_data()` is recommended to be in `[0.84, 1)`. One balanced value can be `0.845`.

Any other details, please check the usages of each function given by `main_multi.py`.

## Requirements and Installation

**NOTE**: Tensorflow is not included in `requirements.txt` due to the platform differences, please install and configure yourself based on your computer or server. Also note that `Python 3` is required.

```pip
$   git clone https://github.com/ryanfwy/image-similarity.git
$   cd image-similarity
$   pip3 install -r requirements.txt
```

The requirements are also listed down bellow.

- tensorflow: the newest version for CPU, or the version that matches your GPU and CUDA.
- h5py~=2.6.0
- numpy~=1.14.5
- requests~=2.21.0

## Experiment

In the demo, 6 and 3 images are used to match their similarities.

### Accuracy

The cosine distances are shown in the table.

| | <img width="100" src="./demo/3.jpg"/> | <img width="100" src="./demo/4.jpg"/> | <img width="100" src="./demo/5.jpg"/> |
| --- | :---: | :---: | :---: |
| <img width="100" src="./demo/1.jpg"/> | **0.9229318** | 0.5577963 | 0.5826051 |
| <img width="100" src="./demo/2.jpg"/> | **0.84877944** | 0.538753 | 0.5624183 |
| <img width="100" src="./demo/3.jpg"/> | **1.** | 0.5512465 | 0.57025677 |
| <img width="100" src="./demo/4.jpg"/> | 0.5512465 | **0.99999994** | 0.54037786 |
| <img width="100" src="./demo/5.jpg"/> | 0.57025677 | 0.54037786 | **0.9999998** |
| <img width="100" src="./demo/6.jpg"/> | 0.5575757 | 0.5238174 | **0.91234696** |

As it is shown, image similarity using deep neural network works fine. The distances of the matched images will roughly be greater than `0.84`.

### Efficiency

For running efficiency, multi-processing and batch-wise prediction are used in feature extraction procedure. And thus, image requesting and processing in CPU, image prediction with model in GPU, will run simultaneously. In the procedure of similarity analysis, a matrix-wise mathematical method is used to avoid n*m iteration one by one. This may help a lot in the condition of low efficiency of python iteration, especially in a huge amount.

Table bellow shows the time consumption runing with 8 processes in a practical case. The results are only for reference, they may change a lot based on the number of processes we use, the quality of the network, the image size of the online resources and so on.

|  | Source 1 | Source 2 | Iteration |
| :---: | :---: | :---: | :---: |
| Amount | 13501 | 21221 | 13501 * 21221 |
| Time Consumption | 0:35:53 | 0:17:50 | 0:00:03.913282 |

## Notice

[1] By default, the programme have to get the online images from urls we prepared in `.csv`. If we want to run the code with a list of offline images, we need to override the `_sub_process()` class method by ourselves. For demo and details, please check [demo_override](./demo_override).


## Thanks

Demo images come from [ImageSimilarity](https://github.com/nivance/image-similarity) by [nivance](https://github.com/nivance). It is an another algorithm (pHash) of image similarity implementation in java.


================================================
FILE: demo/test1.csv
================================================
id,url
1,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/1.jpg
2,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/2.jpg
3,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/3.jpg
4,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/4.jpg
5,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/5.jpg
6,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/6.jpg

================================================
FILE: demo/test2.csv
================================================
id,url
3,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/3.jpg
4,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/4.jpg
5,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/5.jpg

================================================
FILE: demo_override/README.md
================================================
# Implement Your Own `_sub_process()`

By default, the `.csv` source file should at least include one field that place the **urls**. In other words, the programme have to get the online images from urls. However, if we want to run the code with a list of offline images, we need to override the `_sub_process()` class method by ourselves.

## Implement the Subclass

The implementation should look like:

```python
class NewImageSimilarity(ImageSimilarity):
    @staticmethod
    def _sub_process(para):
        # Override the method from the base class
        path, fields = para['path'], para['fields']
        try:
            feature = DeepModel.preprocess_image(path)
            return feature, fields

        except Exception as e:
            print('Error file %s: %s' % (fields[0], e))

        return None, None
```

As it is shown, the method `_sub_process()` just simply remove one line `request.get(path)` and pass the `path` argument to `DeepModel.preprocess_image()` directly.

In here, the `.csv` source file should at least include a field, such as `path`, to place all the local image paths. For example, it can be prepared like this.

```
id,path
3,../demo/3.jpg
4,../demo/4.jpg
5,../demo/5.jpg
```

The full example is also given in [main_override.py](./main_override.py). Please read it for more details about how to implement your own `_sub_process()` and run.

## Quick Preparation

If we want to load a batch of offline image paths from the local directory which are prepared for `.csv` source file, the [image_util_cli.py](../image_util_cli.py) quick preparation script can easily do this job.

To run this script, you should first put a batch of images into a directory, such as `source1`. The document tree will look like this.

```
./source1
 |- image1.jpg
 |- image2.jpg
 |- ...
 |_ image100.jpg
```

After that, open `Terminal.app` (MacOS), `cd` to the directory of `image_util_cli.py`, and run it with the required arguments.

```
$   cd image-similarity
$   python3 image_util_cli.py ./source1 -d '\t' -o ./images.csv
```

The usage of `image_util_cli.py` is given bellow. Also we can check it at any time by passing the argument `-h`.

```
usage: image_util_cli [-h] [-d DELIMITER] [-o OUT_PATH] source
positional arguments:
source                directory of the source images

optional arguments:
-h, --help            show this help message and exit
-d DELIMITER, --delimiter DELIMITER
                      delimiter to the output file, default: ','
-o OUT_PATH, --out-path OUT_PATH
                      path to the output file, default: name of the source directory
```


================================================
FILE: demo_override/main_override.py
================================================
import sys
sys.path.append('..')

from main_multi import ImageSimilarity, DeepModel

class NewImageSimilarity(ImageSimilarity):
    @staticmethod
    def _sub_process(para):
        # Override the method from the base class
        path, fields = para['path'], para['fields']
        try:
            feature = DeepModel.preprocess_image(path)
            return feature, fields

        except Exception as e:
            print('Error file %s: %s' % (fields[0], e))

        return None, None


if __name__ == "__main__":
    similarity = NewImageSimilarity()

    '''Setup'''
    similarity.batch_size = 16
    similarity.num_processes = 2

    '''Load source data'''
    test1 = similarity.load_data_csv('./test1.csv', delimiter=',')
    test2 = similarity.load_data_csv('./test2.csv', delimiter=',', cols=['id', 'path'])

    '''Save features and fields'''
    similarity.save_data('test1', test1)
    similarity.save_data('test2', test2)


================================================
FILE: demo_override/test1.csv
================================================
id,path
1,../demo/1.jpg
2,../demo/2.jpg
3,../demo/3.jpg
4,../demo/4.jpg
5,../demo/5.jpg
6,../demo/6.jpg

================================================
FILE: demo_override/test2.csv
================================================
id,path
3,../demo/3.jpg
4,../demo/4.jpg
5,../demo/5.jpg

================================================
FILE: image_util_cli.py
================================================
'''CLI utility for image preparation.'''

import os
import argparse
import numpy as np


def process(input_dir, delimiter=',', output_path=None):
    '''Generate a `.csv` file with image paths.'''
    result = [['name', 'path']]
    file_names = os.listdir(input_dir)
    file_names.sort()
    for file_name in file_names:
        file_path = os.path.join(input_dir, file_name)
        result.append([os.path.splitext(file_name)[0], os.path.abspath(file_path)])

    if output_path is None:
        parent_dir = list(filter(lambda x: not x == '', input_dir.split('/')))[-1]
        output_path = parent_dir + '.csv'

    np.savetxt(output_path, result, delimiter=delimiter, fmt='%s', encoding='utf-8')

    print('File saved to `%s`.' % output_path)

def main():
    '''CLI entrance.'''
    parser = argparse.ArgumentParser(prog='image_util_cli')
    parser.add_argument('source', action='store', type=str, help='directory of the source images')
    parser.add_argument('-d', '--delimiter', required=False, type=str, default=',', help="delimiter to the output file, default: ','")
    parser.add_argument('-o', '--out-path', required=False, type=str, help='path to the output file, default: name of the source directory')

    args = parser.parse_args()
    if args.source:
        if os.path.isdir(args.source) is False:
            exit('No directory `%s`.' % args.source)

        process(args.source, delimiter=args.delimiter, output_path=args.out_path)


if __name__ == '__main__':
    main()


================================================
FILE: main_multi.py
================================================
'''Image similarity using deep features.

Recommendation: the threshold of the `DeepModel.cosine_distance` can be set as the following values.
    0.84 = greater matches amount
    0.845 = balance, default
    0.85 = better accuracy
'''

from io import BytesIO
from multiprocessing import Pool

import os
import datetime
import numpy as np
import requests
import h5py

from model_util import DeepModel, DataSequence


class ImageSimilarity():
    '''Image similarity.'''
    def __init__(self):
        self._tmp_dir = './__generated__'
        self._batch_size = 64
        self._num_processes = 4
        self._model = None
        self._title = []

    @property
    def batch_size(self):
        '''Batch size of model prediction.'''
        return self._batch_size

    @property
    def num_processes(self):
        '''Number of processes using `Multiprocessing.Pool`.'''
        return self._num_processes

    @batch_size.setter
    def batch_size(self, batch_size):
        self._batch_size = batch_size

    @num_processes.setter
    def num_processes(self, num_processes):
        self._num_processes = num_processes

    def _data_generation(self, args):
        '''Generate input batches for predict generator.

        Args:
            args: parameters that pass to `sub_process`.
                - path: path of the image, online url by default.
                - fields: all other fields.

        Returns:
            batch_x: a batch of predict samples.
            batch_fields: a batch of fields that matches the samples.
        '''
        # Multiprocessing
        pool = Pool(self._num_processes)
        res = pool.map(self._sub_process, args)
        pool.close()
        pool.join()

        batch_x, batch_fields = [], []
        for x, fields in res:
            if x is not None:
                batch_x.append(x)
                batch_fields.append(fields)

        return batch_x, batch_fields

    def _predict_generator(self, paras):
        '''Build a predict generator.

        Args:
            paras: input parameters of all samples.
                - path: path of the image, online url by default.
                - fields: all other fields.

        Returns:
            The predict generator.
        '''
        return DataSequence(paras, self._data_generation, batch_size=self._batch_size)

    @staticmethod
    def _sub_process(para):
        '''A sub-process function of `multiprocessing`.

        Download image from url and process it into a numpy array.

        Args:
            para: input parameters of one image.
                - path: path of the image, online url by default.
                - fields: all other fields.

        Returns:
            feature: feature array of one image.
            fields: all other fields  of one image that passed from `para`.

        Note: If error happens, `None` will be returned.
        '''
        path, fields = para['path'], para['fields']
        try:
            headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
            res = requests.get(path, headers=headers)
            feature = DeepModel.preprocess_image(BytesIO(res.content))
            return feature, fields

        except Exception as e:
            print('Error downloading %s: %s' % (fields[0], e))

        return None, None

    @staticmethod
    def load_data_csv(fname, delimiter=None, include_header=True, cols=None):
        '''Load `.csv` file. Mostly it should be a file that list all fields to match.

        Args:
            fname: name or path to the file.
            delimiter: delimiter to split the content.
            include_header: whether the source file include header or not.
            cols: a list of columns to read. Pass `None` to read all columns.

        Returns:
            A list of data.
        '''
        assert delimiter is not None, 'Delimiter is required.'

        if include_header:
            usecols = None
            skip_header = 1
            if cols:
                with open(fname, 'r', encoding='utf-8') as f:
                    csv_head = f.readline().strip().split(delimiter)

                usecols = [csv_head.index(col) for col in cols]

        else:
            usecols = None
            skip_header = 0

        data = np.genfromtxt(
            fname,
            dtype=str,
            comments=None,
            delimiter=delimiter,
            encoding='utf-8',
            invalid_raise=False,
            usecols=usecols,
            skip_header=skip_header
        )

        return data if len(data.shape) > 1 else data.reshape(1, -1)

    @staticmethod
    def load_data_h5(fname):
        '''Load `.h5` file. Mostly it should be a file with features that extracted from the model.

        Args:
            fname: name or path to the file.

        Returns:
            A list of data.
        '''
        with h5py.File(fname, 'r') as h:
            data = np.array(h['data'])
        return data



    def save_data(self, title, lines):
        '''Load images from `url`, extract features and fields, save as `.h5` and `.csv` files.

        Args:
            title: title to save the results.
            lines: lines of the source data. `url` should be placed at the end of all the fields.

        Returns:
            None. `.h5` and `.csv` files will be saved instead.
        '''
        # Load model
        if self._model is None:
            self._model = DeepModel()

        print('%s: download starts.' % title)
        start = datetime.datetime.now()

        args = [{'path': line[-1], 'fields': line} for line in lines]

        # Prediction
        generator = self._predict_generator(args)
        features = self._model.extract_feature(generator)

        # Save files
        if len(self._title) == 2:
            self._title = []
        self._title.append(title)

        if not os.path.isdir(self._tmp_dir):
            os.mkdir(self._tmp_dir)

        fname_feature = os.path.join(self._tmp_dir, '_' + title + '_feature.h5')
        with h5py.File(fname_feature, mode='w') as h:
            h.create_dataset('data', data=features)
        print('%s: feature saved to `%s`.' % (title, fname_feature))

        fname_fields = os.path.join(self._tmp_dir, '_' + title + '_fields.csv')
        np.savetxt(fname_fields, generator.list_of_label_fields, delimiter='\t', fmt='%s', encoding='utf-8')
        print('%s: fields saved to `%s`.' % (title, fname_fields))

        print('%s: download succeeded.' % title)
        print('Amount:', len(generator.list_of_label_fields))
        print('Time consumed:', datetime.datetime.now()-start)
        print()

    def iteration(self, save_header, thresh=0.845, title1=None, title2=None):
        '''Calculate the cosine distance of two inputs, save the matched fields to `.csv` file.

        Args:
            save_header: header of the result `.csv` file.
            thresh: threshold of the similarity.
            title1, title2: Optional. If `save_data()` is not invoked, titles of two inputs should be passed.

        Returns:
            A matrix of element-wise cosine distance.

        Note:
            1. The threshold can be set as the following values.
                0.84 = greater matches amount
                0.845 = balance, default
                0.85 = better accuracy

            2. If the generated files are exist, set `title1` or `title2` as same as the title of their source files.
                For example, pass `test.csv` to `save_data()` will generate `_test_feature.h5` and `_test_fields.csv` files,
                so set `title1` or `title2` to `test`, and `save_data()` will not be required to invoke.
        '''
        if title1 and title2:
            self._title = [title1, title2]

        assert len(self._title) == 2, 'Two inputs are required.'

        feature1 = self.load_data_h5(os.path.join(self._tmp_dir, '_' + self._title[0] + '_feature.h5'))
        feature2 = self.load_data_h5(os.path.join(self._tmp_dir, '_' + self._title[1] + '_feature.h5'))

        fields1 = self.load_data_csv(os.path.join(self._tmp_dir, '_' + self._title[0] + '_fields.csv'), delimiter='\t', include_header=False)
        fields2 = self.load_data_csv(os.path.join(self._tmp_dir, '_' + self._title[1] + '_fields.csv'), delimiter='\t', include_header=False)

        print('%s: feature loaded, shape' % self._title[0], feature1.shape)
        print('%s: fields loaded, length' % self._title[0], len(fields1))

        print('%s: feature loaded, shape' % self._title[1], feature2.shape)
        print('%s: fields loaded, length' % self._title[1], len(fields2))

        print('Iteration starts.')
        start = datetime.datetime.now()

        distances = DeepModel.cosine_distance(feature1, feature2)
        indexes = np.argmax(distances, axis=1)

        result = [save_header + ['similarity']]

        for x, y in enumerate(indexes):
            dis = distances[x][y]
            if dis >= thresh:
                result.append(np.concatenate((fields1[x], fields2[y], np.array(['%.5f' % dis])), axis=0))

        if len(result) > 0:
            np.savetxt('result_similarity.csv', result, fmt='%s', delimiter='\t', encoding='utf-8')

        print('Iteration finished: results saved to `result_similarity.csv`.')
        print('Amount: %d (%d * %d)' % (len(fields1)*len(fields2), len(fields1), len(fields2)))
        print('Time consumed:', datetime.datetime.now()-start)
        print()

        return distances


if __name__ == '__main__':
    similarity = ImageSimilarity()

    '''Setup'''
    similarity.batch_size = 16
    similarity.num_processes = 2

    '''Load source data'''
    test1 = similarity.load_data_csv('./demo/test1.csv', delimiter=',')
    test2 = similarity.load_data_csv('./demo/test2.csv', delimiter=',', cols=['id', 'url'])

    '''Save features and fields'''
    similarity.save_data('test1', test1)
    similarity.save_data('test2', test2)

    '''Calculate similarities'''
    result = similarity.iteration(['test1_id', 'test1_url', 'test2_id', 'test2_url'], thresh=0.845)
    print('Row for source file 1, and column for source file 2.')
    print(result)


================================================
FILE: model_util.py
================================================
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'

import numpy as np

from tensorflow.python.keras.applications.mobilenet import MobileNet, preprocess_input
from tensorflow.python.keras.preprocessing import image as process_image
from tensorflow.python.keras.utils import Sequence
from tensorflow.python.keras.layers import GlobalAveragePooling2D
from tensorflow.python.keras import Model


class DeepModel():
    '''MobileNet deep model.'''
    def __init__(self):
        self._model = self._define_model()

        print('Loading MobileNet.')
        print()

    @staticmethod
    def _define_model(output_layer=-1):
        '''Define a pre-trained MobileNet model.

        Args:
            output_layer: the number of layer that output.

        Returns:
            Class of keras model with weights.
        '''
        base_model = MobileNet(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
        output = base_model.layers[output_layer].output
        output = GlobalAveragePooling2D()(output)
        model = Model(inputs=base_model.input, outputs=output)
        return model

    @staticmethod
    def preprocess_image(path):
        '''Process an image to numpy array.

        Args:
            path: the path of the image.

        Returns:
            Numpy array of the image.
        '''
        img = process_image.load_img(path, target_size=(224, 224))
        x = process_image.img_to_array(img)
        # x = np.expand_dims(x, axis=0)
        x = preprocess_input(x)
        return x

    @staticmethod
    def cosine_distance(input1, input2):
        '''Calculating the distance of two inputs.

        The return values lies in [-1, 1]. `-1` denotes two features are the most unlike,
        `1` denotes they are the most similar.

        Args:
            input1, input2: two input numpy arrays.

        Returns:
            Element-wise cosine distances of two inputs.
        '''
        # return np.dot(input1, input2) / (np.linalg.norm(input1) * np.linalg.norm(input2))
        return np.dot(input1, input2.T) / \
                np.dot(np.linalg.norm(input1, axis=1, keepdims=True), \
                        np.linalg.norm(input2.T, axis=0, keepdims=True))

    def extract_feature(self, generator):
        '''Extract deep feature using MobileNet model.

        Args:
            generator: a predict generator inherit from `keras.utils.Sequence`.

        Returns:
            The output features of all inputs.
        '''
        features = self._model.predict_generator(generator)
        return features


class DataSequence(Sequence):
    '''Predict generator inherit from `keras.utils.Sequence`.'''
    def __init__(self, paras, generation, batch_size=32):
        self.list_of_label_fields = []
        self.list_of_paras = paras
        self.data_generation = generation
        self.batch_size = batch_size
        self.__idx = 0

    def __len__(self):
        '''The number of batches per epoch.'''
        return int(np.ceil(len(self.list_of_paras) / self.batch_size))

    def __getitem__(self, idx):
        '''Generate one batch of data.'''
        paras = self.list_of_paras[idx * self.batch_size : (idx+1) * self.batch_size]
        batch_x, batch_fields = self.data_generation(paras)

        if idx == self.__idx:
            self.list_of_label_fields += batch_fields
            self.__idx += 1

        return np.array(batch_x)


================================================
FILE: requirements.txt
================================================
h5py~=2.6.0
numpy~=1.14.5
requests~=2.21.0

Download .txt

gitextract_bc4qvi6_/

├── .gitignore
├── LICENSE
├── README.md
├── demo/
│   ├── test1.csv
│   └── test2.csv
├── demo_override/
│   ├── README.md
│   ├── main_override.py
│   ├── test1.csv
│   └── test2.csv
├── image_util_cli.py
├── main_multi.py
├── model_util.py
└── requirements.txt

Download .txt

SYMBOL INDEX (27 symbols across 4 files)

FILE: demo_override/main_override.py
  class NewImageSimilarity (line 6) | class NewImageSimilarity(ImageSimilarity):
    method _sub_process (line 8) | def _sub_process(para):

FILE: image_util_cli.py
  function process (line 8) | def process(input_dir, delimiter=',', output_path=None):
  function main (line 25) | def main():

FILE: main_multi.py
  class ImageSimilarity (line 21) | class ImageSimilarity():
    method __init__ (line 23) | def __init__(self):
    method batch_size (line 31) | def batch_size(self):
    method num_processes (line 36) | def num_processes(self):
    method batch_size (line 41) | def batch_size(self, batch_size):
    method num_processes (line 45) | def num_processes(self, num_processes):
    method _data_generation (line 48) | def _data_generation(self, args):
    method _predict_generator (line 74) | def _predict_generator(self, paras):
    method _sub_process (line 88) | def _sub_process(para):
    method load_data_csv (line 117) | def load_data_csv(fname, delimiter=None, include_header=True, cols=None):
    method load_data_h5 (line 158) | def load_data_h5(fname):
    method save_data (line 173) | def save_data(self, title, lines):
    method iteration (line 218) | def iteration(self, save_header, thresh=0.845, title1=None, title2=None):

FILE: model_util.py
  class DeepModel (line 13) | class DeepModel():
    method __init__ (line 15) | def __init__(self):
    method _define_model (line 22) | def _define_model(output_layer=-1):
    method preprocess_image (line 38) | def preprocess_image(path):
    method cosine_distance (line 54) | def cosine_distance(input1, input2):
    method extract_feature (line 71) | def extract_feature(self, generator):
  class DataSequence (line 84) | class DataSequence(Sequence):
    method __init__ (line 86) | def __init__(self, paras, generation, batch_size=32):
    method __len__ (line 93) | def __len__(self):
    method __getitem__ (line 97) | def __getitem__(self, idx):

Download .json

Condensed preview — 13 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (29K chars).

[
  {
    "path": ".gitignore",
    "chars": 30,
    "preview": ".DS_Store\n.vscode\n__pycache__\n"
  },
  {
    "path": "LICENSE",
    "chars": 1060,
    "preview": "MIT License\n\nCopyright (c) 2019 Ryan\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof th"
  },
  {
    "path": "README.md",
    "chars": 6405,
    "preview": "# Image Similarity\n\nThis is an efficient utility of image similarity using [MobileNet](https://arxiv.org/abs/1704.04861)"
  },
  {
    "path": "demo/test1.csv",
    "chars": 480,
    "preview": "id,url\n1,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/1.jpg\n2,https://raw.githubusercontent.co"
  },
  {
    "path": "demo/test2.csv",
    "chars": 243,
    "preview": "id,url\n3,https://raw.githubusercontent.com/ryanfwy/image_similarity/master/demo/3.jpg\n4,https://raw.githubusercontent.co"
  },
  {
    "path": "demo_override/README.md",
    "chars": 2611,
    "preview": "# Implement Your Own `_sub_process()`\n\nBy default, the `.csv` source file should at least include one field that place t"
  },
  {
    "path": "demo_override/main_override.py",
    "chars": 943,
    "preview": "import sys\nsys.path.append('..')\n\nfrom main_multi import ImageSimilarity, DeepModel\n\nclass NewImageSimilarity(ImageSimil"
  },
  {
    "path": "demo_override/test1.csv",
    "chars": 103,
    "preview": "id,path\n1,../demo/1.jpg\n2,../demo/2.jpg\n3,../demo/3.jpg\n4,../demo/4.jpg\n5,../demo/5.jpg\n6,../demo/6.jpg"
  },
  {
    "path": "demo_override/test2.csv",
    "chars": 55,
    "preview": "id,path\n3,../demo/3.jpg\n4,../demo/4.jpg\n5,../demo/5.jpg"
  },
  {
    "path": "image_util_cli.py",
    "chars": 1498,
    "preview": "'''CLI utility for image preparation.'''\n\nimport os\nimport argparse\nimport numpy as np\n\n\ndef process(input_dir, delimite"
  },
  {
    "path": "main_multi.py",
    "chars": 10534,
    "preview": "'''Image similarity using deep features.\r\n\r\nRecommendation: the threshold of the `DeepModel.cosine_distance` can be set "
  },
  {
    "path": "model_util.py",
    "chars": 3408,
    "preview": "import os\nos.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'\n\nimport numpy as np\n\nfrom tensorflow.python.keras.applications.mobile"
  },
  {
    "path": "requirements.txt",
    "chars": 43,
    "preview": "h5py~=2.6.0\nnumpy~=1.14.5\nrequests~=2.21.0\n"
  }
]

About this extraction

This page contains the full source code of the ryanfwy/image-similarity GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 13 files (26.8 KB), approximately 6.8k tokens, and a symbol index with 27 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo