Repository: deliton/idt
Branch: master
Commit: 050d82a51dfd
Files: 26
Total size: 61.4 KB

Directory structure:
gitextract_x8iz4hcn/

├── .github/
│   └── ISSUE_TEMPLATE/
│       ├── bug_report.md
│       └── feature_request.md
├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── idt/
│   ├── __init__.py
│   ├── __main__.py
│   ├── bing.py
│   ├── bing_api.py
│   ├── duckgo.py
│   ├── factories.py
│   ├── flickr_api.py
│   ├── resizers/
│   │   ├── __init__.py
│   │   ├── get_resizer.py
│   │   ├── longer_side.py
│   │   ├── shorter_side.py
│   │   └── smartcrop.py
│   └── utils/
│       ├── __init__.py
│       ├── create_dataset_csv.py
│       ├── download_images.py
│       ├── remove_corrupt.py
│       └── split_dataset.py
├── requirements.txt
└── setup.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.md
================================================
---
name: Bug report
about: Create a report to help us improve
title: ''
labels: ''
assignees: ''

---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

**Expected behavior**
A clear and concise description of what you expected to happen.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Desktop (please complete the following information):**
 - OS: [e.g. iOS]
 - Browser [e.g. chrome, safari]
 - Version [e.g. 22]

**Smartphone (please complete the following information):**
 - Device: [e.g. iPhone6]
 - OS: [e.g. iOS8.1]
 - Browser [e.g. stock browser, safari]
 - Version [e.g. 22]

**Additional context**
Add any other context about the problem here.


================================================
FILE: .github/ISSUE_TEMPLATE/feature_request.md
================================================
---
name: Feature request
about: Suggest an idea for this project
title: ''
labels: ''
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.


================================================
FILE: .gitignore
================================================
!**/__pycache__/
__pycache__
idt.egg-info
build/
dist/
.README.md.kate-swp


/.vscode

================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Contributor Covenant Code of Conduct

## Our Pledge

In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.

## Our Standards

Examples of behavior that contributes to creating a positive environment
include:

* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

* The use of sexualized language or imagery and unwelcome sexual attention or
 advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
 address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
 professional setting

## Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.

## Scope

This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at deliton.m@hotmail.com. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html

[homepage]: https://www.contributor-covenant.org

For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq


================================================
FILE: CONTRIBUTING.md
================================================
![idt-contrib](https://user-images.githubusercontent.com/47995046/96387698-74e85e80-117a-11eb-8b35-d65b336fd1df.png)

🎉 Thanks for taking the time to contribute to this project! 🎉

## Code of Conduct
This project and everyone participating in it is governed by the [IDT Code of Conduct](CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code. Please report unacceptable behavior to deliton.m@hotmail.com.

## How Can I Contribute?

### Reporting Bugs

This section guides you through submitting a bug report for IDT. Following these guidelines helps maintainers and the community understand your report :pencil:, reproduce the behavior :computer: :computer:, and find related reports :mag_right:.

When you are creating a bug report, please include as many details as possible.

> **Note:** If you find a **Closed** issue that seems like it is the same thing that you're experiencing, open a new issue and include a link to the original issue in the body of your new one.

#### How Do I Submit A (Good) Bug Report?

Bugs are tracked as [GitHub issues](https://guides.github.com/features/issues/). Create an issue in this repository and provide the following information by filling in bug report template

Explain the problem and include additional details to help maintainers reproduce the problem:

* **Use a clear and descriptive title** for the issue to identify the problem.
* **Describe the exact steps which reproduce the problem** in as many details as possible. For example, start by explaining how you started IDT, e.g. which command exactly you used in the terminal, or how you started IDT otherwise. When listing steps, **don't just say what you did, but explain how you did it**. For example, if you moved the cursor to the end of a line, explain if you used the mouse, or a keyboard shortcut or an IDT command, and if so which one?
* **Provide specific examples to demonstrate the steps**. Include links to files or GitHub projects, or copy/pasteable snippets, which you use in those examples. If you're providing snippets in the issue, use [Markdown code blocks](https://help.github.com/articles/markdown-basics/#multiple-lines).
* **Describe the behavior you observed after following the steps** and point out what exactly is the problem with that behavior.
* **Explain which behavior you expected to see instead and why.**
* **Include screenshots and animated GIFs** which show you following the described steps and clearly demonstrate the problem. If you use the keyboard while following the steps, **record the GIF with the Keybinding Resolver shown**. You can use [this tool](https://www.cockos.com/licecap/) to record GIFs on macOS and Windows, and [this tool](https://github.com/colinkeenan/silentcast) or [this tool](https://github.com/GNOME/byzanz) on Linux.
* **If you're reporting that IDT crashed**, include a crash report with a stack trace from the operating system. On macOS, the crash report will be available in `Console.app` under "Diagnostic and usage information" > "User diagnostic reports". Include the crash report in the issue in a [code block](https://help.github.com/articles/markdown-basics/#multiple-lines), a [file attachment](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/), or put it in a [gist](https://gist.github.com/) and provide link to that gist.
* **If the problem is related to performance or memory**, include a CPU profile capture with your report.
* **If the problem wasn't triggered by a specific action**, describe what you were doing before the problem happened and share more information using the guidelines below.

### Suggesting Enhancements

This section guides you through submitting an enhancement suggestion for IDT, including completely new features and minor improvements to existing functionality. Following these guidelines helps maintainers and the community understand your suggestion :pencil: and find related suggestions :mag_right:.

#### How Do I Submit A (Good) Enhancement Suggestion?

Enhancement suggestions are tracked as [GitHub issues](https://guides.github.com/features/issues/). Create an issue on that repository and provide the following information:

* **Use a clear and descriptive title** for the issue to identify the suggestion.
* **Provide a step-by-step description of the suggested enhancement** in as many details as possible.
* **Provide specific examples to demonstrate the steps**. Include copy/pasteable snippets which you use in those examples, as [Markdown code blocks](https://help.github.com/articles/markdown-basics/#multiple-lines).
* **Describe the current behavior** and **explain which behavior you expected to see instead** and why.
* **Include screenshots and animated GIFs** which help you demonstrate the steps or point out the part of IDT which the suggestion is related to. You can use [this tool](https://www.cockos.com/licecap/) to record GIFs on macOS and Windows, and [this tool](https://github.com/colinkeenan/silentcast) or [this tool](https://github.com/GNOME/byzanz) on Linux.
* **Explain why this enhancement would be useful** to most IDT users and isn't something that can or should be implemented as a community package.
* **List some other text editors or applications where this enhancement exists.**
* **Specify which version of IDT you're using.** 
* **Specify the name and version of the OS you're using.**

### Your First Code Contribution

Unsure where to begin contributing to IDT? You can start by looking through `help-wanted` issues:

* Help wanted issues - issues related to program problems, feature suggestion and implementation of wanted features.


### Pull Requests

The process described here has several goals:

- Maintain IDT's quality
- Fix problems that are important to users
- Engage the community in working toward the best possible IDT
- Enable a sustainable system for IDT's maintainers to review contributions

Please follow these steps to have your contribution considered by the maintainers:

1. Follow all instructions in [the template](PULL_REQUEST_TEMPLATE.md)
2. Follow the [styleguides](#styleguides)
3. After you submit your pull request, verify that all [status checks](https://help.github.com/articles/about-status-checks/) are passing <details><summary>What if the status checks are failing?</summary>If a status check is failing, and you believe that the failure is unrelated to your change, please leave a comment on the pull request explaining why you believe the failure is unrelated. A maintainer will re-run the status check for you. If we conclude that the failure was a false positive, then we will open an issue to track that problem with our status check suite.</details>

IDT is a volunteer effort. We encourage you to pitch in and join the team!

Thanks! <3

IDT Team


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2020 Deliton Junior

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# IDT - Image Dataset Tool

## Version 0.0.6 beta

![idt-logo](https://user-images.githubusercontent.com/47995046/96403078-d675f080-11ad-11eb-8435-c8ce69a6c871.png)


## Description

The image dataset tool (IDT) is a CLI app developed to make it easier and faster to create image datasets to be used for deep learning. The tool achieves this by scraping images from several search engines such as duckgo, bing and deviantart. IDT also optimizes the image dataset, although this feature is optional, the user can downscale and compress the images for optimal file size and dimensions. A sample dataset created using **idt** that contains  a total amount of 23.688 image files weights only 559,2 megabytes.

## NEW UPDATE!
I am proud to announce our newest version! 🎉🎉

**What changed**
* Added auto duplicate images remover
* Added longer side resize method. With this option, the image is resized to its longer side.
* Added shorter side resize method. With this option, the image is resized to its shorter side.
* Added Smart Crop. This method tries to crop and resize exactly the main subject of the image. The algorithm is based on SmartCrop.js and SmartCrop.py.
* Removed verbose mode. This was used in earlier stages of development but now don't add value to the experience.
* The official documentation is almost ready. A link will be available soon

## Installing

You can install it via pip or cloning this repository.

```console
user@admin:~$ pip3 install idt

```

**OR**


```console
user@admin:~$ git clone https://github.com/deliton/idt.git && cd idt
user@admin:~/idt$ sudo python3 setup.py install

```


## Getting Started

![idt-gif](https://user-images.githubusercontent.com/47995046/96406740-6d46ab00-11b6-11eb-980b-a40968ed38b4.gif)

The quickest way to get started with IDT is running the simple "run" command. Just write in your favorite console something like:

```console
user@admin:~$ idt run -i apples 
```

This will quickly download 50 images of apples. By default it uses the duckgo search engine to do so. 
The run command accepts the following options:

| Option | Description |
| ----------- | ----------- |
| **-i** or **--input** | the keyword to find the desired images. | 
| **-s** or **--size** | the amount of images to be downloaded. |
| **-e** or **--engine** | the desired search engine (options: duckgo, bing, bing_api and flickr_api) |
| **--resize-method** | choose a resize method of images. (options: longer_side, shorter_side and smartcrop) |
| **-is** or **--image-size** | option to set the desired image size ratio. default=512 |
| **-ak** or **--api-key** | If you are using a search engine that requires an API key, this option is required |


## Usage

IDT requires a config file that tells it how your dataset should be organized. You can create it using the following command:

```console
user@admin:~$ idt init
```

This command will trigger the config file creator and will ask for the desired dataset parameters. In this example let's create a dataset containing images of your favorite cars. The first parameters this command will ask is what name should your dataset have? In this example, let's name our dataset "My favorite cars"

```console
Insert a name  for your dataset: : My favorite cars
```

Then the tool will ask how many samples per search are required to mount your dataset. In order to build a good dataset for deep learning, many images are required and since we're using a search engine to scrape images, many searches with different keywords are required to mount a good sized dataset. This value will correspond to how many images should be downloaded at every search. In this example we need a dataset with 250 images in each class, and we'll use 5 keywords to mount each class. So if we type the number 50 here, IDT will download 50 images of every keyword provided. If we provide 5 keywords we should get the required 250 images.

```console
How many samples per search will be necessary?  : 50
```

The tool will now ask for and image size ratio. Since using large images to train neural networks is not a viable thing, we can optionally choose one of the following image size ratios and scale down our images to that size. In this example, we'll go for 512x512, although 256x256 would be an even better option for this task.

```console
Choose images resolution:

[1] 512 pixels / 512 pixels (recommended)
[2] 1024 pixels / 1024 pixels
[3] 256 pixels / 256 pixels
[4] 128 pixels / 128 pixels
[5] Keep original image size

ps: note that the aspect ratio of the image will not be changed, 
so possibly the images received will have slightly different size

What is the desired image size ratio: 1
```

And then choose "longer_side" for resize method.

```console
[1] Resize image based on longer side
[2] Resize image based on shorter side
[3] Smartcrop

ps: note that the aspect ratio of the image will not be changed,
so possibly the images received will have slightly different size

Desired Image resize method: : longer_side

```

Now you must choose how many classes/folders your dataset should have. In this example, this part can be very personal, but my favorite cars are: Chevrolet Impala, Range Rover Evoque, Tesla Model X and (why not) AvtoVAZ Lada. So in this case we have 4 classes, one for each favorite.

```console
How many image classes are required? : 4
```

Afterwards, you'll be asked to choose between one of the search engines available. In this example, we'll use DuckGO to search images for us.

```console
Choose a search engine:

[1] Duck GO (recommended)
[2] Bing
[3] Bing API 
[4] Flickr API

Select option:: 1
```

Now we have to do some repetitive form filling. We must name each class and all the keywords that will be used to find the images. Note that this part can be later changed by your own code, to generate more classes and keywords.

```console
Class 1 name: : Chevrolet Impala
```

After typing the first class name, we'll be asked to provide all the keywords to find the dataset. Remember that we told the program to download 50 images of each keyword so we must provide 5 keywords in this case to get all 250 images. Each keyword MUST be separated by commas(,)

```console
In order to achieve better results, choose several keywords that will
be provided to the search engine to find your class in different settings.

Example: 

Class Name: Pineapple
keywords: pineapple, pineapple fruit, ananas, abacaxi, pineapple drawing

Type in all keywords used to find your desired class, separated by commas: Chevrolet Impala 1967 car photos,
chevrolet impala on the road, chevrolet impala vintage car, chevrolet impala convertible 1961, chevrolet impala 1964 lowrider

```

Then repeat the process of filling class name and its keywords until you fill all the 4 classes required.

```console
Dataset YAML file has been created successfully. Now run idt build to mount your dataset!
```

Your dataset configuration file has been created. Now just rust the following command and see the magic happen:

```console
user@admin:~$ idt build
```

And wait while the dataset is being mounted:

```console
Creating Chevrolet Impala class
Downloading Chevrolet Impala 1967 car photos  [#########################-----------]   72%  00:00:12

```

At the end, all your images will be available in a folder with the dataset name. Also, a csv file with the dataset stats are also included in the dataset's root folder.

![idt-results](https://user-images.githubusercontent.com/47995046/93012667-808fa680-f578-11ea-82fc-7ebcb8ce3c41.png)


## Split image dataset for Deep Learning

Since deep learning often requires you to split your dataset into a subset of training/validation folders, this project can also do this for you! Just run:

```console
user@admin:~$ idt split
```

Now you must choose a train/valid proportion. In this example I've chosen that 70% of the images will be reserved for training, while the rest will be reserved for validation: 

```console
Choose the desired proportion of images of each class to be distributed in train/valid folders.
What percentage of images should be distributed towards training? 
(0-100): 70

70 percent of the images will be moved to a train folder, while 30 percent of the remaining images
will be stored in a validation folder.
Is that ok? [Y/n]: y
```

And that's it! The dataset-split should now be found with the corresponding train/valid subdirectories.

## Issues

This project is being developed in my spare time and it still needs a lot of effort to be free of bugs. Pull requests and contributors are really appreciated, feel free to contribute in any way you can!


================================================
FILE: idt/__init__.py
================================================


================================================
FILE: idt/__main__.py
================================================
import os
import click
import yaml
import rich
from rich.console import Console

from idt.factories import SearchEngineFactory
from idt.utils.remove_corrupt import remove_corrupt
from idt.utils.create_dataset_csv import create_dataset_csv
from idt.utils.split_dataset import split_dataset

BANNER = """
[bold blue]=====================================================================


                8888888 8888888b. 88888888888 
                  888   888  "Y88b    888     
                  888   888    888    888     
                  888   888    888    888     
                  888   888    888    888     
                  888   888    888    888     
                  888   888  .d88P    888     
                8888888 8888888P"     888  
                                           
          		[italic]IMAGE DATASET TOOL V0.6[/italic]                                                                                    
                                                                                                                                 
=====================================================================[/bold blue]                                                                                                                                
		"""

#@click.command()
@click.group()
def main():
    
    """
    Image Dataset Builder CLI to create amazing datasets
    """
    pass

@main.command()
def version():
	"""
	Shows what version idt is currently on
	"""
	click.clear()
	rich.print("[bold magenta]Image Dataset Tool (IDT)[/bold magenta] version 0.0.6 beta")

@main.command()
def authors():
	"""
	Shows who are the creators of IDT
	"""
	click.clear()
	rich.print("[bold]IDT[/bold] was initially made by [bold magenta]Deliton Junior[/bold magenta] and [bold red]Misael Kelviny[/bold red]")

@main.command()
@click.option('--input', '-i','--i', help="The name of the thing you want to download")
@click.option('--size', '-s','--s', default=50, help="The number of images you want to download.")
@click.option('--engine', '-e','--e', default="duckgo", help="What search engine will be used to find your images")
@click.option('--resize-method', '-rs','--rs', default="longer_side", help="Resize method adopted. Options: shorter_side, longer_side and smartcrop")
@click.option('--imagesize', '-is','--is', default=512, help="What image size ratio should be applied to your dataset")
@click.option('--api-key', '-ak','--ak', default=None, help="Provide an api-key for the engines that require one")
def run(input, size, engine, resize_method, imagesize, api_key):
	"""
	This command executes a single search and downloads it
	"""
	engine_list = ['duckgo', 'bing', 'bing_api', 'flickr_api']
	click.clear()

	if input and engine in engine_list:
		factory = SearchEngineFactory(input,size,input,resize_method,"dataset",imagesize, engine, api_key)
		# Remove corrupt files
		remove_corrupt("dataset")

	else:
		rich.print("Please provide a valid name")

@main.command()
@click.option('--default', '-d','--d', is_flag=True,default=False, help="Generate a default config file")
def init(default):
	"""
	This command initialyzes idt and creates a dataset config file
	"""
	console = Console()
	console.clear()

	if default:
		document_dict = {
			"DATASET_NAME": "dataset",
			"API_KEY": "",
			"SAMPLES_PER_SEARCH": 50,
			"IMAGE_SIZE": 512,
			"ENGINE": "duckgo",
			"RESIZE_METHOD": "longer_side",
			"CLASSES": [{"CLASS_NAME": "Test", "SEARCH_KEYWORDS": "images of cats"}]}

		if not os.path.exists("dataset.yaml"):
			console.print("[bold]Creating a dataset configuration file...[/bold]")
			
			f = open("dataset.yaml", "w")
			f.write(yaml.dump(document_dict))
			if f:
				console.clear()
				console.print("Dataset YAML file has been created sucessfully. Now run [bold blue]idt build[/bold blue] to mount your dataset!")
				exit(0)
			
		
		else:
			console.print("[red]A dataset.yaml is already created. To use another one, delete the current dataset.yaml file[/red]")
			exit(0)

	console.print(BANNER)
	dataset_name = click.prompt("Insert a name to your dataset: ")

	console.clear()
	samples = click.prompt("How many samples per seach will be necessary?  ",type=int)

	console.clear()
	console.print("[bold]Choose image resolution[/bold]", justify="center")
	console.print("""

[1] 512 pixels / 512 pixels [bold blue](recommended)[/bold blue]
[2] 1024 pixels / 1024 pixels
[3] 256 pixels / 256 pixels
[4] 128 pixels / 128 pixels
[5] Keep original image size

[italic]ps: note that the aspect ratio of the image will [bold]not[/bold] be changed, so possibly the images received will have slightly different size[/italic]
		
		""")


	image_size_ratio = click.prompt("What is the desired image size ratio", type=int)
	while image_size_ratio < 1 or image_size_ratio > 5:
		console.print("[italic red]Invalid option, please choose between 1 and 5. [/italic red]")
		image_size_ratio= click.prompt("\nOption: ",type=int)

	if image_size_ratio == 1:
		image_size_ratio= 512
	elif image_size_ratio == 2:
		image_size_ratio = 1024
	elif image_size_ratio == 3:
		image_size_ratio = 256
	elif image_size_ratio == 4:
		image_size_ratio= 128
	elif image_size_ratio == 5:
		image_size_ratio = 0

	console.clear()
	console.print("[bold]Choose a resize method[/bold]", justify="center")
	console.print("""

[1] Resize image based on longer side
[2] Resize image based on shorter side
[3] Smartcrop

[italic]ps: note that the aspect ratio of the image will [bold]not[/bold] be changed, so possibly the images received will have slightly different size[/italic]
		
		""")
	resize_method = click.prompt("Desired Image resize method: ", type=int)
	while resize_method < 1 or resize_method > 3:
		console.print("[red]Invalid option[/red]")
		resize_method = click.prompt("Choose method [1-3]: ")

	resize_method_options = ['','longer_side','shorter_side','smartcrop']


	console.clear()
	number_of_classes = click.prompt("How many image classes are required? ",type=int)

	document_dict = {
  
    "DATASET_NAME": dataset_name,
  
    "SAMPLES_PER_SEARCH": samples,
 
    "IMAGE_SIZE": image_size_ratio,
  
    "RESIZE_METHOD": resize_method_options[resize_method],
  
    "CLASSES": []
  
}

	console.clear()
	console.print("[bold]Choose a search engine[/bold]", justify="center")
	console.print("""

[1] Duck GO [bold blue](recommended)[/bold blue]
[2] Bing
[3] Bing API [italic yellow](Requires API key)[/italic yellow]
[4] Flickr API [italic yellow](Requires API key)[/italic yellow]

		""")
	search_engine= click.prompt("Select option:", type=int)
	while search_engine < 0 or search_engine > 4:
		console.print("[italic red]Invalid option, please choose between 1 and 4.[/italic red]")
		search_engine = click.prompt("\nOption: ", type=int)

	search_options = ['none', 'duckgo', 'bing', 'bing_api', 'flickr_api']
	document_dict['ENGINE'] = search_options[search_engine]

	if search_engine > 2:
		console.clear()
		console.print(f'Insert your [bold blue]{search_options[search_engine]}[/bold blue] API key')
		engine_api_key = click.prompt("API key: ", type=str)
		document_dict['API_KEY'] = engine_api_key
	else:
		document_dict['API_KEY'] = "NONE"

	search_engine = search_options[search_engine]

	for x in range(number_of_classes):
		console.clear()
		class_name = click.prompt("Class {x} name: ".format(x=x+1))
		console.clear()

		console.print("""In order to achieve better results, choose several keywords that will be provided to the search engine to find your class in different settings.
	
[bold blue]Example: [/bold blue]

Class Name: [bold yellow]Pineapple[/bold yellow]
[italic]keywords[/italic]: [underline]pineapple, pineapple fruit, ananas, abacaxi, pineapple drawing[/underline]

			""")
		keywords = click.prompt("Type in all keywords used to find your desired class, separated by commas: ")
		document_dict['CLASSES'].append({'CLASS_NAME': class_name, 'SEARCH_KEYWORDS': keywords})
    
	if not os.path.exists("dataset.yaml"):
		console.print("[bold]Creating a dataset configuration file...[/bold]")
		try:
			f = open("dataset.yaml", "w")
			f.write(yaml.dump(document_dict))
			if f:
				console.clear()
				console.print("Dataset YAML file has been created sucessfully. Now run [bold blue]idt build[/bold blue] to mount your dataset!")
		except:
			console.print("[red]Unable to create file. Please check permission[/red]")
		
	else:
		console.print("[red]A dataset.yaml is already created. To use another one, delete the current dataset.yaml file[/red]")

@main.command()
def build():
	"""
	This command mounts the dataset
	"""
	console = Console()
	console.clear()
	console.print(BANNER)
	if not os.path.exists("dataset.yaml"):
		click.clear()
		console.print("Dataset config file not found\nRun - idt init\n")
		exit(0)

	with open('dataset.yaml') as f:
		data = yaml.load(f, Loader=yaml.FullLoader)
	
	click.clear()
	console.print("Building [bold blue]{dataset_name}[/bold blue] dataset...\n".format(dataset_name=data['DATASET_NAME']))
	for classes in data['CLASSES']:
		click.clear()
		console.print('Creating [bold blue]{name} class[/bold blue] \n'.format(name=classes['CLASS_NAME']))
		search_list = classes['SEARCH_KEYWORDS'].split(",")
		for keywords in search_list:
			factory = SearchEngineFactory(keywords,data['SAMPLES_PER_SEARCH'],classes['CLASS_NAME'],data['RESIZE_METHOD'], data['DATASET_NAME'],data['IMAGE_SIZE'], data['ENGINE'],data['API_KEY'])
	# Remove corrupt files
	remove_corrupt(data['DATASET_NAME'])

	# Create a CSV with dataset info
	create_dataset_csv(data['DATASET_NAME'])
	click.clear()
	console.print("Dataset READY!")

@main.command()
def split():
	"""
	Split dataset into train/valid folders
	"""
	console = Console()
	while True:
		click.clear()
		console.print(BANNER)
		console.print("Choose the desired proportion of images of each class to be distributed in train/valid folders. [bold]What percentage of images should be distributed towards training?[/bold] ")
		train_proportion = click.prompt("(0-100)", type=int)
		validation_proportion = 100 - train_proportion
		if train_proportion < 0 or train_proportion > 100:
			click.clear()
			console.print("[red]Please provide a valid amount. Choose a number between 0 and 100 to be assigned to training.[/red]")
			continue
		else:
			click.clear()
			console.print("[bold blue]{train} percent[/bold blue] of the images will be moved to a [bold yellow]train[/bold yellow] folder, while [bold blue]{valid} percent [/bold blue] of the remaining images will be stored in a [bold yellow]validation[/bold yellow] folder.".format(train=train_proportion, valid=validation_proportion))
			c= click.prompt("Is that ok? [Y/n]")
			if c.lower() == 'y':
				if not os.path.exists("dataset.yaml"):
					click.clear()
					console.print("Dataset config file not found\nRun - [bold blue]idt init[/bold blue]")
					exit(0)

				with open('dataset.yaml') as f:
					click.clear()
					console.print("[italic]Copying files to the train/valid folders. Please wait...[/italic]")
					data = yaml.load(f, Loader=yaml.FullLoader)
					split_dataset(data['DATASET_NAME'], train_proportion)
				console.clear()
				console.print("[bold blue]Done[/bold blue]")
				break
			else:
				continue
		
	
if __name__ == "__main__":
	main()


================================================
FILE: idt/bing.py
================================================
import os
import json 
import requests
import re

from idt.utils.download_images import download
from idt.utils.remove_corrupt import erase_duplicates

from rich.progress import Progress

__name__ = "bing"

class BingSearchEngine:
	def __init__(self,data,n_images,folder,resize_method,root_folder,size):
		self.data = data
		self.n_images = n_images
		self.folder = folder
		self.resize_method = resize_method
		self.root_folder = root_folder
		self.size = size
		self.downloaded_images = 0
		self.page = 0
		self.search()

	def search(self):
		BING_IMAGE = 'https://www.bing.com/images/async?q='

		USER_AGENT = {
		'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0'}

		data = self.data.replace(" ", "-")

		if data[0] == "-":
			data  = data[1:]

		page_counter = 0
		with Progress() as progress:
			task1 = progress.add_task(f"Downloading [blue]{self.data}[/blue] class...",total=self.n_images)
			while self.downloaded_images < self.n_images:
				searchurl = BING_IMAGE + data + '&first=' + str(self.page) + '&count=100'

			    # request url, without usr_agent the permission gets denied
				response = requests.get(searchurl, headers=USER_AGENT)
				html = response.text
				self.page += 100
				results = re.findall('murl&quot;:&quot;(.*?)&quot;', html)

				if not os.path.exists(self.root_folder):
					os.mkdir(self.root_folder)

				target_folder = os.path.join(self.root_folder, self.folder)
				if not os.path.exists(target_folder):
					os.mkdir(target_folder)

				for link in results:
					try:
						if self.downloaded_images < self.n_images:
							download(link,self.size,self.root_folder,self.folder, self.resize_method)
							self.downloaded_images += 1
							progress.update(task1, advance=1)
						else:
							break; 
					except:
						continue
				self.downloaded_images -= erase_duplicates(target_folder)
		print('Done')


================================================
FILE: idt/bing_api.py
================================================
import os
import json 
import requests
import re

from idt.utils.download_images import download
from idt.utils.remove_corrupt import erase_duplicates
from idt.utils.create_dataset_csv import generate_class_info
from rich.progress import Progress

__name__ = "bing_api"

class BingApiSearchEngine:
	def __init__(self,data,n_images,folder,resize_method,root_folder,size,api_key):
		self.data = data
		self.n_images = n_images
		self.folder = folder
		self.resize_method = resize_method
		self.root_folder = root_folder
		self.size = size
		self.downloaded_images = 0
		self.dataset_info = []
		self.page = 0
		self.api_key = api_key
		self.search()

	def search(self):
		BING_IMAGE = 'https://api.cognitive.microsoft.com/bing/v7.0/images/search'

		headers = {"Ocp-Apim-Subscription-Key" : self.api_key}
		params  = {"q": self.data, "count": 100, "offset": self.page}

		page_counter = 0
		with Progress() as progress:
			task1 = progress.add_task(f"Downloading [blue]{self.data}[/blue] class...",total=self.n_images)
			while self.downloaded_images < self.n_images:
				response = requests.get(BING_IMAGE, headers=headers, params=params)
				response.raise_for_status()
				results = response.json()
				self.page += 100

				if not os.path.exists(self.root_folder):
					os.mkdir(self.root_folder)

				target_folder = os.path.join(self.root_folder, self.folder)
				if not os.path.exists(target_folder):
					os.mkdir(target_folder)

				for result in results['value']:
					try:
						if self.downloaded_images < self.n_images:
							download(result['contentUrl'],self.size,self.root_folder,self.folder, self.resize_method)
							self.dataset_info.append({
								'name': result['name'],
								'origin': result['hostPageDisplayUrl'].split('/')[2],
								'date': result['datePublished'],
								'original_size': result['contentSize'],
								'original_width': result['width'],
								'original_height' : result['height']})

							self.downloaded_images += 1
							progress.update(task1, advance=1)
						else:
							break; 
					except:
						continue
			self.downloaded_images -= erase_duplicates(target_folder)
		generate_class_info(self.dataset_info,self.root_folder, self.folder)


================================================
FILE: idt/duckgo.py
================================================
import requests;
import re;
import json;
import time;
import logging;
import os;
from rich.progress import Progress

from idt.utils.download_images import download
from idt.utils.remove_corrupt import erase_duplicates

__name__ = "duckgo"

class DuckGoSearchEngine:
    def __init__(self,  data, n_images, folder, resize_method, root_folder, size):
        self.data = data
        self.n_images = n_images
        self.folder = folder
        self.resize_method = resize_method
        self.root_folder = root_folder
        self.size = size
        self.downloaded_images = 0
        self.search()

    def search(self):
        URL = 'https://duckduckgo.com/'
        PARAMS = {'q': self.data}
        HEADERS = {
        'authority': 'duckduckgo.com',
        'accept': 'application/json, text/javascript, */*; q=0.01',
        'sec-fetch-dest': 'empty',
        'x-requested-with': 'XMLHttpRequest',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'cors',
        'referer': 'https://duckduckgo.com/',
        'accept-language': 'en-US,en;q=0.9'}

        res = requests.post(URL, data=PARAMS, timeout=3.000)
        search_object = re.search(r'vqd=([\d-]+)\&', res.text, re.M|re.I)
        #print(search_object)

        if not search_object:
            return -1;

        PARAMS = (
        ('l', 'us-en'),
        ('o', 'json'),
        ('q', self.data),
        ('vqd', search_object.group(1)),
        ('f', ',,,'),
        ('p', '1'),
        ('v7exp', 'a'))

        request_url = URL + "i.js";
        with Progress() as progress:

            task1 = progress.add_task("[blue]Downloading {x} class...".format(x=self.data), total=self.n_images)
            while self.downloaded_images < self.n_images:
                while True:
                    try:
                        res = requests.get(request_url, headers=HEADERS, params=PARAMS, timeout=3.000);
                        data = json.loads(res.text);
                        break;
                    except ValueError as e:
                        time.sleep(5);
                        continue;

                if not os.path.exists(self.root_folder):
                    os.mkdir(self.root_folder)

                target_folder = os.path.join(self.root_folder, self.folder)
                if not os.path.exists(target_folder):
                    os.mkdir(target_folder)

                # Cut the extra result by the amount that still need to be downloaded
                if len(data["results"]) > self.n_images - self.downloaded_images:
                    data["results"] = data["results"][:self.n_images - self.downloaded_images]

                for results in data["results"]:
                    try:
                        download(results["image"], self.size, self.root_folder, self.folder, self.resize_method)
                        self.downloaded_images+= 1
                        progress.update(task1, advance=1) 
                    except Exception as e:
                        continue
                        
                self.downloaded_images -= erase_duplicates(target_folder)

                if "next" not in data:
                    return 0
                request_url = URL + data["next"];


================================================
FILE: idt/factories.py
================================================
from idt.duckgo import DuckGoSearchEngine
from idt.bing import BingSearchEngine
from idt.bing_api import BingApiSearchEngine 
from idt.flickr_api import FlickrApiSearchEngine

__name__ = "factories"

class SearchEngineFactory:
	def __init__(self,data,n_images,folder,resize_method,root_folder,size,engine,api_key):
		self.data = data
		self.n_images = n_images
		self.folder = folder
		self.resize_method = resize_method
		self.root_folder = root_folder
		self.size = size
		self.engine = engine
		self.api_key = api_key
		self.getSearchEngine() 

	def getSearchEngine(self):
		if self.engine == "duckgo":
			return DuckGoSearchEngine(self.data, self.n_images, self.folder,self.resize_method,self.root_folder, self.size)
		elif self.engine == "bing":
			return BingSearchEngine(self.data, self.n_images, self.folder, self.resize_method, self.root_folder, self.size)
		elif self.engine == "bing_api":
			return BingApiSearchEngine(self.data, self.n_images, self.folder, self.resize_method, self.root_folder, self.size, self.api_key)
		elif self.engine == "flickr_api":
			return FlickrApiSearchEngine(self.data, self.n_images, self.folder, self.resize_method, self.root_folder, self.size, self.api_key)
		else:
			return None


================================================
FILE: idt/flickr_api.py
================================================
import os
import json 
import requests
import re

from idt.utils.download_images import download
from idt.utils.remove_corrupt import erase_duplicates
from rich.progress import Progress

__name__ = "flickr_api"

class FlickrApiSearchEngine:
	def __init__(self,data,n_images,folder,resize_method,root_folder,size,api_key):
		self.data = data
		self.n_images = n_images
		self.folder = folder
		self.resize_method = resize_method
		self.root_folder = root_folder
		self.size = size
		self.downloaded_images = 0
		self.dataset_info = []
		self.page = 1
		self.api_key = api_key
		self.search()

	def search(self):
		FLICKR_LINK = 'https://www.flickr.com/services/rest/'

		#headers = {"Ocp-Apim-Subscription-Key" : self.api_key}
		data = self.data.replace(" ", "+")

		if data[0] == "+":
			data  = data[1:]

		params = {
		"method": "flickr.photos.search",
		"api_key": self.api_key,
		"tags": data,
		"format": "json",
		"page": self.page,
		"nojsoncallback": 1
		}
		with Progress() as progress:
			task1 = progress.add_task(f"Downloading [blue]{self.data}[/blue] class...",total=self.n_images)
			while self.downloaded_images < self.n_images:
				response = requests.get(FLICKR_LINK, params=params)
				response.raise_for_status()
				results = response.json()
				results = results['photos']
				if results['total'] == 0:
					progress.update(task1, advance=self.n_images)
					return 0
				
				self.page += 1

				if not os.path.exists(self.root_folder):
					os.mkdir(self.root_folder)

				target_folder = os.path.join(self.root_folder, self.folder)
				if not os.path.exists(target_folder):
					os.mkdir(target_folder)

				for result in results['photo']:
					try:
						if self.downloaded_images < self.n_images:
							link = f"https://farm{result['farm']}.staticflickr.com/{result['server']}/{result['id']}_{result['secret']}.jpg"
							download(link, self.size,self.root_folder,self.folder, self.resize_method)
							self.downloaded_images += 1
							progress.update(task1, advance=1)
						else:
							break; 
					except:
						continue
			self.downloaded_images -= erase_duplicates(target_folder)


================================================
FILE: idt/resizers/__init__.py
================================================


================================================
FILE: idt/resizers/get_resizer.py
================================================
from .smartcrop import SmartCrop
from .longer_side import crop_longer_side
from .shorter_side import crop_shorter_side

def get_resizer(img, target_size, resizer):
	if target_size == 0:
		return img

	if resizer == "smartcrop":
		sc = SmartCrop()
		return sc.run_crop(img, target_size)
	elif resizer == 'shorter_side':
		return crop_shorter_side(img,target_size)
	elif resizer == 'longer_side':
		return crop_longer_side(img, target_size)


================================================
FILE: idt/resizers/longer_side.py
================================================
from PIL import Image

def crop_longer_side(img, size):
	IMG_SIZE = size, size
	img.thumbnail(IMG_SIZE, Image.ANTIALIAS)
	return img


================================================
FILE: idt/resizers/shorter_side.py
================================================
from PIL import Image

def crop_shorter_side(img, size):
	width, height = img.size
	if width < size or height < size:
		return img.thumbnail(size, Image.ANTIALIAS)
	elif width > height:
		ratio = float(width) / float(height)
		new_width = int(size * ratio)
		return img.resize((new_width, size), Image.ANTIALIAS)
	else:
		ratio = float(height) / float(width)
		new_height = int(size * ratio)
		return img.resize((size, new_height), Image.ANTIALIAS)


================================================
FILE: idt/resizers/smartcrop.py
================================================
import math
import sys

import numpy as np
from PIL import Image
from PIL.ImageFilter import Kernel


def saturation(image):
    r, g, b = image.split()
    r, g, b = np.array(r, float), np.array(g, float), np.array(b, float)
    maximum = np.maximum(np.maximum(r, g), b)  # [0; 255]
    minimum = np.minimum(np.minimum(r, g), b)  # [0; 255]
    s = (maximum + minimum) / 255  # [0.0; 1.0]
    d = (maximum - minimum) / 255  # [0.0; 1.0]
    d[maximum == minimum] = 0  # if maximum == minimum:
    s[maximum == minimum] = 1  # -> saturation = 0 / 1 = 0
    mask = s > 1
    s[mask] = 2 - d[mask]
    return d / s  # [0.0; 1.0]


def thirds(x):
    """gets value in the range of [0, 1] where 0 is the center of the pictures
    returns weight of rule of thirds [0, 1]"""
    x = ((x + 2 / 3) % 2 * 0.5 - 0.5) * 16
    return max(1 - x * x, 0)


class SmartCrop(object):

    DEFAULT_SKIN_COLOR = [0.78, 0.57, 0.44]

    def __init__(
        self,
        detail_weight=0.2,
        edge_radius=0.4,
        edge_weight=-20,
        outside_importance=-0.5,
        rule_of_thirds=True,
        saturation_bias=0.2,
        saturation_brightness_max=0.9,
        saturation_brightness_min=0.05,
        saturation_threshold=0.4,
        saturation_weight=0.3,
        score_down_sample=8,
        skin_bias=0.01,
        skin_brightness_max=1,
        skin_brightness_min=0.2,
        skin_color=None,
        skin_threshold=0.8,
        skin_weight=1.8
    ):
        self.detail_weight = detail_weight
        self.edge_radius = edge_radius
        self.edge_weight = edge_weight
        self.outside_importance = outside_importance
        self.rule_of_thirds = rule_of_thirds
        self.saturation_bias = saturation_bias
        self.saturation_brightness_max = saturation_brightness_max
        self.saturation_brightness_min = saturation_brightness_min
        self.saturation_threshold = saturation_threshold
        self.saturation_weight = saturation_weight
        self.score_down_sample = score_down_sample
        self.skin_bias = skin_bias
        self.skin_brightness_max = skin_brightness_max
        self.skin_brightness_min = skin_brightness_min
        self.skin_color = skin_color or self.DEFAULT_SKIN_COLOR
        self.skin_threshold = skin_threshold
        self.skin_weight = skin_weight

    def analyse(
        self,
        image,
        crop_width,
        crop_height,
        max_scale=1,
        min_scale=0.9,
        scale_step=0.1,
        step=8
    ):
        """
        Analyze image and return some suggestions of crops (coordinates).
        This implementation / algorithm is really slow for large images.
        Use `crop()` which is pre-scaling the image before analyzing it.
        """
        cie_image = image.convert('L', (0.2126, 0.7152, 0.0722, 0))
        cie_array = np.array(cie_image)  # [0; 255]

        # R=skin G=edge B=saturation
        edge_image = self.detect_edge(cie_image)
        skin_image = self.detect_skin(cie_array, image)
        saturation_image = self.detect_saturation(cie_array, image)
        analyse_image = Image.merge('RGB', [skin_image, edge_image, saturation_image])

        del edge_image
        del skin_image
        del saturation_image

        score_image = analyse_image.copy()
        score_image.thumbnail(
            (
                int(math.ceil(image.size[0] / self.score_down_sample)),
                int(math.ceil(image.size[1] / self.score_down_sample))
            ),
            Image.ANTIALIAS)

        top_crop = None
        top_score = -sys.maxsize

        crops = self.crops(
            image,
            crop_width,
            crop_height,
            max_scale=max_scale,
            min_scale=min_scale,
            scale_step=scale_step,
            step=step)

        for crop in crops:
            crop['score'] = self.score(score_image, crop)
            if crop['score']['total'] > top_score:
                top_crop = crop
                top_score = crop['score']['total']

        return {'analyse_image': analyse_image, 'crops': crops, 'top_crop': top_crop}

    def crop(
        self,
        image,
        width,
        height,
        prescale=True,
        max_scale=1,
        min_scale=0.9,
        scale_step=0.1,
        step=8
    ):
        scale = min(image.size[0] / width, image.size[1] / height)
        crop_width = int(math.floor(width * scale))
        crop_height = int(math.floor(height * scale))
        min_scale = min(max_scale, max(1 / scale, min_scale))

        prescale_size = 1
        if prescale:
            prescale_size = 1 / scale / min_scale
            if prescale_size < 1:
                image = image.copy()
                image.thumbnail(
                    (int(image.size[0] * prescale_size), int(image.size[1] * prescale_size)),
                    Image.ANTIALIAS)
                crop_width = int(math.floor(crop_width * prescale_size))
                crop_height = int(math.floor(crop_height * prescale_size))
            else:
                prescale_size = 1

        result = self.analyse(
            image,
            crop_width=crop_width,
            crop_height=crop_height,
            min_scale=min_scale,
            max_scale=max_scale,
            scale_step=scale_step,
            step=step)

        for i in range(len(result['crops'])):
            crop = result['crops'][i]
            crop['x'] = int(math.floor(crop['x'] / prescale_size))
            crop['y'] = int(math.floor(crop['y'] / prescale_size))
            crop['width'] = int(math.floor(crop['width'] / prescale_size))
            crop['height'] = int(math.floor(crop['height'] / prescale_size))
            result['crops'][i] = crop
        return result

    def run_crop(self, image, target_size):
        if image.mode != 'RGB' and image.mode != 'RGBA':
            new_image = Image.new('RGB', image.size)
            new_image.paste(image)
            image = new_image

        result = self.crop(image, width=100, height=int(target_size / target_size * 100))

        box = (
            result['top_crop']['x'],
            result['top_crop']['y'],
            result['top_crop']['width'] + result['top_crop']['x'],
            result['top_crop']['height'] + result['top_crop']['y']
        )

        cropped_image = image.crop(box)
        cropped_image.thumbnail((target_size,target_size), Image.ANTIALIAS)
        return cropped_image

    def crops(
        self,
        image,
        crop_width,
        crop_height,
        max_scale=1,
        min_scale=0.9,
        scale_step=0.1,
        step=8
    ):
        image_width, image_height = image.size
        crops = []
        for scale in (
            i / 100 for i in range(
                int(max_scale * 100),
                int((min_scale - scale_step) * 100),
                -int(scale_step * 100))
        ):
            for y in range(0, image_height, step):
                if not (y + crop_height * scale <= image_height):
                    break
                for x in range(0, image_width, step):
                    if not (x + crop_width * scale <= image_width):
                        break
                    crops.append({
                        'x': x,
                        'y': y,
                        'width': crop_width * scale,
                        'height': crop_height * scale,
                    })
        if not crops:
            raise ValueError(locals())
        return crops

    def detect_edge(self, cie_image):
        return cie_image.filter(Kernel((3, 3), (0, -1, 0, -1, 4, -1, 0, -1, 0), 1, 1))

    def detect_saturation(self, cie_array, source_image):
        threshold = self.saturation_threshold
        saturation_data = saturation(source_image)
        mask = (
            (saturation_data > threshold) &
            (cie_array >= self.saturation_brightness_min * 255) &
            (cie_array <= self.saturation_brightness_max * 255))

        saturation_data[~mask] = 0
        saturation_data[mask] = (saturation_data[mask] - threshold) * (255 / (1 - threshold))

        return Image.fromarray(saturation_data.astype('uint8'))

    def detect_skin(self, cie_array, source_image):
        r, g, b = source_image.split()
        r, g, b = np.array(r, float), np.array(g, float), np.array(b, float)
        rd = np.ones_like(r) * -self.skin_color[0]
        gd = np.ones_like(g) * -self.skin_color[1]
        bd = np.ones_like(b) * -self.skin_color[2]

        mag = np.sqrt(r * r + g * g + b * b)
        mask = ~(abs(mag) < 1e-6)
        rd[mask] = r[mask] / mag[mask] - self.skin_color[0]
        gd[mask] = g[mask] / mag[mask] - self.skin_color[1]
        bd[mask] = b[mask] / mag[mask] - self.skin_color[2]

        skin = 1 - np.sqrt(rd * rd + gd * gd + bd * bd)
        mask = (
            (skin > self.skin_threshold) &
            (cie_array >= self.skin_brightness_min * 255) &
            (cie_array <= self.skin_brightness_max * 255))

        skin_data = (skin - self.skin_threshold) * (255 / (1 - self.skin_threshold))
        skin_data[~mask] = 0

        return Image.fromarray(skin_data.astype('uint8'))

    def importance(self, crop, x, y):
        if (
            crop['x'] > x or x >= crop['x'] + crop['width'] or
            crop['y'] > y or y >= crop['y'] + crop['height']
        ):
            return self.outside_importance

        x = (x - crop['x']) / crop['width']
        y = (y - crop['y']) / crop['height']
        px, py = abs(0.5 - x) * 2, abs(0.5 - y) * 2

        # distance from edge
        dx = max(px - 1 + self.edge_radius, 0)
        dy = max(py - 1 + self.edge_radius, 0)
        d = (dx * dx + dy * dy) * self.edge_weight
        s = 1.41 - math.sqrt(px * px + py * py)

        if self.rule_of_thirds:
            s += (max(0, s + d + 0.5) * 1.2) * (thirds(px) + thirds(py))

        return s + d

    def score(self, target_image, crop):
        score = {
            'detail': 0,
            'saturation': 0,
            'skin': 0,
            'total': 0,
        }
        target_data = target_image.getdata()
        target_width, target_height = target_image.size

        down_sample = self.score_down_sample
        inv_down_sample = 1 / down_sample
        target_width_down_sample = target_width * down_sample
        target_height_down_sample = target_height * down_sample

        for y in range(0, target_height_down_sample, down_sample):
            for x in range(0, target_width_down_sample, down_sample):
                p = int(
                    math.floor(y * inv_down_sample) * target_width +
                    math.floor(x * inv_down_sample)
                )
                importance = self.importance(crop, x, y)
                detail = target_data[p][1] / 255
                score['skin'] += (
                    target_data[p][0] / 255 *
                    (detail + self.skin_bias) *
                    importance
                )
                score['detail'] += detail * importance
                score['saturation'] += (
                    target_data[p][2] / 255 *
                    (detail + self.saturation_bias) *
                    importance
                )
        score['total'] = (
            score['detail'] * self.detail_weight +
            score['skin'] * self.skin_weight +
            score['saturation'] * self.saturation_weight
        ) / (crop['width'] * crop['height'])
        return score

================================================
FILE: idt/utils/__init__.py
================================================


================================================
FILE: idt/utils/create_dataset_csv.py
================================================
import os
import re
import csv
import yaml

__name__ = "create_dataset_csv"

def create_dataset_csv(path):
	number_of_dirs = 0
	number_of_files = 0
	csv_dict = {'DATASET':path,'NUMBER_OF_CLASSES':0, 'TOTAL_NUMBER_OF_FILES':0}

	for base, dirs, files in os.walk(path):
		for directories in dirs:
			number_of_dirs += 1
			dir_path = os.path.join(path,directories)
			count = len(os.listdir(dir_path))
			csv_dict[str(directories)]= count
		for Files in files:
			number_of_files += 1

	csv_dict['NUMBER_OF_CLASSES'] = number_of_dirs
	csv_dict['TOTAL_NUMBER_OF_FILES'] = number_of_files

	with open('{path}/{path}.csv'.format(path=path), 'w') as f:
		for key in csv_dict.keys():
			f.write("%s,%s\n"%(key,csv_dict[key]))

#TODO implement natural sort to classes
def atoi(text):
    return int(text) if text.isdigit() else text

def natural_keys(text):
    return [ atoi(c) for c in re.split(r'(\d+)', text) ]

def generate_class_info(dict, root_folder, folder):
	f = open(f"./{root_folder}/{folder}.yaml", "w")
	f.write(yaml.dump(dict))


================================================
FILE: idt/utils/download_images.py
================================================
import uuid
import requests;
import os;
from PIL import Image
from io import BytesIO
from idt.resizers.get_resizer import get_resizer

__name__ = "download_images"

def download(link, size, root_folder, class_name, resize_method):
    IMG_SIZE = size, size
    response = requests.get(link, timeout=3.000)
    file = BytesIO(response.content)
    raw_img = Image.open(file)

    # resize or crop image according to provided resize method
    img = get_resizer(raw_img, size, resize_method)

    # Split last part of url to get image name and its extension
    img_name = link.rsplit('/', 1)[1]
    img_type = img_name.split('.')[1]

    if img_type.lower() != "jpg":
        raise Exception("Cannot download these type of file")
    else:
        #Check if another file of the same name already exists
        id = uuid.uuid1()
        img.save(f"./{root_folder}/{class_name}/{class_name}-{id.hex}.jpg", "JPEG")

================================================
FILE: idt/utils/remove_corrupt.py
================================================
import os, hashlib, re, csv

__name__ = "remove_corrupt"

def remove_corrupt(path):
	visited_dir = 0
	print("Removing corrupt files")

	for base, dirs, files in os.walk(path):
		for directories in dirs:
			visited_dir += 1
		for Files in files:
			file = os.path.join(base,Files)
			if os.stat(file).st_size == 0:
				#print(Files, "is corrupt, removing it...")
				os.remove(file)

def erase_duplicates(folder):
	duplicates = []
	hash_keys = dict()
	file_list = os.listdir(folder)

	for index, file_name in enumerate(file_list):
		if os.path.isfile(os.path.join(folder,file_name)):
			with open(os.path.join(folder,file_name), 'rb') as f:
				filehash = hashlib.md5(f.read()).hexdigest()
			if filehash not in hash_keys:
				hash_keys[filehash] = index
			else:
				duplicates.append((index, hash_keys[filehash]))
				
	for index in duplicates:
		os.remove(os.path.join(folder, file_list[index[0]]))

	return len(duplicates)

================================================
FILE: idt/utils/split_dataset.py
================================================
import os
import random
from shutil import copyfile

__name__ = "split_dataset"

def split_dataset(img_source_dir, train_size):
    train_size = float(train_size / 100)

    print("Creating a dataset split into train/validation folders...")

    if not os.path.exists(img_source_dir):
        raise OSError('The source folder doesnt exist. Are you sure the dataset folder is ', img_source_dir)
        
    # Create folders
    if not os.path.exists('split-dataset'):
        os.makedirs('split-dataset')
    else:
        if not os.path.exists('split-dataset/train'):
            os.makedirs('split-dataset/train')
        if not os.path.exists('split-dataset/validation'):
            os.makedirs('split-dataset/validation')
            
    # Get the subdirectories in the main image folder
    subdirs = [subdir for subdir in os.listdir(img_source_dir) if os.path.isdir(os.path.join(img_source_dir, subdir))]

    for subdir in subdirs:
        subdir_fullpath = os.path.join(img_source_dir, subdir)
        if len(os.listdir(subdir_fullpath)) == 0:
            print(subdir_fullpath + ' is empty')
            break

        train_subdir = os.path.join('split-dataset/train', subdir)
        validation_subdir = os.path.join('split-dataset/validation', subdir)

        # Create subdirectories in train and validation folders
        if not os.path.exists(train_subdir):
            os.makedirs(train_subdir)

        if not os.path.exists(validation_subdir):
            os.makedirs(validation_subdir)

        train_counter = 0
        validation_counter = 0

        # Randomly assign an image to train or validation folder
        for filename in os.listdir(subdir_fullpath):
            if filename.endswith(".jpg") or filename.endswith(".png"): 
                fileparts = filename.split('.')

                if random.uniform(0, 1) <= train_size:
                    copyfile(os.path.join(subdir_fullpath, filename), os.path.join(train_subdir, str(train_counter) + '.' + fileparts[1]))
                    train_counter += 1
                else:
                    copyfile(os.path.join(subdir_fullpath, filename), os.path.join(validation_subdir, str(validation_counter) + '.' + fileparts[1]))
                    validation_counter += 1

================================================
FILE: requirements.txt
================================================
click==7.1.2
PyYAML==5.3.1
requests==2.22.0
Pillow==7.0.0
rich==6.1.2
numpy==1.19.1


================================================
FILE: setup.py
================================================
from setuptools import setup, find_packages
from io import open
from os import path
import pathlib

# The directory containing this file

HERE = pathlib.Path(__file__).parent  # The text of the README file
README = (HERE / 'README.md').read_text()  # automatically captured required modules for install_requires in requirements.txt and as well as configure dependency links
with open(path.join(HERE, 'requirements.txt'), encoding='utf-8') as f:
    all_reqs = f.read().split('\n')

install_requires = [x.strip() for x in all_reqs if 'git+' not in x
                    and not x.startswith('#') and not x.startswith('-')]
dependency_links = [x.strip().replace('git+', '') for x in all_reqs
                    if 'git+' not in x]

setup(  # list of all packages
        # any python greater than 2.7
    name='idt',
    description='A cli tool that quickly generates ready-to-use image datasets'
        ,
    version='0.0.6',
    packages=find_packages(),
    install_requires=install_requires,
    python_requires='>=2.7',
    entry_points='''
        [console_scripts]
        idt=idt.__main__:main
    ''',
    author='Deliton Junior',
    keyword='idt, image datasets, generators, dataset generator, image scraper'
        ,
    long_description=README,
    long_description_content_type='text/markdown',
    license='MIT',
    url='https://github.com/deliton/idt',
    download_url='https://github.com/deliton/idt/archive/master.zip',
    dependency_links=dependency_links,
    author_email='deliton.m@hotmail.com',
    classifiers=['License :: OSI Approved :: MIT License',
                 'Programming Language :: Python :: 2.7',
                 'Programming Language :: Python :: 3',
                 'Programming Language :: Python :: 3.7'],
    )