Repository: deliton/idt Branch: master Commit: 050d82a51dfd Files: 26 Total size: 61.4 KB Directory structure: gitextract_x8iz4hcn/ ├── .github/ │ └── ISSUE_TEMPLATE/ │ ├── bug_report.md │ └── feature_request.md ├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── idt/ │ ├── __init__.py │ ├── __main__.py │ ├── bing.py │ ├── bing_api.py │ ├── duckgo.py │ ├── factories.py │ ├── flickr_api.py │ ├── resizers/ │ │ ├── __init__.py │ │ ├── get_resizer.py │ │ ├── longer_side.py │ │ ├── shorter_side.py │ │ └── smartcrop.py │ └── utils/ │ ├── __init__.py │ ├── create_dataset_csv.py │ ├── download_images.py │ ├── remove_corrupt.py │ └── split_dataset.py ├── requirements.txt └── setup.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/ISSUE_TEMPLATE/bug_report.md ================================================ --- name: Bug report about: Create a report to help us improve title: '' labels: '' assignees: '' --- **Describe the bug** A clear and concise description of what the bug is. **To Reproduce** Steps to reproduce the behavior: 1. Go to '...' 2. Click on '....' 3. Scroll down to '....' 4. See error **Expected behavior** A clear and concise description of what you expected to happen. **Screenshots** If applicable, add screenshots to help explain your problem. **Desktop (please complete the following information):** - OS: [e.g. iOS] - Browser [e.g. chrome, safari] - Version [e.g. 22] **Smartphone (please complete the following information):** - Device: [e.g. iPhone6] - OS: [e.g. iOS8.1] - Browser [e.g. stock browser, safari] - Version [e.g. 22] **Additional context** Add any other context about the problem here. ================================================ FILE: .github/ISSUE_TEMPLATE/feature_request.md ================================================ --- name: Feature request about: Suggest an idea for this project title: '' labels: '' assignees: '' --- **Is your feature request related to a problem? Please describe.** A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] **Describe the solution you'd like** A clear and concise description of what you want to happen. **Describe alternatives you've considered** A clear and concise description of any alternative solutions or features you've considered. **Additional context** Add any other context or screenshots about the feature request here. ================================================ FILE: .gitignore ================================================ !**/__pycache__/ __pycache__ idt.egg-info build/ dist/ .README.md.kate-swp /.vscode ================================================ FILE: CODE_OF_CONDUCT.md ================================================ # Contributor Covenant Code of Conduct ## Our Pledge In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation. ## Our Standards Examples of behavior that contributes to creating a positive environment include: * Using welcoming and inclusive language * Being respectful of differing viewpoints and experiences * Gracefully accepting constructive criticism * Focusing on what is best for the community * Showing empathy towards other community members Examples of unacceptable behavior by participants include: * The use of sexualized language or imagery and unwelcome sexual attention or advances * Trolling, insulting/derogatory comments, and personal or political attacks * Public or private harassment * Publishing others' private information, such as a physical or electronic address, without explicit permission * Other conduct which could reasonably be considered inappropriate in a professional setting ## Our Responsibilities Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior. Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful. ## Scope This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers. ## Enforcement Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at deliton.m@hotmail.com. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately. Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership. ## Attribution This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html [homepage]: https://www.contributor-covenant.org For answers to common questions about this code of conduct, see https://www.contributor-covenant.org/faq ================================================ FILE: CONTRIBUTING.md ================================================ ![idt-contrib](https://user-images.githubusercontent.com/47995046/96387698-74e85e80-117a-11eb-8b35-d65b336fd1df.png) 🎉 Thanks for taking the time to contribute to this project! 🎉 ## Code of Conduct This project and everyone participating in it is governed by the [IDT Code of Conduct](CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code. Please report unacceptable behavior to deliton.m@hotmail.com. ## How Can I Contribute? ### Reporting Bugs This section guides you through submitting a bug report for IDT. Following these guidelines helps maintainers and the community understand your report :pencil:, reproduce the behavior :computer: :computer:, and find related reports :mag_right:. When you are creating a bug report, please include as many details as possible. > **Note:** If you find a **Closed** issue that seems like it is the same thing that you're experiencing, open a new issue and include a link to the original issue in the body of your new one. #### How Do I Submit A (Good) Bug Report? Bugs are tracked as [GitHub issues](https://guides.github.com/features/issues/). Create an issue in this repository and provide the following information by filling in bug report template Explain the problem and include additional details to help maintainers reproduce the problem: * **Use a clear and descriptive title** for the issue to identify the problem. * **Describe the exact steps which reproduce the problem** in as many details as possible. For example, start by explaining how you started IDT, e.g. which command exactly you used in the terminal, or how you started IDT otherwise. When listing steps, **don't just say what you did, but explain how you did it**. For example, if you moved the cursor to the end of a line, explain if you used the mouse, or a keyboard shortcut or an IDT command, and if so which one? * **Provide specific examples to demonstrate the steps**. Include links to files or GitHub projects, or copy/pasteable snippets, which you use in those examples. If you're providing snippets in the issue, use [Markdown code blocks](https://help.github.com/articles/markdown-basics/#multiple-lines). * **Describe the behavior you observed after following the steps** and point out what exactly is the problem with that behavior. * **Explain which behavior you expected to see instead and why.** * **Include screenshots and animated GIFs** which show you following the described steps and clearly demonstrate the problem. If you use the keyboard while following the steps, **record the GIF with the Keybinding Resolver shown**. You can use [this tool](https://www.cockos.com/licecap/) to record GIFs on macOS and Windows, and [this tool](https://github.com/colinkeenan/silentcast) or [this tool](https://github.com/GNOME/byzanz) on Linux. * **If you're reporting that IDT crashed**, include a crash report with a stack trace from the operating system. On macOS, the crash report will be available in `Console.app` under "Diagnostic and usage information" > "User diagnostic reports". Include the crash report in the issue in a [code block](https://help.github.com/articles/markdown-basics/#multiple-lines), a [file attachment](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/), or put it in a [gist](https://gist.github.com/) and provide link to that gist. * **If the problem is related to performance or memory**, include a CPU profile capture with your report. * **If the problem wasn't triggered by a specific action**, describe what you were doing before the problem happened and share more information using the guidelines below. ### Suggesting Enhancements This section guides you through submitting an enhancement suggestion for IDT, including completely new features and minor improvements to existing functionality. Following these guidelines helps maintainers and the community understand your suggestion :pencil: and find related suggestions :mag_right:. #### How Do I Submit A (Good) Enhancement Suggestion? Enhancement suggestions are tracked as [GitHub issues](https://guides.github.com/features/issues/). Create an issue on that repository and provide the following information: * **Use a clear and descriptive title** for the issue to identify the suggestion. * **Provide a step-by-step description of the suggested enhancement** in as many details as possible. * **Provide specific examples to demonstrate the steps**. Include copy/pasteable snippets which you use in those examples, as [Markdown code blocks](https://help.github.com/articles/markdown-basics/#multiple-lines). * **Describe the current behavior** and **explain which behavior you expected to see instead** and why. * **Include screenshots and animated GIFs** which help you demonstrate the steps or point out the part of IDT which the suggestion is related to. You can use [this tool](https://www.cockos.com/licecap/) to record GIFs on macOS and Windows, and [this tool](https://github.com/colinkeenan/silentcast) or [this tool](https://github.com/GNOME/byzanz) on Linux. * **Explain why this enhancement would be useful** to most IDT users and isn't something that can or should be implemented as a community package. * **List some other text editors or applications where this enhancement exists.** * **Specify which version of IDT you're using.** * **Specify the name and version of the OS you're using.** ### Your First Code Contribution Unsure where to begin contributing to IDT? You can start by looking through `help-wanted` issues: * Help wanted issues - issues related to program problems, feature suggestion and implementation of wanted features. ### Pull Requests The process described here has several goals: - Maintain IDT's quality - Fix problems that are important to users - Engage the community in working toward the best possible IDT - Enable a sustainable system for IDT's maintainers to review contributions Please follow these steps to have your contribution considered by the maintainers: 1. Follow all instructions in [the template](PULL_REQUEST_TEMPLATE.md) 2. Follow the [styleguides](#styleguides) 3. After you submit your pull request, verify that all [status checks](https://help.github.com/articles/about-status-checks/) are passing
What if the status checks are failing?If a status check is failing, and you believe that the failure is unrelated to your change, please leave a comment on the pull request explaining why you believe the failure is unrelated. A maintainer will re-run the status check for you. If we conclude that the failure was a false positive, then we will open an issue to track that problem with our status check suite.
IDT is a volunteer effort. We encourage you to pitch in and join the team! Thanks! <3 IDT Team ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2020 Deliton Junior Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # IDT - Image Dataset Tool ## Version 0.0.6 beta ![idt-logo](https://user-images.githubusercontent.com/47995046/96403078-d675f080-11ad-11eb-8435-c8ce69a6c871.png) ## Description The image dataset tool (IDT) is a CLI app developed to make it easier and faster to create image datasets to be used for deep learning. The tool achieves this by scraping images from several search engines such as duckgo, bing and deviantart. IDT also optimizes the image dataset, although this feature is optional, the user can downscale and compress the images for optimal file size and dimensions. A sample dataset created using **idt** that contains a total amount of 23.688 image files weights only 559,2 megabytes. ## NEW UPDATE! I am proud to announce our newest version! 🎉🎉 **What changed** * Added auto duplicate images remover * Added longer side resize method. With this option, the image is resized to its longer side. * Added shorter side resize method. With this option, the image is resized to its shorter side. * Added Smart Crop. This method tries to crop and resize exactly the main subject of the image. The algorithm is based on SmartCrop.js and SmartCrop.py. * Removed verbose mode. This was used in earlier stages of development but now don't add value to the experience. * The official documentation is almost ready. A link will be available soon ## Installing You can install it via pip or cloning this repository. ```console user@admin:~$ pip3 install idt ``` **OR** ```console user@admin:~$ git clone https://github.com/deliton/idt.git && cd idt user@admin:~/idt$ sudo python3 setup.py install ``` ## Getting Started ![idt-gif](https://user-images.githubusercontent.com/47995046/96406740-6d46ab00-11b6-11eb-980b-a40968ed38b4.gif) The quickest way to get started with IDT is running the simple "run" command. Just write in your favorite console something like: ```console user@admin:~$ idt run -i apples ``` This will quickly download 50 images of apples. By default it uses the duckgo search engine to do so. The run command accepts the following options: | Option | Description | | ----------- | ----------- | | **-i** or **--input** | the keyword to find the desired images. | | **-s** or **--size** | the amount of images to be downloaded. | | **-e** or **--engine** | the desired search engine (options: duckgo, bing, bing_api and flickr_api) | | **--resize-method** | choose a resize method of images. (options: longer_side, shorter_side and smartcrop) | | **-is** or **--image-size** | option to set the desired image size ratio. default=512 | | **-ak** or **--api-key** | If you are using a search engine that requires an API key, this option is required | ## Usage IDT requires a config file that tells it how your dataset should be organized. You can create it using the following command: ```console user@admin:~$ idt init ``` This command will trigger the config file creator and will ask for the desired dataset parameters. In this example let's create a dataset containing images of your favorite cars. The first parameters this command will ask is what name should your dataset have? In this example, let's name our dataset "My favorite cars" ```console Insert a name for your dataset: : My favorite cars ``` Then the tool will ask how many samples per search are required to mount your dataset. In order to build a good dataset for deep learning, many images are required and since we're using a search engine to scrape images, many searches with different keywords are required to mount a good sized dataset. This value will correspond to how many images should be downloaded at every search. In this example we need a dataset with 250 images in each class, and we'll use 5 keywords to mount each class. So if we type the number 50 here, IDT will download 50 images of every keyword provided. If we provide 5 keywords we should get the required 250 images. ```console How many samples per search will be necessary? : 50 ``` The tool will now ask for and image size ratio. Since using large images to train neural networks is not a viable thing, we can optionally choose one of the following image size ratios and scale down our images to that size. In this example, we'll go for 512x512, although 256x256 would be an even better option for this task. ```console Choose images resolution: [1] 512 pixels / 512 pixels (recommended) [2] 1024 pixels / 1024 pixels [3] 256 pixels / 256 pixels [4] 128 pixels / 128 pixels [5] Keep original image size ps: note that the aspect ratio of the image will not be changed, so possibly the images received will have slightly different size What is the desired image size ratio: 1 ``` And then choose "longer_side" for resize method. ```console [1] Resize image based on longer side [2] Resize image based on shorter side [3] Smartcrop ps: note that the aspect ratio of the image will not be changed, so possibly the images received will have slightly different size Desired Image resize method: : longer_side ``` Now you must choose how many classes/folders your dataset should have. In this example, this part can be very personal, but my favorite cars are: Chevrolet Impala, Range Rover Evoque, Tesla Model X and (why not) AvtoVAZ Lada. So in this case we have 4 classes, one for each favorite. ```console How many image classes are required? : 4 ``` Afterwards, you'll be asked to choose between one of the search engines available. In this example, we'll use DuckGO to search images for us. ```console Choose a search engine: [1] Duck GO (recommended) [2] Bing [3] Bing API [4] Flickr API Select option:: 1 ``` Now we have to do some repetitive form filling. We must name each class and all the keywords that will be used to find the images. Note that this part can be later changed by your own code, to generate more classes and keywords. ```console Class 1 name: : Chevrolet Impala ``` After typing the first class name, we'll be asked to provide all the keywords to find the dataset. Remember that we told the program to download 50 images of each keyword so we must provide 5 keywords in this case to get all 250 images. Each keyword MUST be separated by commas(,) ```console In order to achieve better results, choose several keywords that will be provided to the search engine to find your class in different settings. Example: Class Name: Pineapple keywords: pineapple, pineapple fruit, ananas, abacaxi, pineapple drawing Type in all keywords used to find your desired class, separated by commas: Chevrolet Impala 1967 car photos, chevrolet impala on the road, chevrolet impala vintage car, chevrolet impala convertible 1961, chevrolet impala 1964 lowrider ``` Then repeat the process of filling class name and its keywords until you fill all the 4 classes required. ```console Dataset YAML file has been created successfully. Now run idt build to mount your dataset! ``` Your dataset configuration file has been created. Now just rust the following command and see the magic happen: ```console user@admin:~$ idt build ``` And wait while the dataset is being mounted: ```console Creating Chevrolet Impala class Downloading Chevrolet Impala 1967 car photos [#########################-----------] 72% 00:00:12 ``` At the end, all your images will be available in a folder with the dataset name. Also, a csv file with the dataset stats are also included in the dataset's root folder. ![idt-results](https://user-images.githubusercontent.com/47995046/93012667-808fa680-f578-11ea-82fc-7ebcb8ce3c41.png) ## Split image dataset for Deep Learning Since deep learning often requires you to split your dataset into a subset of training/validation folders, this project can also do this for you! Just run: ```console user@admin:~$ idt split ``` Now you must choose a train/valid proportion. In this example I've chosen that 70% of the images will be reserved for training, while the rest will be reserved for validation: ```console Choose the desired proportion of images of each class to be distributed in train/valid folders. What percentage of images should be distributed towards training? (0-100): 70 70 percent of the images will be moved to a train folder, while 30 percent of the remaining images will be stored in a validation folder. Is that ok? [Y/n]: y ``` And that's it! The dataset-split should now be found with the corresponding train/valid subdirectories. ## Issues This project is being developed in my spare time and it still needs a lot of effort to be free of bugs. Pull requests and contributors are really appreciated, feel free to contribute in any way you can! ================================================ FILE: idt/__init__.py ================================================ ================================================ FILE: idt/__main__.py ================================================ import os import click import yaml import rich from rich.console import Console from idt.factories import SearchEngineFactory from idt.utils.remove_corrupt import remove_corrupt from idt.utils.create_dataset_csv import create_dataset_csv from idt.utils.split_dataset import split_dataset BANNER = """ [bold blue]===================================================================== 8888888 8888888b. 88888888888 888 888 "Y88b 888 888 888 888 888 888 888 888 888 888 888 888 888 888 888 888 888 888 888 .d88P 888 8888888 8888888P" 888 [italic]IMAGE DATASET TOOL V0.6[/italic] =====================================================================[/bold blue] """ #@click.command() @click.group() def main(): """ Image Dataset Builder CLI to create amazing datasets """ pass @main.command() def version(): """ Shows what version idt is currently on """ click.clear() rich.print("[bold magenta]Image Dataset Tool (IDT)[/bold magenta] version 0.0.6 beta") @main.command() def authors(): """ Shows who are the creators of IDT """ click.clear() rich.print("[bold]IDT[/bold] was initially made by [bold magenta]Deliton Junior[/bold magenta] and [bold red]Misael Kelviny[/bold red]") @main.command() @click.option('--input', '-i','--i', help="The name of the thing you want to download") @click.option('--size', '-s','--s', default=50, help="The number of images you want to download.") @click.option('--engine', '-e','--e', default="duckgo", help="What search engine will be used to find your images") @click.option('--resize-method', '-rs','--rs', default="longer_side", help="Resize method adopted. Options: shorter_side, longer_side and smartcrop") @click.option('--imagesize', '-is','--is', default=512, help="What image size ratio should be applied to your dataset") @click.option('--api-key', '-ak','--ak', default=None, help="Provide an api-key for the engines that require one") def run(input, size, engine, resize_method, imagesize, api_key): """ This command executes a single search and downloads it """ engine_list = ['duckgo', 'bing', 'bing_api', 'flickr_api'] click.clear() if input and engine in engine_list: factory = SearchEngineFactory(input,size,input,resize_method,"dataset",imagesize, engine, api_key) # Remove corrupt files remove_corrupt("dataset") else: rich.print("Please provide a valid name") @main.command() @click.option('--default', '-d','--d', is_flag=True,default=False, help="Generate a default config file") def init(default): """ This command initialyzes idt and creates a dataset config file """ console = Console() console.clear() if default: document_dict = { "DATASET_NAME": "dataset", "API_KEY": "", "SAMPLES_PER_SEARCH": 50, "IMAGE_SIZE": 512, "ENGINE": "duckgo", "RESIZE_METHOD": "longer_side", "CLASSES": [{"CLASS_NAME": "Test", "SEARCH_KEYWORDS": "images of cats"}]} if not os.path.exists("dataset.yaml"): console.print("[bold]Creating a dataset configuration file...[/bold]") f = open("dataset.yaml", "w") f.write(yaml.dump(document_dict)) if f: console.clear() console.print("Dataset YAML file has been created sucessfully. Now run [bold blue]idt build[/bold blue] to mount your dataset!") exit(0) else: console.print("[red]A dataset.yaml is already created. To use another one, delete the current dataset.yaml file[/red]") exit(0) console.print(BANNER) dataset_name = click.prompt("Insert a name to your dataset: ") console.clear() samples = click.prompt("How many samples per seach will be necessary? ",type=int) console.clear() console.print("[bold]Choose image resolution[/bold]", justify="center") console.print(""" [1] 512 pixels / 512 pixels [bold blue](recommended)[/bold blue] [2] 1024 pixels / 1024 pixels [3] 256 pixels / 256 pixels [4] 128 pixels / 128 pixels [5] Keep original image size [italic]ps: note that the aspect ratio of the image will [bold]not[/bold] be changed, so possibly the images received will have slightly different size[/italic] """) image_size_ratio = click.prompt("What is the desired image size ratio", type=int) while image_size_ratio < 1 or image_size_ratio > 5: console.print("[italic red]Invalid option, please choose between 1 and 5. [/italic red]") image_size_ratio= click.prompt("\nOption: ",type=int) if image_size_ratio == 1: image_size_ratio= 512 elif image_size_ratio == 2: image_size_ratio = 1024 elif image_size_ratio == 3: image_size_ratio = 256 elif image_size_ratio == 4: image_size_ratio= 128 elif image_size_ratio == 5: image_size_ratio = 0 console.clear() console.print("[bold]Choose a resize method[/bold]", justify="center") console.print(""" [1] Resize image based on longer side [2] Resize image based on shorter side [3] Smartcrop [italic]ps: note that the aspect ratio of the image will [bold]not[/bold] be changed, so possibly the images received will have slightly different size[/italic] """) resize_method = click.prompt("Desired Image resize method: ", type=int) while resize_method < 1 or resize_method > 3: console.print("[red]Invalid option[/red]") resize_method = click.prompt("Choose method [1-3]: ") resize_method_options = ['','longer_side','shorter_side','smartcrop'] console.clear() number_of_classes = click.prompt("How many image classes are required? ",type=int) document_dict = { "DATASET_NAME": dataset_name, "SAMPLES_PER_SEARCH": samples, "IMAGE_SIZE": image_size_ratio, "RESIZE_METHOD": resize_method_options[resize_method], "CLASSES": [] } console.clear() console.print("[bold]Choose a search engine[/bold]", justify="center") console.print(""" [1] Duck GO [bold blue](recommended)[/bold blue] [2] Bing [3] Bing API [italic yellow](Requires API key)[/italic yellow] [4] Flickr API [italic yellow](Requires API key)[/italic yellow] """) search_engine= click.prompt("Select option:", type=int) while search_engine < 0 or search_engine > 4: console.print("[italic red]Invalid option, please choose between 1 and 4.[/italic red]") search_engine = click.prompt("\nOption: ", type=int) search_options = ['none', 'duckgo', 'bing', 'bing_api', 'flickr_api'] document_dict['ENGINE'] = search_options[search_engine] if search_engine > 2: console.clear() console.print(f'Insert your [bold blue]{search_options[search_engine]}[/bold blue] API key') engine_api_key = click.prompt("API key: ", type=str) document_dict['API_KEY'] = engine_api_key else: document_dict['API_KEY'] = "NONE" search_engine = search_options[search_engine] for x in range(number_of_classes): console.clear() class_name = click.prompt("Class {x} name: ".format(x=x+1)) console.clear() console.print("""In order to achieve better results, choose several keywords that will be provided to the search engine to find your class in different settings. [bold blue]Example: [/bold blue] Class Name: [bold yellow]Pineapple[/bold yellow] [italic]keywords[/italic]: [underline]pineapple, pineapple fruit, ananas, abacaxi, pineapple drawing[/underline] """) keywords = click.prompt("Type in all keywords used to find your desired class, separated by commas: ") document_dict['CLASSES'].append({'CLASS_NAME': class_name, 'SEARCH_KEYWORDS': keywords}) if not os.path.exists("dataset.yaml"): console.print("[bold]Creating a dataset configuration file...[/bold]") try: f = open("dataset.yaml", "w") f.write(yaml.dump(document_dict)) if f: console.clear() console.print("Dataset YAML file has been created sucessfully. Now run [bold blue]idt build[/bold blue] to mount your dataset!") except: console.print("[red]Unable to create file. Please check permission[/red]") else: console.print("[red]A dataset.yaml is already created. To use another one, delete the current dataset.yaml file[/red]") @main.command() def build(): """ This command mounts the dataset """ console = Console() console.clear() console.print(BANNER) if not os.path.exists("dataset.yaml"): click.clear() console.print("Dataset config file not found\nRun - idt init\n") exit(0) with open('dataset.yaml') as f: data = yaml.load(f, Loader=yaml.FullLoader) click.clear() console.print("Building [bold blue]{dataset_name}[/bold blue] dataset...\n".format(dataset_name=data['DATASET_NAME'])) for classes in data['CLASSES']: click.clear() console.print('Creating [bold blue]{name} class[/bold blue] \n'.format(name=classes['CLASS_NAME'])) search_list = classes['SEARCH_KEYWORDS'].split(",") for keywords in search_list: factory = SearchEngineFactory(keywords,data['SAMPLES_PER_SEARCH'],classes['CLASS_NAME'],data['RESIZE_METHOD'], data['DATASET_NAME'],data['IMAGE_SIZE'], data['ENGINE'],data['API_KEY']) # Remove corrupt files remove_corrupt(data['DATASET_NAME']) # Create a CSV with dataset info create_dataset_csv(data['DATASET_NAME']) click.clear() console.print("Dataset READY!") @main.command() def split(): """ Split dataset into train/valid folders """ console = Console() while True: click.clear() console.print(BANNER) console.print("Choose the desired proportion of images of each class to be distributed in train/valid folders. [bold]What percentage of images should be distributed towards training?[/bold] ") train_proportion = click.prompt("(0-100)", type=int) validation_proportion = 100 - train_proportion if train_proportion < 0 or train_proportion > 100: click.clear() console.print("[red]Please provide a valid amount. Choose a number between 0 and 100 to be assigned to training.[/red]") continue else: click.clear() console.print("[bold blue]{train} percent[/bold blue] of the images will be moved to a [bold yellow]train[/bold yellow] folder, while [bold blue]{valid} percent [/bold blue] of the remaining images will be stored in a [bold yellow]validation[/bold yellow] folder.".format(train=train_proportion, valid=validation_proportion)) c= click.prompt("Is that ok? [Y/n]") if c.lower() == 'y': if not os.path.exists("dataset.yaml"): click.clear() console.print("Dataset config file not found\nRun - [bold blue]idt init[/bold blue]") exit(0) with open('dataset.yaml') as f: click.clear() console.print("[italic]Copying files to the train/valid folders. Please wait...[/italic]") data = yaml.load(f, Loader=yaml.FullLoader) split_dataset(data['DATASET_NAME'], train_proportion) console.clear() console.print("[bold blue]Done[/bold blue]") break else: continue if __name__ == "__main__": main() ================================================ FILE: idt/bing.py ================================================ import os import json import requests import re from idt.utils.download_images import download from idt.utils.remove_corrupt import erase_duplicates from rich.progress import Progress __name__ = "bing" class BingSearchEngine: def __init__(self,data,n_images,folder,resize_method,root_folder,size): self.data = data self.n_images = n_images self.folder = folder self.resize_method = resize_method self.root_folder = root_folder self.size = size self.downloaded_images = 0 self.page = 0 self.search() def search(self): BING_IMAGE = 'https://www.bing.com/images/async?q=' USER_AGENT = { 'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0'} data = self.data.replace(" ", "-") if data[0] == "-": data = data[1:] page_counter = 0 with Progress() as progress: task1 = progress.add_task(f"Downloading [blue]{self.data}[/blue] class...",total=self.n_images) while self.downloaded_images < self.n_images: searchurl = BING_IMAGE + data + '&first=' + str(self.page) + '&count=100' # request url, without usr_agent the permission gets denied response = requests.get(searchurl, headers=USER_AGENT) html = response.text self.page += 100 results = re.findall('murl":"(.*?)"', html) if not os.path.exists(self.root_folder): os.mkdir(self.root_folder) target_folder = os.path.join(self.root_folder, self.folder) if not os.path.exists(target_folder): os.mkdir(target_folder) for link in results: try: if self.downloaded_images < self.n_images: download(link,self.size,self.root_folder,self.folder, self.resize_method) self.downloaded_images += 1 progress.update(task1, advance=1) else: break; except: continue self.downloaded_images -= erase_duplicates(target_folder) print('Done') ================================================ FILE: idt/bing_api.py ================================================ import os import json import requests import re from idt.utils.download_images import download from idt.utils.remove_corrupt import erase_duplicates from idt.utils.create_dataset_csv import generate_class_info from rich.progress import Progress __name__ = "bing_api" class BingApiSearchEngine: def __init__(self,data,n_images,folder,resize_method,root_folder,size,api_key): self.data = data self.n_images = n_images self.folder = folder self.resize_method = resize_method self.root_folder = root_folder self.size = size self.downloaded_images = 0 self.dataset_info = [] self.page = 0 self.api_key = api_key self.search() def search(self): BING_IMAGE = 'https://api.cognitive.microsoft.com/bing/v7.0/images/search' headers = {"Ocp-Apim-Subscription-Key" : self.api_key} params = {"q": self.data, "count": 100, "offset": self.page} page_counter = 0 with Progress() as progress: task1 = progress.add_task(f"Downloading [blue]{self.data}[/blue] class...",total=self.n_images) while self.downloaded_images < self.n_images: response = requests.get(BING_IMAGE, headers=headers, params=params) response.raise_for_status() results = response.json() self.page += 100 if not os.path.exists(self.root_folder): os.mkdir(self.root_folder) target_folder = os.path.join(self.root_folder, self.folder) if not os.path.exists(target_folder): os.mkdir(target_folder) for result in results['value']: try: if self.downloaded_images < self.n_images: download(result['contentUrl'],self.size,self.root_folder,self.folder, self.resize_method) self.dataset_info.append({ 'name': result['name'], 'origin': result['hostPageDisplayUrl'].split('/')[2], 'date': result['datePublished'], 'original_size': result['contentSize'], 'original_width': result['width'], 'original_height' : result['height']}) self.downloaded_images += 1 progress.update(task1, advance=1) else: break; except: continue self.downloaded_images -= erase_duplicates(target_folder) generate_class_info(self.dataset_info,self.root_folder, self.folder) ================================================ FILE: idt/duckgo.py ================================================ import requests; import re; import json; import time; import logging; import os; from rich.progress import Progress from idt.utils.download_images import download from idt.utils.remove_corrupt import erase_duplicates __name__ = "duckgo" class DuckGoSearchEngine: def __init__(self, data, n_images, folder, resize_method, root_folder, size): self.data = data self.n_images = n_images self.folder = folder self.resize_method = resize_method self.root_folder = root_folder self.size = size self.downloaded_images = 0 self.search() def search(self): URL = 'https://duckduckgo.com/' PARAMS = {'q': self.data} HEADERS = { 'authority': 'duckduckgo.com', 'accept': 'application/json, text/javascript, */*; q=0.01', 'sec-fetch-dest': 'empty', 'x-requested-with': 'XMLHttpRequest', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36', 'sec-fetch-site': 'same-origin', 'sec-fetch-mode': 'cors', 'referer': 'https://duckduckgo.com/', 'accept-language': 'en-US,en;q=0.9'} res = requests.post(URL, data=PARAMS, timeout=3.000) search_object = re.search(r'vqd=([\d-]+)\&', res.text, re.M|re.I) #print(search_object) if not search_object: return -1; PARAMS = ( ('l', 'us-en'), ('o', 'json'), ('q', self.data), ('vqd', search_object.group(1)), ('f', ',,,'), ('p', '1'), ('v7exp', 'a')) request_url = URL + "i.js"; with Progress() as progress: task1 = progress.add_task("[blue]Downloading {x} class...".format(x=self.data), total=self.n_images) while self.downloaded_images < self.n_images: while True: try: res = requests.get(request_url, headers=HEADERS, params=PARAMS, timeout=3.000); data = json.loads(res.text); break; except ValueError as e: time.sleep(5); continue; if not os.path.exists(self.root_folder): os.mkdir(self.root_folder) target_folder = os.path.join(self.root_folder, self.folder) if not os.path.exists(target_folder): os.mkdir(target_folder) # Cut the extra result by the amount that still need to be downloaded if len(data["results"]) > self.n_images - self.downloaded_images: data["results"] = data["results"][:self.n_images - self.downloaded_images] for results in data["results"]: try: download(results["image"], self.size, self.root_folder, self.folder, self.resize_method) self.downloaded_images+= 1 progress.update(task1, advance=1) except Exception as e: continue self.downloaded_images -= erase_duplicates(target_folder) if "next" not in data: return 0 request_url = URL + data["next"]; ================================================ FILE: idt/factories.py ================================================ from idt.duckgo import DuckGoSearchEngine from idt.bing import BingSearchEngine from idt.bing_api import BingApiSearchEngine from idt.flickr_api import FlickrApiSearchEngine __name__ = "factories" class SearchEngineFactory: def __init__(self,data,n_images,folder,resize_method,root_folder,size,engine,api_key): self.data = data self.n_images = n_images self.folder = folder self.resize_method = resize_method self.root_folder = root_folder self.size = size self.engine = engine self.api_key = api_key self.getSearchEngine() def getSearchEngine(self): if self.engine == "duckgo": return DuckGoSearchEngine(self.data, self.n_images, self.folder,self.resize_method,self.root_folder, self.size) elif self.engine == "bing": return BingSearchEngine(self.data, self.n_images, self.folder, self.resize_method, self.root_folder, self.size) elif self.engine == "bing_api": return BingApiSearchEngine(self.data, self.n_images, self.folder, self.resize_method, self.root_folder, self.size, self.api_key) elif self.engine == "flickr_api": return FlickrApiSearchEngine(self.data, self.n_images, self.folder, self.resize_method, self.root_folder, self.size, self.api_key) else: return None ================================================ FILE: idt/flickr_api.py ================================================ import os import json import requests import re from idt.utils.download_images import download from idt.utils.remove_corrupt import erase_duplicates from rich.progress import Progress __name__ = "flickr_api" class FlickrApiSearchEngine: def __init__(self,data,n_images,folder,resize_method,root_folder,size,api_key): self.data = data self.n_images = n_images self.folder = folder self.resize_method = resize_method self.root_folder = root_folder self.size = size self.downloaded_images = 0 self.dataset_info = [] self.page = 1 self.api_key = api_key self.search() def search(self): FLICKR_LINK = 'https://www.flickr.com/services/rest/' #headers = {"Ocp-Apim-Subscription-Key" : self.api_key} data = self.data.replace(" ", "+") if data[0] == "+": data = data[1:] params = { "method": "flickr.photos.search", "api_key": self.api_key, "tags": data, "format": "json", "page": self.page, "nojsoncallback": 1 } with Progress() as progress: task1 = progress.add_task(f"Downloading [blue]{self.data}[/blue] class...",total=self.n_images) while self.downloaded_images < self.n_images: response = requests.get(FLICKR_LINK, params=params) response.raise_for_status() results = response.json() results = results['photos'] if results['total'] == 0: progress.update(task1, advance=self.n_images) return 0 self.page += 1 if not os.path.exists(self.root_folder): os.mkdir(self.root_folder) target_folder = os.path.join(self.root_folder, self.folder) if not os.path.exists(target_folder): os.mkdir(target_folder) for result in results['photo']: try: if self.downloaded_images < self.n_images: link = f"https://farm{result['farm']}.staticflickr.com/{result['server']}/{result['id']}_{result['secret']}.jpg" download(link, self.size,self.root_folder,self.folder, self.resize_method) self.downloaded_images += 1 progress.update(task1, advance=1) else: break; except: continue self.downloaded_images -= erase_duplicates(target_folder) ================================================ FILE: idt/resizers/__init__.py ================================================ ================================================ FILE: idt/resizers/get_resizer.py ================================================ from .smartcrop import SmartCrop from .longer_side import crop_longer_side from .shorter_side import crop_shorter_side def get_resizer(img, target_size, resizer): if target_size == 0: return img if resizer == "smartcrop": sc = SmartCrop() return sc.run_crop(img, target_size) elif resizer == 'shorter_side': return crop_shorter_side(img,target_size) elif resizer == 'longer_side': return crop_longer_side(img, target_size) ================================================ FILE: idt/resizers/longer_side.py ================================================ from PIL import Image def crop_longer_side(img, size): IMG_SIZE = size, size img.thumbnail(IMG_SIZE, Image.ANTIALIAS) return img ================================================ FILE: idt/resizers/shorter_side.py ================================================ from PIL import Image def crop_shorter_side(img, size): width, height = img.size if width < size or height < size: return img.thumbnail(size, Image.ANTIALIAS) elif width > height: ratio = float(width) / float(height) new_width = int(size * ratio) return img.resize((new_width, size), Image.ANTIALIAS) else: ratio = float(height) / float(width) new_height = int(size * ratio) return img.resize((size, new_height), Image.ANTIALIAS) ================================================ FILE: idt/resizers/smartcrop.py ================================================ import math import sys import numpy as np from PIL import Image from PIL.ImageFilter import Kernel def saturation(image): r, g, b = image.split() r, g, b = np.array(r, float), np.array(g, float), np.array(b, float) maximum = np.maximum(np.maximum(r, g), b) # [0; 255] minimum = np.minimum(np.minimum(r, g), b) # [0; 255] s = (maximum + minimum) / 255 # [0.0; 1.0] d = (maximum - minimum) / 255 # [0.0; 1.0] d[maximum == minimum] = 0 # if maximum == minimum: s[maximum == minimum] = 1 # -> saturation = 0 / 1 = 0 mask = s > 1 s[mask] = 2 - d[mask] return d / s # [0.0; 1.0] def thirds(x): """gets value in the range of [0, 1] where 0 is the center of the pictures returns weight of rule of thirds [0, 1]""" x = ((x + 2 / 3) % 2 * 0.5 - 0.5) * 16 return max(1 - x * x, 0) class SmartCrop(object): DEFAULT_SKIN_COLOR = [0.78, 0.57, 0.44] def __init__( self, detail_weight=0.2, edge_radius=0.4, edge_weight=-20, outside_importance=-0.5, rule_of_thirds=True, saturation_bias=0.2, saturation_brightness_max=0.9, saturation_brightness_min=0.05, saturation_threshold=0.4, saturation_weight=0.3, score_down_sample=8, skin_bias=0.01, skin_brightness_max=1, skin_brightness_min=0.2, skin_color=None, skin_threshold=0.8, skin_weight=1.8 ): self.detail_weight = detail_weight self.edge_radius = edge_radius self.edge_weight = edge_weight self.outside_importance = outside_importance self.rule_of_thirds = rule_of_thirds self.saturation_bias = saturation_bias self.saturation_brightness_max = saturation_brightness_max self.saturation_brightness_min = saturation_brightness_min self.saturation_threshold = saturation_threshold self.saturation_weight = saturation_weight self.score_down_sample = score_down_sample self.skin_bias = skin_bias self.skin_brightness_max = skin_brightness_max self.skin_brightness_min = skin_brightness_min self.skin_color = skin_color or self.DEFAULT_SKIN_COLOR self.skin_threshold = skin_threshold self.skin_weight = skin_weight def analyse( self, image, crop_width, crop_height, max_scale=1, min_scale=0.9, scale_step=0.1, step=8 ): """ Analyze image and return some suggestions of crops (coordinates). This implementation / algorithm is really slow for large images. Use `crop()` which is pre-scaling the image before analyzing it. """ cie_image = image.convert('L', (0.2126, 0.7152, 0.0722, 0)) cie_array = np.array(cie_image) # [0; 255] # R=skin G=edge B=saturation edge_image = self.detect_edge(cie_image) skin_image = self.detect_skin(cie_array, image) saturation_image = self.detect_saturation(cie_array, image) analyse_image = Image.merge('RGB', [skin_image, edge_image, saturation_image]) del edge_image del skin_image del saturation_image score_image = analyse_image.copy() score_image.thumbnail( ( int(math.ceil(image.size[0] / self.score_down_sample)), int(math.ceil(image.size[1] / self.score_down_sample)) ), Image.ANTIALIAS) top_crop = None top_score = -sys.maxsize crops = self.crops( image, crop_width, crop_height, max_scale=max_scale, min_scale=min_scale, scale_step=scale_step, step=step) for crop in crops: crop['score'] = self.score(score_image, crop) if crop['score']['total'] > top_score: top_crop = crop top_score = crop['score']['total'] return {'analyse_image': analyse_image, 'crops': crops, 'top_crop': top_crop} def crop( self, image, width, height, prescale=True, max_scale=1, min_scale=0.9, scale_step=0.1, step=8 ): scale = min(image.size[0] / width, image.size[1] / height) crop_width = int(math.floor(width * scale)) crop_height = int(math.floor(height * scale)) min_scale = min(max_scale, max(1 / scale, min_scale)) prescale_size = 1 if prescale: prescale_size = 1 / scale / min_scale if prescale_size < 1: image = image.copy() image.thumbnail( (int(image.size[0] * prescale_size), int(image.size[1] * prescale_size)), Image.ANTIALIAS) crop_width = int(math.floor(crop_width * prescale_size)) crop_height = int(math.floor(crop_height * prescale_size)) else: prescale_size = 1 result = self.analyse( image, crop_width=crop_width, crop_height=crop_height, min_scale=min_scale, max_scale=max_scale, scale_step=scale_step, step=step) for i in range(len(result['crops'])): crop = result['crops'][i] crop['x'] = int(math.floor(crop['x'] / prescale_size)) crop['y'] = int(math.floor(crop['y'] / prescale_size)) crop['width'] = int(math.floor(crop['width'] / prescale_size)) crop['height'] = int(math.floor(crop['height'] / prescale_size)) result['crops'][i] = crop return result def run_crop(self, image, target_size): if image.mode != 'RGB' and image.mode != 'RGBA': new_image = Image.new('RGB', image.size) new_image.paste(image) image = new_image result = self.crop(image, width=100, height=int(target_size / target_size * 100)) box = ( result['top_crop']['x'], result['top_crop']['y'], result['top_crop']['width'] + result['top_crop']['x'], result['top_crop']['height'] + result['top_crop']['y'] ) cropped_image = image.crop(box) cropped_image.thumbnail((target_size,target_size), Image.ANTIALIAS) return cropped_image def crops( self, image, crop_width, crop_height, max_scale=1, min_scale=0.9, scale_step=0.1, step=8 ): image_width, image_height = image.size crops = [] for scale in ( i / 100 for i in range( int(max_scale * 100), int((min_scale - scale_step) * 100), -int(scale_step * 100)) ): for y in range(0, image_height, step): if not (y + crop_height * scale <= image_height): break for x in range(0, image_width, step): if not (x + crop_width * scale <= image_width): break crops.append({ 'x': x, 'y': y, 'width': crop_width * scale, 'height': crop_height * scale, }) if not crops: raise ValueError(locals()) return crops def detect_edge(self, cie_image): return cie_image.filter(Kernel((3, 3), (0, -1, 0, -1, 4, -1, 0, -1, 0), 1, 1)) def detect_saturation(self, cie_array, source_image): threshold = self.saturation_threshold saturation_data = saturation(source_image) mask = ( (saturation_data > threshold) & (cie_array >= self.saturation_brightness_min * 255) & (cie_array <= self.saturation_brightness_max * 255)) saturation_data[~mask] = 0 saturation_data[mask] = (saturation_data[mask] - threshold) * (255 / (1 - threshold)) return Image.fromarray(saturation_data.astype('uint8')) def detect_skin(self, cie_array, source_image): r, g, b = source_image.split() r, g, b = np.array(r, float), np.array(g, float), np.array(b, float) rd = np.ones_like(r) * -self.skin_color[0] gd = np.ones_like(g) * -self.skin_color[1] bd = np.ones_like(b) * -self.skin_color[2] mag = np.sqrt(r * r + g * g + b * b) mask = ~(abs(mag) < 1e-6) rd[mask] = r[mask] / mag[mask] - self.skin_color[0] gd[mask] = g[mask] / mag[mask] - self.skin_color[1] bd[mask] = b[mask] / mag[mask] - self.skin_color[2] skin = 1 - np.sqrt(rd * rd + gd * gd + bd * bd) mask = ( (skin > self.skin_threshold) & (cie_array >= self.skin_brightness_min * 255) & (cie_array <= self.skin_brightness_max * 255)) skin_data = (skin - self.skin_threshold) * (255 / (1 - self.skin_threshold)) skin_data[~mask] = 0 return Image.fromarray(skin_data.astype('uint8')) def importance(self, crop, x, y): if ( crop['x'] > x or x >= crop['x'] + crop['width'] or crop['y'] > y or y >= crop['y'] + crop['height'] ): return self.outside_importance x = (x - crop['x']) / crop['width'] y = (y - crop['y']) / crop['height'] px, py = abs(0.5 - x) * 2, abs(0.5 - y) * 2 # distance from edge dx = max(px - 1 + self.edge_radius, 0) dy = max(py - 1 + self.edge_radius, 0) d = (dx * dx + dy * dy) * self.edge_weight s = 1.41 - math.sqrt(px * px + py * py) if self.rule_of_thirds: s += (max(0, s + d + 0.5) * 1.2) * (thirds(px) + thirds(py)) return s + d def score(self, target_image, crop): score = { 'detail': 0, 'saturation': 0, 'skin': 0, 'total': 0, } target_data = target_image.getdata() target_width, target_height = target_image.size down_sample = self.score_down_sample inv_down_sample = 1 / down_sample target_width_down_sample = target_width * down_sample target_height_down_sample = target_height * down_sample for y in range(0, target_height_down_sample, down_sample): for x in range(0, target_width_down_sample, down_sample): p = int( math.floor(y * inv_down_sample) * target_width + math.floor(x * inv_down_sample) ) importance = self.importance(crop, x, y) detail = target_data[p][1] / 255 score['skin'] += ( target_data[p][0] / 255 * (detail + self.skin_bias) * importance ) score['detail'] += detail * importance score['saturation'] += ( target_data[p][2] / 255 * (detail + self.saturation_bias) * importance ) score['total'] = ( score['detail'] * self.detail_weight + score['skin'] * self.skin_weight + score['saturation'] * self.saturation_weight ) / (crop['width'] * crop['height']) return score ================================================ FILE: idt/utils/__init__.py ================================================ ================================================ FILE: idt/utils/create_dataset_csv.py ================================================ import os import re import csv import yaml __name__ = "create_dataset_csv" def create_dataset_csv(path): number_of_dirs = 0 number_of_files = 0 csv_dict = {'DATASET':path,'NUMBER_OF_CLASSES':0, 'TOTAL_NUMBER_OF_FILES':0} for base, dirs, files in os.walk(path): for directories in dirs: number_of_dirs += 1 dir_path = os.path.join(path,directories) count = len(os.listdir(dir_path)) csv_dict[str(directories)]= count for Files in files: number_of_files += 1 csv_dict['NUMBER_OF_CLASSES'] = number_of_dirs csv_dict['TOTAL_NUMBER_OF_FILES'] = number_of_files with open('{path}/{path}.csv'.format(path=path), 'w') as f: for key in csv_dict.keys(): f.write("%s,%s\n"%(key,csv_dict[key])) #TODO implement natural sort to classes def atoi(text): return int(text) if text.isdigit() else text def natural_keys(text): return [ atoi(c) for c in re.split(r'(\d+)', text) ] def generate_class_info(dict, root_folder, folder): f = open(f"./{root_folder}/{folder}.yaml", "w") f.write(yaml.dump(dict)) ================================================ FILE: idt/utils/download_images.py ================================================ import uuid import requests; import os; from PIL import Image from io import BytesIO from idt.resizers.get_resizer import get_resizer __name__ = "download_images" def download(link, size, root_folder, class_name, resize_method): IMG_SIZE = size, size response = requests.get(link, timeout=3.000) file = BytesIO(response.content) raw_img = Image.open(file) # resize or crop image according to provided resize method img = get_resizer(raw_img, size, resize_method) # Split last part of url to get image name and its extension img_name = link.rsplit('/', 1)[1] img_type = img_name.split('.')[1] if img_type.lower() != "jpg": raise Exception("Cannot download these type of file") else: #Check if another file of the same name already exists id = uuid.uuid1() img.save(f"./{root_folder}/{class_name}/{class_name}-{id.hex}.jpg", "JPEG") ================================================ FILE: idt/utils/remove_corrupt.py ================================================ import os, hashlib, re, csv __name__ = "remove_corrupt" def remove_corrupt(path): visited_dir = 0 print("Removing corrupt files") for base, dirs, files in os.walk(path): for directories in dirs: visited_dir += 1 for Files in files: file = os.path.join(base,Files) if os.stat(file).st_size == 0: #print(Files, "is corrupt, removing it...") os.remove(file) def erase_duplicates(folder): duplicates = [] hash_keys = dict() file_list = os.listdir(folder) for index, file_name in enumerate(file_list): if os.path.isfile(os.path.join(folder,file_name)): with open(os.path.join(folder,file_name), 'rb') as f: filehash = hashlib.md5(f.read()).hexdigest() if filehash not in hash_keys: hash_keys[filehash] = index else: duplicates.append((index, hash_keys[filehash])) for index in duplicates: os.remove(os.path.join(folder, file_list[index[0]])) return len(duplicates) ================================================ FILE: idt/utils/split_dataset.py ================================================ import os import random from shutil import copyfile __name__ = "split_dataset" def split_dataset(img_source_dir, train_size): train_size = float(train_size / 100) print("Creating a dataset split into train/validation folders...") if not os.path.exists(img_source_dir): raise OSError('The source folder doesnt exist. Are you sure the dataset folder is ', img_source_dir) # Create folders if not os.path.exists('split-dataset'): os.makedirs('split-dataset') else: if not os.path.exists('split-dataset/train'): os.makedirs('split-dataset/train') if not os.path.exists('split-dataset/validation'): os.makedirs('split-dataset/validation') # Get the subdirectories in the main image folder subdirs = [subdir for subdir in os.listdir(img_source_dir) if os.path.isdir(os.path.join(img_source_dir, subdir))] for subdir in subdirs: subdir_fullpath = os.path.join(img_source_dir, subdir) if len(os.listdir(subdir_fullpath)) == 0: print(subdir_fullpath + ' is empty') break train_subdir = os.path.join('split-dataset/train', subdir) validation_subdir = os.path.join('split-dataset/validation', subdir) # Create subdirectories in train and validation folders if not os.path.exists(train_subdir): os.makedirs(train_subdir) if not os.path.exists(validation_subdir): os.makedirs(validation_subdir) train_counter = 0 validation_counter = 0 # Randomly assign an image to train or validation folder for filename in os.listdir(subdir_fullpath): if filename.endswith(".jpg") or filename.endswith(".png"): fileparts = filename.split('.') if random.uniform(0, 1) <= train_size: copyfile(os.path.join(subdir_fullpath, filename), os.path.join(train_subdir, str(train_counter) + '.' + fileparts[1])) train_counter += 1 else: copyfile(os.path.join(subdir_fullpath, filename), os.path.join(validation_subdir, str(validation_counter) + '.' + fileparts[1])) validation_counter += 1 ================================================ FILE: requirements.txt ================================================ click==7.1.2 PyYAML==5.3.1 requests==2.22.0 Pillow==7.0.0 rich==6.1.2 numpy==1.19.1 ================================================ FILE: setup.py ================================================ from setuptools import setup, find_packages from io import open from os import path import pathlib # The directory containing this file HERE = pathlib.Path(__file__).parent # The text of the README file README = (HERE / 'README.md').read_text() # automatically captured required modules for install_requires in requirements.txt and as well as configure dependency links with open(path.join(HERE, 'requirements.txt'), encoding='utf-8') as f: all_reqs = f.read().split('\n') install_requires = [x.strip() for x in all_reqs if 'git+' not in x and not x.startswith('#') and not x.startswith('-')] dependency_links = [x.strip().replace('git+', '') for x in all_reqs if 'git+' not in x] setup( # list of all packages # any python greater than 2.7 name='idt', description='A cli tool that quickly generates ready-to-use image datasets' , version='0.0.6', packages=find_packages(), install_requires=install_requires, python_requires='>=2.7', entry_points=''' [console_scripts] idt=idt.__main__:main ''', author='Deliton Junior', keyword='idt, image datasets, generators, dataset generator, image scraper' , long_description=README, long_description_content_type='text/markdown', license='MIT', url='https://github.com/deliton/idt', download_url='https://github.com/deliton/idt/archive/master.zip', dependency_links=dependency_links, author_email='deliton.m@hotmail.com', classifiers=['License :: OSI Approved :: MIT License', 'Programming Language :: Python :: 2.7', 'Programming Language :: Python :: 3', 'Programming Language :: Python :: 3.7'], )