[
  {
    "path": ".github/ISSUE_TEMPLATE/bug_report.md",
    "content": "---\nname: Bug report\nabout: Create a report to help us improve\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n\n**Describe the bug**\nA clear and concise description of what the bug is.\n\n**To Reproduce**\nSteps to reproduce the behavior:\n1. Go to '...'\n2. Click on '....'\n3. Scroll down to '....'\n4. See error\n\n**Expected behavior**\nA clear and concise description of what you expected to happen.\n\n**Screenshots**\nIf applicable, add screenshots to help explain your problem.\n\n**Desktop (please complete the following information):**\n - OS: [e.g. iOS]\n - Browser [e.g. chrome, safari]\n - Version [e.g. 22]\n\n**Smartphone (please complete the following information):**\n - Device: [e.g. iPhone6]\n - OS: [e.g. iOS8.1]\n - Browser [e.g. stock browser, safari]\n - Version [e.g. 22]\n\n**Additional context**\nAdd any other context about the problem here.\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature_request.md",
    "content": "---\nname: Feature request\nabout: Suggest an idea for this project\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n\n**Is your feature request related to a problem? Please describe.**\nA clear and concise description of what the problem is. Ex. I'm always frustrated when [...]\n\n**Describe the solution you'd like**\nA clear and concise description of what you want to happen.\n\n**Describe alternatives you've considered**\nA clear and concise description of any alternative solutions or features you've considered.\n\n**Additional context**\nAdd any other context or screenshots about the feature request here.\n"
  },
  {
    "path": ".gitignore",
    "content": "!**/__pycache__/\n__pycache__\nidt.egg-info\nbuild/\ndist/\n.README.md.kate-swp\n\n\n/.vscode"
  },
  {
    "path": "CODE_OF_CONDUCT.md",
    "content": "# Contributor Covenant Code of Conduct\n\n## Our Pledge\n\nIn the interest of fostering an open and welcoming environment, we as\ncontributors and maintainers pledge to making participation in our project and\nour community a harassment-free experience for everyone, regardless of age, body\nsize, disability, ethnicity, sex characteristics, gender identity and expression,\nlevel of experience, education, socio-economic status, nationality, personal\nappearance, race, religion, or sexual identity and orientation.\n\n## Our Standards\n\nExamples of behavior that contributes to creating a positive environment\ninclude:\n\n* Using welcoming and inclusive language\n* Being respectful of differing viewpoints and experiences\n* Gracefully accepting constructive criticism\n* Focusing on what is best for the community\n* Showing empathy towards other community members\n\nExamples of unacceptable behavior by participants include:\n\n* The use of sexualized language or imagery and unwelcome sexual attention or\n advances\n* Trolling, insulting/derogatory comments, and personal or political attacks\n* Public or private harassment\n* Publishing others' private information, such as a physical or electronic\n address, without explicit permission\n* Other conduct which could reasonably be considered inappropriate in a\n professional setting\n\n## Our Responsibilities\n\nProject maintainers are responsible for clarifying the standards of acceptable\nbehavior and are expected to take appropriate and fair corrective action in\nresponse to any instances of unacceptable behavior.\n\nProject maintainers have the right and responsibility to remove, edit, or\nreject comments, commits, code, wiki edits, issues, and other contributions\nthat are not aligned to this Code of Conduct, or to ban temporarily or\npermanently any contributor for other behaviors that they deem inappropriate,\nthreatening, offensive, or harmful.\n\n## Scope\n\nThis Code of Conduct applies both within project spaces and in public spaces\nwhen an individual is representing the project or its community. Examples of\nrepresenting a project or community include using an official project e-mail\naddress, posting via an official social media account, or acting as an appointed\nrepresentative at an online or offline event. Representation of a project may be\nfurther defined and clarified by project maintainers.\n\n## Enforcement\n\nInstances of abusive, harassing, or otherwise unacceptable behavior may be\nreported by contacting the project team at deliton.m@hotmail.com. All\ncomplaints will be reviewed and investigated and will result in a response that\nis deemed necessary and appropriate to the circumstances. The project team is\nobligated to maintain confidentiality with regard to the reporter of an incident.\nFurther details of specific enforcement policies may be posted separately.\n\nProject maintainers who do not follow or enforce the Code of Conduct in good\nfaith may face temporary or permanent repercussions as determined by other\nmembers of the project's leadership.\n\n## Attribution\n\nThis Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,\navailable at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html\n\n[homepage]: https://www.contributor-covenant.org\n\nFor answers to common questions about this code of conduct, see\nhttps://www.contributor-covenant.org/faq\n"
  },
  {
    "path": "CONTRIBUTING.md",
    "content": "![idt-contrib](https://user-images.githubusercontent.com/47995046/96387698-74e85e80-117a-11eb-8b35-d65b336fd1df.png)\n\n🎉 Thanks for taking the time to contribute to this project! 🎉\n\n## Code of Conduct\nThis project and everyone participating in it is governed by the [IDT Code of Conduct](CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code. Please report unacceptable behavior to deliton.m@hotmail.com.\n\n## How Can I Contribute?\n\n### Reporting Bugs\n\nThis section guides you through submitting a bug report for IDT. Following these guidelines helps maintainers and the community understand your report :pencil:, reproduce the behavior :computer: :computer:, and find related reports :mag_right:.\n\nWhen you are creating a bug report, please include as many details as possible.\n\n> **Note:** If you find a **Closed** issue that seems like it is the same thing that you're experiencing, open a new issue and include a link to the original issue in the body of your new one.\n\n#### How Do I Submit A (Good) Bug Report?\n\nBugs are tracked as [GitHub issues](https://guides.github.com/features/issues/). Create an issue in this repository and provide the following information by filling in bug report template\n\nExplain the problem and include additional details to help maintainers reproduce the problem:\n\n* **Use a clear and descriptive title** for the issue to identify the problem.\n* **Describe the exact steps which reproduce the problem** in as many details as possible. For example, start by explaining how you started IDT, e.g. which command exactly you used in the terminal, or how you started IDT otherwise. When listing steps, **don't just say what you did, but explain how you did it**. For example, if you moved the cursor to the end of a line, explain if you used the mouse, or a keyboard shortcut or an IDT command, and if so which one?\n* **Provide specific examples to demonstrate the steps**. Include links to files or GitHub projects, or copy/pasteable snippets, which you use in those examples. If you're providing snippets in the issue, use [Markdown code blocks](https://help.github.com/articles/markdown-basics/#multiple-lines).\n* **Describe the behavior you observed after following the steps** and point out what exactly is the problem with that behavior.\n* **Explain which behavior you expected to see instead and why.**\n* **Include screenshots and animated GIFs** which show you following the described steps and clearly demonstrate the problem. If you use the keyboard while following the steps, **record the GIF with the Keybinding Resolver shown**. You can use [this tool](https://www.cockos.com/licecap/) to record GIFs on macOS and Windows, and [this tool](https://github.com/colinkeenan/silentcast) or [this tool](https://github.com/GNOME/byzanz) on Linux.\n* **If you're reporting that IDT crashed**, include a crash report with a stack trace from the operating system. On macOS, the crash report will be available in `Console.app` under \"Diagnostic and usage information\" > \"User diagnostic reports\". Include the crash report in the issue in a [code block](https://help.github.com/articles/markdown-basics/#multiple-lines), a [file attachment](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/), or put it in a [gist](https://gist.github.com/) and provide link to that gist.\n* **If the problem is related to performance or memory**, include a CPU profile capture with your report.\n* **If the problem wasn't triggered by a specific action**, describe what you were doing before the problem happened and share more information using the guidelines below.\n\n### Suggesting Enhancements\n\nThis section guides you through submitting an enhancement suggestion for IDT, including completely new features and minor improvements to existing functionality. Following these guidelines helps maintainers and the community understand your suggestion :pencil: and find related suggestions :mag_right:.\n\n#### How Do I Submit A (Good) Enhancement Suggestion?\n\nEnhancement suggestions are tracked as [GitHub issues](https://guides.github.com/features/issues/). Create an issue on that repository and provide the following information:\n\n* **Use a clear and descriptive title** for the issue to identify the suggestion.\n* **Provide a step-by-step description of the suggested enhancement** in as many details as possible.\n* **Provide specific examples to demonstrate the steps**. Include copy/pasteable snippets which you use in those examples, as [Markdown code blocks](https://help.github.com/articles/markdown-basics/#multiple-lines).\n* **Describe the current behavior** and **explain which behavior you expected to see instead** and why.\n* **Include screenshots and animated GIFs** which help you demonstrate the steps or point out the part of IDT which the suggestion is related to. You can use [this tool](https://www.cockos.com/licecap/) to record GIFs on macOS and Windows, and [this tool](https://github.com/colinkeenan/silentcast) or [this tool](https://github.com/GNOME/byzanz) on Linux.\n* **Explain why this enhancement would be useful** to most IDT users and isn't something that can or should be implemented as a community package.\n* **List some other text editors or applications where this enhancement exists.**\n* **Specify which version of IDT you're using.** \n* **Specify the name and version of the OS you're using.**\n\n### Your First Code Contribution\n\nUnsure where to begin contributing to IDT? You can start by looking through `help-wanted` issues:\n\n* Help wanted issues - issues related to program problems, feature suggestion and implementation of wanted features.\n\n\n### Pull Requests\n\nThe process described here has several goals:\n\n- Maintain IDT's quality\n- Fix problems that are important to users\n- Engage the community in working toward the best possible IDT\n- Enable a sustainable system for IDT's maintainers to review contributions\n\nPlease follow these steps to have your contribution considered by the maintainers:\n\n1. Follow all instructions in [the template](PULL_REQUEST_TEMPLATE.md)\n2. Follow the [styleguides](#styleguides)\n3. After you submit your pull request, verify that all [status checks](https://help.github.com/articles/about-status-checks/) are passing <details><summary>What if the status checks are failing?</summary>If a status check is failing, and you believe that the failure is unrelated to your change, please leave a comment on the pull request explaining why you believe the failure is unrelated. A maintainer will re-run the status check for you. If we conclude that the failure was a false positive, then we will open an issue to track that problem with our status check suite.</details>\n\nIDT is a volunteer effort. We encourage you to pitch in and join the team!\n\nThanks! <3\n\nIDT Team\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2020 Deliton Junior\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# IDT - Image Dataset Tool\n\n## Version 0.0.6 beta\n\n![idt-logo](https://user-images.githubusercontent.com/47995046/96403078-d675f080-11ad-11eb-8435-c8ce69a6c871.png)\n\n\n## Description\n\nThe image dataset tool (IDT) is a CLI app developed to make it easier and faster to create image datasets to be used for deep learning. The tool achieves this by scraping images from several search engines such as duckgo, bing and deviantart. IDT also optimizes the image dataset, although this feature is optional, the user can downscale and compress the images for optimal file size and dimensions. A sample dataset created using **idt** that contains  a total amount of 23.688 image files weights only 559,2 megabytes.\n\n## NEW UPDATE!\nI am proud to announce our newest version! 🎉🎉\n\n**What changed**\n* Added auto duplicate images remover\n* Added longer side resize method. With this option, the image is resized to its longer side.\n* Added shorter side resize method. With this option, the image is resized to its shorter side.\n* Added Smart Crop. This method tries to crop and resize exactly the main subject of the image. The algorithm is based on SmartCrop.js and SmartCrop.py.\n* Removed verbose mode. This was used in earlier stages of development but now don't add value to the experience.\n* The official documentation is almost ready. A link will be available soon\n\n## Installing\n\nYou can install it via pip or cloning this repository.\n\n```console\nuser@admin:~$ pip3 install idt\n\n```\n\n**OR**\n\n\n```console\nuser@admin:~$ git clone https://github.com/deliton/idt.git && cd idt\nuser@admin:~/idt$ sudo python3 setup.py install\n\n```\n\n\n## Getting Started\n\n![idt-gif](https://user-images.githubusercontent.com/47995046/96406740-6d46ab00-11b6-11eb-980b-a40968ed38b4.gif)\n\nThe quickest way to get started with IDT is running the simple \"run\" command. Just write in your favorite console something like:\n\n```console\nuser@admin:~$ idt run -i apples \n```\n\nThis will quickly download 50 images of apples. By default it uses the duckgo search engine to do so. \nThe run command accepts the following options:\n\n| Option | Description |\n| ----------- | ----------- |\n| **-i** or **--input** | the keyword to find the desired images. | \n| **-s** or **--size** | the amount of images to be downloaded. |\n| **-e** or **--engine** | the desired search engine (options: duckgo, bing, bing_api and flickr_api) |\n| **--resize-method** | choose a resize method of images. (options: longer_side, shorter_side and smartcrop) |\n| **-is** or **--image-size** | option to set the desired image size ratio. default=512 |\n| **-ak** or **--api-key** | If you are using a search engine that requires an API key, this option is required |\n\n\n## Usage\n\nIDT requires a config file that tells it how your dataset should be organized. You can create it using the following command:\n\n```console\nuser@admin:~$ idt init\n```\n\nThis command will trigger the config file creator and will ask for the desired dataset parameters. In this example let's create a dataset containing images of your favorite cars. The first parameters this command will ask is what name should your dataset have? In this example, let's name our dataset \"My favorite cars\"\n\n```console\nInsert a name  for your dataset: : My favorite cars\n```\n\nThen the tool will ask how many samples per search are required to mount your dataset. In order to build a good dataset for deep learning, many images are required and since we're using a search engine to scrape images, many searches with different keywords are required to mount a good sized dataset. This value will correspond to how many images should be downloaded at every search. In this example we need a dataset with 250 images in each class, and we'll use 5 keywords to mount each class. So if we type the number 50 here, IDT will download 50 images of every keyword provided. If we provide 5 keywords we should get the required 250 images.\n\n```console\nHow many samples per search will be necessary?  : 50\n```\n\nThe tool will now ask for and image size ratio. Since using large images to train neural networks is not a viable thing, we can optionally choose one of the following image size ratios and scale down our images to that size. In this example, we'll go for 512x512, although 256x256 would be an even better option for this task.\n\n```console\nChoose images resolution:\n\n[1] 512 pixels / 512 pixels (recommended)\n[2] 1024 pixels / 1024 pixels\n[3] 256 pixels / 256 pixels\n[4] 128 pixels / 128 pixels\n[5] Keep original image size\n\nps: note that the aspect ratio of the image will not be changed, \nso possibly the images received will have slightly different size\n\nWhat is the desired image size ratio: 1\n```\n\nAnd then choose \"longer_side\" for resize method.\n\n```console\n[1] Resize image based on longer side\n[2] Resize image based on shorter side\n[3] Smartcrop\n\nps: note that the aspect ratio of the image will not be changed,\nso possibly the images received will have slightly different size\n\nDesired Image resize method: : longer_side\n\n```\n\nNow you must choose how many classes/folders your dataset should have. In this example, this part can be very personal, but my favorite cars are: Chevrolet Impala, Range Rover Evoque, Tesla Model X and (why not) AvtoVAZ Lada. So in this case we have 4 classes, one for each favorite.\n\n```console\nHow many image classes are required? : 4\n```\n\nAfterwards, you'll be asked to choose between one of the search engines available. In this example, we'll use DuckGO to search images for us.\n\n```console\nChoose a search engine:\n\n[1] Duck GO (recommended)\n[2] Bing\n[3] Bing API \n[4] Flickr API\n\nSelect option:: 1\n```\n\nNow we have to do some repetitive form filling. We must name each class and all the keywords that will be used to find the images. Note that this part can be later changed by your own code, to generate more classes and keywords.\n\n```console\nClass 1 name: : Chevrolet Impala\n```\n\nAfter typing the first class name, we'll be asked to provide all the keywords to find the dataset. Remember that we told the program to download 50 images of each keyword so we must provide 5 keywords in this case to get all 250 images. Each keyword MUST be separated by commas(,)\n\n```console\nIn order to achieve better results, choose several keywords that will\nbe provided to the search engine to find your class in different settings.\n\nExample: \n\nClass Name: Pineapple\nkeywords: pineapple, pineapple fruit, ananas, abacaxi, pineapple drawing\n\nType in all keywords used to find your desired class, separated by commas: Chevrolet Impala 1967 car photos,\nchevrolet impala on the road, chevrolet impala vintage car, chevrolet impala convertible 1961, chevrolet impala 1964 lowrider\n\n```\n\nThen repeat the process of filling class name and its keywords until you fill all the 4 classes required.\n\n```console\nDataset YAML file has been created successfully. Now run idt build to mount your dataset!\n```\n\nYour dataset configuration file has been created. Now just rust the following command and see the magic happen:\n\n```console\nuser@admin:~$ idt build\n```\n\nAnd wait while the dataset is being mounted:\n\n```console\nCreating Chevrolet Impala class\nDownloading Chevrolet Impala 1967 car photos  [#########################-----------]   72%  00:00:12\n\n```\n\nAt the end, all your images will be available in a folder with the dataset name. Also, a csv file with the dataset stats are also included in the dataset's root folder.\n\n![idt-results](https://user-images.githubusercontent.com/47995046/93012667-808fa680-f578-11ea-82fc-7ebcb8ce3c41.png)\n\n\n## Split image dataset for Deep Learning\n\nSince deep learning often requires you to split your dataset into a subset of training/validation folders, this project can also do this for you! Just run:\n\n```console\nuser@admin:~$ idt split\n```\n\nNow you must choose a train/valid proportion. In this example I've chosen that 70% of the images will be reserved for training, while the rest will be reserved for validation: \n\n```console\nChoose the desired proportion of images of each class to be distributed in train/valid folders.\nWhat percentage of images should be distributed towards training? \n(0-100): 70\n\n70 percent of the images will be moved to a train folder, while 30 percent of the remaining images\nwill be stored in a validation folder.\nIs that ok? [Y/n]: y\n```\n\nAnd that's it! The dataset-split should now be found with the corresponding train/valid subdirectories.\n\n## Issues\n\nThis project is being developed in my spare time and it still needs a lot of effort to be free of bugs. Pull requests and contributors are really appreciated, feel free to contribute in any way you can!\n\n"
  },
  {
    "path": "idt/__init__.py",
    "content": ""
  },
  {
    "path": "idt/__main__.py",
    "content": "import os\nimport click\nimport yaml\nimport rich\nfrom rich.console import Console\n\nfrom idt.factories import SearchEngineFactory\nfrom idt.utils.remove_corrupt import remove_corrupt\nfrom idt.utils.create_dataset_csv import create_dataset_csv\nfrom idt.utils.split_dataset import split_dataset\n\nBANNER = \"\"\"\n[bold blue]=====================================================================\n\n\n                             \n                8888888 8888888b. 88888888888 \n                  888   888  \"Y88b    888     \n                  888   888    888    888     \n                  888   888    888    888     \n                  888   888    888    888     \n                  888   888    888    888     \n                  888   888  .d88P    888     \n                8888888 8888888P\"     888  \n                                           \n          \t\t[italic]IMAGE DATASET TOOL V0.6[/italic]                                                                                    \n                                                                                                                                 \n=====================================================================[/bold blue]                                                                                                                                \n\t\t\"\"\"\n\n#@click.command()\n@click.group()\ndef main():\n    \n    \"\"\"\n    Image Dataset Builder CLI to create amazing datasets\n    \"\"\"\n    pass\n\n@main.command()\ndef version():\n\t\"\"\"\n\tShows what version idt is currently on\n\t\"\"\"\n\tclick.clear()\n\trich.print(\"[bold magenta]Image Dataset Tool (IDT)[/bold magenta] version 0.0.6 beta\")\n\n@main.command()\ndef authors():\n\t\"\"\"\n\tShows who are the creators of IDT\n\t\"\"\"\n\tclick.clear()\n\trich.print(\"[bold]IDT[/bold] was initially made by [bold magenta]Deliton Junior[/bold magenta] and [bold red]Misael Kelviny[/bold red]\")\n\n@main.command()\n@click.option('--input', '-i','--i', help=\"The name of the thing you want to download\")\n@click.option('--size', '-s','--s', default=50, help=\"The number of images you want to download.\")\n@click.option('--engine', '-e','--e', default=\"duckgo\", help=\"What search engine will be used to find your images\")\n@click.option('--resize-method', '-rs','--rs', default=\"longer_side\", help=\"Resize method adopted. Options: shorter_side, longer_side and smartcrop\")\n@click.option('--imagesize', '-is','--is', default=512, help=\"What image size ratio should be applied to your dataset\")\n@click.option('--api-key', '-ak','--ak', default=None, help=\"Provide an api-key for the engines that require one\")\ndef run(input, size, engine, resize_method, imagesize, api_key):\n\t\"\"\"\n\tThis command executes a single search and downloads it\n\t\"\"\"\n\tengine_list = ['duckgo', 'bing', 'bing_api', 'flickr_api']\n\tclick.clear()\n\n\tif input and engine in engine_list:\n\t\tfactory = SearchEngineFactory(input,size,input,resize_method,\"dataset\",imagesize, engine, api_key)\n\t\t# Remove corrupt files\n\t\tremove_corrupt(\"dataset\")\n\n\telse:\n\t\trich.print(\"Please provide a valid name\")\n\n@main.command()\n@click.option('--default', '-d','--d', is_flag=True,default=False, help=\"Generate a default config file\")\ndef init(default):\n\t\"\"\"\n\tThis command initialyzes idt and creates a dataset config file\n\t\"\"\"\n\tconsole = Console()\n\tconsole.clear()\n\n\tif default:\n\t\tdocument_dict = {\n\t\t\t\"DATASET_NAME\": \"dataset\",\n\t\t\t\"API_KEY\": \"\",\n\t\t\t\"SAMPLES_PER_SEARCH\": 50,\n\t\t\t\"IMAGE_SIZE\": 512,\n\t\t\t\"ENGINE\": \"duckgo\",\n\t\t\t\"RESIZE_METHOD\": \"longer_side\",\n\t\t\t\"CLASSES\": [{\"CLASS_NAME\": \"Test\", \"SEARCH_KEYWORDS\": \"images of cats\"}]}\n\n\t\tif not os.path.exists(\"dataset.yaml\"):\n\t\t\tconsole.print(\"[bold]Creating a dataset configuration file...[/bold]\")\n\t\t\t\n\t\t\tf = open(\"dataset.yaml\", \"w\")\n\t\t\tf.write(yaml.dump(document_dict))\n\t\t\tif f:\n\t\t\t\tconsole.clear()\n\t\t\t\tconsole.print(\"Dataset YAML file has been created sucessfully. Now run [bold blue]idt build[/bold blue] to mount your dataset!\")\n\t\t\t\texit(0)\n\t\t\t\n\t\t\n\t\telse:\n\t\t\tconsole.print(\"[red]A dataset.yaml is already created. To use another one, delete the current dataset.yaml file[/red]\")\n\t\t\texit(0)\n\n\tconsole.print(BANNER)\n\tdataset_name = click.prompt(\"Insert a name to your dataset: \")\n\n\tconsole.clear()\n\tsamples = click.prompt(\"How many samples per seach will be necessary?  \",type=int)\n\n\tconsole.clear()\n\tconsole.print(\"[bold]Choose image resolution[/bold]\", justify=\"center\")\n\tconsole.print(\"\"\"\n\n[1] 512 pixels / 512 pixels [bold blue](recommended)[/bold blue]\n[2] 1024 pixels / 1024 pixels\n[3] 256 pixels / 256 pixels\n[4] 128 pixels / 128 pixels\n[5] Keep original image size\n\n[italic]ps: note that the aspect ratio of the image will [bold]not[/bold] be changed, so possibly the images received will have slightly different size[/italic]\n\t\t\n\t\t\"\"\")\n\n\n\timage_size_ratio = click.prompt(\"What is the desired image size ratio\", type=int)\n\twhile image_size_ratio < 1 or image_size_ratio > 5:\n\t\tconsole.print(\"[italic red]Invalid option, please choose between 1 and 5. [/italic red]\")\n\t\timage_size_ratio= click.prompt(\"\\nOption: \",type=int)\n\n\tif image_size_ratio == 1:\n\t\timage_size_ratio= 512\n\telif image_size_ratio == 2:\n\t\timage_size_ratio = 1024\n\telif image_size_ratio == 3:\n\t\timage_size_ratio = 256\n\telif image_size_ratio == 4:\n\t\timage_size_ratio= 128\n\telif image_size_ratio == 5:\n\t\timage_size_ratio = 0\n\n\tconsole.clear()\n\tconsole.print(\"[bold]Choose a resize method[/bold]\", justify=\"center\")\n\tconsole.print(\"\"\"\n\n[1] Resize image based on longer side\n[2] Resize image based on shorter side\n[3] Smartcrop\n\n[italic]ps: note that the aspect ratio of the image will [bold]not[/bold] be changed, so possibly the images received will have slightly different size[/italic]\n\t\t\n\t\t\"\"\")\n\tresize_method = click.prompt(\"Desired Image resize method: \", type=int)\n\twhile resize_method < 1 or resize_method > 3:\n\t\tconsole.print(\"[red]Invalid option[/red]\")\n\t\tresize_method = click.prompt(\"Choose method [1-3]: \")\n\n\tresize_method_options = ['','longer_side','shorter_side','smartcrop']\n\n\n\tconsole.clear()\n\tnumber_of_classes = click.prompt(\"How many image classes are required? \",type=int)\n\n\tdocument_dict = {\n  \n    \"DATASET_NAME\": dataset_name,\n  \n    \"SAMPLES_PER_SEARCH\": samples,\n \n    \"IMAGE_SIZE\": image_size_ratio,\n  \n    \"RESIZE_METHOD\": resize_method_options[resize_method],\n  \n    \"CLASSES\": []\n  \n}\n\n\tconsole.clear()\n\tconsole.print(\"[bold]Choose a search engine[/bold]\", justify=\"center\")\n\tconsole.print(\"\"\"\n\n[1] Duck GO [bold blue](recommended)[/bold blue]\n[2] Bing\n[3] Bing API [italic yellow](Requires API key)[/italic yellow]\n[4] Flickr API [italic yellow](Requires API key)[/italic yellow]\n\n\t\t\"\"\")\n\tsearch_engine= click.prompt(\"Select option:\", type=int)\n\twhile search_engine < 0 or search_engine > 4:\n\t\tconsole.print(\"[italic red]Invalid option, please choose between 1 and 4.[/italic red]\")\n\t\tsearch_engine = click.prompt(\"\\nOption: \", type=int)\n\n\tsearch_options = ['none', 'duckgo', 'bing', 'bing_api', 'flickr_api']\n\tdocument_dict['ENGINE'] = search_options[search_engine]\n\n\tif search_engine > 2:\n\t\tconsole.clear()\n\t\tconsole.print(f'Insert your [bold blue]{search_options[search_engine]}[/bold blue] API key')\n\t\tengine_api_key = click.prompt(\"API key: \", type=str)\n\t\tdocument_dict['API_KEY'] = engine_api_key\n\telse:\n\t\tdocument_dict['API_KEY'] = \"NONE\"\n\n\tsearch_engine = search_options[search_engine]\n\n\tfor x in range(number_of_classes):\n\t\tconsole.clear()\n\t\tclass_name = click.prompt(\"Class {x} name: \".format(x=x+1))\n\t\tconsole.clear()\n\n\t\tconsole.print(\"\"\"In order to achieve better results, choose several keywords that will be provided to the search engine to find your class in different settings.\n\t\n[bold blue]Example: [/bold blue]\n\nClass Name: [bold yellow]Pineapple[/bold yellow]\n[italic]keywords[/italic]: [underline]pineapple, pineapple fruit, ananas, abacaxi, pineapple drawing[/underline]\n\n\t\t\t\"\"\")\n\t\tkeywords = click.prompt(\"Type in all keywords used to find your desired class, separated by commas: \")\n\t\tdocument_dict['CLASSES'].append({'CLASS_NAME': class_name, 'SEARCH_KEYWORDS': keywords})\n    \n\tif not os.path.exists(\"dataset.yaml\"):\n\t\tconsole.print(\"[bold]Creating a dataset configuration file...[/bold]\")\n\t\ttry:\n\t\t\tf = open(\"dataset.yaml\", \"w\")\n\t\t\tf.write(yaml.dump(document_dict))\n\t\t\tif f:\n\t\t\t\tconsole.clear()\n\t\t\t\tconsole.print(\"Dataset YAML file has been created sucessfully. Now run [bold blue]idt build[/bold blue] to mount your dataset!\")\n\t\texcept:\n\t\t\tconsole.print(\"[red]Unable to create file. Please check permission[/red]\")\n\t\t\n\telse:\n\t\tconsole.print(\"[red]A dataset.yaml is already created. To use another one, delete the current dataset.yaml file[/red]\")\n\n@main.command()\ndef build():\n\t\"\"\"\n\tThis command mounts the dataset\n\t\"\"\"\n\tconsole = Console()\n\tconsole.clear()\n\tconsole.print(BANNER)\n\tif not os.path.exists(\"dataset.yaml\"):\n\t\tclick.clear()\n\t\tconsole.print(\"Dataset config file not found\\nRun - idt init\\n\")\n\t\texit(0)\n\n\twith open('dataset.yaml') as f:\n\t\tdata = yaml.load(f, Loader=yaml.FullLoader)\n\t\n\tclick.clear()\n\tconsole.print(\"Building [bold blue]{dataset_name}[/bold blue] dataset...\\n\".format(dataset_name=data['DATASET_NAME']))\n\tfor classes in data['CLASSES']:\n\t\tclick.clear()\n\t\tconsole.print('Creating [bold blue]{name} class[/bold blue] \\n'.format(name=classes['CLASS_NAME']))\n\t\tsearch_list = classes['SEARCH_KEYWORDS'].split(\",\")\n\t\tfor keywords in search_list:\n\t\t\tfactory = SearchEngineFactory(keywords,data['SAMPLES_PER_SEARCH'],classes['CLASS_NAME'],data['RESIZE_METHOD'], data['DATASET_NAME'],data['IMAGE_SIZE'], data['ENGINE'],data['API_KEY'])\n\t# Remove corrupt files\n\tremove_corrupt(data['DATASET_NAME'])\n\n\t# Create a CSV with dataset info\n\tcreate_dataset_csv(data['DATASET_NAME'])\n\tclick.clear()\n\tconsole.print(\"Dataset READY!\")\n\n@main.command()\ndef split():\n\t\"\"\"\n\tSplit dataset into train/valid folders\n\t\"\"\"\n\tconsole = Console()\n\twhile True:\n\t\tclick.clear()\n\t\tconsole.print(BANNER)\n\t\tconsole.print(\"Choose the desired proportion of images of each class to be distributed in train/valid folders. [bold]What percentage of images should be distributed towards training?[/bold] \")\n\t\ttrain_proportion = click.prompt(\"(0-100)\", type=int)\n\t\tvalidation_proportion = 100 - train_proportion\n\t\tif train_proportion < 0 or train_proportion > 100:\n\t\t\tclick.clear()\n\t\t\tconsole.print(\"[red]Please provide a valid amount. Choose a number between 0 and 100 to be assigned to training.[/red]\")\n\t\t\tcontinue\n\t\telse:\n\t\t\tclick.clear()\n\t\t\tconsole.print(\"[bold blue]{train} percent[/bold blue] of the images will be moved to a [bold yellow]train[/bold yellow] folder, while [bold blue]{valid} percent [/bold blue] of the remaining images will be stored in a [bold yellow]validation[/bold yellow] folder.\".format(train=train_proportion, valid=validation_proportion))\n\t\t\tc= click.prompt(\"Is that ok? [Y/n]\")\n\t\t\tif c.lower() == 'y':\n\t\t\t\tif not os.path.exists(\"dataset.yaml\"):\n\t\t\t\t\tclick.clear()\n\t\t\t\t\tconsole.print(\"Dataset config file not found\\nRun - [bold blue]idt init[/bold blue]\")\n\t\t\t\t\texit(0)\n\n\t\t\t\twith open('dataset.yaml') as f:\n\t\t\t\t\tclick.clear()\n\t\t\t\t\tconsole.print(\"[italic]Copying files to the train/valid folders. Please wait...[/italic]\")\n\t\t\t\t\tdata = yaml.load(f, Loader=yaml.FullLoader)\n\t\t\t\t\tsplit_dataset(data['DATASET_NAME'], train_proportion)\n\t\t\t\tconsole.clear()\n\t\t\t\tconsole.print(\"[bold blue]Done[/bold blue]\")\n\t\t\t\tbreak\n\t\t\telse:\n\t\t\t\tcontinue\n\t\t\n\t\n\nif __name__ == \"__main__\":\n\tmain()\n"
  },
  {
    "path": "idt/bing.py",
    "content": "import os\nimport json \nimport requests\nimport re\n\nfrom idt.utils.download_images import download\nfrom idt.utils.remove_corrupt import erase_duplicates\n\nfrom rich.progress import Progress\n\n__name__ = \"bing\"\n\nclass BingSearchEngine:\n\tdef __init__(self,data,n_images,folder,resize_method,root_folder,size):\n\t\tself.data = data\n\t\tself.n_images = n_images\n\t\tself.folder = folder\n\t\tself.resize_method = resize_method\n\t\tself.root_folder = root_folder\n\t\tself.size = size\n\t\tself.downloaded_images = 0\n\t\tself.page = 0\n\t\tself.search()\n\n\tdef search(self):\n\t\tBING_IMAGE = 'https://www.bing.com/images/async?q='\n\n\t\tUSER_AGENT = {\n\t\t'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0'}\n\n\t\tdata = self.data.replace(\" \", \"-\")\n\n\t\tif data[0] == \"-\":\n\t\t\tdata  = data[1:]\n\n\t\tpage_counter = 0\n\t\twith Progress() as progress:\n\t\t\ttask1 = progress.add_task(f\"Downloading [blue]{self.data}[/blue] class...\",total=self.n_images)\n\t\t\twhile self.downloaded_images < self.n_images:\n\t\t\t\tsearchurl = BING_IMAGE + data + '&first=' + str(self.page) + '&count=100'\n\n\t\t\t    # request url, without usr_agent the permission gets denied\n\t\t\t\tresponse = requests.get(searchurl, headers=USER_AGENT)\n\t\t\t\thtml = response.text\n\t\t\t\tself.page += 100\n\t\t\t\tresults = re.findall('murl&quot;:&quot;(.*?)&quot;', html)\n\n\t\t\t\tif not os.path.exists(self.root_folder):\n\t\t\t\t\tos.mkdir(self.root_folder)\n\n\t\t\t\ttarget_folder = os.path.join(self.root_folder, self.folder)\n\t\t\t\tif not os.path.exists(target_folder):\n\t\t\t\t\tos.mkdir(target_folder)\n\n\t\t\t\tfor link in results:\n\t\t\t\t\ttry:\n\t\t\t\t\t\tif self.downloaded_images < self.n_images:\n\t\t\t\t\t\t\tdownload(link,self.size,self.root_folder,self.folder, self.resize_method)\n\t\t\t\t\t\t\tself.downloaded_images += 1\n\t\t\t\t\t\t\tprogress.update(task1, advance=1)\n\t\t\t\t\t\telse:\n\t\t\t\t\t\t\tbreak; \n\t\t\t\t\texcept:\n\t\t\t\t\t\tcontinue\n\t\t\t\tself.downloaded_images -= erase_duplicates(target_folder)\n\t\tprint('Done')\n"
  },
  {
    "path": "idt/bing_api.py",
    "content": "import os\nimport json \nimport requests\nimport re\n\nfrom idt.utils.download_images import download\nfrom idt.utils.remove_corrupt import erase_duplicates\nfrom idt.utils.create_dataset_csv import generate_class_info\nfrom rich.progress import Progress\n\n__name__ = \"bing_api\"\n\nclass BingApiSearchEngine:\n\tdef __init__(self,data,n_images,folder,resize_method,root_folder,size,api_key):\n\t\tself.data = data\n\t\tself.n_images = n_images\n\t\tself.folder = folder\n\t\tself.resize_method = resize_method\n\t\tself.root_folder = root_folder\n\t\tself.size = size\n\t\tself.downloaded_images = 0\n\t\tself.dataset_info = []\n\t\tself.page = 0\n\t\tself.api_key = api_key\n\t\tself.search()\n\n\tdef search(self):\n\t\tBING_IMAGE = 'https://api.cognitive.microsoft.com/bing/v7.0/images/search'\n\n\t\theaders = {\"Ocp-Apim-Subscription-Key\" : self.api_key}\n\t\tparams  = {\"q\": self.data, \"count\": 100, \"offset\": self.page}\n\n\t\tpage_counter = 0\n\t\twith Progress() as progress:\n\t\t\ttask1 = progress.add_task(f\"Downloading [blue]{self.data}[/blue] class...\",total=self.n_images)\n\t\t\twhile self.downloaded_images < self.n_images:\n\t\t\t\tresponse = requests.get(BING_IMAGE, headers=headers, params=params)\n\t\t\t\tresponse.raise_for_status()\n\t\t\t\tresults = response.json()\n\t\t\t\tself.page += 100\n\n\t\t\t\tif not os.path.exists(self.root_folder):\n\t\t\t\t\tos.mkdir(self.root_folder)\n\n\t\t\t\ttarget_folder = os.path.join(self.root_folder, self.folder)\n\t\t\t\tif not os.path.exists(target_folder):\n\t\t\t\t\tos.mkdir(target_folder)\n\n\t\t\t\tfor result in results['value']:\n\t\t\t\t\ttry:\n\t\t\t\t\t\tif self.downloaded_images < self.n_images:\n\t\t\t\t\t\t\tdownload(result['contentUrl'],self.size,self.root_folder,self.folder, self.resize_method)\n\t\t\t\t\t\t\tself.dataset_info.append({\n\t\t\t\t\t\t\t\t'name': result['name'],\n\t\t\t\t\t\t\t\t'origin': result['hostPageDisplayUrl'].split('/')[2],\n\t\t\t\t\t\t\t\t'date': result['datePublished'],\n\t\t\t\t\t\t\t\t'original_size': result['contentSize'],\n\t\t\t\t\t\t\t\t'original_width': result['width'],\n\t\t\t\t\t\t\t\t'original_height' : result['height']})\n\n\t\t\t\t\t\t\tself.downloaded_images += 1\n\t\t\t\t\t\t\tprogress.update(task1, advance=1)\n\t\t\t\t\t\telse:\n\t\t\t\t\t\t\tbreak; \n\t\t\t\t\texcept:\n\t\t\t\t\t\tcontinue\n\t\t\tself.downloaded_images -= erase_duplicates(target_folder)\n\t\tgenerate_class_info(self.dataset_info,self.root_folder, self.folder)\n"
  },
  {
    "path": "idt/duckgo.py",
    "content": "import requests;\nimport re;\nimport json;\nimport time;\nimport logging;\nimport os;\nfrom rich.progress import Progress\n\nfrom idt.utils.download_images import download\nfrom idt.utils.remove_corrupt import erase_duplicates\n\n__name__ = \"duckgo\"\n\nclass DuckGoSearchEngine:\n    def __init__(self,  data, n_images, folder, resize_method, root_folder, size):\n        self.data = data\n        self.n_images = n_images\n        self.folder = folder\n        self.resize_method = resize_method\n        self.root_folder = root_folder\n        self.size = size\n        self.downloaded_images = 0\n        self.search()\n\n    def search(self):\n        URL = 'https://duckduckgo.com/'\n        PARAMS = {'q': self.data}\n        HEADERS = {\n        'authority': 'duckduckgo.com',\n        'accept': 'application/json, text/javascript, */*; q=0.01',\n        'sec-fetch-dest': 'empty',\n        'x-requested-with': 'XMLHttpRequest',\n        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',\n        'sec-fetch-site': 'same-origin',\n        'sec-fetch-mode': 'cors',\n        'referer': 'https://duckduckgo.com/',\n        'accept-language': 'en-US,en;q=0.9'}\n\n        res = requests.post(URL, data=PARAMS, timeout=3.000)\n        search_object = re.search(r'vqd=([\\d-]+)\\&', res.text, re.M|re.I)\n        #print(search_object)\n\n        if not search_object:\n            return -1;\n\n        PARAMS = (\n        ('l', 'us-en'),\n        ('o', 'json'),\n        ('q', self.data),\n        ('vqd', search_object.group(1)),\n        ('f', ',,,'),\n        ('p', '1'),\n        ('v7exp', 'a'))\n\n        request_url = URL + \"i.js\";\n        with Progress() as progress:\n\n            task1 = progress.add_task(\"[blue]Downloading {x} class...\".format(x=self.data), total=self.n_images)\n            while self.downloaded_images < self.n_images:\n                while True:\n                    try:\n                        res = requests.get(request_url, headers=HEADERS, params=PARAMS, timeout=3.000);\n                        data = json.loads(res.text);\n                        break;\n                    except ValueError as e:\n                        time.sleep(5);\n                        continue;\n\n                if not os.path.exists(self.root_folder):\n                    os.mkdir(self.root_folder)\n\n                target_folder = os.path.join(self.root_folder, self.folder)\n                if not os.path.exists(target_folder):\n                    os.mkdir(target_folder)\n\n                # Cut the extra result by the amount that still need to be downloaded\n                if len(data[\"results\"]) > self.n_images - self.downloaded_images:\n                    data[\"results\"] = data[\"results\"][:self.n_images - self.downloaded_images]\n\n                for results in data[\"results\"]:\n                    try:\n                        download(results[\"image\"], self.size, self.root_folder, self.folder, self.resize_method)\n                        self.downloaded_images+= 1\n                        progress.update(task1, advance=1) \n                    except Exception as e:\n                        continue\n                        \n                self.downloaded_images -= erase_duplicates(target_folder)\n\n                if \"next\" not in data:\n                    return 0\n                request_url = URL + data[\"next\"];\n"
  },
  {
    "path": "idt/factories.py",
    "content": "from idt.duckgo import DuckGoSearchEngine\nfrom idt.bing import BingSearchEngine\nfrom idt.bing_api import BingApiSearchEngine \nfrom idt.flickr_api import FlickrApiSearchEngine\n\n__name__ = \"factories\"\n\nclass SearchEngineFactory:\n\tdef __init__(self,data,n_images,folder,resize_method,root_folder,size,engine,api_key):\n\t\tself.data = data\n\t\tself.n_images = n_images\n\t\tself.folder = folder\n\t\tself.resize_method = resize_method\n\t\tself.root_folder = root_folder\n\t\tself.size = size\n\t\tself.engine = engine\n\t\tself.api_key = api_key\n\t\tself.getSearchEngine() \n\n\tdef getSearchEngine(self):\n\t\tif self.engine == \"duckgo\":\n\t\t\treturn DuckGoSearchEngine(self.data, self.n_images, self.folder,self.resize_method,self.root_folder, self.size)\n\t\telif self.engine == \"bing\":\n\t\t\treturn BingSearchEngine(self.data, self.n_images, self.folder, self.resize_method, self.root_folder, self.size)\n\t\telif self.engine == \"bing_api\":\n\t\t\treturn BingApiSearchEngine(self.data, self.n_images, self.folder, self.resize_method, self.root_folder, self.size, self.api_key)\n\t\telif self.engine == \"flickr_api\":\n\t\t\treturn FlickrApiSearchEngine(self.data, self.n_images, self.folder, self.resize_method, self.root_folder, self.size, self.api_key)\n\t\telse:\n\t\t\treturn None\n"
  },
  {
    "path": "idt/flickr_api.py",
    "content": "import os\nimport json \nimport requests\nimport re\n\nfrom idt.utils.download_images import download\nfrom idt.utils.remove_corrupt import erase_duplicates\nfrom rich.progress import Progress\n\n__name__ = \"flickr_api\"\n\nclass FlickrApiSearchEngine:\n\tdef __init__(self,data,n_images,folder,resize_method,root_folder,size,api_key):\n\t\tself.data = data\n\t\tself.n_images = n_images\n\t\tself.folder = folder\n\t\tself.resize_method = resize_method\n\t\tself.root_folder = root_folder\n\t\tself.size = size\n\t\tself.downloaded_images = 0\n\t\tself.dataset_info = []\n\t\tself.page = 1\n\t\tself.api_key = api_key\n\t\tself.search()\n\n\tdef search(self):\n\t\tFLICKR_LINK = 'https://www.flickr.com/services/rest/'\n\n\t\t#headers = {\"Ocp-Apim-Subscription-Key\" : self.api_key}\n\t\tdata = self.data.replace(\" \", \"+\")\n\n\t\tif data[0] == \"+\":\n\t\t\tdata  = data[1:]\n\n\t\tparams = {\n\t\t\"method\": \"flickr.photos.search\",\n\t\t\"api_key\": self.api_key,\n\t\t\"tags\": data,\n\t\t\"format\": \"json\",\n\t\t\"page\": self.page,\n\t\t\"nojsoncallback\": 1\n\t\t}\n\t\twith Progress() as progress:\n\t\t\ttask1 = progress.add_task(f\"Downloading [blue]{self.data}[/blue] class...\",total=self.n_images)\n\t\t\twhile self.downloaded_images < self.n_images:\n\t\t\t\tresponse = requests.get(FLICKR_LINK, params=params)\n\t\t\t\tresponse.raise_for_status()\n\t\t\t\tresults = response.json()\n\t\t\t\tresults = results['photos']\n\t\t\t\tif results['total'] == 0:\n\t\t\t\t\tprogress.update(task1, advance=self.n_images)\n\t\t\t\t\treturn 0\n\t\t\t\t\n\t\t\t\tself.page += 1\n\n\t\t\t\tif not os.path.exists(self.root_folder):\n\t\t\t\t\tos.mkdir(self.root_folder)\n\n\t\t\t\ttarget_folder = os.path.join(self.root_folder, self.folder)\n\t\t\t\tif not os.path.exists(target_folder):\n\t\t\t\t\tos.mkdir(target_folder)\n\n\t\t\t\tfor result in results['photo']:\n\t\t\t\t\ttry:\n\t\t\t\t\t\tif self.downloaded_images < self.n_images:\n\t\t\t\t\t\t\tlink = f\"https://farm{result['farm']}.staticflickr.com/{result['server']}/{result['id']}_{result['secret']}.jpg\"\n\t\t\t\t\t\t\tdownload(link, self.size,self.root_folder,self.folder, self.resize_method)\n\t\t\t\t\t\t\tself.downloaded_images += 1\n\t\t\t\t\t\t\tprogress.update(task1, advance=1)\n\t\t\t\t\t\telse:\n\t\t\t\t\t\t\tbreak; \n\t\t\t\t\texcept:\n\t\t\t\t\t\tcontinue\n\t\t\tself.downloaded_images -= erase_duplicates(target_folder)\n"
  },
  {
    "path": "idt/resizers/__init__.py",
    "content": ""
  },
  {
    "path": "idt/resizers/get_resizer.py",
    "content": "from .smartcrop import SmartCrop\nfrom .longer_side import crop_longer_side\nfrom .shorter_side import crop_shorter_side\n\ndef get_resizer(img, target_size, resizer):\n\tif target_size == 0:\n\t\treturn img\n\n\tif resizer == \"smartcrop\":\n\t\tsc = SmartCrop()\n\t\treturn sc.run_crop(img, target_size)\n\telif resizer == 'shorter_side':\n\t\treturn crop_shorter_side(img,target_size)\n\telif resizer == 'longer_side':\n\t\treturn crop_longer_side(img, target_size)\n\n\n"
  },
  {
    "path": "idt/resizers/longer_side.py",
    "content": "from PIL import Image\n\ndef crop_longer_side(img, size):\n\tIMG_SIZE = size, size\n\timg.thumbnail(IMG_SIZE, Image.ANTIALIAS)\n\treturn img\n"
  },
  {
    "path": "idt/resizers/shorter_side.py",
    "content": "from PIL import Image\n\ndef crop_shorter_side(img, size):\n\twidth, height = img.size\n\tif width < size or height < size:\n\t\treturn img.thumbnail(size, Image.ANTIALIAS)\n\telif width > height:\n\t\tratio = float(width) / float(height)\n\t\tnew_width = int(size * ratio)\n\t\treturn img.resize((new_width, size), Image.ANTIALIAS)\n\telse:\n\t\tratio = float(height) / float(width)\n\t\tnew_height = int(size * ratio)\n\t\treturn img.resize((size, new_height), Image.ANTIALIAS)\n"
  },
  {
    "path": "idt/resizers/smartcrop.py",
    "content": "import math\nimport sys\n\nimport numpy as np\nfrom PIL import Image\nfrom PIL.ImageFilter import Kernel\n\n\ndef saturation(image):\n    r, g, b = image.split()\n    r, g, b = np.array(r, float), np.array(g, float), np.array(b, float)\n    maximum = np.maximum(np.maximum(r, g), b)  # [0; 255]\n    minimum = np.minimum(np.minimum(r, g), b)  # [0; 255]\n    s = (maximum + minimum) / 255  # [0.0; 1.0]\n    d = (maximum - minimum) / 255  # [0.0; 1.0]\n    d[maximum == minimum] = 0  # if maximum == minimum:\n    s[maximum == minimum] = 1  # -> saturation = 0 / 1 = 0\n    mask = s > 1\n    s[mask] = 2 - d[mask]\n    return d / s  # [0.0; 1.0]\n\n\ndef thirds(x):\n    \"\"\"gets value in the range of [0, 1] where 0 is the center of the pictures\n    returns weight of rule of thirds [0, 1]\"\"\"\n    x = ((x + 2 / 3) % 2 * 0.5 - 0.5) * 16\n    return max(1 - x * x, 0)\n\n\nclass SmartCrop(object):\n\n    DEFAULT_SKIN_COLOR = [0.78, 0.57, 0.44]\n\n    def __init__(\n        self,\n        detail_weight=0.2,\n        edge_radius=0.4,\n        edge_weight=-20,\n        outside_importance=-0.5,\n        rule_of_thirds=True,\n        saturation_bias=0.2,\n        saturation_brightness_max=0.9,\n        saturation_brightness_min=0.05,\n        saturation_threshold=0.4,\n        saturation_weight=0.3,\n        score_down_sample=8,\n        skin_bias=0.01,\n        skin_brightness_max=1,\n        skin_brightness_min=0.2,\n        skin_color=None,\n        skin_threshold=0.8,\n        skin_weight=1.8\n    ):\n        self.detail_weight = detail_weight\n        self.edge_radius = edge_radius\n        self.edge_weight = edge_weight\n        self.outside_importance = outside_importance\n        self.rule_of_thirds = rule_of_thirds\n        self.saturation_bias = saturation_bias\n        self.saturation_brightness_max = saturation_brightness_max\n        self.saturation_brightness_min = saturation_brightness_min\n        self.saturation_threshold = saturation_threshold\n        self.saturation_weight = saturation_weight\n        self.score_down_sample = score_down_sample\n        self.skin_bias = skin_bias\n        self.skin_brightness_max = skin_brightness_max\n        self.skin_brightness_min = skin_brightness_min\n        self.skin_color = skin_color or self.DEFAULT_SKIN_COLOR\n        self.skin_threshold = skin_threshold\n        self.skin_weight = skin_weight\n\n    def analyse(\n        self,\n        image,\n        crop_width,\n        crop_height,\n        max_scale=1,\n        min_scale=0.9,\n        scale_step=0.1,\n        step=8\n    ):\n        \"\"\"\n        Analyze image and return some suggestions of crops (coordinates).\n        This implementation / algorithm is really slow for large images.\n        Use `crop()` which is pre-scaling the image before analyzing it.\n        \"\"\"\n        cie_image = image.convert('L', (0.2126, 0.7152, 0.0722, 0))\n        cie_array = np.array(cie_image)  # [0; 255]\n\n        # R=skin G=edge B=saturation\n        edge_image = self.detect_edge(cie_image)\n        skin_image = self.detect_skin(cie_array, image)\n        saturation_image = self.detect_saturation(cie_array, image)\n        analyse_image = Image.merge('RGB', [skin_image, edge_image, saturation_image])\n\n        del edge_image\n        del skin_image\n        del saturation_image\n\n        score_image = analyse_image.copy()\n        score_image.thumbnail(\n            (\n                int(math.ceil(image.size[0] / self.score_down_sample)),\n                int(math.ceil(image.size[1] / self.score_down_sample))\n            ),\n            Image.ANTIALIAS)\n\n        top_crop = None\n        top_score = -sys.maxsize\n\n        crops = self.crops(\n            image,\n            crop_width,\n            crop_height,\n            max_scale=max_scale,\n            min_scale=min_scale,\n            scale_step=scale_step,\n            step=step)\n\n        for crop in crops:\n            crop['score'] = self.score(score_image, crop)\n            if crop['score']['total'] > top_score:\n                top_crop = crop\n                top_score = crop['score']['total']\n\n        return {'analyse_image': analyse_image, 'crops': crops, 'top_crop': top_crop}\n\n    def crop(\n        self,\n        image,\n        width,\n        height,\n        prescale=True,\n        max_scale=1,\n        min_scale=0.9,\n        scale_step=0.1,\n        step=8\n    ):\n        scale = min(image.size[0] / width, image.size[1] / height)\n        crop_width = int(math.floor(width * scale))\n        crop_height = int(math.floor(height * scale))\n        min_scale = min(max_scale, max(1 / scale, min_scale))\n\n        prescale_size = 1\n        if prescale:\n            prescale_size = 1 / scale / min_scale\n            if prescale_size < 1:\n                image = image.copy()\n                image.thumbnail(\n                    (int(image.size[0] * prescale_size), int(image.size[1] * prescale_size)),\n                    Image.ANTIALIAS)\n                crop_width = int(math.floor(crop_width * prescale_size))\n                crop_height = int(math.floor(crop_height * prescale_size))\n            else:\n                prescale_size = 1\n\n        result = self.analyse(\n            image,\n            crop_width=crop_width,\n            crop_height=crop_height,\n            min_scale=min_scale,\n            max_scale=max_scale,\n            scale_step=scale_step,\n            step=step)\n\n        for i in range(len(result['crops'])):\n            crop = result['crops'][i]\n            crop['x'] = int(math.floor(crop['x'] / prescale_size))\n            crop['y'] = int(math.floor(crop['y'] / prescale_size))\n            crop['width'] = int(math.floor(crop['width'] / prescale_size))\n            crop['height'] = int(math.floor(crop['height'] / prescale_size))\n            result['crops'][i] = crop\n        return result\n\n    def run_crop(self, image, target_size):\n        if image.mode != 'RGB' and image.mode != 'RGBA':\n            new_image = Image.new('RGB', image.size)\n            new_image.paste(image)\n            image = new_image\n\n        result = self.crop(image, width=100, height=int(target_size / target_size * 100))\n\n        box = (\n            result['top_crop']['x'],\n            result['top_crop']['y'],\n            result['top_crop']['width'] + result['top_crop']['x'],\n            result['top_crop']['height'] + result['top_crop']['y']\n        )\n\n        cropped_image = image.crop(box)\n        cropped_image.thumbnail((target_size,target_size), Image.ANTIALIAS)\n        return cropped_image\n\n    def crops(\n        self,\n        image,\n        crop_width,\n        crop_height,\n        max_scale=1,\n        min_scale=0.9,\n        scale_step=0.1,\n        step=8\n    ):\n        image_width, image_height = image.size\n        crops = []\n        for scale in (\n            i / 100 for i in range(\n                int(max_scale * 100),\n                int((min_scale - scale_step) * 100),\n                -int(scale_step * 100))\n        ):\n            for y in range(0, image_height, step):\n                if not (y + crop_height * scale <= image_height):\n                    break\n                for x in range(0, image_width, step):\n                    if not (x + crop_width * scale <= image_width):\n                        break\n                    crops.append({\n                        'x': x,\n                        'y': y,\n                        'width': crop_width * scale,\n                        'height': crop_height * scale,\n                    })\n        if not crops:\n            raise ValueError(locals())\n        return crops\n\n    def detect_edge(self, cie_image):\n        return cie_image.filter(Kernel((3, 3), (0, -1, 0, -1, 4, -1, 0, -1, 0), 1, 1))\n\n    def detect_saturation(self, cie_array, source_image):\n        threshold = self.saturation_threshold\n        saturation_data = saturation(source_image)\n        mask = (\n            (saturation_data > threshold) &\n            (cie_array >= self.saturation_brightness_min * 255) &\n            (cie_array <= self.saturation_brightness_max * 255))\n\n        saturation_data[~mask] = 0\n        saturation_data[mask] = (saturation_data[mask] - threshold) * (255 / (1 - threshold))\n\n        return Image.fromarray(saturation_data.astype('uint8'))\n\n    def detect_skin(self, cie_array, source_image):\n        r, g, b = source_image.split()\n        r, g, b = np.array(r, float), np.array(g, float), np.array(b, float)\n        rd = np.ones_like(r) * -self.skin_color[0]\n        gd = np.ones_like(g) * -self.skin_color[1]\n        bd = np.ones_like(b) * -self.skin_color[2]\n\n        mag = np.sqrt(r * r + g * g + b * b)\n        mask = ~(abs(mag) < 1e-6)\n        rd[mask] = r[mask] / mag[mask] - self.skin_color[0]\n        gd[mask] = g[mask] / mag[mask] - self.skin_color[1]\n        bd[mask] = b[mask] / mag[mask] - self.skin_color[2]\n\n        skin = 1 - np.sqrt(rd * rd + gd * gd + bd * bd)\n        mask = (\n            (skin > self.skin_threshold) &\n            (cie_array >= self.skin_brightness_min * 255) &\n            (cie_array <= self.skin_brightness_max * 255))\n\n        skin_data = (skin - self.skin_threshold) * (255 / (1 - self.skin_threshold))\n        skin_data[~mask] = 0\n\n        return Image.fromarray(skin_data.astype('uint8'))\n\n    def importance(self, crop, x, y):\n        if (\n            crop['x'] > x or x >= crop['x'] + crop['width'] or\n            crop['y'] > y or y >= crop['y'] + crop['height']\n        ):\n            return self.outside_importance\n\n        x = (x - crop['x']) / crop['width']\n        y = (y - crop['y']) / crop['height']\n        px, py = abs(0.5 - x) * 2, abs(0.5 - y) * 2\n\n        # distance from edge\n        dx = max(px - 1 + self.edge_radius, 0)\n        dy = max(py - 1 + self.edge_radius, 0)\n        d = (dx * dx + dy * dy) * self.edge_weight\n        s = 1.41 - math.sqrt(px * px + py * py)\n\n        if self.rule_of_thirds:\n            s += (max(0, s + d + 0.5) * 1.2) * (thirds(px) + thirds(py))\n\n        return s + d\n\n    def score(self, target_image, crop):\n        score = {\n            'detail': 0,\n            'saturation': 0,\n            'skin': 0,\n            'total': 0,\n        }\n        target_data = target_image.getdata()\n        target_width, target_height = target_image.size\n\n        down_sample = self.score_down_sample\n        inv_down_sample = 1 / down_sample\n        target_width_down_sample = target_width * down_sample\n        target_height_down_sample = target_height * down_sample\n\n        for y in range(0, target_height_down_sample, down_sample):\n            for x in range(0, target_width_down_sample, down_sample):\n                p = int(\n                    math.floor(y * inv_down_sample) * target_width +\n                    math.floor(x * inv_down_sample)\n                )\n                importance = self.importance(crop, x, y)\n                detail = target_data[p][1] / 255\n                score['skin'] += (\n                    target_data[p][0] / 255 *\n                    (detail + self.skin_bias) *\n                    importance\n                )\n                score['detail'] += detail * importance\n                score['saturation'] += (\n                    target_data[p][2] / 255 *\n                    (detail + self.saturation_bias) *\n                    importance\n                )\n        score['total'] = (\n            score['detail'] * self.detail_weight +\n            score['skin'] * self.skin_weight +\n            score['saturation'] * self.saturation_weight\n        ) / (crop['width'] * crop['height'])\n        return score"
  },
  {
    "path": "idt/utils/__init__.py",
    "content": ""
  },
  {
    "path": "idt/utils/create_dataset_csv.py",
    "content": "import os\nimport re\nimport csv\nimport yaml\n\n__name__ = \"create_dataset_csv\"\n\ndef create_dataset_csv(path):\n\tnumber_of_dirs = 0\n\tnumber_of_files = 0\n\tcsv_dict = {'DATASET':path,'NUMBER_OF_CLASSES':0, 'TOTAL_NUMBER_OF_FILES':0}\n\n\tfor base, dirs, files in os.walk(path):\n\t\tfor directories in dirs:\n\t\t\tnumber_of_dirs += 1\n\t\t\tdir_path = os.path.join(path,directories)\n\t\t\tcount = len(os.listdir(dir_path))\n\t\t\tcsv_dict[str(directories)]= count\n\t\tfor Files in files:\n\t\t\tnumber_of_files += 1\n\n\tcsv_dict['NUMBER_OF_CLASSES'] = number_of_dirs\n\tcsv_dict['TOTAL_NUMBER_OF_FILES'] = number_of_files\n\n\twith open('{path}/{path}.csv'.format(path=path), 'w') as f:\n\t\tfor key in csv_dict.keys():\n\t\t\tf.write(\"%s,%s\\n\"%(key,csv_dict[key]))\n\n#TODO implement natural sort to classes\ndef atoi(text):\n    return int(text) if text.isdigit() else text\n\ndef natural_keys(text):\n    return [ atoi(c) for c in re.split(r'(\\d+)', text) ]\n\ndef generate_class_info(dict, root_folder, folder):\n\tf = open(f\"./{root_folder}/{folder}.yaml\", \"w\")\n\tf.write(yaml.dump(dict))\n"
  },
  {
    "path": "idt/utils/download_images.py",
    "content": "import uuid\nimport requests;\nimport os;\nfrom PIL import Image\nfrom io import BytesIO\nfrom idt.resizers.get_resizer import get_resizer\n\n__name__ = \"download_images\"\n\ndef download(link, size, root_folder, class_name, resize_method):\n    IMG_SIZE = size, size\n    response = requests.get(link, timeout=3.000)\n    file = BytesIO(response.content)\n    raw_img = Image.open(file)\n\n    # resize or crop image according to provided resize method\n    img = get_resizer(raw_img, size, resize_method)\n\n    # Split last part of url to get image name and its extension\n    img_name = link.rsplit('/', 1)[1]\n    img_type = img_name.split('.')[1]\n\n    if img_type.lower() != \"jpg\":\n        raise Exception(\"Cannot download these type of file\")\n    else:\n        #Check if another file of the same name already exists\n        id = uuid.uuid1()\n        img.save(f\"./{root_folder}/{class_name}/{class_name}-{id.hex}.jpg\", \"JPEG\")"
  },
  {
    "path": "idt/utils/remove_corrupt.py",
    "content": "import os, hashlib, re, csv\n\n__name__ = \"remove_corrupt\"\n\ndef remove_corrupt(path):\n\tvisited_dir = 0\n\tprint(\"Removing corrupt files\")\n\n\tfor base, dirs, files in os.walk(path):\n\t\tfor directories in dirs:\n\t\t\tvisited_dir += 1\n\t\tfor Files in files:\n\t\t\tfile = os.path.join(base,Files)\n\t\t\tif os.stat(file).st_size == 0:\n\t\t\t\t#print(Files, \"is corrupt, removing it...\")\n\t\t\t\tos.remove(file)\n\ndef erase_duplicates(folder):\n\tduplicates = []\n\thash_keys = dict()\n\tfile_list = os.listdir(folder)\n\n\tfor index, file_name in enumerate(file_list):\n\t\tif os.path.isfile(os.path.join(folder,file_name)):\n\t\t\twith open(os.path.join(folder,file_name), 'rb') as f:\n\t\t\t\tfilehash = hashlib.md5(f.read()).hexdigest()\n\t\t\tif filehash not in hash_keys:\n\t\t\t\thash_keys[filehash] = index\n\t\t\telse:\n\t\t\t\tduplicates.append((index, hash_keys[filehash]))\n\t\t\t\t\n\tfor index in duplicates:\n\t\tos.remove(os.path.join(folder, file_list[index[0]]))\n\n\treturn len(duplicates)"
  },
  {
    "path": "idt/utils/split_dataset.py",
    "content": "import os\nimport random\nfrom shutil import copyfile\n\n__name__ = \"split_dataset\"\n\ndef split_dataset(img_source_dir, train_size):\n    train_size = float(train_size / 100)\n\n    print(\"Creating a dataset split into train/validation folders...\")\n\n    if not os.path.exists(img_source_dir):\n        raise OSError('The source folder doesnt exist. Are you sure the dataset folder is ', img_source_dir)\n        \n    # Create folders\n    if not os.path.exists('split-dataset'):\n        os.makedirs('split-dataset')\n    else:\n        if not os.path.exists('split-dataset/train'):\n            os.makedirs('split-dataset/train')\n        if not os.path.exists('split-dataset/validation'):\n            os.makedirs('split-dataset/validation')\n            \n    # Get the subdirectories in the main image folder\n    subdirs = [subdir for subdir in os.listdir(img_source_dir) if os.path.isdir(os.path.join(img_source_dir, subdir))]\n\n    for subdir in subdirs:\n        subdir_fullpath = os.path.join(img_source_dir, subdir)\n        if len(os.listdir(subdir_fullpath)) == 0:\n            print(subdir_fullpath + ' is empty')\n            break\n\n        train_subdir = os.path.join('split-dataset/train', subdir)\n        validation_subdir = os.path.join('split-dataset/validation', subdir)\n\n        # Create subdirectories in train and validation folders\n        if not os.path.exists(train_subdir):\n            os.makedirs(train_subdir)\n\n        if not os.path.exists(validation_subdir):\n            os.makedirs(validation_subdir)\n\n        train_counter = 0\n        validation_counter = 0\n\n        # Randomly assign an image to train or validation folder\n        for filename in os.listdir(subdir_fullpath):\n            if filename.endswith(\".jpg\") or filename.endswith(\".png\"): \n                fileparts = filename.split('.')\n\n                if random.uniform(0, 1) <= train_size:\n                    copyfile(os.path.join(subdir_fullpath, filename), os.path.join(train_subdir, str(train_counter) + '.' + fileparts[1]))\n                    train_counter += 1\n                else:\n                    copyfile(os.path.join(subdir_fullpath, filename), os.path.join(validation_subdir, str(validation_counter) + '.' + fileparts[1]))\n                    validation_counter += 1"
  },
  {
    "path": "requirements.txt",
    "content": "click==7.1.2\nPyYAML==5.3.1\nrequests==2.22.0\nPillow==7.0.0\nrich==6.1.2\nnumpy==1.19.1\n\n"
  },
  {
    "path": "setup.py",
    "content": "from setuptools import setup, find_packages\nfrom io import open\nfrom os import path\nimport pathlib\n\n# The directory containing this file\n\nHERE = pathlib.Path(__file__).parent  # The text of the README file\nREADME = (HERE / 'README.md').read_text()  # automatically captured required modules for install_requires in requirements.txt and as well as configure dependency links\nwith open(path.join(HERE, 'requirements.txt'), encoding='utf-8') as f:\n    all_reqs = f.read().split('\\n')\n\ninstall_requires = [x.strip() for x in all_reqs if 'git+' not in x\n                    and not x.startswith('#') and not x.startswith('-')]\ndependency_links = [x.strip().replace('git+', '') for x in all_reqs\n                    if 'git+' not in x]\n\nsetup(  # list of all packages\n        # any python greater than 2.7\n    name='idt',\n    description='A cli tool that quickly generates ready-to-use image datasets'\n        ,\n    version='0.0.6',\n    packages=find_packages(),\n    install_requires=install_requires,\n    python_requires='>=2.7',\n    entry_points='''\n        [console_scripts]\n        idt=idt.__main__:main\n    ''',\n    author='Deliton Junior',\n    keyword='idt, image datasets, generators, dataset generator, image scraper'\n        ,\n    long_description=README,\n    long_description_content_type='text/markdown',\n    license='MIT',\n    url='https://github.com/deliton/idt',\n    download_url='https://github.com/deliton/idt/archive/master.zip',\n    dependency_links=dependency_links,\n    author_email='deliton.m@hotmail.com',\n    classifiers=['License :: OSI Approved :: MIT License',\n                 'Programming Language :: Python :: 2.7',\n                 'Programming Language :: Python :: 3',\n                 'Programming Language :: Python :: 3.7'],\n    )\n"
  }
]