[
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2022 Enrico Cambiaso\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# cookidump\n\nEasily dump cookidoo recipes from the official website\n\n### Description ###\n\nThis program allows you to dump all recipes on [Cookidoo](https://cookidoo.co.uk) websites (available for different countries) for offline and posticipate reading.\nThose recipes are valid in particular for [Thermomix/Bimby](https://en.wikipedia.org/wiki/Thermomix) devices.\nIn order to dump the recipes, a valid subscription is needed.\n\nThe initial concept of this program was based on [jakubszalaty/cookidoo-parser](https://github.com/jakubszalaty/cookidoo-parser).\n\n### Mentioning ###\n\nIf you intend to scientifically investigate or extend cookidump, please consider citing the following paper.\n\n```\n@article{cambiaso2022cookidump,\ntitle = {Web security and data dumping: The Cookidump case},\njournal = {Software Impacts},\nvolume = {14},\npages = {100426},\nyear = {2022},\nissn = {2665-9638},\ndoi = {https://doi.org/10.1016/j.simpa.2022.100426},\nurl = {https://www.sciencedirect.com/science/article/pii/S2665963822001105},\nauthor = {Enrico Cambiaso and Maurizio Aiello},\nkeywords = {Cyber-security, Data dump, Database security, Browser automation},\nabstract = {In the web security field, data dumping activities are often related to a malicious exploitation. In this paper, we focus on data dumping activities executed legitimately by scraping/storing data shown on the browser. We evaluate such operation by proposing Cookidump, a tool able to dump all recipes available on the Cookidoo© website portal. While such scenario is not relevant, in terms of security and privacy, we discuss the impact of such kind of activity for other scenarios including web applications hosting sensitive information.}\n}\n```\n\nFurther information can be found at [https://www.sciencedirect.com/science/article/pii/S2665963822001105](https://www.sciencedirect.com/science/article/pii/S2665963822001105).\n\n### Features ###\n\n* Easy to run\n* Easy to open HTML output\n* Output including a surfable list of dumped recipes\n* Customizable searches\n\n### Installation ###\n\n#### nix ####\n\n```\nnix run github:auino/cookidump -- <outputdir> [--separate-json]\n```\n\nNix provisions `google-chrome` together with `chromedriver`. Only \n`<outputdir>` and `[--separate-json]` arguments are expected.\n\n#### manual ####\n\n1. Clone the repository:\n\n```\ngit clone https://github.com/auino/cookidump.git\n```\n\n2. `cd` into the download folder\n\n3. Install [Python](https://www.python.org) requirements:\n\n```\npip install -r requirements.txt\n```\n\n4. Install the [Google Chrome](https://chrome.google.com) browser, if not already installed\n\n5. Download the [Chrome WebDriver](https://sites.google.com/chromium.org/driver/) and save it on the `cookidump` folder\n\n6. You are ready to dump your recipes\n\n### Usage ###\n\nSimply run the following command to start the program. The program is interactive to simplify it's usage.\n\n```\npython cookidump.py [--separate-json] <webdriverfile> <outputdir>\n```\n\nwhere:\n* `webdriverfile` identifies the path to the downloaded [Chrome WebDriver](https://sites.google.com/chromium.org/driver/) (for instance, `chromedriver.exe` for Windows hosts, `./chromedriver` for Linux and macOS hosts)\n* `outputdir` identifies the path of the output directory (will be created, if not already existent)\n* `--separate-json` allows to generate a separate JSON file for each recipe, instead of one aggregate file including all recipes\n\nThe program will open a [Google Chrome](https://chrome.google.com) window and wait until you are logged in into your [Cookidoo](https://cookidoo.co.uk) account (different countries are supported).\n\nAfter that, follow intructions provided by the script itself to proceed with the dump.\n\n#### Considerations ####\n\nBy following script instructions, it is also possible to apply custom filters to export selected recipes (for instance, in base of the dish, title and ingredients, Thermomix/Bimby version, etc.).\n\nOutput is represented by an `index.html` file, included in `outputdir`, plus a set of recipes inside of structured folders.\nBy opening the generated `index.html` file on your browser, it is possible to have a list of recipes downloaded and surf to the desired recipe.\n\nThe number of exported recipes is limited to around `1000` for each execution.\nHence, use of filters may help in this case to reduce the number of recipes exported.\n\n### Other approaches ###\n\nA different approach, previously adopted, is based on the retrieval of structured data on recipes.\nMore information can be found on the [datastructure branch](https://github.com/auino/cookidump/tree/datastructure).\nOutput is represented in this case in a different (structured) format, hence, it has to be interpreted. Such interpretation is not implemented in the linked previous commit.\n\nAnother community-driven approach, supported by some AI, has been released in the [community branch](https://github.com/auino/cookidump/tree/community).\n\n### TODO ###\n\n* Bypass the limited number of exported recipes\n* Parse downloaded recipes to store them on a database, or to generate a unique linked PDF\n* Make Chrome run headless for better speeds\n* Set up a dedicated container for the program\n\n### Supporters ###\n\n* [@vikramsoni2](https://github.com/vikramsoni2), regarding JSON saves plus minor enhancements\n* [@mrwogu](https://github.com/mrwogu), regarding additional information to be extracted on the generated JSON file, plus suggestions on the possibility to save recipes on dedicated JSON files\n* [@nilskrause](https://github.com/NilsKrause), regarding argument parsing and updates on the link to download the Chrome WebDriver\n* [@NightProgramming](https://github.com/NightProgramming), regarding the use of selenium version 3\n* [@morela](https://github.com/morela), regarding the update of the tool to support a newer version of Selenium\n* [@ndjc](https://github.com/ndjc), fixing some deprecation warnings\n* [@paoloaq](https://github.com/paoloaq), fixing page scrolling\n\n### Extensions ###\n\n* [GAC27/ReceitasDaAvoGenerator](https://github.com/GAC27/ReceitasDaAvoGenerator) implements a tool generating a PDF \"book\" file from the output provided by [cookidump](https://github.com/auino/cookidump)\n\n### Disclaimer ###\n\nThe authors of this program are not responsible of the usage of it.\nThis program is released only for research and dissemination purposes.\nAlso, the program provides users the ability to locally and temporarily store recipes accessible through a legit subscription.\nBefore using this program, check Cookidoo subscription terms of service, according to the country related to the exploited subscription. \nSharing of the obtained recipes is not a legit activity and the authors of this program are not responsible of any illecit and sharing activity accomplished by the users.\n\n### Contacts ###\n\nYou can find me on Twitter as [@auino](https://twitter.com/auino).\n"
  },
  {
    "path": "cookidump.py",
    "content": "#!/usr/bin/python3\n\n# cookidump\n# Original GitHub project:\n# https://github.com/auino/cookidump\n\nimport os\nimport io\nimport re\nimport time\nimport json\nimport pathlib\nimport argparse\nimport platform\nfrom bs4 import BeautifulSoup\nfrom selenium import webdriver\nfrom urllib.parse import urlparse\nfrom urllib.request import urlretrieve\nfrom selenium.webdriver.common.by import By\nfrom selenium.webdriver.common.keys import Keys\nfrom selenium.webdriver.chrome.options import Options\nfrom selenium.webdriver.chrome.service import Service\nfrom selenium.webdriver.common.action_chains import ActionChains\n\nPAGELOAD_TO = 3\nSCROLL_TO = 1\nMAX_SCROLL_RETRIES = 5\n\ndef startBrowser(chrome_driver_path):\n    \"\"\"Starts browser with predefined parameters\"\"\"\n    chrome_options = Options()\n    if \"GOOGLE_CHROME_PATH\" in os.environ:\n        chrome_options.binary_location = os.getenv('GOOGLE_CHROME_PATH')\n    #chrome_options.add_argument('--headless')\n    chrome_service = Service(chrome_driver_path)\n    driver = webdriver.Chrome(service=chrome_service, options=chrome_options)\n    return driver\n\ndef listToFile(browser, baseDir):\n    \"\"\"Gets html from search list and saves in html file\"\"\"\n    filename = '{}index.html'.format(baseDir)\n    # creating directories, if needed\n    path = pathlib.Path(filename)\n    path.parent.mkdir(parents=True, exist_ok=True)\n    # getting web page source\n    #html = browser.page_source\n    html = browser.execute_script(\"return document.documentElement.outerHTML\")\n    # saving the page\n    with io.open(filename, 'w', encoding='utf-8') as f: f.write(html)\n\ndef imgToFile(outputdir, recipeID, img_url):\n    img_path = '{}images/{}.jpg'.format(outputdir, recipeID)\n    path = pathlib.Path(img_path)\n    path.parent.mkdir(parents=True, exist_ok=True)\n    urlretrieve(img_url, img_path)\n    return '../images/{}.jpg'.format(recipeID)\n\ndef recipeToFile(browser, filename):\n    \"\"\"Gets html of the recipe and saves in html file\"\"\"\n    # creating directories, if needed\n    path = pathlib.Path(filename)\n    path.parent.mkdir(parents=True, exist_ok=True)\n    # getting web page source\n    html = browser.page_source\n    # saving the page\n    with io.open(filename, 'w', encoding='utf-8') as f: f.write(html)\n\ndef recipeToJSON(browser, recipeID):\n    html = browser.page_source\n    soup = BeautifulSoup(html, 'html.parser')\n\n    recipe = {}\n    recipe['id'] = recipeID\n    recipe['language'] = soup.select_one('html').attrs['lang']\n    recipe['title'] = soup.select_one(\".recipe-card__title\").text\n    recipe['rating_count'] = re.sub(r'\\D', '', soup.select_one(\".core-rating__label\").text, flags=re.IGNORECASE)\n    recipe['rating_score'] = soup.select_one(\".core-rating__counter\").text\n    recipe['tm-versions'] = [v.text.replace('\\n','').strip().lower() for v in soup.select(\".recipe-card__tm-version core-badge\")]\n    recipe.update({ l.text : l.next_sibling.strip() for l in soup.select(\"core-feature-icons label span\") })\n    recipe['ingredients'] = [re.sub(' +', ' ', li.text).replace('\\n','').strip() for li in soup.select(\"#ingredients li\")]\n    recipe['nutritions'] = {}\n    for item in list(zip(soup.select(\".nutritions dl\")[0].find_all(\"dt\"), soup.select(\".nutritions dl\")[0].find_all(\"dd\"))):\n        dt, dl = item\n        recipe['nutritions'].update({ dt.string.replace('\\n','').strip().lower(): re.sub(r'\\s{2,}', ' ', dl.string.replace('\\n','').strip().lower()) })\n    recipe['steps'] = [re.sub(' +', ' ', li.text).replace('\\n','').strip() for li in soup.select(\"#preparation-steps li\")]\n    recipe['tags'] = [a.text.replace('#','').replace('\\n','').strip().lower() for a in soup.select(\".core-tags-wrapper__tags-container a\")]\n\n    return recipe\n\ndef run(webdriverfile, outputdir, separate_json):\n    \"\"\"Scraps all recipes and stores them in html\"\"\"\n    print('[CD] Welcome to cookidump, starting things off...')\n    # fixing the outputdir parameter, if needed\n    if outputdir[-1:][0] != '/': outputdir += '/'\n    locale = str(input('[CD] Complete the website domain: https://cookidoo.'))\n    baseURL = 'https://cookidoo.{}/'.format(locale)\n    brw = startBrowser(webdriverfile)\n    # opening the home page\n    brw.get(baseURL)\n    time.sleep(PAGELOAD_TO)\n    reply = input('[CD] Please login to your account and then enter y to continue: ')\n    # recipes base url\n    rbURL = 'https://cookidoo.{}/search/'.format(locale)\n    brw.get(rbURL)\n    time.sleep(PAGELOAD_TO)\n    # possible filters done here\n    reply = input('[CD] Set your filters, if any, and then enter y to continue: ')\n    # asking for additional details for output organization\n    custom_output_dir = input(\"[CD] enter the directory name to store the results (ex. vegeratian): \")\n    if custom_output_dir : outputdir += '{}/'.format(custom_output_dir)\n    # proceeding\n    print('[CD] Proceeding with scraping')\n    # removing the name\n    brw.execute_script(\"var element = arguments[0];element.parentNode.removeChild(element);\", brw.find_element(By.TAG_NAME, 'core-user-profile'))\n    # clicking on cookie accept\n    try: brw.find_element(By.CLASS_NAME, 'accept-cookie-container').click()\n    except: pass\n    # showing all recipes\n    elementsToBeFound = int(brw.find_element(By.CLASS_NAME, 'items-start').text.split('\\n')[-1].split(' ')[0])\n    previousElements = 0\n    while True:\n        # checking if ended or not\n        currentElements = len(brw.find_elements(By.CLASS_NAME, 'link--alt'))\n        if currentElements >= elementsToBeFound: break\n        # scrolling to the end\n        brw.execute_script(\"window.scrollTo(0, document.body.scrollHeight);\")\n        time.sleep(SCROLL_TO)\n        # clicking on the \"load more recipes\" button\n        try:\n            brw.find_element(By.XPATH, \"//button[@data-cy='load-more-button']\").click()\n            time.sleep(PAGELOAD_TO)\n        except: pass\n        print('Scrolling [{}/{}]'.format(currentElements, elementsToBeFound))\n        # checking if I can't load more elements\n        count = count + 1 if previousElements == currentElements else 0\n        if count >= MAX_SCROLL_RETRIES: break\n        previousElements = currentElements\n\n    print('Scrolling [{}/{}]'.format(currentElements, elementsToBeFound))\n\n    # saving all recipes urls\n    els = brw.find_elements(By.CLASS_NAME, 'link--alt')\n    recipesURLs = []\n    for el in els:\n        recipeURL = el.get_attribute('href')\n        recipesURLs.append(recipeURL)\n        recipeID = recipeURL.split('/')[-1:][0]\n        brw.execute_script(\"arguments[0].setAttribute(arguments[1], arguments[2]);\", el, 'href', './recipes/{}.html'.format(recipeID))\n\n    # removing search bar\n    try: brw.execute_script(\"var element = arguments[0];element.parentNode.removeChild(element);\", brw.find_element(By.TAG_NAME, 'core-search-bar'))\n    except: pass\n\n    # removing scripts\n    for s in brw.find_elements(By.TAG_NAME, 'script'):\n        try: brw.execute_script(\"var element = arguments[0];element.parentNode.removeChild(element);\", s)\n        except: pass\n\n    # saving the list to file\n    listToFile(brw, outputdir)\n\n    # filter recipe Url list because it contains terms-of-use, privacy, disclaimer links too\n    recipesURLs = [l for l in recipesURLs if 'recipe' in l]\n\n    # getting all recipes\n    print(\"Getting all recipes...\")\n    c = 0\n    recipeData = []\n    for recipeURL in recipesURLs:\n        try:\n            # building urls\n            u = str(urlparse(recipeURL).path)\n            if u[0] == '/': u = '.'+u\n            recipeID = u.split('/')[-1:][0]\n            # opening recipe url\n            brw.get(recipeURL)\n            time.sleep(PAGELOAD_TO)\n            # removing the base href header\n            try: brw.execute_script(\"var element = arguments[0];element.parentNode.removeChild(element);\", brw.find_element(By.TAG_NAME, 'base'))\n            except: pass\n            # removing the name\n            brw.execute_script(\"var element = arguments[0];element.parentNode.removeChild(element);\", brw.find_element(By.TAG_NAME, 'core-user-profile'))\n            # changing the top url\n            brw.execute_script(\"arguments[0].setAttribute(arguments[1], arguments[2]);\", brw.find_element(By.CLASS_NAME, 'page-header__home'), 'href', '../../index.html')\n            # saving recipe image\n            img_url = brw.find_element(By.ID, 'recipe-card__image-loader').find_element(By.TAG_NAME, 'img').get_attribute('src')\n            local_img_path = imgToFile(outputdir, recipeID, img_url)\n            # change the image url to local\n            brw.execute_script(\"arguments[0].setAttribute(arguments[1], arguments[2]);\", brw.find_element(By.CLASS_NAME, 'core-tile__image'), 'srcset', '')\n            brw.execute_script(\"arguments[0].setAttribute(arguments[1], arguments[2]);\", brw.find_element(By.CLASS_NAME, 'core-tile__image'), 'src', local_img_path)\n            # saving the file\n            recipeToFile(brw, '{}recipes/{}.html'.format(outputdir, recipeID))\n            # extracting JSON info\n            recipe = recipeToJSON(brw, recipeID)\n            # saving JSON file, if needed\n            if separate_json:\n                print('[CD] Writing recipe to JSON file')\n                with open('{}recipes/{}.json'.format(outputdir, recipeID), 'w') as outfile: json.dump(recipe, outfile)\n            else:\n                recipeData.append(recipe)\n            # printing information\n            c += 1\n            if c % 10 == 0: print('Dumped recipes: {}/{}'.format(c, len(recipesURLs)))\n        except: pass\n\n    # save JSON file, if needed\n    if not separate_json:\n        print('[CD] Writing recipes to JSON file')\n        with open('{}data.json'.format(outputdir), 'w') as outfile: json.dump(recipeData, outfile)\n\n    # logging out\n    logoutURL = 'https://cookidoo.{}/profile/logout'.format(locale)\n    brw.get(logoutURL)\n    time.sleep(PAGELOAD_TO)\n\n    # closing session\n    print('[CD] Closing session\\n[CD] Goodbye!')\n    brw.close()\n\nif  __name__ =='__main__':\n    parser = argparse.ArgumentParser(description='Dump Cookidoo recipes from a valid account')\n    parser.add_argument('webdriverfile', type=str, help='the path to the Chrome WebDriver file')\n    parser.add_argument('outputdir', type=str, help='the output directory')\n    parser.add_argument('-s', '--separate-json', action='store_true', help='Create a separate JSON file for each recipe; otherwise, a single data file will be generated')\n    args = parser.parse_args()\n    run(args.webdriverfile, args.outputdir, args.separate_json)\n"
  },
  {
    "path": "deploy.sh",
    "content": "#!/bin/bash\n\nNAME=\"cookidump\"\n\nif [ \"$#\" != \"1\" ]; then\n\techo \"Usage: $0 <comment>\"\n\texit 0\nfi\n\ngit add .\ngit commit -m \"$1\"\ngit push -f $NAME master\n"
  },
  {
    "path": "requirements.txt",
    "content": "beautifulsoup4\nselenium>=4.8.0\n"
  }
]