[
  {
    "path": "exercises/collecting_data/exercise.js",
    "content": "var exercise      = require('workshopper-exercise')()\n  , filecheck     = require('workshopper-exercise/filecheck')\n  , execute       = require('workshopper-exercise/execute')\n  , comparestdout = require('workshopper-exercise/comparestdout')\n\n// checks that the submission file actually exists\nexercise = filecheck(exercise)\n\n// execute the solution and submission in parallel with spawn()\nexercise = execute(exercise)\n\n// compare stdout of solution and submission\nexercise = comparestdout(exercise)\n\nmodule.exports = exercise\n"
  },
  {
    "path": "exercises/collecting_data/problem.md",
    "content": "Now, we will go through all of the links on the page and create a data table (csv file) out of it.\n\nYou might want to get a div of interest first, and then `find` items inside of it.\n\nFor example, let's get the `a` tag and the `date` in this html:\n\n```html\n<div class=\"title\">\n  <a href=\"http://nodeschool.io/#workshoppers\">Workshoppers</a>\n  <div class=\"date\">\n    Date: February 16, 2012\n  </div>\n</div>\n```\n\nGet the div:\n```js\nvar div = $('div.title')\n```\n\nUse `find` to search the div (more on `find` here: https://github.com/cheeriojs/cheerio#traversing)\n\n```js\nvar a = div.find('a')\nvar date = div.find('.date')\n```\n\nNow create the row:\n```js\nvar row = {\n  href: a.attr('href'),\n  date: date.text()\n}\n\nconsole.log(row)\n```\n\nYou'll need to use this technique in this exercise.\n\n# Exercise\n\nLet's get the top links on the \"science\" subreddit from February 16, 2012:\n\nGo to `http://web.archive.org/web/20120216223019/http://www.reddit.com/r/science/` and look at the page source by right-clicking a link and clicking `Inspect Element`\n\nParse the html and create a row for each `a` link on the page. You'll need to use `map` and `console.log`.\n\nEach row should include 3 fields: the visible `score`, the link's `href`, and `contents` (text contents of the `a` tag).\n\nExample:\n```\n{\"score\": \"15\", \"href\": \"/web/20120216223019/http://www.bbc.co.uk/news/science-environment-17015559\", \"content\": \"\\'New frontier\\' of Antarctic lake exploration - What\\'s behind the drive to explore Antarctica\\'s lakes?bbc.co.ukDrJulianBashircommentsharecancel\"}\n{\"score\": \"3\", \"href\": \"/web/20120216223019/http://www.reddit.com/other-link-here\", \"content\": \"Some other link\"}\n```\n\n\n\n\n\n\n"
  },
  {
    "path": "exercises/collecting_data/solution/solution.js",
    "content": "var cheerio = require('cheerio')\nvar got = require('got')\n\nvar URL = 'http://web.archive.org/web/20120216223019/http://www.reddit.com/r/science/'\n\ngot(URL, function (err, html) {\n  $ = cheerio.load(html)\n  $('.link').map(function (i, el) {\n    el = $(el)\n    var score = el.find('.score.unvoted')\n    var a = el.find('a')\n    var row = {\n      score: score.text(),\n      href: a.attr('href'),\n      content: a.text()\n    }\n    console.log(row)\n  })\n})\n"
  },
  {
    "path": "exercises/menu.json",
    "content": "[\n  \"Requesting Pages\",\n  \"Parsing HTML\",\n  \"Traversing Dom\",\n  \"Collecting Data\",\n  \"Outputting CSV\"\n]\n"
  },
  {
    "path": "exercises/outputting_csv/exercise.js",
    "content": "var exercise      = require('workshopper-exercise')()\n  , filecheck     = require('workshopper-exercise/filecheck')\n  , execute       = require('workshopper-exercise/execute')\n  , comparestdout = require('workshopper-exercise/comparestdout')\n\n// checks that the submission file actually exists\nexercise = filecheck(exercise)\n\n// execute the solution and submission in parallel with spawn()\nexercise = execute(exercise)\n\n// compare stdout of solution and submission\nexercise = comparestdout(exercise)\n\nmodule.exports = exercise\n"
  },
  {
    "path": "exercises/outputting_csv/problem.md",
    "content": "Say we wanted to put the table into a csv file. To output csv, we can use a format stream:\n\n```js\nvar writer = require('format-data')('csv')\n```\n`writer` is now ready to accept rows to write.\n\nYou can now `write` each row, like so:\n\n```\nwriter.write({ header1: value, header2: value })\n```\n\nAnd then, you can say where the data should go. In this case, we're writing to `process.stdout`:\n```\nwriter.pipe(process.stdout)\n```\n\nYou could also write to a file, like this:\n\n```\nvar fs = require('fs')\nvar file = fs.createWriteStream('output.csv')\nwriter.pipe(file)\n```\n\nFor this exercise, use `process.stdout` for testing.\n\n# Exercise\n\nTransform your file from the last exercise so that instead of `console.log`, it uses a `writer.write`.\n"
  },
  {
    "path": "exercises/outputting_csv/solution/solution.js",
    "content": "var cheerio = require('cheerio')\nvar got = require('got')\nvar writer = require('format-data')('csv')\n\nvar URL = 'http://web.archive.org/web/20120216223019/http://www.reddit.com/r/science/'\n\ngot(URL, function (err, html) {\n  $ = cheerio.load(html)\n  $('.link').map(function (i, el) {\n    el = $(el)\n    var score = el.find('.score.unvoted')\n    var a = el.find('a')\n    var row = {\n      score: score.text(),\n      href: a.attr('href'),\n      content: a.text()\n    }\n    writer.write(row)\n  })\n})\n\nwriter.pipe(process.stdout)"
  },
  {
    "path": "exercises/parsing_html/exercise.js",
    "content": "var exercise      = require('workshopper-exercise')()\n  , filecheck     = require('workshopper-exercise/filecheck')\n  , execute       = require('workshopper-exercise/execute')\n  , comparestdout = require('workshopper-exercise/comparestdout')\n\n// checks that the submission file actually exists\nexercise = filecheck(exercise)\n\n// execute the solution and submission in parallel with spawn()\nexercise = execute(exercise)\n\n// compare stdout of solution and submission\nexercise = comparestdout(exercise)\n\nmodule.exports = exercise"
  },
  {
    "path": "exercises/parsing_html/problem.md",
    "content": "# Traversing the dom!\n\n`cheerio` is a library that allows you to use a jQuery-like syntax right here in your terminal. If you know anything about jQuery this one should be easy for you.\n\n# Overview\nLet's start with a simple example. Say we want to access the 'h1' tag in the following html:\n\n```\n<html>\n  <body>\n    <h1>There is no cow level.</h1>\n  </body>\n</html>\n```\n\nWe can use `cheerio` to prepare the html like this:\n\n```\nvar cheerio = require('cheerio')\nvar html = '<html><body><h1>There is no cow level.</h1></body></html>''\nvar $ = cheerio.load(html)\n```\n\nWe can use the new `$` variable to query the html -- to get the `h1` tag, you can use `$('h1')`.\n\nThere are a variety of functions you can use with the object you get (See `https://www.npmjs.com/package/cheerio`):\n\nExample:\n\n  * `$('h1').text()` will return `There is no cow level`.\n  * `$('body').html()` will return `<h1>There is no cow level</h1>`\n\n\n\n# Exercise\n\nGo to the following link in your browser:\n\n`http://web.archive.org/web/20120216223019/http://www.reddit.com/r/science/`\n\nTake your file from the last tutorial. Use `got`, `cheerio`, and `console.log` to print out the **readable text** of the website using the `text()` function.\n\n**hint: all of the content is in the `body` tag**\n"
  },
  {
    "path": "exercises/parsing_html/solution/solution.js",
    "content": "var got = require('got')\nvar cheerio = require('cheerio')\n\nvar URL = 'http://web.archive.org/web/20120216223019/http://www.reddit.com/r/science/'\n\ngot(URL, function (err, html) {\n  var $ = cheerio.load(html)\n  var content = $('body')\n  console.log(content.text())\n})"
  },
  {
    "path": "exercises/requesting_pages/exercise.js",
    "content": "var exercise      = require('workshopper-exercise')()\n  , filecheck     = require('workshopper-exercise/filecheck')\n  , execute       = require('workshopper-exercise/execute')\n  , comparestdout = require('workshopper-exercise/comparestdout')\n\n// checks that the submission file actually exists\nexercise = filecheck(exercise)\n\n// execute the solution and submission in parallel with spawn()\nexercise = execute(exercise)\n\n// compare stdout of solution and submission\nexercise = comparestdout(exercise)\n\nmodule.exports = exercise\n"
  },
  {
    "path": "exercises/requesting_pages/problem.md",
    "content": "# The project\n\nFirst, we need to know how to grab the contents of the webpage as html. Create a file called 'index.js'. You can use the `got` module to easily retrieve the contents of the webpage. You'll want to `require` the `got` module, like so.\n\n```js\nvar got = require('got')\n```\n\n`got` will go get the webpage, and when it's done, it will call the function (err, html). `err` will be an error object (which we can look at) and `html` will have the page contents.\n\nHere is an example that prints the html from webpage `http://nodeschool.io`.\n\n```js\nvar got = require('got')\n\ngot('http://nodeschool.io', function (err, html) {\n  console.log(html)\n})\n```\n\n# Exercise\n\nLet's look at reddit.com's science subreddit in February, 2012.\n\n`http://web.archive.org/web/20120216223019/http://www.reddit.com/r/science/`\n\nUse `got`, and `console.log` to print out the contents of the page.\n\n"
  },
  {
    "path": "exercises/requesting_pages/solution/solution.js",
    "content": "var got = require('got')\n\nvar URL = 'http://web.archive.org/web/20120216223019/http://www.reddit.com/r/science/'\ngot(URL, function (err, html) {\n  console.log(html)\n})"
  },
  {
    "path": "exercises/traversing_dom/exercise.js",
    "content": "var exercise      = require('workshopper-exercise')()\n  , filecheck     = require('workshopper-exercise/filecheck')\n  , execute       = require('workshopper-exercise/execute')\n  , comparestdout = require('workshopper-exercise/comparestdout')\n\n// checks that the submission file actually exists\nexercise = filecheck(exercise)\n\n// execute the solution and submission in parallel with spawn()\nexercise = execute(exercise)\n\n// compare stdout of solution and submission\nexercise = comparestdout(exercise)\n\nmodule.exports = exercise\n"
  },
  {
    "path": "exercises/traversing_dom/problem.md",
    "content": "# Loops\n\nOkay, now we want to get all of the links in the page. A link looks like this:\n\n```html\n<a href=\"http://some-other-url.com\">Click me!</a>\n```\n\nIf we want to select an `a` tag and get it's contents, we can do\n\n```js\n$('a').text()\n```\n\nThis might select multiple tags, though, so we might need to go through each of them if you want to do something with them. Here's one way to go through each item:\n\n```js\n$('a').map(function (i, el) {\n  // you can use either 'el' or 'this'\n  $(this).text()\n})\n```\n\n## Exercise\n\nGet all of the `a` elements on the page and `console.log` the `href` attribute. You can grab their `href` attribute using `attr`.\n\nMore info here: `https://github.com/cheeriojs/cheerio#attributes`\n\n\n"
  },
  {
    "path": "exercises/traversing_dom/solution/solution.js",
    "content": "var got = require('got')\nvar cheerio = require('cheerio')\n\nvar URL = 'http://web.archive.org/web/20120216223019/http://www.reddit.com/r/science/'\n\ngot(URL, function (err, html) {\n  var $ = cheerio.load(html)\n  $('a').map(function (i, el) {\n    console.log($(el).attr('href'))\n  })\n})\n"
  },
  {
    "path": "index.js",
    "content": "#!/usr/bin/env node\n\nvar workshopper = require('workshopper'),\n      path        = require('path')\n\nfunction fpath (f) {\n    return path.join(__dirname, f)\n}\n\nworkshopper({\n    name        : 'nutella-scrape',\n    title       : 'Nutella Scraper',\n    subtitle    : 'Learn how to scrape webpages with Node.js',\n    appDir      : __dirname,\n    menuItems   : [],\n    exerciseDir : fpath('./exercises/')\n})"
  },
  {
    "path": "package.json",
    "content": "{\n  \"name\": \"nutella-scrape\",\n  \"version\": \"1.1.1\",\n  \"description\": \"a nodeschool workshop to teach scraping\",\n  \"main\": \"index.js\",\n  \"scripts\": {\n    \"test\": \"echo \\\"Error: no test specified\\\" && exit 1\"\n  },\n  \"repository\": {\n    \"type\": \"git\",\n    \"url\": \"https://github.com/karissa/nutella-scrape.git\"\n  },\n  \"bin\": \"./index.js\",\n  \"keywords\": [\n    \"nodeschool\",\n    \"scraping\",\n    \"tutorial\",\n    \"school\",\n    \"exercise\",\n    \"help\",\n    \"scrape\"\n  ],\n  \"author\": \"Karissa McKelvey <karissa@karissamck.com> (http://karissamck.com/)\",\n  \"license\": \"BSD-2-Clause\",\n  \"bugs\": {\n    \"url\": \"https://github.com/karissa/nutella-scrape/issues\"\n  },\n  \"preferGlobal\": true,\n  \"homepage\": \"https://github.com/karissa/nutella-scrape\",\n  \"dependencies\": {\n    \"cheerio\": \"^0.19.0\",\n    \"format-data\": \"^2.1.2\",\n    \"got\": \"^4.1.1\",\n    \"workshopper\": \"^2.7.0\",\n    \"workshopper-exercise\": \"^2.4.0\"\n  }\n}\n"
  },
  {
    "path": "readme.md",
    "content": "# nutella-scrape\n\n[![NPM](https://nodei.co/npm/nutella-scrape.png?downloads=true&stars=true&global=true)](https://nodei.co/npm/nutella-scrape/)\n\n![nutella](https://github.com/karissa/nutella-scrape/blob/master/nutella.png)\n\n  1. Run `sudo npm install nutella-scrape -g`\n  2. Run `nutella-scrape`\n  3. ???\n  4. LEARN!!\n\nIn this tutorial, we will work through how to scrape websites using Node.js for the primary purpose of using it in other programs -- in servers, frontends (yes, Node works in the browser!), or just writing a table to disk for analysis elsewhere.\n\nThe DOM (Document Object Model) is an abstract concept describing how we can interact with HTML. JavaScript is GREAT for traversing HTML (i.e., the DOM) because it was made to work with HTML in the first place.\n\n## TODO\n\n* parallel\n* spoofing\n* cookies/login walls\n* electron-microscope"
  }
]