Full Code of karissa/nutella-scrape for AI

master f66b938309ef cached

19 files

11.6 KB

3.8k tokens

1 symbols

1 requests

Download .txt

Repository: karissa/nutella-scrape
Branch: master
Commit: f66b938309ef
Files: 19
Total size: 11.6 KB

Directory structure:
gitextract_esphwsff/

├── exercises/
│   ├── collecting_data/
│   │   ├── exercise.js
│   │   ├── problem.md
│   │   └── solution/
│   │       └── solution.js
│   ├── menu.json
│   ├── outputting_csv/
│   │   ├── exercise.js
│   │   ├── problem.md
│   │   └── solution/
│   │       └── solution.js
│   ├── parsing_html/
│   │   ├── exercise.js
│   │   ├── problem.md
│   │   └── solution/
│   │       └── solution.js
│   ├── requesting_pages/
│   │   ├── exercise.js
│   │   ├── problem.md
│   │   └── solution/
│   │       └── solution.js
│   └── traversing_dom/
│       ├── exercise.js
│       ├── problem.md
│       └── solution/
│           └── solution.js
├── index.js
├── package.json
└── readme.md

================================================
FILE CONTENTS
================================================

================================================
FILE: exercises/collecting_data/exercise.js
================================================
var exercise      = require('workshopper-exercise')()
  , filecheck     = require('workshopper-exercise/filecheck')
  , execute       = require('workshopper-exercise/execute')
  , comparestdout = require('workshopper-exercise/comparestdout')

// checks that the submission file actually exists
exercise = filecheck(exercise)

// execute the solution and submission in parallel with spawn()
exercise = execute(exercise)

// compare stdout of solution and submission
exercise = comparestdout(exercise)

module.exports = exercise


================================================
FILE: exercises/collecting_data/problem.md
================================================
Now, we will go through all of the links on the page and create a data table (csv file) out of it.

You might want to get a div of interest first, and then `find` items inside of it.

For example, let's get the `a` tag and the `date` in this html:

```html
<div class="title">
  <a href="http://nodeschool.io/#workshoppers">Workshoppers</a>
  <div class="date">
    Date: February 16, 2012
  </div>
</div>
```

Get the div:
```js
var div = $('div.title')
```

Use `find` to search the div (more on `find` here: https://github.com/cheeriojs/cheerio#traversing)

```js
var a = div.find('a')
var date = div.find('.date')
```

Now create the row:
```js
var row = {
  href: a.attr('href'),
  date: date.text()
}

console.log(row)
```

You'll need to use this technique in this exercise.

# Exercise

Let's get the top links on the "science" subreddit from February 16, 2012:

Go to `http://web.archive.org/web/20120216223019/http://www.reddit.com/r/science/` and look at the page source by right-clicking a link and clicking `Inspect Element`

Parse the html and create a row for each `a` link on the page. You'll need to use `map` and `console.log`.

Each row should include 3 fields: the visible `score`, the link's `href`, and `contents` (text contents of the `a` tag).

Example:
```
{"score": "15", "href": "/web/20120216223019/http://www.bbc.co.uk/news/science-environment-17015559", "content": "\'New frontier\' of Antarctic lake exploration - What\'s behind the drive to explore Antarctica\'s lakes?bbc.co.ukDrJulianBashircommentsharecancel"}
{"score": "3", "href": "/web/20120216223019/http://www.reddit.com/other-link-here", "content": "Some other link"}
```








================================================
FILE: exercises/collecting_data/solution/solution.js
================================================
var cheerio = require('cheerio')
var got = require('got')

var URL = 'http://web.archive.org/web/20120216223019/http://www.reddit.com/r/science/'

got(URL, function (err, html) {
  $ = cheerio.load(html)
  $('.link').map(function (i, el) {
    el = $(el)
    var score = el.find('.score.unvoted')
    var a = el.find('a')
    var row = {
      score: score.text(),
      href: a.attr('href'),
      content: a.text()
    }
    console.log(row)
  })
})


================================================
FILE: exercises/menu.json
================================================
[
  "Requesting Pages",
  "Parsing HTML",
  "Traversing Dom",
  "Collecting Data",
  "Outputting CSV"
]


================================================
FILE: exercises/outputting_csv/exercise.js
================================================
var exercise      = require('workshopper-exercise')()
  , filecheck     = require('workshopper-exercise/filecheck')
  , execute       = require('workshopper-exercise/execute')
  , comparestdout = require('workshopper-exercise/comparestdout')

// checks that the submission file actually exists
exercise = filecheck(exercise)

// execute the solution and submission in parallel with spawn()
exercise = execute(exercise)

// compare stdout of solution and submission
exercise = comparestdout(exercise)

module.exports = exercise


================================================
FILE: exercises/outputting_csv/problem.md
================================================
Say we wanted to put the table into a csv file. To output csv, we can use a format stream:

```js
var writer = require('format-data')('csv')
```
`writer` is now ready to accept rows to write.

You can now `write` each row, like so:

```
writer.write({ header1: value, header2: value })
```

And then, you can say where the data should go. In this case, we're writing to `process.stdout`:
```
writer.pipe(process.stdout)
```

You could also write to a file, like this:

```
var fs = require('fs')
var file = fs.createWriteStream('output.csv')
writer.pipe(file)
```

For this exercise, use `process.stdout` for testing.

# Exercise

Transform your file from the last exercise so that instead of `console.log`, it uses a `writer.write`.


================================================
FILE: exercises/outputting_csv/solution/solution.js
================================================
var cheerio = require('cheerio')
var got = require('got')
var writer = require('format-data')('csv')

var URL = 'http://web.archive.org/web/20120216223019/http://www.reddit.com/r/science/'

got(URL, function (err, html) {
  $ = cheerio.load(html)
  $('.link').map(function (i, el) {
    el = $(el)
    var score = el.find('.score.unvoted')
    var a = el.find('a')
    var row = {
      score: score.text(),
      href: a.attr('href'),
      content: a.text()
    }
    writer.write(row)
  })
})

writer.pipe(process.stdout)

================================================
FILE: exercises/parsing_html/exercise.js
================================================
var exercise      = require('workshopper-exercise')()
  , filecheck     = require('workshopper-exercise/filecheck')
  , execute       = require('workshopper-exercise/execute')
  , comparestdout = require('workshopper-exercise/comparestdout')

// checks that the submission file actually exists
exercise = filecheck(exercise)

// execute the solution and submission in parallel with spawn()
exercise = execute(exercise)

// compare stdout of solution and submission
exercise = comparestdout(exercise)

module.exports = exercise

================================================
FILE: exercises/parsing_html/problem.md
================================================
# Traversing the dom!

`cheerio` is a library that allows you to use a jQuery-like syntax right here in your terminal. If you know anything about jQuery this one should be easy for you.

# Overview
Let's start with a simple example. Say we want to access the 'h1' tag in the following html:

```
<html>
  <body>
    <h1>There is no cow level.</h1>
  </body>
</html>
```

We can use `cheerio` to prepare the html like this:

```
var cheerio = require('cheerio')
var html = '<html><body><h1>There is no cow level.</h1></body></html>''
var $ = cheerio.load(html)
```

We can use the new `$` variable to query the html -- to get the `h1` tag, you can use `$('h1')`.

There are a variety of functions you can use with the object you get (See `https://www.npmjs.com/package/cheerio`):

Example:

  * `$('h1').text()` will return `There is no cow level`.
  * `$('body').html()` will return `<h1>There is no cow level</h1>`



# Exercise

Go to the following link in your browser:

`http://web.archive.org/web/20120216223019/http://www.reddit.com/r/science/`

Take your file from the last tutorial. Use `got`, `cheerio`, and `console.log` to print out the **readable text** of the website using the `text()` function.

**hint: all of the content is in the `body` tag**


================================================
FILE: exercises/parsing_html/solution/solution.js
================================================
var got = require('got')
var cheerio = require('cheerio')

var URL = 'http://web.archive.org/web/20120216223019/http://www.reddit.com/r/science/'

got(URL, function (err, html) {
  var $ = cheerio.load(html)
  var content = $('body')
  console.log(content.text())
})

================================================
FILE: exercises/requesting_pages/exercise.js
================================================
var exercise      = require('workshopper-exercise')()
  , filecheck     = require('workshopper-exercise/filecheck')
  , execute       = require('workshopper-exercise/execute')
  , comparestdout = require('workshopper-exercise/comparestdout')

// checks that the submission file actually exists
exercise = filecheck(exercise)

// execute the solution and submission in parallel with spawn()
exercise = execute(exercise)

// compare stdout of solution and submission
exercise = comparestdout(exercise)

module.exports = exercise


================================================
FILE: exercises/requesting_pages/problem.md
================================================
# The project

First, we need to know how to grab the contents of the webpage as html. Create a file called 'index.js'. You can use the `got` module to easily retrieve the contents of the webpage. You'll want to `require` the `got` module, like so.

```js
var got = require('got')
```

`got` will go get the webpage, and when it's done, it will call the function (err, html). `err` will be an error object (which we can look at) and `html` will have the page contents.

Here is an example that prints the html from webpage `http://nodeschool.io`.

```js
var got = require('got')

got('http://nodeschool.io', function (err, html) {
  console.log(html)
})
```

# Exercise

Let's look at reddit.com's science subreddit in February, 2012.

`http://web.archive.org/web/20120216223019/http://www.reddit.com/r/science/`

Use `got`, and `console.log` to print out the contents of the page.



================================================
FILE: exercises/requesting_pages/solution/solution.js
================================================
var got = require('got')

var URL = 'http://web.archive.org/web/20120216223019/http://www.reddit.com/r/science/'
got(URL, function (err, html) {
  console.log(html)
})

================================================
FILE: exercises/traversing_dom/exercise.js
================================================
var exercise      = require('workshopper-exercise')()
  , filecheck     = require('workshopper-exercise/filecheck')
  , execute       = require('workshopper-exercise/execute')
  , comparestdout = require('workshopper-exercise/comparestdout')

// checks that the submission file actually exists
exercise = filecheck(exercise)

// execute the solution and submission in parallel with spawn()
exercise = execute(exercise)

// compare stdout of solution and submission
exercise = comparestdout(exercise)

module.exports = exercise


================================================
FILE: exercises/traversing_dom/problem.md
================================================
# Loops

Okay, now we want to get all of the links in the page. A link looks like this:

```html
<a href="http://some-other-url.com">Click me!</a>
```

If we want to select an `a` tag and get it's contents, we can do

```js
$('a').text()
```

This might select multiple tags, though, so we might need to go through each of them if you want to do something with them. Here's one way to go through each item:

```js
$('a').map(function (i, el) {
  // you can use either 'el' or 'this'
  $(this).text()
})
```

## Exercise

Get all of the `a` elements on the page and `console.log` the `href` attribute. You can grab their `href` attribute using `attr`.

More info here: `https://github.com/cheeriojs/cheerio#attributes`




================================================
FILE: exercises/traversing_dom/solution/solution.js
================================================
var got = require('got')
var cheerio = require('cheerio')

var URL = 'http://web.archive.org/web/20120216223019/http://www.reddit.com/r/science/'

got(URL, function (err, html) {
  var $ = cheerio.load(html)
  $('a').map(function (i, el) {
    console.log($(el).attr('href'))
  })
})


================================================
FILE: index.js
================================================
#!/usr/bin/env node

var workshopper = require('workshopper'),
      path        = require('path')

function fpath (f) {
    return path.join(__dirname, f)
}

workshopper({
    name        : 'nutella-scrape',
    title       : 'Nutella Scraper',
    subtitle    : 'Learn how to scrape webpages with Node.js',
    appDir      : __dirname,
    menuItems   : [],
    exerciseDir : fpath('./exercises/')
})

================================================
FILE: package.json
================================================
{
  "name": "nutella-scrape",
  "version": "1.1.1",
  "description": "a nodeschool workshop to teach scraping",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "repository": {
    "type": "git",
    "url": "https://github.com/karissa/nutella-scrape.git"
  },
  "bin": "./index.js",
  "keywords": [
    "nodeschool",
    "scraping",
    "tutorial",
    "school",
    "exercise",
    "help",
    "scrape"
  ],
  "author": "Karissa McKelvey <karissa@karissamck.com> (http://karissamck.com/)",
  "license": "BSD-2-Clause",
  "bugs": {
    "url": "https://github.com/karissa/nutella-scrape/issues"
  },
  "preferGlobal": true,
  "homepage": "https://github.com/karissa/nutella-scrape",
  "dependencies": {
    "cheerio": "^0.19.0",
    "format-data": "^2.1.2",
    "got": "^4.1.1",
    "workshopper": "^2.7.0",
    "workshopper-exercise": "^2.4.0"
  }
}


================================================
FILE: readme.md
================================================
# nutella-scrape

[![NPM](https://nodei.co/npm/nutella-scrape.png?downloads=true&stars=true&global=true)](https://nodei.co/npm/nutella-scrape/)

![nutella](https://github.com/karissa/nutella-scrape/blob/master/nutella.png)

  1. Run `sudo npm install nutella-scrape -g`
  2. Run `nutella-scrape`
  3. ???
  4. LEARN!!

In this tutorial, we will work through how to scrape websites using Node.js for the primary purpose of using it in other programs -- in servers, frontends (yes, Node works in the browser!), or just writing a table to disk for analysis elsewhere.

The DOM (Document Object Model) is an abstract concept describing how we can interact with HTML. JavaScript is GREAT for traversing HTML (i.e., the DOM) because it was made to work with HTML in the first place.

## TODO

* parallel
* spoofing
* cookies/login walls
* electron-microscope

Download .txt

gitextract_esphwsff/

├── exercises/
│   ├── collecting_data/
│   │   ├── exercise.js
│   │   ├── problem.md
│   │   └── solution/
│   │       └── solution.js
│   ├── menu.json
│   ├── outputting_csv/
│   │   ├── exercise.js
│   │   ├── problem.md
│   │   └── solution/
│   │       └── solution.js
│   ├── parsing_html/
│   │   ├── exercise.js
│   │   ├── problem.md
│   │   └── solution/
│   │       └── solution.js
│   ├── requesting_pages/
│   │   ├── exercise.js
│   │   ├── problem.md
│   │   └── solution/
│   │       └── solution.js
│   └── traversing_dom/
│       ├── exercise.js
│       ├── problem.md
│       └── solution/
│           └── solution.js
├── index.js
├── package.json
└── readme.md

Download .txt

SYMBOL INDEX (1 symbols across 1 files)

FILE: index.js
  function fpath (line 6) | function fpath (f) {

Download .json

Condensed preview — 19 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (14K chars).

[
  {
    "path": "exercises/collecting_data/exercise.js",
    "chars": 527,
    "preview": "var exercise      = require('workshopper-exercise')()\n  , filecheck     = require('workshopper-exercise/filecheck')\n  , "
  },
  {
    "path": "exercises/collecting_data/problem.md",
    "chars": 1669,
    "preview": "Now, we will go through all of the links on the page and create a data table (csv file) out of it.\n\nYou might want to ge"
  },
  {
    "path": "exercises/collecting_data/solution/solution.js",
    "chars": 452,
    "preview": "var cheerio = require('cheerio')\nvar got = require('got')\n\nvar URL = 'http://web.archive.org/web/20120216223019/http://w"
  },
  {
    "path": "exercises/menu.json",
    "chars": 104,
    "preview": "[\n  \"Requesting Pages\",\n  \"Parsing HTML\",\n  \"Traversing Dom\",\n  \"Collecting Data\",\n  \"Outputting CSV\"\n]\n"
  },
  {
    "path": "exercises/outputting_csv/exercise.js",
    "chars": 527,
    "preview": "var exercise      = require('workshopper-exercise')()\n  , filecheck     = require('workshopper-exercise/filecheck')\n  , "
  },
  {
    "path": "exercises/outputting_csv/problem.md",
    "chars": 734,
    "preview": "Say we wanted to put the table into a csv file. To output csv, we can use a format stream:\n\n```js\nvar writer = require('"
  },
  {
    "path": "exercises/outputting_csv/solution/solution.js",
    "chars": 524,
    "preview": "var cheerio = require('cheerio')\nvar got = require('got')\nvar writer = require('format-data')('csv')\n\nvar URL = 'http://"
  },
  {
    "path": "exercises/parsing_html/exercise.js",
    "chars": 526,
    "preview": "var exercise      = require('workshopper-exercise')()\n  , filecheck     = require('workshopper-exercise/filecheck')\n  , "
  },
  {
    "path": "exercises/parsing_html/problem.md",
    "chars": 1261,
    "preview": "# Traversing the dom!\n\n`cheerio` is a library that allows you to use a jQuery-like syntax right here in your terminal. I"
  },
  {
    "path": "exercises/parsing_html/solution/solution.js",
    "chars": 266,
    "preview": "var got = require('got')\nvar cheerio = require('cheerio')\n\nvar URL = 'http://web.archive.org/web/20120216223019/http://w"
  },
  {
    "path": "exercises/requesting_pages/exercise.js",
    "chars": 527,
    "preview": "var exercise      = require('workshopper-exercise')()\n  , filecheck     = require('workshopper-exercise/filecheck')\n  , "
  },
  {
    "path": "exercises/requesting_pages/problem.md",
    "chars": 883,
    "preview": "# The project\n\nFirst, we need to know how to grab the contents of the webpage as html. Create a file called 'index.js'. "
  },
  {
    "path": "exercises/requesting_pages/solution/solution.js",
    "chars": 167,
    "preview": "var got = require('got')\n\nvar URL = 'http://web.archive.org/web/20120216223019/http://www.reddit.com/r/science/'\ngot(URL"
  },
  {
    "path": "exercises/traversing_dom/exercise.js",
    "chars": 527,
    "preview": "var exercise      = require('workshopper-exercise')()\n  , filecheck     = require('workshopper-exercise/filecheck')\n  , "
  },
  {
    "path": "exercises/traversing_dom/problem.md",
    "chars": 720,
    "preview": "# Loops\n\nOkay, now we want to get all of the links in the page. A link looks like this:\n\n```html\n<a href=\"http://some-ot"
  },
  {
    "path": "exercises/traversing_dom/solution/solution.js",
    "chars": 284,
    "preview": "var got = require('got')\nvar cheerio = require('cheerio')\n\nvar URL = 'http://web.archive.org/web/20120216223019/http://w"
  },
  {
    "path": "index.js",
    "chars": 402,
    "preview": "#!/usr/bin/env node\n\nvar workshopper = require('workshopper'),\n      path        = require('path')\n\nfunction fpath (f) {"
  },
  {
    "path": "package.json",
    "chars": 903,
    "preview": "{\n  \"name\": \"nutella-scrape\",\n  \"version\": \"1.1.1\",\n  \"description\": \"a nodeschool workshop to teach scraping\",\n  \"main\""
  },
  {
    "path": "readme.md",
    "chars": 852,
    "preview": "# nutella-scrape\n\n[![NPM](https://nodei.co/npm/nutella-scrape.png?downloads=true&stars=true&global=true)](https://nodei."
  }
]

About this extraction

This page contains the full source code of the karissa/nutella-scrape GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 19 files (11.6 KB), approximately 3.8k tokens, and a symbol index with 1 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo