Repository: gahabeen/jsonframe-cheerio Branch: master Commit: 81a33fa84ad2 Files: 7 Total size: 45.9 KB Directory structure: gitextract_pll6rnwq/ ├── .travis.yml ├── LICENSE ├── README.md ├── index.js ├── package.json └── tests/ ├── index.test.js └── playground/ └── html/ └── company.html ================================================ FILE CONTENTS ================================================ ================================================ FILE: .travis.yml ================================================ language: node_js node_js: - "7.2" before_script: - npm install cheerio ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2017 Gabin Desserprit Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # This repository is deprecated. Use it at your own risk! --- [![NPM](https://nodei.co/npm/jsonframe-cheerio.png?downloads=true&downloadRank=true&stars=true)](https://nodei.co/npm/jsonframe-cheerio/)

jsonframe

simple multi-level scraper json input/output

npm jsonframe-cheerio a Cheerio Plugin

## **2.0.5x** features 😍 **JSON Syntax**: input json, output the same structured json including with scraped data 🌈 **Simple patterns**: simple inline `selectors`, `extractors`, `filters` and `parser`. 💪 **Reliable & fast**: used in production within crawlers [See the full changelog](#changelog) ## Example ```js let cheerio = require('cheerio') let $ = cheerio.load(`

I love jsonframe!

Email: gabin@datascraper.pro `) let jsonframe = require('jsonframe-cheerio') jsonframe($) // initializing the plugin let frame = { "title": "h1", // this is an inline selector "email": "span[itemprop=email] < email" // output an extracted email } console.log( $('body').scrape(frame, { string: true } )) /*=> { "title": "I love jsonframe!", "email": "gabin@datascraper.pro" } /* ``` ## Use **Install the plugin** to your Node.js app through **NPM** ``` npm i jsonframe-cheerio --save ``` ## API ### Loading Start by `loading Cheerio`. ```js let cheerio = require('cheerio') let $ = cheerio.load("HTML DOM to load") // See Cheerio API ``` Then `load the jsonframe plugin`. ```js let jsonframe = require('jsonframe-cheerio') // require from npm package jsonframe($) // apply the plugin to the current Cheerio instance ``` ### Scraper Once the plugin is loaded, you've first got to set the **frame** of your data. Let's take the following `HTML example`: ```html

Pricing

A Link We are the 04/02/2017
Phone USA: (912) 148-456
Phone FR: +332 38 30 37 90 Email: lspurcell@suddenlink.net ``` #### $( selector ).scrape( frame , {options}) `selector` is defined in [Cheerio's documentation](https://github.com/cheeriojs/cheerio#-selector-context-root-) `frame` is a JSON or Javascript Object `{options}` are detailed [later in its own section](#options) ```js let frame = { "title": "h2" // CSS selector } ``` We then pass the frame to the function: ```js let result = $('body').scrape(frame, { string: true }) console.log( result ) //=> {"title": "Pricing"} ``` ### Frame #### Inline Selector Most common selector, `inline line` by specifying nothing more than the data name property and the selector as its value. ```js ... let frame = { "title": "h2" } let result = $('body').scrape(frame, { string: true }) console.log( result ) /* output => { "title": "Pricing" } */ ... ``` #### New : Inline attribute / extractor / parser You can now declare everything in line. You should just be careful to always use them in the following order when combining them : `@ (attribute), | (extractor), || (parse)`. _See examples for each of them above._ #### Attribute `_a: "attributeName"` allows you to retrieve `any attribute data` `@` inside the selector `_s` allows you to do it inline ```js ... let frame = { "proPrice": ".planName:contains('Pro') + span@price" } let result = $('body').scrape(frame, { string: true }) console.log( result ) /* output => { "proPrice": "39.00" } */ ... ``` #### Extractor `<` inside the selector `_s` allows you to do it inline It currently supports `email` (also `mail`), `telephone` (also `phone`), `date`, `fullName` (or `firstName`, `lastName`, `initials`, `suffix`, `salutation`) and `html` (to get the inner html) and by default (no declaration), we get the `inner text`. ```js ... let frame = { "email": "[itemprop=email] < phone", "frphone": "[itemprop=frphone] < phone" } let result = $('body').scrape(frame, { string: true }) console.log( result ) /* output => { "email": "example@google.net", "frphone": "33238303790" } */ ... ``` #### Filter `|` inside the selector `_s` allows you to do it inline It currently supports `trim` (remove spaces at beginning and end), `lowercase or lcase`, `uppercase or ucase`, `capitalize or cap`, `words or w`, `noescapchar or nec`, `compact or cmp` and `number or nb`. ```js ... let frame = { "email1": "[itemprop=email] < phone | uppercase", "email2": "[itemprop=email] < phone | capitalize" } let result = $('body').scrape(frame, { string: true }) console.log( result ) /* output => { "email1": "EXAMPLE@GOOGLE.NET", "email2": "EXAMPLE GOOGLE NET" } */ ... ``` #### Parse / Regex `||` inside the selector `_s` allows you to use regexes in line `_p: /regex/` allows you to extract data based on **regular expressions** ```js ... let frame = { "data": ".date || \\d{1,2}/\\d{1,2}/\\d{2,4}" } // or use the longer version for proper regex entry let frame = { "data": { _s: ".date", _p: /\d{1,2}\/\d{1,2}\/\d{2,4}/ // n[n]/n[n]/nn[nn] format here } } let result = $('body').scrape(frame, { string: true }) console.log( result ) /* output => { "date": "04/02/2017" } */ ... ``` #### List / Array `_d: [{ }]` allows you to get an `array / list of data` `_d: ["selector"]` will retrieves a list based on the selector inbetween quotes. `_d: ["firstSelector", "secondSelector"]` works too and merge the results into one array You could even shorten it more by listing right from the selector as follows: `"selectorName": [".selector"]` which returns an array of strings ```js ... let frame = { "pricing": { _s: "#pricing .item", _d: [{ "name": ".planName", "price": ".planPrice" }] } } let result = $('body').scrape(frame, { string: true }) console.log( result ) /* output => { "pricing": [ { "name": "Hacker", "price": "Free" }, { "name": "Pro", "price": "$39" } ] } */ // Or a shorter way which works for simple string arrays let frame = { "pricingNames": ["#pricing .item .planName"] } let result = $('body').scrape(frame, { string: true }) console.log( result ) /* output => { "pricingNames": ["Hacker", "Pro"] } */ ... ``` #### Grouped `"_g": { _s: "", _d: {} }` allows you to group some data selectors by a parent selector without naming the parent. You can also extends the group property to add some meaning or simply have several groups at the same level. Group property name must be `_g` or `_group` followed by `_` and whatever string you want. ex: `_g_head : {}` or `_g_body : {}` ```js ... let frame = { _g: { _s: "#pricing .item", _d: { "name": ".planName", "price": ".planPrice" } }, _g_second: { _s: "#pricing .item", _d: { "secondName": ".planName", "secondPrice": ".planPrice" } } } let result = $('body').scrape(frame, { string: true }) console.log( result ) /* output => { "name": "Hacker", "price": "Free", "secondName": "Hacker", "secondPrice": "Free" } */ ... ``` #### Nested `"parent": { _s: "parentSelector", _d: {} }` allows you to segment your data by `setting a parent section` from which the child data will be scraped. You can also use `"parent": { }` when you only want to nest data into objects without setting a parent selector. ```js ... let frame = { "pricing": { _s: "#pricing .item", _d: { "name": ".planName", "price": ".planPrice" } } } let result = $('body').scrape(frame, { string: true }) console.log( result ) /* output => { "pricing":{ "name": "Hacker", "price": "Free" } } */ ... ``` > Note here that we get the first returned result (#pricing .item). #### Example See how you can properly `structure your data`, ready for the output! ```js ... let frame = { "pricing": { _s: "#pricing .item", _d: [{ "name": ".planName", "price": ".planPrice @ price", "image": { "url": "img @ src", "link": "a @ href" } }] } } let result = $('body').scrape(frame, { string: true }) console.log( result ) /* output => { "pricing":[ { "name": "Hacker", "price": "0", "image": { "url": "./img/hacker.png", "link": "/hacker" } }, { "name": "Pro", "price": "39.00", "image": { "url": "./img/pro.png", "link": "/pro" } } ] } */ ... ``` > Note here that we get the first returned result (#pricing .item). ### Options ```js ... let frame = { "proPrice": { _s: ".planName:contains('Pro') + span", _a: "price" } } let result = $('body') .scrape(frame, { timestats: true, // default: false string: true // default: false }) console.log(result) /* output => { "proPrice": { "value":"39.00", "_timestats": "1" // ms } } */ ... ``` ## Tests One shot tests ```bash npm run test ``` Watching test on updates ```bash npm run test-watch ``` ## Changelog ⚠ Careful if you've been using **jsonframe** from the **version 1.x.x**, some things changed to make it more **flexible**, **faster to use (inline parameters)** and **more meaningful in the syntax**. **2.0.52** (28/02/2017) - Update the email regex - Update the website regex - Fix array into array results - Improving script efficiency getting data from node(s) - Fix date extractor when no date to extract **2.0.51** (27/02/2017) - Fix a fatal error (argh) which was just a typo about the new chained extractors **2.0.50** (27/02/2017) - Extractors chaining is now possible. For ex: `.selector < html email` would work **2.0.49** (27/02/2017) - Fixing issue when attribute doesn't exists (@ attributeNmae) - Improving array of object management (need to find a way to avoid empty objects still) **2.0.48** (27/02/2017) - Add Filter `Split(char)` to split string based on character (default to whitespace) - Add Extractor `numbers or nb` (return potentially an array) - Update Filter `numbers or nb` (simply filter the string to output only numbers) - Add Filter `between(string1&&string2)` to filter data by starting and finishing string - Add Filter `before(string)` to get data before a string - Add Filter `after(string)` to get data after a string - Add array support to Filter `left(nb)` and `right(nb)` (slice the array elements) - Add Filter `fromto(startNb,endNb)` to either slice an array or a string from index to index - Add Filter `get(nb)` to extract either an array item or a character from a string **2.0.46** (26/02/2017) - Rebuild of the Unstructured scraper with breaks (_b) - Works like a charm now! **2.0.45** (25/02/2017) - Fix weird fullName parsing in some cases - Update Handle of Regex - Is now able to capture a group with a regex **2.0.44** (24/02/2017) - Inline array for extractors like `"mails": [".parentSelector < email"]` - Adds french words: `prenom` and `nom` to humanname extractor - Add filters: `right(number)`, `left(number)` - Set a stricter regex for email extractor `/([a-zA-Z0-9._-]{0,30}@[a-zA-Z0-9._-]{0,15}\.[a-zA-Z0-9._-]{0,15})/gmi` **2.0.3** (23/02/2017) - Possibility to scrape unstructured data with breaks (`_b`). More about this soooon in the readme. - New filters: `words or w`, `noescapchar or nec` and `compact or cmp` - Multi-filters is available now. Ex: `.selector | words compact`. Simply separated by spaces. - Disabling google libphonenumber for now **2.0.2** (15/02/2017) - String option to get a stringified output right away - Multi-groups possibility at same level (several _g wouldn't work as same property name) in frame like _g_head and _g_body for example - Joined arrays/lists with ["firstlist.selector", "secondlist.selector", "..."] when inline - Better handling of img node - automatic src attribute is output (if nothing else set) **2.0.1** (14/02/2017) - Fixed the non-passing tests and added all the new ones for 2.x.x updates - Refactoring the way data is processed for future multiple occurences **2.0.0** (12/02/2017) - ⚠ Changing ~~`Type`~~ for `Extractor` with shortcode `<` instead of `|` - ⚠ `filters` with the shortcode `|` - Inline parameters support for `"attribute"`, `"extractor"` and `"parse"` - Simple string arrays from inline selector - Group property to group data selectors whitout naming the group (childs take the place of the group property `"_g"` or `"_group"` ) **1.1.1** (05/02/2017) - Short & functionnal parameters ( `_s`, `_t`, `_a`) instead of `"selector"`, `"extractor"`, `"attr"`. Idea behind being to easily differentiate **retrieved data name** to **functionnal data**. - Automatic handler for `img` selected element (automatically retrieve the img src link) - `_parent_` selector to target the **parent content** - A **regex parser** with the functionnal parameter **parse**: `_p` (`_parse` works too) - **Extractor** `_t: "html"` feature to get back **inner html of a selector** - Added **timestats** to measure time spent on each node via `.scrape(frame, {timestats: true})` - Refactorization of the whole code to make it evolutive (DRY) - Update of the tests cases accordingly **1.0.0** (27/01/2017) - Stable version release with basic features ## Contributing 🤝 > Feel free to follow the procedure to make it even more awesome! 1. Create an `issue` so we `get the discussion started` 2. Fork it! 3. Create your feature branch: `git checkout -b my-new-feature` 4. Commit your changes: `git commit -am 'Add some feature'` 5. Push to the branch: `git push origin my-new-feature` 6. Submit a pull request :D ## License [Gabin Desserprit](mailto:gabin@datascraper.pro) - [datascraper.pro](datascraper.pro) Released under MIT License ================================================ FILE: index.js ================================================ 'use strict' const _ = require('lodash') const chrono = require('chrono-node') const humanname = require('humanname') const addressit = require('addressit') // const phoneUtil = require('google-libphonenumber').PhoneNumberUtil.getInstance() let parseData = function (data, regex, { multiple = false } = {}) { let result = data let extracted if (regex) { try { let rgx = regex if (_.isString(regex)) { rgx = new RegExp(regex, 'gim') } extracted = rgx.exec(data) if (multiple) { result = extracted // result = data.match(rgx) } else { if (extracted[1]) { result = extracted[1] } else { result = extracted[0] } // result = data.match(rgx)[0] } } catch (error) { // console.log("Regex error: ", error) } } return result } let filterData = function (data, filter) { let paranthethisRegex = /(?:\()(.+)(?:\))/gim let result = data if (["raw"].includes(filter)) { // let the raw data } else if (filter && filter.includes("split")) { let splitValue = paranthethisRegex.exec(filter) if (splitValue && splitValue[1]) { result = result.split(splitValue[1]) } else { result = result.split(" ") } result = result.filter(function (x) { return x !== "" }) result = result.map(function (x) { return x.trim() }) } else if (filter && filter.includes("between")) { let betweenValues = paranthethisRegex.exec(filter) if (betweenValues && betweenValues[1]) { betweenValues = betweenValues[1].split("&&") if (betweenValues.length > 1) { result = result.split(betweenValues[0].replace(/_/gm, " ").trim()).pop().split(betweenValues[1].replace(/_/gm, " ").trim()).shift().trim() || "" } } } else if (filter && filter.includes("after")) { let afterValue = paranthethisRegex.exec(filter) if (afterValue && afterValue[1]) { result = result.split(afterValue[1].replace(/_/gm, " ").trim()).pop().trim() || "" } } else if (filter && filter.includes("before")) { let beforeValue = paranthethisRegex.exec(filter) if (beforeValue && beforeValue[1]) { result = result.split(beforeValue[1].replace(/_/gm, " ").trim()).shift().trim() || "" } } else if (filter && filter.includes("css")) { // let cssValue = paranthethisRegex.exec(filter) // if(cssValue && cssValue[1]){ // result = result.split(cssValue[1].trim()).pop().split(",",1).shift().trim() || "" // } } else if (["trim"].includes(filter)) { result = result.trim() } else if (filter && filter.includes("join") && _.isArray(result)) { let joinChar = paranthethisRegex.exec(filter) if (joinChar && joinChar[1]) { result = result.join(joinChar[1].replace(/_/gm, " ")) } else { result = result.join(" ") } } else if (["lowercase", "lcase"].includes(filter)) { result = result.toLowerCase() } else if (["uppercase", "ucase"].includes(filter)) { result = result.toUpperCase() } else if (["capitalize", "cap"].includes(filter)) { result = _.startCase(result) } else if (["number", "nb"].includes(filter)) { result = result.match(/\d+/gm) result = result.join(" ") } else if (["words", "w"].includes(filter)) { result = result.replace(/\W/gm, " ") } else if (["noescapchar", "nec"].includes(filter)) { result = result.replace(/\t+|\n+|\r+/gm, " ") } else if (filter && filter.includes("right")) { let regexified = filter.match(/\d+/g) if (regexified && regexified[0]) { let nb = regexified[0] if (_.isArray(result)) { result = result.slice(result.length - nb, result.length) } else { result = result.substr(result.length - nb) } } } else if (filter && filter.includes("left")) { let regexified = filter.match(/\d+/g) if (regexified && regexified[0]) { let nb = regexified[0] if (_.isArray(result)) { result = result.slice(0, nb) } else { result = result.substr(0, nb) } } } else if (filter && filter.includes("fromto")) { let regexified = paranthethisRegex.exec(filter) if (regexified && regexified[1]) { let nbs = regexified[1].split(/[,-]/gim) let start, end if (nbs.length > 1) { start = parseInt(nbs[0].trim()) end = parseInt(nbs[1].trim()) if (_.isArray(result)) { result = result.slice(start, end + 1) } else { result = result.substr(start, end) } } } } else if (filter && filter.includes("get")) { let regexified = filter.match(/\d+/g) if (regexified && regexified[0]) { let nb = regexified[0] if (_.isArray(result)) { result = result[nb] } else { result = result.charAt(nb) } } //default } else if (["compact", "cmp"].includes(filter) || !filter) { result = result.replace(/\s+/gm, " ").trim() } return result } let extractByExtractor = function (data, extractor, { multiple = false } = {}) { let result = data let emailRegex = /([a-zA-Z0-9._-]{1,30}@[a-zA-Z0-9._-]{2,15}\.[a-zA-Z0-9._-]{2,15})/gmi let phoneRegex = /\+?\(?\d*\)? ?\(?\d+\)?\d*([\s./-]\d{2,})+/gmi let websiteRegex = /(?:[\s\W])((https?:\/\/)?(www\.)?[-a-zA-Z0-9:%._\+~#=]{2,256}\.[a-z]{2,6}\b[-a-zA-Z0-9@:%_\+.~#?&/=]*)/gmi if (["phone", "telephone"].includes(extractor)) { if (multiple) { result = data.match(phoneRegex) || "" } else { result = data.match(phoneRegex) !== null ? data.match(phoneRegex)[0] : "" } } else if (["numbers", "nb"].includes(extractor)) { if (multiple) { result = result.match(/\d+/gm) || "" } else { result = result.match(/\d+/gm) !== null ? result.match(/\d+/gm)[0] : "" } } else if (["website"].includes(extractor)) { let websites = data.match(websiteRegex) if (websites && websites.length > 0) { websites = websites.map(function (x) { return x.substr(1, x.length) // remove first character }) if (multiple) { result = websites || "" } else { result = websites !== null ? websites[0] : "" } } } else if (["address", "add"].includes(extractor)) { result = addressit(data) } else if (["email", "mail", "@"].includes(extractor)) { if (multiple) { result = data.match(emailRegex) || data if (_.isArray(result) && result.length === 1) { result = result[0] } } else { result = data.match(emailRegex) !== null ? data.match(emailRegex)[0] : "" } } else if (["date", "d"].includes(extractor)) { let date = chrono.casual.parseDate(data) if (date) { result = date.toString() } else { result = "" } } else if (["fullName", "prenom", "firstName", "nom", "lastName", "initials", "suffix", "salutation"].includes(extractor)) { // compact data before to parse it result = humanname.parse(filterData(data, "cmp")) if ("fullName".includes(extractor)) { // return the object } else if (["firstName", "prenom"].includes(extractor)) { result = result.firstName } else if (["lastName", "nom"].includes(extractor)) { result = result.lastName } else if ("initials".includes(extractor)) { result = result.initials } else if ("suffix".includes(extractor)) { result = result.suffix } else if ("salutation".includes(extractor)) { result = result.salutation } } return result } let isAGroupKey = function (groupKey) { let groupProperties = ['_g', '_group', '_groupe'] let isAGroup = false groupProperties.forEach(function (value) { if (value === groupKey || groupKey.startsWith(value + '_')) { isAGroup = true return } }) return isAGroup } let getPropertyFromObj = function (obj, propertyName) { let properties = { 'selector': ['_s', '_selector', '_selecteur', 'selector'], 'attribute': ['_a', '_attr', '_attribut', 'attr', 'attribute'], 'filter': ['_filter', '_f', '_filtre', 'filter'], 'extractor': ['_e', '_extracteur', 'extractor', 'type', '_t'], //keep temporary old types 'data': ['_d', '_data', '_donnee', 'data'], 'parser': ['_p', '_parser', '_parseur', 'parser'], 'break': ['_b', '_break', '_cassure'] } let ob = this let res = null if (properties[propertyName]) { properties[propertyName].forEach(function (property, i) { if (obj[property]) { res = obj[property] return } }) } return res } let timeSpent = function (lastTime) { return new Date().getTime() - lastTime } String.prototype.oneSplitFromEnd = function (char) { let arr = this.split(char), res = [] res[1] = arr[arr.length - 1] arr.pop() res[0] = arr.join(char) return res } module.exports = function ($) { let getNodesFromSmartSelector = function (node, selector) { if (selector === "_parent_") { return node } else { return $(node).find(selector) } } let getFunctionalParameters = function (obj) { let result = { selector: getPropertyFromObj(obj, 'selector'), attribute: getPropertyFromObj(obj, 'attribute'), filter: getPropertyFromObj(obj, 'filter'), extractor: getPropertyFromObj(obj, 'extractor'), data: getPropertyFromObj(obj, 'data'), parser: getPropertyFromObj(obj, 'parser'), break: getPropertyFromObj(obj, 'break') } return result } let updateFunctionalParametersFromSelector = function (g, selector, node) { let gUpdate = extractSmartSelector({ selector: selector, node: $(node) }) g.selector = gUpdate.selector g.parser = g.parser ? g.parser : gUpdate.parser g.filter = g.filter ? g.filter : gUpdate.filter g.attribute = g.attribute ? g.attribute : gUpdate.attribute g.extractor = g.extractor ? g.extractor : gUpdate.extractor return g } let getDataFromNodes = function (nodes, g, { timestats = false, multiple = true } = {}) { let result = [] if (timestats) { result = {} result['_value'] = [] } // Getting data $(nodes).each(function (i, n) { let r = getTheRightData($(n), { extractor: g.extractor, filter: g.filter, attr: g.attribute, parser: g.parser, multiple: multiple }) if (_.isArray(r) && r.length === 1) { r = r[0] } if (r) { if (result['_value']) { if (_.isArray(r) && r.length > 1) { result['_value'] = r } else { result['_value'].push(r) } } else { if (_.isArray(r) && r.length > 1) { result = r } else { result.push(r) } } } // not multiple wanted, stop at the first one if (!multiple) { return false } }) if (result['_value']) { result['_timestat'] = timeSpent(gTime) } // avoid listing if ((!g.filter || !g.filter.join("").includes("split")) && !multiple && result[0]) { result = result[0] } if (g.filter && g.filter.join("").includes("join") && result.length === 1) { result = result[0] } if (result.length === 0) { result = null } return result } let extractSmartSelector = function ({ selector, node = null, attribute = null, filter = null, extractor = null, parser = null }) { let res = { "selector": selector, "attribute": attribute, "filter": filter, "extractor": extractor, "parser": parser } if (res.selector.includes('||')) { res.parser = res.selector.oneSplitFromEnd('||')[1].trim() res.selector = res.selector.oneSplitFromEnd('||')[0].trim() } if (res.selector.includes('|')) { res.filter = res.selector.oneSplitFromEnd('|')[1].trim() res.filter = res.filter.split(/\s+/) res.selector = res.selector.oneSplitFromEnd('|')[0].trim() } if (res.selector.includes('<')) { res.extractor = res.selector.oneSplitFromEnd('<')[1].trim() res.extractor = res.extractor.split(/\s+/) res.selector = res.selector.oneSplitFromEnd('<')[0].trim() } if (res.selector.includes('@')) { res.attribute = res.selector.oneSplitFromEnd('@')[1].trim() res.selector = res.selector.oneSplitFromEnd('@')[0].trim() } if (!res.extractor && !res.attribute && $(node).find(res.selector)['0'] && $(node).find(res.selector)['0'].name.toLowerCase() === "img") { res.attribute = "src" } return res } let getTheRightData = function (node, { attr = null, extractor = null, filter = null, parser = null, multiple = false } = {}) { //assuming we handle only one node from getDataFromNodes let result = null let localNode = node[0] || node // in case of many, shouldn't happen if (attr) { result = $(localNode).attr(attr) || "" } else { result = $(localNode).text() } let extractors = [] // build an array of extractors anyway if (!_.isArray(extractor)) { extractors.push(extractor) } else { extractors = extractor } if (extractors[0] && extractors[0] === "html") { result = $(localNode).html() } if (_.isObject(result)) { _.forOwn(result, function (value, key) { extractors.forEach(function (ext, index) { result[key] = extractByExtractor(result[key], ext, { multiple }) }) }) } else { extractors.forEach(function (ext, index) { result = extractByExtractor(result, ext, { multiple }) }) } if (_.isObject(result)) { _.forOwn(result, function (value, key) { if (_.isArray(filter)) { filter.forEach(function (f, index) { result[key] = filterData(result[key], f) }) } else { // handle type of child if (_.isString(result[key])) { result[key] = filterData(result[key], filter) } } }) } else { if (_.isArray(filter)) { filter.forEach(function (f, index) { result = filterData(result, f) }) } else { result = filterData(result, filter) } } if (parser) { result = parseData(result, parser, { multiple }) } // if(!multiple && _.isArray(result)){ // result = result[0] // } return result } // real prototype $.prototype.scrape = function (frame, { debug = false, timestats = false, string = false } = {}) { let output = {} let mainNode = $(this) let iterateThrough = function (obj, elem, node) { let gTime = new Date().getTime() if (_.isObject(obj)) { _.forOwn(obj, function (currentValue, key) { // Security for jsonpath in "_to" > "_frame" if (key === "_frame" || key === "_from") { elem[key] = currentValue // If it's a group key } else if (isAGroupKey(key)) { let selector = getPropertyFromObj(currentValue, 'selector') let data = getPropertyFromObj(currentValue, 'data') let n = getNodesFromSmartSelector($(node), selector) iterateThrough(data, elem, $(n)) } else { try { let g = {} if (_.isObject(currentValue) && !_.isArray(currentValue)) { g = getFunctionalParameters(currentValue) if (g.selector && _.isString(g.selector)) { g = updateFunctionalParametersFromSelector(g, g.selector, $(node)) if (g.data && _.isObject(g.data)) { if (_.isArray(g.data)) { // Check if break included if (g.break && _.isString(g.break)) { let parent = getNodesFromSmartSelector($(node), g.selector) // Clone the parent to leave the initial DOM in place :) let tempParent = $(parent).clone() // Get the number of blocks to create let l = $(tempParent).children(g.break).length // Random name to set the list var breaklist = "#breaklist1234" // Add the list after the parent in the DOM $(parent).after('
') // Moving the dom elements to blocks for (var index = 0; index < l; index++) { $(breaklist).append('
') // console.log("Appending: ",$(parent).children(g.break).first().text()) // Move the break element to the .break block $(breaklist).children().last().append($(tempParent).children(g.break).first()) // Move the next blocks to the .break block $(tempParent).children().first().nextUntil(g.break).each(function (i, e) { // console.log("nextItem", $(e).text()); $(breaklist).children().last().append($(e)) }) } elem[key] = [] // Iterating in this list $(breaklist).children(".break").each(function (i, e) { elem[key][i] = {} iterateThrough(g.data[0], elem[key][i], $(e)) }) } // Check if object in array else if (_.isObject(g.data[0]) && _.size(g.data[0]) > 0) { elem[key] = [] let nn = getNodesFromSmartSelector($(node), g.selector) if ($(nn).length > 0) { $(nn).each(function (i, n) { elem[key][i] = {} iterateThrough(g.data[0], elem[key][i], $(n)) }) } // If no object, taking the single string } else if (_.isString(g.data[0])) { let n = getNodesFromSmartSelector($(node), g.selector) let dataResp = getDataFromNodes($(n), g) if (dataResp) { elem[key] = dataResp } } // Simple data object to use parent selector as base } else { if (_.size(g.data) > 0) { elem[key] = {} let n = $(node).find(g.selector).first() iterateThrough(g.data, elem[key], $(n)) } } } else { let n = getNodesFromSmartSelector($(node), g.selector) let dataResp = getDataFromNodes($(n), g, { multiple: false }) if (dataResp) { // push data as unit of array elem[key] = dataResp } } } // There is no Selector but still an Object for organization else { elem[key] = {} iterateThrough(currentValue, elem[key], node) } } else if (_.isArray(currentValue)) { elem[key] = [] // For each unique string currentValue.forEach(function (arrSelector, h) { if (_.isString(arrSelector)) { g = updateFunctionalParametersFromSelector(g, arrSelector, $(node)) let n = getNodesFromSmartSelector($(node), g.selector) let dataResp = getDataFromNodes($(n), g) if (dataResp) { // push data as unit of array elem[key].push(...dataResp) } } }) } // The Parameter is a single string === selector > directly scraped else { g = updateFunctionalParametersFromSelector(g, currentValue, $(node)) let n = getNodesFromSmartSelector($(node), g.selector) let dataResp = getDataFromNodes($(n), g, { multiple: false }) if (dataResp) { // push data as unit of array elem[key] = dataResp } } } catch (error) { // console.log("obj key", key); console.log(error) } } }) } } iterateThrough(frame, output, mainNode) if (string) { output = JSON.stringify(output, null, 2) } return output } } ================================================ FILE: package.json ================================================ { "name": "jsonframe-cheerio", "version": "2.0.52", "description": "simple multi-level scraper json input/output", "main": "index.js", "scripts": { "test": "mocha tests/**/*.test.js", "test-watch": "nodemon --exec \"npm run test\"" }, "author": { "name": "Gabin Desserprit", "email": "gabin@datascraper.pro", "url": "http://datascraper.pro" }, "keywords": [ "cheerio", "scraper", "frame", "json", "parser", "template" ], "repository": { "type": "git", "url": "https://github.com/gahabeen/jsonframe-cheerio" }, "bugs": { "url": "https://github.com/gahabeen/jsonframe-cheerio/issues" }, "license": "ISC", "devDependencies": { "cheerio": "^0.22.0", "expect": "^1.20.2", "lodash": "^4.17.4", "mocha": "^3.2.0", "nodemon": "^1.11.0", "unfluff": "^1.1.0", "xmldom": "^0.1.27", "xpath": "0.0.23" }, "dependencies": { "addressit": "^1.4.0", "chrono-node": "^1.2.5", "google-libphonenumber": "^2.0.9", "humanname": "^0.2.2", "lodash": "^4.17.4" } } ================================================ FILE: tests/index.test.js ================================================ const expect = require('expect') const cheerio = require('cheerio') let _ = require('lodash') let jsonframe = require('./../index.js') let html = `

Pricing

A Link We are the 04/02/2017
Phone USA: (912) 148-456
Phone FR: +332 38 30 37 90 Email: lspurcell@suddenlink.net ` let $ = cheerio.load(html) jsonframe($) describe('JsonFrame Tests', () => { describe('Get Data from Inline Selector', () => { it('should get simple text', () => { let frame = { "title": "h2" } let output = $('body').scrape(frame) expect(output) .toContain({ "title": "Pricing" }) }) it('should get img src automatically', () => { let frame = { "picture": ".picture" // even without mentionning the img tag } let output = $('body').scrape(frame) expect(output) .toContain({ "picture": "somepath/to/image.png" }) }) }) describe('Get Attribute Data from Object {selector, attribute}', () => { it('should get the price attribute value', () => { let frame = { "proPrice": { _s: ".planName:contains('Pro') + span", _a: "price" } } let output = $('body').scrape(frame) expect(output) .toContain({ "proPrice": "39.00" }) }) it('should get the price attribute value (inline)', () => { let frame = { "proPrice": ".planName:contains('Pro') + span @ price" } let output = $('body').scrape(frame) expect(output) .toContain({ "proPrice": "39.00" }) }) it('should get the link (href) attribute value', () => { let frame = { "link": { _s: ".mainLink", _a: "href" } } let output = $('body').scrape(frame) expect(output) .toContain({ "link": "some/url/to/somewhere" }) }) it('should get the link (href) attribute value', () => { let frame = { "link": ".mainLink @ href" } let output = $('body').scrape(frame) expect(output) .toContain({ "link": "some/url/to/somewhere" }) }) }) describe('Get Data with Type {selector, type[, attribute,]}', () => { it('should get the USA telephone value', () => { let frame = { "telephone": { _s: "[itemprop=usaphone]", _t: "telephone" } } let output = $('body').scrape(frame) expect(output) .toContain({ "telephone": "(912) 148-456" }) }) it('should get the USA telephone value (inline)', () => { let frame = { "telephone": "[itemprop=usaphone] < telephone" } let output = $('body').scrape(frame) expect(output) .toContain({ "telephone": "(912) 148-456" }) }) it('should get the FR telephone value', () => { let frame = { "telephone": { _s: "[itemprop=frphone]", _t: "telephone" } } let output = $('body').scrape(frame) expect(output) .toContain({ "telephone": "+332 38 30 37 90" }) }) it('should get the FR telephone value (inline)', () => { let frame = { "telephone": "[itemprop=frphone] < telephone" } let output = $('body').scrape(frame) expect(output) .toContain({ "telephone": "+332 38 30 37 90" }) }) it('should get the email value', () => { let frame = { "email": { _s: "[itemprop=email]", _t: "email" } } let output = $('body').scrape(frame) expect(output) .toContain({ "email": "lspurcell@suddenlink.net" }) }) it('should get the email value (inline)', () => { let frame = { "email": "[itemprop=email] < email" } let output = $('body').scrape(frame) expect(output) .toContain({ "email": "lspurcell@suddenlink.net" }) }) it('should get the inner html value', () => { let frame = { "inner": { _s: ".popup", _t: "html" } } let output = $('body').scrape(frame) expect(output) .toContain({ "inner": "Some inner content" }) }) it('should get the inner html value (inline)', () => { let frame = { "inner": ".popup < html" } let output = $('body').scrape(frame) expect(output) .toContain({ "inner": "Some inner content" }) }) }) describe('Get Parsed Data thanks to Regex {selector, parse[, type, attribute]}', () => { it('should get the parsed date dd/mm/yyyy from regex', () => { let frame = { "data": { _s: ".date", _p: /\d{1,2}\/\d{1,2}\/\d{2,4}/ } } let output = $('body').scrape(frame) expect(output) .toContain({ "data": "04/02/2017" }) }) it('should get the parsed date dd/mm/yyyy from regex (inline)', () => { let frame = { "data": ".date || \\d{1,2}/\\d{1,2}/\\d{2,4}" } let output = $('body').scrape(frame) expect(output) .toContain({ "data": "04/02/2017" }) }) }) describe('Get Child Obj Data {selector, data: {}}', () => { it('should get json object with parent > child', () => { let frame = { "pricing": { _s: "#pricing .item", _d: { "name": ".planName", "price": ".planPrice" } } } let output = $('body').scrape(frame) expect(output) .toContain({ "pricing": { "name": "Hacker", "price": "Free" } }) }) }) describe('Get Array / List of Data {selector, data: [{}]}', () => { it('should get json object with parent > childs []', () => { let frame = { "pricing": { _s: "#pricing .item", _d: [{ "name": ".planName", "price": ".planPrice" }] } } let output = $('body').scrape(frame) expect(output) .toContain({ "pricing": [{ "name": "Hacker", "price": "Free" }, { "name": "Pro", "price": "$39" } ] }) }) }) describe('Get child elements grouped by a selector with _g', () => { it('should get the data within the first li item', () => { let frame = { _g: { _s: "#pricing .item", _d: { "name": ".planName", "price": ".planPrice @ price", "image": { "url": "img", "link": "a @ href" } } } } let output = $('body').scrape(frame) expect(output) .toContain({ "name": "Hacker", "price": "0", "image": { "url": "./img/hacker.png", "link": "/hacker" } }) }) }) describe('Full examples', () => { it('should get the pricing list + details', () => { let frame = { "pricing": { _s: "#pricing .item", _d: [{ "name": ".planName", "price": { _s: ".planPrice", _a: "price" }, "image": { "url": { _s: "img", _a: "src" }, "link": { _s: "a", _a: "href" } } }] } } let output = $('body').scrape(frame) expect(output) .toContain({ "pricing": [{ "name": "Hacker", "price": "0", "image": { "url": "./img/hacker.png", "link": "/hacker" } }, { "name": "Pro", "price": "39.00", "image": { "url": "./img/pro.png", "link": "/pro" } } ] }) }) it('should get the pricing list + details (inline)', () => { let frame = { "pricing": { _s: "#pricing .item", _d: [{ "name": ".planName", "price": ".planPrice @ price", "image": { "url": "img", "link": "a @ href" } }] } } let output = $('body').scrape(frame) expect(output) .toContain({ "pricing": [{ "name": "Hacker", "price": "0", "image": { "url": "./img/hacker.png", "link": "/hacker" } }, { "name": "Pro", "price": "39.00", "image": { "url": "./img/pro.png", "link": "/pro" } } ] }) }) }) }) ================================================ FILE: tests/playground/html/company.html ================================================ Company

Bonjour

Lorem ipsum dolor sit amet, consectetur adipisicing elit. Iusto, inventore, nihil! Itaque aspernatur tenetur repellendus ipsam iste non accusamus similique, ab minus. Sed saepe nesciunt debitis, sit asperiores optio corporis.

The company name 815 684 9704
A company +18749284
My Company 104 78794 15
Hire Me 849 0445 667