Repository: ericchiang/pup
Branch: master
Commit: 5a57cf111366
Files: 18
Total size: 228.4 KB
Directory structure:
gitextract_hf53ko4u/
├── .github/
│ └── workflows/
│ └── test.yaml
├── .gitignore
├── LICENSE
├── README.md
├── display.go
├── go.mod
├── go.sum
├── parse.go
├── parse_test.go
├── pup.go
├── pup.rb
├── selector.go
└── tests/
├── README.md
├── cmds.txt
├── expected_output.txt
├── index.html
├── run.py
└── test
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/workflows/test.yaml
================================================
name: test
on:
push:
branches:
- master
pull_request:
branches:
- master
jobs:
test:
strategy:
matrix:
os: [ubuntu-latest]
go-version: [1.17.x, 1.16.x]
runs-on: ${{ matrix.os }}
steps:
- name: Install Go
uses: actions/setup-go@v2
with:
go-version: ${{ matrix.go-version }}
- name: Checkout code
uses: actions/checkout@v2
- name: Build
run: go build ./...
- name: Test
run: go test -v ./...
================================================
FILE: .gitignore
================================================
dist/
testpages/*
tests/test_results.txt
robots.html
================================================
FILE: LICENSE
================================================
Copyright (c) 2014: Eric Chiang
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
================================================
FILE: README.md
================================================
# pup
pup is a command line tool for processing HTML. It reads from stdin,
prints to stdout, and allows the user to filter parts of the page using
[CSS selectors](https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Getting_started/Selectors).
Inspired by [jq](http://stedolan.github.io/jq/), pup aims to be a
fast and flexible way of exploring HTML from the terminal.
## Install
Direct downloads are available through the [releases page](https://github.com/EricChiang/pup/releases/latest).
If you have Go installed on your computer just run `go get`.
go get github.com/ericchiang/pup
If you're on OS X, use [Homebrew](http://brew.sh/) to install (no Go required).
brew install https://raw.githubusercontent.com/EricChiang/pup/master/pup.rb
## Quick start
```bash
$ curl -s https://news.ycombinator.com/
```
Ew, HTML. Let's run that through some pup selectors:
```bash
$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a'
```
Okay, how about only the links?
```bash
$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a attr{href}'
```
Even better, let's grab the titles too:
```bash
$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a json{}'
```
## Basic Usage
```bash
$ cat index.html | pup [flags] '[selectors] [display function]'
```
## Examples
Download a webpage with wget.
```bash
$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html
```
#### Clean and indent
By default pup will fill in missing tags and properly indent the page.
```bash
$ cat robots.html
# nasty looking HTML
$ cat robots.html | pup --color
# cleaned, indented, and colorful HTML
```
#### Filter by tag
```bash
$ cat robots.html | pup 'title'
Robots exclusion standard - Wikipedia, the free encyclopedia
```
#### Filter by id
```bash
$ cat robots.html | pup 'span#See_also'
See also
```
#### Filter by attribute
```bash
$ cat robots.html | pup 'th[scope="row"]'
Exclusion standards
Related marketing topics
Search marketing related topics
Search engine spam
Linking
People
Other
```
#### Pseudo Classes
CSS selectors have a group of specifiers called ["pseudo classes"](
https://developer.mozilla.org/en-US/docs/Web/CSS/Pseudo-classes) which are pretty
cool. pup implements a majority of the relevant ones them.
Here are some examples.
```bash
$ cat robots.html | pup 'a[rel]:empty'
```
```bash
$ cat robots.html | pup ':contains("History")'
History
History
```
```bash
$ cat robots.html | pup ':parent-of([action="edit"])'
Edit links
```
For a complete list, view the [implemented selectors](#implemented-selectors)
section.
#### `+`, `>`, and `,`
These are intermediate characters that declare special instructions. For
instance, a comma `,` allows pup to specify multiple groups of selectors.
```bash
$ cat robots.html | pup 'title, h1 span[dir="auto"]'
Robots exclusion standard - Wikipedia, the free encyclopedia
Robots exclusion standard
```
#### Chain selectors together
When combining selectors, the HTML nodes selected by the previous selector will
be passed to the next ones.
```bash
$ cat robots.html | pup 'h1#firstHeading'
Robots exclusion standard
```
```bash
$ cat robots.html | pup 'h1#firstHeading span'
Robots exclusion standard
```
## Implemented Selectors
For further examples of these selectors head over to [MDN](
https://developer.mozilla.org/en-US/docs/Web/CSS/Reference).
```bash
pup '.class'
pup '#id'
pup 'element'
pup 'selector + selector'
pup 'selector > selector'
pup '[attribute]'
pup '[attribute="value"]'
pup '[attribute*="value"]'
pup '[attribute~="value"]'
pup '[attribute^="value"]'
pup '[attribute$="value"]'
pup ':empty'
pup ':first-child'
pup ':first-of-type'
pup ':last-child'
pup ':last-of-type'
pup ':only-child'
pup ':only-of-type'
pup ':contains("text")'
pup ':nth-child(n)'
pup ':nth-of-type(n)'
pup ':nth-last-child(n)'
pup ':nth-last-of-type(n)'
pup ':not(selector)'
pup ':parent-of(selector)'
```
You can mix and match selectors as you wish.
```bash
cat index.html | pup 'element#id[attribute="value"]:first-of-type'
```
## Display Functions
Non-HTML selectors which effect the output type are implemented as functions
which can be provided as a final argument.
#### `text{}`
Print all text from selected nodes and children in depth first order.
```bash
$ cat robots.html | pup '.mw-headline text{}'
History
About the standard
Disadvantages
Alternatives
Examples
Nonstandard extensions
Crawl-delay directive
Allow directive
Sitemap
Host
Universal "*" match
Meta tags and headers
See also
References
External links
```
#### `attr{attrkey}`
Print the values of all attributes with a given key from all selected nodes.
```bash
$ cat robots.html | pup '.catlinks div attr{id}'
mw-normal-catlinks
mw-hidden-catlinks
```
#### `json{}`
Print HTML as JSON.
```bash
$ cat robots.html | pup 'div#p-namespaces a'
Article
Talk
```
```bash
$ cat robots.html | pup 'div#p-namespaces a json{}'
[
{
"accesskey": "c",
"href": "/wiki/Robots_exclusion_standard",
"tag": "a",
"text": "Article",
"title": "View the content page [c]"
},
{
"accesskey": "t",
"href": "/wiki/Talk:Robots_exclusion_standard",
"tag": "a",
"text": "Talk",
"title": "Discussion about the content page [t]"
}
]
```
Use the `-i` / `--indent` flag to control the intent level.
```bash
$ cat robots.html | pup -i 4 'div#p-namespaces a json{}'
[
{
"accesskey": "c",
"href": "/wiki/Robots_exclusion_standard",
"tag": "a",
"text": "Article",
"title": "View the content page [c]"
},
{
"accesskey": "t",
"href": "/wiki/Talk:Robots_exclusion_standard",
"tag": "a",
"text": "Talk",
"title": "Discussion about the content page [t]"
}
]
```
If the selectors only return one element the results will be printed as a JSON
object, not a list.
```bash
$ cat robots.html | pup --indent 4 'title json{}'
{
"tag": "title",
"text": "Robots exclusion standard - Wikipedia, the free encyclopedia"
}
```
Because there is no universal standard for converting HTML/XML to JSON, a
method has been chosen which hopefully fits. The goal is simply to get the
output of pup into a more consumable format.
## Flags
Run `pup --help` for a list of further options
================================================
FILE: display.go
================================================
package main
import (
"encoding/json"
"fmt"
"regexp"
"strings"
"github.com/fatih/color"
colorable "github.com/mattn/go-colorable"
"golang.org/x/net/html"
"golang.org/x/net/html/atom"
)
func init() {
color.Output = colorable.NewColorableStdout()
}
type Displayer interface {
Display([]*html.Node)
}
func ParseDisplayer(cmd string) error {
attrRe := regexp.MustCompile(`attr\{([a-zA-Z\-]+)\}`)
if cmd == "text{}" {
pupDisplayer = TextDisplayer{}
} else if cmd == "json{}" {
pupDisplayer = JSONDisplayer{}
} else if match := attrRe.FindAllStringSubmatch(cmd, -1); len(match) == 1 {
pupDisplayer = AttrDisplayer{
Attr: match[0][1],
}
} else {
return fmt.Errorf("Unknown displayer")
}
return nil
}
// Is this node a tag with no end tag such as or ?
// http://www.w3.org/TR/html-markup/syntax.html#syntax-elements
func isVoidElement(n *html.Node) bool {
switch n.DataAtom {
case atom.Area, atom.Base, atom.Br, atom.Col, atom.Command, atom.Embed,
atom.Hr, atom.Img, atom.Input, atom.Keygen, atom.Link,
atom.Meta, atom.Param, atom.Source, atom.Track, atom.Wbr:
return true
}
return false
}
var (
// Colors
tagColor *color.Color = color.New(color.FgCyan)
tokenColor = color.New(color.FgCyan)
attrKeyColor = color.New(color.FgMagenta)
quoteColor = color.New(color.FgBlue)
commentColor = color.New(color.FgYellow)
)
type TreeDisplayer struct {
}
func (t TreeDisplayer) Display(nodes []*html.Node) {
for _, node := range nodes {
t.printNode(node, 0)
}
}
// The
tag indicates that the text within it should always be formatted
// as is. See https://github.com/ericchiang/pup/issues/33
func (t TreeDisplayer) printPre(n *html.Node) {
switch n.Type {
case html.TextNode:
s := n.Data
if pupEscapeHTML {
// don't escape javascript
if n.Parent == nil || n.Parent.DataAtom != atom.Script {
s = html.EscapeString(s)
}
}
fmt.Print(s)
for c := n.FirstChild; c != nil; c = c.NextSibling {
t.printPre(c)
}
case html.ElementNode:
fmt.Printf("<%s", n.Data)
for _, a := range n.Attr {
val := a.Val
if pupEscapeHTML {
val = html.EscapeString(val)
}
fmt.Printf(` %s="%s"`, a.Key, val)
}
fmt.Print(">")
if !isVoidElement(n) {
for c := n.FirstChild; c != nil; c = c.NextSibling {
t.printPre(c)
}
fmt.Printf("%s>", n.Data)
}
case html.CommentNode:
data := n.Data
if pupEscapeHTML {
data = html.EscapeString(data)
}
fmt.Printf("\n", data)
for c := n.FirstChild; c != nil; c = c.NextSibling {
t.printPre(c)
}
case html.DoctypeNode, html.DocumentNode:
for c := n.FirstChild; c != nil; c = c.NextSibling {
t.printPre(c)
}
}
}
// Print a node and all of it's children to `maxlevel`.
func (t TreeDisplayer) printNode(n *html.Node, level int) {
switch n.Type {
case html.TextNode:
s := n.Data
if pupEscapeHTML {
// don't escape javascript
if n.Parent == nil || n.Parent.DataAtom != atom.Script {
s = html.EscapeString(s)
}
}
s = strings.TrimSpace(s)
if s != "" {
t.printIndent(level)
fmt.Println(s)
}
case html.ElementNode:
t.printIndent(level)
// TODO: allow pre with color
if n.DataAtom == atom.Pre && !pupPrintColor && pupPreformatted {
t.printPre(n)
fmt.Println()
return
}
if pupPrintColor {
tokenColor.Print("<")
tagColor.Printf("%s", n.Data)
} else {
fmt.Printf("<%s", n.Data)
}
for _, a := range n.Attr {
val := a.Val
if pupEscapeHTML {
val = html.EscapeString(val)
}
if pupPrintColor {
fmt.Print(" ")
attrKeyColor.Printf("%s", a.Key)
tokenColor.Print("=")
quoteColor.Printf(`"%s"`, val)
} else {
fmt.Printf(` %s="%s"`, a.Key, val)
}
}
if pupPrintColor {
tokenColor.Println(">")
} else {
fmt.Println(">")
}
if !isVoidElement(n) {
t.printChildren(n, level+1)
t.printIndent(level)
if pupPrintColor {
tokenColor.Print("")
tagColor.Printf("%s", n.Data)
tokenColor.Println(">")
} else {
fmt.Printf("%s>\n", n.Data)
}
}
case html.CommentNode:
t.printIndent(level)
data := n.Data
if pupEscapeHTML {
data = html.EscapeString(data)
}
if pupPrintColor {
commentColor.Printf("\n", data)
} else {
fmt.Printf("\n", data)
}
t.printChildren(n, level)
case html.DoctypeNode, html.DocumentNode:
t.printChildren(n, level)
}
}
func (t TreeDisplayer) printChildren(n *html.Node, level int) {
if pupMaxPrintLevel > -1 {
if level >= pupMaxPrintLevel {
t.printIndent(level)
fmt.Println("...")
return
}
}
child := n.FirstChild
for child != nil {
t.printNode(child, level)
child = child.NextSibling
}
}
func (t TreeDisplayer) printIndent(level int) {
for ; level > 0; level-- {
fmt.Print(pupIndentString)
}
}
// Print the text of a node
type TextDisplayer struct{}
func (t TextDisplayer) Display(nodes []*html.Node) {
for _, node := range nodes {
if node.Type == html.TextNode {
data := node.Data
if pupEscapeHTML {
// don't escape javascript
if node.Parent == nil || node.Parent.DataAtom != atom.Script {
data = html.EscapeString(data)
}
}
fmt.Println(data)
}
children := []*html.Node{}
child := node.FirstChild
for child != nil {
children = append(children, child)
child = child.NextSibling
}
t.Display(children)
}
}
// Print the attribute of a node
type AttrDisplayer struct {
Attr string
}
func (a AttrDisplayer) Display(nodes []*html.Node) {
for _, node := range nodes {
attributes := node.Attr
for _, attr := range attributes {
if attr.Key == a.Attr {
val := attr.Val
if pupEscapeHTML {
val = html.EscapeString(val)
}
fmt.Printf("%s\n", val)
}
}
}
}
// Print nodes as a JSON list
type JSONDisplayer struct{}
// returns a jsonifiable struct
func jsonify(node *html.Node) map[string]interface{} {
vals := map[string]interface{}{}
if len(node.Attr) > 0 {
for _, attr := range node.Attr {
if pupEscapeHTML {
vals[attr.Key] = html.EscapeString(attr.Val)
} else {
vals[attr.Key] = attr.Val
}
}
}
vals["tag"] = node.DataAtom.String()
children := []interface{}{}
for child := node.FirstChild; child != nil; child = child.NextSibling {
switch child.Type {
case html.ElementNode:
children = append(children, jsonify(child))
case html.TextNode:
text := strings.TrimSpace(child.Data)
if text != "" {
if pupEscapeHTML {
// don't escape javascript
if node.DataAtom != atom.Script {
text = html.EscapeString(text)
}
}
// if there is already text we'll append it
currText, ok := vals["text"]
if ok {
text = fmt.Sprintf("%s %s", currText, text)
}
vals["text"] = text
}
case html.CommentNode:
comment := strings.TrimSpace(child.Data)
if pupEscapeHTML {
comment = html.EscapeString(comment)
}
currComment, ok := vals["comment"]
if ok {
comment = fmt.Sprintf("%s %s", currComment, comment)
}
vals["comment"] = comment
}
}
if len(children) > 0 {
vals["children"] = children
}
return vals
}
func (j JSONDisplayer) Display(nodes []*html.Node) {
var data []byte
var err error
jsonNodes := []map[string]interface{}{}
for _, node := range nodes {
jsonNodes = append(jsonNodes, jsonify(node))
}
data, err = json.MarshalIndent(&jsonNodes, "", pupIndentString)
if err != nil {
panic("Could not jsonify nodes")
}
fmt.Printf("%s\n", data)
}
// Print the number of features returned
type NumDisplayer struct{}
func (d NumDisplayer) Display(nodes []*html.Node) {
fmt.Println(len(nodes))
}
================================================
FILE: go.mod
================================================
module github.com/ericchiang/pup
go 1.13
require (
github.com/fatih/color v1.0.0
github.com/mattn/go-colorable v0.0.5
github.com/mattn/go-isatty v0.0.0-20151211000621-56b76bdf51f7 // indirect
golang.org/x/net v0.0.0-20160720084139-4d38db76854b
golang.org/x/sys v0.0.0-20160717071931-a646d33e2ee3 // indirect
golang.org/x/text v0.0.0-20160719205907-0a5a09ee4409
)
================================================
FILE: go.sum
================================================
github.com/fatih/color v1.0.0 h1:4zdNjpoprR9fed2QRCPb2VTPU4UFXEtJc9Vc+sgXkaQ=
github.com/fatih/color v1.0.0/go.mod h1:Zm6kSWBoL9eyXnKyktHP6abPY2pDugNf5KwzbycvMj4=
github.com/mattn/go-colorable v0.0.5 h1:X1IeP+MaFWC+vpbhw3y426rQftzXSj+N7eJFnBEMBfE=
github.com/mattn/go-colorable v0.0.5/go.mod h1:9vuHe8Xs5qXnSaW/c/ABM9alt+Vo+STaOChaDxuIBZU=
github.com/mattn/go-isatty v0.0.0-20151211000621-56b76bdf51f7 h1:owMyzMR4QR+jSdlfkX9jPU3rsby4++j99BfbtgVr6ZY=
github.com/mattn/go-isatty v0.0.0-20151211000621-56b76bdf51f7/go.mod h1:M+lRXTBqGeGNdLjl/ufCoiOlB5xdOkqRJdNxMWT7Zi4=
golang.org/x/net v0.0.0-20160720084139-4d38db76854b h1:2lHDZItrxmjk3OXnITVKcHWo6qQYJSm4q2pmvciVkxo=
golang.org/x/net v0.0.0-20160720084139-4d38db76854b/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
golang.org/x/sys v0.0.0-20160717071931-a646d33e2ee3 h1:ZLExsLvnoqWSw6JB6k6RjWobIHGR3NG9dzVANJ7SVKc=
golang.org/x/sys v0.0.0-20160717071931-a646d33e2ee3/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
golang.org/x/text v0.0.0-20160719205907-0a5a09ee4409 h1:ImTDOALQ1AOSGXgapb9Q1tOcHlxpQXZCPSIMKLce0JU=
golang.org/x/text v0.0.0-20160719205907-0a5a09ee4409/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
================================================
FILE: parse.go
================================================
package main
import (
"fmt"
"io"
"os"
"strconv"
"strings"
"golang.org/x/net/html"
"golang.org/x/net/html/charset"
"golang.org/x/text/transform"
)
var (
pupIn io.ReadCloser = os.Stdin
pupCharset string = ""
pupMaxPrintLevel int = -1
pupPreformatted bool = false
pupPrintColor bool = false
pupEscapeHTML bool = true
pupIndentString string = " "
pupDisplayer Displayer = TreeDisplayer{}
)
// Parse the html while handling the charset
func ParseHTML(r io.Reader, cs string) (*html.Node, error) {
var err error
if cs == "" {
// attempt to guess the charset of the HTML document
r, err = charset.NewReader(r, "")
if err != nil {
return nil, err
}
} else {
// let the user specify the charset
e, name := charset.Lookup(cs)
if name == "" {
return nil, fmt.Errorf("'%s' is not a valid charset", cs)
}
r = transform.NewReader(r, e.NewDecoder())
}
return html.Parse(r)
}
func PrintHelp(w io.Writer, exitCode int) {
helpString := `Usage
pup [flags] [selectors] [optional display function]
Version
%s
Flags
-c --color print result with color
-f --file file to read from
-h --help display this help
-i --indent number of spaces to use for indent or character
-n --number print number of elements selected
-l --limit restrict number of levels printed
-p --plain don't escape html
--pre preserve preformatted text
--charset specify the charset for pup to use
--version display version
`
fmt.Fprintf(w, helpString, VERSION)
os.Exit(exitCode)
}
func ParseArgs() ([]string, error) {
cmds, err := ProcessFlags(os.Args[1:])
if err != nil {
return []string{}, err
}
return ParseCommands(strings.Join(cmds, " "))
}
// Process command arguments and return all non-flags.
func ProcessFlags(cmds []string) (nonFlagCmds []string, err error) {
var i int
defer func() {
if r := recover(); r != nil {
err = fmt.Errorf("Option '%s' requires an argument", cmds[i])
}
}()
nonFlagCmds = make([]string, len(cmds))
n := 0
for i = 0; i < len(cmds); i++ {
cmd := cmds[i]
switch cmd {
case "-c", "--color":
pupPrintColor = true
case "-p", "--plain":
pupEscapeHTML = false
case "--pre":
pupPreformatted = true
case "-f", "--file":
filename := cmds[i+1]
pupIn, err = os.Open(filename)
if err != nil {
fmt.Fprintf(os.Stderr, "%s\n", err.Error())
os.Exit(2)
}
i++
case "-h", "--help":
PrintHelp(os.Stdout, 0)
case "-i", "--indent":
indentLevel, err := strconv.Atoi(cmds[i+1])
if err == nil {
pupIndentString = strings.Repeat(" ", indentLevel)
} else {
pupIndentString = cmds[i+1]
}
i++
case "-l", "--limit":
pupMaxPrintLevel, err = strconv.Atoi(cmds[i+1])
if err != nil {
return []string{}, fmt.Errorf("Argument for '%s' must be numeric", cmd)
}
i++
case "--charset":
pupCharset = cmds[i+1]
i++
case "--version":
fmt.Println(VERSION)
os.Exit(0)
case "-n", "--number":
pupDisplayer = NumDisplayer{}
default:
if cmd[0] == '-' {
return []string{}, fmt.Errorf("Unrecognized flag '%s'", cmd)
}
nonFlagCmds[n] = cmds[i]
n++
}
}
return nonFlagCmds[:n], nil
}
// Split a string with awareness for quoted text and commas
func ParseCommands(cmdString string) ([]string, error) {
cmds := []string{}
last, next, max := 0, 0, len(cmdString)
for {
// if we're at the end of the string, return
if next == max {
if next > last {
cmds = append(cmds, cmdString[last:next])
}
return cmds, nil
}
// evaluate a rune
c := cmdString[next]
switch c {
case ' ':
if next > last {
cmds = append(cmds, cmdString[last:next])
}
last = next + 1
case ',':
if next > last {
cmds = append(cmds, cmdString[last:next])
}
cmds = append(cmds, ",")
last = next + 1
case '\'', '"':
// for quotes, consume runes until the quote has ended
quoteChar := c
for {
next++
if next == max {
return []string{}, fmt.Errorf("Unmatched open quote (%c)", quoteChar)
}
if cmdString[next] == '\\' {
next++
if next == max {
return []string{}, fmt.Errorf("Unmatched open quote (%c)", quoteChar)
}
} else if cmdString[next] == quoteChar {
break
}
}
}
next++
}
}
================================================
FILE: parse_test.go
================================================
package main
import (
"testing"
)
type parseCmdTest struct {
input string
split []string
ok bool
}
var parseCmdTests = []parseCmdTest{
parseCmdTest{`w1 w2`, []string{`w1`, `w2`}, true},
parseCmdTest{`w1 w2 w3`, []string{`w1`, `w2`, `w3`}, true},
parseCmdTest{`w1 'w2 w3'`, []string{`w1`, `'w2 w3'`}, true},
parseCmdTest{`w1 "w2 w3"`, []string{`w1`, `"w2 w3"`}, true},
parseCmdTest{`w1 "w2 w3"`, []string{`w1`, `"w2 w3"`}, true},
parseCmdTest{`w1 'w2 w3'`, []string{`w1`, `'w2 w3'`}, true},
parseCmdTest{`w1"w2 w3"`, []string{`w1"w2 w3"`}, true},
parseCmdTest{`w1'w2 w3'`, []string{`w1'w2 w3'`}, true},
parseCmdTest{`w1"w2 'w3"`, []string{`w1"w2 'w3"`}, true},
parseCmdTest{`w1'w2 "w3'`, []string{`w1'w2 "w3'`}, true},
parseCmdTest{`"w1 w2" "w3"`, []string{`"w1 w2"`, `"w3"`}, true},
parseCmdTest{`'w1 w2' "w3"`, []string{`'w1 w2'`, `"w3"`}, true},
parseCmdTest{`'w1 \'w2' "w3"`, []string{`'w1 \'w2'`, `"w3"`}, true},
parseCmdTest{`'w1 \'w2 "w3"`, []string{}, false},
parseCmdTest{`w1 'w2 w3'"`, []string{}, false},
parseCmdTest{`w1 "w2 w3"'`, []string{}, false},
parseCmdTest{`w1 ' "w2 w3"`, []string{}, false},
parseCmdTest{`w1 " 'w2 w3'`, []string{}, false},
parseCmdTest{`w1"w2 w3""`, []string{}, false},
parseCmdTest{`w1'w2 w3''`, []string{}, false},
parseCmdTest{`w1"w2 'w3""`, []string{}, false},
parseCmdTest{`w1'w2 "w3''`, []string{}, false},
parseCmdTest{`"w1 w2" "w3"'`, []string{}, false},
parseCmdTest{`'w1 w2' "w3"'`, []string{}, false},
parseCmdTest{`w1,"w2 w3"`, []string{`w1`, `,`, `"w2 w3"`}, true},
parseCmdTest{`w1,'w2 w3'`, []string{`w1`, `,`, `'w2 w3'`}, true},
parseCmdTest{`w1 , "w2 w3"`, []string{`w1`, `,`, `"w2 w3"`}, true},
parseCmdTest{`w1 , 'w2 w3'`, []string{`w1`, `,`, `'w2 w3'`}, true},
parseCmdTest{`w1, "w2 w3"`, []string{`w1`, `,`, `"w2 w3"`}, true},
parseCmdTest{`w1, 'w2 w3'`, []string{`w1`, `,`, `'w2 w3'`}, true},
parseCmdTest{`w1 ,"w2 w3"`, []string{`w1`, `,`, `"w2 w3"`}, true},
parseCmdTest{`w1 ,'w2 w3'`, []string{`w1`, `,`, `'w2 w3'`}, true},
parseCmdTest{`w1"w2, w3"`, []string{`w1"w2, w3"`}, true},
parseCmdTest{`w1'w2, w3'`, []string{`w1'w2, w3'`}, true},
parseCmdTest{`w1"w2, 'w3"`, []string{`w1"w2, 'w3"`}, true},
parseCmdTest{`w1'w2, "w3'`, []string{`w1'w2, "w3'`}, true},
parseCmdTest{`"w1, w2" "w3"`, []string{`"w1, w2"`, `"w3"`}, true},
parseCmdTest{`'w1, w2' "w3"`, []string{`'w1, w2'`, `"w3"`}, true},
parseCmdTest{`'w1, \'w2' "w3"`, []string{`'w1, \'w2'`, `"w3"`}, true},
parseCmdTest{`h1, .article-teaser, .article-content`, []string{
`h1`, `,`, `.article-teaser`, `,`, `.article-content`,
}, true},
parseCmdTest{`h1 ,.article-teaser ,.article-content`, []string{
`h1`, `,`, `.article-teaser`, `,`, `.article-content`,
}, true},
parseCmdTest{`h1 , .article-teaser , .article-content`, []string{
`h1`, `,`, `.article-teaser`, `,`, `.article-content`,
}, true},
}
func sliceEq(s1, s2 []string) bool {
if len(s1) != len(s2) {
return false
}
for i := range s1 {
if s1[i] != s2[i] {
return false
}
}
return true
}
func TestParseCommands(t *testing.T) {
for _, test := range parseCmdTests {
parsed, err := ParseCommands(test.input)
if test.ok != (err == nil) {
t.Errorf("`%s`: should have cause error? %v", test.input, !test.ok)
} else if !sliceEq(test.split, parsed) {
t.Errorf("`%s`: `%s`: `%s`", test.input, test.split, parsed)
}
}
}
================================================
FILE: pup.go
================================================
package main
import (
"fmt"
"os"
"golang.org/x/net/html"
)
// _=,_
// o_/6 /#\
// \__ |##/
// ='|--\
// / #'-.
// \#|_ _'-. /
// |/ \_( # |"
// C/ ,--___/
var VERSION string = "0.4.0"
func main() {
// process flags and arguments
cmds, err := ParseArgs()
if err != nil {
fmt.Fprintf(os.Stderr, "%s\n", err.Error())
os.Exit(2)
}
// Parse the input and get the root node
root, err := ParseHTML(pupIn, pupCharset)
if err != nil {
fmt.Fprintf(os.Stderr, "%s\n", err.Error())
os.Exit(2)
}
pupIn.Close()
// Parse the selectors
selectorFuncs := []SelectorFunc{}
funcGenerator := Select
var cmd string
for len(cmds) > 0 {
cmd, cmds = cmds[0], cmds[1:]
if len(cmds) == 0 {
if err := ParseDisplayer(cmd); err == nil {
continue
}
}
switch cmd {
case "*": // select all
continue
case ">":
funcGenerator = SelectFromChildren
case "+":
funcGenerator = SelectNextSibling
case ",": // nil will signify a comma
selectorFuncs = append(selectorFuncs, nil)
default:
selector, err := ParseSelector(cmd)
if err != nil {
fmt.Fprintf(os.Stderr, "Selector parsing error: %s\n", err.Error())
os.Exit(2)
}
selectorFuncs = append(selectorFuncs, funcGenerator(selector))
funcGenerator = Select
}
}
selectedNodes := []*html.Node{}
currNodes := []*html.Node{root}
for _, selectorFunc := range selectorFuncs {
if selectorFunc == nil { // hit a comma
selectedNodes = append(selectedNodes, currNodes...)
currNodes = []*html.Node{root}
} else {
currNodes = selectorFunc(currNodes)
}
}
selectedNodes = append(selectedNodes, currNodes...)
pupDisplayer.Display(selectedNodes)
}
================================================
FILE: pup.rb
================================================
# This file was generated by release.sh
require 'formula'
class Pup < Formula
homepage 'https://github.com/ericchiang/pup'
version '0.4.0'
if Hardware::CPU.is_64_bit?
url 'https://github.com/ericchiang/pup/releases/download/v0.4.0/pup_v0.4.0_darwin_amd64.zip'
sha256 'c539a697efee2f8e56614a54cb3b215338e00de1f6a7c2fa93144ab6e1db8ebe'
else
url 'https://github.com/ericchiang/pup/releases/download/v0.4.0/pup_v0.4.0_darwin_386.zip'
sha256 '75c27caa0008a9cc639beb7506077ad9f32facbffcc4e815e999eaf9588a527e'
end
def install
bin.install 'pup'
end
end
================================================
FILE: selector.go
================================================
package main
import (
"bytes"
"fmt"
"regexp"
"strconv"
"strings"
"text/scanner"
"golang.org/x/net/html"
)
type Selector interface {
Match(node *html.Node) bool
}
type SelectorFunc func(nodes []*html.Node) []*html.Node
func Select(s Selector) SelectorFunc {
// have to define first to be able to do recursion
var selectChildren func(node *html.Node) []*html.Node
selectChildren = func(node *html.Node) []*html.Node {
selected := []*html.Node{}
for child := node.FirstChild; child != nil; child = child.NextSibling {
if s.Match(child) {
selected = append(selected, child)
} else {
selected = append(selected, selectChildren(child)...)
}
}
return selected
}
return func(nodes []*html.Node) []*html.Node {
selected := []*html.Node{}
for _, node := range nodes {
selected = append(selected, selectChildren(node)...)
}
return selected
}
}
// Defined for the '>' selector
func SelectNextSibling(s Selector) SelectorFunc {
return func(nodes []*html.Node) []*html.Node {
selected := []*html.Node{}
for _, node := range nodes {
for ns := node.NextSibling; ns != nil; ns = ns.NextSibling {
if ns.Type == html.ElementNode {
if s.Match(ns) {
selected = append(selected, ns)
}
break
}
}
}
return selected
}
}
// Defined for the '+' selector
func SelectFromChildren(s Selector) SelectorFunc {
return func(nodes []*html.Node) []*html.Node {
selected := []*html.Node{}
for _, node := range nodes {
for c := node.FirstChild; c != nil; c = c.NextSibling {
if s.Match(c) {
selected = append(selected, c)
}
}
}
return selected
}
}
type PseudoClass func(*html.Node) bool
type CSSSelector struct {
Tag string
Attrs map[string]*regexp.Regexp
Pseudo PseudoClass
}
func (s CSSSelector) Match(node *html.Node) bool {
if node.Type != html.ElementNode {
return false
}
if s.Tag != "" {
if s.Tag != node.DataAtom.String() {
return false
}
}
for attrKey, matcher := range s.Attrs {
matched := false
for _, attr := range node.Attr {
if attrKey == attr.Key {
if !matcher.MatchString(attr.Val) {
return false
}
matched = true
break
}
}
if !matched {
return false
}
}
if s.Pseudo == nil {
return true
}
return s.Pseudo(node)
}
// Parse a selector
// e.g. `div#my-button.btn[href^="http"]`
func ParseSelector(cmd string) (selector CSSSelector, err error) {
selector = CSSSelector{
Tag: "",
Attrs: map[string]*regexp.Regexp{},
Pseudo: nil,
}
var s scanner.Scanner
s.Init(strings.NewReader(cmd))
err = ParseTagMatcher(&selector, s)
return
}
// Parse the initial tag
// e.g. `div`
func ParseTagMatcher(selector *CSSSelector, s scanner.Scanner) error {
tag := bytes.NewBuffer([]byte{})
defer func() {
selector.Tag = tag.String()
}()
for {
c := s.Next()
switch c {
case scanner.EOF:
return nil
case '.':
return ParseClassMatcher(selector, s)
case '#':
return ParseIdMatcher(selector, s)
case '[':
return ParseAttrMatcher(selector, s)
case ':':
return ParsePseudo(selector, s)
default:
if _, err := tag.WriteRune(c); err != nil {
return err
}
}
}
}
// Parse a class matcher
// e.g. `.btn`
func ParseClassMatcher(selector *CSSSelector, s scanner.Scanner) error {
var class bytes.Buffer
defer func() {
regexpStr := `(\A|\s)` + regexp.QuoteMeta(class.String()) + `(\s|\z)`
selector.Attrs["class"] = regexp.MustCompile(regexpStr)
}()
for {
c := s.Next()
switch c {
case scanner.EOF:
return nil
case '.':
return ParseClassMatcher(selector, s)
case '#':
return ParseIdMatcher(selector, s)
case '[':
return ParseAttrMatcher(selector, s)
case ':':
return ParsePseudo(selector, s)
default:
if _, err := class.WriteRune(c); err != nil {
return err
}
}
}
}
// Parse an id matcher
// e.g. `#my-picture`
func ParseIdMatcher(selector *CSSSelector, s scanner.Scanner) error {
var id bytes.Buffer
defer func() {
regexpStr := `^` + regexp.QuoteMeta(id.String()) + `$`
selector.Attrs["id"] = regexp.MustCompile(regexpStr)
}()
for {
c := s.Next()
switch c {
case scanner.EOF:
return nil
case '.':
return ParseClassMatcher(selector, s)
case '#':
return ParseIdMatcher(selector, s)
case '[':
return ParseAttrMatcher(selector, s)
case ':':
return ParsePseudo(selector, s)
default:
if _, err := id.WriteRune(c); err != nil {
return err
}
}
}
}
// Parse an attribute matcher
// e.g. `[attr^="http"]`
func ParseAttrMatcher(selector *CSSSelector, s scanner.Scanner) error {
var attrKey bytes.Buffer
var attrVal bytes.Buffer
hasMatchVal := false
matchType := '='
defer func() {
if hasMatchVal {
var regexpStr string
switch matchType {
case '=':
regexpStr = `^` + regexp.QuoteMeta(attrVal.String()) + `$`
case '*':
regexpStr = regexp.QuoteMeta(attrVal.String())
case '$':
regexpStr = regexp.QuoteMeta(attrVal.String()) + `$`
case '^':
regexpStr = `^` + regexp.QuoteMeta(attrVal.String())
case '~':
regexpStr = `(\A|\s)` + regexp.QuoteMeta(attrVal.String()) + `(\s|\z)`
}
selector.Attrs[attrKey.String()] = regexp.MustCompile(regexpStr)
} else {
selector.Attrs[attrKey.String()] = regexp.MustCompile(`^.*$`)
}
}()
// After reaching ']' proceed
proceed := func() error {
switch s.Next() {
case scanner.EOF:
return nil
case '.':
return ParseClassMatcher(selector, s)
case '#':
return ParseIdMatcher(selector, s)
case '[':
return ParseAttrMatcher(selector, s)
case ':':
return ParsePseudo(selector, s)
default:
return fmt.Errorf("Expected selector indicator after ']'")
}
}
// Parse the attribute key matcher
for !hasMatchVal {
c := s.Next()
switch c {
case scanner.EOF:
return fmt.Errorf("Unmatched open brace '['")
case ']':
// No attribute value matcher, proceed!
return proceed()
case '$', '^', '~', '*':
matchType = c
hasMatchVal = true
if s.Next() != '=' {
return fmt.Errorf("'%c' must be followed by a '='", matchType)
}
case '=':
matchType = c
hasMatchVal = true
default:
if _, err := attrKey.WriteRune(c); err != nil {
return err
}
}
}
// figure out if the value is quoted
c := s.Next()
inQuote := false
switch c {
case scanner.EOF:
return fmt.Errorf("Unmatched open brace '['")
case ']':
return proceed()
case '"':
inQuote = true
default:
if _, err := attrVal.WriteRune(c); err != nil {
return err
}
}
if inQuote {
for {
c := s.Next()
switch c {
case '\\':
// consume another character
if c = s.Next(); c == scanner.EOF {
return fmt.Errorf("Unmatched open brace '['")
}
case '"':
switch s.Next() {
case ']':
return proceed()
default:
return fmt.Errorf("Quote must end at ']'")
}
}
if _, err := attrVal.WriteRune(c); err != nil {
return err
}
}
} else {
for {
c := s.Next()
switch c {
case scanner.EOF:
return fmt.Errorf("Unmatched open brace '['")
case ']':
// No attribute value matcher, proceed!
return proceed()
}
if _, err := attrVal.WriteRune(c); err != nil {
return err
}
}
}
}
// Parse the selector after ':'
func ParsePseudo(selector *CSSSelector, s scanner.Scanner) error {
if selector.Pseudo != nil {
return fmt.Errorf("Combined multiple pseudo classes")
}
var b bytes.Buffer
for s.Peek() != scanner.EOF {
if _, err := b.WriteRune(s.Next()); err != nil {
return err
}
}
cmd := b.String()
var err error
switch {
case cmd == "empty":
selector.Pseudo = func(n *html.Node) bool {
return n.FirstChild == nil
}
case cmd == "first-child":
selector.Pseudo = firstChildPseudo
case cmd == "last-child":
selector.Pseudo = lastChildPseudo
case cmd == "only-child":
selector.Pseudo = func(n *html.Node) bool {
return firstChildPseudo(n) && lastChildPseudo(n)
}
case cmd == "first-of-type":
selector.Pseudo = firstOfTypePseudo
case cmd == "last-of-type":
selector.Pseudo = lastOfTypePseudo
case cmd == "only-of-type":
selector.Pseudo = func(n *html.Node) bool {
return firstOfTypePseudo(n) && lastOfTypePseudo(n)
}
case strings.HasPrefix(cmd, "contains("):
selector.Pseudo, err = parseContainsPseudo(cmd[len("contains("):])
if err != nil {
return err
}
case strings.HasPrefix(cmd, "nth-child("),
strings.HasPrefix(cmd, "nth-last-child("),
strings.HasPrefix(cmd, "nth-last-of-type("),
strings.HasPrefix(cmd, "nth-of-type("):
if selector.Pseudo, err = parseNthPseudo(cmd); err != nil {
return err
}
case strings.HasPrefix(cmd, "not("):
if selector.Pseudo, err = parseNotPseudo(cmd[len("not("):]); err != nil {
return err
}
case strings.HasPrefix(cmd, "parent-of("):
if selector.Pseudo, err = parseParentOfPseudo(cmd[len("parent-of("):]); err != nil {
return err
}
default:
return fmt.Errorf("%s not a valid pseudo class", cmd)
}
return nil
}
// :first-of-child
func firstChildPseudo(n *html.Node) bool {
for c := n.PrevSibling; c != nil; c = c.PrevSibling {
if c.Type == html.ElementNode {
return false
}
}
return true
}
// :last-of-child
func lastChildPseudo(n *html.Node) bool {
for c := n.NextSibling; c != nil; c = c.NextSibling {
if c.Type == html.ElementNode {
return false
}
}
return true
}
// :first-of-type
func firstOfTypePseudo(node *html.Node) bool {
if node.Type != html.ElementNode {
return false
}
for n := node.PrevSibling; n != nil; n = n.PrevSibling {
if n.DataAtom == node.DataAtom {
return false
}
}
return true
}
// :last-of-type
func lastOfTypePseudo(node *html.Node) bool {
if node.Type != html.ElementNode {
return false
}
for n := node.NextSibling; n != nil; n = n.NextSibling {
if n.DataAtom == node.DataAtom {
return false
}
}
return true
}
func parseNthPseudo(cmd string) (PseudoClass, error) {
i := strings.IndexRune(cmd, '(')
if i < 0 {
// really, we should never get here
return nil, fmt.Errorf("Fatal error, '%s' does not contain a '('", cmd)
}
pseudoName := cmd[:i]
// Figure out how the counting function works
var countNth func(*html.Node) int
switch pseudoName {
case "nth-child":
countNth = func(n *html.Node) int {
nth := 1
for sib := n.PrevSibling; sib != nil; sib = sib.PrevSibling {
if sib.Type == html.ElementNode {
nth++
}
}
return nth
}
case "nth-of-type":
countNth = func(n *html.Node) int {
nth := 1
for sib := n.PrevSibling; sib != nil; sib = sib.PrevSibling {
if sib.Type == html.ElementNode && sib.DataAtom == n.DataAtom {
nth++
}
}
return nth
}
case "nth-last-child":
countNth = func(n *html.Node) int {
nth := 1
for sib := n.NextSibling; sib != nil; sib = sib.NextSibling {
if sib.Type == html.ElementNode {
nth++
}
}
return nth
}
case "nth-last-of-type":
countNth = func(n *html.Node) int {
nth := 1
for sib := n.NextSibling; sib != nil; sib = sib.NextSibling {
if sib.Type == html.ElementNode && sib.DataAtom == n.DataAtom {
nth++
}
}
return nth
}
default:
return nil, fmt.Errorf("Unrecognized pseudo '%s'", pseudoName)
}
nthString := cmd[i+1:]
i = strings.IndexRune(nthString, ')')
if i < 0 {
return nil, fmt.Errorf("Unmatched '(' for pseudo class %s", pseudoName)
} else if i != len(nthString)-1 {
return nil, fmt.Errorf("%s(n) must end selector", pseudoName)
}
number := nthString[:i]
// Check if the number is 'odd' or 'even'
oddOrEven := -1
switch number {
case "odd":
oddOrEven = 1
case "even":
oddOrEven = 0
}
if oddOrEven > -1 {
return func(n *html.Node) bool {
return n.Type == html.ElementNode && countNth(n)%2 == oddOrEven
}, nil
}
// Check against '3n+4' pattern
r := regexp.MustCompile(`([0-9]+)n[ ]?\+[ ]?([0-9])`)
subMatch := r.FindAllStringSubmatch(number, -1)
if len(subMatch) == 1 && len(subMatch[0]) == 3 {
cycle, _ := strconv.Atoi(subMatch[0][1])
offset, _ := strconv.Atoi(subMatch[0][2])
return func(n *html.Node) bool {
return n.Type == html.ElementNode && countNth(n)%cycle == offset
}, nil
}
// check against 'n+2' pattern
r = regexp.MustCompile(`n[ ]?\+[ ]?([0-9])`)
subMatch = r.FindAllStringSubmatch(number, -1)
if len(subMatch) == 1 && len(subMatch[0]) == 2 {
offset, _ := strconv.Atoi(subMatch[0][1])
return func(n *html.Node) bool {
return n.Type == html.ElementNode && countNth(n) >= offset
}, nil
}
// the only other option is a numeric value
nth, err := strconv.Atoi(nthString[:i])
if err != nil {
return nil, err
} else if nth <= 0 {
return nil, fmt.Errorf("Argument to '%s' must be greater than 0", pseudoName)
}
return func(n *html.Node) bool {
return n.Type == html.ElementNode && countNth(n) == nth
}, nil
}
// Parse a :contains("") selector
// expects the input to be everything after the open parenthesis
// e.g. for `contains("Help")` the argument would be `"Help")`
func parseContainsPseudo(cmd string) (PseudoClass, error) {
var s scanner.Scanner
s.Init(strings.NewReader(cmd))
switch s.Next() {
case '"':
default:
return nil, fmt.Errorf("Malformed 'contains(\"\")' selector")
}
textToContain := bytes.NewBuffer([]byte{})
for {
r := s.Next()
switch r {
case '"':
// ')' then EOF must follow '"'
if s.Next() != ')' {
return nil, fmt.Errorf("Malformed 'contains(\"\")' selector")
}
if s.Next() != scanner.EOF {
return nil, fmt.Errorf("'contains(\"\")' must end selector")
}
text := textToContain.String()
contains := func(node *html.Node) bool {
for c := node.FirstChild; c != nil; c = c.NextSibling {
if c.Type == html.TextNode {
if strings.Contains(c.Data, text) {
return true
}
}
}
return false
}
return contains, nil
case '\\':
s.Next()
case scanner.EOF:
return nil, fmt.Errorf("Malformed 'contains(\"\")' selector")
default:
if _, err := textToContain.WriteRune(r); err != nil {
return nil, err
}
}
}
}
// Parse a :not(selector) selector
// expects the input to be everything after the open parenthesis
// e.g. for `not(div#id)` the argument would be `div#id)`
func parseNotPseudo(cmd string) (PseudoClass, error) {
if len(cmd) < 2 {
return nil, fmt.Errorf("malformed ':not' selector")
}
endQuote, cmd := cmd[len(cmd)-1], cmd[:len(cmd)-1]
selector, err := ParseSelector(cmd)
if err != nil {
return nil, err
}
if endQuote != ')' {
return nil, fmt.Errorf("unmatched '('")
}
return func(n *html.Node) bool {
return !selector.Match(n)
}, nil
}
// Parse a :parent-of(selector) selector
// expects the input to be everything after the open parenthesis
// e.g. for `parent-of(div#id)` the argument would be `div#id)`
func parseParentOfPseudo(cmd string) (PseudoClass, error) {
if len(cmd) < 2 {
return nil, fmt.Errorf("malformed ':parent-of' selector")
}
endQuote, cmd := cmd[len(cmd)-1], cmd[:len(cmd)-1]
selector, err := ParseSelector(cmd)
if err != nil {
return nil, err
}
if endQuote != ')' {
return nil, fmt.Errorf("unmatched '('")
}
return func(n *html.Node) bool {
for c := n.FirstChild; c != nil; c = c.NextSibling {
if c.Type == html.ElementNode && selector.Match(c) {
return true
}
}
return false
}, nil
}
================================================
FILE: tests/README.md
================================================
# Tests
A simple set of tests to help maintain sanity.
These tests don't actually test functionality they only make sure pup behaves
the same after code changes.
`cmds.txt` holds a list of commands to perform on `index.html`.
The output of each of these commands produces a specific sha1sum. The expected
sha1sum of each command is in `expected_output.txt`.
Running the `test` file (just a bash script) will run the tests and diff the
output. If pup has changed at all since the last version, you'll see the sha1sums
that changed and the commands that produced that change.
To overwrite the current sha1sums, just run `python run.py > expected_output.txt`
================================================
FILE: tests/cmds.txt
================================================
#footer
#footer li
#footer li + a
#footer li + a attr{title}
#footer li > li
table li
table li:first-child
table li:first-of-type
table li:last-child
table li:last-of-type
table a[title="The Practice of Programming"]
table a[title="The Practice of Programming"] text{}
json{}
text{}
.after-portlet
.after
:empty
td:empty
.navbox-list li:nth-child(1)
.navbox-list li:nth-child(2)
.navbox-list li:nth-child(3)
.navbox-list li:nth-last-child(1)
.navbox-list li:nth-last-child(2)
.navbox-list li:nth-last-child(3)
.navbox-list li:nth-child(n+1)
.navbox-list li:nth-child(3n+1)
.navbox-list li:nth-last-child(n+1)
.navbox-list li:nth-last-child(3n+1)
:only-child
.navbox-list li:only-child
.summary
[class=summary]
[class="summary"]
#toc
#toc li + a
#toc li + a text{}
#toc li + a json{}
#toc li + a + span
#toc li + span
#toc li > li
li a:not([rel])
link, a
link ,a
link , a
link , a sup
link , a:parent-of(sup)
link , a:parent-of(sup) sup
li --number
li -n
================================================
FILE: tests/expected_output.txt
================================================
c00fef10d36c1166cb5ac886f9d25201b720e37e #footer
a7bb8dbfdd638bacad0aa9dc3674126d396b74e2 #footer li
da39a3ee5e6b4b0d3255bfef95601890afd80709 #footer li + a
da39a3ee5e6b4b0d3255bfef95601890afd80709 #footer li + a attr{title}
da39a3ee5e6b4b0d3255bfef95601890afd80709 #footer li > li
a92e50c09cd56970625ac3b74efbddb83b2731bb table li
505c04a42e0084cd95560c233bd3a81b2c59352d table li:first-child
505c04a42e0084cd95560c233bd3a81b2c59352d table li:first-of-type
66950e746590d7f4e9cfe3d1adef42cd0addcf1d table li:last-child
66950e746590d7f4e9cfe3d1adef42cd0addcf1d table li:last-of-type
0a37d612cd4c67a42bd147b1edc5a1128456b017 table a[title="The Practice of Programming"]
0d3918d54f868f13110262ffbb88cbb0b083057d table a[title="The Practice of Programming"] text{}
ecb542a30fc75c71a0c6380692cbbc4266ccbce4 json{}
95ef88ded9dab22ee3206cca47b9c3a376274bda text{}
e4f7358fbb7bb1748a296fa2a7e815fa7de0a08b .after-portlet
da39a3ee5e6b4b0d3255bfef95601890afd80709 .after
5b3020ba03fb43f7cdbcb3924546532b6ec9bd71 :empty
3406ca0f548d66a7351af5411ce945cf67a2f849 td:empty
30fff0af0b1209f216d6e9124e7396c0adfa0758 .navbox-list li:nth-child(1)
a38e26949f047faab5ea7ba2acabff899349ce03 .navbox-list li:nth-child(2)
d954831229a76b888e85149564727776e5a2b37a .navbox-list li:nth-child(3)
d314e83b059bb876b0e5ee76aa92d54987961f9a .navbox-list li:nth-last-child(1)
1f19496e239bca61a1109dbbb8b5e0ab3e302b50 .navbox-list li:nth-last-child(2)
1ec9ebf14fc28c7d2b13e81241a6d2e1608589e8 .navbox-list li:nth-last-child(3)
52e726f0993d2660f0fb3ea85156f6fbcc1cfeee .navbox-list li:nth-child(n+1)
0b20c98650efa5df39d380fea8d5b43f3a08cb66 .navbox-list li:nth-child(3n+1)
52e726f0993d2660f0fb3ea85156f6fbcc1cfeee .navbox-list li:nth-last-child(n+1)
972973fe1e8f63e4481c8641d6169c638a528a6e .navbox-list li:nth-last-child(3n+1)
6c45ee6bca361b8a9baee50a15f575fc6ac73adc :only-child
44c99f6ad37b65dc0893cdcb1c60235d827ee73e .navbox-list li:only-child
641037814e358487d1938fc080e08f72a3846ef8 .summary
641037814e358487d1938fc080e08f72a3846ef8 [class=summary]
641037814e358487d1938fc080e08f72a3846ef8 [class="summary"]
613bf65ac4042b6ee0a7a47f08732fdbe1b5b06b #toc
da39a3ee5e6b4b0d3255bfef95601890afd80709 #toc li + a
da39a3ee5e6b4b0d3255bfef95601890afd80709 #toc li + a text{}
97d170e1550eee4afc0af065b78cda302a97674c #toc li + a json{}
da39a3ee5e6b4b0d3255bfef95601890afd80709 #toc li + a + span
da39a3ee5e6b4b0d3255bfef95601890afd80709 #toc li + span
da39a3ee5e6b4b0d3255bfef95601890afd80709 #toc li > li
87eee1189dd5296d6c010a1ad329fc53c6099d72 li a:not([rel])
055f3c98e9160beb13f72f1009ad66b6252a9bba link, a
055f3c98e9160beb13f72f1009ad66b6252a9bba link ,a
055f3c98e9160beb13f72f1009ad66b6252a9bba link , a
0d1f66765d1632c70f8608947890524e78459362 link , a sup
b6a3d6cccd305fcc3e8bf2743c443743bdaaa02b link , a:parent-of(sup)
0d1f66765d1632c70f8608947890524e78459362 link , a:parent-of(sup) sup
da39a3ee5e6b4b0d3255bfef95601890afd80709 li --number
da39a3ee5e6b4b0d3255bfef95601890afd80709 li -n
================================================
FILE: tests/index.html
================================================
Go (programming language) - Wikipedia, the free encyclopedia
Ken Thompson states that, initially, Go was purely an experimental project. Referring to himself along with the other original authors of Go, he states:[11]
When the three of us [Thompson, Rob Pike, and Robert Griesemer] got started, it was pure research. The three of us got together and decided that we hated C++. [laughter] ... [Returning to Go,] we started off with the idea that all three of us had to be talked into every feature in the language, so there was no extraneous garbage put into the language for any reason.
The history of the language before its first release, back to 2007, is covered in the language's FAQ.[12]
Go is recognizably in the tradition of C, but makes many changes aimed at conciseness, simplicity, and safety. The following is a brief overview of the features which define Go (for more information see the language specification):
A toolchain that, by default, produces statically linked native binaries without external dependencies.
A desire to keep the language specification simple enough to hold in a programmer's head,[17][18] in part by omitting features common to similar languages:
Go's syntax includes changes from C aimed at keeping code concise and readable. The programmer needn't specify the types of expressions, allowing just i := 3 or s := "some words" to replace C's int i = 3; or char* s = "some words";. Semicolons at the end of lines aren't required. Functions may return multiple, named values, and returning a result, err pair is the conventional way a function indicates an error to its caller in Go.[a] Go adds literal syntaxes for initializing struct parameters by name, and for initializing maps and slices. As an alternative to C's three-statement for loop, Go's range expressions allow concise iteration over arrays, slices, strings, and maps.
Go adds some basic types not present in C for safety and convenience:
Slices (written []type) point into an array of objects in memory, storing a pointer to the start of the slice, a length, and a capacity specifying when new memory needs to be allocated to expand the array. Slice contents are passed by reference, and their contents are always mutable.
Go's immutable string type typically holds UTF-8 text (though it can hold arbitrary bytes as well).
map[keytype]valtype provides a hashtable.
Go also adds channel types, which support concurrency and are discussed below, and interfaces, which replace virtual inheritance and are discussed in Interface system section.
Structurally, Go's type system has a few differences from C and most C derivatives. Unlike C typedefs, Go's named types are not aliases for each other, and rules limit when different types can be assigned to each other without explicit conversion.[19] Unlike in C, conversions between number types are explicit; to ensure that doesn't create verbose conversion-heavy code, numeric constants in Go represent abstract, untyped numbers.[20] Finally, in place of non-virtual inheritance, Go has a feature called type embedding in which one object can contain others and pick up their methods.
In Go's package system, each package has a path (e.g., "compress/bzip2" or "code.google.com/p/go.net/html") and a name (e.g., bzip2 or html). References to other packages' definitions must always be prefixed with the other package's name, and only the capitalized names from other modules are accessible: io.Reader is public but bzip2.reader is not.[21] The go get command can retrieve packages stored in a remote repository such as Github or Google Code, and package paths often look like partial URLs for compatibility.[22]
Concurrency: goroutines, channels, and select[edit]
Go provides facilities for writing concurrent programs that share state by communicating.[23][24][25] Concurrency refers not only to multithreading and CPU parallelism, which Go supports, but also to asynchrony: letting slow operations like a database or network-read run while the program does other work, as is common in event-based servers.[26]
Go's concurrency-related syntax and types include:
The go statement, go func(), starts a function in a new light-weight process, or goroutine
Channel types, chan type, provide a type-safe, synchronized, optionally buffered channels between goroutines, and are useful mostly with two other facilities:
The send statement, ch <- x sends x over ch
The receive operator, <- ch receives a value from ch
Both operations block until the other goroutine is ready to communicate
The select statement uses a switch-like syntax to wait for communication on any of a set of channels[27]
From these tools one can build concurrent constructs like worker pools, pipelines (in which, say, a file is decompressed and parsed as it downloads), background calls with timeout, "fan-out" parallel calls to a set of services, and others.[28] Channels have also found uses further from the usual notion of interprocess communication, like serving as a concurrency-safe list of recycled buffers,[29] implementing coroutines (which helped inspire the name goroutine),[30] and implementing iterators.[31]
While the communicating-processes model is favored in Go, it isn't the only one: memory can be shared across goroutines (see below), and the standard sync module provides locks and other primitives.[32]
There are no restrictions on how goroutines access shared data, making race conditions possible. Specifically, unless a program explicitly synchronizes via channels or other means, writes from one goroutine might be partly, entirely, or not at all visible to another, often with no guarantees about ordering of writes.[33] Furthermore, Go's internal data structures like interface values, slice headers, and string headers are not immune to race conditions, so type and memory safety can be violated in multithreaded programs that modify shared instances of those types without synchronization.[34][35]
Idiomatic Go minimizes sharing of data (and thus potential race conditions) by communicating over channels, and a race-condition tester is included in the standard distribution to help catch unsafe behavior. Still, it is important to realize that while Go provides building blocks that can be used to write correct, comprehensible concurrent code, arbitrary code isn't guaranteed to be safe.
Some concurrency-related structural conventions of Go (channels and alternative channel inputs) are derived from Tony Hoare'scommunicating sequential processes model. Unlike previous concurrent programming languages such as occam or Limbo (a language on which Go co-designer Rob Pike worked[36]), Go does not provide any built-in notion of safe or verifiable concurrency.[33]
In place of virtual inheritance, Go uses interfaces. An interface declaration is nothing but a list of required methods: for example, implementing io.Reader requires a Read method that takes a []byte and returns a count of bytes read and any error.[37] Code calling Read needn't know whether it's reading from an HTTP connection, a file, an in-memory buffer, or any other source.
Go types don't declare which interfaces they implement: having the required methods is implementing the interface. In formal language, Go's interface system provides structural rather than nominal typing.
The example below uses the io.Reader and io.Writer interfaces to test Go's implementation of SHA-256 on a standard test input, 1,000,000 repeats of the character "a". RepeatByte implements an io.Reader yielding an infinite stream of repeats of a byte, similar to Unix /dev/zero. The main() function uses RepeatByte to stream a million repeats of "a" into the hash function, then prints the result, which matches the expected value published online.[38] Even though both reader and writer interfaces are needed to make this work, the code needn't mention either; the compiler infers what types implement what interfaces:
Also note type RepeatByte is defined as a byte, not a struct. Named types in Go needn't be structs, and any named type can have methods defined, satisfy interfaces, and act, for practical purposes, as objects; the standard library, for example, stores IP addresses in byte slices.[39]
Besides calling methods via interfaces, Go allows converting interface values to other types with a run-time type check. The language constructs to do so are the type assertion,[40] which checks against a single potential type, and the type switch,[41] which checks against multiple types.
interface{}, the empty interface, is an important corner case because it can refer to an item of any concrete type, including primitive types like string. Code using the empty interface can't simply call methods (or built-in operators) on the referred-to object, but it can store the interface{} value, try to convert it to a more useful type via a type assertion or type switch, or inspect it with Go's reflect package.[42] Because interface{} can refer to any value, it's a limited way to escape the restrictions of static typing, like void* in C but with additional run-time type checks.
Interface values are stored in memory as a pointer to data and a second pointer to run-time type information.[43] Like other pointers in Go, interface values are nil if uninitialized.[44] Unlike in environments like Java's virtual machine, there is no object header; the run-time type information is only attached to interface values. So, the system imposes no per-object memory overhead for objects not accessed via interface, similar to C structs or C# ValueTypes.
Go does not have interface inheritance, but one interface type can embed another; then the embedding interface requires all of the methods required by the embedded interface.[45]
Go deliberately omits certain features common in other languages, including generic programming, assertions, pointer arithmetic, and inheritance. After initially omitting exceptions, the language added the panic/recover mechanism, but it is only meant for rare circumstances.[46][47][48]
The Go authors express an openness to generic programming, explicitly argue against assertions and pointer arithmetic, while defending the choice to omit type inheritance as giving a more useful language, encouraging heavy use of interfaces instead.[2]
The Go authors and community put substantial effort into molding the style and design of Go programs:
Indentation, spacing, and other surface-level details of code are automatically standardized by the go fmt tool. go vet and golint do additional checking automatically.
Tools and libraries distributed with Go suggest standard approaches to things like API documentation (godoc[49]), testing (go test), building (go build), package management (go get), and so on.
Syntax rules require things that are optional in other languages, for example by banning cyclic dependencies, unused variables or imports, and implicit type conversions.
The omission of certain features (for example, functional-programming shortcuts like map and C++-style try/finally blocks) tends to encourage a particular explicit, concrete, and imperative programming style.
When adapting to the Go ecosystem after working in other languages, differences in style and approach can be as important as low-level language and library differences.
Go includes the same sort of debugging, testing, and code-vetting tools as many language distributions. The Go distribution includes go vet, which analyzes code searching for common stylistic problems and mistakes. A profiler, unit testing tool, gdb debugging support, and a race condition tester are also in the distribution. The Go distribution includes its own build system, which requires only information in the Go files themselves, no separate build files.
There is an ecosystem of third-party tools that add to the standard distribution, such as gocode, which enables code autocompletion in many text editors, goimports (by a Go team member), which automatically adds/removes package imports as needed, errcheck, which detects code that might unintentionally ignore errors, and more. Plugins exist to add language support in widely used text editors, and at least one IDE, LiteIDE, targets Go in particular.
package main
import("flag""fmt""strings")func main(){var omitNewline bool
flag.BoolVar(&omitNewline,"n",false,"don't print final newline")
flag.Parse()// Scans the arg list and sets up flags.
str := strings.Join(flag.Args()," ")if omitNewline {
fmt.Print(str)}else{
fmt.Println(str)}}
// Reading and writing files are basic tasks needed for// many Go programs. First we'll look at some examples of// reading files.package main
import("bufio""fmt""io""io/ioutil""os")// Reading files requires checking most calls for errors.// This helper will streamline our error checks below.func check(e error){if e !=nil{panic(e)}}func main(){// Perhaps the most basic file reading task is// slurping a file's entire contents into memory.
dat, err := ioutil.ReadFile("/tmp/dat")
check(err)
fmt.Print(string(dat))// You'll often want more control over how and what// parts of a file are read. For these tasks, start// by `Open`ing a file to obtain an `os.File` value.
f, err := os.Open("/tmp/dat")
check(err)// Read some bytes from the beginning of the file.// Allow up to 5 to be read but also note how many// actually were read.
b1 :=make([]byte,5)
n1, err := f.Read(b1)
check(err)
fmt.Printf("%d bytes: %s\n", n1,string(b1))// You can also `Seek` to a known location in the file// and `Read` from there.
o2, err := f.Seek(6,0)
check(err)
b2 :=make([]byte,2)
n2, err := f.Read(b2)
check(err)
fmt.Printf("%d bytes @ %d: %s\n", n2, o2,string(b2))// The `io` package provides some functions that may// be helpful for file reading. For example, reads// like the ones above can be more robustly// implemented with `ReadAtLeast`.
o3, err := f.Seek(6,0)
check(err)
b3 :=make([]byte,2)
n3, err := io.ReadAtLeast(f, b3,2)
check(err)
fmt.Printf("%d bytes @ %d: %s\n", n3, o3,string(b3))// There is no built-in rewind, but `Seek(0, 0)`// accomplishes this.
_, err = f.Seek(0,0)
check(err)// The `bufio` package implements a buffered// reader that may be useful both for its efficiency// with many small reads and because of the additional// reading methods it provides.
r4 := bufio.NewReader(f)
b4, err := r4.Peek(5)
check(err)
fmt.Printf("5 bytes: %s\n",string(b4))// Close the file when you're done (usually this would// be scheduled immediately after `Open`ing with// `defer`).
f.Close()}
Doozer, a lock service by managed hosting provider Heroku
Other companies and sites using Go (generally together with other languages, not exclusively) include:[53][54]
Google, for many projects, notably including download server dl.google.com[55][56][57]
Dropbox, migrated some of their critical components from Python to Go[58]
CloudFlare, for their delta-coding proxy Railgun, their distributed DNS service, as well as tools for cryptography, logging, stream processing, and accessing SPDY sites.[59][60]
Michele Simionato wrote in an article for artima.com:[62]
Here I just wanted to point out the design choices about interfaces and inheritance. Such ideas are not new and it is a shame that no popular language has followed such particular route in the design space. I hope Go will become popular; if not, I hope such ideas will finally enter in a popular language, we are already 10 or 20 years too late :-(
Go is extremely easy to dive into. There are a minimal number of fundamental language concepts and the syntax is clean and designed to be clear and unambiguous. Go is still experimental and still a little rough around the edges.
Ars Technica interviewed Rob Pike, one of the authors of Go, and asked why a new language was needed. He replied that:[64]
It wasn't enough to just add features to existing programming languages, because sometimes you can get more in the long run by taking things away. They wanted to start from scratch and rethink everything. ... [But they did not want] to deviate too much from what developers already knew because they wanted to avoid alienating Go's target audience.
Go was named Programming Language of the Year by the TIOBE Programming Community Index in its first year, 2009, for having a larger 12-month increase in popularity (in only 2 months, after its introduction in November) than any other language that year, and reached 13th place by January 2010,[65] surpassing established languages like Pascal. As of August 2014[update], its ranking had dropped to 38th in the index, placing it lower than COBOL and Fortran.[66] Go is already in commercial use by several large organizations.[67]
The complexity of C++ (even more complexity has been added in the new C++), and the resulting impact on productivity, is no longer justified. All the hoops that the C++ programmer had to jump through in order to use a C-compatible language make no sense anymore -- they're just a waste of time and effort. Now, Go makes much more sense for the class of problems that C++ was originally intended to solve.
On the day of the general release of the language, Francis McCabe, developer of the Go! programming language (note the exclamation point), requested a name change of Google's language to prevent confusion with his language.[70] The issue was closed by a Google developer on 12 October 2010 with the custom status "Unfortunate" and with the following comment: "there are many computing products and services named Go. In the 11 months since our release, there has been minimal confusion of the two languages."[71]
^Usually, exactly one of the result and error values has a value other than the type's zero value; sometimes both do, as when a read or write can only be partially completed, and sometimes neither, as when a read returns 0 bytes. See Semipredicate problem: Multivalued return.
^"A Tutorial for the Go Programming Language". The Go Programming Language. Google. Retrieved 10 March 2013. "In Go the rule about visibility of information is simple: if a name (of a top-level type, function, method, constant or variable, or of a structure field or method) is capitalized, users of the package may see it. Otherwise, the name and hence the thing being named is visible only inside the package in which it is declared."
^The Go Programming Language Specification. This deliberately glosses over some details in the spec: close, channel range expressions, the two-argument form of the receive operator, unidrectional channel types, and so on.