Full Code of learnbyexample/Command-line-text-processing for AI

master ce56c851f078 cached

80 files

519.3 KB

169.0k tokens

1 requests

Download .txt

Showing preview only (548K chars total). Download the full file or copy to clipboard to get everything.

Repository: learnbyexample/Command-line-text-processing
Branch: master
Commit: ce56c851f078
Files: 80
Total size: 519.3 KB

Directory structure:
gitextract_wr_ra6a8/

├── README.md
├── exercises/
│   ├── GNU_grep/
│   │   ├── .ref_solutions/
│   │   │   ├── ex01_basic_match.txt
│   │   │   ├── ex02_basic_options.txt
│   │   │   ├── ex03_multiple_string_match.txt
│   │   │   ├── ex04_filenames.txt
│   │   │   ├── ex05_word_line_matching.txt
│   │   │   ├── ex06_ABC_context_matching.txt
│   │   │   ├── ex07_recursive_search.txt
│   │   │   ├── ex08_search_pattern_from_file.txt
│   │   │   ├── ex09_regex_anchors.txt
│   │   │   ├── ex10_regex_this_or_that.txt
│   │   │   ├── ex11_regex_quantifiers.txt
│   │   │   ├── ex12_regex_character_class_part1.txt
│   │   │   ├── ex13_regex_character_class_part2.txt
│   │   │   ├── ex14_regex_grouping_and_backreference.txt
│   │   │   ├── ex15_regex_PCRE.txt
│   │   │   └── ex16_misc_and_extras.txt
│   │   ├── ex01_basic_match/
│   │   │   └── sample.txt
│   │   ├── ex01_basic_match.txt
│   │   ├── ex02_basic_options/
│   │   │   └── sample.txt
│   │   ├── ex02_basic_options.txt
│   │   ├── ex03_multiple_string_match/
│   │   │   └── sample.txt
│   │   ├── ex03_multiple_string_match.txt
│   │   ├── ex04_filenames/
│   │   │   ├── greeting.txt
│   │   │   ├── poem.txt
│   │   │   └── sample.txt
│   │   ├── ex04_filenames.txt
│   │   ├── ex05_word_line_matching/
│   │   │   ├── greeting.txt
│   │   │   ├── sample.txt
│   │   │   └── words.txt
│   │   ├── ex05_word_line_matching.txt
│   │   ├── ex06_ABC_context_matching/
│   │   │   └── sample.txt
│   │   ├── ex06_ABC_context_matching.txt
│   │   ├── ex07_recursive_search/
│   │   │   ├── msg/
│   │   │   │   ├── greeting.txt
│   │   │   │   └── sample.txt
│   │   │   ├── poem.txt
│   │   │   ├── progs/
│   │   │   │   ├── hello.py
│   │   │   │   └── hello.sh
│   │   │   └── words.txt
│   │   ├── ex07_recursive_search.txt
│   │   ├── ex08_search_pattern_from_file/
│   │   │   ├── baz.txt
│   │   │   ├── foo.txt
│   │   │   └── words.txt
│   │   ├── ex08_search_pattern_from_file.txt
│   │   ├── ex09_regex_anchors/
│   │   │   └── sample.txt
│   │   ├── ex09_regex_anchors.txt
│   │   ├── ex10_regex_this_or_that/
│   │   │   └── sample.txt
│   │   ├── ex10_regex_this_or_that.txt
│   │   ├── ex11_regex_quantifiers/
│   │   │   └── garbled.txt
│   │   ├── ex11_regex_quantifiers.txt
│   │   ├── ex12_regex_character_class_part1/
│   │   │   └── sample_words.txt
│   │   ├── ex12_regex_character_class_part1.txt
│   │   ├── ex13_regex_character_class_part2/
│   │   │   └── sample.txt
│   │   ├── ex13_regex_character_class_part2.txt
│   │   ├── ex14_regex_grouping_and_backreference/
│   │   │   └── sample.txt
│   │   ├── ex14_regex_grouping_and_backreference.txt
│   │   ├── ex15_regex_PCRE/
│   │   │   └── sample.txt
│   │   ├── ex15_regex_PCRE.txt
│   │   ├── ex16_misc_and_extras/
│   │   │   ├── garbled.txt
│   │   │   ├── poem.txt
│   │   │   └── sample.txt
│   │   ├── ex16_misc_and_extras.txt
│   │   └── solve
│   └── README.md
├── file_attributes.md
├── gnu_awk.md
├── gnu_grep.md
├── gnu_sed.md
├── miscellaneous.md
├── overview_presentation/
│   ├── baz.json
│   ├── foo.xml
│   ├── greeting.txt
│   └── sample.txt
├── perl_the_swiss_knife.md
├── restructure_text.md
├── ruby_one_liners.md
├── sorting_stuff.md
├── tail_less_cat_head.md
├── whats_the_difference.md
└── wheres_my_file.md

================================================
FILE CONTENTS
================================================

================================================
FILE: README.md
================================================
# Command Line Text Processing

Learn about various commands available for common and exotic text processing needs. Examples have been tested on GNU/Linux - there'd be syntax/feature variations with other distributions, consult their respective `man` pages for details.

---

:warning: :warning: I'm no longer actively working on this repo. Instead, I've converted existing chapters into ebooks (see [ebook section](#ebooks) below for links), available under the same license. These ebooks are better formatted, updated for newer versions of the software, includes exercises, solutions, etc. Since all the chapters have been converted, I'm archiving this repo.

---

<br>

## Ebooks

Individual online ebooks with better formatting, explanations, exercises, solutions, etc:

* [CLI text processing with GNU grep and ripgrep](https://learnbyexample.github.io/learn_gnugrep_ripgrep/)
* [CLI text processing with GNU sed](https://learnbyexample.github.io/learn_gnused/)
* [CLI text processing with GNU awk](https://learnbyexample.github.io/learn_gnuawk/)
* [Ruby One-Liners Guide](https://learnbyexample.github.io/learn_ruby_oneliners/)
* [Perl One-Liners Guide](https://learnbyexample.github.io/learn_perl_oneliners/)
* [CLI text processing with GNU Coreutils](https://learnbyexample.github.io/cli_text_processing_coreutils/)
* [Linux Command Line Computing](https://learnbyexample.github.io/cli-computing/)

See https://learnbyexample.github.io/books/ for links to PDF/EPUB versions and other ebooks.

<br>

## Chapters

As mentioned earlier, I'm no longer actively working on these chapters:

* [Cat, Less, Tail and Head](./tail_less_cat_head.md)
    * cat, less, tail, head, Text Editors
* [GNU grep](./gnu_grep.md)
* [GNU sed](./gnu_sed.md)
* [GNU awk](./gnu_awk.md)
* [Perl the swiss knife](./perl_the_swiss_knife.md)
* [Ruby one liners](./ruby_one_liners.md)
* [Sorting stuff](./sorting_stuff.md)
    * sort, uniq, comm, shuf
* [Restructure text](./restructure_text.md)
    * paste, column, pr, fold
* [Whats the difference](./whats_the_difference.md)
    * cmp, diff
* [Wheres my file](./wheres_my_file.md)
* [File attributes](./file_attributes.md)
    * wc, du, df, touch, file
* [Miscellaneous](./miscellaneous.md)
    * cut, tr, basename, dirname, xargs, seq

<br>

## Webinar recordings

Recorded couple of videos based on content in the chapters, not sure if I'll do more:

* [Using the sort command](https://www.youtube.com/watch?v=qLfAwwb5vGs)
* [Using uniq and comm](https://www.youtube.com/watch?v=uAb2kxA2TyQ)

See also my short videos on [Linux command line tips](https://www.youtube.com/watch?v=p0KCLusMd5Q&list=PLTv2U3HnAL4PNTmRqZBSUgKaiHbRL2zeY)

<br>

## Exercises

Check out [exercises](./exercises) directory to solve practice questions on `grep`, right from the command line itself.

See also my [TUI-apps](https://github.com/learnbyexample/TUI-apps) repo for interactive CLI text processing exercises.

<br>

## Contributing

* Please [open an issue](https://github.com/learnbyexample/Command-line-text-processing/issues) for typos or bugs
    * As this repo is no longer actively worked upon, **please do not submit pull requests**
* Share the repo with friends/colleagues, on social media, etc to help reach other learners
* In case you need to reach me, mail me at `echo 'yrneaolrknzcyr.arg@tznvy.pbz' | tr 'a-z' 'n-za-m'` or send a DM via [twitter](https://twitter.com/learn_byexample)

<br>

## Acknowledgements

* [unix.stackexchange](https://unix.stackexchange.com/) and [stackoverflow](https://stackoverflow.com/) - for getting answers to pertinent questions as well as sharpening skills by understanding and answering questions
* Forums like [Linux users](https://www.linkedin.com/groups/65688), [/r/commandline/](https://www.reddit.com/r/commandline/), [/r/linux/](https://www.reddit.com/r/linux/), [/r/ruby/](https://www.reddit.com/r/ruby/), [news.ycombinator](https://news.ycombinator.com/news), [devup](http://devup.in/) and others for valuable feedback (especially spotting mistakes) and encouragement
* See [wikipedia entry 'Roses Are Red'](https://en.wikipedia.org/wiki/Roses_Are_Red) for `poem.txt` used as sample text input file

<br>

## License

This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-nc-sa/4.0/)


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex01_basic_match.txt
================================================
1) Match lines containing the string: day
Solution: grep 'day' sample.txt

2) Match lines containing the string: it
Solution: grep 'it' sample.txt

3) Match lines containing the string: do you
Solution: grep 'do you' sample.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex02_basic_options.txt
================================================
1) Match lines containing the string irrespective of lower/upper case: no
Solution: grep -i 'no' sample.txt

2) Match lines not containing the string: o
Solution: grep -v 'o' sample.txt

3) Match lines with line numbers containing the string: it
Solution: grep -n 'it' sample.txt

4) Output only number of matching lines containing the string: a
Solution: grep -c 'a' sample.txt

5) Match first two lines containing the string: do
Solution: grep -m2 'do' sample.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex03_multiple_string_match.txt
================================================
1) Match lines containing either of these three strings
        String1: Not
        String2: he
        String3: sun
Solution: grep -e 'Not' -e 'he' -e 'sun' sample.txt

2) Match lines containing both these strings
        String1: He
        String2: or
Solution: grep 'He' sample.txt | grep 'or'

3) Match lines containing either of these two strings
        String1: a
        String2: i
   and contains this as well
        String3: do
Solution: grep -e 'a' -e 'i' sample.txt | grep 'do'

4) Match lines containing the string
        String1: it
   but not these strings
        String2: No
        String3: no
Solution: grep 'it' sample.txt | grep -vi 'no'


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex04_filenames.txt
================================================
Note: All files present in the directory should be given as file inputs to grep

1) Show only filenames containing the string: are
Solution: grep -l 'are' *

2) Show only filenames NOT containing the string: two
Solution: grep -L 'two' *

3) Match all lines containing the string: are
Solution: grep 'are' *

4) Match maximum of two matching lines along with filenames containing the character: a
Solution: grep -m2 'a' *

5) Match all lines without prefixing filename containing the string: to
Solution: grep -h 'to' *


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex05_word_line_matching.txt
================================================
Note: All files present in the directory should be given as file inputs to grep

1) Match lines containing whole word: do
Solution: grep -w 'do' *

2) Match whole lines containing the string: Hello World
Solution: grep -x 'Hello World' *

3) Match lines containing these whole words:
        Word1: He
        Word2: far
Solution: grep -w -e 'far' -e 'He' *

4) Match lines containing the whole word: you
    and NOT containing the case insensitive string: How
Solution: grep -w 'you' * | grep -vi 'how'


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex06_ABC_context_matching.txt
================================================
1) Get lines and 3 following it containing the string: you
Solution: grep -A3 'you' sample.txt

2) Get lines and 2 preceding it containing the string: is
Solution: grep -B2 'is' sample.txt

3) Get lines and 1 following/preceding containing the string: Not
Solution: grep -C1 'Not' sample.txt

4) Get lines and 1 following and 4 preceding containing the string: Not
Solution: grep -A1 -B4 'Not' sample.txt

5) Get lines and 1 preceding it containing the string: you
        there should be no separator between the matches
Solution: grep --no-group-separator -B1 'you' sample.txt

6) Get lines and 1 preceding it containing the string: you
        the separator between the matches should be: #####
Solution: grep --group-separator='#####' -B1 'you' sample.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex07_recursive_search.txt
================================================
Note: Every file in this directory and sub-directories is input for grep, unless otherwise specified

1) Match all lines containing the string: you
Solution: grep -r 'you'

2) Show only filenames matching the string: Hello
    filenames should only end with .txt 
Solution: grep -rl --include='*.txt' 'Hello'

3) Show only filenames matching the string: Hello
    filenames should NOT end with .txt 
Solution: grep -rl --exclude='*.txt' 'Hello'

4) Show only filenames matching the string: are
    should not include the directory: progs
Solution: grep -rl --exclude-dir='progs' 'are'

5) Show only filenames matching the string: are
    should NOT include these directories
            dir1: progs
            dir2: msg
Solution: grep -rl --exclude-dir='progs' --exclude-dir='msg' 'are'

6) Show only filenames matching the string: are
    should include files only from sub-directories
    hint: use shell glob pattern to specify directories to search
Solution: grep -rl 'are' */


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex08_search_pattern_from_file.txt
================================================
Note: words.txt has only whole words per line, use it as file input when task is to match whole words

1) Match all strings from file words.txt in file baz.txt
Solution: grep -f words.txt baz.txt 

2) Match all words from file words.txt in file foo.txt
    should only match whole words
    should print only matching words, not entire line
Solution: grep -owf words.txt foo.txt

3) Show common lines between foo.txt and baz.txt
Solution: grep -Fxf foo.txt baz.txt

4) Show lines present in baz.txt but not in foo.txt
Solution: grep -Fxvf foo.txt baz.txt

5) Show lines present in foo.txt but not in baz.txt
Solution: grep -Fxvf baz.txt foo.txt

6) Find all words common between all three files in the directory
    should only match whole words
    should print only matching words, not entire line
Solution: grep -owf words.txt foo.txt | grep -owf- baz.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex09_regex_anchors.txt
================================================
1) Match all lines starting with: no
Solution: grep '^no' sample.txt

2) Match all lines ending with: it
Solution: grep 'it$' sample.txt

3) Match all lines containing whole word: do
Solution: grep -w 'do' sample.txt

4) Match all lines containing words starting with: do
Solution: grep '\<do' sample.txt

5) Match all lines containing words ending with: do
Solution: grep 'do\>' sample.txt

6) Match all lines starting with: ^
Solution: grep '^^' sample.txt

7) Match all lines ending with: $
Solution: grep '$$' sample.txt

8) Match all lines containing the string: in
    not surrounded by word boundaries, for ex: mint but not tin or ink
Solution: grep '\Bin\B' sample.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex10_regex_this_or_that.txt
================================================
1) Match all lines containing any of these strings:
        String1: day
        String2: not
Solution: grep -E 'day|not' sample.txt

2) Match all lines containing any of these whole words:
        String1: he
        String2: in
Solution: grep -wE 'he|in' sample.txt

3) Match all lines containing any of these strings:
        String1: you
        String2: be
        String3: to
        String4: he
Solution: grep -E 'he|be|to|you' sample.txt

4) Match all lines containing any of these strings:
        String1: you
        String2: be
        String3: to
        String4: he
    but NOT these strings:
        String1: it
        String2: do
Solution: grep -E 'he|be|to|you' sample.txt | grep -vE 'do|it'

5) Match all lines starting with any of these strings:
        String1: no
        String2: to
Solution: grep -E '^no|^to' sample.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex11_regex_quantifiers.txt
================================================
1) Extract all 3 character strings surrounded by word boundaries
Solution: grep -ow '...' garbled.txt

2) Extract largest string from each line
        starting with character: d
        ending with character  : g
Solution: grep -o 'd.*g' garbled.txt

3) Extract all strings from each line
        starting with character: d
        followed by zero or one: o
        ending with character  : g
Solution: grep -oE 'do?g' garbled.txt

4) Extract all strings from each line
        starting with character: d
        followed by zero or one of any character
        ending with character  : g
Solution: grep -oE 'd.?g' garbled.txt

5) Extract all strings from each line
        starting with character: g
        followed by atleast one: o
        ending with character  : d
Solution: grep -oE 'go+d' garbled.txt

6) Extract all strings from each line
        starting with character : g
        followed by extactly six: o
        ending with character   : d
Solution: grep -oE 'go{6}d' garbled.txt

7) Extract all strings from each line
        starting with character         : g
        followed by min two and max four: o
        ending with character           : d
Solution: grep -oE 'go{2,4}d' garbled.txt

8) Extract all strings from each line
        starting with character: d
        followed by max of two : o
        ending with character  : g
Solution: grep -oE 'do{,2}g' garbled.txt

9) Extract all strings from each line
        starting with character : g
        followed by min of three: o
        ending with character   : d
Solution: grep -oE 'go{3,}d' garbled.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex12_regex_character_class_part1.txt
================================================
1) Match all lines containing any of these characters:
        character1: q
        character2: x
        character3: z
Solution: grep '[qzx]' sample_words.txt

2) Match all lines containing any of these characters:
        character1: c
        character2: f
    followed by any character
    followed by   : t
Solution: grep '[cf].t' sample_words.txt

3) Extract all words starting with character: s
    ignore case
    should contain only alphabets
    minimum two letters
    should be surrounded by word boundaries
Solution: grep -iowE 's[a-z]+' sample_words.txt

4) Extract all words made up of these characters:
        character1: a
        character2: c
        character3: e
        character4: r
        character5: s
    ignore case
    should contain only alphabets
    should be surrounded by word boundaries
Solution: grep -iowE '[acers]+' sample_words.txt

5) Extract all numbers surrounded by word boundaries
Solution: grep -ow '[0-9]*' sample_words.txt

6) Extract all numbers surrounded by word boundaries matching the condition
    30 <= number <= 70
Solution: grep -owE '[3-6][0-9]|70' sample_words.txt

7) Extract all words made up of non-vowel characters
    ignore case
    should contain only alphabets and at least two
    should be surrounded by word boundaries
Solution: grep -iowE '[b-df-hj-np-tv-z]{2,}' sample_words.txt

8) Extract all sequence of strings consisting of character: -
    surrounded on either side by zero or more case insensitive alphabets    
Solution: grep -io '[a-z]*-[a-z]*' sample_words.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex13_regex_character_class_part2.txt
================================================
1) Extract all characters before first occurrence of =
Solution: grep -o '^[^=]*' sample.txt

2) Extract all characters from start of line made up of these characters
        upper or lower case alphabets
        all digits
        the underscore character
Solution: grep -o '^\w*' sample.txt

3) Match all lines containing the sequence
        String1: there
        any number of whitespace
        String2: have
Solution: grep 'there\s*have' sample.txt

4) Extract all characters from start of line made up of these characters
        upper or lower case alphabets
        all digits
        the characters [ and ]
        ending with ]
Solution: grep -oi '^[]a-z0-9[]*]' sample.txt

5) Extract all punctuation characters from first line
Solution: grep -om1 '[[:punct:]]' sample.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex14_regex_grouping_and_backreference.txt
================================================
1) Match lines containing these strings
        String1: scare
        String2: spore
Solution: grep -E 's(po|ca)re' sample.txt

2) Extract these words
        Word1: handy
        Word2: hand
        Word3: hands
        Word4: handful
Solution: grep -oE 'hand([sy]|ful)?' sample.txt

3) Extract all whole words with at least one letter occurring twice in the word
    ignore case
    only alphabets
    the letter occurring twice need not be placed next to each other
Solution: grep -ioE '[a-z]*([a-z])[a-z]*\1[a-z]*' sample.txt

4) Match lines where same sequence of three consecutive alphabets is matched another time in the same line
    ignore case
Solution: grep -iE '([a-z]{3}).*\1' sample.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex15_regex_PCRE.txt
================================================
1) Extract all strings to the right of =
    provided characters from start of line until = do not include [ or ]
Solution: grep -oP '^[^][=]+=\K.*' sample.txt

2) Match all lines containing the string: Hi
    but shouldn't be followed afterwards in the line by: are
Solution: grep -P 'Hi(?!.*are)' sample.txt

3) Extract from start of line up to the string: Hi
    provided it is followed afterwards in the line by: you
Solution: grep -oP '.*Hi(?=.*you)' sample.txt

4) Extract all sequence of characters surrounded on both sides by space character
    the space character should not be part of output
Solution: grep -oP ' \K[^ ]+(?= )' sample.txt

5) Extract all words
    made of upper or lower case alphabets
    at least two letters in length
    surrounded by word boundaries
    should not contain consecutive repeated alphabets
Solution: grep -iowP '[a-z]*([a-z])\1[a-z]*(*SKIP)(*F)|[a-z]{2,}' sample.txt



================================================
FILE: exercises/GNU_grep/.ref_solutions/ex16_misc_and_extras.txt
================================================
Note: all files in directory are input to grep, unless otherwise specified

1) Extract all negative numbers
    starts with - followed by one or more digits
    do not output filenames
Solution: grep -hoE -- '-[0-9]+' *

2) Display only filenames containing these two strings anywhere in the file
        String1: day
        String2: and
Solution: grep -zlE 'day.*and|and.*day' *

3) The below command
        grep -c '^Solution:' ../.ref_solutions/*
    will give number of questions in each exercise. Change it, using another command and pipe if needed, so that only overall total is printed
Solution: cat ../.ref_solutions/* | grep -c '^Solution:'



================================================
FILE: exercises/GNU_grep/ex01_basic_match/sample.txt
================================================
Hello World!

Good day
How do you do?

Just do it
Believe it!

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he


================================================
FILE: exercises/GNU_grep/ex01_basic_match.txt
================================================
1) Match lines containing the string: day


2) Match lines containing the string: it


3) Match lines containing the string: do you



================================================
FILE: exercises/GNU_grep/ex02_basic_options/sample.txt
================================================
Hello World!

Good day
How do you do?

Just do it
Believe it!

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he


================================================
FILE: exercises/GNU_grep/ex02_basic_options.txt
================================================
1) Match lines containing the string irrespective of lower/upper case: no


2) Match lines not containing the string: o


3) Match lines with line numbers containing the string: it


4) Output only number of matching lines containing the string: a


5) Match first two lines containing the string: do



================================================
FILE: exercises/GNU_grep/ex03_multiple_string_match/sample.txt
================================================
Hello World!

Good day
How do you do?

Just do it
Believe it!

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he


================================================
FILE: exercises/GNU_grep/ex03_multiple_string_match.txt
================================================
1) Match lines containing either of these three strings
        String1: Not
        String2: he
        String3: sun


2) Match lines containing both these strings
        String1: He
        String2: or


3) Match lines containing either of these two strings
        String1: a
        String2: i
   and contains this as well
        String3: do


4) Match lines containing the string
        String1: it
   but not these strings
        String2: No
        String3: no



================================================
FILE: exercises/GNU_grep/ex04_filenames/greeting.txt
================================================
Hi, how are you?

Hola :)

Hello world

Good day

Rock on


================================================
FILE: exercises/GNU_grep/ex04_filenames/poem.txt
================================================
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.


================================================
FILE: exercises/GNU_grep/ex04_filenames/sample.txt
================================================
Hello World!

Good day
How do you do?

Just do it
Believe it!

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he


================================================
FILE: exercises/GNU_grep/ex04_filenames.txt
================================================
Note: All files present in the directory should be given as file inputs to grep

1) Show only filenames containing the string: are


2) Show only filenames NOT containing the string: two


3) Match all lines containing the string: are


4) Match maximum of two matching lines along with filenames containing the character: a


5) Match all lines without prefixing filename containing the string: to



================================================
FILE: exercises/GNU_grep/ex05_word_line_matching/greeting.txt
================================================
Hi, how are you?

Hola :)

Hello World

Good day

Rock on


================================================
FILE: exercises/GNU_grep/ex05_word_line_matching/sample.txt
================================================
Hello World!

Good day
How do you do?

Just do it
Believe it!

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he


================================================
FILE: exercises/GNU_grep/ex05_word_line_matching/words.txt
================================================
afar
far
carfare
farce
faraway
airfare


================================================
FILE: exercises/GNU_grep/ex05_word_line_matching.txt
================================================
Note: All files present in the directory should be given as file inputs to grep

1) Match lines containing whole word: do


2) Match whole lines containing the string: Hello World


3) Match lines containing these whole words:
        Word1: He
        Word2: far


4) Match lines containing the whole word: you
    and NOT containing the case insensitive string: How



================================================
FILE: exercises/GNU_grep/ex06_ABC_context_matching/sample.txt
================================================
Hello World!

Good day
How do you do?

Just do it
Believe it!

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he


================================================
FILE: exercises/GNU_grep/ex06_ABC_context_matching.txt
================================================
1) Get lines and 3 following it containing the string: you


2) Get lines and 2 preceding it containing the string: is


3) Get lines and 1 following/preceding containing the string: Not


4) Get lines and 1 following and 4 preceding containing the string: Not


5) Get lines and 1 preceding it containing the string: you
        there should be no separator between the matches


6) Get lines and 1 preceding it containing the string: you
        the separator between the matches should be: #####



================================================
FILE: exercises/GNU_grep/ex07_recursive_search/msg/greeting.txt
================================================
Hi, how are you?

Hola :)

Hello World

Good day

Rock on


================================================
FILE: exercises/GNU_grep/ex07_recursive_search/msg/sample.txt
================================================
Hello World!

Good day
How do you do?

Just do it
Believe it!

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he


================================================
FILE: exercises/GNU_grep/ex07_recursive_search/poem.txt
================================================
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.


================================================
FILE: exercises/GNU_grep/ex07_recursive_search/progs/hello.py
================================================
#!/usr/bin/python3

print("Hello World")


================================================
FILE: exercises/GNU_grep/ex07_recursive_search/progs/hello.sh
================================================
#!/bin/bash

echo "Hello $USER"
echo "Today is $(date -u +%A)"
echo 'Hope you are having a nice day'


================================================
FILE: exercises/GNU_grep/ex07_recursive_search/words.txt
================================================
afar
far
carfare
farce
faraway
airfare


================================================
FILE: exercises/GNU_grep/ex07_recursive_search.txt
================================================
Note: Every file in this directory and sub-directories is input for grep, unless otherwise specified

1) Match all lines containing the string: you


2) Show only filenames matching the string: Hello
    filenames should only end with .txt 


3) Show only filenames matching the string: Hello
    filenames should NOT end with .txt 


4) Show only filenames matching the string: are
    should not include the directory: progs


5) Show only filenames matching the string: are
    should NOT include these directories
            dir1: progs
            dir2: msg


6) Show only filenames matching the string: are
    should include files only from sub-directories
    hint: use shell glob pattern to specify directories to search



================================================
FILE: exercises/GNU_grep/ex08_search_pattern_from_file/baz.txt
================================================
I saw a few red cars going that way
To the end!
Are you coming today to the party?
a[5] = 'good';
Have you read the Harry Potter series?


================================================
FILE: exercises/GNU_grep/ex08_search_pattern_from_file/foo.txt
================================================
part
a[5] = 'good';
I saw a few red cars going that way
Believe it!
to do list


================================================
FILE: exercises/GNU_grep/ex08_search_pattern_from_file/words.txt
================================================
car
part
to
read


================================================
FILE: exercises/GNU_grep/ex08_search_pattern_from_file.txt
================================================
Note: words.txt has only whole words per line, use it as file input when task is to match whole words

1) Match all strings from file words.txt in file baz.txt


2) Match all words from file words.txt in file foo.txt
    should only match whole words
    should print only matching words, not entire line


3) Show common lines between foo.txt and baz.txt


4) Show lines present in baz.txt but not in foo.txt


5) Show lines present in foo.txt but not in baz.txt


6) Find all words common between all three files in the directory
    should only match whole words
    should print only matching words, not entire line



================================================
FILE: exercises/GNU_grep/ex09_regex_anchors/sample.txt
================================================
hello world!

good day
how do you do?

just do it
believe it!

today is sunny
not a bit funny
no doubt you like it too

much ado about nothing
he he he

^ could be exponentiation or xor operator
scalar variables in perl start with $


================================================
FILE: exercises/GNU_grep/ex09_regex_anchors.txt
================================================
1) Match all lines starting with: no


2) Match all lines ending with: it


3) Match all lines containing whole word: do


4) Match all lines containing words starting with: do


5) Match all lines containing words ending with: do


6) Match all lines starting with: ^


7) Match all lines ending with: $


8) Match all lines containing the string: in
    not surrounded by word boundaries, for ex: mint but not tin or ink



================================================
FILE: exercises/GNU_grep/ex10_regex_this_or_that/sample.txt
================================================
hello world!

good day
how do you do?

just do it
believe it!

today is sunny
not a bit funny
no doubt you like it too

much ado about nothing
he he he

^ could be exponentiation or xor operator
scalar variables in perl start with $


================================================
FILE: exercises/GNU_grep/ex10_regex_this_or_that.txt
================================================
1) Match all lines containing any of these strings:
        String1: day
        String2: not


2) Match all lines containing any of these whole words:
        String1: he
        String2: in


3) Match all lines containing any of these strings:
        String1: you
        String2: be
        String3: to
        String4: he


4) Match all lines containing any of these strings:
        String1: you
        String2: be
        String3: to
        String4: he
    but NOT these strings:
        String1: it
        String2: do


5) Match all lines starting with any of these strings:
        String1: no
        String2: to



================================================
FILE: exercises/GNU_grep/ex11_regex_quantifiers/garbled.txt
================================================
gd
god
goood
oh gold
goooooodyyyy
dog
dg
dig good gold
doogoodog
c@t made forty justify
dodging a toy


================================================
FILE: exercises/GNU_grep/ex11_regex_quantifiers.txt
================================================
1) Extract all 3 character strings surrounded by word boundaries


2) Extract largest string from each line
        starting with character: d
        ending with character  : g


3) Extract all strings from each line
        starting with character: d
        followed by zero or one: o
        ending with character  : g


4) Extract all strings from each line
        starting with character: d
        followed by zero or one of any character
        ending with character  : g


5) Extract all strings from each line
        starting with character: g
        followed by atleast one: o
        ending with character  : d


6) Extract all strings from each line
        starting with character : g
        followed by extactly six: o
        ending with character   : d


7) Extract all strings from each line
        starting with character         : g
        followed by min two and max four: o
        ending with character           : d


8) Extract all strings from each line
        starting with character: d
        followed by max of two : o
        ending with character  : g


9) Extract all strings from each line
        starting with character : g
        followed by min of three: o
        ending with character   : d



================================================
FILE: exercises/GNU_grep/ex12_regex_character_class_part1/sample_words.txt
================================================
far 30 scarce f@$t 42 fit
Cute 34 quite pry far-fetched Sure
70 cast-away 12 good hue he
cry just Nymph race Peace. 67
foo;bar;baz;p@t
ARE 72 cut copy paste
p1ate rest 512 Sync


================================================
FILE: exercises/GNU_grep/ex12_regex_character_class_part1.txt
================================================
1) Match all lines containing any of these characters:
        character1: q
        character2: x
        character3: z


2) Match all lines containing any of these characters:
        character1: c
        character2: f
    followed by any character
    followed by   : t


3) Extract all words starting with character: s
    ignore case
    should contain only alphabets
    minimum two letters
    should be surrounded by word boundaries


4) Extract all words made up of these characters:
        character1: a
        character2: c
        character3: e
        character4: r
        character5: s
    ignore case
    should contain only alphabets
    should be surrounded by word boundaries


5) Extract all numbers surrounded by word boundaries


6) Extract all numbers surrounded by word boundaries matching the condition
    30 <= number <= 70


7) Extract all words made up of non-vowel characters
    ignore case
    should contain only alphabets and at least two
    should be surrounded by word boundaries


8) Extract all sequence of strings consisting of character: -
    surrounded on either side by zero or more case insensitive alphabets    



================================================
FILE: exercises/GNU_grep/ex13_regex_character_class_part2/sample.txt
================================================
a[2]='sample string'
foo_bar=4232
appx_pi=3.14
greeting="Hi  there		have a nice   day"
food[4]="dosa"
b[0][1]=42


================================================
FILE: exercises/GNU_grep/ex13_regex_character_class_part2.txt
================================================
1) Extract all characters before first occurrence of =


2) Extract all characters from start of line made up of these characters
        upper or lower case alphabets
        all digits
        the underscore character


3) Match all lines containing the sequence
        String1: there
        any number of whitespace
        String2: have


4) Extract all characters from start of line made up of these characters
        upper or lower case alphabets
        all digits
        the characters [ and ]
        ending with ]


5) Extract all punctuation characters from first line



================================================
FILE: exercises/GNU_grep/ex14_regex_grouping_and_backreference/sample.txt
================================================
hands hand library scare handy handful
scared too big time eel candy
spare food regulate circuit spore stare
tire tempt cold malady


================================================
FILE: exercises/GNU_grep/ex14_regex_grouping_and_backreference.txt
================================================
1) Match lines containing these strings
        String1: scare
        String2: spore


2) Extract these words
        Word1: handy
        Word2: hand
        Word3: hands
        Word4: handful


3) Extract all whole words with at least one letter occurring twice in the word
    ignore case
    only alphabets
    the letter occurring twice need not be placed next to each other


4) Match lines where same sequence of three consecutive alphabets is matched another time in the same line
    ignore case



================================================
FILE: exercises/GNU_grep/ex15_regex_PCRE/sample.txt
================================================
a[2]='Hi, how are you?'
foo_bar=4232
appx_pi=3.14
greeting="Hi there have a nice day"
food[4]="dosa"
b[0][1]=42


================================================
FILE: exercises/GNU_grep/ex15_regex_PCRE.txt
================================================
1) Extract all strings to the right of =
    provided characters from start of line until = do not include [ or ]


2) Match all lines containing the string: Hi
    but shouldn't be followed afterwards in the line by: are


3) Extract from start of line up to the string: Hi
    provided it is followed afterwards in the line by: you


4) Extract all sequence of characters surrounded on both sides by space character
    the space character should not be part of output


5) Extract all words
    made of upper or lower case alphabets
    at least two letters in length
    surrounded by word boundaries
    should not contain consecutive repeated alphabets




================================================
FILE: exercises/GNU_grep/ex16_misc_and_extras/garbled.txt
================================================
day and night
-43 and 99 and 12


================================================
FILE: exercises/GNU_grep/ex16_misc_and_extras/poem.txt
================================================
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

Good day to you :)


================================================
FILE: exercises/GNU_grep/ex16_misc_and_extras/sample.txt
================================================
account balance: -2300
good day
foo and bar and baz


================================================
FILE: exercises/GNU_grep/ex16_misc_and_extras.txt
================================================
Note: all files in directory are input to grep, unless otherwise specified

1) Extract all negative numbers
    starts with - followed by one or more digits
    do not output filenames


2) Display only filenames containing these two strings anywhere in the file
        String1: day
        String2: and


3) The below command
        grep -c '^Solution:' ../.ref_solutions/*
    will give number of questions in each exercise. Change it, using another command and pipe if needed, so that only overall total is printed




================================================
FILE: exercises/GNU_grep/solve
================================================
dir_name=$(basename "$PWD")
ref_file="../.ref_solutions/$dir_name.txt"
sol_file="../$dir_name.txt"
tmp_file='../.tmp.txt'

# color output
tcolors=$(tput colors)
if [[ -n $tcolors && $tcolors -ge 8 ]]; then
    red=$(tput setaf 1)
    green=$(tput setaf 2)
    blue=$(tput setaf 4)
    clr_color=$(tput sgr0)
else
    red=''
    green=''
    blue=''
    clr_color=''
fi

sub_sol=0
if [[ $1 == -s ]]; then
    prev_cmd=$(fc -ln -2 | sed 's/^[ \t]*//;q')
    sub_sol=1
elif [[ $1 == -q ]]; then
    # highlight the question to be solved next
    # or show only the (unanswered)? question to be solved next
    cat "$sol_file"
    return
elif [[ -n $1 ]]; then
    echo -e 'Unknown option...Exiting script'
    return
fi

count=0
sol_count=0
err_count=0
while IFS= read -u3 -r ref_line && read -u4 -r sol_line; do
    if [[ "${ref_line:0:9}" == Solution: ]]; then
        (( count++ ))

        if [[ $sub_sol == 1 && -z $sol_line ]]; then
            sol_line="$prev_cmd"
            sub_sol=0
        fi

        if [[ "$(eval "command ${ref_line:10}")" == "$(eval "command $sol_line")" ]]; then
            (( sol_count++ ))
            # use color if terminal supports
            echo '---------------------------------------------'
            echo "Match for question $count:"
            echo "${red}Submitted solution:${clr_color} $sol_line"
            echo "${green}Reference solution:${clr_color} ${ref_line:10}"
            echo '---------------------------------------------'
        else
            (( err_count++ ))
            if [[ $err_count == 1 && -n $sol_line ]]; then
                echo '---------------------------------------------'
                echo "Mismatch for question $count:"
                echo "$(tput bold)${red}Expected output is:${clr_color}$(tput rmso)"
                eval "command ${ref_line:10}"
                echo '---------------------------------------------'
            fi
            sol_line=''
        fi
    fi

    echo "$sol_line" >> "$tmp_file"

done 3<"$ref_file" 4<"$sol_file"

((count==sol_count)) && printf "\t\t$(tput bold)${blue}All Pass${clr_color}$(tput rmso)\t\t\n"

mv "$tmp_file" "$sol_file"

# vim: syntax=bash


================================================
FILE: exercises/README.md
================================================
# <a name="exercises"></a>Exercises

Instructions and shell script here assumes `bash` shell. Tested on *GNU bash, version 4.3.46*

<br>

* For example, the first exercise for **GNU_grep**
    * directory: `ex01_basic_match`
    * question file: `ex01_basic_match.txt`
    * solution reference: `.ref_solutions/ex01_basic_match.txt`
* Each exercise contains one or more question to be solved
* The script `solve` will assist in checking solutions

```bash
$ git clone https://github.com/learnbyexample/Command-line-text-processing.git
$ cd Command-line-text-processing/exercises/GNU_grep/
$ ls
ex01_basic_match      ex02_basic_options      ex03_multiple_string_match      solve
ex01_basic_match.txt  ex02_basic_options.txt  ex03_multiple_string_match.txt

$ find -name 'ex01*'
./.ref_solutions/ex01_basic_match.txt
./ex01_basic_match
./ex01_basic_match.txt
```

<br>

* Solving the questions
    * Go to the exercise folder
    * Use `ls` to see input file(s)
    * To see the problems for that exercise, follow the steps below

```bash
$ cd ex01_basic_match
$ ls
sample.txt

$ # to see the questions
$ source ../solve -q
1) Match lines containing the string: day


2) Match lines containing the string: it


3) Match lines containing the string: do you


$ # or open the questions file with your fav editor
$ gvim ../$(basename "$PWD").txt
$ # create an alias to use from any ex* directory
$ alias oq='gvim ../$(basename "$PWD").txt'
$ oq
```

<br>

* Submitting solutions one by one
    * immediately after executing command that answers a question, call the `solve` script

```bash
$ grep 'day' sample.txt 
Good day
Today is sunny
$ source ../solve -s
---------------------------------------------
Match for question 1:
Submitted solution: grep 'day' sample.txt 
Reference solution: grep 'day' sample.txt
---------------------------------------------
```

<br>

* Submit all at once
    * by editing the `../$(basename "$PWD").txt` file directly
    * the answer should replace the empty line immediately following the question
* **Note**
    * there are different ways to solve the same question
    * but for specific exercise like **GNU_grep** try to solve using `grep` only
    * also, remember that `eval` is used to check equivalence. So be sure of commands submitted

```bash
$ cat ../$(basename "$PWD").txt
1) Match lines containing the string: day
grep 'day' sample.txt

2) Match lines containing the string: it
sed -n '/it/p' sample.txt

3) Match lines containing the string: do you
echo 'How do you do?'

$ source ../solve
---------------------------------------------
Match for question 1:
Submitted solution: grep 'day' sample.txt
Reference solution: grep 'day' sample.txt
---------------------------------------------
---------------------------------------------
Match for question 2:
Submitted solution: sed -n '/it/p' sample.txt
Reference solution: grep 'it' sample.txt
---------------------------------------------
---------------------------------------------
Match for question 3:
Submitted solution: echo 'How do you do?'
Reference solution: grep 'do you' sample.txt
---------------------------------------------
		All Pass		
```

<br>

* Then move on to next exercise directory
* Create aliases for different commands for easy use, after checking that the aliases are available of course

```bash
$ type cs cq ca nq pq
bash: type: cs: not found
bash: type: cq: not found
bash: type: ca: not found
bash: type: nq: not found
bash: type: pq: not found

$ alias cs='source ../solve -s'
$ alias cq='source ../solve -q'
$ alias ca='source ../solve'
$ # to go to directory of next question
$ nq() { d=$(basename "$PWD"); nd=$(printf "../ex%02d*/" $((${d:2:2}+1))); cd $nd ; }
$ # to go to directory of previous question
$ pq() { d=$(basename "$PWD"); pd=$(printf "../ex%02d*/" $((${d:2:2}-1))); cd $pd ; }
```

<br>

If wrong solution is submitted, the expected output is shown. This also helps to better understand the question as I found it difficult to convey the intent of question clearly with words alone...

```bash
$ source ../solve -q
1) Match lines containing the string: day


2) Match lines containing the string: it


3) Match lines containing the string: do you

$ grep 'do' sample.txt 
How do you do?
Just do it
No doubt you like it too
Much ado about nothing
$ source ../solve -s
---------------------------------------------
Mismatch for question 1:
Expected output is:
Good day
Today is sunny
---------------------------------------------
```


================================================
FILE: file_attributes.md
================================================
# <a name="file-attributes"></a>File attributes

**Table of Contents**

* [wc](#wc)
    * [Various counts](#various-counts)
    * [subtle differences](#subtle-differences)
    * [Further reading for wc](#further-reading-for-wc)
* [du](#du)
    * [Default size](#default-size)
    * [Various size formats](#various-size-formats)
    * [Dereferencing links](#dereferencing-links)
    * [Filtering options](#filtering-options)
    * [Further reading for du](#further-reading-for-du)
* [df](#df)
    * [Examples](#examples)
    * [Further reading for df](#further-reading-for-df)
* [touch](#touch)
    * [Creating empty file](#creating-empty-file)
    * [Updating timestamps](#updating-timestamps)
    * [Preserving timestamp](#preserving-timestamp)
    * [Further reading for touch](#further-reading-for-touch)
* [file](#file)
    * [File type examples](#file-type-examples)
    * [Further reading for file](#further-reading-for-file)

<br>

## <a name="wc"></a>wc

```bash
$ wc --version | head -n1
wc (GNU coreutils) 8.25

$ man wc
WC(1)                            User Commands                           WC(1)

NAME
       wc - print newline, word, and byte counts for each file

SYNOPSIS
       wc [OPTION]... [FILE]...
       wc [OPTION]... --files0-from=F

DESCRIPTION
       Print newline, word, and byte counts for each FILE, and a total line if
       more than one FILE is specified.  A word is a non-zero-length  sequence
       of characters delimited by white space.

       With no FILE, or when FILE is -, read standard input.
...
```

<br>

#### <a name="various-counts"></a>Various counts

```bash
$ cat sample.txt
Hello World
Good day
No doubt you like it too
Much ado about nothing
He he he

$ # by default, gives newline/word/byte count (in that order)
$ wc sample.txt
 5 17 78 sample.txt

$ # options to get individual numbers
$ wc -l sample.txt
5 sample.txt
$ wc -w sample.txt
17 sample.txt
$ wc -c sample.txt
78 sample.txt

$ # use shell input redirection if filename is not needed
$ wc -l < sample.txt
5
```

* multiple file input
* automatically displays total at end

```bash
$ cat greeting.txt
Hello there
Have a safe journey
$ cat fruits.txt
Fruit   Price
apple   42
banana  31
fig     90
guava   6

$ wc *.txt
  5  10  57 fruits.txt
  2   6  32 greeting.txt
  5  17  78 sample.txt
 12  33 167 total
```

* use `-L` to get length of longest line

```bash
$ wc -L < sample.txt
24

$ echo 'foo bar baz' | wc -L
11
$ echo 'hi there!' | wc -L
9

$ # last line will show max value, not sum of all input
$ wc -L *.txt
 13 fruits.txt
 19 greeting.txt
 24 sample.txt
 24 total
```

<br>

#### <a name="subtle-differences"></a>subtle differences

* byte count vs character count

```bash
$ # when input is ASCII
$ printf 'hi there' | wc -c
8
$ printf 'hi there' | wc -m
8

$ # when input has multi-byte characters
$ printf 'hi👍' | od -x
0000000 6968 9ff0 8d91
0000006

$ printf 'hi👍' | wc -m
3

$ printf 'hi👍' | wc -c
6
```

* `-l` option gives only the count of number of newline characters

```bash
$ printf 'hi there\ngood day' | wc -l
1
$ printf 'hi there\ngood day\n' | wc -l
2
$ printf 'hi there\n\n\nfoo\n' | wc -l
4
```

* From `man wc` "A word is a non-zero-length sequence of characters delimited by white space"

```bash
$ echo 'foo        bar ;-*' | wc -w
3

$ # use other text processing as needed
$ echo 'foo        bar ;-*' | grep -iowE '[a-z]+'
foo
bar
$ echo 'foo        bar ;-*' | grep -iowE '[a-z]+' | wc -l
2
```

* `-L` won't count non-printable characters and tabs are converted to equivalent spaces

```bash
$ printf 'food\tgood' | wc -L
12
$ printf 'food\tgood' | wc -m
9
$ printf 'food\tgood' | awk '{print length()}'
9

$ printf 'foo\0bar\0baz' | wc -L
9
$ printf 'foo\0bar\0baz' | wc -m
11
$ printf 'foo\0bar\0baz' | awk '{print length()}'
11
```

<br>

#### <a name="further-reading-for-wc"></a>Further reading for wc

* `man wc` and `info wc` for more options and detailed documentation
* [wc Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/wc?sort=votes&pageSize=15)
* [wc Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/wc?sort=votes&pageSize=15)

<br>

## <a name="du"></a>du

```bash
$ du --version | head -n1
du (GNU coreutils) 8.25

$ man du
DU(1)                            User Commands                           DU(1)

NAME
       du - estimate file space usage

SYNOPSIS
       du [OPTION]... [FILE]...
       du [OPTION]... --files0-from=F

DESCRIPTION
       Summarize disk usage of the set of FILEs, recursively for directories.
...
```

<br>

<br>

#### <a name="default-size"></a>Default size

* By default, size is given in size of **1024 bytes**
* Files are ignored, all directories and sub-directories are recursively reported

```bash
$ ls -F
projs/  py_learn@  words.txt

$ du
17920   ./projs/full_addr
14316   ./projs/half_addr
32952   ./projs
33880   .
```

* use `-a` to recursively show both files and directories
* use `-s` to show total directory size without descending into its sub-directories

```bash
$ du -a
712     ./projs/report.log
17916   ./projs/full_addr/faddr.v
17920   ./projs/full_addr
14312   ./projs/half_addr/haddr.v
14316   ./projs/half_addr
32952   ./projs
0       ./py_learn
924     ./words.txt
33880   .

$ du -s
33880   .

$ du -s projs words.txt
32952   projs
924     words.txt
```

* use `-S` to show directory size without taking into account size of its sub-directories

```bash
$ du -S
17920   ./projs/full_addr
14316   ./projs/half_addr
716     ./projs
928     .
```

<br>

<br>

#### <a name="various-size-formats"></a>Various size formats

```bash
$ # number of bytes
$ stat -c %s words.txt
938848
$ du -b words.txt
938848  words.txt

$ # kilobytes = 1024 bytes
$ du -sk projs
32952   projs
$ # megabytes = 1024 kilobytes
$ du -sm projs
33      projs

$ # -B to specify custom byte scale size
$ du -sB 5000 projs
6749    projs
$ du -sB 1048576 projs
33      projs
```

* human readable and si units

```bash
$ # in terms of powers of 1024
$ # M = 1048576 bytes and so on
$ du -sh projs/* words.txt
18M     projs/full_addr
14M     projs/half_addr
712K    projs/report.log
924K    words.txt

$ # in terms of powers of 1000
$ # M = 1000000 bytes and so on
$ du -s --si projs/* words.txt
19M     projs/full_addr
15M     projs/half_addr
730k    projs/report.log
947k    words.txt
```

* sorting

```bash
$ du -sh projs/* words.txt | sort -h
712K    projs/report.log
924K    words.txt
14M     projs/half_addr
18M     projs/full_addr

$ du -sk projs/* | sort -nr
17920   projs/full_addr
14316   projs/half_addr
712     projs/report.log
```

* to get size based on number of characters in file rather than disk space alloted

```bash
$ du -b words.txt
938848  words.txt

$ du -h words.txt
924K    words.txt

$ # 938848/1024 = 916.84
$ du --apparent-size -h words.txt
917K    words.txt
```

<br>

#### <a name="dereferencing-links"></a>Dereferencing links

* See `man` and `info` pages for other related options

```bash
$ # -D to dereference command line argument
$ du py_learn
0       py_learn
$ du -shD py_learn
503M    py_learn

$ # -L to dereference links found by du
$ du -sh
34M     .
$ du -shL
536M    .
```

<br>

#### <a name="filtering-options"></a>Filtering options

* `-d` to specify maximum depth

```bash
$ du -ah projs
712K    projs/report.log
18M     projs/full_addr/faddr.v
18M     projs/full_addr
14M     projs/half_addr/haddr.v
14M     projs/half_addr
33M     projs

$ du -ah -d1 projs
712K    projs/report.log
18M     projs/full_addr
14M     projs/half_addr
33M     projs
```

* `-c` to also show total size at end

```bash
$ du -cshD projs py_learn
33M     projs
503M    py_learn
535M    total
```

* `-t` to provide a threshold comparison

```bash
$ # >= 15M
$ du -Sh -t 15M
18M     ./projs/full_addr

$ # <= 1M
$ du -ah -t -1M
712K    ./projs/report.log
0       ./py_learn
924K    ./words.txt
```

* excluding files/directories based on **glob** pattern
* see also `--exclude-from=FILE` and `--files0-from=FILE` options

```bash
$ # note that excluded files affect directory size reported
$ du -ah --exclude='*addr*' projs
712K    projs/report.log
716K    projs

$ # depending on shell, brace expansion can be used
$ du -ah --exclude='*.'{v,log} projs
4.0K    projs/full_addr
4.0K    projs/half_addr
12K     projs
```

<br>

#### <a name="further-reading-for-du"></a>Further reading for du

* `man du` and `info du` for more options and detailed documentation
* [du Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/disk-usage?sort=votes&pageSize=15)
* [du Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/du?sort=votes&pageSize=15)

<br>

## <a name="df"></a>df

```bash
$ df --version | head -n1
df (GNU coreutils) 8.25

$ man df
DF(1)                            User Commands                           DF(1)

NAME
       df - report file system disk space usage

SYNOPSIS
       df [OPTION]... [FILE]...

DESCRIPTION
       This  manual  page  documents  the  GNU version of df.  df displays the
       amount of disk space available on the file system containing each  file
       name  argument.   If  no file name is given, the space available on all
       currently mounted file systems is shown.
...
```

<br>

#### <a name="examples"></a>Examples

```bash
$ # use df without arguments to get information on all currently mounted file systems
$ df .
Filesystem     1K-blocks     Used Available Use% Mounted on
/dev/sda1       98298500 58563816  34734748  63% /

$ # use -B option for custom size
$ # use --si for size in powers of 1000 instead of 1024
$ df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        94G   56G   34G  63% /
```

* Use `--output` to report only specific fields of interest

```bash
$ df -h --output=size,used,file / /media/learnbyexample/projs
 Size  Used File
  94G   56G /
  92G   35G /media/learnbyexample/projs

$ df -h --output=pcent .
Use%
 63%

$ df -h --output=pcent,fstype | awk -F'%' 'NR>2 && $1>=40'
 63% ext3
 40% ext4
 51% ext4
```

<br>

#### <a name="further-reading-for-df"></a>Further reading for df

* `man df` and `info df` for more options and detailed documentation
* [df Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/df?sort=votes&pageSize=15)
* [Parsing df command output with awk](https://unix.stackexchange.com/questions/360865/parsing-df-command-output-with-awk)
* [processing df output](https://www.reddit.com/r/bash/comments/68dbml/using_an_array_variable_in_an_awk_command/)

<br>

## <a name="touch"></a>touch

```bash
$ touch --version | head -n1
touch (GNU coreutils) 8.25

$ man touch
TOUCH(1)                         User Commands                        TOUCH(1)

NAME
       touch - change file timestamps

SYNOPSIS
       touch [OPTION]... FILE...

DESCRIPTION
       Update  the  access  and modification times of each FILE to the current
       time.

       A FILE argument that does not exist is created empty, unless -c  or  -h
       is supplied.
...
```

<br>

#### <a name="creating-empty-file"></a>Creating empty file

```bash
$ ls foo.txt
ls: cannot access 'foo.txt': No such file or directory
$ touch foo.txt
$ ls foo.txt
foo.txt

$ # use -c if new file shouldn't be created
$ rm foo.txt
$ touch -c foo.txt
$ ls foo.txt
ls: cannot access 'foo.txt': No such file or directory
```

<br>

#### <a name="updating-timestamps"></a>Updating timestamps

* Updating both access and modification timestamp to current time

```bash
$ # last access time
$ stat -c %x fruits.txt
2017-07-19 17:06:01.523308599 +0530
$ # last modification time
$ stat -c %y fruits.txt
2017-07-13 13:54:03.576055933 +0530

$ touch fruits.txt
$ stat -c %x fruits.txt
2017-07-21 10:11:44.241921229 +0530
$ stat -c %y fruits.txt
2017-07-21 10:11:44.241921229 +0530
```

* Updating only access or modification timestamp

```bash
$ touch -a greeting.txt
$ stat -c %x greeting.txt
2017-07-21 10:14:08.457268564 +0530
$ stat -c %y greeting.txt
2017-07-13 13:54:26.004499660 +0530

$ touch -m sample.txt
$ stat -c %x sample.txt
2017-07-13 13:48:24.945450646 +0530
$ stat -c %y sample.txt
2017-07-21 10:14:40.770006144 +0530
```

* Using timestamp from another file to update

```bash
$ stat -c $'%x\n%y' power.log report.log
2017-07-19 10:48:03.978295434 +0530
2017-07-14 20:50:42.850887578 +0530
2017-06-24 13:00:31.773583923 +0530
2017-06-24 12:59:53.316751651 +0530

$ # copy both access and modification timestamp from power.log to report.log
$ touch -r power.log report.log
$ stat -c $'%x\n%y' report.log
2017-07-19 10:48:03.978295434 +0530
2017-07-14 20:50:42.850887578 +0530

$ # add -a or -m options to limit to only access or modification timestamp
```

* Using date string to update
* See also `-t` option

```bash
$ # add -a or -m as needed
$ touch -d '2010-03-17 17:04:23' report.log
$ stat -c $'%x\n%y' report.log
2010-03-17 17:04:23.000000000 +0530
2010-03-17 17:04:23.000000000 +0530
```

<br>

#### <a name="preserving-timestamp"></a>Preserving timestamp

* Text processing on files would update the timestamps

```bash
$ stat -c $'%x\n%y' power.log
2017-07-21 11:11:42.862874240 +0530
2017-07-13 21:31:53.496323704 +0530

$ sed -i 's/foo/bar/g' power.log
$ stat -c $'%x\n%y' power.log
2017-07-21 11:12:20.303504336 +0530
2017-07-21 11:12:20.303504336 +0530
```

* `touch` can be used to restore timestamps after processing

```bash
$ # first copy the timestamps using touch -r
$ stat -c $'%x\n%y' story.txt
2017-06-24 13:00:31.773583923 +0530
2017-06-24 12:59:53.316751651 +0530
$ # tmp.txt is temporary empty file
$ touch -r story.txt tmp.txt
$ stat -c $'%x\n%y' tmp.txt
2017-06-24 13:00:31.773583923 +0530
2017-06-24 12:59:53.316751651 +0530

$ # after text processing, copy back the timestamps and remove temporary file
$ sed -i 's/cat/dog/g' story.txt
$ touch -r tmp.txt story.txt && rm tmp.txt
$ stat -c $'%x\n%y' story.txt
2017-06-24 13:00:31.773583923 +0530
2017-06-24 12:59:53.316751651 +0530
```

<br>

#### <a name="further-reading-for-touch"></a>Further reading for touch

* `man touch` and `info touch` for more options and detailed documentation
* [touch Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/touch?sort=votes&pageSize=15)

<br>

## <a name="file"></a>file

```bash
$ file --version | head -n1
file-5.25

$ man file
FILE(1)                   BSD General Commands Manual                  FILE(1)

NAME
     file — determine file type

SYNOPSIS
     file [-bcEhiklLNnprsvzZ0] [--apple] [--extension] [--mime-encoding]
          [--mime-type] [-e testname] [-F separator] [-f namefile]
          [-m magicfiles] [-P name=value] file ...
     file -C [-m magicfiles]
     file [--help]

DESCRIPTION
     This manual page documents version 5.25 of the file command.

     file tests each argument in an attempt to classify it.  There are three
     sets of tests, performed in this order: filesystem tests, magic tests,
     and language tests.  The first test that succeeds causes the file type to
     be printed.
...
```

<br>

<br>

#### <a name="file-type-examples"></a>File type examples

```bash
$ file sample.txt
sample.txt: ASCII text
$ # without file name in output
$ file -b sample.txt
ASCII text

$ printf 'hi👍\n' | file -
/dev/stdin: UTF-8 Unicode text
$ printf 'hi👍\n' | file -i -
/dev/stdin: text/plain; charset=utf-8

$ file ch
ch:  Bourne-Again shell script, ASCII text executable

$ file sunset.jpg moon.png
sunset.jpg: JPEG image data
moon.png: PNG image data, 32 x 32, 8-bit/color RGBA, non-interlaced
```

* different line terminators

```bash
$ printf 'hi' | file -
/dev/stdin: ASCII text, with no line terminators

$ printf 'hi\r' | file -
/dev/stdin: ASCII text, with CR line terminators

$ printf 'hi\r\n' | file -
/dev/stdin: ASCII text, with CRLF line terminators

$ printf 'hi\n' | file -
/dev/stdin: ASCII text
```

* find all files of particular type in current directory, for example `image` files

```bash
$ find -type f -exec bash -c '(file -b "$0" | grep -wq "image data") && echo "$0"' {} \;
./sunset.jpg
./moon.png

$ # if filenames do not contain : or newline characters
$ find -type f -exec file {} + | awk -F: '/\<image data\>/{print $1}'
./sunset.jpg
./moon.png
```

<br>

#### <a name="further-reading-for-file"></a>Further reading for file

* `man file` and `info file` for more options and detailed documentation
* See also `identify` command which `describes the format and characteristics of one or more image files`


================================================
FILE: gnu_awk.md
================================================
<br> <br> <br>

---

:information_source: :information_source: This chapter has been converted into a better formatted ebook: https://learnbyexample.github.io/learn_gnuawk/. The ebook also has content updated for newer version of the commands, includes a chapter on regular expressions, has exercises, solutions, etc.

For markdown source and links to buy pdf/epub versions, see: https://github.com/learnbyexample/learn_gnuawk

---

<br> <br> <br>

## <a name="gnu-awk"></a>GNU awk

**Table of Contents**

* [Field processing](#field-processing)
    * [Default field separation](#default-field-separation)
    * [Specifying different input field separator](#specifying-different-input-field-separator)
    * [Specifying different output field separator](#specifying-different-output-field-separator)
* [Filtering](#filtering)
    * [Idiomatic print usage](#idiomatic-print-usage)
    * [Field comparison](#field-comparison)
    * [Regular expressions based filtering](#regular-expressions-based-filtering)
    * [Fixed string matching](#fixed-string-matching)
    * [Line number based filtering](#line-number-based-filtering)
* [Case Insensitive filtering](#case-insensitive-filtering)
* [Changing record separators](#changing-record-separators)
    * [Paragraph mode](#paragraph-mode)
    * [Multicharacter RS](#multicharacter-rs)
* [Substitute functions](#substitute-functions)
* [Inplace file editing](#inplace-file-editing)
* [Using shell variables](#using-shell-variables)
* [Multiple file input](#multiple-file-input)
* [Control Structures](#control-structures)
    * [if-else and loops](#if-else-and-loops)
    * [next and nextfile](#next-and-nextfile)
* [Multiline processing](#multiline-processing)
* [Two file processing](#two-file-processing)
    * [Comparing whole lines](#comparing-whole-lines)
    * [Comparing specific fields](#comparing-specific-fields)
    * [getline](#getline)
* [Creating new fields](#creating-new-fields)
* [Dealing with duplicates](#dealing-with-duplicates)
* [Lines between two REGEXPs](#lines-between-two-regexps)
    * [All unbroken blocks](#all-unbroken-blocks)
    * [Specific blocks](#specific-blocks)
    * [Broken blocks](#broken-blocks)
* [Arrays](#arrays)
* [awk scripts](#awk-scripts)
* [Miscellaneous](#miscellaneous)
    * [FPAT and FIELDWIDTHS](#fpat-and-fieldwidths)
    * [String functions](#string-functions)
    * [Executing external commands](#executing-external-commands)
    * [printf formatting](#printf-formatting)
    * [Redirecting print output](#redirecting-print-output)
* [Gotchas and Tips](#gotchas-and-tips)
* [Further Reading](#further-reading)

<br>

```bash
$ awk --version | head -n1
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)

$ man awk
GAWK(1)                        Utility Commands                        GAWK(1)

NAME
       gawk - pattern scanning and processing language

SYNOPSIS
       gawk [ POSIX or GNU style options ] -f program-file [ -- ] file ...
       gawk [ POSIX or GNU style options ] [ -- ] program-text file ...

DESCRIPTION
       Gawk  is  the  GNU Project's implementation of the AWK programming lan‐
       guage.  It conforms to the definition of  the  language  in  the  POSIX
       1003.1  Standard.   This version in turn is based on the description in
       The AWK Programming Language, by Aho, Kernighan, and Weinberger.   Gawk
       provides  the additional features found in the current version of Brian
       Kernighan's awk and a number of GNU-specific extensions.
...
```

**Prerequisites and notes**

* familiarity with programming concepts like variables, printing, control structures, arrays, etc
* familiarity with regular expressions
    * if not, check out **ERE** portion of [GNU sed regular expressions](./gnu_sed.md#regular-expressions) which is close enough to features available in `gawk`
* this tutorial is primarily focussed on short programs that are easily usable from command line, similar to using `grep`, `sed`, etc
* see [Gawk: Effective AWK Programming](https://www.gnu.org/software/gawk/manual/) manual for complete reference, has information on other `awk` versions as well as notes on POSIX standard

<br>

## <a name="field-processing"></a>Field processing

<br>

#### <a name="default-field-separation"></a>Default field separation

* `$0` contains the entire input record
    * default input record separator is newline character
* `$1` contains the first field text
    * default input field separator is one or more of continuous space, tab or newline characters
* `$2` contains the second field text and so on
* `$(2+3)` result of expressions can be used, this one evaluates to `$5` and hence gives fifth field
    * similarly if variable `i` has value `2`, then `$(i+3)` will give fifth field
    * See also [gawk manual - Expressions](https://www.gnu.org/software/gawk/manual/html_node/Expressions.html)
* `NF` is a built-in variable which contains number of fields in the current record
    * so, `$NF` will give last field
    * `$(NF-1)` will give second last field and so on

```bash
$ cat fruits.txt
fruit   qty
apple   42
banana  31
fig     90
guava   6

$ # print only first field
$ awk '{print $1}' fruits.txt
fruit
apple
banana
fig
guava

$ # print only second field
$ awk '{print $2}' fruits.txt
qty
42
31
90
6
```

<br>

#### <a name="specifying-different-input-field-separator"></a>Specifying different input field separator

* by using `-F` command line option
* by setting `FS` variable
* See [FPAT and FIELDWIDTHS](#fpat-and-fieldwidths) section for other ways of defining input fields

```bash
$ # second field where input field separator is :
$ echo 'foo:123:bar:789' | awk -F: '{print $2}'
123

$ # last field
$ echo 'foo:123:bar:789' | awk -F: '{print $NF}'
789

$ # first and last field
$ # note the use of , and space between output fields
$ echo 'foo:123:bar:789' | awk -F: '{print $1, $NF}'
foo 789

$ # second last field
$ echo 'foo:123:bar:789' | awk -F: '{print $(NF-1)}'
bar

$ # use quotes to avoid clashes with shell special characters
$ echo 'one;two;three;four' | awk -F';' '{print $3}'
three
```

* Regular expressions based input field separator

```bash
$ echo 'Sample123string54with908numbers' | awk -F'[0-9]+' '{print $2}'
string

$ # first field will be empty as there is nothing before '{'
$ echo '{foo}   bar=baz' | awk -F'[{}= ]+' '{print $1}'

$ echo '{foo}   bar=baz' | awk -F'[{}= ]+' '{print $2}'
foo
$ echo '{foo}   bar=baz' | awk -F'[{}= ]+' '{print $3}'
bar
```

* default input field separator is one or more of continuous space, tab or newline characters (will be termed as whitespace here on)
    * exact same behavior if `FS` is assigned single space character
* in addition, leading and trailing whitespaces won't be considered when splitting the input record

```bash
$ printf ' a    ate b\tc   \n'
 a    ate b     c
$ printf ' a    ate b\tc   \n' | awk '{print $1}'
a
$ printf ' a    ate b\tc   \n' | awk '{print NF}'
4
$ # same behavior if FS is assigned to single space character
$ printf ' a    ate b\tc   \n' | awk -F' ' '{print $1}'
a
$ printf ' a    ate b\tc   \n' | awk -F' ' '{print NF}'
4

$ # for anything else, leading/trailing whitespaces will be considered
$ printf ' a    ate b\tc   \n' | awk -F'[ \t]+' '{print $2}'
a
$ printf ' a    ate b\tc   \n' | awk -F'[ \t]+' '{print NF}'
6
```

* assigning empty string to FS will split the input record character wise
* note the use of command line option `-v` to set FS

```bash
$ echo 'apple' | awk -v FS= '{print $1}'
a
$ echo 'apple' | awk -v FS= '{print $2}'
p
$ echo 'apple' | awk -v FS= '{print $NF}'
e

$ # detecting multibyte characters depends on locale
$ printf 'hi👍 how are you?' | awk -v FS= '{print $3}'
👍
```

**Further Reading**

* [gawk manual - Field Splitting Summary](https://www.gnu.org/software/gawk/manual/html_node/Field-Splitting-Summary.html#Field-Splitting-Summary)
* [stackoverflow - explanation on default FS](https://stackoverflow.com/questions/30405694/default-field-separator-for-awk)
* [unix.stackexchange - filter lines if it contains a particular character only once](https://unix.stackexchange.com/questions/362550/how-to-remove-line-if-it-contains-a-character-exactly-once)
* [stackoverflow - Processing 2 files with different field separators](https://stackoverflow.com/questions/24516141/awk-processing-2-files-with-different-field-separators)

<br>

#### <a name="specifying-different-output-field-separator"></a>Specifying different output field separator

* by setting `OFS` variable
* also gets added between every argument to `print` statement
    * use [printf](#printf-formatting) to avoid this
* default is single space

```bash
$ # statements inside BEGIN are executed before processing any input text
$ echo 'foo:123:bar:789' | awk 'BEGIN{FS=OFS=":"} {print $1, $NF}'
foo:789
$ # can also be set using command line option -v
$ echo 'foo:123:bar:789' | awk -F: -v OFS=':' '{print $1, $NF}'
foo:789

$ # changing a field will re-build contents of $0
$ echo ' a      ate b   ' | awk '{$2 = "foo"; print $0}' | cat -A
a foo b$

$ # $1=$1 is an idiomatic way to re-build when there is nothing else to change
$ echo 'foo:123:bar:789' | awk -F: -v OFS='-' '{print $0}'
foo:123:bar:789
$ echo 'foo:123:bar:789' | awk -F: -v OFS='-' '{$1=$1; print $0}'
foo-123-bar-789

$ # OFS is used to separate different arguments given to print
$ echo 'foo:123:bar:789' | awk -F: -v OFS='\t' '{print $1, $3}'
foo     bar

$ echo 'Sample123string54with908numbers' | awk -F'[0-9]+' '{$1=$1; print $0}'
Sample string with numbers
```

<br>

## <a name="filtering"></a>Filtering

<br>

#### <a name="idiomatic-print-usage"></a>Idiomatic print usage

* `print` statement with no arguments will print contents of `$0`
* if condition is specified without corresponding statements, contents of `$0` is printed if condition evaluates to true
* `1` is typically used to represent always true condition and thus print contents of `$0`

```bash
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ # displaying contents of input file(s) similar to 'cat' command
$ # equivalent to using awk '{print $0}' and awk '1'
$ awk '{print}' poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.
```

<br>

#### <a name="field-comparison"></a>Field comparison

* Each block of statements within `{}` can be prefixed by an optional condition so that those statements will execute only if condition evaluates to true
* Condition specified without corresponding statements will lead to printing contents of `$0` if condition evaluates to true

```bash
$ # if first field exactly matches the string 'apple'
$ awk '$1=="apple"{print $2}' fruits.txt
42

$ # print first field if second field > 35
$ # NR>1 to avoid the header line
$ # NR built-in variable contains record number
$ awk 'NR>1 && $2>35{print $1}' fruits.txt
apple
fig

$ # print header and lines with qty < 35
$ awk 'NR==1 || $2<35' fruits.txt
fruit   qty
banana  31
guava   6
```

* If the above examples are too confusing, think of it as syntactical sugar
* Statements are grouped within `{}`
    * inside `{}`, we have a `if` control structure
    * Like `C` language, braces not needed for single statements within `if`, but consider that `{}` is used for clarity
    * From this explicit syntax, remove the outer `{}`, `if` and `()` used for `if`
* As we'll see later, this allows to mash up few lines of program compactly on command line itself
    * Of course, for medium to large programs, it is better to put the code in separate file. See [awk scripts](#awk-scripts) section

```bash
$ # awk '$1=="apple"{print $2}' fruits.txt
$ awk '{
         if($1 == "apple"){
            print $2
         }
       }' fruits.txt
42

$ # awk 'NR==1 || $2<35' fruits.txt
$ awk '{
         if(NR==1 || $2<35){
            print $0
         }
       }' fruits.txt
fruit   qty
banana  31
guava   6
```

**Further Reading**

* [gawk manual - Truth Values and Conditions](https://www.gnu.org/software/gawk/manual/html_node/Truth-Values-and-Conditions.html)
* [gawk manual - Operator Precedence](https://www.gnu.org/software/gawk/manual/html_node/Precedence.html)
* [unix.stackexchange - filtering columns by header name](https://unix.stackexchange.com/questions/359697/print-columns-in-awk-by-header-name)

<br>

#### <a name="regular-expressions-based-filtering"></a>Regular expressions based filtering

* the *REGEXP* is specified within `//` and by default acts upon `$0`
* See also [stackoverflow - lines around matching regexp](https://stackoverflow.com/questions/17908555/printing-with-sed-or-awk-a-line-following-a-matching-pattern)

```bash
$ # all lines containing the string 'are'
$ # same as: grep 'are' poem.txt
$ awk '/are/' poem.txt
Roses are red,
Violets are blue,
And so are you.

$ # negating REGEXP, same as: grep -v 'are' poem.txt
$ awk '!/are/' poem.txt
Sugar is sweet,

$ # same as: grep 'are' poem.txt | grep -v 'so'
$ awk '/are/ && !/so/' poem.txt
Roses are red,
Violets are blue,

$ # lines starting with 'a' or 'b'
$ awk '/^[ab]/' fruits.txt
apple   42
banana  31

$ # print last field of all lines containing 'are'
$ awk '/are/{print $NF}' poem.txt
red,
blue,
you.
```

* strings can be used as well, which will be interpreted as *REGEXP* if necessary
* Allows [using shell variables](#using-shell-variables) instead of hardcoded *REGEXP*
    * that section also notes difference between using `//` and string

```bash
$ awk '$0 !~ "are"' poem.txt
Sugar is sweet,

$ awk '$0 ~ "^[ab]"' fruits.txt
apple   42
banana  31

$ # also helpful if search strings have the / delimiter character
$ cat paths.txt
/foo/a/report.log
/foo/y/power.log
$ awk '/\/foo\/a\//' paths.txt
/foo/a/report.log
$ awk '$0 ~ "/foo/a/"' paths.txt
/foo/a/report.log
```

* *REGEXP* matching against specific field

```bash
$ # if first field contains 'a'
$ awk '$1 ~ /a/' fruits.txt
apple   42
banana  31
guava   6

$ # if first field contains 'a' and qty > 20
$ awk '$1 ~ /a/ && $2 > 20' fruits.txt
apple   42
banana  31

$ # if first field does NOT contain 'a'
$ awk '$1 !~ /a/' fruits.txt
fruit   qty
fig     90
```

<br>

#### <a name="fixed-string-matching"></a>Fixed string matching

* to search a string literally, `index` function can be used instead of *REGEXP*
    * similar to `grep -F`
* the function returns the starting position and `0` if no match found

```bash
$ cat eqns.txt
a=b,a-b=c,c*d
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

$ # no output since '+' is meta character, would need '/a\+b/'
$ awk '/a+b/' eqns.txt
$ # same as: grep -F 'a+b' eqns.txt
$ awk 'index($0,"a+b")' eqns.txt
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

$ # much easier than '/i\*\(t\+9-g\)/'
$ awk 'index($0,"i*(t+9-g)")' eqns.txt
i*(t+9-g)/8,4-a+b

$ # check only last field
$ awk -F, 'index($NF,"a+b")' eqns.txt
i*(t+9-g)/8,4-a+b
$ # index not needed if entire field/line is being compared
$ awk -F, '$1=="a+b"' eqns.txt
a+b,pi=3.14,5e12
```

* return value is useful to match at specific position
* for ex: at start/end of line

```bash
$ # start of line
$ awk 'index($0,"a+b")==1' eqns.txt
a+b,pi=3.14,5e12

$ # end of line
$ # length function returns number of characters, by default acts on $0
$ awk 'index($0,"a+b")==length()-length("a+b")+1' eqns.txt
i*(t+9-g)/8,4-a+b
$ # to avoid repetitions, save the search string in variable
$ awk -v s="a+b" 'index($0,s)==length()-length(s)+1' eqns.txt
i*(t+9-g)/8,4-a+b
```

<br>

#### <a name="line-number-based-filtering"></a>Line number based filtering

* Built-in variable `NR` contains total records read so far
* Use `FNR` if you need line numbers separately for [multiple file processing](#multiple-file-processing)

```bash
$ # same as: head -n2 poem.txt | tail -n1
$ awk 'NR==2' poem.txt
Violets are blue,

$ # print 2nd and 4th line
$ awk 'NR==2 || NR==4' poem.txt
Violets are blue,
And so are you.

$ # same as: tail -n1 poem.txt
$ # statements inside END are executed after processing all input text
$ awk 'END{print}' poem.txt
And so are you.

$ awk 'NR==4{print $2}' fruits.txt
90
```

* for large input, use `exit` to avoid unnecessary record processing

```bash
$ seq 14323 14563435 | awk 'NR==234{print; exit}'
14556

$ # sample time comparison
$ time seq 14323 14563435 | awk 'NR==234{print; exit}'
14556

real    0m0.004s
user    0m0.004s
sys     0m0.000s
$ time seq 14323 14563435 | awk 'NR==234{print}'
14556

real    0m2.167s
user    0m2.280s
sys     0m0.092s
```

* See also [unix.stackexchange - filtering list of lines from every X number of lines](https://unix.stackexchange.com/questions/325985/how-to-print-lines-number-15-and-25-out-of-each-50-lines)

<br>

## <a name="case-insensitive-filtering"></a>Case Insensitive filtering

```bash
$ # same as: grep -i 'rose' poem.txt
$ awk -v IGNORECASE=1 '/rose/' poem.txt
Roses are red,

$ # for small enough set, can also use REGEXP character class
$ awk '/[rR]ose/' poem.txt
Roses are red,

$ # another way is to use built-in string function 'tolower'
$ awk 'tolower($0) ~ /rose/' poem.txt
Roses are red,
```

<br>

## <a name="changing-record-separators"></a>Changing record separators

* `RS` to change input record separator
* default is newline character

```bash
$ s='this is a sample string'

$ # space as input record separator, printing all records
$ printf "$s" | awk -v RS=' ' '{print NR, $0}'
1 this
2 is
3 a
4 sample
5 string

$ # print all records containing 'a'
$ printf "$s" | awk -v RS=' ' '/a/'
a
sample
```

* `ORS` to change output record separator
* gets added to every `print` statement
    * use [printf](#printf-formatting) to avoid this
* default is newline character

```bash
$ seq 3 | awk '{print $0}'
1
2
3
$ # note that there is empty line after last record
$ seq 3 | awk -v ORS='\n\n' '{print $0}'
1

2

3

$ # dynamically changing ORS
$ # ?: ternary operator to select between two expressions based on a condition
$ # can also use: seq 6 | awk '{ORS = NR%2 ? " " : RS} 1'
$ seq 6 | awk '{ORS = NR%2 ? " " : "\n"} 1'
1 2
3 4
5 6
$ seq 6 | awk '{ORS = NR%3 ? "-" : "\n"} 1'
1-2-3
4-5-6
```

<br>

#### <a name="paragraph-mode"></a>Paragraph mode

* When `RS` is set to empty string, one or more consecutive empty lines is used as input record separator
* Can also use regular expression `RS=\n\n+` but there are subtle differences, see [gawk manual - multiline records](https://www.gnu.org/software/gawk/manual/html_node/Multiple-Line.html). Important points from that link quoted below

>However, there is an important difference between ‘RS = ""’ and ‘RS = "\n\n+"’. In the first case, leading newlines in the input data file are ignored, and if a file ends without extra blank lines after the last record, the final newline is removed from the record. In the second case, this special processing is not done

>Now that the input is separated into records, the second step is to separate the fields in the records. One way to do this is to divide each of the lines into fields in the normal manner. This happens by default as the result of a special feature. When RS is set to the empty string and FS is set to a single character, the newline character always acts as a field separator. This is in addition to whatever field separations result from FS

>When FS is the null string ("") or a regexp, this special feature of RS does not apply. It does apply to the default field separator of a single space: ‘FS = " "’

Consider the below sample file

```bash
$ cat sample.txt
Hello World

Good day
How are you

Just do-it
Believe it

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he
```

* Filtering paragraphs

```bash
$ # print all paragraphs containing 'it'
$ # if extra newline at end is undesirable, can use
$ # awk -v RS= '/it/{print c++ ? "\n" $0 : $0}' sample.txt
$ awk -v RS= -v ORS='\n\n' '/it/' sample.txt
Just do-it
Believe it

Today is sunny
Not a bit funny
No doubt you like it too

$ # based on number of lines in each paragraph
$ awk -F'\n' -v RS= -v ORS='\n\n' 'NF==1' sample.txt
Hello World

$ awk -F'\n' -v RS= -v ORS='\n\n' 'NF==2 && /do/' sample.txt
Just do-it
Believe it

Much ado about nothing
He he he

```

* Re-structuring paragraphs

```bash
$ # default FS is one or more of continuous space, tab or newline characters
$ # default OFS is single space
$ # so, $1=$1 will change it uniformly to single space between fields
$ awk -v RS= '{$1=$1} 1' sample.txt
Hello World
Good day How are you
Just do-it Believe it
Today is sunny Not a bit funny No doubt you like it too
Much ado about nothing He he he

$ # a better usecase
$ awk 'BEGIN{FS="\n"; OFS=". "; RS=""; ORS="\n\n"} {$1=$1} 1' sample.txt
Hello World

Good day. How are you

Just do-it. Believe it

Today is sunny. Not a bit funny. No doubt you like it too

Much ado about nothing. He he he

```

**Further Reading**

* [unix.stackexchange - filtering line surrounded by empty lines](https://unix.stackexchange.com/questions/359717/select-line-with-empty-line-above-and-under)
* [stackoverflow - excellent example and explanation of RS and FS](https://stackoverflow.com/questions/46142118/converting-regex-to-sed-or-grep-regex)

<br>

#### <a name="multicharacter-rs"></a>Multicharacter RS

* Some marker like `Error` or `Warning` etc

```bash
$ cat report.log
blah blah
Error: something went wrong
more blah
whatever
Error: something surely went wrong
some text
some more text
blah blah blah

$ awk -v RS='Error:' 'END{print NR-1}' report.log
2
$ awk -v RS='Error:' 'NR==1' report.log
blah blah

$ # filter 'Error:' block matching particular string
$ # to preserve formatting, use: '/whatever/{print RS $0}'
$ awk -v RS='Error:' '/whatever/' report.log
 something went wrong
more blah
whatever

$ # blocks with more than 3 lines
$ # splitting string with 3 newlines will yield 4 fields
$ awk -F'\n' -v RS='Error:' 'NF>4{print RS $0}' report.log
Error: something surely went wrong
some text
some more text
blah blah blah

```

* Regular expression based `RS`
    * the `RT` variable will contain string matched by `RS`
* Note that entire input is treated as single string, so `^` and `$` anchors will apply only once - not every line

```bash
$ s='Sample123string54with908numbers'
$ printf "$s" | awk -v RS='[0-9]+' 'NR==1'
Sample

$ # note the relationship between record and separators
$ printf "$s" | awk -v RS='[0-9]+' '{print NR " : " $0 " - " RT}'
1 : Sample - 123
2 : string - 54
3 : with - 908
4 : numbers - 

$ # need to be careful of empty records
$ printf '123string54with908' | awk -v RS='[0-9]+' '{print NR " : " $0}'
1 : 
2 : string
3 : with
$ # and newline at end of input
$ printf '123string54with908\n' | awk -v RS='[0-9]+' '{print NR " : " $0}'
1 : 
2 : string
3 : with
4 : 

```

* Joining lines based on specific end of line condition

```bash
$ cat msg.txt
Hello there.
It will rain to-
day. Have a safe
and pleasant jou-
rney.

$ # join lines ending with - to next line
$ # by manipulating RS and ORS
$ awk -v RS='-\n' -v ORS= '1' msg.txt
Hello there.
It will rain today. Have a safe
and pleasant journey.

$ # by manipulating ORS alone, sub function covered in later sections
$ awk '{ORS = sub(/-$/,"") ? "" : "\n"} 1' msg.txt
Hello there.
It will rain today. Have a safe
and pleasant journey.
$ # easier: perl -pe 's/-\n//' msg.txt as newline is still part of input line
```

* processing null terminated input

```bash
$ printf 'foo\0bar\0' | cat -A
foo^@bar^@$
$ printf 'foo\0bar\0' | awk -v RS='\0' '{print}'
foo
bar
```

**Further Reading**

* [gawk manual - Records](https://www.gnu.org/software/gawk/manual/html_node/Records.html#Records)
* [unix.stackexchange - Slurp-mode in awk](https://unix.stackexchange.com/questions/304457/slurp-mode-in-awk)
* [stackoverflow - using RS to count number of occurrences of a given string](https://stackoverflow.com/questions/45102651/how-to-grep-double-quote-followed-by-a-string-at-same-time/45102962#45102962)

<br>

## <a name="substitute-functions"></a>Substitute functions

* Use `sub` string function for replacing first occurrence
* Use `gsub` for replacing all occurrences
* By default, `$0` which contains input record is modified, can specify any other field or variable as needed

```bash
$ # replacing first occurrence
$ echo '1-2-3-4-5' | awk '{sub("-", ":")} 1'
1:2-3-4-5

$ # replacing all occurrences
$ echo '1-2-3-4-5' | awk '{gsub("-", ":")} 1'
1:2:3:4:5

$ # return value for sub/gsub is number of replacements made
$ echo '1-2-3-4-5' | awk '{n=gsub("-", ":"); print n} 1'
4
1:2:3:4:5

$ # // format is better suited to specify search REGEXP
$ echo '1-2-3-4-5' | awk '{gsub(/[^-]+/, "abc")} 1'
abc-abc-abc-abc-abc

$ # replacing all occurrences only for third field
$ echo 'one;two;three;four' | awk -F';' '{gsub("e", "E", $3)} 1'
one two thrEE four
```

* Use `gensub` to return the modified string unlike `sub` or `gsub` which modifies inplace
* it also supports back-references and ability to modify specific match
* acts upon `$0` if target is not specified

```bash
$ # replace second occurrence
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(":", "-", 2)} 1'
foo:123-bar:baz
$ # use REGEXP as needed
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "XYZ", 2)} 1'
foo:XYZ:bar:baz

$ # or print the returned string directly
$ echo 'foo:123:bar:baz' | awk '{print gensub(":", "-", 2)}'
foo:123-bar:baz

$ # replace third occurrence
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "XYZ", 3)} 1'
foo:123:XYZ:baz

$ # replace all occurrences, similar to gsub
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "XYZ", "g")} 1'
XYZ:XYZ:XYZ:XYZ

$ # target other than $0
$ echo 'foo:123:bar:baz' | awk -F: -v OFS=: '{$1=gensub(/o/, "b", 2, $1)} 1'
fob:123:bar:baz
```

* back-reference examples
* use `\"` within double-quotes to represent `"` character in replacement string
* use `\\1` to represent `\1` - the first captured group and so on
* `&` or `\0` will back-reference entire matched string

```bash
$ # replacing last occurrence without knowing how many occurrences are there
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/(.*):/, "\\1-", 1)} 1'
foo:123:bar-baz
$ echo 'foo and bar and baz land good' | awk '{$0=gensub(/(.*)and/, "\\1XYZ", 1)} 1'
foo and bar and baz lXYZ good

$ # use word boundaries as necessary
$ echo 'foo and bar and baz land good' | awk '{$0=gensub(/(.*)\<and\>/, "\\1XYZ", 1)} 1'
foo and bar XYZ baz land good

$ # replacing last but one
$ echo '456:foo:123:bar:789:baz' | awk '{$0=gensub(/(.*):(.*:)/, "\\1-\\2", 1)} 1'
456:foo:123:bar-789:baz

$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "\"&\"", "g")} 1'
"foo":"123":"bar":"baz"
```

* saving quotes in variables - to avoid escaping double quotes or having to use octal code for single quotes

```bash
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "\047&\047", "g")} 1'
'foo':'123':'bar':'baz'
$ echo 'foo:123:bar:baz' | awk -v sq="'" '{$0=gensub(/[^:]+/, sq"&"sq, "g")} 1'
'foo':'123':'bar':'baz'

$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "\"&\"", "g")} 1'
"foo":"123":"bar":"baz"
$ echo 'foo:123:bar:baz' | awk -v dq='"' '{$0=gensub(/[^:]+/, dq"&"dq, "g")} 1'
"foo":"123":"bar":"baz"
```

**Further Reading**

* [gawk manual - String-Manipulation Functions](https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html)
* [gawk manual - escape processing](https://www.gnu.org/software/gawk/manual/html_node/Gory-Details.html)

<br>

## <a name="inplace-file-editing"></a>Inplace file editing

* Use this option with caution, preferably after testing that the `awk` code is working as intended

```bash
$ cat greeting.txt
Hi there
Have a nice day

$ awk -i inplace '{gsub("e", "E")} 1' greeting.txt
$ cat greeting.txt
Hi thErE
HavE a nicE day
```

* Multiple input files are treated individually and changes are written back to respective files

```bash
$ cat f1
I ate 3 apples
$ cat f2
I bought two bananas and 3 mangoes

$ awk -i inplace '{gsub("3", "three")} 1' f1 f2
$ cat f1
I ate three apples
$ cat f2
I bought two bananas and three mangoes
```

* to create backups of original file, set `INPLACE_SUFFIX` variable
* **Note** that in newer versions, you have to use `inplace::suffix` instead of `INPLACE_SUFFIX`

```bash
$ awk -i inplace -v INPLACE_SUFFIX='.bkp' '{gsub("three", "3")} 1' f1
$ cat f1
I ate 3 apples
$ cat f1.bkp
I ate three apples
```

* See [gawk manual - Enabling In-Place File Editing](https://www.gnu.org/software/gawk/manual/html_node/Extension-Sample-Inplace.html) for implementation details

<br>

## <a name="using-shell-variables"></a>Using shell variables

* when `awk` code is part of shell program and shell variable needs to be passed as input to `awk` code
* for example:
    * command line argument passed to shell script, which is in turn passed on to `awk`
    * control structures in shell script calling `awk` with different search strings
* See also [stackoverflow - How do I use shell variables in an awk script?](https://stackoverflow.com/questions/19075671/how-do-i-use-shell-variables-in-an-awk-script)

```bash
$ # examples tested with bash shell

$ f='apple'
$ awk -v word="$f" '$1==word' fruits.txt
apple   42
$ f='fig'
$ awk -v word="$f" '$1==word' fruits.txt
fig     90

$ q='20'
$ awk -v threshold="$q" 'NR==1 || $2>threshold' fruits.txt
fruit   qty
apple   42
banana  31
fig     90
```

* accessing shell environment variables

```bash
$ # existing environment variable
$ awk 'BEGIN{print ENVIRON["PWD"]}'
/home/learnbyexample
$ awk 'BEGIN{print ENVIRON["SHELL"]}'
/bin/bash

$ # defined along with awk code
$ word='hello world' awk 'BEGIN{print ENVIRON["word"]}'
hello world

$ # using ENVIRON also prevents awk's interpretation of escape sequences
$ s='a\n=c'
$ foo="$s" awk 'BEGIN{print ENVIRON["foo"]}'
a\n=c
$ awk -v foo="$s" 'BEGIN{print foo}'
a
=c
```

* passing *REGEXP*
* See also [gawk manual - Using Dynamic Regexps](https://www.gnu.org/software/gawk/manual/html_node/Computed-Regexps.html)

```bash
$ s='are'
$ # for: awk '!/are/' poem.txt
$ awk -v s="$s" '$0 !~ s' poem.txt
Sugar is sweet,
$ # for: awk '/are/ && !/so/' poem.txt
$ awk -v s="$s" '$0 ~ s && !/so/' poem.txt
Roses are red,
Violets are blue,

$ r='[^-]+'
$ echo '1-2-3-4-5' | awk -v r="$r" '{gsub(r, "abc")} 1'
abc-abc-abc-abc-abc

$ # escape sequence has to be doubled when string is interpreted as REGEXP
$ s='foo and bar and baz land good'
$ echo "$s" | awk '{$0=gensub("(.*)\\<and\\>", "\\1XYZ", 1)} 1'
foo and bar XYZ baz land good
$ # hence passing as variable should be
$ r='(.*)\\<and\\>'
$ echo "$s" | awk -v r="$r" '{$0=gensub(r, "\\1XYZ", 1)} 1'
foo and bar XYZ baz land good

$ # or use ENVIRON
$ r='(.*)\<and\>'
$ echo "$s" | r="$r" awk '{$0=gensub(ENVIRON["r"], "\\1XYZ", 1)} 1'
foo and bar XYZ baz land good
```

<br>

## <a name="multiple-file-input"></a>Multiple file input

* Example to show difference between `NR` and `FNR`

```bash
$ # NR for overall record number
$ awk 'NR==1' poem.txt greeting.txt
Roses are red,

$ # FNR for individual file's record number
$ # same as: head -q -n1 poem.txt greeting.txt
$ awk 'FNR==1' poem.txt greeting.txt
Roses are red,
Hi thErE
```

* Constructs to do some processing before starting each file as well as at the end
* `BEGINFILE` - to add code to be executed before start of each input file
* `ENDFILE` - to add code to be executed after processing each input file
* `FILENAME` - file name of current input file being processed

```bash
$ # similar to: tail -n1 poem.txt greeting.txt
$ awk 'BEGINFILE{print "file: "FILENAME}
       ENDFILE{print $0"\n------"}' poem.txt greeting.txt
file: poem.txt
And so are you.
------
file: greeting.txt
HavE a nicE day
------
```

* And of course, there can be usual `awk` code

```bash
$ awk 'BEGINFILE{print "file: "FILENAME}
       FNR==1;
       ENDFILE{print "------"}' poem.txt greeting.txt
file: poem.txt
Roses are red,
------
file: greeting.txt
Hi thErE
------

$ awk 'BEGINFILE{c++; print "file: "FILENAME}
       FNR==2;
       END{print "\nTotal input files: "c}' poem.txt greeting.txt
file: poem.txt
Violets are blue,
file: greeting.txt
HavE a nicE day

Total input files: 2
```

**Further Reading**

* [gawk manual - Using ARGC and ARGV](https://www.gnu.org/software/gawk/manual/html_node/ARGC-and-ARGV.html)
* [gawk manual - ARGIND](https://www.gnu.org/software/gawk/manual/html_node/Auto_002dset.html#index-ARGIND-variable)
* [gawk manual - ERRNO](https://www.gnu.org/software/gawk/manual/html_node/Auto_002dset.html#index-ERRNO-variable)
* [stackoverflow - Finding common value across multiple files](https://stackoverflow.com/a/43473385/4082052)

<br>

## <a name="control-structures"></a>Control Structures

* Syntax is similar to `C` language and single statements inside control structures don't require to be grouped within `{}`
* See [gawk manual - Control Statements](https://www.gnu.org/software/gawk/manual/html_node/Statements.html) for details

Remember that by default there is a loop that goes over all input records and constructs like `BEGIN` and `END` fall outside that loop

```bash
$ cat nums.txt
42
-2
10101
-3.14
-75
$ awk '{sum += $1} END{print sum}' nums.txt
10062.9

$ # uninitialized variables will have empty string
$ printf '' | awk '{sum += $1} END{print sum}'

$ # so either add '0' or use unary '+' operator to convert to number
$ printf '' | awk '{sum += $1} END{print +sum}'
0
$ awk '{sum += $1} END{print sum+0}' /dev/null
0
```

* See also [unix.stackexchange - change in behavior of unary + with gawk version 4.2.0](https://unix.stackexchange.com/questions/421904/regression-with-unary-plus)

<br>

#### <a name="if-else-and-loops"></a>if-else and loops

* We have already seen simple `if` examples in [Filtering](#filtering) section
* See also [gawk manual - Switch](https://www.gnu.org/software/gawk/manual/html_node/Switch-Statement.html)

```bash
$ # same as: sed -n '/are/ s/so/SO/p' poem.txt
$ # remember that sub/gsub returns number of substitutions made
$ awk '/are/{if(sub("so", "SO")) print}' poem.txt
And SO are you.
$ # of course, can also use
$ awk '/are/ && sub("so", "SO")' poem.txt
And SO are you.

$ # if-else example
$ awk 'NR>1{if($2>40) $0="+"$0; else $0="-"$0} 1' fruits.txt
fruit   qty
+apple   42
-banana  31
+fig     90
-guava   6
```

* ternary operator
* See also [stackoverflow - finding min and max value of a column](https://stackoverflow.com/a/29784278/4082052)

```bash
$ cat nums.txt
42
-2
10101
-3.14
-75

$ # changing -ve to +ve and vice versa
$ # same as: awk '{if($0 ~ /^-/) sub(/^-/,""); else sub(/^/,"-")} 1' nums.txt
$ awk '{$0 ~ /^-/ ? sub(/^-/,"") : sub(/^/,"-")} 1' nums.txt
-42
2
-10101
3.14
75
$ # can also use: awk '!sub(/^-/,""){sub(/^/,"-")} 1' nums.txt
```

* for loop
* similar to `C` language, `break` and `continue` statements are also available
* See also [stackoverflow - find missing numbers from sequential list](https://stackoverflow.com/questions/38491676/how-can-i-find-the-missing-integers-in-a-unique-and-sequential-list-one-per-lin)

```bash
$ awk 'BEGIN{for(i=2; i<11; i+=2) print i}'
2
4
6
8
10

$ # looping each field
$ s='scat:cat:no cat:abdicate:cater'
$ echo "$s" | awk -F: -v OFS=: '{for(i=1;i<=NF;i++) if($i=="cat") $i="CAT"} 1'
scat:CAT:no cat:abdicate:cater
$ # can also use sub function
$ echo "$s" | awk -F: -v OFS=: '{for(i=1;i<=NF;i++) sub(/^cat$/,"CAT",$i)} 1'
scat:CAT:no cat:abdicate:cater
```

* while loop
* do-while is also available

```bash
$ awk 'BEGIN{i=2; while(i<11){print i; i+=2}}'
2
4
6
8
10

$ # recursive substitution
$ # here again return value of sub/gsub is useful
$ echo 'titillate' | awk '{while( gsub(/til/, "") ) print}'
tilate
ate
```

<br>

#### <a name="next-and-nextfile"></a>next and nextfile

* `next` will skip rest of statements and start processing next line of current file being processed
    * there is a loop by default which goes over all input records, `next` is applicable for that
    * it is similar to `continue` statement within loops
* it is often used in [Two file processing](#two-file-processing)

```bash
$ # here 'next' is used to skip processing header line
$ awk 'NR==1{print; next} /a.*a/{$0="*"$0} /[eiou]/{$0="-"$0} 1' fruits.txt
fruit   qty
-apple   42
*banana  31
-fig     90
-*guava   6
```

* `nextfile` is useful to skip remaining lines from current file being processed and move on to next file

```bash
$ # same as: head -q -n1 poem.txt greeting.txt fruits.txt
$ awk 'FNR>1{nextfile} 1' poem.txt greeting.txt fruits.txt
Roses are red,
Hi thErE
fruit   qty

$ # specific field
$ awk 'FNR>2{nextfile} {print $1}' poem.txt greeting.txt fruits.txt
Roses
Violets
Hi
HavE
fruit
apple

$ # similar to 'grep -il'
$ awk -v IGNORECASE=1 '/red/{print FILENAME; nextfile}' *
colors_1.txt
colors_2.txt
poem.txt
$ awk -v IGNORECASE=1 '$1 ~ /red/{print FILENAME; nextfile}' *
colors_1.txt
colors_2.txt
```

<br>

## <a name="multiline-processing"></a>Multiline processing

* Processing consecutive lines

```bash
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ # match two consecutive lines
$ awk 'p~/are/ && /is/{print p ORS $0} {p=$0}' poem.txt
Violets are blue,
Sugar is sweet,
$ # if only the second line is needed
$ awk 'p~/are/ && /is/; {p=$0}' poem.txt
Sugar is sweet,

$ # match three consecutive lines
$ awk 'p2~/red/ && p1~/blue/ && /is/{print p2} {p2=p1; p1=$0}' poem.txt
Roses are red,

$ # common mistake
$ sed -n '/are/{N;/is/p}' poem.txt
$ # would need something like this and not practical to extend for other cases
$ sed '$!N; /are.*\n.*is/p; D' poem.txt
Violets are blue,
Sugar is sweet,
```

Consider this sample input file

```bash
$ cat range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
END
baz
```

* extracting lines around matching line
* See also [stackoverflow - lines around matching regexp](https://stackoverflow.com/questions/17908555/printing-with-sed-or-awk-a-line-following-a-matching-pattern)
* how `n && n--` works:
    * need to note that right hand side of `&&` is processed only if left hand side is `true`
    * so for example, if initially `n=2`, then we get
        * `2 && 2; n=1` - evaluates to `true`
        * `1 && 1; n=0` - evaluates to `true`
        * `0 && ` - evaluates to `false` ... no decrementing `n` and hence will be `false` until `n` is re-assigned non-zero value

```bash
$ # similar to: grep --no-group-separator -A1 'BEGIN' range.txt
$ awk '/BEGIN/{n=2} n && n--' range.txt
BEGIN
1234
BEGIN
a

$ # only print the line after matching line
$ # can also use: awk '/BEGIN/{n=1; next} n && n--' range.txt
$ awk 'n && n--; /BEGIN/{n=1}' range.txt
1234
a
$ # generic case: print nth line after match
$ awk 'n && !--n; /BEGIN/{n=3}' range.txt
END
c

$ # print second line prior to matched line
$ awk '/END/{print p2} {p2=p1; p1=$0}' range.txt
1234
b
$ # save all lines in an array for generic case
$ # NR>n is checked to avoid printing empty line if there is a match
$ # within first n lines
$ awk -v n=3 '/BEGIN/ && NR>n{print a[NR-n]} {a[NR]=$0}' range.txt
6789
$ # or, use the reversing trick
$ tac range.txt | awk 'n && !--n; /END/{n=3}' | tac
BEGIN
a
```

* Checking if multiple strings are present at least once in entire input file
* If there are lots of strings to check, use arrays

```bash
$ # can also use BEGINFILE instead of FNR==1
$ awk 'FNR==1{s1=s2=0} /is/{s1=1} /are/{s2=1} s1&&s2{print FILENAME; nextfile}' *
poem.txt
sample.txt

$ awk 'FNR==1{s1=s2=0} /foo/{s1=1} /report/{s2=1} s1&&s2{print FILENAME; nextfile}' *
paths.txt
```

**Further Reading**

* [stackoverflow - delete line based on content of previous/next lines](https://stackoverflow.com/questions/49112877/delete-line-if-line-matches-foo-line-above-matches-bar-and-line-below-match)
* [softwareengineering - FSM examples](https://softwareengineering.stackexchange.com/questions/47806/examples-of-finite-state-machines)
* [wikipedia - FSM](https://en.wikipedia.org/wiki/Finite-state_machine)

<br>

## <a name="two-file-processing"></a>Two file processing

* We'll use awk's associative arrays (key-value pairs) here
    * key can be number or string
    * See also [gawk manual - Arrays](https://www.gnu.org/software/gawk/manual/html_node/Arrays.html)
* Unlike [comm](./sorting_stuff.md#comm) the input files need not be sorted and comparison can be done based on certain field(s) as well

<br>

#### <a name="comparing-whole-lines"></a>Comparing whole lines

Consider the following test files

```bash
$ cat colors_1.txt
Blue
Brown
Purple
Red
Teal
Yellow

$ cat colors_2.txt
Black
Blue
Green
Red
White
```

* common lines and lines unique to one of the files
* For two files as input, `NR==FNR` will be true only when first file is being processed
* Using `next` will skip rest of code when first file is processed
* `a[$0]` will create unique keys (here entire line content is used as key) in array `a`
    * just referencing a key will create it if it doesn't already exist, with value as empty string (will also act as zero in numeric context)
* `$0 in a` will be true if key already exists in array `a`

```bash
$ # common lines
$ # same as: grep -Fxf colors_1.txt colors_2.txt
$ awk 'NR==FNR{a[$0]; next} $0 in a' colors_1.txt colors_2.txt
Blue
Red

$ # lines from colors_2.txt not present in colors_1.txt
$ # same as: grep -vFxf colors_1.txt colors_2.txt
$ awk 'NR==FNR{a[$0]; next} !($0 in a)' colors_1.txt colors_2.txt
Black
Green
White

$ # reversing the order of input files gives
$ # lines from colors_1.txt not present in colors_2.txt
$ awk 'NR==FNR{a[$0]; next} !($0 in a)' colors_2.txt colors_1.txt
Brown
Purple
Teal
Yellow
```

<br>

#### <a name="comparing-specific-fields"></a>Comparing specific fields

Consider the sample input file

```bash
$ cat marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59
ECE     Om      92
CSE     Amy     67
```

* single field
* For ex: only first field comparison by using `$1` instead of `$0` as key

```bash
$ cat list1
ECE
CSE

$ # extract only lines matching first field specified in list1
$ awk 'NR==FNR{a[$1]; next} $1 in a' list1 marks.txt
ECE     Raj     53
ECE     Joel    72
CSE     Surya   81
ECE     Om      92
CSE     Amy     67

$ # if header is needed as well
$ awk 'NR==FNR{a[$1]; next} FNR==1 || $1 in a' list1 marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
CSE     Surya   81
ECE     Om      92
CSE     Amy     67
```

* multiple fields
* create a string by adding some character between the fields to act as key
    * for ex: to avoid matching two field values `abc` and `123` to match with two other field values `ab` and `c123`
    * by adding character, say `_`, the key would be `abc_123` for first case and `ab_c123` for second case
    * this can still lead to false match if input data has `_`
    * there is also a built-in way to do this using [gawk manual - Multidimensional Arrays](https://www.gnu.org/software/gawk/manual/html_node/Multidimensional.html#Multidimensional)

```bash
$ cat list2
EEE Moi
CSE Amy
ECE Raj

$ # extract only lines matching both fields specified in list2
$ awk 'NR==FNR{a[$1"_"$2]; next} $1"_"$2 in a' list2 marks.txt
ECE     Raj     53
EEE     Moi     68
CSE     Amy     67

$ # uses SUBSEP as separator, whose default value is non-printing character \034
$ awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' list2 marks.txt
ECE     Raj     53
EEE     Moi     68
CSE     Amy     67
```

* field and value comparison

```bash
$ cat list3
ECE 70
EEE 65
CSE 80

$ # extract line matching Dept and minimum marks specified in list3
$ awk 'NR==FNR{d[$1]=$2; next} $1 in d && $3 >= d[$1]' list3 marks.txt
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
ECE     Om      92
```

<br>

#### <a name="getline"></a>getline

* `getline` is an alternative way to read from a file and could be faster than `NR==FNR` method for some cases
* But use it with caution
    * [gawk manual - getline](https://www.gnu.org/software/gawk/manual/html_node/Getline.html) for details, especially about corner cases, errors, etc
    * [getline caveats](https://web.archive.org/web/20170524214527/http://awk.freeshell.org/AllAboutGetline)
    * [gawk manual - Closing Input and Output Redirections](https://www.gnu.org/software/gawk/manual/html_node/Close-Files-And-Pipes.html) if you have to start from beginning of file again
* `getline` return value: `1` if record is found, `0` if end of file, `-1` for errors such as file not found (use `ERRNO` variable to get details)

```bash
$ # replace mth line in poem.txt with nth line from nums.txt
$ # return value handling is not shown here, but should be done ideally
$ awk -v m=3 -v n=2 'BEGIN{while(n-- > 0) getline s < "nums.txt"}
                     FNR==m{$0=s} 1' poem.txt
Roses are red,
Violets are blue,
-2
And so are you.

$ # without getline, but slower due to NR==FNR check for every line processed
$ awk -v m=3 -v n=2 'NR==FNR{if(FNR==n){s=$0; nextfile} next}
                     FNR==m{$0=s} 1' nums.txt poem.txt
Roses are red,
Violets are blue,
-2
And so are you.

$ # Note that if nums.txt has less than n lines:
$ # getline version will use last line of nums.txt if any
$ # NR==FNR version will give empty string as 's' would be uninitialized
```

* Another use case is if two files are to be processed simultaneously

```bash
$ # print line from fruits.txt if corresponding line from nums.txt is +ve number
$ # the return value check ensures corresponding line number comparison
$ awk -v file='nums.txt' '(getline num < file)==1 && num>0' fruits.txt
fruit   qty
banana  31

$ # without getline, but has to save entire file in array
$ awk 'NR==FNR{n[FNR]=$0; next} n[FNR]>0' nums.txt fruits.txt
fruit   qty
banana  31
```

* error handling

```bash
$ awk 'NR==FNR{n[FNR]=$0; next} n[FNR]>0' xyz.txt fruits.txt
awk: fatal: cannot open file 'xyz.txt' for reading (No such file or directory)

$ awk -v file='xyz.txt' '{ e=(getline num < file);
                           if(e<0){print file ": " ERRNO; exit} }
                         e==1 && num>0' fruits.txt
xyz.txt: No such file or directory
```

**Further Reading**

* [stackoverflow - Fastest way to find lines of a text file from another larger text file](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-text-file-from-another-larger-text-file-in-bash)
* [unix.stackexchange - filter lines based on line numbers specified in another file](https://unix.stackexchange.com/questions/320651/read-numbers-from-control-file-and-extract-matching-line-numbers-from-the-data-f)
* [stackoverflow - three file processing to extract a matrix subset](https://stackoverflow.com/questions/45036019/how-to-filter-the-values-from-selected-columns-and-rows)
* [unix.stackexchange - column wise merging](https://unix.stackexchange.com/questions/294145/merging-two-files-one-column-at-a-time)
* [stackoverflow - extract specific rows from a text file using an index file](https://stackoverflow.com/questions/40595990/print-many-specific-rows-from-a-text-file-using-an-index-file)

<br>

## <a name="creating-new-fields"></a>Creating new fields

* Number of fields in input record can be changed by simply manipulating `NF`

```bash
$ # reducing fields
$ echo 'foo,bar,123,baz' | awk -F, -v OFS=, '{NF=2} 1'
foo,bar

$ # creating new empty field(s)
$ echo 'foo,bar,123,baz' | awk -F, -v OFS=, '{NF=5} 1'
foo,bar,123,baz,

$ # assigning to field greater than NF will create empty fields as needed
$ echo 'foo,bar,123,baz' | awk -F, -v OFS=, '{$7=42} 1'
foo,bar,123,baz,,,42
```

* adding a field based on existing fields

```bash
$ # adding a new 'Grade' field
$ awk 'BEGIN{OFS="\t"; g[9]="S"; g[8]="A"; g[7]="B"; g[6]="C"; g[5]="D"}
      {NF++; $NF = NR==1 ? "Grade" : g[int($(NF-1)/10)]} 1' marks.txt
Dept    Name    Marks   Grade
ECE     Raj     53      D
ECE     Joel    72      B
EEE     Moi     68      C
CSE     Surya   81      A
EEE     Tia     59      D
ECE     Om      92      S
CSE     Amy     67      C

$ # can also use split (covered in a later section)
$ # array assignment: split("DCBAS",g,//)
$ # index adjustment: g[int($(NF-1)/10)-4]
```

* two file example

```bash
$ cat list4
Raj class_rep
Amy sports_rep
Tia placement_rep

$ awk -v OFS='\t' 'NR==FNR{r[$1]=$2; next}
         {$(NF+1) = FNR==1 ? "Role" : r[$2]} 1' list4 marks.txt
Dept    Name    Marks   Role
ECE     Raj     53      class_rep
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59      placement_rep
ECE     Om      92
CSE     Amy     67      sports_rep
```

<br>

## <a name="dealing-with-duplicates"></a>Dealing with duplicates

* default value of uninitialized variable is `0` in numeric context and empty string in text context
    * and evaluates to `false` when used conditionally

*Illustration to show default numeric value and array in action*

```bash
$ printf 'mad\n42\n42\ndam\n42\n'
mad
42
42
dam
42

$ printf 'mad\n42\n42\ndam\n42\n' | awk '{print $0 "\t" int(a[$0]); a[$0]++}'
mad     0
42      0
42      1
dam     0
42      2
$ # only those entries with second column value zero will be retained
$ printf 'mad\n42\n42\ndam\n42\n' | awk '!a[$0]++'
mad
42
dam
```

* first, examples that retain only first copy of duplicates
* See also [iridakos: remove duplicates](https://iridakos.com/how-to/2019/05/16/remove-duplicate-lines-preserving-order-linux.html) for a detailed explanation
* See also [stackoverflow - add a letter to duplicate entries](https://stackoverflow.com/questions/47774779/add-letter-to-second-third-fourth-occurrence-of-a-string)

```bash
$ cat duplicates.txt
abc  7   4
food toy ****
abc  7   4
test toy 123
good toy ****

$ # whole line
$ awk '!seen[$0]++' duplicates.txt
abc  7   4
food toy ****
test toy 123
good toy ****

$ # particular column
$ awk '!seen[$2]++' duplicates.txt
abc  7   4
food toy ****

$ # total count
$ awk '!seen[$2]++{c++} END{print +c}' duplicates.txt
2
```

* if input is so large that integer numbers can overflow
* See also [gawk manual - Arbitrary-Precision Integer Arithmetic](https://www.gnu.org/software/gawk/manual/html_node/Arbitrary-Precision-Integers.html)

```bash
$ # avoid unnecessary counting altogether
$ awk '!($2 in seen); {seen[$2]}' duplicates.txt
abc  7   4
food toy ****

$ # use arbitrary-precision integers, limited only by available memory
$ awk -M '!($2 in seen){c++} {seen[$2]} END{print +c}' duplicates.txt
2
```

* For multiple fields, separate them using `,` or form a string with some character in between
    * choose a character unlikely to appear in input data, else there can be false matches
    * `FS` is a good choice as fields wouldn't contain separator character(s)

```bash
$ awk '!seen[$2 FS $3]++' duplicates.txt
abc  7   4
food toy ****
test toy 123

$ # can also use simulated multidimensional array
$ # SUBSEP, whose default is \034 non-printing character, is used as separator
$ awk '!seen[$2,$3]++' duplicates.txt
abc  7   4
food toy ****
test toy 123
```

* retaining specific numbered copy

```bash
$ # second occurrence of duplicate
$ awk '++seen[$2]==2' duplicates.txt
abc  7   4
test toy 123

$ # third occurrence of duplicate
$ awk '++seen[$2]==3' duplicates.txt
good toy ****
```

* retaining only last copy of duplicate

```bash
$ # reverse the input line-wise, retain first copy and then reverse again
$ tac duplicates.txt | awk '!seen[$2]++' | tac
abc  7   4
good toy ****
```

* filtering based on duplicate count
* allows to emulate [uniq](./sorting_stuff.md#uniq) command for specific fields
* See also [unix.stackexchange - retain only parent directory paths](https://unix.stackexchange.com/questions/362571/filter-out-paths-from-a-text-file-that-are-deeper-than-their-immediate-predecces)

```bash
$ # all duplicates based on 1st column
$ awk 'NR==FNR{a[$1]++; next} a[$1]>1' duplicates.txt duplicates.txt
abc  7   4
abc  7   4
$ # all duplicates based on 3rd column
$ awk 'NR==FNR{a[$3]++; next} a[$3]>1' duplicates.txt duplicates.txt
abc  7   4
food toy ****
abc  7   4
good toy ****

$ # more than 2 duplicates based on 2nd column
$ awk 'NR==FNR{a[$2]++; next} a[$2]>2' duplicates.txt duplicates.txt
food toy ****
test toy 123
good toy ****

$ # only unique lines based on 3rd column
$ awk 'NR==FNR{a[$3]++; next} a[$3]==1' duplicates.txt duplicates.txt
test toy 123
```

<br>

## <a name="lines-between-two-regexps"></a>Lines between two REGEXPs

* This section deals with filtering lines bound by two *REGEXP*s (referred to as blocks)
* For simplicity the two *REGEXP*s usually used in below examples are the strings **BEGIN** and **END**

<br>

#### <a name="all-unbroken-blocks"></a>All unbroken blocks

Consider the below sample input file, which doesn't have any unbroken blocks (i.e **BEGIN** and **END** are always present in pairs)

```bash
$ cat range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
END
baz
```

* Extracting lines between starting and ending *REGEXP*

```bash
$ # include both starting/ending REGEXP
$ # can also use: awk '/BEGIN/,/END/' range.txt
$ # which is similar to sed -n '/BEGIN/,/END/p'
$ # but not suitable to extend for other cases
$ awk '/BEGIN/{f=1} f; /END/{f=0}' range.txt
BEGIN
1234
6789
END
BEGIN
a
b
c
END

$ # exclude both starting/ending REGEXP
$ # can also use: awk '/BEGIN/{f=1; next} /END/{f=0} f' range.txt
$ awk '/END/{f=0} f; /BEGIN/{f=1}' range.txt
1234
6789
a
b
c
```

* Include only start or end *REGEXP*

```bash
$ # include only starting REGEXP
$ awk '/BEGIN/{f=1} /END/{f=0} f' range.txt
BEGIN
1234
6789
BEGIN
a
b
c

$ # include only ending REGEXP
$ awk 'f; /END/{f=0} /BEGIN/{f=1}' range.txt
1234
6789
END
a
b
c
END
```

* Extracting lines other than lines between the two *REGEXP*s

```bash
$ awk '/BEGIN/{f=1} !f; /END/{f=0}' range.txt
foo
bar
baz

$ # the other three cases would be
$ awk '/END/{f=0} !f; /BEGIN/{f=1}' range.txt
$ awk '!f; /BEGIN/{f=1} /END/{f=0}' range.txt
$ awk '/BEGIN/{f=1} /END/{f=0} !f' range.txt
```

<br>

#### <a name="specific-blocks"></a>Specific blocks

* Getting first block

```bash
$ awk '/BEGIN/{f=1} f; /END/{exit}' range.txt
BEGIN
1234
6789
END

$ # use other tricks discussed in previous section as needed
$ awk '/END/{exit} f; /BEGIN/{f=1}' range.txt
1234
6789
```

* Getting last block

```bash
$ # reverse input linewise, change the order of REGEXPs, finally reverse again
$ tac range.txt | awk '/END/{f=1} f; /BEGIN/{exit}' | tac
BEGIN
a
b
c
END

$ # or, save the blocks in a buffer and print the last one alone
$ # ORS contains output record separator, which is newline by default
$ seq 30 | awk '/4/{f=1; b=$0; next} f{b=b ORS $0} /6/{f=0} END{print b}'
24
25
26
```

* Getting blocks based on a counter

```bash
$ # all blocks
$ seq 30 | sed -n '/4/,/6/p'
4
5
6
14
15
16
24
25
26

$ # get only 2nd block
$ # can also use: seq 30 | awk -v b=2 '/4/{c++} c==b{print; if(/6/) exit}'
$ seq 30 | awk -v b=2 '/4/{c++} c==b; /6/ && c==b{exit}'
14
15
16

$ # to get all blocks greater than 'b' blocks
$ seq 30 | awk -v b=1 '/4/{f=1; c++} f && c>b; /6/{f=0}'
14
15
16
24
25
26
```

* excluding a particular block

```bash
$ # excludes 2nd block
$ seq 30 | awk -v b=2 '/4/{f=1; c++} f && c!=b; /6/{f=0}'
4
5
6
24
25
26
```

<br>

#### <a name="broken-blocks"></a>Broken blocks

* If there are blocks with ending *REGEXP* but without corresponding start, `awk '/BEGIN/{f=1} f; /END/{f=0}'` will suffice
* Consider the modified input file where starting *REGEXP* doesn't have corresponding ending

```bash
$ cat broken_range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
baz

$ # the file reversing trick comes in handy here as well
$ tac broken_range.txt | awk '/END/{f=1} f; /BEGIN/{f=0}' | tac
BEGIN
1234
6789
END
```

* But if both kinds of broken blocks are present, accumulate the records and print accordingly

```bash
$ cat multiple_broken.txt
qqqqqqq
BEGIN
foo
BEGIN
1234
6789
END
bar
END
0-42-1
BEGIN
a
BEGIN
b
END
xyzabc

$ awk '/BEGIN/{f=1; buf=$0; next}
       f{buf=buf ORS $0}
       /END/{f=0; if(buf) print buf; buf=""}' multiple_broken.txt
BEGIN
1234
6789
END
BEGIN
b
END
```

**Further Reading**

* [stackoverflow - select lines between two regexps](https://stackoverflow.com/questions/38972736/how-to-select-lines-between-two-patterns)
* [unix.stackexchange - print only blocks with lines > n](https://unix.stackexchange.com/questions/295600/deleting-lines-between-rows-in-a-text-file-using-awk-or-sed)
* [unix.stackexchange - print a block only if it contains matching string](https://unix.stackexchange.com/a/335523/109046)
* [unix.stackexchange - print a block matching two different strings](https://unix.stackexchange.com/questions/347368/grep-with-range-and-pass-three-filters)
* [unix.stackexchange - extract block up to 2nd occurrence of ending REGEXP](https://unix.stackexchange.com/questions/404175/using-awk-to-print-lines-from-one-match-through-a-second-instance-of-a-separate)

<br>

## <a name="arrays"></a>Arrays

We've already seen examples using arrays, some more examples discussed in this section

* array looping

```bash
$ # average marks for each department
$ awk 'NR>1{d[$1]+=$3; c[$1]++} END{for(i in d)print i, d[i]/c[i]}' marks.txt
ECE 72.3333
EEE 63.5
CSE 74
```

* Sorting
* See [gawk manual - Predefined Array Scanning Orders](https://www.gnu.org/software/gawk/manual/html_node/Controlling-Scanning.html#Controlling-Scanning) for more details

```bash
$ # by default, keys are traversed in random order
$ awk 'BEGIN{a["z"]=1; a["x"]=12; a["b"]=42; for(i in a)print i, a[i]}'
x 12
z 1
b 42

$ # index sorted ascending order as strings
$ awk 'BEGIN{PROCINFO["sorted_in"] = "@ind_str_asc";
       a["z"]=1; a["x"]=12; a["b"]=42; for(i in a)print i, a[i]}'
b 42
x 12
z 1

$ # value sorted ascending order as numbers
$ awk 'BEGIN{PROCINFO["sorted_in"] = "@val_num_asc";
       a["z"]=1; a["x"]=12; a["b"]=42; for(i in a)print i, a[i]}'
z 1
x 12
b 42
```

* deleting array elements

```bash
$ cat list5
CSE     Surya   75
EEE     Jai     69
ECE     Kal     83

$ # update entry if a match is found
$ # else append the new entries
$ awk '{ky=$1"_"$2} NR==FNR{upd[ky]=$0; next}
        ky in upd{$0=upd[ky]; delete upd[ky]} 1;
        END{for(i in upd)print upd[i]}' list5 marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   75
EEE     Tia     59
ECE     Om      92
CSE     Amy     67
ECE     Kal     83
EEE     Jai     69
```

* true multidimensional arrays
* length of sub-arrays need not be same. See [gawk manual - Arrays of Arrays](https://www.gnu.org/software/gawk/manual/html_node/Arrays-of-Arrays.html#Arrays-of-Arrays) for details

```bash
$ awk 'NR>1{d[$1][$2]=$3} END{for(i in d["ECE"])print i}' marks.txt
Joel
Raj
Om

$ awk -v f='CSE' 'NR>1{d[$1][$2]=$3} END{for(i in d[f])print i, d[f][i]}' marks.txt
Surya 81
Amy 67
```

**Further Reading**

* [gawk manual - all array topics](https://www.gnu.org/software/gawk/manual/html_node/Arrays.html)
* [unix.stackexchange - count words based on length](https://unix.stackexchange.com/questions/396855/is-there-an-easy-way-to-count-characters-in-words-in-file-from-terminal)
* [unix.stackexchange - filtering specific lines](https://unix.stackexchange.com/a/326215/109046)

<br>

## <a name="awk-scripts"></a>awk scripts

* For larger programs, save the code in a file and use `-f` command line option
* `;` is not needed to terminate a statement
* See also [gawk manual - Command-Line Options](https://www.gnu.org/software/gawk/manual/html_node/Options.html#Options) for other related options

```bash
$ cat buf.awk
/BEGIN/{
    f=1
    buf=$0
    next
}

f{
    buf=buf ORS $0
}

/END/{
    f=0
    if(buf)
        print buf
    buf=""
}

$ awk -f buf.awk multiple_broken.txt
BEGIN
1234
6789
END
BEGIN
b
END
```

* Another advantage is that single quotes can be freely used

```bash
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "\047&\047", "g")} 1'
'foo':'123':'bar':'baz'

$ cat quotes.awk
{
    $0 = gensub(/[^:]+/, "'&'", "g")
}

1

$ echo 'foo:123:bar:baz' | awk -f quotes.awk
'foo':'123':'bar':'baz'
```

* If the code has been first tried out on command line, add `-o` option to get a pretty printed version

```bash
$ awk -o -v OFS='\t' 'NR==FNR{r[$1]=$2; next}
         {$(NF+1) = FNR==1 ? "Role" : r[$2]} 1' list4 marks.txt
Dept    Name    Marks   Role
ECE     Raj     53      class_rep
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59      placement_rep
ECE     Om      92
CSE     Amy     67      sports_rep
```

File name can be passed along `-o` option, otherwise by default `awkprof.out` will be used

```bash
$ cat awkprof.out
        # gawk profile, created Mon Mar 16 10:11:11 2020

        # Rule(s)

        NR == FNR {
                r[$1] = $2
                next
        }

        {
                $(NF + 1) = (FNR == 1 ? "Role" : r[$2])
        }

        1 {
                print $0
        }

$ # note that other command line options have to be provided as usual
$ # for ex: awk -v OFS='\t' -f awkprof.out list4 marks.txt
```

<br>

## <a name="miscellaneous"></a>Miscellaneous

<br>

#### <a name="fpat-and-fieldwidths"></a>FPAT and FIELDWIDTHS

* `FS` allows to define field separator
* In contrast, `FPAT` allows to define what should the fields be made up of
* See also [gawk manual - Defining Fields by Content](https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html)

```bash
$ s='Sample123string54with908numbers'
$ # define fields to be one or more consecutive digits
$ echo "$s" | awk -v FPAT='[0-9]+' '{print $1, $2, $3}'
123 54 908
$ # define fields to be one or more consecutive alphabets
$ echo "$s" | awk -v FPAT='[a-zA-Z]+' '{print $1, $2, $3, $4}'
Sample string with numbers
```

* For simpler **csv** input having quoted strings if fields themselves have `,` in them, using `FPAT` is reasonable approach
* Use a proper parser if input can have other cases like newlines in fields
    * See [unix.stackexchange - using csv parser](https://unix.stackexchange.com/a/238192) for a sample program in `perl`

```bash
$ s='foo,"bar,123",baz,abc'
$ echo "$s" | awk -F, '{print $2}'
"bar
$ echo "$s" | awk -v FPAT='"[^"]*"|[^,]*' '{print $2}'
"bar,123"
```

* if input has well defined fields based on number of characters, `FIELDWIDTHS` can be used to specify width of each field

```bash
$ awk -v FIELDWIDTHS='8 3' -v OFS= '/fig/{$2=35} 1' fruits.txt
fruit   qty
apple   42
banana  31
fig     35
guava   6

$ # without FIELDWIDTHS
$ awk '/fig/{$2=35} 1' fruits.txt
fruit   qty
apple   42
banana  31
fig 35
guava   6
```

**Further Reading**

* [gawk manual - Processing Fixed-Width Data](https://www.gnu.org/software/gawk/manual/html_node/Fixed-width-data.html)
* [unix.stackexchange - Modify records in fixed-width files](https://unix.stackexchange.com/questions/368574/modify-records-in-fixed-width-files)
* [unix.stackexchange - detecting empty fields in fixed width files](https://unix.stackexchange.com/questions/321559/extracting-data-with-awk-when-some-lines-have-empty-missing-values)
* [stackoverflow - count number of times value is repeated each line](https://stackoverflow.com/questions/37450880/how-do-i-filter-tab-separated-input-by-the-count-of-fields-with-a-given-value)
* [stackoverflow - skip characters with FIELDWIDTHS in GNU Awk 4.2](https://stackoverflow.com/questions/46932189/how-do-you-skip-characters-with-fieldwidths-in-gnu-awk-4-2)

<br>

#### <a name="string-functions"></a>String functions

* `length` function - returns length of string, by default acts on `$0`

```bash
$ seq 8 13 | awk 'length()==1'
8
9

$ awk 'NR==1 || length($1)>4' fruits.txt
fruit   qty
apple   42
banana  31
guava   6

$ # character count and not byte count is calculated, similar to 'wc -m'
$ printf 'hi👍' | awk '{print length()}'
3

$ # use -b option if number of bytes are needed
$ printf 'hi👍' | awk -b '{print length()}'
6
```

* `split` function - similar to `FS` splitting input record into fields
* use `patsplit` function to get results similar to `FPAT`
* See also [gawk manual - Split function](https://www.gnu.org/software/gawk/manual/gawk.html#index-split_0028_0029-function)
* See also [unix.stackexchange - delimit second column](https://unix.stackexchange.com/questions/372253/awk-command-to-delimit-the-second-column)

```bash
$ # 1st argument is string to be split
$ # 2nd argument is array to save results, indexed from 1
$ # 3rd argument is separator, default is FS
$ s='foo,1996-10-25,hello,good'
$ echo "$s" | awk -F, '{split($2,d,"-"); print "Month is: " d[2]}'
Month is: 10

$ # using regular expression to define separator
$ # return value is number of fields after splitting
$ s='Sample123string54with908numbers'
$ echo "$s" | awk '{n=split($0,s,/[0-9]+/); for(i=1;i<=n;i++)print s[i]}'
Sample
string
with
numbers
$ # use 4th argument if separators are needed as well
$ echo "$s" | awk '{n=split($0,s,/[0-9]+/,seps); for(i=1;i<n;i++)print seps[i]}'
123
54
908

$ # single row to multiple rows based on splitting last field
$ s='foo,baz,12:42:3'
$ echo "$s" | awk -F, '{n=split($NF,a,":"); NF--; for(i=1;i<=n;i++) print $0,a[i]}'
foo baz 12
foo baz 42
foo baz 3
```

* `substr` function allows to extract specified number of characters from given string
    * indexing starts with `1`
* See [gawk manual - substr function](https://www.gnu.org/software/gawk/manual/gawk.html#index-substr_0028_0029-function) for corner cases and details

```bash
$ # 1st argument is string to be worked on
$ # 2nd argument is starting position
$ # 3rd argument is number of characters to be extracted
$ echo 'abcdefghij' | awk '{print substr($0,1,5)}'
abcde
$ echo 'abcdefghij' | awk '{print substr($0,4,3)}'
def
$ # if 3rd argument is not given, string is extracted until end
$ echo 'abcdefghij' | awk '{print substr($0,6)}'
fghij

$ echo 'abcdefghij' | awk -v OFS=':' '{print substr($0,2,3), substr($0,6,3)}'
bcd:fgh

$ # if only few characters are needed from input line, can use empty FS
$ echo 'abcdefghij' | awk -v FS= '{print $3}'
c
$ echo 'abcdefghij' | awk -v FS= '{print $3, $5}'
c e
```

<br>

#### <a name="executing-external-commands"></a>Executing external commands

* External commands can be issued using `system` function
* Output would be as usual on `stdout` unless redirected while calling the command
* Return value of `system` depends on `exit` status of executed command, see [gawk manual - Input/Output Functions](https://www.gnu.org/software/gawk/manual/html_node/I_002fO-Functions.html) for details

```bash
$ awk 'BEGIN{system("echo Hello World")}'
Hello World

$ wc poem.txt
 4 13 65 poem.txt
$ awk 'BEGIN{system("wc poem.txt")}'
 4 13 65 poem.txt

$ awk 'BEGIN{system("seq 10 | paste -sd, > out.txt")}'
$ cat out.txt
1,2,3,4,5,6,7,8,9,10

$ ls xyz.txt
ls: cannot access 'xyz.txt': No such file or directory
$ echo $?
2
$ awk 'BEGIN{s=system("ls xyz.txt"); print "Status: " s}'
ls: cannot access 'xyz.txt': No such file or directory
Status: 2

$ cat f2
I bought two bananas and three mangoes
$ echo 'f1,f2,odd.txt' | awk -F, '{system("cat " $2)}'
I bought two bananas and three mangoes
```

<br>

#### <a name="printf-formatting"></a>printf formatting

* Similar to `printf` function in `C` and shell built-in command
* use `sprintf` function to save result in variable instead of printing
* See also [gawk manual - printf](https://www.gnu.org/software/gawk/manual/html_node/Printf.html)

```bash
$ awk '{sum += $1} END{print sum}' nums.txt
10062.9

$ # note that ORS is not appended and has to be added manually
$ awk '{sum += $1} END{printf "%.2f\n", sum}' nums.txt
10062.86

$ awk '{sum += $1} END{printf "%10.2f\n", sum}' nums.txt
  10062.86

$ awk '{sum += $1} END{printf "%010.2f\n", sum}' nums.txt
0010062.86

$ awk '{sum += $1} END{printf "%d\n", sum}' nums.txt
10062

$ awk '{sum += $1} END{printf "%+d\n", sum}' nums.txt
+10062

$ awk '{sum += $1} END{printf "%e\n", sum}' nums.txt
1.006286e+04
```

* to refer argument by positional number (starts with 1), use `<num>$`

```bash
$ # can also use: awk 'BEGIN{printf "hex=%x\noct=%o\ndec=%d\n", 15, 15, 15}'
$ awk 'BEGIN{printf "hex=%1$x\noct=%1$o\ndec=%1$d\n", 15}'
hex=f
oct=17
dec=15

$ # adding prefix to hex/oct numbers
$ awk 'BEGIN{printf "hex=%1$#x\noct=%1$#o\ndec=%1$d\n", 15}'
hex=0xf
oct=017
dec=15
```

* strings

```bash
$ # prefix remaining width with spaces
$ awk 'BEGIN{printf "%6s:%5s\n", "foo", "bar"}'
   foo:  bar

$ # suffix remaining width with spaces
$ awk 'BEGIN{printf "%-6s:%-5s\n", "foo", "bar"}'
foo   :bar  

$ # truncate
$ awk 'BEGIN{printf "%.2s\n", "foobar"}'
fo
```

* avoid using `printf` without format specifier

```bash
$ awk 'BEGIN{s="solve: 5 % x = 1"; printf s}'
awk: cmd. line:1: fatal: not enough arguments to satisfy format string
    `solve: 5 % x = 1'
               ^ ran out for this one

$ awk 'BEGIN{s="solve: 5 % x = 1"; printf "%s\n", s}'
solve: 5 % x = 1
```

* See also [stackoverflow - concatenating columns in middle](https://stackoverflow.com/questions/49135518/linux-csv-file-concatenate-columns-into-one-column)

<br>

#### <a name="redirecting-print-output"></a>Redirecting print output

* redirecting to file instead of stdout using `>`
* similar to behavior in shell, if file already exists it is overwritten
    * use `>>` to append to an existing file without deleting content
* however, unlike shell, subsequent redirections to same file will append to it
* See also [gawk manual - Closing Input and Output Redirections](https://www.gnu.org/software/gawk/manual/html_node/Close-Files-And-Pipes.html) if you have too many redirections

```bash
$ seq 6 | awk 'NR%2{print > "odd.txt"; next} {print > "even.txt"}'
$ cat odd.txt
1
3
5
$ cat even.txt
2
4
6

$ awk 'NR==1{col1=$1".txt"; col2=$2".txt"; next}
       {print $1 > col1; print $2 > col2}' fruits.txt
$ cat fruit.txt
apple
banana
fig
guava
$ cat qty.txt
42
31
90
6
```

* redirecting to shell command
* this is useful if you have different things to redirect to different commands, otherwise it can be done as usual in shell acting on `awk`'s output
* all redirections to same command gets combined as single input to that command

```bash
$ # same as: echo 'foo good 123' | awk '{print $2}' | wc -c
$ echo 'foo good 123' | awk '{print $2 | "wc -c"}'
5
$ # to avoid newline character being added to print
$ echo 'foo good 123' | awk -v ORS= '{print $2 | "wc -c"}'
4
$ # assuming no format specifiers in input
$ echo 'foo good 123' | awk '{printf $2 | "wc -c"}'
4

$ # same as: echo 'foo good 123' | awk '{printf $2 $3 | "wc -c"}'
$ echo 'foo good 123' | awk '{printf $2 | "wc -c"; printf $3 | "wc -c"}'
7
```

**Further Reading**

* [gawk manual - Input/Output Functions](https://www.gnu.org/software/gawk/manual/html_node/I_002fO-Functions.html)
* [gawk manual - Redirecting Output of print and printf](https://www.gnu.org/software/gawk/manual/html_node/Redirection.html)
* [gawk manual - Two-Way Communications with Another Process](https://www.gnu.org/software/gawk/manual/html_node/Two_002dway-I_002fO.html)
* [unix.stackexchange - inplace editing as well as stdout](https://unix.stackexchange.com/questions/321679/gawk-inplace-and-stdout)
* [stackoverflow - redirect blocks to separate files](https://stackoverflow.com/questions/45098279/write-blocks-in-a-text-file-to-multiple-new-files)

<br>

## <a name="gotchas-and-tips"></a>Gotchas and Tips

* using `$` for variables
* only input record `$0` and field contents `$1`, `$2` etc need `$`
* See also [unix.stackexchange - Why does awk print the whole line when I want it to print a variable?](https://unix.stackexchange.com/questions/291126/why-does-awk-print-the-whole-line-when-i-want-it-to-print-a-variable)

```bash
$ # wrong
$ awk -v word="apple" '$1==$word' fruits.txt

$ # right
$ awk -v word="apple" '$1==word' fruits.txt
apple   42
```

* dos style line endings
* See also [unix.stackexchange - filtering when last column has \r](https://unix.stackexchange.com/questions/399560/using-awk-to-select-rows-with-specific-value-in-specific-column)

```bash
$ # no issue with unix style line ending
$ printf 'foo bar\n123 789\n' | awk '{print $2, $1}'
bar foo
789 123

$ # dos style line ending causes trouble
$ printf 'foo bar\r\n123 789\r\n' | awk '{print $2, $1}'
 foo
 123

$ # easy to deal by simply setting appropriate RS
$ # note that ORS would still be newline character only
$ printf 'foo bar\r\n123 789\r\n' | awk -v RS='\r\n' '{print $2, $1}'
bar foo
789 123
```

* relying on default initial value

```bash
$ # step 1 - works for single file
$ awk '{sum += $1} END{print sum}' nums.txt
10062.9

$ # step 2 - change to work for multiple file
$ awk '{sum += $1} ENDFILE{print FILENAME, sum}' nums.txt
nums.txt 10062.9

$ # step 3 - check with multiple file input
$ # oops, default numerical value '0' for sum works only once
$ awk '{sum += $1} ENDFILE{print FILENAME, sum}' nums.txt <(seq 3)
nums.txt 10062.9
/dev/fd/63 10068.9

$ # step 4 - correctly initialize variables
$ awk '{sum += $1} ENDFILE{print FILENAME, sum; sum=0}' nums.txt <(seq 3)
nums.txt 10062.9
/dev/fd/63 6
```

* use unary operator `+` to force numeric conversion

```bash
$ awk '{sum += $1} END{print FILENAME, sum}' nums.txt
nums.txt 10062.9

$ awk '{sum += $1} END{print FILENAME, sum}' /dev/null
/dev/null 

$ awk '{sum += $1} END{print FILENAME, +sum}' /dev/null
/dev/null 0
```

* concatenate empty string to force string comparison

```bash
$ echo '5 5.0' | awk '{print $1==$2 ? "same" : "different", "string"}'
same string

$ echo '5 5.0' | awk '{print $1""==$2 ? "same" : "different", "string"}'
different string
```

* beware of expressions going -ve for field calculations

```bash
$ cat misc.txt
foo
good bad ugly
123 xyz
a b c d

$ # trying to delete last two fields
$ awk '{NF -= 2} 1' misc.txt
awk: cmd. line:1: (FILENAME=misc.txt FNR=1) fatal: NF set to negative value
$ # dynamically change it depending on number of fields
$ awk '{NF = (NF<=2) ? 0 : NF-2} 1' misc.txt

good

a b

$ # similarly, trying to access 3rd field from end
$ awk '{print $(NF-2)}' misc.txt
awk: cmd. line:1: (FILENAME=misc.txt FNR=1) fatal: attempt to access field -1
$ awk 'NF>2{print $(NF-2)}' misc.txt
good
b
```

* If input is ASCII alone, simple trick to improve speed
* For simple non-regex based column filtering, using [cut](./miscellaneous.md#cut) command might give faster results
    * See [stackoverflow - how to split columns faster](https://stackoverflow.com/questions/46882557/how-to-split-columns-faster-in-python/46883120#46883120) for example

```bash
$ # all words containing exactly 3 lowercase a
$ time awk -F'a' 'NF==4{cnt++} END{print +cnt}' /usr/share/dict/words
1019

real    0m0.075s

$ time LC_ALL=C awk -F'a' 'NF==4{cnt++} END{print +cnt}' /usr/share/dict/words
1019

real    0m0.045s
```

<br>

## <a name="further-reading"></a>Further Reading

* Manual and related
    * `man awk` and `info awk` for quick reference from command line
    * [gawk manual](https://www.gnu.org/software/gawk/manual/gawk.html#SEC_Contents) for complete reference, extensions and more
    * [awk FAQ](http://www.faqs.org/faqs/computer-lang/awk/faq/) - from 2002, but plenty of information, especially about all the various `awk` implementations
* this tutorial has also been [converted to an ebook](https://github.com/learnbyexample/learn_gnuawk) with additional descriptions, examples, a chapter on regular expressions, etc.
* What's up with different `awk` versions?
    * [unix.stackexchange - brief explanation](https://unix.stackexchange.com/questions/29576/difference-between-gawk-vs-awk)
    * [Differences between gawk, nawk, mawk, and POSIX awk](https://archive.is/btGky)
    * [cheat sheet for awk/nawk/gawk](https://catonmat.net/ftp/awk.cheat.sheet.txt)
* Tutorials and Q&A
    * [code.snipcademy - gentle intro](https://code.snipcademy.com/tutorials/shell-scripting/awk/introduction)
    * [funtoo - using examples](https://www.funtoo.org/Awk_by_Example,_Part_1)
    * [grymoire - detailed tutorial](https://www.grymoire.com/Unix/Awk.html) - covers information about different `awk` versions as well
    * [catonmat - one liners explained](https://catonmat.net/awk-one-liners-explained-part-one)
    * [Why Learn AWK?](https://blog.jpalardy.com/posts/why-learn-awk/)
    * [awk Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/awk?sort=votes&pageSize=15)
    * [awk Q&A on unix.stackexchange](https://unix.stackexchange.com/questions/tagged/awk?sort=votes&pageSize=15)
* Alternatives
    * [GNU datamash](https://www.gnu.org/software/datamash/alternatives/)
    * [bioawk](https://github.com/lh3/bioawk)
    * [hawk](https://github.com/gelisam/hawk/blob/master/doc/README.md) - based on Haskell
    * [miller](https://github.com/johnkerl/miller) - similar to awk/sed/cut/join/sort for name-indexed data such as CSV, TSV, and tabular JSON
        * See this [ycombinator news](https://news.ycombinator.com/item?id=10066742) for other tools like this
* miscellaneous
    * [unix.stackexchange - When to use grep, sed, awk, perl, etc](https://unix.stackexchange.com/questions/303044/when-to-use-grep-less-awk-sed)
    * [awk-libs](https://github.com/e36freak/awk-libs) - lots of useful functions
    * [awkaster](https://github.com/TheMozg/awk-raycaster) - Pseudo-3D shooter written completely in awk using raycasting technique
    * [awk REPL](https://awk.js.org/) - live editor on browser
* examples for some of the stuff not covered in this tutorial
    * [unix.stackexchange - rand/srand](https://unix.stackexchange.com/questions/372816/awk-get-random-lines-of-file-satisfying-a-condition)
    * [unix.stackexchange - strftime](https://unix.stackexchange.com/questions/224969/current-date-in-awk)
    * [unix.stackexchange - ARGC and ARGV](https://unix.stackexchange.com/questions/222146/awk-does-not-end/222150#222150)
    * [stackoverflow - arbitrary precision integer extension](https://stackoverflow.com/questions/46904447/strange-output-while-comparing-engineering-numbers-in-awk)
    * [stackoverflow - recognizing hexadecimal numbers](https://stackoverflow.com/questions/3683110/how-to-make-calculations-on-hexadecimal-numbers-with-awk)
    * [unix.stackexchange - sprintf and close](https://unix.stackexchange.com/questions/223727/splitting-file-for-every-10000-numbers-not-lines/223739#223739)
    * [unix.stackexchange - user defined functions and array passing](https://unix.stackexchange.com/questions/72469/gawk-passing-arrays-to-functions)
    * [unix.stackexchange - rename csv files based on number of fields in header row](https://unix.stackexchange.com/questions/408742/count-number-of-columns-in-csv-files-and-rename-if-less-than-11-columns)


================================================
FILE: gnu_grep.md
================================================
<br> <br> <br>

---

:information_source: :information_source: This chapter has been converted into a better formatted ebook: https://learnbyexample.github.io/learn_gnugrep_ripgrep/. The ebook also has content updated for newer version of the commands, includes exercises, solutions, has a separate chapter for popular alternative `ripgrep`, etc.

For markdown source and links to buy pdf/epub versions, see: https://github.com/learnbyexample/learn_gnugrep_ripgrep

---

<br> <br> <br>

# <a name="gnu-grep"></a>GNU grep

**Table of Contents**

* [Simple string search](#simple-string-search)
* [Case insensitive search](#case-insensitive-search)
* [Invert matching lines](#invert-matching-lines)
* [Line number, count and limiting output lines](#line-number-count-and-limiting-output-lines)
* [Multiple search strings](#multiple-search-strings)
* [File names in output](#file-names-in-output)
* [Match whole word or line](#match-whole-word-or-line)
* [Colored output](#colored-output)
* [Get only matching portion](#get-only-matching-portion)
* [Context matching](#context-matching)
* [Recursive search](#recursive-search)
    * [Basic recursive search](#basic-recursive-search)
    * [Exclude/Include specific files/directories](#excludeinclude-specific-filesdirectories)
    * [Recursive search with bash options](#recursive-search-with-bash-options)
    * [Recursive search using find command](#recursive-search-using-find-command)
    * [Passing file names to other commands](#passing-file-names-to-other-commands)
* [Search strings from file](#search-strings-from-file)
* [Options for scripting purposes](#options-for-scripting-purposes)
* [Regular Expressions - BRE/ERE](#regular-expressions-breere)
    * [Line Anchors](#line-anchors)
    * [Word Anchors](#word-anchors)
    * [Alternation](#alternation)
    * [The dot meta character](#the-dot-meta-character)
    * [Quantifiers](#quantifiers)
    * [Character classes](#character-classes)
    * [Grouping](#grouping)
    * [Back reference](#back-reference)
* [Multiline matching](#multiline-matching)
* [Perl Compatible Regular Expressions](#perl-compatible-regular-expressions)
    * [Backslash sequences](#backslash-sequences)
    * [Non-greedy matching](#non-greedy-matching)
    * [Lookarounds](#lookarounds)
    * [Ignoring specific matches](#ignoring-specific-matches)
    * [Re-using regular expression pattern](#re-using-regular-expression-pattern)
* [Gotchas and Tips](#gotchas-and-tips)
* [Regular Expressions Reference (ERE)](#regular-expressions-reference-ere)
    * [Anchors](#anchors)
    * [Character Quantifiers](#character-quantifiers)
    * [Character classes and backslash sequences](#character-classes-and-backslash-sequences)
    * [Pattern groups](#pattern-groups)
    * [Basic vs Extended Regular Expressions](#basic-vs-extended-regular-expressions)
* [Further Reading](#further-reading)

<br>

```bash
$ grep -V | head -1
grep (GNU grep) 2.25

$ man grep
GREP(1)                     General Commands Manual                    GREP(1)

NAME
       grep, egrep, fgrep, rgrep - print lines matching a pattern

SYNOPSIS
       grep [OPTIONS] PATTERN [FILE...]
       grep [OPTIONS] [-e PATTERN]...  [-f FILE]...  [FILE...]

DESCRIPTION
       grep searches the named input FILEs for lines containing a match to the
       given PATTERN.  If no files are specified, or if the file “-” is given,
       grep  searches  standard  input.   By default, grep prints the matching
       lines.

       In addition, the variant programs egrep, fgrep and rgrep are  the  same
       as  grep -E,  grep -F,  and  grep -r, respectively.  These variants are
       deprecated, but are provided for backward compatibility.
...
```

**Note** For more detailed documentation and examples, use `info grep`

<br>

## <a name="simple-string-search"></a>Simple string search

* First specify the search pattern (usually enclosed in single quotes) and then the file input
* More than one file can be specified or input given from stdin

```bash
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ grep 'are' poem.txt
Roses are red,
Violets are blue,
And so are you.

$ grep 'so are' poem.txt
And so are you.
```

* If search string contains any regular expression meta characters like `^$\.*[]` (covered later), use the `-F` option or `fgrep` if available

```bash
$ echo 'int a[5]' | grep 'a[5]'
$ echo 'int a[5]' | grep -F 'a[5]'
int a[5]
$ echo 'int a[5]' | fgrep 'a[5]'
int a[5]
```

* See [Gotchas and Tips](#gotchas-and-tips) section if you get strange issues

<br>

## <a name="case-insensitive-search"></a>Case insensitive search

```bash
$ grep -i 'rose' poem.txt
Roses are red,

$ grep -i 'and' poem.txt
And so are you.
```

<br>

## <a name="invert-matching-lines"></a>Invert matching lines

* Use the `-v` option to get lines other than those matching the search string
* Tip: Look out for other opposite pairs like `-l -L`, `-h -H`, opposites in regular expression, etc

```bash
$ grep -v 'are' poem.txt
Sugar is sweet,

$ # example for input from stdin
$ seq 5 | grep -v '3'
1
2
4
5
```

<br>

## <a name="line-number-count-and-limiting-output-lines"></a>Line number, count and limiting output lines

* Show line number of matching lines

```bash
$ grep -n 'sweet' poem.txt
3:Sugar is sweet,
```

* Count number of matching lines

```bash
$ grep -c 'are' poem.txt
3
```

* Limit number of matching lines

```bash
$ grep -m2 'are' poem.txt
Roses are red,
Violets are blue,
```

<br>

## <a name="multiple-search-strings"></a>Multiple search strings

* Match any

```bash
$ # search blue or you
$ grep -e 'blue' -e 'you' poem.txt
Violets are blue,
And so are you.
```

If there are lot of search strings, use a file input

**Note** Be careful to avoid empty lines in the file, it would result in matching all the lines

```bash
$ printf 'rose\nsugar\n' > search_strings.txt
$ cat search_strings.txt
rose
sugar

$ # -f option accepts file input with search terms in separate lines
$ grep -if search_strings.txt poem.txt
Roses are red,
Sugar is sweet,
```

* Match all

```bash
$ # match line containing both are & And
$ grep 'are' poem.txt | grep 'And'
And so are you.
```

<br>

## <a name="file-names-in-output"></a>File names in output

* `-l` to get files matching the search
* `-L` to get files not matching the search
* `grep` skips the rest of file once a match is found

```bash
$ grep -l 'Rose' poem.txt
poem.txt

$ grep -L 'are' poem.txt search_strings.txt
search_strings.txt
```

* Prefix file name to search results
* `-h` is default for single file input, no file name prefix in output
* `-H` is default for multiple file input, file name prefix in output

```bash
$ grep -h 'Rose' poem.txt
Roses are red,
$ grep -H 'Rose' poem.txt
poem.txt:Roses are red,

$ # -H is default for multiple file input
$ grep -i 'sugar' poem.txt search_strings.txt
poem.txt:Sugar is sweet,
search_strings.txt:sugar
$ grep -ih 'sugar' poem.txt search_strings.txt
Sugar is sweet,
sugar
```

<br>

## <a name="match-whole-word-or-line"></a>Match whole word or line

* Word search using `-w` option
    * word constitutes of alphabets, numbers and underscore character
* This will ensure that given patterns are not surrounded by other word characters
    * this is slightly different than using word boundaries in regular expressions
* For example, this helps to distinguish `par` from `spar`, `part`, etc

```bash
$ printf 'par value\nheir apparent\n' | grep 'par'
par value
heir apparent

$ printf 'par value\nheir apparent\n' | grep -w 'par'
par value

$ printf 'scare\ncart\ncar\nmacaroni\n' | grep -w 'car'
car
```

* Another useful option is `-x` to match only complete line, not anywhere in the line

```bash
$ printf 'see my book list\nmy book\n' | grep 'my book'
see my book list
my book

$ printf 'see my book list\nmy book\n' | grep -x 'my book'
my book

$ printf 'scare\ncart\ncar\nmacaroni\n' | grep -x 'car'
car
```

<br>

## <a name="colored-output"></a>Colored output

* Highlight search strings, line numbers, file name, etc in different colors
    * Depends on color support in terminal being used
* options to `--color` are
    * `auto` when output is redirected (another command, file, etc) the color information won't be passed
    * `always` when output is redirected (another command, file, etc) the color information will also be passed
    * `never` explicitly specify no highlighting

```bash
$ # can also use grep --color 'blue' as auto is default
$ grep --color=auto 'blue' poem.txt
Violets are blue,
```

* Sample screenshot

![grep color output](./images/color_option.png)

* Example to show difference between `auto` and `always`

```bash
$ grep --color=auto 'blue' poem.txt > saved_output.txt
$ cat -v saved_output.txt
Violets are blue,
$ grep --color=always 'blue' poem.txt > saved_output.txt
$ cat -v saved_output.txt
Violets are ^[[01;31m^[[Kblue^[[m^[[K,

$ # some commands like 'less' are capable of using the color information
$ grep --color=always 'are' poem.txt | less -R
$ # highlight multiple matching patterns
$ grep --color=always 'are' poem.txt | grep --color 'd'
Roses are red,
And so are you.
```

<br>

## <a name="get-only-matching-portion"></a>Get only matching portion

* The `-o` option to get only matched portion is more useful with regular expressions
* Comes in handy if overall number of matches is required, instead of only line wise

```bash
$ grep -o 'are' poem.txt
are
are
are

$ # -c only gives count of matching lines
$ grep -c 'e' poem.txt
4
$ grep -co 'e' poem.txt
4
$ # so need another command to get count of all matches
$ grep -o 'e' poem.txt | wc -l
9
```

<br>

## <a name="context-matching"></a>Context matching

* The `-A`, `-B` and `-C` options are useful to get lines after/before/around matching line respectively

```bash
$ grep -A1 'blue' poem.txt
Violets are blue,
Sugar is sweet,
$ grep -B1 'blue' poem.txt
Roses are red,
Violets are blue,
$ grep -C1 'blue' poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
```

* If there are multiple non-adjacent matching segments, by default `grep` adds a line `--` to separate them
    * non-adjacent here implies that segments are separated by at least one line in input data

```bash
$ seq 29 | grep -A1 '3'
3
4
--
13
14
--
23
24
```

* Use `--no-group-separator` option if the separator line is a hindrance, for example feeding the output of `grep` to another program

```bash
$ seq 29 | grep --no-group-separator -A1 '3'
3
4
13
14
23
24
```

* Use `--group-separator` to customize the separator

```bash
$ seq 29 | grep --group-separator='*****' -A1 '3'
3
4
*****
13
14
*****
23
24
```

<br>

## <a name="recursive-search"></a>Recursive search

First let's create some more test files

```bash
$ mkdir -p test_files/hidden_files
$ printf 'Red\nGreen\nBlue\nBlack\nWhite\n' > test_files/colors.txt
$ printf 'Violet\nIndigo\nBlue\nGreen\nYellow\nOrange\nRed\n' > test_files/vibgyor.txt
$ printf '#!/usr/bin/python3\n\nprint("Hello World")\n' > test_files/hello.py
$ printf 'I like yellow\nWhat about you\n' > test_files/hidden_files/.fav_color.info
```

From `man grep`

```bash
       -r, --recursive
              Read all files  under  each  directory,  recursively,  following
              symbolic  links only if they are on the command line.  Note that
              if  no  file  operand  is  given,  grep  searches  the   working
              directory.  This is equivalent to the -d recurse option.

       -R, --dereference-recursive
              Read  all  files  under each directory, recursively.  Follow all
              symbolic links, unlike -r.
```

<br>

#### <a name="basic-recursive-search"></a>Basic recursive search

* Note that `-H` option automatically activates for multiple file input

```bash
$ # by default, current working directory is searched
$ grep -r 'red'
poem.txt:Roses are red,

$ grep -ri 'red'
poem.txt:Roses are red,
test_files/colors.txt:Red
test_files/vibgyor.txt:Red

$ grep -rin 'red'
poem.txt:1:Roses are red,
test_files/colors.txt:1:Red
test_files/vibgyor.txt:7:Red

$ grep -ril 'red'
poem.txt
test_files/colors.txt
test_files/vibgyor.txt
```

<br>

#### <a name="excludeinclude-specific-filesdirectories"></a>Exclude/Include specific files/directories

* By default, recursive search includes hidden files as well
* They can be excluded by file name or directory name
    * [glob](https://github.com/learnbyexample/Linux_command_line/blob/master/Shell.md#wildcards) patterns can be used
    * for example: `*.[ch]` to specify all files ending with `.c` or `.h`
* The exclusion options can be used multiple times
    * for example: `--exclude='*.txt' --exclude='*.log'` or specified from a file using `--exclude-from=FILE`
* To search only files with specific pattern in their names, use `--include=GLOB`
* **Note:** exclusion/inclusion applies only to basename of file/directory, not the entire path
* To follow all symbolic links (not directly specificied as arguments, but found on recursive search), use `-R` instead of `-r`

```bash
$ grep -ri 'you'
poem.txt:And so are you.
test_files/hidden_files/.fav_color.info:What about you

$ # exclude file names starting with `.` i.e hidden files
$ grep -ri --exclude='.*' 'you'
poem.txt:And so are you.

$ # include only file names ending with `.info`
$ grep -ri --include='*.info' 'you'
test_files/hidden_files/.fav_color.info:What about you

$ # exclude a directory
$ grep -ri --exclude-dir='hidden_files' 'you'
poem.txt:And so are you.

$ # If you are using git(or similar), this would be handy
$ # grep --exclude-dir='.git' -rl 'search pattern'
```

<br>

#### <a name="recursive-search-with-bash-options"></a>Recursive search with bash options

* Using `bash` options `globstar` (for recursion)
    * Other options like `extglob` and `dotglob` come in handy too
    * See [glob](https://github.com/learnbyexample/Linux_command_line/blob/master/Shell.md#wildcards) for more info on these options
* The `-d skip` option tells grep to skip directories instead of trying to treat them as text file to be searched

```bash
$ grep -ril 'yellow'
test_files/hidden_files/.fav_color.info
test_files/vibgyor.txt

$ # recursive search
$ shopt -s globstar
$ grep -d skip -il 'yellow' **/*
test_files/vibgyor.txt

$ # include hidden files as well
$ shopt -s dotglob
$ grep -d skip -il 'yellow' **/*
test_files/hidden_files/.fav_color.info
test_files/vibgyor.txt

$ # use extended glob patterns
$ shopt -s extglob
$ # other than poem.txt
$ grep -d skip -il 'red' **/!(poem.txt)
test_files/colors.txt
test_files/vibgyor.txt
$ # other than poem.txt or colors.txt
$ grep -d skip -il 'red' **/!(poem|colors).txt
test_files/vibgyor.txt
```

<br>

#### <a name="recursive-search-using-find-command"></a>Recursive search using find command

* `find` is obviously more versatile
* See also [this guide](./wheres_my_file.md#find) for more examples/tutorials on using `find`

```bash
$ # all files, including hidden ones
$ find -type f -exec grep -il 'red' {} +
./poem.txt
./test_files/colors.txt
./test_files/vibgyor.txt

$ # all files ending with .txt
$ find -type f -name '*.txt' -exec grep -in 'you' {} +
./poem.txt:4:And so are you.

$ # all files not ending with .txt
$ find -type f -not -name '*.txt' -exec grep -in 'you' {} +
./test_files/hidden_files/.fav_color.info:2:What about you
```

<br>

#### <a name="passing-file-names-to-other-commands"></a>Passing file names to other commands

* To pass files filtered to another command, see if the receiving command can differentiate file names by ASCII NUL character
* If so, use the `-Z` so that `grep` output is terminated with NUL character and commands like `xargs` have option `-0` to understand it
* This helps when file names can have characters like space, newline, etc
* Typical use case: Search and replace something in all files matching some pattern, for ex: `grep -rlZ 'PAT1' | xargs -0 sed -i 's/PAT2/REPLACE/g'`

```bash
$ # prompt at end of line not shown for simplicity
$ # ^@ here indicates the NUL character
$ grep -rlZ 'you' | cat -A
poem.txt^@test_files/hidden_files/.fav_color.info^@

$ # print first column from all lines of all files
$ grep -rlZ 'you' | xargs -0 awk '{print $1}'
Roses
Violets
Sugar
And
I
What
```

* simple example to show filenames with space causing issue if `-Z` is not used

```bash
$ # 'abc xyz.txt' is a file with space in its name
$ grep -ri 'are'
abc xyz.txt:hi how are you
poem.txt:Roses are red,
poem.txt:Violets are blue,
poem.txt:And so are you.
saved_output.txt:Violets are blue,

$ # problem when -Z is not used
$ grep -ril 'are' | xargs grep 'you'
grep: abc: No such file or directory
grep: xyz.txt: No such file or directory
poem.txt:And so are you.

$ # no issues if -Z is used
$ grep -rilZ 'are' | xargs -0 grep 'you'
abc xyz.txt:hi how are you
poem.txt:And so are you.
```

* Example for matching more than one search string anywhere in file

```bash
$ # files containing 'you'
$ grep -rl 'you'
poem.txt
test_files/hidden_files/.fav_color.info

$ # files containing 'you' as well as 'are'
$ grep -rlZ 'you' | xargs -0 grep -l 'are'
poem.txt

$ # files containing 'you' but NOT 'are'
$ grep -rlZ 'you' | xargs -0 grep -L 'are'
test_files/hidden_files/.fav_color.info
```

* another example

```bash
$ grep -rilZ 'red' | xargs -0 grep -il 'blue'
poem.txt
test_files/colors.txt
test_files/vibgyor.txt

$ # note the use of `-Z` for middle command
$ grep -rilZ 'red' | xargs -0 grep -ilZ 'blue' | xargs -0 grep -il 'violet'
poem.txt
test_files/vibgyor.txt
```

<br>

## <a name="search-strings-from-file"></a>Search strings from file

* using file input to specify search terms
* `-F` option will force matching strings literally(no regular expressions)
* See also [stackoverflow - Fastest way to find lines of a text file from another larger text file](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-text-file-from-another-larger-text-file-in-bash) - read all answers

```bash
$ grep -if test_files/colors.txt poem.txt
Roses are red,
Violets are blue,

$ # get common lines between two files
$ grep -Fxf test_files/colors.txt test_files/vibgyor.txt
Blue
Green
Red

$ # get lines present in vibgyor.txt but not in colors.txt
$ grep -Fvxf test_files/colors.txt test_files/vibgyor.txt
Violet
Indigo
Yellow
Orange
```

<br>

## <a name="options-for-scripting-purposes"></a>Options for scripting purposes

* In scripts, often it is needed just to know if a pattern matches or not
* The `-q` option doesn't print anything on stdout and exit status is `0` if match is found
    * Check out [this practical script](https://github.com/learnbyexample/command_help/blob/master/ch) using the `-q` option

```bash
$ grep -qi 'rose' poem.txt
$ echo $?
0
$ grep -qi 'lily' poem.txt
$ echo $?
1

$ if grep -qi 'rose' poem.txt; then echo 'match found!'; else echo 'match not found'; fi
match found!
$ if grep -qi 'lily' poem.txt; then echo 'match found!'; else echo 'match not found'; fi
match not found
```

* The `-s` option will suppress error messages as well

```bash
$ grep 'rose' file_xyz.txt
grep: file_xyz.txt: No such file or directory
$ grep -s 'rose' file_xyz.txt
$ echo $?
2

$ touch foo.txt
$ chmod -r foo.txt
$ grep 'rose' foo.txt
grep: foo.txt: Permission denied
$ grep -s 'rose' foo.txt
$ echo $?
2
```

<br>

## <a name="regular-expressions-breere"></a>Regular Expressions - BRE/ERE

Before diving into regular expressions, few examples to show default `grep` behavior vs `-F`

```bash
$ # oops, why did it not match?
$ echo 'int a[5]' | grep 'a[5]'

$ # where did that error come from??
$ echo 'int a[5]' | grep 'a['
grep: Invalid regular expression

$ # what is going on???
$ echo 'int a[5]' | grep 'a[5'
grep: Unmatched [ or [^

$ # phew, -F is a life saver
$ echo 'int a[5]' | grep -F 'a[5]'
int a[5]

$ # [ and ] are meta characters, details in following sections
$ echo 'int a[5]' | grep 'a\[5]'
int a[5]
```

* By default, `grep` treats the search pattern as BRE (Basic Regular Expression)
    * `-G` option can be used to specify explicitly that BRE is used
* The `-E` option allows to use ERE (Extended Regular Expression) which in GNU grep's case only differs in how meta characters are used, no difference in regular expression functionalities
* If `-F` option is used, the search string is treated literally
* If available, one can also use `-P` which indicates PCRE (Perl Compatible Regular Expression)

<br>

#### <a name="line-anchors"></a>Line Anchors

* Often, search must match from beginning of line or towards end of line
* For example, an integer variable declaration in `C` will start with optional white-space, the keyword `int`, white-space and then variable(s)
    * This way one can avoid matching declarations inside single line comments as well.
* Similarly, one might want to match a variable at end of statement
* The meta characters for line anchoring are `^` for beginning of line and `$` for end of line

```bash
$ echo 'Fantasy is my favorite genre' > fav.txt
$ echo 'My favorite genre is Fantasy' >> fav.txt
$ cat fav.txt
Fantasy is my favorite genre
My favorite genre is Fantasy

$ # start of line
$ grep '^Fantasy' fav.txt
Fantasy is my favorite genre

$ # end of line
$ grep 'Fantasy$' fav.txt
My favorite genre is Fantasy

$ # without anchors
$ grep 'Fantasy' fav.txt
Fantasy is my favorite genre
My favorite genre is Fantasy
```

* As the meta characters have special meaning (assuming `-F` option is not used), they have to be escaped using `\` to match literally
* The `\` itself is meta character, so to match it literally, use `\\`
* The line anchors `^` and `$` have special meaning only when they are present at start/end of regular expression

```bash
$ echo '^foo bar$' | grep '^foo'
$ echo '^foo bar$' | grep '\^foo'
^foo bar$
$ echo '^foo bar$' | grep '^^foo'
^foo bar$

$ echo '^foo bar$' | grep 'bar$'
$ echo '^foo bar$' | grep 'bar\$'
^foo bar$
$ echo '^foo bar$' | grep 'bar$$'
^foo bar$

$ echo 'foo $ bar' | grep ' $ '
foo $ bar

$ printf 'foo\cbar' | grep -o '\c'
c
$ printf 'foo\cbar' | grep -o '\\c'
\c
```

<br>

#### <a name="word-anchors"></a>Word Anchors

* The `-w` option works well to match whole words. But what about matching only start or end of words?
* Anchors `\<` and `\>` will match start/end positions of a word
* `\b` can also be used instead of `\<` and `\>` which matches both edges of a word

```bash
$ printf 'spar\npar\npart\napparent\n'
spar
par
part
apparent

$ # words ending with par
$ printf 'spar\npar\npart\napparent\n' | grep 'par\>'
spar
par

$ # words starting with par
$ printf 'spar\npar\npart\napparent\n' | grep '\<par'
par
part
```

* `-w` option is same as specifying both start and end word boundaries

```bash
$ printf 'spar\npar\npart\napparent\n' | grep '\<par\>'
par

$ printf 'spar\npar\npart\napparent\n' | grep '\bpar\b'
par

$ printf 'spar\npar\npart\napparent\n' | grep -w 'par'
par
```

* `\b` has an opposite `\B` which is quite useful too

```bash
$ # string not surrounded by word boundary either side
$ printf 'spar\npar\npart\napparent\n' | grep '\Bpar\B'
apparent

$ # word containing par but not as start of word
$ printf 'spar\npar\npart\napparent\n' | grep '\Bpar'
spar
apparent

$ # word containing par but not as end of word
$ printf 'spar\npar\npart\napparent\n' | grep 'par\B'
part
apparent
```

* the word boundary escape sequences differ slightly from `-w` option

```bash
$ # this fails because there is no word boundary between space and +
$ echo '2 +3 = 5' | grep '\b+3\b'
$ # this works as -w only ensures that there are no surrounding word characters
$ echo '2 +3 = 5' | grep -w '+3'
2 +3 = 5

$ # doesn't work as , isn't at start of word boundary
$ echo 'hi, 2 one' | grep '\<, 2\>'
$ # won't match as there are word characters before ,
$ echo 'hi, 2 one' | grep -w ', 2'
$ # works as \b matches both edges and , is at end of word after i
$ echo 'hi, 2 one' | grep '\b, 2\b'
hi, 2 one
```

<br>

#### <a name="alternation"></a>Alternation

* The `|` meta character is similar to using multiple `-e` option
* Each side of `|` is complete regular expression with their own start/end anchors
* How each part of alternation is handled and order of evaluation/output is beyond the scope of this tutorial
    * See [this](https://www.regular-expressions.info/alternation.html) for more info on this topic.
* `|` is one of meta characters that requires different syntax between BRE/ERE

```bash
$ grep 'blue\|you' poem.txt
Violets are blue,
And so are you.
$ grep -E 'blue|you' poem.txt
Violets are blue,
And so are you.

$ # extract case-insensitive e or f from anywhere in line
$ echo 'Fantasy is my favorite genre' | grep -Eio 'e|f'
F
f
e
e
e

$ # extract case-insensitive e at end of line, f at start of line
$ echo 'Fantasy is my favorite genre' | grep -Eio 'e$|^f'
F
e
```

* A cool usecase of alternation is using `^` or `$` anchors to highlight searched term as well as display rest of unmatched lines
    * the line anchors will match every input line, even empty lines as they are position markers

```bash
$ grep --color=auto -E '^|are' poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ grep --color=auto -E 'is|$' poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.
```

Screenshot for above example:

![highlighting string](./images/highlight_string_whole_file_op.png)

See also

* [stackoverflow - Grep output with multiple Colors](https://stackoverflow.com/questions/17236005/grep-output-with-multiple-colors)
* [unix.stackexchange - Multicolored Grep](https://unix.stackexchange.com/questions/104350/multicolored-grep)

<br>

#### <a name="the-dot-meta-character"></a>The dot meta character

The `.` meta character matches is used to match any character

```bash
$ # any two characters surrounded by word boundaries
$ echo 'I have 12, he has 132!' | grep -ow '..'
12
he

$ # match three characters from start of line
$ # \t (TAB) is single character here
$ printf 'a\tbcd\n' | grep -o '^...'
a       b

$ # all three character word starting with c
$ echo 'car bat cod cope scat dot abacus' | grep -ow 'c..'
car
cod

$ echo '1 & 2' | grep -o '.'
1
 
&
 
2
```

<br>

#### <a name="quantifiers"></a>Greedy Quantifiers

Defines how many times a character (simplified for now) should be matched

* `?` will try to match 0 or 1 time
* For BRE, use `\?`

```bash
$ printf 'late\npale\nfactor\nrare\nact\n'
late
pale
factor
rare
act

$ # match a followed by t, with or without c in between
$ printf 'late\npale\nfactor\nrare\nact\n' | grep -E 'ac?t'
late
factor
act

$ # same as using this alternation
$ printf 'late\npale\nfactor\nrare\nact\n' | grep -E 'at|act'
late
factor
act
```

* `*` will try to match 0 or more times
* There is no upper limit and `*` will try to match as many times as possible
    * if matching maximum times results in overall regex failing, then next best count is chosen until overall regex passes
    * if there are multiple quantifiers, left-most quantifier gets precedence

```bash
$ echo 'abbbc' | grep -o 'b*'
bbb

$ # matches 0 or more b only if surrounded by a and c
$ echo 'abc ac adc abbc bbb bc' | grep -o 'ab*c'
abc
ac
abbc

$ # see how it matched everything
$ echo 'car bat cod map scat dot abacus' | grep -o '.*'
car bat cod map scat dot abacus

$ # but here it stops at m
$ echo 'car bat cod map scat dot abacus' | grep -o '.*m'
car bat cod m

$ # stopped at dot, not bat or scat - match as much as possible
$ echo 'car bat cod map scat dot abacus' | grep -o 'c.*t'
car bat cod map scat dot

$ # matching overall expression gets preference
$ echo 'car bat cod map scat dot abacus' | grep -o 'c.*at'
car bat cod map scat

$ # precedence is left to right in case of multiple matches
$ echo 'car bat cod map scat dot abacus' | grep -o 'b.*m'
bat cod m
$ echo 'car bat cod map scat dot abacus' | grep -o 'b.*m*'
bat cod map scat dot abacus
```

* `+` will try to match 1 or more times
* Another meta character that differs in syntax between BRE/ERE

```bash
$ echo 'abbbc' | grep -o 'b\+'
bbb
$ echo 'abbbc' | grep -oE 'b+'
bbb

$ echo 'abc ac adc abbc bbb bc' | grep -oE 'ab+c'
abc
abbc
$ echo 'abc ac adc abbc bbb bc' | grep -o 'ab*c'
abc
ac
abbc
```

* For more precise control on number of times to match, `{}` is useful
    * use `\{\}` for BRE
* It can take one of four forms, `{m,n}`, `{,n}`, `{m,}` and `{n}`

```bash
$ # {m,n} - m to n, including both m and n
$ echo 'ac abc abbc abbbc' | grep -Eo 'ab{1,2}c'
abc
abbc

$ # {,n} - 0 to n times
$ echo 'ac abc abbc abbbc' | grep -Eo 'ab{,2}c'
ac
abc
abbc

$ # {m,} - at least m times
$ echo 'ac abc abbc abbbc' | grep -Eo 'ab{2,}c'
abbc
abbbc

$ # {n} - exactly n times
$ echo 'ac abc abbc abbbc' | grep -Eo 'ab{2}c'
abbc
```

<br>

#### <a name="character-classes"></a>Character classes

* The meta character pairs `[]` allow to match any of the multiple characters within `[]`
* Meta characters like `^`, `$` have different meaning inside and outside of `[]`
* Simple example first, matching any of the characters within `[]`

```bash
$ echo 'do so in to no on' | grep -ow '[nt]o'
to
no

$ echo 'do so in to no on' | grep -ow '[sot][on]'
so
to
on
```

* Adding a quantifier
* Check out [unix words](https://en.wikipedia.org/wiki/Words_(Unix)) and [sample words file](https://users.cs.duke.edu/~ola/ap/linuxwords)

```bash
$ # words made up of letters o and n, at least 2 letters
$ grep -xE '[on]{2,}' /usr/share/dict/words
no
non
noon
on

$ # lines containing only digits
$ printf 'cat\nfoo\n123\nbaz\n42\n' | grep -xE '[0123456789]+'
123
42
```

* Character ranges
* Matching any alphabet, number, hexadecimal number etc becomes cumbersome if every character has to be individually specified
* So, there's a shortcut, using `-` to construct a range (has to be specified in ascending order)
* See [ascii codes table](https://ascii.cl/) for reference
    * Note that behavior of range will differ for other character encodings
    * See **Character Classes and Bracket Expressions** as well as **LC_COLLATE under Environment Variables** sections in `info grep` for more detail
* [Matching Numeric Ranges with a Regular Expression](https://www.regular-expressions.info/numericranges.html)

```bash
$ printf 'cat\nfoo\n123\nbaz\n42\n' | grep -xE '[0-9]+'
123
42

$ printf 'cat\nfoo\n123\nbaz\n42\n' | grep -xiE '[a-z]+'
cat
foo
baz

$ # only valid decimal numbers
$ printf '128\n34\nfe32\nfoo1\nbar\n' | grep -xE '[0-9]+'
128
34

$ # only valid octal numbers
$ printf '128\n34\nfe32\nfoo1\nbar\n' | grep -xE '[0-7]+'
34

$ # only valid hexadecimal numbers
$ printf '128\n34\nfe32\nfoo1\nbar\n' | grep -xiE '[0-9a-f]+'
128
34
fe32

$ # numbers between 10-29
$ echo '23 54 12 92' | grep -owE '[12][0-9]'
23
12
```

* Negating character class
* By using `^` as first character inside `[]`, we get inverted character class
    * As pointed out earlier, some meta characters behave differently inside and outside of `[]`

```bash
$ # alphabetic words not starting with c
$ echo '123 core not sink code finish' | grep -owE '[^c][a-z]+'
not
sink
finish

$ # excluding numbers 2,3,4,9
$ # note that 200a 200; etc will also match, usage depends on knowing input
$ echo '2001 2004 2005 2008 2009' | grep -ow '200[^2-49]'
2001
2005
2008

$ # get characters from start of line upto(not including) known identifier
$ echo 'foo=bar; baz=123' | grep -oE '^[^=]+'
foo

$ # get characters at end of line from(not including) known identifier
$ echo 'foo=bar; baz=123' | grep -oE '[^=]+$'
123

$ # get all sequence of characters surrounded by unique identifier
$ echo 'I like "mango" and "guava"' | grep -oE '"[^"]+"'
"mango"
"guava"
```

* Matching meta characters inside `[]`
* Most meta characters like `( ) . + { } | $` don't have special meaning inside `[]` and hence do not require special treatment
* Some combination like `[.` or `=]` cannot be used in this order, as they have special meaning within `[]`
    * See **Character Classes and Bracket Expressions** section in `info grep` for more detail

```bash
$ # to match - it should be first or last character within []
$ echo 'Foo-bar 123-456 42 Co-operate' | grep -oiwE '[a-z-]+'
Foo-bar
Co-operate

$ # to match ] it should be first character within []
$ printf 'int a[5]\nfoo=bar\n' | grep '[]=]'
int a[5]
foo=bar

$ # to match [ use [ anywhere in the character list
$ # [][] will match both [ and ]
$ printf 'int a[5]\nfoo=bar\n' | grep '[[]'
int a[5]

$ # to match ^ it should be other than first in the list
$ echo '(a+b)^2 = a^2 + b^2 + 2ab' | grep -owE '[a-z^0-9]{3,}'
a^2
b^2
2ab
```

* Named character classes
* Equivalent class shown is for C locale and ASCII character encoding
    * See [ascii codes table](https://ascii.cl/) for reference
* See **Character Classes and Bracket Expressions** section in `info grep` for more detail

| Character classes | Description |
| ------------- | ----------- |
| `[:digit:]` | Same as `[0-9]` |
| `[:lower:]` | Same as `[a-z]` |
| `[:upper:]` | Same as `[A-Z]` |
| `[:alpha:]` | Same as `[a-zA-Z]` |
| `[:alnum:]` | Same as `[0-9a-zA-Z]` |
| `[:xdigit:]` | Same as `[0-9a-fA-F]` |
| `[:cntrl:]` | Control characters - first 32 ASCII characters and 127th (DEL) |
| `[:punct:]` | All the punctuation characters |
| `[:graph:]` | `[:alnum:]` and `[:punct:]` |
| `[:print:]` | `[:alnum:]`, `[:punct:]` and space |
| `[:blank:]` | Space and tab characters |
| `[:space:]` | white-space characters: tab, newline, vertical tab, form feed, carriage return and space |

```bash
$ printf '128\n34\nAB32\nFoo\nbar\n' | grep -x '[[:alnum:]]*'
128
34
AB32
Foo
bar

$ printf '128\n34\nAB32\nFoo\nbar\n' | grep -x '[[:lower:]]*'
bar

$ printf '128\n34\nAB32\nFoo\nbar\n' | grep -x '[[:lower:]0-9]*'
128
34
bar
```

* backslash character classes

| Character classes | Description |
| ------------- | ----------- |
| `\w` | Same as `[0-9a-zA-Z_]` or `[[:alnum:]_]` |
| `\W` | Same as `[^0-9a-zA-Z_]` or `[^[:alnum:]_]` |
| `\s` | Same as `[[:space:]]` |
| `\S` | Same as `[^[:space:]]` |

```bash
$ printf '123\n$#\ncmp_str\nFoo_bar\n' | grep -x '\w*'
123
cmp_str
Foo_bar
$ printf '123\n$#\ncmp_str\nFoo_bar\n' | grep -x '[[:alnum:]_]*'
123
cmp_str
Foo_bar

$ printf '123\n$#\ncmp_str\nFoo_bar\n' | grep -x '\W*'
$#
$ printf '123\n$#\ncmp_str\nFoo_bar\n' | grep -x '[^[:alnum:]_]*'
$#
```

<br>

#### <a name="grouping"></a>Grouping

* Character classes allow matching against a choice of multiple character list and then quantifier added if needed
* One of the uses of grouping is analogous to character classes for whole regular expressions, instead of just list of characters
* The meta characters `()` are used for grouping
    * requires `\(\)` for BRE
* Similar to `a(b+c)d = abd+acd` in maths, you get `a(b|c)d = abd|acd` in regular expressions

```bash
$ # 5 letter words starting with c and ending with ty or ly
$ grep -xE 'c..(ty|ly)' /usr/share/dict/words
catty
coyly
curly

$ # 7 letter words starting with e and ending with rged or sted
$ grep -xE 'e..(rg|st)ed' /usr/share/dict/words
emerged
existed

$ # repeat a pattern 3 times
$ grep -xE '([a-d][r-z]){3}' /usr/share/dict/words
avatar
awards
cravat

$ # nesting of () is allowed
$ grep -E '([as](p|c)[r-t]){2}' /usr/share/dict/words
scraps

$ # can be used to match specific columns in well defined tables
$ echo 'foo:123:bar:baz' | grep -E '^([^:]+:){2}bar'
foo:123:bar:baz
```

* See also [stackoverflow - matching character exactly n times in a line](https://stackoverflow.com/questions/40187643/grep-search-with-regex)

<br>

#### <a name="back-reference"></a>Back reference

* The matched string within `()` can also be used to be matched again by back referencing the captured groups
* `\1` denotes the first matched group, `\2` the second one and so on
    * Order is leftmost `(` is `\1`, next one is `\2` and so on
* Note that the matched string, not the regular expression itself is referenced
    * for ex: if `([0-9][a-f])` matches `3b`, then back referencing will be `3b` not any other valid match of the regular expression like `8f`, `0a` etc
    * Other regular expressions like PCRE do allow referencing the regular expression itself

```bash
$ # note how first three and last three letters are same
$ grep -xE '([a-d]..)\1' /usr/share/dict/words
bonbon
cancan
chichi
$ # note how adding quantifier is not same as back-referencing
$ grep -m4 -xE '([a-d]..){2}' /usr/share/dict/words
abacus
abided
abides
ablaze

$ # words with consecutive repeated letters
$ echo 'eel flee all pat ilk seen' | grep -iowE '[a-z]*(.)\1[a-z]*'
eel
flee
all
seen

$ # 17 letter words with first and last as same letter
$ grep -xE '(.)[a-z]{15}\1' /usr/share/dict/words
semiprofessionals
transcendentalist
```

* Spotting repeated words

```bash
$ cat story.txt
singing tin in the rain
walking for for a cause
have a nice day
day and night

$ grep -wE '(\w+)\W+\1' story.txt
walking for for a cause
```

* **Note** that there is an [issue for certain usage of back-reference and quantifier](https://debbugs.gnu.org/cgi/bugreport.cgi?bug=26864)

```bash
$ # no output
$ grep -m5 -xiE '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words
$ # works when nesting is unrolled
$ grep -m5 -xiE '[a-z]*([a-z])\1[a-z]*([a-z])\2[a-z]*' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed

$ # no problem if PCRE is used instead of ERE
$ grep -m5 -xiP '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed
```

<br>

## <a name="multiline-matching"></a>Multiline matching

* If input is small enough to meet memory requirements, the `-z` option comes in handy to match across multiple lines
* Instead of newline being line separator, the ASCII NUL character is used
    * So, multiline matching depends on whether or not input file itself contains the NUL character
    * Usually text files won't have occasion to use the NUL character and presence of it marks it as binary file for `grep`

```bash
$ # \0 for ASCII NUL character
$ printf 'red\nblue\n\0green\n' | cat -e
red$
blue$
^@green$

$ # see --binary-files=TYPE option in info grep for binary details
$ printf 'red\nblue\n\0green\n' | grep -a 'red'
red

$ # with -z, \0 marks the different 'lines'
$ printf 'red\nblue\n\0green\n' | grep -z 'red'
red
blue

$ # if no \0 in input, entire input read as single string
$ printf 'red\nblue\ngreen\n' | grep -z 'red'
red
blue
green
```

* `\n` is not defined in BRE/ERE
    * see [unix.stackexchange - How to specify characters using hexadecimal codes](https://unix.stackexchange.com/questions/19491/how-to-specify-characters-using-hexadecimal-codes-in-grep) for a workaround
* if some characteristics of input is known, `[[:space:]]` can be used as workaround, which matches all white-space characters

```bash
$ grep -oz 'Roses.*blue,[[:space:]]' poem.txt
Roses are red,
Violets are blue,
```

<br>

## <a name="perl-compatible-regular-expressions"></a>Perl Compatible Regular Expressions

```bash
$ # see also: https://github.com/learnbyexample/command_help
$ man grep | sed -n '/^\s*-P/,/^$/p'
       -P, --perl-regexp
              Interpret the pattern as a  Perl-compatible  regular  expression
              (PCRE).   This  is  highly  experimental and grep -P may warn of
              unimplemented features.

```

* The man page informs that `-P` is *highly experimental*. So far, haven't faced any issues. But do keep this in mind.
    * newer versions of `GNU grep` has fixes for some `-P` bugs, see [release notes](https://savannah.gnu.org/news/?group_id=67) for an overview of changes between versions
* Only a few highlights is presented here
* For more info
    * `man pcrepattern` or [read it online](https://www.pcre.org/original/doc/html/pcrepattern.html)
    * [perldoc - re](https://perldoc.perl.org/perlre.html) - Perl regular expression syntax, also links to other related tutorials
    * [stackoverflow - What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean)

<br>

#### <a name="backslash-sequences"></a>Backslash sequences

Some of the backslash constructs available in PCRE over already seen ones in ERE

* `\d` for `[0-9]`
* `\s` for `[ \t\r\n\f\v]`
* `\h` for `[ \t]`
* `\n` for newline character
* `\D`, `\S`, `\H`, `\N` etc for their opposites

```bash
$ # example for [0-9] in ERE and \d in PCRE
$ echo 'foo=5, bar=3; x=83, y=120' | grep -oE '[0-9]+'
5
3
83
120
$ echo 'foo=5, bar=3; x=83, y=120' | grep -oP '\d+'
5
3
83
120

$ # (?s) allows newlines to be also matches when using . meta character
$ grep -ozP '(?s)Roses.*blue,\n' poem.txt
Roses are red,
Violets are blue,
```

* See **INTERNAL OPTION SETTING** in `man pcrepattern` for more info on `(?s)`, `(?m)` etc
* [Specifying Modes Inside The Regular Expression](https://www.regular-expressions.info/modifiers.html) also has some detail on such options

<br>

#### <a name="non-greedy-matching"></a>Non-greedy matching

* Both BRE/ERE support only greedy matching quantifiers
    * match as much as possible
* PCRE supports non-greedy version by adding `?` after quantifiers
    * match as minimal as possible
* See [this Python notebook](https://nbviewer.jupyter.org/url/norvig.com/ipython/pal3.ipynb) for an interesting project on palindrome sentences

```bash
$ echo 'foo and bar and baz went shopping bytes' | grep -oi '\w.*and'
foo and bar and

$ echo 'foo and bar and baz went shopping bytes' | grep -oiP '\w.*?and'
foo and
bar and

$ # recall that matching overall expression gets preference
$ echo 'foo and bar and baz went shopping bytes' | grep -oi '\w.*and baz'
foo and bar and baz
$ echo 'foo and bar and baz went shopping bytes' | grep -oiP '\w.*?and baz'
foo and bar and baz

$ # minimal matching with single character has simple workaround
$ echo 'A man, a plan, a canal, Panama' | grep -oi 'a.*,'
A man, a plan, a canal,
$ echo 'A man, a plan, a canal, Panama' | grep -oi 'a[^,]*,'
A man,
a plan,
a canal,
```

<br>

#### <a name="lookarounds"></a>Lookarounds

* Ability to add conditions to match before/after required pattern
* There are four types
    * positive lookahead `(?=`
    * negative lookahead `(?!`
    * positive lookbehind `(?<=`
    * negative lookbehind `(?<!`
* One way to remember is that **behind** uses `<` and **negative** uses `!` instead of `=`
* When used with `-o` option, lookarounds portion won't be part of output

Fixed and variable length *lookbehind*

```bash
$ # extract digits preceded by single lowercase letter and =
$ # this is fixed length lookbehind because length is known
$ echo 'foo=5, bar=3; x=83, y=120' | grep -oP '(?<=\b[a-z]=)\d+'
83
120

$ # error because {2,} induces variable length matching
$ echo 'foo=5, bar=3; x=83, y=120' | grep -oP '(?<=\b[a-z]{2,}=)\d+'
grep: lookbehind assertion is not fixed length

$ # use \K for such cases
$ echo 'foo=5, bar=3; x=83, y=120' | grep -oP '\b[a-z]{2,}=\K\d+'
5
3
```

* Examples for lookarounds

```bash
$ # extract digits that follow =
$ echo 'foo=5, bar=3; x=83, y=120' | grep -oP '=\K\d+'
5
3
83
120

$ # digits that follow = and has , after
$ echo 'foo=5, bar=3; x=83, y=120' | grep -oP '=\K\d+(?=,)'
5
83

$ # extract words, but not those at start of line
$ echo 'car bat cod map' | grep -owP '(?<!^)\w+'
bat
cod
map

$ # extract words, but not those at start of line or end of line
$ echo 'car bat cod map' | grep -owP '(?<!^)\w+(?!$)'
bat
cod

$ # matching multiple search patterns in any order
$ grep -P '(?=.*are)(?=.*s).*d' poem.txt
Roses are red,
And so are you.
```

<br>

#### <a name="ignoring-specific-matches"></a>Ignoring specific matches

* A useful construct is `(*SKIP)(*F)` which allows to discard matches not needed
* Simple way to use is that regular expression which should be discarded is written first, `(*SKIP)(*F)` is appended and then whichever is required by added after `|`
* See [Excluding Unwanted Matches](https://www.rexegg.com/backtracking-control-verbs.html#skipfail) for more info

```bash
$ # all words except bat and map
$ echo 'car bat cod map' | grep -oP '(bat|map)(*SKIP)(*F)|\w+'
car
cod

$ # all words except those surrounded by double quotes
$ echo 'I like "mango" and "guava"' | grep -oP '"[^"]+"(*SKIP)(*F)|\w+'
I
like
and
```

<br>

#### <a name="re-using-regular-expression-pattern"></a>Re-using regular expression pattern

* `\1`, `\2` etc only matches exact string
* `(?1)`, `(?2)` etc re-uses the regular expression itself

```bash
$ # (?1) refers to first group \d{4}-\d{2}-\d{2}
$ echo '2008-03-24 and 2012-08-12 foo' | grep -oP '(\d{4}-\d{2}-\d{2})\D+(?1)'
2008-03-24 and 2012-08-12
```

<br>

## <a name="gotchas-and-tips"></a>Gotchas and Tips

* Always quote the search string (unless you know what you are doing :P)

```bash
$ # spaces are special
$ grep so are poem.txt
grep: are: No such file or directory
poem.txt:And so are you.
$ grep 'so are' poem.txt
And so are you.

$ # use of # indicates start of comment
$ printf 'foo\na#2\nb#3\n' | grep #2
Usage: grep [OPTION]... PATTERN [FILE]...
Try 'grep --help' for more information.
$ printf 'foo\na#2\nb#3\n' | grep '#2'
a#2
```

* Another common problem is unquoted search string will be open to shell's own globbing rules

```bash
$ # sample output on bash shell, might vary for different shells
$ echo '*.txt' | grep -F *.txt
$ echo '*.txt' | grep -F '*.txt'
*.txt
```

* Use double quotes for variable expansion, command substitution, etc (Note: could vary based on shell used)
* See [mywiki.wooledge Quotes](https://mywiki.wooledge.org/Quotes) for detailed discussion of quoting in `bash` shell

```bash
$ # sample output on bash shell, might vary for different shells
$ color='blue'
$ grep "$color" poem.txt
Violets are blue,
```

* Pattern starting with `-`

```bash
$ # this issue is not specific to grep alone
$ # the command assumes -2 is an option and hence the error
$ echo '5*3-2=13' | grep '-2'
Usage: grep [OPTION]... PATTERN [FILE]...
Try 'grep --help' for more information.

$ # workaround by using \-
$ echo '5*3-2=13' | grep '\-2'
5*3-2=13

$ # or use -- to indicate no further options to process
$ echo '5*3-2=13' | grep -- '-2'
5*3-2=13

$ # same issue with printf
$ printf '-1+2=1\n'
bash: printf: -1: invalid option
printf: usage: printf [-v var] format [arguments]
$ printf -- '-1+2=1\n'
-1+2=1
```

* Tip: Options can be specified at end of command as well, useful if option was forgotten and have to quickly add it to previous command from history

```bash
$ grep 'are' poem.txt
Roses are red,
Violets are blue,
And so are you.

$ # use previous command from history, for ex up arrow key in bash
$ # then simply add the option at end
$ grep 'are' poem.txt -n
1:Roses are red,
2:Violets are blue,
4:And so are you.
```

* Speed boost if input file is ASCII
* See also [unix.stackexchange - Counting the number of lines having a number > 100](https://unix.stackexchange.com/questions/312297/counting-the-number-of-lines-having-a-number-greater-than-100/312330#312330) - where `grep` is blazing fast compared to other solutions

```bash
$ time grep -xE '([a-d][r-z]){3}' /usr/share/dict/words
avatar
awards
cravat

real    0m0.145s

$ time LC_ALL=C grep -xE '([a-d][r-z]){3}' /usr/share/dict/words
avatar
awards
cravat

real    0m0.011s
```

* Speed boost by using PCRE for back-references
* might be faster when using quantifiers as well

```bash
$ time LC_ALL=C grep -xE '([a-z]..)\1' /usr/share/dict/words
bonbon
cancan
chichi
murmur
muumuu
pawpaw
pompom
tartar
testes

real    0m0.174s
$ time grep -xP '([a-z]..)\1' /usr/share/dict/words
bonbon
cancan
chichi
murmur
muumuu
pawpaw
pompom
tartar
testes

real    0m0.008s
```

<br>

## <a name="regular-expressions-reference-ere"></a>Regular Expressions Reference (ERE)

<br>

#### <a name="anchors"></a>Anchors

* `^` match from start of line
* `$` match end of line
* `\<` match beginning of word
* `\>` match end of word
* `\b` match edge of word
* `\B` match other than edge of word

<br>

#### <a name="character-quantifiers"></a>Character Quantifiers

* `.` match any single character
* `*` match preceding character/group 0 or more times
* `+` match preceding character/group 1 or more times
* `?` match preceding character/group 0 or 1 times
* `{m,n}` match preceding character/group m to n times, including m and n
* `{m,}` match preceding character/group m or more times
* `{,n}` match preceding character/group 0 to n times
* `{n}` match preceding character/group exactly n times

<br>

#### <a name="character-classes-and-backslash-sequences"></a>Character classes and backslash sequences

* `[aeiou]` match any of these characters
* `[^aeiou]` do not match any of these characters
* `[a-z]` match any lowercase alphabet
* `[0-9]` match any digit character
* `\w` match alphabets, digits and underscore character, short cut for `[a-zA-Z0-9_]`
* `\W` opposite of `\w` , short cut for `[^a-zA-Z0-9_]`
* `\s` match white-space characters: tab, newline, vertical tab, form feed, carriage return, and space
* `\S` match other than white-space characters

<br>

#### <a name="pattern-groups"></a>Pattern groups

* `|` matches either of the given patterns
* `()` patterns within `()` are grouped and treated as one pattern, useful in conjunction with `|`
* `\1` backreference to first grouped pattern within `()`
* `\2` backreference to second grouped pattern within `()` and so on

<br>

#### <a name="basic-vs-extended-regular-expressions"></a>Basic vs Extended Regular Expressions

By default, the pattern passed to `grep` is treated as Basic Regular Expressions(BRE), which can be overridden using options like `-E` for ERE and `-P` for Perl Compatible Regular Expression(PCRE). Paraphrasing from `info grep`

>In Basic Regular Expressions the meta-characters `? + { | ( )` lose their special meaning, instead use the backslashed versions `\? \+ \{ \| \( \)`

<br>

## <a name="further-reading"></a>Further Reading

* `man grep` and `info grep`
    * At least go through all options ;)
    * **Usage section** in `info grep` has good examples as well
* This chapter has also been [converted to a book](https://github.com/learnbyexample/learn_gnugrep_ripgrep) with additional examples, exercises and covers popular alternative `ripgrep`
* A bit of history
    * [Brian Kernighan remembers the origins of grep](https://thenewstack.io/brian-kernighan-remembers-the-origins-of-grep/)
    * [how grep command was born](https://medium.com/@rualthanzauva/grep-was-a-private-command-of-mine-for-quite-a-while-before-i-made-it-public-ken-thompson-a40e24a5ef48)
    * [why GNU grep is fast](https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html)
    * [unix.stackexchange - Difference between grep, egrep and fgrep](https://unix.stackexchange.com/questions/17949/what-is-the-difference-between-grep-egrep-and-fgrep)
* Q&A on stackoverflow/stackexchange are good source of learning material, good for practice exercises as well
    * [grep Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/grep?sort=votes&pageSize=15)
    * [grep Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/grep?sort=votes&pageSize=15)
* Learn Regular Expressions (has information on flavors other than BRE/ERE/PCRE too)
    * [Regular Expressions Tutorial](https://www.regular-expressions.info/tutorial.html)
    * [rexegg](https://www.rexegg.com/) - tutorials, tricks and more
    * [regexcrossword](https://regexcrossword.com/)
    * [stackoverflow - What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean)
    * [online regex tester and debugger](https://regex101.com/) - by default `pcre` flavor
* Alternatives
    * [ripgrep](https://github.com/BurntSushi/ripgrep)
    * [pcregrep](https://www.pcre.org/original/doc/html/pcregrep.html)
    * [ag - silver searcher](https://github.com/ggreer/the_silver_searcher)
* [unix.stackexchange - When to use grep, sed, awk, perl, etc](https://unix.stackexchange.com/questions/303044/when-to-use-grep-less-awk-sed)



================================================
FILE: gnu_sed.md
================================================
<br> <br> <br>

---

:information_source: :information_source: This chapter has been converted into a better formatted ebook: https://learnbyexample.github.io/learn_gnused/. The ebook also has content updated for newer version of the commands, includes exercises, solutions, etc.

For markdown source and links to buy pdf/epub versions, see: https://github.com/learnbyexample/learn_gnused

---

<br> <br> <br>

# <a name="gnu-sed"></a>GNU sed

**Table of Contents**

* [Simple search and replace](#simple-search-and-replace)
    * [editing stdin](#editing-stdin)
    * [editing file input](#editing-file-input)
* [Inplace file editing](#inplace-file-editing)
    * [With backup](#with-backup)
    * [Without backup](#without-backup)
    * [Multiple files](#multiple-files)
    * [Prefix backup name](#prefix-backup-name)
    * [Place backups in directory](#place-backups-in-directory)
* [Line filtering options](#line-filtering-options)
    * [Print command](#print-command)
    * [Delete command](#delete-command)
    * [Quit commands](#quit-commands)
    * [Negating REGEXP address](#negating-regexp-address)
    * [Combining multiple REGEXP](#combining-multiple-regexp)
    * [Filtering by line number](#filtering-by-line-number)
    * [Print only line number](#print-only-line-number)
    * [Address range](#address-range)
    * [Relative addressing](#relative-addressing)
* [Using different delimiter for REGEXP](#using-different-delimiter-for-regexp)
* [Regular Expressions](#regular-expressions)
    * [Line Anchors](#line-anchors)
    * [Word Anchors](#word-anchors)
    * [Matching the meta characters](#matching-the-meta-characters)
    * [Alternation](#alternation)
    * [The dot meta character](#the-dot-meta-character)
    * [Quantifiers](#quantifiers)
    * [Character classes](#character-classes)
    * [Escape sequences](#escape-sequences)
    * [Grouping](#grouping)
    * [Back reference](#back-reference)
    * [Changing case](#changing-case)
* [Substitute command modifiers](#substitute-command-modifiers)
    * [g modifier](#g-modifier)
    * [Replace specific occurrence](#replace-specific-occurrence)
    * [Ignoring case](#ignoring-case)
    * [p modifier](#p-modifier)
    * [w modifier](#w-modifier)
    * [e modifier](#e-modifier)
    * [m modifier](#m-modifier)
* [Shell substitutions](#shell-substitutions)
    * [Variable substitution](#variable-substitution)
    * [Command substitution](#command-substitution)
* [z and s command line options](#z-and-s-command-line-options)
* [change command](#change-command)
* [insert command](#insert-command)
* [append command](#append-command)
* [adding contents of file](#adding-contents-of-file)
    * [r for entire file](#r-for-entire-file)
    * [R for line by line](#r-for-line-by-line)
* [n and N commands](#n-and-n-commands)
* [Control structures](#control-structures)
    * [if then else](#if-then-else)
    * [replacing in specific column](#replacing-in-specific-column)
    * [overlapping substitutions](#overlapping-substitutions)
* [Lines between two REGEXPs](#lines-between-two-regexps)
    * [Include or Exclude matching REGEXPs](#include-or-exclude-matching-regexps)
    * [First or Last block](#first-or-last-block)
    * [Broken blocks](#broken-blocks)
* [sed scripts](#sed-scripts)
* [Gotchas and Tips](#gotchas-and-tips)
* [Further Reading](#further-reading)

<br>

```bash
$ sed --version | head -n1
sed (GNU sed) 4.2.2

$ man sed
SED(1)                           User Commands                          SED(1)

NAME
       sed - stream editor for filtering and transforming text

SYNOPSIS
       sed [OPTION]... {script-only-if-no-other-script} [input-file]...

DESCRIPTION
       Sed  is a stream editor.  A stream editor is used to perform basic text
       transformations on an input stream (a file or input from  a  pipeline).
       While  in  some  ways similar to an editor which permits scripted edits
       (such as ed), sed works by making only one pass over the input(s),  and
       is consequently more efficient.  But it is sed's ability to filter text
       in a pipeline which particularly distinguishes it from other  types  of
       editors.
...
```

**Note:** [Multiline and manipulating pattern space](https://www.gnu.org/software/sed/manual/sed.html#Multiline-techniques) with h,x,D,G,H,P etc is not covered in this chapter and examples/information is based on ASCII encoded text input only

<br>

## <a name="simple-search-and-replace"></a>Simple search and replace

Detailed examples for **substitute** command will be covered in later sections, syntax is

```
s/REGEXP/REPLACEMENT/FLAGS
```

The `/` character is idiomatically used as delimiter character. See also [Using different delimiter for REGEXP](#using-different-delimiter-for-regexp)

<br>

#### <a name="editing-stdin"></a>editing stdin

```bash
$ # sample command output to be edited
$ seq 10 | paste -sd,
1,2,3,4,5,6,7,8,9,10

$ # change only first ',' to ' : '
$ seq 10 | paste -sd, | sed 's/,/ : /'
1 : 2,3,4,5,6,7,8,9,10

$ # change all ',' to ' : ' by using 'g' modifier
$ seq 10 | paste -sd, | sed 's/,/ : /g'
1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10
```

**Note:** As a good practice, all examples use single quotes around arguments to prevent shell interpretation. See [Shell substitutions](#shell-substitutions) section on use of double quotes

<br>

#### <a name="editing-file-input"></a>editing file input

* By default newline character is the line separator
* See [Regular Expressions](#regular-expressions) section for qualifying search terms, for ex
    * word boundaries to distinguish between 'hi', 'this', 'his', 'history', etc
    * multiple search terms, specific set of character, etc

```bash
$ cat greeting.txt
Hi there
Have a nice day

$ # change first 'e' in each line to 'E'
$ sed 's/e/E/' greeting.txt
Hi thEre
HavE a nice day

$ # change first 'nice day' in each line to 'safe journey'
$ sed 's/nice day/safe journey/' greeting.txt
Hi there
Have a safe journey

$ # change all 'e' to 'E' and save changed text to another file
$ sed 's/e/E/g' greeting.txt > out.txt
$ cat out.txt
Hi thErE
HavE a nicE day
```

<br>

## <a name="inplace-file-editing"></a>Inplace file editing

* In previous section, the output from `sed` was displayed on stdout or saved to another file
* To write the changes back to original file, use `-i` option

**Note**:

* Refer to `man sed` for details of how to use the `-i` option. It varies with different `sed` implementations. As mentioned at start of this chapter, `sed (GNU sed) 4.2.2` is being used here
* See also [unix.stackexchange - working with symlinks](https://unix.stackexchange.com/questions/348693/sed-update-etc-grub-conf-in-spite-this-link-file)

<br>

#### <a name="with-backup"></a>With backup

* When extension is given, the original input file is preserved with name changed according to extension provided

```bash
$ # '.bkp' is extension provide

Download .txt

gitextract_wr_ra6a8/

├── README.md
├── exercises/
│   ├── GNU_grep/
│   │   ├── .ref_solutions/
│   │   │   ├── ex01_basic_match.txt
│   │   │   ├── ex02_basic_options.txt
│   │   │   ├── ex03_multiple_string_match.txt
│   │   │   ├── ex04_filenames.txt
│   │   │   ├── ex05_word_line_matching.txt
│   │   │   ├── ex06_ABC_context_matching.txt
│   │   │   ├── ex07_recursive_search.txt
│   │   │   ├── ex08_search_pattern_from_file.txt
│   │   │   ├── ex09_regex_anchors.txt
│   │   │   ├── ex10_regex_this_or_that.txt
│   │   │   ├── ex11_regex_quantifiers.txt
│   │   │   ├── ex12_regex_character_class_part1.txt
│   │   │   ├── ex13_regex_character_class_part2.txt
│   │   │   ├── ex14_regex_grouping_and_backreference.txt
│   │   │   ├── ex15_regex_PCRE.txt
│   │   │   └── ex16_misc_and_extras.txt
│   │   ├── ex01_basic_match/
│   │   │   └── sample.txt
│   │   ├── ex01_basic_match.txt
│   │   ├── ex02_basic_options/
│   │   │   └── sample.txt
│   │   ├── ex02_basic_options.txt
│   │   ├── ex03_multiple_string_match/
│   │   │   └── sample.txt
│   │   ├── ex03_multiple_string_match.txt
│   │   ├── ex04_filenames/
│   │   │   ├── greeting.txt
│   │   │   ├── poem.txt
│   │   │   └── sample.txt
│   │   ├── ex04_filenames.txt
│   │   ├── ex05_word_line_matching/
│   │   │   ├── greeting.txt
│   │   │   ├── sample.txt
│   │   │   └── words.txt
│   │   ├── ex05_word_line_matching.txt
│   │   ├── ex06_ABC_context_matching/
│   │   │   └── sample.txt
│   │   ├── ex06_ABC_context_matching.txt
│   │   ├── ex07_recursive_search/
│   │   │   ├── msg/
│   │   │   │   ├── greeting.txt
│   │   │   │   └── sample.txt
│   │   │   ├── poem.txt
│   │   │   ├── progs/
│   │   │   │   ├── hello.py
│   │   │   │   └── hello.sh
│   │   │   └── words.txt
│   │   ├── ex07_recursive_search.txt
│   │   ├── ex08_search_pattern_from_file/
│   │   │   ├── baz.txt
│   │   │   ├── foo.txt
│   │   │   └── words.txt
│   │   ├── ex08_search_pattern_from_file.txt
│   │   ├── ex09_regex_anchors/
│   │   │   └── sample.txt
│   │   ├── ex09_regex_anchors.txt
│   │   ├── ex10_regex_this_or_that/
│   │   │   └── sample.txt
│   │   ├── ex10_regex_this_or_that.txt
│   │   ├── ex11_regex_quantifiers/
│   │   │   └── garbled.txt
│   │   ├── ex11_regex_quantifiers.txt
│   │   ├── ex12_regex_character_class_part1/
│   │   │   └── sample_words.txt
│   │   ├── ex12_regex_character_class_part1.txt
│   │   ├── ex13_regex_character_class_part2/
│   │   │   └── sample.txt
│   │   ├── ex13_regex_character_class_part2.txt
│   │   ├── ex14_regex_grouping_and_backreference/
│   │   │   └── sample.txt
│   │   ├── ex14_regex_grouping_and_backreference.txt
│   │   ├── ex15_regex_PCRE/
│   │   │   └── sample.txt
│   │   ├── ex15_regex_PCRE.txt
│   │   ├── ex16_misc_and_extras/
│   │   │   ├── garbled.txt
│   │   │   ├── poem.txt
│   │   │   └── sample.txt
│   │   ├── ex16_misc_and_extras.txt
│   │   └── solve
│   └── README.md
├── file_attributes.md
├── gnu_awk.md
├── gnu_grep.md
├── gnu_sed.md
├── miscellaneous.md
├── overview_presentation/
│   ├── baz.json
│   ├── foo.xml
│   ├── greeting.txt
│   └── sample.txt
├── perl_the_swiss_knife.md
├── restructure_text.md
├── ruby_one_liners.md
├── sorting_stuff.md
├── tail_less_cat_head.md
├── whats_the_difference.md
└── wheres_my_file.md

Download .json

Condensed preview — 80 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (562K chars).

[
  {
    "path": "README.md",
    "chars": 4354,
    "preview": "# Command Line Text Processing\n\nLearn about various commands available for common and exotic text processing needs. Exam"
  },
  {
    "path": "exercises/GNU_grep/.ref_solutions/ex01_basic_match.txt",
    "chars": 228,
    "preview": "1) Match lines containing the string: day\nSolution: grep 'day' sample.txt\n\n2) Match lines containing the string: it\nSolu"
  },
  {
    "path": "exercises/GNU_grep/.ref_solutions/ex02_basic_options.txt",
    "chars": 466,
    "preview": "1) Match lines containing the string irrespective of lower/upper case: no\nSolution: grep -i 'no' sample.txt\n\n2) Match li"
  },
  {
    "path": "exercises/GNU_grep/.ref_solutions/ex03_multiple_string_match.txt",
    "chars": 663,
    "preview": "1) Match lines containing either of these three strings\n        String1: Not\n        String2: he\n        String3: sun\nSo"
  },
  {
    "path": "exercises/GNU_grep/.ref_solutions/ex04_filenames.txt",
    "chars": 520,
    "preview": "Note: All files present in the directory should be given as file inputs to grep\n\n1) Show only filenames containing the s"
  },
  {
    "path": "exercises/GNU_grep/.ref_solutions/ex05_word_line_matching.txt",
    "chars": 504,
    "preview": "Note: All files present in the directory should be given as file inputs to grep\n\n1) Match lines containing whole word: d"
  },
  {
    "path": "exercises/GNU_grep/.ref_solutions/ex06_ABC_context_matching.txt",
    "chars": 760,
    "preview": "1) Get lines and 3 following it containing the string: you\nSolution: grep -A3 'you' sample.txt\n\n2) Get lines and 2 prece"
  },
  {
    "path": "exercises/GNU_grep/.ref_solutions/ex07_recursive_search.txt",
    "chars": 982,
    "preview": "Note: Every file in this directory and sub-directories is input for grep, unless otherwise specified\n\n1) Match all lines"
  },
  {
    "path": "exercises/GNU_grep/.ref_solutions/ex08_search_pattern_from_file.txt",
    "chars": 859,
    "preview": "Note: words.txt has only whole words per line, use it as file input when task is to match whole words\n\n1) Match all stri"
  },
  {
    "path": "exercises/GNU_grep/.ref_solutions/ex09_regex_anchors.txt",
    "chars": 677,
    "preview": "1) Match all lines starting with: no\nSolution: grep '^no' sample.txt\n\n2) Match all lines ending with: it\nSolution: grep "
  },
  {
    "path": "exercises/GNU_grep/.ref_solutions/ex10_regex_this_or_that.txt",
    "chars": 845,
    "preview": "1) Match all lines containing any of these strings:\n        String1: day\n        String2: not\nSolution: grep -E 'day|not"
  },
  {
    "path": "exercises/GNU_grep/.ref_solutions/ex11_regex_quantifiers.txt",
    "chars": 1584,
    "preview": "1) Extract all 3 character strings surrounded by word boundaries\nSolution: grep -ow '...' garbled.txt\n\n2) Extract larges"
  },
  {
    "path": "exercises/GNU_grep/.ref_solutions/ex12_regex_character_class_part1.txt",
    "chars": 1544,
    "preview": "1) Match all lines containing any of these characters:\n        character1: q\n        character2: x\n        character3: z"
  },
  {
    "path": "exercises/GNU_grep/.ref_solutions/ex13_regex_character_class_part2.txt",
    "chars": 786,
    "preview": "1) Extract all characters before first occurrence of =\nSolution: grep -o '^[^=]*' sample.txt\n\n2) Extract all characters "
  },
  {
    "path": "exercises/GNU_grep/.ref_solutions/ex14_regex_grouping_and_backreference.txt",
    "chars": 702,
    "preview": "1) Match lines containing these strings\n        String1: scare\n        String2: spore\nSolution: grep -E 's(po|ca)re' sam"
  },
  {
    "path": "exercises/GNU_grep/.ref_solutions/ex15_regex_PCRE.txt",
    "chars": 914,
    "preview": "1) Extract all strings to the right of =\n    provided characters from start of line until = do not include [ or ]\nSoluti"
  },
  {
    "path": "exercises/GNU_grep/.ref_solutions/ex16_misc_and_extras.txt",
    "chars": 653,
    "preview": "Note: all files in directory are input to grep, unless otherwise specified\n\n1) Extract all negative numbers\n    starts w"
  },
  {
    "path": "exercises/GNU_grep/ex01_basic_match/sample.txt",
    "chars": 152,
    "preview": "Hello World!\n\nGood day\nHow do you do?\n\nJust do it\nBelieve it!\n\nToday is sunny\nNot a bit funny\nNo doubt you like it too\n\n"
  },
  {
    "path": "exercises/GNU_grep/ex01_basic_match.txt",
    "chars": 133,
    "preview": "1) Match lines containing the string: day\n\n\n2) Match lines containing the string: it\n\n\n3) Match lines containing the str"
  },
  {
    "path": "exercises/GNU_grep/ex02_basic_options/sample.txt",
    "chars": 152,
    "preview": "Hello World!\n\nGood day\nHow do you do?\n\nJust do it\nBelieve it!\n\nToday is sunny\nNot a bit funny\nNo doubt you like it too\n\n"
  },
  {
    "path": "exercises/GNU_grep/ex02_basic_options.txt",
    "chars": 302,
    "preview": "1) Match lines containing the string irrespective of lower/upper case: no\n\n\n2) Match lines not containing the string: o\n"
  },
  {
    "path": "exercises/GNU_grep/ex03_multiple_string_match/sample.txt",
    "chars": 152,
    "preview": "Hello World!\n\nGood day\nHow do you do?\n\nJust do it\nBelieve it!\n\nToday is sunny\nNot a bit funny\nNo doubt you like it too\n\n"
  },
  {
    "path": "exercises/GNU_grep/ex03_multiple_string_match.txt",
    "chars": 473,
    "preview": "1) Match lines containing either of these three strings\n        String1: Not\n        String2: he\n        String3: sun\n\n\n"
  },
  {
    "path": "exercises/GNU_grep/ex04_filenames/greeting.txt",
    "chars": 58,
    "preview": "Hi, how are you?\n\nHola :)\n\nHello world\n\nGood day\n\nRock on\n"
  },
  {
    "path": "exercises/GNU_grep/ex04_filenames/poem.txt",
    "chars": 65,
    "preview": "Roses are red,\nViolets are blue,\nSugar is sweet,\nAnd so are you.\n"
  },
  {
    "path": "exercises/GNU_grep/ex04_filenames/sample.txt",
    "chars": 152,
    "preview": "Hello World!\n\nGood day\nHow do you do?\n\nJust do it\nBelieve it!\n\nToday is sunny\nNot a bit funny\nNo doubt you like it too\n\n"
  },
  {
    "path": "exercises/GNU_grep/ex04_filenames.txt",
    "chars": 400,
    "preview": "Note: All files present in the directory should be given as file inputs to grep\n\n1) Show only filenames containing the s"
  },
  {
    "path": "exercises/GNU_grep/ex05_word_line_matching/greeting.txt",
    "chars": 58,
    "preview": "Hi, how are you?\n\nHola :)\n\nHello World\n\nGood day\n\nRock on\n"
  },
  {
    "path": "exercises/GNU_grep/ex05_word_line_matching/sample.txt",
    "chars": 152,
    "preview": "Hello World!\n\nGood day\nHow do you do?\n\nJust do it\nBelieve it!\n\nToday is sunny\nNot a bit funny\nNo doubt you like it too\n\n"
  },
  {
    "path": "exercises/GNU_grep/ex05_word_line_matching/words.txt",
    "chars": 39,
    "preview": "afar\nfar\ncarfare\nfarce\nfaraway\nairfare\n"
  },
  {
    "path": "exercises/GNU_grep/ex05_word_line_matching.txt",
    "chars": 369,
    "preview": "Note: All files present in the directory should be given as file inputs to grep\n\n1) Match lines containing whole word: d"
  },
  {
    "path": "exercises/GNU_grep/ex06_ABC_context_matching/sample.txt",
    "chars": 152,
    "preview": "Hello World!\n\nGood day\nHow do you do?\n\nJust do it\nBelieve it!\n\nToday is sunny\nNot a bit funny\nNo doubt you like it too\n\n"
  },
  {
    "path": "exercises/GNU_grep/ex06_ABC_context_matching.txt",
    "chars": 500,
    "preview": "1) Get lines and 3 following it containing the string: you\n\n\n2) Get lines and 2 preceding it containing the string: is\n\n"
  },
  {
    "path": "exercises/GNU_grep/ex07_recursive_search/msg/greeting.txt",
    "chars": 58,
    "preview": "Hi, how are you?\n\nHola :)\n\nHello World\n\nGood day\n\nRock on\n"
  },
  {
    "path": "exercises/GNU_grep/ex07_recursive_search/msg/sample.txt",
    "chars": 152,
    "preview": "Hello World!\n\nGood day\nHow do you do?\n\nJust do it\nBelieve it!\n\nToday is sunny\nNot a bit funny\nNo doubt you like it too\n\n"
  },
  {
    "path": "exercises/GNU_grep/ex07_recursive_search/poem.txt",
    "chars": 65,
    "preview": "Roses are red,\nViolets are blue,\nSugar is sweet,\nAnd so are you.\n"
  },
  {
    "path": "exercises/GNU_grep/ex07_recursive_search/progs/hello.py",
    "chars": 41,
    "preview": "#!/usr/bin/python3\n\nprint(\"Hello World\")\n"
  },
  {
    "path": "exercises/GNU_grep/ex07_recursive_search/progs/hello.sh",
    "chars": 101,
    "preview": "#!/bin/bash\n\necho \"Hello $USER\"\necho \"Today is $(date -u +%A)\"\necho 'Hope you are having a nice day'\n"
  },
  {
    "path": "exercises/GNU_grep/ex07_recursive_search/words.txt",
    "chars": 39,
    "preview": "afar\nfar\ncarfare\nfarce\nfaraway\nairfare\n"
  },
  {
    "path": "exercises/GNU_grep/ex07_recursive_search.txt",
    "chars": 732,
    "preview": "Note: Every file in this directory and sub-directories is input for grep, unless otherwise specified\n\n1) Match all lines"
  },
  {
    "path": "exercises/GNU_grep/ex08_search_pattern_from_file/baz.txt",
    "chars": 137,
    "preview": "I saw a few red cars going that way\nTo the end!\nAre you coming today to the party?\na[5] = 'good';\nHave you read the Harr"
  },
  {
    "path": "exercises/GNU_grep/ex08_search_pattern_from_file/foo.txt",
    "chars": 79,
    "preview": "part\na[5] = 'good';\nI saw a few red cars going that way\nBelieve it!\nto do list\n"
  },
  {
    "path": "exercises/GNU_grep/ex08_search_pattern_from_file/words.txt",
    "chars": 17,
    "preview": "car\npart\nto\nread\n"
  },
  {
    "path": "exercises/GNU_grep/ex08_search_pattern_from_file.txt",
    "chars": 621,
    "preview": "Note: words.txt has only whole words per line, use it as file input when task is to match whole words\n\n1) Match all stri"
  },
  {
    "path": "exercises/GNU_grep/ex09_regex_anchors/sample.txt",
    "chars": 233,
    "preview": "hello world!\n\ngood day\nhow do you do?\n\njust do it\nbelieve it!\n\ntoday is sunny\nnot a bit funny\nno doubt you like it too\n\n"
  },
  {
    "path": "exercises/GNU_grep/ex09_regex_anchors.txt",
    "chars": 424,
    "preview": "1) Match all lines starting with: no\n\n\n2) Match all lines ending with: it\n\n\n3) Match all lines containing whole word: do"
  },
  {
    "path": "exercises/GNU_grep/ex10_regex_this_or_that/sample.txt",
    "chars": 233,
    "preview": "hello world!\n\ngood day\nhow do you do?\n\njust do it\nbelieve it!\n\ntoday is sunny\nnot a bit funny\nno doubt you like it too\n\n"
  },
  {
    "path": "exercises/GNU_grep/ex10_regex_this_or_that.txt",
    "chars": 627,
    "preview": "1) Match all lines containing any of these strings:\n        String1: day\n        String2: not\n\n\n2) Match all lines conta"
  },
  {
    "path": "exercises/GNU_grep/ex11_regex_quantifiers/garbled.txt",
    "chars": 102,
    "preview": "gd\ngod\ngoood\noh gold\ngoooooodyyyy\ndog\ndg\ndig good gold\ndoogoodog\nc@t made forty justify\ndodging a toy\n"
  },
  {
    "path": "exercises/GNU_grep/ex11_regex_quantifiers.txt",
    "chars": 1241,
    "preview": "1) Extract all 3 character strings surrounded by word boundaries\n\n\n2) Extract largest string from each line\n        star"
  },
  {
    "path": "exercises/GNU_grep/ex12_regex_character_class_part1/sample_words.txt",
    "chars": 177,
    "preview": "far 30 scarce f@$t 42 fit\nCute 34 quite pry far-fetched Sure\n70 cast-away 12 good hue he\ncry just Nymph race Peace. 67\nf"
  },
  {
    "path": "exercises/GNU_grep/ex12_regex_character_class_part1.txt",
    "chars": 1162,
    "preview": "1) Match all lines containing any of these characters:\n        character1: q\n        character2: x\n        character3: z"
  },
  {
    "path": "exercises/GNU_grep/ex13_regex_character_class_part2/sample.txt",
    "chars": 113,
    "preview": "a[2]='sample string'\nfoo_bar=4232\nappx_pi=3.14\ngreeting=\"Hi  there\t\thave a nice   day\"\nfood[4]=\"dosa\"\nb[0][1]=42\n"
  },
  {
    "path": "exercises/GNU_grep/ex13_regex_character_class_part2.txt",
    "chars": 585,
    "preview": "1) Extract all characters before first occurrence of =\n\n\n2) Extract all characters from start of line made up of these c"
  },
  {
    "path": "exercises/GNU_grep/ex14_regex_grouping_and_backreference/sample.txt",
    "chars": 132,
    "preview": "hands hand library scare handy handful\nscared too big time eel candy\nspare food regulate circuit spore stare\ntire tempt "
  },
  {
    "path": "exercises/GNU_grep/ex14_regex_grouping_and_backreference.txt",
    "chars": 508,
    "preview": "1) Match lines containing these strings\n        String1: scare\n        String2: spore\n\n\n2) Extract these words\n        W"
  },
  {
    "path": "exercises/GNU_grep/ex15_regex_PCRE/sample.txt",
    "chars": 112,
    "preview": "a[2]='Hi, how are you?'\nfoo_bar=4232\nappx_pi=3.14\ngreeting=\"Hi there have a nice day\"\nfood[4]=\"dosa\"\nb[0][1]=42\n"
  },
  {
    "path": "exercises/GNU_grep/ex15_regex_PCRE.txt",
    "chars": 661,
    "preview": "1) Extract all strings to the right of =\n    provided characters from start of line until = do not include [ or ]\n\n\n2) M"
  },
  {
    "path": "exercises/GNU_grep/ex16_misc_and_extras/garbled.txt",
    "chars": 32,
    "preview": "day and night\n-43 and 99 and 12\n"
  },
  {
    "path": "exercises/GNU_grep/ex16_misc_and_extras/poem.txt",
    "chars": 85,
    "preview": "Roses are red,\nViolets are blue,\nSugar is sweet,\nAnd so are you.\n\nGood day to you :)\n"
  },
  {
    "path": "exercises/GNU_grep/ex16_misc_and_extras/sample.txt",
    "chars": 52,
    "preview": "account balance: -2300\ngood day\nfoo and bar and baz\n"
  },
  {
    "path": "exercises/GNU_grep/ex16_misc_and_extras.txt",
    "chars": 522,
    "preview": "Note: all files in directory are input to grep, unless otherwise specified\n\n1) Extract all negative numbers\n    starts w"
  },
  {
    "path": "exercises/GNU_grep/solve",
    "chars": 2182,
    "preview": "dir_name=$(basename \"$PWD\")\nref_file=\"../.ref_solutions/$dir_name.txt\"\nsol_file=\"../$dir_name.txt\"\ntmp_file='../.tmp.txt"
  },
  {
    "path": "exercises/README.md",
    "chars": 4479,
    "preview": "# <a name=\"exercises\"></a>Exercises\n\nInstructions and shell script here assumes `bash` shell. Tested on *GNU bash, versi"
  },
  {
    "path": "file_attributes.md",
    "chars": 16402,
    "preview": "# <a name=\"file-attributes\"></a>File attributes\n\n**Table of Contents**\n\n* [wc](#wc)\n    * [Various counts](#various-coun"
  },
  {
    "path": "gnu_awk.md",
    "chars": 78228,
    "preview": "<br> <br> <br>\n\n---\n\n:information_source: :information_source: This chapter has been converted into a better formatted e"
  },
  {
    "path": "gnu_grep.md",
    "chars": 51790,
    "preview": "<br> <br> <br>\n\n---\n\n:information_source: :information_source: This chapter has been converted into a better formatted e"
  },
  {
    "path": "gnu_sed.md",
    "chars": 84768,
    "preview": "<br> <br> <br>\n\n---\n\n:information_source: :information_source: This chapter has been converted into a better formatted e"
  },
  {
    "path": "miscellaneous.md",
    "chars": 18371,
    "preview": "# <a name=\"miscellaneous\"></a>Miscellaneous\n\n**Table of Contents**\n\n* [cut](#cut)\n    * [select specific fields](#select"
  },
  {
    "path": "overview_presentation/baz.json",
    "chars": 149,
    "preview": "{\n   \"abc\": {\n      \"@attr\": \"good\",\n      \"text\": \"Hi there\"\n   },\n   \"xyz\": {\n      \"@attr\": \"bad\",\n      \"text\": \"I a"
  },
  {
    "path": "overview_presentation/foo.xml",
    "chars": 99,
    "preview": "<foo>\n    <abc attr=\"good\">Hi there</abc>\n    <xyz attr=\"bad\">I am good. How are you?</xyz>\n</foo>\n"
  },
  {
    "path": "overview_presentation/greeting.txt",
    "chars": 25,
    "preview": "Hi there\nHave a nice day\n"
  },
  {
    "path": "overview_presentation/sample.txt",
    "chars": 162,
    "preview": "Hello World!\n\nGood day\nHow do you do?\n\nJust do it\nBelieve 42 it!\n\nToday is sunny\nNot a bit funny\nNo doubt you like it to"
  },
  {
    "path": "perl_the_swiss_knife.md",
    "chars": 92170,
    "preview": "<br> <br> <br>\n\n---\n\n:information_source: :information_source: This chapter has been converted into a better formatted e"
  },
  {
    "path": "restructure_text.md",
    "chars": 17305,
    "preview": "# <a name=\"restructure-text\"></a>Restructure text\n\n**Table of Contents**\n\n* [paste](#paste)\n    * [Concatenating files c"
  },
  {
    "path": "ruby_one_liners.md",
    "chars": 77200,
    "preview": "<br> <br> <br>\n\n---\n\n:information_source: :information_source: This chapter has been converted into a better formatted e"
  },
  {
    "path": "sorting_stuff.md",
    "chars": 27132,
    "preview": "# <a name=\"sorting-stuff\"></a>Sorting stuff\n\n**Table of Contents**\n\n* [sort](#sort)\n    * [Default sort](#default-sort)\n"
  },
  {
    "path": "tail_less_cat_head.md",
    "chars": 17028,
    "preview": "# <a name=\"cat-less-tail-and-head\"></a>Cat, Less, Tail and Head\n\n**Table of Contents**\n\n* [cat](#cat)\n    * [Concatenate"
  },
  {
    "path": "whats_the_difference.md",
    "chars": 7756,
    "preview": "# <a name=\"whats-the-difference\"></a>What's the difference\n\n**Table of Contents**\n\n* [cmp](#cmp)\n* [diff](#diff)\n    * ["
  },
  {
    "path": "wheres_my_file.md",
    "chars": 7174,
    "preview": "# <a name=\"where's-my-file\"></a>Where's my file\n\n**Table of Contents**\n\n* [find](#find)\n* [locate](#locate)\n\n<br>\n\n## <a"
  }
]

About this extraction

This page contains the full source code of the learnbyexample/Command-line-text-processing GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 80 files (519.3 KB), approximately 169.0k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo