Repository: learnbyexample/Command-line-text-processing
Branch: master
Commit: ce56c851f078
Files: 80
Total size: 519.3 KB

Directory structure:
gitextract_wr_ra6a8/

├── README.md
├── exercises/
│   ├── GNU_grep/
│   │   ├── .ref_solutions/
│   │   │   ├── ex01_basic_match.txt
│   │   │   ├── ex02_basic_options.txt
│   │   │   ├── ex03_multiple_string_match.txt
│   │   │   ├── ex04_filenames.txt
│   │   │   ├── ex05_word_line_matching.txt
│   │   │   ├── ex06_ABC_context_matching.txt
│   │   │   ├── ex07_recursive_search.txt
│   │   │   ├── ex08_search_pattern_from_file.txt
│   │   │   ├── ex09_regex_anchors.txt
│   │   │   ├── ex10_regex_this_or_that.txt
│   │   │   ├── ex11_regex_quantifiers.txt
│   │   │   ├── ex12_regex_character_class_part1.txt
│   │   │   ├── ex13_regex_character_class_part2.txt
│   │   │   ├── ex14_regex_grouping_and_backreference.txt
│   │   │   ├── ex15_regex_PCRE.txt
│   │   │   └── ex16_misc_and_extras.txt
│   │   ├── ex01_basic_match/
│   │   │   └── sample.txt
│   │   ├── ex01_basic_match.txt
│   │   ├── ex02_basic_options/
│   │   │   └── sample.txt
│   │   ├── ex02_basic_options.txt
│   │   ├── ex03_multiple_string_match/
│   │   │   └── sample.txt
│   │   ├── ex03_multiple_string_match.txt
│   │   ├── ex04_filenames/
│   │   │   ├── greeting.txt
│   │   │   ├── poem.txt
│   │   │   └── sample.txt
│   │   ├── ex04_filenames.txt
│   │   ├── ex05_word_line_matching/
│   │   │   ├── greeting.txt
│   │   │   ├── sample.txt
│   │   │   └── words.txt
│   │   ├── ex05_word_line_matching.txt
│   │   ├── ex06_ABC_context_matching/
│   │   │   └── sample.txt
│   │   ├── ex06_ABC_context_matching.txt
│   │   ├── ex07_recursive_search/
│   │   │   ├── msg/
│   │   │   │   ├── greeting.txt
│   │   │   │   └── sample.txt
│   │   │   ├── poem.txt
│   │   │   ├── progs/
│   │   │   │   ├── hello.py
│   │   │   │   └── hello.sh
│   │   │   └── words.txt
│   │   ├── ex07_recursive_search.txt
│   │   ├── ex08_search_pattern_from_file/
│   │   │   ├── baz.txt
│   │   │   ├── foo.txt
│   │   │   └── words.txt
│   │   ├── ex08_search_pattern_from_file.txt
│   │   ├── ex09_regex_anchors/
│   │   │   └── sample.txt
│   │   ├── ex09_regex_anchors.txt
│   │   ├── ex10_regex_this_or_that/
│   │   │   └── sample.txt
│   │   ├── ex10_regex_this_or_that.txt
│   │   ├── ex11_regex_quantifiers/
│   │   │   └── garbled.txt
│   │   ├── ex11_regex_quantifiers.txt
│   │   ├── ex12_regex_character_class_part1/
│   │   │   └── sample_words.txt
│   │   ├── ex12_regex_character_class_part1.txt
│   │   ├── ex13_regex_character_class_part2/
│   │   │   └── sample.txt
│   │   ├── ex13_regex_character_class_part2.txt
│   │   ├── ex14_regex_grouping_and_backreference/
│   │   │   └── sample.txt
│   │   ├── ex14_regex_grouping_and_backreference.txt
│   │   ├── ex15_regex_PCRE/
│   │   │   └── sample.txt
│   │   ├── ex15_regex_PCRE.txt
│   │   ├── ex16_misc_and_extras/
│   │   │   ├── garbled.txt
│   │   │   ├── poem.txt
│   │   │   └── sample.txt
│   │   ├── ex16_misc_and_extras.txt
│   │   └── solve
│   └── README.md
├── file_attributes.md
├── gnu_awk.md
├── gnu_grep.md
├── gnu_sed.md
├── miscellaneous.md
├── overview_presentation/
│   ├── baz.json
│   ├── foo.xml
│   ├── greeting.txt
│   └── sample.txt
├── perl_the_swiss_knife.md
├── restructure_text.md
├── ruby_one_liners.md
├── sorting_stuff.md
├── tail_less_cat_head.md
├── whats_the_difference.md
└── wheres_my_file.md

================================================
FILE CONTENTS
================================================

================================================
FILE: README.md
================================================
# Command Line Text Processing

Learn about various commands available for common and exotic text processing needs. Examples have been tested on GNU/Linux - there'd be syntax/feature variations with other distributions, consult their respective `man` pages for details.

---

:warning: :warning: I'm no longer actively working on this repo. Instead, I've converted existing chapters into ebooks (see [ebook section](#ebooks) below for links), available under the same license. These ebooks are better formatted, updated for newer versions of the software, includes exercises, solutions, etc. Since all the chapters have been converted, I'm archiving this repo.

---

<br>

## Ebooks

Individual online ebooks with better formatting, explanations, exercises, solutions, etc:

* [CLI text processing with GNU grep and ripgrep](https://learnbyexample.github.io/learn_gnugrep_ripgrep/)
* [CLI text processing with GNU sed](https://learnbyexample.github.io/learn_gnused/)
* [CLI text processing with GNU awk](https://learnbyexample.github.io/learn_gnuawk/)
* [Ruby One-Liners Guide](https://learnbyexample.github.io/learn_ruby_oneliners/)
* [Perl One-Liners Guide](https://learnbyexample.github.io/learn_perl_oneliners/)
* [CLI text processing with GNU Coreutils](https://learnbyexample.github.io/cli_text_processing_coreutils/)
* [Linux Command Line Computing](https://learnbyexample.github.io/cli-computing/)

See https://learnbyexample.github.io/books/ for links to PDF/EPUB versions and other ebooks.

<br>

## Chapters

As mentioned earlier, I'm no longer actively working on these chapters:

* [Cat, Less, Tail and Head](./tail_less_cat_head.md)
    * cat, less, tail, head, Text Editors
* [GNU grep](./gnu_grep.md)
* [GNU sed](./gnu_sed.md)
* [GNU awk](./gnu_awk.md)
* [Perl the swiss knife](./perl_the_swiss_knife.md)
* [Ruby one liners](./ruby_one_liners.md)
* [Sorting stuff](./sorting_stuff.md)
    * sort, uniq, comm, shuf
* [Restructure text](./restructure_text.md)
    * paste, column, pr, fold
* [Whats the difference](./whats_the_difference.md)
    * cmp, diff
* [Wheres my file](./wheres_my_file.md)
* [File attributes](./file_attributes.md)
    * wc, du, df, touch, file
* [Miscellaneous](./miscellaneous.md)
    * cut, tr, basename, dirname, xargs, seq

<br>

## Webinar recordings

Recorded couple of videos based on content in the chapters, not sure if I'll do more:

* [Using the sort command](https://www.youtube.com/watch?v=qLfAwwb5vGs)
* [Using uniq and comm](https://www.youtube.com/watch?v=uAb2kxA2TyQ)

See also my short videos on [Linux command line tips](https://www.youtube.com/watch?v=p0KCLusMd5Q&list=PLTv2U3HnAL4PNTmRqZBSUgKaiHbRL2zeY)

<br>

## Exercises

Check out [exercises](./exercises) directory to solve practice questions on `grep`, right from the command line itself.

See also my [TUI-apps](https://github.com/learnbyexample/TUI-apps) repo for interactive CLI text processing exercises.

<br>

## Contributing

* Please [open an issue](https://github.com/learnbyexample/Command-line-text-processing/issues) for typos or bugs
    * As this repo is no longer actively worked upon, **please do not submit pull requests**
* Share the repo with friends/colleagues, on social media, etc to help reach other learners
* In case you need to reach me, mail me at `echo 'yrneaolrknzcyr.arg@tznvy.pbz' | tr 'a-z' 'n-za-m'` or send a DM via [twitter](https://twitter.com/learn_byexample)

<br>

## Acknowledgements

* [unix.stackexchange](https://unix.stackexchange.com/) and [stackoverflow](https://stackoverflow.com/) - for getting answers to pertinent questions as well as sharpening skills by understanding and answering questions
* Forums like [Linux users](https://www.linkedin.com/groups/65688), [/r/commandline/](https://www.reddit.com/r/commandline/), [/r/linux/](https://www.reddit.com/r/linux/), [/r/ruby/](https://www.reddit.com/r/ruby/), [news.ycombinator](https://news.ycombinator.com/news), [devup](http://devup.in/) and others for valuable feedback (especially spotting mistakes) and encouragement
* See [wikipedia entry 'Roses Are Red'](https://en.wikipedia.org/wiki/Roses_Are_Red) for `poem.txt` used as sample text input file

<br>

## License

This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-nc-sa/4.0/)


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex01_basic_match.txt
================================================
1) Match lines containing the string: day
Solution: grep 'day' sample.txt

2) Match lines containing the string: it
Solution: grep 'it' sample.txt

3) Match lines containing the string: do you
Solution: grep 'do you' sample.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex02_basic_options.txt
================================================
1) Match lines containing the string irrespective of lower/upper case: no
Solution: grep -i 'no' sample.txt

2) Match lines not containing the string: o
Solution: grep -v 'o' sample.txt

3) Match lines with line numbers containing the string: it
Solution: grep -n 'it' sample.txt

4) Output only number of matching lines containing the string: a
Solution: grep -c 'a' sample.txt

5) Match first two lines containing the string: do
Solution: grep -m2 'do' sample.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex03_multiple_string_match.txt
================================================
1) Match lines containing either of these three strings
        String1: Not
        String2: he
        String3: sun
Solution: grep -e 'Not' -e 'he' -e 'sun' sample.txt

2) Match lines containing both these strings
        String1: He
        String2: or
Solution: grep 'He' sample.txt | grep 'or'

3) Match lines containing either of these two strings
        String1: a
        String2: i
   and contains this as well
        String3: do
Solution: grep -e 'a' -e 'i' sample.txt | grep 'do'

4) Match lines containing the string
        String1: it
   but not these strings
        String2: No
        String3: no
Solution: grep 'it' sample.txt | grep -vi 'no'


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex04_filenames.txt
================================================
Note: All files present in the directory should be given as file inputs to grep

1) Show only filenames containing the string: are
Solution: grep -l 'are' *

2) Show only filenames NOT containing the string: two
Solution: grep -L 'two' *

3) Match all lines containing the string: are
Solution: grep 'are' *

4) Match maximum of two matching lines along with filenames containing the character: a
Solution: grep -m2 'a' *

5) Match all lines without prefixing filename containing the string: to
Solution: grep -h 'to' *


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex05_word_line_matching.txt
================================================
Note: All files present in the directory should be given as file inputs to grep

1) Match lines containing whole word: do
Solution: grep -w 'do' *

2) Match whole lines containing the string: Hello World
Solution: grep -x 'Hello World' *

3) Match lines containing these whole words:
        Word1: He
        Word2: far
Solution: grep -w -e 'far' -e 'He' *

4) Match lines containing the whole word: you
    and NOT containing the case insensitive string: How
Solution: grep -w 'you' * | grep -vi 'how'


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex06_ABC_context_matching.txt
================================================
1) Get lines and 3 following it containing the string: you
Solution: grep -A3 'you' sample.txt

2) Get lines and 2 preceding it containing the string: is
Solution: grep -B2 'is' sample.txt

3) Get lines and 1 following/preceding containing the string: Not
Solution: grep -C1 'Not' sample.txt

4) Get lines and 1 following and 4 preceding containing the string: Not
Solution: grep -A1 -B4 'Not' sample.txt

5) Get lines and 1 preceding it containing the string: you
        there should be no separator between the matches
Solution: grep --no-group-separator -B1 'you' sample.txt

6) Get lines and 1 preceding it containing the string: you
        the separator between the matches should be: #####
Solution: grep --group-separator='#####' -B1 'you' sample.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex07_recursive_search.txt
================================================
Note: Every file in this directory and sub-directories is input for grep, unless otherwise specified

1) Match all lines containing the string: you
Solution: grep -r 'you'

2) Show only filenames matching the string: Hello
    filenames should only end with .txt 
Solution: grep -rl --include='*.txt' 'Hello'

3) Show only filenames matching the string: Hello
    filenames should NOT end with .txt 
Solution: grep -rl --exclude='*.txt' 'Hello'

4) Show only filenames matching the string: are
    should not include the directory: progs
Solution: grep -rl --exclude-dir='progs' 'are'

5) Show only filenames matching the string: are
    should NOT include these directories
            dir1: progs
            dir2: msg
Solution: grep -rl --exclude-dir='progs' --exclude-dir='msg' 'are'

6) Show only filenames matching the string: are
    should include files only from sub-directories
    hint: use shell glob pattern to specify directories to search
Solution: grep -rl 'are' */


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex08_search_pattern_from_file.txt
================================================
Note: words.txt has only whole words per line, use it as file input when task is to match whole words

1) Match all strings from file words.txt in file baz.txt
Solution: grep -f words.txt baz.txt 

2) Match all words from file words.txt in file foo.txt
    should only match whole words
    should print only matching words, not entire line
Solution: grep -owf words.txt foo.txt

3) Show common lines between foo.txt and baz.txt
Solution: grep -Fxf foo.txt baz.txt

4) Show lines present in baz.txt but not in foo.txt
Solution: grep -Fxvf foo.txt baz.txt

5) Show lines present in foo.txt but not in baz.txt
Solution: grep -Fxvf baz.txt foo.txt

6) Find all words common between all three files in the directory
    should only match whole words
    should print only matching words, not entire line
Solution: grep -owf words.txt foo.txt | grep -owf- baz.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex09_regex_anchors.txt
================================================
1) Match all lines starting with: no
Solution: grep '^no' sample.txt

2) Match all lines ending with: it
Solution: grep 'it$' sample.txt

3) Match all lines containing whole word: do
Solution: grep -w 'do' sample.txt

4) Match all lines containing words starting with: do
Solution: grep '\<do' sample.txt

5) Match all lines containing words ending with: do
Solution: grep 'do\>' sample.txt

6) Match all lines starting with: ^
Solution: grep '^^' sample.txt

7) Match all lines ending with: $
Solution: grep '$$' sample.txt

8) Match all lines containing the string: in
    not surrounded by word boundaries, for ex: mint but not tin or ink
Solution: grep '\Bin\B' sample.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex10_regex_this_or_that.txt
================================================
1) Match all lines containing any of these strings:
        String1: day
        String2: not
Solution: grep -E 'day|not' sample.txt

2) Match all lines containing any of these whole words:
        String1: he
        String2: in
Solution: grep -wE 'he|in' sample.txt

3) Match all lines containing any of these strings:
        String1: you
        String2: be
        String3: to
        String4: he
Solution: grep -E 'he|be|to|you' sample.txt

4) Match all lines containing any of these strings:
        String1: you
        String2: be
        String3: to
        String4: he
    but NOT these strings:
        String1: it
        String2: do
Solution: grep -E 'he|be|to|you' sample.txt | grep -vE 'do|it'

5) Match all lines starting with any of these strings:
        String1: no
        String2: to
Solution: grep -E '^no|^to' sample.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex11_regex_quantifiers.txt
================================================
1) Extract all 3 character strings surrounded by word boundaries
Solution: grep -ow '...' garbled.txt

2) Extract largest string from each line
        starting with character: d
        ending with character  : g
Solution: grep -o 'd.*g' garbled.txt

3) Extract all strings from each line
        starting with character: d
        followed by zero or one: o
        ending with character  : g
Solution: grep -oE 'do?g' garbled.txt

4) Extract all strings from each line
        starting with character: d
        followed by zero or one of any character
        ending with character  : g
Solution: grep -oE 'd.?g' garbled.txt

5) Extract all strings from each line
        starting with character: g
        followed by atleast one: o
        ending with character  : d
Solution: grep -oE 'go+d' garbled.txt

6) Extract all strings from each line
        starting with character : g
        followed by extactly six: o
        ending with character   : d
Solution: grep -oE 'go{6}d' garbled.txt

7) Extract all strings from each line
        starting with character         : g
        followed by min two and max four: o
        ending with character           : d
Solution: grep -oE 'go{2,4}d' garbled.txt

8) Extract all strings from each line
        starting with character: d
        followed by max of two : o
        ending with character  : g
Solution: grep -oE 'do{,2}g' garbled.txt

9) Extract all strings from each line
        starting with character : g
        followed by min of three: o
        ending with character   : d
Solution: grep -oE 'go{3,}d' garbled.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex12_regex_character_class_part1.txt
================================================
1) Match all lines containing any of these characters:
        character1: q
        character2: x
        character3: z
Solution: grep '[qzx]' sample_words.txt

2) Match all lines containing any of these characters:
        character1: c
        character2: f
    followed by any character
    followed by   : t
Solution: grep '[cf].t' sample_words.txt

3) Extract all words starting with character: s
    ignore case
    should contain only alphabets
    minimum two letters
    should be surrounded by word boundaries
Solution: grep -iowE 's[a-z]+' sample_words.txt

4) Extract all words made up of these characters:
        character1: a
        character2: c
        character3: e
        character4: r
        character5: s
    ignore case
    should contain only alphabets
    should be surrounded by word boundaries
Solution: grep -iowE '[acers]+' sample_words.txt

5) Extract all numbers surrounded by word boundaries
Solution: grep -ow '[0-9]*' sample_words.txt

6) Extract all numbers surrounded by word boundaries matching the condition
    30 <= number <= 70
Solution: grep -owE '[3-6][0-9]|70' sample_words.txt

7) Extract all words made up of non-vowel characters
    ignore case
    should contain only alphabets and at least two
    should be surrounded by word boundaries
Solution: grep -iowE '[b-df-hj-np-tv-z]{2,}' sample_words.txt

8) Extract all sequence of strings consisting of character: -
    surrounded on either side by zero or more case insensitive alphabets    
Solution: grep -io '[a-z]*-[a-z]*' sample_words.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex13_regex_character_class_part2.txt
================================================
1) Extract all characters before first occurrence of =
Solution: grep -o '^[^=]*' sample.txt

2) Extract all characters from start of line made up of these characters
        upper or lower case alphabets
        all digits
        the underscore character
Solution: grep -o '^\w*' sample.txt

3) Match all lines containing the sequence
        String1: there
        any number of whitespace
        String2: have
Solution: grep 'there\s*have' sample.txt

4) Extract all characters from start of line made up of these characters
        upper or lower case alphabets
        all digits
        the characters [ and ]
        ending with ]
Solution: grep -oi '^[]a-z0-9[]*]' sample.txt

5) Extract all punctuation characters from first line
Solution: grep -om1 '[[:punct:]]' sample.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex14_regex_grouping_and_backreference.txt
================================================
1) Match lines containing these strings
        String1: scare
        String2: spore
Solution: grep -E 's(po|ca)re' sample.txt

2) Extract these words
        Word1: handy
        Word2: hand
        Word3: hands
        Word4: handful
Solution: grep -oE 'hand([sy]|ful)?' sample.txt

3) Extract all whole words with at least one letter occurring twice in the word
    ignore case
    only alphabets
    the letter occurring twice need not be placed next to each other
Solution: grep -ioE '[a-z]*([a-z])[a-z]*\1[a-z]*' sample.txt

4) Match lines where same sequence of three consecutive alphabets is matched another time in the same line
    ignore case
Solution: grep -iE '([a-z]{3}).*\1' sample.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex15_regex_PCRE.txt
================================================
1) Extract all strings to the right of =
    provided characters from start of line until = do not include [ or ]
Solution: grep -oP '^[^][=]+=\K.*' sample.txt

2) Match all lines containing the string: Hi
    but shouldn't be followed afterwards in the line by: are
Solution: grep -P 'Hi(?!.*are)' sample.txt

3) Extract from start of line up to the string: Hi
    provided it is followed afterwards in the line by: you
Solution: grep -oP '.*Hi(?=.*you)' sample.txt

4) Extract all sequence of characters surrounded on both sides by space character
    the space character should not be part of output
Solution: grep -oP ' \K[^ ]+(?= )' sample.txt

5) Extract all words
    made of upper or lower case alphabets
    at least two letters in length
    surrounded by word boundaries
    should not contain consecutive repeated alphabets
Solution: grep -iowP '[a-z]*([a-z])\1[a-z]*(*SKIP)(*F)|[a-z]{2,}' sample.txt


================================================
FILE: exercises/GNU_grep/.ref_solutions/ex16_misc_and_extras.txt
================================================
Note: all files in directory are input to grep, unless otherwise specified

1) Extract all negative numbers
    starts with - followed by one or more digits
    do not output filenames
Solution: grep -hoE -- '-[0-9]+' *

2) Display only filenames containing these two strings anywhere in the file
        String1: day
        String2: and
Solution: grep -zlE 'day.*and|and.*day' *

3) The below command
        grep -c '^Solution:' ../.ref_solutions/*
    will give number of questions in each exercise. Change it, using another command and pipe if needed, so that only overall total is printed
Solution: cat ../.ref_solutions/* | grep -c '^Solution:'


================================================
FILE: exercises/GNU_grep/ex01_basic_match/sample.txt
================================================
Hello World!

Good day
How do you do?

Just do it
Believe it!

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he


================================================
FILE: exercises/GNU_grep/ex01_basic_match.txt
================================================
1) Match lines containing the string: day


2) Match lines containing the string: it


3) Match lines containing the string: do you


================================================
FILE: exercises/GNU_grep/ex02_basic_options/sample.txt
================================================
Hello World!

Good day
How do you do?

Just do it
Believe it!

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he


================================================
FILE: exercises/GNU_grep/ex02_basic_options.txt
================================================
1) Match lines containing the string irrespective of lower/upper case: no


2) Match lines not containing the string: o


3) Match lines with line numbers containing the string: it


4) Output only number of matching lines containing the string: a


5) Match first two lines containing the string: do


================================================
FILE: exercises/GNU_grep/ex03_multiple_string_match/sample.txt
================================================
Hello World!

Good day
How do you do?

Just do it
Believe it!

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he


================================================
FILE: exercises/GNU_grep/ex03_multiple_string_match.txt
================================================
1) Match lines containing either of these three strings
        String1: Not
        String2: he
        String3: sun


2) Match lines containing both these strings
        String1: He
        String2: or


3) Match lines containing either of these two strings
        String1: a
        String2: i
   and contains this as well
        String3: do


4) Match lines containing the string
        String1: it
   but not these strings
        String2: No
        String3: no


================================================
FILE: exercises/GNU_grep/ex04_filenames/greeting.txt
================================================
Hi, how are you?

Hola :)

Hello world

Good day

Rock on


================================================
FILE: exercises/GNU_grep/ex04_filenames/poem.txt
================================================
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.


================================================
FILE: exercises/GNU_grep/ex04_filenames/sample.txt
================================================
Hello World!

Good day
How do you do?

Just do it
Believe it!

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he


================================================
FILE: exercises/GNU_grep/ex04_filenames.txt
================================================
Note: All files present in the directory should be given as file inputs to grep

1) Show only filenames containing the string: are


2) Show only filenames NOT containing the string: two


3) Match all lines containing the string: are


4) Match maximum of two matching lines along with filenames containing the character: a


5) Match all lines without prefixing filename containing the string: to


================================================
FILE: exercises/GNU_grep/ex05_word_line_matching/greeting.txt
================================================
Hi, how are you?

Hola :)

Hello World

Good day

Rock on


================================================
FILE: exercises/GNU_grep/ex05_word_line_matching/sample.txt
================================================
Hello World!

Good day
How do you do?

Just do it
Believe it!

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he


================================================
FILE: exercises/GNU_grep/ex05_word_line_matching/words.txt
================================================
afar
far
carfare
farce
faraway
airfare


================================================
FILE: exercises/GNU_grep/ex05_word_line_matching.txt
================================================
Note: All files present in the directory should be given as file inputs to grep

1) Match lines containing whole word: do


2) Match whole lines containing the string: Hello World


3) Match lines containing these whole words:
        Word1: He
        Word2: far


4) Match lines containing the whole word: you
    and NOT containing the case insensitive string: How


================================================
FILE: exercises/GNU_grep/ex06_ABC_context_matching/sample.txt
================================================
Hello World!

Good day
How do you do?

Just do it
Believe it!

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he


================================================
FILE: exercises/GNU_grep/ex06_ABC_context_matching.txt
================================================
1) Get lines and 3 following it containing the string: you


2) Get lines and 2 preceding it containing the string: is


3) Get lines and 1 following/preceding containing the string: Not


4) Get lines and 1 following and 4 preceding containing the string: Not


5) Get lines and 1 preceding it containing the string: you
        there should be no separator between the matches


6) Get lines and 1 preceding it containing the string: you
        the separator between the matches should be: #####


================================================
FILE: exercises/GNU_grep/ex07_recursive_search/msg/greeting.txt
================================================
Hi, how are you?

Hola :)

Hello World

Good day

Rock on


================================================
FILE: exercises/GNU_grep/ex07_recursive_search/msg/sample.txt
================================================
Hello World!

Good day
How do you do?

Just do it
Believe it!

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he


================================================
FILE: exercises/GNU_grep/ex07_recursive_search/poem.txt
================================================
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.


================================================
FILE: exercises/GNU_grep/ex07_recursive_search/progs/hello.py
================================================
#!/usr/bin/python3

print("Hello World")


================================================
FILE: exercises/GNU_grep/ex07_recursive_search/progs/hello.sh
================================================
#!/bin/bash

echo "Hello $USER"
echo "Today is $(date -u +%A)"
echo 'Hope you are having a nice day'


================================================
FILE: exercises/GNU_grep/ex07_recursive_search/words.txt
================================================
afar
far
carfare
farce
faraway
airfare


================================================
FILE: exercises/GNU_grep/ex07_recursive_search.txt
================================================
Note: Every file in this directory and sub-directories is input for grep, unless otherwise specified

1) Match all lines containing the string: you


2) Show only filenames matching the string: Hello
    filenames should only end with .txt 


3) Show only filenames matching the string: Hello
    filenames should NOT end with .txt 


4) Show only filenames matching the string: are
    should not include the directory: progs


5) Show only filenames matching the string: are
    should NOT include these directories
            dir1: progs
            dir2: msg


6) Show only filenames matching the string: are
    should include files only from sub-directories
    hint: use shell glob pattern to specify directories to search


================================================
FILE: exercises/GNU_grep/ex08_search_pattern_from_file/baz.txt
================================================
I saw a few red cars going that way
To the end!
Are you coming today to the party?
a[5] = 'good';
Have you read the Harry Potter series?


================================================
FILE: exercises/GNU_grep/ex08_search_pattern_from_file/foo.txt
================================================
part
a[5] = 'good';
I saw a few red cars going that way
Believe it!
to do list


================================================
FILE: exercises/GNU_grep/ex08_search_pattern_from_file/words.txt
================================================
car
part
to
read


================================================
FILE: exercises/GNU_grep/ex08_search_pattern_from_file.txt
================================================
Note: words.txt has only whole words per line, use it as file input when task is to match whole words

1) Match all strings from file words.txt in file baz.txt


2) Match all words from file words.txt in file foo.txt
    should only match whole words
    should print only matching words, not entire line


3) Show common lines between foo.txt and baz.txt


4) Show lines present in baz.txt but not in foo.txt


5) Show lines present in foo.txt but not in baz.txt


6) Find all words common between all three files in the directory
    should only match whole words
    should print only matching words, not entire line


================================================
FILE: exercises/GNU_grep/ex09_regex_anchors/sample.txt
================================================
hello world!

good day
how do you do?

just do it
believe it!

today is sunny
not a bit funny
no doubt you like it too

much ado about nothing
he he he

^ could be exponentiation or xor operator
scalar variables in perl start with $


================================================
FILE: exercises/GNU_grep/ex09_regex_anchors.txt
================================================
1) Match all lines starting with: no


2) Match all lines ending with: it


3) Match all lines containing whole word: do


4) Match all lines containing words starting with: do


5) Match all lines containing words ending with: do


6) Match all lines starting with: ^


7) Match all lines ending with: $


8) Match all lines containing the string: in
    not surrounded by word boundaries, for ex: mint but not tin or ink


================================================
FILE: exercises/GNU_grep/ex10_regex_this_or_that/sample.txt
================================================
hello world!

good day
how do you do?

just do it
believe it!

today is sunny
not a bit funny
no doubt you like it too

much ado about nothing
he he he

^ could be exponentiation or xor operator
scalar variables in perl start with $


================================================
FILE: exercises/GNU_grep/ex10_regex_this_or_that.txt
================================================
1) Match all lines containing any of these strings:
        String1: day
        String2: not


2) Match all lines containing any of these whole words:
        String1: he
        String2: in


3) Match all lines containing any of these strings:
        String1: you
        String2: be
        String3: to
        String4: he


4) Match all lines containing any of these strings:
        String1: you
        String2: be
        String3: to
        String4: he
    but NOT these strings:
        String1: it
        String2: do


5) Match all lines starting with any of these strings:
        String1: no
        String2: to


================================================
FILE: exercises/GNU_grep/ex11_regex_quantifiers/garbled.txt
================================================
gd
god
goood
oh gold
goooooodyyyy
dog
dg
dig good gold
doogoodog
c@t made forty justify
dodging a toy


================================================
FILE: exercises/GNU_grep/ex11_regex_quantifiers.txt
================================================
1) Extract all 3 character strings surrounded by word boundaries


2) Extract largest string from each line
        starting with character: d
        ending with character  : g


3) Extract all strings from each line
        starting with character: d
        followed by zero or one: o
        ending with character  : g


4) Extract all strings from each line
        starting with character: d
        followed by zero or one of any character
        ending with character  : g


5) Extract all strings from each line
        starting with character: g
        followed by atleast one: o
        ending with character  : d


6) Extract all strings from each line
        starting with character : g
        followed by extactly six: o
        ending with character   : d


7) Extract all strings from each line
        starting with character         : g
        followed by min two and max four: o
        ending with character           : d


8) Extract all strings from each line
        starting with character: d
        followed by max of two : o
        ending with character  : g


9) Extract all strings from each line
        starting with character : g
        followed by min of three: o
        ending with character   : d


================================================
FILE: exercises/GNU_grep/ex12_regex_character_class_part1/sample_words.txt
================================================
far 30 scarce f@$t 42 fit
Cute 34 quite pry far-fetched Sure
70 cast-away 12 good hue he
cry just Nymph race Peace. 67
foo;bar;baz;p@t
ARE 72 cut copy paste
p1ate rest 512 Sync


================================================
FILE: exercises/GNU_grep/ex12_regex_character_class_part1.txt
================================================
1) Match all lines containing any of these characters:
        character1: q
        character2: x
        character3: z


2) Match all lines containing any of these characters:
        character1: c
        character2: f
    followed by any character
    followed by   : t


3) Extract all words starting with character: s
    ignore case
    should contain only alphabets
    minimum two letters
    should be surrounded by word boundaries


4) Extract all words made up of these characters:
        character1: a
        character2: c
        character3: e
        character4: r
        character5: s
    ignore case
    should contain only alphabets
    should be surrounded by word boundaries


5) Extract all numbers surrounded by word boundaries


6) Extract all numbers surrounded by word boundaries matching the condition
    30 <= number <= 70


7) Extract all words made up of non-vowel characters
    ignore case
    should contain only alphabets and at least two
    should be surrounded by word boundaries


8) Extract all sequence of strings consisting of character: -
    surrounded on either side by zero or more case insensitive alphabets    


================================================
FILE: exercises/GNU_grep/ex13_regex_character_class_part2/sample.txt
================================================
a[2]='sample string'
foo_bar=4232
appx_pi=3.14
greeting="Hi  there		have a nice   day"
food[4]="dosa"
b[0][1]=42


================================================
FILE: exercises/GNU_grep/ex13_regex_character_class_part2.txt
================================================
1) Extract all characters before first occurrence of =


2) Extract all characters from start of line made up of these characters
        upper or lower case alphabets
        all digits
        the underscore character


3) Match all lines containing the sequence
        String1: there
        any number of whitespace
        String2: have


4) Extract all characters from start of line made up of these characters
        upper or lower case alphabets
        all digits
        the characters [ and ]
        ending with ]


5) Extract all punctuation characters from first line


================================================
FILE: exercises/GNU_grep/ex14_regex_grouping_and_backreference/sample.txt
================================================
hands hand library scare handy handful
scared too big time eel candy
spare food regulate circuit spore stare
tire tempt cold malady


================================================
FILE: exercises/GNU_grep/ex14_regex_grouping_and_backreference.txt
================================================
1) Match lines containing these strings
        String1: scare
        String2: spore


2) Extract these words
        Word1: handy
        Word2: hand
        Word3: hands
        Word4: handful


3) Extract all whole words with at least one letter occurring twice in the word
    ignore case
    only alphabets
    the letter occurring twice need not be placed next to each other


4) Match lines where same sequence of three consecutive alphabets is matched another time in the same line
    ignore case


================================================
FILE: exercises/GNU_grep/ex15_regex_PCRE/sample.txt
================================================
a[2]='Hi, how are you?'
foo_bar=4232
appx_pi=3.14
greeting="Hi there have a nice day"
food[4]="dosa"
b[0][1]=42


================================================
FILE: exercises/GNU_grep/ex15_regex_PCRE.txt
================================================
1) Extract all strings to the right of =
    provided characters from start of line until = do not include [ or ]


2) Match all lines containing the string: Hi
    but shouldn't be followed afterwards in the line by: are


3) Extract from start of line up to the string: Hi
    provided it is followed afterwards in the line by: you


4) Extract all sequence of characters surrounded on both sides by space character
    the space character should not be part of output


5) Extract all words
    made of upper or lower case alphabets
    at least two letters in length
    surrounded by word boundaries
    should not contain consecutive repeated alphabets


================================================
FILE: exercises/GNU_grep/ex16_misc_and_extras/garbled.txt
================================================
day and night
-43 and 99 and 12


================================================
FILE: exercises/GNU_grep/ex16_misc_and_extras/poem.txt
================================================
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

Good day to you :)


================================================
FILE: exercises/GNU_grep/ex16_misc_and_extras/sample.txt
================================================
account balance: -2300
good day
foo and bar and baz


================================================
FILE: exercises/GNU_grep/ex16_misc_and_extras.txt
================================================
Note: all files in directory are input to grep, unless otherwise specified

1) Extract all negative numbers
    starts with - followed by one or more digits
    do not output filenames


2) Display only filenames containing these two strings anywhere in the file
        String1: day
        String2: and


3) The below command
        grep -c '^Solution:' ../.ref_solutions/*
    will give number of questions in each exercise. Change it, using another command and pipe if needed, so that only overall total is printed


================================================
FILE: exercises/GNU_grep/solve
================================================
dir_name=$(basename "$PWD")
ref_file="../.ref_solutions/$dir_name.txt"
sol_file="../$dir_name.txt"
tmp_file='../.tmp.txt'

# color output
tcolors=$(tput colors)
if [[ -n $tcolors && $tcolors -ge 8 ]]; then
    red=$(tput setaf 1)
    green=$(tput setaf 2)
    blue=$(tput setaf 4)
    clr_color=$(tput sgr0)
else
    red=''
    green=''
    blue=''
    clr_color=''
fi

sub_sol=0
if [[ $1 == -s ]]; then
    prev_cmd=$(fc -ln -2 | sed 's/^[ \t]*//;q')
    sub_sol=1
elif [[ $1 == -q ]]; then
    # highlight the question to be solved next
    # or show only the (unanswered)? question to be solved next
    cat "$sol_file"
    return
elif [[ -n $1 ]]; then
    echo -e 'Unknown option...Exiting script'
    return
fi

count=0
sol_count=0
err_count=0
while IFS= read -u3 -r ref_line && read -u4 -r sol_line; do
    if [[ "${ref_line:0:9}" == Solution: ]]; then
        (( count++ ))

        if [[ $sub_sol == 1 && -z $sol_line ]]; then
            sol_line="$prev_cmd"
            sub_sol=0
        fi

        if [[ "$(eval "command ${ref_line:10}")" == "$(eval "command $sol_line")" ]]; then
            (( sol_count++ ))
            # use color if terminal supports
            echo '---------------------------------------------'
            echo "Match for question $count:"
            echo "${red}Submitted solution:${clr_color} $sol_line"
            echo "${green}Reference solution:${clr_color} ${ref_line:10}"
            echo '---------------------------------------------'
        else
            (( err_count++ ))
            if [[ $err_count == 1 && -n $sol_line ]]; then
                echo '---------------------------------------------'
                echo "Mismatch for question $count:"
                echo "$(tput bold)${red}Expected output is:${clr_color}$(tput rmso)"
                eval "command ${ref_line:10}"
                echo '---------------------------------------------'
            fi
            sol_line=''
        fi
    fi

    echo "$sol_line" >> "$tmp_file"

done 3<"$ref_file" 4<"$sol_file"

((count==sol_count)) && printf "\t\t$(tput bold)${blue}All Pass${clr_color}$(tput rmso)\t\t\n"

mv "$tmp_file" "$sol_file"

# vim: syntax=bash


================================================
FILE: exercises/README.md
================================================
# <a name="exercises"></a>Exercises

Instructions and shell script here assumes `bash` shell. Tested on *GNU bash, version 4.3.46*

<br>

* For example, the first exercise for **GNU_grep**
    * directory: `ex01_basic_match`
    * question file: `ex01_basic_match.txt`
    * solution reference: `.ref_solutions/ex01_basic_match.txt`
* Each exercise contains one or more question to be solved
* The script `solve` will assist in checking solutions

```bash
$ git clone https://github.com/learnbyexample/Command-line-text-processing.git
$ cd Command-line-text-processing/exercises/GNU_grep/
$ ls
ex01_basic_match      ex02_basic_options      ex03_multiple_string_match      solve
ex01_basic_match.txt  ex02_basic_options.txt  ex03_multiple_string_match.txt

$ find -name 'ex01*'
./.ref_solutions/ex01_basic_match.txt
./ex01_basic_match
./ex01_basic_match.txt
```

<br>

* Solving the questions
    * Go to the exercise folder
    * Use `ls` to see input file(s)
    * To see the problems for that exercise, follow the steps below

```bash
$ cd ex01_basic_match
$ ls
sample.txt

$ # to see the questions
$ source ../solve -q
1) Match lines containing the string: day


2) Match lines containing the string: it


3) Match lines containing the string: do you


$ # or open the questions file with your fav editor
$ gvim ../$(basename "$PWD").txt
$ # create an alias to use from any ex* directory
$ alias oq='gvim ../$(basename "$PWD").txt'
$ oq
```

<br>

* Submitting solutions one by one
    * immediately after executing command that answers a question, call the `solve` script

```bash
$ grep 'day' sample.txt 
Good day
Today is sunny
$ source ../solve -s
---------------------------------------------
Match for question 1:
Submitted solution: grep 'day' sample.txt 
Reference solution: grep 'day' sample.txt
---------------------------------------------
```

<br>

* Submit all at once
    * by editing the `../$(basename "$PWD").txt` file directly
    * the answer should replace the empty line immediately following the question
* **Note**
    * there are different ways to solve the same question
    * but for specific exercise like **GNU_grep** try to solve using `grep` only
    * also, remember that `eval` is used to check equivalence. So be sure of commands submitted

```bash
$ cat ../$(basename "$PWD").txt
1) Match lines containing the string: day
grep 'day' sample.txt

2) Match lines containing the string: it
sed -n '/it/p' sample.txt

3) Match lines containing the string: do you
echo 'How do you do?'

$ source ../solve
---------------------------------------------
Match for question 1:
Submitted solution: grep 'day' sample.txt
Reference solution: grep 'day' sample.txt
---------------------------------------------
---------------------------------------------
Match for question 2:
Submitted solution: sed -n '/it/p' sample.txt
Reference solution: grep 'it' sample.txt
---------------------------------------------
---------------------------------------------
Match for question 3:
Submitted solution: echo 'How do you do?'
Reference solution: grep 'do you' sample.txt
---------------------------------------------
		All Pass		
```

<br>

* Then move on to next exercise directory
* Create aliases for different commands for easy use, after checking that the aliases are available of course

```bash
$ type cs cq ca nq pq
bash: type: cs: not found
bash: type: cq: not found
bash: type: ca: not found
bash: type: nq: not found
bash: type: pq: not found

$ alias cs='source ../solve -s'
$ alias cq='source ../solve -q'
$ alias ca='source ../solve'
$ # to go to directory of next question
$ nq() { d=$(basename "$PWD"); nd=$(printf "../ex%02d*/" $((${d:2:2}+1))); cd $nd ; }
$ # to go to directory of previous question
$ pq() { d=$(basename "$PWD"); pd=$(printf "../ex%02d*/" $((${d:2:2}-1))); cd $pd ; }
```

<br>

If wrong solution is submitted, the expected output is shown. This also helps to better understand the question as I found it difficult to convey the intent of question clearly with words alone...

```bash
$ source ../solve -q
1) Match lines containing the string: day


2) Match lines containing the string: it


3) Match lines containing the string: do you

$ grep 'do' sample.txt 
How do you do?
Just do it
No doubt you like it too
Much ado about nothing
$ source ../solve -s
---------------------------------------------
Mismatch for question 1:
Expected output is:
Good day
Today is sunny
---------------------------------------------
```


================================================
FILE: file_attributes.md
================================================
# <a name="file-attributes"></a>File attributes

**Table of Contents**

* [wc](#wc)
    * [Various counts](#various-counts)
    * [subtle differences](#subtle-differences)
    * [Further reading for wc](#further-reading-for-wc)
* [du](#du)
    * [Default size](#default-size)
    * [Various size formats](#various-size-formats)
    * [Dereferencing links](#dereferencing-links)
    * [Filtering options](#filtering-options)
    * [Further reading for du](#further-reading-for-du)
* [df](#df)
    * [Examples](#examples)
    * [Further reading for df](#further-reading-for-df)
* [touch](#touch)
    * [Creating empty file](#creating-empty-file)
    * [Updating timestamps](#updating-timestamps)
    * [Preserving timestamp](#preserving-timestamp)
    * [Further reading for touch](#further-reading-for-touch)
* [file](#file)
    * [File type examples](#file-type-examples)
    * [Further reading for file](#further-reading-for-file)

<br>

## <a name="wc"></a>wc

```bash
$ wc --version | head -n1
wc (GNU coreutils) 8.25

$ man wc
WC(1)                            User Commands                           WC(1)

NAME
       wc - print newline, word, and byte counts for each file

SYNOPSIS
       wc [OPTION]... [FILE]...
       wc [OPTION]... --files0-from=F

DESCRIPTION
       Print newline, word, and byte counts for each FILE, and a total line if
       more than one FILE is specified.  A word is a non-zero-length  sequence
       of characters delimited by white space.

       With no FILE, or when FILE is -, read standard input.
...
```

<br>

#### <a name="various-counts"></a>Various counts

```bash
$ cat sample.txt
Hello World
Good day
No doubt you like it too
Much ado about nothing
He he he

$ # by default, gives newline/word/byte count (in that order)
$ wc sample.txt
 5 17 78 sample.txt

$ # options to get individual numbers
$ wc -l sample.txt
5 sample.txt
$ wc -w sample.txt
17 sample.txt
$ wc -c sample.txt
78 sample.txt

$ # use shell input redirection if filename is not needed
$ wc -l < sample.txt
5
```

* multiple file input
* automatically displays total at end

```bash
$ cat greeting.txt
Hello there
Have a safe journey
$ cat fruits.txt
Fruit   Price
apple   42
banana  31
fig     90
guava   6

$ wc *.txt
  5  10  57 fruits.txt
  2   6  32 greeting.txt
  5  17  78 sample.txt
 12  33 167 total
```

* use `-L` to get length of longest line

```bash
$ wc -L < sample.txt
24

$ echo 'foo bar baz' | wc -L
11
$ echo 'hi there!' | wc -L
9

$ # last line will show max value, not sum of all input
$ wc -L *.txt
 13 fruits.txt
 19 greeting.txt
 24 sample.txt
 24 total
```

<br>

#### <a name="subtle-differences"></a>subtle differences

* byte count vs character count

```bash
$ # when input is ASCII
$ printf 'hi there' | wc -c
8
$ printf 'hi there' | wc -m
8

$ # when input has multi-byte characters
$ printf 'hi👍' | od -x
0000000 6968 9ff0 8d91
0000006

$ printf 'hi👍' | wc -m
3

$ printf 'hi👍' | wc -c
6
```

* `-l` option gives only the count of number of newline characters

```bash
$ printf 'hi there\ngood day' | wc -l
1
$ printf 'hi there\ngood day\n' | wc -l
2
$ printf 'hi there\n\n\nfoo\n' | wc -l
4
```

* From `man wc` "A word is a non-zero-length sequence of characters delimited by white space"

```bash
$ echo 'foo        bar ;-*' | wc -w
3

$ # use other text processing as needed
$ echo 'foo        bar ;-*' | grep -iowE '[a-z]+'
foo
bar
$ echo 'foo        bar ;-*' | grep -iowE '[a-z]+' | wc -l
2
```

* `-L` won't count non-printable characters and tabs are converted to equivalent spaces

```bash
$ printf 'food\tgood' | wc -L
12
$ printf 'food\tgood' | wc -m
9
$ printf 'food\tgood' | awk '{print length()}'
9

$ printf 'foo\0bar\0baz' | wc -L
9
$ printf 'foo\0bar\0baz' | wc -m
11
$ printf 'foo\0bar\0baz' | awk '{print length()}'
11
```

<br>

#### <a name="further-reading-for-wc"></a>Further reading for wc

* `man wc` and `info wc` for more options and detailed documentation
* [wc Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/wc?sort=votes&pageSize=15)
* [wc Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/wc?sort=votes&pageSize=15)

<br>

## <a name="du"></a>du

```bash
$ du --version | head -n1
du (GNU coreutils) 8.25

$ man du
DU(1)                            User Commands                           DU(1)

NAME
       du - estimate file space usage

SYNOPSIS
       du [OPTION]... [FILE]...
       du [OPTION]... --files0-from=F

DESCRIPTION
       Summarize disk usage of the set of FILEs, recursively for directories.
...
```

<br>

<br>

#### <a name="default-size"></a>Default size

* By default, size is given in size of **1024 bytes**
* Files are ignored, all directories and sub-directories are recursively reported

```bash
$ ls -F
projs/  py_learn@  words.txt

$ du
17920   ./projs/full_addr
14316   ./projs/half_addr
32952   ./projs
33880   .
```

* use `-a` to recursively show both files and directories
* use `-s` to show total directory size without descending into its sub-directories

```bash
$ du -a
712     ./projs/report.log
17916   ./projs/full_addr/faddr.v
17920   ./projs/full_addr
14312   ./projs/half_addr/haddr.v
14316   ./projs/half_addr
32952   ./projs
0       ./py_learn
924     ./words.txt
33880   .

$ du -s
33880   .

$ du -s projs words.txt
32952   projs
924     words.txt
```

* use `-S` to show directory size without taking into account size of its sub-directories

```bash
$ du -S
17920   ./projs/full_addr
14316   ./projs/half_addr
716     ./projs
928     .
```

<br>

<br>

#### <a name="various-size-formats"></a>Various size formats

```bash
$ # number of bytes
$ stat -c %s words.txt
938848
$ du -b words.txt
938848  words.txt

$ # kilobytes = 1024 bytes
$ du -sk projs
32952   projs
$ # megabytes = 1024 kilobytes
$ du -sm projs
33      projs

$ # -B to specify custom byte scale size
$ du -sB 5000 projs
6749    projs
$ du -sB 1048576 projs
33      projs
```

* human readable and si units

```bash
$ # in terms of powers of 1024
$ # M = 1048576 bytes and so on
$ du -sh projs/* words.txt
18M     projs/full_addr
14M     projs/half_addr
712K    projs/report.log
924K    words.txt

$ # in terms of powers of 1000
$ # M = 1000000 bytes and so on
$ du -s --si projs/* words.txt
19M     projs/full_addr
15M     projs/half_addr
730k    projs/report.log
947k    words.txt
```

* sorting

```bash
$ du -sh projs/* words.txt | sort -h
712K    projs/report.log
924K    words.txt
14M     projs/half_addr
18M     projs/full_addr

$ du -sk projs/* | sort -nr
17920   projs/full_addr
14316   projs/half_addr
712     projs/report.log
```

* to get size based on number of characters in file rather than disk space alloted

```bash
$ du -b words.txt
938848  words.txt

$ du -h words.txt
924K    words.txt

$ # 938848/1024 = 916.84
$ du --apparent-size -h words.txt
917K    words.txt
```

<br>

#### <a name="dereferencing-links"></a>Dereferencing links

* See `man` and `info` pages for other related options

```bash
$ # -D to dereference command line argument
$ du py_learn
0       py_learn
$ du -shD py_learn
503M    py_learn

$ # -L to dereference links found by du
$ du -sh
34M     .
$ du -shL
536M    .
```

<br>

#### <a name="filtering-options"></a>Filtering options

* `-d` to specify maximum depth

```bash
$ du -ah projs
712K    projs/report.log
18M     projs/full_addr/faddr.v
18M     projs/full_addr
14M     projs/half_addr/haddr.v
14M     projs/half_addr
33M     projs

$ du -ah -d1 projs
712K    projs/report.log
18M     projs/full_addr
14M     projs/half_addr
33M     projs
```

* `-c` to also show total size at end

```bash
$ du -cshD projs py_learn
33M     projs
503M    py_learn
535M    total
```

* `-t` to provide a threshold comparison

```bash
$ # >= 15M
$ du -Sh -t 15M
18M     ./projs/full_addr

$ # <= 1M
$ du -ah -t -1M
712K    ./projs/report.log
0       ./py_learn
924K    ./words.txt
```

* excluding files/directories based on **glob** pattern
* see also `--exclude-from=FILE` and `--files0-from=FILE` options

```bash
$ # note that excluded files affect directory size reported
$ du -ah --exclude='*addr*' projs
712K    projs/report.log
716K    projs

$ # depending on shell, brace expansion can be used
$ du -ah --exclude='*.'{v,log} projs
4.0K    projs/full_addr
4.0K    projs/half_addr
12K     projs
```

<br>

#### <a name="further-reading-for-du"></a>Further reading for du

* `man du` and `info du` for more options and detailed documentation
* [du Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/disk-usage?sort=votes&pageSize=15)
* [du Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/du?sort=votes&pageSize=15)

<br>

## <a name="df"></a>df

```bash
$ df --version | head -n1
df (GNU coreutils) 8.25

$ man df
DF(1)                            User Commands                           DF(1)

NAME
       df - report file system disk space usage

SYNOPSIS
       df [OPTION]... [FILE]...

DESCRIPTION
       This  manual  page  documents  the  GNU version of df.  df displays the
       amount of disk space available on the file system containing each  file
       name  argument.   If  no file name is given, the space available on all
       currently mounted file systems is shown.
...
```

<br>

#### <a name="examples"></a>Examples

```bash
$ # use df without arguments to get information on all currently mounted file systems
$ df .
Filesystem     1K-blocks     Used Available Use% Mounted on
/dev/sda1       98298500 58563816  34734748  63% /

$ # use -B option for custom size
$ # use --si for size in powers of 1000 instead of 1024
$ df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        94G   56G   34G  63% /
```

* Use `--output` to report only specific fields of interest

```bash
$ df -h --output=size,used,file / /media/learnbyexample/projs
 Size  Used File
  94G   56G /
  92G   35G /media/learnbyexample/projs

$ df -h --output=pcent .
Use%
 63%

$ df -h --output=pcent,fstype | awk -F'%' 'NR>2 && $1>=40'
 63% ext3
 40% ext4
 51% ext4
```

<br>

#### <a name="further-reading-for-df"></a>Further reading for df

* `man df` and `info df` for more options and detailed documentation
* [df Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/df?sort=votes&pageSize=15)
* [Parsing df command output with awk](https://unix.stackexchange.com/questions/360865/parsing-df-command-output-with-awk)
* [processing df output](https://www.reddit.com/r/bash/comments/68dbml/using_an_array_variable_in_an_awk_command/)

<br>

## <a name="touch"></a>touch

```bash
$ touch --version | head -n1
touch (GNU coreutils) 8.25

$ man touch
TOUCH(1)                         User Commands                        TOUCH(1)

NAME
       touch - change file timestamps

SYNOPSIS
       touch [OPTION]... FILE...

DESCRIPTION
       Update  the  access  and modification times of each FILE to the current
       time.

       A FILE argument that does not exist is created empty, unless -c  or  -h
       is supplied.
...
```

<br>

#### <a name="creating-empty-file"></a>Creating empty file

```bash
$ ls foo.txt
ls: cannot access 'foo.txt': No such file or directory
$ touch foo.txt
$ ls foo.txt
foo.txt

$ # use -c if new file shouldn't be created
$ rm foo.txt
$ touch -c foo.txt
$ ls foo.txt
ls: cannot access 'foo.txt': No such file or directory
```

<br>

#### <a name="updating-timestamps"></a>Updating timestamps

* Updating both access and modification timestamp to current time

```bash
$ # last access time
$ stat -c %x fruits.txt
2017-07-19 17:06:01.523308599 +0530
$ # last modification time
$ stat -c %y fruits.txt
2017-07-13 13:54:03.576055933 +0530

$ touch fruits.txt
$ stat -c %x fruits.txt
2017-07-21 10:11:44.241921229 +0530
$ stat -c %y fruits.txt
2017-07-21 10:11:44.241921229 +0530
```

* Updating only access or modification timestamp

```bash
$ touch -a greeting.txt
$ stat -c %x greeting.txt
2017-07-21 10:14:08.457268564 +0530
$ stat -c %y greeting.txt
2017-07-13 13:54:26.004499660 +0530

$ touch -m sample.txt
$ stat -c %x sample.txt
2017-07-13 13:48:24.945450646 +0530
$ stat -c %y sample.txt
2017-07-21 10:14:40.770006144 +0530
```

* Using timestamp from another file to update

```bash
$ stat -c $'%x\n%y' power.log report.log
2017-07-19 10:48:03.978295434 +0530
2017-07-14 20:50:42.850887578 +0530
2017-06-24 13:00:31.773583923 +0530
2017-06-24 12:59:53.316751651 +0530

$ # copy both access and modification timestamp from power.log to report.log
$ touch -r power.log report.log
$ stat -c $'%x\n%y' report.log
2017-07-19 10:48:03.978295434 +0530
2017-07-14 20:50:42.850887578 +0530

$ # add -a or -m options to limit to only access or modification timestamp
```

* Using date string to update
* See also `-t` option

```bash
$ # add -a or -m as needed
$ touch -d '2010-03-17 17:04:23' report.log
$ stat -c $'%x\n%y' report.log
2010-03-17 17:04:23.000000000 +0530
2010-03-17 17:04:23.000000000 +0530
```

<br>

#### <a name="preserving-timestamp"></a>Preserving timestamp

* Text processing on files would update the timestamps

```bash
$ stat -c $'%x\n%y' power.log
2017-07-21 11:11:42.862874240 +0530
2017-07-13 21:31:53.496323704 +0530

$ sed -i 's/foo/bar/g' power.log
$ stat -c $'%x\n%y' power.log
2017-07-21 11:12:20.303504336 +0530
2017-07-21 11:12:20.303504336 +0530
```

* `touch` can be used to restore timestamps after processing

```bash
$ # first copy the timestamps using touch -r
$ stat -c $'%x\n%y' story.txt
2017-06-24 13:00:31.773583923 +0530
2017-06-24 12:59:53.316751651 +0530
$ # tmp.txt is temporary empty file
$ touch -r story.txt tmp.txt
$ stat -c $'%x\n%y' tmp.txt
2017-06-24 13:00:31.773583923 +0530
2017-06-24 12:59:53.316751651 +0530

$ # after text processing, copy back the timestamps and remove temporary file
$ sed -i 's/cat/dog/g' story.txt
$ touch -r tmp.txt story.txt && rm tmp.txt
$ stat -c $'%x\n%y' story.txt
2017-06-24 13:00:31.773583923 +0530
2017-06-24 12:59:53.316751651 +0530
```

<br>

#### <a name="further-reading-for-touch"></a>Further reading for touch

* `man touch` and `info touch` for more options and detailed documentation
* [touch Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/touch?sort=votes&pageSize=15)

<br>

## <a name="file"></a>file

```bash
$ file --version | head -n1
file-5.25

$ man file
FILE(1)                   BSD General Commands Manual                  FILE(1)

NAME
     file — determine file type

SYNOPSIS
     file [-bcEhiklLNnprsvzZ0] [--apple] [--extension] [--mime-encoding]
          [--mime-type] [-e testname] [-F separator] [-f namefile]
          [-m magicfiles] [-P name=value] file ...
     file -C [-m magicfiles]
     file [--help]

DESCRIPTION
     This manual page documents version 5.25 of the file command.

     file tests each argument in an attempt to classify it.  There are three
     sets of tests, performed in this order: filesystem tests, magic tests,
     and language tests.  The first test that succeeds causes the file type to
     be printed.
...
```

<br>

<br>

#### <a name="file-type-examples"></a>File type examples

```bash
$ file sample.txt
sample.txt: ASCII text
$ # without file name in output
$ file -b sample.txt
ASCII text

$ printf 'hi👍\n' | file -
/dev/stdin: UTF-8 Unicode text
$ printf 'hi👍\n' | file -i -
/dev/stdin: text/plain; charset=utf-8

$ file ch
ch:  Bourne-Again shell script, ASCII text executable

$ file sunset.jpg moon.png
sunset.jpg: JPEG image data
moon.png: PNG image data, 32 x 32, 8-bit/color RGBA, non-interlaced
```

* different line terminators

```bash
$ printf 'hi' | file -
/dev/stdin: ASCII text, with no line terminators

$ printf 'hi\r' | file -
/dev/stdin: ASCII text, with CR line terminators

$ printf 'hi\r\n' | file -
/dev/stdin: ASCII text, with CRLF line terminators

$ printf 'hi\n' | file -
/dev/stdin: ASCII text
```

* find all files of particular type in current directory, for example `image` files

```bash
$ find -type f -exec bash -c '(file -b "$0" | grep -wq "image data") && echo "$0"' {} \;
./sunset.jpg
./moon.png

$ # if filenames do not contain : or newline characters
$ find -type f -exec file {} + | awk -F: '/\<image data\>/{print $1}'
./sunset.jpg
./moon.png
```

<br>

#### <a name="further-reading-for-file"></a>Further reading for file

* `man file` and `info file` for more options and detailed documentation
* See also `identify` command which `describes the format and characteristics of one or more image files`


================================================
FILE: gnu_awk.md
================================================
<br> <br> <br>

---

:information_source: :information_source: This chapter has been converted into a better formatted ebook: https://learnbyexample.github.io/learn_gnuawk/. The ebook also has content updated for newer version of the commands, includes a chapter on regular expressions, has exercises, solutions, etc.

For markdown source and links to buy pdf/epub versions, see: https://github.com/learnbyexample/learn_gnuawk

---

<br> <br> <br>

## <a name="gnu-awk"></a>GNU awk

**Table of Contents**

* [Field processing](#field-processing)
    * [Default field separation](#default-field-separation)
    * [Specifying different input field separator](#specifying-different-input-field-separator)
    * [Specifying different output field separator](#specifying-different-output-field-separator)
* [Filtering](#filtering)
    * [Idiomatic print usage](#idiomatic-print-usage)
    * [Field comparison](#field-comparison)
    * [Regular expressions based filtering](#regular-expressions-based-filtering)
    * [Fixed string matching](#fixed-string-matching)
    * [Line number based filtering](#line-number-based-filtering)
* [Case Insensitive filtering](#case-insensitive-filtering)
* [Changing record separators](#changing-record-separators)
    * [Paragraph mode](#paragraph-mode)
    * [Multicharacter RS](#multicharacter-rs)
* [Substitute functions](#substitute-functions)
* [Inplace file editing](#inplace-file-editing)
* [Using shell variables](#using-shell-variables)
* [Multiple file input](#multiple-file-input)
* [Control Structures](#control-structures)
    * [if-else and loops](#if-else-and-loops)
    * [next and nextfile](#next-and-nextfile)
* [Multiline processing](#multiline-processing)
* [Two file processing](#two-file-processing)
    * [Comparing whole lines](#comparing-whole-lines)
    * [Comparing specific fields](#comparing-specific-fields)
    * [getline](#getline)
* [Creating new fields](#creating-new-fields)
* [Dealing with duplicates](#dealing-with-duplicates)
* [Lines between two REGEXPs](#lines-between-two-regexps)
    * [All unbroken blocks](#all-unbroken-blocks)
    * [Specific blocks](#specific-blocks)
    * [Broken blocks](#broken-blocks)
* [Arrays](#arrays)
* [awk scripts](#awk-scripts)
* [Miscellaneous](#miscellaneous)
    * [FPAT and FIELDWIDTHS](#fpat-and-fieldwidths)
    * [String functions](#string-functions)
    * [Executing external commands](#executing-external-commands)
    * [printf formatting](#printf-formatting)
    * [Redirecting print output](#redirecting-print-output)
* [Gotchas and Tips](#gotchas-and-tips)
* [Further Reading](#further-reading)

<br>

```bash
$ awk --version | head -n1
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)

$ man awk
GAWK(1)                        Utility Commands                        GAWK(1)

NAME
       gawk - pattern scanning and processing language

SYNOPSIS
       gawk [ POSIX or GNU style options ] -f program-file [ -- ] file ...
       gawk [ POSIX or GNU style options ] [ -- ] program-text file ...

DESCRIPTION
       Gawk  is  the  GNU Project's implementation of the AWK programming lan‐
       guage.  It conforms to the definition of  the  language  in  the  POSIX
       1003.1  Standard.   This version in turn is based on the description in
       The AWK Programming Language, by Aho, Kernighan, and Weinberger.   Gawk
       provides  the additional features found in the current version of Brian
       Kernighan's awk and a number of GNU-specific extensions.
...
```

**Prerequisites and notes**

* familiarity with programming concepts like variables, printing, control structures, arrays, etc
* familiarity with regular expressions
    * if not, check out **ERE** portion of [GNU sed regular expressions](./gnu_sed.md#regular-expressions) which is close enough to features available in `gawk`
* this tutorial is primarily focussed on short programs that are easily usable from command line, similar to using `grep`, `sed`, etc
* see [Gawk: Effective AWK Programming](https://www.gnu.org/software/gawk/manual/) manual for complete reference, has information on other `awk` versions as well as notes on POSIX standard

<br>

## <a name="field-processing"></a>Field processing

<br>

#### <a name="default-field-separation"></a>Default field separation

* `$0` contains the entire input record
    * default input record separator is newline character
* `$1` contains the first field text
    * default input field separator is one or more of continuous space, tab or newline characters
* `$2` contains the second field text and so on
* `$(2+3)` result of expressions can be used, this one evaluates to `$5` and hence gives fifth field
    * similarly if variable `i` has value `2`, then `$(i+3)` will give fifth field
    * See also [gawk manual - Expressions](https://www.gnu.org/software/gawk/manual/html_node/Expressions.html)
* `NF` is a built-in variable which contains number of fields in the current record
    * so, `$NF` will give last field
    * `$(NF-1)` will give second last field and so on

```bash
$ cat fruits.txt
fruit   qty
apple   42
banana  31
fig     90
guava   6

$ # print only first field
$ awk '{print $1}' fruits.txt
fruit
apple
banana
fig
guava

$ # print only second field
$ awk '{print $2}' fruits.txt
qty
42
31
90
6
```

<br>

#### <a name="specifying-different-input-field-separator"></a>Specifying different input field separator

* by using `-F` command line option
* by setting `FS` variable
* See [FPAT and FIELDWIDTHS](#fpat-and-fieldwidths) section for other ways of defining input fields

```bash
$ # second field where input field separator is :
$ echo 'foo:123:bar:789' | awk -F: '{print $2}'
123

$ # last field
$ echo 'foo:123:bar:789' | awk -F: '{print $NF}'
789

$ # first and last field
$ # note the use of , and space between output fields
$ echo 'foo:123:bar:789' | awk -F: '{print $1, $NF}'
foo 789

$ # second last field
$ echo 'foo:123:bar:789' | awk -F: '{print $(NF-1)}'
bar

$ # use quotes to avoid clashes with shell special characters
$ echo 'one;two;three;four' | awk -F';' '{print $3}'
three
```

* Regular expressions based input field separator

```bash
$ echo 'Sample123string54with908numbers' | awk -F'[0-9]+' '{print $2}'
string

$ # first field will be empty as there is nothing before '{'
$ echo '{foo}   bar=baz' | awk -F'[{}= ]+' '{print $1}'

$ echo '{foo}   bar=baz' | awk -F'[{}= ]+' '{print $2}'
foo
$ echo '{foo}   bar=baz' | awk -F'[{}= ]+' '{print $3}'
bar
```

* default input field separator is one or more of continuous space, tab or newline characters (will be termed as whitespace here on)
    * exact same behavior if `FS` is assigned single space character
* in addition, leading and trailing whitespaces won't be considered when splitting the input record

```bash
$ printf ' a    ate b\tc   \n'
 a    ate b     c
$ printf ' a    ate b\tc   \n' | awk '{print $1}'
a
$ printf ' a    ate b\tc   \n' | awk '{print NF}'
4
$ # same behavior if FS is assigned to single space character
$ printf ' a    ate b\tc   \n' | awk -F' ' '{print $1}'
a
$ printf ' a    ate b\tc   \n' | awk -F' ' '{print NF}'
4

$ # for anything else, leading/trailing whitespaces will be considered
$ printf ' a    ate b\tc   \n' | awk -F'[ \t]+' '{print $2}'
a
$ printf ' a    ate b\tc   \n' | awk -F'[ \t]+' '{print NF}'
6
```

* assigning empty string to FS will split the input record character wise
* note the use of command line option `-v` to set FS

```bash
$ echo 'apple' | awk -v FS= '{print $1}'
a
$ echo 'apple' | awk -v FS= '{print $2}'
p
$ echo 'apple' | awk -v FS= '{print $NF}'
e

$ # detecting multibyte characters depends on locale
$ printf 'hi👍 how are you?' | awk -v FS= '{print $3}'
👍
```

**Further Reading**

* [gawk manual - Field Splitting Summary](https://www.gnu.org/software/gawk/manual/html_node/Field-Splitting-Summary.html#Field-Splitting-Summary)
* [stackoverflow - explanation on default FS](https://stackoverflow.com/questions/30405694/default-field-separator-for-awk)
* [unix.stackexchange - filter lines if it contains a particular character only once](https://unix.stackexchange.com/questions/362550/how-to-remove-line-if-it-contains-a-character-exactly-once)
* [stackoverflow - Processing 2 files with different field separators](https://stackoverflow.com/questions/24516141/awk-processing-2-files-with-different-field-separators)

<br>

#### <a name="specifying-different-output-field-separator"></a>Specifying different output field separator

* by setting `OFS` variable
* also gets added between every argument to `print` statement
    * use [printf](#printf-formatting) to avoid this
* default is single space

```bash
$ # statements inside BEGIN are executed before processing any input text
$ echo 'foo:123:bar:789' | awk 'BEGIN{FS=OFS=":"} {print $1, $NF}'
foo:789
$ # can also be set using command line option -v
$ echo 'foo:123:bar:789' | awk -F: -v OFS=':' '{print $1, $NF}'
foo:789

$ # changing a field will re-build contents of $0
$ echo ' a      ate b   ' | awk '{$2 = "foo"; print $0}' | cat -A
a foo b$

$ # $1=$1 is an idiomatic way to re-build when there is nothing else to change
$ echo 'foo:123:bar:789' | awk -F: -v OFS='-' '{print $0}'
foo:123:bar:789
$ echo 'foo:123:bar:789' | awk -F: -v OFS='-' '{$1=$1; print $0}'
foo-123-bar-789

$ # OFS is used to separate different arguments given to print
$ echo 'foo:123:bar:789' | awk -F: -v OFS='\t' '{print $1, $3}'
foo     bar

$ echo 'Sample123string54with908numbers' | awk -F'[0-9]+' '{$1=$1; print $0}'
Sample string with numbers
```

<br>

## <a name="filtering"></a>Filtering

<br>

#### <a name="idiomatic-print-usage"></a>Idiomatic print usage

* `print` statement with no arguments will print contents of `$0`
* if condition is specified without corresponding statements, contents of `$0` is printed if condition evaluates to true
* `1` is typically used to represent always true condition and thus print contents of `$0`

```bash
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ # displaying contents of input file(s) similar to 'cat' command
$ # equivalent to using awk '{print $0}' and awk '1'
$ awk '{print}' poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.
```

<br>

#### <a name="field-comparison"></a>Field comparison

* Each block of statements within `{}` can be prefixed by an optional condition so that those statements will execute only if condition evaluates to true
* Condition specified without corresponding statements will lead to printing contents of `$0` if condition evaluates to true

```bash
$ # if first field exactly matches the string 'apple'
$ awk '$1=="apple"{print $2}' fruits.txt
42

$ # print first field if second field > 35
$ # NR>1 to avoid the header line
$ # NR built-in variable contains record number
$ awk 'NR>1 && $2>35{print $1}' fruits.txt
apple
fig

$ # print header and lines with qty < 35
$ awk 'NR==1 || $2<35' fruits.txt
fruit   qty
banana  31
guava   6
```

* If the above examples are too confusing, think of it as syntactical sugar
* Statements are grouped within `{}`
    * inside `{}`, we have a `if` control structure
    * Like `C` language, braces not needed for single statements within `if`, but consider that `{}` is used for clarity
    * From this explicit syntax, remove the outer `{}`, `if` and `()` used for `if`
* As we'll see later, this allows to mash up few lines of program compactly on command line itself
    * Of course, for medium to large programs, it is better to put the code in separate file. See [awk scripts](#awk-scripts) section

```bash
$ # awk '$1=="apple"{print $2}' fruits.txt
$ awk '{
         if($1 == "apple"){
            print $2
         }
       }' fruits.txt
42

$ # awk 'NR==1 || $2<35' fruits.txt
$ awk '{
         if(NR==1 || $2<35){
            print $0
         }
       }' fruits.txt
fruit   qty
banana  31
guava   6
```

**Further Reading**

* [gawk manual - Truth Values and Conditions](https://www.gnu.org/software/gawk/manual/html_node/Truth-Values-and-Conditions.html)
* [gawk manual - Operator Precedence](https://www.gnu.org/software/gawk/manual/html_node/Precedence.html)
* [unix.stackexchange - filtering columns by header name](https://unix.stackexchange.com/questions/359697/print-columns-in-awk-by-header-name)

<br>

#### <a name="regular-expressions-based-filtering"></a>Regular expressions based filtering

* the *REGEXP* is specified within `//` and by default acts upon `$0`
* See also [stackoverflow - lines around matching regexp](https://stackoverflow.com/questions/17908555/printing-with-sed-or-awk-a-line-following-a-matching-pattern)

```bash
$ # all lines containing the string 'are'
$ # same as: grep 'are' poem.txt
$ awk '/are/' poem.txt
Roses are red,
Violets are blue,
And so are you.

$ # negating REGEXP, same as: grep -v 'are' poem.txt
$ awk '!/are/' poem.txt
Sugar is sweet,

$ # same as: grep 'are' poem.txt | grep -v 'so'
$ awk '/are/ && !/so/' poem.txt
Roses are red,
Violets are blue,

$ # lines starting with 'a' or 'b'
$ awk '/^[ab]/' fruits.txt
apple   42
banana  31

$ # print last field of all lines containing 'are'
$ awk '/are/{print $NF}' poem.txt
red,
blue,
you.
```

* strings can be used as well, which will be interpreted as *REGEXP* if necessary
* Allows [using shell variables](#using-shell-variables) instead of hardcoded *REGEXP*
    * that section also notes difference between using `//` and string

```bash
$ awk '$0 !~ "are"' poem.txt
Sugar is sweet,

$ awk '$0 ~ "^[ab]"' fruits.txt
apple   42
banana  31

$ # also helpful if search strings have the / delimiter character
$ cat paths.txt
/foo/a/report.log
/foo/y/power.log
$ awk '/\/foo\/a\//' paths.txt
/foo/a/report.log
$ awk '$0 ~ "/foo/a/"' paths.txt
/foo/a/report.log
```

* *REGEXP* matching against specific field

```bash
$ # if first field contains 'a'
$ awk '$1 ~ /a/' fruits.txt
apple   42
banana  31
guava   6

$ # if first field contains 'a' and qty > 20
$ awk '$1 ~ /a/ && $2 > 20' fruits.txt
apple   42
banana  31

$ # if first field does NOT contain 'a'
$ awk '$1 !~ /a/' fruits.txt
fruit   qty
fig     90
```

<br>

#### <a name="fixed-string-matching"></a>Fixed string matching

* to search a string literally, `index` function can be used instead of *REGEXP*
    * similar to `grep -F`
* the function returns the starting position and `0` if no match found

```bash
$ cat eqns.txt
a=b,a-b=c,c*d
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

$ # no output since '+' is meta character, would need '/a\+b/'
$ awk '/a+b/' eqns.txt
$ # same as: grep -F 'a+b' eqns.txt
$ awk 'index($0,"a+b")' eqns.txt
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

$ # much easier than '/i\*\(t\+9-g\)/'
$ awk 'index($0,"i*(t+9-g)")' eqns.txt
i*(t+9-g)/8,4-a+b

$ # check only last field
$ awk -F, 'index($NF,"a+b")' eqns.txt
i*(t+9-g)/8,4-a+b
$ # index not needed if entire field/line is being compared
$ awk -F, '$1=="a+b"' eqns.txt
a+b,pi=3.14,5e12
```

* return value is useful to match at specific position
* for ex: at start/end of line

```bash
$ # start of line
$ awk 'index($0,"a+b")==1' eqns.txt
a+b,pi=3.14,5e12

$ # end of line
$ # length function returns number of characters, by default acts on $0
$ awk 'index($0,"a+b")==length()-length("a+b")+1' eqns.txt
i*(t+9-g)/8,4-a+b
$ # to avoid repetitions, save the search string in variable
$ awk -v s="a+b" 'index($0,s)==length()-length(s)+1' eqns.txt
i*(t+9-g)/8,4-a+b
```

<br>

#### <a name="line-number-based-filtering"></a>Line number based filtering

* Built-in variable `NR` contains total records read so far
* Use `FNR` if you need line numbers separately for [multiple file processing](#multiple-file-processing)

```bash
$ # same as: head -n2 poem.txt | tail -n1
$ awk 'NR==2' poem.txt
Violets are blue,

$ # print 2nd and 4th line
$ awk 'NR==2 || NR==4' poem.txt
Violets are blue,
And so are you.

$ # same as: tail -n1 poem.txt
$ # statements inside END are executed after processing all input text
$ awk 'END{print}' poem.txt
And so are you.

$ awk 'NR==4{print $2}' fruits.txt
90
```

* for large input, use `exit` to avoid unnecessary record processing

```bash
$ seq 14323 14563435 | awk 'NR==234{print; exit}'
14556

$ # sample time comparison
$ time seq 14323 14563435 | awk 'NR==234{print; exit}'
14556

real    0m0.004s
user    0m0.004s
sys     0m0.000s
$ time seq 14323 14563435 | awk 'NR==234{print}'
14556

real    0m2.167s
user    0m2.280s
sys     0m0.092s
```

* See also [unix.stackexchange - filtering list of lines from every X number of lines](https://unix.stackexchange.com/questions/325985/how-to-print-lines-number-15-and-25-out-of-each-50-lines)

<br>

## <a name="case-insensitive-filtering"></a>Case Insensitive filtering

```bash
$ # same as: grep -i 'rose' poem.txt
$ awk -v IGNORECASE=1 '/rose/' poem.txt
Roses are red,

$ # for small enough set, can also use REGEXP character class
$ awk '/[rR]ose/' poem.txt
Roses are red,

$ # another way is to use built-in string function 'tolower'
$ awk 'tolower($0) ~ /rose/' poem.txt
Roses are red,
```

<br>

## <a name="changing-record-separators"></a>Changing record separators

* `RS` to change input record separator
* default is newline character

```bash
$ s='this is a sample string'

$ # space as input record separator, printing all records
$ printf "$s" | awk -v RS=' ' '{print NR, $0}'
1 this
2 is
3 a
4 sample
5 string

$ # print all records containing 'a'
$ printf "$s" | awk -v RS=' ' '/a/'
a
sample
```

* `ORS` to change output record separator
* gets added to every `print` statement
    * use [printf](#printf-formatting) to avoid this
* default is newline character

```bash
$ seq 3 | awk '{print $0}'
1
2
3
$ # note that there is empty line after last record
$ seq 3 | awk -v ORS='\n\n' '{print $0}'
1

2

3

$ # dynamically changing ORS
$ # ?: ternary operator to select between two expressions based on a condition
$ # can also use: seq 6 | awk '{ORS = NR%2 ? " " : RS} 1'
$ seq 6 | awk '{ORS = NR%2 ? " " : "\n"} 1'
1 2
3 4
5 6
$ seq 6 | awk '{ORS = NR%3 ? "-" : "\n"} 1'
1-2-3
4-5-6
```

<br>

#### <a name="paragraph-mode"></a>Paragraph mode

* When `RS` is set to empty string, one or more consecutive empty lines is used as input record separator
* Can also use regular expression `RS=\n\n+` but there are subtle differences, see [gawk manual - multiline records](https://www.gnu.org/software/gawk/manual/html_node/Multiple-Line.html). Important points from that link quoted below

>However, there is an important difference between ‘RS = ""’ and ‘RS = "\n\n+"’. In the first case, leading newlines in the input data file are ignored, and if a file ends without extra blank lines after the last record, the final newline is removed from the record. In the second case, this special processing is not done

>Now that the input is separated into records, the second step is to separate the fields in the records. One way to do this is to divide each of the lines into fields in the normal manner. This happens by default as the result of a special feature. When RS is set to the empty string and FS is set to a single character, the newline character always acts as a field separator. This is in addition to whatever field separations result from FS

>When FS is the null string ("") or a regexp, this special feature of RS does not apply. It does apply to the default field separator of a single space: ‘FS = " "’

Consider the below sample file

```bash
$ cat sample.txt
Hello World

Good day
How are you

Just do-it
Believe it

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he
```

* Filtering paragraphs

```bash
$ # print all paragraphs containing 'it'
$ # if extra newline at end is undesirable, can use
$ # awk -v RS= '/it/{print c++ ? "\n" $0 : $0}' sample.txt
$ awk -v RS= -v ORS='\n\n' '/it/' sample.txt
Just do-it
Believe it

Today is sunny
Not a bit funny
No doubt you like it too

$ # based on number of lines in each paragraph
$ awk -F'\n' -v RS= -v ORS='\n\n' 'NF==1' sample.txt
Hello World

$ awk -F'\n' -v RS= -v ORS='\n\n' 'NF==2 && /do/' sample.txt
Just do-it
Believe it

Much ado about nothing
He he he

```

* Re-structuring paragraphs

```bash
$ # default FS is one or more of continuous space, tab or newline characters
$ # default OFS is single space
$ # so, $1=$1 will change it uniformly to single space between fields
$ awk -v RS= '{$1=$1} 1' sample.txt
Hello World
Good day How are you
Just do-it Believe it
Today is sunny Not a bit funny No doubt you like it too
Much ado about nothing He he he

$ # a better usecase
$ awk 'BEGIN{FS="\n"; OFS=". "; RS=""; ORS="\n\n"} {$1=$1} 1' sample.txt
Hello World

Good day. How are you

Just do-it. Believe it

Today is sunny. Not a bit funny. No doubt you like it too

Much ado about nothing. He he he

```

**Further Reading**

* [unix.stackexchange - filtering line surrounded by empty lines](https://unix.stackexchange.com/questions/359717/select-line-with-empty-line-above-and-under)
* [stackoverflow - excellent example and explanation of RS and FS](https://stackoverflow.com/questions/46142118/converting-regex-to-sed-or-grep-regex)

<br>

#### <a name="multicharacter-rs"></a>Multicharacter RS

* Some marker like `Error` or `Warning` etc

```bash
$ cat report.log
blah blah
Error: something went wrong
more blah
whatever
Error: something surely went wrong
some text
some more text
blah blah blah

$ awk -v RS='Error:' 'END{print NR-1}' report.log
2
$ awk -v RS='Error:' 'NR==1' report.log
blah blah

$ # filter 'Error:' block matching particular string
$ # to preserve formatting, use: '/whatever/{print RS $0}'
$ awk -v RS='Error:' '/whatever/' report.log
 something went wrong
more blah
whatever

$ # blocks with more than 3 lines
$ # splitting string with 3 newlines will yield 4 fields
$ awk -F'\n' -v RS='Error:' 'NF>4{print RS $0}' report.log
Error: something surely went wrong
some text
some more text
blah blah blah

```

* Regular expression based `RS`
    * the `RT` variable will contain string matched by `RS`
* Note that entire input is treated as single string, so `^` and `$` anchors will apply only once - not every line

```bash
$ s='Sample123string54with908numbers'
$ printf "$s" | awk -v RS='[0-9]+' 'NR==1'
Sample

$ # note the relationship between record and separators
$ printf "$s" | awk -v RS='[0-9]+' '{print NR " : " $0 " - " RT}'
1 : Sample - 123
2 : string - 54
3 : with - 908
4 : numbers - 

$ # need to be careful of empty records
$ printf '123string54with908' | awk -v RS='[0-9]+' '{print NR " : " $0}'
1 : 
2 : string
3 : with
$ # and newline at end of input
$ printf '123string54with908\n' | awk -v RS='[0-9]+' '{print NR " : " $0}'
1 : 
2 : string
3 : with
4 : 

```

* Joining lines based on specific end of line condition

```bash
$ cat msg.txt
Hello there.
It will rain to-
day. Have a safe
and pleasant jou-
rney.

$ # join lines ending with - to next line
$ # by manipulating RS and ORS
$ awk -v RS='-\n' -v ORS= '1' msg.txt
Hello there.
It will rain today. Have a safe
and pleasant journey.

$ # by manipulating ORS alone, sub function covered in later sections
$ awk '{ORS = sub(/-$/,"") ? "" : "\n"} 1' msg.txt
Hello there.
It will rain today. Have a safe
and pleasant journey.
$ # easier: perl -pe 's/-\n//' msg.txt as newline is still part of input line
```

* processing null terminated input

```bash
$ printf 'foo\0bar\0' | cat -A
foo^@bar^@$
$ printf 'foo\0bar\0' | awk -v RS='\0' '{print}'
foo
bar
```

**Further Reading**

* [gawk manual - Records](https://www.gnu.org/software/gawk/manual/html_node/Records.html#Records)
* [unix.stackexchange - Slurp-mode in awk](https://unix.stackexchange.com/questions/304457/slurp-mode-in-awk)
* [stackoverflow - using RS to count number of occurrences of a given string](https://stackoverflow.com/questions/45102651/how-to-grep-double-quote-followed-by-a-string-at-same-time/45102962#45102962)

<br>

## <a name="substitute-functions"></a>Substitute functions

* Use `sub` string function for replacing first occurrence
* Use `gsub` for replacing all occurrences
* By default, `$0` which contains input record is modified, can specify any other field or variable as needed

```bash
$ # replacing first occurrence
$ echo '1-2-3-4-5' | awk '{sub("-", ":")} 1'
1:2-3-4-5

$ # replacing all occurrences
$ echo '1-2-3-4-5' | awk '{gsub("-", ":")} 1'
1:2:3:4:5

$ # return value for sub/gsub is number of replacements made
$ echo '1-2-3-4-5' | awk '{n=gsub("-", ":"); print n} 1'
4
1:2:3:4:5

$ # // format is better suited to specify search REGEXP
$ echo '1-2-3-4-5' | awk '{gsub(/[^-]+/, "abc")} 1'
abc-abc-abc-abc-abc

$ # replacing all occurrences only for third field
$ echo 'one;two;three;four' | awk -F';' '{gsub("e", "E", $3)} 1'
one two thrEE four
```

* Use `gensub` to return the modified string unlike `sub` or `gsub` which modifies inplace
* it also supports back-references and ability to modify specific match
* acts upon `$0` if target is not specified

```bash
$ # replace second occurrence
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(":", "-", 2)} 1'
foo:123-bar:baz
$ # use REGEXP as needed
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "XYZ", 2)} 1'
foo:XYZ:bar:baz

$ # or print the returned string directly
$ echo 'foo:123:bar:baz' | awk '{print gensub(":", "-", 2)}'
foo:123-bar:baz

$ # replace third occurrence
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "XYZ", 3)} 1'
foo:123:XYZ:baz

$ # replace all occurrences, similar to gsub
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "XYZ", "g")} 1'
XYZ:XYZ:XYZ:XYZ

$ # target other than $0
$ echo 'foo:123:bar:baz' | awk -F: -v OFS=: '{$1=gensub(/o/, "b", 2, $1)} 1'
fob:123:bar:baz
```

* back-reference examples
* use `\"` within double-quotes to represent `"` character in replacement string
* use `\\1` to represent `\1` - the first captured group and so on
* `&` or `\0` will back-reference entire matched string

```bash
$ # replacing last occurrence without knowing how many occurrences are there
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/(.*):/, "\\1-", 1)} 1'
foo:123:bar-baz
$ echo 'foo and bar and baz land good' | awk '{$0=gensub(/(.*)and/, "\\1XYZ", 1)} 1'
foo and bar and baz lXYZ good

$ # use word boundaries as necessary
$ echo 'foo and bar and baz land good' | awk '{$0=gensub(/(.*)\<and\>/, "\\1XYZ", 1)} 1'
foo and bar XYZ baz land good

$ # replacing last but one
$ echo '456:foo:123:bar:789:baz' | awk '{$0=gensub(/(.*):(.*:)/, "\\1-\\2", 1)} 1'
456:foo:123:bar-789:baz

$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "\"&\"", "g")} 1'
"foo":"123":"bar":"baz"
```

* saving quotes in variables - to avoid escaping double quotes or having to use octal code for single quotes

```bash
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "\047&\047", "g")} 1'
'foo':'123':'bar':'baz'
$ echo 'foo:123:bar:baz' | awk -v sq="'" '{$0=gensub(/[^:]+/, sq"&"sq, "g")} 1'
'foo':'123':'bar':'baz'

$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "\"&\"", "g")} 1'
"foo":"123":"bar":"baz"
$ echo 'foo:123:bar:baz' | awk -v dq='"' '{$0=gensub(/[^:]+/, dq"&"dq, "g")} 1'
"foo":"123":"bar":"baz"
```

**Further Reading**

* [gawk manual - String-Manipulation Functions](https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html)
* [gawk manual - escape processing](https://www.gnu.org/software/gawk/manual/html_node/Gory-Details.html)

<br>

## <a name="inplace-file-editing"></a>Inplace file editing

* Use this option with caution, preferably after testing that the `awk` code is working as intended

```bash
$ cat greeting.txt
Hi there
Have a nice day

$ awk -i inplace '{gsub("e", "E")} 1' greeting.txt
$ cat greeting.txt
Hi thErE
HavE a nicE day
```

* Multiple input files are treated individually and changes are written back to respective files

```bash
$ cat f1
I ate 3 apples
$ cat f2
I bought two bananas and 3 mangoes

$ awk -i inplace '{gsub("3", "three")} 1' f1 f2
$ cat f1
I ate three apples
$ cat f2
I bought two bananas and three mangoes
```

* to create backups of original file, set `INPLACE_SUFFIX` variable
* **Note** that in newer versions, you have to use `inplace::suffix` instead of `INPLACE_SUFFIX`

```bash
$ awk -i inplace -v INPLACE_SUFFIX='.bkp' '{gsub("three", "3")} 1' f1
$ cat f1
I ate 3 apples
$ cat f1.bkp
I ate three apples
```

* See [gawk manual - Enabling In-Place File Editing](https://www.gnu.org/software/gawk/manual/html_node/Extension-Sample-Inplace.html) for implementation details

<br>

## <a name="using-shell-variables"></a>Using shell variables

* when `awk` code is part of shell program and shell variable needs to be passed as input to `awk` code
* for example:
    * command line argument passed to shell script, which is in turn passed on to `awk`
    * control structures in shell script calling `awk` with different search strings
* See also [stackoverflow - How do I use shell variables in an awk script?](https://stackoverflow.com/questions/19075671/how-do-i-use-shell-variables-in-an-awk-script)

```bash
$ # examples tested with bash shell

$ f='apple'
$ awk -v word="$f" '$1==word' fruits.txt
apple   42
$ f='fig'
$ awk -v word="$f" '$1==word' fruits.txt
fig     90

$ q='20'
$ awk -v threshold="$q" 'NR==1 || $2>threshold' fruits.txt
fruit   qty
apple   42
banana  31
fig     90
```

* accessing shell environment variables

```bash
$ # existing environment variable
$ awk 'BEGIN{print ENVIRON["PWD"]}'
/home/learnbyexample
$ awk 'BEGIN{print ENVIRON["SHELL"]}'
/bin/bash

$ # defined along with awk code
$ word='hello world' awk 'BEGIN{print ENVIRON["word"]}'
hello world

$ # using ENVIRON also prevents awk's interpretation of escape sequences
$ s='a\n=c'
$ foo="$s" awk 'BEGIN{print ENVIRON["foo"]}'
a\n=c
$ awk -v foo="$s" 'BEGIN{print foo}'
a
=c
```

* passing *REGEXP*
* See also [gawk manual - Using Dynamic Regexps](https://www.gnu.org/software/gawk/manual/html_node/Computed-Regexps.html)

```bash
$ s='are'
$ # for: awk '!/are/' poem.txt
$ awk -v s="$s" '$0 !~ s' poem.txt
Sugar is sweet,
$ # for: awk '/are/ && !/so/' poem.txt
$ awk -v s="$s" '$0 ~ s && !/so/' poem.txt
Roses are red,
Violets are blue,

$ r='[^-]+'
$ echo '1-2-3-4-5' | awk -v r="$r" '{gsub(r, "abc")} 1'
abc-abc-abc-abc-abc

$ # escape sequence has to be doubled when string is interpreted as REGEXP
$ s='foo and bar and baz land good'
$ echo "$s" | awk '{$0=gensub("(.*)\\<and\\>", "\\1XYZ", 1)} 1'
foo and bar XYZ baz land good
$ # hence passing as variable should be
$ r='(.*)\\<and\\>'
$ echo "$s" | awk -v r="$r" '{$0=gensub(r, "\\1XYZ", 1)} 1'
foo and bar XYZ baz land good

$ # or use ENVIRON
$ r='(.*)\<and\>'
$ echo "$s" | r="$r" awk '{$0=gensub(ENVIRON["r"], "\\1XYZ", 1)} 1'
foo and bar XYZ baz land good
```

<br>

## <a name="multiple-file-input"></a>Multiple file input

* Example to show difference between `NR` and `FNR`

```bash
$ # NR for overall record number
$ awk 'NR==1' poem.txt greeting.txt
Roses are red,

$ # FNR for individual file's record number
$ # same as: head -q -n1 poem.txt greeting.txt
$ awk 'FNR==1' poem.txt greeting.txt
Roses are red,
Hi thErE
```

* Constructs to do some processing before starting each file as well as at the end
* `BEGINFILE` - to add code to be executed before start of each input file
* `ENDFILE` - to add code to be executed after processing each input file
* `FILENAME` - file name of current input file being processed

```bash
$ # similar to: tail -n1 poem.txt greeting.txt
$ awk 'BEGINFILE{print "file: "FILENAME}
       ENDFILE{print $0"\n------"}' poem.txt greeting.txt
file: poem.txt
And so are you.
------
file: greeting.txt
HavE a nicE day
------
```

* And of course, there can be usual `awk` code

```bash
$ awk 'BEGINFILE{print "file: "FILENAME}
       FNR==1;
       ENDFILE{print "------"}' poem.txt greeting.txt
file: poem.txt
Roses are red,
------
file: greeting.txt
Hi thErE
------

$ awk 'BEGINFILE{c++; print "file: "FILENAME}
       FNR==2;
       END{print "\nTotal input files: "c}' poem.txt greeting.txt
file: poem.txt
Violets are blue,
file: greeting.txt
HavE a nicE day

Total input files: 2
```

**Further Reading**

* [gawk manual - Using ARGC and ARGV](https://www.gnu.org/software/gawk/manual/html_node/ARGC-and-ARGV.html)
* [gawk manual - ARGIND](https://www.gnu.org/software/gawk/manual/html_node/Auto_002dset.html#index-ARGIND-variable)
* [gawk manual - ERRNO](https://www.gnu.org/software/gawk/manual/html_node/Auto_002dset.html#index-ERRNO-variable)
* [stackoverflow - Finding common value across multiple files](https://stackoverflow.com/a/43473385/4082052)

<br>

## <a name="control-structures"></a>Control Structures

* Syntax is similar to `C` language and single statements inside control structures don't require to be grouped within `{}`
* See [gawk manual - Control Statements](https://www.gnu.org/software/gawk/manual/html_node/Statements.html) for details

Remember that by default there is a loop that goes over all input records and constructs like `BEGIN` and `END` fall outside that loop

```bash
$ cat nums.txt
42
-2
10101
-3.14
-75
$ awk '{sum += $1} END{print sum}' nums.txt
10062.9

$ # uninitialized variables will have empty string
$ printf '' | awk '{sum += $1} END{print sum}'

$ # so either add '0' or use unary '+' operator to convert to number
$ printf '' | awk '{sum += $1} END{print +sum}'
0
$ awk '{sum += $1} END{print sum+0}' /dev/null
0
```

* See also [unix.stackexchange - change in behavior of unary + with gawk version 4.2.0](https://unix.stackexchange.com/questions/421904/regression-with-unary-plus)

<br>

#### <a name="if-else-and-loops"></a>if-else and loops

* We have already seen simple `if` examples in [Filtering](#filtering) section
* See also [gawk manual - Switch](https://www.gnu.org/software/gawk/manual/html_node/Switch-Statement.html)

```bash
$ # same as: sed -n '/are/ s/so/SO/p' poem.txt
$ # remember that sub/gsub returns number of substitutions made
$ awk '/are/{if(sub("so", "SO")) print}' poem.txt
And SO are you.
$ # of course, can also use
$ awk '/are/ && sub("so", "SO")' poem.txt
And SO are you.

$ # if-else example
$ awk 'NR>1{if($2>40) $0="+"$0; else $0="-"$0} 1' fruits.txt
fruit   qty
+apple   42
-banana  31
+fig     90
-guava   6
```

* ternary operator
* See also [stackoverflow - finding min and max value of a column](https://stackoverflow.com/a/29784278/4082052)

```bash
$ cat nums.txt
42
-2
10101
-3.14
-75

$ # changing -ve to +ve and vice versa
$ # same as: awk '{if($0 ~ /^-/) sub(/^-/,""); else sub(/^/,"-")} 1' nums.txt
$ awk '{$0 ~ /^-/ ? sub(/^-/,"") : sub(/^/,"-")} 1' nums.txt
-42
2
-10101
3.14
75
$ # can also use: awk '!sub(/^-/,""){sub(/^/,"-")} 1' nums.txt
```

* for loop
* similar to `C` language, `break` and `continue` statements are also available
* See also [stackoverflow - find missing numbers from sequential list](https://stackoverflow.com/questions/38491676/how-can-i-find-the-missing-integers-in-a-unique-and-sequential-list-one-per-lin)

```bash
$ awk 'BEGIN{for(i=2; i<11; i+=2) print i}'
2
4
6
8
10

$ # looping each field
$ s='scat:cat:no cat:abdicate:cater'
$ echo "$s" | awk -F: -v OFS=: '{for(i=1;i<=NF;i++) if($i=="cat") $i="CAT"} 1'
scat:CAT:no cat:abdicate:cater
$ # can also use sub function
$ echo "$s" | awk -F: -v OFS=: '{for(i=1;i<=NF;i++) sub(/^cat$/,"CAT",$i)} 1'
scat:CAT:no cat:abdicate:cater
```

* while loop
* do-while is also available

```bash
$ awk 'BEGIN{i=2; while(i<11){print i; i+=2}}'
2
4
6
8
10

$ # recursive substitution
$ # here again return value of sub/gsub is useful
$ echo 'titillate' | awk '{while( gsub(/til/, "") ) print}'
tilate
ate
```

<br>

#### <a name="next-and-nextfile"></a>next and nextfile

* `next` will skip rest of statements and start processing next line of current file being processed
    * there is a loop by default which goes over all input records, `next` is applicable for that
    * it is similar to `continue` statement within loops
* it is often used in [Two file processing](#two-file-processing)

```bash
$ # here 'next' is used to skip processing header line
$ awk 'NR==1{print; next} /a.*a/{$0="*"$0} /[eiou]/{$0="-"$0} 1' fruits.txt
fruit   qty
-apple   42
*banana  31
-fig     90
-*guava   6
```

* `nextfile` is useful to skip remaining lines from current file being processed and move on to next file

```bash
$ # same as: head -q -n1 poem.txt greeting.txt fruits.txt
$ awk 'FNR>1{nextfile} 1' poem.txt greeting.txt fruits.txt
Roses are red,
Hi thErE
fruit   qty

$ # specific field
$ awk 'FNR>2{nextfile} {print $1}' poem.txt greeting.txt fruits.txt
Roses
Violets
Hi
HavE
fruit
apple

$ # similar to 'grep -il'
$ awk -v IGNORECASE=1 '/red/{print FILENAME; nextfile}' *
colors_1.txt
colors_2.txt
poem.txt
$ awk -v IGNORECASE=1 '$1 ~ /red/{print FILENAME; nextfile}' *
colors_1.txt
colors_2.txt
```

<br>

## <a name="multiline-processing"></a>Multiline processing

* Processing consecutive lines

```bash
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ # match two consecutive lines
$ awk 'p~/are/ && /is/{print p ORS $0} {p=$0}' poem.txt
Violets are blue,
Sugar is sweet,
$ # if only the second line is needed
$ awk 'p~/are/ && /is/; {p=$0}' poem.txt
Sugar is sweet,

$ # match three consecutive lines
$ awk 'p2~/red/ && p1~/blue/ && /is/{print p2} {p2=p1; p1=$0}' poem.txt
Roses are red,

$ # common mistake
$ sed -n '/are/{N;/is/p}' poem.txt
$ # would need something like this and not practical to extend for other cases
$ sed '$!N; /are.*\n.*is/p; D' poem.txt
Violets are blue,
Sugar is sweet,
```

Consider this sample input file

```bash
$ cat range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
END
baz
```

* extracting lines around matching line
* See also [stackoverflow - lines around matching regexp](https://stackoverflow.com/questions/17908555/printing-with-sed-or-awk-a-line-following-a-matching-pattern)
* how `n && n--` works:
    * need to note that right hand side of `&&` is processed only if left hand side is `true`
    * so for example, if initially `n=2`, then we get
        * `2 && 2; n=1` - evaluates to `true`
        * `1 && 1; n=0` - evaluates to `true`
        * `0 && ` - evaluates to `false` ... no decrementing `n` and hence will be `false` until `n` is re-assigned non-zero value

```bash
$ # similar to: grep --no-group-separator -A1 'BEGIN' range.txt
$ awk '/BEGIN/{n=2} n && n--' range.txt
BEGIN
1234
BEGIN
a

$ # only print the line after matching line
$ # can also use: awk '/BEGIN/{n=1; next} n && n--' range.txt
$ awk 'n && n--; /BEGIN/{n=1}' range.txt
1234
a
$ # generic case: print nth line after match
$ awk 'n && !--n; /BEGIN/{n=3}' range.txt
END
c

$ # print second line prior to matched line
$ awk '/END/{print p2} {p2=p1; p1=$0}' range.txt
1234
b
$ # save all lines in an array for generic case
$ # NR>n is checked to avoid printing empty line if there is a match
$ # within first n lines
$ awk -v n=3 '/BEGIN/ && NR>n{print a[NR-n]} {a[NR]=$0}' range.txt
6789
$ # or, use the reversing trick
$ tac range.txt | awk 'n && !--n; /END/{n=3}' | tac
BEGIN
a
```

* Checking if multiple strings are present at least once in entire input file
* If there are lots of strings to check, use arrays

```bash
$ # can also use BEGINFILE instead of FNR==1
$ awk 'FNR==1{s1=s2=0} /is/{s1=1} /are/{s2=1} s1&&s2{print FILENAME; nextfile}' *
poem.txt
sample.txt

$ awk 'FNR==1{s1=s2=0} /foo/{s1=1} /report/{s2=1} s1&&s2{print FILENAME; nextfile}' *
paths.txt
```

**Further Reading**

* [stackoverflow - delete line based on content of previous/next lines](https://stackoverflow.com/questions/49112877/delete-line-if-line-matches-foo-line-above-matches-bar-and-line-below-match)
* [softwareengineering - FSM examples](https://softwareengineering.stackexchange.com/questions/47806/examples-of-finite-state-machines)
* [wikipedia - FSM](https://en.wikipedia.org/wiki/Finite-state_machine)

<br>

## <a name="two-file-processing"></a>Two file processing

* We'll use awk's associative arrays (key-value pairs) here
    * key can be number or string
    * See also [gawk manual - Arrays](https://www.gnu.org/software/gawk/manual/html_node/Arrays.html)
* Unlike [comm](./sorting_stuff.md#comm) the input files need not be sorted and comparison can be done based on certain field(s) as well

<br>

#### <a name="comparing-whole-lines"></a>Comparing whole lines

Consider the following test files

```bash
$ cat colors_1.txt
Blue
Brown
Purple
Red
Teal
Yellow

$ cat colors_2.txt
Black
Blue
Green
Red
White
```

* common lines and lines unique to one of the files
* For two files as input, `NR==FNR` will be true only when first file is being processed
* Using `next` will skip rest of code when first file is processed
* `a[$0]` will create unique keys (here entire line content is used as key) in array `a`
    * just referencing a key will create it if it doesn't already exist, with value as empty string (will also act as zero in numeric context)
* `$0 in a` will be true if key already exists in array `a`

```bash
$ # common lines
$ # same as: grep -Fxf colors_1.txt colors_2.txt
$ awk 'NR==FNR{a[$0]; next} $0 in a' colors_1.txt colors_2.txt
Blue
Red

$ # lines from colors_2.txt not present in colors_1.txt
$ # same as: grep -vFxf colors_1.txt colors_2.txt
$ awk 'NR==FNR{a[$0]; next} !($0 in a)' colors_1.txt colors_2.txt
Black
Green
White

$ # reversing the order of input files gives
$ # lines from colors_1.txt not present in colors_2.txt
$ awk 'NR==FNR{a[$0]; next} !($0 in a)' colors_2.txt colors_1.txt
Brown
Purple
Teal
Yellow
```

<br>

#### <a name="comparing-specific-fields"></a>Comparing specific fields

Consider the sample input file

```bash
$ cat marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59
ECE     Om      92
CSE     Amy     67
```

* single field
* For ex: only first field comparison by using `$1` instead of `$0` as key

```bash
$ cat list1
ECE
CSE

$ # extract only lines matching first field specified in list1
$ awk 'NR==FNR{a[$1]; next} $1 in a' list1 marks.txt
ECE     Raj     53
ECE     Joel    72
CSE     Surya   81
ECE     Om      92
CSE     Amy     67

$ # if header is needed as well
$ awk 'NR==FNR{a[$1]; next} FNR==1 || $1 in a' list1 marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
CSE     Surya   81
ECE     Om      92
CSE     Amy     67
```

* multiple fields
* create a string by adding some character between the fields to act as key
    * for ex: to avoid matching two field values `abc` and `123` to match with two other field values `ab` and `c123`
    * by adding character, say `_`, the key would be `abc_123` for first case and `ab_c123` for second case
    * this can still lead to false match if input data has `_`
    * there is also a built-in way to do this using [gawk manual - Multidimensional Arrays](https://www.gnu.org/software/gawk/manual/html_node/Multidimensional.html#Multidimensional)

```bash
$ cat list2
EEE Moi
CSE Amy
ECE Raj

$ # extract only lines matching both fields specified in list2
$ awk 'NR==FNR{a[$1"_"$2]; next} $1"_"$2 in a' list2 marks.txt
ECE     Raj     53
EEE     Moi     68
CSE     Amy     67

$ # uses SUBSEP as separator, whose default value is non-printing character \034
$ awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' list2 marks.txt
ECE     Raj     53
EEE     Moi     68
CSE     Amy     67
```

* field and value comparison

```bash
$ cat list3
ECE 70
EEE 65
CSE 80

$ # extract line matching Dept and minimum marks specified in list3
$ awk 'NR==FNR{d[$1]=$2; next} $1 in d && $3 >= d[$1]' list3 marks.txt
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
ECE     Om      92
```

<br>

#### <a name="getline"></a>getline

* `getline` is an alternative way to read from a file and could be faster than `NR==FNR` method for some cases
* But use it with caution
    * [gawk manual - getline](https://www.gnu.org/software/gawk/manual/html_node/Getline.html) for details, especially about corner cases, errors, etc
    * [getline caveats](https://web.archive.org/web/20170524214527/http://awk.freeshell.org/AllAboutGetline)
    * [gawk manual - Closing Input and Output Redirections](https://www.gnu.org/software/gawk/manual/html_node/Close-Files-And-Pipes.html) if you have to start from beginning of file again
* `getline` return value: `1` if record is found, `0` if end of file, `-1` for errors such as file not found (use `ERRNO` variable to get details)

```bash
$ # replace mth line in poem.txt with nth line from nums.txt
$ # return value handling is not shown here, but should be done ideally
$ awk -v m=3 -v n=2 'BEGIN{while(n-- > 0) getline s < "nums.txt"}
                     FNR==m{$0=s} 1' poem.txt
Roses are red,
Violets are blue,
-2
And so are you.

$ # without getline, but slower due to NR==FNR check for every line processed
$ awk -v m=3 -v n=2 'NR==FNR{if(FNR==n){s=$0; nextfile} next}
                     FNR==m{$0=s} 1' nums.txt poem.txt
Roses are red,
Violets are blue,
-2
And so are you.

$ # Note that if nums.txt has less than n lines:
$ # getline version will use last line of nums.txt if any
$ # NR==FNR version will give empty string as 's' would be uninitialized
```

* Another use case is if two files are to be processed simultaneously

```bash
$ # print line from fruits.txt if corresponding line from nums.txt is +ve number
$ # the return value check ensures corresponding line number comparison
$ awk -v file='nums.txt' '(getline num < file)==1 && num>0' fruits.txt
fruit   qty
banana  31

$ # without getline, but has to save entire file in array
$ awk 'NR==FNR{n[FNR]=$0; next} n[FNR]>0' nums.txt fruits.txt
fruit   qty
banana  31
```

* error handling

```bash
$ awk 'NR==FNR{n[FNR]=$0; next} n[FNR]>0' xyz.txt fruits.txt
awk: fatal: cannot open file 'xyz.txt' for reading (No such file or directory)

$ awk -v file='xyz.txt' '{ e=(getline num < file);
                           if(e<0){print file ": " ERRNO; exit} }
                         e==1 && num>0' fruits.txt
xyz.txt: No such file or directory
```

**Further Reading**

* [stackoverflow - Fastest way to find lines of a text file from another larger text file](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-text-file-from-another-larger-text-file-in-bash)
* [unix.stackexchange - filter lines based on line numbers specified in another file](https://unix.stackexchange.com/questions/320651/read-numbers-from-control-file-and-extract-matching-line-numbers-from-the-data-f)
* [stackoverflow - three file processing to extract a matrix subset](https://stackoverflow.com/questions/45036019/how-to-filter-the-values-from-selected-columns-and-rows)
* [unix.stackexchange - column wise merging](https://unix.stackexchange.com/questions/294145/merging-two-files-one-column-at-a-time)
* [stackoverflow - extract specific rows from a text file using an index file](https://stackoverflow.com/questions/40595990/print-many-specific-rows-from-a-text-file-using-an-index-file)

<br>

## <a name="creating-new-fields"></a>Creating new fields

* Number of fields in input record can be changed by simply manipulating `NF`

```bash
$ # reducing fields
$ echo 'foo,bar,123,baz' | awk -F, -v OFS=, '{NF=2} 1'
foo,bar

$ # creating new empty field(s)
$ echo 'foo,bar,123,baz' | awk -F, -v OFS=, '{NF=5} 1'
foo,bar,123,baz,

$ # assigning to field greater than NF will create empty fields as needed
$ echo 'foo,bar,123,baz' | awk -F, -v OFS=, '{$7=42} 1'
foo,bar,123,baz,,,42
```

* adding a field based on existing fields

```bash
$ # adding a new 'Grade' field
$ awk 'BEGIN{OFS="\t"; g[9]="S"; g[8]="A"; g[7]="B"; g[6]="C"; g[5]="D"}
      {NF++; $NF = NR==1 ? "Grade" : g[int($(NF-1)/10)]} 1' marks.txt
Dept    Name    Marks   Grade
ECE     Raj     53      D
ECE     Joel    72      B
EEE     Moi     68      C
CSE     Surya   81      A
EEE     Tia     59      D
ECE     Om      92      S
CSE     Amy     67      C

$ # can also use split (covered in a later section)
$ # array assignment: split("DCBAS",g,//)
$ # index adjustment: g[int($(NF-1)/10)-4]
```

* two file example

```bash
$ cat list4
Raj class_rep
Amy sports_rep
Tia placement_rep

$ awk -v OFS='\t' 'NR==FNR{r[$1]=$2; next}
         {$(NF+1) = FNR==1 ? "Role" : r[$2]} 1' list4 marks.txt
Dept    Name    Marks   Role
ECE     Raj     53      class_rep
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59      placement_rep
ECE     Om      92
CSE     Amy     67      sports_rep
```

<br>

## <a name="dealing-with-duplicates"></a>Dealing with duplicates

* default value of uninitialized variable is `0` in numeric context and empty string in text context
    * and evaluates to `false` when used conditionally

*Illustration to show default numeric value and array in action*

```bash
$ printf 'mad\n42\n42\ndam\n42\n'
mad
42
42
dam
42

$ printf 'mad\n42\n42\ndam\n42\n' | awk '{print $0 "\t" int(a[$0]); a[$0]++}'
mad     0
42      0
42      1
dam     0
42      2
$ # only those entries with second column value zero will be retained
$ printf 'mad\n42\n42\ndam\n42\n' | awk '!a[$0]++'
mad
42
dam
```

* first, examples that retain only first copy of duplicates
* See also [iridakos: remove duplicates](https://iridakos.com/how-to/2019/05/16/remove-duplicate-lines-preserving-order-linux.html) for a detailed explanation
* See also [stackoverflow - add a letter to duplicate entries](https://stackoverflow.com/questions/47774779/add-letter-to-second-third-fourth-occurrence-of-a-string)

```bash
$ cat duplicates.txt
abc  7   4
food toy ****
abc  7   4
test toy 123
good toy ****

$ # whole line
$ awk '!seen[$0]++' duplicates.txt
abc  7   4
food toy ****
test toy 123
good toy ****

$ # particular column
$ awk '!seen[$2]++' duplicates.txt
abc  7   4
food toy ****

$ # total count
$ awk '!seen[$2]++{c++} END{print +c}' duplicates.txt
2
```

* if input is so large that integer numbers can overflow
* See also [gawk manual - Arbitrary-Precision Integer Arithmetic](https://www.gnu.org/software/gawk/manual/html_node/Arbitrary-Precision-Integers.html)

```bash
$ # avoid unnecessary counting altogether
$ awk '!($2 in seen); {seen[$2]}' duplicates.txt
abc  7   4
food toy ****

$ # use arbitrary-precision integers, limited only by available memory
$ awk -M '!($2 in seen){c++} {seen[$2]} END{print +c}' duplicates.txt
2
```

* For multiple fields, separate them using `,` or form a string with some character in between
    * choose a character unlikely to appear in input data, else there can be false matches
    * `FS` is a good choice as fields wouldn't contain separator character(s)

```bash
$ awk '!seen[$2 FS $3]++' duplicates.txt
abc  7   4
food toy ****
test toy 123

$ # can also use simulated multidimensional array
$ # SUBSEP, whose default is \034 non-printing character, is used as separator
$ awk '!seen[$2,$3]++' duplicates.txt
abc  7   4
food toy ****
test toy 123
```

* retaining specific numbered copy

```bash
$ # second occurrence of duplicate
$ awk '++seen[$2]==2' duplicates.txt
abc  7   4
test toy 123

$ # third occurrence of duplicate
$ awk '++seen[$2]==3' duplicates.txt
good toy ****
```

* retaining only last copy of duplicate

```bash
$ # reverse the input line-wise, retain first copy and then reverse again
$ tac duplicates.txt | awk '!seen[$2]++' | tac
abc  7   4
good toy ****
```

* filtering based on duplicate count
* allows to emulate [uniq](./sorting_stuff.md#uniq) command for specific fields
* See also [unix.stackexchange - retain only parent directory paths](https://unix.stackexchange.com/questions/362571/filter-out-paths-from-a-text-file-that-are-deeper-than-their-immediate-predecces)

```bash
$ # all duplicates based on 1st column
$ awk 'NR==FNR{a[$1]++; next} a[$1]>1' duplicates.txt duplicates.txt
abc  7   4
abc  7   4
$ # all duplicates based on 3rd column
$ awk 'NR==FNR{a[$3]++; next} a[$3]>1' duplicates.txt duplicates.txt
abc  7   4
food toy ****
abc  7   4
good toy ****

$ # more than 2 duplicates based on 2nd column
$ awk 'NR==FNR{a[$2]++; next} a[$2]>2' duplicates.txt duplicates.txt
food toy ****
test toy 123
good toy ****

$ # only unique lines based on 3rd column
$ awk 'NR==FNR{a[$3]++; next} a[$3]==1' duplicates.txt duplicates.txt
test toy 123
```

<br>

## <a name="lines-between-two-regexps"></a>Lines between two REGEXPs

* This section deals with filtering lines bound by two *REGEXP*s (referred to as blocks)
* For simplicity the two *REGEXP*s usually used in below examples are the strings **BEGIN** and **END**

<br>

#### <a name="all-unbroken-blocks"></a>All unbroken blocks

Consider the below sample input file, which doesn't have any unbroken blocks (i.e **BEGIN** and **END** are always present in pairs)

```bash
$ cat range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
END
baz
```

* Extracting lines between starting and ending *REGEXP*

```bash
$ # include both starting/ending REGEXP
$ # can also use: awk '/BEGIN/,/END/' range.txt
$ # which is similar to sed -n '/BEGIN/,/END/p'
$ # but not suitable to extend for other cases
$ awk '/BEGIN/{f=1} f; /END/{f=0}' range.txt
BEGIN
1234
6789
END
BEGIN
a
b
c
END

$ # exclude both starting/ending REGEXP
$ # can also use: awk '/BEGIN/{f=1; next} /END/{f=0} f' range.txt
$ awk '/END/{f=0} f; /BEGIN/{f=1}' range.txt
1234
6789
a
b
c
```

* Include only start or end *REGEXP*

```bash
$ # include only starting REGEXP
$ awk '/BEGIN/{f=1} /END/{f=0} f' range.txt
BEGIN
1234
6789
BEGIN
a
b
c

$ # include only ending REGEXP
$ awk 'f; /END/{f=0} /BEGIN/{f=1}' range.txt
1234
6789
END
a
b
c
END
```

* Extracting lines other than lines between the two *REGEXP*s

```bash
$ awk '/BEGIN/{f=1} !f; /END/{f=0}' range.txt
foo
bar
baz

$ # the other three cases would be
$ awk '/END/{f=0} !f; /BEGIN/{f=1}' range.txt
$ awk '!f; /BEGIN/{f=1} /END/{f=0}' range.txt
$ awk '/BEGIN/{f=1} /END/{f=0} !f' range.txt
```

<br>

#### <a name="specific-blocks"></a>Specific blocks

* Getting first block

```bash
$ awk '/BEGIN/{f=1} f; /END/{exit}' range.txt
BEGIN
1234
6789
END

$ # use other tricks discussed in previous section as needed
$ awk '/END/{exit} f; /BEGIN/{f=1}' range.txt
1234
6789
```

* Getting last block

```bash
$ # reverse input linewise, change the order of REGEXPs, finally reverse again
$ tac range.txt | awk '/END/{f=1} f; /BEGIN/{exit}' | tac
BEGIN
a
b
c
END

$ # or, save the blocks in a buffer and print the last one alone
$ # ORS contains output record separator, which is newline by default
$ seq 30 | awk '/4/{f=1; b=$0; next} f{b=b ORS $0} /6/{f=0} END{print b}'
24
25
26
```

* Getting blocks based on a counter

```bash
$ # all blocks
$ seq 30 | sed -n '/4/,/6/p'
4
5
6
14
15
16
24
25
26

$ # get only 2nd block
$ # can also use: seq 30 | awk -v b=2 '/4/{c++} c==b{print; if(/6/) exit}'
$ seq 30 | awk -v b=2 '/4/{c++} c==b; /6/ && c==b{exit}'
14
15
16

$ # to get all blocks greater than 'b' blocks
$ seq 30 | awk -v b=1 '/4/{f=1; c++} f && c>b; /6/{f=0}'
14
15
16
24
25
26
```

* excluding a particular block

```bash
$ # excludes 2nd block
$ seq 30 | awk -v b=2 '/4/{f=1; c++} f && c!=b; /6/{f=0}'
4
5
6
24
25
26
```

<br>

#### <a name="broken-blocks"></a>Broken blocks

* If there are blocks with ending *REGEXP* but without corresponding start, `awk '/BEGIN/{f=1} f; /END/{f=0}'` will suffice
* Consider the modified input file where starting *REGEXP* doesn't have corresponding ending

```bash
$ cat broken_range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
baz

$ # the file reversing trick comes in handy here as well
$ tac broken_range.txt | awk '/END/{f=1} f; /BEGIN/{f=0}' | tac
BEGIN
1234
6789
END
```

* But if both kinds of broken blocks are present, accumulate the records and print accordingly

```bash
$ cat multiple_broken.txt
qqqqqqq
BEGIN
foo
BEGIN
1234
6789
END
bar
END
0-42-1
BEGIN
a
BEGIN
b
END
xyzabc

$ awk '/BEGIN/{f=1; buf=$0; next}
       f{buf=buf ORS $0}
       /END/{f=0; if(buf) print buf; buf=""}' multiple_broken.txt
BEGIN
1234
6789
END
BEGIN
b
END
```

**Further Reading**

* [stackoverflow - select lines between two regexps](https://stackoverflow.com/questions/38972736/how-to-select-lines-between-two-patterns)
* [unix.stackexchange - print only blocks with lines > n](https://unix.stackexchange.com/questions/295600/deleting-lines-between-rows-in-a-text-file-using-awk-or-sed)
* [unix.stackexchange - print a block only if it contains matching string](https://unix.stackexchange.com/a/335523/109046)
* [unix.stackexchange - print a block matching two different strings](https://unix.stackexchange.com/questions/347368/grep-with-range-and-pass-three-filters)
* [unix.stackexchange - extract block up to 2nd occurrence of ending REGEXP](https://unix.stackexchange.com/questions/404175/using-awk-to-print-lines-from-one-match-through-a-second-instance-of-a-separate)

<br>

## <a name="arrays"></a>Arrays

We've already seen examples using arrays, some more examples discussed in this section

* array looping

```bash
$ # average marks for each department
$ awk 'NR>1{d[$1]+=$3; c[$1]++} END{for(i in d)print i, d[i]/c[i]}' marks.txt
ECE 72.3333
EEE 63.5
CSE 74
```

* Sorting
* See [gawk manual - Predefined Array Scanning Orders](https://www.gnu.org/software/gawk/manual/html_node/Controlling-Scanning.html#Controlling-Scanning) for more details

```bash
$ # by default, keys are traversed in random order
$ awk 'BEGIN{a["z"]=1; a["x"]=12; a["b"]=42; for(i in a)print i, a[i]}'
x 12
z 1
b 42

$ # index sorted ascending order as strings
$ awk 'BEGIN{PROCINFO["sorted_in"] = "@ind_str_asc";
       a["z"]=1; a["x"]=12; a["b"]=42; for(i in a)print i, a[i]}'
b 42
x 12
z 1

$ # value sorted ascending order as numbers
$ awk 'BEGIN{PROCINFO["sorted_in"] = "@val_num_asc";
       a["z"]=1; a["x"]=12; a["b"]=42; for(i in a)print i, a[i]}'
z 1
x 12
b 42
```

* deleting array elements

```bash
$ cat list5
CSE     Surya   75
EEE     Jai     69
ECE     Kal     83

$ # update entry if a match is found
$ # else append the new entries
$ awk '{ky=$1"_"$2} NR==FNR{upd[ky]=$0; next}
        ky in upd{$0=upd[ky]; delete upd[ky]} 1;
        END{for(i in upd)print upd[i]}' list5 marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   75
EEE     Tia     59
ECE     Om      92
CSE     Amy     67
ECE     Kal     83
EEE     Jai     69
```

* true multidimensional arrays
* length of sub-arrays need not be same. See [gawk manual - Arrays of Arrays](https://www.gnu.org/software/gawk/manual/html_node/Arrays-of-Arrays.html#Arrays-of-Arrays) for details

```bash
$ awk 'NR>1{d[$1][$2]=$3} END{for(i in d["ECE"])print i}' marks.txt
Joel
Raj
Om

$ awk -v f='CSE' 'NR>1{d[$1][$2]=$3} END{for(i in d[f])print i, d[f][i]}' marks.txt
Surya 81
Amy 67
```

**Further Reading**

* [gawk manual - all array topics](https://www.gnu.org/software/gawk/manual/html_node/Arrays.html)
* [unix.stackexchange - count words based on length](https://unix.stackexchange.com/questions/396855/is-there-an-easy-way-to-count-characters-in-words-in-file-from-terminal)
* [unix.stackexchange - filtering specific lines](https://unix.stackexchange.com/a/326215/109046)

<br>

## <a name="awk-scripts"></a>awk scripts

* For larger programs, save the code in a file and use `-f` command line option
* `;` is not needed to terminate a statement
* See also [gawk manual - Command-Line Options](https://www.gnu.org/software/gawk/manual/html_node/Options.html#Options) for other related options

```bash
$ cat buf.awk
/BEGIN/{
    f=1
    buf=$0
    next
}

f{
    buf=buf ORS $0
}

/END/{
    f=0
    if(buf)
        print buf
    buf=""
}

$ awk -f buf.awk multiple_broken.txt
BEGIN
1234
6789
END
BEGIN
b
END
```

* Another advantage is that single quotes can be freely used

```bash
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "\047&\047", "g")} 1'
'foo':'123':'bar':'baz'

$ cat quotes.awk
{
    $0 = gensub(/[^:]+/, "'&'", "g")
}

1

$ echo 'foo:123:bar:baz' | awk -f quotes.awk
'foo':'123':'bar':'baz'
```

* If the code has been first tried out on command line, add `-o` option to get a pretty printed version

```bash
$ awk -o -v OFS='\t' 'NR==FNR{r[$1]=$2; next}
         {$(NF+1) = FNR==1 ? "Role" : r[$2]} 1' list4 marks.txt
Dept    Name    Marks   Role
ECE     Raj     53      class_rep
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59      placement_rep
ECE     Om      92
CSE     Amy     67      sports_rep
```

File name can be passed along `-o` option, otherwise by default `awkprof.out` will be used

```bash
$ cat awkprof.out
        # gawk profile, created Mon Mar 16 10:11:11 2020

        # Rule(s)

        NR == FNR {
                r[$1] = $2
                next
        }

        {
                $(NF + 1) = (FNR == 1 ? "Role" : r[$2])
        }

        1 {
                print $0
        }

$ # note that other command line options have to be provided as usual
$ # for ex: awk -v OFS='\t' -f awkprof.out list4 marks.txt
```

<br>

## <a name="miscellaneous"></a>Miscellaneous

<br>

#### <a name="fpat-and-fieldwidths"></a>FPAT and FIELDWIDTHS

* `FS` allows to define field separator
* In contrast, `FPAT` allows to define what should the fields be made up of
* See also [gawk manual - Defining Fields by Content](https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html)

```bash
$ s='Sample123string54with908numbers'
$ # define fields to be one or more consecutive digits
$ echo "$s" | awk -v FPAT='[0-9]+' '{print $1, $2, $3}'
123 54 908
$ # define fields to be one or more consecutive alphabets
$ echo "$s" | awk -v FPAT='[a-zA-Z]+' '{print $1, $2, $3, $4}'
Sample string with numbers
```

* For simpler **csv** input having quoted strings if fields themselves have `,` in them, using `FPAT` is reasonable approach
* Use a proper parser if input can have other cases like newlines in fields
    * See [unix.stackexchange - using csv parser](https://unix.stackexchange.com/a/238192) for a sample program in `perl`

```bash
$ s='foo,"bar,123",baz,abc'
$ echo "$s" | awk -F, '{print $2}'
"bar
$ echo "$s" | awk -v FPAT='"[^"]*"|[^,]*' '{print $2}'
"bar,123"
```

* if input has well defined fields based on number of characters, `FIELDWIDTHS` can be used to specify width of each field

```bash
$ awk -v FIELDWIDTHS='8 3' -v OFS= '/fig/{$2=35} 1' fruits.txt
fruit   qty
apple   42
banana  31
fig     35
guava   6

$ # without FIELDWIDTHS
$ awk '/fig/{$2=35} 1' fruits.txt
fruit   qty
apple   42
banana  31
fig 35
guava   6
```

**Further Reading**

* [gawk manual - Processing Fixed-Width Data](https://www.gnu.org/software/gawk/manual/html_node/Fixed-width-data.html)
* [unix.stackexchange - Modify records in fixed-width files](https://unix.stackexchange.com/questions/368574/modify-records-in-fixed-width-files)
* [unix.stackexchange - detecting empty fields in fixed width files](https://unix.stackexchange.com/questions/321559/extracting-data-with-awk-when-some-lines-have-empty-missing-values)
* [stackoverflow - count number of times value is repeated each line](https://stackoverflow.com/questions/37450880/how-do-i-filter-tab-separated-input-by-the-count-of-fields-with-a-given-value)
* [stackoverflow - skip characters with FIELDWIDTHS in GNU Awk 4.2](https://stackoverflow.com/questions/46932189/how-do-you-skip-characters-with-fieldwidths-in-gnu-awk-4-2)

<br>

#### <a name="string-functions"></a>String functions

* `length` function - returns length of string, by default acts on `$0`

```bash
$ seq 8 13 | awk 'length()==1'
8
9

$ awk 'NR==1 || length($1)>4' fruits.txt
fruit   qty
apple   42
banana  31
guava   6

$ # character count and not byte count is calculated, similar to 'wc -m'
$ printf 'hi👍' | awk '{print length()}'
3

$ # use -b option if number of bytes are needed
$ printf 'hi👍' | awk -b '{print length()}'
6
```

* `split` function - similar to `FS` splitting input record into fields
* use `patsplit` function to get results similar to `FPAT`
* See also [gawk manual - Split function](https://www.gnu.org/software/gawk/manual/gawk.html#index-split_0028_0029-function)
* See also [unix.stackexchange - delimit second column](https://unix.stackexchange.com/questions/372253/awk-command-to-delimit-the-second-column)

```bash
$ # 1st argument is string to be split
$ # 2nd argument is array to save results, indexed from 1
$ # 3rd argument is separator, default is FS
$ s='foo,1996-10-25,hello,good'
$ echo "$s" | awk -F, '{split($2,d,"-"); print "Month is: " d[2]}'
Month is: 10

$ # using regular expression to define separator
$ # return value is number of fields after splitting
$ s='Sample123string54with908numbers'
$ echo "$s" | awk '{n=split($0,s,/[0-9]+/); for(i=1;i<=n;i++)print s[i]}'
Sample
string
with
numbers
$ # use 4th argument if separators are needed as well
$ echo "$s" | awk '{n=split($0,s,/[0-9]+/,seps); for(i=1;i<n;i++)print seps[i]}'
123
54
908

$ # single row to multiple rows based on splitting last field
$ s='foo,baz,12:42:3'
$ echo "$s" | awk -F, '{n=split($NF,a,":"); NF--; for(i=1;i<=n;i++) print $0,a[i]}'
foo baz 12
foo baz 42
foo baz 3
```

* `substr` function allows to extract specified number of characters from given string
    * indexing starts with `1`
* See [gawk manual - substr function](https://www.gnu.org/software/gawk/manual/gawk.html#index-substr_0028_0029-function) for corner cases and details

```bash
$ # 1st argument is string to be worked on
$ # 2nd argument is starting position
$ # 3rd argument is number of characters to be extracted
$ echo 'abcdefghij' | awk '{print substr($0,1,5)}'
abcde
$ echo 'abcdefghij' | awk '{print substr($0,4,3)}'
def
$ # if 3rd argument is not given, string is extracted until end
$ echo 'abcdefghij' | awk '{print substr($0,6)}'
fghij

$ echo 'abcdefghij' | awk -v OFS=':' '{print substr($0,2,3), substr($0,6,3)}'
bcd:fgh

$ # if only few characters are needed from input line, can use empty FS
$ echo 'abcdefghij' | awk -v FS= '{print $3}'
c
$ echo 'abcdefghij' | awk -v FS= '{print $3, $5}'
c e
```

<br>

#### <a name="executing-external-commands"></a>Executing external commands

* External commands can be issued using `system` function
* Output would be as usual on `stdout` unless redirected while calling the command
* Return value of `system` depends on `exit` status of executed command, see [gawk manual - Input/Output Functions](https://www.gnu.org/software/gawk/manual/html_node/I_002fO-Functions.html) for details

```bash
$ awk 'BEGIN{system("echo Hello World")}'
Hello World

$ wc poem.txt
 4 13 65 poem.txt
$ awk 'BEGIN{system("wc poem.txt")}'
 4 13 65 poem.txt

$ awk 'BEGIN{system("seq 10 | paste -sd, > out.txt")}'
$ cat out.txt
1,2,3,4,5,6,7,8,9,10

$ ls xyz.txt
ls: cannot access 'xyz.txt': No such file or directory
$ echo $?
2
$ awk 'BEGIN{s=system("ls xyz.txt"); print "Status: " s}'
ls: cannot access 'xyz.txt': No such file or directory
Status: 2

$ cat f2
I bought two bananas and three mangoes
$ echo 'f1,f2,odd.txt' | awk -F, '{system("cat " $2)}'
I bought two bananas and three mangoes
```

<br>

#### <a name="printf-formatting"></a>printf formatting

* Similar to `printf` function in `C` and shell built-in command
* use `sprintf` function to save result in variable instead of printing
* See also [gawk manual - printf](https://www.gnu.org/software/gawk/manual/html_node/Printf.html)

```bash
$ awk '{sum += $1} END{print sum}' nums.txt
10062.9

$ # note that ORS is not appended and has to be added manually
$ awk '{sum += $1} END{printf "%.2f\n", sum}' nums.txt
10062.86

$ awk '{sum += $1} END{printf "%10.2f\n", sum}' nums.txt
  10062.86

$ awk '{sum += $1} END{printf "%010.2f\n", sum}' nums.txt
0010062.86

$ awk '{sum += $1} END{printf "%d\n", sum}' nums.txt
10062

$ awk '{sum += $1} END{printf "%+d\n", sum}' nums.txt
+10062

$ awk '{sum += $1} END{printf "%e\n", sum}' nums.txt
1.006286e+04
```

* to refer argument by positional number (starts with 1), use `<num>$`

```bash
$ # can also use: awk 'BEGIN{printf "hex=%x\noct=%o\ndec=%d\n", 15, 15, 15}'
$ awk 'BEGIN{printf "hex=%1$x\noct=%1$o\ndec=%1$d\n", 15}'
hex=f
oct=17
dec=15

$ # adding prefix to hex/oct numbers
$ awk 'BEGIN{printf "hex=%1$#x\noct=%1$#o\ndec=%1$d\n", 15}'
hex=0xf
oct=017
dec=15
```

* strings

```bash
$ # prefix remaining width with spaces
$ awk 'BEGIN{printf "%6s:%5s\n", "foo", "bar"}'
   foo:  bar

$ # suffix remaining width with spaces
$ awk 'BEGIN{printf "%-6s:%-5s\n", "foo", "bar"}'
foo   :bar  

$ # truncate
$ awk 'BEGIN{printf "%.2s\n", "foobar"}'
fo
```

* avoid using `printf` without format specifier

```bash
$ awk 'BEGIN{s="solve: 5 % x = 1"; printf s}'
awk: cmd. line:1: fatal: not enough arguments to satisfy format string
    `solve: 5 % x = 1'
               ^ ran out for this one

$ awk 'BEGIN{s="solve: 5 % x = 1"; printf "%s\n", s}'
solve: 5 % x = 1
```

* See also [stackoverflow - concatenating columns in middle](https://stackoverflow.com/questions/49135518/linux-csv-file-concatenate-columns-into-one-column)

<br>

#### <a name="redirecting-print-output"></a>Redirecting print output

* redirecting to file instead of stdout using `>`
* similar to behavior in shell, if file already exists it is overwritten
    * use `>>` to append to an existing file without deleting content
* however, unlike shell, subsequent redirections to same file will append to it
* See also [gawk manual - Closing Input and Output Redirections](https://www.gnu.org/software/gawk/manual/html_node/Close-Files-And-Pipes.html) if you have too many redirections

```bash
$ seq 6 | awk 'NR%2{print > "odd.txt"; next} {print > "even.txt"}'
$ cat odd.txt
1
3
5
$ cat even.txt
2
4
6

$ awk 'NR==1{col1=$1".txt"; col2=$2".txt"; next}
       {print $1 > col1; print $2 > col2}' fruits.txt
$ cat fruit.txt
apple
banana
fig
guava
$ cat qty.txt
42
31
90
6
```

* redirecting to shell command
* this is useful if you have different things to redirect to different commands, otherwise it can be done as usual in shell acting on `awk`'s output
* all redirections to same command gets combined as single input to that command

```bash
$ # same as: echo 'foo good 123' | awk '{print $2}' | wc -c
$ echo 'foo good 123' | awk '{print $2 | "wc -c"}'
5
$ # to avoid newline character being added to print
$ echo 'foo good 123' | awk -v ORS= '{print $2 | "wc -c"}'
4
$ # assuming no format specifiers in input
$ echo 'foo good 123' | awk '{printf $2 | "wc -c"}'
4

$ # same as: echo 'foo good 123' | awk '{printf $2 $3 | "wc -c"}'
$ echo 'foo good 123' | awk '{printf $2 | "wc -c"; printf $3 | "wc -c"}'
7
```

**Further Reading**

* [gawk manual - Input/Output Functions](https://www.gnu.org/software/gawk/manual/html_node/I_002fO-Functions.html)
* [gawk manual - Redirecting Output of print and printf](https://www.gnu.org/software/gawk/manual/html_node/Redirection.html)
* [gawk manual - Two-Way Communications with Another Process](https://www.gnu.org/software/gawk/manual/html_node/Two_002dway-I_002fO.html)
* [unix.stackexchange - inplace editing as well as stdout](https://unix.stackexchange.com/questions/321679/gawk-inplace-and-stdout)
* [stackoverflow - redirect blocks to separate files](https://stackoverflow.com/questions/45098279/write-blocks-in-a-text-file-to-multiple-new-files)

<br>

## <a name="gotchas-and-tips"></a>Gotchas and Tips

* using `$` for variables
* only input record `$0` and field contents `$1`, `$2` etc need `$`
* See also [unix.stackexchange - Why does awk print the whole line when I want it to print a variable?](https://unix.stackexchange.com/questions/291126/why-does-awk-print-the-whole-line-when-i-want-it-to-print-a-variable)

```bash
$ # wrong
$ awk -v word="apple" '$1==$word' fruits.txt

$ # right
$ awk -v word="apple" '$1==word' fruits.txt
apple   42
```

* dos style line endings
* See also [unix.stackexchange - filtering when last column has \r](https://unix.stackexchange.com/questions/399560/using-awk-to-select-rows-with-specific-value-in-specific-column)

```bash
$ # no issue with unix style line ending
$ printf 'foo bar\n123 789\n' | awk '{print $2, $1}'
bar foo
789 123

$ # dos style line ending causes trouble
$ printf 'foo bar\r\n123 789\r\n' | awk '{print $2, $1}'
 foo
 123

$ # easy to deal by simply setting appropriate RS
$ # note that ORS would still be newline character only
$ printf 'foo bar\r\n123 789\r\n' | awk -v RS='\r\n' '{print $2, $1}'
bar foo
789 123
```

* relying on default initial value

```bash
$ # step 1 - works for single file
$ awk '{sum += $1} END{print sum}' nums.txt
10062.9

$ # step 2 - change to work for multiple file
$ awk '{sum += $1} ENDFILE{print FILENAME, sum}' nums.txt
nums.txt 10062.9

$ # step 3 - check with multiple file input
$ # oops, default numerical value '0' for sum works only once
$ awk '{sum += $1} ENDFILE{print FILENAME, sum}' nums.txt <(seq 3)
nums.txt 10062.9
/dev/fd/63 10068.9

$ # step 4 - correctly initialize variables
$ awk '{sum += $1} ENDFILE{print FILENAME, sum; sum=0}' nums.txt <(seq 3)
nums.txt 10062.9
/dev/fd/63 6
```

* use unary operator `+` to force numeric conversion

```bash
$ awk '{sum += $1} END{print FILENAME, sum}' nums.txt
nums.txt 10062.9

$ awk '{sum += $1} END{print FILENAME, sum}' /dev/null
/dev/null 

$ awk '{sum += $1} END{print FILENAME, +sum}' /dev/null
/dev/null 0
```

* concatenate empty string to force string comparison

```bash
$ echo '5 5.0' | awk '{print $1==$2 ? "same" : "different", "string"}'
same string

$ echo '5 5.0' | awk '{print $1""==$2 ? "same" : "different", "string"}'
different string
```

* beware of expressions going -ve for field calculations

```bash
$ cat misc.txt
foo
good bad ugly
123 xyz
a b c d

$ # trying to delete last two fields
$ awk '{NF -= 2} 1' misc.txt
awk: cmd. line:1: (FILENAME=misc.txt FNR=1) fatal: NF set to negative value
$ # dynamically change it depending on number of fields
$ awk '{NF = (NF<=2) ? 0 : NF-2} 1' misc.txt

good

a b

$ # similarly, trying to access 3rd field from end
$ awk '{print $(NF-2)}' misc.txt
awk: cmd. line:1: (FILENAME=misc.txt FNR=1) fatal: attempt to access field -1
$ awk 'NF>2{print $(NF-2)}' misc.txt
good
b
```

* If input is ASCII alone, simple trick to improve speed
* For simple non-regex based column filtering, using [cut](./miscellaneous.md#cut) command might give faster results
    * See [stackoverflow - how to split columns faster](https://stackoverflow.com/questions/46882557/how-to-split-columns-faster-in-python/46883120#46883120) for example

```bash
$ # all words containing exactly 3 lowercase a
$ time awk -F'a' 'NF==4{cnt++} END{print +cnt}' /usr/share/dict/words
1019

real    0m0.075s

$ time LC_ALL=C awk -F'a' 'NF==4{cnt++} END{print +cnt}' /usr/share/dict/words
1019

real    0m0.045s
```

<br>

## <a name="further-reading"></a>Further Reading

* Manual and related
    * `man awk` and `info awk` for quick reference from command line
    * [gawk manual](https://www.gnu.org/software/gawk/manual/gawk.html#SEC_Contents) for complete reference, extensions and more
    * [awk FAQ](http://www.faqs.org/faqs/computer-lang/awk/faq/) - from 2002, but plenty of information, especially about all the various `awk` implementations
* this tutorial has also been [converted to an ebook](https://github.com/learnbyexample/learn_gnuawk) with additional descriptions, examples, a chapter on regular expressions, etc.
* What's up with different `awk` versions?
    * [unix.stackexchange - brief explanation](https://unix.stackexchange.com/questions/29576/difference-between-gawk-vs-awk)
    * [Differences between gawk, nawk, mawk, and POSIX awk](https://archive.is/btGky)
    * [cheat sheet for awk/nawk/gawk](https://catonmat.net/ftp/awk.cheat.sheet.txt)
* Tutorials and Q&A
    * [code.snipcademy - gentle intro](https://code.snipcademy.com/tutorials/shell-scripting/awk/introduction)
    * [funtoo - using examples](https://www.funtoo.org/Awk_by_Example,_Part_1)
    * [grymoire - detailed tutorial](https://www.grymoire.com/Unix/Awk.html) - covers information about different `awk` versions as well
    * [catonmat - one liners explained](https://catonmat.net/awk-one-liners-explained-part-one)
    * [Why Learn AWK?](https://blog.jpalardy.com/posts/why-learn-awk/)
    * [awk Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/awk?sort=votes&pageSize=15)
    * [awk Q&A on unix.stackexchange](https://unix.stackexchange.com/questions/tagged/awk?sort=votes&pageSize=15)
* Alternatives
    * [GNU datamash](https://www.gnu.org/software/datamash/alternatives/)
    * [bioawk](https://github.com/lh3/bioawk)
    * [hawk](https://github.com/gelisam/hawk/blob/master/doc/README.md) - based on Haskell
    * [miller](https://github.com/johnkerl/miller) - similar to awk/sed/cut/join/sort for name-indexed data such as CSV, TSV, and tabular JSON
        * See this [ycombinator news](https://news.ycombinator.com/item?id=10066742) for other tools like this
* miscellaneous
    * [unix.stackexchange - When to use grep, sed, awk, perl, etc](https://unix.stackexchange.com/questions/303044/when-to-use-grep-less-awk-sed)
    * [awk-libs](https://github.com/e36freak/awk-libs) - lots of useful functions
    * [awkaster](https://github.com/TheMozg/awk-raycaster) - Pseudo-3D shooter written completely in awk using raycasting technique
    * [awk REPL](https://awk.js.org/) - live editor on browser
* examples for some of the stuff not covered in this tutorial
    * [unix.stackexchange - rand/srand](https://unix.stackexchange.com/questions/372816/awk-get-random-lines-of-file-satisfying-a-condition)
    * [unix.stackexchange - strftime](https://unix.stackexchange.com/questions/224969/current-date-in-awk)
    * [unix.stackexchange - ARGC and ARGV](https://unix.stackexchange.com/questions/222146/awk-does-not-end/222150#222150)
    * [stackoverflow - arbitrary precision integer extension](https://stackoverflow.com/questions/46904447/strange-output-while-comparing-engineering-numbers-in-awk)
    * [stackoverflow - recognizing hexadecimal numbers](https://stackoverflow.com/questions/3683110/how-to-make-calculations-on-hexadecimal-numbers-with-awk)
    * [unix.stackexchange - sprintf and close](https://unix.stackexchange.com/questions/223727/splitting-file-for-every-10000-numbers-not-lines/223739#223739)
    * [unix.stackexchange - user defined functions and array passing](https://unix.stackexchange.com/questions/72469/gawk-passing-arrays-to-functions)
    * [unix.stackexchange - rename csv files based on number of fields in header row](https://unix.stackexchange.com/questions/408742/count-number-of-columns-in-csv-files-and-rename-if-less-than-11-columns)


================================================
FILE: gnu_grep.md
================================================
<br> <br> <br>

---

:information_source: :information_source: This chapter has been converted into a better formatted ebook: https://learnbyexample.github.io/learn_gnugrep_ripgrep/. The ebook also has content updated for newer version of the commands, includes exercises, solutions, has a separate chapter for popular alternative `ripgrep`, etc.

For markdown source and links to buy pdf/epub versions, see: https://github.com/learnbyexample/learn_gnugrep_ripgrep

---

<br> <br> <br>

# <a name="gnu-grep"></a>GNU grep

**Table of Contents**

* [Simple string search](#simple-string-search)
* [Case insensitive search](#case-insensitive-search)
* [Invert matching lines](#invert-matching-lines)
* [Line number, count and limiting output lines](#line-number-count-and-limiting-output-lines)
* [Multiple search strings](#multiple-search-strings)
* [File names in output](#file-names-in-output)
* [Match whole word or line](#match-whole-word-or-line)
* [Colored output](#colored-output)
* [Get only matching portion](#get-only-matching-portion)
* [Context matching](#context-matching)
* [Recursive search](#recursive-search)
    * [Basic recursive search](#basic-recursive-search)
    * [Exclude/Include specific files/directories](#excludeinclude-specific-filesdirectories)
    * [Recursive search with bash options](#recursive-search-with-bash-options)
    * [Recursive search using find command](#recursive-search-using-find-command)
    * [Passing file names to other commands](#passing-file-names-to-other-commands)
* [Search strings from file](#search-strings-from-file)
* [Options for scripting purposes](#options-for-scripting-purposes)
* [Regular Expressions - BRE/ERE](#regular-expressions-breere)
    * [Line Anchors](#line-anchors)
    * [Word Anchors](#word-anchors)
    * [Alternation](#alternation)
    * [The dot meta character](#the-dot-meta-character)
    * [Quantifiers](#quantifiers)
    * [Character classes](#character-classes)
    * [Grouping](#grouping)
    * [Back reference](#back-reference)
* [Multiline matching](#multiline-matching)
* [Perl Compatible Regular Expressions](#perl-compatible-regular-expressions)
    * [Backslash sequences](#backslash-sequences)
    * [Non-greedy matching](#non-greedy-matching)
    * [Lookarounds](#lookarounds)
    * [Ignoring specific matches](#ignoring-specific-matches)
    * [Re-using regular expression pattern](#re-using-regular-expression-pattern)
* [Gotchas and Tips](#gotchas-and-tips)
* [Regular Expressions Reference (ERE)](#regular-expressions-reference-ere)
    * [Anchors](#anchors)
    * [Character Quantifiers](#character-quantifiers)
    * [Character classes and backslash sequences](#character-classes-and-backslash-sequences)
    * [Pattern groups](#pattern-groups)
    * [Basic vs Extended Regular Expressions](#basic-vs-extended-regular-expressions)
* [Further Reading](#further-reading)

<br>

```bash
$ grep -V | head -1
grep (GNU grep) 2.25

$ man grep
GREP(1)                     General Commands Manual                    GREP(1)

NAME
       grep, egrep, fgrep, rgrep - print lines matching a pattern

SYNOPSIS
       grep [OPTIONS] PATTERN [FILE...]
       grep [OPTIONS] [-e PATTERN]...  [-f FILE]...  [FILE...]

DESCRIPTION
       grep searches the named input FILEs for lines containing a match to the
       given PATTERN.  If no files are specified, or if the file “-” is given,
       grep  searches  standard  input.   By default, grep prints the matching
       lines.

       In addition, the variant programs egrep, fgrep and rgrep are  the  same
       as  grep -E,  grep -F,  and  grep -r, respectively.  These variants are
       deprecated, but are provided for backward compatibility.
...
```

**Note** For more detailed documentation and examples, use `info grep`

<br>

## <a name="simple-string-search"></a>Simple string search

* First specify the search pattern (usually enclosed in single quotes) and then the file input
* More than one file can be specified or input given from stdin

```bash
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ grep 'are' poem.txt
Roses are red,
Violets are blue,
And so are you.

$ grep 'so are' poem.txt
And so are you.
```

* If search string contains any regular expression meta characters like `^$\.*[]` (covered later), use the `-F` option or `fgrep` if available

```bash
$ echo 'int a[5]' | grep 'a[5]'
$ echo 'int a[5]' | grep -F 'a[5]'
int a[5]
$ echo 'int a[5]' | fgrep 'a[5]'
int a[5]
```

* See [Gotchas and Tips](#gotchas-and-tips) section if you get strange issues

<br>

## <a name="case-insensitive-search"></a>Case insensitive search

```bash
$ grep -i 'rose' poem.txt
Roses are red,

$ grep -i 'and' poem.txt
And so are you.
```

<br>

## <a name="invert-matching-lines"></a>Invert matching lines

* Use the `-v` option to get lines other than those matching the search string
* Tip: Look out for other opposite pairs like `-l -L`, `-h -H`, opposites in regular expression, etc

```bash
$ grep -v 'are' poem.txt
Sugar is sweet,

$ # example for input from stdin
$ seq 5 | grep -v '3'
1
2
4
5
```

<br>

## <a name="line-number-count-and-limiting-output-lines"></a>Line number, count and limiting output lines

* Show line number of matching lines

```bash
$ grep -n 'sweet' poem.txt
3:Sugar is sweet,
```

* Count number of matching lines

```bash
$ grep -c 'are' poem.txt
3
```

* Limit number of matching lines

```bash
$ grep -m2 'are' poem.txt
Roses are red,
Violets are blue,
```

<br>

## <a name="multiple-search-strings"></a>Multiple search strings

* Match any

```bash
$ # search blue or you
$ grep -e 'blue' -e 'you' poem.txt
Violets are blue,
And so are you.
```

If there are lot of search strings, use a file input

**Note** Be careful to avoid empty lines in the file, it would result in matching all the lines

```bash
$ printf 'rose\nsugar\n' > search_strings.txt
$ cat search_strings.txt
rose
sugar

$ # -f option accepts file input with search terms in separate lines
$ grep -if search_strings.txt poem.txt
Roses are red,
Sugar is sweet,
```

* Match all

```bash
$ # match line containing both are & And
$ grep 'are' poem.txt | grep 'And'
And so are you.
```

<br>

## <a name="file-names-in-output"></a>File names in output

* `-l` to get files matching the search
* `-L` to get files not matching the search
* `grep` skips the rest of file once a match is found

```bash
$ grep -l 'Rose' poem.txt
poem.txt

$ grep -L 'are' poem.txt search_strings.txt
search_strings.txt
```

* Prefix file name to search results
* `-h` is default for single file input, no file name prefix in output
* `-H` is default for multiple file input, file name prefix in output

```bash
$ grep -h 'Rose' poem.txt
Roses are red,
$ grep -H 'Rose' poem.txt
poem.txt:Roses are red,

$ # -H is default for multiple file input
$ grep -i 'sugar' poem.txt search_strings.txt
poem.txt:Sugar is sweet,
search_strings.txt:sugar
$ grep -ih 'sugar' poem.txt search_strings.txt
Sugar is sweet,
sugar
```

<br>

## <a name="match-whole-word-or-line"></a>Match whole word or line

* Word search using `-w` option
    * word constitutes of alphabets, numbers and underscore character
* This will ensure that given patterns are not surrounded by other word characters
    * this is slightly different than using word boundaries in regular expressions
* For example, this helps to distinguish `par` from `spar`, `part`, etc

```bash
$ printf 'par value\nheir apparent\n' | grep 'par'
par value
heir apparent

$ printf 'par value\nheir apparent\n' | grep -w 'par'
par value

$ printf 'scare\ncart\ncar\nmacaroni\n' | grep -w 'car'
car
```

* Another useful option is `-x` to match only complete line, not anywhere in the line

```bash
$ printf 'see my book list\nmy book\n' | grep 'my book'
see my book list
my book

$ printf 'see my book list\nmy book\n' | grep -x 'my book'
my book

$ printf 'scare\ncart\ncar\nmacaroni\n' | grep -x 'car'
car
```

<br>

## <a name="colored-output"></a>Colored output

* Highlight search strings, line numbers, file name, etc in different colors
    * Depends on color support in terminal being used
* options to `--color` are
    * `auto` when output is redirected (another command, file, etc) the color information won't be passed
    * `always` when output is redirected (another command, file, etc) the color information will also be passed
    * `never` explicitly specify no highlighting

```bash
$ # can also use grep --color 'blue' as auto is default
$ grep --color=auto 'blue' poem.txt
Violets are blue,
```

* Sample screenshot

![grep color output](./images/color_option.png)

* Example to show difference between `auto` and `always`

```bash
$ grep --color=auto 'blue' poem.txt > saved_output.txt
$ cat -v saved_output.txt
Violets are blue,
$ grep --color=always 'blue' poem.txt > saved_output.txt
$ cat -v saved_output.txt
Violets are ^[[01;31m^[[Kblue^[[m^[[K,

$ # some commands like 'less' are capable of using the color information
$ grep --color=always 'are' poem.txt | less -R
$ # highlight multiple matching patterns
$ grep --color=always 'are' poem.txt | grep --color 'd'
Roses are red,
And so are you.
```

<br>

## <a name="get-only-matching-portion"></a>Get only matching portion

* The `-o` option to get only matched portion is more useful with regular expressions
* Comes in handy if overall number of matches is required, instead of only line wise

```bash
$ grep -o 'are' poem.txt
are
are
are

$ # -c only gives count of matching lines
$ grep -c 'e' poem.txt
4
$ grep -co 'e' poem.txt
4
$ # so need another command to get count of all matches
$ grep -o 'e' poem.txt | wc -l
9
```

<br>

## <a name="context-matching"></a>Context matching

* The `-A`, `-B` and `-C` options are useful to get lines after/before/around matching line respectively

```bash
$ grep -A1 'blue' poem.txt
Violets are blue,
Sugar is sweet,
$ grep -B1 'blue' poem.txt
Roses are red,
Violets are blue,
$ grep -C1 'blue' poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
```

* If there are multiple non-adjacent matching segments, by default `grep` adds a line `--` to separate them
    * non-adjacent here implies that segments are separated by at least one line in input data

```bash
$ seq 29 | grep -A1 '3'
3
4
--
13
14
--
23
24
```

* Use `--no-group-separator` option if the separator line is a hindrance, for example feeding the output of `grep` to another program

```bash
$ seq 29 | grep --no-group-separator -A1 '3'
3
4
13
14
23
24
```

* Use `--group-separator` to customize the separator

```bash
$ seq 29 | grep --group-separator='*****' -A1 '3'
3
4
*****
13
14
*****
23
24
```

<br>

## <a name="recursive-search"></a>Recursive search

First let's create some more test files

```bash
$ mkdir -p test_files/hidden_files
$ printf 'Red\nGreen\nBlue\nBlack\nWhite\n' > test_files/colors.txt
$ printf 'Violet\nIndigo\nBlue\nGreen\nYellow\nOrange\nRed\n' > test_files/vibgyor.txt
$ printf '#!/usr/bin/python3\n\nprint("Hello World")\n' > test_files/hello.py
$ printf 'I like yellow\nWhat about you\n' > test_files/hidden_files/.fav_color.info
```

From `man grep`

```bash
       -r, --recursive
              Read all files  under  each  directory,  recursively,  following
              symbolic  links only if they are on the command line.  Note that
              if  no  file  operand  is  given,  grep  searches  the   working
              directory.  This is equivalent to the -d recurse option.

       -R, --dereference-recursive
              Read  all  files  under each directory, recursively.  Follow all
              symbolic links, unlike -r.
```

<br>

#### <a name="basic-recursive-search"></a>Basic recursive search

* Note that `-H` option automatically activates for multiple file input

```bash
$ # by default, current working directory is searched
$ grep -r 'red'
poem.txt:Roses are red,

$ grep -ri 'red'
poem.txt:Roses are red,
test_files/colors.txt:Red
test_files/vibgyor.txt:Red

$ grep -rin 'red'
poem.txt:1:Roses are red,
test_files/colors.txt:1:Red
test_files/vibgyor.txt:7:Red

$ grep -ril 'red'
poem.txt
test_files/colors.txt
test_files/vibgyor.txt
```

<br>

#### <a name="excludeinclude-specific-filesdirectories"></a>Exclude/Include specific files/directories

* By default, recursive search includes hidden files as well
* They can be excluded by file name or directory name
    * [glob](https://github.com/learnbyexample/Linux_command_line/blob/master/Shell.md#wildcards) patterns can be used
    * for example: `*.[ch]` to specify all files ending with `.c` or `.h`
* The exclusion options can be used multiple times
    * for example: `--exclude='*.txt' --exclude='*.log'` or specified from a file using `--exclude-from=FILE`
* To search only files with specific pattern in their names, use `--include=GLOB`
* **Note:** exclusion/inclusion applies only to basename of file/directory, not the entire path
* To follow all symbolic links (not directly specificied as arguments, but found on recursive search), use `-R` instead of `-r`

```bash
$ grep -ri 'you'
poem.txt:And so are you.
test_files/hidden_files/.fav_color.info:What about you

$ # exclude file names starting with `.` i.e hidden files
$ grep -ri --exclude='.*' 'you'
poem.txt:And so are you.

$ # include only file names ending with `.info`
$ grep -ri --include='*.info' 'you'
test_files/hidden_files/.fav_color.info:What about you

$ # exclude a directory
$ grep -ri --exclude-dir='hidden_files' 'you'
poem.txt:And so are you.

$ # If you are using git(or similar), this would be handy
$ # grep --exclude-dir='.git' -rl 'search pattern'
```

<br>

#### <a name="recursive-search-with-bash-options"></a>Recursive search with bash options

* Using `bash` options `globstar` (for recursion)
    * Other options like `extglob` and `dotglob` come in handy too
    * See [glob](https://github.com/learnbyexample/Linux_command_line/blob/master/Shell.md#wildcards) for more info on these options
* The `-d skip` option tells grep to skip directories instead of trying to treat them as text file to be searched

```bash
$ grep -ril 'yellow'
test_files/hidden_files/.fav_color.info
test_files/vibgyor.txt

$ # recursive search
$ shopt -s globstar
$ grep -d skip -il 'yellow' **/*
test_files/vibgyor.txt

$ # include hidden files as well
$ shopt -s dotglob
$ grep -d skip -il 'yellow' **/*
test_files/hidden_files/.fav_color.info
test_files/vibgyor.txt

$ # use extended glob patterns
$ shopt -s extglob
$ # other than poem.txt
$ grep -d skip -il 'red' **/!(poem.txt)
test_files/colors.txt
test_files/vibgyor.txt
$ # other than poem.txt or colors.txt
$ grep -d skip -il 'red' **/!(poem|colors).txt
test_files/vibgyor.txt
```

<br>

#### <a name="recursive-search-using-find-command"></a>Recursive search using find command

* `find` is obviously more versatile
* See also [this guide](./wheres_my_file.md#find) for more examples/tutorials on using `find`

```bash
$ # all files, including hidden ones
$ find -type f -exec grep -il 'red' {} +
./poem.txt
./test_files/colors.txt
./test_files/vibgyor.txt

$ # all files ending with .txt
$ find -type f -name '*.txt' -exec grep -in 'you' {} +
./poem.txt:4:And so are you.

$ # all files not ending with .txt
$ find -type f -not -name '*.txt' -exec grep -in 'you' {} +
./test_files/hidden_files/.fav_color.info:2:What about you
```

<br>

#### <a name="passing-file-names-to-other-commands"></a>Passing file names to other commands

* To pass files filtered to another command, see if the receiving command can differentiate file names by ASCII NUL character
* If so, use the `-Z` so that `grep` output is terminated with NUL character and commands like `xargs` have option `-0` to understand it
* This helps when file names can have characters like space, newline, etc
* Typical use case: Search and replace something in all files matching some pattern, for ex: `grep -rlZ 'PAT1' | xargs -0 sed -i 's/PAT2/REPLACE/g'`

```bash
$ # prompt at end of line not shown for simplicity
$ # ^@ here indicates the NUL character
$ grep -rlZ 'you' | cat -A
poem.txt^@test_files/hidden_files/.fav_color.info^@

$ # print first column from all lines of all files
$ grep -rlZ 'you' | xargs -0 awk '{print $1}'
Roses
Violets
Sugar
And
I
What
```

* simple example to show filenames with space causing issue if `-Z` is not used

```bash
$ # 'abc xyz.txt' is a file with space in its name
$ grep -ri 'are'
abc xyz.txt:hi how are you
poem.txt:Roses are red,
poem.txt:Violets are blue,
poem.txt:And so are you.
saved_output.txt:Violets are blue,

$ # problem when -Z is not used
$ grep -ril 'are' | xargs grep 'you'
grep: abc: No such file or directory
grep: xyz.txt: No such file or directory
poem.txt:And so are you.

$ # no issues if -Z is used
$ grep -rilZ 'are' | xargs -0 grep 'you'
abc xyz.txt:hi how are you
poem.txt:And so are you.
```

* Example for matching more than one search string anywhere in file

```bash
$ # files containing 'you'
$ grep -rl 'you'
poem.txt
test_files/hidden_files/.fav_color.info

$ # files containing 'you' as well as 'are'
$ grep -rlZ 'you' | xargs -0 grep -l 'are'
poem.txt

$ # files containing 'you' but NOT 'are'
$ grep -rlZ 'you' | xargs -0 grep -L 'are'
test_files/hidden_files/.fav_color.info
```

* another example

```bash
$ grep -rilZ 'red' | xargs -0 grep -il 'blue'
poem.txt
test_files/colors.txt
test_files/vibgyor.txt

$ # note the use of `-Z` for middle command
$ grep -rilZ 'red' | xargs -0 grep -ilZ 'blue' | xargs -0 grep -il 'violet'
poem.txt
test_files/vibgyor.txt
```

<br>

## <a name="search-strings-from-file"></a>Search strings from file

* using file input to specify search terms
* `-F` option will force matching strings literally(no regular expressions)
* See also [stackoverflow - Fastest way to find lines of a text file from another larger text file](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-text-file-from-another-larger-text-file-in-bash) - read all answers

```bash
$ grep -if test_files/colors.txt poem.txt
Roses are red,
Violets are blue,

$ # get common lines between two files
$ grep -Fxf test_files/colors.txt test_files/vibgyor.txt
Blue
Green
Red

$ # get lines present in vibgyor.txt but not in colors.txt
$ grep -Fvxf test_files/colors.txt test_files/vibgyor.txt
Violet
Indigo
Yellow
Orange
```

<br>

## <a name="options-for-scripting-purposes"></a>Options for scripting purposes

* In scripts, often it is needed just to know if a pattern matches or not
* The `-q` option doesn't print anything on stdout and exit status is `0` if match is found
    * Check out [this practical script](https://github.com/learnbyexample/command_help/blob/master/ch) using the `-q` option

```bash
$ grep -qi 'rose' poem.txt
$ echo $?
0
$ grep -qi 'lily' poem.txt
$ echo $?
1

$ if grep -qi 'rose' poem.txt; then echo 'match found!'; else echo 'match not found'; fi
match found!
$ if grep -qi 'lily' poem.txt; then echo 'match found!'; else echo 'match not found'; fi
match not found
```

* The `-s` option will suppress error messages as well

```bash
$ grep 'rose' file_xyz.txt
grep: file_xyz.txt: No such file or directory
$ grep -s 'rose' file_xyz.txt
$ echo $?
2

$ touch foo.txt
$ chmod -r foo.txt
$ grep 'rose' foo.txt
grep: foo.txt: Permission denied
$ grep -s 'rose' foo.txt
$ echo $?
2
```

<br>

## <a name="regular-expressions-breere"></a>Regular Expressions - BRE/ERE

Before diving into regular expressions, few examples to show default `grep` behavior vs `-F`

```bash
$ # oops, why did it not match?
$ echo 'int a[5]' | grep 'a[5]'

$ # where did that error come from??
$ echo 'int a[5]' | grep 'a['
grep: Invalid regular expression

$ # what is going on???
$ echo 'int a[5]' | grep 'a[5'
grep: Unmatched [ or [^

$ # phew, -F is a life saver
$ echo 'int a[5]' | grep -F 'a[5]'
int a[5]

$ # [ and ] are meta characters, details in following sections
$ echo 'int a[5]' | grep 'a\[5]'
int a[5]
```

* By default, `grep` treats the search pattern as BRE (Basic Regular Expression)
    * `-G` option can be used to specify explicitly that BRE is used
* The `-E` option allows to use ERE (Extended Regular Expression) which in GNU grep's case only differs in how meta characters are used, no difference in regular expression functionalities
* If `-F` option is used, the search string is treated literally
* If available, one can also use `-P` which indicates PCRE (Perl Compatible Regular Expression)

<br>

#### <a name="line-anchors"></a>Line Anchors

* Often, search must match from beginning of line or towards end of line
* For example, an integer variable declaration in `C` will start with optional white-space, the keyword `int`, white-space and then variable(s)
    * This way one can avoid matching declarations inside single line comments as well.
* Similarly, one might want to match a variable at end of statement
* The meta characters for line anchoring are `^` for beginning of line and `$` for end of line

```bash
$ echo 'Fantasy is my favorite genre' > fav.txt
$ echo 'My favorite genre is Fantasy' >> fav.txt
$ cat fav.txt
Fantasy is my favorite genre
My favorite genre is Fantasy

$ # start of line
$ grep '^Fantasy' fav.txt
Fantasy is my favorite genre

$ # end of line
$ grep 'Fantasy$' fav.txt
My favorite genre is Fantasy

$ # without anchors
$ grep 'Fantasy' fav.txt
Fantasy is my favorite genre
My favorite genre is Fantasy
```

* As the meta characters have special meaning (assuming `-F` option is not used), they have to be escaped using `\` to match literally
* The `\` itself is meta character, so to match it literally, use `\\`
* The line anchors `^` and `$` have special meaning only when they are present at start/end of regular expression

```bash
$ echo '^foo bar$' | grep '^foo'
$ echo '^foo bar$' | grep '\^foo'
^foo bar$
$ echo '^foo bar$' | grep '^^foo'
^foo bar$

$ echo '^foo bar$' | grep 'bar$'
$ echo '^foo bar$' | grep 'bar\$'
^foo bar$
$ echo '^foo bar$' | grep 'bar$$'
^foo bar$

$ echo 'foo $ bar' | grep ' $ '
foo $ bar

$ printf 'foo\cbar' | grep -o '\c'
c
$ printf 'foo\cbar' | grep -o '\\c'
\c
```

<br>

#### <a name="word-anchors"></a>Word Anchors

* The `-w` option works well to match whole words. But what about matching only start or end of words?
* Anchors `\<` and `\>` will match start/end positions of a word
* `\b` can also be used instead of `\<` and `\>` which matches both edges of a word

```bash
$ printf 'spar\npar\npart\napparent\n'
spar
par
part
apparent

$ # words ending with par
$ printf 'spar\npar\npart\napparent\n' | grep 'par\>'
spar
par

$ # words starting with par
$ printf 'spar\npar\npart\napparent\n' | grep '\<par'
par
part
```

* `-w` option is same as specifying both start and end word boundaries

```bash
$ printf 'spar\npar\npart\napparent\n' | grep '\<par\>'
par

$ printf 'spar\npar\npart\napparent\n' | grep '\bpar\b'
par

$ printf 'spar\npar\npart\napparent\n' | grep -w 'par'
par
```

* `\b` has an opposite `\B` which is quite useful too

```bash
$ # string not surrounded by word boundary either side
$ printf 'spar\npar\npart\napparent\n' | grep '\Bpar\B'
apparent

$ # word containing par but not as start of word
$ printf 'spar\npar\npart\napparent\n' | grep '\Bpar'
spar
apparent

$ # word containing par but not as end of word
$ printf 'spar\npar\npart\napparent\n' | grep 'par\B'
part
apparent
```

* the word boundary escape sequences differ slightly from `-w` option

```bash
$ # this fails because there is no word boundary between space and +
$ echo '2 +3 = 5' | grep '\b+3\b'
$ # this works as -w only ensures that there are no surrounding word characters
$ echo '2 +3 = 5' | grep -w '+3'
2 +3 = 5

$ # doesn't work as , isn't at start of word boundary
$ echo 'hi, 2 one' | grep '\<, 2\>'
$ # won't match as there are word characters before ,
$ echo 'hi, 2 one' | grep -w ', 2'
$ # works as \b matches both edges and , is at end of word after i
$ echo 'hi, 2 one' | grep '\b, 2\b'
hi, 2 one
```

<br>

#### <a name="alternation"></a>Alternation

* The `|` meta character is similar to using multiple `-e` option
* Each side of `|` is complete regular expression with their own start/end anchors
* How each part of alternation is handled and order of evaluation/output is beyond the scope of this tutorial
    * See [this](https://www.regular-expressions.info/alternation.html) for more info on this topic.
* `|` is one of meta characters that requires different syntax between BRE/ERE

```bash
$ grep 'blue\|you' poem.txt
Violets are blue,
And so are you.
$ grep -E 'blue|you' poem.txt
Violets are blue,
And so are you.

$ # extract case-insensitive e or f from anywhere in line
$ echo 'Fantasy is my favorite genre' | grep -Eio 'e|f'
F
f
e
e
e

$ # extract case-insensitive e at end of line, f at start of line
$ echo 'Fantasy is my favorite genre' | grep -Eio 'e$|^f'
F
e
```

* A cool usecase of alternation is using `^` or `$` anchors to highlight searched term as well as display rest of unmatched lines
    * the line anchors will match every input line, even empty lines as they are position markers

```bash
$ grep --color=auto -E '^|are' poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ grep --color=auto -E 'is|$' poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.
```

Screenshot for above example:

![highlighting string](./images/highlight_string_whole_file_op.png)

See also

* [stackoverflow - Grep output with multiple Colors](https://stackoverflow.com/questions/17236005/grep-output-with-multiple-colors)
* [unix.stackexchange - Multicolored Grep](https://unix.stackexchange.com/questions/104350/multicolored-grep)

<br>

#### <a name="the-dot-meta-character"></a>The dot meta character

The `.` meta character matches is used to match any character

```bash
$ # any two characters surrounded by word boundaries
$ echo 'I have 12, he has 132!' | grep -ow '..'
12
he

$ # match three characters from start of line
$ # \t (TAB) is single character here
$ printf 'a\tbcd\n' | grep -o '^...'
a       b

$ # all three character word starting with c
$ echo 'car bat cod cope scat dot abacus' | grep -ow 'c..'
car
cod

$ echo '1 & 2' | grep -o '.'
1
 
&
 
2
```

<br>

#### <a name="quantifiers"></a>Greedy Quantifiers

Defines how many times a character (simplified for now) should be matched

* `?` will try to match 0 or 1 time
* For BRE, use `\?`

```bash
$ printf 'late\npale\nfactor\nrare\nact\n'
late
pale
factor
rare
act

$ # match a followed by t, with or without c in between
$ printf 'late\npale\nfactor\nrare\nact\n' | grep -E 'ac?t'
late
factor
act

$ # same as using this alternation
$ printf 'late\npale\nfactor\nrare\nact\n' | grep -E 'at|act'
late
factor
act
```

* `*` will try to match 0 or more times
* There is no upper limit and `*` will try to match as many times as possible
    * if matching maximum times results in overall regex failing, then next best count is chosen until overall regex passes
    * if there are multiple quantifiers, left-most quantifier gets precedence

```bash
$ echo 'abbbc' | grep -o 'b*'
bbb

$ # matches 0 or more b only if surrounded by a and c
$ echo 'abc ac adc abbc bbb bc' | grep -o 'ab*c'
abc
ac
abbc

$ # see how it matched everything
$ echo 'car bat cod map scat dot abacus' | grep -o '.*'
car bat cod map scat dot abacus

$ # but here it stops at m
$ echo 'car bat cod map scat dot abacus' | grep -o '.*m'
car bat cod m

$ # stopped at dot, not bat or scat - match as much as possible
$ echo 'car bat cod map scat dot abacus' | grep -o 'c.*t'
car bat cod map scat dot

$ # matching overall expression gets preference
$ echo 'car bat cod map scat dot abacus' | grep -o 'c.*at'
car bat cod map scat

$ # precedence is left to right in case of multiple matches
$ echo 'car bat cod map scat dot abacus' | grep -o 'b.*m'
bat cod m
$ echo 'car bat cod map scat dot abacus' | grep -o 'b.*m*'
bat cod map scat dot abacus
```

* `+` will try to match 1 or more times
* Another meta character that differs in syntax between BRE/ERE

```bash
$ echo 'abbbc' | grep -o 'b\+'
bbb
$ echo 'abbbc' | grep -oE 'b+'
bbb

$ echo 'abc ac adc abbc bbb bc' | grep -oE 'ab+c'
abc
abbc
$ echo 'abc ac adc abbc bbb bc' | grep -o 'ab*c'
abc
ac
abbc
```

* For more precise control on number of times to match, `{}` is useful
    * use `\{\}` for BRE
* It can take one of four forms, `{m,n}`, `{,n}`, `{m,}` and `{n}`

```bash
$ # {m,n} - m to n, including both m and n
$ echo 'ac abc abbc abbbc' | grep -Eo 'ab{1,2}c'
abc
abbc

$ # {,n} - 0 to n times
$ echo 'ac abc abbc abbbc' | grep -Eo 'ab{,2}c'
ac
abc
abbc

$ # {m,} - at least m times
$ echo 'ac abc abbc abbbc' | grep -Eo 'ab{2,}c'
abbc
abbbc

$ # {n} - exactly n times
$ echo 'ac abc abbc abbbc' | grep -Eo 'ab{2}c'
abbc
```

<br>

#### <a name="character-classes"></a>Character classes

* The meta character pairs `[]` allow to match any of the multiple characters within `[]`
* Meta characters like `^`, `$` have different meaning inside and outside of `[]`
* Simple example first, matching any of the characters within `[]`

```bash
$ echo 'do so in to no on' | grep -ow '[nt]o'
to
no

$ echo 'do so in to no on' | grep -ow '[sot][on]'
so
to
on
```

* Adding a quantifier
* Check out [unix words](https://en.wikipedia.org/wiki/Words_(Unix)) and [sample words file](https://users.cs.duke.edu/~ola/ap/linuxwords)

```bash
$ # words made up of letters o and n, at least 2 letters
$ grep -xE '[on]{2,}' /usr/share/dict/words
no
non
noon
on

$ # lines containing only digits
$ printf 'cat\nfoo\n123\nbaz\n42\n' | grep -xE '[0123456789]+'
123
42
```

* Character ranges
* Matching any alphabet, number, hexadecimal number etc becomes cumbersome if every character has to be individually specified
* So, there's a shortcut, using `-` to construct a range (has to be specified in ascending order)
* See [ascii codes table](https://ascii.cl/) for reference
    * Note that behavior of range will differ for other character encodings
    * See **Character Classes and Bracket Expressions** as well as **LC_COLLATE under Environment Variables** sections in `info grep` for more detail
* [Matching Numeric Ranges with a Regular Expression](https://www.regular-expressions.info/numericranges.html)

```bash
$ printf 'cat\nfoo\n123\nbaz\n42\n' | grep -xE '[0-9]+'
123
42

$ printf 'cat\nfoo\n123\nbaz\n42\n' | grep -xiE '[a-z]+'
cat
foo
baz

$ # only valid decimal numbers
$ printf '128\n34\nfe32\nfoo1\nbar\n' | grep -xE '[0-9]+'
128
34

$ # only valid octal numbers
$ printf '128\n34\nfe32\nfoo1\nbar\n' | grep -xE '[0-7]+'
34

$ # only valid hexadecimal numbers
$ printf '128\n34\nfe32\nfoo1\nbar\n' | grep -xiE '[0-9a-f]+'
128
34
fe32

$ # numbers between 10-29
$ echo '23 54 12 92' | grep -owE '[12][0-9]'
23
12
```

* Negating character class
* By using `^` as first character inside `[]`, we get inverted character class
    * As pointed out earlier, some meta characters behave differently inside and outside of `[]`

```bash
$ # alphabetic words not starting with c
$ echo '123 core not sink code finish' | grep -owE '[^c][a-z]+'
not
sink
finish

$ # excluding numbers 2,3,4,9
$ # note that 200a 200; etc will also match, usage depends on knowing input
$ echo '2001 2004 2005 2008 2009' | grep -ow '200[^2-49]'
2001
2005
2008

$ # get characters from start of line upto(not including) known identifier
$ echo 'foo=bar; baz=123' | grep -oE '^[^=]+'
foo

$ # get characters at end of line from(not including) known identifier
$ echo 'foo=bar; baz=123' | grep -oE '[^=]+$'
123

$ # get all sequence of characters surrounded by unique identifier
$ echo 'I like "mango" and "guava"' | grep -oE '"[^"]+"'
"mango"
"guava"
```

* Matching meta characters inside `[]`
* Most meta characters like `( ) . + { } | $` don't have special meaning inside `[]` and hence do not require special treatment
* Some combination like `[.` or `=]` cannot be used in this order, as they have special meaning within `[]`
    * See **Character Classes and Bracket Expressions** section in `info grep` for more detail

```bash
$ # to match - it should be first or last character within []
$ echo 'Foo-bar 123-456 42 Co-operate' | grep -oiwE '[a-z-]+'
Foo-bar
Co-operate

$ # to match ] it should be first character within []
$ printf 'int a[5]\nfoo=bar\n' | grep '[]=]'
int a[5]
foo=bar

$ # to match [ use [ anywhere in the character list
$ # [][] will match both [ and ]
$ printf 'int a[5]\nfoo=bar\n' | grep '[[]'
int a[5]

$ # to match ^ it should be other than first in the list
$ echo '(a+b)^2 = a^2 + b^2 + 2ab' | grep -owE '[a-z^0-9]{3,}'
a^2
b^2
2ab
```

* Named character classes
* Equivalent class shown is for C locale and ASCII character encoding
    * See [ascii codes table](https://ascii.cl/) for reference
* See **Character Classes and Bracket Expressions** section in `info grep` for more detail

| Character classes | Description |
| ------------- | ----------- |
| `[:digit:]` | Same as `[0-9]` |
| `[:lower:]` | Same as `[a-z]` |
| `[:upper:]` | Same as `[A-Z]` |
| `[:alpha:]` | Same as `[a-zA-Z]` |
| `[:alnum:]` | Same as `[0-9a-zA-Z]` |
| `[:xdigit:]` | Same as `[0-9a-fA-F]` |
| `[:cntrl:]` | Control characters - first 32 ASCII characters and 127th (DEL) |
| `[:punct:]` | All the punctuation characters |
| `[:graph:]` | `[:alnum:]` and `[:punct:]` |
| `[:print:]` | `[:alnum:]`, `[:punct:]` and space |
| `[:blank:]` | Space and tab characters |
| `[:space:]` | white-space characters: tab, newline, vertical tab, form feed, carriage return and space |

```bash
$ printf '128\n34\nAB32\nFoo\nbar\n' | grep -x '[[:alnum:]]*'
128
34
AB32
Foo
bar

$ printf '128\n34\nAB32\nFoo\nbar\n' | grep -x '[[:lower:]]*'
bar

$ printf '128\n34\nAB32\nFoo\nbar\n' | grep -x '[[:lower:]0-9]*'
128
34
bar
```

* backslash character classes

| Character classes | Description |
| ------------- | ----------- |
| `\w` | Same as `[0-9a-zA-Z_]` or `[[:alnum:]_]` |
| `\W` | Same as `[^0-9a-zA-Z_]` or `[^[:alnum:]_]` |
| `\s` | Same as `[[:space:]]` |
| `\S` | Same as `[^[:space:]]` |

```bash
$ printf '123\n$#\ncmp_str\nFoo_bar\n' | grep -x '\w*'
123
cmp_str
Foo_bar
$ printf '123\n$#\ncmp_str\nFoo_bar\n' | grep -x '[[:alnum:]_]*'
123
cmp_str
Foo_bar

$ printf '123\n$#\ncmp_str\nFoo_bar\n' | grep -x '\W*'
$#
$ printf '123\n$#\ncmp_str\nFoo_bar\n' | grep -x '[^[:alnum:]_]*'
$#
```

<br>

#### <a name="grouping"></a>Grouping

* Character classes allow matching against a choice of multiple character list and then quantifier added if needed
* One of the uses of grouping is analogous to character classes for whole regular expressions, instead of just list of characters
* The meta characters `()` are used for grouping
    * requires `\(\)` for BRE
* Similar to `a(b+c)d = abd+acd` in maths, you get `a(b|c)d = abd|acd` in regular expressions

```bash
$ # 5 letter words starting with c and ending with ty or ly
$ grep -xE 'c..(ty|ly)' /usr/share/dict/words
catty
coyly
curly

$ # 7 letter words starting with e and ending with rged or sted
$ grep -xE 'e..(rg|st)ed' /usr/share/dict/words
emerged
existed

$ # repeat a pattern 3 times
$ grep -xE '([a-d][r-z]){3}' /usr/share/dict/words
avatar
awards
cravat

$ # nesting of () is allowed
$ grep -E '([as](p|c)[r-t]){2}' /usr/share/dict/words
scraps

$ # can be used to match specific columns in well defined tables
$ echo 'foo:123:bar:baz' | grep -E '^([^:]+:){2}bar'
foo:123:bar:baz
```

* See also [stackoverflow - matching character exactly n times in a line](https://stackoverflow.com/questions/40187643/grep-search-with-regex)

<br>

#### <a name="back-reference"></a>Back reference

* The matched string within `()` can also be used to be matched again by back referencing the captured groups
* `\1` denotes the first matched group, `\2` the second one and so on
    * Order is leftmost `(` is `\1`, next one is `\2` and so on
* Note that the matched string, not the regular expression itself is referenced
    * for ex: if `([0-9][a-f])` matches `3b`, then back referencing will be `3b` not any other valid match of the regular expression like `8f`, `0a` etc
    * Other regular expressions like PCRE do allow referencing the regular expression itself

```bash
$ # note how first three and last three letters are same
$ grep -xE '([a-d]..)\1' /usr/share/dict/words
bonbon
cancan
chichi
$ # note how adding quantifier is not same as back-referencing
$ grep -m4 -xE '([a-d]..){2}' /usr/share/dict/words
abacus
abided
abides
ablaze

$ # words with consecutive repeated letters
$ echo 'eel flee all pat ilk seen' | grep -iowE '[a-z]*(.)\1[a-z]*'
eel
flee
all
seen

$ # 17 letter words with first and last as same letter
$ grep -xE '(.)[a-z]{15}\1' /usr/share/dict/words
semiprofessionals
transcendentalist
```

* Spotting repeated words

```bash
$ cat story.txt
singing tin in the rain
walking for for a cause
have a nice day
day and night

$ grep -wE '(\w+)\W+\1' story.txt
walking for for a cause
```

* **Note** that there is an [issue for certain usage of back-reference and quantifier](https://debbugs.gnu.org/cgi/bugreport.cgi?bug=26864)

```bash
$ # no output
$ grep -m5 -xiE '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words
$ # works when nesting is unrolled
$ grep -m5 -xiE '[a-z]*([a-z])\1[a-z]*([a-z])\2[a-z]*' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed

$ # no problem if PCRE is used instead of ERE
$ grep -m5 -xiP '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed
```

<br>

## <a name="multiline-matching"></a>Multiline matching

* If input is small enough to meet memory requirements, the `-z` option comes in handy to match across multiple lines
* Instead of newline being line separator, the ASCII NUL character is used
    * So, multiline matching depends on whether or not input file itself contains the NUL character
    * Usually text files won't have occasion to use the NUL character and presence of it marks it as binary file for `grep`

```bash
$ # \0 for ASCII NUL character
$ printf 'red\nblue\n\0green\n' | cat -e
red$
blue$
^@green$

$ # see --binary-files=TYPE option in info grep for binary details
$ printf 'red\nblue\n\0green\n' | grep -a 'red'
red

$ # with -z, \0 marks the different 'lines'
$ printf 'red\nblue\n\0green\n' | grep -z 'red'
red
blue

$ # if no \0 in input, entire input read as single string
$ printf 'red\nblue\ngreen\n' | grep -z 'red'
red
blue
green
```

* `\n` is not defined in BRE/ERE
    * see [unix.stackexchange - How to specify characters using hexadecimal codes](https://unix.stackexchange.com/questions/19491/how-to-specify-characters-using-hexadecimal-codes-in-grep) for a workaround
* if some characteristics of input is known, `[[:space:]]` can be used as workaround, which matches all white-space characters

```bash
$ grep -oz 'Roses.*blue,[[:space:]]' poem.txt
Roses are red,
Violets are blue,
```

<br>

## <a name="perl-compatible-regular-expressions"></a>Perl Compatible Regular Expressions

```bash
$ # see also: https://github.com/learnbyexample/command_help
$ man grep | sed -n '/^\s*-P/,/^$/p'
       -P, --perl-regexp
              Interpret the pattern as a  Perl-compatible  regular  expression
              (PCRE).   This  is  highly  experimental and grep -P may warn of
              unimplemented features.

```

* The man page informs that `-P` is *highly experimental*. So far, haven't faced any issues. But do keep this in mind.
    * newer versions of `GNU grep` has fixes for some `-P` bugs, see [release notes](https://savannah.gnu.org/news/?group_id=67) for an overview of changes between versions
* Only a few highlights is presented here
* For more info
    * `man pcrepattern` or [read it online](https://www.pcre.org/original/doc/html/pcrepattern.html)
    * [perldoc - re](https://perldoc.perl.org/perlre.html) - Perl regular expression syntax, also links to other related tutorials
    * [stackoverflow - What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean)

<br>

#### <a name="backslash-sequences"></a>Backslash sequences

Some of the backslash constructs available in PCRE over already seen ones in ERE

* `\d` for `[0-9]`
* `\s` for `[ \t\r\n\f\v]`
* `\h` for `[ \t]`
* `\n` for newline character
* `\D`, `\S`, `\H`, `\N` etc for their opposites

```bash
$ # example for [0-9] in ERE and \d in PCRE
$ echo 'foo=5, bar=3; x=83, y=120' | grep -oE '[0-9]+'
5
3
83
120
$ echo 'foo=5, bar=3; x=83, y=120' | grep -oP '\d+'
5
3
83
120

$ # (?s) allows newlines to be also matches when using . meta character
$ grep -ozP '(?s)Roses.*blue,\n' poem.txt
Roses are red,
Violets are blue,
```

* See **INTERNAL OPTION SETTING** in `man pcrepattern` for more info on `(?s)`, `(?m)` etc
* [Specifying Modes Inside The Regular Expression](https://www.regular-expressions.info/modifiers.html) also has some detail on such options

<br>

#### <a name="non-greedy-matching"></a>Non-greedy matching

* Both BRE/ERE support only greedy matching quantifiers
    * match as much as possible
* PCRE supports non-greedy version by adding `?` after quantifiers
    * match as minimal as possible
* See [this Python notebook](https://nbviewer.jupyter.org/url/norvig.com/ipython/pal3.ipynb) for an interesting project on palindrome sentences

```bash
$ echo 'foo and bar and baz went shopping bytes' | grep -oi '\w.*and'
foo and bar and

$ echo 'foo and bar and baz went shopping bytes' | grep -oiP '\w.*?and'
foo and
bar and

$ # recall that matching overall expression gets preference
$ echo 'foo and bar and baz went shopping bytes' | grep -oi '\w.*and baz'
foo and bar and baz
$ echo 'foo and bar and baz went shopping bytes' | grep -oiP '\w.*?and baz'
foo and bar and baz

$ # minimal matching with single character has simple workaround
$ echo 'A man, a plan, a canal, Panama' | grep -oi 'a.*,'
A man, a plan, a canal,
$ echo 'A man, a plan, a canal, Panama' | grep -oi 'a[^,]*,'
A man,
a plan,
a canal,
```

<br>

#### <a name="lookarounds"></a>Lookarounds

* Ability to add conditions to match before/after required pattern
* There are four types
    * positive lookahead `(?=`
    * negative lookahead `(?!`
    * positive lookbehind `(?<=`
    * negative lookbehind `(?<!`
* One way to remember is that **behind** uses `<` and **negative** uses `!` instead of `=`
* When used with `-o` option, lookarounds portion won't be part of output

Fixed and variable length *lookbehind*

```bash
$ # extract digits preceded by single lowercase letter and =
$ # this is fixed length lookbehind because length is known
$ echo 'foo=5, bar=3; x=83, y=120' | grep -oP '(?<=\b[a-z]=)\d+'
83
120

$ # error because {2,} induces variable length matching
$ echo 'foo=5, bar=3; x=83, y=120' | grep -oP '(?<=\b[a-z]{2,}=)\d+'
grep: lookbehind assertion is not fixed length

$ # use \K for such cases
$ echo 'foo=5, bar=3; x=83, y=120' | grep -oP '\b[a-z]{2,}=\K\d+'
5
3
```

* Examples for lookarounds

```bash
$ # extract digits that follow =
$ echo 'foo=5, bar=3; x=83, y=120' | grep -oP '=\K\d+'
5
3
83
120

$ # digits that follow = and has , after
$ echo 'foo=5, bar=3; x=83, y=120' | grep -oP '=\K\d+(?=,)'
5
83

$ # extract words, but not those at start of line
$ echo 'car bat cod map' | grep -owP '(?<!^)\w+'
bat
cod
map

$ # extract words, but not those at start of line or end of line
$ echo 'car bat cod map' | grep -owP '(?<!^)\w+(?!$)'
bat
cod

$ # matching multiple search patterns in any order
$ grep -P '(?=.*are)(?=.*s).*d' poem.txt
Roses are red,
And so are you.
```

<br>

#### <a name="ignoring-specific-matches"></a>Ignoring specific matches

* A useful construct is `(*SKIP)(*F)` which allows to discard matches not needed
* Simple way to use is that regular expression which should be discarded is written first, `(*SKIP)(*F)` is appended and then whichever is required by added after `|`
* See [Excluding Unwanted Matches](https://www.rexegg.com/backtracking-control-verbs.html#skipfail) for more info

```bash
$ # all words except bat and map
$ echo 'car bat cod map' | grep -oP '(bat|map)(*SKIP)(*F)|\w+'
car
cod

$ # all words except those surrounded by double quotes
$ echo 'I like "mango" and "guava"' | grep -oP '"[^"]+"(*SKIP)(*F)|\w+'
I
like
and
```

<br>

#### <a name="re-using-regular-expression-pattern"></a>Re-using regular expression pattern

* `\1`, `\2` etc only matches exact string
* `(?1)`, `(?2)` etc re-uses the regular expression itself

```bash
$ # (?1) refers to first group \d{4}-\d{2}-\d{2}
$ echo '2008-03-24 and 2012-08-12 foo' | grep -oP '(\d{4}-\d{2}-\d{2})\D+(?1)'
2008-03-24 and 2012-08-12
```

<br>

## <a name="gotchas-and-tips"></a>Gotchas and Tips

* Always quote the search string (unless you know what you are doing :P)

```bash
$ # spaces are special
$ grep so are poem.txt
grep: are: No such file or directory
poem.txt:And so are you.
$ grep 'so are' poem.txt
And so are you.

$ # use of # indicates start of comment
$ printf 'foo\na#2\nb#3\n' | grep #2
Usage: grep [OPTION]... PATTERN [FILE]...
Try 'grep --help' for more information.
$ printf 'foo\na#2\nb#3\n' | grep '#2'
a#2
```

* Another common problem is unquoted search string will be open to shell's own globbing rules

```bash
$ # sample output on bash shell, might vary for different shells
$ echo '*.txt' | grep -F *.txt
$ echo '*.txt' | grep -F '*.txt'
*.txt
```

* Use double quotes for variable expansion, command substitution, etc (Note: could vary based on shell used)
* See [mywiki.wooledge Quotes](https://mywiki.wooledge.org/Quotes) for detailed discussion of quoting in `bash` shell

```bash
$ # sample output on bash shell, might vary for different shells
$ color='blue'
$ grep "$color" poem.txt
Violets are blue,
```

* Pattern starting with `-`

```bash
$ # this issue is not specific to grep alone
$ # the command assumes -2 is an option and hence the error
$ echo '5*3-2=13' | grep '-2'
Usage: grep [OPTION]... PATTERN [FILE]...
Try 'grep --help' for more information.

$ # workaround by using \-
$ echo '5*3-2=13' | grep '\-2'
5*3-2=13

$ # or use -- to indicate no further options to process
$ echo '5*3-2=13' | grep -- '-2'
5*3-2=13

$ # same issue with printf
$ printf '-1+2=1\n'
bash: printf: -1: invalid option
printf: usage: printf [-v var] format [arguments]
$ printf -- '-1+2=1\n'
-1+2=1
```

* Tip: Options can be specified at end of command as well, useful if option was forgotten and have to quickly add it to previous command from history

```bash
$ grep 'are' poem.txt
Roses are red,
Violets are blue,
And so are you.

$ # use previous command from history, for ex up arrow key in bash
$ # then simply add the option at end
$ grep 'are' poem.txt -n
1:Roses are red,
2:Violets are blue,
4:And so are you.
```

* Speed boost if input file is ASCII
* See also [unix.stackexchange - Counting the number of lines having a number > 100](https://unix.stackexchange.com/questions/312297/counting-the-number-of-lines-having-a-number-greater-than-100/312330#312330) - where `grep` is blazing fast compared to other solutions

```bash
$ time grep -xE '([a-d][r-z]){3}' /usr/share/dict/words
avatar
awards
cravat

real    0m0.145s

$ time LC_ALL=C grep -xE '([a-d][r-z]){3}' /usr/share/dict/words
avatar
awards
cravat

real    0m0.011s
```

* Speed boost by using PCRE for back-references
* might be faster when using quantifiers as well

```bash
$ time LC_ALL=C grep -xE '([a-z]..)\1' /usr/share/dict/words
bonbon
cancan
chichi
murmur
muumuu
pawpaw
pompom
tartar
testes

real    0m0.174s
$ time grep -xP '([a-z]..)\1' /usr/share/dict/words
bonbon
cancan
chichi
murmur
muumuu
pawpaw
pompom
tartar
testes

real    0m0.008s
```

<br>

## <a name="regular-expressions-reference-ere"></a>Regular Expressions Reference (ERE)

<br>

#### <a name="anchors"></a>Anchors

* `^` match from start of line
* `$` match end of line
* `\<` match beginning of word
* `\>` match end of word
* `\b` match edge of word
* `\B` match other than edge of word

<br>

#### <a name="character-quantifiers"></a>Character Quantifiers

* `.` match any single character
* `*` match preceding character/group 0 or more times
* `+` match preceding character/group 1 or more times
* `?` match preceding character/group 0 or 1 times
* `{m,n}` match preceding character/group m to n times, including m and n
* `{m,}` match preceding character/group m or more times
* `{,n}` match preceding character/group 0 to n times
* `{n}` match preceding character/group exactly n times

<br>

#### <a name="character-classes-and-backslash-sequences"></a>Character classes and backslash sequences

* `[aeiou]` match any of these characters
* `[^aeiou]` do not match any of these characters
* `[a-z]` match any lowercase alphabet
* `[0-9]` match any digit character
* `\w` match alphabets, digits and underscore character, short cut for `[a-zA-Z0-9_]`
* `\W` opposite of `\w` , short cut for `[^a-zA-Z0-9_]`
* `\s` match white-space characters: tab, newline, vertical tab, form feed, carriage return, and space
* `\S` match other than white-space characters

<br>

#### <a name="pattern-groups"></a>Pattern groups

* `|` matches either of the given patterns
* `()` patterns within `()` are grouped and treated as one pattern, useful in conjunction with `|`
* `\1` backreference to first grouped pattern within `()`
* `\2` backreference to second grouped pattern within `()` and so on

<br>

#### <a name="basic-vs-extended-regular-expressions"></a>Basic vs Extended Regular Expressions

By default, the pattern passed to `grep` is treated as Basic Regular Expressions(BRE), which can be overridden using options like `-E` for ERE and `-P` for Perl Compatible Regular Expression(PCRE). Paraphrasing from `info grep`

>In Basic Regular Expressions the meta-characters `? + { | ( )` lose their special meaning, instead use the backslashed versions `\? \+ \{ \| \( \)`

<br>

## <a name="further-reading"></a>Further Reading

* `man grep` and `info grep`
    * At least go through all options ;)
    * **Usage section** in `info grep` has good examples as well
* This chapter has also been [converted to a book](https://github.com/learnbyexample/learn_gnugrep_ripgrep) with additional examples, exercises and covers popular alternative `ripgrep`
* A bit of history
    * [Brian Kernighan remembers the origins of grep](https://thenewstack.io/brian-kernighan-remembers-the-origins-of-grep/)
    * [how grep command was born](https://medium.com/@rualthanzauva/grep-was-a-private-command-of-mine-for-quite-a-while-before-i-made-it-public-ken-thompson-a40e24a5ef48)
    * [why GNU grep is fast](https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html)
    * [unix.stackexchange - Difference between grep, egrep and fgrep](https://unix.stackexchange.com/questions/17949/what-is-the-difference-between-grep-egrep-and-fgrep)
* Q&A on stackoverflow/stackexchange are good source of learning material, good for practice exercises as well
    * [grep Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/grep?sort=votes&pageSize=15)
    * [grep Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/grep?sort=votes&pageSize=15)
* Learn Regular Expressions (has information on flavors other than BRE/ERE/PCRE too)
    * [Regular Expressions Tutorial](https://www.regular-expressions.info/tutorial.html)
    * [rexegg](https://www.rexegg.com/) - tutorials, tricks and more
    * [regexcrossword](https://regexcrossword.com/)
    * [stackoverflow - What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean)
    * [online regex tester and debugger](https://regex101.com/) - by default `pcre` flavor
* Alternatives
    * [ripgrep](https://github.com/BurntSushi/ripgrep)
    * [pcregrep](https://www.pcre.org/original/doc/html/pcregrep.html)
    * [ag - silver searcher](https://github.com/ggreer/the_silver_searcher)
* [unix.stackexchange - When to use grep, sed, awk, perl, etc](https://unix.stackexchange.com/questions/303044/when-to-use-grep-less-awk-sed)


================================================
FILE: gnu_sed.md
================================================
<br> <br> <br>

---

:information_source: :information_source: This chapter has been converted into a better formatted ebook: https://learnbyexample.github.io/learn_gnused/. The ebook also has content updated for newer version of the commands, includes exercises, solutions, etc.

For markdown source and links to buy pdf/epub versions, see: https://github.com/learnbyexample/learn_gnused

---

<br> <br> <br>

# <a name="gnu-sed"></a>GNU sed

**Table of Contents**

* [Simple search and replace](#simple-search-and-replace)
    * [editing stdin](#editing-stdin)
    * [editing file input](#editing-file-input)
* [Inplace file editing](#inplace-file-editing)
    * [With backup](#with-backup)
    * [Without backup](#without-backup)
    * [Multiple files](#multiple-files)
    * [Prefix backup name](#prefix-backup-name)
    * [Place backups in directory](#place-backups-in-directory)
* [Line filtering options](#line-filtering-options)
    * [Print command](#print-command)
    * [Delete command](#delete-command)
    * [Quit commands](#quit-commands)
    * [Negating REGEXP address](#negating-regexp-address)
    * [Combining multiple REGEXP](#combining-multiple-regexp)
    * [Filtering by line number](#filtering-by-line-number)
    * [Print only line number](#print-only-line-number)
    * [Address range](#address-range)
    * [Relative addressing](#relative-addressing)
* [Using different delimiter for REGEXP](#using-different-delimiter-for-regexp)
* [Regular Expressions](#regular-expressions)
    * [Line Anchors](#line-anchors)
    * [Word Anchors](#word-anchors)
    * [Matching the meta characters](#matching-the-meta-characters)
    * [Alternation](#alternation)
    * [The dot meta character](#the-dot-meta-character)
    * [Quantifiers](#quantifiers)
    * [Character classes](#character-classes)
    * [Escape sequences](#escape-sequences)
    * [Grouping](#grouping)
    * [Back reference](#back-reference)
    * [Changing case](#changing-case)
* [Substitute command modifiers](#substitute-command-modifiers)
    * [g modifier](#g-modifier)
    * [Replace specific occurrence](#replace-specific-occurrence)
    * [Ignoring case](#ignoring-case)
    * [p modifier](#p-modifier)
    * [w modifier](#w-modifier)
    * [e modifier](#e-modifier)
    * [m modifier](#m-modifier)
* [Shell substitutions](#shell-substitutions)
    * [Variable substitution](#variable-substitution)
    * [Command substitution](#command-substitution)
* [z and s command line options](#z-and-s-command-line-options)
* [change command](#change-command)
* [insert command](#insert-command)
* [append command](#append-command)
* [adding contents of file](#adding-contents-of-file)
    * [r for entire file](#r-for-entire-file)
    * [R for line by line](#r-for-line-by-line)
* [n and N commands](#n-and-n-commands)
* [Control structures](#control-structures)
    * [if then else](#if-then-else)
    * [replacing in specific column](#replacing-in-specific-column)
    * [overlapping substitutions](#overlapping-substitutions)
* [Lines between two REGEXPs](#lines-between-two-regexps)
    * [Include or Exclude matching REGEXPs](#include-or-exclude-matching-regexps)
    * [First or Last block](#first-or-last-block)
    * [Broken blocks](#broken-blocks)
* [sed scripts](#sed-scripts)
* [Gotchas and Tips](#gotchas-and-tips)
* [Further Reading](#further-reading)

<br>

```bash
$ sed --version | head -n1
sed (GNU sed) 4.2.2

$ man sed
SED(1)                           User Commands                          SED(1)

NAME
       sed - stream editor for filtering and transforming text

SYNOPSIS
       sed [OPTION]... {script-only-if-no-other-script} [input-file]...

DESCRIPTION
       Sed  is a stream editor.  A stream editor is used to perform basic text
       transformations on an input stream (a file or input from  a  pipeline).
       While  in  some  ways similar to an editor which permits scripted edits
       (such as ed), sed works by making only one pass over the input(s),  and
       is consequently more efficient.  But it is sed's ability to filter text
       in a pipeline which particularly distinguishes it from other  types  of
       editors.
...
```

**Note:** [Multiline and manipulating pattern space](https://www.gnu.org/software/sed/manual/sed.html#Multiline-techniques) with h,x,D,G,H,P etc is not covered in this chapter and examples/information is based on ASCII encoded text input only

<br>

## <a name="simple-search-and-replace"></a>Simple search and replace

Detailed examples for **substitute** command will be covered in later sections, syntax is

```
s/REGEXP/REPLACEMENT/FLAGS
```

The `/` character is idiomatically used as delimiter character. See also [Using different delimiter for REGEXP](#using-different-delimiter-for-regexp)

<br>

#### <a name="editing-stdin"></a>editing stdin

```bash
$ # sample command output to be edited
$ seq 10 | paste -sd,
1,2,3,4,5,6,7,8,9,10

$ # change only first ',' to ' : '
$ seq 10 | paste -sd, | sed 's/,/ : /'
1 : 2,3,4,5,6,7,8,9,10

$ # change all ',' to ' : ' by using 'g' modifier
$ seq 10 | paste -sd, | sed 's/,/ : /g'
1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10
```

**Note:** As a good practice, all examples use single quotes around arguments to prevent shell interpretation. See [Shell substitutions](#shell-substitutions) section on use of double quotes

<br>

#### <a name="editing-file-input"></a>editing file input

* By default newline character is the line separator
* See [Regular Expressions](#regular-expressions) section for qualifying search terms, for ex
    * word boundaries to distinguish between 'hi', 'this', 'his', 'history', etc
    * multiple search terms, specific set of character, etc

```bash
$ cat greeting.txt
Hi there
Have a nice day

$ # change first 'e' in each line to 'E'
$ sed 's/e/E/' greeting.txt
Hi thEre
HavE a nice day

$ # change first 'nice day' in each line to 'safe journey'
$ sed 's/nice day/safe journey/' greeting.txt
Hi there
Have a safe journey

$ # change all 'e' to 'E' and save changed text to another file
$ sed 's/e/E/g' greeting.txt > out.txt
$ cat out.txt
Hi thErE
HavE a nicE day
```

<br>

## <a name="inplace-file-editing"></a>Inplace file editing

* In previous section, the output from `sed` was displayed on stdout or saved to another file
* To write the changes back to original file, use `-i` option

**Note**:

* Refer to `man sed` for details of how to use the `-i` option. It varies with different `sed` implementations. As mentioned at start of this chapter, `sed (GNU sed) 4.2.2` is being used here
* See also [unix.stackexchange - working with symlinks](https://unix.stackexchange.com/questions/348693/sed-update-etc-grub-conf-in-spite-this-link-file)

<br>

#### <a name="with-backup"></a>With backup

* When extension is given, the original input file is preserved with name changed according to extension provided

```bash
$ # '.bkp' is extension provided
$ sed -i.bkp 's/Hi/Hello/' greeting.txt
$ # output from sed is written back to 'greeting.txt'
$ cat greeting.txt
Hello there
Have a nice day

$ # original file is preserved in 'greeting.txt.bkp'
$ cat greeting.txt.bkp
Hi there
Have a nice day
```

<br>

#### <a name="without-backup"></a>Without backup

* Use this option with caution, changes made cannot be undone

```bash
$ sed -i 's/nice day/safe journey/' greeting.txt

$ # note, 'Hi' was already changed to 'Hello' in previous example
$ cat greeting.txt
Hello there
Have a safe journey
```

<br>

#### <a name="multiple-files"></a>Multiple files

* Multiple input files are treated individually and changes are written back to respective files

```bash
$ cat f1
I ate 3 apples
$ cat f2
I bought two bananas and 3 mangoes

$ # -i can be used with or without backup
$ sed -i 's/3/three/' f1 f2
$ cat f1
I ate three apples
$ cat f2
I bought two bananas and three mangoes
```

<br>

#### <a name="prefix-backup-name"></a>Prefix backup name

* A `*` in argument given to `-i` will get expanded to input filename
* This way, one can add prefix instead of suffix for backup

```bash
$ cat var.txt
foo
bar
baz

$ sed -i'bkp.*' 's/foo/hello/' var.txt
$ cat var.txt
hello
bar
baz

$ cat bkp.var.txt
foo
bar
baz
```

<br>

#### <a name="place-backups-in-directory"></a>Place backups in directory

* `*` also allows to specify an existing directory to place the backups instead of current working directory

```bash
$ mkdir bkp_dir
$ sed -i'bkp_dir/*' 's/bar/hi/' var.txt
$ cat var.txt
hello
hi
baz

$ cat bkp_dir/var.txt
hello
bar
baz

$ # extensions can be added as well
$ # bkp_dir/*.bkp for suffix
$ # bkp_dir/bkp.* for prefix
$ # bkp_dir/bkp.*.2017 for both and so on
```

<br>

## <a name="line-filtering-options"></a>Line filtering options

* By default, `sed` acts on entire file. Often, one needs to extract or change only specific lines based on text search, line numbers, lines between two patterns, etc
* This filtering is much like using `grep`, `head` and `tail` commands in many ways and there are even more features
    * Use `sed` for inplace editing, the filtered lines to be transformed etc. Not as substitute for those commands

<br>

#### <a name="print-command"></a>Print command

* It is usually used in conjunction with `-n` option
* By default, `sed` prints every input line, including any changes made by commands like substitution
    * printing here refers to line being part of `sed` output which may be shown on terminal, redirected to file, etc
* Using `-n` option and `p` command together, only specific lines needed can be filtered
* Examples below use the `/REGEXP/` addressing, other forms will be seen in sections to follow

```bash
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ # all lines containing the string 'are'
$ # same as: grep 'are' poem.txt
$ sed -n '/are/p' poem.txt
Roses are red,
Violets are blue,
And so are you.

$ # all lines containing the string 'so are'
$ # same as: grep 'so are' poem.txt
$ sed -n '/so are/p' poem.txt
And so are you.
```

* Using print and substitution together

```bash
$ # print only lines on which substitution happens
$ sed -n 's/are/ARE/p' poem.txt
Roses ARE red,
Violets ARE blue,
And so ARE you.

$ # if line contains 'are', perform given command
$ # print only if substitution succeeds
$ sed -n '/are/ s/so/SO/p' poem.txt
And SO are you.
```

* Duplicating every input line

```bash
$ # note, -n is not used and no filtering applied
$ seq 3 | sed 'p'
1
1
2
2
3
3
```

<br>

#### <a name="delete-command"></a>Delete command

* By default, `sed` prints every input line, including any changes like substitution
* Using the `d` command, those specific lines will NOT be printed

```bash
$ # same as: grep -v 'are' poem.txt
$ sed '/are/d' poem.txt
Sugar is sweet,

$ # same as: seq 5 | grep -v '3'
$ seq 5 | sed '/3/d'
1
2
4
5
```

* Modifier `I` allows to filter lines in case-insensitive way
* See [Regular Expressions](#regular-expressions) section for more details

```bash
$ # /rose/I means match the string 'rose' irrespective of case
$ sed '/rose/Id' poem.txt
Violets are blue,
Sugar is sweet,
And so are you.
```

<br>

#### <a name="quit-commands"></a>Quit commands

* Exit `sed` without processing further input

```bash
$ # same as: seq 23 45 | head -n5
$ # remember that printing is default action if -n is not used
$ # here, 5 is line number based addressing
$ seq 23 45 | sed '5q'
23
24
25
26
27
```

* `Q` is similar to `q` but won't print the matching line

```bash
$ seq 23 45 | sed '5Q'
23
24
25
26

$ # useful to print from beginning of file up to but not including line matching REGEXP
$ sed '/is/Q' poem.txt
Roses are red,
Violets are blue,
```

* Use `tac` to get all lines starting from last occurrence of search string

```bash
$ # all lines from last occurrence of '7'
$ seq 50 | tac | sed '/7/q' | tac
47
48
49
50

$ # all lines from last occurrence of '7' excluding line with '7'
$ seq 50 | tac | sed '/7/Q' | tac
48
49
50
```

**Note**

* This way of using quit commands won't work for inplace editing with multiple file input
* See also [unix.stackexchange - applying changes to multiple files](https://unix.stackexchange.com/questions/309514/sed-apply-changes-in-multiple-files)

<br>

#### <a name="negating-regexp-address"></a>Negating REGEXP address

* Use `!` to invert the specified address

```bash
$ # same as: sed -n '/so are/p' poem.txt
$ sed '/so are/!d' poem.txt
And so are you.

$ # same as: sed '/are/d' poem.txt
$ sed -n '/are/!p' poem.txt
Sugar is sweet,
```

<br>

#### <a name="combining-multiple-regexp"></a>Combining multiple REGEXP

* See also [sed manual - Multiple commands syntax](https://www.gnu.org/software/sed/manual/sed.html#Multiple-commands-syntax) for more details
* See also [sed scripts](#sed-scripts) section for an alternate way

```bash
$ # each command as argument to -e option
$ sed -n -e '/blue/p' -e '/you/p' poem.txt
Violets are blue,
And so are you.

$ # each command separated by ;
$ # not all commands can be specified so
$ sed -n '/blue/p; /you/p' poem.txt
Violets are blue,
And so are you.

$ # each command separated by literal newline character
$ # might depend on whether the shell allows such multiline command
$ sed -n '
/blue/p
/you/p
' poem.txt
Violets are blue,
And so are you.
```

* Use `{}` command grouping for logical AND

```bash
$ # same as: grep 'are' poem.txt | grep 'And'
$ # space between /REGEXP/ and {} is optional
$ sed -n '/are/ {/And/p}' poem.txt
And so are you.

$ # same as: grep 'are' poem.txt | grep -v 'so'
$ sed -n '/are/ {/so/!p}' poem.txt
Roses are red,
Violets are blue,

$ # same as: grep -v 'red' poem.txt | grep -v 'blue'
$ sed -n '/red/!{/blue/!p}' poem.txt
Sugar is sweet,
And so are you.
$ # many ways to do it, use whatever feels easier to construct
$ # sed -e '/red/d' -e '/blue/d' poem.txt
$ # grep -v -e 'red' -e 'blue' poem.txt
```

* Different ways to do same things. See also [Alternation](#alternation) and [Control structures](#control-structures)

```bash
$ # multiple commands can lead to duplicatation
$ sed -n '/blue/p; /t/p' poem.txt
Violets are blue,
Violets are blue,
Sugar is sweet,
$ # in such cases, use regular expressions instead
$ sed -nE '/blue|t/p;' poem.txt
Violets are blue,
Sugar is sweet,

$ sed -nE '/red|blue/!p' poem.txt
Sugar is sweet,
And so are you.

$ sed -n '/so/b; /are/p' poem.txt
Roses are red,
Violets are blue,
```

<br>

#### <a name="filtering-by-line-number"></a>Filtering by line number

* Exact line number can be specified to be acted upon
* As a special case, `$` indicates last line of file
* See also [sed manual - Multiple commands syntax](https://www.gnu.org/software/sed/manual/sed.html#Multiple-commands-syntax)

```bash
$ # here, 2 represents the address for print command, similar to /REGEXP/p
$ # same as: head -n2 poem.txt | tail -n1
$ sed -n '2p' poem.txt
Violets are blue,

$ # print 2nd and 4th line
$ sed -n '2p; 4p' poem.txt
Violets are blue,
And so are you.

$ # same as: tail -n1 poem.txt
$ sed -n '$p' poem.txt
And so are you.

$ # delete except 3rd line
$ sed '3!d' poem.txt
Sugar is sweet,

$ # substitution only on 2nd line
$ sed '2 s/are/ARE/' poem.txt
Roses are red,
Violets ARE blue,
Sugar is sweet,
And so are you.
```

* For large input files, combine `p` with `q` for speedy exit
* `sed` would immediately quit without processing further input lines when `q` is used

```bash
$ seq 3542 4623452 | sed -n '2452{p;q}'
5993

$ seq 3542 4623452 | sed -n '250p; 2452{p;q}'
3791
5993

$ # here is a sample time comparison
$ time seq 3542 4623452 | sed -n '2452{p;q}' > /dev/null

real    0m0.003s
user    0m0.000s
sys     0m0.000s
$ time seq 3542 4623452 | sed -n '2452p' > /dev/null

real    0m0.334s
user    0m0.396s
sys     0m0.024s
```

* mimicking `head` command using `q`

```bash
$ # same as: seq 23 45 | head -n5
$ # remember that printing is default action if -n is not used
$ seq 23 45 | sed '5q'
23
24
25
26
27
```

<br>

#### <a name="print-only-line-number"></a>Print only line number

```bash
$ # gives both line number and matching line
$ grep -n 'blue' poem.txt
2:Violets are blue,

$ # gives only line number of matching line
$ sed -n '/blue/=' poem.txt
2

$ sed -n '/are/=' poem.txt
1
2
4
```

* If needed, matching line can also be printed. But there will be newline separation

```bash
$ sed -n '/blue/{=;p}' poem.txt
2
Violets are blue,

$ # or
$ sed -n '/blue/{p;=}' poem.txt
Violets are blue,
2
```

<br>

#### <a name="address-range"></a>Address range

* So far, we've seen how to filter specific line based on *REGEXP* and line numbers
* `sed` also allows to combine them to enable selecting a range of lines
* Consider the sample input file for this section

```bash
$ cat addr_range.txt
Hello World

Good day
How are you

Just do-it
Believe it

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he
```

* Range defined by start and end *REGEXP*
* For other cases like getting lines without the line matching start and/or end, unbalanced start/end, when end *REGEXP* doesn't match, etc see [Lines between two REGEXPs](#lines-between-two-regexps) section

```bash
$ sed -n '/is/,/like/p' addr_range.txt
Today is sunny
Not a bit funny
No doubt you like it too

$ sed -n '/just/I,/believe/Ip' addr_range.txt
Just do-it
Believe it

$ # the second REGEXP will always be checked after the line matching first address
$ sed -n '/No/,/No/p' addr_range.txt
Not a bit funny
No doubt you like it too

$ # all the matching ranges will be printed
$ sed -n '/you/,/do/p' addr_range.txt
How are you

Just do-it
No doubt you like it too

Much ado about nothing
```

* Range defined by start and end line numbers

```bash
$ # print lines numbered 3 to 7
$ sed -n '3,7p' addr_range.txt
Good day
How are you

Just do-it
Believe it

$ # print lines from line number 13 to last line
$ sed -n '13,$p' addr_range.txt
Much ado about nothing
He he he

$ # delete lines numbered 2 to 13
$ sed '2,13d' addr_range.txt
Hello World
He he he
```

* Range defined by mix of line number and *REGEXP*

```bash
$ sed -n '3,/do/p' addr_range.txt
Good day
How are you

Just do-it

$ sed -n '/Today/,$p' addr_range.txt
Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he
```

* Negating address range, just add `!` to end of address range

```bash
$ # same as: seq 10 | sed '3,7d'
$ seq 10 | sed -n '3,7!p'
1
2
8
9
10

$ # same as: sed '/Today/,$d' addr_range.txt
$ sed -n '/Today/,$!p' addr_range.txt
Hello World

Good day
How are you

Just do-it
Believe it

```

<br>

#### <a name="relative-addressing"></a>Relative addressing

* Prefixing `+` to a number for second address gives relative filtering
* Similar to using `grep -A<num> --no-group-separator 'REGEXP'` but `grep` merges adjacent groups while `sed` does not

```bash
$ # line matching 'is' and 2 lines after
$ sed -n '/is/,+2p' addr_range.txt
Today is sunny
Not a bit funny
No doubt you like it too

$ # note that all matching ranges will be filtered
$ sed -n '/do/,+2p' addr_range.txt
Just do-it
Believe it

No doubt you like it too

Much ado about nothing
```

* The first address could be number too
* Useful when using [Shell substitutions](#shell-substitutions)

```bash
$ sed -n '3,+4p' addr_range.txt
Good day
How are you

Just do-it
Believe it
```

* Another relative format is `i~j` which acts on ith line and i+j, i+2j, i+3j, etc
    * `1~2` means 1st, 3rd, 5th, 7th, etc (i.e odd numbered lines)
    * `5~3` means 5th, 8th, 11th, etc

```bash
$ # match odd numbered lines
$ # for even, use 2~2
$ seq 10 | sed -n '1~2p'
1
3
5
7
9

$ # match line numbers: 2, 2+1*4, 2+1*4, etc
$ seq 10 | sed -n '2~4p'
2
6
10
```

* If `~j` is specified after `,` then meaning changes completely
* After the matching line based on number or *REGEXP* of start address, the closest line number multiple of `j` will mark end address

```bash
$ # 2nd line is start address
$ # closest multiple of 4 is 4th line
$ seq 10 | sed -n '2,~4p'
2
3
4
$ # closest multiple of 4 is 8th line
$ seq 10 | sed -n '5,~4p'
5
6
7
8

$ # line matching on `Just` is 6th line, so ending is 10th line
$ sed -n '/Just/,~5p' addr_range.txt
Just do-it
Believe it

Today is sunny
Not a bit funny
```

<br>

## <a name="using-different-delimiter-for-regexp"></a>Using different delimiter for REGEXP

* `/` is idiomatically used as the *REGEXP* delimiter
    * See also [a bit of history on why / is commonly used as delimiter](https://www.reddit.com/r/commandline/comments/3lhgwh/why_did_people_standardize_on_using_forward/cvgie7j/)
* But any character other than `\` and newline character can be used instead
* This helps to avoid/reduce use of `\`

```bash
$ # instead of this
$ echo '/home/learnbyexample/reports' | sed 's/\/home\/learnbyexample\//~\//'
~/reports

$ # use a different delimiter
$ echo '/home/learnbyexample/reports' | sed 's#/home/learnbyexample/#~/#'
~/reports
```

* For *REGEXP* used in address matching, syntax is a bit different `\<char>REGEXP<char>`

```bash
$ printf '/foo/bar/1\n/foo/baz/1\n'
/foo/bar/1
/foo/baz/1

$ printf '/foo/bar/1\n/foo/baz/1\n' | sed -n '\;/foo/bar/;p'
/foo/bar/1
```

<br>

## <a name="regular-expressions"></a>Regular Expressions

* By default, `sed` treats *REGEXP* as BRE (Basic Regular Expression)
* The `-E` option enables ERE (Extended Regular Expression) which in GNU sed's case only differs in how meta characters are used, no difference in functionalities
    * Initially GNU sed only had `-r` option to enable ERE and `man sed` doesn't even mention `-E`
    * Other `sed` versions use `-E` and `grep` uses `-E` as well. So `-r` won't be used in examples in this tutorial
    * See also [sed manual - BRE-vs-ERE](https://www.gnu.org/software/sed/manual/sed.html#BRE-vs-ERE)
* See [sed manual - Regular Expressions](https://www.gnu.org/software/sed/manual/sed.html#sed-regular-expressions) for more details

<br>

#### <a name="line-anchors"></a>Line Anchors

* Often, search must match from beginning of line or towards end of line
* For example, an integer variable declaration in `C` will start with optional white-space, the keyword `int`, white-space and then variable(s)
    * This way one can avoid matching declarations inside single line comments as well
* Similarly, one might want to match a variable at end of statement

Consider the input file and sample substitution without using any anchoring

```bash
$ cat anchors.txt
cat and dog
too many cats around here
to concatenate, use the cmd cat
catapults laid waste to the village
just scat and quit bothering me
that is quite a fabricated tale
try the grape variety muscat

$ # without anchors, substitution will replace wherever the string is found
$ sed 's/cat/XXX/g' anchors.txt
XXX and dog
too many XXXs around here
to conXXXenate, use the cmd XXX
XXXapults laid waste to the village
just sXXX and quit bothering me
that is quite a fabriXXXed tale
try the grape variety musXXX
```

* The meta character `^` forces *REGEXP* to match only at start of line

```bash
$ # filtering lines starting with 'cat'
$ sed -n '/^cat/p' anchors.txt
cat and dog
catapults laid waste to the village

$ # replace only at start of line
$ # g modifier not needed as there can only be single match at start of line
$ sed 's/^cat/XXX/' anchors.txt
XXX and dog
too many cats around here
to concatenate, use the cmd cat
XXXapults laid waste to the village
just scat and quit bothering me
that is quite a fabricated tale
try the grape variety muscat

$ # add something to start of line
$ echo 'Have a good day' | sed 's/^/Hi! /'
Hi! Have a good day
```

* The meta character `$` forces *REGEXP* to match only at end of line

```bash
$ # filtering lines ending with 'cat'
$ sed -n '/cat$/p' anchors.txt
to concatenate, use the cmd cat
try the grape variety muscat

$ # replace only at end of line
$ sed 's/cat$/YYY/' anchors.txt
cat and dog
too many cats around here
to concatenate, use the cmd YYY
catapults laid waste to the village
just scat and quit bothering me
that is quite a fabricated tale
try the grape variety musYYY

$ # add something to end of line
$ echo 'Have a good day' | sed 's/$/. Cya later/'
Have a good day. Cya later
```

<br>

#### <a name="word-anchors"></a>Word Anchors

* A **word** character is any alphabet (irrespective of case) or any digit or the underscore character
* The word anchors help in matching or not matching boundaries of a word
    * For example, to distinguish between `par`, `spar` and `apparent`
* `\b` matches word boundary
    * `\` is meta character and certain combinations like `\b` and `\B` have special meaning

```bash
$ # words ending with 'cat'
$ sed -n 's/cat\b/XXX/p' anchors.txt
XXX and dog
to concatenate, use the cmd XXX
just sXXX and quit bothering me
try the grape variety musXXX

$ # words starting with 'cat'
$ sed -n 's/\bcat/YYY/p' anchors.txt
YYY and dog
too many YYYs around here
to concatenate, use the cmd YYY
YYYapults laid waste to the village

$ # only whole words
$ sed -n 's/\bcat\b/ZZZ/p' anchors.txt
ZZZ and dog
to concatenate, use the cmd ZZZ

$ # word is made up of alphabets, numbers and _
$ echo 'foo, foo_bar and foo1' | sed 's/\bfoo\b/baz/g'
baz, foo_bar and foo1
```

* `\B` is opposite of `\b`, i.e it doesn't match word boundaries

```bash
$ # substitute only if 'cat' is surrounded by word characters
$ sed -n 's/\Bcat\B/QQQ/p' anchors.txt
to conQQQenate, use the cmd cat
that is quite a fabriQQQed tale

$ # substitute only if 'cat' is not start of word
$ sed -n 's/\Bcat/RRR/p' anchors.txt
to conRRRenate, use the cmd cat
just sRRR and quit bothering me
that is quite a fabriRRRed tale
try the grape variety musRRR

$ # substitute only if 'cat' is not end of word
$ sed -n 's/cat\B/SSS/p' anchors.txt
too many SSSs around here
to conSSSenate, use the cmd cat
SSSapults laid waste to the village
that is quite a fabriSSSed tale
```

* One can also use these alternatives for `\b`
    * `\<` for start of word
    * `\>` for end of word

```bash
$ # same as: sed 's/\bcat\b/X/g'
$ echo 'concatenate cat scat cater' | sed 's/\<cat\>/X/g'
concatenate X scat cater

$ # add something to both start/end of word
$ echo 'hi foo_baz 3b' | sed 's/\b/:/g'
:hi: :foo_baz: :3b:

$ # add something only at start of word
$ echo 'hi foo_baz 3b' | sed 's/\</:/g'
:hi :foo_baz :3b

$ # add something only at end of word
$ echo 'hi foo_baz 3b' | sed 's/\>/:/g'
hi: foo_baz: 3b:
```

<br>

#### <a name="matching-the-meta-characters"></a>Matching the meta characters

* Since meta characters like `^`, `$`, `\` etc have special meaning in *REGEXP*, they have to be escaped using `\` to match them literally

```bash
$ # here, '^' will match only start of line
$ echo '(a+b)^2 = a^2 + b^2 + 2ab' | sed 's/^/**/g'
**(a+b)^2 = a^2 + b^2 + 2ab

$ # '\` before '^' will match '^' literally
$ echo '(a+b)^2 = a^2 + b^2 + 2ab' | sed 's/\^/**/g'
(a+b)**2 = a**2 + b**2 + 2ab

$ # to match '\' use '\\'
$ echo 'foo\bar' | sed 's/\\/ /'
foo bar

$ echo 'pa$$' | sed 's/$/s/g'
pa$$s
$ echo 'pa$$' | sed 's/\$/s/g'
pass

$ # '^' has special meaning only at start of REGEXP
$ # similarly, '$' has special meaning only at end of REGEXP
$ echo '(a+b)^2 = a^2 + b^2 + 2ab' | sed 's/a^2/A^2/g'
(a+b)^2 = A^2 + b^2 + 2ab
```

* Certain characters like `&` and `\` have special meaning in *REPLACEMENT* section of substitute as well. They too have to be escaped using `\`
* And the delimiter character has to be escaped of course
* See [back reference](#back-reference) section for use of `&` in *REPLACEMENT* section

```bash
$ # & will refer to entire matched string of REGEXP section
$ echo 'foo and bar' | sed 's/and/"&"/'
foo "and" bar
$ echo 'foo and bar' | sed 's/and/"\&"/'
foo "&" bar

$ # use different delimiter where required
$ echo 'a b' | sed 's/ /\//'
a/b
$ echo 'a b' | sed 's# #/#'
a/b

$ # use \\ to represent literal \
$ echo '/foo/bar/baz' | sed 's#/#\\#g'
\foo\bar\baz
```

<br>

#### <a name="alternation"></a>Alternation

* Two or more *REGEXP* can be combined as logical OR using the `|` meta character
    * syntax is `\|` for BRE and `|` for ERE
* Each side of `|` is complete regular expression with their own start/end anchors
* How each part of alternation is handled and order of evaluation/output is beyond the scope of this tutorial
    * See [this](https://www.regular-expressions.info/alternation.html) for more info on this topic.

```bash
$ # BRE
$ sed -n '/red\|blue/p' poem.txt
Roses are red,
Violets are blue,

$ # ERE
$ sed -nE '/red|blue/p' poem.txt
Roses are red,
Violets are blue,

$ # filter lines starting or ending with 'cat'
$ sed -nE '/^cat|cat$/p' anchors.txt
cat and dog
to concatenate, use the cmd cat
catapults laid waste to the village
try the grape variety muscat

$ # g modifier is needed for more than one replacement
$ echo 'foo and temp and baz' | sed -E 's/foo|temp|baz/XYZ/'
XYZ and temp and baz
$ echo 'foo and temp and baz' | sed -E 's/foo|temp|baz/XYZ/g'
XYZ and XYZ and XYZ
```

<br>

#### <a name="the-dot-meta-character"></a>The dot meta character

* The `.` meta character matches any character once, including newline

```bash
$ # replace all sequence of 3 characters starting with 'c' and ending with 't'
$ echo 'coat cut fit c#t' | sed 's/c.t/XYZ/g'
coat XYZ fit XYZ

$ # replace all sequence of 4 characters starting with 'c' and ending with 't'
$ echo 'coat cut fit c#t' | sed 's/c..t/ABCD/g'
ABCD cut fit c#t

$ # space, tab etc are also characters which will be matched by '.'
$ echo 'coat cut fit c#t' | sed 's/t.f/IJK/g'
coat cuIJKit c#t
```

<br>

#### <a name="quantifiers"></a>Quantifiers

All quantifiers in `sed` are greedy, i.e longest match wins as long as overall *REGEXP* is satisfied and precedence is left to right. In this section, we'll cover usage of quantifiers on characters

* `?` will try to match 0 or 1 time
* For BRE, use `\?`

```bash
$ printf 'late\npale\nfactor\nrare\nact\n'
late
pale
factor
rare
act

$ # same as using: sed -nE '/at|act/p'
$ printf 'late\npale\nfactor\nrare\nact\n' | sed -nE '/ac?t/p'
late
factor
act

$ # greediness comes in handy in some cases
$ # problem: '<' has to be replaced with '\<' only if not preceded by '\'
$ echo 'blah \< foo bar < blah baz <'
blah \< foo bar < blah baz <
$ # this won't work as '\<' gets replaced with '\\<'
$ echo 'blah \< foo bar < blah baz <' | sed -E 's/</\\</g'
blah \\< foo bar \< blah baz \<
$ # by using '\\?<' both '\<' and '<' gets replaced by '\<'
$ echo 'blah \< foo bar < blah baz <' | sed -E 's/\\?</\\</g'
blah \< foo bar \< blah baz \<
```

* `*` will try to match 0 or more times

```bash
$ printf 'abc\nac\nadc\nabbc\nbbb\nbc\nabbbbbc\n'
abc
ac
adc
abbc
bbb
bc
abbbbbc

$ # match 'a' and 'c' with any number of 'b' in between
$ printf 'abc\nac\nadc\nabbc\nbbb\nbc\nabbbbbc\n' | sed -n '/ab*c/p'
abc
ac
abbc
abbbbbc

$ # delete from start of line to 'te'
$ echo 'that is quite a fabricated tale' | sed 's/.*te//'
d tale
$ # delete from start of line to 'te '
$ echo 'that is quite a fabricated tale' | sed 's/.*te //'
a fabricated tale
$ # delete from first 'f' in the line to end of line
$ echo 'that is quite a fabricated tale' | sed 's/f.*//'
that is quite a 
```

* `+` will try to match 1 or more times
* For BRE, use `\+`

```bash
$ # match 'a' and 'c' with at least one 'b' in between
$ # BRE
$ printf 'abc\nac\nadc\nabbc\nbbb\nbc\nabbbbbc\n' | sed -n '/ab\+c/p'
abc
abbc
abbbbbc

$ # ERE
$ printf 'abc\nac\nadc\nabbc\nbbb\nbc\nabbbbbc\n' | sed -nE '/ab+c/p'
abc
abbc
abbbbbc
```

* For more precise control on number of times to match, use `{}`

```bash
$ # exactly 5 times
$ printf 'abc\nac\nadc\nabbc\nbbb\nbc\nabbbbbc\n' | sed -nE '/ab{5}c/p'
abbbbbc

$ # between 1 to 3 times, inclusive of 1 and 3
$ printf 'abc\nac\nadc\nabbc\nbbb\nbc\nabbbbbc\n' | sed -nE '/ab{1,3}c/p'
abc
abbc

$ # maximum of 2 times, including 0 times
$ printf 'abc\nac\nadc\nabbc\nbbb\nbc\nabbbbbc\n' | sed -nE '/ab{,2}c/p'
abc
ac
abbc

$ # minimum of 2 times
$ printf 'abc\nac\nadc\nabbc\nbbb\nbc\nabbbbbc\n' | sed -nE '/ab{2,}c/p'
abbc
abbbbbc

$ # BRE
$ printf 'abc\nac\nadc\nabbc\nbbb\nbc\nabbbbbc\n' | sed -n '/ab\{2,\}c/p'
abbc
abbbbbc
```

<br>

#### <a name="character-classes"></a>Character classes

* The `.` meta character provides a way to match any character
* Character class provides a way to match any character among a specified set of characters enclosed within `[]`

```bash
$ # same as: sed -nE '/lane|late/p'
$ printf 'late\nlane\nfate\nfete\n' | sed -n '/la[nt]e/p'
late
lane

$ printf 'late\nlane\nfate\nfete\n' | sed -n '/[fl]a[nt]e/p'
late
lane
fate

$ # quantifiers can be added similar to using for any other character
$ # filter lines made up entirely of digits, containing at least one digit
$ printf 'cat5\nfoo\n123\n42\n' | sed -nE '/^[0123456789]+$/p'
123
42
$ # filter lines made up entirely of digits, containing at least three digits
$ printf 'cat5\nfoo\n123\n42\n' | sed -nE '/^[0123456789]{3,}$/p'
123
```

Character ranges

* Matching any alphabet, number, hexadecimal number etc becomes cumbersome if every character has to be individually specified
* So, there's a shortcut, using `-` to construct a range (has to be specified in ascending order)
* See [ascii codes table](https://ascii.cl/) for reference
    * Note that behavior of range will depend on locale settings
    * [arch wiki - locale](https://wiki.archlinux.org/index.php/locale)
    * [Linux: Define Locale and Language Settings](https://www.shellhacks.com/linux-define-locale-language-settings/)

```bash
$ # filter lines made up entirely of digits, at least one
$ printf 'cat5\nfoo\n123\n42\n' | sed -nE '/^[0-9]+$/p'
123
42

$ # filter lines made up entirely of lower case alphabets, at least one
$ printf 'cat5\nfoo\n123\n42\n' | sed -nE '/^[a-z]+$/p'
foo

$ # filter lines made up entirely of lower case alphabets and digits, at least one
$ printf 'cat5\nfoo\n123\n42\n' | sed -nE '/^[a-z0-9]+$/p'
cat5
foo
123
42
```

* Numeric ranges, easy for certain cases but not suitable always. Use `awk` or `perl` for arithmetic computation
* See also [Matching Numeric Ranges with a Regular Expression](https://www.regular-expressions.info/numericranges.html)

```bash
$ # numbers between 10 to 29
$ printf '23\n154\n12\n26\n98234\n' | sed -n '/^[12][0-9]$/p'
23
12
26

$ # numbers >= 100
$ printf '23\n154\n12\n26\n98234\n' | sed -nE '/^[0-9]{3,}$/p'
154
98234

$ # numbers >= 100 if there are leading zeros
$ printf '0501\n035\n154\n12\n26\n98234\n' | sed -nE '/^0*[1-9][0-9]{2,}$/p'
0501
154
98234
```

Negating character class

* Meta characters inside and outside of `[]` are completely different
* For example, `^` as first character inside `[]` matches characters other than those specified inside character class

```bash
$ # delete zero or more characters before first =
$ echo 'foo=bar; baz=123' | sed 's/^[^=]*//'
=bar; baz=123

$ # delete zero or more characters after last =
$ echo 'foo=bar; baz=123' | sed 's/[^=]*$//'
foo=bar; baz=

$ # same as: sed -n '/[aeiou]/!p'
$ printf 'tryst\nglyph\npity\nwhy\n' | sed -n '/^[^aeiou]*$/p'
tryst
glyph
why
```

Matching meta characters inside `[]`

* Characters like `^`, `]`, `-`, etc need special attention to be part of list
* Also, sequences like `[.` or `=]` have special meaning within `[]`
    * See [sed manual - Character-Classes-and-Bracket-Expressions](https://www.gnu.org/software/sed/manual/sed.html#Character-Classes-and-Bracket-Expressions) for complete list

```bash
$ # to match - it should be first or last character within []
$ printf 'Foo-bar\nabc-456\n42\nCo-operate\n' | sed -nE '/^[a-z-]+$/Ip'
Foo-bar
Co-operate

$ # to match ] it should be first character within []
$ printf 'int foo\nint a[5]\nfoo=bar\n' | sed -n '/[]=]/p'
int a[5]
foo=bar

$ # to match [ use [ anywhere in the character list
$ # [][] will match both [ and ]
$ printf 'int foo\nint a[5]\nfoo=bar\n' | sed -n '/[[]/p'
int a[5]

$ # to match ^ it should be other than first in the list
$ printf 'c=a^b\nd=f*h+e\nz=x-y\n' | sed -n '/[*^]/p'
c=a^b
d=f*h+e
```

Named character classes

* Equivalent class shown is for C locale and ASCII character encoding
    * See [ascii codes table](https://ascii.cl/) for reference
* See [sed manual - Character Classes and Bracket Expressions](https://www.gnu.org/software/sed/manual/sed.html#Character-Classes-and-Bracket-Expressions) for more details

| Character classes | Description |
| ------------- | ----------- |
| `[:digit:]` | Same as `[0-9]` |
| `[:lower:]` | Same as `[a-z]` |
| `[:upper:]` | Same as `[A-Z]` |
| `[:alpha:]` | Same as `[a-zA-Z]` |
| `[:alnum:]` | Same as `[0-9a-zA-Z]` |
| `[:xdigit:]` | Same as `[0-9a-fA-F]` |
| `[:cntrl:]` | Control characters - first 32 ASCII characters and 127th (DEL) |
| `[:punct:]` | All the punctuation characters |
| `[:graph:]` | `[:alnum:]` and `[:punct:]` |
| `[:print:]` | `[:alnum:]`, `[:punct:]` and space |
| `[:blank:]` | Space and tab characters |
| `[:space:]` | white-space characters: tab, newline, vertical tab, form feed, carriage return and space |

```bash
$ # lines containing only hexadecimal characters
$ printf '128\n34\nfe32\nfoo1\nbar\n' | sed -nE '/^[[:xdigit:]]+$/p'
128
34
fe32

$ # lines containing at least one non-hexadecimal character
$ printf '128\n34\nfe32\nfoo1\nbar\n' | sed -n '/[^[:xdigit:]]/p'
foo1
bar

$ # same as: sed -nE '/^[a-z-]+$/Ip'
$ printf 'Foo-bar\nabc-456\n42\nCo-operate\n' | sed -nE '/^[[:alpha:]-]+$/p'
Foo-bar
Co-operate

$ # remove all punctuation characters
$ sed 's/[[:punct:]]//g' poem.txt
Roses are red
Violets are blue
Sugar is sweet
And so are you
```

Backslash character classes

* Equivalent class shown is for C locale and ASCII character encoding
    * See [ascii codes table](https://ascii.cl/) for reference
* See [sed manual - regular expression extensions](https://www.gnu.org/software/sed/manual/sed.html#regexp-extensions) for more details

| Character classes | Description |
| ------------- | ----------- |
| `\w` | Same as `[0-9a-zA-Z_]` or `[[:alnum:]_]` |
| `\W` | Same as `[^0-9a-zA-Z_]` or `[^[:alnum:]_]` |
| `\s` | Same as `[[:space:]]` |
| `\S` | Same as `[^[:space:]]` |

```bash
$ # lines containing only word characters
$ printf '123\na=b+c\ncmp_str\nFoo_bar\n' | sed -nE '/^\w+$/p'
123
cmp_str
Foo_bar

$ # backslash character classes cannot be used inside [] unlike perl
$ # \w would simply match w
$ echo 'w=y-x+9*3' | sed 's/[\w=]//g'
y-x+9*3
$ echo 'w=y-x+9*3' | perl -pe 's/[\w=]//g'
-+*
```

<br>

#### <a name="escape-sequences"></a>Escape sequences

* Certain ASCII characters like tab, carriage return, newline, etc have escape sequence to represent them
    * Unlike backslash character classes, these can be used within `[]` as well
* Any ASCII character can be also represented using their decimal or octal or hexadecimal value
    * See [ascii codes table](https://ascii.cl/) for reference
* See [sed manual - Escapes](https://www.gnu.org/software/sed/manual/sed.html#Escapes) for more details

```bash
$ # example for representing tab character
$ printf 'foo\tbar\tbaz\n'
foo     bar     baz
$ printf 'foo\tbar\tbaz\n' | sed 's/\t/ /g'
foo bar baz
$ echo 'a b c' | sed 's/ /\t/g'
a       b       c

$ # using escape sequence inside character class
$ printf 'a\tb\vc\n'
a       b
         c
$ printf 'a\tb\vc\n' | cat -vT
a^Ib^Kc
$ printf 'a\tb\vc\n' | sed 's/[\t\v]/ /g'
a b c

$ # most common use case for hex escape sequence is to represent single quotes
$ # equivalent is '\d039' and '\o047' for decimal and octal respectively
$ echo "foo: '34'"
foo: '34'
$ echo "foo: '34'" | sed 's/\x27/"/g'
foo: "34"
$ echo 'foo: "34"' | sed 's/"/\x27/g'
foo: '34'
```

<br>

#### <a name="grouping"></a>Grouping

* Character classes allow matching against a choice of multiple character list and then quantifier added if needed
* One of the uses of grouping is analogous to character classes for whole regular expressions, instead of just list of characters
* The meta characters `()` are used for grouping
    * requires `\(\)` for BRE
* Similar to maths `ab + ac = a(b+c)`, think of regular expression `a(b|c) = ab|ac`

```bash
$ # four letter words with 'on' or 'no' in middle
$ printf 'known\nmood\nknow\npony\ninns\n' | sed -nE '/\b[a-z](on|no)[a-z]\b/p'
know
pony
$ # common mistake to use character class, will match 'oo' and 'nn' as well
$ printf 'known\nmood\nknow\npony\ninns\n' | sed -nE '/\b[a-z][on]{2}[a-z]\b/p'
mood
know
pony
inns

$ # quantifier example
$ printf 'handed\nhand\nhandy\nhands\nhandle\n' | sed -nE '/^hand([sy]|le)?$/p'
hand
handy
hands
handle

$ # remove first two columns where : is delimiter
$ echo 'foo:123:bar:baz' | sed -E 's/^([^:]+:){2}//'
bar:baz

$ # can be nested as required
$ printf 'spade\nscore\nscare\nspare\nsphere\n' | sed -nE '/^s([cp](he|a)[rd])e$/p'
spade
scare
spare
sphere
```

<br>

#### <a name="back-reference"></a>Back reference

* The matched string within `()` can also be used to be matched again by back referencing the captured groups
* `\1` denotes the first matched group, `\2` the second one and so on
    * Order is leftmost `(` is `\1`, next one is `\2` and so on
    * Can be used both in *REGEXP* as well as in *REPLACEMENT* sections
* `&` or `\0` represents entire matched string in *REPLACEMENT* section
* Note that the matched string, not the regular expression itself is referenced
    * for ex: if `([0-9][a-f])` matches `3b`, then back referencing will be `3b` not any other valid match of the regular expression like `8f`, `0a` etc
* As `\` and `&` are special characters in *REPLACEMENT* section, use `\\` and `\&` respectively for literal representation

```bash
$ # filter lines with consecutive repeated alphabets
$ printf 'eel\nflee\nall\npat\nilk\nseen\n' | sed -nE '/([a-z])\1/p'
eel
flee
all
seen

$ # reduce \\ to single \ and delete if only single \
$ echo '\[\] and \\w and \[a-zA-Z0-9\_\]' | sed -E 's/(\\?)\\/\1/g'
[] and \w and [a-zA-Z0-9_]

$ # remove two or more duplicate words separated by space
$ # word boundaries prevent false matches like 'the theatre' 'sand and stone' etc
$ echo 'a a a walking for for a cause' | sed -E 's/\b(\w+)( \1)+\b/\1/g'
a walking for a cause

$ # surround only third column with double quotes
$ # note the nested capture groups and numbers used in REPLACEMENT section
$ echo 'foo:123:bar:baz' | sed -E 's/^(([^:]+:){2})([^:]+)/\1"\3"/'
foo:123:"bar":baz

$ # add first column data to end of line as well
$ echo 'foo:123:bar:baz' | sed -E 's/^([^:]+).*/& \1/'
foo:123:bar:baz foo

$ # surround entire line with double quotes
$ echo 'hello world' | sed 's/.*/"&"/'
"hello world"
$ # add something at start as well as end of line
$ echo 'hello world' | sed 's/.*/Hi. &. Have a nice day/'
Hi. hello world. Have a nice day
```

<br>

#### <a name="changing-case"></a>Changing case

* Applies only to *REPLACEMENT* section, unlike `perl` where these can be used in *REGEXP* portion as well
* See [sed manual - The s Command](https://www.gnu.org/software/sed/manual/sed.html#The-_0022s_0022-Command) for more details and corner cases

```bash
$ # UPPERCASE all alphabets, will be stopped on \L or \E
$ echo 'HeLlO WoRLD' | sed 's/.*/\U&/'
HELLO WORLD

$ # lowercase all alphabets, will be stopped on \U or \E
$ echo 'HeLlO WoRLD' | sed 's/.*/\L&/'
hello world

$ # Uppercase only next character
$ echo 'foo bar' | sed 's/\w*/\u&/g'
Foo Bar
$ echo 'foo_bar next_line' | sed -E 's/_([a-z])/\u\1/g'
fooBar nextLine

$ # lowercase only next character
$ echo 'FOO BAR' | sed 's/\w*/\l&/g'
fOO bAR
$ echo 'fooBar nextLine Baz' | sed -E 's/([a-z])([A-Z])/\1_\l\2/g'
foo_bar next_line Baz

$ # titlecase if input has mixed case
$ echo 'HeLlO WoRLD' | sed 's/.*/\L&/; s/\w*/\u&/g'
Hello World
$ # sed 's/.*/\L\u&/' also works, but not sure if it is defined behavior
$ echo 'HeLlO WoRLD' | sed 's/.*/\L&/; s/./\u&/'
Hello world

$ # \E will stop conversion started by \U or \L
$ echo 'foo_bar next_line baz' | sed -E 's/([a-z]+)(_[a-z]+)/\U\1\E\2/g'
FOO_bar NEXT_line baz
```

<br>

## <a name="substitute-command-modifiers"></a>Substitute command modifiers

The `s` command syntax:

```
s/REGEXP/REPLACEMENT/FLAGS
```

* Modifiers (or FLAGS) like `g`, `p` and `I` have been already seen. For completeness, they will be discussed again along with rest of the modifiers
* See [sed manual - The s Command](https://www.gnu.org/software/sed/manual/sed.html#The-_0022s_0022-Command) for more details and corner cases

<br>

#### <a name="g-modifier"></a>g modifier

By default, substitute command will replace only first occurrence of match. `g` modifier is needed to replace all occurrences

```bash
$ # replace only first : with -
$ echo 'foo:123:bar:baz' | sed 's/:/-/'
foo-123:bar:baz

$ # replace all : with -
$ echo 'foo:123:bar:baz' | sed 's/:/-/g'
foo-123-bar-baz
```

<br>

#### <a name="replace-specific-occurrence"></a>Replace specific occurrence

* A number can be used to specify *N*th match to be replaced

```bash
$ # replace first occurrence
$ echo 'foo:123:bar:baz' | sed 's/:/-/'
foo-123:bar:baz
$ echo 'foo:123:bar:baz' | sed -E 's/[^:]+/XYZ/'
XYZ:123:bar:baz

$ # replace second occurrence
$ echo 'foo:123:bar:baz' | sed 's/:/-/2'
foo:123-bar:baz
$ echo 'foo:123:bar:baz' | sed -E 's/[^:]+/XYZ/2'
foo:XYZ:bar:baz

$ # replace third occurrence
$ echo 'foo:123:bar:baz' | sed 's/:/-/3'
foo:123:bar-baz
$ echo 'foo:123:bar:baz' | sed -E 's/[^:]+/XYZ/3'
foo:123:XYZ:baz

$ # choice of quantifier depends on knowing input
$ echo ':123:bar:baz' | sed 's/[^:]*/XYZ/2'
:XYZ:bar:baz
$ echo ':123:bar:baz' | sed -E 's/[^:]+/XYZ/2'
:123:XYZ:baz
```

* Replacing *N*th match from end of line when number of matches is unknown
* Makes use of greediness of quantifiers

```bash
$ # replacing last occurrence
$ # can also use sed -E 's/:([^:]*)$/-\1/'
$ echo 'foo:123:bar:baz' | sed -E 's/(.*):/\1-/'
foo:123:bar-baz
$ echo '456:foo:123:bar:789:baz' | sed -E 's/(.*):/\1-/'
456:foo:123:bar:789-baz
$ echo 'foo and bar and baz land good' | sed -E 's/(.*)and/\1XYZ/'
foo and bar and baz lXYZ good
$ # use word boundaries as necessary
$ echo 'foo and bar and baz land good' | sed -E 's/(.*)\band\b/\1XYZ/'
foo and bar XYZ baz land good

$ # replacing last but one
$ echo 'foo:123:bar:baz' | sed -E 's/(.*):(.*:)/\1-\2/'
foo:123-bar:baz
$ echo '456:foo:123:bar:789:baz' | sed -E 's/(.*):(.*:)/\1-\2/'
456:foo:123:bar-789:baz

$ # replacing last but two
$ echo '456:foo:123:bar:789:baz' | sed -E 's/(.*):((.*:){2})/\1-\2/'
456:foo:123-bar:789:baz
$ # replacing last but three
$ echo '456:foo:123:bar:789:baz' | sed -E 's/(.*):((.*:){3})/\1-\2/'
456:foo-123:bar:789:baz
```

* Replacing all but first *N* occurrences by combining with `g` modifier

```bash
$ # replace all : with - except first two
$ echo '456:foo:123:bar:789:baz' | sed -E 's/:/-/3g'
456:foo:123-bar-789-baz

$ # replace all : with - except first three
$ echo '456:foo:123:bar:789:baz' | sed -E 's/:/-/4g'
456:foo:123:bar-789-baz
```

* Replacing multiple *N*th occurrences

```bash
$ # replace first two occurrences of : with -
$ echo '456:foo:123:bar:789:baz' | sed 's/:/-/; s/:/-/'
456-foo-123:bar:789:baz

$ # replace second and third occurrences of : with -
$ # note the changes in number to be used for subsequent replacement
$ echo '456:foo:123:bar:789:baz' | sed 's/:/-/2; s/:/-/2'
456:foo-123-bar:789:baz

$ # better way is to use descending order
$ echo '456:foo:123:bar:789:baz' | sed 's/:/-/3; s/:/-/2'
456:foo-123-bar:789:baz
$ # replace second, third and fifth occurrences of : with -
$ echo '456:foo:123:bar:789:baz' | sed 's/:/-/5; s/:/-/3; s/:/-/2'
456:foo-123-bar:789-baz
```

<br>

#### <a name="ignoring-case"></a>Ignoring case

* Either `i` or `I` can be used for replacing in case-insensitive manner
* Since only `I` can be used for address filtering (for ex: `sed '/rose/Id' poem.txt`), use `I` for substitute command as well for consistency

```bash
$ echo 'hello Hello HELLO HeLlO' | sed 's/hello/hi/g'
hi Hello HELLO HeLlO

$ echo 'hello Hello HELLO HeLlO' | sed 's/hello/hi/Ig'
hi hi hi hi
```

<br>

#### <a name="p-modifier"></a>p modifier

* Usually used in conjunction with `-n` option to output only modified lines

```bash
$ # no output if no substitution
$ echo 'hi there. have a nice day' | sed -n 's/xyz/XYZ/p'
$ # modified line if there is substitution
$ echo 'hi there. have a nice day' | sed -n 's/\bh/H/pg'
Hi there. Have a nice day

$ # only lines containing 'are'
$ sed -n 's/are/ARE/p' poem.txt
Roses ARE red,
Violets ARE blue,
And so ARE you.

$ # only lines containing 'are' as well as 'so'
$ sed -n '/are/ s/so/SO/p' poem.txt
And SO are you.
```

<br>

#### <a name="w-modifier"></a>w modifier

* Allows to write only the changes to specified file name instead of default **stdout**

```bash
$ # space between w and filename is optional
$ # same as: sed -n 's/3/three/p' > 3.txt
$ seq 20 | sed -n 's/3/three/w 3.txt'
$ cat 3.txt
three
1three

$ # do not use -n if output should be displayed as well as written to file
$ echo '456:foo:123:bar:789:baz' | sed -E 's/(:[^:]*){2}$//w col.txt'
456:foo:123:bar
$ cat col.txt
456:foo:123:bar
```

* For multiple output files, use `-e` for each file

```bash
$ seq 20 | sed -n -e 's/5/five/w 5.txt' -e 's/7/seven/w 7.txt'
$ cat 5.txt
five
1five
$ cat 7.txt
seven
1seven
```

* There are two predefined filenames
    * `/dev/stdout` to write to **stdout**
    * `/dev/stderr` to write to **stderr**

```bash
$ # inplace editing as well as display changes on terminal
$ sed -i 's/three/3/w /dev/stdout' 3.txt
3
13
$ cat 3.txt
3
13
```

<br>

#### <a name="e-modifier"></a>e modifier

* Allows to use shell command output in *REPLACEMENT* section
* Trailing newline from command output is suppressed

```bash
$ # replacing a line with output of shell command
$ printf 'Date:\nreplace this line\n'
Date:
replace this line
$ printf 'Date:\nreplace this line\n' | sed 's/^replace.*/date/e'
Date:
Thu May 25 10:19:46 IST 2017

$ # when using p modifier with e, order is important
$ printf 'Date:\nreplace this line\n' | sed -n 's/^replace.*/date/ep'
Thu May 25 10:19:46 IST 2017
$ printf 'Date:\nreplace this line\n' | sed -n 's/^replace.*/date/pe'
date

$ # entire modified line is executed as shell command
$ echo 'xyz 5' | sed 's/xyz/seq/e'
1
2
3
4
5
```

<br>

#### <a name="m-modifier"></a>m modifier

* Either `m` or `M` can be used
* So far, we've seen only line based operations (newline character being used to distinguish lines)
* There are various ways (see [sed manual - How sed Works](https://www.gnu.org/software/sed/manual/sed.html#Execution-Cycle)) by which more than one line is there in pattern space and in such cases `m` modifier can be used
* See also [unix.stackexchange - usage of multi-line modifier](https://unix.stackexchange.com/questions/298670/simple-significant-usage-of-m-multi-line-address-suffix) for more examples

Before seeing example with `m` modifier, let's see a simple example to get two lines in pattern space

```bash
$ # line matching 'blue' and next line in pattern space
$ sed -n '/blue/{N;p}' poem.txt
Violets are blue,
Sugar is sweet,

$ # applying substitution, remember that . matches newline as well
$ sed -n '/blue/{N;s/are.*is//p}' poem.txt
Violets  sweet,
```

* When `m` modifier is used, it affects the behavior of `^`, `$` and `.` meta characters

```bash
$ # without m modifier, ^ will anchor only beginning of entire pattern space
$ sed -n '/blue/{N;s/^/:: /pg}' poem.txt
:: Violets are blue,
Sugar is sweet,
$ # with m modifier, ^ will anchor each individual line within pattern space
$ sed -n '/blue/{N;s/^/:: /pgm}' poem.txt
:: Violets are blue,
:: Sugar is sweet,

$ # same applies to $ as well
$ sed -n '/blue/{N;s/$/ ::/pg}' poem.txt
Violets are blue,
Sugar is sweet, ::
$ sed -n '/blue/{N;s/$/ ::/pgm}' poem.txt
Violets are blue, ::
Sugar is sweet, ::

$ # with m modifier, . will not match newline character
$ sed -n '/blue/{N;s/are.*//p}' poem.txt
Violets 
$ sed -n '/blue/{N;s/are.*//pm}' poem.txt
Violets 
Sugar is sweet,
```

<br>

## <a name="shell-substitutions"></a>Shell substitutions

* Examples presented works with `bash` shell, might differ for other shells
* See also [stackoverflow - Difference between single and double quotes in Bash](https://stackoverflow.com/questions/6697753/difference-between-single-and-double-quotes-in-bash)
* For robust substitutions taking care of meta characters in *REGEXP* and *REPLACEMENT* sections, see
    * [unix.stackexchange - How to ensure that string interpolated into sed substitution escapes all metachars](https://unix.stackexchange.com/questions/129059/how-to-ensure-that-string-interpolated-into-sed-substitution-escapes-all-metac)
    * [unix.stackexchange - What characters do I need to escape when using sed in a sh script?](https://unix.stackexchange.com/questions/32907/what-characters-do-i-need-to-escape-when-using-sed-in-a-sh-script)
    * [stackoverflow - Is it possible to escape regex metacharacters reliably with sed](https://stackoverflow.com/questions/29613304/is-it-possible-to-escape-regex-metacharacters-reliably-with-sed)

<br>

#### <a name="variable-substitution"></a>Variable substitution

* Entire command in double quotes can be used for simple use cases

```bash
$ word='are'
$ sed -n "/$word/p" poem.txt
Roses are red,
Violets are blue,
And so are you.

$ replace='ARE'
$ sed "s/$word/$replace/g" poem.txt
Roses ARE red,
Violets ARE blue,
Sugar is sweet,
And so ARE you.

$ # need to use delimiter as suitable
$ echo 'home path is:' | sed "s/$/ $HOME/"
sed: -e expression #1, char 7: unknown option to `s'
$ echo 'home path is:' | sed "s|$| $HOME|"
home path is: /home/learnbyexample
```

* If command has characters like `\`, backtick, `!` etc, double quote only the variable

```bash
$ # if history expansion is enabled, ! is special
$ word='are'
$ sed "/$word/!d" poem.txt
sed "/$word/date +%A" poem.txt
sed: -e expression #1, char 7: extra characters after command

$ # so double quote only the variable
$ # the command is concatenation of '/' and "$word" and '/!d'
$ sed '/'"$word"'/!d' poem.txt
Roses are red,
Violets are blue,
And so are you.
```

<br>

#### <a name="command-substitution"></a>Command substitution

* Much more flexible than using `e` modifier as part of line can be modified as well

```bash
$ echo 'today is date' | sed 's/date/'"$(date +%A)"'/'
today is Tuesday

$ # need to use delimiter as suitable
$ echo 'current working dir is: ' | sed 's/$/'"$(pwd)"'/'
sed: -e expression #1, char 6: unknown option to `s'
$ echo 'current working dir is: ' | sed 's|$|'"$(pwd)"'|'
current working dir is: /home/learnbyexample/command_line_text_processing

$ # multiline output cannot be substituted in this manner
$ echo 'foo' | sed 's/foo/'"$(seq 5)"'/'
sed: -e expression #1, char 7: unterminated `s' command
```

<br>

## <a name="z-and-s-command-line-options"></a>z and s command line options

* We have already seen a few options like `-n`, `-e`, `-i` and `-E`
* This section will cover `-z` and `-s` options
* See [sed manual - Command line options](https://www.gnu.org/software/sed/manual/sed.html#Command_002dLine-Options) for other options and more details

The `-z` option will cause `sed` to separate input based on ASCII NUL character instead of newlines

```bash
$ # useful to process null separated data
$ # for ex: output of grep -Z, find -print0, etc
$ printf 'teal\0red\nblue\n\0green\n' | sed -nz '/red/p' | cat -A
red$
blue$
^@

$ # also useful to process whole file(not having NUL characters) as a single string
$ # adds ; to previous line if current line starts with c
$ printf 'cat\ndog\ncoat\ncut\nmat\n' | sed -z 's/\nc/;&/g'
cat
dog;
coat;
cut
mat
```

The `-s` option will cause `sed` to treat multiple input files separately instead of treating them as single concatenated input. If `-i` is being used, `-s` is implied

```bash
$ # without -s, there is only one first line
$ # F command prints file name of current file
$ sed '1F' f1 f2
f1
I ate three apples
I bought two bananas and three mangoes

$ # with -s, each file has its own address
$ sed -s '1F' f1 f2
f1
I ate three apples
f2
I bought two bananas and three mangoes
```

<br>

<br>

## <a name="change-command"></a>change command

The change command `c` will delete line(s) represented by address or address range and replace it with given string

**Note** the string used cannot have literal newline character, use escape sequence instead

```bash
$ # white-space between c and replacement string is ignored
$ seq 3 | sed '2c foo bar'
1
foo bar
3

$ # note how all lines in address range are replaced
$ seq 8 | sed '3,7cfoo bar'
1
2
foo bar
8

$ # escape sequences are allowed in string to be replaced
$ sed '/red/,/is/chello\nhi there' poem.txt
hello
hi there
And so are you.
```

* command will apply for all matching addresses

```bash
$ seq 5 | sed '/[24]/cfoo'
1
foo
3
foo
5
```

* `\` is special immediately after `c`, see [sed manual - other commands](https://www.gnu.org/software/sed/manual/sed.html#Other-Commands) for details
* If escape sequence is needed at beginning of replacement string, use an additional `\`

```bash
$ # \ helps to add leading spaces
$ seq 3 | sed '2c  a'
1
a
3
$ seq 3 | sed '2c\ a'
1
 a
3

$ seq 3 | sed '2c\tgood day'
1
tgood day
3
$ seq 3 | sed '2c\\tgood day'
1
        good day
3
```

* Since `;` cannot be used to distinguish between string and end of command, use `-e` for multiple commands

```bash
$ sed -e '/are/cHi;s/is/IS/' poem.txt
Hi;s/is/IS/
Hi;s/is/IS/
Sugar is sweet,
Hi;s/is/IS/

$ sed -e '/are/cHi' -e 's/is/IS/' poem.txt
Hi
Hi
Sugar IS sweet,
Hi
```

* Using shell substitution

```bash
$ text='good day'
$ seq 3 | sed '2c'"$text"
1
good day
3

$ text='good day\nfoo bar'
$ seq 3 | sed '2c'"$text"
1
good day
foo bar
3

$ seq 3 | sed '2c'"$(date +%A)"
1
Thursday
3

$ # multiline command output will lead to error
$ seq 3 | sed '2c'"$(seq 2)"
sed: -e expression #1, char 5: missing command
```

<br>

## <a name="insert-command"></a>insert command

The insert command allows to add string before a line matching given address

**Note** the string used cannot have literal newline character, use escape sequence instead

```bash
$ # white-space between i and string is ignored
$ # same as: sed '2s/^/hello\n/'
$ seq 3 | sed '2i hello'
1
hello
2
3

$ # escape sequences can be used
$ seq 3 | sed '2ihello\nhi'
1
hello
hi
2
3
```

* command will apply for all matching addresses

```bash
$ seq 5 | sed '/[24]/ifoo'
1
foo
2
3
foo
4
5
```

* `\` is special immediately after `i`, see [sed manual - other commands](https://www.gnu.org/software/sed/manual/sed.html#Other-Commands) for details
* If escape sequence is needed at beginning of replacement string, use an additional `\`

```bash
$ seq 3 | sed '2i  foo'
1
foo
2
3
$ seq 3 | sed '2i\ foo'
1
 foo
2
3

$ seq 3 | sed '2i\tbar'
1
tbar
2
3
$ seq 3 | sed '2i\\tbar'
1
        bar
2
3
```

* Since `;` cannot be used to distinguish between string and end of command, use `-e` for multiple commands

```bash
$ sed -e '/is/ifoobar;s/are/ARE/' poem.txt
Roses are red,
Violets are blue,
foobar;s/are/ARE/
Sugar is sweet,
And so are you.

$ sed -e '/is/ifoobar' -e 's/are/ARE/' poem.txt
Roses ARE red,
Violets ARE blue,
foobar
Sugar is sweet,
And so ARE you.
```

* Using shell substitution

```bash
$ text='good day'
$ seq 3 | sed '2i'"$text"
1
good day
2
3

$ text='good day\nfoo bar'
$ seq 3 | sed '2i'"$text"
1
good day
foo bar
2
3

$ seq 3 | sed '2iToday is '"$(date +%A)"
1
Today is Thursday
2
3

$ # multiline command output will lead to error
$ seq 3 | sed '2i'"$(seq 2)"
sed: -e expression #1, char 5: missing command
```

<br>

## <a name="append-command"></a>append command

The append command allows to add string after a line matching given address

**Note** the string used cannot have literal newline character, use escape sequence instead

```bash
$ # white-space between a and string is ignored
$ # same as: sed '2s/$/\nhello/'
$ seq 3 | sed '2a hello'
1
2
hello
3

$ # escape sequences can be used
$ seq 3 | sed '2ahello\nhi'
1
2
hello
hi
3
```

* command will apply for all matching addresses

```bash
$ seq 5 | sed '/[24]/afoo'
1
2
foo
3
4
foo
5
```

* `\` is special immediately after `a`, see [sed manual - other commands](https://www.gnu.org/software/sed/manual/sed.html#Other-Commands) for details
* If escape sequence is needed at beginning of replacement string, use an additional `\`

```bash
$ seq 3 | sed '2a  foo'
1
2
foo
3
$ seq 3 | sed '2a\ foo'
1
2
 foo
3

$ seq 3 | sed '2a\tbar'
1
2
tbar
3
$ seq 3 | sed '2a\\tbar'
1
2
        bar
3
```

* Since `;` cannot be used to distinguish between string and end of command, use `-e` for multiple commands

```bash
$ sed -e '/is/afoobar;s/are/ARE/' poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
foobar;s/are/ARE/
And so are you.

$ sed -e '/is/afoobar' -e 's/are/ARE/' poem.txt
Roses ARE red,
Violets ARE blue,
Sugar is sweet,
foobar
And so ARE you.
```

* Using shell substitution

```bash
$ text='good day'
$ seq 3 | sed '2a'"$text"
1
2
good day
3

$ text='good day\nfoo bar'
$ seq 3 | sed '2a'"$text"
1
2
good day
foo bar
3

$ seq 3 | sed '2aToday is '"$(date +%A)"
1
2
Today is Thursday
3

$ # multiline command output will lead to error
$ seq 3 | sed '2a'"$(seq 2)"
sed: -e expression #1, char 5: missing command
```

* See also [stackoverflow - add newline character if last line of input doesn't have one](https://stackoverflow.com/questions/41343062/what-does-this-mean-in-linux-sed-a-a-txt)

<br>

## <a name="adding-contents-of-file"></a>adding contents of file

<br>

#### <a name="r-for-entire-file"></a>r for entire file

* The `r` command allows to add contents of file after a line matching given address
* It is a robust way to add multiline content or if content can have characters that may be interpreted
* Special name `/dev/stdin` allows to read from **stdin** instead of file input
* First, a simple example to add contents of one file into another at specified address

```bash
$ cat 5.txt
five
1five

$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ # space between r and filename is optional
$ sed '2r 5.txt' poem.txt
Roses are red,
Violets are blue,
five
1five
Sugar is sweet,
And so are you.

$ # content cannot be added before first line
$ sed '0r 5.txt' poem.txt
sed: -e expression #1, char 2: invalid usage of line address 0
$ # but that is trivial to solve: cat 5.txt poem.txt
```

* command will apply for all matching addresses

```bash
$ seq 5 | sed '/[24]/r 5.txt'
1
2
five
1five
3
4
five
1five
5
```

* adding content of variable as it is without any interpretation
* also shows example for using `/dev/stdin`

```bash
$ text='Good day\nfoo bar baz\n'
$ # escape sequence like \n will be interpreted when 'a' command is used
$ sed '/is/a'"$text" poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
Good day
foo bar baz

And so are you.

$ # \ is just another character, won't be treated as special with 'r' command
$ echo "$text" | sed '/is/r /dev/stdin' poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
Good day\nfoo bar baz\n
And so are you.
```

* adding multiline command output is simple as well

```bash
$ seq 3 | sed '/is/r /dev/stdin' poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
1
2
3
And so are you.
```

* replacing a line or range of lines with contents of file
* See also [unix.stackexchange - various ways to replace line M in file1 with line N in file2](https://unix.stackexchange.com/a/396450)

```bash
$ # replacing range of lines
$ # order is important, first 'r' and then 'd'
$ sed -e '/is/r 5.txt' -e '1,/is/d' poem.txt
five
1five
And so are you.

$ # replacing a line
$ seq 3 | sed -e '3r /dev/stdin' -e '3d' poem.txt
Roses are red,
Violets are blue,
1
2
3
And so are you.

$ # can also use {} grouping to avoid repeating the address
$ seq 3 | sed -e '/blue/{r /dev/stdin' -e 'd}' poem.txt
Roses are red,
1
2
3
Sugar is sweet,
And so are you.
```

<br>

#### <a name="r-for-line-by-line"></a>R for line by line

* add a line for every address match
* Special name `/dev/stdin` allows to read from **stdin** instead of file input

```bash
$ # space between R and filename is optional
$ seq 3 | sed '/are/R /dev/stdin' poem.txt
Roses are red,
1
Violets are blue,
2
Sugar is sweet,
And so are you.
3
$ # to replace matching line
$ seq 3 | sed -e '/are/{R /dev/stdin' -e 'd}' poem.txt
1
2
Sugar is sweet,
3

$ sed '2,3R 5.txt' poem.txt
Roses are red,
Violets are blue,
five
Sugar is sweet,
1five
And so are you.
```

* number of lines from file to be read different from number of matching address lines

```bash
$ # file has more lines than matching address
$ # 2 lines in 5.txt but only 1 line matching 'is'
$ sed '/is/R 5.txt' poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
five
And so are you.

$ # lines matching address is more than file to be read
$ # 3 lines matching 'are' but only 2 lines from stdin
$ seq 2 | sed '/are/R /dev/stdin' poem.txt
Roses are red,
1
Violets are blue,
2
Sugar is sweet,
And so are you.
```

<br>

## <a name="n-and-n-commands"></a>n and N commands

* These two commands will fetch next line (newline or NUL character separated, depending on options)

Quoting from [sed manual - common commands](https://www.gnu.org/software/sed/manual/sed.html#Common-Commands) for `n` command

>If auto-print is not disabled, print the pattern space, then, regardless, replace the pattern space with the next line of input. If there is no more input then sed exits without processing any more commands.

```bash
$ # if line contains 'blue', replace 'e' with 'E' only for following line
$ sed '/blue/{n;s/e/E/g}' poem.txt
Roses are red,
Violets are blue,
Sugar is swEEt,
And so are you.

$ # better illustrated with -n option
$ sed -n '/blue/{n;s/e/E/pg}' poem.txt
Sugar is swEEt,

$ # if line contains 'blue', replace 'e' with 'E' only for next to next line
$ sed -n '/blue/{n;n;s/e/E/pg}' poem.txt
And so arE you.
```

Quoting from [sed manual - other commands](https://www.gnu.org/software/sed/manual/sed.html#Other-Commands) for `N` command

>Add a newline to the pattern space, then append the next line of input to the pattern space. If there is no more input then sed exits without processing any more commands

>When -z is used, a zero byte (the ascii ‘NUL’ character) is added between the lines (instead of a new line)

* See also [stackoverflow - apply substitution every 4 lines but excluding the 4th line](https://stackoverflow.com/questions/40229578/how-to-insert-a-line-feed-into-a-sed-line-concatenation)

```bash
$ # if line contains 'blue', replace 'e' with 'E' both in current line and next
$ sed '/blue/{N;s/e/E/g}' poem.txt
Roses are red,
ViolEts arE bluE,
Sugar is swEEt,
And so are you.

$ # better illustrated with -n option
$ sed -n '/blue/{N;s/e/E/pg}' poem.txt
ViolEts arE bluE,
Sugar is swEEt,

$ sed -n '/blue/{N;N;s/e/E/pg}' poem.txt
ViolEts arE bluE,
Sugar is swEEt,
And so arE you.
```

* Combination

```bash
$ # n will fetch next line, current line is out of pattern space
$ # N will then add another line
$ sed -n '/blue/{n;N;s/e/E/pg}' poem.txt
Sugar is swEEt,
And so arE you.
```

* not necessary to qualify with an address

```bash
$ seq 6 | sed 'n;cXYZ'
1
XYZ
3
XYZ
5
XYZ

$ seq 6 | sed 'N;s/\n/ /'
1 2
3 4
5 6
```

<br>

## <a name="control-structures"></a>Control structures

* Using `:label` one can mark a command location to branch to conditionally or unconditionally
* See [sed manual - Commands for sed gurus](https://www.gnu.org/software/sed/manual/sed.html#Programming-Commands) for more details

<br>

#### <a name="if-then-else"></a>if then else

* Simple if-then-else can be simulated using `b` command
* `b` command will unconditionally branch to specified label
* Without label, `b` will skip rest of commands and start next cycle
* See [unix.stackexchange - processing only lines between REGEXPs](https://unix.stackexchange.com/questions/292819/remove-commented-lines-except-one-comment-using-sed) for interesting use case

```bash
$ # changing -ve to +ve and vice versa
$ cat nums.txt
42
-2
10101
-3.14
-75
$ # same as: perl -pe '/^-/ ? s/// : s/^/-/'
$ # empty REGEXP section will reuse previous REGEXP, in this case /^-/
$ sed '/^-/{s///;b}; s/^/-/' nums.txt
-42
2
-10101
3.14
75

$ # same as: perl -pe '/are/ ? s/e/*/g : s/e/#/g'
$ # if line contains 'are' replace 'e' with '*' else replace 'e' with '#'
$ sed '/are/{s/e/*/g;b}; s/e/#/g' poem.txt
Ros*s ar* r*d,
Viol*ts ar* blu*,
Sugar is sw##t,
And so ar* you.
```

<br>

#### <a name="replacing-in-specific-column"></a>replacing in specific column

* `t` command will branch to specified label on successful substitution
* Without label, `t` will skip rest of commands and start next cycle
* More examples
    * [stackoverflow - replace data after last delimiter](https://stackoverflow.com/questions/39907133/replace-data-after-last-delimiter-of-every-line-using-sed-or-awk/39908523#39908523)
    * [stackoverflow - replace multiple occurrences in specific column](https://stackoverflow.com/questions/42886531/replace-mutliple-occurances-in-delimited-columns/42886919#42886919)

```bash
$ # replace space with underscore only in 3rd column
$ # ^(([^|]+\|){2} captures first two columns
$ # [^|]* zero or more non-column separator characters
$ # as long as match is found, command will be repeated on same input line
$ echo 'foo bar|a b c|1 2 3|xyz abc' | sed -E ':a s/^(([^|]+\|){2}[^|]*) /\1_/; ta'
foo bar|a b c|1_2_3|xyz abc

$ # use awk/perl for simpler syntax
$ # for ex: awk 'BEGIN{FS=OFS="|"} {gsub(/ /,"_",$3); print}'
```

* example to show difference between `b` and `t`

```bash
$ # whether or not 'R' is found on lines containing 'are', branch will happen
$ sed '/are/{s/R/*/g;b}; s/e/#/g' poem.txt
*oses are red,
Violets are blue,
Sugar is sw##t,
And so are you.

$ # branch only if line contains 'are' and substitution of 'R' succeeds
$ sed '/are/{s/R/*/g;t}; s/e/#/g' poem.txt
*oses are red,
Viol#ts ar# blu#,
Sugar is sw##t,
And so ar# you.
```

<br>

#### <a name="overlapping-substitutions"></a>overlapping substitutions

* `t` command looping with label comes in handy for overlapping substitutions as well
* Note that in general this method will work recursively, see [stackoverflow - substitute recursively](https://stackoverflow.com/questions/9983646/sed-substitute-recursively) for example

```bash
$ # consider the problem of replacing empty columns with something
$ # case1: no consecutive empty columns - no problem
$ echo 'foo::bar::baz' | sed 's/::/:0:/g'
foo:0:bar:0:baz
$ # case2: consecutive empty columns are present - problematic
$ echo 'foo:::bar::baz' | sed 's/::/:0:/g'
foo:0::bar:0:baz

$ # t command looping will handle both cases
$ echo 'foo::bar::baz' | sed ':a s/::/:0:/; ta'
foo:0:bar:0:baz
$ echo 'foo:::bar::baz' | sed ':a s/::/:0:/; ta'
foo:0:0:bar:0:baz
```

<br>

## <a name="lines-between-two-regexps"></a>Lines between two REGEXPs

* Simple cases were seen in [address range](#address-range) section
* This section will deal with more cases and some corner cases

<br>

#### <a name="include-or-exclude-matching-regexps"></a>Include or Exclude matching REGEXPs

Consider the sample input file, for simplicity the two REGEXPs are **BEGIN** and **END** strings instead of regular expressions

```bash
$ cat range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
END
baz
```

First, lines between the two *REGEXP*s are to be printed

* Case 1: both starting and ending *REGEXP* part of output

```bash
$ sed -n '/BEGIN/,/END/p' range.txt
BEGIN
1234
6789
END
BEGIN
a
b
c
END
```

* Case 2: both starting and ending *REGEXP* not part of ouput

```bash
$ # remember that empty REGEXP section will reuse previously matched REGEXP
$ sed -n '/BEGIN/,/END/{//!p}' range.txt
1234
6789
a
b
c
```

* Case 3: only starting *REGEXP* part of output

```bash
$ sed -n '/BEGIN/,/END/{/END/!p}' range.txt
BEGIN
1234
6789
BEGIN
a
b
c
```

* Case 4: only ending *REGEXP* part of output

```bash
$ sed -n '/BEGIN/,/END/{/BEGIN/!p}' range.txt
1234
6789
END
a
b
c
END
```

Second, lines between the two *REGEXP*s are to be deleted

* Case 5: both starting and ending *REGEXP* not part of output

```bash
$ sed '/BEGIN/,/END/d' range.txt
foo
bar
baz
```

* Case 6: both starting and ending *REGEXP* part of output

```bash
$ # remember that empty REGEXP section will reuse previously matched REGEXP
$ sed '/BEGIN/,/END/{//!d}' range.txt
foo
BEGIN
END
bar
BEGIN
END
baz
```

* Case 7: only starting *REGEXP* part of output

```bash
$ sed '/BEGIN/,/END/{/BEGIN/!d}' range.txt
foo
BEGIN
bar
BEGIN
baz
```

* Case 8: only ending *REGEXP* part of output

```bash
$ sed '/BEGIN/,/END/{/END/!d}' range.txt
foo
END
bar
END
baz
```

<br>

#### <a name="first-or-last-block"></a>First or Last block

* Getting first block is very simple by using `q` command

```bash
$ sed -n '/BEGIN/,/END/{p;/END/q}' range.txt
BEGIN
1234
6789
END

$ # use other tricks discussed in previous section as needed
$ sed -n '/BEGIN/,/END/{//!p;/END/q}' range.txt
1234
6789
```

* To get last block, reverse the input linewise, the order of *REGEXP*s and finally reverse again

```bash
$ tac range.txt | sed -n '/END/,/BEGIN/{p;/BEGIN/q}' | tac
BEGIN
a
b
c
END

$ # use other tricks discussed in previous section as needed
$ tac range.txt | sed -n '/END/,/BEGIN/{//!p;/BEGIN/q}' | tac
a
b
c
```

* To get a specific block, say 3rd one, `awk` or `perl` would be a better choice
    * See [Specific blocks](./gnu_awk.md#specific-blocks) for `awk` examples

<br>

#### <a name="broken-blocks"></a>Broken blocks

* If there are blocks with ending *REGEXP* but without corresponding starting *REGEXP*, `sed -n '/BEGIN/,/END/p'` will suffice
* Consider the modified input file where final starting *REGEXP* doesn't have corresponding ending

```bash
$ cat broken_range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
baz
```

* All lines till end of file gets printed with simple use of `sed -n '/BEGIN/,/END/p'`
* The file reversing trick comes in handy here as well
* But if both kinds of broken blocks are present, further processing will be required. Better to use `awk` or `perl` in such cases
    * See [Broken blocks](./gnu_awk.md#broken-blocks) for `awk` examples

```bash
$ sed -n '/BEGIN/,/END/p' broken_range.txt
BEGIN
1234
6789
END
BEGIN
a
b
c
baz

$ tac broken_range.txt | sed -n '/END/,/BEGIN/p' | tac
BEGIN
1234
6789
END
```

* If there are multiple starting *REGEXP* but single ending *REGEXP*, the reversing trick comes handy again

```bash
$ cat uneven_range.txt
foo
BEGIN
1234
BEGIN
42
6789
END
bar
BEGIN
a
BEGIN
b
BEGIN
c
BEGIN
d
BEGIN
e
END
baz

$ tac uneven_range.txt | sed -n '/END/,/BEGIN/p' | tac
BEGIN
42
6789
END
BEGIN
e
END
```

<br>

## <a name="sed-scripts"></a>sed scripts

* `sed` commands can be placed in a file and called using `-f` option or directly executed using [shebang](https://en.wikipedia.org/wiki/Shebang_(Unix))
* See [sed manual - Some Sample Scripts](https://www.gnu.org/software/sed/manual/sed.html#Examples) for more examples
* See [sed manual - Often-Used Commands](https://www.gnu.org/software/sed/manual/sed.html#Common-Commands) for more details on using comments

```bash
$ cat script.sed
# each line is a command
/is/cfoo bar
/you/r 3.txt
/you/d
# single quotes can be used freely
s/are/'are'/g

$ sed -f script.sed poem.txt
Roses 'are' red,
Violets 'are' blue,
foo bar
3
13

$ # command line options are specified as usual
$ sed -nf script.sed poem.txt
foo bar
3
13
```

* command line options can be specified along with shebang as well as added at time of invocation
* See also [stackoverflow - usage of options along with shebang depends on lot of factors](https://stackoverflow.com/questions/4303128/how-to-use-multiple-arguments-with-a-shebang-i-e)

```bash
$ type sed
sed is /bin/sed

$ cat executable.sed
#!/bin/sed -f
/is/cfoo bar
/you/r 3.txt
/you/d
s/are/'are'/g

$ chmod +x executable.sed

$ ./executable.sed poem.txt
Roses 'are' red,
Violets 'are' blue,
foo bar
3
13

$ ./executable.sed -n poem.txt
foo bar
3
13
```

<br>

## <a name="gotchas-and-tips"></a>Gotchas and Tips

* dos style line endings

```bash
$ # no issue with unix style line ending
$ printf 'foo bar\n123 789\n' | sed -E 's/\w+$/xyz/'
foo xyz
123 xyz

$ # dos style line ending causes trouble
$ printf 'foo bar\r\n123 789\r\n' | sed -E 's/\w+$/xyz/'
foo bar
123 789

$ # can be corrected by adding \r as well to match
$ # if needed, add \r in replacement section as well
$ printf 'foo bar\r\n123 789\r\n' | sed -E 's/\w+\r$/xyz/'
foo xyz
123 xyz
```

* changing dos to unix style line ending and vice versa

```bash
$ # bash functions
$ unix2dos() { sed -i 's/$/\r/' "$@" ; }
$ dos2unix() { sed -i 's/\r$//' "$@" ; }

$ cat -A 5.txt
five$
1five$

$ unix2dos 5.txt
$ cat -A 5.txt
five^M$
1five^M$

$ dos2unix 5.txt
$ cat -A 5.txt
five$
1five$
```

* variable/command substitution
* See also [stackoverflow - Is it possible to escape regex metacharacters reliably with sed](https://stackoverflow.com/questions/29613304/is-it-possible-to-escape-regex-metacharacters-reliably-with-sed)

```bash
$ # variables don't get expanded within single quotes
$ printf 'user\nhome\n' | sed '/user/ s/$/: $USER/'
user: $USER
home
$ printf 'user\nhome\n' | sed '/user/ s/$/: '"$USER"'/'
user: learnbyexample
home

$ # variable being substituted cannot have the delimiter character
$ printf 'user\nhome\n' | sed '/home/ s/$/: '"$HOME"'/'
sed: -e expression #1, char 15: unknown option to `s'
$ printf 'user\nhome\n' | sed '/home/ s#$#: '"$HOME"'#'
user
home: /home/learnbyexample

$ # use r command for robust insertion from file/command-output
$ sed '1a'"$(seq 2)" 5.txt
sed: -e expression #1, char 5: missing command
$ seq 2 | sed '1r /dev/stdin' 5.txt
five
1
2
1five
```

* common regular expression mistakes #1 - greediness

```bash
$ s='foo and bar and baz land good'
$ echo "$s" | sed 's/foo.*ba/123 789/'
123 789z land good

$ # use a more restrictive version
$ echo "$s" | sed -E 's/foo \w+ ba/123 789/'
123 789r and baz land good

$ # or use a tool with non-greedy feature available
$ echo "$s" | perl -pe 's/foo.*?ba/123 789/'
123 789r and baz land good

$ # for single characters, use negated character class
$ echo 'foo=123,baz=789,xyz=42' | sed 's/foo=.*,//'
xyz=42
$ echo 'foo=123,baz=789,xyz=42' | sed 's/foo=[^,]*,//'
baz=789,xyz=42
```

* common regular expression mistakes #2 - BRE vs ERE syntax

```bash
$ # + needs to be escaped with BRE or enable ERE
$ echo 'like 42 and 37' | sed 's/[0-9]+/xxx/g'
like 42 and 37
$ echo 'like 42 and 37' | sed -E 's/[0-9]+/xxx/g'
like xxx and xxx

$ # or escaping when not required
$ echo 'get {} and let' | sed 's/\{\}/[]/'
sed: -e expression #1, char 10: Invalid preceding regular expression
$ echo 'get {} and let' | sed 's/{}/[]/'
get [] and let
```

* common regular expression mistakes #3 - using PCRE syntax/features
    * especially by trying out solution on online sites like [regex101](https://regex101.com/) and expecting it to work with `sed` as well

```bash
$ # \d is not available as backslash character class, will match 'd' instead
$ echo 'like 42 and 37' | sed -E 's/\d+/xxx/g'
like 42 anxxx 37
$ echo 'like 42 and 37' | sed -E 's/[0-9]+/xxx/g'
like xxx and xxx

$ # features like lookarounds/non-greedy/etc not available
$ echo 'foo,baz,,xyz,,,123' | sed -E 's/,\K(?=,)/NaN/g'
sed: -e expression #1, char 16: Invalid preceding regular expression
$ echo 'foo,baz,,xyz,,,123' | perl -pe 's/,\K(?=,)/NaN/g'
foo,baz,NaN,xyz,NaN,NaN,123
```

* common regular expression mistakes #4 - end of line white-space

```bash
$ printf 'foo bar \n123 789\t\n' | sed -E 's/\w+$/xyz/'
foo bar 
123 789 

$ printf 'foo bar \n123 789\t\n' | sed -E 's/\w+\s*$/xyz/'
foo xyz
123 xyz
```

* and many more... see also
    * [unix.stackexchange - Why does my regular expression work in X but not in Y?](https://unix.stackexchange.com/questions/119905/why-does-my-regular-expression-work-in-x-but-not-in-y)
    * [stackoverflow - Greedy vs. Reluctant vs. Possessive Quantifiers](https://stackoverflow.com/questions/5319840/greedy-vs-reluctant-vs-possessive-quantifiers)
    * [stackoverflow - How to replace everything between but only until the first occurrence of the end string?](https://stackoverflow.com/questions/45168607/how-to-replace-everything-between-but-only-until-the-first-occurrence-of-the-end)
    * [stackoverflow - How to match a specified pattern with multiple possibilities](https://stackoverflow.com/questions/43650926/how-to-match-a-specified-pattern-with-multiple-possibilities)
    * [stackoverflow - mixing different regex syntax](https://stackoverflow.com/questions/45389684/cant-comment-a-line-in-my-cnf/45389833#45389833)
    * [sed manual - BRE-vs-ERE](https://www.gnu.org/software/sed/manual/sed.html#BRE-vs-ERE)

* Speed boost for ASCII encoded input

```bash
$ time sed -nE '/^([a-d][r-z]){3}$/p' /usr/share/dict/words
avatar
awards
cravat

real    0m0.058s
$ time LC_ALL=C sed -nE '/^([a-d][r-z]){3}$/p' /usr/share/dict/words
avatar
awards
cravat

real    0m0.038s

$ time sed -nE '/^([a-z]..)\1$/p' /usr/share/dict/words > /dev/null

real    0m0.111s
$ time LC_ALL=C sed -nE '/^([a-z]..)\1$/p' /usr/share/dict/words > /dev/null

real    0m0.073s
```

<br>

## <a name="further-reading"></a>Further Reading

* Manual and related
    * `man sed` and `info sed` for more details, known issues/limitations as well as options/commands not covered in this tutorial
    * [GNU sed manual](https://www.gnu.org/software/sed/manual/sed.html) has even more detailed information and examples
    * [sed FAQ](http://sed.sourceforge.net/sedfaq.html), last modified '10 March 2003'
    * [stackoverflow - BSD/macOS Sed vs GNU Sed vs the POSIX Sed specification](https://stackoverflow.com/questions/24275070/sed-not-giving-me-correct-substitute-operation-for-newline-with-mac-difference/24276470#24276470)
    * [unix.stackexchange - Differences between sed on Mac OSX and other standard sed](https://unix.stackexchange.com/questions/13711/differences-between-sed-on-mac-osx-and-other-standard-sed)
* This chapter has also been [converted to a book](https://github.com/learnbyexample/learn_gnused) with additional description, examples and exercises.
* Tutorials and Q&A
    * [sed basics](https://code.snipcademy.com/tutorials/shell-scripting/sed/introduction)
    * [sed detailed tutorial](https://www.grymoire.com/Unix/Sed.html) - has details on differences between various `sed` versions as well
    * [sed one-liners explained](https://catonmat.net/sed-one-liners-explained-part-one)
    * [cheat sheet](https://catonmat.net/ftp/sed.stream.editor.cheat.sheet.txt)
    * [unix.stackexchange - common search and replace examples](https://unix.stackexchange.com/questions/112023/how-can-i-replace-a-string-in-a-files)
    * [sed Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/sed?sort=votes&pageSize=15)
    * [sed Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/sed?sort=votes&pageSize=15)
* Selected examples - portable solutions, commands not covered in this tutorial, same problem solved using different tools, etc
    * [unix.stackexchange - replace multiline string](https://unix.stackexchange.com/questions/26284/how-can-i-use-sed-to-replace-a-multi-line-string)
    * [stackoverflow - deleting empty lines with optional white spaces](https://stackoverflow.com/questions/16414410/delete-empty-lines-using-sed)
    * [unix.stackexchange - print only line above the matching line](https://unix.stackexchange.com/questions/264489/find-each-line-matching-a-pattern-but-print-only-the-line-above-it)
    * [stackoverflow - How to select lines between two patterns?](https://stackoverflow.com/questions/38972736/how-to-select-lines-between-two-patterns)
    * [stackoverflow - get lines between two patterns only if there is third pattern between them](https://stackoverflow.com/questions/39960075/bash-how-to-get-lines-between-patterns-only-if-there-is-pattern2-between-them)
        * [unix.stackexchange - similar example](https://unix.stackexchange.com/questions/228699/sed-print-lines-matched-by-a-pattern-range-if-one-line-matches-a-condition)
* Learn Regular Expressions (has information on flavors other than BRE/ERE too)
    * [Regular Expressions Tutorial](https://www.regular-expressions.info/tutorial.html)
    * [regexcrossword](https://regexcrossword.com/)
    * [stackoverflow - What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean)
* Related tools
    * [rpl](https://unix.stackexchange.com/questions/112023/how-can-i-replace-a-string-in-a-files/251742#251742) - search and replace tool, has interesting options like interactive mode and recursive mode
    * [sd](https://github.com/chmln/sd) - simple search and replace, implemented in Rust
    * [sedsed](https://github.com/aureliojargas/sedsed) - Debugger, indenter and HTMLizer for sed scripts
    * [xo](https://github.com/ezekg/xo) - composes regular expression match groups
* [unix.stackexchange - When to use grep, sed, awk, perl, etc](https://unix.stackexchange.com/questions/303044/when-to-use-grep-less-awk-sed)


================================================
FILE: miscellaneous.md
================================================
# <a name="miscellaneous"></a>Miscellaneous

**Table of Contents**

* [cut](#cut)
    * [select specific fields](#select-specific-fields)
    * [suppressing lines without delimiter](#suppressing-lines-without-delimiter)
    * [specifying delimiters](#specifying-delimiters)
    * [complement](#complement)
    * [select specific characters](#select-specific-characters)
    * [Further reading for cut](#further-reading-for-cut)
* [tr](#tr)
    * [translation](#translation)
    * [escape sequences and character classes](#escape-sequences-and-character-classes)
    * [deletion](#deletion)
    * [squeeze](#squeeze)
    * [Further reading for tr](#further-reading-for-tr)
* [basename](#basename)
* [dirname](#dirname)
* [xargs](#xargs)
* [seq](#seq)
    * [integer sequences](#integer-sequences)
    * [specifying separator](#specifying-separator)
    * [floating point sequences](#floating-point-sequences)
    * [Further reading for seq](#further-reading-for-seq)

<br>

## <a name="cut"></a>cut

```bash
$ cut --version | head -n1
cut (GNU coreutils) 8.25

$ man cut
CUT(1)                           User Commands                          CUT(1)

NAME
       cut - remove sections from each line of files

SYNOPSIS
       cut OPTION... [FILE]...

DESCRIPTION
       Print selected parts of lines from each FILE to standard output.

       With no FILE, or when FILE is -, read standard input.
...
```

<br>

#### <a name="select-specific-fields"></a>select specific fields

* Default delimiter is **tab** character
* `-f` option allows to print specific field(s) from each input line

```bash
$ printf 'foo\tbar\t123\tbaz\n'
foo     bar     123     baz

$ # single field
$ printf 'foo\tbar\t123\tbaz\n' | cut -f2
bar

$ # multiple fields can be specified by using ,
$ printf 'foo\tbar\t123\tbaz\n' | cut -f2,4
bar     baz

$ # output is always ascending order of field numbers
$ printf 'foo\tbar\t123\tbaz\n' | cut -f3,1
foo     123

$ # range can be specified using -
$ printf 'foo\tbar\t123\tbaz\n' | cut -f1-3
foo     bar     123
$ # if ending number is omitted, select till last field
$ printf 'foo\tbar\t123\tbaz\n' | cut -f3-
123     baz
```

<br>

#### <a name="suppressing-lines-without-delimiter"></a>suppressing lines without delimiter

```bash
$ cat marks.txt
jan 2017
foobar  12      45      23
feb 2017
foobar  18      38      19

$ # by default lines without delimiter will be printed
$ cut -f2- marks.txt
jan 2017
12      45      23
feb 2017
18      38      19

$ # use -s option to suppress such lines
$ cut -s -f2- marks.txt
12      45      23
18      38      19
```

<br>

#### <a name="specifying-delimiters"></a>specifying delimiters

* use `-d` option to specify input delimiter other than default **tab** character
* only single character can be used, for multi-character/regex based delimiter use `awk` or `perl`

```bash
$ echo 'foo:bar:123:baz' | cut -d: -f3
123

$ # by default output delimiter is same as input
$ echo 'foo:bar:123:baz' | cut -d: -f1,4
foo:baz

$ # quote the delimiter character if it clashes with shell special characters
$ echo 'one;two;three;four' | cut -d; -f3
cut: option requires an argument -- 'd'
Try 'cut --help' for more information.
-f3: command not found
$ echo 'one;two;three;four' | cut -d';' -f3
three
```

* use `--output-delimiter` option to specify different output delimiter
* since this option accepts a string, more than one character can be specified
* See also [using $ prefixed string](https://unix.stackexchange.com/questions/48106/what-does-it-mean-to-have-a-dollarsign-prefixed-string-in-a-script)

```bash
$ printf 'foo\tbar\t123\tbaz\n' | cut --output-delimiter=: -f1-3
foo:bar:123

$ echo 'one;two;three;four' | cut -d';' --output-delimiter=' ' -f1,3-
one three four

$ # tested on bash, might differ with other shells
$ echo 'one;two;three;four' | cut -d';' --output-delimiter=$'\t' -f1,3-
one     three   four

$ echo 'one;two;three;four' | cut -d';' --output-delimiter=' - ' -f1,3-
one - three - four
```

<br>

#### <a name="complement"></a>complement

```bash
$ echo 'one;two;three;four' | cut -d';' -f1,3-
one;three;four

$ # to print other than specified fields
$ echo 'one;two;three;four' | cut -d';' --complement -f2
one;three;four
```

<br>

#### <a name="select-specific-characters"></a>select specific characters

* similar to `-f` for field selection, use `-c` for character selection
* See manual for what defines a character and differences between `-b` and `-c`

```bash
$ echo 'foo:bar:123:baz' | cut -c4
:

$ printf 'foo\tbar\t123\tbaz\n' | cut -c1,4,7
f       r

$ echo 'foo:bar:123:baz' | cut -c8-
:123:baz

$ echo 'foo:bar:123:baz' | cut --complement -c8-
foo:bar

$ echo 'foo:bar:123:baz' | cut -c1,6,7 --output-delimiter=' '
f a r

$ echo 'abcdefghij' | cut --output-delimiter='-' -c1-3,4-7,8-
abc-defg-hij

$ cut -c1-3 marks.txt
jan
foo
feb
foo
```

<br>

#### <a name="further-reading-for-cut"></a>Further reading for cut

* `man cut` and `info cut` for more options and detailed documentation
* [cut Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/cut?sort=votes&pageSize=15)

<br>

## <a name="tr"></a>tr

```bash
$ tr --version | head -n1
tr (GNU coreutils) 8.25

$ man tr
TR(1)                            User Commands                           TR(1)

NAME
       tr - translate or delete characters

SYNOPSIS
       tr [OPTION]... SET1 [SET2]

DESCRIPTION
       Translate, squeeze, and/or delete characters from standard input, writ‐
       ing to standard output.
...
```

<br>

#### <a name="translation"></a>translation

* one-to-one mapping of characters, all occurrences are translated
* as good practice, enclose the arguments in single quotes to avoid issues due to shell interpretation

```bash
$ echo 'foo bar cat baz' | tr 'abc' '123'
foo 21r 31t 21z

$ # use - to represent a range in ascending order
$ echo 'foo bar cat baz' | tr 'a-f' '1-6'
6oo 21r 31t 21z

$ # changing case
$ echo 'foo bar cat baz' | tr 'a-z' 'A-Z'
FOO BAR CAT BAZ
$ echo 'Hello World' | tr 'a-zA-Z' 'A-Za-z'
hELLO wORLD

$ echo 'foo;bar;baz' | tr ; :
tr: missing operand
Try 'tr --help' for more information.
$ echo 'foo;bar;baz' | tr ';' ':'
foo:bar:baz
```

* rot13 example

```bash
$ echo 'foo bar cat baz' | tr 'a-z' 'n-za-m'
sbb one png onm
$ echo 'sbb one png onm' | tr 'a-z' 'n-za-m'
foo bar cat baz

$ echo 'Hello World' | tr 'a-zA-Z' 'n-za-mN-ZA-M'
Uryyb Jbeyq
$ echo 'Uryyb Jbeyq' | tr 'a-zA-Z' 'n-za-mN-ZA-M'
Hello World
```

* use shell input redirection for file input

```bash
$ cat marks.txt
jan 2017
foobar  12      45      23
feb 2017
foobar  18      38      19

$ tr 'a-z' 'A-Z' < marks.txt
JAN 2017
FOOBAR  12      45      23
FEB 2017
FOOBAR  18      38      19
```

* if arguments are of different lengths

```bash
$ # when second argument is longer, the extra characters are ignored
$ echo 'foo bar cat baz' | tr 'abc' '1-9'
foo 21r 31t 21z

$ # when first argument is longer
$ # the last character of second argument gets re-used
$ echo 'foo bar cat baz' | tr 'a-z' '123'
333 213 313 213

$ # use -t option to truncate first argument to same length as second
$ echo 'foo bar cat baz' | tr -t 'a-z' '123'
foo 21r 31t 21z
```

<br>

#### <a name="escape-sequences-and-character-classes"></a>escape sequences and character classes

* Certain characters like newline, tab, etc can be represented using escape sequences or octal representation
* Certain commonly useful groups of characters like alphabets, digits, punctuations etc have character class as shortcuts
* See [gnu tr manual](http://www.gnu.org/software/coreutils/manual/html_node/Character-sets.html#Character-sets) for all escape sequences and character classes

```bash
$ printf 'foo\tbar\t123\tbaz\n' | tr '\t' ':'
foo:bar:123:baz

$ echo 'foo:bar:123:baz' | tr ':' '\n'
foo
bar
123
baz
$ # makes it easier to transform
$ echo 'foo:bar:123:baz' | tr ':' '\n' | pr -2ats'-'
foo-bar
123-baz

$ echo 'foo bar cat baz' | tr '[:lower:]' '[:upper:]'
FOO BAR CAT BAZ
```

* since `-` is used for character ranges, place it at the end to represent it literally
    * cannot be used at start of argument as it would get treated as option
    * or use `--` to indicate end of option processing
* similarly, to represent `\` literally, use `\\`

```bash
$ echo '/foo-bar/baz/report' | tr '-a-z' '_A-Z'
tr: invalid option -- 'a'
Try 'tr --help' for more information.

$ echo '/foo-bar/baz/report' | tr 'a-z-' 'A-Z_'
/FOO_BAR/BAZ/REPORT

$ echo '/foo-bar/baz/report' | tr -- '-a-z' '_A-Z'
/FOO_BAR/BAZ/REPORT

$ echo '/foo-bar/baz/report' | tr '/-' '\\_'
\foo_bar\baz\report
```

<br>

#### <a name="deletion"></a>deletion

* use `-d` option to specify characters to be deleted
* add complement option `-c` if it is easier to define which characters are to be retained

```bash
$ echo '2017-03-21' | tr -d '-'
20170321

$ echo 'Hi123 there. How a32re you' | tr -d '1-9'
Hi there. How are you

$ # delete all punctuation characters
$ echo '"Foo1!", "Bar.", ":Baz:"' | tr -d '[:punct:]'
Foo1 Bar Baz

$ # deleting carriage return character
$ cat -v greeting.txt
Hi there^M
How are you^M
$ tr -d '\r' < greeting.txt | cat -v
Hi there
How are you

$ # retain only alphabets, comma and newline characters
$ echo '"Foo1!", "Bar.", ":Baz:"' | tr -cd '[:alpha:],\n'
Foo,Bar,Baz
```

<br>

#### <a name="squeeze"></a>squeeze

* to change consecutive repeated characters to single copy of that character

```bash
$ # only lower case alphabets
$ echo 'FFoo seed 11233' | tr -s 'a-z'
FFo sed 11233

$ # alphabets and digits
$ echo 'FFoo seed 11233' | tr -s '[:alnum:]'
Fo sed 123

$ # squeeze other than alphabets
$ echo 'FFoo seed 11233' | tr -sc '[:alpha:]'
FFoo seed 123

$ # only characters present in second argument is used for squeeze
$ echo 'FFoo seed 11233' | tr -s 'A-Z' 'a-z'
fo sed 11233

$ # multiple consecutive horizontal spaces to single space
$ printf 'foo\t\tbar \t123     baz\n'
foo             bar     123     baz
$ printf 'foo\t\tbar \t123     baz\n' | tr -s '[:blank:]' ' '
foo bar 123 baz
```

<br>

#### <a name="further-reading-for-tr"></a>Further reading for tr

* `man tr` and `info tr` for more options and detailed documentation
* [tr Q&A on unix stackexchange](http://unix.stackexchange.com/questions/tagged/tr?sort=votes&pageSize=15)

<br>

## <a name="basename"></a>basename

```bash
$ basename --version | head -n1
basename (GNU coreutils) 8.25

$ man basename
BASENAME(1)                      User Commands                     BASENAME(1)

NAME
       basename - strip directory and suffix from filenames

SYNOPSIS
       basename NAME [SUFFIX]
       basename OPTION... NAME...

DESCRIPTION
       Print  NAME  with  any leading directory components removed.  If speci‐
       fied, also remove a trailing SUFFIX.
...
```

<br>

**Examples**

```bash
$ # same as using pwd command
$ echo "$PWD"
/home/learnbyexample

$ basename "$PWD"
learnbyexample

$ # use -a option if there are multiple arguments
$ basename -a foo/a/report.log bar/y/power.log
report.log
power.log

$ # use single quotes if arguments contain space and other special shell characters
$ # use suffix option -s to strip file extension from filename
$ basename -s '.log' '/home/learnbyexample/proj adder/power.log'
power
$ # -a is implied when using -s option
$ basename -s'.log' foo/a/report.log bar/y/power.log
report
power
```

* Can also use [Parameter expansion](http://mywiki.wooledge.org/BashFAQ/073) if working on file paths saved in variables
    * assumes `bash` shell and similar that support this feature

```bash
$ # remove from start of string up to last /
$ file='/home/learnbyexample/proj adder/power.log'
$ basename "$file"
power.log
$ echo "${file##*/}"
power.log

$ t="${file##*/}"
$ # remove .log from end of string
$ echo "${t%.log}"
power
```

* See `man basename` and `info basename` for detailed documentation

<br>

## <a name="dirname"></a>dirname

```bash
$ dirname --version | head -n1
dirname (GNU coreutils) 8.25

$ man dirname
DIRNAME(1)                       User Commands                      DIRNAME(1)

NAME
       dirname - strip last component from file name

SYNOPSIS
       dirname [OPTION] NAME...

DESCRIPTION
       Output each NAME with its last non-slash component and trailing slashes
       removed; if NAME contains no  /'s,  output  '.'  (meaning  the  current
       directory).
...
```

<br>

**Examples**

```bash
$ echo "$PWD"
/home/learnbyexample

$ dirname "$PWD"
/home

$ # use single quotes if arguments contain space and other special shell characters
$ dirname '/home/learnbyexample/proj adder/power.log'
/home/learnbyexample/proj adder

$ # unlike basename, by default dirname handles multiple arguments
$ dirname foo/a/report.log bar/y/power.log
foo/a
bar/y

$ # if no / in argument, output is . to indicate current directory
$ dirname power.log
.
```

* Use `$()` command substitution to further process output as needed

```bash
$ dirname '/home/learnbyexample/proj adder/power.log'
/home/learnbyexample/proj adder

$ dirname "$(dirname '/home/learnbyexample/proj adder/power.log')"
/home/learnbyexample

$ basename "$(dirname '/home/learnbyexample/proj adder/power.log')"
proj adder
```

* Can also use [Parameter expansion](http://mywiki.wooledge.org/BashFAQ/073) if working on file paths saved in variables
    * assumes `bash` shell and similar that support this feature

```bash
$ # remove from last / in the string to end of string
$ file='/home/learnbyexample/proj adder/power.log'
$ dirname "$file"
/home/learnbyexample/proj adder
$ echo "${file%/*}"
/home/learnbyexample/proj adder

$ # remove from second last / to end of string
$ echo "${file%/*/*}"
/home/learnbyexample

$ # apply basename trick to get just directory name instead of full path
$ t="${file%/*}"
$ echo "${t##*/}"
proj adder
```

* See `man dirname` and `info dirname` for detailed documentation

<br>

## <a name="xargs"></a>xargs

```bash
$ xargs --version | head -n1
xargs (GNU findutils) 4.7.0-git

$ whatis xargs
xargs (1)            - build and execute command lines from standard input

$ # from 'man xargs'
       This manual page documents the GNU version of xargs.  xargs reads items
       from  the  standard  input, delimited by blanks (which can be protected
       with double or single quotes or a backslash) or newlines, and  executes
       the  command (default is /bin/echo) one or more times with any initial-
       arguments followed by items read from standard input.  Blank  lines  on
       the standard input are ignored.
```

While `xargs` is [primarily used](https://unix.stackexchange.com/questions/24954/when-is-xargs-needed) for passing output of command or file contents to another command as input arguments and/or parallel processing, it can be quite handy for certain text processing stuff with default `echo` command

```bash
$ printf ' foo\t\tbar \t123     baz \n' | cat -e
 foo		bar 	123     baz $
$ # tr helps to change consecutive blanks to single space
$ # but what if blanks at start and end have to be removed as well?
$ printf ' foo\t\tbar \t123     baz \n' | tr -s '[:blank:]' ' ' | cat -e
 foo bar 123 baz $
$ # xargs does this by default
$ printf ' foo\t\tbar \t123     baz \n' | xargs | cat -e
foo bar 123 baz$

$ # -n option limits number of arguments per line
$ printf ' foo\t\tbar \t123     baz \n' | xargs -n2
foo bar
123 baz

$ # same as using: paste -d' ' - - -
$ # or: pr -3ats' '
$ seq 6 | xargs -n3
1 2 3
4 5 6
```

* use `-a` option to specify file input instead of stdin

```bash
$ cat marks.txt
jan 2017
foobar  12      45      23
feb 2017
foobar  18      38      19

$ xargs -a marks.txt
jan 2017 foobar 12 45 23 feb 2017 foobar 18 38 19

$ # use -L option to limit max number of lines per command line
$ xargs -L2 -a marks.txt
jan 2017 foobar 12 45 23
feb 2017 foobar 18 38 19
```

* **Note** since `echo` is the command being executed, it will cause issue with option interpretation

```bash
$ printf ' -e foo\t\tbar \t123     baz \n' | xargs -n2
foo
bar 123
baz

$ # use -t option to see what is happening (verbose output)
$ printf ' -e foo\t\tbar \t123     baz \n' | xargs -n2 -t
echo -e foo 
foo
echo bar 123 
bar 123
echo baz 
baz
```

* See `man xargs` and `info xargs` for detailed documentation

<br>

## <a name="seq"></a>seq

```bash
$ seq --version | head -n1
seq (GNU coreutils) 8.25

$ man seq
SEQ(1)                           User Commands                          SEQ(1)

NAME
       seq - print a sequence of numbers

SYNOPSIS
       seq [OPTION]... LAST
       seq [OPTION]... FIRST LAST
       seq [OPTION]... FIRST INCREMENT LAST

DESCRIPTION
       Print numbers from FIRST to LAST, in steps of INCREMENT.
...
```

<br>

#### <a name="integer-sequences"></a>integer sequences

* see `info seq` for details of how large numbers are handled
    * for ex: `seq 50000000000000000000 2 50000000000000000004` may not work

```bash
$ # default start=1 and increment=1
$ seq 3
1
2
3

$ # default increment=1
$ seq 25434 25437
25434
25435
25436
25437
$ seq -5 -3
-5
-4
-3

$ # different increment value
$ seq 1000 5 1011
1000
1005
1010

$ # use negative increment for descending order
$ seq 10 -5 -7
10
5
0
-5
```

* use `-w` option for leading zeros
* largest length of start/end value is used to determine padding

```bash
$ seq 008 010
8
9
10

$ # or: seq -w 8 010
$ seq -w 008 010
008
009
010

$ seq -w 0003
0001
0002
0003
```

<br>

#### <a name="specifying-separator"></a>specifying separator

* As seen already, default is newline separator between numbers
* `-s` option allows to use custom string between numbers
* A newline is always added at end

```bash
$ seq -s: 4
1:2:3:4

$ seq -s' ' 4
1 2 3 4

$ seq -s' - ' 4
1 - 2 - 3 - 4
```

<br>

#### <a name="floating-point-sequences"></a>floating point sequences

```bash
$ # default increment=1
$ seq 0.5 2.5
0.5
1.5
2.5

$ seq -s':' -2 0.75 3
-2.00:-1.25:-0.50:0.25:1.00:1.75:2.50

$ # Scientific notation is supported
$ seq 1.2e2 1.22e2
120
121
122
```

* formatting numbers, see `info seq` for details

```bash
$ seq -f'%.3f' -s':' -2 0.75 3
-2.000:-1.250:-0.500:0.250:1.000:1.750:2.500

$ seq -f'%.3e' 1.2e2 1.22e2
1.200e+02
1.210e+02
1.220e+02
```

<br>

#### <a name="further-reading-for-seq"></a>Further reading for seq

* `man seq` and `info seq` for more options, corner cases and detailed documentation
* [seq Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/seq?sort=votes&pageSize=15)


================================================
FILE: overview_presentation/baz.json
================================================
{
   "abc": {
      "@attr": "good",
      "text": "Hi there"
   },
   "xyz": {
      "@attr": "bad",
      "text": "I am good. How are you?"
   }
}


================================================
FILE: overview_presentation/foo.xml
================================================
<foo>
    <abc attr="good">Hi there</abc>
    <xyz attr="bad">I am good. How are you?</xyz>
</foo>


================================================
FILE: overview_presentation/greeting.txt
================================================
Hi there
Have a nice day


================================================
FILE: overview_presentation/sample.txt
================================================
Hello World!

Good day
How do you do?

Just do it
Believe 42 it!

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he 123 he he


================================================
FILE: perl_the_swiss_knife.md
================================================
<br> <br> <br>

---

:information_source: :information_source: This chapter has been converted into a better formatted ebook - https://learnbyexample.github.io/learn_perl_oneliners/. The ebook also has content updated for newer version of `perl`, includes exercises, solutions, etc.

For markdown source and links to buy pdf/epub versions, see: https://github.com/learnbyexample/learn_perl_oneliners

---

<br> <br> <br>

# <a name="perl-one-liners"></a>Perl one liners

**Table of Contents**

* [Executing Perl code](#executing-perl-code)
* [Simple search and replace](#simple-search-and-replace)
    * [inplace editing](#inplace-editing)
* [Line filtering](#line-filtering)
    * [Regular expressions based filtering](#regular-expressions-based-filtering)
    * [Fixed string matching](#fixed-string-matching)
    * [Line number based filtering](#line-number-based-filtering)
* [Field processing](#field-processing)
    * [Field comparison](#field-comparison)
    * [Specifying different input field separator](#specifying-different-input-field-separator)
    * [Specifying different output field separator](#specifying-different-output-field-separator)
* [Changing record separators](#changing-record-separators)
    * [Input record separator](#input-record-separator)
    * [Output record separator](#output-record-separator)
* [Multiline processing](#multiline-processing)
* [Perl regular expressions](#perl-regular-expressions)
    * [sed vs perl subtle differences](#sed-vs-perl-subtle-differences)
    * [Backslash sequences](#backslash-sequences)
    * [Non-greedy quantifier](#non-greedy-quantifier)
    * [Lookarounds](#lookarounds)
    * [Ignoring specific matches](#ignoring-specific-matches)
    * [Special capture groups](#special-capture-groups)
    * [Modifiers](#modifiers)
    * [Quoting metacharacters](#quoting-metacharacters)
    * [Matching position](#matching-position)
* [Using modules](#using-modules)
* [Two file processing](#two-file-processing)
    * [Comparing whole lines](#comparing-whole-lines)
    * [Comparing specific fields](#comparing-specific-fields)
    * [Line number matching](#line-number-matching)
* [Creating new fields](#creating-new-fields)
* [Multiple file input](#multiple-file-input)
* [Dealing with duplicates](#dealing-with-duplicates)
* [Lines between two REGEXPs](#lines-between-two-regexps)
    * [All unbroken blocks](#all-unbroken-blocks)
    * [Specific blocks](#specific-blocks)
    * [Broken blocks](#broken-blocks)
* [Array operations](#array-operations)
    * [Iteration and filtering](#iteration-and-filtering)
    * [Sorting](#sorting)
    * [Transforming](#transforming)
* [Miscellaneous](#miscellaneous)
    * [split](#split)
    * [Fixed width processing](#fixed-width-processing)
    * [String and file replication](#string-and-file-replication)
    * [transliteration](#transliteration)
    * [Executing external commands](#executing-external-commands)
* [Further Reading](#further-reading)

<br>

```bash
$ perl -le 'print $^V'
v5.22.1

$ man perl
PERL(1)                Perl Programmers Reference Guide                PERL(1)

NAME
       perl - The Perl 5 language interpreter

SYNOPSIS
       perl [ -sTtuUWX ]      [ -hv ] [ -V[:configvar] ]
            [ -cw ] [ -d[t][:debugger] ] [ -D[number/list] ]
            [ -pna ] [ -Fpattern ] [ -l[octal] ] [ -0[octal/hexadecimal] ]
            [ -Idir ] [ -m[-]module ] [ -M[-]'module...' ] [ -f ]
            [ -C [number/list] ]      [ -S ]      [ -x[dir] ]
            [ -i[extension] ]
            [ [-e|-E] 'command' ] [ -- ] [ programfile ] [ argument ]...

       For more information on these options, you can run "perldoc perlrun".
...
```

**Prerequisites and notes**

* familiarity with programming concepts like variables, printing, control structures, arrays, etc
* Perl borrows syntax/features from **C, shell scripting, awk, sed** etc. Prior experience working with them would help a lot
* familiarity with regular expression basics
    * if not, check out **ERE** portion of [GNU sed regular expressions](./gnu_sed.md#regular-expressions)
    * examples for non-greedy, lookarounds, etc will be covered here
* this tutorial is primarily focussed on short programs that are easily usable from command line, similar to using `grep`, `sed`, `awk` etc
    * do NOT use style/syntax presented here when writing full fledged Perl programs which should use **strict, warnings** etc
    * see [perldoc - perlintro](https://perldoc.perl.org/perlintro.html) and [learnxinyminutes - perl](https://learnxinyminutes.com/docs/perl/) for quick intro to using Perl for full fledged programs
* links to Perl documentation will be added as necessary
* unless otherwise specified, consider input as ASCII encoded text only
    * see also [stackoverflow - why UTF-8 is not default](https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default)

<br>

## <a name="executing-perl-code"></a>Executing Perl code

* One way is to put code in a file and use `perl` command with filename as argument
* Another is to use [shebang](https://en.wikipedia.org/wiki/Shebang_(Unix)) at beginning of script, make the file executable and directly run it

```bash
$ cat code.pl
print "Hello Perl\n"
$ perl code.pl
Hello Perl

$ # similar to bash
$ cat code.sh
echo 'Hello Bash'
$ bash code.sh
Hello Bash
```

* For short programs, one can use `-e` commandline option to provide code from command line itself
    * Use `-E` option to use newer features like `say`. See [perldoc - new features](https://perldoc.perl.org/feature.html)
* This entire chapter is about using `perl` this way from commandline

```bash
$ perl -e 'print "Hello Perl\n"'
Hello Perl

$ # say automatically adds newline character
$ perl -E 'say "Hello Perl"'
Hello Perl

$ # similar to
$ bash -c 'echo "Hello Bash"'
Hello Bash

$ # multiple commands can be issued separated by ;
$ # -l will be covered later, here used to append newline to print
$ perl -le '$x=25; $y=12; print $x**$y'
59604644775390625
```

* Perl is (in)famous for being able to things more than one way
* examples in this chapter will mostly try to use the syntax that avoids `(){}`

```bash
$ # shows different syntax usage of if/say/print
$ perl -e 'if(2<3){print("2 is less than 3\n")}'
2 is less than 3
$ perl -E 'say "2 is less than 3" if 2<3'
2 is less than 3

$ # string comparison uses eq for ==, lt for < and so on
$ perl -e 'if("a" lt "b"){$x=5; $y=10} print "x=$x; y=$y\n"'
x=5; y=10
$ # x/y assignment will happen only if condition evaluates to true
$ perl -E 'say "x=$x; y=$y" if "a" lt "b" and $x=5,$y=10'
x=5; y=10

$ # variables will be interpolated within double quotes
$ # so, use q operator if single quoting is needed
$ # as single quote is already being used to group perl code for -e option
$ perl -le 'print "ab $x 123"'
ab  123
$ perl -le 'print q/ab $x 123/'
ab $x 123
```

**Further Reading**

* `perl -h` for summary of options
* [perldoc - Command Switches](https://perldoc.perl.org/perlrun.html#Command-Switches)
* [perldoc - Perl operators and precedence](https://perldoc.perl.org/perlop.html)
* [explainshell](https://explainshell.com/explain?cmd=perl+-F+-l+-anpeE+-i+-0+-M) - to quickly get information without having to traverse through the docs
* See [Changing record separators](#changing-record-separators) section for more details on `-l` option

<br>

## <a name="simple-search-and-replace"></a>Simple search and replace

* **substitution** command syntax is very similar to `sed` for search and replace
    * syntax is `variable =~ s/REGEXP/REPLACEMENT/FLAGS` and by default acts on `$_` if variable is not specified
    * see [perldoc - SPECIAL VARIABLES](https://perldoc.perl.org/perlvar.html#SPECIAL-VARIABLES) for explanation on `$_` and other such special variables
    * more detailed examples will be covered in later sections
* Just like other text processing commands, `perl` will automatically loop over input line by line when `-n` or `-p` option is used
    * like `sed`, the `-n` option won't print the record
    * `-p` will print the record, including any changes made
    * newline character being default record separator
    * `$_` will contain the input record content, including the record separator (unlike `sed` and `awk`)
    * any directory name appearing in file arguments passed will be automatically ignored
* and similar to other commands, `perl` will work with both stdin and file input
    * See other chapters for examples of [seq](./miscellaneous.md#seq), [paste](./restructure_text.md#paste), etc

```bash
$ # sample stdin data
$ seq 10 | paste -sd,
1,2,3,4,5,6,7,8,9,10

$ # change only first ',' to ' : '
$ # same as: sed 's/,/ : /'
$ seq 10 | paste -sd, | perl -pe 's/,/ : /'
1 : 2,3,4,5,6,7,8,9,10

$ # change all ',' to ' : ' by using 'g' modifier
$ # same as: sed 's/,/ : /g'
$ seq 10 | paste -sd, | perl -pe 's/,/ : /g'
1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10

$ cat greeting.txt
Hi there
Have a nice day
$ # same as: sed 's/nice day/safe journey/' greeting.txt
$ perl -pe 's/nice day/safe journey/' greeting.txt
Hi there
Have a safe journey
```

<br>

#### <a name="inplace-editing"></a>inplace editing

* similar to [GNU sed - using * with inplace option](./gnu_sed.md#prefix-backup-name), one can also use `*` to either prefix the backup name or place the backup files in another existing directory
* See also [effectiveperlprogramming - caveats of using -i option](https://www.effectiveperlprogramming.com/2017/12/in-place-editing-gets-safer-in-v5-28/)

```bash
$ # same as: sed -i.bkp 's/Hi/Hello/' greeting.txt
$ perl -i.bkp -pe 's/Hi/Hello/' greeting.txt
$ # original file gets preserved in 'greeting.txt.bkp'
$ cat greeting.txt
Hello there
Have a nice day

$ # using -i'bkp.*' will save backup file as 'bkp.greeting.txt'

$ # use empty argument to -i with caution, changes made cannot be undone
$ perl -i -pe 's/nice day/safe journey/' greeting.txt
$ cat greeting.txt
Hello there
Have a safe journey
```

* Multiple input files are treated individually and changes are written back to respective files

```bash
$ cat f1
I ate 3 apples
$ cat f2
I bought two bananas and 3 mangoes

$ perl -i.bkp -pe 's/3/three/' f1 f2
$ cat f1
I ate three apples
$ cat f2
I bought two bananas and three mangoes
```

<br>

## <a name="line-filtering"></a>Line filtering

<br>

#### <a name="regular-expressions-based-filtering"></a>Regular expressions based filtering

* syntax is `variable =~ m/REGEXP/FLAGS` to check for a match
    * `variable !~ m/REGEXP/FLAGS` for negated match
    * by default acts on `$_` if variable is not specified
* as we need to print only selective lines, use `-n` option
    * by default, contents of `$_` will be printed if no argument is passed to `print`

```bash
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ # same as: grep '^[RS]' or sed -n '/^[RS]/p' or awk '/^[RS]/'
$ # /^[RS]/ is shortcut for $_ =~ m/^[RS]/
$ perl -ne 'print if /^[RS]/' poem.txt
Roses are red,
Sugar is sweet,

$ # same as: grep -i 'and' poem.txt
$ perl -ne 'print if /and/i' poem.txt
And so are you.

$ # same as: grep -v 'are' poem.txt
$ # !/are/ is shortcut for $_ !~ m/are/
$ perl -ne 'print if !/are/' poem.txt
Sugar is sweet,

$ # same as: awk '/are/ && !/so/' poem.txt
$ perl -ne 'print if /are/ && !/so/' poem.txt
Roses are red,
Violets are blue,
```

* using different delimiter
* quoting from [perldoc - Regexp Quote-Like Operators](https://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators)

> With the m you can use any pair of non-alphanumeric, non-whitespace characters as delimiters

```bash
$ cat paths.txt
/foo/a/report.log
/foo/y/power.log
/foo/abc/errors.log

$ perl -ne 'print if /\/foo\/a\//' paths.txt
/foo/a/report.log

$ perl -ne 'print if m#/foo/a/#' paths.txt
/foo/a/report.log

$ perl -ne 'print if !m#/foo/a/#' paths.txt
/foo/y/power.log
/foo/abc/errors.log
```

<br>

#### <a name="fixed-string-matching"></a>Fixed string matching

* similar to `grep -F` and `awk index`
* See also
    * [perldoc - index function](https://perldoc.perl.org/functions/index.html)
    * [perldoc - Quote and Quote-like Operators](https://perldoc.perl.org/5.8.8/perlop.html#Quote-and-Quote-like-Operators)
    * [Quoting metacharacters](#quoting-metacharacters) section

```bash
$ # same as: grep -F 'a[5]' or awk 'index($0, "a[5]")'
$ # index returns matching position(starts at 0) and -1 if not found
$ echo 'int a[5]' | perl -ne 'print if index($_, "a[5]") != -1'
int a[5]

$ # however, string within double quotes gets interpolated, for ex
$ x='123'; echo "$x"
123
$ perl -e '$x=123; print "$x\n"'
123

$ # so, for commandline usage, better to pass string as environment variable
$ # they are accessible via the %ENV hash variable
$ perl -le 'print $ENV{PWD}'
/home/learnbyexample
$ perl -le 'print $ENV{SHELL}'
/bin/bash

$ echo 'a#$%d' | perl -ne 'print if index($_, "#$%") != -1'
$ echo 'a#$%d' | s='#$%' perl -ne 'print if index($_, $ENV{s}) != -1'
a#$%d
```

* return value is useful to match at specific position
* for ex: at start/end of line

```bash
$ cat eqns.txt
a=b,a-b=c,c*d
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

$ # start of line
$ # same as: s='a+b' awk 'index($0, ENVIRON["s"])==1' eqns.txt
$ s='a+b' perl -ne 'print if index($_, $ENV{s})==0' eqns.txt
a+b,pi=3.14,5e12

$ # end of line
$ # length function returns number of characters, by default acts on $_
$ s='a+b' perl -ne '$pos = length() - length($ENV{s}) - 1;
                    print if index($_, $ENV{s}) == $pos' eqns.txt
i*(t+9-g)/8,4-a+b
```

<br>

#### <a name="line-number-based-filtering"></a>Line number based filtering

* special variable `$.` contains total records read so far, similar to `NR` in `awk`
    * But no equivalent of awk's `FNR`, [see this stackoverflow Q&A for workaround](https://stackoverflow.com/questions/12384692/line-number-of-a-file-in-perl)
* See also [perldoc - eof](https://perldoc.perl.org/perlfunc.html#eof)

```bash
$ # same as: head -n2 poem.txt | tail -n1
$ # or sed -n '2p' or awk 'NR==2'
$ perl -ne 'print if $.==2' poem.txt
Violets are blue,

$ # print 2nd and 4th line
$ # same as: sed -n '2p; 4p' or awk 'NR==2 || NR==4'
$ perl -ne 'print if $.==2 || $.==4' poem.txt
Violets are blue,
And so are you.

$ # same as: tail -n1 poem.txt
$ # or sed -n '$p' or awk 'END{print}'
$ perl -ne 'print if eof' poem.txt
And so are you.
```

* for large input, use `exit` to avoid unnecessary record processing

```bash
$ # can also use: perl -ne 'print and exit if $.==234'
$ seq 14323 14563435 | perl -ne 'if($.==234){print; exit}'
14556

$ # sample time comparison
$ time seq 14323 14563435 | perl -ne 'if($.==234){print; exit}' > /dev/null
real    0m0.005s
$ time seq 14323 14563435 | perl -ne 'print if $.==234' > /dev/null
real    0m2.439s

$ # mimicking head command, same as: head -n3 or sed '3q'
$ seq 14 25 | perl -pe 'exit if $.>3'
14
15
16

$ # same as: sed '3Q'
$ seq 14 25 | perl -pe 'exit if $.==3'
14
15
```

* selecting range of lines
* `..` is [perldoc - range operator](https://perldoc.perl.org/perlop.html#Range-Operators)

```bash
$ # same as: sed -n '3,5p' or awk 'NR>=3 && NR<=5'
$ # in this context, the range is compared against $.
$ seq 14 25 | perl -ne 'print if 3..5'
16
17
18

$ # selecting from particular line number to end of input
$ # same as: sed -n '10,$p' or awk 'NR>=10'
$ seq 14 25 | perl -ne 'print if $.>=10'
23
24
25
```

<br>

## <a name="field-processing"></a>Field processing

* `-a` option will auto-split each input record based on one or more continuous white-space, similar to default behavior in `awk`
    * See also [split](#split) section
* Special variable array `@F` will contain all the elements, indexing starts from 0
    * negative indexing is also supported, `-1` gives last element, `-2` gives last-but-one and so on
    * see [Array operations](#array-operations) section for examples on array usage

```bash
$ cat fruits.txt
fruit   qty
apple   42
banana  31
fig     90
guava   6

$ # print only first field, indexing starts from 0
$ # same as: awk '{print $1}' fruits.txt
$ perl -lane 'print $F[0]' fruits.txt
fruit
apple
banana
fig
guava

$ # print only second field
$ # same as: awk '{print $2}' fruits.txt
$ perl -lane 'print $F[1]' fruits.txt
qty
42
31
90
6
```

* by default, leading and trailing whitespaces won't be considered when splitting the input record
    * mimicking `awk`'s default behavior

```bash
$ printf ' a    ate b\tc   \n'
 a    ate b     c
$ printf ' a    ate b\tc   \n' | perl -lane 'print $F[0]'
a
$ printf ' a    ate b\tc   \n' | perl -lane 'print $F[-1]'
c

$ # number of fields, $#F gives index of last element - so add 1
$ echo '1 a 7' | perl -lane 'print $#F+1'
3
$ printf ' a    ate b\tc   \n' | perl -lane 'print $#F+1'
4
$ # or use scalar context
$ echo '1 a 7' | perl -lane 'print scalar @F'
3
```

<br>

#### <a name="field-comparison"></a>Field comparison

* for numeric context, Perl automatically tries to convert the string to number, ignoring white-space
* for string comparison, use `eq` for `==`, `ne` for `!=` and so on

```bash
$ # if first field exactly matches the string 'apple'
$ # same as: awk '$1=="apple"{print $2}' fruits.txt
$ perl -lane 'print $F[1] if $F[0] eq "apple"' fruits.txt
42

$ # print first field if second field > 35 (excluding header)
$ # same as: awk 'NR>1 && $2>35{print $1}' fruits.txt
$ perl -lane 'print $F[0] if $F[1]>35 && $.>1' fruits.txt
apple
fig

$ # print header and lines with qty < 35
$ # same as: awk 'NR==1 || $2<35' fruits.txt
$ perl -ane 'print if $F[1]<35 || $.==1' fruits.txt
fruit   qty
banana  31
guava   6

$ # if first field does NOT contain 'a'
$ # same as: awk '$1 !~ /a/' fruits.txt
$ perl -ane 'print if $F[0] !~ /a/' fruits.txt
fruit   qty
fig     90
```

<br>

#### <a name="specifying-different-input-field-separator"></a>Specifying different input field separator

* by using `-F` command line option
    * See also [split](#split) section, which covers details about trailing empty fields

```bash
$ # second field where input field separator is :
$ # same as: awk -F: '{print $2}'
$ echo 'foo:123:bar:789' | perl -F: -lane 'print $F[1]'
123

$ # last field, same as: awk -F: '{print $NF}'
$ echo 'foo:123:bar:789' | perl -F: -lane 'print $F[-1]'
789
$ # second last field, same as: awk -F: '{print $(NF-1)}'
$ echo 'foo:123:bar:789' | perl -F: -lane 'print $F[-2]'
bar

$ # second and last field
$ # other ways to print more than 1 element will be covered later
$ echo 'foo:123:bar:789' | perl -F: -lane 'print "$F[1] $F[-1]"'
123 789

$ # use quotes to avoid clashes with shell special characters
$ echo 'one;two;three;four' | perl -F';' -lane 'print $F[2]'
three
```

* Regular expressions based input field separator

```bash
$ # same as: awk -F'[0-9]+' '{print $2}'
$ echo 'Sample123string54with908numbers' | perl -F'\d+' -lane 'print $F[1]'
string

$ # first field will be empty as there is nothing before '{'
$ # same as: awk -F'[{}= ]+' '{print $1}'
$ # \x20 is space character, can't use literal space within [] when using -F
$ echo '{foo}   bar=baz' | perl -F'[{}=\x20]+' -lane 'print $F[0]'

$ echo '{foo}   bar=baz' | perl -F'[{}=\x20]+' -lane 'print $F[1]'
foo
$ echo '{foo}   bar=baz' | perl -F'[{}=\x20]+' -lane 'print $F[2]'
bar
```

* empty argument to `-F` will split the input record character wise

```bash
$ # same as: gawk -v FS= '{print $1}'
$ echo 'apple' | perl -F -lane 'print $F[0]'
a
$ echo 'apple' | perl -F -lane 'print $F[1]'
p
$ echo 'apple' | perl -F -lane 'print $F[-1]'
e

$ # use -C option when dealing with unicode characters
$ # S will turn on UTF-8 for stdin/stdout/stderr streams
$ printf 'hi👍 how are you?' | perl -CS -F -lane 'print $F[2]'
👍
```

<br>

#### <a name="specifying-different-output-field-separator"></a>Specifying different output field separator

* Method 1: use `$,` to change separator between `print` arguments
    * could be remembered easily by noting that `,` is used to separate `print` arguments

```bash
$ # by default, the various arguments are concatenated
$ echo 'foo:123:bar:789' | perl -F: -lane 'print $F[1], $F[-1]'
123789

$ # change $, if different separator is needed
$ echo 'foo:123:bar:789' | perl -F: -lane '$,=" "; print $F[1], $F[-1]'
123 789
$ echo 'foo:123:bar:789' | perl -F: -lane '$,="-"; print $F[1], $F[-1]'
123-789

$ # argument can be array too
$ echo 'foo:123:bar:789' | perl -F: -lane '$,="-"; print @F[1,-1]'
123-789
$ echo 'foo:123:bar:789' | perl -F: -lane '$,=" - "; print @F'
foo - 123 - bar - 789
```

* Method 2: use `join`

```bash
$ echo 'foo:123:bar:789' | perl -F: -lane 'print join "-", $F[1], $F[-1]'
123-789

$ echo 'foo:123:bar:789' | perl -F: -lane 'print join "-", @F[1,-1]'
123-789

$ echo 'foo:123:bar:789' | perl -F: -lane 'print join " - ", @F'
foo - 123 - bar - 789
```

* Method 3: use `$"` to change separator when array is interpolated, default is space character
    * could be remembered easily by noting that interpolation happens within double quotes

```bash
$ # default is space
$ echo 'foo:123:bar:789' | perl -F: -lane 'print "@F[1,-1]"'
123 789

$ echo 'foo:123:bar:789' | perl -F: -lane '$"="-"; print "@F[1,-1]"'
123-789

$ echo 'foo:123:bar:789' | perl -F: -lane '$"=","; print "@F"'
foo,123,bar,789
```

* use `BEGIN` if same separator is to be used for all lines
    * statements inside `BEGIN` are executed before processing any input text

```bash
$ # can also use: perl -lane 'BEGIN{$"=","} print "@F"' fruits.txt
$ perl -lane 'BEGIN{$,=","} print @F' fruits.txt
fruit,qty
apple,42
banana,31
fig,90
guava,6
```

## <a name="changing-record-separators"></a>Changing record separators

* Before seeing examples for changing record separators, let's cover a detail about contents of input record and use of `-l` option
* See also [perldoc - chomp](https://perldoc.perl.org/functions/chomp.html)

```bash
$ # input record includes the record separator as well
$ # can also use: perl -pe 's/$/ 123/'
$ echo 'foo' | perl -pe 's/\n/ 123\n/'
foo 123

$ # this example shows better use case
$ # similar to paste -sd but with ability to use multi-character delimiter
$ seq 5 | perl -pe 's/\n/ : / if !eof'
1 : 2 : 3 : 4 : 5

$ # -l option will chomp off the record separator (among other things)
$ echo 'foo' | perl -l -pe 's/\n/ 123\n/'
foo

$ # -l also sets output record separator which gets added to print statements
$ # ORS gets input record separator value if no argument is passed to -l
$ # hence the newline automatically getting added for print in this example
$ perl -lane 'print $F[0] if $F[1]<35 && $.>1' fruits.txt
banana
guava
```

<br>

#### <a name="input-record-separator"></a>Input record separator

* by default, newline character is used as input record separator
* use `$/` to specify a different input record separator
    * unlike `awk`, only string can be used, no regular expressions
* for single character separator, can also use `-0` command line option which accepts octal/hexadecimal value as argument
* if `-l` option is also used
    * input record separator will be chomped from input record
    * in addition, if argument is not passed to `-l`, output record separator will get whatever is current value of input record separator
    * so, order of `-l`, `-0` and/or `$/` usage becomes important

```bash
$ s='this is a sample string'

$ # space as input record separator, printing all records
$ # same as: awk -v RS=' ' '{print NR, $0}'
$ # ORS is newline as -l is used before $/ gets changed
$ printf "$s" | perl -lne 'BEGIN{$/=" "} print "$. $_"'
1 this
2 is
3 a
4 sample
5 string

$ # print all records containing 'a'
$ # same as: awk -v RS=' ' '/a/'
$ printf "$s" | perl -l -0040 -ne 'print if /a/'
a
sample

$ # if the order is changed, ORS will be space, not newline
$ printf "$s" | perl -0040 -l -ne 'print if /a/'
a sample 
```

* `-0` option used without argument will use the ASCII NUL character as input record separator

```bash
$ printf 'foo\0bar\0' | cat -A
foo^@bar^@$
$ printf 'foo\0bar\0' | perl -l -0 -ne 'print'
foo
bar

$ # could be golfed to: perl -l -0pe ''
$ # but dont use `-l0` as `0` will be treated as argument to `-l`
```

* values `-0400` to `-0777` will cause entire file to be slurped
    * idiomatically, `-0777` is used

```bash
$ # s modifier allows . to match newline as well
$ perl -0777 -pe 's/red.*are //s' poem.txt
Roses are you.

$ # replace first newline with '. '
$ perl -0777 -pe 's/\n/. /' greeting.txt
Hello there. Have a safe journey
```

* for paragraph mode (two more more consecutive newline characters), use `-00` or assign empty string to `$/`

Consider the below sample file

```bash
$ cat sample.txt
Hello World

Good day
How are you

Just do-it
Believe it

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he
```

* again, input record will have the separator too and using `-l` will chomp it
* however, if more than two consecutive newline characters separate the paragraphs, only two newlines will be preserved and the rest discarded
    * use `$/="\n\n"` to avoid this behavior

```bash
$ # print all paragraphs containing 'it'
$ # same as: awk -v RS= -v ORS='\n\n' '/it/' sample.txt
$ perl -00 -ne 'print if /it/' sample.txt
Just do-it
Believe it

Today is sunny
Not a bit funny
No doubt you like it too

$ # based on number of lines in each paragraph
$ perl -F'\n' -00 -ane 'print if $#F==0' sample.txt
Hello World

$ # unlike awk -F'\n' -v RS= -v ORS='\n\n' 'NF==2 && /do/' sample.txt
$ # there wont be empty line at end because input file didn't have it
$ perl -F'\n' -00 -ane 'print if $#F==1 && /do/' sample.txt
Just do-it
Believe it

Much ado about nothing
He he he
```

* Re-structuring paragraphs

```bash
$ # same as: awk 'BEGIN{FS="\n"; OFS=". "; RS=""; ORS="\n\n"} {$1=$1} 1'
$ perl -F'\n' -00 -ane 'print join ". ", @F; print "\n\n"' sample.txt
Hello World

Good day. How are you

Just do-it. Believe it

Today is sunny. Not a bit funny. No doubt you like it too

Much ado about nothing. He he he

```

* multi-character separator

```bash
$ cat report.log
blah blah
Error: something went wrong
more blah
whatever
Error: something surely went wrong
some text
some more text
blah blah blah

$ # number of records, same as: awk -v RS='Error:' 'END{print NR}'
$ perl -lne 'BEGIN{$/="Error:"} print $. if eof' report.log
3
$ # print first record
$ perl -lne 'BEGIN{$/="Error:"} print if $.==1' report.log
blah blah

$ # same as: awk -v RS='Error:' '/surely/{print RS $0}' report.log
$ perl -lne 'BEGIN{$/="Error:"} print "$/$_" if /surely/' report.log
Error: something surely went wrong
some text
some more text
blah blah blah

```

* Joining lines based on specific end of line condition

```bash
$ cat msg.txt
Hello there.
It will rain to-
day. Have a safe
and pleasant jou-
rney.

$ # same as: awk -v RS='-\n' -v ORS= '1' msg.txt
$ # can also use: perl -pe 's/-\n//' msg.txt
$ perl -pe 'BEGIN{$/="-\n"} chomp' msg.txt
Hello there.
It will rain today. Have a safe
and pleasant journey.
```

<br>

#### <a name="output-record-separator"></a>Output record separator

* one way is to use `$\` to specify a different output record separator
    * by default it doesn't have a value

```bash
$ # note that despite $\ not having a value, output has newlines
$ # because the input record still has the input record separator
$ seq 3 | perl -ne 'print'
1
2
3
$ # same as: awk -v ORS='\n\n' '{print $0}'
$ seq 3 | perl -ne 'BEGIN{$\="\n"} print'
1

2

3

$ seq 2 | perl -ne 'BEGIN{$\="---\n"} print'
1
---
2
---
```

* dynamically changing output record separator

```bash
$ # same as: awk '{ORS = NR%2 ? " " : "\n"} 1'
$ # note the use of -l to chomp the input record separator
$ seq 6 | perl -lpe '$\ = $.%2 ? " " : "\n"'
1 2
3 4
5 6

$ # -l also sets the output record separator
$ # but gets overridden by $\
$ seq 6 | perl -lpe '$\ = $.%3 ? "-" : "\n"'
1-2-3
4-5-6
```

* passing argument to `-l` to set output record separator

```bash
$ seq 8 | perl -ne 'print if /[24]/'
2
4

$ # null separator, note how -l also chomps input record separator
$ seq 8 | perl -l0 -ne 'print if /[24]/' | cat -A
2^@4^@

$ # comma separator, won't have a newline at end
$ seq 8 | perl -l054 -ne 'print if /[24]/'
2,4,

$ # to add a final newline to output, use END and printf
$ seq 8 | perl -l054 -ne 'print if /[24]/; END{printf "\n"}'
2,4,
```

<br>

## <a name="multiline-processing"></a>Multiline processing

* Processing consecutive lines

```bash
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ # match two consecutive lines
$ # same as: awk 'p~/are/ && /is/{print p ORS $0} {p=$0}' poem.txt
$ perl -ne 'print $p,$_ if /is/ && $p=~/are/; $p=$_' poem.txt
Violets are blue,
Sugar is sweet,
$ # if only the second line is needed, same as: awk 'p~/are/ && /is/; {p=$0}'
$ perl -ne 'print if /is/ && $p=~/are/; $p=$_' poem.txt
Sugar is sweet,

$ # print if line matches a condition as well as condition for next 2 lines
$ # same as: awk 'p2~/red/ && p1~/blue/ && /is/{print p2} {p2=p1; p1=$0}'
$ perl -ne 'print $p2 if /is/ && $p1=~/blue/ && $p2=~/red/;
            $p2=$p1; $p1=$_' poem.txt
Roses are red,
```

Consider this sample input file

```bash
$ cat range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
END
baz
```

* extracting lines around matching line
* how `$n && $n--` works:
    * need to note that right hand side of `&&` is processed only if left hand side is `true`
    * so for example, if initially `$n=2`, then we get
        * `2 && 2; $n=1` - evaluates to `true`
        * `1 && 1; $n=0` - evaluates to `true`
        * `0 && ` - evaluates to `false` ... no decrementing `$n` and hence will be `false` until `$n` is re-assigned non-zero value

```bash
$ # similar to: grep --no-group-separator -A1 'BEGIN' range.txt
$ # same as: awk '/BEGIN/{n=2} n && n--' range.txt
$ perl -ne '$n=2 if /BEGIN/; print if $n && $n--' range.txt
BEGIN
1234
BEGIN
a

$ # print only line after matching line, same as: awk 'n && n--; /BEGIN/{n=1}'
$ perl -ne 'print if $n && $n--; $n=1 if /BEGIN/' range.txt
1234
a

$ # generic case: print nth line after match, awk 'n && !--n; /BEGIN/{n=3}'
$ perl -ne 'print if $n && !--$n; $n=3 if /BEGIN/' range.txt
END
c

$ # print second line prior to matched line
$ # same as: awk '/END/{print p2} {p2=p1; p1=$0}' range.txt
$ perl -ne 'print $p2 if /END/; $p2=$p1; $p1=$_' range.txt
1234
b

$ # use reversing trick for generic case of nth line before match
$ # same as: tac range.txt | awk 'n && !--n; /END/{n=3}' | tac
$ tac range.txt | perl -ne 'print if $n && !--$n; $n=3 if /END/' | tac
BEGIN
a
```

**Further Reading**

* [stackoverflow - multiline find and replace](https://stackoverflow.com/questions/39884112/perl-multiline-find-and-replace-with-regex)
* [stackoverflow - delete line based on content of previous/next lines](https://stackoverflow.com/questions/49112877/delete-line-if-line-matches-foo-line-above-matches-bar-and-line-below-match)
* [softwareengineering - FSM examples](https://softwareengineering.stackexchange.com/questions/47806/examples-of-finite-state-machines)
* [wikipedia - FSM](https://en.wikipedia.org/wiki/Finite-state_machine)

<br>

## <a name="perl-regular-expressions"></a>Perl regular expressions

* examples to showcase some of the features not present in ERE and modifiers not available in `sed`'s substitute command
* many features of Perl regular expressions will NOT be covered, but external links will be provided wherever relevant
    * See [perldoc - perlre](https://perldoc.perl.org/perlre.html) for complete reference
    * and [perldoc - regular expressions FAQ](https://perldoc.perl.org/perlfaq.html#the-perlfaq6-manpage%3a-Regular-Expressions)
* examples/descriptions based only on ASCII encoding

<br>

#### <a name="sed-vs-perl-subtle-differences"></a>sed vs perl subtle differences

* input record separator being part of input record

```bash
$ echo 'foo:123:bar:789' | sed -E 's/[^:]+$/xyz/'
foo:123:bar:xyz
$ # newline character gets replaced too as shown by shell prompt
$ echo 'foo:123:bar:789' | perl -pe 's/[^:]+$/xyz/'
foo:123:bar:xyz$
$ # simple workaround is to use -l option
$ echo 'foo:123:bar:789' | perl -lpe 's/[^:]+$/xyz/'
foo:123:bar:xyz

$ # of course it has uses too
$ seq 10 | paste -sd, | sed 's/,/ : /g'
1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10
$ seq 10 | perl -pe 's/\n/ : / if !eof'
1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10
```

* how much does `*` match?

```bash
$ # sed will choose biggest match
$ echo ',baz,,xyz,,,' | sed 's/[^,]*/A/g'
A,A,A,A,A,A,A
$ echo 'foo,baz,,xyz,,,123' | sed 's/[^,]*/A/g'
A,A,A,A,A,A,A

$ # but perl will match both empty and non-empty strings
$ echo ',baz,,xyz,,,' | perl -lpe 's/[^,]*/A/g'
A,AA,A,AA,A,A,A
$ echo 'foo,baz,,xyz,,,123' | perl -lpe 's/[^,]*/A/g'
AA,AA,A,AA,A,A,AA

$ echo '42,789' | sed 's/[0-9]*/"&"/g'
"42","789"
$ echo '42,789' | perl -lpe 's/\d*/"$&"/g'
"42""","789"""
$ echo '42,789' | perl -lpe 's/\d+/"$&"/g'
"42","789"
```

* backslash sequences inside character classes

```bash
$ # \w would simply match w
$ echo 'w=y-x+9*3' | sed 's/[\w=]//g'
y-x+9*3

$ # \w would match any word character
$ echo 'w=y-x+9*3' | perl -pe 's/[\w=]//g'
-+*
```

* replacing specific occurrence
* See [stackoverflow - substitute the nth occurrence of a match in a Perl regex](https://stackoverflow.com/questions/2555662/how-can-i-substitute-the-nth-occurrence-of-a-match-in-a-perl-regex) for workarounds

```bash
$ echo 'foo:123:bar:baz' | sed 's/:/-/2'
foo:123-bar:baz

$ echo 'foo:123:bar:baz' | perl -pe 's/:/-/2'
Unknown regexp modifier "/2" at -e line 1, at end of line
Execution of -e aborted due to compilation errors.
$ # e modifier covered later, allows Perl code in replacement section
$ echo 'foo:123:bar:baz' | perl -pe '$c=0; s/:/++$c==2 ? "-" : $&/ge'
foo:123-bar:baz
$ # or use non-greedy and \K(covered later), same as: sed 's/and/-/3'
$ echo 'foo and bar and baz land good' | perl -pe 's/(and.*?){2}\Kand/-/'
foo and bar and baz l- good

$ # emulating GNU sed's number+g modifier
$ a='456:foo:123:bar:789:baz
x:y:z:a:v:xc:gf'
$ echo "$a" | sed 's/:/-/3g'
456:foo:123-bar-789-baz
x:y:z-a-v-xc-gf
$ echo "$a" | perl -pe '$c=0; s/:/++$c<3 ? $& : "-"/ge'
456:foo:123-bar-789-baz
x:y:z-a-v-xc-gf
```

* variable interpolation when `$` or `@` is used
* See also [perldoc - Quote and Quote-like Operators](https://perldoc.perl.org/5.8.8/perlop.html#Quote-and-Quote-like-Operators)

```bash
$ seq 2 | sed 's/$x/xyz/'
1
2

$ # uninitialized variable, same applies for: perl -pe 's/@a/xyz/'
$ seq 2 | perl -pe 's/$x/xyz/'
xyz1
xyz2
$ # initialized variable
$ seq 2 | perl -pe '$x=2; s/$x/xyz/'
1
xyz

$ # using single quotes as delimiter won't interpolate
$ # not usable for one-liners given shell's own single/double quotes behavior
$ cat sub_sq.pl
s'$x'xyz'
$ seq 2 | perl -p sub_sq.pl
1
2
```

* back reference
* See also [perldoc - Warning on \1 Instead of $1](https://perldoc.perl.org/perlre.html#Warning-on-%5c1-Instead-of-%241)

```bash
$ # use $& to refer entire matched string in replacement section
$ echo 'hello world' | sed 's/.*/"&"/'
"hello world"
$ echo 'hello world' | perl -pe 's/.*/"&"/'
"&"
$ echo 'hello world' | perl -pe 's/.*/"$&"/'
"hello world"

$ # use \1, \2, etc or \g1, \g2 etc for back referencing in search section
$ # use $1, $2, etc in replacement section
$ echo 'a a a walking for for a cause' | perl -pe 's/\b(\w+)( \1)+\b/$1/g'
a walking for a cause
```

<br>

#### <a name="backslash-sequences"></a>Backslash sequences

* `\d` for `[0-9]`
* `\s` for `[ \t\r\n\f\v]`
* `\h` for `[ \t]`
* `\n` for newline character
* `\D`, `\S`, `\H`, `\N` respectively for their opposites
* See [perldoc - perlrecharclass](https://perldoc.perl.org/perlrecharclass.html#Backslash-sequences) for full list and details

```bash
$ # same as: sed -E 's/[0-9]+/xxx/g'
$ echo 'like 42 and 37' | perl -pe 's/\d+/xxx/g'
like xxx and xxx

$ # same as: sed -E 's/[^0-9]+/xxx/g'
$ # note again the use of -l because of newline in input record
$ echo 'like 42 and 37' | perl -lpe 's/\D+/xxx/g'
xxx42xxx37

$ # no need -l here as \h won't match newline
$ echo 'a b c  ' | perl -pe 's/\h*$//'
a b c
```

<br>

#### <a name="non-greedy-quantifier"></a>Non-greedy quantifier

* adding a `?` to `?` or `*` or `+` or `{}` quantifiers will change matching from greedy to non-greedy. In other words, to match as minimally as possible
    * also known as lazy quantifier
* See also [regular-expressions.info - Possessive Quantifiers](https://www.regular-expressions.info/possessive.html)

```bash
$ # greedy matching
$ echo 'foo and bar and baz land good' | perl -pe 's/foo.*and//'
 good
$ # non-greedy matching
$ echo 'foo and bar and baz land good' | perl -pe 's/foo.*?and//'
 bar and baz land good

$ echo '12342789' | perl -pe 's/\d{2,5}//'
789
$ echo '12342789' | perl -pe 's/\d{2,5}?//'
342789

$ # for single character, non-greedy is not always needed
$ echo '123:42:789:good:5:bad' | perl -pe 's/:.*?:/:/'
123:789:good:5:bad
$ echo '123:42:789:good:5:bad' | perl -pe 's/:[^:]*:/:/'
123:789:good:5:bad

$ # just like greedy, overall matching is considered, as minimal as possible
$ echo '123:42:789:good:5:bad' | perl -pe 's/:.*?:[a-z]/:/'
123:ood:5:bad
$ echo '123:42:789:good:5:bad' | perl -pe 's/:.*:[a-z]/:/'
123:ad
```

<br>

#### <a name="lookarounds"></a>Lookarounds

* Ability to add if conditions to match before/after required pattern
* There are four types
    * positive lookahead `(?=`
    * negative lookahead `(?!`
    * positive lookbehind `(?<=`
    * negative lookbehind `(?<!`
* One way to remember is that **behind** uses `<` and **negative** uses `!` instead of `=`

The string matched by lookarounds are like word boundaries and anchors, do not constitute as part of matched string. They are termed as **zero-width patterns**

* positive lookbehind `(?<=`

```bash
$ s='foo=5, bar=3; x=83, y=120'

$ # extract all digit sequences
$ echo "$s" | perl -lne 'print join " ", /\d+/g'
5 3 83 120

$ # extract digits only if preceded by two lowercase alphabets and =
$ # note how the characters matched by lookbehind isn't part of output
$ echo "$s" | perl -lne 'print join " ", /(?<=[a-z]{2}=)\d+/g'
5 3

$ # this can be done without lookbehind too
$ # taking advantage of behavior of //g when () is used
$ echo "$s" | perl -lne 'print join " ", /[a-z]{2}=(\d+)/g'
5 3

$ # change all digits preceded by single lowercase alphabet and =
$ echo "$s" | perl -pe 's/(?<=\b[a-z]=)\d+/42/g'
foo=5, bar=3; x=42, y=42
$ # alternate, without lookbehind
$ echo "$s" | perl -pe 's/(\b[a-z]=)\d+/${1}42/g'
foo=5, bar=3; x=42, y=42
```

* positive lookahead `(?=`

```bash
$ s='foo=5, bar=3; x=83, y=120'

$ # extract digits that end with ,
$ # can also use: perl -lne 'print join ":", /(\d+),/g'
$ echo "$s" | perl -lne 'print join ":", /\d+(?=,)/g'
5:83

$ # change all digits ending with ,
$ # can also use: perl -pe 's/\d+,/42,/g'
$ echo "$s" | perl -pe 's/\d+(?=,)/42/g'
foo=42, bar=3; x=42, y=120

$ # both lookbehind and lookahead
$ echo 'foo,,baz,,,xyz' | perl -pe 's/,,/,NA,/g'
foo,NA,baz,NA,,xyz
$ echo 'foo,,baz,,,xyz' | perl -pe 's/(?<=,)(?=,)/NA/g'
foo,NA,baz,NA,NA,xyz
```

* negative lookbehind `(?<!` and negative lookahead `(?!`

```bash
$ # change foo if not preceded by _
$ # note how 'foo' at start of line is matched as well
$ echo 'foo _foo 1foo' | perl -pe 's/(?<!_)foo/baz/g'
baz _foo 1baz

$ # join each line in paragraph by replacing newline character
$ # except the one at end of paragraph
$ perl -00 -pe 's/\n(?!$)/. /g' sample.txt
Hello World

Good day. How are you

Just do-it. Believe it

Today is sunny. Not a bit funny. No doubt you like it too

Much ado about nothing. He he he
```

* `\K` helps as a workaround for some of the variable-length lookbehind cases
* See also [stackoverflow - Variable-length lookbehind-assertion alternatives](https://stackoverflow.com/questions/11640447/variable-length-lookbehind-assertion-alternatives-for-regular-expressions)

```bash
$ # lookbehind is checking start of line (0 characters) and comma(1 character)
$ echo ',baz,,,xyz,,' | perl -pe 's/(?<=^|,)(?=,|$)/NA/g'
Variable length lookbehind not implemented in regex m/(?<=^|,)(?=,|$)/ at -e line 1.

$ # \K helps in such cases
$ echo ',baz,,,xyz,,' | perl -pe 's/(^|,)\K(?=,|$)/NA/g'
NA,baz,NA,NA,xyz,NA,NA
```

* some more examples

```bash
$ # helps to avoid , within fields for field splitting
$ # note how the quotes are still part of field value
$ echo '"foo","12,34","good"' | perl -F'/"\K,(?=")/' -lane 'print $F[1]'
"12,34"
$ echo '"foo","12,34","good"' | perl -F'/"\K,(?=")/' -lane 'print $F[2]'
"good"

$ # capture groups inside lookarounds
$ echo 'a b c d e' | perl -pe 's/(\H+\h+)(?=(\H+)\h)/$1$2\n/g'
a b
b c
c d
d e
$ # generic formula :)
$ echo 'a b c d e' | perl -pe 's/(\H+\h+)(?=(\H+(\h+\H+){1})\h)/$1$2\n/g'
a b c
b c d
c d e
$ echo 'a b c d e' | perl -pe 's/(\H+\h+)(?=(\H+(\h+\H+){2})\h)/$1$2\n/g'
a b c d
b c d e
```

**Further Reading**

* [stackoverflow - reverse four letter words](https://stackoverflow.com/questions/46870285/reverse-four-length-of-letters-with-sed-in-unix)
* [stackoverflow - lookarounds and possessive quantifier](https://stackoverflow.com/questions/42437747/pcre-negative-lookahead-gives-unexpected-match)

<br>

#### <a name="ignoring-specific-matches"></a>Ignoring specific matches

* A useful construct is `(*SKIP)(*F)` which allows to discard matches not needed
    * regular expression which should be discarded is written first, `(*SKIP)(*F)` is appended and then required regular expression is added after `|`

```bash
$ s='Car Bat cod12 Map foo_bar'
$ # all words except those starting with 'c' or 'C'
$ echo "$s" | perl -lne 'print join "\n", /\bc\w+(*SKIP)(*F)|\w+/gi'
Bat
Map
foo_bar

$ s='I like "mango" and "guava"'
$ # all words except those surrounded by double quotes
$ echo "$s" | perl -lne 'print join "\n", /"[^"]+"(*SKIP)(*F)|\w+/g'
I
like
and
$ # change words except those surrounded by double quotes
$ echo "$s" | perl -pe 's/"[^"]+"(*SKIP)(*F)|\w+/\U$&/g'
I LIKE "mango" AND "guava"
```

* for line based decisions, simple if-else might help

```bash
$ cat nums.txt
42
-2
10101
-3.14
-75

$ # change +ve number to -ve and vice versa
$ # note that empty regexp will reuse last successfully matched regexp
$ perl -pe '/^-/ ? s/// : s/^/-/' nums.txt
-42
2
-10101
3.14
75
```

**Further Reading**

* [perldoc - Special Backtracking Control Verbs](https://perldoc.perl.org/perlre.html#Special-Backtracking-Control-Verbs)
* [rexegg - Excluding Unwanted Matches](https://www.rexegg.com/backtracking-control-verbs.html#skipfail)

<br>

#### <a name="special-capture-groups"></a>Special capture groups

* `\1`, `\2` etc only matches exact string
* `(?1)`, `(?2)` etc re-uses the regular expression itself

```bash
$ s='baz 2008-03-24 and 2012-08-12 foo 2016-03-25'
$ # (?1) refers to first capture group (\d{4}-\d{2}-\d{2})
$ echo "$s" | perl -pe 's/(\d{4}-\d{2}-\d{2}) and (?1)/XYZ/'
baz XYZ foo 2016-03-25

$ # using \1 won't work as the two dates are different
$ echo "$s" | perl -pe 's/(\d{4}-\d{2}-\d{2}) and \1//'
baz 2008-03-24 and 2012-08-12 foo 2016-03-25
```

* use `(?:` to group regular expressions without capturing it, so this won't be counted for backreference
* See also
    * [stackoverflow - what is non-capturing group](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-what-does-do)
    * [stackoverflow - extract specific fields and key-value pairs](https://stackoverflow.com/questions/46632397/parse-vcf-files-info-field)

```bash
$ s='Car Bat cod12 Map foo_bar'
$ # check what happens if ?: is not used
$ echo "$s" | perl -lne 'print join "\n", /(?:Bat|Map)(*SKIP)(*F)|\w+/gi'
Car
cod12
foo_bar

$ # using ?: helps to focus only on required capture groups
$ echo 'cod1 foo_bar' | perl -pe 's/(?:co|fo)\K(\w)(\w)/$2$1/g'
co1d fo_obar
$ # without ?: you'd need to remember all the other groups as well
$ echo 'cod1 foo_bar' | perl -pe 's/(co|fo)\K(\w)(\w)/$3$2/g'
co1d fo_obar
```

* named capture groups `(?<name>`
    * for backreference, use `\k<name>`
    * accessible via `%+` hash in replacement section

```bash
$ s='baz 2008-03-24 and 2012-08-12 foo 2016-03-25'
$ echo "$s" | perl -pe 's/(\d{4})-(\d{2})-(\d{2})/$3-$2-$1/g'
baz 24-03-2008 and 12-08-2012 foo 25-03-2016

$ # naming the capture groups might offer clarity
$ echo "$s" | perl -pe 's/(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})/$+{d}-$+{m}-$+{y}/g'
baz 24-03-2008 and 12-08-2012 foo 25-03-2016
$ echo "$s" | perl -pe 's/(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})/$+{m}-$+{d}-$+{y}/g'
baz 03-24-2008 and 08-12-2012 foo 03-25-2016

$ # and useful to transform different capture groups
$ s='"foo,bar",123,"x,y,z",42'
$ echo "$s" | perl -lpe 's/"(?<a>[^"]+)",|(?<a>[^,]+),/$+{a}|/g'
foo,bar|123|x,y,z|42
$ # can also use (?| branch reset
$ echo "$s" | perl -lpe 's/(?|"([^"]+)",|([^,]+),)/$1|/g'
foo,bar|123|x,y,z|42
```

**Further Reading**

* [perldoc - Extended Patterns](https://perldoc.perl.org/perlre.html#Extended-Patterns)
* [rexegg - all the (? usages](https://www.rexegg.com/regex-disambiguation.html)
* [regular-expressions - recursion](https://www.regular-expressions.info/recurse.html#balanced)

<br>

#### <a name="modifiers"></a>Modifiers

* some are already seen, like the `g` (global match) and `i` (case insensitive matching)
* first up, the `r` modifier which returns the substitution result instead of modifying the variable it is acting upon

```bash
$ perl -e '$x="feed"; $y=$x=~s/e/E/gr; print "x=$x\ny=$y\n"'
x=feed
y=fEEd

$ # the r modifier is available for transliteration operator too
$ perl -e '$x="food"; $y=$x=~tr/a-z/A-Z/r; print "x=$x\ny=$y\n"'
x=food
y=FOOD
```

* `e` modifier allows to use Perl code in replacement section instead of string
* use `ee` if you need to construct a string and then apply evaluation

```bash
$ # replace numbers with their squares
$ echo '4 and 10' | perl -pe 's/\d+/$&*$&/ge'
16 and 100

$ # replace matched string with incremental value
$ echo '4 and 10 foo 57' | perl -pe 's/\d+/++$c/ge'
1 and 2 foo 3
$ # passing initial value
$ echo '4 and 10 foo 57' | c=100 perl -pe 's/\d+/$ENV{c}++/ge'
100 and 101 foo 102

$ # formatting string
$ echo 'a1-2-deed' | perl -lpe 's/[^-]+/sprintf "%04s", $&/ge'
00a1-0002-deed

$ # calling a function
$ echo 'food:12:explain:789' | perl -pe 's/\w+/length($&)/ge'
4:2:7:3

$ # applying another substitution to matched string
$ echo '"mango" and "guava"' | perl -pe 's/"[^"]+"/$&=~s|a|A|gr/ge'
"mAngo" and "guAvA"
```

* multiline modifiers

```bash
$ # m modifier to match beginning/end of each line within multiline string
$ perl -00 -ne 'print if /^Believe/' sample.txt
$ perl -00 -ne 'print if /^Believe/m' sample.txt
Just do-it
Believe it

$ perl -00 -ne 'print if /funny$/' sample.txt
$ perl -00 -ne 'print if /funny$/m' sample.txt
Today is sunny
Not a bit funny
No doubt you like it too

$ # s modifier to allow . meta character to match newlines as well
$ perl -00 -ne 'print if /do.*he/' sample.txt
$ perl -00 -ne 'print if /do.*he/s' sample.txt
Much ado about nothing
He he he
```

**Further Reading**

* [perldoc - perlre Modifiers](https://perldoc.perl.org/perlre.html#Modifiers)
* [stackoverflow - replacement within matched string](https://stackoverflow.com/questions/40458639/replacement-within-the-matched-string-with-sed)

<br>

#### <a name="quoting-metacharacters"></a>Quoting metacharacters

* part of regular expression can be surrounded within `\Q` and `\E` to prevent matching meta characters within that portion
    * however, `$` and `@` would still be interpolated as long as delimiter isn't single quotes
    * `\E` is optional if applying `\Q` till end of search expression
* typical use case is string to be protected is already present in a variable, for ex: user input or result of another command
* quotemeta will add a backslash to all characters other than `\w` characters
* See also [perldoc - Quoting metacharacters](https://perldoc.perl.org/perlre.html#Quoting-metacharacters)

```bash
$ # quotemeta in action
$ perl -le '$x="[a].b+c^"; print quotemeta $x'
\[a\]\.b\+c\^

$ # same as: s='a+b' perl -ne 'print if index($_, $ENV{s})==0' eqns.txt
$ s='a+b' perl -ne 'print if /^\Q$ENV{s}/' eqns.txt
a+b,pi=3.14,5e12

$ s='a+b' perl -pe 's/^\Q$ENV{s}/ABC/' eqns.txt
a=b,a-b=c,c*d
ABC,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

$ s='a+b' perl -pe 's/\Q$ENV{s}\E.*,/ABC,/' eqns.txt
a=b,a-b=c,c*d
ABC,5e12
i*(t+9-g)/8,4-a+b
```

* use `q` operator for replacement section
* it would treat contents as if they were placed inside single quotes and hence no interpolation
* See also [perldoc - Quote and Quote-like Operators](https://perldoc.perl.org/5.8.8/perlop.html#Quote-and-Quote-like-Operators)

```bash
$ # q in action
$ perl -le '$x="[a].b+c^$@123"; print $x'
[a].b+c^123
$ perl -le '$x=q([a].b+c^$@123); print $x'
[a].b+c^$@123
$ perl -le '$x=q([a].b+c^$@123); print quotemeta $x'
\[a\]\.b\+c\^\$\@123

$ echo 'foo 123' | perl -pe 's/foo/$foo/'
 123
$ echo 'foo 123' | perl -pe 's/foo/q($foo)/e'
$foo 123
$ echo 'foo 123' | perl -pe 's/foo/q{$f)oo}/e'
$f)oo 123

$ # string saved in other variables do not need special attention
$ echo 'foo 123' | s='a$b' perl -pe 's/foo/$ENV{s}/'
a$b 123
$ echo 'foo 123' | perl -pe 's/foo/a$b/'
a 123
```

<br>

#### <a name="matching-position"></a>Matching position

* From [perldoc - perlvar](https://perldoc.perl.org/perlvar.html#SPECIAL-VARIABLES)

>$-[0] is the offset of the start of the last successful match

>$+[0] is the offset into the string of the end of the entire match

```bash
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ # starting position of match
$ perl -lne 'print "line: $., offset: $-[0]" if /are/' poem.txt
line: 1, offset: 6
line: 2, offset: 8
line: 4, offset: 7
$ # if offset is needed starting from 1 instead of 0
$ perl -lne 'print "line: $., offset: ",$-[0]+1 if /are/' poem.txt
line: 1, offset: 7
line: 2, offset: 9
line: 4, offset: 8

$ # ending position of match
$ perl -lne 'print "line: $., offset: $+[0]" if /are/' poem.txt
line: 1, offset: 9
line: 2, offset: 11
line: 4, offset: 10
```

* for multiple matches, use `while` loop to go over all the matches

```bash
$ perl -lne 'print "$.:$&:$-[0]" while /is|so|are/g' poem.txt
1:are:6
2:are:8
3:is:6
4:so:4
4:are:7
```

<br>

## <a name="using-modules"></a>Using modules

* There are many standard modules available that come with Perl installation
* and many more available from **Comprehensive Perl Archive Network** (CPAN)
    * [stackoverflow - easiest way to install a missing module](https://stackoverflow.com/questions/65865/whats-the-easiest-way-to-install-a-missing-perl-module)

```bash
$ echo '34,17,6' | perl -F, -lane 'BEGIN{use List::Util qw(max)} print max @F'
34
$ # -M option provides a way to specify modules from command line
$ echo '34,17,6' | perl -MList::Util=max -F, -lane 'print max @F'
34
$ echo '34,17,6' | perl -MList::Util=sum0 -F, -lane 'print sum0 @F'
57
$ echo '34,17,6' | perl -MList::Util=product -F, -lane 'print product @F'
3468

$ s='1,2,3,4,5'
$ echo "$s" | perl -MList::Util=shuffle -F, -lane 'print join ",",shuffle @F'
5,3,4,1,2

$ s='3,b,a,c,d,1,d,c,2,3,1,b'
$ echo "$s" | perl -MList::MoreUtils=uniq -F, -lane 'print join ",",uniq @F'
3,b,a,c,d,1,2

$ echo 'foo 123 baz' | base64
Zm9vIDEyMyBiYXoK
$ echo 'foo 123 baz' | perl -MMIME::Base64 -ne 'print encode_base64 $_'
Zm9vIDEyMyBiYXoK
$ echo 'Zm9vIDEyMyBiYXoK' | perl -MMIME::Base64 -ne 'print decode_base64 $_'
foo 123 baz
```

* a cool module [O](https://perldoc.perl.org/O.html) helps to convert one-liners to full fledged programs
    * similar to `-o` option for GNU awk

```bash
$ # command being deparsed is discussed in a later section
$ perl -MO=Deparse -ne 'if(!$#ARGV){$h{$_}=1; next}
            print if $h{$_}' colors_1.txt colors_2.txt
LINE: while (defined($_ = <ARGV>)) {
    unless ($#ARGV) {
        $h{$_} = 1;
        next;
    }
    print $_ if $h{$_};
}
-e syntax OK

$ perl -MO=Deparse -00 -ne 'print if /it/' sample.txt
BEGIN { $/ = ""; $\ = undef; }
LINE: while (defined($_ = <ARGV>)) {
    print $_ if /it/;
}
-e syntax OK
```

**Further Reading**

* [perldoc - perlmodlib](https://perldoc.perl.org/perlmodlib.html)
* [perldoc - Core modules](https://perldoc.perl.org/index-modules-L.html)
* [unix.stackexchange - example for Algorithm::Combinatorics](https://unix.stackexchange.com/questions/310840/better-solution-for-finding-id-groups-permutations-combinations)
* [unix.stackexchange - example for Text::ParseWords](https://unix.stackexchange.com/questions/319301/excluding-enclosed-delimiters-with-cut)
* [stackoverflow - regular expression modules](https://stackoverflow.com/questions/3258847/what-are-good-perl-pattern-matching-regex-modules)
* [metacpan - String::Approx](https://metacpan.org/pod/String::Approx) - Perl extension for approximate matching (fuzzy matching)
* [metacpan - Tie::IxHash](https://metacpan.org/pod/Tie::IxHash) - ordered associative arrays for Perl

<br>

## <a name="two-file-processing"></a>Two file processing

First, a bit about `$#ARGV` and hash variables

```bash
$ # $#ARGV can be used to know which file is being processed
$ perl -lne 'print $#ARGV' <(seq 2) <(seq 3) <(seq 1)
1
1
0
0
0
-1

$ # creating hash variable
$ # checking if a key is present using exists
$ # or if value is known to evaluate to true
$ perl -le '$h{"a"}=5; $h{"b"}=0; $h{1}="abc";
            print "key:a value=", $h{"a"};
            print "key:b present" if exists $h{"b"};
            print "key:1 present" if $h{1}'
key:a value=5
key:b present
key:1 present
```

<br>

#### <a name="comparing-whole-lines"></a>Comparing whole lines

Consider the following test files

```bash
$ cat colors_1.txt
Blue
Brown
Purple
Red
Teal
Yellow

$ cat colors_2.txt
Black
Blue
Green
Red
White
```

* For two files as input, `$#ARGV` will be `0` only when first file is being processed
* Using `next` will skip rest of code
* entire line is used as key

```bash
$ # common lines
$ # note that all duplicates matching in second file would get printed
$ # same as: grep -Fxf colors_1.txt colors_2.txt
$ # same as: awk 'NR==FNR{a[$0]; next} $0 in a' colors_1.txt colors_2.txt
$ perl -ne 'if(!$#ARGV){$h{$_}=1; next}
            print if $h{$_}' colors_1.txt colors_2.txt
Blue
Red
$ # can also use: perl -ne '!$#ARGV ? $h{$_}=1 : $h{$_} && print'

$ # lines from colors_2.txt not present in colors_1.txt
$ # same as: grep -vFxf colors_1.txt colors_2.txt
$ # same as: awk 'NR==FNR{a[$0]; next} !($0 in a)' colors_1.txt colors_2.txt
$ perl -ne 'if(!$#ARGV){$h{$_}=1; next}
            print if !$h{$_}' colors_1.txt colors_2.txt
Black
Green
White
```

* alternative constructs
* `<FILEHANDLE>` reads line(s) from the specified file
    * defaults to current file argument(includes stdin as well), so `<>` can be used as shortcut
    * `<STDIN>` will read only from stdin, there are also predefined handles for stdout/stderr
    * in list context, all the lines would be read
    * See [perldoc - I/O Operators](https://perldoc.perl.org/perlop.html#I%2fO-Operators) for details

```bash
$ # using if-else instead of next
$ perl -ne 'if(!$#ARGV){ $h{$_}=1 }
            else{ print if $h{$_} }' colors_1.txt colors_2.txt
Blue
Red

$ # read all lines of first file in BEGIN block
$ # <> reads a line from current file argument
$ # eof will ensure only first file is read
$ perl -ne 'BEGIN{ $h{<>}=1 while !eof; }
            print if $h{$_}' colors_1.txt colors_2.txt
Blue
Red
$ # this method also allows to easily reset line number
$ # close ARGV is similar to calling nextfile in GNU awk
$ perl -ne 'BEGIN{ $h{<>}=1 while !eof; close ARGV}
            print "$.\n" if $h{$_}' colors_1.txt colors_2.txt
2
4

$ # or pass 1st file content as STDIN, $. will be automatically reset as well
$ perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> }
            print if $h{$_}' <colors_1.txt colors_2.txt
Blue
Red
```

<br>

#### <a name="comparing-specific-fields"></a>Comparing specific fields

Consider the sample input file

```bash
$ cat marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59
ECE     Om      92
CSE     Amy     67
```

* single field
* For ex: only first field comparison instead of entire line as key

```bash
$ cat list1
ECE
CSE

$ # extract only lines matching first field specified in list1
$ # same as: awk 'NR==FNR{a[$1]; next} $1 in a' list1 marks.txt
$ perl -ane 'if(!$#ARGV){ $h{$F[0]}=1 }
             else{ print if $h{$F[0]} }' list1 marks.txt
ECE     Raj     53
ECE     Joel    72
CSE     Surya   81
ECE     Om      92
CSE     Amy     67

$ # if header is needed as well
$ # same as: awk 'NR==FNR{a[$1]; next} FNR==1 || $1 in a' list1 marks.txt
$ perl -ane 'if(!$#ARGV){ $h{$F[0]}=1; $.=0 }
             else{ print if $h{$F[0]} || $.==1 }' list1 marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
CSE     Surya   81
ECE     Om      92
CSE     Amy     67
```

* multiple field comparison

```bash
$ cat list2
EEE Moi
CSE Amy
ECE Raj

$ # extract only lines matching both fields specified in list2
$ # same as: awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' list2 marks.txt
$ # default SUBSEP(stored in $;) is \034, same as GNU awk
$ perl -ane 'if(!$#ARGV){ $h{$F[0],$F[1]}=1 }
             else{ print if $h{$F[0],$F[1]} }' list2 marks.txt
ECE     Raj     53
EEE     Moi     68
CSE     Amy     67

$ # or use multidimensional hash
$ perl -ane 'if(!$#ARGV){ $h{$F[0]}{$F[1]}=1 }
             else{ print if $h{$F[0]}{$F[1]} }' list2 marks.txt
ECE     Raj     53
EEE     Moi     68
CSE     Amy     67
```

* field and value comparison

```bash
$ cat list3
ECE 70
EEE 65
CSE 80

$ # extract line matching Dept and minimum marks specified in list3
$ # same as: awk 'NR==FNR{d[$1]; m[$1]=$2; next} $1 in d && $3 >= m[$1]'
$ perl -ane 'if(!$#ARGV){ $d{$F[0]}=1; $m{$F[0]}=$F[1] }
             else{ print if $d{$F[0]} && $F[2]>=$m{$F[0]} }' list3 marks.txt
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
ECE     Om      92
```

* See also [stackoverflow - Fastest way to find lines of a text file from another larger text file](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-text-file-from-another-larger-text-file-in-bash)

<br>

#### <a name="line-number-matching"></a>Line number matching

```bash
$ # replace mth line in poem.txt with nth line from nums.txt
$ # assumes that there are at least n lines in nums.txt
$ # same as: awk -v m=3 -v n=2 'BEGIN{while(n-- > 0) getline s < "nums.txt"}
$ #                             FNR==m{$0=s} 1' poem.txt
$ m=3 n=2 perl -pe 'BEGIN{ $s=<> while $ENV{n}-- > 0; close ARGV}
                    $_=$s if $.==$ENV{m}' nums.txt poem.txt
Roses are red,
Violets are blue,
-2
And so are you.

$ # print line from fruits.txt if corresponding line from nums.txt is +ve number
$ # same as: awk -v file='nums.txt' '(getline num < file)==1 && num>0'
$ <nums.txt perl -ne 'print if <STDIN> > 0' fruits.txt
fruit   qty
banana  31
```

<br>

## <a name="creating-new-fields"></a>Creating new fields

* Number of fields in input record can be changed by simply manipulating `$#F`

```bash
$ s='foo,bar,123,baz'

$ # reducing fields
$ # same as: awk -F, -v OFS=, '{NF=2} 1'
$ echo "$s" | perl -F, -lane '$,=","; $#F=1; print @F'
foo,bar

$ # creating new empty field(s)
$ # same as: awk -F, -v OFS=, '{NF=5} 1'
$ echo "$s" | perl -F, -lane '$,=","; $#F=4; print @F'
foo,bar,123,baz,

$ # assigning to field greater than $#F will create empty fields as needed
$ # same as: awk -F, -v OFS=, '{$7=42} 1'
$ echo "$s" | perl -F, -lane '$,=","; $F[6]=42; print @F'
foo,bar,123,baz,,,42
```

* adding a field based on existing fields
    * See also [split](#split) and [Array operations](#array-operations) sections

```bash
$ # adding a new 'Grade' field
$ # same as: awk 'BEGIN{OFS="\t"; split("DCBAS",g,//)}
$ #          {NF++; $NF = NR==1 ? "Grade" : g[int($(NF-1)/10)-4]} 1' marks.txt
$ perl -lane 'BEGIN{$,="\t"; @g = split //, "DCBAS"} $#F++;
              $F[-1] = $.==1 ? "Grade" : $g[$F[-2]/10 - 5]; print @F' marks.txt
Dept    Name    Marks   Grade
ECE     Raj     53      D
ECE     Joel    72      B
EEE     Moi     68      C
CSE     Surya   81      A
EEE     Tia     59      D
ECE     Om      92      S
CSE     Amy     67      C

$ # alternate syntax: array initialization and appending array element
$ perl -lane 'BEGIN{$,="\t"; @g = qw(D C B A S)}
              push @F, $.==1 ? "Grade" : $g[$F[-1]/10 - 5]; print @F' marks.txt
```

* two file example

```bash
$ cat list4
Raj class_rep
Amy sports_rep
Tia placement_rep

$ # same as: awk -v OFS='\t' 'NR==FNR{r[$1]=$2; next}
$ #          {NF++; $NF = FNR==1 ? "Role" : $NF=r[$2]} 1' list4 marks.txt
$ perl -lane 'if(!$#ARGV){ $r{$F[0]}=$F[1]; $.=0 }
              else{ push @F, $.==1 ? "Role" : $r{$F[1]};
                    print join "\t", @F }' list4 marks.txt
Dept    Name    Marks   Role
ECE     Raj     53      class_rep
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59      placement_rep
ECE     Om      92
CSE     Amy     67      sports_rep
```

<br>

## <a name="multiple-file-input"></a>Multiple file input

* there is no gawk's `FNR/BEGINFILE/ENDFILE` equivalent in perl, but it can be worked around

```bash
$ # same as: awk 'FNR==2' poem.txt greeting.txt
$ # close ARGV will reset $. to 0
$ perl -ne 'print if $.==2; close ARGV if eof' poem.txt greeting.txt
Violets are blue,
Have a safe journey

$ # same as: awk 'BEGINFILE{print "file: "FILENAME} ENDFILE{print $0"\n------"}'
$ perl -lne 'print "file: $ARGV" if $.==1;
             print "$_\n------" and close ARGV if eof' poem.txt greeting.txt
file: poem.txt
And so are you.
------
file: greeting.txt
Have a safe journey
------
```

* workaround for gawk's `nextfile`
* to skip remaining lines from current file being processed and move on to next file

```bash
$ # same as: head -q -n1 and awk 'FNR>1{nextfile} 1'
$ perl -pe 'close ARGV if $.>=1' poem.txt greeting.txt fruits.txt
Roses are red,
Hello there
fruit   qty

$ # same as: awk 'tolower($1) ~ /red/{print FILENAME; nextfile}' *
$ perl -lane 'print $ARGV and close ARGV if $F[0] =~ /red/i' *
colors_1.txt
colors_2.txt
```

<br>

## <a name="dealing-with-duplicates"></a>Dealing with duplicates

* retain only first copy of duplicates

```bash
$ cat duplicates.txt
abc  7   4
food toy ****
abc  7   4
test toy 123
good toy ****

$ # whole line, same as: awk '!seen[$0]++' duplicates.txt
$ perl -ne 'print if !$seen{$_}++' duplicates.txt
abc  7   4
food toy ****
test toy 123
good toy ****

$ # particular column, same as: awk '!seen[$2]++' duplicates.txt
$ perl -ane 'print if !$seen{$F[1]}++' duplicates.txt
abc  7   4
food toy ****

$ # total count, same as: awk '!seen[$2]++{c++} END{print +c}' duplicates.txt
$ perl -lane '$c++ if !$seen{$F[1]}++; END{print $c+0}' duplicates.txt
2
```

* if input is so large that integer numbers can overflow
* See also [perldoc - bignum](https://perldoc.perl.org/bignum.html)

```bash
$ perl -le 'print "equal" if
   102**33==1922231403943151831696327756255167543169267432774552016351387451392'
$ # -M option here enables the use of bignum module
$ perl -Mbignum -le 'print "equal" if
   102**33==1922231403943151831696327756255167543169267432774552016351387451392'
equal

$ # avoid unnecessary counting altogether
$ # same as: awk '!($2 in seen); {seen[$2]}' duplicates.txt
$ perl -ane 'print if !$seen{$F[1]}; $seen{$F[1]}=1' duplicates.txt
abc  7   4
food toy ****

$ # same as: awk -M '!($2 in seen){c++} {seen[$2]} END{print +c}' duplicates.txt
$ perl -Mbignum -lane '$c++ if !$seen{$F[1]}; $seen{$F[1]}=1;
                       END{print $c+0}' duplicates.txt
2
```

* multiple fields
* See also [unix.stackexchange - based on same fields that could be in different order](https://unix.stackexchange.com/questions/325619/delete-lines-that-contain-the-same-information-but-in-different-order)

```bash
$ # same as: awk '!seen[$2,$3]++' duplicates.txt
$ # default SUBSEP(stored in $;) is \034, same as GNU awk
$ perl -ane 'print if !$seen{$F[1],$F[2]}++' duplicates.txt
abc  7   4
food toy ****
test toy 123

$ # or use multidimensional key
$ perl -ane 'print if !$seen{$F[1]}{$F[2]}++' duplicates.txt
abc  7   4
food toy ****
test toy 123
```

* retaining specific copy

```bash
$ # second occurrence of duplicate
$ # same as: awk '++seen[$2]==2' duplicates.txt
$ perl -ane 'print if ++$seen{$F[1]}==2' duplicates.txt
abc  7   4
test toy 123

$ # third occurrence of duplicate
$ # same as: awk '++seen[$2]==3' duplicates.txt
$ perl -ane 'print if ++$seen{$F[1]}==3' duplicates.txt
good toy ****

$ # retaining only last copy of duplicate
$ # reverse the input line-wise, retain first copy and then reverse again
$ # same as: tac duplicates.txt | awk '!seen[$2]++' | tac
$ tac duplicates.txt | perl -ane 'print if !$seen{$F[1]}++' | tac
abc  7   4
good toy ****
```

* filtering based on duplicate count
* allows to emulate [uniq](./sorting_stuff.md#uniq) command for specific fields

```bash
$ # all duplicates based on 1st column
$ # same as: awk 'NR==FNR{a[$1]++; next} a[$1]>1' duplicates.txt duplicates.txt
$ perl -ane 'if(!$#ARGV){ $x{$F[0]}++ }
             else{ print if $x{$F[0]}>1 }' duplicates.txt duplicates.txt
abc  7   4
abc  7   4

$ # more than 2 duplicates based on 2nd column
$ # same as: awk 'NR==FNR{a[$2]++; next} a[$2]>2' duplicates.txt duplicates.txt
$ perl -ane 'if(!$#ARGV){ $x{$F[1]}++ }
             else{ print if $x{$F[1]}>2 }' duplicates.txt duplicates.txt
food toy ****
test toy 123
good toy ****

$ # only unique lines based on 3rd column
$ # same as: awk 'NR==FNR{a[$3]++; next} a[$3]==1' duplicates.txt duplicates.txt
$ perl -ane 'if(!$#ARGV){ $x{$F[2]}++ }
             else{ print if $x{$F[2]}==1 }' duplicates.txt duplicates.txt
test toy 123
```

<br>

## <a name="lines-between-two-regexps"></a>Lines between two REGEXPs

* This section deals with filtering lines bound by two *REGEXP*s (referred to as blocks)
* For simplicity the two *REGEXP*s usually used in below examples are the strings **BEGIN** and **END**

<br>

#### <a name="all-unbroken-blocks"></a>All unbroken blocks

Consider the below sample input file, which doesn't have any unbroken blocks (i.e **BEGIN** and **END** are always present in pairs)

```bash
$ cat range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
END
baz
```

* Extracting lines between starting and ending *REGEXP*

```bash
$ # include both starting/ending REGEXP
$ # same as: awk '/BEGIN/{f=1} f; /END/{f=0}' range.txt
$ perl -ne '$f=1 if /BEGIN/; print if $f; $f=0 if /END/' range.txt
BEGIN
1234
6789
END
BEGIN
a
b
c
END

$ # can also use: perl -ne 'print if /BEGIN/../END/' range.txt
$ # which is similar to sed -n '/BEGIN/,/END/p'
$ # but not suitable to extend for other cases
```

* other variations

```bash
$ # same as: awk '/END/{f=0} f; /BEGIN/{f=1}' range.txt
$ perl -ne '$f=0 if /END/; print if $f; $f=1 if /BEGIN/' range.txt
1234
6789
a
b
c

$ # check out what these do:
$ perl -ne '$f=1 if /BEGIN/; $f=0 if /END/; print if $f' range.txt
$ perl -ne 'print if $f; $f=0 if /END/; $f=1 if /BEGIN/' range.txt
```

* Extracting lines other than lines between the two *REGEXP*s

```bash
$ # same as: awk '/BEGIN/{f=1} !f; /END/{f=0}' range.txt
$ # can also use: perl -ne 'print if !(/BEGIN/../END/)' range.txt
$ perl -ne '$f=1 if /BEGIN/; print if !$f; $f=0 if /END/' range.txt
foo
bar
baz

$ # the other three cases would be
$ perl -ne '$f=0 if /END/; print if !$f; $f=1 if /BEGIN/' range.txt
$ perl -ne 'print if !$f; $f=1 if /BEGIN/; $f=0 if /END/' range.txt
$ perl -ne '$f=1 if /BEGIN/; $f=0 if /END/; print if !$f' range.txt
```

<br>

#### <a name="specific-blocks"></a>Specific blocks

* Getting first block

```bash
$ # same as: awk '/BEGIN/{f=1} f; /END/{exit}' range.txt
$ perl -ne '$f=1 if /BEGIN/; print if $f; exit if /END/' range.txt
BEGIN
1234
6789
END

$ # use other tricks discussed in previous section as needed
$ # same as: awk '/END/{exit} f; /BEGIN/{f=1}' range.txt
$ perl -ne 'exit if /END/; print if $f; $f=1 if /BEGIN/' range.txt
1234
6789
```

* Getting last block

```bash
$ # reverse input linewise, change the order of REGEXPs, finally reverse again
$ # same as: tac range.txt | awk '/END/{f=1} f; /BEGIN/{exit}' | tac
$ tac range.txt | perl -ne '$f=1 if /END/; print if $f; exit if /BEGIN/' | tac
BEGIN
a
b
c
END

$ # or, save the blocks in a buffer and print the last one alone
$ # same as: awk '/4/{f=1; b=$0; next} f{b=b ORS $0} /6/{f=0} END{print b}'
$ seq 30 | perl -ne 'if(/4/){$f=1; $b=$_; next}
                     $b.=$_ if $f; $f=0 if /6/; END{print $b}'
24
25
26
```

* Getting blocks based on a counter

```bash
$ # get only 2nd block
$ # same as: seq 30 | awk -v b=2 '/4/{c++} c==b{print; if(/6/) exit}'
$ seq 30 | b=2 perl -ne '$c++ if /4/; if($c==$ENV{b}){print; exit if /6/}'
14
15
16

$ # to get all blocks greater than 'b' blocks
$ # same as: seq 30 | awk -v b=1 '/4/{f=1; c++} f && c>b; /6/{f=0}'
$ seq 30 | b=1 perl -ne '$f=1, $c++ if /4/;
                         print if $f && $c>$ENV{b}; $f=0 if /6/'
14
15
16
24
25
26
```

* excluding a particular block

```bash
$ # excludes 2nd block
$ # same as: seq 30 | awk -v b=2 '/4/{f=1; c++} f && c!=b; /6/{f=0}'
$ seq 30 | b=2 perl -ne '$f=1, $c++ if /4/;
                         print if $f && $c!=$ENV{b}; $f=0 if /6/'
4
5
6
24
25
26
```

* extract block only if it matches another string as well

```bash
$ # string to match inside block: 23
$ perl -ne 'if(/BEGIN/){$f=1; $m=0; $b=""}; $m=1 if $f && /23/;
            $b.=$_ if $f; if(/END/){print $b if $m; $f=0}' range.txt
BEGIN
1234
6789
END

$ # line to match inside block: 5 or 25
$ seq 30 | perl -ne 'if(/4/){$f=1; $m=0; $b=""}; $m=1 if $f && /^(5|25)$/;
                     $b.=$_ if $f; if(/6/){print $b if $m; $f=0}'
4
5
6
24
25
26
```

<br>

#### <a name="broken-blocks"></a>Broken blocks

* If there are blocks with ending *REGEXP* but without corresponding start, earlier techniques used will suffice
* Consider the modified input file where starting *REGEXP* doesn't have corresponding ending

```bash
$ cat broken_range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
baz

$ # the file reversing trick comes in handy here as well
$ # same as: tac broken_range.txt | awk '/END/{f=1} f; /BEGIN/{f=0}' | tac
$ tac broken_range.txt | perl -ne '$f=1 if /END/;
                         print if $f; $f=0 if /BEGIN/' | tac
BEGIN
1234
6789
END
```

* But if both kinds of broken blocks are present, for ex:

```bash
$ cat multiple_broken.txt
qqqqqqq
BEGIN
foo
BEGIN
1234
6789
END
bar
END
0-42-1
BEGIN
a
BEGIN
b
END
xyzabc
```

then use buffers to accumulate the records and print accordingly

```bash
$ # same as: awk '/BEGIN/{f=1; buf=$0; next} f{buf=buf ORS $0}
$ #          /END/{f=0; if(buf) print buf; buf=""}' multiple_broken.txt
$ perl -ne 'if(/BEGIN/){$f=1; $b=$_; next} $b.=$_ if $f;
            if(/END/){$f=0; print $b if $b; $b=""}' multiple_broken.txt
BEGIN
1234
6789
END
BEGIN
b
END

$ # note how buffer is initialized as well as cleared
$ # on matching beginning/end REGEXPs respectively
$ # 'undef $b' can also be used here instead of $b=""
```

<br>

## <a name="array-operations"></a>Array operations

* initialization

```bash
$ # list example, each value is separated by comma
$ perl -e '($x, $y) = (4, 5); print "$x:$y\n"'
4:5

$ # using list to initialize arrays, allows variable interpolation
$ # ($x, $y) = ($y, $x) will swap variables :)
$ perl -e '@nums = (4, 5, 84); print "@nums\n"'
4 5 84
$ perl -e '@nums = (4, 5, 84, "foo"); print "@nums\n"'
4 5 84 foo
$ perl -e '$x=5; @y=(3, 2); @nums = ($x, "good", @y); print "@nums\n"'
5 good 3 2

$ # use qw to specify string elements separated by space, no interpolation
$ perl -e '@nums = qw(4 5 84 "foo"); print "@nums\n"'
4 5 84 "foo"
$ perl -e '@nums = qw(a $x @y); print "@nums\n"'
a $x @y
$ # use different delimiter as needed
$ perl -e '@nums = qw/baz 1)foo/; print "@nums\n"'
baz 1)foo
```

* accessing individual elements
* See also [perldoc - functions for arrays](https://perldoc.perl.org/index-functions-by-cat.html#Functions-for-real-@ARRAYs) for push,pop,shift,unshift functions

```bash
$ # index starts from 0
$ perl -le '@nums = (4, "foo", 2, "x"); print $nums[0]'
4
$ # note the use of $ when accessing individual element
$ perl -le '@nums = (4, "foo", 2, "x"); print $nums[2]'
2
$ # to access elements from end, use -ve index from -1
$ perl -le '@nums = (4, "foo", 2, "x"); print $nums[-1]'
x

$ # index of last element in array
$ perl -le '@nums = (4, "foo", 2, "x"); print $#nums'
3
$ # size of array, i.e total number of elements
$ perl -le '@nums = (4, "foo", 2, "x"); $s=@nums; print $s'
4
$ perl -le '@nums = (4, "foo", 2, "x"); print scalar @nums'
4
```

* array slices
* See also [perldoc - Range Operators](https://perldoc.perl.org/perlop.html#Range-Operators)

```bash
$ # note the use of @ when accessing more than one element
$ echo 'a b c d' | perl -lane 'print "@F[0,-1,2]"'
a d c
$ # range operator
$ echo 'a b c d' | perl -lane 'print "@F[1..2]"'
b c
$ # rotating elements
$ echo 'a b c d' | perl -lane 'print "@F[1..$#F,0]"'
b c d a

$ # index needed can be given from another array too
$ echo 'a b c d' | perl -lane '@i=(3,1); print "@F[@i]"'
d b

$ # easy swapping of columns
$ perl -lane 'print join "\t", @F[1,0]' fruits.txt
qty     fruit
42      apple
31      banana
90      fig
6       guava
```

* range operator also allows handy initialization

```bash
$ perl -le '@n = (12..17); print "@n"'
12 13 14 15 16 17

$ perl -le '@n = (l..ad); print "@n"'
l m n o p q r s t u v w x y z aa ab ac ad
```

<br>

#### <a name="iteration-and-filtering"></a>Iteration and filtering

* See also [stackoverflow - extracting multiline text and performing substitution](https://stackoverflow.com/questions/47653826/awk-extracting-a-data-which-is-on-several-lines/47654406#47654406)

```bash
$ # foreach will return each value one by one
$ # can also use 'for' keyword instead of 'foreach'
$ perl -le 'print $_*2 foreach (12..14)'
24
26
28

$ # iterate using index
$ perl -le '@x = (a..e); foreach (0..$#x){print $x[$_]}'
a
b
c
d
e

$ # C-style for loop can be used as well
$ perl -le '@x = (a..c); for($i=0;$i<=$#x;$i++){print $x[$i]}'
a
b
c
```

* use `grep` for filtering array elements based on a condition
* See also [unix.stackexchange - extract specific fields and use corresponding header text](https://unix.stackexchange.com/questions/397498/create-lists-of-words-according-to-binary-numbers/397504#397504)

```bash
$ # as usual, $_ will get the value each iteration
$ perl -le '$,=" "; print grep { /[35]/ } 2..26'
3 5 13 15 23 25
$ # alternate syntax
$ perl -le '$,=" "; print grep /[35]/, 2..26'
3 5 13 15 23 25

$ # to get index instead of matches
$ perl -le '$,=" "; @n=(2..26); print grep {$n[$_]=~/[35]/} 0..$#n'
1 3 11 13 21 23

$ # compare values
$ s='23 756 -983 5'
$ echo "$s" | perl -lane 'print join " ", grep $_<100, @F'
23 -983 5

$ # filters only those elements with successful substitution
$ # note that it would modify array elements as well
$ echo "$s" | perl -lane 'print join " ", grep s/3/E/, @F'
2E -98E
```

* more examples

```bash
$ # filtering column(s) based on header
$ perl -lane '@i = grep {$F[$_] eq "Name"} 0..$#F if $.==1;
              print @F[@i]' marks.txt
Name
Raj
Joel
Moi
Surya
Tia
Om
Amy

$ cat split.txt
foo,1:2:5,baz
wry,4,look
free,3:8,oh
$ # print line if more than one column has a digit
$ perl -F: -lane 'print if (grep /\d/, @F) > 1' split.txt
foo,1:2:5,baz
free,3:8,oh
```

* to get random element from array

```bash
$ s='65 23 756 -983 5'
$ echo "$s" | perl -lane 'print $F[rand @F]'
5
$ echo "$s" | perl -lane 'print $F[rand @F]'
23
$ echo "$s" | perl -lane 'print $F[rand @F]'
-983

$ # in scalar context, size of array gets passed to rand
$ # rand actually returns a float
$ # which then gets converted to int index
```

<br>

#### <a name="sorting"></a>Sorting

* See [perldoc - sort](https://perldoc.perl.org/functions/sort.html) for details
* `$a` and `$b` are special variables used for sorting, avoid using them as user defined variables

```bash
$ # by default, sort does string comparison
$ s='foo baz v22 aimed'
$ echo "$s" | perl -lane 'print join " ", sort @F'
aimed baz foo v22

$ # same as default sort
$ echo "$s" | perl -lane 'print join " ", sort {$a cmp $b} @F'
aimed baz foo v22
$ # descending order, note how $a and $b are switched
$ echo "$s" | perl -lane 'print join " ", sort {$b cmp $a} @F'
v22 foo baz aimed

$ # functions can be used for custom sorting
$ # lc lowercases string, so this sorts case insensitively
$ perl -lane 'print join " ", sort {lc $a cmp lc $b} @F' poem.txt
are red, Roses
are blue, Violets
is Sugar sweet,
And are so you.
```

* sorting characters within word

```bash
$ echo 'foobar' | perl -F -lane 'print sort @F'
abfoor

$ cat words.txt
bot
art
are
boat
toe
flee
reed

$ # words with characters in ascending order
$ perl -F -lane 'print if (join "", sort @F) eq $_' words.txt
bot
art

$ # words with characters in descending order
$ perl -F -lane 'print if (join "", sort {$b cmp $a} @F) eq $_' words.txt
toe
reed
```

* for numeric comparison, use `<=>` instead of `cmp`

```bash
$ s='23 756 -983 5'
$ echo "$s" | perl -lane 'print join " ",sort {$a <=> $b} @F'
-983 5 23 756
$ echo "$s" | perl -lane 'print join " ",sort {$b <=> $a} @F'
756 23 5 -983

$ # sorting strings based on their length
$ s='floor bat to dubious four'
$ echo "$s" | perl -lane 'print join ":",sort {length $a <=> length $b} @F'
to:bat:four:floor:dubious
```

* sorting columns based on header

```bash
$ # need to get indexes of order required for header, then use it for all lines
$ perl -lane '@i = sort {$F[$a] cmp $F[$b]} 0..$#F if $.==1;
              print join "\t", @F[@i]' marks.txt
Dept    Marks   Name
ECE     53      Raj
ECE     72      Joel
EEE     68      Moi
CSE     81      Surya
EEE     59      Tia
ECE     92      Om
CSE     67      Amy

$ perl -lane '@i = sort {$F[$b] cmp $F[$a]} 0..$#F if $.==1;
              print join "\t", @F[@i]' marks.txt
Name    Marks   Dept
Raj     53      ECE
Joel    72      ECE
Moi     68      EEE
Surya   81      CSE
Tia     59      EEE
Om      92      ECE
Amy     67      CSE
```

**Further Reading**

* [perldoc - How do I sort a hash (optionally by value instead of key)?](https://perldoc.perl.org/perlfaq4.html#How-do-I-sort-a-hash-(optionally-by-value-instead-of-key)%3f)
* [stackoverflow - sort the keys of a hash by value](https://stackoverflow.com/questions/10901084/how-to-sort-perl-hash-on-values-and-order-the-keys-correspondingly-in-two-array)
* [stackoverflow - sort only from 2nd field, ignore header](https://stackoverflow.com/questions/48920626/sort-rows-in-csv-file-without-header-first-column)
* [stackoverflow - sort based on group of lines](https://stackoverflow.com/questions/48925359/sorting-groups-of-lines)

<br>

#### <a name="transforming"></a>Transforming

* shuffling list elements

```bash
$ s='23 756 -983 5'
$ # note that this doesn't change the input array
$ echo "$s" | perl -MList::Util=shuffle -lane 'print join " ", shuffle @F'
756 23 -983 5
$ echo "$s" | perl -MList::Util=shuffle -lane 'print join " ", shuffle @F'
5 756 23 -983

$ # randomizing file contents
$ perl -MList::Util=shuffle -e 'print shuffle <>' poem.txt
Sugar is sweet,
And so are you.
Violets are blue,
Roses are red,

$ # or if shuffle order is known
$ seq 5 | perl -e '@lines=<>; print @lines[3,1,0,2,4]'
4
2
1
3
5
```

* use `map` to transform every element

```bash
$ echo '23 756 -983 5' | perl -lane 'print join " ", map {$_*$_} @F'
529 571536 966289 25
$ echo 'a b c' | perl -lane 'print join ",", map {qq/"$_"/} @F'
"a","b","c"
$ echo 'a b c' | perl -lane 'print join ",", map {uc qq/"$_"/} @F'
"A","B","C"

$ # changing the array itself
$ perl -le '@s=(4, 245, 12); map {$_*$_} @s; print join " ", @s'
4 245 12
$ perl -le '@s=(4, 245, 12); map {$_ = $_*$_} @s; print join " ", @s'
16 60025 144

$ # ASCII int values for each character
$ echo 'AaBbCc' | perl -F -lane 'print join " ", map ord, @F'
65 97 66 98 67 99

$ s='this is a sample sentence'
$ # shuffle each word, split here converts each element to character array
$ # join the characters after shuffling with empty string
$ # finally print each changed element with space as separator
$ echo "$s" | perl -MList::Util=shuffle -lane '$,=" ";
                    print map {join "", shuffle split//} @F;'
tshi si a mleasp ncstneee
```

* fun little unreadable script...

```bash
$ cat para.txt
Why cannot I go back to my ignorant days with wild imaginations and fantasies?
Perhaps the answer lies in not being able to adapt to my freedom.
Those little dreams, goal setting, anticipation of results, used to be my world.
All joy within the soul and less dependent on outside world.
But all these are absent for a long time now.
Hope I can wake those dreams all over again.

$ perl -MList::Util=shuffle -F'/([^a-zA-Z]+)/' -lane '
        print map {@c=split//; $#c<3 || /[^a-zA-Z]/? $_ :
              join "",$c[0],(shuffle @c[1..$#c-1]),$c[-1]} @F;' para.txt
Why coannt I go back to my inoagrnt dyas wtih wild imiaintangos and fatenasis?
Phearps the awsenr lies in not bieng albe to aadpt to my fedoerm.
Toshe llttie draems, goal stetnig, aaioiciptntn of rtuelss, uesd to be my wrlod.
All joy witihn the suol and less dnenepedt on oiduste world.
But all tsehe are abenst for a lnog tmie now.
Hpoe I can wkae toshe daemrs all over aiagn.
```

* reverse array
* See also [stackoverflow - apply tr and reverse to particular column](https://stackoverflow.com/questions/45571828/execute-bash-command-inside-awk-and-print-command-output/45572038#45572038)

```bash
$ s='23 756 -983 5'
$ echo "$s" | perl -lane 'print join " ", reverse @F'
5 -983 756 23

$ echo 'foobar' | perl -lne 'print reverse split//'
raboof
$ # can also use scalar context instead of using split
$ echo 'foobar' | perl -lne '$x=reverse; print $x'
raboof
$ echo 'foobar' | perl -lne 'print scalar reverse'
raboof
```

<br>

## <a name="miscellaneous"></a>Miscellaneous

<br>

#### <a name="split"></a>split

* the `-a` command line option uses `split` and automatically saves the results in `@F` array
* default separator is `\s+`
* by default acts on `$_`
* and by default all splits are performed
* See also [perldoc - split function](https://perldoc.perl.org/functions/split.html)

```bash
$ echo 'a 1 b 2 c' | perl -lane 'print $F[2]'
b
$ echo 'a 1 b 2 c' | perl -lne '@x=split; print $x[2]'
b
$ # temp variable can be avoided by using list context
$ echo 'a 1 b 2 c' | perl -lne 'print join ":", (split)[2,-1]'
b:c

$ # using digits as separator
$ echo 'a 1 b 2 c' | perl -lne '@x=split /\d+/; print ":$x[1]:"'
: b :

$ # specifying maximum number of splits
$ echo 'a 1 b 2 c' | perl -lne '@x=split /\h+/,$_,2; print "$x[0]:$x[1]:"'
a:1 b 2 c:
$ # specifying limit using -F option
$ echo 'a 1 b 2 c' | perl -F'/\h+/,$_,2' -lane 'print "$F[0]:$F[1]:"'
a:1 b 2 c:
```

* by default, trailing empty fields are stripped
* specify a negative value to preserve trailing empty fields

```bash
$ echo ':123::' | perl -lne 'print scalar split /:/'
2
$ echo ':123::' | perl -lne 'print scalar split /:/,$_,-1'
4

$ echo ':123::' | perl -F: -lane 'print scalar @F'
2
$ echo ':123::' | perl -F'/:/,$_,-1' -lane 'print scalar @F'
4
```

* to save the separators as well, use capture groups

```bash
$ echo 'a 1 b 2 c' | perl -lne '@x=split /(\d+)/; print "$x[1],$x[3]"'
1,2
$ # or, without the temp variable
$ echo 'a 1 b 2 c' | perl -lne 'print join ",", (split /(\d+)/)[1,3]'
1,2

$ # same can be done for -F option
$ echo 'a 1 b 2 c' | perl -F'(\d+)' -lane 'print "$F[1],$F[3]"'
1,2
```

* single line to multiple line by splitting a column

```bash
$ cat split.txt
foo,1:2:5,baz
wry,4,look
free,3:8,oh

$ perl -F, -ane 'print join ",", $F[0],$_,$F[2] for split /:/,$F[1]' split.txt
foo,1,baz
foo,2,baz
foo,5,baz
wry,4,look
free,3,oh
free,8,oh
```

* weird behavior if literal space character is used with `-F` option

```bash
$ # only one element in @F array
$ echo 'a 1 b 2 c' | perl -F'/b /' -lane 'print $F[1]'

$ # space not being used by separator
$ echo 'a 1 b 2 c' | perl -F'b ' -lane 'print $F[1]'
 2 c
$ # correct behavior
$ echo 'a 1 b 2 c' | perl -F'b\x20' -lane 'print $F[1]'
2 c

$ # errors out if space used inside character class
$ echo 'a 1 b 2 c' | perl -F'/b[ ]/' -lane 'print $F[1]'
Unmatched [ in regex; marked by <-- HERE in m//b[ <-- HERE /.
$ echo 'a 1 b 2 c' | perl -lne '@x=split /b[ ]/; print $x[1]'
2 c
```

<br>

#### <a name="fixed-width-processing"></a>Fixed width processing

```bash
$ # here 'a' indicates arbitrary binary data
$ # the number that follows indicates length
$ # the 'x' indicates characters to ignore, use length after 'x' if needed
$ # and there are many other formats, see perldoc for details
$ echo 'b 123 good' | perl -lne '@x = unpack("a1xa3xa4", $_); print $x[0]'
b
$ echo 'b 123 good' | perl -lne '@x = unpack("a1xa3xa4", $_); print $x[1]'
123
$ echo 'b 123 good' | perl -lne '@x = unpack("a1xa3xa4", $_); print $x[2]'
good

$ # unpack not always needed, can simply capture characters needed
$ echo 'b 123 good' | perl -lne 'print /.{2}(.{3})/'
123
$ # or use substr to specify offset (starts from 0) and length
$ echo 'b 123 good' | perl -lne 'print substr $_, 6, 4'
good

$ # substr can also be used for replacing
$ echo 'b 123 good' | perl -lpe 'substr $_, 2, 3, "gleam"'
b gleam good
```

**Further Reading**

* [perldoc - tutorial on pack and unpack](https://perldoc.perl.org/perlpacktut.html)
* [perldoc - substr](https://perldoc.perl.org/functions/substr.html)
* [stackoverflow - extract columns from a fixed-width format](https://stackoverflow.com/questions/1494611/how-can-i-extract-columns-from-a-fixed-width-format-in-perl)
* [stackoverflow - build fixed-width template from header](https://stackoverflow.com/questions/4911044/parse-fixed-width-files)
* [stackoverflow - convert fixed-width to delimited format](https://stackoverflow.com/questions/43734981/display-column-from-empty-column-delimited-space-in-bash)

<br>

#### <a name="string-and-file-replication"></a>String and file replication

```bash
$ # replicate each line
$ seq 2 | perl -ne 'print $_ x 2'
1
1
2
2

$ # replicate a string
$ perl -le 'print "abc" x 5'
abcabcabcabcabc

$ # works for lists too
$ perl -le '@x = (3, 2, 1) x 2; print join " ",@x'
3 2 1 3 2 1

$ # replicating file
$ wc -c poem.txt
65 poem.txt
$ perl -0777 -ne 'print $_ x 100' poem.txt | wc -c
6500
```

* the [perldoc - glob](https://perldoc.perl.org/functions/glob.html) function can be hacked to generate combinations of strings

```bash
$ # typical use case
$ # same as: echo *.log
$ perl -le 'print join " ", glob q/*.log/'
report.log
$ # same as: echo *.{log,pl}
$ perl -le 'print join " ", glob q/*.{log,pl}/'
report.log code.pl sub_sq.pl

$ # hacking
$ # same as: echo {1,3}{a,b}
$ perl -le '@x=glob q/{1,3}{a,b}/; print "@x"'
1a 1b 3a 3b
$ # same as: echo {1,3}{1,3}{1,3}
$ perl -le '@x=glob "{1,3}" x 3; print "@x"'
111 113 131 133 311 313 331 333
```

<br>

#### <a name="transliteration"></a>transliteration

* See `tr` under [perldoc - Quote-Like Operators](https://perldoc.perl.org/perlop.html#Quote-Like-Operators) section for details
* similar to substitution, by default `tr` acts on `$_` variable and modifies it unless `r` modifier is specified
* however, characters `$` and `@` are treated as literals - i.e no interpolation
* similar to `sed`, one can also use `y` instead of `tr`

```bash
$ # one-to-one mapping of characters, all occurrences are translated
$ echo 'foo bar cat baz' | perl -pe 'tr/abc/123/'
foo 21r 31t 21z

$ # use - to represent a range in ascending order
$ echo 'Hello World' | perl -pe 'tr/a-zA-Z/n-za-mN-ZA-M/'
Uryyb Jbeyq
$ echo 'Uryyb Jbeyq' | perl -pe 'tr|a-zA-Z|n-za-mN-ZA-M|'
Hello World
```

* if arguments are of different lengths

```bash
$ # when second argument is longer, the extra characters are ignored
$ echo 'foo bar cat baz' | perl -pe 'tr/abc/1-9/'
foo 21r 31t 21z

$ # when first argument is longer
$ # the last character of second argument gets padded to make it equal
$ echo 'foo bar cat baz' | perl -pe 'tr/a-z/123/'
333 213 313 213
```

* modifiers

```bash
$ # no padding, absent mappings are deleted
$ echo 'fob bar cat baz' | perl -pe 'tr/a-z/123/d'
2 21 31 21
$ echo 'Hello:123:World' | perl -pe 'tr/a-z//d'
H:123:W

$ # c modifier complements first argument characters
$ echo 'Hello:123:World' | perl -lpe 'tr/a-z//cd'
elloorld

$ # s modifier to keep only one copy of repeated characters
$ echo 'FFoo seed 11233' | perl -pe 'tr/a-z//s'
FFo sed 11233
$ # when replacement is done as well, only replaced characters are squeezed
$ # unlike 'tr -s' which squeezes characters specified by second argument
$ echo 'FFoo seed 11233' | perl -pe 'tr/A-Z/a-z/s'
foo seed 11233

$ perl -e '$x="food"; $y=$x=~tr/a-z/A-Z/r; print "x=$x\ny=$y\n"'
x=food
y=FOOD
```

* since `-` is used for character ranges, place it at the start/end to represent it literally
* similarly, to represent `\` literally, use `\\`

```bash
$ echo '/foo-bar/baz/report' | perl -pe 'tr/-a-z/_A-Z/'
/FOO_BAR/BAZ/REPORT

$ echo '/foo-bar/baz/report' | perl -pe 'tr|/-|\\_|'
\foo_bar\baz\report
```

* return value is number of replacements made

```bash
$ echo 'Hello there. How are you?' | grep -o '[a-z]' | wc -l
17

$ echo 'Hello there. How are you?' | perl -lne 'print tr/a-z//'
17
```

* unicode examples

```bash
$ echo 'hello!' | perl -CS -pe 'tr/a-z/\x{1d5ee}-\x{1d607}/'
𝗵𝗲𝗹𝗹𝗼!

$ echo 'How are you?' | perl -Mopen=locale -Mutf8 -pe 'tr/a-zA-Z/𝗮-𝘇𝗔-𝗭/'
𝗛𝗼𝘄 𝗮𝗿𝗲 𝘆𝗼𝘂?
```

<br>

#### <a name="executing-external-commands"></a>Executing external commands

* External commands can be issued using `system` function
* Output would be as usual on `stdout` unless redirected while calling the command

```bash
$ perl -e 'system("echo Hello World")'
Hello World
$ # use q operator to avoid interpolation
$ perl -e 'system q/echo $HOME/'
/home/learnbyexample

$ perl -e 'system q/wc poem.txt/'
 4 13 65 poem.txt

$ perl -e 'system q/seq 10 | paste -sd, > out.txt/'
$ cat out.txt
1,2,3,4,5,6,7,8,9,10

$ cat f2
I bought two bananas and three mangoes
$ echo 'f1,f2,odd.txt' | perl -F, -lane 'system "cat $F[1]"'
I bought two bananas and three mangoes
```

* return value of `system` will have exit status information or `$?` can be used
* see [perldoc - system](https://perldoc.perl.org/functions/system.html) for details

```bash
$ perl -le '$es=system q/ls poem.txt/; print "$es"'
poem.txt
0
$ perl -le 'system q/ls poem.txt/; print "exit status: $?"'
poem.txt
exit status: 0

$ perl -le 'system q/ls xyz.txt/; print "exit status: $?"'
ls: cannot access 'xyz.txt': No such file or directory
exit status: 512
```

* to save result of external command, use backticks or `qx` operator
* newline gets saved too, use `chomp` if needed

```bash
$ perl -e '$lines = `wc -l < poem.txt`; print $lines'
4
$ perl -e '$nums = qx/seq 3/; print $nums'
1
2
3
```

* See also [stackoverflow - difference between backticks, system, exec and open](https://stackoverflow.com/questions/799968/whats-the-difference-between-perls-backticks-system-and-exec)

<br>

## <a name="further-reading"></a>Further Reading

* Manual and related
    * [perldoc - overview](https://perldoc.perl.org/index-overview.html)
    * [perldoc - faqs](https://perldoc.perl.org/index-faq.html)
    * [perldoc - tutorials](https://perldoc.perl.org/index-tutorials.html)
    * [perldoc - functions](https://perldoc.perl.org/index-functions.html)
    * [perldoc - special variables](https://perldoc.perl.org/perlvar.html)
    * [perldoc - perlretut](https://perldoc.perl.org/perlretut.html)
* Tutorials and Q&A
    * [Perl one-liners explained](http://www.catonmat.net/series/perl-one-liners-explained)
    * [perl Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/perl?sort=votes&pageSize=15)
    * [regex FAQ on SO](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean)
    * [regexone](https://regexone.com/) - interative tutorial
    * [regexcrossword](https://regexcrossword.com/) - practice by solving crosswords, read 'How to play' section before you start
* Alternatives
    * [bioperl](http://bioperl.org/howtos/index.html)
    * [ruby](https://www.ruby-lang.org/en/)
    * [unix.stackexchange - When to use grep, sed, awk, perl, etc](https://unix.stackexchange.com/questions/303044/when-to-use-grep-less-awk-sed)


================================================
FILE: restructure_text.md
================================================
# <a name="restructure-text"></a>Restructure text

**Table of Contents**

* [paste](#paste)
    * [Concatenating files column wise](#concatenating-files-column-wise)
    * [Interleaving lines](#interleaving-lines)
    * [Lines to multiple columns](#lines-to-multiple-columns)
    * [Different delimiters between columns](#different-delimiters-between-columns)
    * [Multiple lines to single row](#multiple-lines-to-single-row)
    * [Further reading for paste](#further-reading-for-paste)
* [column](#column)
    * [Pretty printing tables](#pretty-printing-tables)
    * [Specifying different input delimiter](#specifying-different-input-delimiter)
    * [Further reading for column](#further-reading-for-column)
* [pr](#pr)
    * [Converting lines to columns](#converting-lines-to-columns)
    * [Changing PAGE_WIDTH](#changing-page_width)
    * [Combining multiple input files](#combining-multiple-input-files)
    * [Transposing a table](#transposing-a-table)
    * [Further reading for pr](#further-reading-for-pr)
* [fold](#fold)
    * [Examples](#examples)
    * [Further reading for fold](#further-reading-for-fold)

<br>

## <a name="paste"></a>paste

```bash
$ paste --version | head -n1
paste (GNU coreutils) 8.25

$ man paste
PASTE(1)                         User Commands                        PASTE(1)

NAME
       paste - merge lines of files

SYNOPSIS
       paste [OPTION]... [FILE]...

DESCRIPTION
       Write  lines  consisting  of  the sequentially corresponding lines from
       each FILE, separated by TABs, to standard output.

       With no FILE, or when FILE is -, read standard input.
...
```

<br>

#### <a name="concatenating-files-column-wise"></a>Concatenating files column wise

* By default, `paste` adds a TAB between corresponding lines of input files

```bash
$ paste colors_1.txt colors_2.txt
Blue    Black
Brown   Blue
Purple  Green
Red     Red
Teal    White
```

* Specifying a different delimiter using `-d`
* The `<()` syntax is [Process Substitution](http://mywiki.wooledge.org/ProcessSubstitution)
    * to put it simply - allows output of command to be passed as input file to another command without needing to manually create a temporary file

```bash
$ paste -d, <(seq 5) <(seq 6 10)
1,6
2,7
3,8
4,9
5,10

$ # empty cells if number of lines is not same for all input files
$ # -d\| can also be used
$ paste -d'|' <(seq 3) <(seq 4 6) <(seq 7 10)
1|4|7
2|5|8
3|6|9
||10
```

* to paste without any character in between, use `\0` as delimiter
    * note that `\0` here doesn't mean the ASCII NUL character
    * can also use `-d ''` with `GNU paste`

```bash
$ paste -d'\0' <(seq 3) <(seq 6 8)
16
27
38
```

<br>

#### <a name="interleaving-lines"></a>Interleaving lines

* Interleave lines by using newline as delimiter

```bash
$ paste -d'\n' <(seq 11 13) <(seq 101 103)
11
101
12
102
13
103
```

<br>

#### <a name="lines-to-multiple-columns"></a>Lines to multiple columns

* Number of `-` specified determines number of output columns
* Input lines can be passed only as stdin

```bash
$ # single column to two columns
$ seq 10 | paste -d, - -
1,2
3,4
5,6
7,8
9,10

$ # single column to five columns
$ seq 10 | paste -d: - - - - -
1:2:3:4:5
6:7:8:9:10

$ # input redirection for file input
$ paste -d, - - < colors_1.txt
Blue,Brown
Purple,Red
Teal,
```

* Use `printf` trick if number of columns to specify is too large

```bash
$ # prompt at end of line not shown for simplicity
$ printf -- "- %.s" {1..5}
- - - - - 

$ seq 10 | paste -d, $(printf -- "- %.s" {1..5})
1,2,3,4,5
6,7,8,9,10
```

<br>

#### <a name="different-delimiters-between-columns"></a>Different delimiters between columns

* For more than 2 columns, different delimiter character can be specified - passed as list to `-d` option

```bash
$ # , is used between 1st and 2nd column
$ # - is used between 2nd and 3rd column
$ paste -d',-' <(seq 3) <(seq 4 6) <(seq 7 9)
1,4-7
2,5-8
3,6-9

$ # re-use list from beginning if not specified for all columns
$ paste -d',-' <(seq 3) <(seq 4 6) <(seq 7 9) <(seq 10 12)
1,4-7,10
2,5-8,11
3,6-9,12
$ # another example
$ seq 10 | paste -d':,' - - - - -
1:2,3:4,5
6:7,8:9,10

$ # so, with single delimiter, it is just re-used for all columns
$ paste -d, <(seq 3) <(seq 4 6) <(seq 7 9) <(seq 10 12)
1,4,7,10
2,5,8,11
3,6,9,12
```

* combination of `-d` and `/dev/null` (empty file) can give multi-character separation between columns
* If this is too confusing to use, consider [pr](#pr) instead

```bash
$ paste -d' : ' <(seq 3) /dev/null /dev/null <(seq 4 6) /dev/null /dev/null <(seq 7 9)
1 : 4 : 7
2 : 5 : 8
3 : 6 : 9

$ # or just use pr instead
$ pr -mts' : ' <(seq 3) <(seq 4 6) <(seq 7 9)
1 : 4 : 7
2 : 5 : 8
3 : 6 : 9

$ # but paste would allow different delimiters ;)
$ paste -d' :  - ' <(seq 3) /dev/null /dev/null <(seq 4 6) /dev/null /dev/null <(seq 7 9)
1 : 4 - 7
2 : 5 - 8
3 : 6 - 9

$ # pr would need two invocations
$ pr -mts' : ' <(seq 3) <(seq 4 6) | pr -mts' - ' - <(seq 7 9)
1 : 4 - 7
2 : 5 - 8
3 : 6 - 9
```

* example to show using empty file instead of `/dev/null`

```bash
$ # assuming file named e doesn't exist
$ touch e
$ # or use this, will empty contents even if file named e already exists :P
$ > e

$ paste -d' :  - ' <(seq 3) e e <(seq 4 6) e e <(seq 7 9)
1 : 4 - 7
2 : 5 - 8
3 : 6 - 9
```

<br>

#### <a name="multiple-lines-to-single-row"></a>Multiple lines to single row

```bash
$ paste -sd, colors_1.txt
Blue,Brown,Purple,Red,Teal

$ # multiple files each gets a row
$ paste -sd: colors_1.txt colors_2.txt
Blue:Brown:Purple:Red:Teal
Black:Blue:Green:Red:White

$ # multiple input files need not have same number of lines
$ paste -sd, <(seq 3) <(seq 5 9)
1,2,3
5,6,7,8,9
```

* Often used to serialize multiple line output from another command

```bash
$ sort -u colors_1.txt colors_2.txt | paste -sd,
Black,Blue,Brown,Green,Purple,Red,Teal,White
```

* For multiple character delimiter, post-process if separator is unique or use another tool like `perl`

```bash
$ seq 10 | paste -sd,
1,2,3,4,5,6,7,8,9,10

$ # post-process
$ seq 10 | paste -sd, | sed 's/,/ : /g'
1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10

$ # using perl alone
$ seq 10 | perl -pe 's/\n/ : / if(!eof)'
1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10
```

<br>

#### <a name="further-reading-for-paste"></a>Further reading for paste

* `man paste` and `info paste` for more options and detailed documentation
* [paste Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/paste?sort=votes&pageSize=15)

<br>

## <a name="column"></a>column

```bash
COLUMN(1)                 BSD General Commands Manual                COLUMN(1)

NAME
     column — columnate lists

SYNOPSIS
     column [-entx] [-c columns] [-s sep] [file ...]

DESCRIPTION
     The column utility formats its input into multiple columns.  Rows are
     filled before columns.  Input is taken from file operands, or, by
     default, from the standard input.  Empty lines are ignored unless the -e
     option is used.
...
```

<br>

#### <a name="pretty-printing-tables"></a>Pretty printing tables

* by default whitespace is input delimiter

```bash
$ cat dishes.txt
North alootikki baati khichdi makkiroti poha
South appam bisibelebath dosa koottu sevai
West dhokla khakhra modak shiro vadapav
East handoguri litti momo rosgulla shondesh

$ column -t dishes.txt
North  alootikki  baati         khichdi  makkiroti  poha
South  appam      bisibelebath  dosa     koottu     sevai
West   dhokla     khakhra       modak    shiro      vadapav
East   handoguri  litti         momo     rosgulla   shondesh
```

* often useful to get neatly aligned columns from output of another command

```bash
$ paste fruits.txt price.txt
Fruits  Price
apple   182
guava   90
watermelon      35
banana  72
pomegranate     280

$ paste fruits.txt price.txt | column -t
Fruits       Price
apple        182
guava        90
watermelon   35
banana       72
pomegranate  280
```

<br>

#### <a name="specifying-different-input-delimiter"></a>Specifying different input delimiter

* Use `-s` to specify input delimiter
* Use `-n` to prevent merging empty cells
    * From `man column` "This option is a Debian GNU/Linux extension"

```bash
$ paste -d, <(seq 3) <(seq 5 9) <(seq 11 13)
1,5,11
2,6,12
3,7,13
,8,
,9,

$ paste -d, <(seq 3) <(seq 5 9) <(seq 11 13) | column -s, -t
1  5  11
2  6  12
3  7  13
8
9

$ paste -d, <(seq 3) <(seq 5 9) <(seq 11 13) | column -s, -nt
1  5  11
2  6  12
3  7  13
   8  
   9  
```

<br>

#### <a name="further-reading-for-column"></a>Further reading for column

* `man column` for more options and detailed documentation
* [column Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/columns?sort=votes&pageSize=15)
* More examples [here](http://www.commandlinefu.com/commands/using/column/sort-by-votes)

<br>

## <a name="pr"></a>pr

```bash
$ pr --version | head -n1
pr (GNU coreutils) 8.25

$ man pr
PR(1)                            User Commands                           PR(1)

NAME
       pr - convert text files for printing

SYNOPSIS
       pr [OPTION]... [FILE]...

DESCRIPTION
       Paginate or columnate FILE(s) for printing.

       With no FILE, or when FILE is -, read standard input.
...
```

* `Paginate` is not covered, examples related only to `columnate`
* For example, default invocation on a file would add a header, etc

```bash
$ # truncated output shown
$ pr fruits.txt


2017-04-21 17:49                    fruits.txt                    Page 1


Fruits
apple
guava
watermelon
banana
pomegranate

```

* Following sections will use `-t` to omit page headers and trailers

<br>

#### <a name="converting-lines-to-columns"></a>Converting lines to columns

* With [paste](#lines-to-multiple-columns), changing input file rows to column(s) is possible only with consecutive lines
* `pr` can do that as well as split entire file itself according to number of columns needed
* And `-s` option in `pr` allows multi-character output delimiter
* As usual, examples to better show the functionalities

```bash
$ # note how the input got split into two and resulting splits joined by ,
$ seq 6 | pr -2ts,
1,4
2,5
3,6

$ # note how two consecutive lines gets joined by ,
$ seq 6 | paste -d, - -
1,2
3,4
5,6
```

* Default **PAGE_WIDTH** is 72 characters, so each column gets 72 divided by number of columns unless `-s` is used

```bash
$ # 3 columns, so each column width is 24 characters
$ seq 9 | pr -3t
1                       4                       7
2                       5                       8
3                       6                       9

$ # using -s, desired delimiter can be specified
$ seq 9 | pr -3ts' '
1 4 7
2 5 8
3 6 9

$ seq 9 | pr -3ts' : '
1 : 4 : 7
2 : 5 : 8
3 : 6 : 9

$ # default is TAB when using -s option with no arguments
$ seq 9 | pr -3ts
1       4       7
2       5       8
3       6       9
```

* Using `-a` to change consecutive rows, similar to `paste`

```bash
$ seq 8 | pr -4ats:
1:2:3:4
5:6:7:8

$ # no output delimiter for empty cells
$ seq 22 | pr -5ats,
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
16,17,18,19,20
21,22

$ # note output delimiter even for empty cells
$ seq 22 | paste -d, - - - - -
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
16,17,18,19,20
21,22,,,
```

<br>

#### <a name="changing-page_width"></a>Changing PAGE_WIDTH

* The default PAGE_WIDTH is 72
* The formula `(col-1)*len(delimiter) + col` seems to work in determining minimum PAGE_WIDTH required for multiple column output
    * `col` is number of columns required

```bash
$ # (36-1)*1 + 36 = 71, so within PAGE_WIDTH limit
$ seq 74 | pr -36ats,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36
37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72
73,74
$ # (37-1)*1 + 37 = 73, more than default PAGE_WIDTH limit
$ seq 74 | pr -37ats,
pr: page width too narrow
```

* Use `-w` to specify a different PAGE_WIDTH
* The `-J` option turns off truncation

```bash
$ # (37-1)*1 + 37 = 73
$ seq 74 | pr -J -w73 -37ats,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37
38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74

$ # (3-1)*4 + 3 = 11
$ seq 6 | pr -J -w10 -3ats'::::'
pr: page width too narrow
$ seq 6 | pr -J -w11 -3ats'::::'
1::::2::::3
4::::5::::6

$ # if calculating is difficult, simply use a large number
$ seq 6 | pr -J -w500 -3ats'::::'
1::::2::::3
4::::5::::6
```

<br>

#### <a name="combining-multiple-input-files"></a>Combining multiple input files

* Use `-m` option to combine multiple files in parallel, similar to `paste`

```bash
$ # 2 columns, so each column width is 36 characters
$ pr -mt fruits.txt price.txt
Fruits                              Price
apple                               182
guava                               90
watermelon                          35
banana                              72
pomegranate                         280

$ # default is TAB when using -s option with no arguments
$ pr -mts <(seq 3) <(seq 4 6) <(seq 7 10)
1       4       7
2       5       8
3       6       9
                10

$ # double TAB as separator
$ # shell expands $'\t\t' before command is executed
$ pr -mts$'\t\t' colors_1.txt colors_2.txt
Blue            Black
Brown           Blue
Purple          Green
Red             Red
Teal            White
```

* For interleaving, specify newline as separator

```bash
$ pr -mts$'\n' fruits.txt price.txt
Fruits
Price
apple
182
guava
90
watermelon
35
banana
72
pomegranate
280
```

<br>

#### <a name="transposing-a-table"></a>Transposing a table

```bash
$ # delimiter is single character, so easy to use tr to change it to newline
$ cat dishes.txt
North alootikki baati khichdi makkiroti poha
South appam bisibelebath dosa koottu sevai
West dhokla khakhra modak shiro vadapav
East handoguri litti momo rosgulla shondesh

$ # 4 columns, so each column width is 18 characters
$ # $(wc -l < dishes.txt) gives number of columns required
$ tr ' ' '\n' < dishes.txt | pr -$(wc -l < dishes.txt)t
North             South             West              East
alootikki         appam             dhokla            handoguri
baati             bisibelebath      khakhra           litti
khichdi           dosa              modak             momo
makkiroti         koottu            shiro             rosgulla
poha              sevai             vadapav           shondesh
```

* Pipe the output to `column` if spacing is too much

```bash
$ tr ' ' '\n' < dishes.txt | pr -$(wc -l < dishes.txt)t | column -t
North      South         West     East
alootikki  appam         dhokla   handoguri
baati      bisibelebath  khakhra  litti
khichdi    dosa          modak    momo
makkiroti  koottu        shiro    rosgulla
poha       sevai         vadapav  shondesh
```

<br>

#### <a name="further-reading-for-pr"></a>Further reading for pr

* `man pr` and `info pr` for more options and detailed documentation
* More examples [here](http://docstore.mik.ua/orelly/unix3/upt/ch21_15.htm)

<br>

## <a name="fold"></a>fold

```bash
$ fold --version | head -n1
fold (GNU coreutils) 8.25

$ man fold
FOLD(1)                          User Commands                         FOLD(1)

NAME
       fold - wrap each input line to fit in specified width

SYNOPSIS
       fold [OPTION]... [FILE]...

DESCRIPTION
       Wrap input lines in each FILE, writing to standard output.

       With no FILE, or when FILE is -, read standard input.
...
```

<br>

#### <a name="examples"></a>Examples

```bash
$ nl story.txt
     1	The princess of a far away land fought bravely to rescue a travelling group from bandits. And the happy story ends here. Have a nice day.
     2	Still here? okay, read on: The prince of Happalakkahuhu wished he could be as brave as his sister and vowed to train harder

$ # default folding width is 80
$ fold story.txt
The princess of a far away land fought bravely to rescue a travelling group from
 bandits. And the happy story ends here. Have a nice day.
Still here? okay, read on: The prince of Happalakkahuhu wished he could be as br
ave as his sister and vowed to train harder

$ fold story.txt | nl
     1	The princess of a far away land fought bravely to rescue a travelling group from
     2	 bandits. And the happy story ends here. Have a nice day.
     3	Still here? okay, read on: The prince of Happalakkahuhu wished he could be as br
     4	ave as his sister and vowed to train harder
```

* `-s` option breaks at spaces to avoid word splitting

```bash
$ fold -s story.txt
The princess of a far away land fought bravely to rescue a travelling group 
from bandits. And the happy story ends here. Have a nice day.
Still here? okay, read on: The prince of Happalakkahuhu wished he could be as 
brave as his sister and vowed to train harder
```

* Use `-w` to change default width

```bash
$ fold -s -w60 story.txt
The princess of a far away land fought bravely to rescue a 
travelling group from bandits. And the happy story ends 
here. Have a nice day.
Still here? okay, read on: The prince of Happalakkahuhu 
wished he could be as brave as his sister and vowed to 
train harder
```

<br>

#### <a name="further-reading-for-fold"></a>Further reading for fold

* `man fold` and `info fold` for more options and detailed documentation


================================================
FILE: ruby_one_liners.md
================================================
<br> <br> <br>

---

:information_source: :information_source: This chapter has been converted into a better formatted ebook - https://learnbyexample.github.io/learn_ruby_oneliners/. The ebook also has content updated for newer version of `ruby`, extra chapter for parsing json/csv/xml, includes exercises, solutions, etc.

For markdown source and links to buy pdf/epub versions, see: https://github.com/learnbyexample/learn_ruby_oneliners

---

<br> <br> <br>

# <a name="ruby-one-liners"></a>Ruby one liners

**Table of Contents**

* [Executing Ruby code](#executing-ruby-code)
* [Simple search and replace](#simple-search-and-replace)
    * [inplace editing](#inplace-editing)
* [Line filtering](#line-filtering)
    * [Regular expressions based filtering](#regular-expressions-based-filtering)
    * [Fixed string matching](#fixed-string-matching)
    * [Line number based filtering](#line-number-based-filtering)
* [Field processing](#field-processing)
    * [Field comparison](#field-comparison)
    * [Specifying different input field separator](#specifying-different-input-field-separator)
    * [Specifying different output field separator](#specifying-different-output-field-separator)
* [Changing record separators](#changing-record-separators)
    * [Input record separator](#input-record-separator)
    * [Output record separator](#output-record-separator)
* [Multiline processing](#multiline-processing)
* [Ruby regular expressions](#ruby-regular-expressions)
    * [gotchas and tricks](#gotchas-and-tricks)
    * [Backslash sequences](#backslash-sequences)
    * [Non-greedy quantifier](#non-greedy-quantifier)
    * [Lookarounds](#lookarounds)
    * [Special capture groups](#special-capture-groups)
    * [Modifiers](#modifiers)
    * [Code in replacement section](#code-in-replacement-section)
    * [Quoting metacharacters](#quoting-metacharacters)
* [Two file processing](#two-file-processing)
    * [Comparing whole lines](#comparing-whole-lines)
    * [Comparing specific fields](#comparing-specific-fields)
    * [Line number matching](#line-number-matching)
* [Creating new fields](#creating-new-fields)
* [Multiple file input](#multiple-file-input)
* [Dealing with duplicates](#dealing-with-duplicates)
    * [using uniq method](#using-uniq-method)
* [Lines between two REGEXPs](#lines-between-two-regexps)
    * [All unbroken blocks](#all-unbroken-blocks)
    * [Specific blocks](#specific-blocks)
    * [Broken blocks](#broken-blocks)
* [Array operations](#array-operations)
    * [Filtering](#filtering)
    * [Sorting](#sorting)
    * [Transforming](#transforming)
* [Miscellaneous](#miscellaneous)
    * [split](#split)
    * [Fixed width processing](#fixed-width-processing)
    * [String and file replication](#string-and-file-replication)
    * [transliteration](#transliteration)
    * [Executing external commands](#executing-external-commands)
* [Further Reading](#further-reading)

<br>

```
$ ruby --version
ruby 2.5.0p0 (2017-12-25 revision 61468) [x86_64-linux]

$ man ruby
RUBY(1)                Ruby Programmers Reference Guide                RUBY(1)

NAME
     ruby — Interpreted object-oriented scripting language

SYNOPSIS
     ruby [--copyright] [--version] [-SUacdlnpswvy] [-0[octal]] [-C directory]
          [-E external[:internal]] [-F[pattern]] [-I directory] [-K[c]]
          [-T[level]] [-W[level]] [-e command] [-i[extension]] [-r library]
          [-x[directory]] [--{enable|disable}-FEATURE] [--dump=target]
          [--verbose] [--] [program_file] [argument ...]

DESCRIPTION
     Ruby is an interpreted scripting language for quick and easy object-ori‐
     ented programming.  It has many features to process text files and to do
     system management tasks (like in Perl).  It is simple, straight-forward,
     and extensible.

     If you want a language for easy object-oriented programming, or you don't
     like the Perl ugliness, or you do like the concept of LISP, but don't
     like too many parentheses, Ruby might be your language of choice.
...
```

**Prerequisites and notes**

* familiarity with programming concepts like variables, printing, control structures, arrays, etc
* familiarity with regular expressions
* this tutorial is primarily focussed on short programs that are easily usable from command line, similar to using `grep`, `sed`, `awk`, `perl` etc
* unless otherwise specified, consider input as ASCII encoded text only
* this is an attempt to translate [Perl chapter](./perl_the_swiss_knife.md) to `ruby`, I don't have prior experience of using `ruby`

<br>

## <a name="executing-ruby-code"></a>Executing Ruby code

* One way is to put code in a file and use `ruby` command with filename as argument
    * another is to use [shebang](https://en.wikipedia.org/wiki/Shebang_(Unix)) at beginning of script, make the file executable and directly run it
* For short programs, one can use `-e` commandline option to provide code from command line itself
    * this entire chapter is about using `ruby` this way from commandline

```bash
$ cat code.rb
print "Hello Ruby\n"
$ ruby code.rb
Hello Ruby

$ # same as: perl -e 'print "Hello Perl\n"'
$ ruby -e 'print "Hello Ruby\n"'
Hello Ruby

$ # multiple statements can be issued separated by ;
$ # puts adds newline character if input doesn't end with a newline
$ # similar to: perl -E '$x=25; $y=12; say $x**$y'
$ ruby -e 'x=25; y=12; puts x**y'
59604644775390625
```

**Further Reading**

* `ruby -h` for summary of options
    * [explainshell](https://explainshell.com/explain?cmd=ruby+-F+-l+-anpe+-i+-0) - to quickly get information without having to traverse through the docs
* [ruby-lang documentation](https://www.ruby-lang.org/en/documentation/) - manuals, tutorials and references

<br>

## <a name="simple-search-and-replace"></a>Simple search and replace

* More detailed examples with regular expressions will be covered in later sections
* Just like other text processing commands, `ruby` will automatically loop over input line by line when `-n` or `-p` option is used
    * like `sed`, the `-n` option won't print the record
    * `-p` will print the record, including any changes made
    * default record separator is newline character
    * `$_` will contain the input record content, including the record separator (like `perl` and unlike `sed/awk`)
* and similar to other commands, `ruby` will work with both stdin and file input
    * See other chapters for examples of [seq](./miscellaneous.md#seq), [paste](./restructure_text.md#paste), etc

```bash
$ # sample stdin data
$ seq 10 | paste -sd,
1,2,3,4,5,6,7,8,9,10

$ # change only first ',' to ' : '
$ # same as: perl -pe 's/,/ : /'
$ seq 10 | paste -sd, | ruby -pe 'sub(/,/, " : ")'
1 : 2,3,4,5,6,7,8,9,10

$ # change all ',' to ' : '
$ # same as: perl -pe 's/,/ : /g'
$ seq 10 | paste -sd, | ruby -pe 'gsub(/,/, " : ")'
1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10

$ # sub(/,/, " : ") is shortcut for $_.sub!(/,/, " : ")
$ # gsub(/,/, " : ") is shortcut for $_.gsub!(/,/, " : ")
$ # sub! and gsub! do inplace changing
$ # sub and gsub returns the result, similar to perl's s///r modifier
$ # () is optional, sub /,/, " : " can be used instead of sub(/,/, " : ")
```

<br>

#### <a name="inplace-editing"></a>inplace editing

```bash
$ cat greeting.txt
Hi there
Have a nice day

$ # original file gets preserved in 'greeting.txt.bkp'
$ # same as: perl -i.bkp -pe 's/Hi/Hello/' greeting.txt
$ ruby -i.bkp -pe 'sub(/Hi/, "Hello")' greeting.txt
$ cat greeting.txt
Hello there
Have a nice day

$ # use empty argument to -i with caution, changes made cannot be undone
$ ruby -i -pe 'sub(/nice day/, "safe journey")' greeting.txt
$ cat greeting.txt
Hello there
Have a safe journey
```

* Multiple input files are treated individually and changes are written back to respective files

```bash
$ cat f1
I ate 3 apples
$ cat f2
I bought two bananas and 3 mangoes

$ # same as: perl -i.bkp -pe 's/3/three/' f1 f2
$ ruby -i.bkp -pe 'sub(/3/, "three")' f1 f2
$ cat f1
I ate three apples
$ cat f2
I bought two bananas and three mangoes
```

**Further Reading**

* [ruby-doc: Pre-defined variables](https://ruby-doc.org/core-2.5.0/doc/globals_rdoc.html#label-Pre-defined+variables) for explanation on `$_` and other such special variables
* [ruby-doc: gsub](https://ruby-doc.org/core-2.5.0/String.html#method-i-gsub) for `gsub` syntax details

<br>

## <a name="line-filtering"></a>Line filtering

<br>

#### <a name="regular-expressions-based-filtering"></a>Regular expressions based filtering

* one way is to use `variable =~ /REGEXP/FLAGS` to check for a match
    * use `variable !~ /REGEXP/FLAGS` for negated match
    * by default acts on `$_` if variable is not specified
    * see [ruby-doc: Regexp](https://ruby-doc.org/core-2.5.0/Regexp.html) for documentation
* as we need to print only selective lines, use `-n` option
    * by default, contents of `$_` will be printed if no argument is passed to `print`

```bash
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ # same as: perl -ne 'print if /^[RS]/' poem.txt
$ # /^[RS]/ is shortcut for $_ =~ /^[RS]/
$ ruby -ne 'print if /^[RS]/' poem.txt
Roses are red,
Sugar is sweet,

$ # same as: perl -ne 'print if /and/i' poem.txt
$ ruby -ne 'print if /and/i' poem.txt
And so are you.

$ # same as: perl -ne 'print if !/are/' poem.txt
$ # !/are/ is shortcut for $_ !~ /are/
$ ruby -ne 'print if !/are/' poem.txt
Sugar is sweet,

$ # same as: perl -ne 'print if /are/ && !/so/' poem.txt
$ ruby -ne 'print if /are/ && !/so/' poem.txt
Roses are red,
Violets are blue,
```

* using different delimiter
* quoting from [ruby-doc: Percent Strings](https://ruby-doc.org/core-2.5.0/doc/syntax/literals_rdoc.html#label-Percent+Strings)

> If you are using “(”, “[”, “{”, “<” you must close it with “)”, “]”, “}”, “>” respectively. You may use most other non-alphanumeric characters for percent string delimiters such as “%”, “|”, “^”, etc.

```bash
$ cat paths.txt
/foo/a/report.log
/foo/y/power.log
/foo/abc/errors.log

$ # same as: perl -ne 'print if /\/foo\/a\//' paths.txt
$ ruby -ne 'print if /\/foo\/a\//' paths.txt
/foo/a/report.log

$ # same as: perl -ne 'print if m#/foo/a/#' paths.txt
$ ruby -ne 'print if %r#/foo/a/#' paths.txt
/foo/a/report.log

$ # same as: perl -ne 'print if !m#/foo/a/#' paths.txt
$ ruby -ne 'print if !%r#/foo/a/#' paths.txt
/foo/y/power.log
/foo/abc/errors.log
```

<br>

#### <a name="fixed-string-matching"></a>Fixed string matching

* To match strings literally, use `include?` method

```bash
$ echo 'int a[5]' | ruby -ne 'print if /a[5]/'
$ echo 'int a[5]' | ruby -ne 'print if $_.include?("a[5]")'
int a[5]

$ # however, string within double quotes gets interpolated
$ ruby -e 'a=5; puts "value of a:\t#{a}"'
value of a:     5
$ # use %q (covered later) to specify single quoted string
$ echo 'int #{a}' | ruby -ne 'print if $_.include?(%q/#{a}/)'
int #{a}
$ # or pass the string as environment variable
$ echo 'int #{a}' | s='#{a}' ruby -ne 'print if $_.include?(ENV["s"])'
int #{a}
```

* restricting match to start/end of line

```bash
$ cat eqns.txt
a=b,a-b=c,c*d
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

$ # start of line
$ s='a+b' ruby -ne 'print if $_.start_with?(ENV["s"])' eqns.txt
a+b,pi=3.14,5e12

$ # end of line
$ # -l option is needed to remove record separator (covered later)
$ s='a+b' ruby -lne 'print if $_.end_with?(ENV["s"])' eqns.txt
i*(t+9-g)/8,4-a+b
```

* `index` method returns matching position (starts at 0) and nil if not found
    * supports both string and regexp
    * optional 2nd argument allows to specify offset to start searching
* See [ruby-doc: index](https://ruby-doc.org/core-2.5.0/String.html#method-i-index) for details

```bash
$ # passing string
$ ruby -ne 'print if $_.index("a+b")' eqns.txt
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b
$ ruby -ne 'print if $_.index("a+b")==0' eqns.txt
a+b,pi=3.14,5e12

$ # passing regexp
$ ruby -ne 'print if $_.index(/[+*]/)<5' eqns.txt
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

$ s='a+b' ruby -ne 'print if $_.index(ENV["s"], 1)' eqns.txt
i*(t+9-g)/8,4-a+b
```

<br>

#### <a name="line-number-based-filtering"></a>Line number based filtering

* special variable `$.` contains total records read so far, similar to `NR` in `awk`
    * as far as I've checked the docs, there's no equivalent of awk's `FNR`
* See also [ruby-doc: eof](https://ruby-doc.org/core-2.5.0/IO.html#method-i-eof)

```bash
$ # print 2nd line
$ # same as: perl -ne 'print if $.==2' poem.txt
$ ruby -ne 'print if $.==2' poem.txt
Violets are blue,

$ # print 2nd and 4th line
$ # same as: perl -ne 'print if $.==2 || $.==4' poem.txt
$ # can also use: ruby -ne 'print if [2, 4].include?($.)' poem.txt
$ ruby -ne 'print if $.==2 || $.==4' poem.txt
Violets are blue,
And so are you.

$ # print last line
$ # same as: perl -ne 'print if eof' poem.txt
$ # $< is like filehandle for input files/stdin given from commandline
$ ruby -ne 'print if $<.eof' poem.txt
And so are you.
```

* for large input, use `exit` to avoid unnecessary record processing
* See [ruby-doc: Control Expressions](https://ruby-doc.org/core-2.5.0/doc/syntax/control_expressions_rdoc.html) for syntax details

```bash
$ # same as: perl -ne 'if($.==234){print; exit}'
$ seq 14323 14563435 | ruby -ne 'if $.==234 then print; exit end'
14556
$ # can also group the statements in ()
$ seq 14323 14563435 | ruby -ne '(print; exit) if $.==234'
14556

$ # mimicking head command
$ # same as: head -n3 and sed '3q' or perl -pe 'exit if $.>3'
$ seq 14 25 | ruby -pe 'exit if $.>3'
14
15
16

$ # same as: sed '3Q' and perl -pe 'exit if $.==3'
$ seq 14 25 | ruby -pe 'exit if $.==3'
14
15
```

* selecting range of lines
* See [ruby-doc: Range](https://ruby-doc.org/core-2.5.0/Range.html) for syntax details

```bash
$ # in this context, the range is compared against $.
$ # same as: perl -ne 'print if 3..5'
$ seq 14 25 | ruby -ne 'print if 3..5'
16
17
18

$ # selecting from particular line number to end of input
$ # same as: perl -ne 'print if $.>=10'
$ seq 14 25 | ruby -ne 'print if $.>=10'
23
24
25
```

<br>

## <a name="field-processing"></a>Field processing

* `-a` option will auto-split each input record based on one or more continuous white-space
    * similar to default behavior in `awk` and same as `perl -a`
    * See also [split](#split) section
* Special variable array `$F` will contain all the elements, indexing starts from 0
    * negative indexing is also supported, `-1` gives last element, `-2` gives last-but-one and so on
    * see [Array operations](#array-operations) section for examples on array usage

```bash
$ cat fruits.txt
fruit   qty
apple   42
banana  31
fig     90
guava   6

$ # print only first field, indexing starts from 0
$ # same as: perl -lane 'print $F[0]' fruits.txt
$ ruby -ane 'puts $F[0]' fruits.txt
fruit
apple
banana
fig
guava

$ # print only second field
$ # same as: perl -lane 'print $F[1]' fruits.txt
$ ruby -ane 'puts $F[1]' fruits.txt
qty
42
31
90
6
```

* by default, leading and trailing whitespaces won't be considered when splitting the input record
    * same as `awk`'s default behavior and `perl -a`

```bash
$ printf ' a    ate b\tc   \n'
 a    ate b     c
$ printf ' a    ate b\tc   \n' | ruby -ane 'puts $F[0]'
a
$ printf ' a    ate b\tc   \n' | ruby -ane 'puts $F[-1]'
c

$ # number of elements
$ printf ' a    ate b\tc   \n' | ruby -ane 'puts $F.length'
4
```

<br>

#### <a name="field-comparison"></a>Field comparison

* operators `=`, `!=`, `<`, etc will work for both string/numeric comparison
* unlike `perl`, numeric comparison for text requires converting to appropriate numeric format
    * See [ruby-doc: string methods](https://ruby-doc.org/core-2.5.0/String.html#method-i-to_c) for details

```bash
$ # if first field exactly matches the string 'apple'
$ # same as: perl -lane 'print $F[1] if $F[0] eq "apple"' fruits.txt
$ ruby -ane 'puts $F[1] if $F[0] == "apple"' fruits.txt
42

$ # print first field if second field > 35 (excluding header)
$ # same as: perl -lane 'print $F[0] if $F[1]>35 && $.>1' fruits.txt
$ ruby -ane 'puts $F[0] if $F[1].to_i > 35 && $.>1' fruits.txt
apple
fig

$ # print header and lines with qty < 35
$ # same as: perl -ane 'print if $F[1]<35 || $.==1' fruits.txt
$ ruby -ane 'print if $F[1].to_i < 35 || $.==1' fruits.txt
fruit   qty
banana  31
guava   6

$ # if first field does NOT contain 'a'
$ # same as: perl -ane 'print if $F[0] !~ /a/' fruits.txt
$ ruby -ane 'print if $F[0] !~ /a/' fruits.txt
fruit   qty
fig     90
```

<br>

#### <a name="specifying-different-input-field-separator"></a>Specifying different input field separator

* by using `-F` command line option

```bash
$ # second field where input field separator is :
$ # same as: perl -F: -lane 'print $F[1]'
$ echo 'foo:123:bar:789' | ruby -F: -ane 'puts $F[1]'
123

$ # last field, same as: perl -F: -lane 'print $F[-1]'
$ echo 'foo:123:bar:789' | ruby -F: -ane 'puts $F[-1]'
789
$ # second last field, perl -F: -lane 'print $F[-2]'
$ echo 'foo:123:bar:789' | ruby -F: -ane 'puts $F[-2]'
bar

$ # second and last field, same as: perl -F: -lane 'print "$F[1] $F[-1]"'
$ echo 'foo:123:bar:789' | ruby -F: -ane 'puts "#{$F[1]} #{$F[-1]}"'
123 789

$ # use quotes to avoid clashes with shell special characters
$ echo 'one;two;three;four' | ruby -F';' -ane 'puts $F[2]'
three
```

* last element of `$F` array will contain the record separator as well
    * note that default `-a` option without `-F` won't have this issue as whitespaces at start/end are stripped
* it doesn't make visual difference when `puts` is used as it adds newline only if not already present
* if the record separator is not desired, use `-l` option to remove the record separator from input

```bash
$ echo 'foo 123' | ruby -ane 'puts "#{$F[-1]}xyz"'
123xyz

$ echo 'foo:123:bar:789' | ruby -F: -ane 'puts "#{$F[-1]}a"'
789
a
$ echo 'foo:123:bar:789' | ruby -F: -lane 'puts "#{$F[-1]}a"'
789a
```

* Regular expressions based input field separator

```bash
$ # same as: perl -F'\d+' -lane 'print $F[1]'
$ echo 'Sample123string54with908numbers' | ruby -F'\d+' -ane 'puts $F[1]'
string

$ # first field will be empty as there is nothing before '{'
$ echo '{foo}   bar=baz' | ruby -F'[{}= ]+' -ane 'puts $F[0]'

$ echo '{foo}   bar=baz' | ruby -F'[{}= ]+' -ane 'puts $F[1]'
foo
$ echo '{foo}   bar=baz' | ruby -F'[{}= ]+' -ane 'puts $F[2]'
bar
$ echo '{foo}   bar=baz' | ruby -F'[{}= ]+' -ane 'puts $F[-1]'
baz
```

* to process individual characters, simply use indexing on input string
* See [ruby-doc: Encoding](https://ruby-doc.org/core-2.5.0/Encoding.html) for details on handling different string encodings

```bash
$ # same as: perl -F -lane 'print $F[0]'
$ echo 'apple' | ruby -ne 'puts $_[0]'
a

$ # if needed, chomp the record separator using -l
$ # same as: perl -F -lane 'print $F[-1]'
$ echo 'apple' | ruby -lne 'puts $_[-1]'
e

$ ruby -e 'puts Encoding.default_external'
UTF-8
$ printf 'hi👍 how are you?' | ruby -ne 'puts $_[2]'
👍
$ # use -E option to explicitly specify external/internal encodings
$ printf 'hi👍 how are you?' | ruby -E UTF-8:UTF-8 -ne 'puts $_[2]'
👍
```

<br>

#### <a name="specifying-different-output-field-separator"></a>Specifying different output field separator

* use `$,` to change separator between `print` arguments
    * could be remembered easily by noting that `,` is used to separate `print` arguments
    * note that `$,` doesn't affect `puts` which always uses newline as separator
* the `-l` option is useful here in more than one way
    * it removes input record separator
    * and appends the record separator to `print` output

```bash
$ # by default, the various arguments are concatenated
$ echo 'foo:123:bar:789' | ruby -F: -lane 'print $F[1], $F[-1]'
123789

$ # change $, if different separator is needed
$ # same as: perl -F: -lane '$,=" "; print $F[1], $F[-1]'
$ echo 'foo:123:bar:789' | ruby -F: -lane '$,=" "; print $F[1], $F[-1]'
123 789
$ echo 'foo:123:bar:789' | ruby -F: -lane '$,="-"; print $F[1], $F[-1]'
123-789

$ # array's join method also uses $,
$ # same as: perl -F: -lane '$,=" - "; print @F'
$ echo 'foo:123:bar:789' | ruby -F: -lane '$,=" - "; print $F.join'
foo - 123 - bar - 789
$ # or pass the separator as argument to join method
$ echo 'foo:123:bar:789' | ruby -F: -lane 'print $F.join(" - ")'
foo - 123 - bar - 789
$ # or the equivalent
$ echo 'foo:123:bar:789' | ruby -F: -lane 'print $F * " - "'
foo - 123 - bar - 789
```

* use `BEGIN` if same separator is to be used for all lines
    * statements inside `BEGIN` are executed before processing any input text

```bash
$ # same as: perl -lane 'BEGIN{$,=","} print @F' fruits.txt
$ ruby -lane 'BEGIN{$,=","}; print $F.join' fruits.txt
fruit,qty
apple,42
banana,31
fig,90
guava,6
```

<br>

## <a name="changing-record-separators"></a>Changing record separators

<br>

#### <a name="input-record-separator"></a>Input record separator

* by default, newline character is used as input record separator
* use `$/` to specify a different input record separator
    * unlike `gawk`, only string can be used, no regular expressions
* for single character separator, can also use `-0` command line option which accepts octal value as argument
* if `-l` option is also used
    * input record separator will be chomped from input record
        * earlier versions used `chop` instead of `chomp`. See [bugs.ruby-lang.org 12926](https://bugs.ruby-lang.org/issues/12926)
    * in addition, output record separator(ORS) will get whatever is current value of input record separator
    * so, order of `-l`, `-0` and/or `$/` usage becomes important

```bash
$ s='this is a sample string'

$ # space as input record separator, printing all records
$ # ORS is newline as -l is used before $/ gets changed
$ # same as: perl -lne 'BEGIN{$/=" "} print "$. $_"'
$ printf "$s" | ruby -lne 'BEGIN{$/=" "}; print "#{$.} #{$_}"'
1 this
2 is
3 a
4 sample
5 string

$ # print all records containing 'a'
$ # same as: perl -l -0040 -ne 'print if /a/'
$ printf "$s" | ruby -l -0040 -ne 'print if /a/'
a
sample

$ # if the order is changed, ORS will be space, not newline
$ printf "$s" | ruby -0040 -l -ne 'print if /a/'
a sample 
```

* `-0` option used without argument will use the ASCII NUL character as input record separator
* `-0777` will cause entire file to be slurped

```bash
$ printf 'foo\0bar\0' | cat -A
foo^@bar^@$
$ # same as: perl -l -0 -ne 'print'
$ # could be golfed to: ruby -l0pe ''
$ printf 'foo\0bar\0' | ruby -l -0 -ne 'print'
foo
bar

$ # replace first newline with '. '
$ # same as: perl -0777 -pe 's/\n/. /' greeting.txt
$ ruby -0777 -pe 'sub(/\n/, ". ")' greeting.txt
Hello there. Have a safe journey
```

* for paragraph mode (two more more consecutive newline characters), use `-00` or assign empty string to `$/`

Consider the below sample file

```bash
$ cat sample.txt
Hello World

Good day
How are you

Just do-it
Believe it

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he
```

* again, input record will have the separator too and using `-l` will chomp it
* however, if more than two consecutive newline characters separate the paragraphs, only two newlines will be preserved and the rest discarded
    * use `$/="\n\n"` to avoid this behavior

```bash
$ # print all paragraphs containing 'it'
$ # same as: perl -00 -ne 'print if /it/' sample.txt
$ ruby -00 -ne 'print if /it/' sample.txt
Just do-it
Believe it

Today is sunny
Not a bit funny
No doubt you like it too

$ # based on number of lines in each paragraph
$ # same as: perl -F'\n' -00 -ane 'print if $#F==0' sample.txt
$ ruby -F'\n' -00 -ane 'print if $F.length==1' sample.txt
Hello World

```

* Re-structuring paragraphs

```bash
$ # same as: perl -F'\n' -l -00 -ane 'print join ". ", @F' sample.txt
$ ruby -F'\n' -l -00 -ane 'print $F.join(". ")' sample.txt
Hello World
Good day. How are you
Just do-it. Believe it
Today is sunny. Not a bit funny. No doubt you like it too
Much ado about nothing. He he he
```

* multi-character separator

```bash
$ cat report.log
blah blah
Error: something went wrong
more blah
whatever
Error: something surely went wrong
some text
some more text
blah blah blah

$ # number of records, same as: perl -lne 'BEGIN{$/="Error:"} print $. if eof'
$ ruby -ne 'BEGIN{$/="Error:"}; puts $. if $<.eof' report.log
3
$ # print first record, same as: perl -lne 'BEGIN{$/="Error:"} print if $.==1'
$ ruby -lne 'BEGIN{$/="Error:"}; print if $.==1' report.log
blah blah

$ # print a record if it contains given string
$ # same as: perl -lne 'BEGIN{$/="Error:"} print "$/$_" if /surely/'
$ ruby -lne 'BEGIN{$/="Error:"}; print $/,$_ if /surely/' report.log
Error: something surely went wrong
some text
some more text
blah blah blah

```

* Joining lines based on specific end of line condition

```bash
$ cat msg.txt
Hello there.
It will rain to-
day. Have a safe
and pleasant jou-
rney.

$ # same as: perl -pe 'BEGIN{$/="-\n"} chomp' msg.txt
$ ruby -pe 'BEGIN{$/="-\n"}; chomp' msg.txt
Hello there.
It will rain today. Have a safe
and pleasant journey.
```

<br>

#### <a name="output-record-separator"></a>Output record separator

* use `$\` to specify a different output record separator
    * applies to `print` but not `puts`

```bash
$ # note that despite not setting $\, output has newlines
$ # because the input record still has the input record separator
$ seq 3 | ruby -ne 'print'
1
2
3
$ # same as: perl -ne 'BEGIN{$\="\n"} print'
$ seq 3 | ruby -ne 'BEGIN{$\="\n"}; print'
1

2

3

$ seq 2 | ruby -ne 'BEGIN{$\="---\n"}; print'
1
---
2
---
```

* dynamically changing output record separator
* **Note:** except `nil` and `false`, all other values evaluate to `true`
    * `0`, empty string/array/etc evaluate to `true`

```bash
$ # note the use of -l to chomp the input record separator
$ # same as: perl -lpe '$\ = $.%2 ? " " : "\n"'
$ seq 6 | ruby -lpe '$\ = $.%2!=0 ? " " : "\n"'
1 2
3 4
5 6

$ # -l also sets the output record separator
$ # but gets overridden by $\
$ # same as: perl -lpe '$\ = $.%3 ? "-" : "\n"'
$ seq 6 | ruby -lpe '$\ = $.%3!=0 ? "-" : "\n"'
1-2-3
4-5-6
```

<br>

## <a name="multiline-processing"></a>Multiline processing

* Processing consecutive lines
* to keep the one-liner short, global variables(`$` prefix) are used here
    * See [ruby-doc: Global variables](https://ruby-doc.org/core-2.5.0/doc/syntax/assignment_rdoc.html#label-Global+Variables) for syntax details

```bash
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ # match two consecutive lines
$ # same as: perl -ne 'print $p,$_ if /is/ && $p=~/are/; $p=$_' poem.txt
$ ruby -ne 'print $p,$_ if /is/ && $p=~/are/; $p=$_' poem.txt
Violets are blue,
Sugar is sweet,
$ # if only the second line is needed
$ ruby -ne 'print if /is/ && $p=~/are/; $p=$_' poem.txt
Sugar is sweet,

$ # print if line matches a condition as well as condition for next 2 lines
$ ruby -ne 'print $p2 if /is/ && $p1=~/blue/ && $p2=~/red/;
            $p2=$p1; $p1=$_' poem.txt
Roses are red,
```

Consider this sample input file

```bash
$ cat range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
END
baz
```

* extracting lines around matching line
* **Note**
    * default uninitialized value is `nil`, has to be explicitly converted for comparison
    * no auto increment/decrement operators, can use `+=1` and `-=1`


```bash
$ ruby -le 'print $a'

$ ruby -le 'print $a.to_i'
0

$ # print matching line and n-1 lines following the matched line
$ # same as: perl -ne '$n=2 if /BEGIN/; print if $n && $n--' range.txt
$ # can also use: ruby -ne 'BEGIN{n=0}; n=2 if /BEGIN/; print if n>0 && n-=1'
$ ruby -ne '$n=2 if /BEGIN/; print if $n.to_i>0 && $n-=1' range.txt
BEGIN
1234
BEGIN
a

$ # print nth line after match
$ # same as: perl -ne 'print if $n && !--$n; $n=3 if /BEGIN/' range.txt
$ ruby -ne '$n.to_i>0 && (print if $n==1; $n-=1); $n=3 if /BEGIN/' range.txt
END
c

$ # use reversing trick for nth line before match
$ tac range.txt | ruby -ne '$n.to_i>0 && (print if $n==1; $n-=1); $n=3 if /END/' | tac
BEGIN
a
```

**Further Reading**

* [softwareengineering - FSM examples](https://softwareengineering.stackexchange.com/questions/47806/examples-of-finite-state-machines)
* [wikipedia - FSM](https://en.wikipedia.org/wiki/Finite-state_machine)

<br>

## <a name="ruby-regular-expressions"></a>Ruby regular expressions

* assuming that you are already familiar with basics of regular expressions
    * if not, check out [Ruby Regexp](https://leanpub.com/rubyregexp) ebook - step by step guide from beginner to advanced levels
* examples/descriptions are for string containing ASCII characters only
* See [ruby-doc: Regexp](https://ruby-doc.org/core-2.5.0/Regexp.html) for documentation
* See [rexegg ruby](https://www.rexegg.com/regex-ruby.html) for a bit of ruby regexp history and differences with other regexp engines

<br>

#### <a name="gotchas-and-tricks"></a>gotchas and tricks

* input record separator being part of input record

```bash
$ # newline character gets replaced too as shown by shell prompt
$ echo 'foo:123:bar:789' | ruby -pe 'sub(/[^:]+$/, "xyz")'
foo:123:bar:xyz$
$ # simple workaround is to use -l option
$ echo 'foo:123:bar:789' | ruby -lpe 'sub(/[^:]+$/, "xyz")'
foo:123:bar:xyz

$ # of course it is useful too
$ # same as: perl -pe 's/\n/ : / if !eof'
$ seq 10 | ruby -pe 'sub(/\n/, " : ") if !$<.eof'
1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10
```

* how much does `*` match?

```bash
$ # both empty and non-empty strings are matched
$ # even though * is a greedy quantifier
$ echo ',baz,,xyz,,,' | ruby -lpe 'gsub(/[^,]*/, "A")'
A,AA,A,AA,A,A,A
$ echo 'foo,baz,,xyz,,,123' | ruby -lpe 'gsub(/[^,]*/, "A")'
AA,AA,A,AA,A,A,AA

$ # one workaround is to use lookarounds(covered later)
$ echo ',baz,,xyz,,,' | ruby -lpe 'gsub(/(?<=^|,)[^,]*/, "A")'
A,A,A,A,A,A,A
$ echo 'foo,baz,,xyz,,,123' | ruby -lpe 'gsub(/(?<=^|,)[^,]*/, "A")'
A,A,A,A,A,A,A
```

* difference between `^` and `\A`

```bash
$ # ^ matches start of line, not start of string
$ # same as: perl -00 -ne 'print if /^Believe/m' sample.txt
$ ruby -00 -ne 'print if /^Believe/' sample.txt
Just do-it
Believe it

$ ruby -00 -ne 'print if /^he/i' sample.txt
Hello World

Much ado about nothing
He he he

$ # \A matches start of string
$ # without m modifier, both ^ and \A will match start of string in perl
$ ruby -00 -ne 'print if /\Ahe/i' sample.txt
Hello World

$ # similarly, $ matches end of line
$ ruby -00 -ne 'print if /funny$/' sample.txt
Today is sunny
Not a bit funny
No doubt you like it too
```

* difference between `\z` and `\Z`

```bash
$ # \Z matches just before newline
$ seq 14 | ruby -ne 'print if /2\Z/'
2
12

$ # \z matches end of string
$ seq 14 | ruby -ne 'print if /2\z/'
$ seq 14 | ruby -ne 'print if /2\n\z/'
2
12

$ # without newline at end of line, both \z and \Z will behave same
$ seq 14 | ruby -lne 'print if /2\z/'
2
12
```

* delimiters and quoting
* from [ruby-doc: Percent Strings](https://ruby-doc.org/core-2.5.0/doc/syntax/literals_rdoc.html#label-Percent+Strings)

> If you are using “(”, “[”, “{”, “<” you must close it with “)”, “]”, “}”, “>” respectively. You may use most other non-alphanumeric characters for percent string delimiters such as “%”, “|”, “^”, etc.

```bash
$ # %r allows to use delimiter other than /
$ echo 'a/b' | ruby -pe 'sub(/a\/b/, "foo")'
foo
$ echo 'a/b' | ruby -pe 'sub(%r{a/b}, "foo")'
foo

$ # use %q (single quoting) to avoid variable interpolation
$ echo 'foo123' | ruby -pe 'a="huh?"; sub(/12/, "#{a}")'
foohuh?3
$ echo 'foo123' | ruby -pe 'a="huh?"; sub(/12/, %q/#{a}/)'
foo#{a}3

$ # %q also useful for backreferences, as \ is special inside double quotes
$ echo 'a a a 2 be be' | ruby -pe 'gsub(/\b(\w+)( \1)+\b/, "\\1")'
a 2 be
$ echo 'a a a 2 be be' | ruby -pe 'gsub(/\b(\w+)( \1)+\b/, %q/\1/)'
a 2 be
$ # and when double quotes is part of replacement string
$ echo '42,789' | ruby -lpe 'gsub(/\d+/, "\"\\0\"")'
"42","789"
$ echo '42,789' | ruby -lpe 'gsub(/\d+/, %q/"\0"/)'
"42","789"
$ # \& can also be used instead of \0
```

<br>

#### <a name="backslash-sequences"></a>Backslash sequences

* `\w` for `[A-Za-z0-9_]`
* `\d` for `[0-9]`
* `\s` for `[ \t\r\n\f\v]`
* `\h` for `[0-9a-fA-F]` or `[[:xdigit:]]`
* `\W`, `\D`, `\S`, `\H`, respectively for their opposites
* See also [ruby-doc: scan](https://ruby-doc.org/core-2.5.0/String.html#method-i-scan)

```bash
$ # same as: perl -ne 'print if /^[[:xdigit:]]+$/'
$ # can also use: ruby -lne 'print if !/\H/'
$ printf '128A\n34\nfe32\nfoo1\nbar\n' | ruby -ne 'print if /^\h+$/'
128A
34
fe32

$ # same as: perl -pe 's/\d+/xxx/g'
$ echo 'like 42 and 37' | ruby -pe 'gsub(/\d+/, "xxx")'
like xxx and xxx

$ # note again the use of -l because of newline in input record
$ # same as: perl -lpe 's/\D+/xxx/g'
$ echo 'like 42 and 37' | ruby -lpe 'gsub(/\D+/, "xxx")'
xxx42xxx37

$ # get all matches as an array
$ echo 'tea sea-pit sit' | ruby -ne 'puts $_.scan(/[\w\s]+/)'
tea sea
pit sit
```

<br>

#### <a name="non-greedy-quantifier"></a>Non-greedy quantifier

* adding a `?` to `?` or `*` or `+` or `{}` quantifiers will change matching from greedy to non-greedy. In other words, to match as minimally as possible
    * also known as lazy quantifier

```bash
$ # greedy matching
$ echo 'foo and bar and baz land good' | ruby -lne 'print $_.scan(/.*and/)'
["foo and bar and baz land"]
$ # non-greedy matching
$ echo 'foo and bar and baz land good' | ruby -lne 'print $_.scan(/.*?and/)'
["foo and", " bar and", " baz land"]

$ echo '12342789' | ruby -pe 'sub(/\d{2,5}/, "")'
789
$ echo '12342789' | ruby -pe 'sub(/\d{2,5}?/, "")'
342789

$ # for single character, non-greedy is not always needed
$ echo '123:42:789:good:5:bad' | ruby -pe 'sub(/:.*?:/, ":")'
123:789:good:5:bad
$ echo '123:42:789:good:5:bad' | ruby -pe 'sub(/:[^:]*:/, ":")'
123:789:good:5:bad

$ # just like greedy, overall matching is considered, as minimal as possible
$ echo '123:42:789:good:5:bad' | ruby -pe 'sub(/:.*?:[a-z]/, ":")'
123:ood:5:bad
$ echo '123:42:789:good:5:bad' | ruby -pe 'sub(/:.*:[a-z]/, ":")'
123:ad
```

<br>

#### <a name="lookarounds"></a>Lookarounds

* Ability to add if conditions to match before/after required pattern
* There are four types
    * positive lookahead `(?=`
    * negative lookahead `(?!`
    * positive lookbehind `(?<=`
    * negative lookbehind `(?<!`
* One way to remember is that **behind** uses `<` and **negative** uses `!` instead of `=`

The string matched by lookarounds are like word boundaries and anchors, do not constitute as part of matched string. They are termed as **zero-width patterns**

* positive lookbehind `(?<=`

```bash
$ s='foo=5, bar=3; x=83, y=120'

$ # extract all digit sequences, same as: perl -lne 'print join " ", /\d+/g'
$ echo "$s" | ruby -lne 'print $_.scan(/\d+/).join(" ")'
5 3 83 120

$ # extract digits only if preceded by two lowercase alphabets and =
$ # note how the characters matched by lookbehind isn't part of output
$ # same as: perl -lne 'print join " ", /(?<=[a-z]{2}=)\d+/g'
$ echo "$s" | ruby -lne 'print $_.scan(/(?<=[a-z]{2}=)\d+/).join(" ")'
5 3
$ # this can be done without lookbehind too
$ echo "$s" | ruby -lne 'print $_.scan(/[a-z]{2}=(\d+)/).join(" ")'
5 3

$ # change all digits preceded by single lowercase alphabet and =
$ # same as: perl -pe 's/(?<=\b[a-z]=)\d+/42/g'
$ echo "$s" | ruby -pe 'gsub(/(?<=\b[a-z]=)\d+/, "42")'
foo=5, bar=3; x=42, y=42
```

* positive lookahead `(?=`

```bash
$ s='foo=5, bar=3; x=83, y=120'

$ # extract digits that end with ,
$ # same as: perl -lne 'print join ":", /\d+(?=,)/g'
$ echo "$s" | ruby -lne 'print $_.scan(/\d+(?=,)/).join(":")'
5:83

$ # change all digits ending with ,
$ # same as: perl -pe 's/\d+(?=,)/42/g'
$ echo "$s" | ruby -pe 'gsub(/\d+(?=,)/, "42")'
foo=42, bar=3; x=42, y=120

$ # both lookbehind and lookahead
$ echo 'foo,,baz,,,xyz' | ruby -pe 'gsub(/,,/, ",NA,")'
foo,NA,baz,NA,,xyz
$ echo 'foo,,baz,,,xyz' | ruby -pe 'gsub(/(?<=,)(?=,)/, "NA")'
foo,NA,baz,NA,NA,xyz
```

* negative lookbehind `(?<!` and negative lookahead `(?!`

```bash
$ # change foo if not preceded by _
$ # note how 'foo' at start of line is matched as well
$ # same as: perl -pe 's/(?<!_)foo/baz/g'
$ echo 'foo _foo 1foo' | ruby -pe 'gsub(/(?<!_)foo/, "baz")'
baz _foo 1baz

$ # join each line in paragraph by replacing newline character
$ # except the one at end of paragraph
$ # same as: perl -00 -pe 's/\n(?!$)/. /g' sample.txt
$ ruby -00 -pe 'gsub(/\n(?!$)/, ". ")' sample.txt
Hello World

Good day. How are you

Just do-it. Believe it

Today is sunny. Not a bit funny. No doubt you like it too

Much ado about nothing. He he he
```

* capture groups can also be used inside lookarounds

```bash
$ # same as: perl -pe 's/(\H+\h+)(?=(\H+)\h)/$1$2\n/g'
$ # %q cannot be used here as \n is not meaningful inside single quotes
$ echo 'a b c d e' | ruby -lpe 'gsub(/(\S+\s+)(?=(\S+)\s)/, "\\1\\2\n")'
a b
b c
c d
d e
```

* `\K` helps as a workaround for some of the variable-length lookbehind cases
* See also [stackoverflow - Variable-length lookbehind-assertion alternatives](https://stackoverflow.com/questions/11640447/variable-length-lookbehind-assertion-alternatives-for-regular-expressions)

```bash
$ echo '1 and 2 and 3 land 4' | ruby -pe 'sub(/(?<=(and.*?){2})and/, "-")'
-e:1: invalid pattern in look-behind: /(?<=(and.*?){2})and/

$ # \K helps in such cases
$ # same as: sed 's/and/-/3' and perl -pe 's/(and.*?){2}\Kand/-/'
$ echo '1 and 2 and 3 land 4' | ruby -pe 'sub(/(and.*?){2}\Kand/, "-")'
1 and 2 and 3 l- 4
```

* don't use `\K` if there are consecutive matches
* this is because of how the regexp engine has been implemented, `perl` or `vim`'s `\zs` don't have this limitation

```bash
$ echo ',,' | perl -pe 's/,\K/foo/g'
,foo,foo
$ echo ',,' | ruby -pe 'gsub(/,\K/, "foo")'
,foo,
$ echo ',,' | ruby -pe 'gsub(/(?<=,)/, "foo")'
,foo,foo

$ # another example
$ echo '"foo","12,34","good"' | perl -F'/"\K,(?=")/' -lane 'print $F[1]'
"12,34"
$ echo '"foo","12,34","good"' | ruby -F'"\K,(?=")' -lane 'print $F[1]'
"12,34
$ echo '"foo","12,34","good"' | ruby -F'(?<="),(?=")' -lane 'print $F[1]'
"12,34"
```

<br>

#### <a name="special-capture-groups"></a>Special capture groups

* `\1`, `\2` etc only matches exact string
* `\g<1>`, `\g<2>` etc re-uses the regular expression itself

```bash
$ s='baz 2008-03-24 and 2012-08-12 foo 2016-03-25'
$ # same as: perl -pe 's/(\d{4}-\d{2}-\d{2}) and (?1)/XYZ/'
$ echo "$s" | ruby -pe 'sub(/(\d{4}-\d{2}-\d{2}) and \g<1>/, "XYZ")'
baz XYZ foo 2016-03-25

$ # using \1 won't work as the two dates are different
$ echo "$s" | ruby -pe 'sub(/(\d{4}-\d{2}-\d{2}) and \1/, "")'
baz 2008-03-24 and 2012-08-12 foo 2016-03-25
```

* use `(?:` to group regular expressions without capturing it, so this won't be counted for backreference
* See also [stackoverflow - what is non-capturing group](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-what-does-do)

```bash
$ # using ?: helps to focus only on required capture groups
$ # same as: perl -pe 's/(?:co|fo)\K(\w)(\w)/$2$1/g'
$ echo 'cod1 foo_bar' | ruby -pe 'gsub(/(?:co|fo)\K(\w)(\w)/, %q/\2\1/)'
co1d fo_obar

$ # without ?: you'd need to remember all the other groups as well
$ echo 'cod1 foo_bar' | ruby -pe 'gsub(/(co|fo)\K(\w)(\w)/, %q/\3\2/)'
co1d fo_obar
```

* named capture groups `(?<name>` or `(?'name'`
* for backreference, use `\k<name>`
* both named capture groups and normal capture groups cannot be used at the same time

```bash
$ # same as: perl -pe 's/(?<fw>\w+) (?<sw>\w+)/$+{sw} $+{fw}/'
$ echo 'foo 123' | ruby -pe 'sub(/(?<fw>\w+) (?<sw>\w+)/, %q/\k<sw> \k<fw>/)'
123 foo

$ # also useful to transform different capture groups
$ s='"foo,bar",123,"x,y,z",42'
$ # same as: perl -lpe 's/"(?<a>[^"]+)",|(?<a>[^,]+),/$+{a}|/g'
$ echo "$s" | ruby -lpe 'gsub(/"(?<a>[^"]+)",|(?<a>[^,]+),/, %q/\k<a>|/)'
foo,bar|123|x,y,z|42
```

**Further Reading**

* [rexegg - all the (? usages](https://www.rexegg.com/regex-disambiguation.html)
* [regular-expressions - recursion](https://www.regular-expressions.info/recurse.html#balanced)
* [stackoverflow - Recursive nested matching pairs of curly braces](https://stackoverflow.com/questions/19486686/recursive-nested-matching-pairs-of-curly-braces-in-ruby-regex)

<br>

#### <a name="modifiers"></a>Modifiers

* use `i` modifier to ignore case while matching

```bash
$ ruby -ne 'print if /rose/i' poem.txt
Roses are red,

$ echo 'foo 123 FoO' | ruby -pe 'gsub(/foo/i, "good")'
good 123 good
```

* by default, `.` doesn't match the newline character
* `m` modifier allows `.` metacharacter to match newline character as well

```bash
$ # searching for a match which can span across multiple lines

$ # no output as . doesn't match newline
$ ruby -00 -ne 'print if /do.*he/' sample.txt

$ # same as: perl -00 -ne 'print if /do.*he/s' sample.txt
$ ruby -00 -ne 'print if /do.*he/m' sample.txt
Much ado about nothing
He he he
```

<br>

#### <a name="code-in-replacement-section"></a>Code in replacement section

* block form allows to use `ruby` code for replacement section

quoting from [ruby-doc: gsub](https://ruby-doc.org/core-2.5.0/String.html#method-i-gsub)

>In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately. The value returned by the block will be substituted for the match on each call.

* `$1`, `$2`, etc are equivalent of `\1`, `\2`, etc
* `$&` is equivalent of `\&`(or `\0`) - i.e the entire matched string


```bash
$ # replace numbers with their squares, same as: perl -pe 's/\d+/$&**2/ge'
$ echo '4 and 10' | ruby -pe 'gsub(/\d+/){$&.to_i ** 2}'
16 and 100

$ # replace matched string with incremental value
$ # same as: perl -pe 's/\d+/++$c/ge'
$ echo '4 and 10 foo 57' | ruby -pe 'BEGIN{c=0}; gsub(/\d+/){c+=1}'
1 and 2 foo 3

$ # replace with string length, same as: perl -pe 's/\w+/length($&)/ge'
$ echo 'food:12:explain:789' | ruby -pe 'gsub(/\w+/){$&.length}'
4:2:7:3

$ # formatting string, same as: perl -lpe 's/[^-]+/sprintf "%04s", $&/ge'
$ echo 'a1-2-deed' | ruby -lpe 'gsub(/[^-]+/){ $&.rjust(4, "0") }'
00a1-0002-deed

$ # applying another substitution to matched string
$ # same as: perl -pe 's/"[^"]+"/$&=~s|a|A|gr/ge'
$ echo '"mango" and "guava"' | ruby -pe 'gsub(/"[^"]+"/){$&.gsub(/a/, "A")}'
"mAngo" and "guAvA"
```

* replacing specific occurrence

```bash
$ # replacing 2nd occurrence, same as: sed 's/:/-/2'
$ # same as: perl -pe '$c=0; s/:/++$c==2 ? "-" : $&/ge'
$ echo 'foo:123:bar:baz' | ruby -pe 'c=0; gsub(/:/){(c+=1)==2 ? "-" : $&}'
foo:123-bar:baz
$ # or use non-greedy matching, same as: sed 's/and/-/3'
$ echo 'foo and bar and baz land good' | ruby -pe 'sub(/(and.*?){2}\Kand/, "-")'
foo and bar and baz l- good

$ # emulating GNU sed's number+g modifier
$ a='456:foo:123:bar:789:baz
x:y:z:a:v:xc:gf'
$ echo "$a" | sed 's/:/-/3g'
456:foo:123-bar-789-baz
x:y:z-a-v-xc-gf
$ # same as: perl -pe '$c=0; s/:/++$c<3 ? $& : "-"/ge'
$ echo "$a" | ruby -pe 'c=0; gsub(/:/){(c+=1)<3 ? $& : "-"}'
456:foo:123-bar-789-baz
x:y:z-a-v-xc-gf
```

<br>

#### <a name="quoting-metacharacters"></a>Quoting metacharacters

* to match contents of string variable exactly, all metacharacters need to be escaped
* See [ruby-doc: Regexp.escape](https://ruby-doc.org/core-2.5.0/Regexp.html#method-c-escape) for syntax details

```bash
$ cat eqns.txt
a=b,a-b=c,c*d
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

$ # since + is a metacharacter, no match found
$ # note that #{} allows interpolation
$ s='a+b' ruby -ne 'print if /#{ENV["s"]}/' eqns.txt

$ # same as: s='a+b' perl -ne 'print if /\Q$ENV{s}/' eqns.txt
$ s='a+b' ruby -ne 'print if /#{Regexp.escape(ENV["s"])}/' eqns.txt
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

$ # use regexp as needed around variable content, for ex: end of line anchor
$ ruby -pe 'BEGIN{s="a+b"}; sub(/#{Regexp.escape(s)}$/, "a**b")' eqns.txt
a=b,a-b=c,c*d
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a**b
```

<br>

## <a name="two-file-processing"></a>Two file processing

First, a bit about `ARGV` which allows to keep track of which file is being processed

```bash
$ # similar to: perl -lne 'print $#ARGV' <(seq 2) <(seq 3) <(seq 1)
$ ruby -ne 'puts ARGV.length' <(seq 2) <(seq 3) <(seq 1)
2
2
1
1
1
0
```

<br>

#### <a name="comparing-whole-lines"></a>Comparing whole lines

Consider the following test files

```bash
$ cat colors_1.txt
Blue
Brown
Purple
Red
Teal
Yellow

$ cat colors_2.txt
Black
Blue
Green
Red
White
```

* `-r` command line option allows to specify library required
    * the `include?` method allows to check if `set` already contains the element
    * See [ruby-doc: include?](https://ruby-doc.org/stdlib-2.5.0/libdoc/set/rdoc/Set.html#method-i-include-3F) for syntax details

```bash
$ # common lines
$ # note that all duplicates matching in second file would get printed
$ # same as: perl -ne 'if(!$#ARGV){$h{$_}=1; next}
$ #            print if $h{$_}' colors_1.txt colors_2.txt
$ ruby -rset -ne 'BEGIN{s=Set.new}; s.add($_) && next if ARGV.length==1;
                  print if s.include?($_)' colors_1.txt colors_2.txt
Blue
Red

$ # lines from colors_2.txt not present in colors_1.txt
$ ruby -rset -ne 'BEGIN{s=Set.new}; s.add($_) && next if ARGV.length==1;
                  print if !s.include?($_)' colors_1.txt colors_2.txt
Black
Green
White

$ # next - to skip rest of code and process next input line
$ # here used to skip rest of code as long as first file is being processed
$ # alternate: ARGV.length==1 ? s.add($_) : s.include?($_) && print
```

alternate solution by using set operations available for arrays

* [ruby-doc: ARGF](https://ruby-doc.org/core-2.5.0/ARGF.html) filehandle allows to read from filename arguments supplied to script
    * if filename arguments are not present, it would act upon stdin
* `STDIN` filehandle allows to read from stdin
* [ruby-doc: readlines](https://ruby-doc.org/core-2.5.0/IO.html#method-c-readlines) method allows to read all the lines as an array
    * if filehandle is not specified, default is ARGF
* some comparison notes
    * both files will get saved as array in memory here, while previous solution would save only first file
    * duplicates would get removed here
    * likely to be faster compared to previous solution

```bash
$ # note that -n/-p options are not used
$ # and puts is helpful here as record separator is newline character

$ # common lines, output order is based on array to left of & operator
$ ruby -e 'f1=STDIN.readlines; f2=readlines;
           puts f1 & f2' <colors_1.txt colors_2.txt
Blue
Red

$ # lines from colors_2.txt not present in colors_1.txt
$ ruby -e 'f1=STDIN.readlines; f2=readlines;
           puts f2 - f1' <colors_1.txt colors_2.txt
Black
Green
White

$ # for union, use either of these
$ # ruby -e 'f1=STDIN.readlines; f2=readlines;
$ #          puts f1 | f2' <colors_1.txt colors_2.txt
$ # ruby -e 'puts readlines.uniq' colors_1.txt colors_2.txt
```

<br>

#### <a name="comparing-specific-fields"></a>Comparing specific fields

Consider the sample input file

```bash
$ cat marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59
ECE     Om      92
CSE     Amy     67
```

* single field
* For ex: only first field comparison instead of entire line as key

```bash
$ cat list1
ECE
CSE

$ # extract only lines matching first field specified in list1
$ ruby -rset -ane 'BEGIN{s=Set.new}; s.add($F[0]) && next if ARGV.length==1;
                   print if s.include?($F[0])' list1 marks.txt
ECE     Raj     53
ECE     Joel    72
CSE     Surya   81
ECE     Om      92
CSE     Amy     67
```

* multiple field comparison

```bash
$ cat list2
EEE Moi
CSE Amy
ECE Raj

$ # $F[0..1] will return array with elements specified by range (0 to 1 here)
$ ruby -rset -ane 'BEGIN{s=Set.new}; s.add($F[0..1]) && next if ARGV.length==1;
                   print if s.include?($F[0..1])' list2 marks.txt
ECE     Raj     53
EEE     Moi     68
CSE     Amy     67
```

* field and value comparison
* here, we use [hash](https://ruby-doc.org/core-2.5.0/Hash.html) as well to save values based on a key

```bash
$ cat list3
ECE 70
EEE 65
CSE 80

$ # extract line matching Dept and minimum marks specified in list3
$ ruby -rset -ane 'BEGIN{d=Set.new; m={}};
                   (d.add($F[0]); m[$F[0]]=$F[1]) && next if ARGV.length==1;
                   print if d.include?($F[0]) && $F[2]>=m[$F[0]]' list3 marks.txt
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
ECE     Om      92
```

<br>

#### <a name="line-number-matching"></a>Line number matching

```bash
$ # replace mth line in poem.txt with nth line from list1
$ # same as: m=3 n=2 perl -pe 'BEGIN{ $s=<> while $ENV{n}-- > 0; close ARGV}
$ #                    $_=$s if $.==$ENV{m}' list1 poem.txt
$ m=3 n=2 ruby -pe 'BEGIN{ENV["n"].to_i.times { $s=gets }; ARGF.close };
                    $_=$s if $.==ENV["m"].to_i' list1 poem.txt
Roses are red,
Violets are blue,
CSE
And so are you.

$ # print line from fruits.txt if corresponding line from nums.txt is +ve number
$ # same as: <nums.txt perl -ne 'print if <STDIN> > 0' fruits.txt
$ # line from fruits.txt is saved first as STDIN.gets will also set $_
$ <nums.txt ruby -ne 'ln=$_; print ln if STDIN.gets.to_i>0' fruits.txt
fruit   qty
banana  31
$ # can also use:
$ # ruby -e 'STDIN.readlines.zip(readlines).each {|a| puts a[1] if a[0].to_i>0}'
```

For syntax and implementation details, see

* [ruby-doc: ARGF](https://ruby-doc.org/core-2.5.0/ARGF.html)
* [ruby-doc: times](https://ruby-doc.org/core-2.5.0/Integer.html#method-i-times)
* [ruby-doc: gets](https://ruby-doc.org/core-2.5.0/IO.html#method-i-gets)

<br>

## <a name="creating-new-fields"></a>Creating new fields

* See [ruby-doc: slice](https://ruby-doc.org/core-2.5.0/Array.html#method-i-slice) for syntax details

```bash
$ s='foo,bar,123,baz'

$ # to reduce fields, use slice method
$ # same as: echo "$s" | perl -F, -lane '$,=","; $#F=1; print @F'
$ # 1st arg - starting index, 2nd arg - number of elements
$ echo "$s" | ruby -F, -lane '$F.slice!(-2,2); print $F * ","'
foo,bar

$ # assigning to field greater than length will create empty fields as needed
$ # same as: echo "$s" | perl -F, -lane '$,=","; $F[6]=42; print @F'
$ echo "$s" | ruby -F, -lane '$F[6]=42; print $F * ","'
foo,bar,123,baz,,,42
```

* adding a field based on existing fields
* See [ruby-doc: Percent Strings](https://ruby-doc.org/core-2.5.0/doc/syntax/literals_rdoc.html#label-Percent+Strings) for details on `%w`

```bash
$ # adding a new 'Grade' field
$ # same as: perl -lane 'BEGIN{$,="\t"; @g = qw(D C B A S)}
$ #          push @F, $.==1 ? "Grade" : $g[$F[-1]/10 - 5]; print @F' marks.txt
$ ruby -lane 'BEGIN{g = %w[D C B A S]};
              $F.push($.==1 ? "Grade" : g[$F[-1].to_i/10 - 5]);
              print $F * "\t"' marks.txt
Dept    Name    Marks   Grade
ECE     Raj     53      D
ECE     Joel    72      B
EEE     Moi     68      C
CSE     Surya   81      A
EEE     Tia     59      D
ECE     Om      92      S
CSE     Amy     67      C
```

<br>

## <a name="multiple-file-input"></a>Multiple file input

* processing based on line-number/begin/end of each input file

```bash
$ # same as: perl -ne 'print if $.==2; close ARGV if eof'
$ # ARGF.close will reset $. to 0
$ ruby -ne 'print if $.==2; ARGF.close if $<.eof' poem.txt greeting.txt
Violets are blue,
Have a safe journey

$ # same as: perl -lne 'print "file: $ARGV" if $.==1;
$ #            print "$_\n------" and close ARGV if eof' poem.txt greeting.txt
$ ruby -lne 'print "file: #{ARGF.filename}" if $.==1;
             (print "#{$_}\n------"; ARGF.close) if $<.eof' poem.txt greeting.txt
file: poem.txt
And so are you.
------
file: greeting.txt
Have a safe journey
------
```

* to skip remaining lines from current file being processed and move on to next file

```bash
$ # same as: perl -pe 'close ARGV if $.>=1' poem.txt greeting.txt fruits.txt
$ ruby -pe 'ARGF.close if $.>=1' poem.txt greeting.txt fruits.txt
Roses are red,
Hello there
fruit   qty

$ # same as: perl -lane 'print $ARGV and close ARGV if $F[0] =~ /red/i' *
$ ruby -ane '(puts ARGF.filename; ARGF.close) if $F[0] =~ /red/i' *
colors_1.txt
colors_2.txt
```

<br>

## <a name="dealing-with-duplicates"></a>Dealing with duplicates

* retain only first copy of duplicates
* `-r` command line option allows to specify library required
* here, `set` data type is used to keep track of unique values - be it whole line or a particular field
    * the `add?` method will add element to `set` and returns `nil` if element already exists
    * See [ruby-doc: add?](https://ruby-doc.org/stdlib-2.5.0/libdoc/set/rdoc/Set.html#method-i-add-3F) for syntax details

```bash
$ cat duplicates.txt
abc  7   4
food toy ****
abc  7   4
test toy 123
good toy ****

$ # whole line, same as: perl -ne 'print if !$seen{$_}++' duplicates.txt
$ ruby -rset -ne 'BEGIN{s=Set.new}; print if s.add?($_)' duplicates.txt
abc  7   4
food toy ****
test toy 123
good toy ****

$ # particular column, same as: perl -ane 'print if !$seen{$F[1]}++'
$ ruby -rset -ane 'BEGIN{s=Set.new}; print if s.add?($F[1])' duplicates.txt
abc  7   4
food toy ****

$ # total count, same as: perl -lane '$c++ if !$seen{$F[1]}++; END{print $c}'
$ ruby -rset -ane 'BEGIN{s=Set.new}; s.add($F[1]);
                   END{puts s.length}' duplicates.txt
2
```

* multiple fields

```bash
$ # same as: perl -ane 'print if !$seen{$F[1],$F[2]}++' duplicates.txt
$ # $F[1..2] will return an array with fields 2 and 3 as elements
$ ruby -rset -ane 'BEGIN{s=Set.new}; print if s.add?($F[1..2])' duplicates.txt
abc  7   4
food toy ****
test toy 123
```

* retaining only last copy of duplicate

```bash
$ # reverse the input line-wise, retain first copy and then reverse again
$ # same as: tac duplicates.txt | perl -ane 'print if !$seen{$F[1]}++' | tac
$ tac duplicates.txt | ruby -rset -ane 'BEGIN{s=Set.new};
                       print if s.add?($F[1])' | tac
abc  7   4
good toy ****
```

* for count based filtering (other than first/last count), use a `hash`
* `Hash.new(0)` will initialize value of new key to `0`

```bash
$ # second occurrence of duplicate
$ # same as: perl -ane 'print if ++$h{$F[1]}==2' duplicates.txt
$ ruby -ane 'BEGIN{h=Hash.new(0)}; print if (h[$F[1]]+=1)==2' duplicates.txt
abc  7   4
test toy 123

$ # third occurrence of duplicate
$ # same as: perl -ane 'print if ++$h{$F[1]}==3' duplicates.txt
$ ruby -ane 'BEGIN{h=Hash.new(0)}; print if (h[$F[1]]+=1)==3' duplicates.txt
good toy ****
```

* filtering based on duplicate count
* allows to emulate [uniq](./sorting_stuff.md#uniq) command for specific fields

```bash
$ # all duplicates based on 1st column
$ # same as: perl -ane '!$#ARGV ? $x{$F[0]}++ : $x{$F[0]}>1 && print'
$ ruby -ane 'BEGIN{h=Hash.new(0)}; ARGV.length==1 ? h[$F[0]]+=1 :
              h[$F[0]]>1 && print' duplicates.txt duplicates.txt
abc  7   4
abc  7   4

$ # more than 2 duplicates based on 2nd column
$ ruby -ane 'BEGIN{h=Hash.new(0)}; ARGV.length==1 ? h[$F[1]]+=1 :
              h[$F[1]]>2 && print' duplicates.txt duplicates.txt
food toy ****
test toy 123
good toy ****

$ # only unique lines based on 3rd column
$ ruby -ane 'BEGIN{h=Hash.new(0)}; ARGV.length==1 ? h[$F[2]]+=1 :
              h[$F[2]]==1 && print' duplicates.txt duplicates.txt
test toy 123
```

<br>

#### <a name="using-uniq-method"></a>using uniq method

* [ruby-doc: uniq](https://ruby-doc.org/core-2.5.0/Array.html#method-i-uniq)
* original order is maintained

```bash
$ # same as: ruby -rset -ne 'BEGIN{s=Set.new}; print if s.add?($_)'
$ ruby -e 'puts readlines.uniq' duplicates.txt
abc  7   4
food toy ****
test toy 123
good toy ****

$ # same as: ruby -rset -ane 'BEGIN{s=Set.new}; print if s.add?($F[1])'
$ ruby -e 'puts readlines.uniq {|s| s.split[1]}' duplicates.txt
abc  7   4
food toy ****

$ # same as: ruby -rset -ane 'BEGIN{s=Set.new}; print if s.add?($F[1..2])'
$ ruby -e 'puts readlines.uniq {|s| s.split[1..2]}' duplicates.txt
abc  7   4
food toy ****
test toy 123
```

<br>

## <a name="lines-between-two-regexps"></a>Lines between two REGEXPs

* This section deals with filtering lines bound by two *REGEXP*s (referred to as blocks)
* For simplicity the two *REGEXP*s usually used in below examples are the strings **BEGIN** and **END**

<br>

#### <a name="all-unbroken-blocks"></a>All unbroken blocks

Consider the below sample input file, which doesn't have any unbroken blocks (i.e **BEGIN** and **END** are always present in pairs)

```bash
$ cat range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
END
baz
```

* Extracting lines between starting and ending *REGEXP*

```bash
$ # include both starting/ending REGEXP
$ # same as: perl -ne '$f=1 if /BEGIN/; print if $f; $f=0 if /END/'
$ ruby -ne '$f=1 if /BEGIN/; print if $f==1; $f=0 if /END/' range.txt
BEGIN
1234
6789
END
BEGIN
a
b
c
END

$ # can also use: ruby -ne 'print if /BEGIN/../END/' range.txt
$ # which is similar to sed -n '/BEGIN/,/END/p'
$ # but not suitable to extend for other cases
```

* other variations

```bash
$ # exclude both starting/ending REGEXP
$ # same as: perl -ne '$f=0 if /END/; print if $f; $f=1 if /BEGIN/'
$ ruby -ne '$f=0 if /END/; print if $f==1; $f=1 if /BEGIN/' range.txt
1234
6789
a
b
c

$ # check out what these do:
$ ruby -ne '$f=1 if /BEGIN/; $f=0 if /END/; print if $f==1' range.txt
$ ruby -ne 'print if $f==1; $f=0 if /END/; $f=1 if /BEGIN/' range.txt
```

* Extracting lines other than lines between the two *REGEXP*s

```bash
$ # same as: perl -ne '$f=1 if /BEGIN/; print if !$f; $f=0 if /END/'
$ # can also use: ruby -ne 'print if !(/BEGIN/../END/)' range.txt
$ ruby -ne '$f=1 if /BEGIN/; print if $f!=1; $f=0 if /END/' range.txt
foo
bar
baz

$ # the other three cases would be
$ ruby -ne '$f=0 if /END/; print if $f!=1; $f=1 if /BEGIN/' range.txt
$ ruby -ne 'print if $f!=1; $f=1 if /BEGIN/; $f=0 if /END/' range.txt
$ ruby -ne '$f=1 if /BEGIN/; $f=0 if /END/; print if $f!=1' range.txt
```

<br>

#### <a name="specific-blocks"></a>Specific blocks

* Getting first block

```bash
$ # same as: perl -ne '$f=1 if /BEGIN/; print if $f; exit if /END/'
$ ruby -ne '$f=1 if /BEGIN/; print if $f==1; exit if /END/' range.txt
BEGIN
1234
6789
END

$ # use other tricks discussed in previous section as needed
$ ruby -ne 'exit if /END/; print if $f==1; $f=1 if /BEGIN/' range.txt
1234
6789
```

* Getting last block

```bash
$ # reverse input linewise, change the order of REGEXPs, finally reverse again
$ tac range.txt | ruby -ne '$f=1 if /END/; print if $f==1; exit if /BEGIN/' | tac
BEGIN
a
b
c
END

$ # or, save the blocks in a buffer and print the last one alone
$ # same as: seq 30 | perl -ne 'if(/4/){$f=1; $b=$_; next}
$ #                     $b.=$_ if $f; $f=0 if /6/; END{print $b}'
$ # << operator concatenates given string to the variable in-place
$ seq 30 | ruby -ne '($f=1; $b=$_) && next if /4/;
                     $b << $_ if $f==1; $f=0 if /6/; END{print $b}'
24
25
26
```

* Getting blocks based on a counter

```bash
$ # get only 2nd block
$ # same as: b=2 perl -ne '$c++ if /4/; if($c==$ENV{b}){print; exit if /6/}'
$ seq 30 | b=2 ruby -ne 'BEGIN{c=0}; c+=1 if /4/;
                         c==ENV["b"].to_i && (print; exit if /6/)'
14
15
16

$ # to get all blocks greater than 'b' blocks
$ seq 30 | b=1 ruby -ne 'BEGIN{c=0}; ($f=1; c+=1) if /4/;
                         print if $f==1 && c>ENV["b"].to_i; $f=0 if /6/'
14
15
16
24
25
26
```

* excluding a particular block

```bash
$ # excludes 2nd block
$ seq 30 | b=2 ruby -ne 'BEGIN{c=0}; ($f=1; c+=1) if /4/;
                         print if $f==1 && c!=ENV["b"].to_i; $f=0 if /6/'
4
5
6
24
25
26
```

* extract block only if it matches another string as well

```bash
$ # string to match inside block: 23
$ # same as: perl -ne 'if(/BEGIN/){$f=1; $m=0; $b=""}; $m=1 if $f && /23/;
$ #            $b.=$_ if $f; if(/END/){print $b if $m; $f=0}' range.txt
$ ruby -ne '($f=1; $m=0; $b="") if /BEGIN/; $m=1 if $f==1 && /23/;
            $b<<$_ if $f==1; (print $b if $m==1; $f=0) if /END/' range.txt
BEGIN
1234
6789
END

$ # line to match inside block: 5 or 25
$ seq 30 | ruby -ne '($f=1; $m=0; $b="") if /4/; $m=1 if $f==1 && /^2?5$/;
                     $b<<$_ if $f==1; (print $b if $m==1; $f=0) if /6/'
4
5
6
24
25
26
```

<br>

#### <a name="broken-blocks"></a>Broken blocks

* If there are blocks with ending *REGEXP* but without corresponding start, earlier techniques used will suffice
* Consider the modified input file where starting *REGEXP* doesn't have corresponding ending

```bash
$ cat broken_range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
baz

$ # the file reversing trick comes in handy here as well
$ tac broken_range.txt | ruby -ne '$f=1 if /END/;
                         print if $f==1; $f=0 if /BEGIN/' | tac
BEGIN
1234
6789
END
```

* But if both kinds of broken blocks are present, for ex:

```bash
$ cat multiple_broken.txt
qqqqqqq
BEGIN
foo
BEGIN
1234
6789
END
bar
END
0-42-1
BEGIN
a
BEGIN
b
END
xyzabc
```

then use buffers to accumulate the records and print accordingly

```bash
$ # same as: perl -ne 'if(/BEGIN/){$f=1; $b=$_; next} $b.=$_ if $f;
$ #            if(/END/){$f=0; print $b if $b; $b=""}' multiple_broken.txt
$ ruby -ne '($f=1; $b=$_) && next if /BEGIN/; $b << $_ if $f==1;
            ($f=0; print $b if $b!=""; $b="") if /END/' multiple_broken.txt
BEGIN
1234
6789
END
BEGIN
b
END

$ # note how buffer is initialized as well as cleared
$ # on matching beginning/end REGEXPs respectively
```

<br>

## <a name="array-operations"></a>Array operations

See [ruby-doc: Array](https://ruby-doc.org/core-2.5.0/Array.html) for various ways to initialize and methods available

* initialization

```bash
$ # as comma separated values, indexing starts at 0
$ ruby -le 'sq = [1, 4, 9, 16]; print sq[2]'
9
$ ruby -le 'a = [123, "foo", "baz789"]; print a[1]'
foo
$ # -ve indexing, -1 for last element, -2 for second last, etc
$ ruby -le 'foo = [2, "baz", ["a", "b"]]; print foo[-1]'
["a", "b"]

$ # variables can be used, double quoted string will interpolate
$ ruby -le 'a=5; b=["a", "b"]; c=[a, 789, b]; print c'
[5, 789, ["a", "b"]]
$ ruby -le 'c=[89, "a\nb"]; print c[-1]'
a
b

$ # %w allows space separated string values, no interpolation
$ ruby -le 'b = %w[123 foo baz789]; print b[1]'
foo
$ ruby -le 's = %w[foo "baz" "a\nb"]; print s[-1]'
"a\nb"
```

* array slices
* See also [ruby-doc: Array to Arguments Conversion](https://ruby-doc.org/core-2.5.0/doc/syntax/calling_methods_rdoc.html#label-Array+to+Arguments+Conversion)

```bash
$ # accessing more than one element in random order
$ echo 'a b c d' | ruby -lane 'print $F.values_at(0,-1,2) * " "'
a d c
$ echo 'a b c d' | ruby -lane 'i=[0, -1, 2]; print $F.values_at(*i) * " "'
a d c

$ # starting index and number of elements needed from that index
$ echo 'a b c d' | ruby -lane 'print $F[0,3] * " "'
a b c
$ # range operator, arguments are start/end indexes
$ echo 'a b c d' | ruby -lane 'print $F[1..3] * " "'
b c d

$ # n elements from start, can also use 'first' method instead of 'take'
$ echo 'a b c d' | ruby -lane 'print $F.take(2) * " "'
a b
$ # remaining elements after ignoring n elements from start
$ echo 'a b c d' | ruby -lane 'print $F.drop(3) * " "'
d
$ # n elements from end
$ echo 'a b c d' | ruby -lane 'print $F.last(3) * " "'
b c d
```

* looping

```bash
$ # by element value, use 'reverse_each' to iterate in reversed order
$ # can also use range here: ruby -e '(1..4).each {|n| puts n*2}'
$ ruby -e 'nums=[1, 2, 3, 4]; nums.each {|n| puts n*2}'
2
4
6
8

$ # by index
$ ruby -e 'books=%w[Elantris Martian Dune Alchemist]
           books.each_index {|i| puts "#{i+1}) #{books[i]}"}'
1) Elantris
2) Martian
3) Dune
4) Alchemist
```

<br>

#### <a name="filtering"></a>Filtering

* based on regexp

```ruby
$ s='foo:123:bar:baz'
$ echo "$s" | ruby -F: -lane 'print $F.grep(/[a-z]/) * ":"'
foo:bar:baz

$ words='tryst fun glyph pity why'
$ echo "$words" | ruby -lane 'puts $F.grep(/[a-g]/)'
fun
glyph

$ # grep_v inverts the selection
$ echo "$words" | ruby -lane 'puts $F.grep_v(/[aeiou]/)'
tryst
glyph
why
```

* use `select` or `reject` for generic conditions

```bash
$ # to get index instead of matches
$ s='foo:123:bar:baz'
$ echo "$s" | ruby -F: -lane 'print $F.each_index.select{|i| $F[i] =~ /[a-z]/}'
[0, 2, 3]

$ # based on numeric value
$ s='23 756 -983 5'
$ echo "$s" | ruby -lane 'print $F.select { |s| s.to_i < 100 } * " "'
23 -983 5

$ # filters only those elements with successful substitution
$ # for opposite, either use negated condition or use reject instead of select
$ echo "$s" | ruby -lane 'print $F.select { |s| s.sub!(/3/, "E") } * " "'
2E -98E
```

* random element(s)

```bash
$ s='65 23 756 -983 5'
$ echo "$s" | ruby -lane 'print $F.sample'
23
$ echo "$s" | ruby -lane 'print $F.sample'
5

$ echo "$s" | ruby -lane 'print $F.sample(2)'
["-983", "756"]
```

<br>

#### <a name="sorting"></a>Sorting

* [ruby-doc: sort](https://ruby-doc.org/core-2.5.0/Array.html#method-i-sort)
* See also [stackoverflow What does map(&:name) mean in Ruby?](https://stackoverflow.com/questions/1217088/what-does-mapname-mean-in-ruby) for explanation on `&:`

```bash
$ s='foo baz v22 aimed'
$ # same as: perl -lane 'print join " ", sort @F'
$ echo "$s" | ruby -lane 'print $F.sort * " "'
aimed baz foo v22

$ # demonstrating the <=> operator
$ ruby -e 'puts 4 <=> 2'
1
$ ruby -e 'puts 4 <=> 20'
-1
$ ruby -e 'puts 4 <=> 4'
0

$ # descending order
$ # same as: perl -lane 'print join " ", sort {$b cmp $a} @F'
$ echo "$s" | ruby -lane 'print $F.sort { |a,b| b <=> a } * " "'
v22 foo baz aimed
$ # can also reverse the array after default sorting
$ echo "$s" | ruby -lane 'print $F.sort.reverse * " "'
v22 foo baz aimed
```

* using `sort_by` to sort based on a key

```bash
$ s='floor bat to dubious four'
$ # can also use: ruby -lane 'print $F.sort_by(&:length) * ":"'
$ echo "$s" | ruby -lane 'print $F.sort_by {|a| a.length} * ":"'
to:bat:four:floor:dubious

$ # for descending order, simply negate the key
$ echo "$s" | ruby -lane 'print $F.sort_by {|a| -a.length} * ":"'
dubious:floor:four:bat:to

$ # need to explicitly convert from string to number for numeric input
$ s='23 756 -983 5'
$ echo "$s" | ruby -lane 'print $F.sort_by(&:to_i) * " "'
-983 5 23 756
$ s='5.33:2.2e3:42'
$ echo "$s" | ruby -F: -lane 'print $F.sort_by{|n| -n.to_f} * ":"'
2.2e3:42:5.33
```

* sorting characters within word
* `chars` method returns array with individual characters

```bash
$ echo 'foobar' | ruby -lne 'print $_.chars.sort * ""'
abfoor

$ cat words.txt
bot
art
are
boat
toe
flee
reed

$ # words with characters in ascending order
$ # can also use: ruby -lne 'print if $_.chars == $_.chars.sort' words.txt
$ ruby -lne 'print if $_ == $_.chars.sort * ""' words.txt
bot
art

$ # words with characters in descending order
$ # can also use: ruby -lne 'print if $_.chars == $_.chars.sort.reverse'
$ ruby -lne 'print if $_ == $_.chars.sort {|a,b| b <=> a} * ""' words.txt
toe
reed
```

* sorting columns based on header

```bash
$ # need to get indexes of order required for header, then use it for all lines
$ # same as: perl -lane '@i = sort {$F[$a] cmp $F[$b]} 0..$#F if $.==1;
$ #              print join "\t", @F[@i]' marks.txt
$ ruby -lane 'idx = $F.each_index.sort {|i,j| $F[i] <=> $F[j]} if $.==1;
              print $F.values_at(*idx) * "\t"' marks.txt
Dept    Marks   Name
ECE     53      Raj
ECE     72      Joel
EEE     68      Moi
CSE     81      Surya
EEE     59      Tia
ECE     92      Om
CSE     67      Amy
```

* [ruby-doc: uniq](https://ruby-doc.org/core-2.5.0/Array.html#method-i-uniq)
* order is preserved

```bash
$ s='3,b,a,c,d,1,d,c,2,3,1,b'
$ # same as: perl -MList::MoreUtils=uniq -F, -lane 'print join ",",uniq @F'
$ echo "$s" | ruby -F, -lane 'print $F.uniq * ","'
3,b,a,c,d,1,2

$ # same as: ruby -rset -ane 'BEGIN{s=Set.new}; print if s.add?($F[1])'
$ # note that -n/-p option is not used
$ ruby -e 'puts readlines.uniq {|s| s.split[1]}' duplicates.txt
abc  7   4
food toy ****
```

* max/min values

```bash
$ # if numeric array is constructed from string input
$ echo '34,17,6' | ruby -F, -lane 'print $F.max {|a,b| a.to_i <=> b.to_i}'
34
$ # or convert numeric array first, 'map' is covered in next section
$ echo '34,17,6' | ruby -F, -lane 'print $F.map(&:to_i).max'
34
$ echo '23.5,42,-36' | ruby -F, -lane 'puts $F.map(&:to_f).max'
42.0

$ # string comparison is default
$ s='floor bat to dubious four'
$ echo "$s" | ruby -lane 'print $F.min'
bat

$ # can also get max/min 'n' elements
$ echo "$s" | ruby -lane 'print $F.max(2)'
["to", "four"]
$ echo "$s" | ruby -lane 'print $F.min(3) {|a,b| a.size <=> b.size}'
["to", "bat", "four"]
```

<br>

#### <a name="transforming"></a>Transforming

* shuffling elements

```bash
$ s='23 756 -983 5'
$ echo "$s" | ruby -lane 'print $F.shuffle * " "'
5 756 -983 23
$ echo "$s" | ruby -lane 'print $F.shuffle * " "'
756 5 23 -983

$ # randomizing file contents
$ # note that -n/-p option is not used
$ ruby -e 'puts readlines.shuffle' poem.txt
And so are you.
Violets are blue,
Roses are red,
Sugar is sweet,

$ # or if shuffle order is known
$ seq 5 | ruby -e 'puts readlines.values_at(3,1,0,2,4)'
4
2
1
3
5
```

* use `map` to transform every element
* See also [stackoverflow What does map(&:name) mean in Ruby?](https://stackoverflow.com/questions/1217088/what-does-mapname-mean-in-ruby) for explanation on `&:`

```bash
$ echo '23 756 -983 5' | ruby -lane 'print $F.map {|n| n.to_i ** 2} * " "'
529 571536 966289 25
$ echo 'a b c' | ruby -lane 'print $F.map {|s| %Q/"#{s}"/} * ","'
"a","b","c"
$ echo 'a b c' | ruby -lane 'print $F.map {|s| %Q/"#{s}"/.upcase} * ","'
"A","B","C"

$ # ASCII int values for each character
$ echo 'AaBbCc' | ruby -lne 'print $_.chars.map(&:ord) * " "'
65 97 66 98 67 99

$ echo '34,17,6' | ruby -F, -lane 'puts $F.map(&:to_i).sum'
57

$ # shuffle each field character wise
$ s='this is a sample sentence'
$ echo "$s" | ruby -lane 'print $F.map {|s| s.chars.shuffle * ""} * " "'
hsti si a mlepas esencnet
```

* reverse array/string

```bash
$ s='23 756 -983 5'
$ echo "$s" | ruby -lane 'print $F.reverse * " "'
5 -983 756 23

$ echo 'foobar' | ruby -lne 'print $_.reverse'
raboof
$ # or inplace reverse
$ echo 'foobar' | ruby -lpe '$_.reverse!'
raboof
```

* See also [ruby-doc: Enumerable](https://ruby-doc.org/core-2.5.0/Enumerable.html) for more methods like `inject`

<br>

## <a name="miscellaneous"></a>Miscellaneous

<br>

#### <a name="split"></a>split

* the `-a` command line option uses `split` and automatically saves the results in `$F` array
* default separator is `\s+` and also strips whitespace from start/end of string
* See also [ruby-doc: split](https://ruby-doc.org/core-2.5.0/String.html#method-i-split)

```bash
$ # specifying maximum number of splits
$ # same as: perl -lne 'print join ":", split /\s+/,$_,2'
$ echo 'a 1 b 2 c' | ruby -lne 'print $_.split(/\s+/, 2) * ":"'
a:1 b 2 c

$ # by default, trailing empty fields are stripped
$ echo ':123::' | ruby -lne 'print $_.split(/:/) * ","'
,123
$ # specify a negative count to preserve trailing empty fields
$ echo ':123::' | ruby -lne 'print $_.split(/:/, -1) * ","'
,123,,

$ # use string argument for fixed-string split instead of regexp
$ echo 'foo**123**baz' | ruby -lne 'print $_.split("**") * ":"'
foo:123:baz

$ # to save the separators as well, use capture groups
$ s='Sample123string54with908numbers'
$ echo "$s" | ruby -lne 'print $_.split(/(\d+)/) * ":"'
Sample:123:string:54:with:908:numbers
```

* single line to multiple line by splitting a column

```bash
$ cat split.txt
foo,1:2:5,baz
wry,4,look
free,3:8,oh

$ # same as: perl -F, -ane 'print join ",", $F[0],$_,$F[2] for split /:/,$F[1]'
$ ruby -F, -ane '$F[1].split(/:/).each {|x| print [$F[0],x,$F[2]]*","}' split.txt
foo,1,baz
foo,2,baz
foo,5,baz
wry,4,look
free,3,oh
free,8,oh
$ # can also use scan here:
$ # ruby -F, -ane '$F[1].scan(/[^:]+/) {|x| print [$F[0],x,$F[2]]*","}'
```

<br>

#### <a name="fixed-width-processing"></a>Fixed width processing

* [ruby-doc: unpack](https://ruby-doc.org/core-2.5.0/String.html#method-i-unpack)

```bash
$ # same as: perl -lne '@x = unpack("a1xa3xa4", $_); print $x[0]'
$ # here 'a' indicates arbitrary binary string
$ # the number that follows indicates length
$ # the 'x' indicates characters to ignore, use length after 'x' if needed
$ # and there are many other formats, see ruby-doc for details
$ echo 'b 123 good' | ruby -lne 'print $_.unpack("a1xa3xa4")[0]'
b
$ echo 'b 123 good' | ruby -lne 'print $_.unpack("a1xa3xa4")[1]'
123
$ echo 'b 123 good' | ruby -lne 'print $_.unpack("a1xa3xa4")[2]'
good

$ # unpack not always needed, simple slicing might help
$ echo 'b 123 good' | ruby -ne 'puts $_[2,3]'
123
$ echo 'b 123 good' | ruby -ne 'puts $_[6,4]'
good

$ # replacing arbitrary slice
$ # same as: perl -lpe 'substr $_, 2, 3, "gleam"'
$ echo 'b 123 good' | ruby -lpe '$_[2,3] = "gleam"'
b gleam good
```

<br>

#### <a name="string-and-file-replication"></a>String and file replication

```bash
$ # replicate each line, same as: perl -ne 'print $_ x 2'
$ seq 2 | ruby -ne 'print $_ * 2'
1
1
2
2

$ # replicate a string, same as: perl -le 'print "abc" x 5'
$ ruby -e 'puts "abc" * 5'
abcabcabcabcabc

$ # works for array too, but be careful with mutable elements
$ ruby -le 'x = [3, 2, 1] * 2; print x'
[3, 2, 1, 3, 2, 1]
$ ruby -le 'x = [3, 2, [1, 7]] * 2; x[2][0]="a"; print x'
[3, 2, ["a", 7], 3, 2, ["a", 7]]

$ # replicating file, same as: perl -0777 -ne 'print $_ x 100'
$ wc -c poem.txt
65 poem.txt
$ ruby -0777 -ne 'print $_ * 100' poem.txt | wc -c
6500
```

<br>

#### <a name="transliteration"></a>transliteration

* [ruby-doc: tr](https://ruby-doc.org/core-2.5.0/String.html#method-i-tr)

```bash
$ echo 'Uryyb Jbeyq' | ruby -pe '$_.tr!("a-zA-Z", "n-za-mN-ZA-M")'
Hello World
$ echo 'hi there!' | ruby -pe '$_.tr!("a-z", "\u{1d5ee}-\u{1d607}")'
𝗵𝗶 𝘁𝗵𝗲𝗿𝗲!

$ # when first argument is longer
$ # the last character of second argument is padded
$ echo 'foo bar cat baz' | ruby -pe '$_.tr!("a-z", "123")'
333 213 313 213

$ # use ^ at start of first argument to complement specified characters
$ echo 'foo:123:baz' | ruby -lpe '$_.tr!("^0-9", "-")'
----123----

$ # use empty second argument to delete specified characters
$ echo '"Foo1!", "Bar.", ":Baz:"' | ruby -lpe '$_.tr!("^A-Za-z,", "")'
Foo,Bar,Baz

$ # use - at start/end and ^ other than start to match themselves
$ echo 'a^3-b*d' | ruby -lpe '$_.tr!("-^*", "*/+")'
a/3*b+d
```

<br>

#### <a name="executing-external-commands"></a>Executing external commands

* External commands can be issued using `system` function
* Output would be as usual on `stdout` unless redirected while calling the command

```bash
$ # same as: perl -e 'system("echo Hello World")'
$ ruby -e 'system("echo Hello World")'
Hello World

$ ruby -e 'system("wc poem.txt")'
 4 13 65 poem.txt

$ ruby -e 'system("seq 10 | paste -sd, > out.txt")'
$ cat out.txt
1,2,3,4,5,6,7,8,9,10

$ cat f2
I bought two bananas and three mangoes
$ # same as: perl -F, -lane 'system "cat $F[1]"'
$ echo 'f1,f2,odd.txt' | ruby -F, -lane 'system("cat #{$F[1]}")'
I bought two bananas and three mangoes
```

* return value of `system` or global variable `$?` can be used to act upon exit status of command issued
* see [ruby-doc: system](https://ruby-doc.org/core-2.5.0/Kernel.html#method-i-system) for details

```bash
$ ruby -e 'es=system("ls poem.txt"); puts es'
poem.txt
true
$ ruby -e 'system("ls poem.txt"); puts $?'
poem.txt
pid 17005 exit 0

$ ruby -e 'system("ls xyz.txt"); puts $?'
ls: cannot access 'xyz.txt': No such file or directory
pid 17059 exit 2
```

* to save result of external command, use backticks or `%x`

```bash
$ ruby -e 'lines = `wc -l < poem.txt`; print lines'
4

$ ruby -e 'nums = %x/seq 3/; print nums'
1
2
3
```

* See also [stackoverflow - difference between exec, system and %x() or backticks](https://stackoverflow.com/questions/6338908/ruby-difference-between-exec-system-and-x-or-backticks)

<br>

## <a name="further-reading"></a>Further Reading

* Manual and related
    * [ruby-lang documentation](https://www.ruby-lang.org/en/documentation/) - manuals, tutorials and references
    * [ruby-lang - faqs](https://www.ruby-lang.org/en/documentation/faq/)
    * [ruby-lang - quickstart](https://www.ruby-lang.org/en/documentation/quickstart/)
    * [ruby-lang - To Ruby From Perl](https://www.ruby-lang.org/en/documentation/ruby-from-other-languages/to-ruby-from-perl/)
    * [rubular - Ruby regular expression editor](http://rubular.com/)
* Tutorials and Q&A
    * [Smooth Ruby One-Liners](https://dev.to/rpalo/smooth-ruby-one-liners-154) - simple intro to ruby one-liners
    * [Ruby one-liners](http://benoithamelin.tumblr.com/ruby1line) based on [awk one-liners](http://www.pement.org/awk/awk1line.txt)
    * [Ruby Tricks, Idiomatic Ruby, Refactorings and Best Practices](https://franzejr.github.io/best-ruby/index.html)
    * [freecodecamp - learning Ruby](https://medium.freecodecamp.org/learning-ruby-from-zero-to-hero-90ad4eecc82d)
    * [Ruby Regexp](https://leanpub.com/rubyregexp) ebook - step by step guide from beginner to advanced levels
    * [regex FAQ on SO](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean)
* Alternatives
    * [bioruby](https://github.com/bioruby/bioruby)
    * [perl](https://perldoc.perl.org/)
    * [unix.stackexchange - When to use grep, sed, awk, perl, etc](https://unix.stackexchange.com/questions/303044/when-to-use-grep-less-awk-sed)


================================================
FILE: sorting_stuff.md
================================================
# <a name="sorting-stuff"></a>Sorting stuff

**Table of Contents**

* [sort](#sort)
    * [Default sort](#default-sort)
    * [Reverse sort](#reverse-sort)
    * [Various number sorting](#various-number-sorting)
    * [Random sort](#random-sort)
    * [Specifying output file](#specifying-output-file)
    * [Unique sort](#unique-sort)
    * [Column based sorting](#column-based-sorting)
    * [Further reading for sort](#further-reading-for-sort)
* [uniq](#uniq)
    * [Default uniq](#default-uniq)
    * [Only duplicates](#only-duplicates)
    * [Only unique](#only-unique)
    * [Prefix count](#prefix-count)
    * [Ignoring case](#ignoring-case)
    * [Combining multiple files](#combining-multiple-files)
    * [Column options](#column-options)
    * [Further reading for uniq](#further-reading-for-uniq)
* [comm](#comm)
    * [Default three column output](#default-three-column-output)
    * [Suppressing columns](#suppressing-columns)
    * [Files with duplicates](#files-with-duplicates)
    * [Further reading for comm](#further-reading-for-comm)
* [shuf](#shuf)
    * [Random lines](#random-lines)
    * [Random integer numbers](#random-integer-numbers)
    * [Further reading for shuf](#further-reading-for-shuf)

<br>

## <a name="sort"></a>sort

```bash
$ sort --version | head -n1
sort (GNU coreutils) 8.25

$ man sort
SORT(1)                          User Commands                         SORT(1)

NAME
       sort - sort lines of text files

SYNOPSIS
       sort [OPTION]... [FILE]...
       sort [OPTION]... --files0-from=F

DESCRIPTION
       Write sorted concatenation of all FILE(s) to standard output.

       With no FILE, or when FILE is -, read standard input.
...
```

**Note**: All examples shown here assumes ASCII encoded input file


<br>

#### <a name="default-sort"></a>Default sort

```bash
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.

$ sort poem.txt
And so are you.
Roses are red,
Sugar is sweet,
Violets are blue,
```

* Well, that was easy. The lines were sorted alphabetically (ascending order by default) and it so happened that first letter alone was enough to decide the order
* For next example, let's extract all the words and sort them
    * also allows to showcase `sort` accepting stdin
    * See [GNU grep](./gnu_grep.md) chapter if the `grep` command used below looks alien

```bash
$ # output might differ depending on locale settings
$ # note the case-insensitiveness of output
$ grep -oi '[a-z]*' poem.txt | sort
And
are
are
are
blue
is
red
Roses
so
Sugar
sweet
Violets
you
```

* heed hereunto
* See also
    * [arch wiki - locale](https://wiki.archlinux.org/index.php/locale)
    * [Linux: Define Locale and Language Settings](https://www.shellhacks.com/linux-define-locale-language-settings/)

```bash
$ info sort | tail

   (1) If you use a non-POSIX locale (e.g., by setting ‘LC_ALL’ to
‘en_US’), then ‘sort’ may produce output that is sorted differently than
you’re accustomed to.  In that case, set the ‘LC_ALL’ environment
variable to ‘C’.  Note that setting only ‘LC_COLLATE’ has two problems.
First, it is ineffective if ‘LC_ALL’ is also set.  Second, it has
undefined behavior if ‘LC_CTYPE’ (or ‘LANG’, if ‘LC_CTYPE’ is unset) is
set to an incompatible value.  For example, you get undefined behavior
if ‘LC_CTYPE’ is ‘ja_JP.PCK’ but ‘LC_COLLATE’ is ‘en_US.UTF-8’.
```

* Example to help show effect of locale setting

```bash
$ # note how uppercase is sorted before lowercase
$ grep -oi '[a-z]*' poem.txt | LC_ALL=C sort
And
Roses
Sugar
Violets
are
are
are
blue
is
red
so
sweet
you
```

<br>

#### <a name="reverse-sort"></a>Reverse sort

* This is simply reversing from default ascending order to descending order

```bash
$ sort -r poem.txt
Violets are blue,
Sugar is sweet,
Roses are red,
And so are you.
```

<br>

#### <a name="various-number-sorting"></a>Various number sorting

```bash
$ cat numbers.txt
20
53
3
101

$ sort numbers.txt
101
20
3
53
```

* Whoops, what happened there? `sort` won't know to treat them as numbers unless specified
* Depending on format of numbers, different options have to be used
* First up is `-n` option, which sorts based on numerical value

```bash
$ sort -n numbers.txt
3
20
53
101

$ sort -nr numbers.txt
101
53
20
3
```

* The `-n` option can handle negative numbers
* As well as thousands separator and decimal point (depends on locale)
* The `<()` syntax is [Process Substitution](http://mywiki.wooledge.org/ProcessSubstitution)
    * to put it simply - allows output of command to be passed as input file to another command without needing to manually create a temporary file

```bash
$ # multiple files are merged as single input by default
$ sort -n numbers.txt <(echo '-4')
-4
3
20
53
101

$ sort -n numbers.txt <(echo '1,234')
3
20
53
101
1,234

$ sort -n numbers.txt <(echo '31.24')
3
20
31.24
53
101
```

* Use `-g` if input contains numbers prefixed by `+` or [E scientific notation](https://en.wikipedia.org/wiki/Scientific_notation#E_notation)

```bash
$ cat generic_numbers.txt
+120
-1.53
3.14e+4
42.1e-2

$ sort -g generic_numbers.txt
-1.53
42.1e-2
+120
3.14e+4
```

* Commands like `du` have options to display numbers in human readable formats
* `sort` supports sorting such numbers using the `-h` option

```bash
$ du -sh *
104K    power.log
746M    projects
316K    report.log
20K     sample.txt
$ du -sh * | sort -h
20K     sample.txt
104K    power.log
316K    report.log
746M    projects

$ # --si uses powers of 1000 instead of 1024
$ du -s --si *
107k    power.log
782M    projects
324k    report.log
21k     sample.txt
$ du -s --si * | sort -h
21k     sample.txt
107k    power.log
324k    report.log
782M    projects
```

* Version sort - dealing with numbers mixed with other characters
* If this sorting is needed simply while displaying directory contents, use `ls -v` instead of piping to `sort -V`

```bash
$ cat versions.txt
foo_v1.2
bar_v2.1.3
foobar_v2
foo_v1.2.1
foo_v1.3

$ sort -V versions.txt
bar_v2.1.3
foobar_v2
foo_v1.2
foo_v1.2.1
foo_v1.3
```

* Another common use case is when there are multiple filenames differentiated by numbers

```bash
$ cat files.txt
file0
file10
file3
file4

$ sort -V files.txt
file0
file3
file4
file10
```

* Can be used when dealing with numbers reported by `time` command as well

```bash
$ # different solving durations
$ cat rubik_time.txt
5m35.363s
3m20.058s
4m5.099s
4m1.130s
3m42.833s
4m33.083s

$ # assuming consistent min/sec format
$ sort -V rubik_time.txt
3m20.058s
3m42.833s
4m1.130s
4m5.099s
4m33.083s
5m35.363s
```

<br>

#### <a name="random-sort"></a>Random sort

* Note that duplicate lines will always end up next to each other
    * might be useful as a feature for some cases ;)
    * Use `shuf` if this is not desirable
* See also [How can I shuffle the lines of a text file on the Unix command line or in a shell script?](https://stackoverflow.com/questions/2153882/how-can-i-shuffle-the-lines-of-a-text-file-on-the-unix-command-line-or-in-a-shel)

```bash
$ cat nums.txt
1
10
10
12
23
563

$ # the two 10s will always be next to each other
$ sort -R nums.txt
563
12
1
10
10
23

$ # duplicates can end up anywhere
$ shuf nums.txt
10
23
1
10
563
12
```

<br>

#### <a name="specifying-output-file"></a>Specifying output file

* The `-o` option can be used to specify output file
* Useful for in place editing

```bash
$ sort -R nums.txt -o rand_nums.txt
$ cat rand_nums.txt
23
1
10
10
563
12

$ sort -R nums.txt -o nums.txt
$ cat nums.txt
563
23
10
10
1
12
```

* Use shell script looping if there multiple files to be sorted in place
* Below snippet is for `bash` shell

```bash
$ for f in *.txt; do echo sort -V "$f" -o "$f"; done
sort -V files.txt -o files.txt
sort -V rubik_time.txt -o rubik_time.txt
sort -V versions.txt -o versions.txt

$ # remove echo once commands look fine
$ for f in *.txt; do sort -V "$f" -o "$f"; done
```

<br>

#### <a name="unique-sort"></a>Unique sort

* Keep only first copy of lines that are deemed to be same according to `sort` option used

```bash
$ cat duplicates.txt
foo
12 carrots
foo
12 apples
5 guavas

$ # only one copy of foo in output
$ sort -u duplicates.txt
12 apples
12 carrots
5 guavas
foo
```

* According to option used, definition of duplicate will vary
* For example, when `-n` is used, matching numbers are deemed same even if rest of line differs
    * Pipe the output to `uniq` if this is not desirable

```bash
$ # note how first copy of line starting with 12 is retained
$ sort -nu duplicates.txt
foo
5 guavas
12 carrots

$ # use uniq when entire line should be compared to find duplicates
$ sort -n duplicates.txt | uniq
foo
5 guavas
12 apples
12 carrots
```

* Use `-f` option to ignore case of alphabets while determining duplicates

```bash
$ cat words.txt
CAR
are
car
Are
foot
are

$ # only the two 'are' were considered duplicates
$ sort -u words.txt
are
Are
car
CAR
foot

$ # note again that first copy of duplicate is retained
$ sort -fu words.txt
are
CAR
foot
```

<br>

#### <a name="column-based-sorting"></a>Column based sorting

From `info sort`

```
‘-k POS1[,POS2]’
‘--key=POS1[,POS2]’
     Specify a sort field that consists of the part of the line between
     POS1 and POS2 (or the end of the line, if POS2 is omitted),
     _inclusive_.

     Each POS has the form ‘F[.C][OPTS]’, where F is the number of the
     field to use, and C is the number of the first character from the
     beginning of the field.  Fields and character positions are
     numbered starting with 1; a character position of zero in POS2
     indicates the field’s last character.  If ‘.C’ is omitted from
     POS1, it defaults to 1 (the beginning of the field); if omitted
     from POS2, it defaults to 0 (the end of the field).  OPTS are
     ordering options, allowing individual keys to be sorted according
     to different rules; see below for details.  Keys can span multiple
     fields.
```

* By default, blank characters (space and tab) serve as field separators

```bash
$ cat fruits.txt
apple   42
guava   6
fig     90
banana  31

$ sort fruits.txt
apple   42
banana  31
fig     90
guava   6

$ # sort based on 2nd column numbers
$ sort -k2,2n fruits.txt
guava   6
banana  31
apple   42
fig     90
```

* Using a different field separator
* Consider the following sample input file having fields separated by `:`

```bash
$ # name:pet_name:no_of_pets
$ cat pets.txt
foo:dog:2
xyz:cat:1
baz:parrot:5
abcd:cat:3
joe:dog:1
bar:fox:1
temp_var:squirrel:4
boss:dog:10
```

* Sorting based on particular column or column to end of line
* In case of multiple entries, by default `sort` would use content of remaining parts of line to resolve

```bash
$ # only 2nd column
$ # -k2,4 would mean 2nd column to 4th column
$ sort -t: -k2,2 pets.txt
abcd:cat:3
xyz:cat:1
boss:dog:10
foo:dog:2
joe:dog:1
bar:fox:1
baz:parrot:5
temp_var:squirrel:4

$ # from 2nd column to end of line
$ sort -t: -k2 pets.txt
xyz:cat:1
abcd:cat:3
joe:dog:1
boss:dog:10
foo:dog:2
bar:fox:1
baz:parrot:5
temp_var:squirrel:4
```

* Multiple keys can be specified to resolve ties
* Note that if there are still multiple entries with specified keys, remaining parts of lines would be used

```bash
$ # default sort for 2nd column, numeric sort on 3rd column to resolve ties
$ sort -t: -k2,2 -k3,3n pets.txt
xyz:cat:1
abcd:cat:3
joe:dog:1
foo:dog:2
boss:dog:10
bar:fox:1
baz:parrot:5
temp_var:squirrel:4

$ # numeric sort on 3rd column, default sort for 2nd column to resolve ties
$ sort -t: -k3,3n -k2,2 pets.txt
xyz:cat:1
joe:dog:1
bar:fox:1
foo:dog:2
abcd:cat:3
temp_var:squirrel:4
baz:parrot:5
boss:dog:10
```

* Use `-s` option to retain original order of lines in case of tie

```bash
$ sort -s -t: -k2,2 pets.txt
xyz:cat:1
abcd:cat:3
foo:dog:2
joe:dog:1
boss:dog:10
bar:fox:1
baz:parrot:5
temp_var:squirrel:4
```

* The `-u` option, as seen earlier, will retain only first match

```bash
$ sort -u -t: -k2,2 pets.txt
xyz:cat:1
foo:dog:2
bar:fox:1
baz:parrot:5
temp_var:squirrel:4

$ sort -u -t: -k3,3n pets.txt
xyz:cat:1
foo:dog:2
abcd:cat:3
temp_var:squirrel:4
baz:parrot:5
boss:dog:10
```

* Sometimes, the input has to be sorted first and then `-u` used on the sorted output
* See also [remove duplicates based on the value of another column](https://unix.stackexchange.com/questions/379835/remove-duplicates-based-on-the-value-of-another-column)

```bash
$ # sort by number in 3rd column
$ sort -t: -k3,3n pets.txt
bar:fox:1
joe:dog:1
xyz:cat:1
foo:dog:2
abcd:cat:3
temp_var:squirrel:4
baz:parrot:5
boss:dog:10

$ # then get unique entry based on 2nd column
$ sort -t: -k3,3n pets.txt | sort -t: -u -k2,2
xyz:cat:1
joe:dog:1
bar:fox:1
baz:parrot:5
temp_var:squirrel:4
```

* Specifying particular characters within fields
* If character position is not specified, defaults to `1` for starting column and `0` (last character) for ending column

```bash
$ cat marks.txt
fork,ap_12,54
flat,up_342,1.2
fold,tn_48,211
more,ap_93,7
rest,up_5,63

$ # for 2nd column, sort numerically only from 4th character to end
$ sort -t, -k2.4,2n marks.txt
rest,up_5,63
fork,ap_12,54
fold,tn_48,211
more,ap_93,7
flat,up_342,1.2

$ # sort uniquely based on first two characters of line
$ sort -u -k1.1,1.2 marks.txt
flat,up_342,1.2
fork,ap_12,54
more,ap_93,7
rest,up_5,63
```

* If there are headers

```bash
$ cat header.txt
fruit   qty
apple   42
guava   6
fig     90
banana  31

$ # separate and combine header and content to be sorted
$ cat <(head -n1 header.txt) <(tail -n +2 header.txt | sort -k2nr)
fruit   qty
fig     90
apple   42
banana  31
guava   6
```

* See also [sort by last field value when number of fields varies](https://stackoverflow.com/questions/3832068/bash-sort-text-file-by-last-field-value)

<br>

#### <a name="further-reading-for-sort"></a>Further reading for sort

* There are many other options apart from handful presented above. See `man sort` and `info sort` for detailed documentation and more examples
* [sort like a master](http://www.skorks.com/2010/05/sort-files-like-a-master-with-the-linux-sort-command-bash/)
* [When -b to ignore leading blanks is needed](https://unix.stackexchange.com/a/104527/109046)
* [sort Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/sort?sort=votes&pageSize=15)
* [sort on multiple columns using -k option](https://unix.stackexchange.com/questions/249452/unix-multiple-column-sort-issue)
* [sort a string character wise](https://stackoverflow.com/questions/2373874/how-to-sort-characters-in-a-string)
* [Scalability of 'sort -u' for gigantic files](https://unix.stackexchange.com/questions/279096/scalability-of-sort-u-for-gigantic-files)

<br>

## <a name="uniq"></a>uniq

```bash
$ uniq --version | head -n1
uniq (GNU coreutils) 8.25

$ man uniq
UNIQ(1)                          User Commands                         UNIQ(1)

NAME
       uniq - report or omit repeated lines

SYNOPSIS
       uniq [OPTION]... [INPUT [OUTPUT]]

DESCRIPTION
       Filter  adjacent matching lines from INPUT (or standard input), writing
       to OUTPUT (or standard output).

       With no options, matching lines are merged to the first occurrence.
...
```

<br>

#### <a name="default-uniq"></a>Default uniq

```bash
$ cat word_list.txt
are
are
to
good
bad
bad
bad
good
are
bad

$ # adjacent duplicate lines are removed, leaving one copy
$ uniq word_list.txt
are
to
good
bad
good
are
bad

$ # To remove duplicates from entire file, input has to be sorted first
$ # also showcases that uniq accepts stdin as input
$ sort word_list.txt | uniq
are
bad
good
to
```

<br>

#### <a name="only-duplicates"></a>Only duplicates

```bash
$ # duplicates adjacent to each other
$ uniq -d word_list.txt
are
bad

$ # duplicates in entire file
$ sort word_list.txt | uniq -d
are
bad
good
```

* To get only duplicates as well as show all duplicates

```bash
$ uniq -D word_list.txt
are
are
bad
bad
bad

$ sort word_list.txt | uniq -D
are
are
are
bad
bad
bad
bad
good
good
```

* To distinguish the different groups

```bash
$ # using --all-repeated=prepend will add a newline before the first group as well
$ sort word_list.txt | uniq --all-repeated=separate
are
are
are

bad
bad
bad
bad

good
good
```

<br>

#### <a name="only-unique"></a>Only unique

```bash
$ # lines with no adjacent duplicates
$ uniq -u word_list.txt
to
good
good
are
bad

$ # unique lines in entire file
$ sort word_list.txt | uniq -u
to
```

<br>

#### <a name="prefix-count"></a>Prefix count

```bash
$ # adjacent lines
$ uniq -c word_list.txt
      2 are
      1 to
      1 good
      3 bad
      1 good
      1 are
      1 bad

$ # entire file
$ sort word_list.txt | uniq -c
      3 are
      4 bad
      2 good
      1 to

$ # entire file, only duplicates
$ sort word_list.txt | uniq -cd
      3 are
      4 bad
      2 good
```

* Sorting by count

```bash
$ # sort by count
$ sort word_list.txt | uniq -c | sort -n
      1 to
      2 good
      3 are
      4 bad

$ # reverse the order, highest count first
$ sort word_list.txt | uniq -c | sort -nr
      4 bad
      3 are
      2 good
      1 to
```

* To get only entries with min/max count, bit of [awk](./gnu_awk.md) magic would help

```bash
$ # consider this result
$ sort colors.txt | uniq -c | sort -nr
      3 Red
      3 Blue
      2 Yellow
      1 Green
      1 Black

$ # to get all max count
$ # save 1st line 1st column value to c and then print if 1st column equals c
$ sort colors.txt | uniq -c | sort -nr | awk 'NR==1{c=$1} $1==c'
      3 Red
      3 Blue
$ # to get all min count
$ sort colors.txt | uniq -c | sort -n | awk 'NR==1{c=$1} $1==c'
      1 Black
      1 Green
```

* Get rough count of most used commands from `history` file

```bash
$ # awk '{print $1}' will get the 1st column alone
$ awk '{print $1}' "$HISTFILE" | sort | uniq -c | sort -nr | head
   1465 echo
   1180 grep
    552 cd
    531 awk
    451 sed
    423 vi
    418 cat
    392 perl
    325 printf
    320 sort

$ # extract command name from start of line or preceded by 'spaces|spaces'
$ # won't catch commands in other places like command substitution though
$ grep -oP '(^| +\| +)\K[^ ]+' "$HISTFILE" | sort | uniq -c | sort -nr | head
   2006 grep
   1469 echo
    933 sed
    698 awk
    552 cd
    513 perl
    510 cat
    453 sort
    423 vi
    327 printf
```

<br>

#### <a name="ignoring-case"></a>Ignoring case

```bash
$ cat another_list.txt
food
Food
good
are
bad
Are

$ # note how first copy is retained
$ uniq -i another_list.txt
food
good
are
bad
Are

$ uniq -iD another_list.txt
food
Food
```

<br>

#### <a name="combining-multiple-files"></a>Combining multiple files

```bash
$ sort -f word_list.txt another_list.txt | uniq -i
are
bad
food
good
to

$ sort -f word_list.txt another_list.txt | uniq -c
      4 are
      1 Are
      5 bad
      1 food
      1 Food
      3 good
      1 to

$ sort -f word_list.txt another_list.txt | uniq -ic
      5 are
      5 bad
      2 food
      3 good
      1 to
```

* If only adjacent lines (not sorted) is required, need to concatenate files using another command

```bash
$ uniq -id word_list.txt
are
bad

$ uniq -id another_list.txt
food

$ cat word_list.txt another_list.txt | uniq -id
are
bad
food
```

<br>

#### <a name="column-options"></a>Column options

* `uniq` has few options dealing with column manipulations. Not extensive as `sort -k` but handy for some cases
* First up, skipping fields
    * No option to specify different delimiter
    * From `info uniq`: Fields are sequences of non-space non-tab characters that are separated from each other by at least one space or tab
    * Number of spaces/tabs between fields should be same

```bash
$ cat shopping.txt
lemon 5
mango 5
banana 8
bread 1
orange 5

$ # skips first field
$ uniq -f1 shopping.txt
lemon 5
banana 8
bread 1
orange 5

$ # use -f3 to skip first three fields and so on
```

* Skipping characters

```bash
$ cat text
glue
blue
black
stack
stuck

$ # don't consider first 2 characters
$ uniq -s2 text
glue
black
stuck

$ # to visualize the above example
$ # assume there are two fields and uniq is applied on 2nd column
$ sed 's/^../& /' text
gl ue
bl ue
bl ack
st ack
st uck
```

* Upto specified characters

```bash
$ # consider only first 2 characters
$ uniq -w2 text
glue
blue
stack

$ # to visualize the above example
$ # assume there are two fields and uniq is applied on 1st column
$ sed 's/^../& /' text
gl ue
bl ue
bl ack
st ack
st uck
```

* Combining `-s` and `-w`
* Can be combined with `-f` as well

```bash
$ # skip first 3 characters and then use next 2 characters
$ uniq -s3 -w2 text
glue
black
```


<br>

#### <a name="further-reading-for-uniq"></a>Further reading for uniq

* Do check out `man uniq` and `info uniq` for other options and more detailed documentation
* [uniq Q&A on unix stackexchange](http://unix.stackexchange.com/questions/tagged/uniq?sort=votes&pageSize=15)
* [process duplicate lines only based on certain fields](https://unix.stackexchange.com/questions/387590/print-the-duplicate-lines-only-on-fields-1-2-from-csv-file)

<br>

## <a name="comm"></a>comm

```bash
$ comm --version | head -n1
comm (GNU coreutils) 8.25

$ man comm
COMM(1)                          User Commands                         COMM(1)

NAME
       comm - compare two sorted files line by line

SYNOPSIS
       comm [OPTION]... FILE1 FILE2

DESCRIPTION
       Compare sorted files FILE1 and FILE2 line by line.

       When FILE1 or FILE2 (not both) is -, read standard input.

       With  no  options,  produce  three-column  output.  Column one contains
       lines unique to FILE1, column two contains lines unique to  FILE2,  and
       column three contains lines common to both files.
...
```

<br>

#### <a name="default-three-column-output"></a>Default three column output

Consider below sample input files

```bash
$ # sorted input files viewed side by side
$ paste colors_1.txt colors_2.txt
Blue    Black
Brown   Blue
Purple  Green
Red     Red
Teal    White
Yellow
```

* Without any option, `comm` gives 3 column output
    * lines unique to first file
    * lines unique to second file
    * lines common to both files

```bash
$ comm colors_1.txt colors_2.txt
        Black
                Blue
Brown
        Green
Purple
                Red
Teal
        White
Yellow
```

<br>

#### <a name="suppressing-columns"></a>Suppressing columns

* `-1` suppress lines unique to first file
* `-2` suppress lines unique to second file
* `-3` suppress lines common to both files

```bash
$ # suppressing column 3
$ comm -3 colors_1.txt colors_2.txt
        Black
Brown
        Green
Purple
Teal
        White
Yellow
```

* Combining options gives three distinct and useful constructs
* First, getting only common lines to both files

```bash
$ comm -12 colors_1.txt colors_2.txt
Blue
Red
```

* Second, lines unique to first file

```bash
$ comm -23 colors_1.txt colors_2.txt
Brown
Purple
Teal
Yellow
```

* And the third, lines unique to second file

```bash
$ comm -13 colors_1.txt colors_2.txt
Black
Green
White
```

* See also how the above three cases can be done [using grep alone](./gnu_grep.md#search-strings-from-file)
    * **Note** input files do not need to be sorted for `grep` solution

If different `sort` order than default is required, use `--nocheck-order` to ignore error message

```bash
$ comm -23 <(sort -n numbers.txt) <(sort -n nums.txt)
3
comm: file 1 is not in sorted order
20
53
101

$ comm --nocheck-order -23 <(sort -n numbers.txt) <(sort -n nums.txt)
3
20
53
101
```

<br>

#### <a name="files-with-duplicates"></a>Files with duplicates

* As many duplicate lines match in both files, they'll be considered as common
* Rest will be unique to respective files
* This is useful for cases like finding lines present in first but not in second taking in to consideration count of duplicates as well
    * This solution won't be possible with `grep`

```bash
$ paste list1 list2
a       a
a       b
a       c
b       c
b       d
c

$ comm list1 list2
                a
a
a
                b
b
                c
        c
        d

$ comm -23 list1 list2
a
a
b
```

<br>

#### <a name="further-reading-for-comm"></a>Further reading for comm

* `man comm` and `info comm` for more options and detailed documentation
* [comm Q&A on unix stackexchange](http://unix.stackexchange.com/questions/tagged/comm?sort=votes&pageSize=15)

<br>

## <a name="shuf"></a>shuf

```bash
$ shuf --version | head -n1
shuf (GNU coreutils) 8.25

$ man shuf
SHUF(1)                          User Commands                         SHUF(1)

NAME
       shuf - generate random permutations

SYNOPSIS
       shuf [OPTION]... [FILE]
       shuf -e [OPTION]... [ARG]...
       shuf -i LO-HI [OPTION]...

DESCRIPTION
       Write a random permutation of the input lines to standard output.

       With no FILE, or when FILE is -, read standard input.
...
```

<br>

#### <a name="random-lines"></a>Random lines

* Without repeating input lines

```bash
$ cat nums.txt
1
10
10
12
23
563

$ # duplicates can end up anywhere
$ # all lines are part of output
$ shuf nums.txt
10
23
1
10
563
12

$ # limit max number of output lines
$ shuf -n2 nums.txt
563
23
```

* Use `-o` option to specify output file name instead of displaying on stdout
* Helpful for inplace editing

```bash
$ shuf nums.txt -o nums.txt
$ cat nums.txt
10
12
23
10
563
1
```

* With repeated input lines

```bash
$ # -n3 for max 3 lines, -r allows input lines to be repeated
$ shuf -n3 -r nums.txt
1
1
563

$ seq 3 | shuf -n5 -r
2
1
2
1
2

$ # if a limit using -n is not specified, shuf will output lines indefinitely
```

* use `-e` option to specify multiple input lines from command line itself

```bash
$ shuf -e red blue green
green
blue
red

$ shuf -e 'hi there' 'hello world' foo bar
bar
hi there
foo
hello world

$ shuf -n2 -e 'hi there' 'hello world' foo bar
foo
hi there

$ shuf -r -n4 -e foo bar
foo
foo
bar
foo
```

<br>

#### <a name="random-integer-numbers"></a>Random integer numbers

* The `-i` option accepts integer range as input to be shuffled

```bash
$ shuf -i 3-8
3
7
6
4
8
5
```

* Combine with other options as needed

```bash
$ shuf -n3 -i 3-8
5
4
7

$ shuf -r -n4 -i 3-8
5
5
7
8

$ shuf -r -n5 -i 0-1
1
0
0
1
1
```

* Use [seq](./miscellaneous.md#seq) input if negative numbers, floating point, etc are needed

```bash
$ seq 2 -1 -2 | shuf
2
-1
-2
0
1

$ seq 0.3 0.1 0.7 | shuf -n3
0.4
0.5
0.7
```


<br>

#### <a name="further-reading-for-shuf"></a>Further reading for shuf

* `man shuf` and `info shuf` for more options and detailed documentation
* [Generate random numbers in specific range](https://unix.stackexchange.com/questions/140750/generate-random-numbers-in-specific-range)
* [Variable - randomly choose among three numbers](https://unix.stackexchange.com/questions/330689/variable-randomly-chosen-among-three-numbers-10-100-and-1000)
* Related to 'random' stuff:
    * [How to generate a random string?](https://unix.stackexchange.com/questions/230673/how-to-generate-a-random-string)
    * [How can I populate a file with random data?](https://unix.stackexchange.com/questions/33629/how-can-i-populate-a-file-with-random-data)
    * [Run commands at random](https://unix.stackexchange.com/questions/81566/run-commands-at-random)


================================================
FILE: tail_less_cat_head.md
================================================
# <a name="cat-less-tail-and-head"></a>Cat, Less, Tail and Head

**Table of Contents**

* [cat](#cat)
    * [Concatenate files](#concatenate-files)
    * [Accepting input from stdin](#accepting-input-from-stdin)
    * [Squeeze consecutive empty lines](#squeeze-consecutive-empty-lines)
    * [Prefix line numbers](#prefix-line-numbers)
    * [Viewing special characters](#viewing-special-characters)
    * [Writing text to file](#writing-text-to-file)
    * [tac](#tac)
    * [Useless use of cat](#useless-use-of-cat)
    * [Further Reading for cat](#further-reading-for-cat)
* [less](#less)
    * [Navigation commands](#navigation-commands)
    * [Further Reading for less](#further-reading-for-less)
* [tail](#tail)
    * [linewise tail](#linewise-tail)
    * [characterwise tail](#characterwise-tail)
    * [multiple file input for tail](#multiple-file-input-for-tail)
    * [Further Reading for tail](#further-reading-for-tail)
* [head](#head)
    * [linewise head](#linewise-head)
    * [characterwise head](#characterwise-head)
    * [multiple file input for head](#multiple-file-input-for-head)
    * [combining head and tail](#combining-head-and-tail)
    * [Further Reading for head](#further-reading-for-head)
* [Text Editors](#text-editors)

<br>

## <a name="cat"></a>cat

```bash
$ cat --version | head -n1
cat (GNU coreutils) 8.25

$ man cat
CAT(1)                           User Commands                          CAT(1)

NAME
       cat - concatenate files and print on the standard output

SYNOPSIS
       cat [OPTION]... [FILE]...

DESCRIPTION
       Concatenate FILE(s) to standard output.

       With no FILE, or when FILE is -, read standard input.
...
```

* For below examples, `marks_201*` files contain 3 fields delimited by TAB
* To avoid formatting issues, TAB has been converted to spaces using `col -x` while pasting the output here

<br>

#### <a name="concatenate-files"></a>Concatenate files

* One or more files can be given as input and hence a lot of times, `cat` is used to quickly see contents of small single file on terminal
* To save the output of concatenation, just redirect stdout

```bash
$ ls
marks_2015.txt  marks_2016.txt  marks_2017.txt

$ cat marks_201*
Name    Maths   Science
foo     67      78
bar     87      85
Name    Maths   Science
foo     70      75
bar     85      88
Name    Maths   Science
foo     68      76
bar     90      90

$ # save stdout to a file
$ cat marks_201* > all_marks.txt
```

<br>

#### <a name="accepting-input-from-stdin"></a>Accepting input from stdin

```bash
$ # combining input from stdin and other files
$ printf 'Name\tMaths\tScience \nbaz\t56\t63\nbak\t71\t65\n' | cat - marks_2015.txt
Name    Maths   Science
baz     56      63
bak     71      65
Name    Maths   Science
foo     67      78
bar     87      85

$ # - can be placed in whatever order is required
$ printf 'Name\tMaths\tScience \nbaz\t56\t63\nbak\t71\t65\n' | cat marks_2015.txt -
Name    Maths   Science
foo     67      78
bar     87      85
Name    Maths   Science
baz     56      63
bak     71      65
```

<br>

#### <a name="squeeze-consecutive-empty-lines"></a>Squeeze consecutive empty lines

```bash
$ printf 'hello\n\n\nworld\n\nhave a nice day\n'
hello


world

have a nice day
$ printf 'hello\n\n\nworld\n\nhave a nice day\n' | cat -s
hello

world

have a nice day
```

<br>

#### <a name="prefix-line-numbers"></a>Prefix line numbers

```bash
$ # number all lines
$ cat -n marks_201*
     1  Name    Maths   Science
     2  foo     67      78
     3  bar     87      85
     4  Name    Maths   Science
     5  foo     70      75
     6  bar     85      88
     7  Name    Maths   Science
     8  foo     68      76
     9  bar     90      90

$ # number only non-empty lines
$ printf 'hello\n\n\nworld\n\nhave a nice day\n' | cat -sb
     1  hello

     2  world

     3  have a nice day
```

* For more numbering options, check out the command `nl`

```bash
$ whatis nl
nl (1)               - number lines of files
```

<br>

#### <a name="viewing-special-characters"></a>Viewing special characters

* End of line identified by `$`
* Useful for example to see trailing spaces

```bash
$ cat -E marks_2015.txt
Name    Maths   Science $
foo     67      78$
bar     87      85$
```

* TAB identified by `^I`

```bash
$ cat -T marks_2015.txt
Name^IMaths^IScience 
foo^I67^I78
bar^I87^I85
```

* Non-printing characters
* See [Show Non-Printing Characters](http://docstore.mik.ua/orelly/unix/upt/ch25_07.htm) for more detailed info

```bash
$ # NUL character
$ printf 'foo\0bar\0baz\n' | cat -v
foo^@bar^@baz

$ # to check for dos-style line endings
$ printf 'Hello World!\r\n' | cat -v
Hello World!^M

$ printf 'Hello World!\r\n' | dos2unix | cat -v
Hello World!
```

* the `-A` option is equivalent to `-vET`
* the `-e` option is equivalent to `-vE`
* If `dos2unix` and `unix2dos` are not available, see [How to convert DOS/Windows newline (CRLF) to Unix newline (\n)](https://stackoverflow.com/questions/2613800/how-to-convert-dos-windows-newline-crlf-to-unix-newline-n-in-a-bash-script)

<br>

#### <a name="writing-text-to-file"></a>Writing text to file

```bash
$ cat > sample.txt
This is an example of adding text to a new file using cat command.
Press Ctrl+d on a newline to save and quit.

$ cat sample.txt
This is an example of adding text to a new file using cat command.
Press Ctrl+d on a newline to save and quit.
```

* See also how to use [heredoc](http://mywiki.wooledge.org/HereDocument)
    * [How can I write a here doc to a file](https://stackoverflow.com/questions/2953081/how-can-i-write-a-here-doc-to-a-file-in-bash-script)
* See also [difference between Ctrl+c and Ctrl+d to signal end of stdin input in bash](https://unix.stackexchange.com/questions/16333/how-to-signal-the-end-of-stdin-input-in-bash)

<br>

#### <a name="tac"></a>tac

```bash
$ whatis tac
tac (1)              - concatenate and print files in reverse
$ tac --version | head -n1
tac (GNU coreutils) 8.25

$ seq 3 | tac
3
2
1

$ tac marks_2015.txt
bar     87      85
foo     67      78
Name    Maths   Science
```

* Useful in cases where logic is easier to write when working on reversed file
* Consider this made up log file, many **Warning** lines but need to extract only from last such **Warning** upto **Error** line
    * See [GNU sed chapter](./gnu_sed.md#lines-between-two-regexps) for details on the `sed` command used below

```bash
$ cat report.log
blah blah
Warning: something went wrong
more blah
whatever
Warning: something else went wrong
some text
some more text
Error: something seriously went wrong
blah blah blah

$ tac report.log | sed -n '/Error:/,/Warning:/p' | tac
Warning: something else went wrong
some text
some more text
Error: something seriously went wrong
```

* Similarly, if characters in lines have to be reversed, use the `rev` command

```bash
$ whatis rev
rev (1)              - reverse lines characterwise
```

<br>

#### <a name="useless-use-of-cat"></a>Useless use of cat

* `cat` is used so frequently to view contents of a file that somehow users think other commands cannot handle file input
* [UUOC](https://en.wikipedia.org/wiki/Cat_(Unix)#Useless_use_of_cat)
* [Useless Use of Cat Award](http://porkmail.org/era/unix/award.html)

```bash
$ cat report.log | grep -E 'Warning|Error'
Warning: something went wrong
Warning: something else went wrong
Error: something seriously went wrong
$ grep -E 'Warning|Error' report.log
Warning: something went wrong
Warning: something else went wrong
Error: something seriously went wrong
```

* Use [input redirection](http://wiki.bash-hackers.org/howto/redirection_tutorial) if a command doesn't accept file input

```bash
$ cat marks_2015.txt | tr 'A-Z' 'a-z'
name    maths   science
foo     67      78
bar     87      85
$ tr 'A-Z' 'a-z' < marks_2015.txt
name    maths   science
foo     67      78
bar     87      85
```

* However, `cat` should definitely be used where **concatenation** is needed

```bash
$ grep -c 'foo' marks_201*
marks_2015.txt:1
marks_2016.txt:1
marks_2017.txt:1

$ # concatenation allows to get overall count in one-shot in this case
$ cat marks_201* | grep -c 'foo'
3
```

<br>

#### <a name="further-reading-for-cat"></a>Further Reading for cat

* [cat Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/cat?sort=votes&pageSize=15)
* [cat Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/cat?sort=votes&pageSize=15)

<br>

## <a name="less"></a>less

```bash
$ less --version | head -n1
less 481 (GNU regular expressions)

$ # By default, pager is used to display the man pages
$ # and usually, pager is linked to less command
$ type pager less
pager is /usr/bin/pager
less is /usr/bin/less

$ realpath /usr/bin/pager
/bin/less
$ realpath /usr/bin/less
/bin/less
$ diff -s /usr/bin/pager /usr/bin/less
Files /usr/bin/pager and /usr/bin/less are identical
```

* `cat` command is NOT suitable for viewing contents of large files on the Terminal
* `less` displays contents of a file, automatically fits to size of Terminal, allows scrolling in either direction and other options for effective viewing
* Usually, `man` command uses `less` command to display the help page
* The navigation commands are similar to `vi` editor

<br>

#### <a name="navigation-commands"></a>Navigation commands

Commonly used commands are given below, press `h` for summary of options

* `g` go to start of file
* `G` go to end of file
* `q` quit
* `/pattern` search for the given pattern in forward direction
* `?pattern` search for the given pattern in backward direction
* `n` go to next pattern
* `N` go to previous pattern

<br>

#### <a name="further-reading-for-less"></a>Further Reading for less

* See `man less` for detailed info on commands and options. For example:
    * `-s` option to squeeze consecutive blank lines
    * `-N` option to prefix line number
* `less` command is an [improved version](https://unix.stackexchange.com/questions/604/isnt-less-just-more) of `more` command
* [differences between most, more and less](https://unix.stackexchange.com/questions/81129/what-are-the-differences-between-most-more-and-less)
* [less Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/less?sort=votes&pageSize=15)

<br>

## <a name="tail"></a>tail

```bash
$ tail --version | head -n1
tail (GNU coreutils) 8.25

$ man tail
TAIL(1)                          User Commands                         TAIL(1)

NAME
       tail - output the last part of files

SYNOPSIS
       tail [OPTION]... [FILE]...

DESCRIPTION
       Print  the  last  10  lines of each FILE to standard output.  With more
       than one FILE, precede each with a header giving the file name.

       With no FILE, or when FILE is -, read standard input.
...
```

<br>

#### <a name="linewise-tail"></a>linewise tail

Consider this sample file, with line numbers prefixed

```bash
$ cat sample.txt
 1) Hello World
 2) 
 3) Good day
 4) How are you
 5) 
 6) Just do-it
 7) Believe it
 8) 
 9) Today is sunny
10) Not a bit funny
11) No doubt you like it too
12) 
13) Much ado about nothing
14) He he he
15) Adios amigo
```

* default behavior - display last 10 lines

```bash
$ tail sample.txt
 6) Just do-it
 7) Believe it
 8) 
 9) Today is sunny
10) Not a bit funny
11) No doubt you like it too
12) 
13) Much ado about nothing
14) He he he
15) Adios amigo
```

* Use `-n` option to control number of lines to filter

```bash
$ tail -n3 sample.txt
13) Much ado about nothing
14) He he he
15) Adios amigo

$ # some versions of tail allow to skip explicit n character
$ tail -5 sample.txt
11) No doubt you like it too
12) 
13) Much ado about nothing
14) He he he
15) Adios amigo
```

* when number is prefixed with `+` sign, all lines are fetched from that particular line number to end of file

```bash
$ tail -n +10 sample.txt
10) Not a bit funny
11) No doubt you like it too
12) 
13) Much ado about nothing
14) He he he
15) Adios amigo

$ seq 13 17 | tail -n +3
15
16
17
```

<br>

#### <a name="characterwise-tail"></a>characterwise tail

* Note that this works byte wise and not suitable for multi-byte character encodings

```bash
$ # last three characters including the newline character
$ echo 'Hi there!' | tail -c3
e!

$ # excluding the first character
$ echo 'Hi there!' | tail -c +2
i there!
```

<br>

#### <a name="multiple-file-input-for-tail"></a>multiple file input for tail

```bash
$ tail -n2 report.log sample.txt
==> report.log <==
Error: something seriously went wrong
blah blah blah

==> sample.txt <==
14) He he he
15) Adios amigo

$ # -q option to avoid filename in output
$ tail -q -n2 report.log sample.txt
Error: something seriously went wrong
blah blah blah
14) He he he
15) Adios amigo
```

<br>

#### <a name="further-reading-for-tail"></a>Further Reading for tail

* `tail -f` and related options are beyond the scope of this tutorial. Below links might be useful
    * [look out for buffering](http://mywiki.wooledge.org/BashFAQ/009)
    * [Piping tail -f output though grep twice](https://stackoverflow.com/questions/13858912/piping-tail-output-though-grep-twice)
    * [tail and less](https://unix.stackexchange.com/questions/196168/does-less-have-a-feature-like-tail-follow-name-f)
* [tail Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/tail?sort=votes&pageSize=15)
* [tail Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/tail?sort=votes&pageSize=15)

<br>

## <a name="head"></a>head

```bash
$ head --version | head -n1
head (GNU coreutils) 8.25

$ man head
HEAD(1)                          User Commands                         HEAD(1)

NAME
       head - output the first part of files

SYNOPSIS
       head [OPTION]... [FILE]...

DESCRIPTION
       Print  the  first  10 lines of each FILE to standard output.  With more
       than one FILE, precede each with a header giving the file name.

       With no FILE, or when FILE is -, read standard input.
...
```

<br>

#### <a name="linewise-head"></a>linewise head

* default behavior - display starting 10 lines

```bash
$ head sample.txt
 1) Hello World
 2) 
 3) Good day
 4) How are you
 5) 
 6) Just do-it
 7) Believe it
 8) 
 9) Today is sunny
10) Not a bit funny
```

* Use `-n` option to control number of lines to filter

```bash
$ head -n3 sample.txt
 1) Hello World
 2) 
 3) Good day

$ # some versions of head allow to skip explicit n character
$ head -4 sample.txt
 1) Hello World
 2) 
 3) Good day
 4) How are you
```

* when number is prefixed with `-` sign, all lines are fetched except those many lines to end of file

```bash
$ # except last 9 lines of file
$ head -n -9 sample.txt
 1) Hello World
 2) 
 3) Good day
 4) How are you
 5) 
 6) Just do-it

$ # except last 2 lines
$ seq 13 17 | head -n -2
13
14
15
```

<br>

#### <a name="characterwise-head"></a>characterwise head

* Note that this works byte wise and not suitable for multi-byte character encodings

```bash
$ # if output of command doesn't end with newline, prompt will be on same line
$ # to highlight working of command, the prompt for such cases is not shown here

$ # first two characters
$ echo 'Hi there!' | head -c2
Hi

$ # excluding last four characters
$ echo 'Hi there!' | head -c -4
Hi the
```

<br>

#### <a name="multiple-file-input-for-head"></a>multiple file input for head

```bash
$ head -n3 report.log sample.txt
==> report.log <==
blah blah
Warning: something went wrong
more blah

==> sample.txt <==
 1) Hello World
 2) 
 3) Good day

$ # -q option to avoid filename in output
$ head -q -n3 report.log sample.txt
blah blah
Warning: something went wrong
more blah
 1) Hello World
 2) 
 3) Good day
```

<br>

#### <a name="combining-head-and-tail"></a>combining head and tail

* Despite involving two commands, often this combination is faster than equivalent sed/awk versions

```bash
$ head -n11 sample.txt | tail -n3
 9) Today is sunny
10) Not a bit funny
11) No doubt you like it too

$ tail sample.txt | head -n2
 6) Just do-it
 7) Believe it
```

<br>

#### <a name="further-reading-for-head"></a>Further Reading for head

* [head Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/head?sort=votes&pageSize=15)

<br>

## <a name="text-editors"></a>Text Editors

For editing text files, the following applications can be used. Of these, `gedit`, `nano`, `vi` and/or `vim` are available in most distros by default

Easy to use

* [gedit](https://wiki.gnome.org/Apps/Gedit)
* [geany](http://www.geany.org/)
* [nano](http://nano-editor.org/)

Powerful text editors

* [vim](https://github.com/vim/vim)
    * [vim learning resources](https://github.com/learnbyexample/scripting_course/blob/master/Vim_curated_resources.md) and [vim reference](https://github.com/learnbyexample/vim_reference) for further info
* [emacs](https://www.gnu.org/software/emacs/)
* [atom](https://atom.io/)
* [sublime](https://www.sublimetext.com/)

Check out [this analysis](https://github.com/jhallen/joes-sandbox/tree/master/editor-perf) for some performance/feature comparisons of various text editors


================================================
FILE: whats_the_difference.md
================================================
# <a name="whats-the-difference"></a>What's the difference

**Table of Contents**

* [cmp](#cmp)
* [diff](#diff)
    * [Comparing Directories](#comparing-directories)
    * [colordiff](#colordiff)

<br>

## <a name="cmp"></a>cmp

```bash
$ cmp --version | head -n1
cmp (GNU diffutils) 3.3

$ man cmp
CMP(1)                           User Commands                          CMP(1)

NAME
       cmp - compare two files byte by byte

SYNOPSIS
       cmp [OPTION]... FILE1 [FILE2 [SKIP1 [SKIP2]]]

DESCRIPTION
       Compare two files byte by byte.

       The optional SKIP1 and SKIP2 specify the number of bytes to skip at the
       beginning of each file (zero by default).
...
```

* As the comparison is byte by byte, it doesn't matter if file is human readable or not
* A typical use case is to check if two executables are same or not

```bash
$ echo 'foo 123' > f1; echo 'food 123' > f2
$ cmp f1 f2
f1 f2 differ: byte 4, line 1

$ # print differing bytes
$ cmp -b f1 f2
f1 f2 differ: byte 4, line 1 is  40   144 d

$ # skip given bytes from each file
$ # if only one number is given, it is used for both inputs
$ cmp -i 3:4 f1 f2
$ echo $?
0

$ # compare only given number of bytes from start of inputs
$ cmp -n 3 f1 f2
$ echo $?
0

$ # suppress output
$ cmp -s f1 f2
$ echo $?
1
```

* Comparison stops immediately at the first difference found
* If verbose option `-l` is used, comparison would stop at whichever input reaches end of file first

```bash
$ # first column is byte number
$ # second/third column is respective octal value of differing bytes
$ cmp -l f1 f2
4  40 144
5  61  40
6  62  61
7  63  62
8  12  63
cmp: EOF on f1
```

**Further Reading**

* `man cmp` and `info cmp` for more options and detailed documentation


<br>

## <a name="diff"></a>diff

```bash
$ diff --version | head -n1
diff (GNU diffutils) 3.3

$ man diff
DIFF(1)                          User Commands                         DIFF(1)

NAME
       diff - compare files line by line

SYNOPSIS
       diff [OPTION]... FILES

DESCRIPTION
       Compare FILES line by line.
...
```

* `diff` output shows lines from first file input starting with `<`
* lines from second file input starts with `>`
* between the two file contents, `---` is used as separator
* each difference is prefixed by a command that indicates the differences (see links at end of section for more details)

```bash
$ paste d1 d2
1       1
2       hello
3       3
world   4

$ diff d1 d2
2c2
< 2
---
> hello
4c4
< world
---
> 4

$ diff <(seq 4) <(seq 5)
4a5
> 5
```

* use `-i` option to ignore case

```bash
$ echo 'Hello World!' > i1
$ echo 'hello world!' > i2

$ diff i1 i2
1c1
< Hello World!
---
> hello world!

$ diff -i i1 i2
$ echo $?
0
```

* ignoring difference in white spaces

```bash
$ # -b option to ignore changes in the amount of white space
$ diff -b <(echo 'good day') <(echo 'good    day')
$ echo $?
0

$ # -w option to ignore all white spaces
$ diff -w <(echo 'hi    there ') <(echo ' hi there')
$ echo $?
0
$ diff -w <(echo 'hi    there ') <(echo 'hithere')
$ echo $?
0

# use -B to ignore only blank lines
# use -E to ignore changes due to tab expansion
# use -z to ignore trailing white spaces at end of line
```

* side-by-side output

```bash
$ diff -y d1 d2
1                                                               1
2                                                             | hello
3                                                               3
world                                                         | 4

$ # -y is usually used along with other options
$ # default width is 130 print columns
$ diff -W 60 --suppress-common-lines -y d1 d2
2                            |  hello
world                        |  4

$ diff -W 20 --left-column -y <(seq 4) <(seq 5)
1     (
2     (
3     (
4     (
      > 5
```

* by default, there is no output if input files are same. Use `-s` option to additionally indicate files are same
* by default, all differences are shown. Use `-q` option to indicate only that files differ

```bash
$ cp i1 i1_copy
$ diff -s i1 i1_copy
Files i1 and i1_copy are identical
$ diff -s i1 i2
1c1
< Hello World!
---
> hello world!

$ diff -q i1 i1_copy
$ diff -q i1 i2
Files i1 and i2 differ

$ # combine them to always get one line output
$ diff -sq i1 i1_copy
Files i1 and i1_copy are identical
$ diff -sq i1 i2
Files i1 and i2 differ
```

<br>

#### <a name="comparing-directories"></a>Comparing Directories

* when comparing two files of same name from different directories, specifying the filename is optional for one of the directories

```bash
$ mkdir dir1 dir2
$ echo 'Hello World!' > dir1/i1
$ echo 'hello world!' > dir2/i1

$ diff dir1/i1 dir2
1c1
< Hello World!
---
> hello world!

$ diff -s i1 dir1/
Files i1 and dir1/i1 are identical
$ diff -s . dir1/i1
Files ./i1 and dir1/i1 are identical
```

* if both arguments are directories, all files are compared

```bash
$ touch dir1/report.log dir1/lists dir2/power.log
$ cp f1 dir1/
$ cp f1 dir2/

$ # by default, all differences are reported
$ # as well as filenames which are unique to respective directories
$ diff dir1 dir2
diff dir1/i1 dir2/i1
1c1
< Hello World!
---
> hello world!
Only in dir1: lists
Only in dir2: power.log
Only in dir1: report.log
```

* to report only filenames

```bash
$ diff -sq dir1 dir2
Files dir1/f1 and dir2/f1 are identical
Files dir1/i1 and dir2/i1 differ
Only in dir1: lists
Only in dir2: power.log
Only in dir1: report.log

$ # list only differing files
$ # also useful to copy-paste the command for GUI diffs like tkdiff/vimdiff
$ diff dir1 dir2 | grep '^diff '
diff dir1/i1 dir2/i1
```

* to recursively compare sub-directories as well, use `-r`

```bash
$ mkdir dir1/subdir dir2/subdir
$ echo 'good' > dir1/subdir/f1
$ echo 'goad' > dir2/subdir/f1

$ diff -srq dir1 dir2
Files dir1/f1 and dir2/f1 are identical
Files dir1/i1 and dir2/i1 differ
Only in dir1: lists
Only in dir2: power.log
Only in dir1: report.log
Files dir1/subdir/f1 and dir2/subdir/f1 differ

$ diff -r dir1 dir2 | grep '^diff '
diff -r dir1/i1 dir2/i1
diff -r dir1/subdir/f1 dir2/subdir/f1
```

* See also [GNU diffutils manual - comparing directories](https://www.gnu.org/software/diffutils/manual/diffutils.html#Comparing-Directories) for further options and details like excluding files, ignoring filename case, etc and `dirdiff` command

<br>

#### <a name="colordiff"></a>colordiff

```bash
$ whatis colordiff 
colordiff (1)        - a tool to colorize diff output

$ whatis wdiff
wdiff (1)            - display word differences between text files
```

* simply replace `diff` with `colordiff`

![colordiff](./images/colordiff.png)

* or, pass output of a `diff` tool to `colordiff`

![wdiff to colordiff](./images/wdiff_to_colordiff.png)

* See also [stackoverflow - How to colorize diff on the command line?](https://stackoverflow.com/questions/8800578/how-to-colorize-diff-on-the-command-line) for other options

<br>

**Further Reading**

* `man diff` and `info diff` for more options and detailed documentation
    * [GNU diffutils manual](https://www.gnu.org/software/diffutils/manual/diffutils.html) for a better documentation
* `man -k diff` to get list of all commands related to `diff`
* [diff Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/diff?sort=votes&pageSize=15)
* [unix.stackexchange - GUI diff and merge tools](https://unix.stackexchange.com/questions/4573/which-gui-diff-viewer-would-you-recommend-with-copy-to-left-right-functionality)
* [unix.stackexchange - Understanding diff output](https://unix.stackexchange.com/questions/81998/understanding-of-diff-output)
* [stackoverflow - Using output of diff to create patch](https://stackoverflow.com/questions/437219/using-the-output-of-diff-to-create-the-patch)


================================================
FILE: wheres_my_file.md
================================================
# <a name="where's-my-file"></a>Where's my file

**Table of Contents**

* [find](#find)
* [locate](#locate)

<br>

## <a name="find"></a>find

```bash
$ find --version | head -n1
find (GNU findutils) 4.7.0-git

$ man find
FIND(1)                     General Commands Manual                    FIND(1)

NAME
       find - search for files in a directory hierarchy

SYNOPSIS
       find  [-H]  [-L]  [-P]  [-D  debugopts]  [-Olevel]  [starting-point...]
       [expression]

DESCRIPTION
       This manual page documents the GNU version of find.  GNU find  searches
       the  directory  tree  rooted at each given starting-point by evaluating
       the given expression from left to right,  according  to  the  rules  of
       precedence  (see  section  OPERATORS),  until the outcome is known (the
       left hand side is false for and operations,  true  for  or),  at  which
       point  find  moves  on  to the next file name.  If no starting-point is
       specified, `.' is assumed.
...
```

**Examples**

Filtering based on file name

* `find . -iname 'power.log'` search and print path of file named power.log (ignoring case) in current directory and its sub-directories
* `find -name '*log'` search and print path of all files whose name ends with log in current directory - using `.` is optional when searching in current directory
* `find -not -name '*log'` print path of all files whose name does NOT end with log in current directory
* `find -regextype egrep -regex '.*/\w+'` use extended regular expression to match filename containing only `[a-zA-Z_]` characters
    * `.*/` is needed to match initial part of file path

Filtering based on file type

* `find /home/guest1/proj -type f` print path of all regular files found in specified directory
* `find /home/guest1/proj -type d` print path of all directories found in specified directory
* `find /home/guest1/proj -type f -name '.*'` print path of all hidden files

Filtering based on depth

The relative path `.` is considered as depth 0 directory, files and folders immediately contained in a directory are at depth 1 and so on

* `find -maxdepth 1 -type f` all regular files (including hidden ones) from current directory (without going to sub-directories)
* `find -maxdepth 1 -type f -name '[!.]*'` all regular files (but not hidden ones) from current directory (without going to sub-directories)
    * `-not -name '.*'` can be also used
* `find -mindepth 1 -maxdepth 1 -type d` all directories (including hidden ones) in current directory (without going to sub-directories)

Filtering based on file properties

* `find -mtime -2` print files that were modified within last two days in current directory
    * Note that day here means 24 hours
* `find -mtime +7` print files that were modified more than seven days back in current directory
* `find -daystart -type f -mtime -1` files that were modified from beginning of day (not past 24 hours)
* `find -size +10k` print files with size greater than 10 kilobytes in current directory
* `find -size -1M` print files with size less than 1 megabytes in current directory
* `find -size 2G` print files of size 2 gigabytes in current directory

Passing filtered files as input to other commands

* `find report -name '*log*' -exec rm {} \;` delete all filenames containing log in report folder and its sub-folders
    * here `rm` command is called for every file matching the search conditions
    * since `;` is a special character for shell, it needs to be escaped using `\`
* `find report -name '*log*' -delete` delete all filenames containing log in report folder and its sub-folders
* `find -name '*.txt' -exec wc {} +` list of files ending with txt are all passed together as argument to `wc` command instead of executing wc command for every file
    * no need to use escape the `+` character in this case
    * also note that number of invocations of command specified is not necessarily once if number of files found is too large
* `find -name '*.log' -exec mv {} ../log/ \;` move files ending with .log to log directory present in one hierarchy above. `mv` is executed once per each filtered file
* `find -name '*.log' -exec mv -t ../log/ {} +` the `-t` option allows to specify target directory and then provide multiple files to be moved as argument
    * Similarly, one can use `-t` for `cp` command

**Further Reading**

* [using find](http://mywiki.wooledge.org/UsingFind)
* [find examples on SO](https://stackoverflow.com/documentation/bash/566/find#t=201612140534548263961)
* [Collection of find examples](http://alvinalexander.com/unix/edu/examples/find.shtml)
* [find Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/find?sort=votes&pageSize=15)
* [find and tar example](https://unix.stackexchange.com/questions/282762/find-mtime-1-print-xargs-tar-archives-all-files-from-directory-ignoring-t/282885#282885)
* [find Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/find?sort=votes&pageSize=15)
* [Why is looping over find's output bad practice?](https://unix.stackexchange.com/questions/321697/why-is-looping-over-finds-output-bad-practice)


<br>

## <a name="locate"></a>locate

```bash
$ locate --version | head -n1
mlocate 0.26

$ man locate
locate(1)                   General Commands Manual                  locate(1)

NAME
       locate - find files by name

SYNOPSIS
       locate [OPTION]... PATTERN...

DESCRIPTION
       locate  reads  one or more databases prepared by updatedb(8) and writes
       file names matching at least one of the PATTERNs  to  standard  output,
       one per line.

       If  --regex is not specified, PATTERNs can contain globbing characters.
       If any PATTERN contains no globbing characters, locate  behaves  as  if
       the pattern were *PATTERN*.
...
```

Faster alternative to `find` command when searching for a file by its name. It is based on a database, which gets updated by a `cron` job. So, newer files may be not present in results. Use this command if it is available in your distro and you remember some part of filename. Very useful if one has to search entire filesystem in which case `find` command might take a very long time compared to `locate`

**Examples**

* `locate 'power'` print path of files containing power in the whole filesystem
    * matches anywhere in path, ex: '/home/learnbyexample/lowpower_adder/result.log' and '/home/learnbyexample/power.log' are both a valid match
    * implicitly, `locate` would change the string to `*power*` as no globbing characters are present in the string specified
* `locate -b '\power.log'` print path matching the string power.log exactly at end of path
    * '/home/learnbyexample/power.log' matches but not '/home/learnbyexample/lowpower.log'
    * since globbing character '\' is used while specifying search string, it doesn't get implicitly replaced by `*power.log*`
* `locate -b '\proj_adder'` the `-b` option also comes in handy to print only the path of directory name, otherwise every file under that folder would also be displayed
* [find vs locate - pros and cons](https://unix.stackexchange.com/questions/60205/locate-vs-find-usage-pros-and-cons-of-each-other)