Repository: learnbyexample/Command-line-text-processing Branch: master Commit: ce56c851f078 Files: 80 Total size: 519.3 KB Directory structure: gitextract_wr_ra6a8/ ├── README.md ├── exercises/ │ ├── GNU_grep/ │ │ ├── .ref_solutions/ │ │ │ ├── ex01_basic_match.txt │ │ │ ├── ex02_basic_options.txt │ │ │ ├── ex03_multiple_string_match.txt │ │ │ ├── ex04_filenames.txt │ │ │ ├── ex05_word_line_matching.txt │ │ │ ├── ex06_ABC_context_matching.txt │ │ │ ├── ex07_recursive_search.txt │ │ │ ├── ex08_search_pattern_from_file.txt │ │ │ ├── ex09_regex_anchors.txt │ │ │ ├── ex10_regex_this_or_that.txt │ │ │ ├── ex11_regex_quantifiers.txt │ │ │ ├── ex12_regex_character_class_part1.txt │ │ │ ├── ex13_regex_character_class_part2.txt │ │ │ ├── ex14_regex_grouping_and_backreference.txt │ │ │ ├── ex15_regex_PCRE.txt │ │ │ └── ex16_misc_and_extras.txt │ │ ├── ex01_basic_match/ │ │ │ └── sample.txt │ │ ├── ex01_basic_match.txt │ │ ├── ex02_basic_options/ │ │ │ └── sample.txt │ │ ├── ex02_basic_options.txt │ │ ├── ex03_multiple_string_match/ │ │ │ └── sample.txt │ │ ├── ex03_multiple_string_match.txt │ │ ├── ex04_filenames/ │ │ │ ├── greeting.txt │ │ │ ├── poem.txt │ │ │ └── sample.txt │ │ ├── ex04_filenames.txt │ │ ├── ex05_word_line_matching/ │ │ │ ├── greeting.txt │ │ │ ├── sample.txt │ │ │ └── words.txt │ │ ├── ex05_word_line_matching.txt │ │ ├── ex06_ABC_context_matching/ │ │ │ └── sample.txt │ │ ├── ex06_ABC_context_matching.txt │ │ ├── ex07_recursive_search/ │ │ │ ├── msg/ │ │ │ │ ├── greeting.txt │ │ │ │ └── sample.txt │ │ │ ├── poem.txt │ │ │ ├── progs/ │ │ │ │ ├── hello.py │ │ │ │ └── hello.sh │ │ │ └── words.txt │ │ ├── ex07_recursive_search.txt │ │ ├── ex08_search_pattern_from_file/ │ │ │ ├── baz.txt │ │ │ ├── foo.txt │ │ │ └── words.txt │ │ ├── ex08_search_pattern_from_file.txt │ │ ├── ex09_regex_anchors/ │ │ │ └── sample.txt │ │ ├── ex09_regex_anchors.txt │ │ ├── ex10_regex_this_or_that/ │ │ │ └── sample.txt │ │ ├── ex10_regex_this_or_that.txt │ │ ├── ex11_regex_quantifiers/ │ │ │ └── garbled.txt │ │ ├── ex11_regex_quantifiers.txt │ │ ├── ex12_regex_character_class_part1/ │ │ │ └── sample_words.txt │ │ ├── ex12_regex_character_class_part1.txt │ │ ├── ex13_regex_character_class_part2/ │ │ │ └── sample.txt │ │ ├── ex13_regex_character_class_part2.txt │ │ ├── ex14_regex_grouping_and_backreference/ │ │ │ └── sample.txt │ │ ├── ex14_regex_grouping_and_backreference.txt │ │ ├── ex15_regex_PCRE/ │ │ │ └── sample.txt │ │ ├── ex15_regex_PCRE.txt │ │ ├── ex16_misc_and_extras/ │ │ │ ├── garbled.txt │ │ │ ├── poem.txt │ │ │ └── sample.txt │ │ ├── ex16_misc_and_extras.txt │ │ └── solve │ └── README.md ├── file_attributes.md ├── gnu_awk.md ├── gnu_grep.md ├── gnu_sed.md ├── miscellaneous.md ├── overview_presentation/ │ ├── baz.json │ ├── foo.xml │ ├── greeting.txt │ └── sample.txt ├── perl_the_swiss_knife.md ├── restructure_text.md ├── ruby_one_liners.md ├── sorting_stuff.md ├── tail_less_cat_head.md ├── whats_the_difference.md └── wheres_my_file.md ================================================ FILE CONTENTS ================================================ ================================================ FILE: README.md ================================================ # Command Line Text Processing Learn about various commands available for common and exotic text processing needs. Examples have been tested on GNU/Linux - there'd be syntax/feature variations with other distributions, consult their respective `man` pages for details. --- :warning: :warning: I'm no longer actively working on this repo. Instead, I've converted existing chapters into ebooks (see [ebook section](#ebooks) below for links), available under the same license. These ebooks are better formatted, updated for newer versions of the software, includes exercises, solutions, etc. Since all the chapters have been converted, I'm archiving this repo. ---
## Ebooks Individual online ebooks with better formatting, explanations, exercises, solutions, etc: * [CLI text processing with GNU grep and ripgrep](https://learnbyexample.github.io/learn_gnugrep_ripgrep/) * [CLI text processing with GNU sed](https://learnbyexample.github.io/learn_gnused/) * [CLI text processing with GNU awk](https://learnbyexample.github.io/learn_gnuawk/) * [Ruby One-Liners Guide](https://learnbyexample.github.io/learn_ruby_oneliners/) * [Perl One-Liners Guide](https://learnbyexample.github.io/learn_perl_oneliners/) * [CLI text processing with GNU Coreutils](https://learnbyexample.github.io/cli_text_processing_coreutils/) * [Linux Command Line Computing](https://learnbyexample.github.io/cli-computing/) See https://learnbyexample.github.io/books/ for links to PDF/EPUB versions and other ebooks.
## Chapters As mentioned earlier, I'm no longer actively working on these chapters: * [Cat, Less, Tail and Head](./tail_less_cat_head.md) * cat, less, tail, head, Text Editors * [GNU grep](./gnu_grep.md) * [GNU sed](./gnu_sed.md) * [GNU awk](./gnu_awk.md) * [Perl the swiss knife](./perl_the_swiss_knife.md) * [Ruby one liners](./ruby_one_liners.md) * [Sorting stuff](./sorting_stuff.md) * sort, uniq, comm, shuf * [Restructure text](./restructure_text.md) * paste, column, pr, fold * [Whats the difference](./whats_the_difference.md) * cmp, diff * [Wheres my file](./wheres_my_file.md) * [File attributes](./file_attributes.md) * wc, du, df, touch, file * [Miscellaneous](./miscellaneous.md) * cut, tr, basename, dirname, xargs, seq
## Webinar recordings Recorded couple of videos based on content in the chapters, not sure if I'll do more: * [Using the sort command](https://www.youtube.com/watch?v=qLfAwwb5vGs) * [Using uniq and comm](https://www.youtube.com/watch?v=uAb2kxA2TyQ) See also my short videos on [Linux command line tips](https://www.youtube.com/watch?v=p0KCLusMd5Q&list=PLTv2U3HnAL4PNTmRqZBSUgKaiHbRL2zeY)
## Exercises Check out [exercises](./exercises) directory to solve practice questions on `grep`, right from the command line itself. See also my [TUI-apps](https://github.com/learnbyexample/TUI-apps) repo for interactive CLI text processing exercises.
## Contributing * Please [open an issue](https://github.com/learnbyexample/Command-line-text-processing/issues) for typos or bugs * As this repo is no longer actively worked upon, **please do not submit pull requests** * Share the repo with friends/colleagues, on social media, etc to help reach other learners * In case you need to reach me, mail me at `echo 'yrneaolrknzcyr.arg@tznvy.pbz' | tr 'a-z' 'n-za-m'` or send a DM via [twitter](https://twitter.com/learn_byexample)
## Acknowledgements * [unix.stackexchange](https://unix.stackexchange.com/) and [stackoverflow](https://stackoverflow.com/) - for getting answers to pertinent questions as well as sharpening skills by understanding and answering questions * Forums like [Linux users](https://www.linkedin.com/groups/65688), [/r/commandline/](https://www.reddit.com/r/commandline/), [/r/linux/](https://www.reddit.com/r/linux/), [/r/ruby/](https://www.reddit.com/r/ruby/), [news.ycombinator](https://news.ycombinator.com/news), [devup](http://devup.in/) and others for valuable feedback (especially spotting mistakes) and encouragement * See [wikipedia entry 'Roses Are Red'](https://en.wikipedia.org/wiki/Roses_Are_Red) for `poem.txt` used as sample text input file
## License This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-nc-sa/4.0/) ================================================ FILE: exercises/GNU_grep/.ref_solutions/ex01_basic_match.txt ================================================ 1) Match lines containing the string: day Solution: grep 'day' sample.txt 2) Match lines containing the string: it Solution: grep 'it' sample.txt 3) Match lines containing the string: do you Solution: grep 'do you' sample.txt ================================================ FILE: exercises/GNU_grep/.ref_solutions/ex02_basic_options.txt ================================================ 1) Match lines containing the string irrespective of lower/upper case: no Solution: grep -i 'no' sample.txt 2) Match lines not containing the string: o Solution: grep -v 'o' sample.txt 3) Match lines with line numbers containing the string: it Solution: grep -n 'it' sample.txt 4) Output only number of matching lines containing the string: a Solution: grep -c 'a' sample.txt 5) Match first two lines containing the string: do Solution: grep -m2 'do' sample.txt ================================================ FILE: exercises/GNU_grep/.ref_solutions/ex03_multiple_string_match.txt ================================================ 1) Match lines containing either of these three strings String1: Not String2: he String3: sun Solution: grep -e 'Not' -e 'he' -e 'sun' sample.txt 2) Match lines containing both these strings String1: He String2: or Solution: grep 'He' sample.txt | grep 'or' 3) Match lines containing either of these two strings String1: a String2: i and contains this as well String3: do Solution: grep -e 'a' -e 'i' sample.txt | grep 'do' 4) Match lines containing the string String1: it but not these strings String2: No String3: no Solution: grep 'it' sample.txt | grep -vi 'no' ================================================ FILE: exercises/GNU_grep/.ref_solutions/ex04_filenames.txt ================================================ Note: All files present in the directory should be given as file inputs to grep 1) Show only filenames containing the string: are Solution: grep -l 'are' * 2) Show only filenames NOT containing the string: two Solution: grep -L 'two' * 3) Match all lines containing the string: are Solution: grep 'are' * 4) Match maximum of two matching lines along with filenames containing the character: a Solution: grep -m2 'a' * 5) Match all lines without prefixing filename containing the string: to Solution: grep -h 'to' * ================================================ FILE: exercises/GNU_grep/.ref_solutions/ex05_word_line_matching.txt ================================================ Note: All files present in the directory should be given as file inputs to grep 1) Match lines containing whole word: do Solution: grep -w 'do' * 2) Match whole lines containing the string: Hello World Solution: grep -x 'Hello World' * 3) Match lines containing these whole words: Word1: He Word2: far Solution: grep -w -e 'far' -e 'He' * 4) Match lines containing the whole word: you and NOT containing the case insensitive string: How Solution: grep -w 'you' * | grep -vi 'how' ================================================ FILE: exercises/GNU_grep/.ref_solutions/ex06_ABC_context_matching.txt ================================================ 1) Get lines and 3 following it containing the string: you Solution: grep -A3 'you' sample.txt 2) Get lines and 2 preceding it containing the string: is Solution: grep -B2 'is' sample.txt 3) Get lines and 1 following/preceding containing the string: Not Solution: grep -C1 'Not' sample.txt 4) Get lines and 1 following and 4 preceding containing the string: Not Solution: grep -A1 -B4 'Not' sample.txt 5) Get lines and 1 preceding it containing the string: you there should be no separator between the matches Solution: grep --no-group-separator -B1 'you' sample.txt 6) Get lines and 1 preceding it containing the string: you the separator between the matches should be: ##### Solution: grep --group-separator='#####' -B1 'you' sample.txt ================================================ FILE: exercises/GNU_grep/.ref_solutions/ex07_recursive_search.txt ================================================ Note: Every file in this directory and sub-directories is input for grep, unless otherwise specified 1) Match all lines containing the string: you Solution: grep -r 'you' 2) Show only filenames matching the string: Hello filenames should only end with .txt Solution: grep -rl --include='*.txt' 'Hello' 3) Show only filenames matching the string: Hello filenames should NOT end with .txt Solution: grep -rl --exclude='*.txt' 'Hello' 4) Show only filenames matching the string: are should not include the directory: progs Solution: grep -rl --exclude-dir='progs' 'are' 5) Show only filenames matching the string: are should NOT include these directories dir1: progs dir2: msg Solution: grep -rl --exclude-dir='progs' --exclude-dir='msg' 'are' 6) Show only filenames matching the string: are should include files only from sub-directories hint: use shell glob pattern to specify directories to search Solution: grep -rl 'are' */ ================================================ FILE: exercises/GNU_grep/.ref_solutions/ex08_search_pattern_from_file.txt ================================================ Note: words.txt has only whole words per line, use it as file input when task is to match whole words 1) Match all strings from file words.txt in file baz.txt Solution: grep -f words.txt baz.txt 2) Match all words from file words.txt in file foo.txt should only match whole words should print only matching words, not entire line Solution: grep -owf words.txt foo.txt 3) Show common lines between foo.txt and baz.txt Solution: grep -Fxf foo.txt baz.txt 4) Show lines present in baz.txt but not in foo.txt Solution: grep -Fxvf foo.txt baz.txt 5) Show lines present in foo.txt but not in baz.txt Solution: grep -Fxvf baz.txt foo.txt 6) Find all words common between all three files in the directory should only match whole words should print only matching words, not entire line Solution: grep -owf words.txt foo.txt | grep -owf- baz.txt ================================================ FILE: exercises/GNU_grep/.ref_solutions/ex09_regex_anchors.txt ================================================ 1) Match all lines starting with: no Solution: grep '^no' sample.txt 2) Match all lines ending with: it Solution: grep 'it$' sample.txt 3) Match all lines containing whole word: do Solution: grep -w 'do' sample.txt 4) Match all lines containing words starting with: do Solution: grep '\' sample.txt 6) Match all lines starting with: ^ Solution: grep '^^' sample.txt 7) Match all lines ending with: $ Solution: grep '$$' sample.txt 8) Match all lines containing the string: in not surrounded by word boundaries, for ex: mint but not tin or ink Solution: grep '\Bin\B' sample.txt ================================================ FILE: exercises/GNU_grep/.ref_solutions/ex10_regex_this_or_that.txt ================================================ 1) Match all lines containing any of these strings: String1: day String2: not Solution: grep -E 'day|not' sample.txt 2) Match all lines containing any of these whole words: String1: he String2: in Solution: grep -wE 'he|in' sample.txt 3) Match all lines containing any of these strings: String1: you String2: be String3: to String4: he Solution: grep -E 'he|be|to|you' sample.txt 4) Match all lines containing any of these strings: String1: you String2: be String3: to String4: he but NOT these strings: String1: it String2: do Solution: grep -E 'he|be|to|you' sample.txt | grep -vE 'do|it' 5) Match all lines starting with any of these strings: String1: no String2: to Solution: grep -E '^no|^to' sample.txt ================================================ FILE: exercises/GNU_grep/.ref_solutions/ex11_regex_quantifiers.txt ================================================ 1) Extract all 3 character strings surrounded by word boundaries Solution: grep -ow '...' garbled.txt 2) Extract largest string from each line starting with character: d ending with character : g Solution: grep -o 'd.*g' garbled.txt 3) Extract all strings from each line starting with character: d followed by zero or one: o ending with character : g Solution: grep -oE 'do?g' garbled.txt 4) Extract all strings from each line starting with character: d followed by zero or one of any character ending with character : g Solution: grep -oE 'd.?g' garbled.txt 5) Extract all strings from each line starting with character: g followed by atleast one: o ending with character : d Solution: grep -oE 'go+d' garbled.txt 6) Extract all strings from each line starting with character : g followed by extactly six: o ending with character : d Solution: grep -oE 'go{6}d' garbled.txt 7) Extract all strings from each line starting with character : g followed by min two and max four: o ending with character : d Solution: grep -oE 'go{2,4}d' garbled.txt 8) Extract all strings from each line starting with character: d followed by max of two : o ending with character : g Solution: grep -oE 'do{,2}g' garbled.txt 9) Extract all strings from each line starting with character : g followed by min of three: o ending with character : d Solution: grep -oE 'go{3,}d' garbled.txt ================================================ FILE: exercises/GNU_grep/.ref_solutions/ex12_regex_character_class_part1.txt ================================================ 1) Match all lines containing any of these characters: character1: q character2: x character3: z Solution: grep '[qzx]' sample_words.txt 2) Match all lines containing any of these characters: character1: c character2: f followed by any character followed by : t Solution: grep '[cf].t' sample_words.txt 3) Extract all words starting with character: s ignore case should contain only alphabets minimum two letters should be surrounded by word boundaries Solution: grep -iowE 's[a-z]+' sample_words.txt 4) Extract all words made up of these characters: character1: a character2: c character3: e character4: r character5: s ignore case should contain only alphabets should be surrounded by word boundaries Solution: grep -iowE '[acers]+' sample_words.txt 5) Extract all numbers surrounded by word boundaries Solution: grep -ow '[0-9]*' sample_words.txt 6) Extract all numbers surrounded by word boundaries matching the condition 30 <= number <= 70 Solution: grep -owE '[3-6][0-9]|70' sample_words.txt 7) Extract all words made up of non-vowel characters ignore case should contain only alphabets and at least two should be surrounded by word boundaries Solution: grep -iowE '[b-df-hj-np-tv-z]{2,}' sample_words.txt 8) Extract all sequence of strings consisting of character: - surrounded on either side by zero or more case insensitive alphabets Solution: grep -io '[a-z]*-[a-z]*' sample_words.txt ================================================ FILE: exercises/GNU_grep/.ref_solutions/ex13_regex_character_class_part2.txt ================================================ 1) Extract all characters before first occurrence of = Solution: grep -o '^[^=]*' sample.txt 2) Extract all characters from start of line made up of these characters upper or lower case alphabets all digits the underscore character Solution: grep -o '^\w*' sample.txt 3) Match all lines containing the sequence String1: there any number of whitespace String2: have Solution: grep 'there\s*have' sample.txt 4) Extract all characters from start of line made up of these characters upper or lower case alphabets all digits the characters [ and ] ending with ] Solution: grep -oi '^[]a-z0-9[]*]' sample.txt 5) Extract all punctuation characters from first line Solution: grep -om1 '[[:punct:]]' sample.txt ================================================ FILE: exercises/GNU_grep/.ref_solutions/ex14_regex_grouping_and_backreference.txt ================================================ 1) Match lines containing these strings String1: scare String2: spore Solution: grep -E 's(po|ca)re' sample.txt 2) Extract these words Word1: handy Word2: hand Word3: hands Word4: handful Solution: grep -oE 'hand([sy]|ful)?' sample.txt 3) Extract all whole words with at least one letter occurring twice in the word ignore case only alphabets the letter occurring twice need not be placed next to each other Solution: grep -ioE '[a-z]*([a-z])[a-z]*\1[a-z]*' sample.txt 4) Match lines where same sequence of three consecutive alphabets is matched another time in the same line ignore case Solution: grep -iE '([a-z]{3}).*\1' sample.txt ================================================ FILE: exercises/GNU_grep/.ref_solutions/ex15_regex_PCRE.txt ================================================ 1) Extract all strings to the right of = provided characters from start of line until = do not include [ or ] Solution: grep -oP '^[^][=]+=\K.*' sample.txt 2) Match all lines containing the string: Hi but shouldn't be followed afterwards in the line by: are Solution: grep -P 'Hi(?!.*are)' sample.txt 3) Extract from start of line up to the string: Hi provided it is followed afterwards in the line by: you Solution: grep -oP '.*Hi(?=.*you)' sample.txt 4) Extract all sequence of characters surrounded on both sides by space character the space character should not be part of output Solution: grep -oP ' \K[^ ]+(?= )' sample.txt 5) Extract all words made of upper or lower case alphabets at least two letters in length surrounded by word boundaries should not contain consecutive repeated alphabets Solution: grep -iowP '[a-z]*([a-z])\1[a-z]*(*SKIP)(*F)|[a-z]{2,}' sample.txt ================================================ FILE: exercises/GNU_grep/.ref_solutions/ex16_misc_and_extras.txt ================================================ Note: all files in directory are input to grep, unless otherwise specified 1) Extract all negative numbers starts with - followed by one or more digits do not output filenames Solution: grep -hoE -- '-[0-9]+' * 2) Display only filenames containing these two strings anywhere in the file String1: day String2: and Solution: grep -zlE 'day.*and|and.*day' * 3) The below command grep -c '^Solution:' ../.ref_solutions/* will give number of questions in each exercise. Change it, using another command and pipe if needed, so that only overall total is printed Solution: cat ../.ref_solutions/* | grep -c '^Solution:' ================================================ FILE: exercises/GNU_grep/ex01_basic_match/sample.txt ================================================ Hello World! Good day How do you do? Just do it Believe it! Today is sunny Not a bit funny No doubt you like it too Much ado about nothing He he he ================================================ FILE: exercises/GNU_grep/ex01_basic_match.txt ================================================ 1) Match lines containing the string: day 2) Match lines containing the string: it 3) Match lines containing the string: do you ================================================ FILE: exercises/GNU_grep/ex02_basic_options/sample.txt ================================================ Hello World! Good day How do you do? Just do it Believe it! Today is sunny Not a bit funny No doubt you like it too Much ado about nothing He he he ================================================ FILE: exercises/GNU_grep/ex02_basic_options.txt ================================================ 1) Match lines containing the string irrespective of lower/upper case: no 2) Match lines not containing the string: o 3) Match lines with line numbers containing the string: it 4) Output only number of matching lines containing the string: a 5) Match first two lines containing the string: do ================================================ FILE: exercises/GNU_grep/ex03_multiple_string_match/sample.txt ================================================ Hello World! Good day How do you do? Just do it Believe it! Today is sunny Not a bit funny No doubt you like it too Much ado about nothing He he he ================================================ FILE: exercises/GNU_grep/ex03_multiple_string_match.txt ================================================ 1) Match lines containing either of these three strings String1: Not String2: he String3: sun 2) Match lines containing both these strings String1: He String2: or 3) Match lines containing either of these two strings String1: a String2: i and contains this as well String3: do 4) Match lines containing the string String1: it but not these strings String2: No String3: no ================================================ FILE: exercises/GNU_grep/ex04_filenames/greeting.txt ================================================ Hi, how are you? Hola :) Hello world Good day Rock on ================================================ FILE: exercises/GNU_grep/ex04_filenames/poem.txt ================================================ Roses are red, Violets are blue, Sugar is sweet, And so are you. ================================================ FILE: exercises/GNU_grep/ex04_filenames/sample.txt ================================================ Hello World! Good day How do you do? Just do it Believe it! Today is sunny Not a bit funny No doubt you like it too Much ado about nothing He he he ================================================ FILE: exercises/GNU_grep/ex04_filenames.txt ================================================ Note: All files present in the directory should be given as file inputs to grep 1) Show only filenames containing the string: are 2) Show only filenames NOT containing the string: two 3) Match all lines containing the string: are 4) Match maximum of two matching lines along with filenames containing the character: a 5) Match all lines without prefixing filename containing the string: to ================================================ FILE: exercises/GNU_grep/ex05_word_line_matching/greeting.txt ================================================ Hi, how are you? Hola :) Hello World Good day Rock on ================================================ FILE: exercises/GNU_grep/ex05_word_line_matching/sample.txt ================================================ Hello World! Good day How do you do? Just do it Believe it! Today is sunny Not a bit funny No doubt you like it too Much ado about nothing He he he ================================================ FILE: exercises/GNU_grep/ex05_word_line_matching/words.txt ================================================ afar far carfare farce faraway airfare ================================================ FILE: exercises/GNU_grep/ex05_word_line_matching.txt ================================================ Note: All files present in the directory should be given as file inputs to grep 1) Match lines containing whole word: do 2) Match whole lines containing the string: Hello World 3) Match lines containing these whole words: Word1: He Word2: far 4) Match lines containing the whole word: you and NOT containing the case insensitive string: How ================================================ FILE: exercises/GNU_grep/ex06_ABC_context_matching/sample.txt ================================================ Hello World! Good day How do you do? Just do it Believe it! Today is sunny Not a bit funny No doubt you like it too Much ado about nothing He he he ================================================ FILE: exercises/GNU_grep/ex06_ABC_context_matching.txt ================================================ 1) Get lines and 3 following it containing the string: you 2) Get lines and 2 preceding it containing the string: is 3) Get lines and 1 following/preceding containing the string: Not 4) Get lines and 1 following and 4 preceding containing the string: Not 5) Get lines and 1 preceding it containing the string: you there should be no separator between the matches 6) Get lines and 1 preceding it containing the string: you the separator between the matches should be: ##### ================================================ FILE: exercises/GNU_grep/ex07_recursive_search/msg/greeting.txt ================================================ Hi, how are you? Hola :) Hello World Good day Rock on ================================================ FILE: exercises/GNU_grep/ex07_recursive_search/msg/sample.txt ================================================ Hello World! Good day How do you do? Just do it Believe it! Today is sunny Not a bit funny No doubt you like it too Much ado about nothing He he he ================================================ FILE: exercises/GNU_grep/ex07_recursive_search/poem.txt ================================================ Roses are red, Violets are blue, Sugar is sweet, And so are you. ================================================ FILE: exercises/GNU_grep/ex07_recursive_search/progs/hello.py ================================================ #!/usr/bin/python3 print("Hello World") ================================================ FILE: exercises/GNU_grep/ex07_recursive_search/progs/hello.sh ================================================ #!/bin/bash echo "Hello $USER" echo "Today is $(date -u +%A)" echo 'Hope you are having a nice day' ================================================ FILE: exercises/GNU_grep/ex07_recursive_search/words.txt ================================================ afar far carfare farce faraway airfare ================================================ FILE: exercises/GNU_grep/ex07_recursive_search.txt ================================================ Note: Every file in this directory and sub-directories is input for grep, unless otherwise specified 1) Match all lines containing the string: you 2) Show only filenames matching the string: Hello filenames should only end with .txt 3) Show only filenames matching the string: Hello filenames should NOT end with .txt 4) Show only filenames matching the string: are should not include the directory: progs 5) Show only filenames matching the string: are should NOT include these directories dir1: progs dir2: msg 6) Show only filenames matching the string: are should include files only from sub-directories hint: use shell glob pattern to specify directories to search ================================================ FILE: exercises/GNU_grep/ex08_search_pattern_from_file/baz.txt ================================================ I saw a few red cars going that way To the end! Are you coming today to the party? a[5] = 'good'; Have you read the Harry Potter series? ================================================ FILE: exercises/GNU_grep/ex08_search_pattern_from_file/foo.txt ================================================ part a[5] = 'good'; I saw a few red cars going that way Believe it! to do list ================================================ FILE: exercises/GNU_grep/ex08_search_pattern_from_file/words.txt ================================================ car part to read ================================================ FILE: exercises/GNU_grep/ex08_search_pattern_from_file.txt ================================================ Note: words.txt has only whole words per line, use it as file input when task is to match whole words 1) Match all strings from file words.txt in file baz.txt 2) Match all words from file words.txt in file foo.txt should only match whole words should print only matching words, not entire line 3) Show common lines between foo.txt and baz.txt 4) Show lines present in baz.txt but not in foo.txt 5) Show lines present in foo.txt but not in baz.txt 6) Find all words common between all three files in the directory should only match whole words should print only matching words, not entire line ================================================ FILE: exercises/GNU_grep/ex09_regex_anchors/sample.txt ================================================ hello world! good day how do you do? just do it believe it! today is sunny not a bit funny no doubt you like it too much ado about nothing he he he ^ could be exponentiation or xor operator scalar variables in perl start with $ ================================================ FILE: exercises/GNU_grep/ex09_regex_anchors.txt ================================================ 1) Match all lines starting with: no 2) Match all lines ending with: it 3) Match all lines containing whole word: do 4) Match all lines containing words starting with: do 5) Match all lines containing words ending with: do 6) Match all lines starting with: ^ 7) Match all lines ending with: $ 8) Match all lines containing the string: in not surrounded by word boundaries, for ex: mint but not tin or ink ================================================ FILE: exercises/GNU_grep/ex10_regex_this_or_that/sample.txt ================================================ hello world! good day how do you do? just do it believe it! today is sunny not a bit funny no doubt you like it too much ado about nothing he he he ^ could be exponentiation or xor operator scalar variables in perl start with $ ================================================ FILE: exercises/GNU_grep/ex10_regex_this_or_that.txt ================================================ 1) Match all lines containing any of these strings: String1: day String2: not 2) Match all lines containing any of these whole words: String1: he String2: in 3) Match all lines containing any of these strings: String1: you String2: be String3: to String4: he 4) Match all lines containing any of these strings: String1: you String2: be String3: to String4: he but NOT these strings: String1: it String2: do 5) Match all lines starting with any of these strings: String1: no String2: to ================================================ FILE: exercises/GNU_grep/ex11_regex_quantifiers/garbled.txt ================================================ gd god goood oh gold goooooodyyyy dog dg dig good gold doogoodog c@t made forty justify dodging a toy ================================================ FILE: exercises/GNU_grep/ex11_regex_quantifiers.txt ================================================ 1) Extract all 3 character strings surrounded by word boundaries 2) Extract largest string from each line starting with character: d ending with character : g 3) Extract all strings from each line starting with character: d followed by zero or one: o ending with character : g 4) Extract all strings from each line starting with character: d followed by zero or one of any character ending with character : g 5) Extract all strings from each line starting with character: g followed by atleast one: o ending with character : d 6) Extract all strings from each line starting with character : g followed by extactly six: o ending with character : d 7) Extract all strings from each line starting with character : g followed by min two and max four: o ending with character : d 8) Extract all strings from each line starting with character: d followed by max of two : o ending with character : g 9) Extract all strings from each line starting with character : g followed by min of three: o ending with character : d ================================================ FILE: exercises/GNU_grep/ex12_regex_character_class_part1/sample_words.txt ================================================ far 30 scarce f@$t 42 fit Cute 34 quite pry far-fetched Sure 70 cast-away 12 good hue he cry just Nymph race Peace. 67 foo;bar;baz;p@t ARE 72 cut copy paste p1ate rest 512 Sync ================================================ FILE: exercises/GNU_grep/ex12_regex_character_class_part1.txt ================================================ 1) Match all lines containing any of these characters: character1: q character2: x character3: z 2) Match all lines containing any of these characters: character1: c character2: f followed by any character followed by : t 3) Extract all words starting with character: s ignore case should contain only alphabets minimum two letters should be surrounded by word boundaries 4) Extract all words made up of these characters: character1: a character2: c character3: e character4: r character5: s ignore case should contain only alphabets should be surrounded by word boundaries 5) Extract all numbers surrounded by word boundaries 6) Extract all numbers surrounded by word boundaries matching the condition 30 <= number <= 70 7) Extract all words made up of non-vowel characters ignore case should contain only alphabets and at least two should be surrounded by word boundaries 8) Extract all sequence of strings consisting of character: - surrounded on either side by zero or more case insensitive alphabets ================================================ FILE: exercises/GNU_grep/ex13_regex_character_class_part2/sample.txt ================================================ a[2]='sample string' foo_bar=4232 appx_pi=3.14 greeting="Hi there have a nice day" food[4]="dosa" b[0][1]=42 ================================================ FILE: exercises/GNU_grep/ex13_regex_character_class_part2.txt ================================================ 1) Extract all characters before first occurrence of = 2) Extract all characters from start of line made up of these characters upper or lower case alphabets all digits the underscore character 3) Match all lines containing the sequence String1: there any number of whitespace String2: have 4) Extract all characters from start of line made up of these characters upper or lower case alphabets all digits the characters [ and ] ending with ] 5) Extract all punctuation characters from first line ================================================ FILE: exercises/GNU_grep/ex14_regex_grouping_and_backreference/sample.txt ================================================ hands hand library scare handy handful scared too big time eel candy spare food regulate circuit spore stare tire tempt cold malady ================================================ FILE: exercises/GNU_grep/ex14_regex_grouping_and_backreference.txt ================================================ 1) Match lines containing these strings String1: scare String2: spore 2) Extract these words Word1: handy Word2: hand Word3: hands Word4: handful 3) Extract all whole words with at least one letter occurring twice in the word ignore case only alphabets the letter occurring twice need not be placed next to each other 4) Match lines where same sequence of three consecutive alphabets is matched another time in the same line ignore case ================================================ FILE: exercises/GNU_grep/ex15_regex_PCRE/sample.txt ================================================ a[2]='Hi, how are you?' foo_bar=4232 appx_pi=3.14 greeting="Hi there have a nice day" food[4]="dosa" b[0][1]=42 ================================================ FILE: exercises/GNU_grep/ex15_regex_PCRE.txt ================================================ 1) Extract all strings to the right of = provided characters from start of line until = do not include [ or ] 2) Match all lines containing the string: Hi but shouldn't be followed afterwards in the line by: are 3) Extract from start of line up to the string: Hi provided it is followed afterwards in the line by: you 4) Extract all sequence of characters surrounded on both sides by space character the space character should not be part of output 5) Extract all words made of upper or lower case alphabets at least two letters in length surrounded by word boundaries should not contain consecutive repeated alphabets ================================================ FILE: exercises/GNU_grep/ex16_misc_and_extras/garbled.txt ================================================ day and night -43 and 99 and 12 ================================================ FILE: exercises/GNU_grep/ex16_misc_and_extras/poem.txt ================================================ Roses are red, Violets are blue, Sugar is sweet, And so are you. Good day to you :) ================================================ FILE: exercises/GNU_grep/ex16_misc_and_extras/sample.txt ================================================ account balance: -2300 good day foo and bar and baz ================================================ FILE: exercises/GNU_grep/ex16_misc_and_extras.txt ================================================ Note: all files in directory are input to grep, unless otherwise specified 1) Extract all negative numbers starts with - followed by one or more digits do not output filenames 2) Display only filenames containing these two strings anywhere in the file String1: day String2: and 3) The below command grep -c '^Solution:' ../.ref_solutions/* will give number of questions in each exercise. Change it, using another command and pipe if needed, so that only overall total is printed ================================================ FILE: exercises/GNU_grep/solve ================================================ dir_name=$(basename "$PWD") ref_file="../.ref_solutions/$dir_name.txt" sol_file="../$dir_name.txt" tmp_file='../.tmp.txt' # color output tcolors=$(tput colors) if [[ -n $tcolors && $tcolors -ge 8 ]]; then red=$(tput setaf 1) green=$(tput setaf 2) blue=$(tput setaf 4) clr_color=$(tput sgr0) else red='' green='' blue='' clr_color='' fi sub_sol=0 if [[ $1 == -s ]]; then prev_cmd=$(fc -ln -2 | sed 's/^[ \t]*//;q') sub_sol=1 elif [[ $1 == -q ]]; then # highlight the question to be solved next # or show only the (unanswered)? question to be solved next cat "$sol_file" return elif [[ -n $1 ]]; then echo -e 'Unknown option...Exiting script' return fi count=0 sol_count=0 err_count=0 while IFS= read -u3 -r ref_line && read -u4 -r sol_line; do if [[ "${ref_line:0:9}" == Solution: ]]; then (( count++ )) if [[ $sub_sol == 1 && -z $sol_line ]]; then sol_line="$prev_cmd" sub_sol=0 fi if [[ "$(eval "command ${ref_line:10}")" == "$(eval "command $sol_line")" ]]; then (( sol_count++ )) # use color if terminal supports echo '---------------------------------------------' echo "Match for question $count:" echo "${red}Submitted solution:${clr_color} $sol_line" echo "${green}Reference solution:${clr_color} ${ref_line:10}" echo '---------------------------------------------' else (( err_count++ )) if [[ $err_count == 1 && -n $sol_line ]]; then echo '---------------------------------------------' echo "Mismatch for question $count:" echo "$(tput bold)${red}Expected output is:${clr_color}$(tput rmso)" eval "command ${ref_line:10}" echo '---------------------------------------------' fi sol_line='' fi fi echo "$sol_line" >> "$tmp_file" done 3<"$ref_file" 4<"$sol_file" ((count==sol_count)) && printf "\t\t$(tput bold)${blue}All Pass${clr_color}$(tput rmso)\t\t\n" mv "$tmp_file" "$sol_file" # vim: syntax=bash ================================================ FILE: exercises/README.md ================================================ # Exercises Instructions and shell script here assumes `bash` shell. Tested on *GNU bash, version 4.3.46*
* For example, the first exercise for **GNU_grep** * directory: `ex01_basic_match` * question file: `ex01_basic_match.txt` * solution reference: `.ref_solutions/ex01_basic_match.txt` * Each exercise contains one or more question to be solved * The script `solve` will assist in checking solutions ```bash $ git clone https://github.com/learnbyexample/Command-line-text-processing.git $ cd Command-line-text-processing/exercises/GNU_grep/ $ ls ex01_basic_match ex02_basic_options ex03_multiple_string_match solve ex01_basic_match.txt ex02_basic_options.txt ex03_multiple_string_match.txt $ find -name 'ex01*' ./.ref_solutions/ex01_basic_match.txt ./ex01_basic_match ./ex01_basic_match.txt ```
* Solving the questions * Go to the exercise folder * Use `ls` to see input file(s) * To see the problems for that exercise, follow the steps below ```bash $ cd ex01_basic_match $ ls sample.txt $ # to see the questions $ source ../solve -q 1) Match lines containing the string: day 2) Match lines containing the string: it 3) Match lines containing the string: do you $ # or open the questions file with your fav editor $ gvim ../$(basename "$PWD").txt $ # create an alias to use from any ex* directory $ alias oq='gvim ../$(basename "$PWD").txt' $ oq ```
* Submitting solutions one by one * immediately after executing command that answers a question, call the `solve` script ```bash $ grep 'day' sample.txt Good day Today is sunny $ source ../solve -s --------------------------------------------- Match for question 1: Submitted solution: grep 'day' sample.txt Reference solution: grep 'day' sample.txt --------------------------------------------- ```
* Submit all at once * by editing the `../$(basename "$PWD").txt` file directly * the answer should replace the empty line immediately following the question * **Note** * there are different ways to solve the same question * but for specific exercise like **GNU_grep** try to solve using `grep` only * also, remember that `eval` is used to check equivalence. So be sure of commands submitted ```bash $ cat ../$(basename "$PWD").txt 1) Match lines containing the string: day grep 'day' sample.txt 2) Match lines containing the string: it sed -n '/it/p' sample.txt 3) Match lines containing the string: do you echo 'How do you do?' $ source ../solve --------------------------------------------- Match for question 1: Submitted solution: grep 'day' sample.txt Reference solution: grep 'day' sample.txt --------------------------------------------- --------------------------------------------- Match for question 2: Submitted solution: sed -n '/it/p' sample.txt Reference solution: grep 'it' sample.txt --------------------------------------------- --------------------------------------------- Match for question 3: Submitted solution: echo 'How do you do?' Reference solution: grep 'do you' sample.txt --------------------------------------------- All Pass ```
* Then move on to next exercise directory * Create aliases for different commands for easy use, after checking that the aliases are available of course ```bash $ type cs cq ca nq pq bash: type: cs: not found bash: type: cq: not found bash: type: ca: not found bash: type: nq: not found bash: type: pq: not found $ alias cs='source ../solve -s' $ alias cq='source ../solve -q' $ alias ca='source ../solve' $ # to go to directory of next question $ nq() { d=$(basename "$PWD"); nd=$(printf "../ex%02d*/" $((${d:2:2}+1))); cd $nd ; } $ # to go to directory of previous question $ pq() { d=$(basename "$PWD"); pd=$(printf "../ex%02d*/" $((${d:2:2}-1))); cd $pd ; } ```
If wrong solution is submitted, the expected output is shown. This also helps to better understand the question as I found it difficult to convey the intent of question clearly with words alone... ```bash $ source ../solve -q 1) Match lines containing the string: day 2) Match lines containing the string: it 3) Match lines containing the string: do you $ grep 'do' sample.txt How do you do? Just do it No doubt you like it too Much ado about nothing $ source ../solve -s --------------------------------------------- Mismatch for question 1: Expected output is: Good day Today is sunny --------------------------------------------- ``` ================================================ FILE: file_attributes.md ================================================ # File attributes **Table of Contents** * [wc](#wc) * [Various counts](#various-counts) * [subtle differences](#subtle-differences) * [Further reading for wc](#further-reading-for-wc) * [du](#du) * [Default size](#default-size) * [Various size formats](#various-size-formats) * [Dereferencing links](#dereferencing-links) * [Filtering options](#filtering-options) * [Further reading for du](#further-reading-for-du) * [df](#df) * [Examples](#examples) * [Further reading for df](#further-reading-for-df) * [touch](#touch) * [Creating empty file](#creating-empty-file) * [Updating timestamps](#updating-timestamps) * [Preserving timestamp](#preserving-timestamp) * [Further reading for touch](#further-reading-for-touch) * [file](#file) * [File type examples](#file-type-examples) * [Further reading for file](#further-reading-for-file)
## wc ```bash $ wc --version | head -n1 wc (GNU coreutils) 8.25 $ man wc WC(1) User Commands WC(1) NAME wc - print newline, word, and byte counts for each file SYNOPSIS wc [OPTION]... [FILE]... wc [OPTION]... --files0-from=F DESCRIPTION Print newline, word, and byte counts for each FILE, and a total line if more than one FILE is specified. A word is a non-zero-length sequence of characters delimited by white space. With no FILE, or when FILE is -, read standard input. ... ```
#### Various counts ```bash $ cat sample.txt Hello World Good day No doubt you like it too Much ado about nothing He he he $ # by default, gives newline/word/byte count (in that order) $ wc sample.txt 5 17 78 sample.txt $ # options to get individual numbers $ wc -l sample.txt 5 sample.txt $ wc -w sample.txt 17 sample.txt $ wc -c sample.txt 78 sample.txt $ # use shell input redirection if filename is not needed $ wc -l < sample.txt 5 ``` * multiple file input * automatically displays total at end ```bash $ cat greeting.txt Hello there Have a safe journey $ cat fruits.txt Fruit Price apple 42 banana 31 fig 90 guava 6 $ wc *.txt 5 10 57 fruits.txt 2 6 32 greeting.txt 5 17 78 sample.txt 12 33 167 total ``` * use `-L` to get length of longest line ```bash $ wc -L < sample.txt 24 $ echo 'foo bar baz' | wc -L 11 $ echo 'hi there!' | wc -L 9 $ # last line will show max value, not sum of all input $ wc -L *.txt 13 fruits.txt 19 greeting.txt 24 sample.txt 24 total ```
#### subtle differences * byte count vs character count ```bash $ # when input is ASCII $ printf 'hi there' | wc -c 8 $ printf 'hi there' | wc -m 8 $ # when input has multi-byte characters $ printf 'hi👍' | od -x 0000000 6968 9ff0 8d91 0000006 $ printf 'hi👍' | wc -m 3 $ printf 'hi👍' | wc -c 6 ``` * `-l` option gives only the count of number of newline characters ```bash $ printf 'hi there\ngood day' | wc -l 1 $ printf 'hi there\ngood day\n' | wc -l 2 $ printf 'hi there\n\n\nfoo\n' | wc -l 4 ``` * From `man wc` "A word is a non-zero-length sequence of characters delimited by white space" ```bash $ echo 'foo bar ;-*' | wc -w 3 $ # use other text processing as needed $ echo 'foo bar ;-*' | grep -iowE '[a-z]+' foo bar $ echo 'foo bar ;-*' | grep -iowE '[a-z]+' | wc -l 2 ``` * `-L` won't count non-printable characters and tabs are converted to equivalent spaces ```bash $ printf 'food\tgood' | wc -L 12 $ printf 'food\tgood' | wc -m 9 $ printf 'food\tgood' | awk '{print length()}' 9 $ printf 'foo\0bar\0baz' | wc -L 9 $ printf 'foo\0bar\0baz' | wc -m 11 $ printf 'foo\0bar\0baz' | awk '{print length()}' 11 ```
#### Further reading for wc * `man wc` and `info wc` for more options and detailed documentation * [wc Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/wc?sort=votes&pageSize=15) * [wc Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/wc?sort=votes&pageSize=15)
## du ```bash $ du --version | head -n1 du (GNU coreutils) 8.25 $ man du DU(1) User Commands DU(1) NAME du - estimate file space usage SYNOPSIS du [OPTION]... [FILE]... du [OPTION]... --files0-from=F DESCRIPTION Summarize disk usage of the set of FILEs, recursively for directories. ... ```

#### Default size * By default, size is given in size of **1024 bytes** * Files are ignored, all directories and sub-directories are recursively reported ```bash $ ls -F projs/ py_learn@ words.txt $ du 17920 ./projs/full_addr 14316 ./projs/half_addr 32952 ./projs 33880 . ``` * use `-a` to recursively show both files and directories * use `-s` to show total directory size without descending into its sub-directories ```bash $ du -a 712 ./projs/report.log 17916 ./projs/full_addr/faddr.v 17920 ./projs/full_addr 14312 ./projs/half_addr/haddr.v 14316 ./projs/half_addr 32952 ./projs 0 ./py_learn 924 ./words.txt 33880 . $ du -s 33880 . $ du -s projs words.txt 32952 projs 924 words.txt ``` * use `-S` to show directory size without taking into account size of its sub-directories ```bash $ du -S 17920 ./projs/full_addr 14316 ./projs/half_addr 716 ./projs 928 . ```

#### Various size formats ```bash $ # number of bytes $ stat -c %s words.txt 938848 $ du -b words.txt 938848 words.txt $ # kilobytes = 1024 bytes $ du -sk projs 32952 projs $ # megabytes = 1024 kilobytes $ du -sm projs 33 projs $ # -B to specify custom byte scale size $ du -sB 5000 projs 6749 projs $ du -sB 1048576 projs 33 projs ``` * human readable and si units ```bash $ # in terms of powers of 1024 $ # M = 1048576 bytes and so on $ du -sh projs/* words.txt 18M projs/full_addr 14M projs/half_addr 712K projs/report.log 924K words.txt $ # in terms of powers of 1000 $ # M = 1000000 bytes and so on $ du -s --si projs/* words.txt 19M projs/full_addr 15M projs/half_addr 730k projs/report.log 947k words.txt ``` * sorting ```bash $ du -sh projs/* words.txt | sort -h 712K projs/report.log 924K words.txt 14M projs/half_addr 18M projs/full_addr $ du -sk projs/* | sort -nr 17920 projs/full_addr 14316 projs/half_addr 712 projs/report.log ``` * to get size based on number of characters in file rather than disk space alloted ```bash $ du -b words.txt 938848 words.txt $ du -h words.txt 924K words.txt $ # 938848/1024 = 916.84 $ du --apparent-size -h words.txt 917K words.txt ```
#### Dereferencing links * See `man` and `info` pages for other related options ```bash $ # -D to dereference command line argument $ du py_learn 0 py_learn $ du -shD py_learn 503M py_learn $ # -L to dereference links found by du $ du -sh 34M . $ du -shL 536M . ```
#### Filtering options * `-d` to specify maximum depth ```bash $ du -ah projs 712K projs/report.log 18M projs/full_addr/faddr.v 18M projs/full_addr 14M projs/half_addr/haddr.v 14M projs/half_addr 33M projs $ du -ah -d1 projs 712K projs/report.log 18M projs/full_addr 14M projs/half_addr 33M projs ``` * `-c` to also show total size at end ```bash $ du -cshD projs py_learn 33M projs 503M py_learn 535M total ``` * `-t` to provide a threshold comparison ```bash $ # >= 15M $ du -Sh -t 15M 18M ./projs/full_addr $ # <= 1M $ du -ah -t -1M 712K ./projs/report.log 0 ./py_learn 924K ./words.txt ``` * excluding files/directories based on **glob** pattern * see also `--exclude-from=FILE` and `--files0-from=FILE` options ```bash $ # note that excluded files affect directory size reported $ du -ah --exclude='*addr*' projs 712K projs/report.log 716K projs $ # depending on shell, brace expansion can be used $ du -ah --exclude='*.'{v,log} projs 4.0K projs/full_addr 4.0K projs/half_addr 12K projs ```
#### Further reading for du * `man du` and `info du` for more options and detailed documentation * [du Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/disk-usage?sort=votes&pageSize=15) * [du Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/du?sort=votes&pageSize=15)
## df ```bash $ df --version | head -n1 df (GNU coreutils) 8.25 $ man df DF(1) User Commands DF(1) NAME df - report file system disk space usage SYNOPSIS df [OPTION]... [FILE]... DESCRIPTION This manual page documents the GNU version of df. df displays the amount of disk space available on the file system containing each file name argument. If no file name is given, the space available on all currently mounted file systems is shown. ... ```
#### Examples ```bash $ # use df without arguments to get information on all currently mounted file systems $ df . Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 98298500 58563816 34734748 63% / $ # use -B option for custom size $ # use --si for size in powers of 1000 instead of 1024 $ df -h . Filesystem Size Used Avail Use% Mounted on /dev/sda1 94G 56G 34G 63% / ``` * Use `--output` to report only specific fields of interest ```bash $ df -h --output=size,used,file / /media/learnbyexample/projs Size Used File 94G 56G / 92G 35G /media/learnbyexample/projs $ df -h --output=pcent . Use% 63% $ df -h --output=pcent,fstype | awk -F'%' 'NR>2 && $1>=40' 63% ext3 40% ext4 51% ext4 ```
#### Further reading for df * `man df` and `info df` for more options and detailed documentation * [df Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/df?sort=votes&pageSize=15) * [Parsing df command output with awk](https://unix.stackexchange.com/questions/360865/parsing-df-command-output-with-awk) * [processing df output](https://www.reddit.com/r/bash/comments/68dbml/using_an_array_variable_in_an_awk_command/)
## touch ```bash $ touch --version | head -n1 touch (GNU coreutils) 8.25 $ man touch TOUCH(1) User Commands TOUCH(1) NAME touch - change file timestamps SYNOPSIS touch [OPTION]... FILE... DESCRIPTION Update the access and modification times of each FILE to the current time. A FILE argument that does not exist is created empty, unless -c or -h is supplied. ... ```
#### Creating empty file ```bash $ ls foo.txt ls: cannot access 'foo.txt': No such file or directory $ touch foo.txt $ ls foo.txt foo.txt $ # use -c if new file shouldn't be created $ rm foo.txt $ touch -c foo.txt $ ls foo.txt ls: cannot access 'foo.txt': No such file or directory ```
#### Updating timestamps * Updating both access and modification timestamp to current time ```bash $ # last access time $ stat -c %x fruits.txt 2017-07-19 17:06:01.523308599 +0530 $ # last modification time $ stat -c %y fruits.txt 2017-07-13 13:54:03.576055933 +0530 $ touch fruits.txt $ stat -c %x fruits.txt 2017-07-21 10:11:44.241921229 +0530 $ stat -c %y fruits.txt 2017-07-21 10:11:44.241921229 +0530 ``` * Updating only access or modification timestamp ```bash $ touch -a greeting.txt $ stat -c %x greeting.txt 2017-07-21 10:14:08.457268564 +0530 $ stat -c %y greeting.txt 2017-07-13 13:54:26.004499660 +0530 $ touch -m sample.txt $ stat -c %x sample.txt 2017-07-13 13:48:24.945450646 +0530 $ stat -c %y sample.txt 2017-07-21 10:14:40.770006144 +0530 ``` * Using timestamp from another file to update ```bash $ stat -c $'%x\n%y' power.log report.log 2017-07-19 10:48:03.978295434 +0530 2017-07-14 20:50:42.850887578 +0530 2017-06-24 13:00:31.773583923 +0530 2017-06-24 12:59:53.316751651 +0530 $ # copy both access and modification timestamp from power.log to report.log $ touch -r power.log report.log $ stat -c $'%x\n%y' report.log 2017-07-19 10:48:03.978295434 +0530 2017-07-14 20:50:42.850887578 +0530 $ # add -a or -m options to limit to only access or modification timestamp ``` * Using date string to update * See also `-t` option ```bash $ # add -a or -m as needed $ touch -d '2010-03-17 17:04:23' report.log $ stat -c $'%x\n%y' report.log 2010-03-17 17:04:23.000000000 +0530 2010-03-17 17:04:23.000000000 +0530 ```
#### Preserving timestamp * Text processing on files would update the timestamps ```bash $ stat -c $'%x\n%y' power.log 2017-07-21 11:11:42.862874240 +0530 2017-07-13 21:31:53.496323704 +0530 $ sed -i 's/foo/bar/g' power.log $ stat -c $'%x\n%y' power.log 2017-07-21 11:12:20.303504336 +0530 2017-07-21 11:12:20.303504336 +0530 ``` * `touch` can be used to restore timestamps after processing ```bash $ # first copy the timestamps using touch -r $ stat -c $'%x\n%y' story.txt 2017-06-24 13:00:31.773583923 +0530 2017-06-24 12:59:53.316751651 +0530 $ # tmp.txt is temporary empty file $ touch -r story.txt tmp.txt $ stat -c $'%x\n%y' tmp.txt 2017-06-24 13:00:31.773583923 +0530 2017-06-24 12:59:53.316751651 +0530 $ # after text processing, copy back the timestamps and remove temporary file $ sed -i 's/cat/dog/g' story.txt $ touch -r tmp.txt story.txt && rm tmp.txt $ stat -c $'%x\n%y' story.txt 2017-06-24 13:00:31.773583923 +0530 2017-06-24 12:59:53.316751651 +0530 ```
#### Further reading for touch * `man touch` and `info touch` for more options and detailed documentation * [touch Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/touch?sort=votes&pageSize=15)
## file ```bash $ file --version | head -n1 file-5.25 $ man file FILE(1) BSD General Commands Manual FILE(1) NAME file — determine file type SYNOPSIS file [-bcEhiklLNnprsvzZ0] [--apple] [--extension] [--mime-encoding] [--mime-type] [-e testname] [-F separator] [-f namefile] [-m magicfiles] [-P name=value] file ... file -C [-m magicfiles] file [--help] DESCRIPTION This manual page documents version 5.25 of the file command. file tests each argument in an attempt to classify it. There are three sets of tests, performed in this order: filesystem tests, magic tests, and language tests. The first test that succeeds causes the file type to be printed. ... ```

#### File type examples ```bash $ file sample.txt sample.txt: ASCII text $ # without file name in output $ file -b sample.txt ASCII text $ printf 'hi👍\n' | file - /dev/stdin: UTF-8 Unicode text $ printf 'hi👍\n' | file -i - /dev/stdin: text/plain; charset=utf-8 $ file ch ch: Bourne-Again shell script, ASCII text executable $ file sunset.jpg moon.png sunset.jpg: JPEG image data moon.png: PNG image data, 32 x 32, 8-bit/color RGBA, non-interlaced ``` * different line terminators ```bash $ printf 'hi' | file - /dev/stdin: ASCII text, with no line terminators $ printf 'hi\r' | file - /dev/stdin: ASCII text, with CR line terminators $ printf 'hi\r\n' | file - /dev/stdin: ASCII text, with CRLF line terminators $ printf 'hi\n' | file - /dev/stdin: ASCII text ``` * find all files of particular type in current directory, for example `image` files ```bash $ find -type f -exec bash -c '(file -b "$0" | grep -wq "image data") && echo "$0"' {} \; ./sunset.jpg ./moon.png $ # if filenames do not contain : or newline characters $ find -type f -exec file {} + | awk -F: '/\/{print $1}' ./sunset.jpg ./moon.png ```
#### Further reading for file * `man file` and `info file` for more options and detailed documentation * See also `identify` command which `describes the format and characteristics of one or more image files` ================================================ FILE: gnu_awk.md ================================================


--- :information_source: :information_source: This chapter has been converted into a better formatted ebook: https://learnbyexample.github.io/learn_gnuawk/. The ebook also has content updated for newer version of the commands, includes a chapter on regular expressions, has exercises, solutions, etc. For markdown source and links to buy pdf/epub versions, see: https://github.com/learnbyexample/learn_gnuawk ---


## GNU awk **Table of Contents** * [Field processing](#field-processing) * [Default field separation](#default-field-separation) * [Specifying different input field separator](#specifying-different-input-field-separator) * [Specifying different output field separator](#specifying-different-output-field-separator) * [Filtering](#filtering) * [Idiomatic print usage](#idiomatic-print-usage) * [Field comparison](#field-comparison) * [Regular expressions based filtering](#regular-expressions-based-filtering) * [Fixed string matching](#fixed-string-matching) * [Line number based filtering](#line-number-based-filtering) * [Case Insensitive filtering](#case-insensitive-filtering) * [Changing record separators](#changing-record-separators) * [Paragraph mode](#paragraph-mode) * [Multicharacter RS](#multicharacter-rs) * [Substitute functions](#substitute-functions) * [Inplace file editing](#inplace-file-editing) * [Using shell variables](#using-shell-variables) * [Multiple file input](#multiple-file-input) * [Control Structures](#control-structures) * [if-else and loops](#if-else-and-loops) * [next and nextfile](#next-and-nextfile) * [Multiline processing](#multiline-processing) * [Two file processing](#two-file-processing) * [Comparing whole lines](#comparing-whole-lines) * [Comparing specific fields](#comparing-specific-fields) * [getline](#getline) * [Creating new fields](#creating-new-fields) * [Dealing with duplicates](#dealing-with-duplicates) * [Lines between two REGEXPs](#lines-between-two-regexps) * [All unbroken blocks](#all-unbroken-blocks) * [Specific blocks](#specific-blocks) * [Broken blocks](#broken-blocks) * [Arrays](#arrays) * [awk scripts](#awk-scripts) * [Miscellaneous](#miscellaneous) * [FPAT and FIELDWIDTHS](#fpat-and-fieldwidths) * [String functions](#string-functions) * [Executing external commands](#executing-external-commands) * [printf formatting](#printf-formatting) * [Redirecting print output](#redirecting-print-output) * [Gotchas and Tips](#gotchas-and-tips) * [Further Reading](#further-reading)
```bash $ awk --version | head -n1 GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0) $ man awk GAWK(1) Utility Commands GAWK(1) NAME gawk - pattern scanning and processing language SYNOPSIS gawk [ POSIX or GNU style options ] -f program-file [ -- ] file ... gawk [ POSIX or GNU style options ] [ -- ] program-text file ... DESCRIPTION Gawk is the GNU Project's implementation of the AWK programming lan‐ guage. It conforms to the definition of the language in the POSIX 1003.1 Standard. This version in turn is based on the description in The AWK Programming Language, by Aho, Kernighan, and Weinberger. Gawk provides the additional features found in the current version of Brian Kernighan's awk and a number of GNU-specific extensions. ... ``` **Prerequisites and notes** * familiarity with programming concepts like variables, printing, control structures, arrays, etc * familiarity with regular expressions * if not, check out **ERE** portion of [GNU sed regular expressions](./gnu_sed.md#regular-expressions) which is close enough to features available in `gawk` * this tutorial is primarily focussed on short programs that are easily usable from command line, similar to using `grep`, `sed`, etc * see [Gawk: Effective AWK Programming](https://www.gnu.org/software/gawk/manual/) manual for complete reference, has information on other `awk` versions as well as notes on POSIX standard
## Field processing
#### Default field separation * `$0` contains the entire input record * default input record separator is newline character * `$1` contains the first field text * default input field separator is one or more of continuous space, tab or newline characters * `$2` contains the second field text and so on * `$(2+3)` result of expressions can be used, this one evaluates to `$5` and hence gives fifth field * similarly if variable `i` has value `2`, then `$(i+3)` will give fifth field * See also [gawk manual - Expressions](https://www.gnu.org/software/gawk/manual/html_node/Expressions.html) * `NF` is a built-in variable which contains number of fields in the current record * so, `$NF` will give last field * `$(NF-1)` will give second last field and so on ```bash $ cat fruits.txt fruit qty apple 42 banana 31 fig 90 guava 6 $ # print only first field $ awk '{print $1}' fruits.txt fruit apple banana fig guava $ # print only second field $ awk '{print $2}' fruits.txt qty 42 31 90 6 ```
#### Specifying different input field separator * by using `-F` command line option * by setting `FS` variable * See [FPAT and FIELDWIDTHS](#fpat-and-fieldwidths) section for other ways of defining input fields ```bash $ # second field where input field separator is : $ echo 'foo:123:bar:789' | awk -F: '{print $2}' 123 $ # last field $ echo 'foo:123:bar:789' | awk -F: '{print $NF}' 789 $ # first and last field $ # note the use of , and space between output fields $ echo 'foo:123:bar:789' | awk -F: '{print $1, $NF}' foo 789 $ # second last field $ echo 'foo:123:bar:789' | awk -F: '{print $(NF-1)}' bar $ # use quotes to avoid clashes with shell special characters $ echo 'one;two;three;four' | awk -F';' '{print $3}' three ``` * Regular expressions based input field separator ```bash $ echo 'Sample123string54with908numbers' | awk -F'[0-9]+' '{print $2}' string $ # first field will be empty as there is nothing before '{' $ echo '{foo} bar=baz' | awk -F'[{}= ]+' '{print $1}' $ echo '{foo} bar=baz' | awk -F'[{}= ]+' '{print $2}' foo $ echo '{foo} bar=baz' | awk -F'[{}= ]+' '{print $3}' bar ``` * default input field separator is one or more of continuous space, tab or newline characters (will be termed as whitespace here on) * exact same behavior if `FS` is assigned single space character * in addition, leading and trailing whitespaces won't be considered when splitting the input record ```bash $ printf ' a ate b\tc \n' a ate b c $ printf ' a ate b\tc \n' | awk '{print $1}' a $ printf ' a ate b\tc \n' | awk '{print NF}' 4 $ # same behavior if FS is assigned to single space character $ printf ' a ate b\tc \n' | awk -F' ' '{print $1}' a $ printf ' a ate b\tc \n' | awk -F' ' '{print NF}' 4 $ # for anything else, leading/trailing whitespaces will be considered $ printf ' a ate b\tc \n' | awk -F'[ \t]+' '{print $2}' a $ printf ' a ate b\tc \n' | awk -F'[ \t]+' '{print NF}' 6 ``` * assigning empty string to FS will split the input record character wise * note the use of command line option `-v` to set FS ```bash $ echo 'apple' | awk -v FS= '{print $1}' a $ echo 'apple' | awk -v FS= '{print $2}' p $ echo 'apple' | awk -v FS= '{print $NF}' e $ # detecting multibyte characters depends on locale $ printf 'hi👍 how are you?' | awk -v FS= '{print $3}' 👍 ``` **Further Reading** * [gawk manual - Field Splitting Summary](https://www.gnu.org/software/gawk/manual/html_node/Field-Splitting-Summary.html#Field-Splitting-Summary) * [stackoverflow - explanation on default FS](https://stackoverflow.com/questions/30405694/default-field-separator-for-awk) * [unix.stackexchange - filter lines if it contains a particular character only once](https://unix.stackexchange.com/questions/362550/how-to-remove-line-if-it-contains-a-character-exactly-once) * [stackoverflow - Processing 2 files with different field separators](https://stackoverflow.com/questions/24516141/awk-processing-2-files-with-different-field-separators)
#### Specifying different output field separator * by setting `OFS` variable * also gets added between every argument to `print` statement * use [printf](#printf-formatting) to avoid this * default is single space ```bash $ # statements inside BEGIN are executed before processing any input text $ echo 'foo:123:bar:789' | awk 'BEGIN{FS=OFS=":"} {print $1, $NF}' foo:789 $ # can also be set using command line option -v $ echo 'foo:123:bar:789' | awk -F: -v OFS=':' '{print $1, $NF}' foo:789 $ # changing a field will re-build contents of $0 $ echo ' a ate b ' | awk '{$2 = "foo"; print $0}' | cat -A a foo b$ $ # $1=$1 is an idiomatic way to re-build when there is nothing else to change $ echo 'foo:123:bar:789' | awk -F: -v OFS='-' '{print $0}' foo:123:bar:789 $ echo 'foo:123:bar:789' | awk -F: -v OFS='-' '{$1=$1; print $0}' foo-123-bar-789 $ # OFS is used to separate different arguments given to print $ echo 'foo:123:bar:789' | awk -F: -v OFS='\t' '{print $1, $3}' foo bar $ echo 'Sample123string54with908numbers' | awk -F'[0-9]+' '{$1=$1; print $0}' Sample string with numbers ```
## Filtering
#### Idiomatic print usage * `print` statement with no arguments will print contents of `$0` * if condition is specified without corresponding statements, contents of `$0` is printed if condition evaluates to true * `1` is typically used to represent always true condition and thus print contents of `$0` ```bash $ cat poem.txt Roses are red, Violets are blue, Sugar is sweet, And so are you. $ # displaying contents of input file(s) similar to 'cat' command $ # equivalent to using awk '{print $0}' and awk '1' $ awk '{print}' poem.txt Roses are red, Violets are blue, Sugar is sweet, And so are you. ```
#### Field comparison * Each block of statements within `{}` can be prefixed by an optional condition so that those statements will execute only if condition evaluates to true * Condition specified without corresponding statements will lead to printing contents of `$0` if condition evaluates to true ```bash $ # if first field exactly matches the string 'apple' $ awk '$1=="apple"{print $2}' fruits.txt 42 $ # print first field if second field > 35 $ # NR>1 to avoid the header line $ # NR built-in variable contains record number $ awk 'NR>1 && $2>35{print $1}' fruits.txt apple fig $ # print header and lines with qty < 35 $ awk 'NR==1 || $2<35' fruits.txt fruit qty banana 31 guava 6 ``` * If the above examples are too confusing, think of it as syntactical sugar * Statements are grouped within `{}` * inside `{}`, we have a `if` control structure * Like `C` language, braces not needed for single statements within `if`, but consider that `{}` is used for clarity * From this explicit syntax, remove the outer `{}`, `if` and `()` used for `if` * As we'll see later, this allows to mash up few lines of program compactly on command line itself * Of course, for medium to large programs, it is better to put the code in separate file. See [awk scripts](#awk-scripts) section ```bash $ # awk '$1=="apple"{print $2}' fruits.txt $ awk '{ if($1 == "apple"){ print $2 } }' fruits.txt 42 $ # awk 'NR==1 || $2<35' fruits.txt $ awk '{ if(NR==1 || $2<35){ print $0 } }' fruits.txt fruit qty banana 31 guava 6 ``` **Further Reading** * [gawk manual - Truth Values and Conditions](https://www.gnu.org/software/gawk/manual/html_node/Truth-Values-and-Conditions.html) * [gawk manual - Operator Precedence](https://www.gnu.org/software/gawk/manual/html_node/Precedence.html) * [unix.stackexchange - filtering columns by header name](https://unix.stackexchange.com/questions/359697/print-columns-in-awk-by-header-name)
#### Regular expressions based filtering * the *REGEXP* is specified within `//` and by default acts upon `$0` * See also [stackoverflow - lines around matching regexp](https://stackoverflow.com/questions/17908555/printing-with-sed-or-awk-a-line-following-a-matching-pattern) ```bash $ # all lines containing the string 'are' $ # same as: grep 'are' poem.txt $ awk '/are/' poem.txt Roses are red, Violets are blue, And so are you. $ # negating REGEXP, same as: grep -v 'are' poem.txt $ awk '!/are/' poem.txt Sugar is sweet, $ # same as: grep 'are' poem.txt | grep -v 'so' $ awk '/are/ && !/so/' poem.txt Roses are red, Violets are blue, $ # lines starting with 'a' or 'b' $ awk '/^[ab]/' fruits.txt apple 42 banana 31 $ # print last field of all lines containing 'are' $ awk '/are/{print $NF}' poem.txt red, blue, you. ``` * strings can be used as well, which will be interpreted as *REGEXP* if necessary * Allows [using shell variables](#using-shell-variables) instead of hardcoded *REGEXP* * that section also notes difference between using `//` and string ```bash $ awk '$0 !~ "are"' poem.txt Sugar is sweet, $ awk '$0 ~ "^[ab]"' fruits.txt apple 42 banana 31 $ # also helpful if search strings have the / delimiter character $ cat paths.txt /foo/a/report.log /foo/y/power.log $ awk '/\/foo\/a\//' paths.txt /foo/a/report.log $ awk '$0 ~ "/foo/a/"' paths.txt /foo/a/report.log ``` * *REGEXP* matching against specific field ```bash $ # if first field contains 'a' $ awk '$1 ~ /a/' fruits.txt apple 42 banana 31 guava 6 $ # if first field contains 'a' and qty > 20 $ awk '$1 ~ /a/ && $2 > 20' fruits.txt apple 42 banana 31 $ # if first field does NOT contain 'a' $ awk '$1 !~ /a/' fruits.txt fruit qty fig 90 ```
#### Fixed string matching * to search a string literally, `index` function can be used instead of *REGEXP* * similar to `grep -F` * the function returns the starting position and `0` if no match found ```bash $ cat eqns.txt a=b,a-b=c,c*d a+b,pi=3.14,5e12 i*(t+9-g)/8,4-a+b $ # no output since '+' is meta character, would need '/a\+b/' $ awk '/a+b/' eqns.txt $ # same as: grep -F 'a+b' eqns.txt $ awk 'index($0,"a+b")' eqns.txt a+b,pi=3.14,5e12 i*(t+9-g)/8,4-a+b $ # much easier than '/i\*\(t\+9-g\)/' $ awk 'index($0,"i*(t+9-g)")' eqns.txt i*(t+9-g)/8,4-a+b $ # check only last field $ awk -F, 'index($NF,"a+b")' eqns.txt i*(t+9-g)/8,4-a+b $ # index not needed if entire field/line is being compared $ awk -F, '$1=="a+b"' eqns.txt a+b,pi=3.14,5e12 ``` * return value is useful to match at specific position * for ex: at start/end of line ```bash $ # start of line $ awk 'index($0,"a+b")==1' eqns.txt a+b,pi=3.14,5e12 $ # end of line $ # length function returns number of characters, by default acts on $0 $ awk 'index($0,"a+b")==length()-length("a+b")+1' eqns.txt i*(t+9-g)/8,4-a+b $ # to avoid repetitions, save the search string in variable $ awk -v s="a+b" 'index($0,s)==length()-length(s)+1' eqns.txt i*(t+9-g)/8,4-a+b ```
#### Line number based filtering * Built-in variable `NR` contains total records read so far * Use `FNR` if you need line numbers separately for [multiple file processing](#multiple-file-processing) ```bash $ # same as: head -n2 poem.txt | tail -n1 $ awk 'NR==2' poem.txt Violets are blue, $ # print 2nd and 4th line $ awk 'NR==2 || NR==4' poem.txt Violets are blue, And so are you. $ # same as: tail -n1 poem.txt $ # statements inside END are executed after processing all input text $ awk 'END{print}' poem.txt And so are you. $ awk 'NR==4{print $2}' fruits.txt 90 ``` * for large input, use `exit` to avoid unnecessary record processing ```bash $ seq 14323 14563435 | awk 'NR==234{print; exit}' 14556 $ # sample time comparison $ time seq 14323 14563435 | awk 'NR==234{print; exit}' 14556 real 0m0.004s user 0m0.004s sys 0m0.000s $ time seq 14323 14563435 | awk 'NR==234{print}' 14556 real 0m2.167s user 0m2.280s sys 0m0.092s ``` * See also [unix.stackexchange - filtering list of lines from every X number of lines](https://unix.stackexchange.com/questions/325985/how-to-print-lines-number-15-and-25-out-of-each-50-lines)
## Case Insensitive filtering ```bash $ # same as: grep -i 'rose' poem.txt $ awk -v IGNORECASE=1 '/rose/' poem.txt Roses are red, $ # for small enough set, can also use REGEXP character class $ awk '/[rR]ose/' poem.txt Roses are red, $ # another way is to use built-in string function 'tolower' $ awk 'tolower($0) ~ /rose/' poem.txt Roses are red, ```
## Changing record separators * `RS` to change input record separator * default is newline character ```bash $ s='this is a sample string' $ # space as input record separator, printing all records $ printf "$s" | awk -v RS=' ' '{print NR, $0}' 1 this 2 is 3 a 4 sample 5 string $ # print all records containing 'a' $ printf "$s" | awk -v RS=' ' '/a/' a sample ``` * `ORS` to change output record separator * gets added to every `print` statement * use [printf](#printf-formatting) to avoid this * default is newline character ```bash $ seq 3 | awk '{print $0}' 1 2 3 $ # note that there is empty line after last record $ seq 3 | awk -v ORS='\n\n' '{print $0}' 1 2 3 $ # dynamically changing ORS $ # ?: ternary operator to select between two expressions based on a condition $ # can also use: seq 6 | awk '{ORS = NR%2 ? " " : RS} 1' $ seq 6 | awk '{ORS = NR%2 ? " " : "\n"} 1' 1 2 3 4 5 6 $ seq 6 | awk '{ORS = NR%3 ? "-" : "\n"} 1' 1-2-3 4-5-6 ```
#### Paragraph mode * When `RS` is set to empty string, one or more consecutive empty lines is used as input record separator * Can also use regular expression `RS=\n\n+` but there are subtle differences, see [gawk manual - multiline records](https://www.gnu.org/software/gawk/manual/html_node/Multiple-Line.html). Important points from that link quoted below >However, there is an important difference between ‘RS = ""’ and ‘RS = "\n\n+"’. In the first case, leading newlines in the input data file are ignored, and if a file ends without extra blank lines after the last record, the final newline is removed from the record. In the second case, this special processing is not done >Now that the input is separated into records, the second step is to separate the fields in the records. One way to do this is to divide each of the lines into fields in the normal manner. This happens by default as the result of a special feature. When RS is set to the empty string and FS is set to a single character, the newline character always acts as a field separator. This is in addition to whatever field separations result from FS >When FS is the null string ("") or a regexp, this special feature of RS does not apply. It does apply to the default field separator of a single space: ‘FS = " "’ Consider the below sample file ```bash $ cat sample.txt Hello World Good day How are you Just do-it Believe it Today is sunny Not a bit funny No doubt you like it too Much ado about nothing He he he ``` * Filtering paragraphs ```bash $ # print all paragraphs containing 'it' $ # if extra newline at end is undesirable, can use $ # awk -v RS= '/it/{print c++ ? "\n" $0 : $0}' sample.txt $ awk -v RS= -v ORS='\n\n' '/it/' sample.txt Just do-it Believe it Today is sunny Not a bit funny No doubt you like it too $ # based on number of lines in each paragraph $ awk -F'\n' -v RS= -v ORS='\n\n' 'NF==1' sample.txt Hello World $ awk -F'\n' -v RS= -v ORS='\n\n' 'NF==2 && /do/' sample.txt Just do-it Believe it Much ado about nothing He he he ``` * Re-structuring paragraphs ```bash $ # default FS is one or more of continuous space, tab or newline characters $ # default OFS is single space $ # so, $1=$1 will change it uniformly to single space between fields $ awk -v RS= '{$1=$1} 1' sample.txt Hello World Good day How are you Just do-it Believe it Today is sunny Not a bit funny No doubt you like it too Much ado about nothing He he he $ # a better usecase $ awk 'BEGIN{FS="\n"; OFS=". "; RS=""; ORS="\n\n"} {$1=$1} 1' sample.txt Hello World Good day. How are you Just do-it. Believe it Today is sunny. Not a bit funny. No doubt you like it too Much ado about nothing. He he he ``` **Further Reading** * [unix.stackexchange - filtering line surrounded by empty lines](https://unix.stackexchange.com/questions/359717/select-line-with-empty-line-above-and-under) * [stackoverflow - excellent example and explanation of RS and FS](https://stackoverflow.com/questions/46142118/converting-regex-to-sed-or-grep-regex)
#### Multicharacter RS * Some marker like `Error` or `Warning` etc ```bash $ cat report.log blah blah Error: something went wrong more blah whatever Error: something surely went wrong some text some more text blah blah blah $ awk -v RS='Error:' 'END{print NR-1}' report.log 2 $ awk -v RS='Error:' 'NR==1' report.log blah blah $ # filter 'Error:' block matching particular string $ # to preserve formatting, use: '/whatever/{print RS $0}' $ awk -v RS='Error:' '/whatever/' report.log something went wrong more blah whatever $ # blocks with more than 3 lines $ # splitting string with 3 newlines will yield 4 fields $ awk -F'\n' -v RS='Error:' 'NF>4{print RS $0}' report.log Error: something surely went wrong some text some more text blah blah blah ``` * Regular expression based `RS` * the `RT` variable will contain string matched by `RS` * Note that entire input is treated as single string, so `^` and `$` anchors will apply only once - not every line ```bash $ s='Sample123string54with908numbers' $ printf "$s" | awk -v RS='[0-9]+' 'NR==1' Sample $ # note the relationship between record and separators $ printf "$s" | awk -v RS='[0-9]+' '{print NR " : " $0 " - " RT}' 1 : Sample - 123 2 : string - 54 3 : with - 908 4 : numbers - $ # need to be careful of empty records $ printf '123string54with908' | awk -v RS='[0-9]+' '{print NR " : " $0}' 1 : 2 : string 3 : with $ # and newline at end of input $ printf '123string54with908\n' | awk -v RS='[0-9]+' '{print NR " : " $0}' 1 : 2 : string 3 : with 4 : ``` * Joining lines based on specific end of line condition ```bash $ cat msg.txt Hello there. It will rain to- day. Have a safe and pleasant jou- rney. $ # join lines ending with - to next line $ # by manipulating RS and ORS $ awk -v RS='-\n' -v ORS= '1' msg.txt Hello there. It will rain today. Have a safe and pleasant journey. $ # by manipulating ORS alone, sub function covered in later sections $ awk '{ORS = sub(/-$/,"") ? "" : "\n"} 1' msg.txt Hello there. It will rain today. Have a safe and pleasant journey. $ # easier: perl -pe 's/-\n//' msg.txt as newline is still part of input line ``` * processing null terminated input ```bash $ printf 'foo\0bar\0' | cat -A foo^@bar^@$ $ printf 'foo\0bar\0' | awk -v RS='\0' '{print}' foo bar ``` **Further Reading** * [gawk manual - Records](https://www.gnu.org/software/gawk/manual/html_node/Records.html#Records) * [unix.stackexchange - Slurp-mode in awk](https://unix.stackexchange.com/questions/304457/slurp-mode-in-awk) * [stackoverflow - using RS to count number of occurrences of a given string](https://stackoverflow.com/questions/45102651/how-to-grep-double-quote-followed-by-a-string-at-same-time/45102962#45102962)
## Substitute functions * Use `sub` string function for replacing first occurrence * Use `gsub` for replacing all occurrences * By default, `$0` which contains input record is modified, can specify any other field or variable as needed ```bash $ # replacing first occurrence $ echo '1-2-3-4-5' | awk '{sub("-", ":")} 1' 1:2-3-4-5 $ # replacing all occurrences $ echo '1-2-3-4-5' | awk '{gsub("-", ":")} 1' 1:2:3:4:5 $ # return value for sub/gsub is number of replacements made $ echo '1-2-3-4-5' | awk '{n=gsub("-", ":"); print n} 1' 4 1:2:3:4:5 $ # // format is better suited to specify search REGEXP $ echo '1-2-3-4-5' | awk '{gsub(/[^-]+/, "abc")} 1' abc-abc-abc-abc-abc $ # replacing all occurrences only for third field $ echo 'one;two;three;four' | awk -F';' '{gsub("e", "E", $3)} 1' one two thrEE four ``` * Use `gensub` to return the modified string unlike `sub` or `gsub` which modifies inplace * it also supports back-references and ability to modify specific match * acts upon `$0` if target is not specified ```bash $ # replace second occurrence $ echo 'foo:123:bar:baz' | awk '{$0=gensub(":", "-", 2)} 1' foo:123-bar:baz $ # use REGEXP as needed $ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "XYZ", 2)} 1' foo:XYZ:bar:baz $ # or print the returned string directly $ echo 'foo:123:bar:baz' | awk '{print gensub(":", "-", 2)}' foo:123-bar:baz $ # replace third occurrence $ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "XYZ", 3)} 1' foo:123:XYZ:baz $ # replace all occurrences, similar to gsub $ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "XYZ", "g")} 1' XYZ:XYZ:XYZ:XYZ $ # target other than $0 $ echo 'foo:123:bar:baz' | awk -F: -v OFS=: '{$1=gensub(/o/, "b", 2, $1)} 1' fob:123:bar:baz ``` * back-reference examples * use `\"` within double-quotes to represent `"` character in replacement string * use `\\1` to represent `\1` - the first captured group and so on * `&` or `\0` will back-reference entire matched string ```bash $ # replacing last occurrence without knowing how many occurrences are there $ echo 'foo:123:bar:baz' | awk '{$0=gensub(/(.*):/, "\\1-", 1)} 1' foo:123:bar-baz $ echo 'foo and bar and baz land good' | awk '{$0=gensub(/(.*)and/, "\\1XYZ", 1)} 1' foo and bar and baz lXYZ good $ # use word boundaries as necessary $ echo 'foo and bar and baz land good' | awk '{$0=gensub(/(.*)\/, "\\1XYZ", 1)} 1' foo and bar XYZ baz land good $ # replacing last but one $ echo '456:foo:123:bar:789:baz' | awk '{$0=gensub(/(.*):(.*:)/, "\\1-\\2", 1)} 1' 456:foo:123:bar-789:baz $ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "\"&\"", "g")} 1' "foo":"123":"bar":"baz" ``` * saving quotes in variables - to avoid escaping double quotes or having to use octal code for single quotes ```bash $ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "\047&\047", "g")} 1' 'foo':'123':'bar':'baz' $ echo 'foo:123:bar:baz' | awk -v sq="'" '{$0=gensub(/[^:]+/, sq"&"sq, "g")} 1' 'foo':'123':'bar':'baz' $ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "\"&\"", "g")} 1' "foo":"123":"bar":"baz" $ echo 'foo:123:bar:baz' | awk -v dq='"' '{$0=gensub(/[^:]+/, dq"&"dq, "g")} 1' "foo":"123":"bar":"baz" ``` **Further Reading** * [gawk manual - String-Manipulation Functions](https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html) * [gawk manual - escape processing](https://www.gnu.org/software/gawk/manual/html_node/Gory-Details.html)
## Inplace file editing * Use this option with caution, preferably after testing that the `awk` code is working as intended ```bash $ cat greeting.txt Hi there Have a nice day $ awk -i inplace '{gsub("e", "E")} 1' greeting.txt $ cat greeting.txt Hi thErE HavE a nicE day ``` * Multiple input files are treated individually and changes are written back to respective files ```bash $ cat f1 I ate 3 apples $ cat f2 I bought two bananas and 3 mangoes $ awk -i inplace '{gsub("3", "three")} 1' f1 f2 $ cat f1 I ate three apples $ cat f2 I bought two bananas and three mangoes ``` * to create backups of original file, set `INPLACE_SUFFIX` variable * **Note** that in newer versions, you have to use `inplace::suffix` instead of `INPLACE_SUFFIX` ```bash $ awk -i inplace -v INPLACE_SUFFIX='.bkp' '{gsub("three", "3")} 1' f1 $ cat f1 I ate 3 apples $ cat f1.bkp I ate three apples ``` * See [gawk manual - Enabling In-Place File Editing](https://www.gnu.org/software/gawk/manual/html_node/Extension-Sample-Inplace.html) for implementation details
## Using shell variables * when `awk` code is part of shell program and shell variable needs to be passed as input to `awk` code * for example: * command line argument passed to shell script, which is in turn passed on to `awk` * control structures in shell script calling `awk` with different search strings * See also [stackoverflow - How do I use shell variables in an awk script?](https://stackoverflow.com/questions/19075671/how-do-i-use-shell-variables-in-an-awk-script) ```bash $ # examples tested with bash shell $ f='apple' $ awk -v word="$f" '$1==word' fruits.txt apple 42 $ f='fig' $ awk -v word="$f" '$1==word' fruits.txt fig 90 $ q='20' $ awk -v threshold="$q" 'NR==1 || $2>threshold' fruits.txt fruit qty apple 42 banana 31 fig 90 ``` * accessing shell environment variables ```bash $ # existing environment variable $ awk 'BEGIN{print ENVIRON["PWD"]}' /home/learnbyexample $ awk 'BEGIN{print ENVIRON["SHELL"]}' /bin/bash $ # defined along with awk code $ word='hello world' awk 'BEGIN{print ENVIRON["word"]}' hello world $ # using ENVIRON also prevents awk's interpretation of escape sequences $ s='a\n=c' $ foo="$s" awk 'BEGIN{print ENVIRON["foo"]}' a\n=c $ awk -v foo="$s" 'BEGIN{print foo}' a =c ``` * passing *REGEXP* * See also [gawk manual - Using Dynamic Regexps](https://www.gnu.org/software/gawk/manual/html_node/Computed-Regexps.html) ```bash $ s='are' $ # for: awk '!/are/' poem.txt $ awk -v s="$s" '$0 !~ s' poem.txt Sugar is sweet, $ # for: awk '/are/ && !/so/' poem.txt $ awk -v s="$s" '$0 ~ s && !/so/' poem.txt Roses are red, Violets are blue, $ r='[^-]+' $ echo '1-2-3-4-5' | awk -v r="$r" '{gsub(r, "abc")} 1' abc-abc-abc-abc-abc $ # escape sequence has to be doubled when string is interpreted as REGEXP $ s='foo and bar and baz land good' $ echo "$s" | awk '{$0=gensub("(.*)\\", "\\1XYZ", 1)} 1' foo and bar XYZ baz land good $ # hence passing as variable should be $ r='(.*)\\' $ echo "$s" | awk -v r="$r" '{$0=gensub(r, "\\1XYZ", 1)} 1' foo and bar XYZ baz land good $ # or use ENVIRON $ r='(.*)\' $ echo "$s" | r="$r" awk '{$0=gensub(ENVIRON["r"], "\\1XYZ", 1)} 1' foo and bar XYZ baz land good ```
## Multiple file input * Example to show difference between `NR` and `FNR` ```bash $ # NR for overall record number $ awk 'NR==1' poem.txt greeting.txt Roses are red, $ # FNR for individual file's record number $ # same as: head -q -n1 poem.txt greeting.txt $ awk 'FNR==1' poem.txt greeting.txt Roses are red, Hi thErE ``` * Constructs to do some processing before starting each file as well as at the end * `BEGINFILE` - to add code to be executed before start of each input file * `ENDFILE` - to add code to be executed after processing each input file * `FILENAME` - file name of current input file being processed ```bash $ # similar to: tail -n1 poem.txt greeting.txt $ awk 'BEGINFILE{print "file: "FILENAME} ENDFILE{print $0"\n------"}' poem.txt greeting.txt file: poem.txt And so are you. ------ file: greeting.txt HavE a nicE day ------ ``` * And of course, there can be usual `awk` code ```bash $ awk 'BEGINFILE{print "file: "FILENAME} FNR==1; ENDFILE{print "------"}' poem.txt greeting.txt file: poem.txt Roses are red, ------ file: greeting.txt Hi thErE ------ $ awk 'BEGINFILE{c++; print "file: "FILENAME} FNR==2; END{print "\nTotal input files: "c}' poem.txt greeting.txt file: poem.txt Violets are blue, file: greeting.txt HavE a nicE day Total input files: 2 ``` **Further Reading** * [gawk manual - Using ARGC and ARGV](https://www.gnu.org/software/gawk/manual/html_node/ARGC-and-ARGV.html) * [gawk manual - ARGIND](https://www.gnu.org/software/gawk/manual/html_node/Auto_002dset.html#index-ARGIND-variable) * [gawk manual - ERRNO](https://www.gnu.org/software/gawk/manual/html_node/Auto_002dset.html#index-ERRNO-variable) * [stackoverflow - Finding common value across multiple files](https://stackoverflow.com/a/43473385/4082052)
## Control Structures * Syntax is similar to `C` language and single statements inside control structures don't require to be grouped within `{}` * See [gawk manual - Control Statements](https://www.gnu.org/software/gawk/manual/html_node/Statements.html) for details Remember that by default there is a loop that goes over all input records and constructs like `BEGIN` and `END` fall outside that loop ```bash $ cat nums.txt 42 -2 10101 -3.14 -75 $ awk '{sum += $1} END{print sum}' nums.txt 10062.9 $ # uninitialized variables will have empty string $ printf '' | awk '{sum += $1} END{print sum}' $ # so either add '0' or use unary '+' operator to convert to number $ printf '' | awk '{sum += $1} END{print +sum}' 0 $ awk '{sum += $1} END{print sum+0}' /dev/null 0 ``` * See also [unix.stackexchange - change in behavior of unary + with gawk version 4.2.0](https://unix.stackexchange.com/questions/421904/regression-with-unary-plus)
#### if-else and loops * We have already seen simple `if` examples in [Filtering](#filtering) section * See also [gawk manual - Switch](https://www.gnu.org/software/gawk/manual/html_node/Switch-Statement.html) ```bash $ # same as: sed -n '/are/ s/so/SO/p' poem.txt $ # remember that sub/gsub returns number of substitutions made $ awk '/are/{if(sub("so", "SO")) print}' poem.txt And SO are you. $ # of course, can also use $ awk '/are/ && sub("so", "SO")' poem.txt And SO are you. $ # if-else example $ awk 'NR>1{if($2>40) $0="+"$0; else $0="-"$0} 1' fruits.txt fruit qty +apple 42 -banana 31 +fig 90 -guava 6 ``` * ternary operator * See also [stackoverflow - finding min and max value of a column](https://stackoverflow.com/a/29784278/4082052) ```bash $ cat nums.txt 42 -2 10101 -3.14 -75 $ # changing -ve to +ve and vice versa $ # same as: awk '{if($0 ~ /^-/) sub(/^-/,""); else sub(/^/,"-")} 1' nums.txt $ awk '{$0 ~ /^-/ ? sub(/^-/,"") : sub(/^/,"-")} 1' nums.txt -42 2 -10101 3.14 75 $ # can also use: awk '!sub(/^-/,""){sub(/^/,"-")} 1' nums.txt ``` * for loop * similar to `C` language, `break` and `continue` statements are also available * See also [stackoverflow - find missing numbers from sequential list](https://stackoverflow.com/questions/38491676/how-can-i-find-the-missing-integers-in-a-unique-and-sequential-list-one-per-lin) ```bash $ awk 'BEGIN{for(i=2; i<11; i+=2) print i}' 2 4 6 8 10 $ # looping each field $ s='scat:cat:no cat:abdicate:cater' $ echo "$s" | awk -F: -v OFS=: '{for(i=1;i<=NF;i++) if($i=="cat") $i="CAT"} 1' scat:CAT:no cat:abdicate:cater $ # can also use sub function $ echo "$s" | awk -F: -v OFS=: '{for(i=1;i<=NF;i++) sub(/^cat$/,"CAT",$i)} 1' scat:CAT:no cat:abdicate:cater ``` * while loop * do-while is also available ```bash $ awk 'BEGIN{i=2; while(i<11){print i; i+=2}}' 2 4 6 8 10 $ # recursive substitution $ # here again return value of sub/gsub is useful $ echo 'titillate' | awk '{while( gsub(/til/, "") ) print}' tilate ate ```
#### next and nextfile * `next` will skip rest of statements and start processing next line of current file being processed * there is a loop by default which goes over all input records, `next` is applicable for that * it is similar to `continue` statement within loops * it is often used in [Two file processing](#two-file-processing) ```bash $ # here 'next' is used to skip processing header line $ awk 'NR==1{print; next} /a.*a/{$0="*"$0} /[eiou]/{$0="-"$0} 1' fruits.txt fruit qty -apple 42 *banana 31 -fig 90 -*guava 6 ``` * `nextfile` is useful to skip remaining lines from current file being processed and move on to next file ```bash $ # same as: head -q -n1 poem.txt greeting.txt fruits.txt $ awk 'FNR>1{nextfile} 1' poem.txt greeting.txt fruits.txt Roses are red, Hi thErE fruit qty $ # specific field $ awk 'FNR>2{nextfile} {print $1}' poem.txt greeting.txt fruits.txt Roses Violets Hi HavE fruit apple $ # similar to 'grep -il' $ awk -v IGNORECASE=1 '/red/{print FILENAME; nextfile}' * colors_1.txt colors_2.txt poem.txt $ awk -v IGNORECASE=1 '$1 ~ /red/{print FILENAME; nextfile}' * colors_1.txt colors_2.txt ```
## Multiline processing * Processing consecutive lines ```bash $ cat poem.txt Roses are red, Violets are blue, Sugar is sweet, And so are you. $ # match two consecutive lines $ awk 'p~/are/ && /is/{print p ORS $0} {p=$0}' poem.txt Violets are blue, Sugar is sweet, $ # if only the second line is needed $ awk 'p~/are/ && /is/; {p=$0}' poem.txt Sugar is sweet, $ # match three consecutive lines $ awk 'p2~/red/ && p1~/blue/ && /is/{print p2} {p2=p1; p1=$0}' poem.txt Roses are red, $ # common mistake $ sed -n '/are/{N;/is/p}' poem.txt $ # would need something like this and not practical to extend for other cases $ sed '$!N; /are.*\n.*is/p; D' poem.txt Violets are blue, Sugar is sweet, ``` Consider this sample input file ```bash $ cat range.txt foo BEGIN 1234 6789 END bar BEGIN a b c END baz ``` * extracting lines around matching line * See also [stackoverflow - lines around matching regexp](https://stackoverflow.com/questions/17908555/printing-with-sed-or-awk-a-line-following-a-matching-pattern) * how `n && n--` works: * need to note that right hand side of `&&` is processed only if left hand side is `true` * so for example, if initially `n=2`, then we get * `2 && 2; n=1` - evaluates to `true` * `1 && 1; n=0` - evaluates to `true` * `0 && ` - evaluates to `false` ... no decrementing `n` and hence will be `false` until `n` is re-assigned non-zero value ```bash $ # similar to: grep --no-group-separator -A1 'BEGIN' range.txt $ awk '/BEGIN/{n=2} n && n--' range.txt BEGIN 1234 BEGIN a $ # only print the line after matching line $ # can also use: awk '/BEGIN/{n=1; next} n && n--' range.txt $ awk 'n && n--; /BEGIN/{n=1}' range.txt 1234 a $ # generic case: print nth line after match $ awk 'n && !--n; /BEGIN/{n=3}' range.txt END c $ # print second line prior to matched line $ awk '/END/{print p2} {p2=p1; p1=$0}' range.txt 1234 b $ # save all lines in an array for generic case $ # NR>n is checked to avoid printing empty line if there is a match $ # within first n lines $ awk -v n=3 '/BEGIN/ && NR>n{print a[NR-n]} {a[NR]=$0}' range.txt 6789 $ # or, use the reversing trick $ tac range.txt | awk 'n && !--n; /END/{n=3}' | tac BEGIN a ``` * Checking if multiple strings are present at least once in entire input file * If there are lots of strings to check, use arrays ```bash $ # can also use BEGINFILE instead of FNR==1 $ awk 'FNR==1{s1=s2=0} /is/{s1=1} /are/{s2=1} s1&&s2{print FILENAME; nextfile}' * poem.txt sample.txt $ awk 'FNR==1{s1=s2=0} /foo/{s1=1} /report/{s2=1} s1&&s2{print FILENAME; nextfile}' * paths.txt ``` **Further Reading** * [stackoverflow - delete line based on content of previous/next lines](https://stackoverflow.com/questions/49112877/delete-line-if-line-matches-foo-line-above-matches-bar-and-line-below-match) * [softwareengineering - FSM examples](https://softwareengineering.stackexchange.com/questions/47806/examples-of-finite-state-machines) * [wikipedia - FSM](https://en.wikipedia.org/wiki/Finite-state_machine)
## Two file processing * We'll use awk's associative arrays (key-value pairs) here * key can be number or string * See also [gawk manual - Arrays](https://www.gnu.org/software/gawk/manual/html_node/Arrays.html) * Unlike [comm](./sorting_stuff.md#comm) the input files need not be sorted and comparison can be done based on certain field(s) as well
#### Comparing whole lines Consider the following test files ```bash $ cat colors_1.txt Blue Brown Purple Red Teal Yellow $ cat colors_2.txt Black Blue Green Red White ``` * common lines and lines unique to one of the files * For two files as input, `NR==FNR` will be true only when first file is being processed * Using `next` will skip rest of code when first file is processed * `a[$0]` will create unique keys (here entire line content is used as key) in array `a` * just referencing a key will create it if it doesn't already exist, with value as empty string (will also act as zero in numeric context) * `$0 in a` will be true if key already exists in array `a` ```bash $ # common lines $ # same as: grep -Fxf colors_1.txt colors_2.txt $ awk 'NR==FNR{a[$0]; next} $0 in a' colors_1.txt colors_2.txt Blue Red $ # lines from colors_2.txt not present in colors_1.txt $ # same as: grep -vFxf colors_1.txt colors_2.txt $ awk 'NR==FNR{a[$0]; next} !($0 in a)' colors_1.txt colors_2.txt Black Green White $ # reversing the order of input files gives $ # lines from colors_1.txt not present in colors_2.txt $ awk 'NR==FNR{a[$0]; next} !($0 in a)' colors_2.txt colors_1.txt Brown Purple Teal Yellow ```
#### Comparing specific fields Consider the sample input file ```bash $ cat marks.txt Dept Name Marks ECE Raj 53 ECE Joel 72 EEE Moi 68 CSE Surya 81 EEE Tia 59 ECE Om 92 CSE Amy 67 ``` * single field * For ex: only first field comparison by using `$1` instead of `$0` as key ```bash $ cat list1 ECE CSE $ # extract only lines matching first field specified in list1 $ awk 'NR==FNR{a[$1]; next} $1 in a' list1 marks.txt ECE Raj 53 ECE Joel 72 CSE Surya 81 ECE Om 92 CSE Amy 67 $ # if header is needed as well $ awk 'NR==FNR{a[$1]; next} FNR==1 || $1 in a' list1 marks.txt Dept Name Marks ECE Raj 53 ECE Joel 72 CSE Surya 81 ECE Om 92 CSE Amy 67 ``` * multiple fields * create a string by adding some character between the fields to act as key * for ex: to avoid matching two field values `abc` and `123` to match with two other field values `ab` and `c123` * by adding character, say `_`, the key would be `abc_123` for first case and `ab_c123` for second case * this can still lead to false match if input data has `_` * there is also a built-in way to do this using [gawk manual - Multidimensional Arrays](https://www.gnu.org/software/gawk/manual/html_node/Multidimensional.html#Multidimensional) ```bash $ cat list2 EEE Moi CSE Amy ECE Raj $ # extract only lines matching both fields specified in list2 $ awk 'NR==FNR{a[$1"_"$2]; next} $1"_"$2 in a' list2 marks.txt ECE Raj 53 EEE Moi 68 CSE Amy 67 $ # uses SUBSEP as separator, whose default value is non-printing character \034 $ awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' list2 marks.txt ECE Raj 53 EEE Moi 68 CSE Amy 67 ``` * field and value comparison ```bash $ cat list3 ECE 70 EEE 65 CSE 80 $ # extract line matching Dept and minimum marks specified in list3 $ awk 'NR==FNR{d[$1]=$2; next} $1 in d && $3 >= d[$1]' list3 marks.txt ECE Joel 72 EEE Moi 68 CSE Surya 81 ECE Om 92 ```
#### getline * `getline` is an alternative way to read from a file and could be faster than `NR==FNR` method for some cases * But use it with caution * [gawk manual - getline](https://www.gnu.org/software/gawk/manual/html_node/Getline.html) for details, especially about corner cases, errors, etc * [getline caveats](https://web.archive.org/web/20170524214527/http://awk.freeshell.org/AllAboutGetline) * [gawk manual - Closing Input and Output Redirections](https://www.gnu.org/software/gawk/manual/html_node/Close-Files-And-Pipes.html) if you have to start from beginning of file again * `getline` return value: `1` if record is found, `0` if end of file, `-1` for errors such as file not found (use `ERRNO` variable to get details) ```bash $ # replace mth line in poem.txt with nth line from nums.txt $ # return value handling is not shown here, but should be done ideally $ awk -v m=3 -v n=2 'BEGIN{while(n-- > 0) getline s < "nums.txt"} FNR==m{$0=s} 1' poem.txt Roses are red, Violets are blue, -2 And so are you. $ # without getline, but slower due to NR==FNR check for every line processed $ awk -v m=3 -v n=2 'NR==FNR{if(FNR==n){s=$0; nextfile} next} FNR==m{$0=s} 1' nums.txt poem.txt Roses are red, Violets are blue, -2 And so are you. $ # Note that if nums.txt has less than n lines: $ # getline version will use last line of nums.txt if any $ # NR==FNR version will give empty string as 's' would be uninitialized ``` * Another use case is if two files are to be processed simultaneously ```bash $ # print line from fruits.txt if corresponding line from nums.txt is +ve number $ # the return value check ensures corresponding line number comparison $ awk -v file='nums.txt' '(getline num < file)==1 && num>0' fruits.txt fruit qty banana 31 $ # without getline, but has to save entire file in array $ awk 'NR==FNR{n[FNR]=$0; next} n[FNR]>0' nums.txt fruits.txt fruit qty banana 31 ``` * error handling ```bash $ awk 'NR==FNR{n[FNR]=$0; next} n[FNR]>0' xyz.txt fruits.txt awk: fatal: cannot open file 'xyz.txt' for reading (No such file or directory) $ awk -v file='xyz.txt' '{ e=(getline num < file); if(e<0){print file ": " ERRNO; exit} } e==1 && num>0' fruits.txt xyz.txt: No such file or directory ``` **Further Reading** * [stackoverflow - Fastest way to find lines of a text file from another larger text file](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-text-file-from-another-larger-text-file-in-bash) * [unix.stackexchange - filter lines based on line numbers specified in another file](https://unix.stackexchange.com/questions/320651/read-numbers-from-control-file-and-extract-matching-line-numbers-from-the-data-f) * [stackoverflow - three file processing to extract a matrix subset](https://stackoverflow.com/questions/45036019/how-to-filter-the-values-from-selected-columns-and-rows) * [unix.stackexchange - column wise merging](https://unix.stackexchange.com/questions/294145/merging-two-files-one-column-at-a-time) * [stackoverflow - extract specific rows from a text file using an index file](https://stackoverflow.com/questions/40595990/print-many-specific-rows-from-a-text-file-using-an-index-file)
## Creating new fields * Number of fields in input record can be changed by simply manipulating `NF` ```bash $ # reducing fields $ echo 'foo,bar,123,baz' | awk -F, -v OFS=, '{NF=2} 1' foo,bar $ # creating new empty field(s) $ echo 'foo,bar,123,baz' | awk -F, -v OFS=, '{NF=5} 1' foo,bar,123,baz, $ # assigning to field greater than NF will create empty fields as needed $ echo 'foo,bar,123,baz' | awk -F, -v OFS=, '{$7=42} 1' foo,bar,123,baz,,,42 ``` * adding a field based on existing fields ```bash $ # adding a new 'Grade' field $ awk 'BEGIN{OFS="\t"; g[9]="S"; g[8]="A"; g[7]="B"; g[6]="C"; g[5]="D"} {NF++; $NF = NR==1 ? "Grade" : g[int($(NF-1)/10)]} 1' marks.txt Dept Name Marks Grade ECE Raj 53 D ECE Joel 72 B EEE Moi 68 C CSE Surya 81 A EEE Tia 59 D ECE Om 92 S CSE Amy 67 C $ # can also use split (covered in a later section) $ # array assignment: split("DCBAS",g,//) $ # index adjustment: g[int($(NF-1)/10)-4] ``` * two file example ```bash $ cat list4 Raj class_rep Amy sports_rep Tia placement_rep $ awk -v OFS='\t' 'NR==FNR{r[$1]=$2; next} {$(NF+1) = FNR==1 ? "Role" : r[$2]} 1' list4 marks.txt Dept Name Marks Role ECE Raj 53 class_rep ECE Joel 72 EEE Moi 68 CSE Surya 81 EEE Tia 59 placement_rep ECE Om 92 CSE Amy 67 sports_rep ```
## Dealing with duplicates * default value of uninitialized variable is `0` in numeric context and empty string in text context * and evaluates to `false` when used conditionally *Illustration to show default numeric value and array in action* ```bash $ printf 'mad\n42\n42\ndam\n42\n' mad 42 42 dam 42 $ printf 'mad\n42\n42\ndam\n42\n' | awk '{print $0 "\t" int(a[$0]); a[$0]++}' mad 0 42 0 42 1 dam 0 42 2 $ # only those entries with second column value zero will be retained $ printf 'mad\n42\n42\ndam\n42\n' | awk '!a[$0]++' mad 42 dam ``` * first, examples that retain only first copy of duplicates * See also [iridakos: remove duplicates](https://iridakos.com/how-to/2019/05/16/remove-duplicate-lines-preserving-order-linux.html) for a detailed explanation * See also [stackoverflow - add a letter to duplicate entries](https://stackoverflow.com/questions/47774779/add-letter-to-second-third-fourth-occurrence-of-a-string) ```bash $ cat duplicates.txt abc 7 4 food toy **** abc 7 4 test toy 123 good toy **** $ # whole line $ awk '!seen[$0]++' duplicates.txt abc 7 4 food toy **** test toy 123 good toy **** $ # particular column $ awk '!seen[$2]++' duplicates.txt abc 7 4 food toy **** $ # total count $ awk '!seen[$2]++{c++} END{print +c}' duplicates.txt 2 ``` * if input is so large that integer numbers can overflow * See also [gawk manual - Arbitrary-Precision Integer Arithmetic](https://www.gnu.org/software/gawk/manual/html_node/Arbitrary-Precision-Integers.html) ```bash $ # avoid unnecessary counting altogether $ awk '!($2 in seen); {seen[$2]}' duplicates.txt abc 7 4 food toy **** $ # use arbitrary-precision integers, limited only by available memory $ awk -M '!($2 in seen){c++} {seen[$2]} END{print +c}' duplicates.txt 2 ``` * For multiple fields, separate them using `,` or form a string with some character in between * choose a character unlikely to appear in input data, else there can be false matches * `FS` is a good choice as fields wouldn't contain separator character(s) ```bash $ awk '!seen[$2 FS $3]++' duplicates.txt abc 7 4 food toy **** test toy 123 $ # can also use simulated multidimensional array $ # SUBSEP, whose default is \034 non-printing character, is used as separator $ awk '!seen[$2,$3]++' duplicates.txt abc 7 4 food toy **** test toy 123 ``` * retaining specific numbered copy ```bash $ # second occurrence of duplicate $ awk '++seen[$2]==2' duplicates.txt abc 7 4 test toy 123 $ # third occurrence of duplicate $ awk '++seen[$2]==3' duplicates.txt good toy **** ``` * retaining only last copy of duplicate ```bash $ # reverse the input line-wise, retain first copy and then reverse again $ tac duplicates.txt | awk '!seen[$2]++' | tac abc 7 4 good toy **** ``` * filtering based on duplicate count * allows to emulate [uniq](./sorting_stuff.md#uniq) command for specific fields * See also [unix.stackexchange - retain only parent directory paths](https://unix.stackexchange.com/questions/362571/filter-out-paths-from-a-text-file-that-are-deeper-than-their-immediate-predecces) ```bash $ # all duplicates based on 1st column $ awk 'NR==FNR{a[$1]++; next} a[$1]>1' duplicates.txt duplicates.txt abc 7 4 abc 7 4 $ # all duplicates based on 3rd column $ awk 'NR==FNR{a[$3]++; next} a[$3]>1' duplicates.txt duplicates.txt abc 7 4 food toy **** abc 7 4 good toy **** $ # more than 2 duplicates based on 2nd column $ awk 'NR==FNR{a[$2]++; next} a[$2]>2' duplicates.txt duplicates.txt food toy **** test toy 123 good toy **** $ # only unique lines based on 3rd column $ awk 'NR==FNR{a[$3]++; next} a[$3]==1' duplicates.txt duplicates.txt test toy 123 ```
## Lines between two REGEXPs * This section deals with filtering lines bound by two *REGEXP*s (referred to as blocks) * For simplicity the two *REGEXP*s usually used in below examples are the strings **BEGIN** and **END**
#### All unbroken blocks Consider the below sample input file, which doesn't have any unbroken blocks (i.e **BEGIN** and **END** are always present in pairs) ```bash $ cat range.txt foo BEGIN 1234 6789 END bar BEGIN a b c END baz ``` * Extracting lines between starting and ending *REGEXP* ```bash $ # include both starting/ending REGEXP $ # can also use: awk '/BEGIN/,/END/' range.txt $ # which is similar to sed -n '/BEGIN/,/END/p' $ # but not suitable to extend for other cases $ awk '/BEGIN/{f=1} f; /END/{f=0}' range.txt BEGIN 1234 6789 END BEGIN a b c END $ # exclude both starting/ending REGEXP $ # can also use: awk '/BEGIN/{f=1; next} /END/{f=0} f' range.txt $ awk '/END/{f=0} f; /BEGIN/{f=1}' range.txt 1234 6789 a b c ``` * Include only start or end *REGEXP* ```bash $ # include only starting REGEXP $ awk '/BEGIN/{f=1} /END/{f=0} f' range.txt BEGIN 1234 6789 BEGIN a b c $ # include only ending REGEXP $ awk 'f; /END/{f=0} /BEGIN/{f=1}' range.txt 1234 6789 END a b c END ``` * Extracting lines other than lines between the two *REGEXP*s ```bash $ awk '/BEGIN/{f=1} !f; /END/{f=0}' range.txt foo bar baz $ # the other three cases would be $ awk '/END/{f=0} !f; /BEGIN/{f=1}' range.txt $ awk '!f; /BEGIN/{f=1} /END/{f=0}' range.txt $ awk '/BEGIN/{f=1} /END/{f=0} !f' range.txt ```
#### Specific blocks * Getting first block ```bash $ awk '/BEGIN/{f=1} f; /END/{exit}' range.txt BEGIN 1234 6789 END $ # use other tricks discussed in previous section as needed $ awk '/END/{exit} f; /BEGIN/{f=1}' range.txt 1234 6789 ``` * Getting last block ```bash $ # reverse input linewise, change the order of REGEXPs, finally reverse again $ tac range.txt | awk '/END/{f=1} f; /BEGIN/{exit}' | tac BEGIN a b c END $ # or, save the blocks in a buffer and print the last one alone $ # ORS contains output record separator, which is newline by default $ seq 30 | awk '/4/{f=1; b=$0; next} f{b=b ORS $0} /6/{f=0} END{print b}' 24 25 26 ``` * Getting blocks based on a counter ```bash $ # all blocks $ seq 30 | sed -n '/4/,/6/p' 4 5 6 14 15 16 24 25 26 $ # get only 2nd block $ # can also use: seq 30 | awk -v b=2 '/4/{c++} c==b{print; if(/6/) exit}' $ seq 30 | awk -v b=2 '/4/{c++} c==b; /6/ && c==b{exit}' 14 15 16 $ # to get all blocks greater than 'b' blocks $ seq 30 | awk -v b=1 '/4/{f=1; c++} f && c>b; /6/{f=0}' 14 15 16 24 25 26 ``` * excluding a particular block ```bash $ # excludes 2nd block $ seq 30 | awk -v b=2 '/4/{f=1; c++} f && c!=b; /6/{f=0}' 4 5 6 24 25 26 ```
#### Broken blocks * If there are blocks with ending *REGEXP* but without corresponding start, `awk '/BEGIN/{f=1} f; /END/{f=0}'` will suffice * Consider the modified input file where starting *REGEXP* doesn't have corresponding ending ```bash $ cat broken_range.txt foo BEGIN 1234 6789 END bar BEGIN a b c baz $ # the file reversing trick comes in handy here as well $ tac broken_range.txt | awk '/END/{f=1} f; /BEGIN/{f=0}' | tac BEGIN 1234 6789 END ``` * But if both kinds of broken blocks are present, accumulate the records and print accordingly ```bash $ cat multiple_broken.txt qqqqqqq BEGIN foo BEGIN 1234 6789 END bar END 0-42-1 BEGIN a BEGIN b END xyzabc $ awk '/BEGIN/{f=1; buf=$0; next} f{buf=buf ORS $0} /END/{f=0; if(buf) print buf; buf=""}' multiple_broken.txt BEGIN 1234 6789 END BEGIN b END ``` **Further Reading** * [stackoverflow - select lines between two regexps](https://stackoverflow.com/questions/38972736/how-to-select-lines-between-two-patterns) * [unix.stackexchange - print only blocks with lines > n](https://unix.stackexchange.com/questions/295600/deleting-lines-between-rows-in-a-text-file-using-awk-or-sed) * [unix.stackexchange - print a block only if it contains matching string](https://unix.stackexchange.com/a/335523/109046) * [unix.stackexchange - print a block matching two different strings](https://unix.stackexchange.com/questions/347368/grep-with-range-and-pass-three-filters) * [unix.stackexchange - extract block up to 2nd occurrence of ending REGEXP](https://unix.stackexchange.com/questions/404175/using-awk-to-print-lines-from-one-match-through-a-second-instance-of-a-separate)
## Arrays We've already seen examples using arrays, some more examples discussed in this section * array looping ```bash $ # average marks for each department $ awk 'NR>1{d[$1]+=$3; c[$1]++} END{for(i in d)print i, d[i]/c[i]}' marks.txt ECE 72.3333 EEE 63.5 CSE 74 ``` * Sorting * See [gawk manual - Predefined Array Scanning Orders](https://www.gnu.org/software/gawk/manual/html_node/Controlling-Scanning.html#Controlling-Scanning) for more details ```bash $ # by default, keys are traversed in random order $ awk 'BEGIN{a["z"]=1; a["x"]=12; a["b"]=42; for(i in a)print i, a[i]}' x 12 z 1 b 42 $ # index sorted ascending order as strings $ awk 'BEGIN{PROCINFO["sorted_in"] = "@ind_str_asc"; a["z"]=1; a["x"]=12; a["b"]=42; for(i in a)print i, a[i]}' b 42 x 12 z 1 $ # value sorted ascending order as numbers $ awk 'BEGIN{PROCINFO["sorted_in"] = "@val_num_asc"; a["z"]=1; a["x"]=12; a["b"]=42; for(i in a)print i, a[i]}' z 1 x 12 b 42 ``` * deleting array elements ```bash $ cat list5 CSE Surya 75 EEE Jai 69 ECE Kal 83 $ # update entry if a match is found $ # else append the new entries $ awk '{ky=$1"_"$2} NR==FNR{upd[ky]=$0; next} ky in upd{$0=upd[ky]; delete upd[ky]} 1; END{for(i in upd)print upd[i]}' list5 marks.txt Dept Name Marks ECE Raj 53 ECE Joel 72 EEE Moi 68 CSE Surya 75 EEE Tia 59 ECE Om 92 CSE Amy 67 ECE Kal 83 EEE Jai 69 ``` * true multidimensional arrays * length of sub-arrays need not be same. See [gawk manual - Arrays of Arrays](https://www.gnu.org/software/gawk/manual/html_node/Arrays-of-Arrays.html#Arrays-of-Arrays) for details ```bash $ awk 'NR>1{d[$1][$2]=$3} END{for(i in d["ECE"])print i}' marks.txt Joel Raj Om $ awk -v f='CSE' 'NR>1{d[$1][$2]=$3} END{for(i in d[f])print i, d[f][i]}' marks.txt Surya 81 Amy 67 ``` **Further Reading** * [gawk manual - all array topics](https://www.gnu.org/software/gawk/manual/html_node/Arrays.html) * [unix.stackexchange - count words based on length](https://unix.stackexchange.com/questions/396855/is-there-an-easy-way-to-count-characters-in-words-in-file-from-terminal) * [unix.stackexchange - filtering specific lines](https://unix.stackexchange.com/a/326215/109046)
## awk scripts * For larger programs, save the code in a file and use `-f` command line option * `;` is not needed to terminate a statement * See also [gawk manual - Command-Line Options](https://www.gnu.org/software/gawk/manual/html_node/Options.html#Options) for other related options ```bash $ cat buf.awk /BEGIN/{ f=1 buf=$0 next } f{ buf=buf ORS $0 } /END/{ f=0 if(buf) print buf buf="" } $ awk -f buf.awk multiple_broken.txt BEGIN 1234 6789 END BEGIN b END ``` * Another advantage is that single quotes can be freely used ```bash $ echo 'foo:123:bar:baz' | awk '{$0=gensub(/[^:]+/, "\047&\047", "g")} 1' 'foo':'123':'bar':'baz' $ cat quotes.awk { $0 = gensub(/[^:]+/, "'&'", "g") } 1 $ echo 'foo:123:bar:baz' | awk -f quotes.awk 'foo':'123':'bar':'baz' ``` * If the code has been first tried out on command line, add `-o` option to get a pretty printed version ```bash $ awk -o -v OFS='\t' 'NR==FNR{r[$1]=$2; next} {$(NF+1) = FNR==1 ? "Role" : r[$2]} 1' list4 marks.txt Dept Name Marks Role ECE Raj 53 class_rep ECE Joel 72 EEE Moi 68 CSE Surya 81 EEE Tia 59 placement_rep ECE Om 92 CSE Amy 67 sports_rep ``` File name can be passed along `-o` option, otherwise by default `awkprof.out` will be used ```bash $ cat awkprof.out # gawk profile, created Mon Mar 16 10:11:11 2020 # Rule(s) NR == FNR { r[$1] = $2 next } { $(NF + 1) = (FNR == 1 ? "Role" : r[$2]) } 1 { print $0 } $ # note that other command line options have to be provided as usual $ # for ex: awk -v OFS='\t' -f awkprof.out list4 marks.txt ```
## Miscellaneous
#### FPAT and FIELDWIDTHS * `FS` allows to define field separator * In contrast, `FPAT` allows to define what should the fields be made up of * See also [gawk manual - Defining Fields by Content](https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html) ```bash $ s='Sample123string54with908numbers' $ # define fields to be one or more consecutive digits $ echo "$s" | awk -v FPAT='[0-9]+' '{print $1, $2, $3}' 123 54 908 $ # define fields to be one or more consecutive alphabets $ echo "$s" | awk -v FPAT='[a-zA-Z]+' '{print $1, $2, $3, $4}' Sample string with numbers ``` * For simpler **csv** input having quoted strings if fields themselves have `,` in them, using `FPAT` is reasonable approach * Use a proper parser if input can have other cases like newlines in fields * See [unix.stackexchange - using csv parser](https://unix.stackexchange.com/a/238192) for a sample program in `perl` ```bash $ s='foo,"bar,123",baz,abc' $ echo "$s" | awk -F, '{print $2}' "bar $ echo "$s" | awk -v FPAT='"[^"]*"|[^,]*' '{print $2}' "bar,123" ``` * if input has well defined fields based on number of characters, `FIELDWIDTHS` can be used to specify width of each field ```bash $ awk -v FIELDWIDTHS='8 3' -v OFS= '/fig/{$2=35} 1' fruits.txt fruit qty apple 42 banana 31 fig 35 guava 6 $ # without FIELDWIDTHS $ awk '/fig/{$2=35} 1' fruits.txt fruit qty apple 42 banana 31 fig 35 guava 6 ``` **Further Reading** * [gawk manual - Processing Fixed-Width Data](https://www.gnu.org/software/gawk/manual/html_node/Fixed-width-data.html) * [unix.stackexchange - Modify records in fixed-width files](https://unix.stackexchange.com/questions/368574/modify-records-in-fixed-width-files) * [unix.stackexchange - detecting empty fields in fixed width files](https://unix.stackexchange.com/questions/321559/extracting-data-with-awk-when-some-lines-have-empty-missing-values) * [stackoverflow - count number of times value is repeated each line](https://stackoverflow.com/questions/37450880/how-do-i-filter-tab-separated-input-by-the-count-of-fields-with-a-given-value) * [stackoverflow - skip characters with FIELDWIDTHS in GNU Awk 4.2](https://stackoverflow.com/questions/46932189/how-do-you-skip-characters-with-fieldwidths-in-gnu-awk-4-2)
#### String functions * `length` function - returns length of string, by default acts on `$0` ```bash $ seq 8 13 | awk 'length()==1' 8 9 $ awk 'NR==1 || length($1)>4' fruits.txt fruit qty apple 42 banana 31 guava 6 $ # character count and not byte count is calculated, similar to 'wc -m' $ printf 'hi👍' | awk '{print length()}' 3 $ # use -b option if number of bytes are needed $ printf 'hi👍' | awk -b '{print length()}' 6 ``` * `split` function - similar to `FS` splitting input record into fields * use `patsplit` function to get results similar to `FPAT` * See also [gawk manual - Split function](https://www.gnu.org/software/gawk/manual/gawk.html#index-split_0028_0029-function) * See also [unix.stackexchange - delimit second column](https://unix.stackexchange.com/questions/372253/awk-command-to-delimit-the-second-column) ```bash $ # 1st argument is string to be split $ # 2nd argument is array to save results, indexed from 1 $ # 3rd argument is separator, default is FS $ s='foo,1996-10-25,hello,good' $ echo "$s" | awk -F, '{split($2,d,"-"); print "Month is: " d[2]}' Month is: 10 $ # using regular expression to define separator $ # return value is number of fields after splitting $ s='Sample123string54with908numbers' $ echo "$s" | awk '{n=split($0,s,/[0-9]+/); for(i=1;i<=n;i++)print s[i]}' Sample string with numbers $ # use 4th argument if separators are needed as well $ echo "$s" | awk '{n=split($0,s,/[0-9]+/,seps); for(i=1;i #### Executing external commands * External commands can be issued using `system` function * Output would be as usual on `stdout` unless redirected while calling the command * Return value of `system` depends on `exit` status of executed command, see [gawk manual - Input/Output Functions](https://www.gnu.org/software/gawk/manual/html_node/I_002fO-Functions.html) for details ```bash $ awk 'BEGIN{system("echo Hello World")}' Hello World $ wc poem.txt 4 13 65 poem.txt $ awk 'BEGIN{system("wc poem.txt")}' 4 13 65 poem.txt $ awk 'BEGIN{system("seq 10 | paste -sd, > out.txt")}' $ cat out.txt 1,2,3,4,5,6,7,8,9,10 $ ls xyz.txt ls: cannot access 'xyz.txt': No such file or directory $ echo $? 2 $ awk 'BEGIN{s=system("ls xyz.txt"); print "Status: " s}' ls: cannot access 'xyz.txt': No such file or directory Status: 2 $ cat f2 I bought two bananas and three mangoes $ echo 'f1,f2,odd.txt' | awk -F, '{system("cat " $2)}' I bought two bananas and three mangoes ```
#### printf formatting * Similar to `printf` function in `C` and shell built-in command * use `sprintf` function to save result in variable instead of printing * See also [gawk manual - printf](https://www.gnu.org/software/gawk/manual/html_node/Printf.html) ```bash $ awk '{sum += $1} END{print sum}' nums.txt 10062.9 $ # note that ORS is not appended and has to be added manually $ awk '{sum += $1} END{printf "%.2f\n", sum}' nums.txt 10062.86 $ awk '{sum += $1} END{printf "%10.2f\n", sum}' nums.txt 10062.86 $ awk '{sum += $1} END{printf "%010.2f\n", sum}' nums.txt 0010062.86 $ awk '{sum += $1} END{printf "%d\n", sum}' nums.txt 10062 $ awk '{sum += $1} END{printf "%+d\n", sum}' nums.txt +10062 $ awk '{sum += $1} END{printf "%e\n", sum}' nums.txt 1.006286e+04 ``` * to refer argument by positional number (starts with 1), use `$` ```bash $ # can also use: awk 'BEGIN{printf "hex=%x\noct=%o\ndec=%d\n", 15, 15, 15}' $ awk 'BEGIN{printf "hex=%1$x\noct=%1$o\ndec=%1$d\n", 15}' hex=f oct=17 dec=15 $ # adding prefix to hex/oct numbers $ awk 'BEGIN{printf "hex=%1$#x\noct=%1$#o\ndec=%1$d\n", 15}' hex=0xf oct=017 dec=15 ``` * strings ```bash $ # prefix remaining width with spaces $ awk 'BEGIN{printf "%6s:%5s\n", "foo", "bar"}' foo: bar $ # suffix remaining width with spaces $ awk 'BEGIN{printf "%-6s:%-5s\n", "foo", "bar"}' foo :bar $ # truncate $ awk 'BEGIN{printf "%.2s\n", "foobar"}' fo ``` * avoid using `printf` without format specifier ```bash $ awk 'BEGIN{s="solve: 5 % x = 1"; printf s}' awk: cmd. line:1: fatal: not enough arguments to satisfy format string `solve: 5 % x = 1' ^ ran out for this one $ awk 'BEGIN{s="solve: 5 % x = 1"; printf "%s\n", s}' solve: 5 % x = 1 ``` * See also [stackoverflow - concatenating columns in middle](https://stackoverflow.com/questions/49135518/linux-csv-file-concatenate-columns-into-one-column)
#### Redirecting print output * redirecting to file instead of stdout using `>` * similar to behavior in shell, if file already exists it is overwritten * use `>>` to append to an existing file without deleting content * however, unlike shell, subsequent redirections to same file will append to it * See also [gawk manual - Closing Input and Output Redirections](https://www.gnu.org/software/gawk/manual/html_node/Close-Files-And-Pipes.html) if you have too many redirections ```bash $ seq 6 | awk 'NR%2{print > "odd.txt"; next} {print > "even.txt"}' $ cat odd.txt 1 3 5 $ cat even.txt 2 4 6 $ awk 'NR==1{col1=$1".txt"; col2=$2".txt"; next} {print $1 > col1; print $2 > col2}' fruits.txt $ cat fruit.txt apple banana fig guava $ cat qty.txt 42 31 90 6 ``` * redirecting to shell command * this is useful if you have different things to redirect to different commands, otherwise it can be done as usual in shell acting on `awk`'s output * all redirections to same command gets combined as single input to that command ```bash $ # same as: echo 'foo good 123' | awk '{print $2}' | wc -c $ echo 'foo good 123' | awk '{print $2 | "wc -c"}' 5 $ # to avoid newline character being added to print $ echo 'foo good 123' | awk -v ORS= '{print $2 | "wc -c"}' 4 $ # assuming no format specifiers in input $ echo 'foo good 123' | awk '{printf $2 | "wc -c"}' 4 $ # same as: echo 'foo good 123' | awk '{printf $2 $3 | "wc -c"}' $ echo 'foo good 123' | awk '{printf $2 | "wc -c"; printf $3 | "wc -c"}' 7 ``` **Further Reading** * [gawk manual - Input/Output Functions](https://www.gnu.org/software/gawk/manual/html_node/I_002fO-Functions.html) * [gawk manual - Redirecting Output of print and printf](https://www.gnu.org/software/gawk/manual/html_node/Redirection.html) * [gawk manual - Two-Way Communications with Another Process](https://www.gnu.org/software/gawk/manual/html_node/Two_002dway-I_002fO.html) * [unix.stackexchange - inplace editing as well as stdout](https://unix.stackexchange.com/questions/321679/gawk-inplace-and-stdout) * [stackoverflow - redirect blocks to separate files](https://stackoverflow.com/questions/45098279/write-blocks-in-a-text-file-to-multiple-new-files)
## Gotchas and Tips * using `$` for variables * only input record `$0` and field contents `$1`, `$2` etc need `$` * See also [unix.stackexchange - Why does awk print the whole line when I want it to print a variable?](https://unix.stackexchange.com/questions/291126/why-does-awk-print-the-whole-line-when-i-want-it-to-print-a-variable) ```bash $ # wrong $ awk -v word="apple" '$1==$word' fruits.txt $ # right $ awk -v word="apple" '$1==word' fruits.txt apple 42 ``` * dos style line endings * See also [unix.stackexchange - filtering when last column has \r](https://unix.stackexchange.com/questions/399560/using-awk-to-select-rows-with-specific-value-in-specific-column) ```bash $ # no issue with unix style line ending $ printf 'foo bar\n123 789\n' | awk '{print $2, $1}' bar foo 789 123 $ # dos style line ending causes trouble $ printf 'foo bar\r\n123 789\r\n' | awk '{print $2, $1}' foo 123 $ # easy to deal by simply setting appropriate RS $ # note that ORS would still be newline character only $ printf 'foo bar\r\n123 789\r\n' | awk -v RS='\r\n' '{print $2, $1}' bar foo 789 123 ``` * relying on default initial value ```bash $ # step 1 - works for single file $ awk '{sum += $1} END{print sum}' nums.txt 10062.9 $ # step 2 - change to work for multiple file $ awk '{sum += $1} ENDFILE{print FILENAME, sum}' nums.txt nums.txt 10062.9 $ # step 3 - check with multiple file input $ # oops, default numerical value '0' for sum works only once $ awk '{sum += $1} ENDFILE{print FILENAME, sum}' nums.txt <(seq 3) nums.txt 10062.9 /dev/fd/63 10068.9 $ # step 4 - correctly initialize variables $ awk '{sum += $1} ENDFILE{print FILENAME, sum; sum=0}' nums.txt <(seq 3) nums.txt 10062.9 /dev/fd/63 6 ``` * use unary operator `+` to force numeric conversion ```bash $ awk '{sum += $1} END{print FILENAME, sum}' nums.txt nums.txt 10062.9 $ awk '{sum += $1} END{print FILENAME, sum}' /dev/null /dev/null $ awk '{sum += $1} END{print FILENAME, +sum}' /dev/null /dev/null 0 ``` * concatenate empty string to force string comparison ```bash $ echo '5 5.0' | awk '{print $1==$2 ? "same" : "different", "string"}' same string $ echo '5 5.0' | awk '{print $1""==$2 ? "same" : "different", "string"}' different string ``` * beware of expressions going -ve for field calculations ```bash $ cat misc.txt foo good bad ugly 123 xyz a b c d $ # trying to delete last two fields $ awk '{NF -= 2} 1' misc.txt awk: cmd. line:1: (FILENAME=misc.txt FNR=1) fatal: NF set to negative value $ # dynamically change it depending on number of fields $ awk '{NF = (NF<=2) ? 0 : NF-2} 1' misc.txt good a b $ # similarly, trying to access 3rd field from end $ awk '{print $(NF-2)}' misc.txt awk: cmd. line:1: (FILENAME=misc.txt FNR=1) fatal: attempt to access field -1 $ awk 'NF>2{print $(NF-2)}' misc.txt good b ``` * If input is ASCII alone, simple trick to improve speed * For simple non-regex based column filtering, using [cut](./miscellaneous.md#cut) command might give faster results * See [stackoverflow - how to split columns faster](https://stackoverflow.com/questions/46882557/how-to-split-columns-faster-in-python/46883120#46883120) for example ```bash $ # all words containing exactly 3 lowercase a $ time awk -F'a' 'NF==4{cnt++} END{print +cnt}' /usr/share/dict/words 1019 real 0m0.075s $ time LC_ALL=C awk -F'a' 'NF==4{cnt++} END{print +cnt}' /usr/share/dict/words 1019 real 0m0.045s ```
## Further Reading * Manual and related * `man awk` and `info awk` for quick reference from command line * [gawk manual](https://www.gnu.org/software/gawk/manual/gawk.html#SEC_Contents) for complete reference, extensions and more * [awk FAQ](http://www.faqs.org/faqs/computer-lang/awk/faq/) - from 2002, but plenty of information, especially about all the various `awk` implementations * this tutorial has also been [converted to an ebook](https://github.com/learnbyexample/learn_gnuawk) with additional descriptions, examples, a chapter on regular expressions, etc. * What's up with different `awk` versions? * [unix.stackexchange - brief explanation](https://unix.stackexchange.com/questions/29576/difference-between-gawk-vs-awk) * [Differences between gawk, nawk, mawk, and POSIX awk](https://archive.is/btGky) * [cheat sheet for awk/nawk/gawk](https://catonmat.net/ftp/awk.cheat.sheet.txt) * Tutorials and Q&A * [code.snipcademy - gentle intro](https://code.snipcademy.com/tutorials/shell-scripting/awk/introduction) * [funtoo - using examples](https://www.funtoo.org/Awk_by_Example,_Part_1) * [grymoire - detailed tutorial](https://www.grymoire.com/Unix/Awk.html) - covers information about different `awk` versions as well * [catonmat - one liners explained](https://catonmat.net/awk-one-liners-explained-part-one) * [Why Learn AWK?](https://blog.jpalardy.com/posts/why-learn-awk/) * [awk Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/awk?sort=votes&pageSize=15) * [awk Q&A on unix.stackexchange](https://unix.stackexchange.com/questions/tagged/awk?sort=votes&pageSize=15) * Alternatives * [GNU datamash](https://www.gnu.org/software/datamash/alternatives/) * [bioawk](https://github.com/lh3/bioawk) * [hawk](https://github.com/gelisam/hawk/blob/master/doc/README.md) - based on Haskell * [miller](https://github.com/johnkerl/miller) - similar to awk/sed/cut/join/sort for name-indexed data such as CSV, TSV, and tabular JSON * See this [ycombinator news](https://news.ycombinator.com/item?id=10066742) for other tools like this * miscellaneous * [unix.stackexchange - When to use grep, sed, awk, perl, etc](https://unix.stackexchange.com/questions/303044/when-to-use-grep-less-awk-sed) * [awk-libs](https://github.com/e36freak/awk-libs) - lots of useful functions * [awkaster](https://github.com/TheMozg/awk-raycaster) - Pseudo-3D shooter written completely in awk using raycasting technique * [awk REPL](https://awk.js.org/) - live editor on browser * examples for some of the stuff not covered in this tutorial * [unix.stackexchange - rand/srand](https://unix.stackexchange.com/questions/372816/awk-get-random-lines-of-file-satisfying-a-condition) * [unix.stackexchange - strftime](https://unix.stackexchange.com/questions/224969/current-date-in-awk) * [unix.stackexchange - ARGC and ARGV](https://unix.stackexchange.com/questions/222146/awk-does-not-end/222150#222150) * [stackoverflow - arbitrary precision integer extension](https://stackoverflow.com/questions/46904447/strange-output-while-comparing-engineering-numbers-in-awk) * [stackoverflow - recognizing hexadecimal numbers](https://stackoverflow.com/questions/3683110/how-to-make-calculations-on-hexadecimal-numbers-with-awk) * [unix.stackexchange - sprintf and close](https://unix.stackexchange.com/questions/223727/splitting-file-for-every-10000-numbers-not-lines/223739#223739) * [unix.stackexchange - user defined functions and array passing](https://unix.stackexchange.com/questions/72469/gawk-passing-arrays-to-functions) * [unix.stackexchange - rename csv files based on number of fields in header row](https://unix.stackexchange.com/questions/408742/count-number-of-columns-in-csv-files-and-rename-if-less-than-11-columns) ================================================ FILE: gnu_grep.md ================================================


--- :information_source: :information_source: This chapter has been converted into a better formatted ebook: https://learnbyexample.github.io/learn_gnugrep_ripgrep/. The ebook also has content updated for newer version of the commands, includes exercises, solutions, has a separate chapter for popular alternative `ripgrep`, etc. For markdown source and links to buy pdf/epub versions, see: https://github.com/learnbyexample/learn_gnugrep_ripgrep ---


# GNU grep **Table of Contents** * [Simple string search](#simple-string-search) * [Case insensitive search](#case-insensitive-search) * [Invert matching lines](#invert-matching-lines) * [Line number, count and limiting output lines](#line-number-count-and-limiting-output-lines) * [Multiple search strings](#multiple-search-strings) * [File names in output](#file-names-in-output) * [Match whole word or line](#match-whole-word-or-line) * [Colored output](#colored-output) * [Get only matching portion](#get-only-matching-portion) * [Context matching](#context-matching) * [Recursive search](#recursive-search) * [Basic recursive search](#basic-recursive-search) * [Exclude/Include specific files/directories](#excludeinclude-specific-filesdirectories) * [Recursive search with bash options](#recursive-search-with-bash-options) * [Recursive search using find command](#recursive-search-using-find-command) * [Passing file names to other commands](#passing-file-names-to-other-commands) * [Search strings from file](#search-strings-from-file) * [Options for scripting purposes](#options-for-scripting-purposes) * [Regular Expressions - BRE/ERE](#regular-expressions-breere) * [Line Anchors](#line-anchors) * [Word Anchors](#word-anchors) * [Alternation](#alternation) * [The dot meta character](#the-dot-meta-character) * [Quantifiers](#quantifiers) * [Character classes](#character-classes) * [Grouping](#grouping) * [Back reference](#back-reference) * [Multiline matching](#multiline-matching) * [Perl Compatible Regular Expressions](#perl-compatible-regular-expressions) * [Backslash sequences](#backslash-sequences) * [Non-greedy matching](#non-greedy-matching) * [Lookarounds](#lookarounds) * [Ignoring specific matches](#ignoring-specific-matches) * [Re-using regular expression pattern](#re-using-regular-expression-pattern) * [Gotchas and Tips](#gotchas-and-tips) * [Regular Expressions Reference (ERE)](#regular-expressions-reference-ere) * [Anchors](#anchors) * [Character Quantifiers](#character-quantifiers) * [Character classes and backslash sequences](#character-classes-and-backslash-sequences) * [Pattern groups](#pattern-groups) * [Basic vs Extended Regular Expressions](#basic-vs-extended-regular-expressions) * [Further Reading](#further-reading)
```bash $ grep -V | head -1 grep (GNU grep) 2.25 $ man grep GREP(1) General Commands Manual GREP(1) NAME grep, egrep, fgrep, rgrep - print lines matching a pattern SYNOPSIS grep [OPTIONS] PATTERN [FILE...] grep [OPTIONS] [-e PATTERN]... [-f FILE]... [FILE...] DESCRIPTION grep searches the named input FILEs for lines containing a match to the given PATTERN. If no files are specified, or if the file “-” is given, grep searches standard input. By default, grep prints the matching lines. In addition, the variant programs egrep, fgrep and rgrep are the same as grep -E, grep -F, and grep -r, respectively. These variants are deprecated, but are provided for backward compatibility. ... ``` **Note** For more detailed documentation and examples, use `info grep`
## Simple string search * First specify the search pattern (usually enclosed in single quotes) and then the file input * More than one file can be specified or input given from stdin ```bash $ cat poem.txt Roses are red, Violets are blue, Sugar is sweet, And so are you. $ grep 'are' poem.txt Roses are red, Violets are blue, And so are you. $ grep 'so are' poem.txt And so are you. ``` * If search string contains any regular expression meta characters like `^$\.*[]` (covered later), use the `-F` option or `fgrep` if available ```bash $ echo 'int a[5]' | grep 'a[5]' $ echo 'int a[5]' | grep -F 'a[5]' int a[5] $ echo 'int a[5]' | fgrep 'a[5]' int a[5] ``` * See [Gotchas and Tips](#gotchas-and-tips) section if you get strange issues
## Case insensitive search ```bash $ grep -i 'rose' poem.txt Roses are red, $ grep -i 'and' poem.txt And so are you. ```
## Invert matching lines * Use the `-v` option to get lines other than those matching the search string * Tip: Look out for other opposite pairs like `-l -L`, `-h -H`, opposites in regular expression, etc ```bash $ grep -v 'are' poem.txt Sugar is sweet, $ # example for input from stdin $ seq 5 | grep -v '3' 1 2 4 5 ```
## Line number, count and limiting output lines * Show line number of matching lines ```bash $ grep -n 'sweet' poem.txt 3:Sugar is sweet, ``` * Count number of matching lines ```bash $ grep -c 'are' poem.txt 3 ``` * Limit number of matching lines ```bash $ grep -m2 'are' poem.txt Roses are red, Violets are blue, ```
## Multiple search strings * Match any ```bash $ # search blue or you $ grep -e 'blue' -e 'you' poem.txt Violets are blue, And so are you. ``` If there are lot of search strings, use a file input **Note** Be careful to avoid empty lines in the file, it would result in matching all the lines ```bash $ printf 'rose\nsugar\n' > search_strings.txt $ cat search_strings.txt rose sugar $ # -f option accepts file input with search terms in separate lines $ grep -if search_strings.txt poem.txt Roses are red, Sugar is sweet, ``` * Match all ```bash $ # match line containing both are & And $ grep 'are' poem.txt | grep 'And' And so are you. ```
## File names in output * `-l` to get files matching the search * `-L` to get files not matching the search * `grep` skips the rest of file once a match is found ```bash $ grep -l 'Rose' poem.txt poem.txt $ grep -L 'are' poem.txt search_strings.txt search_strings.txt ``` * Prefix file name to search results * `-h` is default for single file input, no file name prefix in output * `-H` is default for multiple file input, file name prefix in output ```bash $ grep -h 'Rose' poem.txt Roses are red, $ grep -H 'Rose' poem.txt poem.txt:Roses are red, $ # -H is default for multiple file input $ grep -i 'sugar' poem.txt search_strings.txt poem.txt:Sugar is sweet, search_strings.txt:sugar $ grep -ih 'sugar' poem.txt search_strings.txt Sugar is sweet, sugar ```
## Match whole word or line * Word search using `-w` option * word constitutes of alphabets, numbers and underscore character * This will ensure that given patterns are not surrounded by other word characters * this is slightly different than using word boundaries in regular expressions * For example, this helps to distinguish `par` from `spar`, `part`, etc ```bash $ printf 'par value\nheir apparent\n' | grep 'par' par value heir apparent $ printf 'par value\nheir apparent\n' | grep -w 'par' par value $ printf 'scare\ncart\ncar\nmacaroni\n' | grep -w 'car' car ``` * Another useful option is `-x` to match only complete line, not anywhere in the line ```bash $ printf 'see my book list\nmy book\n' | grep 'my book' see my book list my book $ printf 'see my book list\nmy book\n' | grep -x 'my book' my book $ printf 'scare\ncart\ncar\nmacaroni\n' | grep -x 'car' car ```
## Colored output * Highlight search strings, line numbers, file name, etc in different colors * Depends on color support in terminal being used * options to `--color` are * `auto` when output is redirected (another command, file, etc) the color information won't be passed * `always` when output is redirected (another command, file, etc) the color information will also be passed * `never` explicitly specify no highlighting ```bash $ # can also use grep --color 'blue' as auto is default $ grep --color=auto 'blue' poem.txt Violets are blue, ``` * Sample screenshot ![grep color output](./images/color_option.png) * Example to show difference between `auto` and `always` ```bash $ grep --color=auto 'blue' poem.txt > saved_output.txt $ cat -v saved_output.txt Violets are blue, $ grep --color=always 'blue' poem.txt > saved_output.txt $ cat -v saved_output.txt Violets are ^[[01;31m^[[Kblue^[[m^[[K, $ # some commands like 'less' are capable of using the color information $ grep --color=always 'are' poem.txt | less -R $ # highlight multiple matching patterns $ grep --color=always 'are' poem.txt | grep --color 'd' Roses are red, And so are you. ```
## Get only matching portion * The `-o` option to get only matched portion is more useful with regular expressions * Comes in handy if overall number of matches is required, instead of only line wise ```bash $ grep -o 'are' poem.txt are are are $ # -c only gives count of matching lines $ grep -c 'e' poem.txt 4 $ grep -co 'e' poem.txt 4 $ # so need another command to get count of all matches $ grep -o 'e' poem.txt | wc -l 9 ```
## Context matching * The `-A`, `-B` and `-C` options are useful to get lines after/before/around matching line respectively ```bash $ grep -A1 'blue' poem.txt Violets are blue, Sugar is sweet, $ grep -B1 'blue' poem.txt Roses are red, Violets are blue, $ grep -C1 'blue' poem.txt Roses are red, Violets are blue, Sugar is sweet, ``` * If there are multiple non-adjacent matching segments, by default `grep` adds a line `--` to separate them * non-adjacent here implies that segments are separated by at least one line in input data ```bash $ seq 29 | grep -A1 '3' 3 4 -- 13 14 -- 23 24 ``` * Use `--no-group-separator` option if the separator line is a hindrance, for example feeding the output of `grep` to another program ```bash $ seq 29 | grep --no-group-separator -A1 '3' 3 4 13 14 23 24 ``` * Use `--group-separator` to customize the separator ```bash $ seq 29 | grep --group-separator='*****' -A1 '3' 3 4 ***** 13 14 ***** 23 24 ```
## Recursive search First let's create some more test files ```bash $ mkdir -p test_files/hidden_files $ printf 'Red\nGreen\nBlue\nBlack\nWhite\n' > test_files/colors.txt $ printf 'Violet\nIndigo\nBlue\nGreen\nYellow\nOrange\nRed\n' > test_files/vibgyor.txt $ printf '#!/usr/bin/python3\n\nprint("Hello World")\n' > test_files/hello.py $ printf 'I like yellow\nWhat about you\n' > test_files/hidden_files/.fav_color.info ``` From `man grep` ```bash -r, --recursive Read all files under each directory, recursively, following symbolic links only if they are on the command line. Note that if no file operand is given, grep searches the working directory. This is equivalent to the -d recurse option. -R, --dereference-recursive Read all files under each directory, recursively. Follow all symbolic links, unlike -r. ```
#### Basic recursive search * Note that `-H` option automatically activates for multiple file input ```bash $ # by default, current working directory is searched $ grep -r 'red' poem.txt:Roses are red, $ grep -ri 'red' poem.txt:Roses are red, test_files/colors.txt:Red test_files/vibgyor.txt:Red $ grep -rin 'red' poem.txt:1:Roses are red, test_files/colors.txt:1:Red test_files/vibgyor.txt:7:Red $ grep -ril 'red' poem.txt test_files/colors.txt test_files/vibgyor.txt ```
#### Exclude/Include specific files/directories * By default, recursive search includes hidden files as well * They can be excluded by file name or directory name * [glob](https://github.com/learnbyexample/Linux_command_line/blob/master/Shell.md#wildcards) patterns can be used * for example: `*.[ch]` to specify all files ending with `.c` or `.h` * The exclusion options can be used multiple times * for example: `--exclude='*.txt' --exclude='*.log'` or specified from a file using `--exclude-from=FILE` * To search only files with specific pattern in their names, use `--include=GLOB` * **Note:** exclusion/inclusion applies only to basename of file/directory, not the entire path * To follow all symbolic links (not directly specificied as arguments, but found on recursive search), use `-R` instead of `-r` ```bash $ grep -ri 'you' poem.txt:And so are you. test_files/hidden_files/.fav_color.info:What about you $ # exclude file names starting with `.` i.e hidden files $ grep -ri --exclude='.*' 'you' poem.txt:And so are you. $ # include only file names ending with `.info` $ grep -ri --include='*.info' 'you' test_files/hidden_files/.fav_color.info:What about you $ # exclude a directory $ grep -ri --exclude-dir='hidden_files' 'you' poem.txt:And so are you. $ # If you are using git(or similar), this would be handy $ # grep --exclude-dir='.git' -rl 'search pattern' ```
#### Recursive search with bash options * Using `bash` options `globstar` (for recursion) * Other options like `extglob` and `dotglob` come in handy too * See [glob](https://github.com/learnbyexample/Linux_command_line/blob/master/Shell.md#wildcards) for more info on these options * The `-d skip` option tells grep to skip directories instead of trying to treat them as text file to be searched ```bash $ grep -ril 'yellow' test_files/hidden_files/.fav_color.info test_files/vibgyor.txt $ # recursive search $ shopt -s globstar $ grep -d skip -il 'yellow' **/* test_files/vibgyor.txt $ # include hidden files as well $ shopt -s dotglob $ grep -d skip -il 'yellow' **/* test_files/hidden_files/.fav_color.info test_files/vibgyor.txt $ # use extended glob patterns $ shopt -s extglob $ # other than poem.txt $ grep -d skip -il 'red' **/!(poem.txt) test_files/colors.txt test_files/vibgyor.txt $ # other than poem.txt or colors.txt $ grep -d skip -il 'red' **/!(poem|colors).txt test_files/vibgyor.txt ```
#### Recursive search using find command * `find` is obviously more versatile * See also [this guide](./wheres_my_file.md#find) for more examples/tutorials on using `find` ```bash $ # all files, including hidden ones $ find -type f -exec grep -il 'red' {} + ./poem.txt ./test_files/colors.txt ./test_files/vibgyor.txt $ # all files ending with .txt $ find -type f -name '*.txt' -exec grep -in 'you' {} + ./poem.txt:4:And so are you. $ # all files not ending with .txt $ find -type f -not -name '*.txt' -exec grep -in 'you' {} + ./test_files/hidden_files/.fav_color.info:2:What about you ```
#### Passing file names to other commands * To pass files filtered to another command, see if the receiving command can differentiate file names by ASCII NUL character * If so, use the `-Z` so that `grep` output is terminated with NUL character and commands like `xargs` have option `-0` to understand it * This helps when file names can have characters like space, newline, etc * Typical use case: Search and replace something in all files matching some pattern, for ex: `grep -rlZ 'PAT1' | xargs -0 sed -i 's/PAT2/REPLACE/g'` ```bash $ # prompt at end of line not shown for simplicity $ # ^@ here indicates the NUL character $ grep -rlZ 'you' | cat -A poem.txt^@test_files/hidden_files/.fav_color.info^@ $ # print first column from all lines of all files $ grep -rlZ 'you' | xargs -0 awk '{print $1}' Roses Violets Sugar And I What ``` * simple example to show filenames with space causing issue if `-Z` is not used ```bash $ # 'abc xyz.txt' is a file with space in its name $ grep -ri 'are' abc xyz.txt:hi how are you poem.txt:Roses are red, poem.txt:Violets are blue, poem.txt:And so are you. saved_output.txt:Violets are blue, $ # problem when -Z is not used $ grep -ril 'are' | xargs grep 'you' grep: abc: No such file or directory grep: xyz.txt: No such file or directory poem.txt:And so are you. $ # no issues if -Z is used $ grep -rilZ 'are' | xargs -0 grep 'you' abc xyz.txt:hi how are you poem.txt:And so are you. ``` * Example for matching more than one search string anywhere in file ```bash $ # files containing 'you' $ grep -rl 'you' poem.txt test_files/hidden_files/.fav_color.info $ # files containing 'you' as well as 'are' $ grep -rlZ 'you' | xargs -0 grep -l 'are' poem.txt $ # files containing 'you' but NOT 'are' $ grep -rlZ 'you' | xargs -0 grep -L 'are' test_files/hidden_files/.fav_color.info ``` * another example ```bash $ grep -rilZ 'red' | xargs -0 grep -il 'blue' poem.txt test_files/colors.txt test_files/vibgyor.txt $ # note the use of `-Z` for middle command $ grep -rilZ 'red' | xargs -0 grep -ilZ 'blue' | xargs -0 grep -il 'violet' poem.txt test_files/vibgyor.txt ```
## Search strings from file * using file input to specify search terms * `-F` option will force matching strings literally(no regular expressions) * See also [stackoverflow - Fastest way to find lines of a text file from another larger text file](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-text-file-from-another-larger-text-file-in-bash) - read all answers ```bash $ grep -if test_files/colors.txt poem.txt Roses are red, Violets are blue, $ # get common lines between two files $ grep -Fxf test_files/colors.txt test_files/vibgyor.txt Blue Green Red $ # get lines present in vibgyor.txt but not in colors.txt $ grep -Fvxf test_files/colors.txt test_files/vibgyor.txt Violet Indigo Yellow Orange ```
## Options for scripting purposes * In scripts, often it is needed just to know if a pattern matches or not * The `-q` option doesn't print anything on stdout and exit status is `0` if match is found * Check out [this practical script](https://github.com/learnbyexample/command_help/blob/master/ch) using the `-q` option ```bash $ grep -qi 'rose' poem.txt $ echo $? 0 $ grep -qi 'lily' poem.txt $ echo $? 1 $ if grep -qi 'rose' poem.txt; then echo 'match found!'; else echo 'match not found'; fi match found! $ if grep -qi 'lily' poem.txt; then echo 'match found!'; else echo 'match not found'; fi match not found ``` * The `-s` option will suppress error messages as well ```bash $ grep 'rose' file_xyz.txt grep: file_xyz.txt: No such file or directory $ grep -s 'rose' file_xyz.txt $ echo $? 2 $ touch foo.txt $ chmod -r foo.txt $ grep 'rose' foo.txt grep: foo.txt: Permission denied $ grep -s 'rose' foo.txt $ echo $? 2 ```
## Regular Expressions - BRE/ERE Before diving into regular expressions, few examples to show default `grep` behavior vs `-F` ```bash $ # oops, why did it not match? $ echo 'int a[5]' | grep 'a[5]' $ # where did that error come from?? $ echo 'int a[5]' | grep 'a[' grep: Invalid regular expression $ # what is going on??? $ echo 'int a[5]' | grep 'a[5' grep: Unmatched [ or [^ $ # phew, -F is a life saver $ echo 'int a[5]' | grep -F 'a[5]' int a[5] $ # [ and ] are meta characters, details in following sections $ echo 'int a[5]' | grep 'a\[5]' int a[5] ``` * By default, `grep` treats the search pattern as BRE (Basic Regular Expression) * `-G` option can be used to specify explicitly that BRE is used * The `-E` option allows to use ERE (Extended Regular Expression) which in GNU grep's case only differs in how meta characters are used, no difference in regular expression functionalities * If `-F` option is used, the search string is treated literally * If available, one can also use `-P` which indicates PCRE (Perl Compatible Regular Expression)
#### Line Anchors * Often, search must match from beginning of line or towards end of line * For example, an integer variable declaration in `C` will start with optional white-space, the keyword `int`, white-space and then variable(s) * This way one can avoid matching declarations inside single line comments as well. * Similarly, one might want to match a variable at end of statement * The meta characters for line anchoring are `^` for beginning of line and `$` for end of line ```bash $ echo 'Fantasy is my favorite genre' > fav.txt $ echo 'My favorite genre is Fantasy' >> fav.txt $ cat fav.txt Fantasy is my favorite genre My favorite genre is Fantasy $ # start of line $ grep '^Fantasy' fav.txt Fantasy is my favorite genre $ # end of line $ grep 'Fantasy$' fav.txt My favorite genre is Fantasy $ # without anchors $ grep 'Fantasy' fav.txt Fantasy is my favorite genre My favorite genre is Fantasy ``` * As the meta characters have special meaning (assuming `-F` option is not used), they have to be escaped using `\` to match literally * The `\` itself is meta character, so to match it literally, use `\\` * The line anchors `^` and `$` have special meaning only when they are present at start/end of regular expression ```bash $ echo '^foo bar$' | grep '^foo' $ echo '^foo bar$' | grep '\^foo' ^foo bar$ $ echo '^foo bar$' | grep '^^foo' ^foo bar$ $ echo '^foo bar$' | grep 'bar$' $ echo '^foo bar$' | grep 'bar\$' ^foo bar$ $ echo '^foo bar$' | grep 'bar$$' ^foo bar$ $ echo 'foo $ bar' | grep ' $ ' foo $ bar $ printf 'foo\cbar' | grep -o '\c' c $ printf 'foo\cbar' | grep -o '\\c' \c ```
#### Word Anchors * The `-w` option works well to match whole words. But what about matching only start or end of words? * Anchors `\<` and `\>` will match start/end positions of a word * `\b` can also be used instead of `\<` and `\>` which matches both edges of a word ```bash $ printf 'spar\npar\npart\napparent\n' spar par part apparent $ # words ending with par $ printf 'spar\npar\npart\napparent\n' | grep 'par\>' spar par $ # words starting with par $ printf 'spar\npar\npart\napparent\n' | grep '\' par $ printf 'spar\npar\npart\napparent\n' | grep '\bpar\b' par $ printf 'spar\npar\npart\napparent\n' | grep -w 'par' par ``` * `\b` has an opposite `\B` which is quite useful too ```bash $ # string not surrounded by word boundary either side $ printf 'spar\npar\npart\napparent\n' | grep '\Bpar\B' apparent $ # word containing par but not as start of word $ printf 'spar\npar\npart\napparent\n' | grep '\Bpar' spar apparent $ # word containing par but not as end of word $ printf 'spar\npar\npart\napparent\n' | grep 'par\B' part apparent ``` * the word boundary escape sequences differ slightly from `-w` option ```bash $ # this fails because there is no word boundary between space and + $ echo '2 +3 = 5' | grep '\b+3\b' $ # this works as -w only ensures that there are no surrounding word characters $ echo '2 +3 = 5' | grep -w '+3' 2 +3 = 5 $ # doesn't work as , isn't at start of word boundary $ echo 'hi, 2 one' | grep '\<, 2\>' $ # won't match as there are word characters before , $ echo 'hi, 2 one' | grep -w ', 2' $ # works as \b matches both edges and , is at end of word after i $ echo 'hi, 2 one' | grep '\b, 2\b' hi, 2 one ```
#### Alternation * The `|` meta character is similar to using multiple `-e` option * Each side of `|` is complete regular expression with their own start/end anchors * How each part of alternation is handled and order of evaluation/output is beyond the scope of this tutorial * See [this](https://www.regular-expressions.info/alternation.html) for more info on this topic. * `|` is one of meta characters that requires different syntax between BRE/ERE ```bash $ grep 'blue\|you' poem.txt Violets are blue, And so are you. $ grep -E 'blue|you' poem.txt Violets are blue, And so are you. $ # extract case-insensitive e or f from anywhere in line $ echo 'Fantasy is my favorite genre' | grep -Eio 'e|f' F f e e e $ # extract case-insensitive e at end of line, f at start of line $ echo 'Fantasy is my favorite genre' | grep -Eio 'e$|^f' F e ``` * A cool usecase of alternation is using `^` or `$` anchors to highlight searched term as well as display rest of unmatched lines * the line anchors will match every input line, even empty lines as they are position markers ```bash $ grep --color=auto -E '^|are' poem.txt Roses are red, Violets are blue, Sugar is sweet, And so are you. $ grep --color=auto -E 'is|$' poem.txt Roses are red, Violets are blue, Sugar is sweet, And so are you. ``` Screenshot for above example: ![highlighting string](./images/highlight_string_whole_file_op.png) See also * [stackoverflow - Grep output with multiple Colors](https://stackoverflow.com/questions/17236005/grep-output-with-multiple-colors) * [unix.stackexchange - Multicolored Grep](https://unix.stackexchange.com/questions/104350/multicolored-grep)
#### The dot meta character The `.` meta character matches is used to match any character ```bash $ # any two characters surrounded by word boundaries $ echo 'I have 12, he has 132!' | grep -ow '..' 12 he $ # match three characters from start of line $ # \t (TAB) is single character here $ printf 'a\tbcd\n' | grep -o '^...' a b $ # all three character word starting with c $ echo 'car bat cod cope scat dot abacus' | grep -ow 'c..' car cod $ echo '1 & 2' | grep -o '.' 1 & 2 ```
#### Greedy Quantifiers Defines how many times a character (simplified for now) should be matched * `?` will try to match 0 or 1 time * For BRE, use `\?` ```bash $ printf 'late\npale\nfactor\nrare\nact\n' late pale factor rare act $ # match a followed by t, with or without c in between $ printf 'late\npale\nfactor\nrare\nact\n' | grep -E 'ac?t' late factor act $ # same as using this alternation $ printf 'late\npale\nfactor\nrare\nact\n' | grep -E 'at|act' late factor act ``` * `*` will try to match 0 or more times * There is no upper limit and `*` will try to match as many times as possible * if matching maximum times results in overall regex failing, then next best count is chosen until overall regex passes * if there are multiple quantifiers, left-most quantifier gets precedence ```bash $ echo 'abbbc' | grep -o 'b*' bbb $ # matches 0 or more b only if surrounded by a and c $ echo 'abc ac adc abbc bbb bc' | grep -o 'ab*c' abc ac abbc $ # see how it matched everything $ echo 'car bat cod map scat dot abacus' | grep -o '.*' car bat cod map scat dot abacus $ # but here it stops at m $ echo 'car bat cod map scat dot abacus' | grep -o '.*m' car bat cod m $ # stopped at dot, not bat or scat - match as much as possible $ echo 'car bat cod map scat dot abacus' | grep -o 'c.*t' car bat cod map scat dot $ # matching overall expression gets preference $ echo 'car bat cod map scat dot abacus' | grep -o 'c.*at' car bat cod map scat $ # precedence is left to right in case of multiple matches $ echo 'car bat cod map scat dot abacus' | grep -o 'b.*m' bat cod m $ echo 'car bat cod map scat dot abacus' | grep -o 'b.*m*' bat cod map scat dot abacus ``` * `+` will try to match 1 or more times * Another meta character that differs in syntax between BRE/ERE ```bash $ echo 'abbbc' | grep -o 'b\+' bbb $ echo 'abbbc' | grep -oE 'b+' bbb $ echo 'abc ac adc abbc bbb bc' | grep -oE 'ab+c' abc abbc $ echo 'abc ac adc abbc bbb bc' | grep -o 'ab*c' abc ac abbc ``` * For more precise control on number of times to match, `{}` is useful * use `\{\}` for BRE * It can take one of four forms, `{m,n}`, `{,n}`, `{m,}` and `{n}` ```bash $ # {m,n} - m to n, including both m and n $ echo 'ac abc abbc abbbc' | grep -Eo 'ab{1,2}c' abc abbc $ # {,n} - 0 to n times $ echo 'ac abc abbc abbbc' | grep -Eo 'ab{,2}c' ac abc abbc $ # {m,} - at least m times $ echo 'ac abc abbc abbbc' | grep -Eo 'ab{2,}c' abbc abbbc $ # {n} - exactly n times $ echo 'ac abc abbc abbbc' | grep -Eo 'ab{2}c' abbc ```
#### Character classes * The meta character pairs `[]` allow to match any of the multiple characters within `[]` * Meta characters like `^`, `$` have different meaning inside and outside of `[]` * Simple example first, matching any of the characters within `[]` ```bash $ echo 'do so in to no on' | grep -ow '[nt]o' to no $ echo 'do so in to no on' | grep -ow '[sot][on]' so to on ``` * Adding a quantifier * Check out [unix words](https://en.wikipedia.org/wiki/Words_(Unix)) and [sample words file](https://users.cs.duke.edu/~ola/ap/linuxwords) ```bash $ # words made up of letters o and n, at least 2 letters $ grep -xE '[on]{2,}' /usr/share/dict/words no non noon on $ # lines containing only digits $ printf 'cat\nfoo\n123\nbaz\n42\n' | grep -xE '[0123456789]+' 123 42 ``` * Character ranges * Matching any alphabet, number, hexadecimal number etc becomes cumbersome if every character has to be individually specified * So, there's a shortcut, using `-` to construct a range (has to be specified in ascending order) * See [ascii codes table](https://ascii.cl/) for reference * Note that behavior of range will differ for other character encodings * See **Character Classes and Bracket Expressions** as well as **LC_COLLATE under Environment Variables** sections in `info grep` for more detail * [Matching Numeric Ranges with a Regular Expression](https://www.regular-expressions.info/numericranges.html) ```bash $ printf 'cat\nfoo\n123\nbaz\n42\n' | grep -xE '[0-9]+' 123 42 $ printf 'cat\nfoo\n123\nbaz\n42\n' | grep -xiE '[a-z]+' cat foo baz $ # only valid decimal numbers $ printf '128\n34\nfe32\nfoo1\nbar\n' | grep -xE '[0-9]+' 128 34 $ # only valid octal numbers $ printf '128\n34\nfe32\nfoo1\nbar\n' | grep -xE '[0-7]+' 34 $ # only valid hexadecimal numbers $ printf '128\n34\nfe32\nfoo1\nbar\n' | grep -xiE '[0-9a-f]+' 128 34 fe32 $ # numbers between 10-29 $ echo '23 54 12 92' | grep -owE '[12][0-9]' 23 12 ``` * Negating character class * By using `^` as first character inside `[]`, we get inverted character class * As pointed out earlier, some meta characters behave differently inside and outside of `[]` ```bash $ # alphabetic words not starting with c $ echo '123 core not sink code finish' | grep -owE '[^c][a-z]+' not sink finish $ # excluding numbers 2,3,4,9 $ # note that 200a 200; etc will also match, usage depends on knowing input $ echo '2001 2004 2005 2008 2009' | grep -ow '200[^2-49]' 2001 2005 2008 $ # get characters from start of line upto(not including) known identifier $ echo 'foo=bar; baz=123' | grep -oE '^[^=]+' foo $ # get characters at end of line from(not including) known identifier $ echo 'foo=bar; baz=123' | grep -oE '[^=]+$' 123 $ # get all sequence of characters surrounded by unique identifier $ echo 'I like "mango" and "guava"' | grep -oE '"[^"]+"' "mango" "guava" ``` * Matching meta characters inside `[]` * Most meta characters like `( ) . + { } | $` don't have special meaning inside `[]` and hence do not require special treatment * Some combination like `[.` or `=]` cannot be used in this order, as they have special meaning within `[]` * See **Character Classes and Bracket Expressions** section in `info grep` for more detail ```bash $ # to match - it should be first or last character within [] $ echo 'Foo-bar 123-456 42 Co-operate' | grep -oiwE '[a-z-]+' Foo-bar Co-operate $ # to match ] it should be first character within [] $ printf 'int a[5]\nfoo=bar\n' | grep '[]=]' int a[5] foo=bar $ # to match [ use [ anywhere in the character list $ # [][] will match both [ and ] $ printf 'int a[5]\nfoo=bar\n' | grep '[[]' int a[5] $ # to match ^ it should be other than first in the list $ echo '(a+b)^2 = a^2 + b^2 + 2ab' | grep -owE '[a-z^0-9]{3,}' a^2 b^2 2ab ``` * Named character classes * Equivalent class shown is for C locale and ASCII character encoding * See [ascii codes table](https://ascii.cl/) for reference * See **Character Classes and Bracket Expressions** section in `info grep` for more detail | Character classes | Description | | ------------- | ----------- | | `[:digit:]` | Same as `[0-9]` | | `[:lower:]` | Same as `[a-z]` | | `[:upper:]` | Same as `[A-Z]` | | `[:alpha:]` | Same as `[a-zA-Z]` | | `[:alnum:]` | Same as `[0-9a-zA-Z]` | | `[:xdigit:]` | Same as `[0-9a-fA-F]` | | `[:cntrl:]` | Control characters - first 32 ASCII characters and 127th (DEL) | | `[:punct:]` | All the punctuation characters | | `[:graph:]` | `[:alnum:]` and `[:punct:]` | | `[:print:]` | `[:alnum:]`, `[:punct:]` and space | | `[:blank:]` | Space and tab characters | | `[:space:]` | white-space characters: tab, newline, vertical tab, form feed, carriage return and space | ```bash $ printf '128\n34\nAB32\nFoo\nbar\n' | grep -x '[[:alnum:]]*' 128 34 AB32 Foo bar $ printf '128\n34\nAB32\nFoo\nbar\n' | grep -x '[[:lower:]]*' bar $ printf '128\n34\nAB32\nFoo\nbar\n' | grep -x '[[:lower:]0-9]*' 128 34 bar ``` * backslash character classes | Character classes | Description | | ------------- | ----------- | | `\w` | Same as `[0-9a-zA-Z_]` or `[[:alnum:]_]` | | `\W` | Same as `[^0-9a-zA-Z_]` or `[^[:alnum:]_]` | | `\s` | Same as `[[:space:]]` | | `\S` | Same as `[^[:space:]]` | ```bash $ printf '123\n$#\ncmp_str\nFoo_bar\n' | grep -x '\w*' 123 cmp_str Foo_bar $ printf '123\n$#\ncmp_str\nFoo_bar\n' | grep -x '[[:alnum:]_]*' 123 cmp_str Foo_bar $ printf '123\n$#\ncmp_str\nFoo_bar\n' | grep -x '\W*' $# $ printf '123\n$#\ncmp_str\nFoo_bar\n' | grep -x '[^[:alnum:]_]*' $# ```
#### Grouping * Character classes allow matching against a choice of multiple character list and then quantifier added if needed * One of the uses of grouping is analogous to character classes for whole regular expressions, instead of just list of characters * The meta characters `()` are used for grouping * requires `\(\)` for BRE * Similar to `a(b+c)d = abd+acd` in maths, you get `a(b|c)d = abd|acd` in regular expressions ```bash $ # 5 letter words starting with c and ending with ty or ly $ grep -xE 'c..(ty|ly)' /usr/share/dict/words catty coyly curly $ # 7 letter words starting with e and ending with rged or sted $ grep -xE 'e..(rg|st)ed' /usr/share/dict/words emerged existed $ # repeat a pattern 3 times $ grep -xE '([a-d][r-z]){3}' /usr/share/dict/words avatar awards cravat $ # nesting of () is allowed $ grep -E '([as](p|c)[r-t]){2}' /usr/share/dict/words scraps $ # can be used to match specific columns in well defined tables $ echo 'foo:123:bar:baz' | grep -E '^([^:]+:){2}bar' foo:123:bar:baz ``` * See also [stackoverflow - matching character exactly n times in a line](https://stackoverflow.com/questions/40187643/grep-search-with-regex)
#### Back reference * The matched string within `()` can also be used to be matched again by back referencing the captured groups * `\1` denotes the first matched group, `\2` the second one and so on * Order is leftmost `(` is `\1`, next one is `\2` and so on * Note that the matched string, not the regular expression itself is referenced * for ex: if `([0-9][a-f])` matches `3b`, then back referencing will be `3b` not any other valid match of the regular expression like `8f`, `0a` etc * Other regular expressions like PCRE do allow referencing the regular expression itself ```bash $ # note how first three and last three letters are same $ grep -xE '([a-d]..)\1' /usr/share/dict/words bonbon cancan chichi $ # note how adding quantifier is not same as back-referencing $ grep -m4 -xE '([a-d]..){2}' /usr/share/dict/words abacus abided abides ablaze $ # words with consecutive repeated letters $ echo 'eel flee all pat ilk seen' | grep -iowE '[a-z]*(.)\1[a-z]*' eel flee all seen $ # 17 letter words with first and last as same letter $ grep -xE '(.)[a-z]{15}\1' /usr/share/dict/words semiprofessionals transcendentalist ``` * Spotting repeated words ```bash $ cat story.txt singing tin in the rain walking for for a cause have a nice day day and night $ grep -wE '(\w+)\W+\1' story.txt walking for for a cause ``` * **Note** that there is an [issue for certain usage of back-reference and quantifier](https://debbugs.gnu.org/cgi/bugreport.cgi?bug=26864) ```bash $ # no output $ grep -m5 -xiE '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words $ # works when nesting is unrolled $ grep -m5 -xiE '[a-z]*([a-z])\1[a-z]*([a-z])\2[a-z]*' /usr/share/dict/words Abbott Annabelle Annette Appaloosa Appleseed $ # no problem if PCRE is used instead of ERE $ grep -m5 -xiP '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words Abbott Annabelle Annette Appaloosa Appleseed ```
## Multiline matching * If input is small enough to meet memory requirements, the `-z` option comes in handy to match across multiple lines * Instead of newline being line separator, the ASCII NUL character is used * So, multiline matching depends on whether or not input file itself contains the NUL character * Usually text files won't have occasion to use the NUL character and presence of it marks it as binary file for `grep` ```bash $ # \0 for ASCII NUL character $ printf 'red\nblue\n\0green\n' | cat -e red$ blue$ ^@green$ $ # see --binary-files=TYPE option in info grep for binary details $ printf 'red\nblue\n\0green\n' | grep -a 'red' red $ # with -z, \0 marks the different 'lines' $ printf 'red\nblue\n\0green\n' | grep -z 'red' red blue $ # if no \0 in input, entire input read as single string $ printf 'red\nblue\ngreen\n' | grep -z 'red' red blue green ``` * `\n` is not defined in BRE/ERE * see [unix.stackexchange - How to specify characters using hexadecimal codes](https://unix.stackexchange.com/questions/19491/how-to-specify-characters-using-hexadecimal-codes-in-grep) for a workaround * if some characteristics of input is known, `[[:space:]]` can be used as workaround, which matches all white-space characters ```bash $ grep -oz 'Roses.*blue,[[:space:]]' poem.txt Roses are red, Violets are blue, ```
## Perl Compatible Regular Expressions ```bash $ # see also: https://github.com/learnbyexample/command_help $ man grep | sed -n '/^\s*-P/,/^$/p' -P, --perl-regexp Interpret the pattern as a Perl-compatible regular expression (PCRE). This is highly experimental and grep -P may warn of unimplemented features. ``` * The man page informs that `-P` is *highly experimental*. So far, haven't faced any issues. But do keep this in mind. * newer versions of `GNU grep` has fixes for some `-P` bugs, see [release notes](https://savannah.gnu.org/news/?group_id=67) for an overview of changes between versions * Only a few highlights is presented here * For more info * `man pcrepattern` or [read it online](https://www.pcre.org/original/doc/html/pcrepattern.html) * [perldoc - re](https://perldoc.perl.org/perlre.html) - Perl regular expression syntax, also links to other related tutorials * [stackoverflow - What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean)
#### Backslash sequences Some of the backslash constructs available in PCRE over already seen ones in ERE * `\d` for `[0-9]` * `\s` for `[ \t\r\n\f\v]` * `\h` for `[ \t]` * `\n` for newline character * `\D`, `\S`, `\H`, `\N` etc for their opposites ```bash $ # example for [0-9] in ERE and \d in PCRE $ echo 'foo=5, bar=3; x=83, y=120' | grep -oE '[0-9]+' 5 3 83 120 $ echo 'foo=5, bar=3; x=83, y=120' | grep -oP '\d+' 5 3 83 120 $ # (?s) allows newlines to be also matches when using . meta character $ grep -ozP '(?s)Roses.*blue,\n' poem.txt Roses are red, Violets are blue, ``` * See **INTERNAL OPTION SETTING** in `man pcrepattern` for more info on `(?s)`, `(?m)` etc * [Specifying Modes Inside The Regular Expression](https://www.regular-expressions.info/modifiers.html) also has some detail on such options
#### Non-greedy matching * Both BRE/ERE support only greedy matching quantifiers * match as much as possible * PCRE supports non-greedy version by adding `?` after quantifiers * match as minimal as possible * See [this Python notebook](https://nbviewer.jupyter.org/url/norvig.com/ipython/pal3.ipynb) for an interesting project on palindrome sentences ```bash $ echo 'foo and bar and baz went shopping bytes' | grep -oi '\w.*and' foo and bar and $ echo 'foo and bar and baz went shopping bytes' | grep -oiP '\w.*?and' foo and bar and $ # recall that matching overall expression gets preference $ echo 'foo and bar and baz went shopping bytes' | grep -oi '\w.*and baz' foo and bar and baz $ echo 'foo and bar and baz went shopping bytes' | grep -oiP '\w.*?and baz' foo and bar and baz $ # minimal matching with single character has simple workaround $ echo 'A man, a plan, a canal, Panama' | grep -oi 'a.*,' A man, a plan, a canal, $ echo 'A man, a plan, a canal, Panama' | grep -oi 'a[^,]*,' A man, a plan, a canal, ```
#### Lookarounds * Ability to add conditions to match before/after required pattern * There are four types * positive lookahead `(?=` * negative lookahead `(?!` * positive lookbehind `(?<=` * negative lookbehind `(? #### Ignoring specific matches * A useful construct is `(*SKIP)(*F)` which allows to discard matches not needed * Simple way to use is that regular expression which should be discarded is written first, `(*SKIP)(*F)` is appended and then whichever is required by added after `|` * See [Excluding Unwanted Matches](https://www.rexegg.com/backtracking-control-verbs.html#skipfail) for more info ```bash $ # all words except bat and map $ echo 'car bat cod map' | grep -oP '(bat|map)(*SKIP)(*F)|\w+' car cod $ # all words except those surrounded by double quotes $ echo 'I like "mango" and "guava"' | grep -oP '"[^"]+"(*SKIP)(*F)|\w+' I like and ```
#### Re-using regular expression pattern * `\1`, `\2` etc only matches exact string * `(?1)`, `(?2)` etc re-uses the regular expression itself ```bash $ # (?1) refers to first group \d{4}-\d{2}-\d{2} $ echo '2008-03-24 and 2012-08-12 foo' | grep -oP '(\d{4}-\d{2}-\d{2})\D+(?1)' 2008-03-24 and 2012-08-12 ```
## Gotchas and Tips * Always quote the search string (unless you know what you are doing :P) ```bash $ # spaces are special $ grep so are poem.txt grep: are: No such file or directory poem.txt:And so are you. $ grep 'so are' poem.txt And so are you. $ # use of # indicates start of comment $ printf 'foo\na#2\nb#3\n' | grep #2 Usage: grep [OPTION]... PATTERN [FILE]... Try 'grep --help' for more information. $ printf 'foo\na#2\nb#3\n' | grep '#2' a#2 ``` * Another common problem is unquoted search string will be open to shell's own globbing rules ```bash $ # sample output on bash shell, might vary for different shells $ echo '*.txt' | grep -F *.txt $ echo '*.txt' | grep -F '*.txt' *.txt ``` * Use double quotes for variable expansion, command substitution, etc (Note: could vary based on shell used) * See [mywiki.wooledge Quotes](https://mywiki.wooledge.org/Quotes) for detailed discussion of quoting in `bash` shell ```bash $ # sample output on bash shell, might vary for different shells $ color='blue' $ grep "$color" poem.txt Violets are blue, ``` * Pattern starting with `-` ```bash $ # this issue is not specific to grep alone $ # the command assumes -2 is an option and hence the error $ echo '5*3-2=13' | grep '-2' Usage: grep [OPTION]... PATTERN [FILE]... Try 'grep --help' for more information. $ # workaround by using \- $ echo '5*3-2=13' | grep '\-2' 5*3-2=13 $ # or use -- to indicate no further options to process $ echo '5*3-2=13' | grep -- '-2' 5*3-2=13 $ # same issue with printf $ printf '-1+2=1\n' bash: printf: -1: invalid option printf: usage: printf [-v var] format [arguments] $ printf -- '-1+2=1\n' -1+2=1 ``` * Tip: Options can be specified at end of command as well, useful if option was forgotten and have to quickly add it to previous command from history ```bash $ grep 'are' poem.txt Roses are red, Violets are blue, And so are you. $ # use previous command from history, for ex up arrow key in bash $ # then simply add the option at end $ grep 'are' poem.txt -n 1:Roses are red, 2:Violets are blue, 4:And so are you. ``` * Speed boost if input file is ASCII * See also [unix.stackexchange - Counting the number of lines having a number > 100](https://unix.stackexchange.com/questions/312297/counting-the-number-of-lines-having-a-number-greater-than-100/312330#312330) - where `grep` is blazing fast compared to other solutions ```bash $ time grep -xE '([a-d][r-z]){3}' /usr/share/dict/words avatar awards cravat real 0m0.145s $ time LC_ALL=C grep -xE '([a-d][r-z]){3}' /usr/share/dict/words avatar awards cravat real 0m0.011s ``` * Speed boost by using PCRE for back-references * might be faster when using quantifiers as well ```bash $ time LC_ALL=C grep -xE '([a-z]..)\1' /usr/share/dict/words bonbon cancan chichi murmur muumuu pawpaw pompom tartar testes real 0m0.174s $ time grep -xP '([a-z]..)\1' /usr/share/dict/words bonbon cancan chichi murmur muumuu pawpaw pompom tartar testes real 0m0.008s ```
## Regular Expressions Reference (ERE)
#### Anchors * `^` match from start of line * `$` match end of line * `\<` match beginning of word * `\>` match end of word * `\b` match edge of word * `\B` match other than edge of word
#### Character Quantifiers * `.` match any single character * `*` match preceding character/group 0 or more times * `+` match preceding character/group 1 or more times * `?` match preceding character/group 0 or 1 times * `{m,n}` match preceding character/group m to n times, including m and n * `{m,}` match preceding character/group m or more times * `{,n}` match preceding character/group 0 to n times * `{n}` match preceding character/group exactly n times
#### Character classes and backslash sequences * `[aeiou]` match any of these characters * `[^aeiou]` do not match any of these characters * `[a-z]` match any lowercase alphabet * `[0-9]` match any digit character * `\w` match alphabets, digits and underscore character, short cut for `[a-zA-Z0-9_]` * `\W` opposite of `\w` , short cut for `[^a-zA-Z0-9_]` * `\s` match white-space characters: tab, newline, vertical tab, form feed, carriage return, and space * `\S` match other than white-space characters
#### Pattern groups * `|` matches either of the given patterns * `()` patterns within `()` are grouped and treated as one pattern, useful in conjunction with `|` * `\1` backreference to first grouped pattern within `()` * `\2` backreference to second grouped pattern within `()` and so on
#### Basic vs Extended Regular Expressions By default, the pattern passed to `grep` is treated as Basic Regular Expressions(BRE), which can be overridden using options like `-E` for ERE and `-P` for Perl Compatible Regular Expression(PCRE). Paraphrasing from `info grep` >In Basic Regular Expressions the meta-characters `? + { | ( )` lose their special meaning, instead use the backslashed versions `\? \+ \{ \| \( \)`
## Further Reading * `man grep` and `info grep` * At least go through all options ;) * **Usage section** in `info grep` has good examples as well * This chapter has also been [converted to a book](https://github.com/learnbyexample/learn_gnugrep_ripgrep) with additional examples, exercises and covers popular alternative `ripgrep` * A bit of history * [Brian Kernighan remembers the origins of grep](https://thenewstack.io/brian-kernighan-remembers-the-origins-of-grep/) * [how grep command was born](https://medium.com/@rualthanzauva/grep-was-a-private-command-of-mine-for-quite-a-while-before-i-made-it-public-ken-thompson-a40e24a5ef48) * [why GNU grep is fast](https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html) * [unix.stackexchange - Difference between grep, egrep and fgrep](https://unix.stackexchange.com/questions/17949/what-is-the-difference-between-grep-egrep-and-fgrep) * Q&A on stackoverflow/stackexchange are good source of learning material, good for practice exercises as well * [grep Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/grep?sort=votes&pageSize=15) * [grep Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/grep?sort=votes&pageSize=15) * Learn Regular Expressions (has information on flavors other than BRE/ERE/PCRE too) * [Regular Expressions Tutorial](https://www.regular-expressions.info/tutorial.html) * [rexegg](https://www.rexegg.com/) - tutorials, tricks and more * [regexcrossword](https://regexcrossword.com/) * [stackoverflow - What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) * [online regex tester and debugger](https://regex101.com/) - by default `pcre` flavor * Alternatives * [ripgrep](https://github.com/BurntSushi/ripgrep) * [pcregrep](https://www.pcre.org/original/doc/html/pcregrep.html) * [ag - silver searcher](https://github.com/ggreer/the_silver_searcher) * [unix.stackexchange - When to use grep, sed, awk, perl, etc](https://unix.stackexchange.com/questions/303044/when-to-use-grep-less-awk-sed) ================================================ FILE: gnu_sed.md ================================================


--- :information_source: :information_source: This chapter has been converted into a better formatted ebook: https://learnbyexample.github.io/learn_gnused/. The ebook also has content updated for newer version of the commands, includes exercises, solutions, etc. For markdown source and links to buy pdf/epub versions, see: https://github.com/learnbyexample/learn_gnused ---


# GNU sed **Table of Contents** * [Simple search and replace](#simple-search-and-replace) * [editing stdin](#editing-stdin) * [editing file input](#editing-file-input) * [Inplace file editing](#inplace-file-editing) * [With backup](#with-backup) * [Without backup](#without-backup) * [Multiple files](#multiple-files) * [Prefix backup name](#prefix-backup-name) * [Place backups in directory](#place-backups-in-directory) * [Line filtering options](#line-filtering-options) * [Print command](#print-command) * [Delete command](#delete-command) * [Quit commands](#quit-commands) * [Negating REGEXP address](#negating-regexp-address) * [Combining multiple REGEXP](#combining-multiple-regexp) * [Filtering by line number](#filtering-by-line-number) * [Print only line number](#print-only-line-number) * [Address range](#address-range) * [Relative addressing](#relative-addressing) * [Using different delimiter for REGEXP](#using-different-delimiter-for-regexp) * [Regular Expressions](#regular-expressions) * [Line Anchors](#line-anchors) * [Word Anchors](#word-anchors) * [Matching the meta characters](#matching-the-meta-characters) * [Alternation](#alternation) * [The dot meta character](#the-dot-meta-character) * [Quantifiers](#quantifiers) * [Character classes](#character-classes) * [Escape sequences](#escape-sequences) * [Grouping](#grouping) * [Back reference](#back-reference) * [Changing case](#changing-case) * [Substitute command modifiers](#substitute-command-modifiers) * [g modifier](#g-modifier) * [Replace specific occurrence](#replace-specific-occurrence) * [Ignoring case](#ignoring-case) * [p modifier](#p-modifier) * [w modifier](#w-modifier) * [e modifier](#e-modifier) * [m modifier](#m-modifier) * [Shell substitutions](#shell-substitutions) * [Variable substitution](#variable-substitution) * [Command substitution](#command-substitution) * [z and s command line options](#z-and-s-command-line-options) * [change command](#change-command) * [insert command](#insert-command) * [append command](#append-command) * [adding contents of file](#adding-contents-of-file) * [r for entire file](#r-for-entire-file) * [R for line by line](#r-for-line-by-line) * [n and N commands](#n-and-n-commands) * [Control structures](#control-structures) * [if then else](#if-then-else) * [replacing in specific column](#replacing-in-specific-column) * [overlapping substitutions](#overlapping-substitutions) * [Lines between two REGEXPs](#lines-between-two-regexps) * [Include or Exclude matching REGEXPs](#include-or-exclude-matching-regexps) * [First or Last block](#first-or-last-block) * [Broken blocks](#broken-blocks) * [sed scripts](#sed-scripts) * [Gotchas and Tips](#gotchas-and-tips) * [Further Reading](#further-reading)
```bash $ sed --version | head -n1 sed (GNU sed) 4.2.2 $ man sed SED(1) User Commands SED(1) NAME sed - stream editor for filtering and transforming text SYNOPSIS sed [OPTION]... {script-only-if-no-other-script} [input-file]... DESCRIPTION Sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). While in some ways similar to an editor which permits scripted edits (such as ed), sed works by making only one pass over the input(s), and is consequently more efficient. But it is sed's ability to filter text in a pipeline which particularly distinguishes it from other types of editors. ... ``` **Note:** [Multiline and manipulating pattern space](https://www.gnu.org/software/sed/manual/sed.html#Multiline-techniques) with h,x,D,G,H,P etc is not covered in this chapter and examples/information is based on ASCII encoded text input only
## Simple search and replace Detailed examples for **substitute** command will be covered in later sections, syntax is ``` s/REGEXP/REPLACEMENT/FLAGS ``` The `/` character is idiomatically used as delimiter character. See also [Using different delimiter for REGEXP](#using-different-delimiter-for-regexp)
#### editing stdin ```bash $ # sample command output to be edited $ seq 10 | paste -sd, 1,2,3,4,5,6,7,8,9,10 $ # change only first ',' to ' : ' $ seq 10 | paste -sd, | sed 's/,/ : /' 1 : 2,3,4,5,6,7,8,9,10 $ # change all ',' to ' : ' by using 'g' modifier $ seq 10 | paste -sd, | sed 's/,/ : /g' 1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 ``` **Note:** As a good practice, all examples use single quotes around arguments to prevent shell interpretation. See [Shell substitutions](#shell-substitutions) section on use of double quotes
#### editing file input * By default newline character is the line separator * See [Regular Expressions](#regular-expressions) section for qualifying search terms, for ex * word boundaries to distinguish between 'hi', 'this', 'his', 'history', etc * multiple search terms, specific set of character, etc ```bash $ cat greeting.txt Hi there Have a nice day $ # change first 'e' in each line to 'E' $ sed 's/e/E/' greeting.txt Hi thEre HavE a nice day $ # change first 'nice day' in each line to 'safe journey' $ sed 's/nice day/safe journey/' greeting.txt Hi there Have a safe journey $ # change all 'e' to 'E' and save changed text to another file $ sed 's/e/E/g' greeting.txt > out.txt $ cat out.txt Hi thErE HavE a nicE day ```
## Inplace file editing * In previous section, the output from `sed` was displayed on stdout or saved to another file * To write the changes back to original file, use `-i` option **Note**: * Refer to `man sed` for details of how to use the `-i` option. It varies with different `sed` implementations. As mentioned at start of this chapter, `sed (GNU sed) 4.2.2` is being used here * See also [unix.stackexchange - working with symlinks](https://unix.stackexchange.com/questions/348693/sed-update-etc-grub-conf-in-spite-this-link-file)
#### With backup * When extension is given, the original input file is preserved with name changed according to extension provided ```bash $ # '.bkp' is extension provided $ sed -i.bkp 's/Hi/Hello/' greeting.txt $ # output from sed is written back to 'greeting.txt' $ cat greeting.txt Hello there Have a nice day $ # original file is preserved in 'greeting.txt.bkp' $ cat greeting.txt.bkp Hi there Have a nice day ```
#### Without backup * Use this option with caution, changes made cannot be undone ```bash $ sed -i 's/nice day/safe journey/' greeting.txt $ # note, 'Hi' was already changed to 'Hello' in previous example $ cat greeting.txt Hello there Have a safe journey ```
#### Multiple files * Multiple input files are treated individually and changes are written back to respective files ```bash $ cat f1 I ate 3 apples $ cat f2 I bought two bananas and 3 mangoes $ # -i can be used with or without backup $ sed -i 's/3/three/' f1 f2 $ cat f1 I ate three apples $ cat f2 I bought two bananas and three mangoes ```
#### Prefix backup name * A `*` in argument given to `-i` will get expanded to input filename * This way, one can add prefix instead of suffix for backup ```bash $ cat var.txt foo bar baz $ sed -i'bkp.*' 's/foo/hello/' var.txt $ cat var.txt hello bar baz $ cat bkp.var.txt foo bar baz ```
#### Place backups in directory * `*` also allows to specify an existing directory to place the backups instead of current working directory ```bash $ mkdir bkp_dir $ sed -i'bkp_dir/*' 's/bar/hi/' var.txt $ cat var.txt hello hi baz $ cat bkp_dir/var.txt hello bar baz $ # extensions can be added as well $ # bkp_dir/*.bkp for suffix $ # bkp_dir/bkp.* for prefix $ # bkp_dir/bkp.*.2017 for both and so on ```
## Line filtering options * By default, `sed` acts on entire file. Often, one needs to extract or change only specific lines based on text search, line numbers, lines between two patterns, etc * This filtering is much like using `grep`, `head` and `tail` commands in many ways and there are even more features * Use `sed` for inplace editing, the filtered lines to be transformed etc. Not as substitute for those commands
#### Print command * It is usually used in conjunction with `-n` option * By default, `sed` prints every input line, including any changes made by commands like substitution * printing here refers to line being part of `sed` output which may be shown on terminal, redirected to file, etc * Using `-n` option and `p` command together, only specific lines needed can be filtered * Examples below use the `/REGEXP/` addressing, other forms will be seen in sections to follow ```bash $ cat poem.txt Roses are red, Violets are blue, Sugar is sweet, And so are you. $ # all lines containing the string 'are' $ # same as: grep 'are' poem.txt $ sed -n '/are/p' poem.txt Roses are red, Violets are blue, And so are you. $ # all lines containing the string 'so are' $ # same as: grep 'so are' poem.txt $ sed -n '/so are/p' poem.txt And so are you. ``` * Using print and substitution together ```bash $ # print only lines on which substitution happens $ sed -n 's/are/ARE/p' poem.txt Roses ARE red, Violets ARE blue, And so ARE you. $ # if line contains 'are', perform given command $ # print only if substitution succeeds $ sed -n '/are/ s/so/SO/p' poem.txt And SO are you. ``` * Duplicating every input line ```bash $ # note, -n is not used and no filtering applied $ seq 3 | sed 'p' 1 1 2 2 3 3 ```
#### Delete command * By default, `sed` prints every input line, including any changes like substitution * Using the `d` command, those specific lines will NOT be printed ```bash $ # same as: grep -v 'are' poem.txt $ sed '/are/d' poem.txt Sugar is sweet, $ # same as: seq 5 | grep -v '3' $ seq 5 | sed '/3/d' 1 2 4 5 ``` * Modifier `I` allows to filter lines in case-insensitive way * See [Regular Expressions](#regular-expressions) section for more details ```bash $ # /rose/I means match the string 'rose' irrespective of case $ sed '/rose/Id' poem.txt Violets are blue, Sugar is sweet, And so are you. ```
#### Quit commands * Exit `sed` without processing further input ```bash $ # same as: seq 23 45 | head -n5 $ # remember that printing is default action if -n is not used $ # here, 5 is line number based addressing $ seq 23 45 | sed '5q' 23 24 25 26 27 ``` * `Q` is similar to `q` but won't print the matching line ```bash $ seq 23 45 | sed '5Q' 23 24 25 26 $ # useful to print from beginning of file up to but not including line matching REGEXP $ sed '/is/Q' poem.txt Roses are red, Violets are blue, ``` * Use `tac` to get all lines starting from last occurrence of search string ```bash $ # all lines from last occurrence of '7' $ seq 50 | tac | sed '/7/q' | tac 47 48 49 50 $ # all lines from last occurrence of '7' excluding line with '7' $ seq 50 | tac | sed '/7/Q' | tac 48 49 50 ``` **Note** * This way of using quit commands won't work for inplace editing with multiple file input * See also [unix.stackexchange - applying changes to multiple files](https://unix.stackexchange.com/questions/309514/sed-apply-changes-in-multiple-files)
#### Negating REGEXP address * Use `!` to invert the specified address ```bash $ # same as: sed -n '/so are/p' poem.txt $ sed '/so are/!d' poem.txt And so are you. $ # same as: sed '/are/d' poem.txt $ sed -n '/are/!p' poem.txt Sugar is sweet, ```
#### Combining multiple REGEXP * See also [sed manual - Multiple commands syntax](https://www.gnu.org/software/sed/manual/sed.html#Multiple-commands-syntax) for more details * See also [sed scripts](#sed-scripts) section for an alternate way ```bash $ # each command as argument to -e option $ sed -n -e '/blue/p' -e '/you/p' poem.txt Violets are blue, And so are you. $ # each command separated by ; $ # not all commands can be specified so $ sed -n '/blue/p; /you/p' poem.txt Violets are blue, And so are you. $ # each command separated by literal newline character $ # might depend on whether the shell allows such multiline command $ sed -n ' /blue/p /you/p ' poem.txt Violets are blue, And so are you. ``` * Use `{}` command grouping for logical AND ```bash $ # same as: grep 'are' poem.txt | grep 'And' $ # space between /REGEXP/ and {} is optional $ sed -n '/are/ {/And/p}' poem.txt And so are you. $ # same as: grep 'are' poem.txt | grep -v 'so' $ sed -n '/are/ {/so/!p}' poem.txt Roses are red, Violets are blue, $ # same as: grep -v 'red' poem.txt | grep -v 'blue' $ sed -n '/red/!{/blue/!p}' poem.txt Sugar is sweet, And so are you. $ # many ways to do it, use whatever feels easier to construct $ # sed -e '/red/d' -e '/blue/d' poem.txt $ # grep -v -e 'red' -e 'blue' poem.txt ``` * Different ways to do same things. See also [Alternation](#alternation) and [Control structures](#control-structures) ```bash $ # multiple commands can lead to duplicatation $ sed -n '/blue/p; /t/p' poem.txt Violets are blue, Violets are blue, Sugar is sweet, $ # in such cases, use regular expressions instead $ sed -nE '/blue|t/p;' poem.txt Violets are blue, Sugar is sweet, $ sed -nE '/red|blue/!p' poem.txt Sugar is sweet, And so are you. $ sed -n '/so/b; /are/p' poem.txt Roses are red, Violets are blue, ```
#### Filtering by line number * Exact line number can be specified to be acted upon * As a special case, `$` indicates last line of file * See also [sed manual - Multiple commands syntax](https://www.gnu.org/software/sed/manual/sed.html#Multiple-commands-syntax) ```bash $ # here, 2 represents the address for print command, similar to /REGEXP/p $ # same as: head -n2 poem.txt | tail -n1 $ sed -n '2p' poem.txt Violets are blue, $ # print 2nd and 4th line $ sed -n '2p; 4p' poem.txt Violets are blue, And so are you. $ # same as: tail -n1 poem.txt $ sed -n '$p' poem.txt And so are you. $ # delete except 3rd line $ sed '3!d' poem.txt Sugar is sweet, $ # substitution only on 2nd line $ sed '2 s/are/ARE/' poem.txt Roses are red, Violets ARE blue, Sugar is sweet, And so are you. ``` * For large input files, combine `p` with `q` for speedy exit * `sed` would immediately quit without processing further input lines when `q` is used ```bash $ seq 3542 4623452 | sed -n '2452{p;q}' 5993 $ seq 3542 4623452 | sed -n '250p; 2452{p;q}' 3791 5993 $ # here is a sample time comparison $ time seq 3542 4623452 | sed -n '2452{p;q}' > /dev/null real 0m0.003s user 0m0.000s sys 0m0.000s $ time seq 3542 4623452 | sed -n '2452p' > /dev/null real 0m0.334s user 0m0.396s sys 0m0.024s ``` * mimicking `head` command using `q` ```bash $ # same as: seq 23 45 | head -n5 $ # remember that printing is default action if -n is not used $ seq 23 45 | sed '5q' 23 24 25 26 27 ```
#### Print only line number ```bash $ # gives both line number and matching line $ grep -n 'blue' poem.txt 2:Violets are blue, $ # gives only line number of matching line $ sed -n '/blue/=' poem.txt 2 $ sed -n '/are/=' poem.txt 1 2 4 ``` * If needed, matching line can also be printed. But there will be newline separation ```bash $ sed -n '/blue/{=;p}' poem.txt 2 Violets are blue, $ # or $ sed -n '/blue/{p;=}' poem.txt Violets are blue, 2 ```
#### Address range * So far, we've seen how to filter specific line based on *REGEXP* and line numbers * `sed` also allows to combine them to enable selecting a range of lines * Consider the sample input file for this section ```bash $ cat addr_range.txt Hello World Good day How are you Just do-it Believe it Today is sunny Not a bit funny No doubt you like it too Much ado about nothing He he he ``` * Range defined by start and end *REGEXP* * For other cases like getting lines without the line matching start and/or end, unbalanced start/end, when end *REGEXP* doesn't match, etc see [Lines between two REGEXPs](#lines-between-two-regexps) section ```bash $ sed -n '/is/,/like/p' addr_range.txt Today is sunny Not a bit funny No doubt you like it too $ sed -n '/just/I,/believe/Ip' addr_range.txt Just do-it Believe it $ # the second REGEXP will always be checked after the line matching first address $ sed -n '/No/,/No/p' addr_range.txt Not a bit funny No doubt you like it too $ # all the matching ranges will be printed $ sed -n '/you/,/do/p' addr_range.txt How are you Just do-it No doubt you like it too Much ado about nothing ``` * Range defined by start and end line numbers ```bash $ # print lines numbered 3 to 7 $ sed -n '3,7p' addr_range.txt Good day How are you Just do-it Believe it $ # print lines from line number 13 to last line $ sed -n '13,$p' addr_range.txt Much ado about nothing He he he $ # delete lines numbered 2 to 13 $ sed '2,13d' addr_range.txt Hello World He he he ``` * Range defined by mix of line number and *REGEXP* ```bash $ sed -n '3,/do/p' addr_range.txt Good day How are you Just do-it $ sed -n '/Today/,$p' addr_range.txt Today is sunny Not a bit funny No doubt you like it too Much ado about nothing He he he ``` * Negating address range, just add `!` to end of address range ```bash $ # same as: seq 10 | sed '3,7d' $ seq 10 | sed -n '3,7!p' 1 2 8 9 10 $ # same as: sed '/Today/,$d' addr_range.txt $ sed -n '/Today/,$!p' addr_range.txt Hello World Good day How are you Just do-it Believe it ```
#### Relative addressing * Prefixing `+` to a number for second address gives relative filtering * Similar to using `grep -A --no-group-separator 'REGEXP'` but `grep` merges adjacent groups while `sed` does not ```bash $ # line matching 'is' and 2 lines after $ sed -n '/is/,+2p' addr_range.txt Today is sunny Not a bit funny No doubt you like it too $ # note that all matching ranges will be filtered $ sed -n '/do/,+2p' addr_range.txt Just do-it Believe it No doubt you like it too Much ado about nothing ``` * The first address could be number too * Useful when using [Shell substitutions](#shell-substitutions) ```bash $ sed -n '3,+4p' addr_range.txt Good day How are you Just do-it Believe it ``` * Another relative format is `i~j` which acts on ith line and i+j, i+2j, i+3j, etc * `1~2` means 1st, 3rd, 5th, 7th, etc (i.e odd numbered lines) * `5~3` means 5th, 8th, 11th, etc ```bash $ # match odd numbered lines $ # for even, use 2~2 $ seq 10 | sed -n '1~2p' 1 3 5 7 9 $ # match line numbers: 2, 2+1*4, 2+1*4, etc $ seq 10 | sed -n '2~4p' 2 6 10 ``` * If `~j` is specified after `,` then meaning changes completely * After the matching line based on number or *REGEXP* of start address, the closest line number multiple of `j` will mark end address ```bash $ # 2nd line is start address $ # closest multiple of 4 is 4th line $ seq 10 | sed -n '2,~4p' 2 3 4 $ # closest multiple of 4 is 8th line $ seq 10 | sed -n '5,~4p' 5 6 7 8 $ # line matching on `Just` is 6th line, so ending is 10th line $ sed -n '/Just/,~5p' addr_range.txt Just do-it Believe it Today is sunny Not a bit funny ```
## Using different delimiter for REGEXP * `/` is idiomatically used as the *REGEXP* delimiter * See also [a bit of history on why / is commonly used as delimiter](https://www.reddit.com/r/commandline/comments/3lhgwh/why_did_people_standardize_on_using_forward/cvgie7j/) * But any character other than `\` and newline character can be used instead * This helps to avoid/reduce use of `\` ```bash $ # instead of this $ echo '/home/learnbyexample/reports' | sed 's/\/home\/learnbyexample\//~\//' ~/reports $ # use a different delimiter $ echo '/home/learnbyexample/reports' | sed 's#/home/learnbyexample/#~/#' ~/reports ``` * For *REGEXP* used in address matching, syntax is a bit different `\REGEXP` ```bash $ printf '/foo/bar/1\n/foo/baz/1\n' /foo/bar/1 /foo/baz/1 $ printf '/foo/bar/1\n/foo/baz/1\n' | sed -n '\;/foo/bar/;p' /foo/bar/1 ```
## Regular Expressions * By default, `sed` treats *REGEXP* as BRE (Basic Regular Expression) * The `-E` option enables ERE (Extended Regular Expression) which in GNU sed's case only differs in how meta characters are used, no difference in functionalities * Initially GNU sed only had `-r` option to enable ERE and `man sed` doesn't even mention `-E` * Other `sed` versions use `-E` and `grep` uses `-E` as well. So `-r` won't be used in examples in this tutorial * See also [sed manual - BRE-vs-ERE](https://www.gnu.org/software/sed/manual/sed.html#BRE-vs-ERE) * See [sed manual - Regular Expressions](https://www.gnu.org/software/sed/manual/sed.html#sed-regular-expressions) for more details
#### Line Anchors * Often, search must match from beginning of line or towards end of line * For example, an integer variable declaration in `C` will start with optional white-space, the keyword `int`, white-space and then variable(s) * This way one can avoid matching declarations inside single line comments as well * Similarly, one might want to match a variable at end of statement Consider the input file and sample substitution without using any anchoring ```bash $ cat anchors.txt cat and dog too many cats around here to concatenate, use the cmd cat catapults laid waste to the village just scat and quit bothering me that is quite a fabricated tale try the grape variety muscat $ # without anchors, substitution will replace wherever the string is found $ sed 's/cat/XXX/g' anchors.txt XXX and dog too many XXXs around here to conXXXenate, use the cmd XXX XXXapults laid waste to the village just sXXX and quit bothering me that is quite a fabriXXXed tale try the grape variety musXXX ``` * The meta character `^` forces *REGEXP* to match only at start of line ```bash $ # filtering lines starting with 'cat' $ sed -n '/^cat/p' anchors.txt cat and dog catapults laid waste to the village $ # replace only at start of line $ # g modifier not needed as there can only be single match at start of line $ sed 's/^cat/XXX/' anchors.txt XXX and dog too many cats around here to concatenate, use the cmd cat XXXapults laid waste to the village just scat and quit bothering me that is quite a fabricated tale try the grape variety muscat $ # add something to start of line $ echo 'Have a good day' | sed 's/^/Hi! /' Hi! Have a good day ``` * The meta character `$` forces *REGEXP* to match only at end of line ```bash $ # filtering lines ending with 'cat' $ sed -n '/cat$/p' anchors.txt to concatenate, use the cmd cat try the grape variety muscat $ # replace only at end of line $ sed 's/cat$/YYY/' anchors.txt cat and dog too many cats around here to concatenate, use the cmd YYY catapults laid waste to the village just scat and quit bothering me that is quite a fabricated tale try the grape variety musYYY $ # add something to end of line $ echo 'Have a good day' | sed 's/$/. Cya later/' Have a good day. Cya later ```
#### Word Anchors * A **word** character is any alphabet (irrespective of case) or any digit or the underscore character * The word anchors help in matching or not matching boundaries of a word * For example, to distinguish between `par`, `spar` and `apparent` * `\b` matches word boundary * `\` is meta character and certain combinations like `\b` and `\B` have special meaning ```bash $ # words ending with 'cat' $ sed -n 's/cat\b/XXX/p' anchors.txt XXX and dog to concatenate, use the cmd XXX just sXXX and quit bothering me try the grape variety musXXX $ # words starting with 'cat' $ sed -n 's/\bcat/YYY/p' anchors.txt YYY and dog too many YYYs around here to concatenate, use the cmd YYY YYYapults laid waste to the village $ # only whole words $ sed -n 's/\bcat\b/ZZZ/p' anchors.txt ZZZ and dog to concatenate, use the cmd ZZZ $ # word is made up of alphabets, numbers and _ $ echo 'foo, foo_bar and foo1' | sed 's/\bfoo\b/baz/g' baz, foo_bar and foo1 ``` * `\B` is opposite of `\b`, i.e it doesn't match word boundaries ```bash $ # substitute only if 'cat' is surrounded by word characters $ sed -n 's/\Bcat\B/QQQ/p' anchors.txt to conQQQenate, use the cmd cat that is quite a fabriQQQed tale $ # substitute only if 'cat' is not start of word $ sed -n 's/\Bcat/RRR/p' anchors.txt to conRRRenate, use the cmd cat just sRRR and quit bothering me that is quite a fabriRRRed tale try the grape variety musRRR $ # substitute only if 'cat' is not end of word $ sed -n 's/cat\B/SSS/p' anchors.txt too many SSSs around here to conSSSenate, use the cmd cat SSSapults laid waste to the village that is quite a fabriSSSed tale ``` * One can also use these alternatives for `\b` * `\<` for start of word * `\>` for end of word ```bash $ # same as: sed 's/\bcat\b/X/g' $ echo 'concatenate cat scat cater' | sed 's/\/X/g' concatenate X scat cater $ # add something to both start/end of word $ echo 'hi foo_baz 3b' | sed 's/\b/:/g' :hi: :foo_baz: :3b: $ # add something only at start of word $ echo 'hi foo_baz 3b' | sed 's/\/:/g' hi: foo_baz: 3b: ```
#### Matching the meta characters * Since meta characters like `^`, `$`, `\` etc have special meaning in *REGEXP*, they have to be escaped using `\` to match them literally ```bash $ # here, '^' will match only start of line $ echo '(a+b)^2 = a^2 + b^2 + 2ab' | sed 's/^/**/g' **(a+b)^2 = a^2 + b^2 + 2ab $ # '\` before '^' will match '^' literally $ echo '(a+b)^2 = a^2 + b^2 + 2ab' | sed 's/\^/**/g' (a+b)**2 = a**2 + b**2 + 2ab $ # to match '\' use '\\' $ echo 'foo\bar' | sed 's/\\/ /' foo bar $ echo 'pa$$' | sed 's/$/s/g' pa$$s $ echo 'pa$$' | sed 's/\$/s/g' pass $ # '^' has special meaning only at start of REGEXP $ # similarly, '$' has special meaning only at end of REGEXP $ echo '(a+b)^2 = a^2 + b^2 + 2ab' | sed 's/a^2/A^2/g' (a+b)^2 = A^2 + b^2 + 2ab ``` * Certain characters like `&` and `\` have special meaning in *REPLACEMENT* section of substitute as well. They too have to be escaped using `\` * And the delimiter character has to be escaped of course * See [back reference](#back-reference) section for use of `&` in *REPLACEMENT* section ```bash $ # & will refer to entire matched string of REGEXP section $ echo 'foo and bar' | sed 's/and/"&"/' foo "and" bar $ echo 'foo and bar' | sed 's/and/"\&"/' foo "&" bar $ # use different delimiter where required $ echo 'a b' | sed 's/ /\//' a/b $ echo 'a b' | sed 's# #/#' a/b $ # use \\ to represent literal \ $ echo '/foo/bar/baz' | sed 's#/#\\#g' \foo\bar\baz ```
#### Alternation * Two or more *REGEXP* can be combined as logical OR using the `|` meta character * syntax is `\|` for BRE and `|` for ERE * Each side of `|` is complete regular expression with their own start/end anchors * How each part of alternation is handled and order of evaluation/output is beyond the scope of this tutorial * See [this](https://www.regular-expressions.info/alternation.html) for more info on this topic. ```bash $ # BRE $ sed -n '/red\|blue/p' poem.txt Roses are red, Violets are blue, $ # ERE $ sed -nE '/red|blue/p' poem.txt Roses are red, Violets are blue, $ # filter lines starting or ending with 'cat' $ sed -nE '/^cat|cat$/p' anchors.txt cat and dog to concatenate, use the cmd cat catapults laid waste to the village try the grape variety muscat $ # g modifier is needed for more than one replacement $ echo 'foo and temp and baz' | sed -E 's/foo|temp|baz/XYZ/' XYZ and temp and baz $ echo 'foo and temp and baz' | sed -E 's/foo|temp|baz/XYZ/g' XYZ and XYZ and XYZ ```
#### The dot meta character * The `.` meta character matches any character once, including newline ```bash $ # replace all sequence of 3 characters starting with 'c' and ending with 't' $ echo 'coat cut fit c#t' | sed 's/c.t/XYZ/g' coat XYZ fit XYZ $ # replace all sequence of 4 characters starting with 'c' and ending with 't' $ echo 'coat cut fit c#t' | sed 's/c..t/ABCD/g' ABCD cut fit c#t $ # space, tab etc are also characters which will be matched by '.' $ echo 'coat cut fit c#t' | sed 's/t.f/IJK/g' coat cuIJKit c#t ```
#### Quantifiers All quantifiers in `sed` are greedy, i.e longest match wins as long as overall *REGEXP* is satisfied and precedence is left to right. In this section, we'll cover usage of quantifiers on characters * `?` will try to match 0 or 1 time * For BRE, use `\?` ```bash $ printf 'late\npale\nfactor\nrare\nact\n' late pale factor rare act $ # same as using: sed -nE '/at|act/p' $ printf 'late\npale\nfactor\nrare\nact\n' | sed -nE '/ac?t/p' late factor act $ # greediness comes in handy in some cases $ # problem: '<' has to be replaced with '\<' only if not preceded by '\' $ echo 'blah \< foo bar < blah baz <' blah \< foo bar < blah baz < $ # this won't work as '\<' gets replaced with '\\<' $ echo 'blah \< foo bar < blah baz <' | sed -E 's/ #### Character classes * The `.` meta character provides a way to match any character * Character class provides a way to match any character among a specified set of characters enclosed within `[]` ```bash $ # same as: sed -nE '/lane|late/p' $ printf 'late\nlane\nfate\nfete\n' | sed -n '/la[nt]e/p' late lane $ printf 'late\nlane\nfate\nfete\n' | sed -n '/[fl]a[nt]e/p' late lane fate $ # quantifiers can be added similar to using for any other character $ # filter lines made up entirely of digits, containing at least one digit $ printf 'cat5\nfoo\n123\n42\n' | sed -nE '/^[0123456789]+$/p' 123 42 $ # filter lines made up entirely of digits, containing at least three digits $ printf 'cat5\nfoo\n123\n42\n' | sed -nE '/^[0123456789]{3,}$/p' 123 ``` Character ranges * Matching any alphabet, number, hexadecimal number etc becomes cumbersome if every character has to be individually specified * So, there's a shortcut, using `-` to construct a range (has to be specified in ascending order) * See [ascii codes table](https://ascii.cl/) for reference * Note that behavior of range will depend on locale settings * [arch wiki - locale](https://wiki.archlinux.org/index.php/locale) * [Linux: Define Locale and Language Settings](https://www.shellhacks.com/linux-define-locale-language-settings/) ```bash $ # filter lines made up entirely of digits, at least one $ printf 'cat5\nfoo\n123\n42\n' | sed -nE '/^[0-9]+$/p' 123 42 $ # filter lines made up entirely of lower case alphabets, at least one $ printf 'cat5\nfoo\n123\n42\n' | sed -nE '/^[a-z]+$/p' foo $ # filter lines made up entirely of lower case alphabets and digits, at least one $ printf 'cat5\nfoo\n123\n42\n' | sed -nE '/^[a-z0-9]+$/p' cat5 foo 123 42 ``` * Numeric ranges, easy for certain cases but not suitable always. Use `awk` or `perl` for arithmetic computation * See also [Matching Numeric Ranges with a Regular Expression](https://www.regular-expressions.info/numericranges.html) ```bash $ # numbers between 10 to 29 $ printf '23\n154\n12\n26\n98234\n' | sed -n '/^[12][0-9]$/p' 23 12 26 $ # numbers >= 100 $ printf '23\n154\n12\n26\n98234\n' | sed -nE '/^[0-9]{3,}$/p' 154 98234 $ # numbers >= 100 if there are leading zeros $ printf '0501\n035\n154\n12\n26\n98234\n' | sed -nE '/^0*[1-9][0-9]{2,}$/p' 0501 154 98234 ``` Negating character class * Meta characters inside and outside of `[]` are completely different * For example, `^` as first character inside `[]` matches characters other than those specified inside character class ```bash $ # delete zero or more characters before first = $ echo 'foo=bar; baz=123' | sed 's/^[^=]*//' =bar; baz=123 $ # delete zero or more characters after last = $ echo 'foo=bar; baz=123' | sed 's/[^=]*$//' foo=bar; baz= $ # same as: sed -n '/[aeiou]/!p' $ printf 'tryst\nglyph\npity\nwhy\n' | sed -n '/^[^aeiou]*$/p' tryst glyph why ``` Matching meta characters inside `[]` * Characters like `^`, `]`, `-`, etc need special attention to be part of list * Also, sequences like `[.` or `=]` have special meaning within `[]` * See [sed manual - Character-Classes-and-Bracket-Expressions](https://www.gnu.org/software/sed/manual/sed.html#Character-Classes-and-Bracket-Expressions) for complete list ```bash $ # to match - it should be first or last character within [] $ printf 'Foo-bar\nabc-456\n42\nCo-operate\n' | sed -nE '/^[a-z-]+$/Ip' Foo-bar Co-operate $ # to match ] it should be first character within [] $ printf 'int foo\nint a[5]\nfoo=bar\n' | sed -n '/[]=]/p' int a[5] foo=bar $ # to match [ use [ anywhere in the character list $ # [][] will match both [ and ] $ printf 'int foo\nint a[5]\nfoo=bar\n' | sed -n '/[[]/p' int a[5] $ # to match ^ it should be other than first in the list $ printf 'c=a^b\nd=f*h+e\nz=x-y\n' | sed -n '/[*^]/p' c=a^b d=f*h+e ``` Named character classes * Equivalent class shown is for C locale and ASCII character encoding * See [ascii codes table](https://ascii.cl/) for reference * See [sed manual - Character Classes and Bracket Expressions](https://www.gnu.org/software/sed/manual/sed.html#Character-Classes-and-Bracket-Expressions) for more details | Character classes | Description | | ------------- | ----------- | | `[:digit:]` | Same as `[0-9]` | | `[:lower:]` | Same as `[a-z]` | | `[:upper:]` | Same as `[A-Z]` | | `[:alpha:]` | Same as `[a-zA-Z]` | | `[:alnum:]` | Same as `[0-9a-zA-Z]` | | `[:xdigit:]` | Same as `[0-9a-fA-F]` | | `[:cntrl:]` | Control characters - first 32 ASCII characters and 127th (DEL) | | `[:punct:]` | All the punctuation characters | | `[:graph:]` | `[:alnum:]` and `[:punct:]` | | `[:print:]` | `[:alnum:]`, `[:punct:]` and space | | `[:blank:]` | Space and tab characters | | `[:space:]` | white-space characters: tab, newline, vertical tab, form feed, carriage return and space | ```bash $ # lines containing only hexadecimal characters $ printf '128\n34\nfe32\nfoo1\nbar\n' | sed -nE '/^[[:xdigit:]]+$/p' 128 34 fe32 $ # lines containing at least one non-hexadecimal character $ printf '128\n34\nfe32\nfoo1\nbar\n' | sed -n '/[^[:xdigit:]]/p' foo1 bar $ # same as: sed -nE '/^[a-z-]+$/Ip' $ printf 'Foo-bar\nabc-456\n42\nCo-operate\n' | sed -nE '/^[[:alpha:]-]+$/p' Foo-bar Co-operate $ # remove all punctuation characters $ sed 's/[[:punct:]]//g' poem.txt Roses are red Violets are blue Sugar is sweet And so are you ``` Backslash character classes * Equivalent class shown is for C locale and ASCII character encoding * See [ascii codes table](https://ascii.cl/) for reference * See [sed manual - regular expression extensions](https://www.gnu.org/software/sed/manual/sed.html#regexp-extensions) for more details | Character classes | Description | | ------------- | ----------- | | `\w` | Same as `[0-9a-zA-Z_]` or `[[:alnum:]_]` | | `\W` | Same as `[^0-9a-zA-Z_]` or `[^[:alnum:]_]` | | `\s` | Same as `[[:space:]]` | | `\S` | Same as `[^[:space:]]` | ```bash $ # lines containing only word characters $ printf '123\na=b+c\ncmp_str\nFoo_bar\n' | sed -nE '/^\w+$/p' 123 cmp_str Foo_bar $ # backslash character classes cannot be used inside [] unlike perl $ # \w would simply match w $ echo 'w=y-x+9*3' | sed 's/[\w=]//g' y-x+9*3 $ echo 'w=y-x+9*3' | perl -pe 's/[\w=]//g' -+* ```
#### Escape sequences * Certain ASCII characters like tab, carriage return, newline, etc have escape sequence to represent them * Unlike backslash character classes, these can be used within `[]` as well * Any ASCII character can be also represented using their decimal or octal or hexadecimal value * See [ascii codes table](https://ascii.cl/) for reference * See [sed manual - Escapes](https://www.gnu.org/software/sed/manual/sed.html#Escapes) for more details ```bash $ # example for representing tab character $ printf 'foo\tbar\tbaz\n' foo bar baz $ printf 'foo\tbar\tbaz\n' | sed 's/\t/ /g' foo bar baz $ echo 'a b c' | sed 's/ /\t/g' a b c $ # using escape sequence inside character class $ printf 'a\tb\vc\n' a b c $ printf 'a\tb\vc\n' | cat -vT a^Ib^Kc $ printf 'a\tb\vc\n' | sed 's/[\t\v]/ /g' a b c $ # most common use case for hex escape sequence is to represent single quotes $ # equivalent is '\d039' and '\o047' for decimal and octal respectively $ echo "foo: '34'" foo: '34' $ echo "foo: '34'" | sed 's/\x27/"/g' foo: "34" $ echo 'foo: "34"' | sed 's/"/\x27/g' foo: '34' ```
#### Grouping * Character classes allow matching against a choice of multiple character list and then quantifier added if needed * One of the uses of grouping is analogous to character classes for whole regular expressions, instead of just list of characters * The meta characters `()` are used for grouping * requires `\(\)` for BRE * Similar to maths `ab + ac = a(b+c)`, think of regular expression `a(b|c) = ab|ac` ```bash $ # four letter words with 'on' or 'no' in middle $ printf 'known\nmood\nknow\npony\ninns\n' | sed -nE '/\b[a-z](on|no)[a-z]\b/p' know pony $ # common mistake to use character class, will match 'oo' and 'nn' as well $ printf 'known\nmood\nknow\npony\ninns\n' | sed -nE '/\b[a-z][on]{2}[a-z]\b/p' mood know pony inns $ # quantifier example $ printf 'handed\nhand\nhandy\nhands\nhandle\n' | sed -nE '/^hand([sy]|le)?$/p' hand handy hands handle $ # remove first two columns where : is delimiter $ echo 'foo:123:bar:baz' | sed -E 's/^([^:]+:){2}//' bar:baz $ # can be nested as required $ printf 'spade\nscore\nscare\nspare\nsphere\n' | sed -nE '/^s([cp](he|a)[rd])e$/p' spade scare spare sphere ```
#### Back reference * The matched string within `()` can also be used to be matched again by back referencing the captured groups * `\1` denotes the first matched group, `\2` the second one and so on * Order is leftmost `(` is `\1`, next one is `\2` and so on * Can be used both in *REGEXP* as well as in *REPLACEMENT* sections * `&` or `\0` represents entire matched string in *REPLACEMENT* section * Note that the matched string, not the regular expression itself is referenced * for ex: if `([0-9][a-f])` matches `3b`, then back referencing will be `3b` not any other valid match of the regular expression like `8f`, `0a` etc * As `\` and `&` are special characters in *REPLACEMENT* section, use `\\` and `\&` respectively for literal representation ```bash $ # filter lines with consecutive repeated alphabets $ printf 'eel\nflee\nall\npat\nilk\nseen\n' | sed -nE '/([a-z])\1/p' eel flee all seen $ # reduce \\ to single \ and delete if only single \ $ echo '\[\] and \\w and \[a-zA-Z0-9\_\]' | sed -E 's/(\\?)\\/\1/g' [] and \w and [a-zA-Z0-9_] $ # remove two or more duplicate words separated by space $ # word boundaries prevent false matches like 'the theatre' 'sand and stone' etc $ echo 'a a a walking for for a cause' | sed -E 's/\b(\w+)( \1)+\b/\1/g' a walking for a cause $ # surround only third column with double quotes $ # note the nested capture groups and numbers used in REPLACEMENT section $ echo 'foo:123:bar:baz' | sed -E 's/^(([^:]+:){2})([^:]+)/\1"\3"/' foo:123:"bar":baz $ # add first column data to end of line as well $ echo 'foo:123:bar:baz' | sed -E 's/^([^:]+).*/& \1/' foo:123:bar:baz foo $ # surround entire line with double quotes $ echo 'hello world' | sed 's/.*/"&"/' "hello world" $ # add something at start as well as end of line $ echo 'hello world' | sed 's/.*/Hi. &. Have a nice day/' Hi. hello world. Have a nice day ```
#### Changing case * Applies only to *REPLACEMENT* section, unlike `perl` where these can be used in *REGEXP* portion as well * See [sed manual - The s Command](https://www.gnu.org/software/sed/manual/sed.html#The-_0022s_0022-Command) for more details and corner cases ```bash $ # UPPERCASE all alphabets, will be stopped on \L or \E $ echo 'HeLlO WoRLD' | sed 's/.*/\U&/' HELLO WORLD $ # lowercase all alphabets, will be stopped on \U or \E $ echo 'HeLlO WoRLD' | sed 's/.*/\L&/' hello world $ # Uppercase only next character $ echo 'foo bar' | sed 's/\w*/\u&/g' Foo Bar $ echo 'foo_bar next_line' | sed -E 's/_([a-z])/\u\1/g' fooBar nextLine $ # lowercase only next character $ echo 'FOO BAR' | sed 's/\w*/\l&/g' fOO bAR $ echo 'fooBar nextLine Baz' | sed -E 's/([a-z])([A-Z])/\1_\l\2/g' foo_bar next_line Baz $ # titlecase if input has mixed case $ echo 'HeLlO WoRLD' | sed 's/.*/\L&/; s/\w*/\u&/g' Hello World $ # sed 's/.*/\L\u&/' also works, but not sure if it is defined behavior $ echo 'HeLlO WoRLD' | sed 's/.*/\L&/; s/./\u&/' Hello world $ # \E will stop conversion started by \U or \L $ echo 'foo_bar next_line baz' | sed -E 's/([a-z]+)(_[a-z]+)/\U\1\E\2/g' FOO_bar NEXT_line baz ```
## Substitute command modifiers The `s` command syntax: ``` s/REGEXP/REPLACEMENT/FLAGS ``` * Modifiers (or FLAGS) like `g`, `p` and `I` have been already seen. For completeness, they will be discussed again along with rest of the modifiers * See [sed manual - The s Command](https://www.gnu.org/software/sed/manual/sed.html#The-_0022s_0022-Command) for more details and corner cases
#### g modifier By default, substitute command will replace only first occurrence of match. `g` modifier is needed to replace all occurrences ```bash $ # replace only first : with - $ echo 'foo:123:bar:baz' | sed 's/:/-/' foo-123:bar:baz $ # replace all : with - $ echo 'foo:123:bar:baz' | sed 's/:/-/g' foo-123-bar-baz ```
#### Replace specific occurrence * A number can be used to specify *N*th match to be replaced ```bash $ # replace first occurrence $ echo 'foo:123:bar:baz' | sed 's/:/-/' foo-123:bar:baz $ echo 'foo:123:bar:baz' | sed -E 's/[^:]+/XYZ/' XYZ:123:bar:baz $ # replace second occurrence $ echo 'foo:123:bar:baz' | sed 's/:/-/2' foo:123-bar:baz $ echo 'foo:123:bar:baz' | sed -E 's/[^:]+/XYZ/2' foo:XYZ:bar:baz $ # replace third occurrence $ echo 'foo:123:bar:baz' | sed 's/:/-/3' foo:123:bar-baz $ echo 'foo:123:bar:baz' | sed -E 's/[^:]+/XYZ/3' foo:123:XYZ:baz $ # choice of quantifier depends on knowing input $ echo ':123:bar:baz' | sed 's/[^:]*/XYZ/2' :XYZ:bar:baz $ echo ':123:bar:baz' | sed -E 's/[^:]+/XYZ/2' :123:XYZ:baz ``` * Replacing *N*th match from end of line when number of matches is unknown * Makes use of greediness of quantifiers ```bash $ # replacing last occurrence $ # can also use sed -E 's/:([^:]*)$/-\1/' $ echo 'foo:123:bar:baz' | sed -E 's/(.*):/\1-/' foo:123:bar-baz $ echo '456:foo:123:bar:789:baz' | sed -E 's/(.*):/\1-/' 456:foo:123:bar:789-baz $ echo 'foo and bar and baz land good' | sed -E 's/(.*)and/\1XYZ/' foo and bar and baz lXYZ good $ # use word boundaries as necessary $ echo 'foo and bar and baz land good' | sed -E 's/(.*)\band\b/\1XYZ/' foo and bar XYZ baz land good $ # replacing last but one $ echo 'foo:123:bar:baz' | sed -E 's/(.*):(.*:)/\1-\2/' foo:123-bar:baz $ echo '456:foo:123:bar:789:baz' | sed -E 's/(.*):(.*:)/\1-\2/' 456:foo:123:bar-789:baz $ # replacing last but two $ echo '456:foo:123:bar:789:baz' | sed -E 's/(.*):((.*:){2})/\1-\2/' 456:foo:123-bar:789:baz $ # replacing last but three $ echo '456:foo:123:bar:789:baz' | sed -E 's/(.*):((.*:){3})/\1-\2/' 456:foo-123:bar:789:baz ``` * Replacing all but first *N* occurrences by combining with `g` modifier ```bash $ # replace all : with - except first two $ echo '456:foo:123:bar:789:baz' | sed -E 's/:/-/3g' 456:foo:123-bar-789-baz $ # replace all : with - except first three $ echo '456:foo:123:bar:789:baz' | sed -E 's/:/-/4g' 456:foo:123:bar-789-baz ``` * Replacing multiple *N*th occurrences ```bash $ # replace first two occurrences of : with - $ echo '456:foo:123:bar:789:baz' | sed 's/:/-/; s/:/-/' 456-foo-123:bar:789:baz $ # replace second and third occurrences of : with - $ # note the changes in number to be used for subsequent replacement $ echo '456:foo:123:bar:789:baz' | sed 's/:/-/2; s/:/-/2' 456:foo-123-bar:789:baz $ # better way is to use descending order $ echo '456:foo:123:bar:789:baz' | sed 's/:/-/3; s/:/-/2' 456:foo-123-bar:789:baz $ # replace second, third and fifth occurrences of : with - $ echo '456:foo:123:bar:789:baz' | sed 's/:/-/5; s/:/-/3; s/:/-/2' 456:foo-123-bar:789-baz ```
#### Ignoring case * Either `i` or `I` can be used for replacing in case-insensitive manner * Since only `I` can be used for address filtering (for ex: `sed '/rose/Id' poem.txt`), use `I` for substitute command as well for consistency ```bash $ echo 'hello Hello HELLO HeLlO' | sed 's/hello/hi/g' hi Hello HELLO HeLlO $ echo 'hello Hello HELLO HeLlO' | sed 's/hello/hi/Ig' hi hi hi hi ```
#### p modifier * Usually used in conjunction with `-n` option to output only modified lines ```bash $ # no output if no substitution $ echo 'hi there. have a nice day' | sed -n 's/xyz/XYZ/p' $ # modified line if there is substitution $ echo 'hi there. have a nice day' | sed -n 's/\bh/H/pg' Hi there. Have a nice day $ # only lines containing 'are' $ sed -n 's/are/ARE/p' poem.txt Roses ARE red, Violets ARE blue, And so ARE you. $ # only lines containing 'are' as well as 'so' $ sed -n '/are/ s/so/SO/p' poem.txt And SO are you. ```
#### w modifier * Allows to write only the changes to specified file name instead of default **stdout** ```bash $ # space between w and filename is optional $ # same as: sed -n 's/3/three/p' > 3.txt $ seq 20 | sed -n 's/3/three/w 3.txt' $ cat 3.txt three 1three $ # do not use -n if output should be displayed as well as written to file $ echo '456:foo:123:bar:789:baz' | sed -E 's/(:[^:]*){2}$//w col.txt' 456:foo:123:bar $ cat col.txt 456:foo:123:bar ``` * For multiple output files, use `-e` for each file ```bash $ seq 20 | sed -n -e 's/5/five/w 5.txt' -e 's/7/seven/w 7.txt' $ cat 5.txt five 1five $ cat 7.txt seven 1seven ``` * There are two predefined filenames * `/dev/stdout` to write to **stdout** * `/dev/stderr` to write to **stderr** ```bash $ # inplace editing as well as display changes on terminal $ sed -i 's/three/3/w /dev/stdout' 3.txt 3 13 $ cat 3.txt 3 13 ```
#### e modifier * Allows to use shell command output in *REPLACEMENT* section * Trailing newline from command output is suppressed ```bash $ # replacing a line with output of shell command $ printf 'Date:\nreplace this line\n' Date: replace this line $ printf 'Date:\nreplace this line\n' | sed 's/^replace.*/date/e' Date: Thu May 25 10:19:46 IST 2017 $ # when using p modifier with e, order is important $ printf 'Date:\nreplace this line\n' | sed -n 's/^replace.*/date/ep' Thu May 25 10:19:46 IST 2017 $ printf 'Date:\nreplace this line\n' | sed -n 's/^replace.*/date/pe' date $ # entire modified line is executed as shell command $ echo 'xyz 5' | sed 's/xyz/seq/e' 1 2 3 4 5 ```
#### m modifier * Either `m` or `M` can be used * So far, we've seen only line based operations (newline character being used to distinguish lines) * There are various ways (see [sed manual - How sed Works](https://www.gnu.org/software/sed/manual/sed.html#Execution-Cycle)) by which more than one line is there in pattern space and in such cases `m` modifier can be used * See also [unix.stackexchange - usage of multi-line modifier](https://unix.stackexchange.com/questions/298670/simple-significant-usage-of-m-multi-line-address-suffix) for more examples Before seeing example with `m` modifier, let's see a simple example to get two lines in pattern space ```bash $ # line matching 'blue' and next line in pattern space $ sed -n '/blue/{N;p}' poem.txt Violets are blue, Sugar is sweet, $ # applying substitution, remember that . matches newline as well $ sed -n '/blue/{N;s/are.*is//p}' poem.txt Violets sweet, ``` * When `m` modifier is used, it affects the behavior of `^`, `$` and `.` meta characters ```bash $ # without m modifier, ^ will anchor only beginning of entire pattern space $ sed -n '/blue/{N;s/^/:: /pg}' poem.txt :: Violets are blue, Sugar is sweet, $ # with m modifier, ^ will anchor each individual line within pattern space $ sed -n '/blue/{N;s/^/:: /pgm}' poem.txt :: Violets are blue, :: Sugar is sweet, $ # same applies to $ as well $ sed -n '/blue/{N;s/$/ ::/pg}' poem.txt Violets are blue, Sugar is sweet, :: $ sed -n '/blue/{N;s/$/ ::/pgm}' poem.txt Violets are blue, :: Sugar is sweet, :: $ # with m modifier, . will not match newline character $ sed -n '/blue/{N;s/are.*//p}' poem.txt Violets $ sed -n '/blue/{N;s/are.*//pm}' poem.txt Violets Sugar is sweet, ```
## Shell substitutions * Examples presented works with `bash` shell, might differ for other shells * See also [stackoverflow - Difference between single and double quotes in Bash](https://stackoverflow.com/questions/6697753/difference-between-single-and-double-quotes-in-bash) * For robust substitutions taking care of meta characters in *REGEXP* and *REPLACEMENT* sections, see * [unix.stackexchange - How to ensure that string interpolated into sed substitution escapes all metachars](https://unix.stackexchange.com/questions/129059/how-to-ensure-that-string-interpolated-into-sed-substitution-escapes-all-metac) * [unix.stackexchange - What characters do I need to escape when using sed in a sh script?](https://unix.stackexchange.com/questions/32907/what-characters-do-i-need-to-escape-when-using-sed-in-a-sh-script) * [stackoverflow - Is it possible to escape regex metacharacters reliably with sed](https://stackoverflow.com/questions/29613304/is-it-possible-to-escape-regex-metacharacters-reliably-with-sed)
#### Variable substitution * Entire command in double quotes can be used for simple use cases ```bash $ word='are' $ sed -n "/$word/p" poem.txt Roses are red, Violets are blue, And so are you. $ replace='ARE' $ sed "s/$word/$replace/g" poem.txt Roses ARE red, Violets ARE blue, Sugar is sweet, And so ARE you. $ # need to use delimiter as suitable $ echo 'home path is:' | sed "s/$/ $HOME/" sed: -e expression #1, char 7: unknown option to `s' $ echo 'home path is:' | sed "s|$| $HOME|" home path is: /home/learnbyexample ``` * If command has characters like `\`, backtick, `!` etc, double quote only the variable ```bash $ # if history expansion is enabled, ! is special $ word='are' $ sed "/$word/!d" poem.txt sed "/$word/date +%A" poem.txt sed: -e expression #1, char 7: extra characters after command $ # so double quote only the variable $ # the command is concatenation of '/' and "$word" and '/!d' $ sed '/'"$word"'/!d' poem.txt Roses are red, Violets are blue, And so are you. ```
#### Command substitution * Much more flexible than using `e` modifier as part of line can be modified as well ```bash $ echo 'today is date' | sed 's/date/'"$(date +%A)"'/' today is Tuesday $ # need to use delimiter as suitable $ echo 'current working dir is: ' | sed 's/$/'"$(pwd)"'/' sed: -e expression #1, char 6: unknown option to `s' $ echo 'current working dir is: ' | sed 's|$|'"$(pwd)"'|' current working dir is: /home/learnbyexample/command_line_text_processing $ # multiline output cannot be substituted in this manner $ echo 'foo' | sed 's/foo/'"$(seq 5)"'/' sed: -e expression #1, char 7: unterminated `s' command ```
## z and s command line options * We have already seen a few options like `-n`, `-e`, `-i` and `-E` * This section will cover `-z` and `-s` options * See [sed manual - Command line options](https://www.gnu.org/software/sed/manual/sed.html#Command_002dLine-Options) for other options and more details The `-z` option will cause `sed` to separate input based on ASCII NUL character instead of newlines ```bash $ # useful to process null separated data $ # for ex: output of grep -Z, find -print0, etc $ printf 'teal\0red\nblue\n\0green\n' | sed -nz '/red/p' | cat -A red$ blue$ ^@ $ # also useful to process whole file(not having NUL characters) as a single string $ # adds ; to previous line if current line starts with c $ printf 'cat\ndog\ncoat\ncut\nmat\n' | sed -z 's/\nc/;&/g' cat dog; coat; cut mat ``` The `-s` option will cause `sed` to treat multiple input files separately instead of treating them as single concatenated input. If `-i` is being used, `-s` is implied ```bash $ # without -s, there is only one first line $ # F command prints file name of current file $ sed '1F' f1 f2 f1 I ate three apples I bought two bananas and three mangoes $ # with -s, each file has its own address $ sed -s '1F' f1 f2 f1 I ate three apples f2 I bought two bananas and three mangoes ```

## change command The change command `c` will delete line(s) represented by address or address range and replace it with given string **Note** the string used cannot have literal newline character, use escape sequence instead ```bash $ # white-space between c and replacement string is ignored $ seq 3 | sed '2c foo bar' 1 foo bar 3 $ # note how all lines in address range are replaced $ seq 8 | sed '3,7cfoo bar' 1 2 foo bar 8 $ # escape sequences are allowed in string to be replaced $ sed '/red/,/is/chello\nhi there' poem.txt hello hi there And so are you. ``` * command will apply for all matching addresses ```bash $ seq 5 | sed '/[24]/cfoo' 1 foo 3 foo 5 ``` * `\` is special immediately after `c`, see [sed manual - other commands](https://www.gnu.org/software/sed/manual/sed.html#Other-Commands) for details * If escape sequence is needed at beginning of replacement string, use an additional `\` ```bash $ # \ helps to add leading spaces $ seq 3 | sed '2c a' 1 a 3 $ seq 3 | sed '2c\ a' 1 a 3 $ seq 3 | sed '2c\tgood day' 1 tgood day 3 $ seq 3 | sed '2c\\tgood day' 1 good day 3 ``` * Since `;` cannot be used to distinguish between string and end of command, use `-e` for multiple commands ```bash $ sed -e '/are/cHi;s/is/IS/' poem.txt Hi;s/is/IS/ Hi;s/is/IS/ Sugar is sweet, Hi;s/is/IS/ $ sed -e '/are/cHi' -e 's/is/IS/' poem.txt Hi Hi Sugar IS sweet, Hi ``` * Using shell substitution ```bash $ text='good day' $ seq 3 | sed '2c'"$text" 1 good day 3 $ text='good day\nfoo bar' $ seq 3 | sed '2c'"$text" 1 good day foo bar 3 $ seq 3 | sed '2c'"$(date +%A)" 1 Thursday 3 $ # multiline command output will lead to error $ seq 3 | sed '2c'"$(seq 2)" sed: -e expression #1, char 5: missing command ```
## insert command The insert command allows to add string before a line matching given address **Note** the string used cannot have literal newline character, use escape sequence instead ```bash $ # white-space between i and string is ignored $ # same as: sed '2s/^/hello\n/' $ seq 3 | sed '2i hello' 1 hello 2 3 $ # escape sequences can be used $ seq 3 | sed '2ihello\nhi' 1 hello hi 2 3 ``` * command will apply for all matching addresses ```bash $ seq 5 | sed '/[24]/ifoo' 1 foo 2 3 foo 4 5 ``` * `\` is special immediately after `i`, see [sed manual - other commands](https://www.gnu.org/software/sed/manual/sed.html#Other-Commands) for details * If escape sequence is needed at beginning of replacement string, use an additional `\` ```bash $ seq 3 | sed '2i foo' 1 foo 2 3 $ seq 3 | sed '2i\ foo' 1 foo 2 3 $ seq 3 | sed '2i\tbar' 1 tbar 2 3 $ seq 3 | sed '2i\\tbar' 1 bar 2 3 ``` * Since `;` cannot be used to distinguish between string and end of command, use `-e` for multiple commands ```bash $ sed -e '/is/ifoobar;s/are/ARE/' poem.txt Roses are red, Violets are blue, foobar;s/are/ARE/ Sugar is sweet, And so are you. $ sed -e '/is/ifoobar' -e 's/are/ARE/' poem.txt Roses ARE red, Violets ARE blue, foobar Sugar is sweet, And so ARE you. ``` * Using shell substitution ```bash $ text='good day' $ seq 3 | sed '2i'"$text" 1 good day 2 3 $ text='good day\nfoo bar' $ seq 3 | sed '2i'"$text" 1 good day foo bar 2 3 $ seq 3 | sed '2iToday is '"$(date +%A)" 1 Today is Thursday 2 3 $ # multiline command output will lead to error $ seq 3 | sed '2i'"$(seq 2)" sed: -e expression #1, char 5: missing command ```
## append command The append command allows to add string after a line matching given address **Note** the string used cannot have literal newline character, use escape sequence instead ```bash $ # white-space between a and string is ignored $ # same as: sed '2s/$/\nhello/' $ seq 3 | sed '2a hello' 1 2 hello 3 $ # escape sequences can be used $ seq 3 | sed '2ahello\nhi' 1 2 hello hi 3 ``` * command will apply for all matching addresses ```bash $ seq 5 | sed '/[24]/afoo' 1 2 foo 3 4 foo 5 ``` * `\` is special immediately after `a`, see [sed manual - other commands](https://www.gnu.org/software/sed/manual/sed.html#Other-Commands) for details * If escape sequence is needed at beginning of replacement string, use an additional `\` ```bash $ seq 3 | sed '2a foo' 1 2 foo 3 $ seq 3 | sed '2a\ foo' 1 2 foo 3 $ seq 3 | sed '2a\tbar' 1 2 tbar 3 $ seq 3 | sed '2a\\tbar' 1 2 bar 3 ``` * Since `;` cannot be used to distinguish between string and end of command, use `-e` for multiple commands ```bash $ sed -e '/is/afoobar;s/are/ARE/' poem.txt Roses are red, Violets are blue, Sugar is sweet, foobar;s/are/ARE/ And so are you. $ sed -e '/is/afoobar' -e 's/are/ARE/' poem.txt Roses ARE red, Violets ARE blue, Sugar is sweet, foobar And so ARE you. ``` * Using shell substitution ```bash $ text='good day' $ seq 3 | sed '2a'"$text" 1 2 good day 3 $ text='good day\nfoo bar' $ seq 3 | sed '2a'"$text" 1 2 good day foo bar 3 $ seq 3 | sed '2aToday is '"$(date +%A)" 1 2 Today is Thursday 3 $ # multiline command output will lead to error $ seq 3 | sed '2a'"$(seq 2)" sed: -e expression #1, char 5: missing command ``` * See also [stackoverflow - add newline character if last line of input doesn't have one](https://stackoverflow.com/questions/41343062/what-does-this-mean-in-linux-sed-a-a-txt)
## adding contents of file
#### r for entire file * The `r` command allows to add contents of file after a line matching given address * It is a robust way to add multiline content or if content can have characters that may be interpreted * Special name `/dev/stdin` allows to read from **stdin** instead of file input * First, a simple example to add contents of one file into another at specified address ```bash $ cat 5.txt five 1five $ cat poem.txt Roses are red, Violets are blue, Sugar is sweet, And so are you. $ # space between r and filename is optional $ sed '2r 5.txt' poem.txt Roses are red, Violets are blue, five 1five Sugar is sweet, And so are you. $ # content cannot be added before first line $ sed '0r 5.txt' poem.txt sed: -e expression #1, char 2: invalid usage of line address 0 $ # but that is trivial to solve: cat 5.txt poem.txt ``` * command will apply for all matching addresses ```bash $ seq 5 | sed '/[24]/r 5.txt' 1 2 five 1five 3 4 five 1five 5 ``` * adding content of variable as it is without any interpretation * also shows example for using `/dev/stdin` ```bash $ text='Good day\nfoo bar baz\n' $ # escape sequence like \n will be interpreted when 'a' command is used $ sed '/is/a'"$text" poem.txt Roses are red, Violets are blue, Sugar is sweet, Good day foo bar baz And so are you. $ # \ is just another character, won't be treated as special with 'r' command $ echo "$text" | sed '/is/r /dev/stdin' poem.txt Roses are red, Violets are blue, Sugar is sweet, Good day\nfoo bar baz\n And so are you. ``` * adding multiline command output is simple as well ```bash $ seq 3 | sed '/is/r /dev/stdin' poem.txt Roses are red, Violets are blue, Sugar is sweet, 1 2 3 And so are you. ``` * replacing a line or range of lines with contents of file * See also [unix.stackexchange - various ways to replace line M in file1 with line N in file2](https://unix.stackexchange.com/a/396450) ```bash $ # replacing range of lines $ # order is important, first 'r' and then 'd' $ sed -e '/is/r 5.txt' -e '1,/is/d' poem.txt five 1five And so are you. $ # replacing a line $ seq 3 | sed -e '3r /dev/stdin' -e '3d' poem.txt Roses are red, Violets are blue, 1 2 3 And so are you. $ # can also use {} grouping to avoid repeating the address $ seq 3 | sed -e '/blue/{r /dev/stdin' -e 'd}' poem.txt Roses are red, 1 2 3 Sugar is sweet, And so are you. ```
#### R for line by line * add a line for every address match * Special name `/dev/stdin` allows to read from **stdin** instead of file input ```bash $ # space between R and filename is optional $ seq 3 | sed '/are/R /dev/stdin' poem.txt Roses are red, 1 Violets are blue, 2 Sugar is sweet, And so are you. 3 $ # to replace matching line $ seq 3 | sed -e '/are/{R /dev/stdin' -e 'd}' poem.txt 1 2 Sugar is sweet, 3 $ sed '2,3R 5.txt' poem.txt Roses are red, Violets are blue, five Sugar is sweet, 1five And so are you. ``` * number of lines from file to be read different from number of matching address lines ```bash $ # file has more lines than matching address $ # 2 lines in 5.txt but only 1 line matching 'is' $ sed '/is/R 5.txt' poem.txt Roses are red, Violets are blue, Sugar is sweet, five And so are you. $ # lines matching address is more than file to be read $ # 3 lines matching 'are' but only 2 lines from stdin $ seq 2 | sed '/are/R /dev/stdin' poem.txt Roses are red, 1 Violets are blue, 2 Sugar is sweet, And so are you. ```
## n and N commands * These two commands will fetch next line (newline or NUL character separated, depending on options) Quoting from [sed manual - common commands](https://www.gnu.org/software/sed/manual/sed.html#Common-Commands) for `n` command >If auto-print is not disabled, print the pattern space, then, regardless, replace the pattern space with the next line of input. If there is no more input then sed exits without processing any more commands. ```bash $ # if line contains 'blue', replace 'e' with 'E' only for following line $ sed '/blue/{n;s/e/E/g}' poem.txt Roses are red, Violets are blue, Sugar is swEEt, And so are you. $ # better illustrated with -n option $ sed -n '/blue/{n;s/e/E/pg}' poem.txt Sugar is swEEt, $ # if line contains 'blue', replace 'e' with 'E' only for next to next line $ sed -n '/blue/{n;n;s/e/E/pg}' poem.txt And so arE you. ``` Quoting from [sed manual - other commands](https://www.gnu.org/software/sed/manual/sed.html#Other-Commands) for `N` command >Add a newline to the pattern space, then append the next line of input to the pattern space. If there is no more input then sed exits without processing any more commands >When -z is used, a zero byte (the ascii ‘NUL’ character) is added between the lines (instead of a new line) * See also [stackoverflow - apply substitution every 4 lines but excluding the 4th line](https://stackoverflow.com/questions/40229578/how-to-insert-a-line-feed-into-a-sed-line-concatenation) ```bash $ # if line contains 'blue', replace 'e' with 'E' both in current line and next $ sed '/blue/{N;s/e/E/g}' poem.txt Roses are red, ViolEts arE bluE, Sugar is swEEt, And so are you. $ # better illustrated with -n option $ sed -n '/blue/{N;s/e/E/pg}' poem.txt ViolEts arE bluE, Sugar is swEEt, $ sed -n '/blue/{N;N;s/e/E/pg}' poem.txt ViolEts arE bluE, Sugar is swEEt, And so arE you. ``` * Combination ```bash $ # n will fetch next line, current line is out of pattern space $ # N will then add another line $ sed -n '/blue/{n;N;s/e/E/pg}' poem.txt Sugar is swEEt, And so arE you. ``` * not necessary to qualify with an address ```bash $ seq 6 | sed 'n;cXYZ' 1 XYZ 3 XYZ 5 XYZ $ seq 6 | sed 'N;s/\n/ /' 1 2 3 4 5 6 ```
## Control structures * Using `:label` one can mark a command location to branch to conditionally or unconditionally * See [sed manual - Commands for sed gurus](https://www.gnu.org/software/sed/manual/sed.html#Programming-Commands) for more details
#### if then else * Simple if-then-else can be simulated using `b` command * `b` command will unconditionally branch to specified label * Without label, `b` will skip rest of commands and start next cycle * See [unix.stackexchange - processing only lines between REGEXPs](https://unix.stackexchange.com/questions/292819/remove-commented-lines-except-one-comment-using-sed) for interesting use case ```bash $ # changing -ve to +ve and vice versa $ cat nums.txt 42 -2 10101 -3.14 -75 $ # same as: perl -pe '/^-/ ? s/// : s/^/-/' $ # empty REGEXP section will reuse previous REGEXP, in this case /^-/ $ sed '/^-/{s///;b}; s/^/-/' nums.txt -42 2 -10101 3.14 75 $ # same as: perl -pe '/are/ ? s/e/*/g : s/e/#/g' $ # if line contains 'are' replace 'e' with '*' else replace 'e' with '#' $ sed '/are/{s/e/*/g;b}; s/e/#/g' poem.txt Ros*s ar* r*d, Viol*ts ar* blu*, Sugar is sw##t, And so ar* you. ```
#### replacing in specific column * `t` command will branch to specified label on successful substitution * Without label, `t` will skip rest of commands and start next cycle * More examples * [stackoverflow - replace data after last delimiter](https://stackoverflow.com/questions/39907133/replace-data-after-last-delimiter-of-every-line-using-sed-or-awk/39908523#39908523) * [stackoverflow - replace multiple occurrences in specific column](https://stackoverflow.com/questions/42886531/replace-mutliple-occurances-in-delimited-columns/42886919#42886919) ```bash $ # replace space with underscore only in 3rd column $ # ^(([^|]+\|){2} captures first two columns $ # [^|]* zero or more non-column separator characters $ # as long as match is found, command will be repeated on same input line $ echo 'foo bar|a b c|1 2 3|xyz abc' | sed -E ':a s/^(([^|]+\|){2}[^|]*) /\1_/; ta' foo bar|a b c|1_2_3|xyz abc $ # use awk/perl for simpler syntax $ # for ex: awk 'BEGIN{FS=OFS="|"} {gsub(/ /,"_",$3); print}' ``` * example to show difference between `b` and `t` ```bash $ # whether or not 'R' is found on lines containing 'are', branch will happen $ sed '/are/{s/R/*/g;b}; s/e/#/g' poem.txt *oses are red, Violets are blue, Sugar is sw##t, And so are you. $ # branch only if line contains 'are' and substitution of 'R' succeeds $ sed '/are/{s/R/*/g;t}; s/e/#/g' poem.txt *oses are red, Viol#ts ar# blu#, Sugar is sw##t, And so ar# you. ```
#### overlapping substitutions * `t` command looping with label comes in handy for overlapping substitutions as well * Note that in general this method will work recursively, see [stackoverflow - substitute recursively](https://stackoverflow.com/questions/9983646/sed-substitute-recursively) for example ```bash $ # consider the problem of replacing empty columns with something $ # case1: no consecutive empty columns - no problem $ echo 'foo::bar::baz' | sed 's/::/:0:/g' foo:0:bar:0:baz $ # case2: consecutive empty columns are present - problematic $ echo 'foo:::bar::baz' | sed 's/::/:0:/g' foo:0::bar:0:baz $ # t command looping will handle both cases $ echo 'foo::bar::baz' | sed ':a s/::/:0:/; ta' foo:0:bar:0:baz $ echo 'foo:::bar::baz' | sed ':a s/::/:0:/; ta' foo:0:0:bar:0:baz ```
## Lines between two REGEXPs * Simple cases were seen in [address range](#address-range) section * This section will deal with more cases and some corner cases
#### Include or Exclude matching REGEXPs Consider the sample input file, for simplicity the two REGEXPs are **BEGIN** and **END** strings instead of regular expressions ```bash $ cat range.txt foo BEGIN 1234 6789 END bar BEGIN a b c END baz ``` First, lines between the two *REGEXP*s are to be printed * Case 1: both starting and ending *REGEXP* part of output ```bash $ sed -n '/BEGIN/,/END/p' range.txt BEGIN 1234 6789 END BEGIN a b c END ``` * Case 2: both starting and ending *REGEXP* not part of ouput ```bash $ # remember that empty REGEXP section will reuse previously matched REGEXP $ sed -n '/BEGIN/,/END/{//!p}' range.txt 1234 6789 a b c ``` * Case 3: only starting *REGEXP* part of output ```bash $ sed -n '/BEGIN/,/END/{/END/!p}' range.txt BEGIN 1234 6789 BEGIN a b c ``` * Case 4: only ending *REGEXP* part of output ```bash $ sed -n '/BEGIN/,/END/{/BEGIN/!p}' range.txt 1234 6789 END a b c END ``` Second, lines between the two *REGEXP*s are to be deleted * Case 5: both starting and ending *REGEXP* not part of output ```bash $ sed '/BEGIN/,/END/d' range.txt foo bar baz ``` * Case 6: both starting and ending *REGEXP* part of output ```bash $ # remember that empty REGEXP section will reuse previously matched REGEXP $ sed '/BEGIN/,/END/{//!d}' range.txt foo BEGIN END bar BEGIN END baz ``` * Case 7: only starting *REGEXP* part of output ```bash $ sed '/BEGIN/,/END/{/BEGIN/!d}' range.txt foo BEGIN bar BEGIN baz ``` * Case 8: only ending *REGEXP* part of output ```bash $ sed '/BEGIN/,/END/{/END/!d}' range.txt foo END bar END baz ```
#### First or Last block * Getting first block is very simple by using `q` command ```bash $ sed -n '/BEGIN/,/END/{p;/END/q}' range.txt BEGIN 1234 6789 END $ # use other tricks discussed in previous section as needed $ sed -n '/BEGIN/,/END/{//!p;/END/q}' range.txt 1234 6789 ``` * To get last block, reverse the input linewise, the order of *REGEXP*s and finally reverse again ```bash $ tac range.txt | sed -n '/END/,/BEGIN/{p;/BEGIN/q}' | tac BEGIN a b c END $ # use other tricks discussed in previous section as needed $ tac range.txt | sed -n '/END/,/BEGIN/{//!p;/BEGIN/q}' | tac a b c ``` * To get a specific block, say 3rd one, `awk` or `perl` would be a better choice * See [Specific blocks](./gnu_awk.md#specific-blocks) for `awk` examples
#### Broken blocks * If there are blocks with ending *REGEXP* but without corresponding starting *REGEXP*, `sed -n '/BEGIN/,/END/p'` will suffice * Consider the modified input file where final starting *REGEXP* doesn't have corresponding ending ```bash $ cat broken_range.txt foo BEGIN 1234 6789 END bar BEGIN a b c baz ``` * All lines till end of file gets printed with simple use of `sed -n '/BEGIN/,/END/p'` * The file reversing trick comes in handy here as well * But if both kinds of broken blocks are present, further processing will be required. Better to use `awk` or `perl` in such cases * See [Broken blocks](./gnu_awk.md#broken-blocks) for `awk` examples ```bash $ sed -n '/BEGIN/,/END/p' broken_range.txt BEGIN 1234 6789 END BEGIN a b c baz $ tac broken_range.txt | sed -n '/END/,/BEGIN/p' | tac BEGIN 1234 6789 END ``` * If there are multiple starting *REGEXP* but single ending *REGEXP*, the reversing trick comes handy again ```bash $ cat uneven_range.txt foo BEGIN 1234 BEGIN 42 6789 END bar BEGIN a BEGIN b BEGIN c BEGIN d BEGIN e END baz $ tac uneven_range.txt | sed -n '/END/,/BEGIN/p' | tac BEGIN 42 6789 END BEGIN e END ```
## sed scripts * `sed` commands can be placed in a file and called using `-f` option or directly executed using [shebang](https://en.wikipedia.org/wiki/Shebang_(Unix)) * See [sed manual - Some Sample Scripts](https://www.gnu.org/software/sed/manual/sed.html#Examples) for more examples * See [sed manual - Often-Used Commands](https://www.gnu.org/software/sed/manual/sed.html#Common-Commands) for more details on using comments ```bash $ cat script.sed # each line is a command /is/cfoo bar /you/r 3.txt /you/d # single quotes can be used freely s/are/'are'/g $ sed -f script.sed poem.txt Roses 'are' red, Violets 'are' blue, foo bar 3 13 $ # command line options are specified as usual $ sed -nf script.sed poem.txt foo bar 3 13 ``` * command line options can be specified along with shebang as well as added at time of invocation * See also [stackoverflow - usage of options along with shebang depends on lot of factors](https://stackoverflow.com/questions/4303128/how-to-use-multiple-arguments-with-a-shebang-i-e) ```bash $ type sed sed is /bin/sed $ cat executable.sed #!/bin/sed -f /is/cfoo bar /you/r 3.txt /you/d s/are/'are'/g $ chmod +x executable.sed $ ./executable.sed poem.txt Roses 'are' red, Violets 'are' blue, foo bar 3 13 $ ./executable.sed -n poem.txt foo bar 3 13 ```
## Gotchas and Tips * dos style line endings ```bash $ # no issue with unix style line ending $ printf 'foo bar\n123 789\n' | sed -E 's/\w+$/xyz/' foo xyz 123 xyz $ # dos style line ending causes trouble $ printf 'foo bar\r\n123 789\r\n' | sed -E 's/\w+$/xyz/' foo bar 123 789 $ # can be corrected by adding \r as well to match $ # if needed, add \r in replacement section as well $ printf 'foo bar\r\n123 789\r\n' | sed -E 's/\w+\r$/xyz/' foo xyz 123 xyz ``` * changing dos to unix style line ending and vice versa ```bash $ # bash functions $ unix2dos() { sed -i 's/$/\r/' "$@" ; } $ dos2unix() { sed -i 's/\r$//' "$@" ; } $ cat -A 5.txt five$ 1five$ $ unix2dos 5.txt $ cat -A 5.txt five^M$ 1five^M$ $ dos2unix 5.txt $ cat -A 5.txt five$ 1five$ ``` * variable/command substitution * See also [stackoverflow - Is it possible to escape regex metacharacters reliably with sed](https://stackoverflow.com/questions/29613304/is-it-possible-to-escape-regex-metacharacters-reliably-with-sed) ```bash $ # variables don't get expanded within single quotes $ printf 'user\nhome\n' | sed '/user/ s/$/: $USER/' user: $USER home $ printf 'user\nhome\n' | sed '/user/ s/$/: '"$USER"'/' user: learnbyexample home $ # variable being substituted cannot have the delimiter character $ printf 'user\nhome\n' | sed '/home/ s/$/: '"$HOME"'/' sed: -e expression #1, char 15: unknown option to `s' $ printf 'user\nhome\n' | sed '/home/ s#$#: '"$HOME"'#' user home: /home/learnbyexample $ # use r command for robust insertion from file/command-output $ sed '1a'"$(seq 2)" 5.txt sed: -e expression #1, char 5: missing command $ seq 2 | sed '1r /dev/stdin' 5.txt five 1 2 1five ``` * common regular expression mistakes #1 - greediness ```bash $ s='foo and bar and baz land good' $ echo "$s" | sed 's/foo.*ba/123 789/' 123 789z land good $ # use a more restrictive version $ echo "$s" | sed -E 's/foo \w+ ba/123 789/' 123 789r and baz land good $ # or use a tool with non-greedy feature available $ echo "$s" | perl -pe 's/foo.*?ba/123 789/' 123 789r and baz land good $ # for single characters, use negated character class $ echo 'foo=123,baz=789,xyz=42' | sed 's/foo=.*,//' xyz=42 $ echo 'foo=123,baz=789,xyz=42' | sed 's/foo=[^,]*,//' baz=789,xyz=42 ``` * common regular expression mistakes #2 - BRE vs ERE syntax ```bash $ # + needs to be escaped with BRE or enable ERE $ echo 'like 42 and 37' | sed 's/[0-9]+/xxx/g' like 42 and 37 $ echo 'like 42 and 37' | sed -E 's/[0-9]+/xxx/g' like xxx and xxx $ # or escaping when not required $ echo 'get {} and let' | sed 's/\{\}/[]/' sed: -e expression #1, char 10: Invalid preceding regular expression $ echo 'get {} and let' | sed 's/{}/[]/' get [] and let ``` * common regular expression mistakes #3 - using PCRE syntax/features * especially by trying out solution on online sites like [regex101](https://regex101.com/) and expecting it to work with `sed` as well ```bash $ # \d is not available as backslash character class, will match 'd' instead $ echo 'like 42 and 37' | sed -E 's/\d+/xxx/g' like 42 anxxx 37 $ echo 'like 42 and 37' | sed -E 's/[0-9]+/xxx/g' like xxx and xxx $ # features like lookarounds/non-greedy/etc not available $ echo 'foo,baz,,xyz,,,123' | sed -E 's/,\K(?=,)/NaN/g' sed: -e expression #1, char 16: Invalid preceding regular expression $ echo 'foo,baz,,xyz,,,123' | perl -pe 's/,\K(?=,)/NaN/g' foo,baz,NaN,xyz,NaN,NaN,123 ``` * common regular expression mistakes #4 - end of line white-space ```bash $ printf 'foo bar \n123 789\t\n' | sed -E 's/\w+$/xyz/' foo bar 123 789 $ printf 'foo bar \n123 789\t\n' | sed -E 's/\w+\s*$/xyz/' foo xyz 123 xyz ``` * and many more... see also * [unix.stackexchange - Why does my regular expression work in X but not in Y?](https://unix.stackexchange.com/questions/119905/why-does-my-regular-expression-work-in-x-but-not-in-y) * [stackoverflow - Greedy vs. Reluctant vs. Possessive Quantifiers](https://stackoverflow.com/questions/5319840/greedy-vs-reluctant-vs-possessive-quantifiers) * [stackoverflow - How to replace everything between but only until the first occurrence of the end string?](https://stackoverflow.com/questions/45168607/how-to-replace-everything-between-but-only-until-the-first-occurrence-of-the-end) * [stackoverflow - How to match a specified pattern with multiple possibilities](https://stackoverflow.com/questions/43650926/how-to-match-a-specified-pattern-with-multiple-possibilities) * [stackoverflow - mixing different regex syntax](https://stackoverflow.com/questions/45389684/cant-comment-a-line-in-my-cnf/45389833#45389833) * [sed manual - BRE-vs-ERE](https://www.gnu.org/software/sed/manual/sed.html#BRE-vs-ERE) * Speed boost for ASCII encoded input ```bash $ time sed -nE '/^([a-d][r-z]){3}$/p' /usr/share/dict/words avatar awards cravat real 0m0.058s $ time LC_ALL=C sed -nE '/^([a-d][r-z]){3}$/p' /usr/share/dict/words avatar awards cravat real 0m0.038s $ time sed -nE '/^([a-z]..)\1$/p' /usr/share/dict/words > /dev/null real 0m0.111s $ time LC_ALL=C sed -nE '/^([a-z]..)\1$/p' /usr/share/dict/words > /dev/null real 0m0.073s ```
## Further Reading * Manual and related * `man sed` and `info sed` for more details, known issues/limitations as well as options/commands not covered in this tutorial * [GNU sed manual](https://www.gnu.org/software/sed/manual/sed.html) has even more detailed information and examples * [sed FAQ](http://sed.sourceforge.net/sedfaq.html), last modified '10 March 2003' * [stackoverflow - BSD/macOS Sed vs GNU Sed vs the POSIX Sed specification](https://stackoverflow.com/questions/24275070/sed-not-giving-me-correct-substitute-operation-for-newline-with-mac-difference/24276470#24276470) * [unix.stackexchange - Differences between sed on Mac OSX and other standard sed](https://unix.stackexchange.com/questions/13711/differences-between-sed-on-mac-osx-and-other-standard-sed) * This chapter has also been [converted to a book](https://github.com/learnbyexample/learn_gnused) with additional description, examples and exercises. * Tutorials and Q&A * [sed basics](https://code.snipcademy.com/tutorials/shell-scripting/sed/introduction) * [sed detailed tutorial](https://www.grymoire.com/Unix/Sed.html) - has details on differences between various `sed` versions as well * [sed one-liners explained](https://catonmat.net/sed-one-liners-explained-part-one) * [cheat sheet](https://catonmat.net/ftp/sed.stream.editor.cheat.sheet.txt) * [unix.stackexchange - common search and replace examples](https://unix.stackexchange.com/questions/112023/how-can-i-replace-a-string-in-a-files) * [sed Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/sed?sort=votes&pageSize=15) * [sed Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/sed?sort=votes&pageSize=15) * Selected examples - portable solutions, commands not covered in this tutorial, same problem solved using different tools, etc * [unix.stackexchange - replace multiline string](https://unix.stackexchange.com/questions/26284/how-can-i-use-sed-to-replace-a-multi-line-string) * [stackoverflow - deleting empty lines with optional white spaces](https://stackoverflow.com/questions/16414410/delete-empty-lines-using-sed) * [unix.stackexchange - print only line above the matching line](https://unix.stackexchange.com/questions/264489/find-each-line-matching-a-pattern-but-print-only-the-line-above-it) * [stackoverflow - How to select lines between two patterns?](https://stackoverflow.com/questions/38972736/how-to-select-lines-between-two-patterns) * [stackoverflow - get lines between two patterns only if there is third pattern between them](https://stackoverflow.com/questions/39960075/bash-how-to-get-lines-between-patterns-only-if-there-is-pattern2-between-them) * [unix.stackexchange - similar example](https://unix.stackexchange.com/questions/228699/sed-print-lines-matched-by-a-pattern-range-if-one-line-matches-a-condition) * Learn Regular Expressions (has information on flavors other than BRE/ERE too) * [Regular Expressions Tutorial](https://www.regular-expressions.info/tutorial.html) * [regexcrossword](https://regexcrossword.com/) * [stackoverflow - What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) * Related tools * [rpl](https://unix.stackexchange.com/questions/112023/how-can-i-replace-a-string-in-a-files/251742#251742) - search and replace tool, has interesting options like interactive mode and recursive mode * [sd](https://github.com/chmln/sd) - simple search and replace, implemented in Rust * [sedsed](https://github.com/aureliojargas/sedsed) - Debugger, indenter and HTMLizer for sed scripts * [xo](https://github.com/ezekg/xo) - composes regular expression match groups * [unix.stackexchange - When to use grep, sed, awk, perl, etc](https://unix.stackexchange.com/questions/303044/when-to-use-grep-less-awk-sed) ================================================ FILE: miscellaneous.md ================================================ # Miscellaneous **Table of Contents** * [cut](#cut) * [select specific fields](#select-specific-fields) * [suppressing lines without delimiter](#suppressing-lines-without-delimiter) * [specifying delimiters](#specifying-delimiters) * [complement](#complement) * [select specific characters](#select-specific-characters) * [Further reading for cut](#further-reading-for-cut) * [tr](#tr) * [translation](#translation) * [escape sequences and character classes](#escape-sequences-and-character-classes) * [deletion](#deletion) * [squeeze](#squeeze) * [Further reading for tr](#further-reading-for-tr) * [basename](#basename) * [dirname](#dirname) * [xargs](#xargs) * [seq](#seq) * [integer sequences](#integer-sequences) * [specifying separator](#specifying-separator) * [floating point sequences](#floating-point-sequences) * [Further reading for seq](#further-reading-for-seq)
## cut ```bash $ cut --version | head -n1 cut (GNU coreutils) 8.25 $ man cut CUT(1) User Commands CUT(1) NAME cut - remove sections from each line of files SYNOPSIS cut OPTION... [FILE]... DESCRIPTION Print selected parts of lines from each FILE to standard output. With no FILE, or when FILE is -, read standard input. ... ```
#### select specific fields * Default delimiter is **tab** character * `-f` option allows to print specific field(s) from each input line ```bash $ printf 'foo\tbar\t123\tbaz\n' foo bar 123 baz $ # single field $ printf 'foo\tbar\t123\tbaz\n' | cut -f2 bar $ # multiple fields can be specified by using , $ printf 'foo\tbar\t123\tbaz\n' | cut -f2,4 bar baz $ # output is always ascending order of field numbers $ printf 'foo\tbar\t123\tbaz\n' | cut -f3,1 foo 123 $ # range can be specified using - $ printf 'foo\tbar\t123\tbaz\n' | cut -f1-3 foo bar 123 $ # if ending number is omitted, select till last field $ printf 'foo\tbar\t123\tbaz\n' | cut -f3- 123 baz ```
#### suppressing lines without delimiter ```bash $ cat marks.txt jan 2017 foobar 12 45 23 feb 2017 foobar 18 38 19 $ # by default lines without delimiter will be printed $ cut -f2- marks.txt jan 2017 12 45 23 feb 2017 18 38 19 $ # use -s option to suppress such lines $ cut -s -f2- marks.txt 12 45 23 18 38 19 ```
#### specifying delimiters * use `-d` option to specify input delimiter other than default **tab** character * only single character can be used, for multi-character/regex based delimiter use `awk` or `perl` ```bash $ echo 'foo:bar:123:baz' | cut -d: -f3 123 $ # by default output delimiter is same as input $ echo 'foo:bar:123:baz' | cut -d: -f1,4 foo:baz $ # quote the delimiter character if it clashes with shell special characters $ echo 'one;two;three;four' | cut -d; -f3 cut: option requires an argument -- 'd' Try 'cut --help' for more information. -f3: command not found $ echo 'one;two;three;four' | cut -d';' -f3 three ``` * use `--output-delimiter` option to specify different output delimiter * since this option accepts a string, more than one character can be specified * See also [using $ prefixed string](https://unix.stackexchange.com/questions/48106/what-does-it-mean-to-have-a-dollarsign-prefixed-string-in-a-script) ```bash $ printf 'foo\tbar\t123\tbaz\n' | cut --output-delimiter=: -f1-3 foo:bar:123 $ echo 'one;two;three;four' | cut -d';' --output-delimiter=' ' -f1,3- one three four $ # tested on bash, might differ with other shells $ echo 'one;two;three;four' | cut -d';' --output-delimiter=$'\t' -f1,3- one three four $ echo 'one;two;three;four' | cut -d';' --output-delimiter=' - ' -f1,3- one - three - four ```
#### complement ```bash $ echo 'one;two;three;four' | cut -d';' -f1,3- one;three;four $ # to print other than specified fields $ echo 'one;two;three;four' | cut -d';' --complement -f2 one;three;four ```
#### select specific characters * similar to `-f` for field selection, use `-c` for character selection * See manual for what defines a character and differences between `-b` and `-c` ```bash $ echo 'foo:bar:123:baz' | cut -c4 : $ printf 'foo\tbar\t123\tbaz\n' | cut -c1,4,7 f r $ echo 'foo:bar:123:baz' | cut -c8- :123:baz $ echo 'foo:bar:123:baz' | cut --complement -c8- foo:bar $ echo 'foo:bar:123:baz' | cut -c1,6,7 --output-delimiter=' ' f a r $ echo 'abcdefghij' | cut --output-delimiter='-' -c1-3,4-7,8- abc-defg-hij $ cut -c1-3 marks.txt jan foo feb foo ```
#### Further reading for cut * `man cut` and `info cut` for more options and detailed documentation * [cut Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/cut?sort=votes&pageSize=15)
## tr ```bash $ tr --version | head -n1 tr (GNU coreutils) 8.25 $ man tr TR(1) User Commands TR(1) NAME tr - translate or delete characters SYNOPSIS tr [OPTION]... SET1 [SET2] DESCRIPTION Translate, squeeze, and/or delete characters from standard input, writ‐ ing to standard output. ... ```
#### translation * one-to-one mapping of characters, all occurrences are translated * as good practice, enclose the arguments in single quotes to avoid issues due to shell interpretation ```bash $ echo 'foo bar cat baz' | tr 'abc' '123' foo 21r 31t 21z $ # use - to represent a range in ascending order $ echo 'foo bar cat baz' | tr 'a-f' '1-6' 6oo 21r 31t 21z $ # changing case $ echo 'foo bar cat baz' | tr 'a-z' 'A-Z' FOO BAR CAT BAZ $ echo 'Hello World' | tr 'a-zA-Z' 'A-Za-z' hELLO wORLD $ echo 'foo;bar;baz' | tr ; : tr: missing operand Try 'tr --help' for more information. $ echo 'foo;bar;baz' | tr ';' ':' foo:bar:baz ``` * rot13 example ```bash $ echo 'foo bar cat baz' | tr 'a-z' 'n-za-m' sbb one png onm $ echo 'sbb one png onm' | tr 'a-z' 'n-za-m' foo bar cat baz $ echo 'Hello World' | tr 'a-zA-Z' 'n-za-mN-ZA-M' Uryyb Jbeyq $ echo 'Uryyb Jbeyq' | tr 'a-zA-Z' 'n-za-mN-ZA-M' Hello World ``` * use shell input redirection for file input ```bash $ cat marks.txt jan 2017 foobar 12 45 23 feb 2017 foobar 18 38 19 $ tr 'a-z' 'A-Z' < marks.txt JAN 2017 FOOBAR 12 45 23 FEB 2017 FOOBAR 18 38 19 ``` * if arguments are of different lengths ```bash $ # when second argument is longer, the extra characters are ignored $ echo 'foo bar cat baz' | tr 'abc' '1-9' foo 21r 31t 21z $ # when first argument is longer $ # the last character of second argument gets re-used $ echo 'foo bar cat baz' | tr 'a-z' '123' 333 213 313 213 $ # use -t option to truncate first argument to same length as second $ echo 'foo bar cat baz' | tr -t 'a-z' '123' foo 21r 31t 21z ```
#### escape sequences and character classes * Certain characters like newline, tab, etc can be represented using escape sequences or octal representation * Certain commonly useful groups of characters like alphabets, digits, punctuations etc have character class as shortcuts * See [gnu tr manual](http://www.gnu.org/software/coreutils/manual/html_node/Character-sets.html#Character-sets) for all escape sequences and character classes ```bash $ printf 'foo\tbar\t123\tbaz\n' | tr '\t' ':' foo:bar:123:baz $ echo 'foo:bar:123:baz' | tr ':' '\n' foo bar 123 baz $ # makes it easier to transform $ echo 'foo:bar:123:baz' | tr ':' '\n' | pr -2ats'-' foo-bar 123-baz $ echo 'foo bar cat baz' | tr '[:lower:]' '[:upper:]' FOO BAR CAT BAZ ``` * since `-` is used for character ranges, place it at the end to represent it literally * cannot be used at start of argument as it would get treated as option * or use `--` to indicate end of option processing * similarly, to represent `\` literally, use `\\` ```bash $ echo '/foo-bar/baz/report' | tr '-a-z' '_A-Z' tr: invalid option -- 'a' Try 'tr --help' for more information. $ echo '/foo-bar/baz/report' | tr 'a-z-' 'A-Z_' /FOO_BAR/BAZ/REPORT $ echo '/foo-bar/baz/report' | tr -- '-a-z' '_A-Z' /FOO_BAR/BAZ/REPORT $ echo '/foo-bar/baz/report' | tr '/-' '\\_' \foo_bar\baz\report ```
#### deletion * use `-d` option to specify characters to be deleted * add complement option `-c` if it is easier to define which characters are to be retained ```bash $ echo '2017-03-21' | tr -d '-' 20170321 $ echo 'Hi123 there. How a32re you' | tr -d '1-9' Hi there. How are you $ # delete all punctuation characters $ echo '"Foo1!", "Bar.", ":Baz:"' | tr -d '[:punct:]' Foo1 Bar Baz $ # deleting carriage return character $ cat -v greeting.txt Hi there^M How are you^M $ tr -d '\r' < greeting.txt | cat -v Hi there How are you $ # retain only alphabets, comma and newline characters $ echo '"Foo1!", "Bar.", ":Baz:"' | tr -cd '[:alpha:],\n' Foo,Bar,Baz ```
#### squeeze * to change consecutive repeated characters to single copy of that character ```bash $ # only lower case alphabets $ echo 'FFoo seed 11233' | tr -s 'a-z' FFo sed 11233 $ # alphabets and digits $ echo 'FFoo seed 11233' | tr -s '[:alnum:]' Fo sed 123 $ # squeeze other than alphabets $ echo 'FFoo seed 11233' | tr -sc '[:alpha:]' FFoo seed 123 $ # only characters present in second argument is used for squeeze $ echo 'FFoo seed 11233' | tr -s 'A-Z' 'a-z' fo sed 11233 $ # multiple consecutive horizontal spaces to single space $ printf 'foo\t\tbar \t123 baz\n' foo bar 123 baz $ printf 'foo\t\tbar \t123 baz\n' | tr -s '[:blank:]' ' ' foo bar 123 baz ```
#### Further reading for tr * `man tr` and `info tr` for more options and detailed documentation * [tr Q&A on unix stackexchange](http://unix.stackexchange.com/questions/tagged/tr?sort=votes&pageSize=15)
## basename ```bash $ basename --version | head -n1 basename (GNU coreutils) 8.25 $ man basename BASENAME(1) User Commands BASENAME(1) NAME basename - strip directory and suffix from filenames SYNOPSIS basename NAME [SUFFIX] basename OPTION... NAME... DESCRIPTION Print NAME with any leading directory components removed. If speci‐ fied, also remove a trailing SUFFIX. ... ```
**Examples** ```bash $ # same as using pwd command $ echo "$PWD" /home/learnbyexample $ basename "$PWD" learnbyexample $ # use -a option if there are multiple arguments $ basename -a foo/a/report.log bar/y/power.log report.log power.log $ # use single quotes if arguments contain space and other special shell characters $ # use suffix option -s to strip file extension from filename $ basename -s '.log' '/home/learnbyexample/proj adder/power.log' power $ # -a is implied when using -s option $ basename -s'.log' foo/a/report.log bar/y/power.log report power ``` * Can also use [Parameter expansion](http://mywiki.wooledge.org/BashFAQ/073) if working on file paths saved in variables * assumes `bash` shell and similar that support this feature ```bash $ # remove from start of string up to last / $ file='/home/learnbyexample/proj adder/power.log' $ basename "$file" power.log $ echo "${file##*/}" power.log $ t="${file##*/}" $ # remove .log from end of string $ echo "${t%.log}" power ``` * See `man basename` and `info basename` for detailed documentation
## dirname ```bash $ dirname --version | head -n1 dirname (GNU coreutils) 8.25 $ man dirname DIRNAME(1) User Commands DIRNAME(1) NAME dirname - strip last component from file name SYNOPSIS dirname [OPTION] NAME... DESCRIPTION Output each NAME with its last non-slash component and trailing slashes removed; if NAME contains no /'s, output '.' (meaning the current directory). ... ```
**Examples** ```bash $ echo "$PWD" /home/learnbyexample $ dirname "$PWD" /home $ # use single quotes if arguments contain space and other special shell characters $ dirname '/home/learnbyexample/proj adder/power.log' /home/learnbyexample/proj adder $ # unlike basename, by default dirname handles multiple arguments $ dirname foo/a/report.log bar/y/power.log foo/a bar/y $ # if no / in argument, output is . to indicate current directory $ dirname power.log . ``` * Use `$()` command substitution to further process output as needed ```bash $ dirname '/home/learnbyexample/proj adder/power.log' /home/learnbyexample/proj adder $ dirname "$(dirname '/home/learnbyexample/proj adder/power.log')" /home/learnbyexample $ basename "$(dirname '/home/learnbyexample/proj adder/power.log')" proj adder ``` * Can also use [Parameter expansion](http://mywiki.wooledge.org/BashFAQ/073) if working on file paths saved in variables * assumes `bash` shell and similar that support this feature ```bash $ # remove from last / in the string to end of string $ file='/home/learnbyexample/proj adder/power.log' $ dirname "$file" /home/learnbyexample/proj adder $ echo "${file%/*}" /home/learnbyexample/proj adder $ # remove from second last / to end of string $ echo "${file%/*/*}" /home/learnbyexample $ # apply basename trick to get just directory name instead of full path $ t="${file%/*}" $ echo "${t##*/}" proj adder ``` * See `man dirname` and `info dirname` for detailed documentation
## xargs ```bash $ xargs --version | head -n1 xargs (GNU findutils) 4.7.0-git $ whatis xargs xargs (1) - build and execute command lines from standard input $ # from 'man xargs' This manual page documents the GNU version of xargs. xargs reads items from the standard input, delimited by blanks (which can be protected with double or single quotes or a backslash) or newlines, and executes the command (default is /bin/echo) one or more times with any initial- arguments followed by items read from standard input. Blank lines on the standard input are ignored. ``` While `xargs` is [primarily used](https://unix.stackexchange.com/questions/24954/when-is-xargs-needed) for passing output of command or file contents to another command as input arguments and/or parallel processing, it can be quite handy for certain text processing stuff with default `echo` command ```bash $ printf ' foo\t\tbar \t123 baz \n' | cat -e foo bar 123 baz $ $ # tr helps to change consecutive blanks to single space $ # but what if blanks at start and end have to be removed as well? $ printf ' foo\t\tbar \t123 baz \n' | tr -s '[:blank:]' ' ' | cat -e foo bar 123 baz $ $ # xargs does this by default $ printf ' foo\t\tbar \t123 baz \n' | xargs | cat -e foo bar 123 baz$ $ # -n option limits number of arguments per line $ printf ' foo\t\tbar \t123 baz \n' | xargs -n2 foo bar 123 baz $ # same as using: paste -d' ' - - - $ # or: pr -3ats' ' $ seq 6 | xargs -n3 1 2 3 4 5 6 ``` * use `-a` option to specify file input instead of stdin ```bash $ cat marks.txt jan 2017 foobar 12 45 23 feb 2017 foobar 18 38 19 $ xargs -a marks.txt jan 2017 foobar 12 45 23 feb 2017 foobar 18 38 19 $ # use -L option to limit max number of lines per command line $ xargs -L2 -a marks.txt jan 2017 foobar 12 45 23 feb 2017 foobar 18 38 19 ``` * **Note** since `echo` is the command being executed, it will cause issue with option interpretation ```bash $ printf ' -e foo\t\tbar \t123 baz \n' | xargs -n2 foo bar 123 baz $ # use -t option to see what is happening (verbose output) $ printf ' -e foo\t\tbar \t123 baz \n' | xargs -n2 -t echo -e foo foo echo bar 123 bar 123 echo baz baz ``` * See `man xargs` and `info xargs` for detailed documentation
## seq ```bash $ seq --version | head -n1 seq (GNU coreutils) 8.25 $ man seq SEQ(1) User Commands SEQ(1) NAME seq - print a sequence of numbers SYNOPSIS seq [OPTION]... LAST seq [OPTION]... FIRST LAST seq [OPTION]... FIRST INCREMENT LAST DESCRIPTION Print numbers from FIRST to LAST, in steps of INCREMENT. ... ```
#### integer sequences * see `info seq` for details of how large numbers are handled * for ex: `seq 50000000000000000000 2 50000000000000000004` may not work ```bash $ # default start=1 and increment=1 $ seq 3 1 2 3 $ # default increment=1 $ seq 25434 25437 25434 25435 25436 25437 $ seq -5 -3 -5 -4 -3 $ # different increment value $ seq 1000 5 1011 1000 1005 1010 $ # use negative increment for descending order $ seq 10 -5 -7 10 5 0 -5 ``` * use `-w` option for leading zeros * largest length of start/end value is used to determine padding ```bash $ seq 008 010 8 9 10 $ # or: seq -w 8 010 $ seq -w 008 010 008 009 010 $ seq -w 0003 0001 0002 0003 ```
#### specifying separator * As seen already, default is newline separator between numbers * `-s` option allows to use custom string between numbers * A newline is always added at end ```bash $ seq -s: 4 1:2:3:4 $ seq -s' ' 4 1 2 3 4 $ seq -s' - ' 4 1 - 2 - 3 - 4 ```
#### floating point sequences ```bash $ # default increment=1 $ seq 0.5 2.5 0.5 1.5 2.5 $ seq -s':' -2 0.75 3 -2.00:-1.25:-0.50:0.25:1.00:1.75:2.50 $ # Scientific notation is supported $ seq 1.2e2 1.22e2 120 121 122 ``` * formatting numbers, see `info seq` for details ```bash $ seq -f'%.3f' -s':' -2 0.75 3 -2.000:-1.250:-0.500:0.250:1.000:1.750:2.500 $ seq -f'%.3e' 1.2e2 1.22e2 1.200e+02 1.210e+02 1.220e+02 ```
#### Further reading for seq * `man seq` and `info seq` for more options, corner cases and detailed documentation * [seq Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/seq?sort=votes&pageSize=15) ================================================ FILE: overview_presentation/baz.json ================================================ { "abc": { "@attr": "good", "text": "Hi there" }, "xyz": { "@attr": "bad", "text": "I am good. How are you?" } } ================================================ FILE: overview_presentation/foo.xml ================================================ Hi there I am good. How are you? ================================================ FILE: overview_presentation/greeting.txt ================================================ Hi there Have a nice day ================================================ FILE: overview_presentation/sample.txt ================================================ Hello World! Good day How do you do? Just do it Believe 42 it! Today is sunny Not a bit funny No doubt you like it too Much ado about nothing He he 123 he he ================================================ FILE: perl_the_swiss_knife.md ================================================


--- :information_source: :information_source: This chapter has been converted into a better formatted ebook - https://learnbyexample.github.io/learn_perl_oneliners/. The ebook also has content updated for newer version of `perl`, includes exercises, solutions, etc. For markdown source and links to buy pdf/epub versions, see: https://github.com/learnbyexample/learn_perl_oneliners ---


# Perl one liners **Table of Contents** * [Executing Perl code](#executing-perl-code) * [Simple search and replace](#simple-search-and-replace) * [inplace editing](#inplace-editing) * [Line filtering](#line-filtering) * [Regular expressions based filtering](#regular-expressions-based-filtering) * [Fixed string matching](#fixed-string-matching) * [Line number based filtering](#line-number-based-filtering) * [Field processing](#field-processing) * [Field comparison](#field-comparison) * [Specifying different input field separator](#specifying-different-input-field-separator) * [Specifying different output field separator](#specifying-different-output-field-separator) * [Changing record separators](#changing-record-separators) * [Input record separator](#input-record-separator) * [Output record separator](#output-record-separator) * [Multiline processing](#multiline-processing) * [Perl regular expressions](#perl-regular-expressions) * [sed vs perl subtle differences](#sed-vs-perl-subtle-differences) * [Backslash sequences](#backslash-sequences) * [Non-greedy quantifier](#non-greedy-quantifier) * [Lookarounds](#lookarounds) * [Ignoring specific matches](#ignoring-specific-matches) * [Special capture groups](#special-capture-groups) * [Modifiers](#modifiers) * [Quoting metacharacters](#quoting-metacharacters) * [Matching position](#matching-position) * [Using modules](#using-modules) * [Two file processing](#two-file-processing) * [Comparing whole lines](#comparing-whole-lines) * [Comparing specific fields](#comparing-specific-fields) * [Line number matching](#line-number-matching) * [Creating new fields](#creating-new-fields) * [Multiple file input](#multiple-file-input) * [Dealing with duplicates](#dealing-with-duplicates) * [Lines between two REGEXPs](#lines-between-two-regexps) * [All unbroken blocks](#all-unbroken-blocks) * [Specific blocks](#specific-blocks) * [Broken blocks](#broken-blocks) * [Array operations](#array-operations) * [Iteration and filtering](#iteration-and-filtering) * [Sorting](#sorting) * [Transforming](#transforming) * [Miscellaneous](#miscellaneous) * [split](#split) * [Fixed width processing](#fixed-width-processing) * [String and file replication](#string-and-file-replication) * [transliteration](#transliteration) * [Executing external commands](#executing-external-commands) * [Further Reading](#further-reading)
```bash $ perl -le 'print $^V' v5.22.1 $ man perl PERL(1) Perl Programmers Reference Guide PERL(1) NAME perl - The Perl 5 language interpreter SYNOPSIS perl [ -sTtuUWX ] [ -hv ] [ -V[:configvar] ] [ -cw ] [ -d[t][:debugger] ] [ -D[number/list] ] [ -pna ] [ -Fpattern ] [ -l[octal] ] [ -0[octal/hexadecimal] ] [ -Idir ] [ -m[-]module ] [ -M[-]'module...' ] [ -f ] [ -C [number/list] ] [ -S ] [ -x[dir] ] [ -i[extension] ] [ [-e|-E] 'command' ] [ -- ] [ programfile ] [ argument ]... For more information on these options, you can run "perldoc perlrun". ... ``` **Prerequisites and notes** * familiarity with programming concepts like variables, printing, control structures, arrays, etc * Perl borrows syntax/features from **C, shell scripting, awk, sed** etc. Prior experience working with them would help a lot * familiarity with regular expression basics * if not, check out **ERE** portion of [GNU sed regular expressions](./gnu_sed.md#regular-expressions) * examples for non-greedy, lookarounds, etc will be covered here * this tutorial is primarily focussed on short programs that are easily usable from command line, similar to using `grep`, `sed`, `awk` etc * do NOT use style/syntax presented here when writing full fledged Perl programs which should use **strict, warnings** etc * see [perldoc - perlintro](https://perldoc.perl.org/perlintro.html) and [learnxinyminutes - perl](https://learnxinyminutes.com/docs/perl/) for quick intro to using Perl for full fledged programs * links to Perl documentation will be added as necessary * unless otherwise specified, consider input as ASCII encoded text only * see also [stackoverflow - why UTF-8 is not default](https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default)
## Executing Perl code * One way is to put code in a file and use `perl` command with filename as argument * Another is to use [shebang](https://en.wikipedia.org/wiki/Shebang_(Unix)) at beginning of script, make the file executable and directly run it ```bash $ cat code.pl print "Hello Perl\n" $ perl code.pl Hello Perl $ # similar to bash $ cat code.sh echo 'Hello Bash' $ bash code.sh Hello Bash ``` * For short programs, one can use `-e` commandline option to provide code from command line itself * Use `-E` option to use newer features like `say`. See [perldoc - new features](https://perldoc.perl.org/feature.html) * This entire chapter is about using `perl` this way from commandline ```bash $ perl -e 'print "Hello Perl\n"' Hello Perl $ # say automatically adds newline character $ perl -E 'say "Hello Perl"' Hello Perl $ # similar to $ bash -c 'echo "Hello Bash"' Hello Bash $ # multiple commands can be issued separated by ; $ # -l will be covered later, here used to append newline to print $ perl -le '$x=25; $y=12; print $x**$y' 59604644775390625 ``` * Perl is (in)famous for being able to things more than one way * examples in this chapter will mostly try to use the syntax that avoids `(){}` ```bash $ # shows different syntax usage of if/say/print $ perl -e 'if(2<3){print("2 is less than 3\n")}' 2 is less than 3 $ perl -E 'say "2 is less than 3" if 2<3' 2 is less than 3 $ # string comparison uses eq for ==, lt for < and so on $ perl -e 'if("a" lt "b"){$x=5; $y=10} print "x=$x; y=$y\n"' x=5; y=10 $ # x/y assignment will happen only if condition evaluates to true $ perl -E 'say "x=$x; y=$y" if "a" lt "b" and $x=5,$y=10' x=5; y=10 $ # variables will be interpolated within double quotes $ # so, use q operator if single quoting is needed $ # as single quote is already being used to group perl code for -e option $ perl -le 'print "ab $x 123"' ab 123 $ perl -le 'print q/ab $x 123/' ab $x 123 ``` **Further Reading** * `perl -h` for summary of options * [perldoc - Command Switches](https://perldoc.perl.org/perlrun.html#Command-Switches) * [perldoc - Perl operators and precedence](https://perldoc.perl.org/perlop.html) * [explainshell](https://explainshell.com/explain?cmd=perl+-F+-l+-anpeE+-i+-0+-M) - to quickly get information without having to traverse through the docs * See [Changing record separators](#changing-record-separators) section for more details on `-l` option
## Simple search and replace * **substitution** command syntax is very similar to `sed` for search and replace * syntax is `variable =~ s/REGEXP/REPLACEMENT/FLAGS` and by default acts on `$_` if variable is not specified * see [perldoc - SPECIAL VARIABLES](https://perldoc.perl.org/perlvar.html#SPECIAL-VARIABLES) for explanation on `$_` and other such special variables * more detailed examples will be covered in later sections * Just like other text processing commands, `perl` will automatically loop over input line by line when `-n` or `-p` option is used * like `sed`, the `-n` option won't print the record * `-p` will print the record, including any changes made * newline character being default record separator * `$_` will contain the input record content, including the record separator (unlike `sed` and `awk`) * any directory name appearing in file arguments passed will be automatically ignored * and similar to other commands, `perl` will work with both stdin and file input * See other chapters for examples of [seq](./miscellaneous.md#seq), [paste](./restructure_text.md#paste), etc ```bash $ # sample stdin data $ seq 10 | paste -sd, 1,2,3,4,5,6,7,8,9,10 $ # change only first ',' to ' : ' $ # same as: sed 's/,/ : /' $ seq 10 | paste -sd, | perl -pe 's/,/ : /' 1 : 2,3,4,5,6,7,8,9,10 $ # change all ',' to ' : ' by using 'g' modifier $ # same as: sed 's/,/ : /g' $ seq 10 | paste -sd, | perl -pe 's/,/ : /g' 1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 $ cat greeting.txt Hi there Have a nice day $ # same as: sed 's/nice day/safe journey/' greeting.txt $ perl -pe 's/nice day/safe journey/' greeting.txt Hi there Have a safe journey ```
#### inplace editing * similar to [GNU sed - using * with inplace option](./gnu_sed.md#prefix-backup-name), one can also use `*` to either prefix the backup name or place the backup files in another existing directory * See also [effectiveperlprogramming - caveats of using -i option](https://www.effectiveperlprogramming.com/2017/12/in-place-editing-gets-safer-in-v5-28/) ```bash $ # same as: sed -i.bkp 's/Hi/Hello/' greeting.txt $ perl -i.bkp -pe 's/Hi/Hello/' greeting.txt $ # original file gets preserved in 'greeting.txt.bkp' $ cat greeting.txt Hello there Have a nice day $ # using -i'bkp.*' will save backup file as 'bkp.greeting.txt' $ # use empty argument to -i with caution, changes made cannot be undone $ perl -i -pe 's/nice day/safe journey/' greeting.txt $ cat greeting.txt Hello there Have a safe journey ``` * Multiple input files are treated individually and changes are written back to respective files ```bash $ cat f1 I ate 3 apples $ cat f2 I bought two bananas and 3 mangoes $ perl -i.bkp -pe 's/3/three/' f1 f2 $ cat f1 I ate three apples $ cat f2 I bought two bananas and three mangoes ```
## Line filtering
#### Regular expressions based filtering * syntax is `variable =~ m/REGEXP/FLAGS` to check for a match * `variable !~ m/REGEXP/FLAGS` for negated match * by default acts on `$_` if variable is not specified * as we need to print only selective lines, use `-n` option * by default, contents of `$_` will be printed if no argument is passed to `print` ```bash $ cat poem.txt Roses are red, Violets are blue, Sugar is sweet, And so are you. $ # same as: grep '^[RS]' or sed -n '/^[RS]/p' or awk '/^[RS]/' $ # /^[RS]/ is shortcut for $_ =~ m/^[RS]/ $ perl -ne 'print if /^[RS]/' poem.txt Roses are red, Sugar is sweet, $ # same as: grep -i 'and' poem.txt $ perl -ne 'print if /and/i' poem.txt And so are you. $ # same as: grep -v 'are' poem.txt $ # !/are/ is shortcut for $_ !~ m/are/ $ perl -ne 'print if !/are/' poem.txt Sugar is sweet, $ # same as: awk '/are/ && !/so/' poem.txt $ perl -ne 'print if /are/ && !/so/' poem.txt Roses are red, Violets are blue, ``` * using different delimiter * quoting from [perldoc - Regexp Quote-Like Operators](https://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators) > With the m you can use any pair of non-alphanumeric, non-whitespace characters as delimiters ```bash $ cat paths.txt /foo/a/report.log /foo/y/power.log /foo/abc/errors.log $ perl -ne 'print if /\/foo\/a\//' paths.txt /foo/a/report.log $ perl -ne 'print if m#/foo/a/#' paths.txt /foo/a/report.log $ perl -ne 'print if !m#/foo/a/#' paths.txt /foo/y/power.log /foo/abc/errors.log ```
#### Fixed string matching * similar to `grep -F` and `awk index` * See also * [perldoc - index function](https://perldoc.perl.org/functions/index.html) * [perldoc - Quote and Quote-like Operators](https://perldoc.perl.org/5.8.8/perlop.html#Quote-and-Quote-like-Operators) * [Quoting metacharacters](#quoting-metacharacters) section ```bash $ # same as: grep -F 'a[5]' or awk 'index($0, "a[5]")' $ # index returns matching position(starts at 0) and -1 if not found $ echo 'int a[5]' | perl -ne 'print if index($_, "a[5]") != -1' int a[5] $ # however, string within double quotes gets interpolated, for ex $ x='123'; echo "$x" 123 $ perl -e '$x=123; print "$x\n"' 123 $ # so, for commandline usage, better to pass string as environment variable $ # they are accessible via the %ENV hash variable $ perl -le 'print $ENV{PWD}' /home/learnbyexample $ perl -le 'print $ENV{SHELL}' /bin/bash $ echo 'a#$%d' | perl -ne 'print if index($_, "#$%") != -1' $ echo 'a#$%d' | s='#$%' perl -ne 'print if index($_, $ENV{s}) != -1' a#$%d ``` * return value is useful to match at specific position * for ex: at start/end of line ```bash $ cat eqns.txt a=b,a-b=c,c*d a+b,pi=3.14,5e12 i*(t+9-g)/8,4-a+b $ # start of line $ # same as: s='a+b' awk 'index($0, ENVIRON["s"])==1' eqns.txt $ s='a+b' perl -ne 'print if index($_, $ENV{s})==0' eqns.txt a+b,pi=3.14,5e12 $ # end of line $ # length function returns number of characters, by default acts on $_ $ s='a+b' perl -ne '$pos = length() - length($ENV{s}) - 1; print if index($_, $ENV{s}) == $pos' eqns.txt i*(t+9-g)/8,4-a+b ```
#### Line number based filtering * special variable `$.` contains total records read so far, similar to `NR` in `awk` * But no equivalent of awk's `FNR`, [see this stackoverflow Q&A for workaround](https://stackoverflow.com/questions/12384692/line-number-of-a-file-in-perl) * See also [perldoc - eof](https://perldoc.perl.org/perlfunc.html#eof) ```bash $ # same as: head -n2 poem.txt | tail -n1 $ # or sed -n '2p' or awk 'NR==2' $ perl -ne 'print if $.==2' poem.txt Violets are blue, $ # print 2nd and 4th line $ # same as: sed -n '2p; 4p' or awk 'NR==2 || NR==4' $ perl -ne 'print if $.==2 || $.==4' poem.txt Violets are blue, And so are you. $ # same as: tail -n1 poem.txt $ # or sed -n '$p' or awk 'END{print}' $ perl -ne 'print if eof' poem.txt And so are you. ``` * for large input, use `exit` to avoid unnecessary record processing ```bash $ # can also use: perl -ne 'print and exit if $.==234' $ seq 14323 14563435 | perl -ne 'if($.==234){print; exit}' 14556 $ # sample time comparison $ time seq 14323 14563435 | perl -ne 'if($.==234){print; exit}' > /dev/null real 0m0.005s $ time seq 14323 14563435 | perl -ne 'print if $.==234' > /dev/null real 0m2.439s $ # mimicking head command, same as: head -n3 or sed '3q' $ seq 14 25 | perl -pe 'exit if $.>3' 14 15 16 $ # same as: sed '3Q' $ seq 14 25 | perl -pe 'exit if $.==3' 14 15 ``` * selecting range of lines * `..` is [perldoc - range operator](https://perldoc.perl.org/perlop.html#Range-Operators) ```bash $ # same as: sed -n '3,5p' or awk 'NR>=3 && NR<=5' $ # in this context, the range is compared against $. $ seq 14 25 | perl -ne 'print if 3..5' 16 17 18 $ # selecting from particular line number to end of input $ # same as: sed -n '10,$p' or awk 'NR>=10' $ seq 14 25 | perl -ne 'print if $.>=10' 23 24 25 ```
## Field processing * `-a` option will auto-split each input record based on one or more continuous white-space, similar to default behavior in `awk` * See also [split](#split) section * Special variable array `@F` will contain all the elements, indexing starts from 0 * negative indexing is also supported, `-1` gives last element, `-2` gives last-but-one and so on * see [Array operations](#array-operations) section for examples on array usage ```bash $ cat fruits.txt fruit qty apple 42 banana 31 fig 90 guava 6 $ # print only first field, indexing starts from 0 $ # same as: awk '{print $1}' fruits.txt $ perl -lane 'print $F[0]' fruits.txt fruit apple banana fig guava $ # print only second field $ # same as: awk '{print $2}' fruits.txt $ perl -lane 'print $F[1]' fruits.txt qty 42 31 90 6 ``` * by default, leading and trailing whitespaces won't be considered when splitting the input record * mimicking `awk`'s default behavior ```bash $ printf ' a ate b\tc \n' a ate b c $ printf ' a ate b\tc \n' | perl -lane 'print $F[0]' a $ printf ' a ate b\tc \n' | perl -lane 'print $F[-1]' c $ # number of fields, $#F gives index of last element - so add 1 $ echo '1 a 7' | perl -lane 'print $#F+1' 3 $ printf ' a ate b\tc \n' | perl -lane 'print $#F+1' 4 $ # or use scalar context $ echo '1 a 7' | perl -lane 'print scalar @F' 3 ```
#### Field comparison * for numeric context, Perl automatically tries to convert the string to number, ignoring white-space * for string comparison, use `eq` for `==`, `ne` for `!=` and so on ```bash $ # if first field exactly matches the string 'apple' $ # same as: awk '$1=="apple"{print $2}' fruits.txt $ perl -lane 'print $F[1] if $F[0] eq "apple"' fruits.txt 42 $ # print first field if second field > 35 (excluding header) $ # same as: awk 'NR>1 && $2>35{print $1}' fruits.txt $ perl -lane 'print $F[0] if $F[1]>35 && $.>1' fruits.txt apple fig $ # print header and lines with qty < 35 $ # same as: awk 'NR==1 || $2<35' fruits.txt $ perl -ane 'print if $F[1]<35 || $.==1' fruits.txt fruit qty banana 31 guava 6 $ # if first field does NOT contain 'a' $ # same as: awk '$1 !~ /a/' fruits.txt $ perl -ane 'print if $F[0] !~ /a/' fruits.txt fruit qty fig 90 ```
#### Specifying different input field separator * by using `-F` command line option * See also [split](#split) section, which covers details about trailing empty fields ```bash $ # second field where input field separator is : $ # same as: awk -F: '{print $2}' $ echo 'foo:123:bar:789' | perl -F: -lane 'print $F[1]' 123 $ # last field, same as: awk -F: '{print $NF}' $ echo 'foo:123:bar:789' | perl -F: -lane 'print $F[-1]' 789 $ # second last field, same as: awk -F: '{print $(NF-1)}' $ echo 'foo:123:bar:789' | perl -F: -lane 'print $F[-2]' bar $ # second and last field $ # other ways to print more than 1 element will be covered later $ echo 'foo:123:bar:789' | perl -F: -lane 'print "$F[1] $F[-1]"' 123 789 $ # use quotes to avoid clashes with shell special characters $ echo 'one;two;three;four' | perl -F';' -lane 'print $F[2]' three ``` * Regular expressions based input field separator ```bash $ # same as: awk -F'[0-9]+' '{print $2}' $ echo 'Sample123string54with908numbers' | perl -F'\d+' -lane 'print $F[1]' string $ # first field will be empty as there is nothing before '{' $ # same as: awk -F'[{}= ]+' '{print $1}' $ # \x20 is space character, can't use literal space within [] when using -F $ echo '{foo} bar=baz' | perl -F'[{}=\x20]+' -lane 'print $F[0]' $ echo '{foo} bar=baz' | perl -F'[{}=\x20]+' -lane 'print $F[1]' foo $ echo '{foo} bar=baz' | perl -F'[{}=\x20]+' -lane 'print $F[2]' bar ``` * empty argument to `-F` will split the input record character wise ```bash $ # same as: gawk -v FS= '{print $1}' $ echo 'apple' | perl -F -lane 'print $F[0]' a $ echo 'apple' | perl -F -lane 'print $F[1]' p $ echo 'apple' | perl -F -lane 'print $F[-1]' e $ # use -C option when dealing with unicode characters $ # S will turn on UTF-8 for stdin/stdout/stderr streams $ printf 'hi👍 how are you?' | perl -CS -F -lane 'print $F[2]' 👍 ```
#### Specifying different output field separator * Method 1: use `$,` to change separator between `print` arguments * could be remembered easily by noting that `,` is used to separate `print` arguments ```bash $ # by default, the various arguments are concatenated $ echo 'foo:123:bar:789' | perl -F: -lane 'print $F[1], $F[-1]' 123789 $ # change $, if different separator is needed $ echo 'foo:123:bar:789' | perl -F: -lane '$,=" "; print $F[1], $F[-1]' 123 789 $ echo 'foo:123:bar:789' | perl -F: -lane '$,="-"; print $F[1], $F[-1]' 123-789 $ # argument can be array too $ echo 'foo:123:bar:789' | perl -F: -lane '$,="-"; print @F[1,-1]' 123-789 $ echo 'foo:123:bar:789' | perl -F: -lane '$,=" - "; print @F' foo - 123 - bar - 789 ``` * Method 2: use `join` ```bash $ echo 'foo:123:bar:789' | perl -F: -lane 'print join "-", $F[1], $F[-1]' 123-789 $ echo 'foo:123:bar:789' | perl -F: -lane 'print join "-", @F[1,-1]' 123-789 $ echo 'foo:123:bar:789' | perl -F: -lane 'print join " - ", @F' foo - 123 - bar - 789 ``` * Method 3: use `$"` to change separator when array is interpolated, default is space character * could be remembered easily by noting that interpolation happens within double quotes ```bash $ # default is space $ echo 'foo:123:bar:789' | perl -F: -lane 'print "@F[1,-1]"' 123 789 $ echo 'foo:123:bar:789' | perl -F: -lane '$"="-"; print "@F[1,-1]"' 123-789 $ echo 'foo:123:bar:789' | perl -F: -lane '$"=","; print "@F"' foo,123,bar,789 ``` * use `BEGIN` if same separator is to be used for all lines * statements inside `BEGIN` are executed before processing any input text ```bash $ # can also use: perl -lane 'BEGIN{$"=","} print "@F"' fruits.txt $ perl -lane 'BEGIN{$,=","} print @F' fruits.txt fruit,qty apple,42 banana,31 fig,90 guava,6 ``` ## Changing record separators * Before seeing examples for changing record separators, let's cover a detail about contents of input record and use of `-l` option * See also [perldoc - chomp](https://perldoc.perl.org/functions/chomp.html) ```bash $ # input record includes the record separator as well $ # can also use: perl -pe 's/$/ 123/' $ echo 'foo' | perl -pe 's/\n/ 123\n/' foo 123 $ # this example shows better use case $ # similar to paste -sd but with ability to use multi-character delimiter $ seq 5 | perl -pe 's/\n/ : / if !eof' 1 : 2 : 3 : 4 : 5 $ # -l option will chomp off the record separator (among other things) $ echo 'foo' | perl -l -pe 's/\n/ 123\n/' foo $ # -l also sets output record separator which gets added to print statements $ # ORS gets input record separator value if no argument is passed to -l $ # hence the newline automatically getting added for print in this example $ perl -lane 'print $F[0] if $F[1]<35 && $.>1' fruits.txt banana guava ```
#### Input record separator * by default, newline character is used as input record separator * use `$/` to specify a different input record separator * unlike `awk`, only string can be used, no regular expressions * for single character separator, can also use `-0` command line option which accepts octal/hexadecimal value as argument * if `-l` option is also used * input record separator will be chomped from input record * in addition, if argument is not passed to `-l`, output record separator will get whatever is current value of input record separator * so, order of `-l`, `-0` and/or `$/` usage becomes important ```bash $ s='this is a sample string' $ # space as input record separator, printing all records $ # same as: awk -v RS=' ' '{print NR, $0}' $ # ORS is newline as -l is used before $/ gets changed $ printf "$s" | perl -lne 'BEGIN{$/=" "} print "$. $_"' 1 this 2 is 3 a 4 sample 5 string $ # print all records containing 'a' $ # same as: awk -v RS=' ' '/a/' $ printf "$s" | perl -l -0040 -ne 'print if /a/' a sample $ # if the order is changed, ORS will be space, not newline $ printf "$s" | perl -0040 -l -ne 'print if /a/' a sample ``` * `-0` option used without argument will use the ASCII NUL character as input record separator ```bash $ printf 'foo\0bar\0' | cat -A foo^@bar^@$ $ printf 'foo\0bar\0' | perl -l -0 -ne 'print' foo bar $ # could be golfed to: perl -l -0pe '' $ # but dont use `-l0` as `0` will be treated as argument to `-l` ``` * values `-0400` to `-0777` will cause entire file to be slurped * idiomatically, `-0777` is used ```bash $ # s modifier allows . to match newline as well $ perl -0777 -pe 's/red.*are //s' poem.txt Roses are you. $ # replace first newline with '. ' $ perl -0777 -pe 's/\n/. /' greeting.txt Hello there. Have a safe journey ``` * for paragraph mode (two more more consecutive newline characters), use `-00` or assign empty string to `$/` Consider the below sample file ```bash $ cat sample.txt Hello World Good day How are you Just do-it Believe it Today is sunny Not a bit funny No doubt you like it too Much ado about nothing He he he ``` * again, input record will have the separator too and using `-l` will chomp it * however, if more than two consecutive newline characters separate the paragraphs, only two newlines will be preserved and the rest discarded * use `$/="\n\n"` to avoid this behavior ```bash $ # print all paragraphs containing 'it' $ # same as: awk -v RS= -v ORS='\n\n' '/it/' sample.txt $ perl -00 -ne 'print if /it/' sample.txt Just do-it Believe it Today is sunny Not a bit funny No doubt you like it too $ # based on number of lines in each paragraph $ perl -F'\n' -00 -ane 'print if $#F==0' sample.txt Hello World $ # unlike awk -F'\n' -v RS= -v ORS='\n\n' 'NF==2 && /do/' sample.txt $ # there wont be empty line at end because input file didn't have it $ perl -F'\n' -00 -ane 'print if $#F==1 && /do/' sample.txt Just do-it Believe it Much ado about nothing He he he ``` * Re-structuring paragraphs ```bash $ # same as: awk 'BEGIN{FS="\n"; OFS=". "; RS=""; ORS="\n\n"} {$1=$1} 1' $ perl -F'\n' -00 -ane 'print join ". ", @F; print "\n\n"' sample.txt Hello World Good day. How are you Just do-it. Believe it Today is sunny. Not a bit funny. No doubt you like it too Much ado about nothing. He he he ``` * multi-character separator ```bash $ cat report.log blah blah Error: something went wrong more blah whatever Error: something surely went wrong some text some more text blah blah blah $ # number of records, same as: awk -v RS='Error:' 'END{print NR}' $ perl -lne 'BEGIN{$/="Error:"} print $. if eof' report.log 3 $ # print first record $ perl -lne 'BEGIN{$/="Error:"} print if $.==1' report.log blah blah $ # same as: awk -v RS='Error:' '/surely/{print RS $0}' report.log $ perl -lne 'BEGIN{$/="Error:"} print "$/$_" if /surely/' report.log Error: something surely went wrong some text some more text blah blah blah ``` * Joining lines based on specific end of line condition ```bash $ cat msg.txt Hello there. It will rain to- day. Have a safe and pleasant jou- rney. $ # same as: awk -v RS='-\n' -v ORS= '1' msg.txt $ # can also use: perl -pe 's/-\n//' msg.txt $ perl -pe 'BEGIN{$/="-\n"} chomp' msg.txt Hello there. It will rain today. Have a safe and pleasant journey. ```
#### Output record separator * one way is to use `$\` to specify a different output record separator * by default it doesn't have a value ```bash $ # note that despite $\ not having a value, output has newlines $ # because the input record still has the input record separator $ seq 3 | perl -ne 'print' 1 2 3 $ # same as: awk -v ORS='\n\n' '{print $0}' $ seq 3 | perl -ne 'BEGIN{$\="\n"} print' 1 2 3 $ seq 2 | perl -ne 'BEGIN{$\="---\n"} print' 1 --- 2 --- ``` * dynamically changing output record separator ```bash $ # same as: awk '{ORS = NR%2 ? " " : "\n"} 1' $ # note the use of -l to chomp the input record separator $ seq 6 | perl -lpe '$\ = $.%2 ? " " : "\n"' 1 2 3 4 5 6 $ # -l also sets the output record separator $ # but gets overridden by $\ $ seq 6 | perl -lpe '$\ = $.%3 ? "-" : "\n"' 1-2-3 4-5-6 ``` * passing argument to `-l` to set output record separator ```bash $ seq 8 | perl -ne 'print if /[24]/' 2 4 $ # null separator, note how -l also chomps input record separator $ seq 8 | perl -l0 -ne 'print if /[24]/' | cat -A 2^@4^@ $ # comma separator, won't have a newline at end $ seq 8 | perl -l054 -ne 'print if /[24]/' 2,4, $ # to add a final newline to output, use END and printf $ seq 8 | perl -l054 -ne 'print if /[24]/; END{printf "\n"}' 2,4, ```
## Multiline processing * Processing consecutive lines ```bash $ cat poem.txt Roses are red, Violets are blue, Sugar is sweet, And so are you. $ # match two consecutive lines $ # same as: awk 'p~/are/ && /is/{print p ORS $0} {p=$0}' poem.txt $ perl -ne 'print $p,$_ if /is/ && $p=~/are/; $p=$_' poem.txt Violets are blue, Sugar is sweet, $ # if only the second line is needed, same as: awk 'p~/are/ && /is/; {p=$0}' $ perl -ne 'print if /is/ && $p=~/are/; $p=$_' poem.txt Sugar is sweet, $ # print if line matches a condition as well as condition for next 2 lines $ # same as: awk 'p2~/red/ && p1~/blue/ && /is/{print p2} {p2=p1; p1=$0}' $ perl -ne 'print $p2 if /is/ && $p1=~/blue/ && $p2=~/red/; $p2=$p1; $p1=$_' poem.txt Roses are red, ``` Consider this sample input file ```bash $ cat range.txt foo BEGIN 1234 6789 END bar BEGIN a b c END baz ``` * extracting lines around matching line * how `$n && $n--` works: * need to note that right hand side of `&&` is processed only if left hand side is `true` * so for example, if initially `$n=2`, then we get * `2 && 2; $n=1` - evaluates to `true` * `1 && 1; $n=0` - evaluates to `true` * `0 && ` - evaluates to `false` ... no decrementing `$n` and hence will be `false` until `$n` is re-assigned non-zero value ```bash $ # similar to: grep --no-group-separator -A1 'BEGIN' range.txt $ # same as: awk '/BEGIN/{n=2} n && n--' range.txt $ perl -ne '$n=2 if /BEGIN/; print if $n && $n--' range.txt BEGIN 1234 BEGIN a $ # print only line after matching line, same as: awk 'n && n--; /BEGIN/{n=1}' $ perl -ne 'print if $n && $n--; $n=1 if /BEGIN/' range.txt 1234 a $ # generic case: print nth line after match, awk 'n && !--n; /BEGIN/{n=3}' $ perl -ne 'print if $n && !--$n; $n=3 if /BEGIN/' range.txt END c $ # print second line prior to matched line $ # same as: awk '/END/{print p2} {p2=p1; p1=$0}' range.txt $ perl -ne 'print $p2 if /END/; $p2=$p1; $p1=$_' range.txt 1234 b $ # use reversing trick for generic case of nth line before match $ # same as: tac range.txt | awk 'n && !--n; /END/{n=3}' | tac $ tac range.txt | perl -ne 'print if $n && !--$n; $n=3 if /END/' | tac BEGIN a ``` **Further Reading** * [stackoverflow - multiline find and replace](https://stackoverflow.com/questions/39884112/perl-multiline-find-and-replace-with-regex) * [stackoverflow - delete line based on content of previous/next lines](https://stackoverflow.com/questions/49112877/delete-line-if-line-matches-foo-line-above-matches-bar-and-line-below-match) * [softwareengineering - FSM examples](https://softwareengineering.stackexchange.com/questions/47806/examples-of-finite-state-machines) * [wikipedia - FSM](https://en.wikipedia.org/wiki/Finite-state_machine)
## Perl regular expressions * examples to showcase some of the features not present in ERE and modifiers not available in `sed`'s substitute command * many features of Perl regular expressions will NOT be covered, but external links will be provided wherever relevant * See [perldoc - perlre](https://perldoc.perl.org/perlre.html) for complete reference * and [perldoc - regular expressions FAQ](https://perldoc.perl.org/perlfaq.html#the-perlfaq6-manpage%3a-Regular-Expressions) * examples/descriptions based only on ASCII encoding
#### sed vs perl subtle differences * input record separator being part of input record ```bash $ echo 'foo:123:bar:789' | sed -E 's/[^:]+$/xyz/' foo:123:bar:xyz $ # newline character gets replaced too as shown by shell prompt $ echo 'foo:123:bar:789' | perl -pe 's/[^:]+$/xyz/' foo:123:bar:xyz$ $ # simple workaround is to use -l option $ echo 'foo:123:bar:789' | perl -lpe 's/[^:]+$/xyz/' foo:123:bar:xyz $ # of course it has uses too $ seq 10 | paste -sd, | sed 's/,/ : /g' 1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 $ seq 10 | perl -pe 's/\n/ : / if !eof' 1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 ``` * how much does `*` match? ```bash $ # sed will choose biggest match $ echo ',baz,,xyz,,,' | sed 's/[^,]*/A/g' A,A,A,A,A,A,A $ echo 'foo,baz,,xyz,,,123' | sed 's/[^,]*/A/g' A,A,A,A,A,A,A $ # but perl will match both empty and non-empty strings $ echo ',baz,,xyz,,,' | perl -lpe 's/[^,]*/A/g' A,AA,A,AA,A,A,A $ echo 'foo,baz,,xyz,,,123' | perl -lpe 's/[^,]*/A/g' AA,AA,A,AA,A,A,AA $ echo '42,789' | sed 's/[0-9]*/"&"/g' "42","789" $ echo '42,789' | perl -lpe 's/\d*/"$&"/g' "42""","789""" $ echo '42,789' | perl -lpe 's/\d+/"$&"/g' "42","789" ``` * backslash sequences inside character classes ```bash $ # \w would simply match w $ echo 'w=y-x+9*3' | sed 's/[\w=]//g' y-x+9*3 $ # \w would match any word character $ echo 'w=y-x+9*3' | perl -pe 's/[\w=]//g' -+* ``` * replacing specific occurrence * See [stackoverflow - substitute the nth occurrence of a match in a Perl regex](https://stackoverflow.com/questions/2555662/how-can-i-substitute-the-nth-occurrence-of-a-match-in-a-perl-regex) for workarounds ```bash $ echo 'foo:123:bar:baz' | sed 's/:/-/2' foo:123-bar:baz $ echo 'foo:123:bar:baz' | perl -pe 's/:/-/2' Unknown regexp modifier "/2" at -e line 1, at end of line Execution of -e aborted due to compilation errors. $ # e modifier covered later, allows Perl code in replacement section $ echo 'foo:123:bar:baz' | perl -pe '$c=0; s/:/++$c==2 ? "-" : $&/ge' foo:123-bar:baz $ # or use non-greedy and \K(covered later), same as: sed 's/and/-/3' $ echo 'foo and bar and baz land good' | perl -pe 's/(and.*?){2}\Kand/-/' foo and bar and baz l- good $ # emulating GNU sed's number+g modifier $ a='456:foo:123:bar:789:baz x:y:z:a:v:xc:gf' $ echo "$a" | sed 's/:/-/3g' 456:foo:123-bar-789-baz x:y:z-a-v-xc-gf $ echo "$a" | perl -pe '$c=0; s/:/++$c<3 ? $& : "-"/ge' 456:foo:123-bar-789-baz x:y:z-a-v-xc-gf ``` * variable interpolation when `$` or `@` is used * See also [perldoc - Quote and Quote-like Operators](https://perldoc.perl.org/5.8.8/perlop.html#Quote-and-Quote-like-Operators) ```bash $ seq 2 | sed 's/$x/xyz/' 1 2 $ # uninitialized variable, same applies for: perl -pe 's/@a/xyz/' $ seq 2 | perl -pe 's/$x/xyz/' xyz1 xyz2 $ # initialized variable $ seq 2 | perl -pe '$x=2; s/$x/xyz/' 1 xyz $ # using single quotes as delimiter won't interpolate $ # not usable for one-liners given shell's own single/double quotes behavior $ cat sub_sq.pl s'$x'xyz' $ seq 2 | perl -p sub_sq.pl 1 2 ``` * back reference * See also [perldoc - Warning on \1 Instead of $1](https://perldoc.perl.org/perlre.html#Warning-on-%5c1-Instead-of-%241) ```bash $ # use $& to refer entire matched string in replacement section $ echo 'hello world' | sed 's/.*/"&"/' "hello world" $ echo 'hello world' | perl -pe 's/.*/"&"/' "&" $ echo 'hello world' | perl -pe 's/.*/"$&"/' "hello world" $ # use \1, \2, etc or \g1, \g2 etc for back referencing in search section $ # use $1, $2, etc in replacement section $ echo 'a a a walking for for a cause' | perl -pe 's/\b(\w+)( \1)+\b/$1/g' a walking for a cause ```
#### Backslash sequences * `\d` for `[0-9]` * `\s` for `[ \t\r\n\f\v]` * `\h` for `[ \t]` * `\n` for newline character * `\D`, `\S`, `\H`, `\N` respectively for their opposites * See [perldoc - perlrecharclass](https://perldoc.perl.org/perlrecharclass.html#Backslash-sequences) for full list and details ```bash $ # same as: sed -E 's/[0-9]+/xxx/g' $ echo 'like 42 and 37' | perl -pe 's/\d+/xxx/g' like xxx and xxx $ # same as: sed -E 's/[^0-9]+/xxx/g' $ # note again the use of -l because of newline in input record $ echo 'like 42 and 37' | perl -lpe 's/\D+/xxx/g' xxx42xxx37 $ # no need -l here as \h won't match newline $ echo 'a b c ' | perl -pe 's/\h*$//' a b c ```
#### Non-greedy quantifier * adding a `?` to `?` or `*` or `+` or `{}` quantifiers will change matching from greedy to non-greedy. In other words, to match as minimally as possible * also known as lazy quantifier * See also [regular-expressions.info - Possessive Quantifiers](https://www.regular-expressions.info/possessive.html) ```bash $ # greedy matching $ echo 'foo and bar and baz land good' | perl -pe 's/foo.*and//' good $ # non-greedy matching $ echo 'foo and bar and baz land good' | perl -pe 's/foo.*?and//' bar and baz land good $ echo '12342789' | perl -pe 's/\d{2,5}//' 789 $ echo '12342789' | perl -pe 's/\d{2,5}?//' 342789 $ # for single character, non-greedy is not always needed $ echo '123:42:789:good:5:bad' | perl -pe 's/:.*?:/:/' 123:789:good:5:bad $ echo '123:42:789:good:5:bad' | perl -pe 's/:[^:]*:/:/' 123:789:good:5:bad $ # just like greedy, overall matching is considered, as minimal as possible $ echo '123:42:789:good:5:bad' | perl -pe 's/:.*?:[a-z]/:/' 123:ood:5:bad $ echo '123:42:789:good:5:bad' | perl -pe 's/:.*:[a-z]/:/' 123:ad ```
#### Lookarounds * Ability to add if conditions to match before/after required pattern * There are four types * positive lookahead `(?=` * negative lookahead `(?!` * positive lookbehind `(?<=` * negative lookbehind `(? #### Ignoring specific matches * A useful construct is `(*SKIP)(*F)` which allows to discard matches not needed * regular expression which should be discarded is written first, `(*SKIP)(*F)` is appended and then required regular expression is added after `|` ```bash $ s='Car Bat cod12 Map foo_bar' $ # all words except those starting with 'c' or 'C' $ echo "$s" | perl -lne 'print join "\n", /\bc\w+(*SKIP)(*F)|\w+/gi' Bat Map foo_bar $ s='I like "mango" and "guava"' $ # all words except those surrounded by double quotes $ echo "$s" | perl -lne 'print join "\n", /"[^"]+"(*SKIP)(*F)|\w+/g' I like and $ # change words except those surrounded by double quotes $ echo "$s" | perl -pe 's/"[^"]+"(*SKIP)(*F)|\w+/\U$&/g' I LIKE "mango" AND "guava" ``` * for line based decisions, simple if-else might help ```bash $ cat nums.txt 42 -2 10101 -3.14 -75 $ # change +ve number to -ve and vice versa $ # note that empty regexp will reuse last successfully matched regexp $ perl -pe '/^-/ ? s/// : s/^/-/' nums.txt -42 2 -10101 3.14 75 ``` **Further Reading** * [perldoc - Special Backtracking Control Verbs](https://perldoc.perl.org/perlre.html#Special-Backtracking-Control-Verbs) * [rexegg - Excluding Unwanted Matches](https://www.rexegg.com/backtracking-control-verbs.html#skipfail)
#### Special capture groups * `\1`, `\2` etc only matches exact string * `(?1)`, `(?2)` etc re-uses the regular expression itself ```bash $ s='baz 2008-03-24 and 2012-08-12 foo 2016-03-25' $ # (?1) refers to first capture group (\d{4}-\d{2}-\d{2}) $ echo "$s" | perl -pe 's/(\d{4}-\d{2}-\d{2}) and (?1)/XYZ/' baz XYZ foo 2016-03-25 $ # using \1 won't work as the two dates are different $ echo "$s" | perl -pe 's/(\d{4}-\d{2}-\d{2}) and \1//' baz 2008-03-24 and 2012-08-12 foo 2016-03-25 ``` * use `(?:` to group regular expressions without capturing it, so this won't be counted for backreference * See also * [stackoverflow - what is non-capturing group](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-what-does-do) * [stackoverflow - extract specific fields and key-value pairs](https://stackoverflow.com/questions/46632397/parse-vcf-files-info-field) ```bash $ s='Car Bat cod12 Map foo_bar' $ # check what happens if ?: is not used $ echo "$s" | perl -lne 'print join "\n", /(?:Bat|Map)(*SKIP)(*F)|\w+/gi' Car cod12 foo_bar $ # using ?: helps to focus only on required capture groups $ echo 'cod1 foo_bar' | perl -pe 's/(?:co|fo)\K(\w)(\w)/$2$1/g' co1d fo_obar $ # without ?: you'd need to remember all the other groups as well $ echo 'cod1 foo_bar' | perl -pe 's/(co|fo)\K(\w)(\w)/$3$2/g' co1d fo_obar ``` * named capture groups `(?` * for backreference, use `\k` * accessible via `%+` hash in replacement section ```bash $ s='baz 2008-03-24 and 2012-08-12 foo 2016-03-25' $ echo "$s" | perl -pe 's/(\d{4})-(\d{2})-(\d{2})/$3-$2-$1/g' baz 24-03-2008 and 12-08-2012 foo 25-03-2016 $ # naming the capture groups might offer clarity $ echo "$s" | perl -pe 's/(?\d{4})-(?\d{2})-(?\d{2})/$+{d}-$+{m}-$+{y}/g' baz 24-03-2008 and 12-08-2012 foo 25-03-2016 $ echo "$s" | perl -pe 's/(?\d{4})-(?\d{2})-(?\d{2})/$+{m}-$+{d}-$+{y}/g' baz 03-24-2008 and 08-12-2012 foo 03-25-2016 $ # and useful to transform different capture groups $ s='"foo,bar",123,"x,y,z",42' $ echo "$s" | perl -lpe 's/"(?[^"]+)",|(?[^,]+),/$+{a}|/g' foo,bar|123|x,y,z|42 $ # can also use (?| branch reset $ echo "$s" | perl -lpe 's/(?|"([^"]+)",|([^,]+),)/$1|/g' foo,bar|123|x,y,z|42 ``` **Further Reading** * [perldoc - Extended Patterns](https://perldoc.perl.org/perlre.html#Extended-Patterns) * [rexegg - all the (? usages](https://www.rexegg.com/regex-disambiguation.html) * [regular-expressions - recursion](https://www.regular-expressions.info/recurse.html#balanced)
####
Modifiers * some are already seen, like the `g` (global match) and `i` (case insensitive matching) * first up, the `r` modifier which returns the substitution result instead of modifying the variable it is acting upon ```bash $ perl -e '$x="feed"; $y=$x=~s/e/E/gr; print "x=$x\ny=$y\n"' x=feed y=fEEd $ # the r modifier is available for transliteration operator too $ perl -e '$x="food"; $y=$x=~tr/a-z/A-Z/r; print "x=$x\ny=$y\n"' x=food y=FOOD ``` * `e` modifier allows to use Perl code in replacement section instead of string * use `ee` if you need to construct a string and then apply evaluation ```bash $ # replace numbers with their squares $ echo '4 and 10' | perl -pe 's/\d+/$&*$&/ge' 16 and 100 $ # replace matched string with incremental value $ echo '4 and 10 foo 57' | perl -pe 's/\d+/++$c/ge' 1 and 2 foo 3 $ # passing initial value $ echo '4 and 10 foo 57' | c=100 perl -pe 's/\d+/$ENV{c}++/ge' 100 and 101 foo 102 $ # formatting string $ echo 'a1-2-deed' | perl -lpe 's/[^-]+/sprintf "%04s", $&/ge' 00a1-0002-deed $ # calling a function $ echo 'food:12:explain:789' | perl -pe 's/\w+/length($&)/ge' 4:2:7:3 $ # applying another substitution to matched string $ echo '"mango" and "guava"' | perl -pe 's/"[^"]+"/$&=~s|a|A|gr/ge' "mAngo" and "guAvA" ``` * multiline modifiers ```bash $ # m modifier to match beginning/end of each line within multiline string $ perl -00 -ne 'print if /^Believe/' sample.txt $ perl -00 -ne 'print if /^Believe/m' sample.txt Just do-it Believe it $ perl -00 -ne 'print if /funny$/' sample.txt $ perl -00 -ne 'print if /funny$/m' sample.txt Today is sunny Not a bit funny No doubt you like it too $ # s modifier to allow . meta character to match newlines as well $ perl -00 -ne 'print if /do.*he/' sample.txt $ perl -00 -ne 'print if /do.*he/s' sample.txt Much ado about nothing He he he ``` **Further Reading** * [perldoc - perlre Modifiers](https://perldoc.perl.org/perlre.html#Modifiers) * [stackoverflow - replacement within matched string](https://stackoverflow.com/questions/40458639/replacement-within-the-matched-string-with-sed)
#### Quoting metacharacters * part of regular expression can be surrounded within `\Q` and `\E` to prevent matching meta characters within that portion * however, `$` and `@` would still be interpolated as long as delimiter isn't single quotes * `\E` is optional if applying `\Q` till end of search expression * typical use case is string to be protected is already present in a variable, for ex: user input or result of another command * quotemeta will add a backslash to all characters other than `\w` characters * See also [perldoc - Quoting metacharacters](https://perldoc.perl.org/perlre.html#Quoting-metacharacters) ```bash $ # quotemeta in action $ perl -le '$x="[a].b+c^"; print quotemeta $x' \[a\]\.b\+c\^ $ # same as: s='a+b' perl -ne 'print if index($_, $ENV{s})==0' eqns.txt $ s='a+b' perl -ne 'print if /^\Q$ENV{s}/' eqns.txt a+b,pi=3.14,5e12 $ s='a+b' perl -pe 's/^\Q$ENV{s}/ABC/' eqns.txt a=b,a-b=c,c*d ABC,pi=3.14,5e12 i*(t+9-g)/8,4-a+b $ s='a+b' perl -pe 's/\Q$ENV{s}\E.*,/ABC,/' eqns.txt a=b,a-b=c,c*d ABC,5e12 i*(t+9-g)/8,4-a+b ``` * use `q` operator for replacement section * it would treat contents as if they were placed inside single quotes and hence no interpolation * See also [perldoc - Quote and Quote-like Operators](https://perldoc.perl.org/5.8.8/perlop.html#Quote-and-Quote-like-Operators) ```bash $ # q in action $ perl -le '$x="[a].b+c^$@123"; print $x' [a].b+c^123 $ perl -le '$x=q([a].b+c^$@123); print $x' [a].b+c^$@123 $ perl -le '$x=q([a].b+c^$@123); print quotemeta $x' \[a\]\.b\+c\^\$\@123 $ echo 'foo 123' | perl -pe 's/foo/$foo/' 123 $ echo 'foo 123' | perl -pe 's/foo/q($foo)/e' $foo 123 $ echo 'foo 123' | perl -pe 's/foo/q{$f)oo}/e' $f)oo 123 $ # string saved in other variables do not need special attention $ echo 'foo 123' | s='a$b' perl -pe 's/foo/$ENV{s}/' a$b 123 $ echo 'foo 123' | perl -pe 's/foo/a$b/' a 123 ```
#### Matching position * From [perldoc - perlvar](https://perldoc.perl.org/perlvar.html#SPECIAL-VARIABLES) >$-[0] is the offset of the start of the last successful match >$+[0] is the offset into the string of the end of the entire match ```bash $ cat poem.txt Roses are red, Violets are blue, Sugar is sweet, And so are you. $ # starting position of match $ perl -lne 'print "line: $., offset: $-[0]" if /are/' poem.txt line: 1, offset: 6 line: 2, offset: 8 line: 4, offset: 7 $ # if offset is needed starting from 1 instead of 0 $ perl -lne 'print "line: $., offset: ",$-[0]+1 if /are/' poem.txt line: 1, offset: 7 line: 2, offset: 9 line: 4, offset: 8 $ # ending position of match $ perl -lne 'print "line: $., offset: $+[0]" if /are/' poem.txt line: 1, offset: 9 line: 2, offset: 11 line: 4, offset: 10 ``` * for multiple matches, use `while` loop to go over all the matches ```bash $ perl -lne 'print "$.:$&:$-[0]" while /is|so|are/g' poem.txt 1:are:6 2:are:8 3:is:6 4:so:4 4:are:7 ```
## Using modules * There are many standard modules available that come with Perl installation * and many more available from **Comprehensive Perl Archive Network** (CPAN) * [stackoverflow - easiest way to install a missing module](https://stackoverflow.com/questions/65865/whats-the-easiest-way-to-install-a-missing-perl-module) ```bash $ echo '34,17,6' | perl -F, -lane 'BEGIN{use List::Util qw(max)} print max @F' 34 $ # -M option provides a way to specify modules from command line $ echo '34,17,6' | perl -MList::Util=max -F, -lane 'print max @F' 34 $ echo '34,17,6' | perl -MList::Util=sum0 -F, -lane 'print sum0 @F' 57 $ echo '34,17,6' | perl -MList::Util=product -F, -lane 'print product @F' 3468 $ s='1,2,3,4,5' $ echo "$s" | perl -MList::Util=shuffle -F, -lane 'print join ",",shuffle @F' 5,3,4,1,2 $ s='3,b,a,c,d,1,d,c,2,3,1,b' $ echo "$s" | perl -MList::MoreUtils=uniq -F, -lane 'print join ",",uniq @F' 3,b,a,c,d,1,2 $ echo 'foo 123 baz' | base64 Zm9vIDEyMyBiYXoK $ echo 'foo 123 baz' | perl -MMIME::Base64 -ne 'print encode_base64 $_' Zm9vIDEyMyBiYXoK $ echo 'Zm9vIDEyMyBiYXoK' | perl -MMIME::Base64 -ne 'print decode_base64 $_' foo 123 baz ``` * a cool module [O](https://perldoc.perl.org/O.html) helps to convert one-liners to full fledged programs * similar to `-o` option for GNU awk ```bash $ # command being deparsed is discussed in a later section $ perl -MO=Deparse -ne 'if(!$#ARGV){$h{$_}=1; next} print if $h{$_}' colors_1.txt colors_2.txt LINE: while (defined($_ = )) { unless ($#ARGV) { $h{$_} = 1; next; } print $_ if $h{$_}; } -e syntax OK $ perl -MO=Deparse -00 -ne 'print if /it/' sample.txt BEGIN { $/ = ""; $\ = undef; } LINE: while (defined($_ = )) { print $_ if /it/; } -e syntax OK ``` **Further Reading** * [perldoc - perlmodlib](https://perldoc.perl.org/perlmodlib.html) * [perldoc - Core modules](https://perldoc.perl.org/index-modules-L.html) * [unix.stackexchange - example for Algorithm::Combinatorics](https://unix.stackexchange.com/questions/310840/better-solution-for-finding-id-groups-permutations-combinations) * [unix.stackexchange - example for Text::ParseWords](https://unix.stackexchange.com/questions/319301/excluding-enclosed-delimiters-with-cut) * [stackoverflow - regular expression modules](https://stackoverflow.com/questions/3258847/what-are-good-perl-pattern-matching-regex-modules) * [metacpan - String::Approx](https://metacpan.org/pod/String::Approx) - Perl extension for approximate matching (fuzzy matching) * [metacpan - Tie::IxHash](https://metacpan.org/pod/Tie::IxHash) - ordered associative arrays for Perl
## Two file processing First, a bit about `$#ARGV` and hash variables ```bash $ # $#ARGV can be used to know which file is being processed $ perl -lne 'print $#ARGV' <(seq 2) <(seq 3) <(seq 1) 1 1 0 0 0 -1 $ # creating hash variable $ # checking if a key is present using exists $ # or if value is known to evaluate to true $ perl -le '$h{"a"}=5; $h{"b"}=0; $h{1}="abc"; print "key:a value=", $h{"a"}; print "key:b present" if exists $h{"b"}; print "key:1 present" if $h{1}' key:a value=5 key:b present key:1 present ```
#### Comparing whole lines Consider the following test files ```bash $ cat colors_1.txt Blue Brown Purple Red Teal Yellow $ cat colors_2.txt Black Blue Green Red White ``` * For two files as input, `$#ARGV` will be `0` only when first file is being processed * Using `next` will skip rest of code * entire line is used as key ```bash $ # common lines $ # note that all duplicates matching in second file would get printed $ # same as: grep -Fxf colors_1.txt colors_2.txt $ # same as: awk 'NR==FNR{a[$0]; next} $0 in a' colors_1.txt colors_2.txt $ perl -ne 'if(!$#ARGV){$h{$_}=1; next} print if $h{$_}' colors_1.txt colors_2.txt Blue Red $ # can also use: perl -ne '!$#ARGV ? $h{$_}=1 : $h{$_} && print' $ # lines from colors_2.txt not present in colors_1.txt $ # same as: grep -vFxf colors_1.txt colors_2.txt $ # same as: awk 'NR==FNR{a[$0]; next} !($0 in a)' colors_1.txt colors_2.txt $ perl -ne 'if(!$#ARGV){$h{$_}=1; next} print if !$h{$_}' colors_1.txt colors_2.txt Black Green White ``` * alternative constructs * `` reads line(s) from the specified file * defaults to current file argument(includes stdin as well), so `<>` can be used as shortcut * `` will read only from stdin, there are also predefined handles for stdout/stderr * in list context, all the lines would be read * See [perldoc - I/O Operators](https://perldoc.perl.org/perlop.html#I%2fO-Operators) for details ```bash $ # using if-else instead of next $ perl -ne 'if(!$#ARGV){ $h{$_}=1 } else{ print if $h{$_} }' colors_1.txt colors_2.txt Blue Red $ # read all lines of first file in BEGIN block $ # <> reads a line from current file argument $ # eof will ensure only first file is read $ perl -ne 'BEGIN{ $h{<>}=1 while !eof; } print if $h{$_}' colors_1.txt colors_2.txt Blue Red $ # this method also allows to easily reset line number $ # close ARGV is similar to calling nextfile in GNU awk $ perl -ne 'BEGIN{ $h{<>}=1 while !eof; close ARGV} print "$.\n" if $h{$_}' colors_1.txt colors_2.txt 2 4 $ # or pass 1st file content as STDIN, $. will be automatically reset as well $ perl -ne 'BEGIN{ $h{$_}=1 while } print if $h{$_}' #### Comparing specific fields Consider the sample input file ```bash $ cat marks.txt Dept Name Marks ECE Raj 53 ECE Joel 72 EEE Moi 68 CSE Surya 81 EEE Tia 59 ECE Om 92 CSE Amy 67 ``` * single field * For ex: only first field comparison instead of entire line as key ```bash $ cat list1 ECE CSE $ # extract only lines matching first field specified in list1 $ # same as: awk 'NR==FNR{a[$1]; next} $1 in a' list1 marks.txt $ perl -ane 'if(!$#ARGV){ $h{$F[0]}=1 } else{ print if $h{$F[0]} }' list1 marks.txt ECE Raj 53 ECE Joel 72 CSE Surya 81 ECE Om 92 CSE Amy 67 $ # if header is needed as well $ # same as: awk 'NR==FNR{a[$1]; next} FNR==1 || $1 in a' list1 marks.txt $ perl -ane 'if(!$#ARGV){ $h{$F[0]}=1; $.=0 } else{ print if $h{$F[0]} || $.==1 }' list1 marks.txt Dept Name Marks ECE Raj 53 ECE Joel 72 CSE Surya 81 ECE Om 92 CSE Amy 67 ``` * multiple field comparison ```bash $ cat list2 EEE Moi CSE Amy ECE Raj $ # extract only lines matching both fields specified in list2 $ # same as: awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' list2 marks.txt $ # default SUBSEP(stored in $;) is \034, same as GNU awk $ perl -ane 'if(!$#ARGV){ $h{$F[0],$F[1]}=1 } else{ print if $h{$F[0],$F[1]} }' list2 marks.txt ECE Raj 53 EEE Moi 68 CSE Amy 67 $ # or use multidimensional hash $ perl -ane 'if(!$#ARGV){ $h{$F[0]}{$F[1]}=1 } else{ print if $h{$F[0]}{$F[1]} }' list2 marks.txt ECE Raj 53 EEE Moi 68 CSE Amy 67 ``` * field and value comparison ```bash $ cat list3 ECE 70 EEE 65 CSE 80 $ # extract line matching Dept and minimum marks specified in list3 $ # same as: awk 'NR==FNR{d[$1]; m[$1]=$2; next} $1 in d && $3 >= m[$1]' $ perl -ane 'if(!$#ARGV){ $d{$F[0]}=1; $m{$F[0]}=$F[1] } else{ print if $d{$F[0]} && $F[2]>=$m{$F[0]} }' list3 marks.txt ECE Joel 72 EEE Moi 68 CSE Surya 81 ECE Om 92 ``` * See also [stackoverflow - Fastest way to find lines of a text file from another larger text file](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-text-file-from-another-larger-text-file-in-bash)
#### Line number matching ```bash $ # replace mth line in poem.txt with nth line from nums.txt $ # assumes that there are at least n lines in nums.txt $ # same as: awk -v m=3 -v n=2 'BEGIN{while(n-- > 0) getline s < "nums.txt"} $ # FNR==m{$0=s} 1' poem.txt $ m=3 n=2 perl -pe 'BEGIN{ $s=<> while $ENV{n}-- > 0; close ARGV} $_=$s if $.==$ENV{m}' nums.txt poem.txt Roses are red, Violets are blue, -2 And so are you. $ # print line from fruits.txt if corresponding line from nums.txt is +ve number $ # same as: awk -v file='nums.txt' '(getline num < file)==1 && num>0' $ > 0' fruits.txt fruit qty banana 31 ```
## Creating new fields * Number of fields in input record can be changed by simply manipulating `$#F` ```bash $ s='foo,bar,123,baz' $ # reducing fields $ # same as: awk -F, -v OFS=, '{NF=2} 1' $ echo "$s" | perl -F, -lane '$,=","; $#F=1; print @F' foo,bar $ # creating new empty field(s) $ # same as: awk -F, -v OFS=, '{NF=5} 1' $ echo "$s" | perl -F, -lane '$,=","; $#F=4; print @F' foo,bar,123,baz, $ # assigning to field greater than $#F will create empty fields as needed $ # same as: awk -F, -v OFS=, '{$7=42} 1' $ echo "$s" | perl -F, -lane '$,=","; $F[6]=42; print @F' foo,bar,123,baz,,,42 ``` * adding a field based on existing fields * See also [split](#split) and [Array operations](#array-operations) sections ```bash $ # adding a new 'Grade' field $ # same as: awk 'BEGIN{OFS="\t"; split("DCBAS",g,//)} $ # {NF++; $NF = NR==1 ? "Grade" : g[int($(NF-1)/10)-4]} 1' marks.txt $ perl -lane 'BEGIN{$,="\t"; @g = split //, "DCBAS"} $#F++; $F[-1] = $.==1 ? "Grade" : $g[$F[-2]/10 - 5]; print @F' marks.txt Dept Name Marks Grade ECE Raj 53 D ECE Joel 72 B EEE Moi 68 C CSE Surya 81 A EEE Tia 59 D ECE Om 92 S CSE Amy 67 C $ # alternate syntax: array initialization and appending array element $ perl -lane 'BEGIN{$,="\t"; @g = qw(D C B A S)} push @F, $.==1 ? "Grade" : $g[$F[-1]/10 - 5]; print @F' marks.txt ``` * two file example ```bash $ cat list4 Raj class_rep Amy sports_rep Tia placement_rep $ # same as: awk -v OFS='\t' 'NR==FNR{r[$1]=$2; next} $ # {NF++; $NF = FNR==1 ? "Role" : $NF=r[$2]} 1' list4 marks.txt $ perl -lane 'if(!$#ARGV){ $r{$F[0]}=$F[1]; $.=0 } else{ push @F, $.==1 ? "Role" : $r{$F[1]}; print join "\t", @F }' list4 marks.txt Dept Name Marks Role ECE Raj 53 class_rep ECE Joel 72 EEE Moi 68 CSE Surya 81 EEE Tia 59 placement_rep ECE Om 92 CSE Amy 67 sports_rep ```
## Multiple file input * there is no gawk's `FNR/BEGINFILE/ENDFILE` equivalent in perl, but it can be worked around ```bash $ # same as: awk 'FNR==2' poem.txt greeting.txt $ # close ARGV will reset $. to 0 $ perl -ne 'print if $.==2; close ARGV if eof' poem.txt greeting.txt Violets are blue, Have a safe journey $ # same as: awk 'BEGINFILE{print "file: "FILENAME} ENDFILE{print $0"\n------"}' $ perl -lne 'print "file: $ARGV" if $.==1; print "$_\n------" and close ARGV if eof' poem.txt greeting.txt file: poem.txt And so are you. ------ file: greeting.txt Have a safe journey ------ ``` * workaround for gawk's `nextfile` * to skip remaining lines from current file being processed and move on to next file ```bash $ # same as: head -q -n1 and awk 'FNR>1{nextfile} 1' $ perl -pe 'close ARGV if $.>=1' poem.txt greeting.txt fruits.txt Roses are red, Hello there fruit qty $ # same as: awk 'tolower($1) ~ /red/{print FILENAME; nextfile}' * $ perl -lane 'print $ARGV and close ARGV if $F[0] =~ /red/i' * colors_1.txt colors_2.txt ```
## Dealing with duplicates * retain only first copy of duplicates ```bash $ cat duplicates.txt abc 7 4 food toy **** abc 7 4 test toy 123 good toy **** $ # whole line, same as: awk '!seen[$0]++' duplicates.txt $ perl -ne 'print if !$seen{$_}++' duplicates.txt abc 7 4 food toy **** test toy 123 good toy **** $ # particular column, same as: awk '!seen[$2]++' duplicates.txt $ perl -ane 'print if !$seen{$F[1]}++' duplicates.txt abc 7 4 food toy **** $ # total count, same as: awk '!seen[$2]++{c++} END{print +c}' duplicates.txt $ perl -lane '$c++ if !$seen{$F[1]}++; END{print $c+0}' duplicates.txt 2 ``` * if input is so large that integer numbers can overflow * See also [perldoc - bignum](https://perldoc.perl.org/bignum.html) ```bash $ perl -le 'print "equal" if 102**33==1922231403943151831696327756255167543169267432774552016351387451392' $ # -M option here enables the use of bignum module $ perl -Mbignum -le 'print "equal" if 102**33==1922231403943151831696327756255167543169267432774552016351387451392' equal $ # avoid unnecessary counting altogether $ # same as: awk '!($2 in seen); {seen[$2]}' duplicates.txt $ perl -ane 'print if !$seen{$F[1]}; $seen{$F[1]}=1' duplicates.txt abc 7 4 food toy **** $ # same as: awk -M '!($2 in seen){c++} {seen[$2]} END{print +c}' duplicates.txt $ perl -Mbignum -lane '$c++ if !$seen{$F[1]}; $seen{$F[1]}=1; END{print $c+0}' duplicates.txt 2 ``` * multiple fields * See also [unix.stackexchange - based on same fields that could be in different order](https://unix.stackexchange.com/questions/325619/delete-lines-that-contain-the-same-information-but-in-different-order) ```bash $ # same as: awk '!seen[$2,$3]++' duplicates.txt $ # default SUBSEP(stored in $;) is \034, same as GNU awk $ perl -ane 'print if !$seen{$F[1],$F[2]}++' duplicates.txt abc 7 4 food toy **** test toy 123 $ # or use multidimensional key $ perl -ane 'print if !$seen{$F[1]}{$F[2]}++' duplicates.txt abc 7 4 food toy **** test toy 123 ``` * retaining specific copy ```bash $ # second occurrence of duplicate $ # same as: awk '++seen[$2]==2' duplicates.txt $ perl -ane 'print if ++$seen{$F[1]}==2' duplicates.txt abc 7 4 test toy 123 $ # third occurrence of duplicate $ # same as: awk '++seen[$2]==3' duplicates.txt $ perl -ane 'print if ++$seen{$F[1]}==3' duplicates.txt good toy **** $ # retaining only last copy of duplicate $ # reverse the input line-wise, retain first copy and then reverse again $ # same as: tac duplicates.txt | awk '!seen[$2]++' | tac $ tac duplicates.txt | perl -ane 'print if !$seen{$F[1]}++' | tac abc 7 4 good toy **** ``` * filtering based on duplicate count * allows to emulate [uniq](./sorting_stuff.md#uniq) command for specific fields ```bash $ # all duplicates based on 1st column $ # same as: awk 'NR==FNR{a[$1]++; next} a[$1]>1' duplicates.txt duplicates.txt $ perl -ane 'if(!$#ARGV){ $x{$F[0]}++ } else{ print if $x{$F[0]}>1 }' duplicates.txt duplicates.txt abc 7 4 abc 7 4 $ # more than 2 duplicates based on 2nd column $ # same as: awk 'NR==FNR{a[$2]++; next} a[$2]>2' duplicates.txt duplicates.txt $ perl -ane 'if(!$#ARGV){ $x{$F[1]}++ } else{ print if $x{$F[1]}>2 }' duplicates.txt duplicates.txt food toy **** test toy 123 good toy **** $ # only unique lines based on 3rd column $ # same as: awk 'NR==FNR{a[$3]++; next} a[$3]==1' duplicates.txt duplicates.txt $ perl -ane 'if(!$#ARGV){ $x{$F[2]}++ } else{ print if $x{$F[2]}==1 }' duplicates.txt duplicates.txt test toy 123 ```
## Lines between two REGEXPs * This section deals with filtering lines bound by two *REGEXP*s (referred to as blocks) * For simplicity the two *REGEXP*s usually used in below examples are the strings **BEGIN** and **END**
#### All unbroken blocks Consider the below sample input file, which doesn't have any unbroken blocks (i.e **BEGIN** and **END** are always present in pairs) ```bash $ cat range.txt foo BEGIN 1234 6789 END bar BEGIN a b c END baz ``` * Extracting lines between starting and ending *REGEXP* ```bash $ # include both starting/ending REGEXP $ # same as: awk '/BEGIN/{f=1} f; /END/{f=0}' range.txt $ perl -ne '$f=1 if /BEGIN/; print if $f; $f=0 if /END/' range.txt BEGIN 1234 6789 END BEGIN a b c END $ # can also use: perl -ne 'print if /BEGIN/../END/' range.txt $ # which is similar to sed -n '/BEGIN/,/END/p' $ # but not suitable to extend for other cases ``` * other variations ```bash $ # same as: awk '/END/{f=0} f; /BEGIN/{f=1}' range.txt $ perl -ne '$f=0 if /END/; print if $f; $f=1 if /BEGIN/' range.txt 1234 6789 a b c $ # check out what these do: $ perl -ne '$f=1 if /BEGIN/; $f=0 if /END/; print if $f' range.txt $ perl -ne 'print if $f; $f=0 if /END/; $f=1 if /BEGIN/' range.txt ``` * Extracting lines other than lines between the two *REGEXP*s ```bash $ # same as: awk '/BEGIN/{f=1} !f; /END/{f=0}' range.txt $ # can also use: perl -ne 'print if !(/BEGIN/../END/)' range.txt $ perl -ne '$f=1 if /BEGIN/; print if !$f; $f=0 if /END/' range.txt foo bar baz $ # the other three cases would be $ perl -ne '$f=0 if /END/; print if !$f; $f=1 if /BEGIN/' range.txt $ perl -ne 'print if !$f; $f=1 if /BEGIN/; $f=0 if /END/' range.txt $ perl -ne '$f=1 if /BEGIN/; $f=0 if /END/; print if !$f' range.txt ```
#### Specific blocks * Getting first block ```bash $ # same as: awk '/BEGIN/{f=1} f; /END/{exit}' range.txt $ perl -ne '$f=1 if /BEGIN/; print if $f; exit if /END/' range.txt BEGIN 1234 6789 END $ # use other tricks discussed in previous section as needed $ # same as: awk '/END/{exit} f; /BEGIN/{f=1}' range.txt $ perl -ne 'exit if /END/; print if $f; $f=1 if /BEGIN/' range.txt 1234 6789 ``` * Getting last block ```bash $ # reverse input linewise, change the order of REGEXPs, finally reverse again $ # same as: tac range.txt | awk '/END/{f=1} f; /BEGIN/{exit}' | tac $ tac range.txt | perl -ne '$f=1 if /END/; print if $f; exit if /BEGIN/' | tac BEGIN a b c END $ # or, save the blocks in a buffer and print the last one alone $ # same as: awk '/4/{f=1; b=$0; next} f{b=b ORS $0} /6/{f=0} END{print b}' $ seq 30 | perl -ne 'if(/4/){$f=1; $b=$_; next} $b.=$_ if $f; $f=0 if /6/; END{print $b}' 24 25 26 ``` * Getting blocks based on a counter ```bash $ # get only 2nd block $ # same as: seq 30 | awk -v b=2 '/4/{c++} c==b{print; if(/6/) exit}' $ seq 30 | b=2 perl -ne '$c++ if /4/; if($c==$ENV{b}){print; exit if /6/}' 14 15 16 $ # to get all blocks greater than 'b' blocks $ # same as: seq 30 | awk -v b=1 '/4/{f=1; c++} f && c>b; /6/{f=0}' $ seq 30 | b=1 perl -ne '$f=1, $c++ if /4/; print if $f && $c>$ENV{b}; $f=0 if /6/' 14 15 16 24 25 26 ``` * excluding a particular block ```bash $ # excludes 2nd block $ # same as: seq 30 | awk -v b=2 '/4/{f=1; c++} f && c!=b; /6/{f=0}' $ seq 30 | b=2 perl -ne '$f=1, $c++ if /4/; print if $f && $c!=$ENV{b}; $f=0 if /6/' 4 5 6 24 25 26 ``` * extract block only if it matches another string as well ```bash $ # string to match inside block: 23 $ perl -ne 'if(/BEGIN/){$f=1; $m=0; $b=""}; $m=1 if $f && /23/; $b.=$_ if $f; if(/END/){print $b if $m; $f=0}' range.txt BEGIN 1234 6789 END $ # line to match inside block: 5 or 25 $ seq 30 | perl -ne 'if(/4/){$f=1; $m=0; $b=""}; $m=1 if $f && /^(5|25)$/; $b.=$_ if $f; if(/6/){print $b if $m; $f=0}' 4 5 6 24 25 26 ```
#### Broken blocks * If there are blocks with ending *REGEXP* but without corresponding start, earlier techniques used will suffice * Consider the modified input file where starting *REGEXP* doesn't have corresponding ending ```bash $ cat broken_range.txt foo BEGIN 1234 6789 END bar BEGIN a b c baz $ # the file reversing trick comes in handy here as well $ # same as: tac broken_range.txt | awk '/END/{f=1} f; /BEGIN/{f=0}' | tac $ tac broken_range.txt | perl -ne '$f=1 if /END/; print if $f; $f=0 if /BEGIN/' | tac BEGIN 1234 6789 END ``` * But if both kinds of broken blocks are present, for ex: ```bash $ cat multiple_broken.txt qqqqqqq BEGIN foo BEGIN 1234 6789 END bar END 0-42-1 BEGIN a BEGIN b END xyzabc ``` then use buffers to accumulate the records and print accordingly ```bash $ # same as: awk '/BEGIN/{f=1; buf=$0; next} f{buf=buf ORS $0} $ # /END/{f=0; if(buf) print buf; buf=""}' multiple_broken.txt $ perl -ne 'if(/BEGIN/){$f=1; $b=$_; next} $b.=$_ if $f; if(/END/){$f=0; print $b if $b; $b=""}' multiple_broken.txt BEGIN 1234 6789 END BEGIN b END $ # note how buffer is initialized as well as cleared $ # on matching beginning/end REGEXPs respectively $ # 'undef $b' can also be used here instead of $b="" ```
## Array operations * initialization ```bash $ # list example, each value is separated by comma $ perl -e '($x, $y) = (4, 5); print "$x:$y\n"' 4:5 $ # using list to initialize arrays, allows variable interpolation $ # ($x, $y) = ($y, $x) will swap variables :) $ perl -e '@nums = (4, 5, 84); print "@nums\n"' 4 5 84 $ perl -e '@nums = (4, 5, 84, "foo"); print "@nums\n"' 4 5 84 foo $ perl -e '$x=5; @y=(3, 2); @nums = ($x, "good", @y); print "@nums\n"' 5 good 3 2 $ # use qw to specify string elements separated by space, no interpolation $ perl -e '@nums = qw(4 5 84 "foo"); print "@nums\n"' 4 5 84 "foo" $ perl -e '@nums = qw(a $x @y); print "@nums\n"' a $x @y $ # use different delimiter as needed $ perl -e '@nums = qw/baz 1)foo/; print "@nums\n"' baz 1)foo ``` * accessing individual elements * See also [perldoc - functions for arrays](https://perldoc.perl.org/index-functions-by-cat.html#Functions-for-real-@ARRAYs) for push,pop,shift,unshift functions ```bash $ # index starts from 0 $ perl -le '@nums = (4, "foo", 2, "x"); print $nums[0]' 4 $ # note the use of $ when accessing individual element $ perl -le '@nums = (4, "foo", 2, "x"); print $nums[2]' 2 $ # to access elements from end, use -ve index from -1 $ perl -le '@nums = (4, "foo", 2, "x"); print $nums[-1]' x $ # index of last element in array $ perl -le '@nums = (4, "foo", 2, "x"); print $#nums' 3 $ # size of array, i.e total number of elements $ perl -le '@nums = (4, "foo", 2, "x"); $s=@nums; print $s' 4 $ perl -le '@nums = (4, "foo", 2, "x"); print scalar @nums' 4 ``` * array slices * See also [perldoc - Range Operators](https://perldoc.perl.org/perlop.html#Range-Operators) ```bash $ # note the use of @ when accessing more than one element $ echo 'a b c d' | perl -lane 'print "@F[0,-1,2]"' a d c $ # range operator $ echo 'a b c d' | perl -lane 'print "@F[1..2]"' b c $ # rotating elements $ echo 'a b c d' | perl -lane 'print "@F[1..$#F,0]"' b c d a $ # index needed can be given from another array too $ echo 'a b c d' | perl -lane '@i=(3,1); print "@F[@i]"' d b $ # easy swapping of columns $ perl -lane 'print join "\t", @F[1,0]' fruits.txt qty fruit 42 apple 31 banana 90 fig 6 guava ``` * range operator also allows handy initialization ```bash $ perl -le '@n = (12..17); print "@n"' 12 13 14 15 16 17 $ perl -le '@n = (l..ad); print "@n"' l m n o p q r s t u v w x y z aa ab ac ad ```
#### Iteration and filtering * See also [stackoverflow - extracting multiline text and performing substitution](https://stackoverflow.com/questions/47653826/awk-extracting-a-data-which-is-on-several-lines/47654406#47654406) ```bash $ # foreach will return each value one by one $ # can also use 'for' keyword instead of 'foreach' $ perl -le 'print $_*2 foreach (12..14)' 24 26 28 $ # iterate using index $ perl -le '@x = (a..e); foreach (0..$#x){print $x[$_]}' a b c d e $ # C-style for loop can be used as well $ perl -le '@x = (a..c); for($i=0;$i<=$#x;$i++){print $x[$i]}' a b c ``` * use `grep` for filtering array elements based on a condition * See also [unix.stackexchange - extract specific fields and use corresponding header text](https://unix.stackexchange.com/questions/397498/create-lists-of-words-according-to-binary-numbers/397504#397504) ```bash $ # as usual, $_ will get the value each iteration $ perl -le '$,=" "; print grep { /[35]/ } 2..26' 3 5 13 15 23 25 $ # alternate syntax $ perl -le '$,=" "; print grep /[35]/, 2..26' 3 5 13 15 23 25 $ # to get index instead of matches $ perl -le '$,=" "; @n=(2..26); print grep {$n[$_]=~/[35]/} 0..$#n' 1 3 11 13 21 23 $ # compare values $ s='23 756 -983 5' $ echo "$s" | perl -lane 'print join " ", grep $_<100, @F' 23 -983 5 $ # filters only those elements with successful substitution $ # note that it would modify array elements as well $ echo "$s" | perl -lane 'print join " ", grep s/3/E/, @F' 2E -98E ``` * more examples ```bash $ # filtering column(s) based on header $ perl -lane '@i = grep {$F[$_] eq "Name"} 0..$#F if $.==1; print @F[@i]' marks.txt Name Raj Joel Moi Surya Tia Om Amy $ cat split.txt foo,1:2:5,baz wry,4,look free,3:8,oh $ # print line if more than one column has a digit $ perl -F: -lane 'print if (grep /\d/, @F) > 1' split.txt foo,1:2:5,baz free,3:8,oh ``` * to get random element from array ```bash $ s='65 23 756 -983 5' $ echo "$s" | perl -lane 'print $F[rand @F]' 5 $ echo "$s" | perl -lane 'print $F[rand @F]' 23 $ echo "$s" | perl -lane 'print $F[rand @F]' -983 $ # in scalar context, size of array gets passed to rand $ # rand actually returns a float $ # which then gets converted to int index ```
#### Sorting * See [perldoc - sort](https://perldoc.perl.org/functions/sort.html) for details * `$a` and `$b` are special variables used for sorting, avoid using them as user defined variables ```bash $ # by default, sort does string comparison $ s='foo baz v22 aimed' $ echo "$s" | perl -lane 'print join " ", sort @F' aimed baz foo v22 $ # same as default sort $ echo "$s" | perl -lane 'print join " ", sort {$a cmp $b} @F' aimed baz foo v22 $ # descending order, note how $a and $b are switched $ echo "$s" | perl -lane 'print join " ", sort {$b cmp $a} @F' v22 foo baz aimed $ # functions can be used for custom sorting $ # lc lowercases string, so this sorts case insensitively $ perl -lane 'print join " ", sort {lc $a cmp lc $b} @F' poem.txt are red, Roses are blue, Violets is Sugar sweet, And are so you. ``` * sorting characters within word ```bash $ echo 'foobar' | perl -F -lane 'print sort @F' abfoor $ cat words.txt bot art are boat toe flee reed $ # words with characters in ascending order $ perl -F -lane 'print if (join "", sort @F) eq $_' words.txt bot art $ # words with characters in descending order $ perl -F -lane 'print if (join "", sort {$b cmp $a} @F) eq $_' words.txt toe reed ``` * for numeric comparison, use `<=>` instead of `cmp` ```bash $ s='23 756 -983 5' $ echo "$s" | perl -lane 'print join " ",sort {$a <=> $b} @F' -983 5 23 756 $ echo "$s" | perl -lane 'print join " ",sort {$b <=> $a} @F' 756 23 5 -983 $ # sorting strings based on their length $ s='floor bat to dubious four' $ echo "$s" | perl -lane 'print join ":",sort {length $a <=> length $b} @F' to:bat:four:floor:dubious ``` * sorting columns based on header ```bash $ # need to get indexes of order required for header, then use it for all lines $ perl -lane '@i = sort {$F[$a] cmp $F[$b]} 0..$#F if $.==1; print join "\t", @F[@i]' marks.txt Dept Marks Name ECE 53 Raj ECE 72 Joel EEE 68 Moi CSE 81 Surya EEE 59 Tia ECE 92 Om CSE 67 Amy $ perl -lane '@i = sort {$F[$b] cmp $F[$a]} 0..$#F if $.==1; print join "\t", @F[@i]' marks.txt Name Marks Dept Raj 53 ECE Joel 72 ECE Moi 68 EEE Surya 81 CSE Tia 59 EEE Om 92 ECE Amy 67 CSE ``` **Further Reading** * [perldoc - How do I sort a hash (optionally by value instead of key)?](https://perldoc.perl.org/perlfaq4.html#How-do-I-sort-a-hash-(optionally-by-value-instead-of-key)%3f) * [stackoverflow - sort the keys of a hash by value](https://stackoverflow.com/questions/10901084/how-to-sort-perl-hash-on-values-and-order-the-keys-correspondingly-in-two-array) * [stackoverflow - sort only from 2nd field, ignore header](https://stackoverflow.com/questions/48920626/sort-rows-in-csv-file-without-header-first-column) * [stackoverflow - sort based on group of lines](https://stackoverflow.com/questions/48925359/sorting-groups-of-lines)
#### Transforming * shuffling list elements ```bash $ s='23 756 -983 5' $ # note that this doesn't change the input array $ echo "$s" | perl -MList::Util=shuffle -lane 'print join " ", shuffle @F' 756 23 -983 5 $ echo "$s" | perl -MList::Util=shuffle -lane 'print join " ", shuffle @F' 5 756 23 -983 $ # randomizing file contents $ perl -MList::Util=shuffle -e 'print shuffle <>' poem.txt Sugar is sweet, And so are you. Violets are blue, Roses are red, $ # or if shuffle order is known $ seq 5 | perl -e '@lines=<>; print @lines[3,1,0,2,4]' 4 2 1 3 5 ``` * use `map` to transform every element ```bash $ echo '23 756 -983 5' | perl -lane 'print join " ", map {$_*$_} @F' 529 571536 966289 25 $ echo 'a b c' | perl -lane 'print join ",", map {qq/"$_"/} @F' "a","b","c" $ echo 'a b c' | perl -lane 'print join ",", map {uc qq/"$_"/} @F' "A","B","C" $ # changing the array itself $ perl -le '@s=(4, 245, 12); map {$_*$_} @s; print join " ", @s' 4 245 12 $ perl -le '@s=(4, 245, 12); map {$_ = $_*$_} @s; print join " ", @s' 16 60025 144 $ # ASCII int values for each character $ echo 'AaBbCc' | perl -F -lane 'print join " ", map ord, @F' 65 97 66 98 67 99 $ s='this is a sample sentence' $ # shuffle each word, split here converts each element to character array $ # join the characters after shuffling with empty string $ # finally print each changed element with space as separator $ echo "$s" | perl -MList::Util=shuffle -lane '$,=" "; print map {join "", shuffle split//} @F;' tshi si a mleasp ncstneee ``` * fun little unreadable script... ```bash $ cat para.txt Why cannot I go back to my ignorant days with wild imaginations and fantasies? Perhaps the answer lies in not being able to adapt to my freedom. Those little dreams, goal setting, anticipation of results, used to be my world. All joy within the soul and less dependent on outside world. But all these are absent for a long time now. Hope I can wake those dreams all over again. $ perl -MList::Util=shuffle -F'/([^a-zA-Z]+)/' -lane ' print map {@c=split//; $#c<3 || /[^a-zA-Z]/? $_ : join "",$c[0],(shuffle @c[1..$#c-1]),$c[-1]} @F;' para.txt Why coannt I go back to my inoagrnt dyas wtih wild imiaintangos and fatenasis? Phearps the awsenr lies in not bieng albe to aadpt to my fedoerm. Toshe llttie draems, goal stetnig, aaioiciptntn of rtuelss, uesd to be my wrlod. All joy witihn the suol and less dnenepedt on oiduste world. But all tsehe are abenst for a lnog tmie now. Hpoe I can wkae toshe daemrs all over aiagn. ``` * reverse array * See also [stackoverflow - apply tr and reverse to particular column](https://stackoverflow.com/questions/45571828/execute-bash-command-inside-awk-and-print-command-output/45572038#45572038) ```bash $ s='23 756 -983 5' $ echo "$s" | perl -lane 'print join " ", reverse @F' 5 -983 756 23 $ echo 'foobar' | perl -lne 'print reverse split//' raboof $ # can also use scalar context instead of using split $ echo 'foobar' | perl -lne '$x=reverse; print $x' raboof $ echo 'foobar' | perl -lne 'print scalar reverse' raboof ```
## Miscellaneous
#### split * the `-a` command line option uses `split` and automatically saves the results in `@F` array * default separator is `\s+` * by default acts on `$_` * and by default all splits are performed * See also [perldoc - split function](https://perldoc.perl.org/functions/split.html) ```bash $ echo 'a 1 b 2 c' | perl -lane 'print $F[2]' b $ echo 'a 1 b 2 c' | perl -lne '@x=split; print $x[2]' b $ # temp variable can be avoided by using list context $ echo 'a 1 b 2 c' | perl -lne 'print join ":", (split)[2,-1]' b:c $ # using digits as separator $ echo 'a 1 b 2 c' | perl -lne '@x=split /\d+/; print ":$x[1]:"' : b : $ # specifying maximum number of splits $ echo 'a 1 b 2 c' | perl -lne '@x=split /\h+/,$_,2; print "$x[0]:$x[1]:"' a:1 b 2 c: $ # specifying limit using -F option $ echo 'a 1 b 2 c' | perl -F'/\h+/,$_,2' -lane 'print "$F[0]:$F[1]:"' a:1 b 2 c: ``` * by default, trailing empty fields are stripped * specify a negative value to preserve trailing empty fields ```bash $ echo ':123::' | perl -lne 'print scalar split /:/' 2 $ echo ':123::' | perl -lne 'print scalar split /:/,$_,-1' 4 $ echo ':123::' | perl -F: -lane 'print scalar @F' 2 $ echo ':123::' | perl -F'/:/,$_,-1' -lane 'print scalar @F' 4 ``` * to save the separators as well, use capture groups ```bash $ echo 'a 1 b 2 c' | perl -lne '@x=split /(\d+)/; print "$x[1],$x[3]"' 1,2 $ # or, without the temp variable $ echo 'a 1 b 2 c' | perl -lne 'print join ",", (split /(\d+)/)[1,3]' 1,2 $ # same can be done for -F option $ echo 'a 1 b 2 c' | perl -F'(\d+)' -lane 'print "$F[1],$F[3]"' 1,2 ``` * single line to multiple line by splitting a column ```bash $ cat split.txt foo,1:2:5,baz wry,4,look free,3:8,oh $ perl -F, -ane 'print join ",", $F[0],$_,$F[2] for split /:/,$F[1]' split.txt foo,1,baz foo,2,baz foo,5,baz wry,4,look free,3,oh free,8,oh ``` * weird behavior if literal space character is used with `-F` option ```bash $ # only one element in @F array $ echo 'a 1 b 2 c' | perl -F'/b /' -lane 'print $F[1]' $ # space not being used by separator $ echo 'a 1 b 2 c' | perl -F'b ' -lane 'print $F[1]' 2 c $ # correct behavior $ echo 'a 1 b 2 c' | perl -F'b\x20' -lane 'print $F[1]' 2 c $ # errors out if space used inside character class $ echo 'a 1 b 2 c' | perl -F'/b[ ]/' -lane 'print $F[1]' Unmatched [ in regex; marked by <-- HERE in m//b[ <-- HERE /. $ echo 'a 1 b 2 c' | perl -lne '@x=split /b[ ]/; print $x[1]' 2 c ```
#### Fixed width processing ```bash $ # here 'a' indicates arbitrary binary data $ # the number that follows indicates length $ # the 'x' indicates characters to ignore, use length after 'x' if needed $ # and there are many other formats, see perldoc for details $ echo 'b 123 good' | perl -lne '@x = unpack("a1xa3xa4", $_); print $x[0]' b $ echo 'b 123 good' | perl -lne '@x = unpack("a1xa3xa4", $_); print $x[1]' 123 $ echo 'b 123 good' | perl -lne '@x = unpack("a1xa3xa4", $_); print $x[2]' good $ # unpack not always needed, can simply capture characters needed $ echo 'b 123 good' | perl -lne 'print /.{2}(.{3})/' 123 $ # or use substr to specify offset (starts from 0) and length $ echo 'b 123 good' | perl -lne 'print substr $_, 6, 4' good $ # substr can also be used for replacing $ echo 'b 123 good' | perl -lpe 'substr $_, 2, 3, "gleam"' b gleam good ``` **Further Reading** * [perldoc - tutorial on pack and unpack](https://perldoc.perl.org/perlpacktut.html) * [perldoc - substr](https://perldoc.perl.org/functions/substr.html) * [stackoverflow - extract columns from a fixed-width format](https://stackoverflow.com/questions/1494611/how-can-i-extract-columns-from-a-fixed-width-format-in-perl) * [stackoverflow - build fixed-width template from header](https://stackoverflow.com/questions/4911044/parse-fixed-width-files) * [stackoverflow - convert fixed-width to delimited format](https://stackoverflow.com/questions/43734981/display-column-from-empty-column-delimited-space-in-bash)
#### String and file replication ```bash $ # replicate each line $ seq 2 | perl -ne 'print $_ x 2' 1 1 2 2 $ # replicate a string $ perl -le 'print "abc" x 5' abcabcabcabcabc $ # works for lists too $ perl -le '@x = (3, 2, 1) x 2; print join " ",@x' 3 2 1 3 2 1 $ # replicating file $ wc -c poem.txt 65 poem.txt $ perl -0777 -ne 'print $_ x 100' poem.txt | wc -c 6500 ``` * the [perldoc - glob](https://perldoc.perl.org/functions/glob.html) function can be hacked to generate combinations of strings ```bash $ # typical use case $ # same as: echo *.log $ perl -le 'print join " ", glob q/*.log/' report.log $ # same as: echo *.{log,pl} $ perl -le 'print join " ", glob q/*.{log,pl}/' report.log code.pl sub_sq.pl $ # hacking $ # same as: echo {1,3}{a,b} $ perl -le '@x=glob q/{1,3}{a,b}/; print "@x"' 1a 1b 3a 3b $ # same as: echo {1,3}{1,3}{1,3} $ perl -le '@x=glob "{1,3}" x 3; print "@x"' 111 113 131 133 311 313 331 333 ```
#### transliteration * See `tr` under [perldoc - Quote-Like Operators](https://perldoc.perl.org/perlop.html#Quote-Like-Operators) section for details * similar to substitution, by default `tr` acts on `$_` variable and modifies it unless `r` modifier is specified * however, characters `$` and `@` are treated as literals - i.e no interpolation * similar to `sed`, one can also use `y` instead of `tr` ```bash $ # one-to-one mapping of characters, all occurrences are translated $ echo 'foo bar cat baz' | perl -pe 'tr/abc/123/' foo 21r 31t 21z $ # use - to represent a range in ascending order $ echo 'Hello World' | perl -pe 'tr/a-zA-Z/n-za-mN-ZA-M/' Uryyb Jbeyq $ echo 'Uryyb Jbeyq' | perl -pe 'tr|a-zA-Z|n-za-mN-ZA-M|' Hello World ``` * if arguments are of different lengths ```bash $ # when second argument is longer, the extra characters are ignored $ echo 'foo bar cat baz' | perl -pe 'tr/abc/1-9/' foo 21r 31t 21z $ # when first argument is longer $ # the last character of second argument gets padded to make it equal $ echo 'foo bar cat baz' | perl -pe 'tr/a-z/123/' 333 213 313 213 ``` * modifiers ```bash $ # no padding, absent mappings are deleted $ echo 'fob bar cat baz' | perl -pe 'tr/a-z/123/d' 2 21 31 21 $ echo 'Hello:123:World' | perl -pe 'tr/a-z//d' H:123:W $ # c modifier complements first argument characters $ echo 'Hello:123:World' | perl -lpe 'tr/a-z//cd' elloorld $ # s modifier to keep only one copy of repeated characters $ echo 'FFoo seed 11233' | perl -pe 'tr/a-z//s' FFo sed 11233 $ # when replacement is done as well, only replaced characters are squeezed $ # unlike 'tr -s' which squeezes characters specified by second argument $ echo 'FFoo seed 11233' | perl -pe 'tr/A-Z/a-z/s' foo seed 11233 $ perl -e '$x="food"; $y=$x=~tr/a-z/A-Z/r; print "x=$x\ny=$y\n"' x=food y=FOOD ``` * since `-` is used for character ranges, place it at the start/end to represent it literally * similarly, to represent `\` literally, use `\\` ```bash $ echo '/foo-bar/baz/report' | perl -pe 'tr/-a-z/_A-Z/' /FOO_BAR/BAZ/REPORT $ echo '/foo-bar/baz/report' | perl -pe 'tr|/-|\\_|' \foo_bar\baz\report ``` * return value is number of replacements made ```bash $ echo 'Hello there. How are you?' | grep -o '[a-z]' | wc -l 17 $ echo 'Hello there. How are you?' | perl -lne 'print tr/a-z//' 17 ``` * unicode examples ```bash $ echo 'hello!' | perl -CS -pe 'tr/a-z/\x{1d5ee}-\x{1d607}/' 𝗵𝗲𝗹𝗹𝗼! $ echo 'How are you?' | perl -Mopen=locale -Mutf8 -pe 'tr/a-zA-Z/𝗮-𝘇𝗔-𝗭/' 𝗛𝗼𝘄 𝗮𝗿𝗲 𝘆𝗼𝘂? ```
#### Executing external commands * External commands can be issued using `system` function * Output would be as usual on `stdout` unless redirected while calling the command ```bash $ perl -e 'system("echo Hello World")' Hello World $ # use q operator to avoid interpolation $ perl -e 'system q/echo $HOME/' /home/learnbyexample $ perl -e 'system q/wc poem.txt/' 4 13 65 poem.txt $ perl -e 'system q/seq 10 | paste -sd, > out.txt/' $ cat out.txt 1,2,3,4,5,6,7,8,9,10 $ cat f2 I bought two bananas and three mangoes $ echo 'f1,f2,odd.txt' | perl -F, -lane 'system "cat $F[1]"' I bought two bananas and three mangoes ``` * return value of `system` will have exit status information or `$?` can be used * see [perldoc - system](https://perldoc.perl.org/functions/system.html) for details ```bash $ perl -le '$es=system q/ls poem.txt/; print "$es"' poem.txt 0 $ perl -le 'system q/ls poem.txt/; print "exit status: $?"' poem.txt exit status: 0 $ perl -le 'system q/ls xyz.txt/; print "exit status: $?"' ls: cannot access 'xyz.txt': No such file or directory exit status: 512 ``` * to save result of external command, use backticks or `qx` operator * newline gets saved too, use `chomp` if needed ```bash $ perl -e '$lines = `wc -l < poem.txt`; print $lines' 4 $ perl -e '$nums = qx/seq 3/; print $nums' 1 2 3 ``` * See also [stackoverflow - difference between backticks, system, exec and open](https://stackoverflow.com/questions/799968/whats-the-difference-between-perls-backticks-system-and-exec)
## Further Reading * Manual and related * [perldoc - overview](https://perldoc.perl.org/index-overview.html) * [perldoc - faqs](https://perldoc.perl.org/index-faq.html) * [perldoc - tutorials](https://perldoc.perl.org/index-tutorials.html) * [perldoc - functions](https://perldoc.perl.org/index-functions.html) * [perldoc - special variables](https://perldoc.perl.org/perlvar.html) * [perldoc - perlretut](https://perldoc.perl.org/perlretut.html) * Tutorials and Q&A * [Perl one-liners explained](http://www.catonmat.net/series/perl-one-liners-explained) * [perl Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/perl?sort=votes&pageSize=15) * [regex FAQ on SO](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) * [regexone](https://regexone.com/) - interative tutorial * [regexcrossword](https://regexcrossword.com/) - practice by solving crosswords, read 'How to play' section before you start * Alternatives * [bioperl](http://bioperl.org/howtos/index.html) * [ruby](https://www.ruby-lang.org/en/) * [unix.stackexchange - When to use grep, sed, awk, perl, etc](https://unix.stackexchange.com/questions/303044/when-to-use-grep-less-awk-sed) ================================================ FILE: restructure_text.md ================================================ # Restructure text **Table of Contents** * [paste](#paste) * [Concatenating files column wise](#concatenating-files-column-wise) * [Interleaving lines](#interleaving-lines) * [Lines to multiple columns](#lines-to-multiple-columns) * [Different delimiters between columns](#different-delimiters-between-columns) * [Multiple lines to single row](#multiple-lines-to-single-row) * [Further reading for paste](#further-reading-for-paste) * [column](#column) * [Pretty printing tables](#pretty-printing-tables) * [Specifying different input delimiter](#specifying-different-input-delimiter) * [Further reading for column](#further-reading-for-column) * [pr](#pr) * [Converting lines to columns](#converting-lines-to-columns) * [Changing PAGE_WIDTH](#changing-page_width) * [Combining multiple input files](#combining-multiple-input-files) * [Transposing a table](#transposing-a-table) * [Further reading for pr](#further-reading-for-pr) * [fold](#fold) * [Examples](#examples) * [Further reading for fold](#further-reading-for-fold)
## paste ```bash $ paste --version | head -n1 paste (GNU coreutils) 8.25 $ man paste PASTE(1) User Commands PASTE(1) NAME paste - merge lines of files SYNOPSIS paste [OPTION]... [FILE]... DESCRIPTION Write lines consisting of the sequentially corresponding lines from each FILE, separated by TABs, to standard output. With no FILE, or when FILE is -, read standard input. ... ```
#### Concatenating files column wise * By default, `paste` adds a TAB between corresponding lines of input files ```bash $ paste colors_1.txt colors_2.txt Blue Black Brown Blue Purple Green Red Red Teal White ``` * Specifying a different delimiter using `-d` * The `<()` syntax is [Process Substitution](http://mywiki.wooledge.org/ProcessSubstitution) * to put it simply - allows output of command to be passed as input file to another command without needing to manually create a temporary file ```bash $ paste -d, <(seq 5) <(seq 6 10) 1,6 2,7 3,8 4,9 5,10 $ # empty cells if number of lines is not same for all input files $ # -d\| can also be used $ paste -d'|' <(seq 3) <(seq 4 6) <(seq 7 10) 1|4|7 2|5|8 3|6|9 ||10 ``` * to paste without any character in between, use `\0` as delimiter * note that `\0` here doesn't mean the ASCII NUL character * can also use `-d ''` with `GNU paste` ```bash $ paste -d'\0' <(seq 3) <(seq 6 8) 16 27 38 ```
#### Interleaving lines * Interleave lines by using newline as delimiter ```bash $ paste -d'\n' <(seq 11 13) <(seq 101 103) 11 101 12 102 13 103 ```
#### Lines to multiple columns * Number of `-` specified determines number of output columns * Input lines can be passed only as stdin ```bash $ # single column to two columns $ seq 10 | paste -d, - - 1,2 3,4 5,6 7,8 9,10 $ # single column to five columns $ seq 10 | paste -d: - - - - - 1:2:3:4:5 6:7:8:9:10 $ # input redirection for file input $ paste -d, - - < colors_1.txt Blue,Brown Purple,Red Teal, ``` * Use `printf` trick if number of columns to specify is too large ```bash $ # prompt at end of line not shown for simplicity $ printf -- "- %.s" {1..5} - - - - - $ seq 10 | paste -d, $(printf -- "- %.s" {1..5}) 1,2,3,4,5 6,7,8,9,10 ```
#### Different delimiters between columns * For more than 2 columns, different delimiter character can be specified - passed as list to `-d` option ```bash $ # , is used between 1st and 2nd column $ # - is used between 2nd and 3rd column $ paste -d',-' <(seq 3) <(seq 4 6) <(seq 7 9) 1,4-7 2,5-8 3,6-9 $ # re-use list from beginning if not specified for all columns $ paste -d',-' <(seq 3) <(seq 4 6) <(seq 7 9) <(seq 10 12) 1,4-7,10 2,5-8,11 3,6-9,12 $ # another example $ seq 10 | paste -d':,' - - - - - 1:2,3:4,5 6:7,8:9,10 $ # so, with single delimiter, it is just re-used for all columns $ paste -d, <(seq 3) <(seq 4 6) <(seq 7 9) <(seq 10 12) 1,4,7,10 2,5,8,11 3,6,9,12 ``` * combination of `-d` and `/dev/null` (empty file) can give multi-character separation between columns * If this is too confusing to use, consider [pr](#pr) instead ```bash $ paste -d' : ' <(seq 3) /dev/null /dev/null <(seq 4 6) /dev/null /dev/null <(seq 7 9) 1 : 4 : 7 2 : 5 : 8 3 : 6 : 9 $ # or just use pr instead $ pr -mts' : ' <(seq 3) <(seq 4 6) <(seq 7 9) 1 : 4 : 7 2 : 5 : 8 3 : 6 : 9 $ # but paste would allow different delimiters ;) $ paste -d' : - ' <(seq 3) /dev/null /dev/null <(seq 4 6) /dev/null /dev/null <(seq 7 9) 1 : 4 - 7 2 : 5 - 8 3 : 6 - 9 $ # pr would need two invocations $ pr -mts' : ' <(seq 3) <(seq 4 6) | pr -mts' - ' - <(seq 7 9) 1 : 4 - 7 2 : 5 - 8 3 : 6 - 9 ``` * example to show using empty file instead of `/dev/null` ```bash $ # assuming file named e doesn't exist $ touch e $ # or use this, will empty contents even if file named e already exists :P $ > e $ paste -d' : - ' <(seq 3) e e <(seq 4 6) e e <(seq 7 9) 1 : 4 - 7 2 : 5 - 8 3 : 6 - 9 ```
#### Multiple lines to single row ```bash $ paste -sd, colors_1.txt Blue,Brown,Purple,Red,Teal $ # multiple files each gets a row $ paste -sd: colors_1.txt colors_2.txt Blue:Brown:Purple:Red:Teal Black:Blue:Green:Red:White $ # multiple input files need not have same number of lines $ paste -sd, <(seq 3) <(seq 5 9) 1,2,3 5,6,7,8,9 ``` * Often used to serialize multiple line output from another command ```bash $ sort -u colors_1.txt colors_2.txt | paste -sd, Black,Blue,Brown,Green,Purple,Red,Teal,White ``` * For multiple character delimiter, post-process if separator is unique or use another tool like `perl` ```bash $ seq 10 | paste -sd, 1,2,3,4,5,6,7,8,9,10 $ # post-process $ seq 10 | paste -sd, | sed 's/,/ : /g' 1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 $ # using perl alone $ seq 10 | perl -pe 's/\n/ : / if(!eof)' 1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 ```
#### Further reading for paste * `man paste` and `info paste` for more options and detailed documentation * [paste Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/paste?sort=votes&pageSize=15)
## column ```bash COLUMN(1) BSD General Commands Manual COLUMN(1) NAME column — columnate lists SYNOPSIS column [-entx] [-c columns] [-s sep] [file ...] DESCRIPTION The column utility formats its input into multiple columns. Rows are filled before columns. Input is taken from file operands, or, by default, from the standard input. Empty lines are ignored unless the -e option is used. ... ```
#### Pretty printing tables * by default whitespace is input delimiter ```bash $ cat dishes.txt North alootikki baati khichdi makkiroti poha South appam bisibelebath dosa koottu sevai West dhokla khakhra modak shiro vadapav East handoguri litti momo rosgulla shondesh $ column -t dishes.txt North alootikki baati khichdi makkiroti poha South appam bisibelebath dosa koottu sevai West dhokla khakhra modak shiro vadapav East handoguri litti momo rosgulla shondesh ``` * often useful to get neatly aligned columns from output of another command ```bash $ paste fruits.txt price.txt Fruits Price apple 182 guava 90 watermelon 35 banana 72 pomegranate 280 $ paste fruits.txt price.txt | column -t Fruits Price apple 182 guava 90 watermelon 35 banana 72 pomegranate 280 ```
#### Specifying different input delimiter * Use `-s` to specify input delimiter * Use `-n` to prevent merging empty cells * From `man column` "This option is a Debian GNU/Linux extension" ```bash $ paste -d, <(seq 3) <(seq 5 9) <(seq 11 13) 1,5,11 2,6,12 3,7,13 ,8, ,9, $ paste -d, <(seq 3) <(seq 5 9) <(seq 11 13) | column -s, -t 1 5 11 2 6 12 3 7 13 8 9 $ paste -d, <(seq 3) <(seq 5 9) <(seq 11 13) | column -s, -nt 1 5 11 2 6 12 3 7 13 8 9 ```
#### Further reading for column * `man column` for more options and detailed documentation * [column Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/columns?sort=votes&pageSize=15) * More examples [here](http://www.commandlinefu.com/commands/using/column/sort-by-votes)
## pr ```bash $ pr --version | head -n1 pr (GNU coreutils) 8.25 $ man pr PR(1) User Commands PR(1) NAME pr - convert text files for printing SYNOPSIS pr [OPTION]... [FILE]... DESCRIPTION Paginate or columnate FILE(s) for printing. With no FILE, or when FILE is -, read standard input. ... ``` * `Paginate` is not covered, examples related only to `columnate` * For example, default invocation on a file would add a header, etc ```bash $ # truncated output shown $ pr fruits.txt 2017-04-21 17:49 fruits.txt Page 1 Fruits apple guava watermelon banana pomegranate ``` * Following sections will use `-t` to omit page headers and trailers
#### Converting lines to columns * With [paste](#lines-to-multiple-columns), changing input file rows to column(s) is possible only with consecutive lines * `pr` can do that as well as split entire file itself according to number of columns needed * And `-s` option in `pr` allows multi-character output delimiter * As usual, examples to better show the functionalities ```bash $ # note how the input got split into two and resulting splits joined by , $ seq 6 | pr -2ts, 1,4 2,5 3,6 $ # note how two consecutive lines gets joined by , $ seq 6 | paste -d, - - 1,2 3,4 5,6 ``` * Default **PAGE_WIDTH** is 72 characters, so each column gets 72 divided by number of columns unless `-s` is used ```bash $ # 3 columns, so each column width is 24 characters $ seq 9 | pr -3t 1 4 7 2 5 8 3 6 9 $ # using -s, desired delimiter can be specified $ seq 9 | pr -3ts' ' 1 4 7 2 5 8 3 6 9 $ seq 9 | pr -3ts' : ' 1 : 4 : 7 2 : 5 : 8 3 : 6 : 9 $ # default is TAB when using -s option with no arguments $ seq 9 | pr -3ts 1 4 7 2 5 8 3 6 9 ``` * Using `-a` to change consecutive rows, similar to `paste` ```bash $ seq 8 | pr -4ats: 1:2:3:4 5:6:7:8 $ # no output delimiter for empty cells $ seq 22 | pr -5ats, 1,2,3,4,5 6,7,8,9,10 11,12,13,14,15 16,17,18,19,20 21,22 $ # note output delimiter even for empty cells $ seq 22 | paste -d, - - - - - 1,2,3,4,5 6,7,8,9,10 11,12,13,14,15 16,17,18,19,20 21,22,,, ```
#### Changing PAGE_WIDTH * The default PAGE_WIDTH is 72 * The formula `(col-1)*len(delimiter) + col` seems to work in determining minimum PAGE_WIDTH required for multiple column output * `col` is number of columns required ```bash $ # (36-1)*1 + 36 = 71, so within PAGE_WIDTH limit $ seq 74 | pr -36ats, 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36 37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72 73,74 $ # (37-1)*1 + 37 = 73, more than default PAGE_WIDTH limit $ seq 74 | pr -37ats, pr: page width too narrow ``` * Use `-w` to specify a different PAGE_WIDTH * The `-J` option turns off truncation ```bash $ # (37-1)*1 + 37 = 73 $ seq 74 | pr -J -w73 -37ats, 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37 38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74 $ # (3-1)*4 + 3 = 11 $ seq 6 | pr -J -w10 -3ats'::::' pr: page width too narrow $ seq 6 | pr -J -w11 -3ats'::::' 1::::2::::3 4::::5::::6 $ # if calculating is difficult, simply use a large number $ seq 6 | pr -J -w500 -3ats'::::' 1::::2::::3 4::::5::::6 ```
#### Combining multiple input files * Use `-m` option to combine multiple files in parallel, similar to `paste` ```bash $ # 2 columns, so each column width is 36 characters $ pr -mt fruits.txt price.txt Fruits Price apple 182 guava 90 watermelon 35 banana 72 pomegranate 280 $ # default is TAB when using -s option with no arguments $ pr -mts <(seq 3) <(seq 4 6) <(seq 7 10) 1 4 7 2 5 8 3 6 9 10 $ # double TAB as separator $ # shell expands $'\t\t' before command is executed $ pr -mts$'\t\t' colors_1.txt colors_2.txt Blue Black Brown Blue Purple Green Red Red Teal White ``` * For interleaving, specify newline as separator ```bash $ pr -mts$'\n' fruits.txt price.txt Fruits Price apple 182 guava 90 watermelon 35 banana 72 pomegranate 280 ```
#### Transposing a table ```bash $ # delimiter is single character, so easy to use tr to change it to newline $ cat dishes.txt North alootikki baati khichdi makkiroti poha South appam bisibelebath dosa koottu sevai West dhokla khakhra modak shiro vadapav East handoguri litti momo rosgulla shondesh $ # 4 columns, so each column width is 18 characters $ # $(wc -l < dishes.txt) gives number of columns required $ tr ' ' '\n' < dishes.txt | pr -$(wc -l < dishes.txt)t North South West East alootikki appam dhokla handoguri baati bisibelebath khakhra litti khichdi dosa modak momo makkiroti koottu shiro rosgulla poha sevai vadapav shondesh ``` * Pipe the output to `column` if spacing is too much ```bash $ tr ' ' '\n' < dishes.txt | pr -$(wc -l < dishes.txt)t | column -t North South West East alootikki appam dhokla handoguri baati bisibelebath khakhra litti khichdi dosa modak momo makkiroti koottu shiro rosgulla poha sevai vadapav shondesh ```
#### Further reading for pr * `man pr` and `info pr` for more options and detailed documentation * More examples [here](http://docstore.mik.ua/orelly/unix3/upt/ch21_15.htm)
## fold ```bash $ fold --version | head -n1 fold (GNU coreutils) 8.25 $ man fold FOLD(1) User Commands FOLD(1) NAME fold - wrap each input line to fit in specified width SYNOPSIS fold [OPTION]... [FILE]... DESCRIPTION Wrap input lines in each FILE, writing to standard output. With no FILE, or when FILE is -, read standard input. ... ```
#### Examples ```bash $ nl story.txt 1 The princess of a far away land fought bravely to rescue a travelling group from bandits. And the happy story ends here. Have a nice day. 2 Still here? okay, read on: The prince of Happalakkahuhu wished he could be as brave as his sister and vowed to train harder $ # default folding width is 80 $ fold story.txt The princess of a far away land fought bravely to rescue a travelling group from bandits. And the happy story ends here. Have a nice day. Still here? okay, read on: The prince of Happalakkahuhu wished he could be as br ave as his sister and vowed to train harder $ fold story.txt | nl 1 The princess of a far away land fought bravely to rescue a travelling group from 2 bandits. And the happy story ends here. Have a nice day. 3 Still here? okay, read on: The prince of Happalakkahuhu wished he could be as br 4 ave as his sister and vowed to train harder ``` * `-s` option breaks at spaces to avoid word splitting ```bash $ fold -s story.txt The princess of a far away land fought bravely to rescue a travelling group from bandits. And the happy story ends here. Have a nice day. Still here? okay, read on: The prince of Happalakkahuhu wished he could be as brave as his sister and vowed to train harder ``` * Use `-w` to change default width ```bash $ fold -s -w60 story.txt The princess of a far away land fought bravely to rescue a travelling group from bandits. And the happy story ends here. Have a nice day. Still here? okay, read on: The prince of Happalakkahuhu wished he could be as brave as his sister and vowed to train harder ```
#### Further reading for fold * `man fold` and `info fold` for more options and detailed documentation ================================================ FILE: ruby_one_liners.md ================================================


--- :information_source: :information_source: This chapter has been converted into a better formatted ebook - https://learnbyexample.github.io/learn_ruby_oneliners/. The ebook also has content updated for newer version of `ruby`, extra chapter for parsing json/csv/xml, includes exercises, solutions, etc. For markdown source and links to buy pdf/epub versions, see: https://github.com/learnbyexample/learn_ruby_oneliners ---


# Ruby one liners **Table of Contents** * [Executing Ruby code](#executing-ruby-code) * [Simple search and replace](#simple-search-and-replace) * [inplace editing](#inplace-editing) * [Line filtering](#line-filtering) * [Regular expressions based filtering](#regular-expressions-based-filtering) * [Fixed string matching](#fixed-string-matching) * [Line number based filtering](#line-number-based-filtering) * [Field processing](#field-processing) * [Field comparison](#field-comparison) * [Specifying different input field separator](#specifying-different-input-field-separator) * [Specifying different output field separator](#specifying-different-output-field-separator) * [Changing record separators](#changing-record-separators) * [Input record separator](#input-record-separator) * [Output record separator](#output-record-separator) * [Multiline processing](#multiline-processing) * [Ruby regular expressions](#ruby-regular-expressions) * [gotchas and tricks](#gotchas-and-tricks) * [Backslash sequences](#backslash-sequences) * [Non-greedy quantifier](#non-greedy-quantifier) * [Lookarounds](#lookarounds) * [Special capture groups](#special-capture-groups) * [Modifiers](#modifiers) * [Code in replacement section](#code-in-replacement-section) * [Quoting metacharacters](#quoting-metacharacters) * [Two file processing](#two-file-processing) * [Comparing whole lines](#comparing-whole-lines) * [Comparing specific fields](#comparing-specific-fields) * [Line number matching](#line-number-matching) * [Creating new fields](#creating-new-fields) * [Multiple file input](#multiple-file-input) * [Dealing with duplicates](#dealing-with-duplicates) * [using uniq method](#using-uniq-method) * [Lines between two REGEXPs](#lines-between-two-regexps) * [All unbroken blocks](#all-unbroken-blocks) * [Specific blocks](#specific-blocks) * [Broken blocks](#broken-blocks) * [Array operations](#array-operations) * [Filtering](#filtering) * [Sorting](#sorting) * [Transforming](#transforming) * [Miscellaneous](#miscellaneous) * [split](#split) * [Fixed width processing](#fixed-width-processing) * [String and file replication](#string-and-file-replication) * [transliteration](#transliteration) * [Executing external commands](#executing-external-commands) * [Further Reading](#further-reading)
``` $ ruby --version ruby 2.5.0p0 (2017-12-25 revision 61468) [x86_64-linux] $ man ruby RUBY(1) Ruby Programmers Reference Guide RUBY(1) NAME ruby — Interpreted object-oriented scripting language SYNOPSIS ruby [--copyright] [--version] [-SUacdlnpswvy] [-0[octal]] [-C directory] [-E external[:internal]] [-F[pattern]] [-I directory] [-K[c]] [-T[level]] [-W[level]] [-e command] [-i[extension]] [-r library] [-x[directory]] [--{enable|disable}-FEATURE] [--dump=target] [--verbose] [--] [program_file] [argument ...] DESCRIPTION Ruby is an interpreted scripting language for quick and easy object-ori‐ ented programming. It has many features to process text files and to do system management tasks (like in Perl). It is simple, straight-forward, and extensible. If you want a language for easy object-oriented programming, or you don't like the Perl ugliness, or you do like the concept of LISP, but don't like too many parentheses, Ruby might be your language of choice. ... ``` **Prerequisites and notes** * familiarity with programming concepts like variables, printing, control structures, arrays, etc * familiarity with regular expressions * this tutorial is primarily focussed on short programs that are easily usable from command line, similar to using `grep`, `sed`, `awk`, `perl` etc * unless otherwise specified, consider input as ASCII encoded text only * this is an attempt to translate [Perl chapter](./perl_the_swiss_knife.md) to `ruby`, I don't have prior experience of using `ruby`
## Executing Ruby code * One way is to put code in a file and use `ruby` command with filename as argument * another is to use [shebang](https://en.wikipedia.org/wiki/Shebang_(Unix)) at beginning of script, make the file executable and directly run it * For short programs, one can use `-e` commandline option to provide code from command line itself * this entire chapter is about using `ruby` this way from commandline ```bash $ cat code.rb print "Hello Ruby\n" $ ruby code.rb Hello Ruby $ # same as: perl -e 'print "Hello Perl\n"' $ ruby -e 'print "Hello Ruby\n"' Hello Ruby $ # multiple statements can be issued separated by ; $ # puts adds newline character if input doesn't end with a newline $ # similar to: perl -E '$x=25; $y=12; say $x**$y' $ ruby -e 'x=25; y=12; puts x**y' 59604644775390625 ``` **Further Reading** * `ruby -h` for summary of options * [explainshell](https://explainshell.com/explain?cmd=ruby+-F+-l+-anpe+-i+-0) - to quickly get information without having to traverse through the docs * [ruby-lang documentation](https://www.ruby-lang.org/en/documentation/) - manuals, tutorials and references
## Simple search and replace * More detailed examples with regular expressions will be covered in later sections * Just like other text processing commands, `ruby` will automatically loop over input line by line when `-n` or `-p` option is used * like `sed`, the `-n` option won't print the record * `-p` will print the record, including any changes made * default record separator is newline character * `$_` will contain the input record content, including the record separator (like `perl` and unlike `sed/awk`) * and similar to other commands, `ruby` will work with both stdin and file input * See other chapters for examples of [seq](./miscellaneous.md#seq), [paste](./restructure_text.md#paste), etc ```bash $ # sample stdin data $ seq 10 | paste -sd, 1,2,3,4,5,6,7,8,9,10 $ # change only first ',' to ' : ' $ # same as: perl -pe 's/,/ : /' $ seq 10 | paste -sd, | ruby -pe 'sub(/,/, " : ")' 1 : 2,3,4,5,6,7,8,9,10 $ # change all ',' to ' : ' $ # same as: perl -pe 's/,/ : /g' $ seq 10 | paste -sd, | ruby -pe 'gsub(/,/, " : ")' 1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 $ # sub(/,/, " : ") is shortcut for $_.sub!(/,/, " : ") $ # gsub(/,/, " : ") is shortcut for $_.gsub!(/,/, " : ") $ # sub! and gsub! do inplace changing $ # sub and gsub returns the result, similar to perl's s///r modifier $ # () is optional, sub /,/, " : " can be used instead of sub(/,/, " : ") ```
#### inplace editing ```bash $ cat greeting.txt Hi there Have a nice day $ # original file gets preserved in 'greeting.txt.bkp' $ # same as: perl -i.bkp -pe 's/Hi/Hello/' greeting.txt $ ruby -i.bkp -pe 'sub(/Hi/, "Hello")' greeting.txt $ cat greeting.txt Hello there Have a nice day $ # use empty argument to -i with caution, changes made cannot be undone $ ruby -i -pe 'sub(/nice day/, "safe journey")' greeting.txt $ cat greeting.txt Hello there Have a safe journey ``` * Multiple input files are treated individually and changes are written back to respective files ```bash $ cat f1 I ate 3 apples $ cat f2 I bought two bananas and 3 mangoes $ # same as: perl -i.bkp -pe 's/3/three/' f1 f2 $ ruby -i.bkp -pe 'sub(/3/, "three")' f1 f2 $ cat f1 I ate three apples $ cat f2 I bought two bananas and three mangoes ``` **Further Reading** * [ruby-doc: Pre-defined variables](https://ruby-doc.org/core-2.5.0/doc/globals_rdoc.html#label-Pre-defined+variables) for explanation on `$_` and other such special variables * [ruby-doc: gsub](https://ruby-doc.org/core-2.5.0/String.html#method-i-gsub) for `gsub` syntax details
## Line filtering
#### Regular expressions based filtering * one way is to use `variable =~ /REGEXP/FLAGS` to check for a match * use `variable !~ /REGEXP/FLAGS` for negated match * by default acts on `$_` if variable is not specified * see [ruby-doc: Regexp](https://ruby-doc.org/core-2.5.0/Regexp.html) for documentation * as we need to print only selective lines, use `-n` option * by default, contents of `$_` will be printed if no argument is passed to `print` ```bash $ cat poem.txt Roses are red, Violets are blue, Sugar is sweet, And so are you. $ # same as: perl -ne 'print if /^[RS]/' poem.txt $ # /^[RS]/ is shortcut for $_ =~ /^[RS]/ $ ruby -ne 'print if /^[RS]/' poem.txt Roses are red, Sugar is sweet, $ # same as: perl -ne 'print if /and/i' poem.txt $ ruby -ne 'print if /and/i' poem.txt And so are you. $ # same as: perl -ne 'print if !/are/' poem.txt $ # !/are/ is shortcut for $_ !~ /are/ $ ruby -ne 'print if !/are/' poem.txt Sugar is sweet, $ # same as: perl -ne 'print if /are/ && !/so/' poem.txt $ ruby -ne 'print if /are/ && !/so/' poem.txt Roses are red, Violets are blue, ``` * using different delimiter * quoting from [ruby-doc: Percent Strings](https://ruby-doc.org/core-2.5.0/doc/syntax/literals_rdoc.html#label-Percent+Strings) > If you are using “(”, “[”, “{”, “<” you must close it with “)”, “]”, “}”, “>” respectively. You may use most other non-alphanumeric characters for percent string delimiters such as “%”, “|”, “^”, etc. ```bash $ cat paths.txt /foo/a/report.log /foo/y/power.log /foo/abc/errors.log $ # same as: perl -ne 'print if /\/foo\/a\//' paths.txt $ ruby -ne 'print if /\/foo\/a\//' paths.txt /foo/a/report.log $ # same as: perl -ne 'print if m#/foo/a/#' paths.txt $ ruby -ne 'print if %r#/foo/a/#' paths.txt /foo/a/report.log $ # same as: perl -ne 'print if !m#/foo/a/#' paths.txt $ ruby -ne 'print if !%r#/foo/a/#' paths.txt /foo/y/power.log /foo/abc/errors.log ```
#### Fixed string matching * To match strings literally, use `include?` method ```bash $ echo 'int a[5]' | ruby -ne 'print if /a[5]/' $ echo 'int a[5]' | ruby -ne 'print if $_.include?("a[5]")' int a[5] $ # however, string within double quotes gets interpolated $ ruby -e 'a=5; puts "value of a:\t#{a}"' value of a: 5 $ # use %q (covered later) to specify single quoted string $ echo 'int #{a}' | ruby -ne 'print if $_.include?(%q/#{a}/)' int #{a} $ # or pass the string as environment variable $ echo 'int #{a}' | s='#{a}' ruby -ne 'print if $_.include?(ENV["s"])' int #{a} ``` * restricting match to start/end of line ```bash $ cat eqns.txt a=b,a-b=c,c*d a+b,pi=3.14,5e12 i*(t+9-g)/8,4-a+b $ # start of line $ s='a+b' ruby -ne 'print if $_.start_with?(ENV["s"])' eqns.txt a+b,pi=3.14,5e12 $ # end of line $ # -l option is needed to remove record separator (covered later) $ s='a+b' ruby -lne 'print if $_.end_with?(ENV["s"])' eqns.txt i*(t+9-g)/8,4-a+b ``` * `index` method returns matching position (starts at 0) and nil if not found * supports both string and regexp * optional 2nd argument allows to specify offset to start searching * See [ruby-doc: index](https://ruby-doc.org/core-2.5.0/String.html#method-i-index) for details ```bash $ # passing string $ ruby -ne 'print if $_.index("a+b")' eqns.txt a+b,pi=3.14,5e12 i*(t+9-g)/8,4-a+b $ ruby -ne 'print if $_.index("a+b")==0' eqns.txt a+b,pi=3.14,5e12 $ # passing regexp $ ruby -ne 'print if $_.index(/[+*]/)<5' eqns.txt a+b,pi=3.14,5e12 i*(t+9-g)/8,4-a+b $ s='a+b' ruby -ne 'print if $_.index(ENV["s"], 1)' eqns.txt i*(t+9-g)/8,4-a+b ```
#### Line number based filtering * special variable `$.` contains total records read so far, similar to `NR` in `awk` * as far as I've checked the docs, there's no equivalent of awk's `FNR` * See also [ruby-doc: eof](https://ruby-doc.org/core-2.5.0/IO.html#method-i-eof) ```bash $ # print 2nd line $ # same as: perl -ne 'print if $.==2' poem.txt $ ruby -ne 'print if $.==2' poem.txt Violets are blue, $ # print 2nd and 4th line $ # same as: perl -ne 'print if $.==2 || $.==4' poem.txt $ # can also use: ruby -ne 'print if [2, 4].include?($.)' poem.txt $ ruby -ne 'print if $.==2 || $.==4' poem.txt Violets are blue, And so are you. $ # print last line $ # same as: perl -ne 'print if eof' poem.txt $ # $< is like filehandle for input files/stdin given from commandline $ ruby -ne 'print if $<.eof' poem.txt And so are you. ``` * for large input, use `exit` to avoid unnecessary record processing * See [ruby-doc: Control Expressions](https://ruby-doc.org/core-2.5.0/doc/syntax/control_expressions_rdoc.html) for syntax details ```bash $ # same as: perl -ne 'if($.==234){print; exit}' $ seq 14323 14563435 | ruby -ne 'if $.==234 then print; exit end' 14556 $ # can also group the statements in () $ seq 14323 14563435 | ruby -ne '(print; exit) if $.==234' 14556 $ # mimicking head command $ # same as: head -n3 and sed '3q' or perl -pe 'exit if $.>3' $ seq 14 25 | ruby -pe 'exit if $.>3' 14 15 16 $ # same as: sed '3Q' and perl -pe 'exit if $.==3' $ seq 14 25 | ruby -pe 'exit if $.==3' 14 15 ``` * selecting range of lines * See [ruby-doc: Range](https://ruby-doc.org/core-2.5.0/Range.html) for syntax details ```bash $ # in this context, the range is compared against $. $ # same as: perl -ne 'print if 3..5' $ seq 14 25 | ruby -ne 'print if 3..5' 16 17 18 $ # selecting from particular line number to end of input $ # same as: perl -ne 'print if $.>=10' $ seq 14 25 | ruby -ne 'print if $.>=10' 23 24 25 ```
## Field processing * `-a` option will auto-split each input record based on one or more continuous white-space * similar to default behavior in `awk` and same as `perl -a` * See also [split](#split) section * Special variable array `$F` will contain all the elements, indexing starts from 0 * negative indexing is also supported, `-1` gives last element, `-2` gives last-but-one and so on * see [Array operations](#array-operations) section for examples on array usage ```bash $ cat fruits.txt fruit qty apple 42 banana 31 fig 90 guava 6 $ # print only first field, indexing starts from 0 $ # same as: perl -lane 'print $F[0]' fruits.txt $ ruby -ane 'puts $F[0]' fruits.txt fruit apple banana fig guava $ # print only second field $ # same as: perl -lane 'print $F[1]' fruits.txt $ ruby -ane 'puts $F[1]' fruits.txt qty 42 31 90 6 ``` * by default, leading and trailing whitespaces won't be considered when splitting the input record * same as `awk`'s default behavior and `perl -a` ```bash $ printf ' a ate b\tc \n' a ate b c $ printf ' a ate b\tc \n' | ruby -ane 'puts $F[0]' a $ printf ' a ate b\tc \n' | ruby -ane 'puts $F[-1]' c $ # number of elements $ printf ' a ate b\tc \n' | ruby -ane 'puts $F.length' 4 ```
#### Field comparison * operators `=`, `!=`, `<`, etc will work for both string/numeric comparison * unlike `perl`, numeric comparison for text requires converting to appropriate numeric format * See [ruby-doc: string methods](https://ruby-doc.org/core-2.5.0/String.html#method-i-to_c) for details ```bash $ # if first field exactly matches the string 'apple' $ # same as: perl -lane 'print $F[1] if $F[0] eq "apple"' fruits.txt $ ruby -ane 'puts $F[1] if $F[0] == "apple"' fruits.txt 42 $ # print first field if second field > 35 (excluding header) $ # same as: perl -lane 'print $F[0] if $F[1]>35 && $.>1' fruits.txt $ ruby -ane 'puts $F[0] if $F[1].to_i > 35 && $.>1' fruits.txt apple fig $ # print header and lines with qty < 35 $ # same as: perl -ane 'print if $F[1]<35 || $.==1' fruits.txt $ ruby -ane 'print if $F[1].to_i < 35 || $.==1' fruits.txt fruit qty banana 31 guava 6 $ # if first field does NOT contain 'a' $ # same as: perl -ane 'print if $F[0] !~ /a/' fruits.txt $ ruby -ane 'print if $F[0] !~ /a/' fruits.txt fruit qty fig 90 ```
#### Specifying different input field separator * by using `-F` command line option ```bash $ # second field where input field separator is : $ # same as: perl -F: -lane 'print $F[1]' $ echo 'foo:123:bar:789' | ruby -F: -ane 'puts $F[1]' 123 $ # last field, same as: perl -F: -lane 'print $F[-1]' $ echo 'foo:123:bar:789' | ruby -F: -ane 'puts $F[-1]' 789 $ # second last field, perl -F: -lane 'print $F[-2]' $ echo 'foo:123:bar:789' | ruby -F: -ane 'puts $F[-2]' bar $ # second and last field, same as: perl -F: -lane 'print "$F[1] $F[-1]"' $ echo 'foo:123:bar:789' | ruby -F: -ane 'puts "#{$F[1]} #{$F[-1]}"' 123 789 $ # use quotes to avoid clashes with shell special characters $ echo 'one;two;three;four' | ruby -F';' -ane 'puts $F[2]' three ``` * last element of `$F` array will contain the record separator as well * note that default `-a` option without `-F` won't have this issue as whitespaces at start/end are stripped * it doesn't make visual difference when `puts` is used as it adds newline only if not already present * if the record separator is not desired, use `-l` option to remove the record separator from input ```bash $ echo 'foo 123' | ruby -ane 'puts "#{$F[-1]}xyz"' 123xyz $ echo 'foo:123:bar:789' | ruby -F: -ane 'puts "#{$F[-1]}a"' 789 a $ echo 'foo:123:bar:789' | ruby -F: -lane 'puts "#{$F[-1]}a"' 789a ``` * Regular expressions based input field separator ```bash $ # same as: perl -F'\d+' -lane 'print $F[1]' $ echo 'Sample123string54with908numbers' | ruby -F'\d+' -ane 'puts $F[1]' string $ # first field will be empty as there is nothing before '{' $ echo '{foo} bar=baz' | ruby -F'[{}= ]+' -ane 'puts $F[0]' $ echo '{foo} bar=baz' | ruby -F'[{}= ]+' -ane 'puts $F[1]' foo $ echo '{foo} bar=baz' | ruby -F'[{}= ]+' -ane 'puts $F[2]' bar $ echo '{foo} bar=baz' | ruby -F'[{}= ]+' -ane 'puts $F[-1]' baz ``` * to process individual characters, simply use indexing on input string * See [ruby-doc: Encoding](https://ruby-doc.org/core-2.5.0/Encoding.html) for details on handling different string encodings ```bash $ # same as: perl -F -lane 'print $F[0]' $ echo 'apple' | ruby -ne 'puts $_[0]' a $ # if needed, chomp the record separator using -l $ # same as: perl -F -lane 'print $F[-1]' $ echo 'apple' | ruby -lne 'puts $_[-1]' e $ ruby -e 'puts Encoding.default_external' UTF-8 $ printf 'hi👍 how are you?' | ruby -ne 'puts $_[2]' 👍 $ # use -E option to explicitly specify external/internal encodings $ printf 'hi👍 how are you?' | ruby -E UTF-8:UTF-8 -ne 'puts $_[2]' 👍 ```
#### Specifying different output field separator * use `$,` to change separator between `print` arguments * could be remembered easily by noting that `,` is used to separate `print` arguments * note that `$,` doesn't affect `puts` which always uses newline as separator * the `-l` option is useful here in more than one way * it removes input record separator * and appends the record separator to `print` output ```bash $ # by default, the various arguments are concatenated $ echo 'foo:123:bar:789' | ruby -F: -lane 'print $F[1], $F[-1]' 123789 $ # change $, if different separator is needed $ # same as: perl -F: -lane '$,=" "; print $F[1], $F[-1]' $ echo 'foo:123:bar:789' | ruby -F: -lane '$,=" "; print $F[1], $F[-1]' 123 789 $ echo 'foo:123:bar:789' | ruby -F: -lane '$,="-"; print $F[1], $F[-1]' 123-789 $ # array's join method also uses $, $ # same as: perl -F: -lane '$,=" - "; print @F' $ echo 'foo:123:bar:789' | ruby -F: -lane '$,=" - "; print $F.join' foo - 123 - bar - 789 $ # or pass the separator as argument to join method $ echo 'foo:123:bar:789' | ruby -F: -lane 'print $F.join(" - ")' foo - 123 - bar - 789 $ # or the equivalent $ echo 'foo:123:bar:789' | ruby -F: -lane 'print $F * " - "' foo - 123 - bar - 789 ``` * use `BEGIN` if same separator is to be used for all lines * statements inside `BEGIN` are executed before processing any input text ```bash $ # same as: perl -lane 'BEGIN{$,=","} print @F' fruits.txt $ ruby -lane 'BEGIN{$,=","}; print $F.join' fruits.txt fruit,qty apple,42 banana,31 fig,90 guava,6 ```
## Changing record separators
#### Input record separator * by default, newline character is used as input record separator * use `$/` to specify a different input record separator * unlike `gawk`, only string can be used, no regular expressions * for single character separator, can also use `-0` command line option which accepts octal value as argument * if `-l` option is also used * input record separator will be chomped from input record * earlier versions used `chop` instead of `chomp`. See [bugs.ruby-lang.org 12926](https://bugs.ruby-lang.org/issues/12926) * in addition, output record separator(ORS) will get whatever is current value of input record separator * so, order of `-l`, `-0` and/or `$/` usage becomes important ```bash $ s='this is a sample string' $ # space as input record separator, printing all records $ # ORS is newline as -l is used before $/ gets changed $ # same as: perl -lne 'BEGIN{$/=" "} print "$. $_"' $ printf "$s" | ruby -lne 'BEGIN{$/=" "}; print "#{$.} #{$_}"' 1 this 2 is 3 a 4 sample 5 string $ # print all records containing 'a' $ # same as: perl -l -0040 -ne 'print if /a/' $ printf "$s" | ruby -l -0040 -ne 'print if /a/' a sample $ # if the order is changed, ORS will be space, not newline $ printf "$s" | ruby -0040 -l -ne 'print if /a/' a sample ``` * `-0` option used without argument will use the ASCII NUL character as input record separator * `-0777` will cause entire file to be slurped ```bash $ printf 'foo\0bar\0' | cat -A foo^@bar^@$ $ # same as: perl -l -0 -ne 'print' $ # could be golfed to: ruby -l0pe '' $ printf 'foo\0bar\0' | ruby -l -0 -ne 'print' foo bar $ # replace first newline with '. ' $ # same as: perl -0777 -pe 's/\n/. /' greeting.txt $ ruby -0777 -pe 'sub(/\n/, ". ")' greeting.txt Hello there. Have a safe journey ``` * for paragraph mode (two more more consecutive newline characters), use `-00` or assign empty string to `$/` Consider the below sample file ```bash $ cat sample.txt Hello World Good day How are you Just do-it Believe it Today is sunny Not a bit funny No doubt you like it too Much ado about nothing He he he ``` * again, input record will have the separator too and using `-l` will chomp it * however, if more than two consecutive newline characters separate the paragraphs, only two newlines will be preserved and the rest discarded * use `$/="\n\n"` to avoid this behavior ```bash $ # print all paragraphs containing 'it' $ # same as: perl -00 -ne 'print if /it/' sample.txt $ ruby -00 -ne 'print if /it/' sample.txt Just do-it Believe it Today is sunny Not a bit funny No doubt you like it too $ # based on number of lines in each paragraph $ # same as: perl -F'\n' -00 -ane 'print if $#F==0' sample.txt $ ruby -F'\n' -00 -ane 'print if $F.length==1' sample.txt Hello World ``` * Re-structuring paragraphs ```bash $ # same as: perl -F'\n' -l -00 -ane 'print join ". ", @F' sample.txt $ ruby -F'\n' -l -00 -ane 'print $F.join(". ")' sample.txt Hello World Good day. How are you Just do-it. Believe it Today is sunny. Not a bit funny. No doubt you like it too Much ado about nothing. He he he ``` * multi-character separator ```bash $ cat report.log blah blah Error: something went wrong more blah whatever Error: something surely went wrong some text some more text blah blah blah $ # number of records, same as: perl -lne 'BEGIN{$/="Error:"} print $. if eof' $ ruby -ne 'BEGIN{$/="Error:"}; puts $. if $<.eof' report.log 3 $ # print first record, same as: perl -lne 'BEGIN{$/="Error:"} print if $.==1' $ ruby -lne 'BEGIN{$/="Error:"}; print if $.==1' report.log blah blah $ # print a record if it contains given string $ # same as: perl -lne 'BEGIN{$/="Error:"} print "$/$_" if /surely/' $ ruby -lne 'BEGIN{$/="Error:"}; print $/,$_ if /surely/' report.log Error: something surely went wrong some text some more text blah blah blah ``` * Joining lines based on specific end of line condition ```bash $ cat msg.txt Hello there. It will rain to- day. Have a safe and pleasant jou- rney. $ # same as: perl -pe 'BEGIN{$/="-\n"} chomp' msg.txt $ ruby -pe 'BEGIN{$/="-\n"}; chomp' msg.txt Hello there. It will rain today. Have a safe and pleasant journey. ```
#### Output record separator * use `$\` to specify a different output record separator * applies to `print` but not `puts` ```bash $ # note that despite not setting $\, output has newlines $ # because the input record still has the input record separator $ seq 3 | ruby -ne 'print' 1 2 3 $ # same as: perl -ne 'BEGIN{$\="\n"} print' $ seq 3 | ruby -ne 'BEGIN{$\="\n"}; print' 1 2 3 $ seq 2 | ruby -ne 'BEGIN{$\="---\n"}; print' 1 --- 2 --- ``` * dynamically changing output record separator * **Note:** except `nil` and `false`, all other values evaluate to `true` * `0`, empty string/array/etc evaluate to `true` ```bash $ # note the use of -l to chomp the input record separator $ # same as: perl -lpe '$\ = $.%2 ? " " : "\n"' $ seq 6 | ruby -lpe '$\ = $.%2!=0 ? " " : "\n"' 1 2 3 4 5 6 $ # -l also sets the output record separator $ # but gets overridden by $\ $ # same as: perl -lpe '$\ = $.%3 ? "-" : "\n"' $ seq 6 | ruby -lpe '$\ = $.%3!=0 ? "-" : "\n"' 1-2-3 4-5-6 ```
## Multiline processing * Processing consecutive lines * to keep the one-liner short, global variables(`$` prefix) are used here * See [ruby-doc: Global variables](https://ruby-doc.org/core-2.5.0/doc/syntax/assignment_rdoc.html#label-Global+Variables) for syntax details ```bash $ cat poem.txt Roses are red, Violets are blue, Sugar is sweet, And so are you. $ # match two consecutive lines $ # same as: perl -ne 'print $p,$_ if /is/ && $p=~/are/; $p=$_' poem.txt $ ruby -ne 'print $p,$_ if /is/ && $p=~/are/; $p=$_' poem.txt Violets are blue, Sugar is sweet, $ # if only the second line is needed $ ruby -ne 'print if /is/ && $p=~/are/; $p=$_' poem.txt Sugar is sweet, $ # print if line matches a condition as well as condition for next 2 lines $ ruby -ne 'print $p2 if /is/ && $p1=~/blue/ && $p2=~/red/; $p2=$p1; $p1=$_' poem.txt Roses are red, ``` Consider this sample input file ```bash $ cat range.txt foo BEGIN 1234 6789 END bar BEGIN a b c END baz ``` * extracting lines around matching line * **Note** * default uninitialized value is `nil`, has to be explicitly converted for comparison * no auto increment/decrement operators, can use `+=1` and `-=1` ```bash $ ruby -le 'print $a' $ ruby -le 'print $a.to_i' 0 $ # print matching line and n-1 lines following the matched line $ # same as: perl -ne '$n=2 if /BEGIN/; print if $n && $n--' range.txt $ # can also use: ruby -ne 'BEGIN{n=0}; n=2 if /BEGIN/; print if n>0 && n-=1' $ ruby -ne '$n=2 if /BEGIN/; print if $n.to_i>0 && $n-=1' range.txt BEGIN 1234 BEGIN a $ # print nth line after match $ # same as: perl -ne 'print if $n && !--$n; $n=3 if /BEGIN/' range.txt $ ruby -ne '$n.to_i>0 && (print if $n==1; $n-=1); $n=3 if /BEGIN/' range.txt END c $ # use reversing trick for nth line before match $ tac range.txt | ruby -ne '$n.to_i>0 && (print if $n==1; $n-=1); $n=3 if /END/' | tac BEGIN a ``` **Further Reading** * [softwareengineering - FSM examples](https://softwareengineering.stackexchange.com/questions/47806/examples-of-finite-state-machines) * [wikipedia - FSM](https://en.wikipedia.org/wiki/Finite-state_machine)
## Ruby regular expressions * assuming that you are already familiar with basics of regular expressions * if not, check out [Ruby Regexp](https://leanpub.com/rubyregexp) ebook - step by step guide from beginner to advanced levels * examples/descriptions are for string containing ASCII characters only * See [ruby-doc: Regexp](https://ruby-doc.org/core-2.5.0/Regexp.html) for documentation * See [rexegg ruby](https://www.rexegg.com/regex-ruby.html) for a bit of ruby regexp history and differences with other regexp engines
#### gotchas and tricks * input record separator being part of input record ```bash $ # newline character gets replaced too as shown by shell prompt $ echo 'foo:123:bar:789' | ruby -pe 'sub(/[^:]+$/, "xyz")' foo:123:bar:xyz$ $ # simple workaround is to use -l option $ echo 'foo:123:bar:789' | ruby -lpe 'sub(/[^:]+$/, "xyz")' foo:123:bar:xyz $ # of course it is useful too $ # same as: perl -pe 's/\n/ : / if !eof' $ seq 10 | ruby -pe 'sub(/\n/, " : ") if !$<.eof' 1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 ``` * how much does `*` match? ```bash $ # both empty and non-empty strings are matched $ # even though * is a greedy quantifier $ echo ',baz,,xyz,,,' | ruby -lpe 'gsub(/[^,]*/, "A")' A,AA,A,AA,A,A,A $ echo 'foo,baz,,xyz,,,123' | ruby -lpe 'gsub(/[^,]*/, "A")' AA,AA,A,AA,A,A,AA $ # one workaround is to use lookarounds(covered later) $ echo ',baz,,xyz,,,' | ruby -lpe 'gsub(/(?<=^|,)[^,]*/, "A")' A,A,A,A,A,A,A $ echo 'foo,baz,,xyz,,,123' | ruby -lpe 'gsub(/(?<=^|,)[^,]*/, "A")' A,A,A,A,A,A,A ``` * difference between `^` and `\A` ```bash $ # ^ matches start of line, not start of string $ # same as: perl -00 -ne 'print if /^Believe/m' sample.txt $ ruby -00 -ne 'print if /^Believe/' sample.txt Just do-it Believe it $ ruby -00 -ne 'print if /^he/i' sample.txt Hello World Much ado about nothing He he he $ # \A matches start of string $ # without m modifier, both ^ and \A will match start of string in perl $ ruby -00 -ne 'print if /\Ahe/i' sample.txt Hello World $ # similarly, $ matches end of line $ ruby -00 -ne 'print if /funny$/' sample.txt Today is sunny Not a bit funny No doubt you like it too ``` * difference between `\z` and `\Z` ```bash $ # \Z matches just before newline $ seq 14 | ruby -ne 'print if /2\Z/' 2 12 $ # \z matches end of string $ seq 14 | ruby -ne 'print if /2\z/' $ seq 14 | ruby -ne 'print if /2\n\z/' 2 12 $ # without newline at end of line, both \z and \Z will behave same $ seq 14 | ruby -lne 'print if /2\z/' 2 12 ``` * delimiters and quoting * from [ruby-doc: Percent Strings](https://ruby-doc.org/core-2.5.0/doc/syntax/literals_rdoc.html#label-Percent+Strings) > If you are using “(”, “[”, “{”, “<” you must close it with “)”, “]”, “}”, “>” respectively. You may use most other non-alphanumeric characters for percent string delimiters such as “%”, “|”, “^”, etc. ```bash $ # %r allows to use delimiter other than / $ echo 'a/b' | ruby -pe 'sub(/a\/b/, "foo")' foo $ echo 'a/b' | ruby -pe 'sub(%r{a/b}, "foo")' foo $ # use %q (single quoting) to avoid variable interpolation $ echo 'foo123' | ruby -pe 'a="huh?"; sub(/12/, "#{a}")' foohuh?3 $ echo 'foo123' | ruby -pe 'a="huh?"; sub(/12/, %q/#{a}/)' foo#{a}3 $ # %q also useful for backreferences, as \ is special inside double quotes $ echo 'a a a 2 be be' | ruby -pe 'gsub(/\b(\w+)( \1)+\b/, "\\1")' a 2 be $ echo 'a a a 2 be be' | ruby -pe 'gsub(/\b(\w+)( \1)+\b/, %q/\1/)' a 2 be $ # and when double quotes is part of replacement string $ echo '42,789' | ruby -lpe 'gsub(/\d+/, "\"\\0\"")' "42","789" $ echo '42,789' | ruby -lpe 'gsub(/\d+/, %q/"\0"/)' "42","789" $ # \& can also be used instead of \0 ```
#### Backslash sequences * `\w` for `[A-Za-z0-9_]` * `\d` for `[0-9]` * `\s` for `[ \t\r\n\f\v]` * `\h` for `[0-9a-fA-F]` or `[[:xdigit:]]` * `\W`, `\D`, `\S`, `\H`, respectively for their opposites * See also [ruby-doc: scan](https://ruby-doc.org/core-2.5.0/String.html#method-i-scan) ```bash $ # same as: perl -ne 'print if /^[[:xdigit:]]+$/' $ # can also use: ruby -lne 'print if !/\H/' $ printf '128A\n34\nfe32\nfoo1\nbar\n' | ruby -ne 'print if /^\h+$/' 128A 34 fe32 $ # same as: perl -pe 's/\d+/xxx/g' $ echo 'like 42 and 37' | ruby -pe 'gsub(/\d+/, "xxx")' like xxx and xxx $ # note again the use of -l because of newline in input record $ # same as: perl -lpe 's/\D+/xxx/g' $ echo 'like 42 and 37' | ruby -lpe 'gsub(/\D+/, "xxx")' xxx42xxx37 $ # get all matches as an array $ echo 'tea sea-pit sit' | ruby -ne 'puts $_.scan(/[\w\s]+/)' tea sea pit sit ```
#### Non-greedy quantifier * adding a `?` to `?` or `*` or `+` or `{}` quantifiers will change matching from greedy to non-greedy. In other words, to match as minimally as possible * also known as lazy quantifier ```bash $ # greedy matching $ echo 'foo and bar and baz land good' | ruby -lne 'print $_.scan(/.*and/)' ["foo and bar and baz land"] $ # non-greedy matching $ echo 'foo and bar and baz land good' | ruby -lne 'print $_.scan(/.*?and/)' ["foo and", " bar and", " baz land"] $ echo '12342789' | ruby -pe 'sub(/\d{2,5}/, "")' 789 $ echo '12342789' | ruby -pe 'sub(/\d{2,5}?/, "")' 342789 $ # for single character, non-greedy is not always needed $ echo '123:42:789:good:5:bad' | ruby -pe 'sub(/:.*?:/, ":")' 123:789:good:5:bad $ echo '123:42:789:good:5:bad' | ruby -pe 'sub(/:[^:]*:/, ":")' 123:789:good:5:bad $ # just like greedy, overall matching is considered, as minimal as possible $ echo '123:42:789:good:5:bad' | ruby -pe 'sub(/:.*?:[a-z]/, ":")' 123:ood:5:bad $ echo '123:42:789:good:5:bad' | ruby -pe 'sub(/:.*:[a-z]/, ":")' 123:ad ```
#### Lookarounds * Ability to add if conditions to match before/after required pattern * There are four types * positive lookahead `(?=` * negative lookahead `(?!` * positive lookbehind `(?<=` * negative lookbehind `(? #### Special capture groups * `\1`, `\2` etc only matches exact string * `\g<1>`, `\g<2>` etc re-uses the regular expression itself ```bash $ s='baz 2008-03-24 and 2012-08-12 foo 2016-03-25' $ # same as: perl -pe 's/(\d{4}-\d{2}-\d{2}) and (?1)/XYZ/' $ echo "$s" | ruby -pe 'sub(/(\d{4}-\d{2}-\d{2}) and \g<1>/, "XYZ")' baz XYZ foo 2016-03-25 $ # using \1 won't work as the two dates are different $ echo "$s" | ruby -pe 'sub(/(\d{4}-\d{2}-\d{2}) and \1/, "")' baz 2008-03-24 and 2012-08-12 foo 2016-03-25 ``` * use `(?:` to group regular expressions without capturing it, so this won't be counted for backreference * See also [stackoverflow - what is non-capturing group](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-what-does-do) ```bash $ # using ?: helps to focus only on required capture groups $ # same as: perl -pe 's/(?:co|fo)\K(\w)(\w)/$2$1/g' $ echo 'cod1 foo_bar' | ruby -pe 'gsub(/(?:co|fo)\K(\w)(\w)/, %q/\2\1/)' co1d fo_obar $ # without ?: you'd need to remember all the other groups as well $ echo 'cod1 foo_bar' | ruby -pe 'gsub(/(co|fo)\K(\w)(\w)/, %q/\3\2/)' co1d fo_obar ``` * named capture groups `(?` or `(?'name'` * for backreference, use `\k` * both named capture groups and normal capture groups cannot be used at the same time ```bash $ # same as: perl -pe 's/(?\w+) (?\w+)/$+{sw} $+{fw}/' $ echo 'foo 123' | ruby -pe 'sub(/(?\w+) (?\w+)/, %q/\k \k/)' 123 foo $ # also useful to transform different capture groups $ s='"foo,bar",123,"x,y,z",42' $ # same as: perl -lpe 's/"(?[^"]+)",|(?[^,]+),/$+{a}|/g' $ echo "$s" | ruby -lpe 'gsub(/"(?[^"]+)",|(?[^,]+),/, %q/\k|/)' foo,bar|123|x,y,z|42 ``` **Further Reading** * [rexegg - all the (? usages](https://www.rexegg.com/regex-disambiguation.html) * [regular-expressions - recursion](https://www.regular-expressions.info/recurse.html#balanced) * [stackoverflow - Recursive nested matching pairs of curly braces](https://stackoverflow.com/questions/19486686/recursive-nested-matching-pairs-of-curly-braces-in-ruby-regex)
####
Modifiers * use `i` modifier to ignore case while matching ```bash $ ruby -ne 'print if /rose/i' poem.txt Roses are red, $ echo 'foo 123 FoO' | ruby -pe 'gsub(/foo/i, "good")' good 123 good ``` * by default, `.` doesn't match the newline character * `m` modifier allows `.` metacharacter to match newline character as well ```bash $ # searching for a match which can span across multiple lines $ # no output as . doesn't match newline $ ruby -00 -ne 'print if /do.*he/' sample.txt $ # same as: perl -00 -ne 'print if /do.*he/s' sample.txt $ ruby -00 -ne 'print if /do.*he/m' sample.txt Much ado about nothing He he he ```
#### Code in replacement section * block form allows to use `ruby` code for replacement section quoting from [ruby-doc: gsub](https://ruby-doc.org/core-2.5.0/String.html#method-i-gsub) >In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately. The value returned by the block will be substituted for the match on each call. * `$1`, `$2`, etc are equivalent of `\1`, `\2`, etc * `$&` is equivalent of `\&`(or `\0`) - i.e the entire matched string ```bash $ # replace numbers with their squares, same as: perl -pe 's/\d+/$&**2/ge' $ echo '4 and 10' | ruby -pe 'gsub(/\d+/){$&.to_i ** 2}' 16 and 100 $ # replace matched string with incremental value $ # same as: perl -pe 's/\d+/++$c/ge' $ echo '4 and 10 foo 57' | ruby -pe 'BEGIN{c=0}; gsub(/\d+/){c+=1}' 1 and 2 foo 3 $ # replace with string length, same as: perl -pe 's/\w+/length($&)/ge' $ echo 'food:12:explain:789' | ruby -pe 'gsub(/\w+/){$&.length}' 4:2:7:3 $ # formatting string, same as: perl -lpe 's/[^-]+/sprintf "%04s", $&/ge' $ echo 'a1-2-deed' | ruby -lpe 'gsub(/[^-]+/){ $&.rjust(4, "0") }' 00a1-0002-deed $ # applying another substitution to matched string $ # same as: perl -pe 's/"[^"]+"/$&=~s|a|A|gr/ge' $ echo '"mango" and "guava"' | ruby -pe 'gsub(/"[^"]+"/){$&.gsub(/a/, "A")}' "mAngo" and "guAvA" ``` * replacing specific occurrence ```bash $ # replacing 2nd occurrence, same as: sed 's/:/-/2' $ # same as: perl -pe '$c=0; s/:/++$c==2 ? "-" : $&/ge' $ echo 'foo:123:bar:baz' | ruby -pe 'c=0; gsub(/:/){(c+=1)==2 ? "-" : $&}' foo:123-bar:baz $ # or use non-greedy matching, same as: sed 's/and/-/3' $ echo 'foo and bar and baz land good' | ruby -pe 'sub(/(and.*?){2}\Kand/, "-")' foo and bar and baz l- good $ # emulating GNU sed's number+g modifier $ a='456:foo:123:bar:789:baz x:y:z:a:v:xc:gf' $ echo "$a" | sed 's/:/-/3g' 456:foo:123-bar-789-baz x:y:z-a-v-xc-gf $ # same as: perl -pe '$c=0; s/:/++$c<3 ? $& : "-"/ge' $ echo "$a" | ruby -pe 'c=0; gsub(/:/){(c+=1)<3 ? $& : "-"}' 456:foo:123-bar-789-baz x:y:z-a-v-xc-gf ```
#### Quoting metacharacters * to match contents of string variable exactly, all metacharacters need to be escaped * See [ruby-doc: Regexp.escape](https://ruby-doc.org/core-2.5.0/Regexp.html#method-c-escape) for syntax details ```bash $ cat eqns.txt a=b,a-b=c,c*d a+b,pi=3.14,5e12 i*(t+9-g)/8,4-a+b $ # since + is a metacharacter, no match found $ # note that #{} allows interpolation $ s='a+b' ruby -ne 'print if /#{ENV["s"]}/' eqns.txt $ # same as: s='a+b' perl -ne 'print if /\Q$ENV{s}/' eqns.txt $ s='a+b' ruby -ne 'print if /#{Regexp.escape(ENV["s"])}/' eqns.txt a+b,pi=3.14,5e12 i*(t+9-g)/8,4-a+b $ # use regexp as needed around variable content, for ex: end of line anchor $ ruby -pe 'BEGIN{s="a+b"}; sub(/#{Regexp.escape(s)}$/, "a**b")' eqns.txt a=b,a-b=c,c*d a+b,pi=3.14,5e12 i*(t+9-g)/8,4-a**b ```
## Two file processing First, a bit about `ARGV` which allows to keep track of which file is being processed ```bash $ # similar to: perl -lne 'print $#ARGV' <(seq 2) <(seq 3) <(seq 1) $ ruby -ne 'puts ARGV.length' <(seq 2) <(seq 3) <(seq 1) 2 2 1 1 1 0 ```
#### Comparing whole lines Consider the following test files ```bash $ cat colors_1.txt Blue Brown Purple Red Teal Yellow $ cat colors_2.txt Black Blue Green Red White ``` * `-r` command line option allows to specify library required * the `include?` method allows to check if `set` already contains the element * See [ruby-doc: include?](https://ruby-doc.org/stdlib-2.5.0/libdoc/set/rdoc/Set.html#method-i-include-3F) for syntax details ```bash $ # common lines $ # note that all duplicates matching in second file would get printed $ # same as: perl -ne 'if(!$#ARGV){$h{$_}=1; next} $ # print if $h{$_}' colors_1.txt colors_2.txt $ ruby -rset -ne 'BEGIN{s=Set.new}; s.add($_) && next if ARGV.length==1; print if s.include?($_)' colors_1.txt colors_2.txt Blue Red $ # lines from colors_2.txt not present in colors_1.txt $ ruby -rset -ne 'BEGIN{s=Set.new}; s.add($_) && next if ARGV.length==1; print if !s.include?($_)' colors_1.txt colors_2.txt Black Green White $ # next - to skip rest of code and process next input line $ # here used to skip rest of code as long as first file is being processed $ # alternate: ARGV.length==1 ? s.add($_) : s.include?($_) && print ``` alternate solution by using set operations available for arrays * [ruby-doc: ARGF](https://ruby-doc.org/core-2.5.0/ARGF.html) filehandle allows to read from filename arguments supplied to script * if filename arguments are not present, it would act upon stdin * `STDIN` filehandle allows to read from stdin * [ruby-doc: readlines](https://ruby-doc.org/core-2.5.0/IO.html#method-c-readlines) method allows to read all the lines as an array * if filehandle is not specified, default is ARGF * some comparison notes * both files will get saved as array in memory here, while previous solution would save only first file * duplicates would get removed here * likely to be faster compared to previous solution ```bash $ # note that -n/-p options are not used $ # and puts is helpful here as record separator is newline character $ # common lines, output order is based on array to left of & operator $ ruby -e 'f1=STDIN.readlines; f2=readlines; puts f1 & f2' #### Comparing specific fields Consider the sample input file ```bash $ cat marks.txt Dept Name Marks ECE Raj 53 ECE Joel 72 EEE Moi 68 CSE Surya 81 EEE Tia 59 ECE Om 92 CSE Amy 67 ``` * single field * For ex: only first field comparison instead of entire line as key ```bash $ cat list1 ECE CSE $ # extract only lines matching first field specified in list1 $ ruby -rset -ane 'BEGIN{s=Set.new}; s.add($F[0]) && next if ARGV.length==1; print if s.include?($F[0])' list1 marks.txt ECE Raj 53 ECE Joel 72 CSE Surya 81 ECE Om 92 CSE Amy 67 ``` * multiple field comparison ```bash $ cat list2 EEE Moi CSE Amy ECE Raj $ # $F[0..1] will return array with elements specified by range (0 to 1 here) $ ruby -rset -ane 'BEGIN{s=Set.new}; s.add($F[0..1]) && next if ARGV.length==1; print if s.include?($F[0..1])' list2 marks.txt ECE Raj 53 EEE Moi 68 CSE Amy 67 ``` * field and value comparison * here, we use [hash](https://ruby-doc.org/core-2.5.0/Hash.html) as well to save values based on a key ```bash $ cat list3 ECE 70 EEE 65 CSE 80 $ # extract line matching Dept and minimum marks specified in list3 $ ruby -rset -ane 'BEGIN{d=Set.new; m={}}; (d.add($F[0]); m[$F[0]]=$F[1]) && next if ARGV.length==1; print if d.include?($F[0]) && $F[2]>=m[$F[0]]' list3 marks.txt ECE Joel 72 EEE Moi 68 CSE Surya 81 ECE Om 92 ```
#### Line number matching ```bash $ # replace mth line in poem.txt with nth line from list1 $ # same as: m=3 n=2 perl -pe 'BEGIN{ $s=<> while $ENV{n}-- > 0; close ARGV} $ # $_=$s if $.==$ENV{m}' list1 poem.txt $ m=3 n=2 ruby -pe 'BEGIN{ENV["n"].to_i.times { $s=gets }; ARGF.close }; $_=$s if $.==ENV["m"].to_i' list1 poem.txt Roses are red, Violets are blue, CSE And so are you. $ # print line from fruits.txt if corresponding line from nums.txt is +ve number $ # same as: > 0' fruits.txt $ # line from fruits.txt is saved first as STDIN.gets will also set $_ $ 0' fruits.txt fruit qty banana 31 $ # can also use: $ # ruby -e 'STDIN.readlines.zip(readlines).each {|a| puts a[1] if a[0].to_i>0}' ``` For syntax and implementation details, see * [ruby-doc: ARGF](https://ruby-doc.org/core-2.5.0/ARGF.html) * [ruby-doc: times](https://ruby-doc.org/core-2.5.0/Integer.html#method-i-times) * [ruby-doc: gets](https://ruby-doc.org/core-2.5.0/IO.html#method-i-gets)
## Creating new fields * See [ruby-doc: slice](https://ruby-doc.org/core-2.5.0/Array.html#method-i-slice) for syntax details ```bash $ s='foo,bar,123,baz' $ # to reduce fields, use slice method $ # same as: echo "$s" | perl -F, -lane '$,=","; $#F=1; print @F' $ # 1st arg - starting index, 2nd arg - number of elements $ echo "$s" | ruby -F, -lane '$F.slice!(-2,2); print $F * ","' foo,bar $ # assigning to field greater than length will create empty fields as needed $ # same as: echo "$s" | perl -F, -lane '$,=","; $F[6]=42; print @F' $ echo "$s" | ruby -F, -lane '$F[6]=42; print $F * ","' foo,bar,123,baz,,,42 ``` * adding a field based on existing fields * See [ruby-doc: Percent Strings](https://ruby-doc.org/core-2.5.0/doc/syntax/literals_rdoc.html#label-Percent+Strings) for details on `%w` ```bash $ # adding a new 'Grade' field $ # same as: perl -lane 'BEGIN{$,="\t"; @g = qw(D C B A S)} $ # push @F, $.==1 ? "Grade" : $g[$F[-1]/10 - 5]; print @F' marks.txt $ ruby -lane 'BEGIN{g = %w[D C B A S]}; $F.push($.==1 ? "Grade" : g[$F[-1].to_i/10 - 5]); print $F * "\t"' marks.txt Dept Name Marks Grade ECE Raj 53 D ECE Joel 72 B EEE Moi 68 C CSE Surya 81 A EEE Tia 59 D ECE Om 92 S CSE Amy 67 C ```
## Multiple file input * processing based on line-number/begin/end of each input file ```bash $ # same as: perl -ne 'print if $.==2; close ARGV if eof' $ # ARGF.close will reset $. to 0 $ ruby -ne 'print if $.==2; ARGF.close if $<.eof' poem.txt greeting.txt Violets are blue, Have a safe journey $ # same as: perl -lne 'print "file: $ARGV" if $.==1; $ # print "$_\n------" and close ARGV if eof' poem.txt greeting.txt $ ruby -lne 'print "file: #{ARGF.filename}" if $.==1; (print "#{$_}\n------"; ARGF.close) if $<.eof' poem.txt greeting.txt file: poem.txt And so are you. ------ file: greeting.txt Have a safe journey ------ ``` * to skip remaining lines from current file being processed and move on to next file ```bash $ # same as: perl -pe 'close ARGV if $.>=1' poem.txt greeting.txt fruits.txt $ ruby -pe 'ARGF.close if $.>=1' poem.txt greeting.txt fruits.txt Roses are red, Hello there fruit qty $ # same as: perl -lane 'print $ARGV and close ARGV if $F[0] =~ /red/i' * $ ruby -ane '(puts ARGF.filename; ARGF.close) if $F[0] =~ /red/i' * colors_1.txt colors_2.txt ```
## Dealing with duplicates * retain only first copy of duplicates * `-r` command line option allows to specify library required * here, `set` data type is used to keep track of unique values - be it whole line or a particular field * the `add?` method will add element to `set` and returns `nil` if element already exists * See [ruby-doc: add?](https://ruby-doc.org/stdlib-2.5.0/libdoc/set/rdoc/Set.html#method-i-add-3F) for syntax details ```bash $ cat duplicates.txt abc 7 4 food toy **** abc 7 4 test toy 123 good toy **** $ # whole line, same as: perl -ne 'print if !$seen{$_}++' duplicates.txt $ ruby -rset -ne 'BEGIN{s=Set.new}; print if s.add?($_)' duplicates.txt abc 7 4 food toy **** test toy 123 good toy **** $ # particular column, same as: perl -ane 'print if !$seen{$F[1]}++' $ ruby -rset -ane 'BEGIN{s=Set.new}; print if s.add?($F[1])' duplicates.txt abc 7 4 food toy **** $ # total count, same as: perl -lane '$c++ if !$seen{$F[1]}++; END{print $c}' $ ruby -rset -ane 'BEGIN{s=Set.new}; s.add($F[1]); END{puts s.length}' duplicates.txt 2 ``` * multiple fields ```bash $ # same as: perl -ane 'print if !$seen{$F[1],$F[2]}++' duplicates.txt $ # $F[1..2] will return an array with fields 2 and 3 as elements $ ruby -rset -ane 'BEGIN{s=Set.new}; print if s.add?($F[1..2])' duplicates.txt abc 7 4 food toy **** test toy 123 ``` * retaining only last copy of duplicate ```bash $ # reverse the input line-wise, retain first copy and then reverse again $ # same as: tac duplicates.txt | perl -ane 'print if !$seen{$F[1]}++' | tac $ tac duplicates.txt | ruby -rset -ane 'BEGIN{s=Set.new}; print if s.add?($F[1])' | tac abc 7 4 good toy **** ``` * for count based filtering (other than first/last count), use a `hash` * `Hash.new(0)` will initialize value of new key to `0` ```bash $ # second occurrence of duplicate $ # same as: perl -ane 'print if ++$h{$F[1]}==2' duplicates.txt $ ruby -ane 'BEGIN{h=Hash.new(0)}; print if (h[$F[1]]+=1)==2' duplicates.txt abc 7 4 test toy 123 $ # third occurrence of duplicate $ # same as: perl -ane 'print if ++$h{$F[1]}==3' duplicates.txt $ ruby -ane 'BEGIN{h=Hash.new(0)}; print if (h[$F[1]]+=1)==3' duplicates.txt good toy **** ``` * filtering based on duplicate count * allows to emulate [uniq](./sorting_stuff.md#uniq) command for specific fields ```bash $ # all duplicates based on 1st column $ # same as: perl -ane '!$#ARGV ? $x{$F[0]}++ : $x{$F[0]}>1 && print' $ ruby -ane 'BEGIN{h=Hash.new(0)}; ARGV.length==1 ? h[$F[0]]+=1 : h[$F[0]]>1 && print' duplicates.txt duplicates.txt abc 7 4 abc 7 4 $ # more than 2 duplicates based on 2nd column $ ruby -ane 'BEGIN{h=Hash.new(0)}; ARGV.length==1 ? h[$F[1]]+=1 : h[$F[1]]>2 && print' duplicates.txt duplicates.txt food toy **** test toy 123 good toy **** $ # only unique lines based on 3rd column $ ruby -ane 'BEGIN{h=Hash.new(0)}; ARGV.length==1 ? h[$F[2]]+=1 : h[$F[2]]==1 && print' duplicates.txt duplicates.txt test toy 123 ```
#### using uniq method * [ruby-doc: uniq](https://ruby-doc.org/core-2.5.0/Array.html#method-i-uniq) * original order is maintained ```bash $ # same as: ruby -rset -ne 'BEGIN{s=Set.new}; print if s.add?($_)' $ ruby -e 'puts readlines.uniq' duplicates.txt abc 7 4 food toy **** test toy 123 good toy **** $ # same as: ruby -rset -ane 'BEGIN{s=Set.new}; print if s.add?($F[1])' $ ruby -e 'puts readlines.uniq {|s| s.split[1]}' duplicates.txt abc 7 4 food toy **** $ # same as: ruby -rset -ane 'BEGIN{s=Set.new}; print if s.add?($F[1..2])' $ ruby -e 'puts readlines.uniq {|s| s.split[1..2]}' duplicates.txt abc 7 4 food toy **** test toy 123 ```
## Lines between two REGEXPs * This section deals with filtering lines bound by two *REGEXP*s (referred to as blocks) * For simplicity the two *REGEXP*s usually used in below examples are the strings **BEGIN** and **END**
#### All unbroken blocks Consider the below sample input file, which doesn't have any unbroken blocks (i.e **BEGIN** and **END** are always present in pairs) ```bash $ cat range.txt foo BEGIN 1234 6789 END bar BEGIN a b c END baz ``` * Extracting lines between starting and ending *REGEXP* ```bash $ # include both starting/ending REGEXP $ # same as: perl -ne '$f=1 if /BEGIN/; print if $f; $f=0 if /END/' $ ruby -ne '$f=1 if /BEGIN/; print if $f==1; $f=0 if /END/' range.txt BEGIN 1234 6789 END BEGIN a b c END $ # can also use: ruby -ne 'print if /BEGIN/../END/' range.txt $ # which is similar to sed -n '/BEGIN/,/END/p' $ # but not suitable to extend for other cases ``` * other variations ```bash $ # exclude both starting/ending REGEXP $ # same as: perl -ne '$f=0 if /END/; print if $f; $f=1 if /BEGIN/' $ ruby -ne '$f=0 if /END/; print if $f==1; $f=1 if /BEGIN/' range.txt 1234 6789 a b c $ # check out what these do: $ ruby -ne '$f=1 if /BEGIN/; $f=0 if /END/; print if $f==1' range.txt $ ruby -ne 'print if $f==1; $f=0 if /END/; $f=1 if /BEGIN/' range.txt ``` * Extracting lines other than lines between the two *REGEXP*s ```bash $ # same as: perl -ne '$f=1 if /BEGIN/; print if !$f; $f=0 if /END/' $ # can also use: ruby -ne 'print if !(/BEGIN/../END/)' range.txt $ ruby -ne '$f=1 if /BEGIN/; print if $f!=1; $f=0 if /END/' range.txt foo bar baz $ # the other three cases would be $ ruby -ne '$f=0 if /END/; print if $f!=1; $f=1 if /BEGIN/' range.txt $ ruby -ne 'print if $f!=1; $f=1 if /BEGIN/; $f=0 if /END/' range.txt $ ruby -ne '$f=1 if /BEGIN/; $f=0 if /END/; print if $f!=1' range.txt ```
#### Specific blocks * Getting first block ```bash $ # same as: perl -ne '$f=1 if /BEGIN/; print if $f; exit if /END/' $ ruby -ne '$f=1 if /BEGIN/; print if $f==1; exit if /END/' range.txt BEGIN 1234 6789 END $ # use other tricks discussed in previous section as needed $ ruby -ne 'exit if /END/; print if $f==1; $f=1 if /BEGIN/' range.txt 1234 6789 ``` * Getting last block ```bash $ # reverse input linewise, change the order of REGEXPs, finally reverse again $ tac range.txt | ruby -ne '$f=1 if /END/; print if $f==1; exit if /BEGIN/' | tac BEGIN a b c END $ # or, save the blocks in a buffer and print the last one alone $ # same as: seq 30 | perl -ne 'if(/4/){$f=1; $b=$_; next} $ # $b.=$_ if $f; $f=0 if /6/; END{print $b}' $ # << operator concatenates given string to the variable in-place $ seq 30 | ruby -ne '($f=1; $b=$_) && next if /4/; $b << $_ if $f==1; $f=0 if /6/; END{print $b}' 24 25 26 ``` * Getting blocks based on a counter ```bash $ # get only 2nd block $ # same as: b=2 perl -ne '$c++ if /4/; if($c==$ENV{b}){print; exit if /6/}' $ seq 30 | b=2 ruby -ne 'BEGIN{c=0}; c+=1 if /4/; c==ENV["b"].to_i && (print; exit if /6/)' 14 15 16 $ # to get all blocks greater than 'b' blocks $ seq 30 | b=1 ruby -ne 'BEGIN{c=0}; ($f=1; c+=1) if /4/; print if $f==1 && c>ENV["b"].to_i; $f=0 if /6/' 14 15 16 24 25 26 ``` * excluding a particular block ```bash $ # excludes 2nd block $ seq 30 | b=2 ruby -ne 'BEGIN{c=0}; ($f=1; c+=1) if /4/; print if $f==1 && c!=ENV["b"].to_i; $f=0 if /6/' 4 5 6 24 25 26 ``` * extract block only if it matches another string as well ```bash $ # string to match inside block: 23 $ # same as: perl -ne 'if(/BEGIN/){$f=1; $m=0; $b=""}; $m=1 if $f && /23/; $ # $b.=$_ if $f; if(/END/){print $b if $m; $f=0}' range.txt $ ruby -ne '($f=1; $m=0; $b="") if /BEGIN/; $m=1 if $f==1 && /23/; $b<<$_ if $f==1; (print $b if $m==1; $f=0) if /END/' range.txt BEGIN 1234 6789 END $ # line to match inside block: 5 or 25 $ seq 30 | ruby -ne '($f=1; $m=0; $b="") if /4/; $m=1 if $f==1 && /^2?5$/; $b<<$_ if $f==1; (print $b if $m==1; $f=0) if /6/' 4 5 6 24 25 26 ```
#### Broken blocks * If there are blocks with ending *REGEXP* but without corresponding start, earlier techniques used will suffice * Consider the modified input file where starting *REGEXP* doesn't have corresponding ending ```bash $ cat broken_range.txt foo BEGIN 1234 6789 END bar BEGIN a b c baz $ # the file reversing trick comes in handy here as well $ tac broken_range.txt | ruby -ne '$f=1 if /END/; print if $f==1; $f=0 if /BEGIN/' | tac BEGIN 1234 6789 END ``` * But if both kinds of broken blocks are present, for ex: ```bash $ cat multiple_broken.txt qqqqqqq BEGIN foo BEGIN 1234 6789 END bar END 0-42-1 BEGIN a BEGIN b END xyzabc ``` then use buffers to accumulate the records and print accordingly ```bash $ # same as: perl -ne 'if(/BEGIN/){$f=1; $b=$_; next} $b.=$_ if $f; $ # if(/END/){$f=0; print $b if $b; $b=""}' multiple_broken.txt $ ruby -ne '($f=1; $b=$_) && next if /BEGIN/; $b << $_ if $f==1; ($f=0; print $b if $b!=""; $b="") if /END/' multiple_broken.txt BEGIN 1234 6789 END BEGIN b END $ # note how buffer is initialized as well as cleared $ # on matching beginning/end REGEXPs respectively ```
## Array operations See [ruby-doc: Array](https://ruby-doc.org/core-2.5.0/Array.html) for various ways to initialize and methods available * initialization ```bash $ # as comma separated values, indexing starts at 0 $ ruby -le 'sq = [1, 4, 9, 16]; print sq[2]' 9 $ ruby -le 'a = [123, "foo", "baz789"]; print a[1]' foo $ # -ve indexing, -1 for last element, -2 for second last, etc $ ruby -le 'foo = [2, "baz", ["a", "b"]]; print foo[-1]' ["a", "b"] $ # variables can be used, double quoted string will interpolate $ ruby -le 'a=5; b=["a", "b"]; c=[a, 789, b]; print c' [5, 789, ["a", "b"]] $ ruby -le 'c=[89, "a\nb"]; print c[-1]' a b $ # %w allows space separated string values, no interpolation $ ruby -le 'b = %w[123 foo baz789]; print b[1]' foo $ ruby -le 's = %w[foo "baz" "a\nb"]; print s[-1]' "a\nb" ``` * array slices * See also [ruby-doc: Array to Arguments Conversion](https://ruby-doc.org/core-2.5.0/doc/syntax/calling_methods_rdoc.html#label-Array+to+Arguments+Conversion) ```bash $ # accessing more than one element in random order $ echo 'a b c d' | ruby -lane 'print $F.values_at(0,-1,2) * " "' a d c $ echo 'a b c d' | ruby -lane 'i=[0, -1, 2]; print $F.values_at(*i) * " "' a d c $ # starting index and number of elements needed from that index $ echo 'a b c d' | ruby -lane 'print $F[0,3] * " "' a b c $ # range operator, arguments are start/end indexes $ echo 'a b c d' | ruby -lane 'print $F[1..3] * " "' b c d $ # n elements from start, can also use 'first' method instead of 'take' $ echo 'a b c d' | ruby -lane 'print $F.take(2) * " "' a b $ # remaining elements after ignoring n elements from start $ echo 'a b c d' | ruby -lane 'print $F.drop(3) * " "' d $ # n elements from end $ echo 'a b c d' | ruby -lane 'print $F.last(3) * " "' b c d ``` * looping ```bash $ # by element value, use 'reverse_each' to iterate in reversed order $ # can also use range here: ruby -e '(1..4).each {|n| puts n*2}' $ ruby -e 'nums=[1, 2, 3, 4]; nums.each {|n| puts n*2}' 2 4 6 8 $ # by index $ ruby -e 'books=%w[Elantris Martian Dune Alchemist] books.each_index {|i| puts "#{i+1}) #{books[i]}"}' 1) Elantris 2) Martian 3) Dune 4) Alchemist ```
#### Filtering * based on regexp ```ruby $ s='foo:123:bar:baz' $ echo "$s" | ruby -F: -lane 'print $F.grep(/[a-z]/) * ":"' foo:bar:baz $ words='tryst fun glyph pity why' $ echo "$words" | ruby -lane 'puts $F.grep(/[a-g]/)' fun glyph $ # grep_v inverts the selection $ echo "$words" | ruby -lane 'puts $F.grep_v(/[aeiou]/)' tryst glyph why ``` * use `select` or `reject` for generic conditions ```bash $ # to get index instead of matches $ s='foo:123:bar:baz' $ echo "$s" | ruby -F: -lane 'print $F.each_index.select{|i| $F[i] =~ /[a-z]/}' [0, 2, 3] $ # based on numeric value $ s='23 756 -983 5' $ echo "$s" | ruby -lane 'print $F.select { |s| s.to_i < 100 } * " "' 23 -983 5 $ # filters only those elements with successful substitution $ # for opposite, either use negated condition or use reject instead of select $ echo "$s" | ruby -lane 'print $F.select { |s| s.sub!(/3/, "E") } * " "' 2E -98E ``` * random element(s) ```bash $ s='65 23 756 -983 5' $ echo "$s" | ruby -lane 'print $F.sample' 23 $ echo "$s" | ruby -lane 'print $F.sample' 5 $ echo "$s" | ruby -lane 'print $F.sample(2)' ["-983", "756"] ```
#### Sorting * [ruby-doc: sort](https://ruby-doc.org/core-2.5.0/Array.html#method-i-sort) * See also [stackoverflow What does map(&:name) mean in Ruby?](https://stackoverflow.com/questions/1217088/what-does-mapname-mean-in-ruby) for explanation on `&:` ```bash $ s='foo baz v22 aimed' $ # same as: perl -lane 'print join " ", sort @F' $ echo "$s" | ruby -lane 'print $F.sort * " "' aimed baz foo v22 $ # demonstrating the <=> operator $ ruby -e 'puts 4 <=> 2' 1 $ ruby -e 'puts 4 <=> 20' -1 $ ruby -e 'puts 4 <=> 4' 0 $ # descending order $ # same as: perl -lane 'print join " ", sort {$b cmp $a} @F' $ echo "$s" | ruby -lane 'print $F.sort { |a,b| b <=> a } * " "' v22 foo baz aimed $ # can also reverse the array after default sorting $ echo "$s" | ruby -lane 'print $F.sort.reverse * " "' v22 foo baz aimed ``` * using `sort_by` to sort based on a key ```bash $ s='floor bat to dubious four' $ # can also use: ruby -lane 'print $F.sort_by(&:length) * ":"' $ echo "$s" | ruby -lane 'print $F.sort_by {|a| a.length} * ":"' to:bat:four:floor:dubious $ # for descending order, simply negate the key $ echo "$s" | ruby -lane 'print $F.sort_by {|a| -a.length} * ":"' dubious:floor:four:bat:to $ # need to explicitly convert from string to number for numeric input $ s='23 756 -983 5' $ echo "$s" | ruby -lane 'print $F.sort_by(&:to_i) * " "' -983 5 23 756 $ s='5.33:2.2e3:42' $ echo "$s" | ruby -F: -lane 'print $F.sort_by{|n| -n.to_f} * ":"' 2.2e3:42:5.33 ``` * sorting characters within word * `chars` method returns array with individual characters ```bash $ echo 'foobar' | ruby -lne 'print $_.chars.sort * ""' abfoor $ cat words.txt bot art are boat toe flee reed $ # words with characters in ascending order $ # can also use: ruby -lne 'print if $_.chars == $_.chars.sort' words.txt $ ruby -lne 'print if $_ == $_.chars.sort * ""' words.txt bot art $ # words with characters in descending order $ # can also use: ruby -lne 'print if $_.chars == $_.chars.sort.reverse' $ ruby -lne 'print if $_ == $_.chars.sort {|a,b| b <=> a} * ""' words.txt toe reed ``` * sorting columns based on header ```bash $ # need to get indexes of order required for header, then use it for all lines $ # same as: perl -lane '@i = sort {$F[$a] cmp $F[$b]} 0..$#F if $.==1; $ # print join "\t", @F[@i]' marks.txt $ ruby -lane 'idx = $F.each_index.sort {|i,j| $F[i] <=> $F[j]} if $.==1; print $F.values_at(*idx) * "\t"' marks.txt Dept Marks Name ECE 53 Raj ECE 72 Joel EEE 68 Moi CSE 81 Surya EEE 59 Tia ECE 92 Om CSE 67 Amy ``` * [ruby-doc: uniq](https://ruby-doc.org/core-2.5.0/Array.html#method-i-uniq) * order is preserved ```bash $ s='3,b,a,c,d,1,d,c,2,3,1,b' $ # same as: perl -MList::MoreUtils=uniq -F, -lane 'print join ",",uniq @F' $ echo "$s" | ruby -F, -lane 'print $F.uniq * ","' 3,b,a,c,d,1,2 $ # same as: ruby -rset -ane 'BEGIN{s=Set.new}; print if s.add?($F[1])' $ # note that -n/-p option is not used $ ruby -e 'puts readlines.uniq {|s| s.split[1]}' duplicates.txt abc 7 4 food toy **** ``` * max/min values ```bash $ # if numeric array is constructed from string input $ echo '34,17,6' | ruby -F, -lane 'print $F.max {|a,b| a.to_i <=> b.to_i}' 34 $ # or convert numeric array first, 'map' is covered in next section $ echo '34,17,6' | ruby -F, -lane 'print $F.map(&:to_i).max' 34 $ echo '23.5,42,-36' | ruby -F, -lane 'puts $F.map(&:to_f).max' 42.0 $ # string comparison is default $ s='floor bat to dubious four' $ echo "$s" | ruby -lane 'print $F.min' bat $ # can also get max/min 'n' elements $ echo "$s" | ruby -lane 'print $F.max(2)' ["to", "four"] $ echo "$s" | ruby -lane 'print $F.min(3) {|a,b| a.size <=> b.size}' ["to", "bat", "four"] ```
#### Transforming * shuffling elements ```bash $ s='23 756 -983 5' $ echo "$s" | ruby -lane 'print $F.shuffle * " "' 5 756 -983 23 $ echo "$s" | ruby -lane 'print $F.shuffle * " "' 756 5 23 -983 $ # randomizing file contents $ # note that -n/-p option is not used $ ruby -e 'puts readlines.shuffle' poem.txt And so are you. Violets are blue, Roses are red, Sugar is sweet, $ # or if shuffle order is known $ seq 5 | ruby -e 'puts readlines.values_at(3,1,0,2,4)' 4 2 1 3 5 ``` * use `map` to transform every element * See also [stackoverflow What does map(&:name) mean in Ruby?](https://stackoverflow.com/questions/1217088/what-does-mapname-mean-in-ruby) for explanation on `&:` ```bash $ echo '23 756 -983 5' | ruby -lane 'print $F.map {|n| n.to_i ** 2} * " "' 529 571536 966289 25 $ echo 'a b c' | ruby -lane 'print $F.map {|s| %Q/"#{s}"/} * ","' "a","b","c" $ echo 'a b c' | ruby -lane 'print $F.map {|s| %Q/"#{s}"/.upcase} * ","' "A","B","C" $ # ASCII int values for each character $ echo 'AaBbCc' | ruby -lne 'print $_.chars.map(&:ord) * " "' 65 97 66 98 67 99 $ echo '34,17,6' | ruby -F, -lane 'puts $F.map(&:to_i).sum' 57 $ # shuffle each field character wise $ s='this is a sample sentence' $ echo "$s" | ruby -lane 'print $F.map {|s| s.chars.shuffle * ""} * " "' hsti si a mlepas esencnet ``` * reverse array/string ```bash $ s='23 756 -983 5' $ echo "$s" | ruby -lane 'print $F.reverse * " "' 5 -983 756 23 $ echo 'foobar' | ruby -lne 'print $_.reverse' raboof $ # or inplace reverse $ echo 'foobar' | ruby -lpe '$_.reverse!' raboof ``` * See also [ruby-doc: Enumerable](https://ruby-doc.org/core-2.5.0/Enumerable.html) for more methods like `inject`
## Miscellaneous
#### split * the `-a` command line option uses `split` and automatically saves the results in `$F` array * default separator is `\s+` and also strips whitespace from start/end of string * See also [ruby-doc: split](https://ruby-doc.org/core-2.5.0/String.html#method-i-split) ```bash $ # specifying maximum number of splits $ # same as: perl -lne 'print join ":", split /\s+/,$_,2' $ echo 'a 1 b 2 c' | ruby -lne 'print $_.split(/\s+/, 2) * ":"' a:1 b 2 c $ # by default, trailing empty fields are stripped $ echo ':123::' | ruby -lne 'print $_.split(/:/) * ","' ,123 $ # specify a negative count to preserve trailing empty fields $ echo ':123::' | ruby -lne 'print $_.split(/:/, -1) * ","' ,123,, $ # use string argument for fixed-string split instead of regexp $ echo 'foo**123**baz' | ruby -lne 'print $_.split("**") * ":"' foo:123:baz $ # to save the separators as well, use capture groups $ s='Sample123string54with908numbers' $ echo "$s" | ruby -lne 'print $_.split(/(\d+)/) * ":"' Sample:123:string:54:with:908:numbers ``` * single line to multiple line by splitting a column ```bash $ cat split.txt foo,1:2:5,baz wry,4,look free,3:8,oh $ # same as: perl -F, -ane 'print join ",", $F[0],$_,$F[2] for split /:/,$F[1]' $ ruby -F, -ane '$F[1].split(/:/).each {|x| print [$F[0],x,$F[2]]*","}' split.txt foo,1,baz foo,2,baz foo,5,baz wry,4,look free,3,oh free,8,oh $ # can also use scan here: $ # ruby -F, -ane '$F[1].scan(/[^:]+/) {|x| print [$F[0],x,$F[2]]*","}' ```
#### Fixed width processing * [ruby-doc: unpack](https://ruby-doc.org/core-2.5.0/String.html#method-i-unpack) ```bash $ # same as: perl -lne '@x = unpack("a1xa3xa4", $_); print $x[0]' $ # here 'a' indicates arbitrary binary string $ # the number that follows indicates length $ # the 'x' indicates characters to ignore, use length after 'x' if needed $ # and there are many other formats, see ruby-doc for details $ echo 'b 123 good' | ruby -lne 'print $_.unpack("a1xa3xa4")[0]' b $ echo 'b 123 good' | ruby -lne 'print $_.unpack("a1xa3xa4")[1]' 123 $ echo 'b 123 good' | ruby -lne 'print $_.unpack("a1xa3xa4")[2]' good $ # unpack not always needed, simple slicing might help $ echo 'b 123 good' | ruby -ne 'puts $_[2,3]' 123 $ echo 'b 123 good' | ruby -ne 'puts $_[6,4]' good $ # replacing arbitrary slice $ # same as: perl -lpe 'substr $_, 2, 3, "gleam"' $ echo 'b 123 good' | ruby -lpe '$_[2,3] = "gleam"' b gleam good ```
#### String and file replication ```bash $ # replicate each line, same as: perl -ne 'print $_ x 2' $ seq 2 | ruby -ne 'print $_ * 2' 1 1 2 2 $ # replicate a string, same as: perl -le 'print "abc" x 5' $ ruby -e 'puts "abc" * 5' abcabcabcabcabc $ # works for array too, but be careful with mutable elements $ ruby -le 'x = [3, 2, 1] * 2; print x' [3, 2, 1, 3, 2, 1] $ ruby -le 'x = [3, 2, [1, 7]] * 2; x[2][0]="a"; print x' [3, 2, ["a", 7], 3, 2, ["a", 7]] $ # replicating file, same as: perl -0777 -ne 'print $_ x 100' $ wc -c poem.txt 65 poem.txt $ ruby -0777 -ne 'print $_ * 100' poem.txt | wc -c 6500 ```
#### transliteration * [ruby-doc: tr](https://ruby-doc.org/core-2.5.0/String.html#method-i-tr) ```bash $ echo 'Uryyb Jbeyq' | ruby -pe '$_.tr!("a-zA-Z", "n-za-mN-ZA-M")' Hello World $ echo 'hi there!' | ruby -pe '$_.tr!("a-z", "\u{1d5ee}-\u{1d607}")' 𝗵𝗶 𝘁𝗵𝗲𝗿𝗲! $ # when first argument is longer $ # the last character of second argument is padded $ echo 'foo bar cat baz' | ruby -pe '$_.tr!("a-z", "123")' 333 213 313 213 $ # use ^ at start of first argument to complement specified characters $ echo 'foo:123:baz' | ruby -lpe '$_.tr!("^0-9", "-")' ----123---- $ # use empty second argument to delete specified characters $ echo '"Foo1!", "Bar.", ":Baz:"' | ruby -lpe '$_.tr!("^A-Za-z,", "")' Foo,Bar,Baz $ # use - at start/end and ^ other than start to match themselves $ echo 'a^3-b*d' | ruby -lpe '$_.tr!("-^*", "*/+")' a/3*b+d ```
#### Executing external commands * External commands can be issued using `system` function * Output would be as usual on `stdout` unless redirected while calling the command ```bash $ # same as: perl -e 'system("echo Hello World")' $ ruby -e 'system("echo Hello World")' Hello World $ ruby -e 'system("wc poem.txt")' 4 13 65 poem.txt $ ruby -e 'system("seq 10 | paste -sd, > out.txt")' $ cat out.txt 1,2,3,4,5,6,7,8,9,10 $ cat f2 I bought two bananas and three mangoes $ # same as: perl -F, -lane 'system "cat $F[1]"' $ echo 'f1,f2,odd.txt' | ruby -F, -lane 'system("cat #{$F[1]}")' I bought two bananas and three mangoes ``` * return value of `system` or global variable `$?` can be used to act upon exit status of command issued * see [ruby-doc: system](https://ruby-doc.org/core-2.5.0/Kernel.html#method-i-system) for details ```bash $ ruby -e 'es=system("ls poem.txt"); puts es' poem.txt true $ ruby -e 'system("ls poem.txt"); puts $?' poem.txt pid 17005 exit 0 $ ruby -e 'system("ls xyz.txt"); puts $?' ls: cannot access 'xyz.txt': No such file or directory pid 17059 exit 2 ``` * to save result of external command, use backticks or `%x` ```bash $ ruby -e 'lines = `wc -l < poem.txt`; print lines' 4 $ ruby -e 'nums = %x/seq 3/; print nums' 1 2 3 ``` * See also [stackoverflow - difference between exec, system and %x() or backticks](https://stackoverflow.com/questions/6338908/ruby-difference-between-exec-system-and-x-or-backticks)
## Further Reading * Manual and related * [ruby-lang documentation](https://www.ruby-lang.org/en/documentation/) - manuals, tutorials and references * [ruby-lang - faqs](https://www.ruby-lang.org/en/documentation/faq/) * [ruby-lang - quickstart](https://www.ruby-lang.org/en/documentation/quickstart/) * [ruby-lang - To Ruby From Perl](https://www.ruby-lang.org/en/documentation/ruby-from-other-languages/to-ruby-from-perl/) * [rubular - Ruby regular expression editor](http://rubular.com/) * Tutorials and Q&A * [Smooth Ruby One-Liners](https://dev.to/rpalo/smooth-ruby-one-liners-154) - simple intro to ruby one-liners * [Ruby one-liners](http://benoithamelin.tumblr.com/ruby1line) based on [awk one-liners](http://www.pement.org/awk/awk1line.txt) * [Ruby Tricks, Idiomatic Ruby, Refactorings and Best Practices](https://franzejr.github.io/best-ruby/index.html) * [freecodecamp - learning Ruby](https://medium.freecodecamp.org/learning-ruby-from-zero-to-hero-90ad4eecc82d) * [Ruby Regexp](https://leanpub.com/rubyregexp) ebook - step by step guide from beginner to advanced levels * [regex FAQ on SO](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) * Alternatives * [bioruby](https://github.com/bioruby/bioruby) * [perl](https://perldoc.perl.org/) * [unix.stackexchange - When to use grep, sed, awk, perl, etc](https://unix.stackexchange.com/questions/303044/when-to-use-grep-less-awk-sed) ================================================ FILE: sorting_stuff.md ================================================ # Sorting stuff **Table of Contents** * [sort](#sort) * [Default sort](#default-sort) * [Reverse sort](#reverse-sort) * [Various number sorting](#various-number-sorting) * [Random sort](#random-sort) * [Specifying output file](#specifying-output-file) * [Unique sort](#unique-sort) * [Column based sorting](#column-based-sorting) * [Further reading for sort](#further-reading-for-sort) * [uniq](#uniq) * [Default uniq](#default-uniq) * [Only duplicates](#only-duplicates) * [Only unique](#only-unique) * [Prefix count](#prefix-count) * [Ignoring case](#ignoring-case) * [Combining multiple files](#combining-multiple-files) * [Column options](#column-options) * [Further reading for uniq](#further-reading-for-uniq) * [comm](#comm) * [Default three column output](#default-three-column-output) * [Suppressing columns](#suppressing-columns) * [Files with duplicates](#files-with-duplicates) * [Further reading for comm](#further-reading-for-comm) * [shuf](#shuf) * [Random lines](#random-lines) * [Random integer numbers](#random-integer-numbers) * [Further reading for shuf](#further-reading-for-shuf)
## sort ```bash $ sort --version | head -n1 sort (GNU coreutils) 8.25 $ man sort SORT(1) User Commands SORT(1) NAME sort - sort lines of text files SYNOPSIS sort [OPTION]... [FILE]... sort [OPTION]... --files0-from=F DESCRIPTION Write sorted concatenation of all FILE(s) to standard output. With no FILE, or when FILE is -, read standard input. ... ``` **Note**: All examples shown here assumes ASCII encoded input file
#### Default sort ```bash $ cat poem.txt Roses are red, Violets are blue, Sugar is sweet, And so are you. $ sort poem.txt And so are you. Roses are red, Sugar is sweet, Violets are blue, ``` * Well, that was easy. The lines were sorted alphabetically (ascending order by default) and it so happened that first letter alone was enough to decide the order * For next example, let's extract all the words and sort them * also allows to showcase `sort` accepting stdin * See [GNU grep](./gnu_grep.md) chapter if the `grep` command used below looks alien ```bash $ # output might differ depending on locale settings $ # note the case-insensitiveness of output $ grep -oi '[a-z]*' poem.txt | sort And are are are blue is red Roses so Sugar sweet Violets you ``` * heed hereunto * See also * [arch wiki - locale](https://wiki.archlinux.org/index.php/locale) * [Linux: Define Locale and Language Settings](https://www.shellhacks.com/linux-define-locale-language-settings/) ```bash $ info sort | tail (1) If you use a non-POSIX locale (e.g., by setting ‘LC_ALL’ to ‘en_US’), then ‘sort’ may produce output that is sorted differently than you’re accustomed to. In that case, set the ‘LC_ALL’ environment variable to ‘C’. Note that setting only ‘LC_COLLATE’ has two problems. First, it is ineffective if ‘LC_ALL’ is also set. Second, it has undefined behavior if ‘LC_CTYPE’ (or ‘LANG’, if ‘LC_CTYPE’ is unset) is set to an incompatible value. For example, you get undefined behavior if ‘LC_CTYPE’ is ‘ja_JP.PCK’ but ‘LC_COLLATE’ is ‘en_US.UTF-8’. ``` * Example to help show effect of locale setting ```bash $ # note how uppercase is sorted before lowercase $ grep -oi '[a-z]*' poem.txt | LC_ALL=C sort And Roses Sugar Violets are are are blue is red so sweet you ```
#### Reverse sort * This is simply reversing from default ascending order to descending order ```bash $ sort -r poem.txt Violets are blue, Sugar is sweet, Roses are red, And so are you. ```
#### Various number sorting ```bash $ cat numbers.txt 20 53 3 101 $ sort numbers.txt 101 20 3 53 ``` * Whoops, what happened there? `sort` won't know to treat them as numbers unless specified * Depending on format of numbers, different options have to be used * First up is `-n` option, which sorts based on numerical value ```bash $ sort -n numbers.txt 3 20 53 101 $ sort -nr numbers.txt 101 53 20 3 ``` * The `-n` option can handle negative numbers * As well as thousands separator and decimal point (depends on locale) * The `<()` syntax is [Process Substitution](http://mywiki.wooledge.org/ProcessSubstitution) * to put it simply - allows output of command to be passed as input file to another command without needing to manually create a temporary file ```bash $ # multiple files are merged as single input by default $ sort -n numbers.txt <(echo '-4') -4 3 20 53 101 $ sort -n numbers.txt <(echo '1,234') 3 20 53 101 1,234 $ sort -n numbers.txt <(echo '31.24') 3 20 31.24 53 101 ``` * Use `-g` if input contains numbers prefixed by `+` or [E scientific notation](https://en.wikipedia.org/wiki/Scientific_notation#E_notation) ```bash $ cat generic_numbers.txt +120 -1.53 3.14e+4 42.1e-2 $ sort -g generic_numbers.txt -1.53 42.1e-2 +120 3.14e+4 ``` * Commands like `du` have options to display numbers in human readable formats * `sort` supports sorting such numbers using the `-h` option ```bash $ du -sh * 104K power.log 746M projects 316K report.log 20K sample.txt $ du -sh * | sort -h 20K sample.txt 104K power.log 316K report.log 746M projects $ # --si uses powers of 1000 instead of 1024 $ du -s --si * 107k power.log 782M projects 324k report.log 21k sample.txt $ du -s --si * | sort -h 21k sample.txt 107k power.log 324k report.log 782M projects ``` * Version sort - dealing with numbers mixed with other characters * If this sorting is needed simply while displaying directory contents, use `ls -v` instead of piping to `sort -V` ```bash $ cat versions.txt foo_v1.2 bar_v2.1.3 foobar_v2 foo_v1.2.1 foo_v1.3 $ sort -V versions.txt bar_v2.1.3 foobar_v2 foo_v1.2 foo_v1.2.1 foo_v1.3 ``` * Another common use case is when there are multiple filenames differentiated by numbers ```bash $ cat files.txt file0 file10 file3 file4 $ sort -V files.txt file0 file3 file4 file10 ``` * Can be used when dealing with numbers reported by `time` command as well ```bash $ # different solving durations $ cat rubik_time.txt 5m35.363s 3m20.058s 4m5.099s 4m1.130s 3m42.833s 4m33.083s $ # assuming consistent min/sec format $ sort -V rubik_time.txt 3m20.058s 3m42.833s 4m1.130s 4m5.099s 4m33.083s 5m35.363s ```
#### Random sort * Note that duplicate lines will always end up next to each other * might be useful as a feature for some cases ;) * Use `shuf` if this is not desirable * See also [How can I shuffle the lines of a text file on the Unix command line or in a shell script?](https://stackoverflow.com/questions/2153882/how-can-i-shuffle-the-lines-of-a-text-file-on-the-unix-command-line-or-in-a-shel) ```bash $ cat nums.txt 1 10 10 12 23 563 $ # the two 10s will always be next to each other $ sort -R nums.txt 563 12 1 10 10 23 $ # duplicates can end up anywhere $ shuf nums.txt 10 23 1 10 563 12 ```
#### Specifying output file * The `-o` option can be used to specify output file * Useful for in place editing ```bash $ sort -R nums.txt -o rand_nums.txt $ cat rand_nums.txt 23 1 10 10 563 12 $ sort -R nums.txt -o nums.txt $ cat nums.txt 563 23 10 10 1 12 ``` * Use shell script looping if there multiple files to be sorted in place * Below snippet is for `bash` shell ```bash $ for f in *.txt; do echo sort -V "$f" -o "$f"; done sort -V files.txt -o files.txt sort -V rubik_time.txt -o rubik_time.txt sort -V versions.txt -o versions.txt $ # remove echo once commands look fine $ for f in *.txt; do sort -V "$f" -o "$f"; done ```
#### Unique sort * Keep only first copy of lines that are deemed to be same according to `sort` option used ```bash $ cat duplicates.txt foo 12 carrots foo 12 apples 5 guavas $ # only one copy of foo in output $ sort -u duplicates.txt 12 apples 12 carrots 5 guavas foo ``` * According to option used, definition of duplicate will vary * For example, when `-n` is used, matching numbers are deemed same even if rest of line differs * Pipe the output to `uniq` if this is not desirable ```bash $ # note how first copy of line starting with 12 is retained $ sort -nu duplicates.txt foo 5 guavas 12 carrots $ # use uniq when entire line should be compared to find duplicates $ sort -n duplicates.txt | uniq foo 5 guavas 12 apples 12 carrots ``` * Use `-f` option to ignore case of alphabets while determining duplicates ```bash $ cat words.txt CAR are car Are foot are $ # only the two 'are' were considered duplicates $ sort -u words.txt are Are car CAR foot $ # note again that first copy of duplicate is retained $ sort -fu words.txt are CAR foot ```
#### Column based sorting From `info sort` ``` ‘-k POS1[,POS2]’ ‘--key=POS1[,POS2]’ Specify a sort field that consists of the part of the line between POS1 and POS2 (or the end of the line, if POS2 is omitted), _inclusive_. Each POS has the form ‘F[.C][OPTS]’, where F is the number of the field to use, and C is the number of the first character from the beginning of the field. Fields and character positions are numbered starting with 1; a character position of zero in POS2 indicates the field’s last character. If ‘.C’ is omitted from POS1, it defaults to 1 (the beginning of the field); if omitted from POS2, it defaults to 0 (the end of the field). OPTS are ordering options, allowing individual keys to be sorted according to different rules; see below for details. Keys can span multiple fields. ``` * By default, blank characters (space and tab) serve as field separators ```bash $ cat fruits.txt apple 42 guava 6 fig 90 banana 31 $ sort fruits.txt apple 42 banana 31 fig 90 guava 6 $ # sort based on 2nd column numbers $ sort -k2,2n fruits.txt guava 6 banana 31 apple 42 fig 90 ``` * Using a different field separator * Consider the following sample input file having fields separated by `:` ```bash $ # name:pet_name:no_of_pets $ cat pets.txt foo:dog:2 xyz:cat:1 baz:parrot:5 abcd:cat:3 joe:dog:1 bar:fox:1 temp_var:squirrel:4 boss:dog:10 ``` * Sorting based on particular column or column to end of line * In case of multiple entries, by default `sort` would use content of remaining parts of line to resolve ```bash $ # only 2nd column $ # -k2,4 would mean 2nd column to 4th column $ sort -t: -k2,2 pets.txt abcd:cat:3 xyz:cat:1 boss:dog:10 foo:dog:2 joe:dog:1 bar:fox:1 baz:parrot:5 temp_var:squirrel:4 $ # from 2nd column to end of line $ sort -t: -k2 pets.txt xyz:cat:1 abcd:cat:3 joe:dog:1 boss:dog:10 foo:dog:2 bar:fox:1 baz:parrot:5 temp_var:squirrel:4 ``` * Multiple keys can be specified to resolve ties * Note that if there are still multiple entries with specified keys, remaining parts of lines would be used ```bash $ # default sort for 2nd column, numeric sort on 3rd column to resolve ties $ sort -t: -k2,2 -k3,3n pets.txt xyz:cat:1 abcd:cat:3 joe:dog:1 foo:dog:2 boss:dog:10 bar:fox:1 baz:parrot:5 temp_var:squirrel:4 $ # numeric sort on 3rd column, default sort for 2nd column to resolve ties $ sort -t: -k3,3n -k2,2 pets.txt xyz:cat:1 joe:dog:1 bar:fox:1 foo:dog:2 abcd:cat:3 temp_var:squirrel:4 baz:parrot:5 boss:dog:10 ``` * Use `-s` option to retain original order of lines in case of tie ```bash $ sort -s -t: -k2,2 pets.txt xyz:cat:1 abcd:cat:3 foo:dog:2 joe:dog:1 boss:dog:10 bar:fox:1 baz:parrot:5 temp_var:squirrel:4 ``` * The `-u` option, as seen earlier, will retain only first match ```bash $ sort -u -t: -k2,2 pets.txt xyz:cat:1 foo:dog:2 bar:fox:1 baz:parrot:5 temp_var:squirrel:4 $ sort -u -t: -k3,3n pets.txt xyz:cat:1 foo:dog:2 abcd:cat:3 temp_var:squirrel:4 baz:parrot:5 boss:dog:10 ``` * Sometimes, the input has to be sorted first and then `-u` used on the sorted output * See also [remove duplicates based on the value of another column](https://unix.stackexchange.com/questions/379835/remove-duplicates-based-on-the-value-of-another-column) ```bash $ # sort by number in 3rd column $ sort -t: -k3,3n pets.txt bar:fox:1 joe:dog:1 xyz:cat:1 foo:dog:2 abcd:cat:3 temp_var:squirrel:4 baz:parrot:5 boss:dog:10 $ # then get unique entry based on 2nd column $ sort -t: -k3,3n pets.txt | sort -t: -u -k2,2 xyz:cat:1 joe:dog:1 bar:fox:1 baz:parrot:5 temp_var:squirrel:4 ``` * Specifying particular characters within fields * If character position is not specified, defaults to `1` for starting column and `0` (last character) for ending column ```bash $ cat marks.txt fork,ap_12,54 flat,up_342,1.2 fold,tn_48,211 more,ap_93,7 rest,up_5,63 $ # for 2nd column, sort numerically only from 4th character to end $ sort -t, -k2.4,2n marks.txt rest,up_5,63 fork,ap_12,54 fold,tn_48,211 more,ap_93,7 flat,up_342,1.2 $ # sort uniquely based on first two characters of line $ sort -u -k1.1,1.2 marks.txt flat,up_342,1.2 fork,ap_12,54 more,ap_93,7 rest,up_5,63 ``` * If there are headers ```bash $ cat header.txt fruit qty apple 42 guava 6 fig 90 banana 31 $ # separate and combine header and content to be sorted $ cat <(head -n1 header.txt) <(tail -n +2 header.txt | sort -k2nr) fruit qty fig 90 apple 42 banana 31 guava 6 ``` * See also [sort by last field value when number of fields varies](https://stackoverflow.com/questions/3832068/bash-sort-text-file-by-last-field-value)
#### Further reading for sort * There are many other options apart from handful presented above. See `man sort` and `info sort` for detailed documentation and more examples * [sort like a master](http://www.skorks.com/2010/05/sort-files-like-a-master-with-the-linux-sort-command-bash/) * [When -b to ignore leading blanks is needed](https://unix.stackexchange.com/a/104527/109046) * [sort Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/sort?sort=votes&pageSize=15) * [sort on multiple columns using -k option](https://unix.stackexchange.com/questions/249452/unix-multiple-column-sort-issue) * [sort a string character wise](https://stackoverflow.com/questions/2373874/how-to-sort-characters-in-a-string) * [Scalability of 'sort -u' for gigantic files](https://unix.stackexchange.com/questions/279096/scalability-of-sort-u-for-gigantic-files)
## uniq ```bash $ uniq --version | head -n1 uniq (GNU coreutils) 8.25 $ man uniq UNIQ(1) User Commands UNIQ(1) NAME uniq - report or omit repeated lines SYNOPSIS uniq [OPTION]... [INPUT [OUTPUT]] DESCRIPTION Filter adjacent matching lines from INPUT (or standard input), writing to OUTPUT (or standard output). With no options, matching lines are merged to the first occurrence. ... ```
#### Default uniq ```bash $ cat word_list.txt are are to good bad bad bad good are bad $ # adjacent duplicate lines are removed, leaving one copy $ uniq word_list.txt are to good bad good are bad $ # To remove duplicates from entire file, input has to be sorted first $ # also showcases that uniq accepts stdin as input $ sort word_list.txt | uniq are bad good to ```
#### Only duplicates ```bash $ # duplicates adjacent to each other $ uniq -d word_list.txt are bad $ # duplicates in entire file $ sort word_list.txt | uniq -d are bad good ``` * To get only duplicates as well as show all duplicates ```bash $ uniq -D word_list.txt are are bad bad bad $ sort word_list.txt | uniq -D are are are bad bad bad bad good good ``` * To distinguish the different groups ```bash $ # using --all-repeated=prepend will add a newline before the first group as well $ sort word_list.txt | uniq --all-repeated=separate are are are bad bad bad bad good good ```
#### Only unique ```bash $ # lines with no adjacent duplicates $ uniq -u word_list.txt to good good are bad $ # unique lines in entire file $ sort word_list.txt | uniq -u to ```
#### Prefix count ```bash $ # adjacent lines $ uniq -c word_list.txt 2 are 1 to 1 good 3 bad 1 good 1 are 1 bad $ # entire file $ sort word_list.txt | uniq -c 3 are 4 bad 2 good 1 to $ # entire file, only duplicates $ sort word_list.txt | uniq -cd 3 are 4 bad 2 good ``` * Sorting by count ```bash $ # sort by count $ sort word_list.txt | uniq -c | sort -n 1 to 2 good 3 are 4 bad $ # reverse the order, highest count first $ sort word_list.txt | uniq -c | sort -nr 4 bad 3 are 2 good 1 to ``` * To get only entries with min/max count, bit of [awk](./gnu_awk.md) magic would help ```bash $ # consider this result $ sort colors.txt | uniq -c | sort -nr 3 Red 3 Blue 2 Yellow 1 Green 1 Black $ # to get all max count $ # save 1st line 1st column value to c and then print if 1st column equals c $ sort colors.txt | uniq -c | sort -nr | awk 'NR==1{c=$1} $1==c' 3 Red 3 Blue $ # to get all min count $ sort colors.txt | uniq -c | sort -n | awk 'NR==1{c=$1} $1==c' 1 Black 1 Green ``` * Get rough count of most used commands from `history` file ```bash $ # awk '{print $1}' will get the 1st column alone $ awk '{print $1}' "$HISTFILE" | sort | uniq -c | sort -nr | head 1465 echo 1180 grep 552 cd 531 awk 451 sed 423 vi 418 cat 392 perl 325 printf 320 sort $ # extract command name from start of line or preceded by 'spaces|spaces' $ # won't catch commands in other places like command substitution though $ grep -oP '(^| +\| +)\K[^ ]+' "$HISTFILE" | sort | uniq -c | sort -nr | head 2006 grep 1469 echo 933 sed 698 awk 552 cd 513 perl 510 cat 453 sort 423 vi 327 printf ```
#### Ignoring case ```bash $ cat another_list.txt food Food good are bad Are $ # note how first copy is retained $ uniq -i another_list.txt food good are bad Are $ uniq -iD another_list.txt food Food ```
#### Combining multiple files ```bash $ sort -f word_list.txt another_list.txt | uniq -i are bad food good to $ sort -f word_list.txt another_list.txt | uniq -c 4 are 1 Are 5 bad 1 food 1 Food 3 good 1 to $ sort -f word_list.txt another_list.txt | uniq -ic 5 are 5 bad 2 food 3 good 1 to ``` * If only adjacent lines (not sorted) is required, need to concatenate files using another command ```bash $ uniq -id word_list.txt are bad $ uniq -id another_list.txt food $ cat word_list.txt another_list.txt | uniq -id are bad food ```
#### Column options * `uniq` has few options dealing with column manipulations. Not extensive as `sort -k` but handy for some cases * First up, skipping fields * No option to specify different delimiter * From `info uniq`: Fields are sequences of non-space non-tab characters that are separated from each other by at least one space or tab * Number of spaces/tabs between fields should be same ```bash $ cat shopping.txt lemon 5 mango 5 banana 8 bread 1 orange 5 $ # skips first field $ uniq -f1 shopping.txt lemon 5 banana 8 bread 1 orange 5 $ # use -f3 to skip first three fields and so on ``` * Skipping characters ```bash $ cat text glue blue black stack stuck $ # don't consider first 2 characters $ uniq -s2 text glue black stuck $ # to visualize the above example $ # assume there are two fields and uniq is applied on 2nd column $ sed 's/^../& /' text gl ue bl ue bl ack st ack st uck ``` * Upto specified characters ```bash $ # consider only first 2 characters $ uniq -w2 text glue blue stack $ # to visualize the above example $ # assume there are two fields and uniq is applied on 1st column $ sed 's/^../& /' text gl ue bl ue bl ack st ack st uck ``` * Combining `-s` and `-w` * Can be combined with `-f` as well ```bash $ # skip first 3 characters and then use next 2 characters $ uniq -s3 -w2 text glue black ```
#### Further reading for uniq * Do check out `man uniq` and `info uniq` for other options and more detailed documentation * [uniq Q&A on unix stackexchange](http://unix.stackexchange.com/questions/tagged/uniq?sort=votes&pageSize=15) * [process duplicate lines only based on certain fields](https://unix.stackexchange.com/questions/387590/print-the-duplicate-lines-only-on-fields-1-2-from-csv-file)
## comm ```bash $ comm --version | head -n1 comm (GNU coreutils) 8.25 $ man comm COMM(1) User Commands COMM(1) NAME comm - compare two sorted files line by line SYNOPSIS comm [OPTION]... FILE1 FILE2 DESCRIPTION Compare sorted files FILE1 and FILE2 line by line. When FILE1 or FILE2 (not both) is -, read standard input. With no options, produce three-column output. Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files. ... ```
#### Default three column output Consider below sample input files ```bash $ # sorted input files viewed side by side $ paste colors_1.txt colors_2.txt Blue Black Brown Blue Purple Green Red Red Teal White Yellow ``` * Without any option, `comm` gives 3 column output * lines unique to first file * lines unique to second file * lines common to both files ```bash $ comm colors_1.txt colors_2.txt Black Blue Brown Green Purple Red Teal White Yellow ```
#### Suppressing columns * `-1` suppress lines unique to first file * `-2` suppress lines unique to second file * `-3` suppress lines common to both files ```bash $ # suppressing column 3 $ comm -3 colors_1.txt colors_2.txt Black Brown Green Purple Teal White Yellow ``` * Combining options gives three distinct and useful constructs * First, getting only common lines to both files ```bash $ comm -12 colors_1.txt colors_2.txt Blue Red ``` * Second, lines unique to first file ```bash $ comm -23 colors_1.txt colors_2.txt Brown Purple Teal Yellow ``` * And the third, lines unique to second file ```bash $ comm -13 colors_1.txt colors_2.txt Black Green White ``` * See also how the above three cases can be done [using grep alone](./gnu_grep.md#search-strings-from-file) * **Note** input files do not need to be sorted for `grep` solution If different `sort` order than default is required, use `--nocheck-order` to ignore error message ```bash $ comm -23 <(sort -n numbers.txt) <(sort -n nums.txt) 3 comm: file 1 is not in sorted order 20 53 101 $ comm --nocheck-order -23 <(sort -n numbers.txt) <(sort -n nums.txt) 3 20 53 101 ```
#### Files with duplicates * As many duplicate lines match in both files, they'll be considered as common * Rest will be unique to respective files * This is useful for cases like finding lines present in first but not in second taking in to consideration count of duplicates as well * This solution won't be possible with `grep` ```bash $ paste list1 list2 a a a b a c b c b d c $ comm list1 list2 a a a b b c c d $ comm -23 list1 list2 a a b ```
#### Further reading for comm * `man comm` and `info comm` for more options and detailed documentation * [comm Q&A on unix stackexchange](http://unix.stackexchange.com/questions/tagged/comm?sort=votes&pageSize=15)
## shuf ```bash $ shuf --version | head -n1 shuf (GNU coreutils) 8.25 $ man shuf SHUF(1) User Commands SHUF(1) NAME shuf - generate random permutations SYNOPSIS shuf [OPTION]... [FILE] shuf -e [OPTION]... [ARG]... shuf -i LO-HI [OPTION]... DESCRIPTION Write a random permutation of the input lines to standard output. With no FILE, or when FILE is -, read standard input. ... ```
#### Random lines * Without repeating input lines ```bash $ cat nums.txt 1 10 10 12 23 563 $ # duplicates can end up anywhere $ # all lines are part of output $ shuf nums.txt 10 23 1 10 563 12 $ # limit max number of output lines $ shuf -n2 nums.txt 563 23 ``` * Use `-o` option to specify output file name instead of displaying on stdout * Helpful for inplace editing ```bash $ shuf nums.txt -o nums.txt $ cat nums.txt 10 12 23 10 563 1 ``` * With repeated input lines ```bash $ # -n3 for max 3 lines, -r allows input lines to be repeated $ shuf -n3 -r nums.txt 1 1 563 $ seq 3 | shuf -n5 -r 2 1 2 1 2 $ # if a limit using -n is not specified, shuf will output lines indefinitely ``` * use `-e` option to specify multiple input lines from command line itself ```bash $ shuf -e red blue green green blue red $ shuf -e 'hi there' 'hello world' foo bar bar hi there foo hello world $ shuf -n2 -e 'hi there' 'hello world' foo bar foo hi there $ shuf -r -n4 -e foo bar foo foo bar foo ```
#### Random integer numbers * The `-i` option accepts integer range as input to be shuffled ```bash $ shuf -i 3-8 3 7 6 4 8 5 ``` * Combine with other options as needed ```bash $ shuf -n3 -i 3-8 5 4 7 $ shuf -r -n4 -i 3-8 5 5 7 8 $ shuf -r -n5 -i 0-1 1 0 0 1 1 ``` * Use [seq](./miscellaneous.md#seq) input if negative numbers, floating point, etc are needed ```bash $ seq 2 -1 -2 | shuf 2 -1 -2 0 1 $ seq 0.3 0.1 0.7 | shuf -n3 0.4 0.5 0.7 ```
#### Further reading for shuf * `man shuf` and `info shuf` for more options and detailed documentation * [Generate random numbers in specific range](https://unix.stackexchange.com/questions/140750/generate-random-numbers-in-specific-range) * [Variable - randomly choose among three numbers](https://unix.stackexchange.com/questions/330689/variable-randomly-chosen-among-three-numbers-10-100-and-1000) * Related to 'random' stuff: * [How to generate a random string?](https://unix.stackexchange.com/questions/230673/how-to-generate-a-random-string) * [How can I populate a file with random data?](https://unix.stackexchange.com/questions/33629/how-can-i-populate-a-file-with-random-data) * [Run commands at random](https://unix.stackexchange.com/questions/81566/run-commands-at-random) ================================================ FILE: tail_less_cat_head.md ================================================ # Cat, Less, Tail and Head **Table of Contents** * [cat](#cat) * [Concatenate files](#concatenate-files) * [Accepting input from stdin](#accepting-input-from-stdin) * [Squeeze consecutive empty lines](#squeeze-consecutive-empty-lines) * [Prefix line numbers](#prefix-line-numbers) * [Viewing special characters](#viewing-special-characters) * [Writing text to file](#writing-text-to-file) * [tac](#tac) * [Useless use of cat](#useless-use-of-cat) * [Further Reading for cat](#further-reading-for-cat) * [less](#less) * [Navigation commands](#navigation-commands) * [Further Reading for less](#further-reading-for-less) * [tail](#tail) * [linewise tail](#linewise-tail) * [characterwise tail](#characterwise-tail) * [multiple file input for tail](#multiple-file-input-for-tail) * [Further Reading for tail](#further-reading-for-tail) * [head](#head) * [linewise head](#linewise-head) * [characterwise head](#characterwise-head) * [multiple file input for head](#multiple-file-input-for-head) * [combining head and tail](#combining-head-and-tail) * [Further Reading for head](#further-reading-for-head) * [Text Editors](#text-editors)
## cat ```bash $ cat --version | head -n1 cat (GNU coreutils) 8.25 $ man cat CAT(1) User Commands CAT(1) NAME cat - concatenate files and print on the standard output SYNOPSIS cat [OPTION]... [FILE]... DESCRIPTION Concatenate FILE(s) to standard output. With no FILE, or when FILE is -, read standard input. ... ``` * For below examples, `marks_201*` files contain 3 fields delimited by TAB * To avoid formatting issues, TAB has been converted to spaces using `col -x` while pasting the output here
#### Concatenate files * One or more files can be given as input and hence a lot of times, `cat` is used to quickly see contents of small single file on terminal * To save the output of concatenation, just redirect stdout ```bash $ ls marks_2015.txt marks_2016.txt marks_2017.txt $ cat marks_201* Name Maths Science foo 67 78 bar 87 85 Name Maths Science foo 70 75 bar 85 88 Name Maths Science foo 68 76 bar 90 90 $ # save stdout to a file $ cat marks_201* > all_marks.txt ```
#### Accepting input from stdin ```bash $ # combining input from stdin and other files $ printf 'Name\tMaths\tScience \nbaz\t56\t63\nbak\t71\t65\n' | cat - marks_2015.txt Name Maths Science baz 56 63 bak 71 65 Name Maths Science foo 67 78 bar 87 85 $ # - can be placed in whatever order is required $ printf 'Name\tMaths\tScience \nbaz\t56\t63\nbak\t71\t65\n' | cat marks_2015.txt - Name Maths Science foo 67 78 bar 87 85 Name Maths Science baz 56 63 bak 71 65 ```
#### Squeeze consecutive empty lines ```bash $ printf 'hello\n\n\nworld\n\nhave a nice day\n' hello world have a nice day $ printf 'hello\n\n\nworld\n\nhave a nice day\n' | cat -s hello world have a nice day ```
#### Prefix line numbers ```bash $ # number all lines $ cat -n marks_201* 1 Name Maths Science 2 foo 67 78 3 bar 87 85 4 Name Maths Science 5 foo 70 75 6 bar 85 88 7 Name Maths Science 8 foo 68 76 9 bar 90 90 $ # number only non-empty lines $ printf 'hello\n\n\nworld\n\nhave a nice day\n' | cat -sb 1 hello 2 world 3 have a nice day ``` * For more numbering options, check out the command `nl` ```bash $ whatis nl nl (1) - number lines of files ```
#### Viewing special characters * End of line identified by `$` * Useful for example to see trailing spaces ```bash $ cat -E marks_2015.txt Name Maths Science $ foo 67 78$ bar 87 85$ ``` * TAB identified by `^I` ```bash $ cat -T marks_2015.txt Name^IMaths^IScience foo^I67^I78 bar^I87^I85 ``` * Non-printing characters * See [Show Non-Printing Characters](http://docstore.mik.ua/orelly/unix/upt/ch25_07.htm) for more detailed info ```bash $ # NUL character $ printf 'foo\0bar\0baz\n' | cat -v foo^@bar^@baz $ # to check for dos-style line endings $ printf 'Hello World!\r\n' | cat -v Hello World!^M $ printf 'Hello World!\r\n' | dos2unix | cat -v Hello World! ``` * the `-A` option is equivalent to `-vET` * the `-e` option is equivalent to `-vE` * If `dos2unix` and `unix2dos` are not available, see [How to convert DOS/Windows newline (CRLF) to Unix newline (\n)](https://stackoverflow.com/questions/2613800/how-to-convert-dos-windows-newline-crlf-to-unix-newline-n-in-a-bash-script)
#### Writing text to file ```bash $ cat > sample.txt This is an example of adding text to a new file using cat command. Press Ctrl+d on a newline to save and quit. $ cat sample.txt This is an example of adding text to a new file using cat command. Press Ctrl+d on a newline to save and quit. ``` * See also how to use [heredoc](http://mywiki.wooledge.org/HereDocument) * [How can I write a here doc to a file](https://stackoverflow.com/questions/2953081/how-can-i-write-a-here-doc-to-a-file-in-bash-script) * See also [difference between Ctrl+c and Ctrl+d to signal end of stdin input in bash](https://unix.stackexchange.com/questions/16333/how-to-signal-the-end-of-stdin-input-in-bash)
#### tac ```bash $ whatis tac tac (1) - concatenate and print files in reverse $ tac --version | head -n1 tac (GNU coreutils) 8.25 $ seq 3 | tac 3 2 1 $ tac marks_2015.txt bar 87 85 foo 67 78 Name Maths Science ``` * Useful in cases where logic is easier to write when working on reversed file * Consider this made up log file, many **Warning** lines but need to extract only from last such **Warning** upto **Error** line * See [GNU sed chapter](./gnu_sed.md#lines-between-two-regexps) for details on the `sed` command used below ```bash $ cat report.log blah blah Warning: something went wrong more blah whatever Warning: something else went wrong some text some more text Error: something seriously went wrong blah blah blah $ tac report.log | sed -n '/Error:/,/Warning:/p' | tac Warning: something else went wrong some text some more text Error: something seriously went wrong ``` * Similarly, if characters in lines have to be reversed, use the `rev` command ```bash $ whatis rev rev (1) - reverse lines characterwise ```
#### Useless use of cat * `cat` is used so frequently to view contents of a file that somehow users think other commands cannot handle file input * [UUOC](https://en.wikipedia.org/wiki/Cat_(Unix)#Useless_use_of_cat) * [Useless Use of Cat Award](http://porkmail.org/era/unix/award.html) ```bash $ cat report.log | grep -E 'Warning|Error' Warning: something went wrong Warning: something else went wrong Error: something seriously went wrong $ grep -E 'Warning|Error' report.log Warning: something went wrong Warning: something else went wrong Error: something seriously went wrong ``` * Use [input redirection](http://wiki.bash-hackers.org/howto/redirection_tutorial) if a command doesn't accept file input ```bash $ cat marks_2015.txt | tr 'A-Z' 'a-z' name maths science foo 67 78 bar 87 85 $ tr 'A-Z' 'a-z' < marks_2015.txt name maths science foo 67 78 bar 87 85 ``` * However, `cat` should definitely be used where **concatenation** is needed ```bash $ grep -c 'foo' marks_201* marks_2015.txt:1 marks_2016.txt:1 marks_2017.txt:1 $ # concatenation allows to get overall count in one-shot in this case $ cat marks_201* | grep -c 'foo' 3 ```
#### Further Reading for cat * [cat Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/cat?sort=votes&pageSize=15) * [cat Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/cat?sort=votes&pageSize=15)
## less ```bash $ less --version | head -n1 less 481 (GNU regular expressions) $ # By default, pager is used to display the man pages $ # and usually, pager is linked to less command $ type pager less pager is /usr/bin/pager less is /usr/bin/less $ realpath /usr/bin/pager /bin/less $ realpath /usr/bin/less /bin/less $ diff -s /usr/bin/pager /usr/bin/less Files /usr/bin/pager and /usr/bin/less are identical ``` * `cat` command is NOT suitable for viewing contents of large files on the Terminal * `less` displays contents of a file, automatically fits to size of Terminal, allows scrolling in either direction and other options for effective viewing * Usually, `man` command uses `less` command to display the help page * The navigation commands are similar to `vi` editor
#### Navigation commands Commonly used commands are given below, press `h` for summary of options * `g` go to start of file * `G` go to end of file * `q` quit * `/pattern` search for the given pattern in forward direction * `?pattern` search for the given pattern in backward direction * `n` go to next pattern * `N` go to previous pattern
#### Further Reading for less * See `man less` for detailed info on commands and options. For example: * `-s` option to squeeze consecutive blank lines * `-N` option to prefix line number * `less` command is an [improved version](https://unix.stackexchange.com/questions/604/isnt-less-just-more) of `more` command * [differences between most, more and less](https://unix.stackexchange.com/questions/81129/what-are-the-differences-between-most-more-and-less) * [less Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/less?sort=votes&pageSize=15)
## tail ```bash $ tail --version | head -n1 tail (GNU coreutils) 8.25 $ man tail TAIL(1) User Commands TAIL(1) NAME tail - output the last part of files SYNOPSIS tail [OPTION]... [FILE]... DESCRIPTION Print the last 10 lines of each FILE to standard output. With more than one FILE, precede each with a header giving the file name. With no FILE, or when FILE is -, read standard input. ... ```
#### linewise tail Consider this sample file, with line numbers prefixed ```bash $ cat sample.txt 1) Hello World 2) 3) Good day 4) How are you 5) 6) Just do-it 7) Believe it 8) 9) Today is sunny 10) Not a bit funny 11) No doubt you like it too 12) 13) Much ado about nothing 14) He he he 15) Adios amigo ``` * default behavior - display last 10 lines ```bash $ tail sample.txt 6) Just do-it 7) Believe it 8) 9) Today is sunny 10) Not a bit funny 11) No doubt you like it too 12) 13) Much ado about nothing 14) He he he 15) Adios amigo ``` * Use `-n` option to control number of lines to filter ```bash $ tail -n3 sample.txt 13) Much ado about nothing 14) He he he 15) Adios amigo $ # some versions of tail allow to skip explicit n character $ tail -5 sample.txt 11) No doubt you like it too 12) 13) Much ado about nothing 14) He he he 15) Adios amigo ``` * when number is prefixed with `+` sign, all lines are fetched from that particular line number to end of file ```bash $ tail -n +10 sample.txt 10) Not a bit funny 11) No doubt you like it too 12) 13) Much ado about nothing 14) He he he 15) Adios amigo $ seq 13 17 | tail -n +3 15 16 17 ```
#### characterwise tail * Note that this works byte wise and not suitable for multi-byte character encodings ```bash $ # last three characters including the newline character $ echo 'Hi there!' | tail -c3 e! $ # excluding the first character $ echo 'Hi there!' | tail -c +2 i there! ```
#### multiple file input for tail ```bash $ tail -n2 report.log sample.txt ==> report.log <== Error: something seriously went wrong blah blah blah ==> sample.txt <== 14) He he he 15) Adios amigo $ # -q option to avoid filename in output $ tail -q -n2 report.log sample.txt Error: something seriously went wrong blah blah blah 14) He he he 15) Adios amigo ```
#### Further Reading for tail * `tail -f` and related options are beyond the scope of this tutorial. Below links might be useful * [look out for buffering](http://mywiki.wooledge.org/BashFAQ/009) * [Piping tail -f output though grep twice](https://stackoverflow.com/questions/13858912/piping-tail-output-though-grep-twice) * [tail and less](https://unix.stackexchange.com/questions/196168/does-less-have-a-feature-like-tail-follow-name-f) * [tail Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/tail?sort=votes&pageSize=15) * [tail Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/tail?sort=votes&pageSize=15)
## head ```bash $ head --version | head -n1 head (GNU coreutils) 8.25 $ man head HEAD(1) User Commands HEAD(1) NAME head - output the first part of files SYNOPSIS head [OPTION]... [FILE]... DESCRIPTION Print the first 10 lines of each FILE to standard output. With more than one FILE, precede each with a header giving the file name. With no FILE, or when FILE is -, read standard input. ... ```
#### linewise head * default behavior - display starting 10 lines ```bash $ head sample.txt 1) Hello World 2) 3) Good day 4) How are you 5) 6) Just do-it 7) Believe it 8) 9) Today is sunny 10) Not a bit funny ``` * Use `-n` option to control number of lines to filter ```bash $ head -n3 sample.txt 1) Hello World 2) 3) Good day $ # some versions of head allow to skip explicit n character $ head -4 sample.txt 1) Hello World 2) 3) Good day 4) How are you ``` * when number is prefixed with `-` sign, all lines are fetched except those many lines to end of file ```bash $ # except last 9 lines of file $ head -n -9 sample.txt 1) Hello World 2) 3) Good day 4) How are you 5) 6) Just do-it $ # except last 2 lines $ seq 13 17 | head -n -2 13 14 15 ```
#### characterwise head * Note that this works byte wise and not suitable for multi-byte character encodings ```bash $ # if output of command doesn't end with newline, prompt will be on same line $ # to highlight working of command, the prompt for such cases is not shown here $ # first two characters $ echo 'Hi there!' | head -c2 Hi $ # excluding last four characters $ echo 'Hi there!' | head -c -4 Hi the ```
#### multiple file input for head ```bash $ head -n3 report.log sample.txt ==> report.log <== blah blah Warning: something went wrong more blah ==> sample.txt <== 1) Hello World 2) 3) Good day $ # -q option to avoid filename in output $ head -q -n3 report.log sample.txt blah blah Warning: something went wrong more blah 1) Hello World 2) 3) Good day ```
#### combining head and tail * Despite involving two commands, often this combination is faster than equivalent sed/awk versions ```bash $ head -n11 sample.txt | tail -n3 9) Today is sunny 10) Not a bit funny 11) No doubt you like it too $ tail sample.txt | head -n2 6) Just do-it 7) Believe it ```
#### Further Reading for head * [head Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/head?sort=votes&pageSize=15)
## Text Editors For editing text files, the following applications can be used. Of these, `gedit`, `nano`, `vi` and/or `vim` are available in most distros by default Easy to use * [gedit](https://wiki.gnome.org/Apps/Gedit) * [geany](http://www.geany.org/) * [nano](http://nano-editor.org/) Powerful text editors * [vim](https://github.com/vim/vim) * [vim learning resources](https://github.com/learnbyexample/scripting_course/blob/master/Vim_curated_resources.md) and [vim reference](https://github.com/learnbyexample/vim_reference) for further info * [emacs](https://www.gnu.org/software/emacs/) * [atom](https://atom.io/) * [sublime](https://www.sublimetext.com/) Check out [this analysis](https://github.com/jhallen/joes-sandbox/tree/master/editor-perf) for some performance/feature comparisons of various text editors ================================================ FILE: whats_the_difference.md ================================================ # What's the difference **Table of Contents** * [cmp](#cmp) * [diff](#diff) * [Comparing Directories](#comparing-directories) * [colordiff](#colordiff)
## cmp ```bash $ cmp --version | head -n1 cmp (GNU diffutils) 3.3 $ man cmp CMP(1) User Commands CMP(1) NAME cmp - compare two files byte by byte SYNOPSIS cmp [OPTION]... FILE1 [FILE2 [SKIP1 [SKIP2]]] DESCRIPTION Compare two files byte by byte. The optional SKIP1 and SKIP2 specify the number of bytes to skip at the beginning of each file (zero by default). ... ``` * As the comparison is byte by byte, it doesn't matter if file is human readable or not * A typical use case is to check if two executables are same or not ```bash $ echo 'foo 123' > f1; echo 'food 123' > f2 $ cmp f1 f2 f1 f2 differ: byte 4, line 1 $ # print differing bytes $ cmp -b f1 f2 f1 f2 differ: byte 4, line 1 is 40 144 d $ # skip given bytes from each file $ # if only one number is given, it is used for both inputs $ cmp -i 3:4 f1 f2 $ echo $? 0 $ # compare only given number of bytes from start of inputs $ cmp -n 3 f1 f2 $ echo $? 0 $ # suppress output $ cmp -s f1 f2 $ echo $? 1 ``` * Comparison stops immediately at the first difference found * If verbose option `-l` is used, comparison would stop at whichever input reaches end of file first ```bash $ # first column is byte number $ # second/third column is respective octal value of differing bytes $ cmp -l f1 f2 4 40 144 5 61 40 6 62 61 7 63 62 8 12 63 cmp: EOF on f1 ``` **Further Reading** * `man cmp` and `info cmp` for more options and detailed documentation
## diff ```bash $ diff --version | head -n1 diff (GNU diffutils) 3.3 $ man diff DIFF(1) User Commands DIFF(1) NAME diff - compare files line by line SYNOPSIS diff [OPTION]... FILES DESCRIPTION Compare FILES line by line. ... ``` * `diff` output shows lines from first file input starting with `<` * lines from second file input starts with `>` * between the two file contents, `---` is used as separator * each difference is prefixed by a command that indicates the differences (see links at end of section for more details) ```bash $ paste d1 d2 1 1 2 hello 3 3 world 4 $ diff d1 d2 2c2 < 2 --- > hello 4c4 < world --- > 4 $ diff <(seq 4) <(seq 5) 4a5 > 5 ``` * use `-i` option to ignore case ```bash $ echo 'Hello World!' > i1 $ echo 'hello world!' > i2 $ diff i1 i2 1c1 < Hello World! --- > hello world! $ diff -i i1 i2 $ echo $? 0 ``` * ignoring difference in white spaces ```bash $ # -b option to ignore changes in the amount of white space $ diff -b <(echo 'good day') <(echo 'good day') $ echo $? 0 $ # -w option to ignore all white spaces $ diff -w <(echo 'hi there ') <(echo ' hi there') $ echo $? 0 $ diff -w <(echo 'hi there ') <(echo 'hithere') $ echo $? 0 # use -B to ignore only blank lines # use -E to ignore changes due to tab expansion # use -z to ignore trailing white spaces at end of line ``` * side-by-side output ```bash $ diff -y d1 d2 1 1 2 | hello 3 3 world | 4 $ # -y is usually used along with other options $ # default width is 130 print columns $ diff -W 60 --suppress-common-lines -y d1 d2 2 | hello world | 4 $ diff -W 20 --left-column -y <(seq 4) <(seq 5) 1 ( 2 ( 3 ( 4 ( > 5 ``` * by default, there is no output if input files are same. Use `-s` option to additionally indicate files are same * by default, all differences are shown. Use `-q` option to indicate only that files differ ```bash $ cp i1 i1_copy $ diff -s i1 i1_copy Files i1 and i1_copy are identical $ diff -s i1 i2 1c1 < Hello World! --- > hello world! $ diff -q i1 i1_copy $ diff -q i1 i2 Files i1 and i2 differ $ # combine them to always get one line output $ diff -sq i1 i1_copy Files i1 and i1_copy are identical $ diff -sq i1 i2 Files i1 and i2 differ ```
#### Comparing Directories * when comparing two files of same name from different directories, specifying the filename is optional for one of the directories ```bash $ mkdir dir1 dir2 $ echo 'Hello World!' > dir1/i1 $ echo 'hello world!' > dir2/i1 $ diff dir1/i1 dir2 1c1 < Hello World! --- > hello world! $ diff -s i1 dir1/ Files i1 and dir1/i1 are identical $ diff -s . dir1/i1 Files ./i1 and dir1/i1 are identical ``` * if both arguments are directories, all files are compared ```bash $ touch dir1/report.log dir1/lists dir2/power.log $ cp f1 dir1/ $ cp f1 dir2/ $ # by default, all differences are reported $ # as well as filenames which are unique to respective directories $ diff dir1 dir2 diff dir1/i1 dir2/i1 1c1 < Hello World! --- > hello world! Only in dir1: lists Only in dir2: power.log Only in dir1: report.log ``` * to report only filenames ```bash $ diff -sq dir1 dir2 Files dir1/f1 and dir2/f1 are identical Files dir1/i1 and dir2/i1 differ Only in dir1: lists Only in dir2: power.log Only in dir1: report.log $ # list only differing files $ # also useful to copy-paste the command for GUI diffs like tkdiff/vimdiff $ diff dir1 dir2 | grep '^diff ' diff dir1/i1 dir2/i1 ``` * to recursively compare sub-directories as well, use `-r` ```bash $ mkdir dir1/subdir dir2/subdir $ echo 'good' > dir1/subdir/f1 $ echo 'goad' > dir2/subdir/f1 $ diff -srq dir1 dir2 Files dir1/f1 and dir2/f1 are identical Files dir1/i1 and dir2/i1 differ Only in dir1: lists Only in dir2: power.log Only in dir1: report.log Files dir1/subdir/f1 and dir2/subdir/f1 differ $ diff -r dir1 dir2 | grep '^diff ' diff -r dir1/i1 dir2/i1 diff -r dir1/subdir/f1 dir2/subdir/f1 ``` * See also [GNU diffutils manual - comparing directories](https://www.gnu.org/software/diffutils/manual/diffutils.html#Comparing-Directories) for further options and details like excluding files, ignoring filename case, etc and `dirdiff` command
#### colordiff ```bash $ whatis colordiff colordiff (1) - a tool to colorize diff output $ whatis wdiff wdiff (1) - display word differences between text files ``` * simply replace `diff` with `colordiff` ![colordiff](./images/colordiff.png) * or, pass output of a `diff` tool to `colordiff` ![wdiff to colordiff](./images/wdiff_to_colordiff.png) * See also [stackoverflow - How to colorize diff on the command line?](https://stackoverflow.com/questions/8800578/how-to-colorize-diff-on-the-command-line) for other options
**Further Reading** * `man diff` and `info diff` for more options and detailed documentation * [GNU diffutils manual](https://www.gnu.org/software/diffutils/manual/diffutils.html) for a better documentation * `man -k diff` to get list of all commands related to `diff` * [diff Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/diff?sort=votes&pageSize=15) * [unix.stackexchange - GUI diff and merge tools](https://unix.stackexchange.com/questions/4573/which-gui-diff-viewer-would-you-recommend-with-copy-to-left-right-functionality) * [unix.stackexchange - Understanding diff output](https://unix.stackexchange.com/questions/81998/understanding-of-diff-output) * [stackoverflow - Using output of diff to create patch](https://stackoverflow.com/questions/437219/using-the-output-of-diff-to-create-the-patch) ================================================ FILE: wheres_my_file.md ================================================ # Where's my file **Table of Contents** * [find](#find) * [locate](#locate)
## find ```bash $ find --version | head -n1 find (GNU findutils) 4.7.0-git $ man find FIND(1) General Commands Manual FIND(1) NAME find - search for files in a directory hierarchy SYNOPSIS find [-H] [-L] [-P] [-D debugopts] [-Olevel] [starting-point...] [expression] DESCRIPTION This manual page documents the GNU version of find. GNU find searches the directory tree rooted at each given starting-point by evaluating the given expression from left to right, according to the rules of precedence (see section OPERATORS), until the outcome is known (the left hand side is false for and operations, true for or), at which point find moves on to the next file name. If no starting-point is specified, `.' is assumed. ... ``` **Examples** Filtering based on file name * `find . -iname 'power.log'` search and print path of file named power.log (ignoring case) in current directory and its sub-directories * `find -name '*log'` search and print path of all files whose name ends with log in current directory - using `.` is optional when searching in current directory * `find -not -name '*log'` print path of all files whose name does NOT end with log in current directory * `find -regextype egrep -regex '.*/\w+'` use extended regular expression to match filename containing only `[a-zA-Z_]` characters * `.*/` is needed to match initial part of file path Filtering based on file type * `find /home/guest1/proj -type f` print path of all regular files found in specified directory * `find /home/guest1/proj -type d` print path of all directories found in specified directory * `find /home/guest1/proj -type f -name '.*'` print path of all hidden files Filtering based on depth The relative path `.` is considered as depth 0 directory, files and folders immediately contained in a directory are at depth 1 and so on * `find -maxdepth 1 -type f` all regular files (including hidden ones) from current directory (without going to sub-directories) * `find -maxdepth 1 -type f -name '[!.]*'` all regular files (but not hidden ones) from current directory (without going to sub-directories) * `-not -name '.*'` can be also used * `find -mindepth 1 -maxdepth 1 -type d` all directories (including hidden ones) in current directory (without going to sub-directories) Filtering based on file properties * `find -mtime -2` print files that were modified within last two days in current directory * Note that day here means 24 hours * `find -mtime +7` print files that were modified more than seven days back in current directory * `find -daystart -type f -mtime -1` files that were modified from beginning of day (not past 24 hours) * `find -size +10k` print files with size greater than 10 kilobytes in current directory * `find -size -1M` print files with size less than 1 megabytes in current directory * `find -size 2G` print files of size 2 gigabytes in current directory Passing filtered files as input to other commands * `find report -name '*log*' -exec rm {} \;` delete all filenames containing log in report folder and its sub-folders * here `rm` command is called for every file matching the search conditions * since `;` is a special character for shell, it needs to be escaped using `\` * `find report -name '*log*' -delete` delete all filenames containing log in report folder and its sub-folders * `find -name '*.txt' -exec wc {} +` list of files ending with txt are all passed together as argument to `wc` command instead of executing wc command for every file * no need to use escape the `+` character in this case * also note that number of invocations of command specified is not necessarily once if number of files found is too large * `find -name '*.log' -exec mv {} ../log/ \;` move files ending with .log to log directory present in one hierarchy above. `mv` is executed once per each filtered file * `find -name '*.log' -exec mv -t ../log/ {} +` the `-t` option allows to specify target directory and then provide multiple files to be moved as argument * Similarly, one can use `-t` for `cp` command **Further Reading** * [using find](http://mywiki.wooledge.org/UsingFind) * [find examples on SO](https://stackoverflow.com/documentation/bash/566/find#t=201612140534548263961) * [Collection of find examples](http://alvinalexander.com/unix/edu/examples/find.shtml) * [find Q&A on unix stackexchange](https://unix.stackexchange.com/questions/tagged/find?sort=votes&pageSize=15) * [find and tar example](https://unix.stackexchange.com/questions/282762/find-mtime-1-print-xargs-tar-archives-all-files-from-directory-ignoring-t/282885#282885) * [find Q&A on stackoverflow](https://stackoverflow.com/questions/tagged/find?sort=votes&pageSize=15) * [Why is looping over find's output bad practice?](https://unix.stackexchange.com/questions/321697/why-is-looping-over-finds-output-bad-practice)
## locate ```bash $ locate --version | head -n1 mlocate 0.26 $ man locate locate(1) General Commands Manual locate(1) NAME locate - find files by name SYNOPSIS locate [OPTION]... PATTERN... DESCRIPTION locate reads one or more databases prepared by updatedb(8) and writes file names matching at least one of the PATTERNs to standard output, one per line. If --regex is not specified, PATTERNs can contain globbing characters. If any PATTERN contains no globbing characters, locate behaves as if the pattern were *PATTERN*. ... ``` Faster alternative to `find` command when searching for a file by its name. It is based on a database, which gets updated by a `cron` job. So, newer files may be not present in results. Use this command if it is available in your distro and you remember some part of filename. Very useful if one has to search entire filesystem in which case `find` command might take a very long time compared to `locate` **Examples** * `locate 'power'` print path of files containing power in the whole filesystem * matches anywhere in path, ex: '/home/learnbyexample/lowpower_adder/result.log' and '/home/learnbyexample/power.log' are both a valid match * implicitly, `locate` would change the string to `*power*` as no globbing characters are present in the string specified * `locate -b '\power.log'` print path matching the string power.log exactly at end of path * '/home/learnbyexample/power.log' matches but not '/home/learnbyexample/lowpower.log' * since globbing character '\' is used while specifying search string, it doesn't get implicitly replaced by `*power.log*` * `locate -b '\proj_adder'` the `-b` option also comes in handy to print only the path of directory name, otherwise every file under that folder would also be displayed * [find vs locate - pros and cons](https://unix.stackexchange.com/questions/60205/locate-vs-find-usage-pros-and-cons-of-each-other)