Repository: SixArm/usv Branch: main Commit: dd09cb6a8351 Files: 77 Total size: 140.5 KB Directory structure: gitextract_r3n42vl8/ ├── CODE_OF_CONDUCT.md ├── README.md ├── bin/ │ ├── bash/ │ │ ├── usv-to-csv.bash │ │ ├── usv-to-debug.bash │ │ └── usv-to-display.bash │ └── python/ │ ├── usv-to-csv.py │ ├── usv-to-debug.py │ └── usv-to-display.py ├── doc/ │ ├── abnf/ │ │ └── index.md │ ├── clap/ │ │ └── index.md │ ├── code/ │ │ └── index.md │ ├── comparisons/ │ │ ├── asv/ │ │ │ └── index.md │ │ ├── csv/ │ │ │ └── index.md │ │ ├── index.md │ │ ├── json/ │ │ │ └── index.md │ │ ├── rsv/ │ │ │ └── index.md │ │ ├── tsv/ │ │ │ └── index.md │ │ └── xlsx/ │ │ └── index.md │ ├── converters/ │ │ └── index.md │ ├── criticisms/ │ │ └── index.md │ ├── editors/ │ │ ├── emacs/ │ │ │ └── index.md │ │ └── vi/ │ │ └── index.md │ ├── end-of-transmission/ │ │ └── index.md │ ├── escape/ │ │ └── index.md │ ├── faq/ │ │ └── index.md │ ├── history-of-ascii-separated-values/ │ │ └── index.md │ ├── how-to-type-unicode-characters/ │ │ └── index.md │ ├── how-to-use-split-and-regex/ │ │ └── index.md │ ├── layout/ │ │ └── index.md │ ├── markup/ │ │ └── index.md │ ├── purpose/ │ │ └── index.md │ ├── rfc/ │ │ ├── draft-unicode-separated-values-01.txt │ │ ├── draft-unicode-separated-values-01.xml │ │ └── index.md │ ├── spacers/ │ │ └── index.md │ ├── styles/ │ │ └── index.md │ └── todo/ │ └── index.md ├── examples/ │ ├── blog-posts.csv │ ├── blog-posts.usv │ ├── end-of-transmission.usv │ ├── hello-goodnight.csv │ ├── hello-goodnight.usv │ ├── stream.usv │ ├── zen-koans.csv │ └── zen-koans.usv ├── tests/ │ ├── 1-dimensional-as-line/ │ │ ├── expect.json │ │ └── input.usv │ ├── 1-dimensional-as-lines/ │ │ ├── expect.json │ │ └── input.usv │ ├── 2-dimensional-as-line/ │ │ ├── expect.json │ │ └── input.usv │ ├── 2-dimensional-as-lines/ │ │ ├── expect.json │ │ └── input.usv │ ├── 3-dimensional-as-line/ │ │ ├── expect.json │ │ └── input.usv │ ├── 3-dimensional-as-lines/ │ │ ├── expect.json │ │ └── input.usv │ ├── 4-dimensional-as-line/ │ │ ├── expect.json │ │ └── input.usv │ ├── 4-dimensional-as-lines/ │ │ ├── expect.json │ │ └── input.usv │ ├── blog-posts/ │ │ ├── output-actual.txt │ │ ├── output-expect.txt │ │ └── test.sh │ ├── end-of-transmission-block/ │ │ ├── output-actual.txt │ │ ├── output-expect.txt │ │ └── test.sh │ ├── libreoffice-calc/ │ │ ├── example1.ods │ │ └── example2.ods │ ├── microsoft-excel/ │ │ ├── example1.xls │ │ ├── example1.xlsx │ │ ├── example2.xls │ │ └── example2.xlsx │ └── stream/ │ ├── output-actual.txt │ ├── output-expect.txt │ └── test.sh └── todo.md ================================================ FILE CONTENTS ================================================ ================================================ FILE: CODE_OF_CONDUCT.md ================================================ # Contributor Covenant Code of Conduct ## Our Pledge We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation. We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community. ## Our Standards Examples of behavior that contributes to a positive environment for our community include: * Demonstrating empathy and kindness toward other people * Being respectful of differing opinions, viewpoints, and experiences * Giving and gracefully accepting constructive feedback * Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience * Focusing on what is best not just for us as individuals, but for the overall community Examples of unacceptable behavior include: * The use of sexualized language or imagery, and sexual attention or advances of any kind * Trolling, insulting or derogatory comments, and personal or political attacks * Public or private harassment * Publishing others' private information, such as a physical or email address, without their explicit permission * Other conduct which could reasonably be considered inappropriate in a professional setting ## Enforcement Responsibilities Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful. Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate. ## Scope This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. ## Enforcement Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at [INSERT CONTACT METHOD]. All complaints will be reviewed and investigated promptly and fairly. All community leaders are obligated to respect the privacy and security of the reporter of any incident. ## Enforcement Guidelines Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct: ### 1. Correction **Community Impact**: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community. **Consequence**: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested. ### 2. Warning **Community Impact**: A violation through a single incident or series of actions. **Consequence**: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban. ### 3. Temporary Ban **Community Impact**: A serious violation of community standards, including sustained inappropriate behavior. **Consequence**: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban. ### 4. Permanent Ban **Community Impact**: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals. **Consequence**: A permanent ban from any sort of public interaction within the community. ## Attribution This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 2.1, available at [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1]. Community Impact Guidelines were inspired by [Mozilla's code of conduct enforcement ladder][Mozilla CoC]. For answers to common questions about this code of conduct, see the FAQ at [https://www.contributor-covenant.org/faq][FAQ]. Translations are available at [https://www.contributor-covenant.org/translations][translations]. [homepage]: https://www.contributor-covenant.org [v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html [Mozilla CoC]: https://github.com/mozilla/diversity [FAQ]: https://www.contributor-covenant.org/faq [translations]: https://www.contributor-covenant.org/translations ================================================ FILE: README.md ================================================ # Unicode Separated Values (USV) ™ Unicode Separated Values (USV) ™ is a data format that uses Unicode characters for markup. [FAQ](doc/faq/) • [RFC](doc/rfc/) • [Code](doc/code/) • [Comparisons](doc/comparisons/) • [TODO](doc/todo/) • [XKCD](https://xkcd.com/927/) ## Introduction Unicode Separated Values (USV) enables new ways of working with data as plain text. * USV builds on ASCII Separated Values (ASV) plus adds capabilities for visible markup. * USV contrasts with Comma Separated Values (CSV) because USV is more specific and powerful. * USV is similar in spirit to Markdown (MD) because the purpose is easy freeform text editing. ### USV markup USV uses Unicode characters for data markup. * [U+001F](https://codepoints.net/U+001F)/[U+241F](https://codepoints.net/U+241F) Unit Separator. * [U+001E](https://codepoints.net/U+001E)/[U+241E](https://codepoints.net/U+241E) Record Separator. * [U+001D](https://codepoints.net/U+001D)/[U+241D](https://codepoints.net/U+241D) Group Separator. * [U+001C](https://codepoints.net/U+001C)/[U+241C](https://codepoints.net/U+241C) File Separator. * [U+001B](https://codepoints.net/U+001B)/[U+241B](https://codepoints.net/U+241B) Escape. * [U+0004](https://codepoints.net/U+0004)/[U+2404](https://codepoints.net/U+2404) End of Transmission. ### USV examples USV looks like this for a 1-dimensional data made of units, such as a log. Each unit ends with a Unit Separator character and an optional newline character. ```usv a␟ b␟ c␟ d␟ ``` USV looks like this for 2-dimensional data made of units and records, such as a spreadsheet table. Each record ends with a Record Separator character and an optional newline character. ```usv a␟b␟␞ c␟d␟␞ ``` USV looks like this for 3-dimensional data made of units and records and groups, such as a spreadsheet folio. Each group ends with a Group Separator character and an optional newline character. ```usv Sheet1␟␞ a␟b␟␞ c␟d␟␞ ␝ Sheet2␟␞ e␟f␟␞ g␟h␟␞ ␝ ``` USV looks like this for 4-dimensional data made of units and records and groups and files, such as a collection of spreadsheet folios. Each file ends with a File Separator character and an optional newline character. ```usv Folio1␟␞ Sheet1␟␞ a␟b␟␞ c␟d␟␞ ␝ Sheet2␟␞ e␟f␟␞ g␟h␟␞ ␝␜ Folio2␟␞ Sheet3␟␞ a␟b␟␞ c␟d␟␞ ␝ Sheet4␟␞ e␟f␟␞ g␟h␟␞ ␝␜ ``` ### USV style USV uses style options to display marks in various ways. * Style Symbols: use visible symbol characters such as `␟` * Style Controls: use invisible control characters such as `\u001F` * Style Braces: use curly-braces with abbreviations such as: `{US}` ### USV layout USV uses layout options to format data in various ways. * Layout Default: format the data so it looks good on a typical terminal screen. * Layout Lines: format each mark with 0 or 1 or 2 surrounding newlines. * Layout by Units or Records or Groups or Files: format a chunk to display on one line. ## Documentation Core: * [Markup with separators and modifiers](doc/markup/) * [Style with symbols, controls, braces](doc/style/) * [Layout with units, records, groups, files, spacers](doc/layout/) Community: * [Frequently Asked Questions (FAQ)](doc/faq/) * [Criticisms and replies](doc/criticisms/) * [TODO list](doc/todo/) Specification: * [Request For Comments (RFC)](doc/rfc/) * [Augmented Backus–Naur Form (ABNF)](doc/anbf/) Code: * [Code examples and production crates](doc/code/) * [Command line argument parsing](doc/clap/) How to: * [How to type Unicode characters](doc/how-to-type-unicode-characters/) * [How to use split and regex](doc/how-to-use-split-and-regex/) Context: * [Converters for ASV, CSV, JSON, XLSX](doc/converters/) * [Comparisons with ASV, CSV, TSV, RSV, JSON](doc/comparisons/) * [History of ASCII separated values (ASV)](history-of-ascii-separated-values/) Editor notes: * [vim notes](doc/editors/vi/) * [emacs notes](doc/editors/emacs/) Example files: * [hello-world.usv](examples/hello-world.usv) versus [hello-world.csv](examples/hello-world.csv) * [zen-koans.usv](examples/zen-koans.usv) versus [zen-koans.csv](examples/zen-koans.csv) * [blog-posts.usv](examples/blog-posts.usv) versus [blog-posts.csv](examples/blog-posts.csv) * [end-of-transmission.usv](examples/end-of-transmission.usv) ## Hello World Suppose you want USV text with two units: "hello" and "world". The USV text with USV symbol characters for unit separators: ```usv hello␟world␟ ``` The USV text with USV control characters for unit separators: ```usv hello\u001Fworld\u001F ``` ## Comparisons to spreadsheets and databases USV semantics are units, records, groups, files. Spreadsheet semantics are cells, lines, sheets, folios. Databases semantics are fields, rows, tables, schemas. ## Examples USV with 2 units by 2 records by 2 groups by 2 files, and the style as sheets: ```usv a␟b␟␞ c␟d␟␞ ␝ e␟f␟␞ g␟h␟␞ ␝ ␜ i␟j␟␞ k␟l␟␞ ␝ m␟n␟␞ o␟p␟␞ ␝ ␜ ``` Parsing example with the USV Rust crate and its iterators: ```rust use usv::*; let text = "a␟b␟␞c␟d␟␞␝e␟f␟␞g␟h␟␞␝␜i␟j␟␞k␟l␟␞␝m␟n␟␞o␟p␟␞␝␜"; let files = text.files(); for file in files { for group in file { for record in group { for unit in record { println!(&unit); } } } } ``` ## Why use USV? USV can handle data that contains commas, semicolons, quotes, tabs, newlines, and other special characters, all without escaping. USV can format units/columns/cells and records/rows/lines and groups/tables/grids and files/schemas/folios. USV aims to be an international standard, and has an official IETF RFCXML Internet Draft. USV uses Unicode characters that are semantically meaningful. USV works well with any typical modern editor, font, terminal, shell, search, and language. USV uses visible letter-width characters, and these are easy to view, select, copy, paste, search. ## USV is easy and friendly USV is intended to be easy to use and friendly to try. USV works with many kinds of data, and many kinds of editors. Any editor that can render the USV characters will work. We use vim, emacs, helix, Zed, VS Code, JEOTrains IDEs, Nova, TextMate, Sublime, Notepad++, etc. USV works with many kinds of tools. Any tool that can parse the USV characters will work. We use awk, sed, grep, rg, miller, etc. USV works with many kinds of languages. Any language that can handle UTF-8 character encoding and rendering should work. We use C, C++, C#, Elixir, Erlang, Go, Java, JavaScript, Julia, Kotlin, Perl, PHP, Python, R, Ruby, Rust, Swift, TypeScript, etc. ## Legal protection for standardization The USV project aims to become a free open source IETF standard and IANA standard, much like the standards for CSV and TDF. Until the standardization happens, the terms "Unicode Separated Values" and "USV" are both trademarks of this project. This repository is copyright 2022-2024. The trademarks and copyrights are by Joel Parker Henderson, me, an individual, not a company. When IETF and IANA approve the submissions as a standard, then the trademarks and copyright will go to a free libre open source software advocacy foundation. We welcome advice about how to do this well. ## Conclusion USV is helping us with data projects. We hope USV may help you too. We welcome constructive feedback about USV, as well as git issues, pull requests, and standardization help. [FAQ](doc/faq/) • [RFC](doc/rfc/) • [Code](doc/code/) • [Comparisons](doc/comparisons/) • [TODO](doc/todo/) • [XKCD](https://xkcd.com/927/) ================================================ FILE: bin/bash/usv-to-csv.bash ================================================ #!/usr/bin/env bash set -euf -o pipefail # USV example shell script that converts USV to CSV. # # Note this script is a simple demo, and does not attempt to escape CSV output, # such as create a double-quoted unit to protect an embedded comma or newline. escape=false comma='' while IFS= read -N1 -r c; do if [ "$escape" = true ]; then escape=false printf %s "$c" else case "$c" in "\u001B" | "␛") escape=true ;; "\u001F" | "␟") comma=',' ;; "\u001E" | "␞") printf "\n" comma='' ;; "\u001D" | "␝") >&2 printf "\nerror: group separator\n" ;; "\u001C" | "␜") >&2 printf "\nerror: file separator\n" ;; "\u0004" | "␄") break ;; *) printf %s%s "$comma" "$c" comma='' ;; esac fi done ================================================ FILE: bin/bash/usv-to-debug.bash ================================================ #!/usr/bin/env bash set -euf -o pipefail # USV example shell script that demonstrates the use of USV characters. # This script reads STDIN one character at a time, and prints text. escape=false while IFS= read -N1 -r c; do if [ "$escape" = true ]; then escape=false printf %s "\nescape character: " "$c" else case "$c" in "\u001B" | "␛") printf "\nescape\n" escape=true ;; "\u001F" | "␟") printf "\nunit separator\n" ;; "\u001E" | "␞") printf "\nrecord separator\n" ;; "\u001D" | "␝") printf "\ngroup separator\n" ;; "\u001C" | "␜") printf "\nfile separator\n" ;; "\u0004" | "␄") printf "\nend of transmission\n" break ;; *) printf %s "$c" ;; esac fi done printf "\n" ================================================ FILE: bin/bash/usv-to-display.bash ================================================ #!/usr/bin/env bash set -euf -o pipefail # USV example shell script that demonstrates the use of USV characters. # This script reads STDIN one character at a time, and prints text. escape=false while IFS= read -N1 -r c; do if [ "$escape" = true ]; then escape=false printf %s "$c" else case "$c" in "\u001B" | "␛") escape=true ;; "\u001F" | "␟") printf "," ;; "\u001E" | "␞") printf "\n" ;; "\u001D" | "␝") printf "\n-\n" ;; "\u001C" | "␜") printf "\n=\n" ;; "\u0004" | "␄") break ;; *) printf %s "$c" ;; esac fi done printf "\n" ================================================ FILE: bin/python/usv-to-csv.py ================================================ #!/usr/bin/env python3 # USV example shell script that converts USV to CSV. # # Note this script is a simple demo, and does not attempt to escape CSV output, # such as create a double-quoted unit to protect an embedded comma or newline. import io import sys sys.stdin.reconfigure(encoding='utf-8') sys.stdout.reconfigure(encoding='utf-8') escape = False comma = '' while True: c = sys.stdin.read(1) if c == '': break if escape: escape = False print(f"{c}", end='', flush=True) else: match c: case "\u001B" | "␛": escape = True case "\u001F" | "␟": comma=',' case "\u001E" | "␞": print(f"\n", end='', flush=True) comma = '' case "\u001D" | "␝": raise Exception("error: group separator") case "\u001C" | "␜": raise Exception("error: file separator") case "\u0004" | "␄": break case (c): print(f"{comma}{c}", end='', flush=True) comma = '' ================================================ FILE: bin/python/usv-to-debug.py ================================================ #!/usr/bin/env python3 # USV example script that demonstrates the use of USV characters. # This script reads STDIN one character at a time, and prints text. import io import sys sys.stdin.reconfigure(encoding='utf-8') sys.stdout.reconfigure(encoding='utf-8') escape = False while True: c = sys.stdin.read(1) if c == '': break if escape: escape = False print(f"\nescape character: {c}\n", end='', flush=True) else: match c: case "\u001B" | "␛": print("\nescape\n", end='', flush=True) escape = True case "\u001F" | "␟": print(f"\nunit separator\n", end='', flush=True) case "\u001E" | "␞": print(f"\nrecord separator\n", end='', flush=True) case "\u001D" | "␝": print(f"\ngroup separator\n", end='', flush=True) case "\u001C" | "␜": print(f"\nfile separator\n", end='', flush=True) case "\u0004" | "␄": print(f"\nend of transmission\n", end='', flush=True) break case (c): print(f"{c}", end='', flush=True) print() ================================================ FILE: bin/python/usv-to-display.py ================================================ #!/usr/bin/env python3 # USV example script that demonstrates the use of USV characters. # This script reads STDIN one character at a time, and prints text. import io import sys sys.stdin.reconfigure(encoding='utf-8') sys.stdout.reconfigure(encoding='utf-8') escape = False while True: c = sys.stdin.read(1) if c == '': break if escape: escape = False print(f"{c}", end='', flush=True) else: match c: case "\u001B" | "␛": escape = True case "\u001F" | "␟": print(f",", end='', flush=True) case "\u001E" | "␞": print(f"\n", end='', flush=True) case "\u001D" | "␝": print(f"\n-\n", end='', flush=True) case "\u001C" | "␜": print(f"\n=\n", end='', flush=True) case "\u0004" | "␄": break case (c): print(f"{c}", end='', flush=True) print() ================================================ FILE: doc/abnf/index.md ================================================ # Augmented Backus–Naur Form (ABNF) Augmented Backus–Naur Form (ABNF) grammar-- work in progress. ## Semantics * usv = *files * file = *groups * group = *records * record = *units * unit = *content-characters ## Syntax Sections: * usv = ( header-and-body / body ) '*' ; anything after the body is chaff * header-and-body = 1*unit-run / 1*record-run / 1*group-run / 1*file-run * body = *unit-run / *record-run / *group-run / *file-run Runs: * file-run = *( *spacer-character file *spacer-character FS ) * group-run = *( *spacer-character group *spacer-character GS ) * record-run = *( *spacer-character record *spacer-character RS ) * unit-run = *( *spacer-character unit *spacer-character US ) Character classes: * content-character = typical-character / escape-character * typical-character = '*' - special-character - escape-character * special-character = US / RS / GS / FS / ESC / EOT * escape-character = ESC ( special-character / typical-character ) * spacer-character = Defined by Unicode Derived Core Property White_Space ## Unicode characters Markers: * US = U+001F Unit Separator / U+241F Symbol for Unit Separator * RS = U+001E Record Separator / U+241E Symbol for Record Separator * GS = U+001D Group Separator / U+241D Symbol for Group Separator * FS = U+001C File Separator / U+241C Symbol for File Separator Modifiers: * ESC = U+001B Escape / U+241B Symbol for Escape * EOT = U+0004 End Of Transmission / U+2404 Symbol for End Of Transmission ================================================ FILE: doc/clap/index.md ================================================ # Command line argument parsing (CLAP) USV tools should enable users to choose their preferred output style. USV tools for terminals should enable options with these settings. Options for USV separators and modifiers: * -u, --unit-separator : Set the unit separator string. * -r, --record-separator : Set the record separator string. * -g, --group-separator : Set the group separator string. * -f, --file-separator : Set the file separator string. * -e, --escape : Set the escape string. * -z, --end-of-transmission : Set the end-of-transmission string. Options for USV marks: * --style-symbols : Show marks as symbols, such as "␟" for Unit Separator. * --style-controls : Show marks as controls, such as "\u001F" for Unit Separator. This is most like ASCII Separated Values (ASV). * --style-braces : Show marks as braces, such as "{US}" for Unit Separator. This is to help plain text readers, and is not USV output. Options for USV layout: * --layout-0: Show each item with no line around it. This is no layout, in other words one long line. * --layout-1: Show each item with one line around it. This is like single-space lines for long form text. * --layout-2: Show each item with two lines around it. This is like double-space lines for long form text. * --layout-units: Show each unit on one line. This can be helpful for line-oriented tools. * --layout-records: Show each record on one line. This is like a typical spreadsheet sheet export. * --layout-groups: Show each group on one line. This can be helpful for folio-oriented tools. * --layout-files: Show one file on one line. This can be helpful for archive-oriented tools. Options for command line tools: * -h, --help : Print help * -V, --version : Print version * -v, --verbose... : Set the verbosity level: 0=none, 1=error, 2=warn, 3=info, 4=debug, 5=trace. Example: --verbose … * --test : Print test output for debugging, verifying, tracing, and the like. Example: --test ================================================ FILE: doc/code/index.md ================================================ # Code USV has source code examples and also has production-ready library code. ## Script examples with Bash and python This repository includes USV code examples that demonstrate parsing. Bash examples: * [usv-to-display.bash](../../bin/bash/usv-to-display.bash) * [usv-to-debug.bash](../../bin/bash/usv-to-debug.bash) * [usv-to-csv.bash](../../bin/bash/usv-to-csv.bash) Python examples: * [usv-to-display.py](../../bin/python/usv-to-display.py) * [usv-to-debug.py](../../bin/python/usv-to-debug.py) * [usv-to-csv.py](../../bin/python/usv-to-csv.py) ## Production code with Rust Rust has a crate in its own repo suitable for production use: * `cargo install usv` * [https://crates.io/crate/usv](https://crates.io/crate/usv) * [https://github.com/sixarm/usv-rust-crate](https://github.com/sixarm/usv-rust-crate) Command line converters: * [asv-to-usv](https://crates.io/crate/asv-to-usv) and [usv-to-asv](https://crates.io/crate/usv-to-asv) * [csv-to-usv](https://crates.io/crate/csv-to-usv) and [usv-to-csv](https://crates.io/crate/usv-to-csv) * [json-to-usv](https://crates.io/crate/json-to-usv) and [usv-to-json](https://crates.io/crate/usv-to-json) The Rust code includes tests and benchmarks. We welcome improvements. ================================================ FILE: doc/comparisons/asv/index.md ================================================ # ASCII Separated Values (ASV) a.k.a. DEL (Delimited ASCII) ASCII Separated Values (ASV) uses these invisible zero-width control character separators: * ASCII character 28 as file separator * ASCII character 29 as group separator * ASCII character 30 as record separator * ASCII character 31 as unit separator. These separators are identical in concept as in USV. ASV also: * Forbids the ASCII control characters in content. In other words, there is no escaping. * In practice, has many incompatible implementations and users that expect the record separator to be a newline character, because the implementations and users prefer to display the data on a screen. ## In our experience In our experience, these ASCII characters tend to be hard to edit manually. * Because many editors treat the characters as invisible zero-width characters. * Because major character pickers show the visible character then insert the visible character, which is the corresponding USV Symbol. In our experience, > 90% of the ASV files we discovered in our research used the character "\n" as the record delimiter, or the combination of characters "\r\n", rather than the correct character 30. ================================================ FILE: doc/comparisons/csv/index.md ================================================ # Comma Separated Values (CSV) Comma Separated Values (CSV) uses a comma character to separate values, and a newline character to separate records. * Has fields, which are equivalent to USV units. * Has records, which are equivalent to USV records. * Does not have a greater hierarchy, such as USV groups and fields, or spreadsheet sheets and folios, or database tables or schemas, etc. * Forbids the tab character in content. * Forbids the newline character in content. * Some implementations forbid the comma character in content; other implementations allow it if and only if the field is surrounded by quotation marks. * Some implementations forbid the newline character in content; other implementations allow it if and only if the field is surrounded by quotation marks. ## Custom delimiter character Some CSV implementations and users enable a custom delimiter character. * For example, some users prefer to use the semicolon character. This is prevalent among some European regions, where the comma character is frequently in use within numbers as a digit separator, such as "123,456,789". * For example, some users prefer to use the vertical pipe character. This is prevalent among some developers of natural language content, when the developers are aware that content may contain commas or semicolons, yet is unlikely to contain a pipe character. There is no standardization to know what the delimiter character is, ahead of time. * In practice, some CSV implementations use a heuristic to guess the delimiter character by inspecting the data. * In practice, some CSV users send along out-of-band instructions that explain the delimiter character. ### Commas CSV implementations may fail when there is a comma that is supposed to be in content, or may require quoting: This data is typically parsed as two CSV fields: ```csv hello, world ``` To get the data as one field, some CSV implementations support surrounding quotation marks: ```csv "hello, world" ``` USV honors commas, such as in this one unit that contains a comma: ```usv hello, world ``` ### Quotes CSV implementations may fail when there is a quotation mark that is supposed to be in content, or may require implementation-specific triple double-quotes. This data is typically parsed as a CSV error: ```csv I say "hello, world" ``` To get the data as one field, some CSV implementations support surrounding quotation marks and escaping via double double-quotes: ```csv "I say ""hello, world""" ``` USV honors quotes, such as in this one unit that contains quotation marks: ```usv I say "hello, world" ``` ### Newlines CSV implementations may fail when there is a newline that is supposed to be in content, or may require implementation-specific escaping. This data is typically parsed as a CSV error: ```csv "first line\nsecond line" ``` To get the data as one field, some CSV implementations support escaping by using backslash quotation marks like this: ```csv "\"first line\rsecond line\"" ``` USV honors newlines, such as in this one unit that contains a newline: ```usv first line second line ``` ## In our experience In our experience, the CSV format has various kinds of implementations, some incompatible, some with escaping and some without. In our experience, some software programs use the file name extension ".csv" to mean other ways of separating data with other characters, such as using tabs, or semi-colons, or spaces. ### CSV files We work with spreadsheets that are folios, that each contain sheets, that each contain grids. Suppose we work with 3 spreadsheets, and each spreadsheet contains 3 sheets. When we export the data, the export process needs multiple filesystem files, and needs some kind of ad hoc naming convention to show what's what: ```txt my-folio-1-sheet-1.csv my-folio-1-sheet-2.csv my-folio-1-sheet-3.csv my-folio-2-sheet-1.csv my-folio-2-sheet-2.csv my-folio-2-sheet-3.csv my-folio-3-sheet-1.csv my-folio-3-sheet-2.csv my-folio-3-sheet-3.csv ``` To send all the data to another team, we have tried a variety of combiner tools, such as `tar` and `zip`. For comparison, USV can contain all the data, because a USV file is equivalent to a spreadsheet folio, and USV group is equivalent to a spreadsheet sheet. Thus our export uses one filesystem file: ```txt my.usv ``` ================================================ FILE: doc/comparisons/index.md ================================================ # Comparisons with ASV, CSV, TSV, RSV Unicode separated values (USV) is similar to these formats, plus offers more capabilities, editor-friendly markup, and standards-track syntax. * [ASCII separated values (ASV) a.k.a. DEL (Delimited ASCII)](asv) * [Comma Separated Values (CSV)](csv) * [Tab Separated Values (TSV) a.k.a. Tab Delimited Format (TDF)](tsv) * [Rows of String Values (RSV)](rsv) * [JavaScript Object Notation (JSON)](json) * [Microsoft Excel (XLSX)](xlsx) ## Summary table | Capability | [USV](../../) | [ASV](asv) | [CSV](csv) | [TSV](tsv) | [RSV](rsv) | [JSON](json) | [XLSX](xlsx) | | --- | --- | --- | --- | --- | --- | --- | --- | | Units / cells / fields | ✅ | ✅ | ✅ | ✅ | ✅ | 🟡 | ✅ | | Records / lines / rows | ✅ | ✅ | ✅ | ✅ | ✅ | 🟡 | ✅ | | Groups / sheets / tables | ✅ | ✅ | ⛔ | ⛔ | ⛔ | 🟡 | ✅ | | Files / folios / schemas | ✅ | ✅ | ⛔ | ⛔ | ⛔ | 🟡 | ✅ | | Text, not binary | ✅ | ✅ | ✅ | ✅ | ⛔ | ✅ | ⛔ | | All visible separators | ✅ | ⛔ | ✅ | 🟡 | ⛔ | ✅ | ⛔ | | Easy for any text editor | ✅ | ⛔ | ✅ | ✅ | ⛔ | ⛔ | ⛔ | | Separator line spacing | ✅ | ⛔ | 🟡 | 🟡 | ⛔ | 🟡 | ⛔ | | IETF.org standards-track | ✅ | ⛔ | 🟡 | 🟡 | ⛔ | ✅ | 🟡 | | Escaping | ✅ | ✅ | ✅ | ⛔ | ⛔ | 🟡 | 🟡 | | End of Transmission | ✅ | ✅ | ⛔ | ⛔ | ⛔ | ⛔ | ⛔ | | Variable units per record | ✅ | ⛔ | ⛔ | ⛔ | ✅ | ✅ | ⛔ | | Separators are terminators | ✅ | ⛔ | ⛔ | ⛔ | ✅ | ⛔ | ⛔ | | Unicode UTF-8 default | ✅ | ⛔ | ⛔ | ⛔ | ⛔ | ✅ | 🟡 | ## Example for ASCII Separated Values (ASV) ```asv a\u001FB\u001F\u001Ec\u001FD\u001F\u001E ``` USV with symbols: ```usv a␟b␟␞c␟d␟␞ ``` USV with controls is identical to ASV: ```usv a\u001FB\u001F\u001Ec\u001FD\u001F\u001E ``` ## Example for Comma Separated Values (CSV) CSV example: ```xlsx a,b c,d ``` USV with symbols: ```usv a␟b␟␞ c␟d␟␞ ``` USV with controls: ```usv a\u001FB\u001F\u001E c\u001FD\u001F\u001E ``` ## Example for Tab Separated Values (TSV) TSV example: ```xlsx a b c d ``` USV with symbols: ```usv a␟b␟␞ c␟d␟␞ ``` USV with controls: ```usv a\u001FB\u001F\u001E c\u001FD\u001F\u001E ``` ## Example for Rows of String Values (RSV) RSV example: ```rsv a\b255b\b255\b253c\b255d\b255\b253 ``` USV with symbols: ```usv a␟b␟␞ c␟d␟␞ ``` USV with controls: ```usv a\u001FB\u001F\u001E c\u001FD\u001F\u001E ``` ## Example for Microsoft Excel (XLSX) XLSX example: ```xlsx Sheet 1 a,b c,d Sheet 2 d,e f,g ``` USV with symbols: ```usv Sheet 1␟␞ a␟b␟␞ c␟d␟␞ ␝ Sheet 2␟␞ e␟f␟␞ g␟h␟␞ ␝ ``` USV with controls: ```usv Sheet 1\u001F\u001E a\u001FB\u001F\u001E c\u001FD\u001F\u001E \u001D Sheet 2\u001F\u001E e\u001Ff\u001F\u001E g\u001Fh\u001F\u001E \u001D ``` ================================================ FILE: doc/comparisons/json/index.md ================================================ # JavaScript Object Notation (JSON) JavaScript Object Notation (JSON) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values). It is a commonly used data format with diverse uses in electronic data interchange, including that of web applications with servers. - Wikipedia ([Source](https://en.wikipedia.org/wiki/JSON)) JSON is more flexible and more powerful than USV because JSON can have infinite nesting and also data types. Example JSON: ```json [ ["a","b"], ["d","e"] ] ``` Equivalent USV: ```usv a␟b␟␞ c␟d␟␞ ``` ## In our experience We use JSON in many web applications, API endpoints, data transformations, and the like. It works very well for these purposes. In our experience JSON is harder to edit by hand than USV, and harder to teach to novices who want to view and edit data. USV tends to be easier for these use cases because USV is simpler. ================================================ FILE: doc/comparisons/rsv/index.md ================================================ # Rows of String Values (RSV) https://github.com/Stenway/RSV-Specification The RSV data file format is a simple binary alternative to CSV. An RSV document represents an array of arrays of nullable string values, also called a jagged array. It's main purpose is to store tabular data. But because it's a jagged array, it's not limited to that. So, rows can contain the same number of values, but don't have to. ================================================ FILE: doc/comparisons/tsv/index.md ================================================ # Tab Separated Values (TSV) a.k.a. Tab Delimited Format (TDF) Tab Separated Values (TSV) uses a tab character to separate values, and a newline character to separate records. * Has fields, which are equivalent to USV units. * Has records, which are equivalent to USV records. * Does not have a greater hierarchy, such as USV groups and fields, or spreadsheet sheets and folios, or database tables or schemas, etc. * Forbids the tab character in content. * Forbids the newline character in content. ## In our experience In our experience, TSV can be difficult to edit with some editors, because the tab character can be invisible, or can take up a varying number of character widths such as 2 spaces or 4 spaces or 8 spaces or as many spaces as it takes to get to the next tab stop. In our experience, some software programs use the file name extension ".tsv", others use the extension ".tdf", and others use the extension ".csv" even though the file actually uses tabs and doesn't use commas. ================================================ FILE: doc/comparisons/xlsx/index.md ================================================ # Microsoft Excel (XLSX) Microsoft Excel (XLSX) is among the world's most popular spreadsheet programs. It uses a data format called "XLSX" which in turn uses XML and binary compression. * Has spreadsheet sheets. Each sheet is called a "Worksheet", and can contain columns and rows. * Has spreadsheet folios. Each folio is called a "Workbook", and can contain one or more sheets. * Does not have a greater hierarchy, such as a collection of folios. * Can import/export data in many formats, such as CSV and TSV, but not yet USV. ## Custom delimiters Microsoft Excel enables the user to import/export using a wide range of custom delimiters, such as column separators and row separators. ## In our experience In our experience, the XLSX is great for primarily reading and editing by using Microsoft Excel or a compatible spreadsheet program. We had some success using decompression software then a XML editor, but this process and the XML tooling is harder for end users to do. ### Workbooks and Worksheets We work with spreadsheets that are folios a.k.a. workbooks, that each contain multiple sheets a.k.a. worksheets. ```txt my-workbook-1.xlsx my-workbook-2.xlsx my-workbook-3.xlsx ``` Or if we export data to CSV or similar format then we have even more files: ```txt my-workbook-1-worksheet-1.csv my-workbook-1-worksheet-2.csv my-workbook-1-worksheet-3.csv my-workbook-2-worksheet-1.csv my-workbook-2-worksheet-2.csv my-workbook-2-worksheet-3.csv my-workbook-3-worksheet-1.csv my-workbook-3-worksheet-2.csv my-workbook-3-worksheet-3.csv ``` To send all the data to another team, we have tried a variety of combiner tools, such as `tar` and `zip`. For comparison, USV can contain all the data, because a USV file is equivalent to a spreadsheet folio, and USV group is equivalent to a spreadsheet sheet. Thus our export uses one filesystem file: ```txt my.usv ``` ================================================ FILE: doc/converters/index.md ================================================ # Converters for ASV, CSV, JSON, XSLX ASCII Separated Values (ASV): * [asv-to-usv](https://crates.io/crate/asv-to-usv) * [usv-to-asv](https://crates.io/crate/usv-to-asv) Comma Separated Values (CSV): * [csv-to-usv](https://crates.io/crate/csv-to-usv) * [usv-to-csv](https://crates.io/crate/usv-to-csv) JavaScript Object Notation (JSON): * [json-to-usv](https://crates.io/crate/json-to-usv) * [usv-to-json](https://crates.io/crate/usv-to-json) Microsoft Excel XML (XLSX): * [xlsx-to-usv](https://crates.io/crate/xlsx-to-usv) * [usv-to-xlsx](https://crates.io/crate/usv-to-xlsx) ================================================ FILE: doc/criticisms/index.md ================================================ # Criticisms USV is led by Joel Parker Henderson (joel@joelparkerhenderson.com). Constructive feedback is welcome. See also [frequently asked questions](../faq/). - [XKCD one universal standard](#xkcd-one-universal-standard) - [Fundamentally wrong](#fundamentally-wrong) - [You cannot edit it](#you-cannot-edit-it) - [No efficient storage](#no-efficient-storage) - [There is no wide library support](#there-is-no-wide-library-support) - [Not all data is representable](#not-all-data-is-representable) - [Editors work with invisible characters](#editors-work-with-invisible-characters) - [Doesn't work with Excel](#doesnt-work-with-excel) - [Not trivially splittable](#not-trivially-splittable) - [No need for an escape character](#no-need-for-an-escape-character) - [Can't encode as a single byte](#cant-encode-as-a-single-byte) - [Better off advocating for editor support](#better-off-advocating-for-editor-support) - [Cleverness for cleverness’s sake](#cleverness-for-clevernesss-sake) - [This is kinda stupid](#this-is-kinda-stupid) - [Nobody needs USV, and nobody should use it.](#nobody-needs-usv-and-nobody-should-use-it) - [Kill it with fire](#kill-it-with-fire) ## XKCD one universal standard
"This is like the XKCD cartoon about one universal standard."
Ha! That's funny. It turns out USV isn't trying to be one universal standard. CSV works really well for many use cases, and is well-supported everywhere, so by all means keep using CSV where you want and where it works well. USV aims just for use cases that CSV doesn't seem to handle well, such as text that contains paragraphs of natural language, or displays better with newlines between units, or data that involves spreadsheet collections (e.g. folios comprising sheets comprising rows and columns) and database collections (e.g. schemas comprising tables comprising records and fields), or data that needs an End of Transmission. ## Fundamentally wrong
"Using Unicode graphic characters as metasyntactic escape characters is fundamentally wrong. Those Unicode characters are for displaying the symbols for Unit Separator, Record Separator, etc. and not for actually being separators! ASCII already has those! Included in Unicode!"
USV accepts ASCII control characters and the corresponding Unicode symbol characters as equivalent. If you prefer to use exclusively ASCII control characters, then do that. I tried that approach first, and the ASCII control characters didn't work well in practice for visual display and for text editors. This is because the ASCII control characters are rendered as invisible for many of the displays and editors I tried, and also didn't copy correctly in many of the tools. Also, there are command-line tools for converting from ASCII Separated Values (ASV) to Unicode Separated Values (USV) and vice versa: [asv-to-usv](https://crates.io/crates/asv-to-usv), [usv-to-asv](https://crates.io/crates/asv-to-usv). ## You cannot edit it
"You cannot edit it in regular editor, like csv/tsv/jsonlines."
I edit it in regular editors, every day. I use vi, emacs, VS Code, JEOTrains IDEs, and more. I've also tried USV on many more editors, and so far it works 100% of the time. If you have a specific editor that doesn't seem to be working well with USV, can you please contact me? ## No efficient storage
"There is no efficient storage, like binary formats."
USV is a text format, on purpose, because it's aiming to be human-readable and human-editable. USV storage goals are similar in magnitude to CSV. If you want efficient storage like a binary format, one way is to use compression on the text data. USV, CSV. and similar text formats can work well with compression, especially if the content has compression-friendly aspects such as repetitions, sequences, patterns, and so forth. ## There is no wide library support
"There is no wide library support."
Currently there's library support using the [USV Rust crate](https://crates.io/crates/usv) and there are command line [converters](../converters/). I welcome help creating library support from anyone who wants to help. The Rust crate is relatively easy to understand, and should be portable to similar family languages such as C, C++, C#, Java, JavaScript, Python, Ruby, etc. ## Not all data is representable
"Not all data is representable."
Can you provide an example of data that is not representable, or an explanation of what the data could be? USV aims for all data to be representable. Specifically, USV aims to be able to represent all UTF-8 encoded text. USV provides an escape character, so you can escape any of the USV special characters as you wish. ## Editors work with invisible characters
"We already have editors that can work with invisible characters. It’s not hard."
It turns out it is hard, in practice. I tried using invisible characters first, and found ongoing hard problems such as with copy/paste, search/replace, import/export, pattern matching, font display, and zero-width rendering. In fact, the difficulties with invisible characters seems to be the reason the reason that programmers mostly abandoned ASCII Separated Values (ASV) in favor of Comma Separated Values (CSV). USV aims to build on ASV to add capabilities for visible characters and better visible displays. ## Doesn't work with Excel
"The adoptability challenge remains here to be Excel support."
Yes you're right. USV is brand-new on the standards track in 2024. Excel support is a long-term goal. Submitting to the IETF is to help programs like Excel to start supporting it. If you have experience with writing Excel import/export capabilities, I welcome your help. ## Not trivially splittable
"This format is not trivially splittable with a regular expression. I'd avoid most of the escaping they show, especially for line endings, and just make RS '\n' the record separator, or possibly RS '\n'*."
See the documentation about [how to use split and regex](../how-to-use-split-and-regex/). Broadly speaking, USV does not have a goal to be trivially splittable, because visual editing is much more important in practice, and because library parsing is more more reliable. ASCII Separated Values (ASV) should be trivially splittable by using a unit separator byte character and record separator byte character. But it turns out that many ASV files in the wild actually change from using the record separator byte character to a newline character. Before you split, you need to know these choices. Comma Separated Values (CSV) should be trivially splittable by using a comma byte character and newline byte character. But it turns out that many CSV files in the wild actually change from using the comma byte character to a semicolon byte character or a pipe character. And some CSV files use escaping such as for quotes, or commas that are embedded in content, or escaped newlines that are embedded in content. Before you split, you need to know these choices. It's easy if you handle all data yourself; it's not easy if you're working with many worldwide organizations. ## No need for an escape character
"I am not convinced about the need for an escape character."
I tried USV without an escape character for a year to get real-world feedback. The feedback was that the escape was needed, because otherwise there could be data that couldn't be represented without an extra out-of-band reformatting/rewriting step. ## Can't encode as a single byte
"ASCII Separated Values is better because it can encode each separator as a single byte."
If single byte encoding is very important, and you don't care about visible symbols, then yes ASCII Separated Values (ASV) is better for you. USV doesn't have a goal of single byte separators. You can freely convert between ASV and USV and back again, if you like, by using these [converters](../converters/) ## Better off advocating for editor support
"Just because a glyph is "invisible" doesn't mean it has to actually be invisible. The symbols for the separators are hard to read, like you're pointing out, which means someone would eventually replace them with some other graphical display, in which case you were just as well off with the actual separators themselves. They would have been better off advocating for editor support for actual separator display."
Yes you're correct. Programmers have been advocating for editor support for actual separator display since the 1980's ASCII Separated Values. So far, the advocating has not succeeded. USV is a compromise for the present. If the future offers editor support as you describe, then it will be great to use that instead of USV, and in fact USV will have been very useful for getting people using group separators, file separators, escapes, End of Transmissions, and other ASV features that are more extensive than CSV. ## Cleverness for cleverness’s sake
"USV would have the disadvantage of using multi-byte characters as delimiters, so you have to decode the file in order to separate records. And you still can’t type the characters directly or be guaranteed to display them without font support. This honestly seems like cleverness for cleverness’s sake."
Yes you're correct directionally on your technical points. To decode one record, you have to read that one record until you reach its record separator; in other words, you can't just use split on one byte value as you can with CSV. That said, you can decode one unit at a time, or one record at a time, or one group at a time, or one file at a time; you don't have to decode the whole file. As for cleverness, it's not especially clever. USV is essentially just ASCII Separated Values (ASV) plus visible symbols and some simple extras for escape, end of transmission, and spacers. The core ideas of ASV and USV are all from the 1970's. ## This is kinda stupid
"I've long wanted a successor to CSV, but this is kinda stupid. People like CSVs because they look good, feel natural even in plaintext. This is the same reason that Markdown in successful. As for including commas in your data, it could just have been managed with a simple escape character like a \, for when there's actually a comma in your data. That's it."
If you want a successor to CSV, do you have suggestions for what you want? What I learned is that when you escape with a backslash, then you have to also provide for escaping a backslash, such as two in a row, and then it causes issues for use cases such as Windows paths, regular expressions, backslash as used in a typical backslash-t for tab or backslash-n for newline, and so on. This is why I prefer to use the escape character as U+241B Symbol for Escape (ESC). More broadly, CSV handles units and records (such as one spreadsheet sheet), but not groups (such as multiple spreadsheet sheets) or files (such as multiple spreadsheet folios). USV handles all of these. ## Nobody needs USV, and nobody should use it.
"This is needlessly adding yet another standard to the mix. If you are in a position to choose what standard you use, just use: * Whatever is best for the data model and/or languages you use. JSON is a common modern choice, suitable for most things. * If you want something more tabular, closer to CSV (which is a valid choice for bulk data), use strict RFC 4180 compliant data. * If you want to specify your own binary super-compact data, use ASN.1. I am also given to understand that Protobuf is a popular modern choice. If you aren’t in a position to choose your standards, just do whatever you need to do to parse whatever junk you are given, and emit as standards-compliant data as possible as output. * Again, RFC 4180 is a great way to standardize your own CSV output, as long as you stick to a subset which the receiving party can parse. Nobody needs USV, and nobody should use it."
Thanks for your specific feedback and conclusion. :-) For me, what's best for my data model is text (not binary), that handles many human languages using UTF-8 (not ASCII), that is easy to read and edit in many text editors (not a specialized row-column editor), and that works especially well with content that is paragraphs of natural language with commas, quotes, newlines, indentations, and the like. I also want capabilities for groups (such as spreadsheet sheets) and files (such as spreadsheet folios). For comparison I've tried binary formats (e.g. ASN.1, Protobuf), row-column tabular formats (e.g. CSV, TDF), web data formats (e.g. JSON, YAML), web markup formats (e.g. HTML, XML). For me, USV is significantly easier to use, read, edit, and share. ## Kill it with fire
"Y'know, I greatly dislike this. It's an actual emotional reaction. This should not be standardized. No one should use this. This is a bad idea and deserves to die in obscurity. I'll tell you why, it's pretty simple. The characters this... thing is stealing, exist to represent invisible control sequences. That is their use. The fact that they can be mentioned by direct input is inevitable, but not to be encouraged. I will be greatly disappointed if this is accepted as a standard. The fact that a USV file looks like a rendered ASV file is a show stopping bug, an anti-feature, an insult to life itself. Kill it with fire."
That's great feedback! The previous time that I heard that kind of feedback, it was about emoji being terrible and how no one should use them. Luckily representations evolve. 😀 ================================================ FILE: doc/editors/emacs/index.md ================================================ # Emacs notes C-x = shows a summary about the character at point. C-u C-x = shows details about the character at point. The rest of this page is from the emacs manual: https://www.gnu.org/software/emacs/manual/html_node/emacs/International-Chars.html ## 23.1 Introduction to International Character Sets The users of international character sets and scripts have established many more-or-less standard coding systems for storing files. These coding systems are typically multibyte, meaning that sequences of two or more bytes are used to represent individual non-ASCII characters. Internally, Emacs uses its own multibyte character encoding, which is a superset of the Unicode standard. This internal encoding allows characters from almost every known script to be intermixed in a single buffer or string. Emacs translates between the multibyte character encoding and various other coding systems when reading and writing files, and when exchanging data with subprocesses. The command C-h h (view-hello-file) displays the file etc/HELLO, which illustrates various scripts by showing how to say “hello” in many languages. If some characters can’t be displayed on your terminal, they appear as ‘?’ or as hollow boxes (see Undisplayable Characters). Keyboards, even in the countries where these character sets are used, generally don’t have keys for all the characters in them. You can insert characters that your keyboard does not support, using C-x 8 RET (insert-char). See Inserting Text. Shorthands are available for some common characters; for example, you can insert a left single quotation mark ‘ by typing C-x 8 [, or in Electric Quote mode, usually by simply typing `. See Quotation Marks. Emacs also supports various input methods, typically one for each script or language, which make it easier to type characters in the script. See Input Methods. The prefix key C-x RET is used for commands that pertain to multibyte characters, coding systems, and input methods. The command C-x = (what-cursor-position) shows information about the character at point. In addition to the character position, which was described in Cursor Position Information, this command displays how the character is encoded. For instance, it displays the following line in the echo area for the character ‘c’: ``` Char: c (99, #o143, #x63) point=28062 of 36168 (78%) column=53 ``` The four values after ‘Char:’ describe the character that follows point, first by showing it and then by giving its character code in decimal, octal and hex. For a non-ASCII multibyte character, these are followed by ‘file’ and the character’s representation, in hex, in the buffer’s coding system, if that coding system encodes the character safely and with a single byte (see Coding Systems). If the character’s encoding is longer than one byte, Emacs shows ‘file ...’. On rare occasions, Emacs encounters raw bytes: single bytes whose values are in the range 128 (0200 octal) through 255 (0377 octal), which Emacs cannot interpret as part of a known encoding of some non-ASCII character. Such raw bytes are treated as if they belonged to a special character set eight-bit; Emacs displays them as escaped octal codes (this can be customized; see Customization of Display). In this case, C-x = shows ‘raw-byte’ instead of ‘file’. In addition, C-x = shows the character codes of raw bytes as if they were in the range #x3FFF80..#x3FFFFF, which is where Emacs maps them to distinguish them from Unicode characters in the range #x0080..#x00FF. With a prefix argument (C-u C-x =), this command additionally calls the command describe-char, which displays a detailed description of the character: * *The character set name, and the codes that identify the character within that character set; ASCII characters are identified as belonging to the ascii character set. * The character’s script, syntax and categories. * What keys to type to input the character in the current input method (if it supports the character). * The character’s encodings, both internally in the buffer, and externally if you were to save the buffer to a file. * If you are running Emacs on a graphical display, the font name and glyph code for the character. If you are running Emacs on a text terminal, the code(s) sent to the terminal. * If the character was composed on display with any following characters to form one or more grapheme clusters, the composition information: the font glyphs if the frame is on a graphical display, and the characters that were composed. * The character’s text properties (see Text Properties in the Emacs Lisp Reference Manual), including any non-default faces used to display the character, and any overlays containing it (see Overlays in the same manual). Here’s an example, with some lines folded to fit into this manual: ``` position: 1 of 1 (0%), column: 0 character: ê (displayed as ê) (codepoint 234, #o352, #xea) preferred charset: unicode (Unicode (ISO10646)) code point in charset: 0xEA script: latin syntax: w which means: word category: .:Base, L:Left-to-right (strong), c:Chinese, j:Japanese, l:Latin, v:Viet to input: type "C-x 8 RET ea" or "C-x 8 RET LATIN SMALL LETTER E WITH CIRCUMFLEX" buffer code: #xC3 #xAA file code: #xC3 #xAA (encoded by coding system utf-8-unix) display: by this font (glyph code) xft:-PfEd-DejaVu Sans Mono-normal-normal- normal-*-15-*-*-*-m-0-iso10646-1 (#xAC) Character code properties: customize what to show name: LATIN SMALL LETTER E WITH CIRCUMFLEX old-name: LATIN SMALL LETTER E CIRCUMFLEX general-category: Ll (Letter, Lowercase) decomposition: (101 770) ('e' '^') ``` ================================================ FILE: doc/editors/vi/index.md ================================================ # vim notes vim comes with most modern Linux and BSD distributions. ## Digraph characters To add digraphs for each USV character, add ``` digraph us 9247 rs 9246 gs 9245 fs 9244 es 9243 eo 9220 ``` to your `~/.vimrc` Then when you want to type, for instance, the record separator character, in insert mode, type `rs` ## List hidden characters To list hidden characters: ``` :set list ``` Later: ``` :set nolist ``` ================================================ FILE: doc/end-of-transmission/index.md ================================================ # End of Transmission (EOT) The End of Transmission (EOT) mark tells any reader that it can stop reading. * EOT tells the data reader that data is done. * EOT has no effect on the output content. Example of a unit "abc" then EOT then extra data "xxx" that is ignored. ```usv abc␞␄xxx ``` EOT can be useful for a variety of use cases: * Streaming data, such as to signal that the reader can close a connection. * Appending data, such as USV content, then extra information such as comments. * Attaching data, such as a USV spreadsheet that has MIME attachments. ================================================ FILE: doc/escape/index.md ================================================ # Escape (ESC) The Escape (ESC) symbol makes the subsequent character treated as a content character. Example: USV with a unit that contains an Escape + End of Transmission, which is treated as content. ```usv a␛␄b␟ ``` In the rare case that you need a separator then content that starts with a carriage return or newline: * You escape the carriage return or newline. * This is because separators may be optionally be followed by any number of carriage returns and/or newlines, which is to help with visual display. ================================================ FILE: doc/faq/index.md ================================================ # Frequently Asked Questions USV is led by Joel Parker Henderson (joel@joelparkerhenderson.com). Constructive feedback is welcome. See also [criticisms](../criticisms/). - [Is USV easy?](#is-usv-easy) - [IS USV aiming to be a standard?](#is-usv-aiming-to-be-a-standard) - [Why choose USV over CSV or TSV?](#why-choose-usv-over-csv-or-tsv) - [Why choose USV over ASV?](#why-choose-usv-over-asv) - [Why choose USV over ASV for machine-only data?](#why-choose-usv-over-asv-for-machine-only-data) - [Why use control picture characters rather than the control characters themselves?](#why-use-control-picture-characters-rather-than-the-control-characters-themselves) - [Why are the symbols so small on my screen?](#why-are-the-symbols-so-small-on-my-screen) ## Is USV easy? Yes. If you know about comma separated values (CSV), or tab separated values (TSV), or ASCII separated values (ASV), or JavaScript Object Notation (JSON), then you already know much about USV. ## IS USV aiming to be a standard? Yes, USV is aiming to become an IETF standard similar to IETF RCF 4180 for CSV. We have submitted the IETF Internet Draft and it is a work in progress. Yes, USV is aiming to become an IANA standard similar to IANA TSV. We have submitted the request for the "text/usv" media type. ## Why choose USV over CSV or TSV? You want your data content to be able to contain commas, or tabs, or newlines, without special escaping or different quoting rules than other data such as numbers. You want your data content to be able to use data groups, or database tables, or spreadsheet grids. You want your data format to be able to use data files, or database schemas, or spreadsheet folios. You want your data semantics to be able to use hierarchy levels, nesting, or outlines. You want a consistent compatible standard format, which CSV can't always provide. You want a consistent compatible standardized file name extension, which CSV/TSV/TDF can't always provide. You want to use End of Transmission (EOT), so you can guarantee a reader has read data until the end. ## Why choose USV over ASV? You want your data content to be friendlier for human reading and human editing. USV provides typically-visible letter-width characters (such as Unicode 241F), whereas ASV provides typically-invisible zero-width characters (such as ASCII 31). It's true that some editors do render ASV characters using other visual representations, such as using the corresponding USV visible characters; however in practice we haven't found much support for this approach. ## Why choose USV over ASV for machine-only data? For machine-only data, such as data that will never be used for human reading or human editing, then USV or ASV are similar because both can handle units, fields, groups, and files. ## Why use control picture characters rather than the control characters themselves? We tried using the control characters, and also tried configuring various editors to show the control characters by rendering the control picture characters. First, we encountered many difficulties with editor configurations, attempting to make each editor treat the invisible zero-width characters by rendering with the visible letter-width characters. Second, we encountered problems with copy/paste functionality, where it often didn't work because the editor implementations and terminal implementations copied visible letter-width characters, not the underlying invisible zero-width characters. Third, users were unable to distinguish between the rendered control picture characters (e.g. the editor saw ASCII 31 and rendered Unicode Unit Separator) versus the control picture characters being in the data content (e.g. someone actually typed Unicode Unit Separator into the data content). ## Why are the symbols so small on my screen? USV renders on your system by using your local font. If your local font has small Unicode symbols for specific characters, then you'll see these. On many systems we've tried, the characters render with the letters "US", "RS", "GS", "FS", etc. We are open to suggestions for fonts that work especially with with USV, and we are open to funding the creation of specialized fonts for these specific characters. ================================================ FILE: doc/history-of-ascii-separated-values/index.md ================================================ # History of ASCII separated values (ASV) ➤ ## ASCII 28 = FS = File separator The file separator FS is an interesting control code, as it gives us insight in the way that computer technology was organized in the sixties. We are now used to random access media like RAM and magnetic disks, but when the ASCII standard was defined, most data was serial. I am not only talking about serial communications, but also about serial storage like punch cards, paper tape and magnetic tapes. In such a situation it is clearly efficient to have a single control code to signal the separation of two files. The FS was defined for this purpose. ## ASCII 29 = GS = Group separator Data storage was one of the main reasons for some control codes to get in the ASCII definition. Databases are most of the time setup with tables, containing records. All records in one table have the same type, but records of different tables can be different. The group separator GS is defined to separate tables in a serial data storage system. Note that the word table wasn't used at that moment and the ASCII people called it a group. ## ASCII 30 = RS = Record separator Within a group (or table) the records are separated with RS or record separator. ## ASCII 31 = US = Unit separator The smallest data item to be stored in a database is called a unit in the ASCII definition. The unit separator separates these fields in a serial data storage environment. The US control code allows all fields to have a variable length. If data storage space is limited—as in the sixties—this is a good way to preserve valuable space. On the other hand is serial storage far less efficient than the table driven RAM and disk implementations of modern times. ## ASCII 14 = Shift Out & ASCII 15 = Shift In The original purpose of these characters was to provide a way to shift a coloured ribbon, split longitudinally usually with red and black, up and down to the other color in an electro-mechanical typewriter or teleprinter, such as the Teletype Model 38, to automate the same function of manual typewriters. Black was the conventional ambient default color and so was shifted "in" or "out" with the other color on the ribbon. ➤ ================================================ FILE: doc/how-to-type-unicode-characters/index.md ================================================ # How to type Unicode characters On many systems, you can type Unicode characters this way: 1. Press and hold the Alt key a.k.a. Option key. 2. Type + and the Unicode character hexadecimal code, such as +241f for Unit Separator. 3. Release the Alt key a.k.a. Option key. On Apple macOS, you may need to do a one-time setup: 1. Go to System Preferences -> Keyboard -> Input Sources. 2. Click on + button, select "Others" -> "Unicode Hex Input" and press "Add". (End of one-time) 3. Switch to the Unicode Hex Input in the menu bar. 4. Hold down the Option key and type the hexadecimal unicode value, then release the Option key. ================================================ FILE: doc/how-to-use-split-and-regex/index.md ================================================ # How to use split and regex To use split and regex, rather than a specific USV parsing tool or library, then you have choices. The pseudocode here is the current best approximation of USV using split and regex. If you are certain that your data never uses any escape characters: ```regex transmission = split input on "[\u0004\u2404]" first files = split transmission on "[\u001C\u241C]" groups = split file on "[\u001D\u241D] records = split group on "[\u001E\u241E]" units = split unit on "[\u001F\u241F]" unit = trim(unit) ``` If your data may use any escape characters, and also if your split and regex offer capabilities for negative lookbehind: ```regex transmission = split input on "[\u0004\u2404]" first files = split transmission on "(?[U+001F](https://codepoints.net/U+001F)/[U+241F](https://codepoints.net/U+241F) Unit Separator. For a spreadsheet cell, database field, etc. * [U+001E](https://codepoints.net/U+001E)/[U+241E](https://codepoints.net/U+241E) Record Separator. For a spreadsheet line, database row, etc. * [U+001D](https://codepoints.net/U+001D)/[U+241D](https://codepoints.net/U+241D) Group Separator. For a spreadsheet sheet, database table, etc. * [U+001C](https://codepoints.net/U+001C)/[U+241C](https://codepoints.net/U+241C) File Separator. For a spreadsheet folio, database schema, etc. * [U+001B](https://codepoints.net/U+001B)/[U+241B](https://codepoints.net/U+241B) Escape. For protecting markup characters in content. * [U+0004](https://codepoints.net/U+0004)/[U+2404](https://codepoints.net/U+2404) End of Transmission. For concluding parsing. ## Character details * [Escape (ESC)](../escape/) * [End of Transmission (EOT)](../end-of-transmission/) * [Spacers](../spacers/) ================================================ FILE: doc/purpose/index.md ================================================ # USV purpose The USV purpose is to help people edit data, share data, and manage data. * Edit data by using plain text and any typical text editor. * Share data by using an international standard for markup. * Manage data by in ways that work well with spreadsheets and databases. ## Edit data by using plain text and any typical text editor USV is a plain text format that aims to be easy to read and edit. * Because USV is plain text, you can use any text editor to open a USV file, edit it, save it, print it, and so on. * Because USV enables line spacing wherever you want it, you can edit anything from simple unit-oriented data (such as for logs and metrics) all the way up to complex file-oriented data (such as for blog posts and content management). * Because USV can display marks using your choice of visible symbol characters or invisible control characters, you can edit using your preferred editors and preferred settings for displaying Unicode symbols and Unicode controls. ## Share data by using an international standard for markup USV has a formal specification on-track to become an international standard. * Because USV is for worldwide sharing, there is a specification that sets the same marks (such as delimiters) for everyone. * Because USV provides a formal IETF Internet-Draft, anyone may implement USV in any language, and know that it will work. * Because USV has a reference implementation that is free libre open source software, everyone can share the tooling as well. ## Manage data by in ways that work well with spreadsheets and databases USV can manage data collections such as spreadsheet sheets and folios, and database tables and schemas. * Because USV has units, records, groups, files, and end of transmission, it has more dimensions than CSV, and can even allow for attachments. * Because USV has more dimensions, it can replace ad hoc binders, such as ZIP files comprising CSV sheets, or XML files comprising Excel workbooks. * Because USV has jagged array capabilities, it can help save and restore system disk paths, spreadsheet folio tabs, database table names, and more. ================================================ FILE: doc/rfc/draft-unicode-separated-values-01.txt ================================================ Internet Engineering Task Force J. Henderson, Ed. Internet-Draft 16 March 2024 Intended status: Experimental Expires: 17 September 2024 Unicode Separated Values (USV) draft-unicode-separated-values-01 Abstract Unicode Separated Values (USV) is a data format that uses Unicode characters to mark parts. USV builds on ASCII separated values (ASV), and provides pragmatic ways to edit data in text editors by using visual symbols and layouts. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 17 September 2024. Copyright Notice Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Henderson Expires 17 September 2024 [Page 1] Internet-Draft Unicode Separated Values (USV) March 2024 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 1.2. Media Type Language . . . . . . . . . . . . . . . . . . . 3 1.3. ABNF Language . . . . . . . . . . . . . . . . . . . . . . 3 2. USV characters . . . . . . . . . . . . . . . . . . . . . . . 3 3. Definition of the USV Format . . . . . . . . . . . . . . . . 4 3.1. Data . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2. Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3. Record . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.4. Group . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.5. File . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.6. Header . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.7. Escape (ESC) . . . . . . . . . . . . . . . . . . . . . . 5 3.8. End of Transmission (EOT) . . . . . . . . . . . . . . . . 5 4. ABNF grammar . . . . . . . . . . . . . . . . . . . . . . . . 6 4.1. Semantics . . . . . . . . . . . . . . . . . . . . . . . . 6 4.2. Syntax . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.3. Runs . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.4. Character classes . . . . . . . . . . . . . . . . . . . . 6 4.5. Unicode symbols . . . . . . . . . . . . . . . . . . . . . 6 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 7 5.1. Hello World . . . . . . . . . . . . . . . . . . . . . . . 7 5.2. Hello World Goodnight Moon . . . . . . . . . . . . . . . 7 5.3. Units, Records, Groups, Files . . . . . . . . . . . . . . 8 5.4. Articles . . . . . . . . . . . . . . . . . . . . . . . . 9 6. Source Code Examples . . . . . . . . . . . . . . . . . . . . 10 7. MIME media type registration for text/usv . . . . . . . . . . 11 7.1. Optional parameters: charset, header . . . . . . . . . . 11 7.2. Encoding considerations . . . . . . . . . . . . . . . . . 11 7.3. Security considerations . . . . . . . . . . . . . . . . . 12 7.4. Interoperability considerations . . . . . . . . . . . . . 12 7.5. Published specification . . . . . . . . . . . . . . . . . 12 7.6. Applications that use this media type . . . . . . . . . . 12 7.7. Additional information . . . . . . . . . . . . . . . . . 12 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 9. Security Considerations . . . . . . . . . . . . . . . . . . . 13 10. Converters . . . . . . . . . . . . . . . . . . . . . . . . . 13 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 11.1. Normative References . . . . . . . . . . . . . . . . . . 13 11.2. Informative References . . . . . . . . . . . . . . . . . 14 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 15 Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 15 Henderson Expires 17 September 2024 [Page 2] Internet-Draft Unicode Separated Values (USV) March 2024 1. Introduction Unicode Separated Values (USV) is a data format useful for exchanging and converting data between various spreadsheet programs, databases, and streaming data services. This RFC explains USV. Additionally, we propose a new media type "text/usv", to be registered with IANA. We provide information references for a USV git repository [usv-git-repository], a programming implementation as a USV Rust crate [usv-rust-crate], and converter tools. 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 1.2. Media Type Language The media type normative references are RFC 6838 [RFC6838], RFC 2046 [RFC2046], and RFC 4289 [RFC4289]. 1.3. ABNF Language The ABNF normative reference is RFC 5234 [RFC5234]. 2. USV characters Separators: * File Separator (FS) is U+001C or U+241C * Group Separator (GS) is U+001D or U+241D * Record Separator (RS) is U+001E or U+241E * Unit Separator (US) is U+001F or U+241F Modifiers: * Escape (ESC) is U+001B or U+241B * End of Transmission (EOT) is U+0004 or U+2404 Henderson Expires 17 September 2024 [Page 3] Internet-Draft Unicode Separated Values (USV) March 2024 Spacers: * Carriage Return (CR) is U+000D * Line Feed (LF) is U+000A 3. Definition of the USV Format 3.1. Data Data comprises units, records, groups, and files. 3.2. Unit A unit comprises content characters. It runs until a Unit Separator (US): Example unit and unit separator: file "unit-and-unit-separator.usv" aaa␟ 3.3. Record A record comprises units. It runs until a Record Separator (RS): Example record and record separator: file "record-and-record-separator.usv" aaa␟bbb␟␞ 3.4. Group A group comprises records. It runs until a Group Separator (GS): Example group and group separator: file "group-and-group-separator.usv" aaa␟bbb␟␞ccc␟ddd␟␞␝ 3.5. File A file comprises groups. It runs until a file separator. Example file and file separator: Henderson Expires 17 September 2024 [Page 4] Internet-Draft Unicode Separated Values (USV) March 2024 file "file-and-file-separator.usv" aaa␟bbb␟␞ccc␟ddd␟␞␝eee␟fff␟␞ggg␟hhh␟␞␝␜ 3.6. Header There may be an optional header appearing as the first item and with the same format as normal items. This header will contain names corresponding to the fields in the data, and should contain the same number of fields as the rest of data. The presence or absence of the header line should be indicated via the optional "header" parameter of this media type. For example: file "header.usv" name␟name␟␞aaa␟bbb␟␞ 3.7. Escape (ESC) Escape (ESC) makes the next character content. Example: USV with a unit that contains an Escape + End of Transmission; because of the Escape, the End of Transmission is treated as content: file "header.usv" a␛␄b␟ 3.8. End of Transmission (EOT) End of Transmission (EOT) tells any reader that it can stop reading. This is can be useful for streaming data, such as to end a connection. This can also be useful for providing data files that contain USV data, then EOT, then addition non-USV information such as comments, images, attachments, etc. * EOT tells the data reader that it can stop. * EOT has no effect on the output content. Example of a unit then an End of Transmission: file "header.usv" abc␞␄ignorable Henderson Expires 17 September 2024 [Page 5] Internet-Draft Unicode Separated Values (USV) March 2024 4. ABNF grammar 4.1. Semantics usv = *files file = *groups group = *records record = *units unit = *content-characters 4.2. Syntax usv = ( header-and-body / body ) '*' ; anything after the body is chaff header-and-body = 1*unit-run / 1*record-run / 1*group-run / 1*file- run body = *unit-run / *record-run / *group-run / *file-run 4.3. Runs file-run = *( *spacer-character file *spacer-character FS ) group-run = *( *spacer-character group *spacer-character GS ) record-run = *( *spacer-character record *spacer-character RS ) unit-run = *( *spacer-character unit *spacer-character US ) 4.4. Character classes content-character = typical-character / ESC '*' typical-character = '*' - special-character special-character = US / RS / GS / FS / ESC / EOT spacer-character = CR / LF 4.5. Unicode symbols FS = U+001C File Separator / U+241C Symbol for File Separator Henderson Expires 17 September 2024 [Page 6] Internet-Draft Unicode Separated Values (USV) March 2024 GS = U+001D Group Separator / U+241D Symbol for Group Separator RS = U+001E Record Separator / U+241E Symbol for Record Separator US = U+001F Unit Separator / U+241F Symbol for Unit Separator ESC = U+001B Escape / U+241B Symbol for Escape EOT = U+0004 End of Transmission / U+2404 Symbol for End of Transmission CR = U+000D Carriage Return LF = U+000A Line Feed 5. Examples 5.1. Hello World This kind of data ... file "hello-world.txt" hello, world ... is represented in USV as two units: file "hello-world.usv" hello␟world␟ If you prefer to see one unit per line, then you can add carriage returns and/or newlines: file "hello-world-with-lines.usv" hello␟ world␟ 5.2. Hello World Goodnight Moon This kind of data ... file "hello-world-goodnight-moon.txt" [ hello, world ], [ goodnight, moon ] ... is represented in USV as two records, each with two units: Henderson Expires 17 September 2024 [Page 7] Internet-Draft Unicode Separated Values (USV) March 2024 file "hello-world-goodnight-moon.usv" hello␟world␟␞goodnight␟moon␟␞ If you prefer to see one record per line, then you can add carriage returns and/or newlines: file "hello-world-goodnight-moon-with-lines.usv" hello␟world␟␞ goodnight␟moon␟␞ 5.3. Units, Records, Groups, Files USV with 2 units by 2 records by 2 groups by 2 files: file "units-records-groups-files.usv" a␟b␟␞c␟d␟␞␝e␟f␟␞g␟h␟␞␝␜i␟j␟␞k␟l␟␞␝m␟n␟␞o␟p␟␞␝␜ If you prefer to see one record per line, then you can add carriage returns and/or newlines: file "units-records-groups-files-with-lines.usv" a␟b␟␞ c␟d␟␞ ␝ e␟f␟␞ g␟h␟␞ ␝ ␜ i␟j␟␞ k␟l␟␞ ␝ m␟n␟␞ o␟p␟␞ ␝ ␜ If you prefer to see one unit per line, then you can add carriage returns and/or newlines: Henderson Expires 17 September 2024 [Page 8] Internet-Draft Unicode Separated Values (USV) March 2024 file "units-records-groups-files-with-lines.usv" a␟ b␟ ␞ c␟ d␟ ␞ ␝ e␟ f␟ ␞ g␟ h␟ ␞ ␝ ␜ i␟ j␟ ␞ k␟ l␟ ␞ ␝ m␟ n␟ ␞ o␟ p␟ ␞ ␝ ␜ 5.4. Articles USV can format paragraphs, such as in this example data stream of articles; note the units contain leading spacers and trailing spacers. Henderson Expires 17 September 2024 [Page 9] Internet-Draft Unicode Separated Values (USV) March 2024 file "articles.usv" Title One ␟ Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip. ␟␞ Title Two ␟ Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. ␟␞ Title Three ␟ Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. ␟␞ 6. Source Code Examples These source code examples demonstrate the Rust programming language and the USV Rust crate. Units: file "usv-rust-crate-units.rs" use usv::*; let str = "a␟b␟"; let units: Units = str.units().collect(); Records: file "usv-rust-crate-records.rs" use usv::*; let str = "a␟b␟␞c␟d␟␞"; let records: Records = str.records().collect(); Groups: Henderson Expires 17 September 2024 [Page 10] Internet-Draft Unicode Separated Values (USV) March 2024 file "usv-rust-crate-groups.rs" use usv::*; let str = "a␟b␟␞c␟d␟␞␝e␟f␟␞g␟h␟␞␝"; let groups: Groups = str.groups().collect(); Files: file "usv-rust-crate-groups.rs" use usv::*; let str = "a␟b␟␞c␟d␟␞␝e␟f␟␞g␟h␟␞␝␜i␟j␟␞k␟l␟␞␝m␟n␟␞o␟p␟␞␝␜"; let files: Files = str.files().collect(); 7. MIME media type registration for text/usv This section provides the MIME media type registration application information. To: ietf-types@iana.org Subject: Registration of MIME media type text/usv MIME media type name: text MIME subtype name: usv Required parameters: none 7.1. Optional parameters: charset, header Common usage of USV is UTF-8, but other character sets defined by IANA for the "text" tree may be used in conjunction with the "charset" parameter. The "header" parameter indicates the presence or absence of the header line. Valid values are "present" or "absent". Implementors choosing not to use this parameter must make their own decisions as to whether the header line is present or absent. 7.2. Encoding considerations This media type uses LF to denote line breaks. However, implementors should be aware that some implementations may not conform i.e. may incorrectly use other values. Henderson Expires 17 September 2024 [Page 11] Internet-Draft Unicode Separated Values (USV) March 2024 7.3. Security considerations USV files contain passive text data that should not pose any risks. However, it is possible in theory that malicious binary data may be included in order to exploit potential buffer overruns in the program processing USV data. Additionally, private data may be shared via this format (which of course applies to any text data). 7.4. Interoperability considerations Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing USV data. Implementations deciding not to use the optional "header" parameter must make their own decision as to whether the header is absent or present. 7.5. Published specification https://github.com/sixarm/usv 7.6. Applications that use this media type Spreadsheet programs, such as with import/export. Database programs, such as with loading/saving text. Data conversion utilities. 7.7. Additional information Magic number(s): none File extension(s): usv Apple macOS File Type Code(s): TEXT Intended usage: COMMON Author/Change controller: IESG Contact: Joel Parker Henderson 8. IANA Considerations We are requesting IANA to create a standard MIME media type "text/ usv". We have filed an IANA request for this, with same contact information. Henderson Expires 17 September 2024 [Page 12] Internet-Draft Unicode Separated Values (USV) March 2024 9. Security Considerations This document should not affect the security of the Internet. 10. Converters We implement converters to/from USV and various popular data formats, including ASCII Separated Values (ASV), Comma Separated Values (CSV), JavaScript Object Notation (JSON), Microsoft Excel XML (XLSX). * asv-to-usv[asv-to-usv-rust-crate], usv-to- asv[usv-to-asv-rust-crate] * csv-to-usv[csv-to-usv-rust-crate], usv-to- csv[usv-to-csv-rust-crate] * json-to-usv[json-to-usv-rust-crate], usv-to- json[usv-to-json-rust-crate] * xlsx-to-usv[xlsx-to-usv-rust-crate], usv-to- xlsx[usv-to-xlsx-rust-crate] The converters are provided for informational purposes. The converters are not part of the specification. 11. References 11.1. Normative References [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . [RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", STD 68, RFC 5234, DOI 10.17487/RFC5234, January 2008, . [RFC6838] Freed, N., Klensin, J., and T. Hansen, "Media Type Specifications and Registration Procedures", BCP 13, RFC 6838, DOI 10.17487/RFC6838, January 2013, . [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types", RFC 2046, DOI 10.17487/RFC2046, November 1996, . Henderson Expires 17 September 2024 [Page 13] Internet-Draft Unicode Separated Values (USV) March 2024 [RFC4289] Freed, N. and J. Klensin, "Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures", BCP 13, RFC 4289, DOI 10.17487/RFC4289, December 2005, . 11.2. Informative References [usv-git-repository] Henderson, J., "USV git repository at https://github.com/sixarm/usv", 2022. [usv-rust-crate] Henderson, J., "USV rust crate at https://crates.io/crates/usv", 2024. [asv-to-usv-rust-crate] Henderson, J., "ASV to USV rust crate at https://crates.io/crates/asv-to-usv", 2024. [usv-to-asv-rust-crate] Henderson, J., "USV to ASV rust crate at https://crates.io/crates/usv-to-asv", 2024. [csv-to-usv-rust-crate] Henderson, J., "CSV to USV rust crate at https://crates.io/crates/csv-to-usv", 2024. [usv-to-csv-rust-crate] Henderson, J., "USV to CSV rust crate at https://crates.io/crates/usv-to-csv", 2024. [json-to-usv-rust-crate] Henderson, J., "JSON to USV rust crate at https://crates.io/crates/json-to-usv", 2024. [usv-to-json-rust-crate] Henderson, J., "USV to JSON rust crate at https://crates.io/crates/usv-to-json", 2024. [xlsx-to-usv-rust-crate] Henderson, J., "XLSX to USV rust crate at https://crates.io/crates/xlsx-to-usv", 2024. [usv-to-xlsx-rust-crate] Henderson, J., "USV to XLSX rust crate at https://crates.io/crates/usv-to-xlsx", 2024. Henderson Expires 17 September 2024 [Page 14] Internet-Draft Unicode Separated Values (USV) March 2024 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . Acknowledgements The author would like to thank Y. Shafranovich, author of the CSV RFC, which provided guidance for this USV RFC. A special thank you goes to P.X.V. Contributors Thanks to all of the contributors. Joel Parker Henderson Email: joel@joelparkerhenderson.com Author's Address Joel Parker Henderson (editor) 601 Van Ness Ave #E3-359 San Francisco, CA 94102 United States of America Phone: 1-415-317-2700 Email: joel@joelparkerhenderson.com URI: https://linkedin.com/in/joelparkerhenderson Henderson Expires 17 September 2024 [Page 15] ================================================ FILE: doc/rfc/draft-unicode-separated-values-01.xml ================================================ ]> Unicode Separated Values (USV)
601 Van Ness Ave #E3-359 San Francisco CA 94102 US 1-415-317-2700 joel@joelparkerhenderson.com https://linkedin.com/in/joelparkerhenderson
General Internet Engineering Task Force usv data format markup Unicode Separated Values (USV) is a data format that uses Unicode characters to mark parts. USV builds on ASCII separated values (ASV), and provides pragmatic ways to edit data in text editors by using visual symbols and layouts.
Introduction Unicode Separated Values (USV) is a data format useful for exchanging and converting data between various spreadsheet programs, databases, and streaming data services. This RFC explains USV. Additionally, we propose a new media type "text/usv", to be registered with IANA. We provide information references for a USV git repository , a programming implementation as a USV Rust crate , and converter tools.
Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here.
Media Type Language The media type normative references are RFC 6838 , RFC 2046 , and RFC 4289 .
ABNF Language The ABNF normative reference is RFC 5234 .
USV characters Separators:
  • File Separator (FS) is U+001C or U+241C
  • Group Separator (GS) is U+001D or U+241D
  • Record Separator (RS) is U+001E or U+241E
  • Unit Separator (US) is U+001F or U+241F
Modifiers:
  • Escape (ESC) is U+001B or U+241B
  • End of Transmission (EOT) is U+0004 or U+2404
Definition of the USV Format
Data Data comprises units, records, groups, and files.
Unit A unit comprises content characters. It runs until a Unit Separator (US): Example unit and unit separator:
Record A record comprises units. It runs until a Record Separator (RS): Example record and record separator:
Group A group comprises records. It runs until a Group Separator (GS): Example group and group separator:
File A file comprises groups. It runs until a file separator. Example file and file separator:
Header There may be an optional header appearing as the first item and with the same format as normal items. This header will contain names corresponding to the fields in the data, and should contain the same number of fields as the rest of data. The presence or absence of the header line should be indicated via the optional "header" parameter of this media type. For example:
Escape (ESC) Escape (ESC) makes the next character content. Example: USV with a unit that contains an Escape + End of Transmission; because of the Escape, the End of Transmission is treated as content:
End of Transmission (EOT) End of Transmission (EOT) tells any reader that it can stop reading. This is can be useful for streaming data, such as to end a connection. This can also be useful for providing data files that contain USV data, then EOT, then addition non-USV information such as comments, images, attachments, etc.
  • EOT tells the data reader that it can stop.
  • EOT has no effect on the output content.
Example of a unit then an End of Transmission:
ABNF grammar
Semantics usv = *files file = *groups group = *records record = *units unit = *content-characters
Syntax usv = ( header-and-body / body ) '*' ; anything after the body is chaff header-and-body = 1*unit-run / 1*record-run / 1*group-run / 1*file-run body = *unit-run / *record-run / *group-run / *file-run
Runs file-run = *( *spacer-character file *spacer-character FS ) group-run = *( *spacer-character group *spacer-character GS ) record-run = *( *spacer-character record *spacer-character RS ) unit-run = *( *spacer-character unit *spacer-character US )
Character classes content-character = typical-character / ESC '*' typical-character = '*' - special-character special-character = US / RS / GS / FS / ESC / EOT spacer-character = Defined by Unicode Derived Core Property White_Space
Unicode symbols FS = U+001C File Separator / U+241C Symbol for File Separator GS = U+001D Group Separator / U+241D Symbol for Group Separator RS = U+001E Record Separator / U+241E Symbol for Record Separator US = U+001F Unit Separator / U+241F Symbol for Unit Separator ESC = U+001B Escape / U+241B Symbol for Escape EOT = U+0004 End of Transmission / U+2404 Symbol for End of Transmission
Examples
Hello World This kind of data ... ... is represented in USV as two units: If you prefer to see one unit per line, then you can add whitespace, such as newlines:
Hello World Goodnight Moon This kind of data ... ... is represented in USV as two records, each with two units: If you prefer to see one record per line, then you can add whitespace, such as newlines:
Units, Records, Groups, Files USV with 2 units by 2 records by 2 groups by 2 files: If you prefer to see one record per line, then you can add whitespace, such as newlines: If you prefer to see one unit per line, then you can add whitespace, such as newlines:
Articles USV can format paragraphs, such as in this example data stream of articles; note the units contain leading spacers and trailing spacers.
Source Code Examples These source code examples demonstrate the Rust programming language and the USV Rust crate. Units: Records: Groups: Files:
MIME media type registration for text/usv This section provides the MIME media type registration application information. To: ietf-types@iana.org Subject: Registration of MIME media type text/usv MIME media type name: text MIME subtype name: usv Required parameters: none
Optional parameters: charset, header Common usage of USV is UTF-8, but other character sets defined by IANA for the "text" tree may be used in conjunction with the "charset" parameter. The "header" parameter indicates the presence or absence of the header line. Valid values are "present" or "absent". Implementors choosing not to use this parameter must make their own decisions as to whether the header line is present or absent.
Encoding considerations This media type uses LF to denote line breaks. However, implementors should be aware that some implementations may not conform i.e. may incorrectly use other values.
Security considerations USV files contain passive text data that should not pose any risks. However, it is possible in theory that malicious binary data may be included in order to exploit potential buffer overruns in the program processing USV data. Additionally, private data may be shared via this format (which of course applies to any text data).
Interoperability considerations Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing USV data. Implementations deciding not to use the optional "header" parameter must make their own decision as to whether the header is absent or present.
Published specification https://github.com/sixarm/usv
Applications that use this media type Spreadsheet programs, such as with import/export. Database programs, such as with loading/saving text. Data conversion utilities.
Additional information Magic number(s): none File extension(s): usv Apple macOS File Type Code(s): TEXT Intended usage: COMMON Author/Change controller: IESG Contact: Joel Parker Henderson <joel@joelparkerhenderson.com>
IANA Considerations We are requesting IANA to create a standard MIME media type "text/usv". We have filed an IANA request for this, with same contact information.
Security Considerations This document should not affect the security of the Internet.
Converters We implement converters to/from USV and various popular data formats, including ASCII Separated Values (ASV), Comma Separated Values (CSV), JavaScript Object Notation (JSON), Microsoft Excel XML (XLSX).
  • asv-to-usv, usv-to-asv
  • csv-to-usv, usv-to-csv
  • json-to-usv, usv-to-json
  • xlsx-to-usv, usv-to-xlsx
The converters are provided for informational purposes. The converters are not part of the specification.
References Normative References Informative References USV git repository at https://github.com/sixarm/usv USV rust crate at https://crates.io/crates/usv ASV to USV rust crate at https://crates.io/crates/asv-to-usv USV to ASV rust crate at https://crates.io/crates/usv-to-asv CSV to USV rust crate at https://crates.io/crates/csv-to-usv USV to CSV rust crate at https://crates.io/crates/usv-to-csv JSON to USV rust crate at https://crates.io/crates/json-to-usv USV to JSON rust crate at https://crates.io/crates/usv-to-json XLSX to USV rust crate at https://crates.io/crates/xlsx-to-usv USV to XLSX rust crate at https://crates.io/crates/usv-to-xlsx Key words for use in RFCs to Indicate Requirement Levels In many standards track documents several words are used to signify the requirements in the specification. These words are often capitalized. This document defines these words as they should be interpreted in IETF documents. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements.
Acknowledgements The author would like to thank Y. Shafranovich, author of the CSV RFC, which provided guidance for this USV RFC. A special thank you goes to P.X.V.
Contributors Thanks to all of the contributors.
joel@joelparkerhenderson.com
================================================ FILE: doc/rfc/index.md ================================================ # Request For Comments (RFC) USV is aiming to be an international standard with the IETF and IANA. Work in progress: * [https://datatracker.ietf.org/doc/draft-unicode-separated-values/01/](https://datatracker.ietf.org/doc/draft-unicode-separated-values/01/) Files: * [draft-unicode-separated-values-01.xml](draft-unicode-separated-values-01.xml) - this is the official IETF RFCXML. * [draft-unicode-separated-values-01.pdf](draft-unicode-separated-values-01.pdf) - autogenerated from IETF RFCXML. * [draft-unicode-separated-values-01.txt](draft-unicode-separated-values-01.txt) - autogenerated from IETF RFCXML. ================================================ FILE: doc/spacers/index.md ================================================ # Spacers Spacers are characters that have the Unicode Derived Core Property White_Space. Examples: * U+0020 Space (SP) * U+0009 Tab (TAB) aka Horizontal Tab (HT) * U+000A Line Feed (LF) aka New Line (NL) aka End Of Line (EOL) * U+000D Carriage Return (CR) USV supports spacers around content and markers, because this greatly helps typical display uses. ## Line Feed character USV with no spacers looks like this: ```usv a␟b␟␞c␟d␟␞ ``` If you want to see each record on its own line, then you can use newline characters: ```usv a␟b␟␞ c␟d␟␞ ``` If you want to see each unit on its own line, then you can use newline characters: ```usv a␟ b␟ ␞ c␟ d␟ ␞ ``` If you want to see each token on its own line, then you can use newline characters: ```usv a ␟ b ␟ ␞ c ␟ d ␟ ␞ ``` ## Space character USV with no spacers looks like this: ```usv a␟bbb␟ccccc␟ ``` If you want to see a column with left alignment, then you can use newline characters and space characters: ```usv a ␟ bbb ␟ ccccc␟ ``` If you want to see a column with right alignment, then you can use newline characters and space characters: ```usv a␟ bbb␟ ccccc␟ ``` If you want to see a column with center alignment, then you can use newline characters and space characters: ```usv a ␟ bbb ␟ ccccc␟ ``` ================================================ FILE: doc/styles/index.md ================================================ # Styles USV styles can customize various kinds of output so it looks like you prefer. * Symbols: characters are visible symbols, such as "␟" for Unit Separator. * Controls: characters are invisible controls, such as "\u001F" for Unit Separator. * Braces: instead of characters, use pretty-print braces, such as "{US}" for Unit Separator. ================================================ FILE: doc/todo/index.md ================================================ # TODO list We welcome help with this todo list. ## Add formats Add USV formats to productivity applications: * [ ] LibreOffice Calc * [ ] Microsoft Excel * [ ] Google Sheets * Etc. ## Create libraries Create USV libraries for programming languages: * [x] Rust crate * [ ] Python pip package * [ ] Node npm package * [ ] Ruby gem * Etc. ## Add handling Add USV handling to statistics systems: * [ ] R * [ ] Julia * [ ] MatLab * [ ] Mathematica * [ ] Python fasspec * [ ] Python Pandas * [ ] Python Polars * [ ] Python Dask * Etc. ## Extend CLI tools Extend USV capabilities for command line interface tools: * [ ] Miller * [ ] TextQL * [ ] Q * [ ] jq * [ ] xsv by BurntSushi * Etc. ## Add comparisons Add comparisons to other data formats: * [ ] [Why isn’t there a decent file format for tabular data?](https://news.ycombinator.com/item?id=31220841) * [ ] [Whitespace Separated Values (WSV)](https://dev.stenway.com/WSV/) * [ ] [SimpleML](https://dev.stenway.com/SML/SimpleML.html) * [ ] [KYLI](https://shkspr.mobi/blog/2017/03/kyli-because-it-is-superior-to-json/) * [ ] [Rows of String Values (RSV)](https://github.com/Stenway/RSV-Specification) ## Improve converters Improve converters: csv-to-usv and usv-to-csv * [ ] Add support for CSV delimiters, especially semi-colon instead of comma. * [ ] Add CLAP option for USV output with RS+ESC+LF. ================================================ FILE: examples/blog-posts.csv ================================================ "Title One","Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat." "Title Two","Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum." "Title Three","Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo." ================================================ FILE: examples/blog-posts.usv ================================================ Title One ␟ Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. ␟␞ Title Two ␟ Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. ␟␞ Title Three ␟ Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. ␟␞ ================================================ FILE: examples/end-of-transmission.usv ================================================ a␟b␟c␟␄ The End of Transmission (EOT) stops parsing. For example, this text comes after the EOT character. ================================================ FILE: examples/hello-goodnight.csv ================================================ "I say ""hello, world""" "You say ""goodnight, moon""" ================================================ FILE: examples/hello-goodnight.usv ================================================ I say "hello, world"␟␞ You say "goodnight, moon"␟␞ ================================================ FILE: examples/stream.usv ================================================ a␟b␟␞c␟d␟␞␝e␟f␟␞g␟h␟␞␝␜i␟j␟␞k␟l␟␞␝m␟n␟␞o␟p␟␞␝␜ ================================================ FILE: examples/zen-koans.csv ================================================ "Truth Koan","A monk asked, ""Without words or silence, will you tell me the truth?""" "Lotus Koan","A child asked, ""Before the lotus blossom emerges, what is it?""" "World Koan","A student asked, ""How does an enlightened one return to the world?""" ================================================ FILE: examples/zen-koans.usv ================================================ Truth Koan␟A monk asked, "Without words or silence, will you tell me the truth?"␟␞ Lotus Koan␟A child asked, "Before the lotus blossom emerges, what is it?"␟␞ World Koan␟A student asked, "How does an enlightened one return to the world?"␟␞ ================================================ FILE: tests/1-dimensional-as-line/expect.json ================================================ ["a","b"] ================================================ FILE: tests/1-dimensional-as-line/input.usv ================================================ a␟b␟ ================================================ FILE: tests/1-dimensional-as-lines/expect.json ================================================ ["a","b"] ================================================ FILE: tests/1-dimensional-as-lines/input.usv ================================================ a ␟ b ␟ ================================================ FILE: tests/2-dimensional-as-line/expect.json ================================================ [["a","b"],["c","d"]] ================================================ FILE: tests/2-dimensional-as-line/input.usv ================================================ a␟b␟␞c␟d␟␞ ================================================ FILE: tests/2-dimensional-as-lines/expect.json ================================================ [["a","b"],["c","d"]] ================================================ FILE: tests/2-dimensional-as-lines/input.usv ================================================ a ␟ b ␟ ␞ c ␟ d ␟ ␞ ================================================ FILE: tests/3-dimensional-as-line/expect.json ================================================ [[["a","b"],["c","d"]],[["e","f"],["g","h"]]] ================================================ FILE: tests/3-dimensional-as-line/input.usv ================================================ a␟b␟␞c␟d␟␞␝e␟f␟␞g␟h␟␞␝ ================================================ FILE: tests/3-dimensional-as-lines/expect.json ================================================ [[["a","b"],["c","d"]],[["e","f"],["g","h"]]] ================================================ FILE: tests/3-dimensional-as-lines/input.usv ================================================ a ␟ b ␟ ␞ c ␟ d ␟ ␞ ␝ e ␟ f ␟ ␞ g ␟ h ␟ ␞ ␝ ================================================ FILE: tests/4-dimensional-as-line/expect.json ================================================ [[[["a","b"],["c","d"]],[["e","f"],["g","h"]]],[[["i","j"],["k","l"]],[["m","n"],["o","p"]]]] ================================================ FILE: tests/4-dimensional-as-line/input.usv ================================================ a␟b␟␞c␟d␟␞␝e␟f␟␞g␟h␟␞␝␜i␟j␟␞k␟l␟␞␝m␟n␟␞o␟p␟␞␝␜ ================================================ FILE: tests/4-dimensional-as-lines/expect.json ================================================ [[[["a","b"],["c","d"]],[["e","f"],["g","h"]]],[[["i","j"],["k","l"]],[["m","n"],["o","p"]]]] ================================================ FILE: tests/4-dimensional-as-lines/input.usv ================================================ a ␟ b ␟ ␞ c ␟ d ␟ ␞ ␝ e ␟ f ␟ ␞ g ␟ h ␟ ␞ ␝ ␜ i ␟ j ␟ ␞ k ␟ l ␟ ␞ ␝ m ␟ n ␟ ␞ o ␟ p ␟ ␞ ␝ ␜ ================================================ FILE: tests/blog-posts/output-actual.txt ================================================ Title One unit separator Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. record separator Title Two unit separator Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. record separator Title Three unit separator Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. ================================================ FILE: tests/blog-posts/output-expect.txt ================================================ Title One unit separator Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. record separator Title Two unit separator Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. record separator Title Three unit separator Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. ================================================ FILE: tests/blog-posts/test.sh ================================================ #!/bin/sh set -euf top="$(git rev-parse --show-toplevel)" cat "$top/examples/blog-posts.usv" | "$top/bin/usv-to-debug.bash" > output-actual.txt diff output-actual.txt output-expect.txt ================================================ FILE: tests/end-of-transmission-block/output-actual.txt ================================================ a unit separator b unit separator c End of Transmission ================================================ FILE: tests/end-of-transmission-block/output-expect.txt ================================================ a unit separator b unit separator c End of Transmission ================================================ FILE: tests/end-of-transmission-block/test.sh ================================================ #!/bin/sh set -euf top="$(git rev-parse --show-toplevel)" cat "$top/examples/end-of-transmission.usv" | "$top/bin/usv-to-debug.bash" > output-actual.txt diff output-actual.txt output-expect.txt ================================================ FILE: tests/microsoft-excel/example1.xls ================================================ 3#0000004#0000ee5#0066006#3333337#8080808#9966009#c0c0c010#cc000011#ccffcc12#dddddd13#ffcccc14#ffffcc15#ffffff90001386024075FalseFalse