Repository: stevengj/subsuper-proposal Branch: master Commit: 6e18e70fe350 Files: 5 Total size: 23.4 KB Directory structure: gitextract_7j_xm0le/ ├── .gitignore ├── .travis.yml ├── Makefile ├── README.md └── subsuper-proposal.tex ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ output subsuper-proposal.aux subsuper-proposal.log subsuper-proposal.out subsuper-proposal.pdf ================================================ FILE: .travis.yml ================================================ email: false os: linux sudo: false addons: apt: packages: - texlive-full script: make travis ================================================ FILE: Makefile ================================================ MAIN=subsuper-proposal pdf: latexmk -pdf $(MAIN) -auxdir=output -outdir=output travis: #Old latexmk doesn't understand auxdir and outdir options latexmk -pdf -pdflatex='pdflatex %S %O -interaction=nonstopmode -halt-on-error' $(MAIN) arxiv: pdf mkdir -p arxiv cp -ivR figures/*/* output/*.bbl *.tex arxiv echo Fix paths to images before continuing #Test build cd arxiv && latexmk -pdf $(MAIN) -auxdir=crap -outdir=crap && rm -rf crap cd arxiv && zip arxiv.zip * clean: rm -rvf *.bbl *.blg *.aux *.fls *.fdb_latexmk *.log *.out *.toc $(MAIN).pdf aux output arxiv ================================================ FILE: README.md ================================================ # Unicode Proposal to Encode Subscripts/Superscripts for Mathematical Programming This is a draft proposal to encode additional subscript and superscript characters in future versions of Unicode, motivated mainly by computer programming for technical applications. Many programming languages (C#, Go, Java, Julia, Python, Rust, Swift, etc.) now allow identifiers to use a wide array of Unicode characters, which enables programs in mathematical fields to use notations that closely hew to the original mathematical notations. e.g. one can now have variable names like `x̂` or `αₓ`. Unfortunately, this opportunity is hampered by a lack of subscript and superscript characters in Unicode 9. (Most famously, there are superscript characters for every lowercase Latin letter *except q*.) We believe that, given the recent explosion of Unicode support in programming languages (and editors), the time is ripe to propose better support for mathematical subscripts/superscripts in Unicode. Our proposal is to add [combining characters](https://en.wikipedia.org/wiki/Combining_character), *mathematical subscript* and *mathematical superscript*, that indicate that the previous glyph (a character + combining characters) should be rendered in sub/superscript form. Fonts can then implement these characters as "ligatures" for selected characters, so that e.g. an "A" followed by "mathematical subscript" would render as a subscript "A". ## How you can help First, we welcome comments on this proposal (in the form of issues or pull requests): corrections, references to notable sources/applications, and technical comments. Second, we suspect that it will be easier to get the Unicode Consortium to take us seriously if prominent authors, organizations, free/open-source projects, and companies can sign on to this proposal (as co-authors or endorsers). Third, it will greatly strengthen the proposal if we can commit to providing draft fonts or (for new combining characters) the necessary improvements to text-rendering software, so experts in those areas are welcome. ================================================ FILE: subsuper-proposal.tex ================================================ \documentclass[10pt,english]{article} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{geometry} \geometry{verbose,lmargin=1in,rmargin=1in} \usepackage{babel} \usepackage{mathrsfs} \usepackage{amsbsy} \usepackage{amssymb} \usepackage[unicode=true]{hyperref} \usepackage{fouriernc} \usepackage[T1]{fontenc} \usepackage{todonotes} %XXX REMOVE ME WHEN DONE \newcommand{\TODO}[1]{\todo[inline]{#1}} \makeatletter \usepackage{upgreek} \usepackage{cite} \newcommand{\Secref}[1]{Section~\ref{sec:#1}} \newcommand{\secref}[1]{Sec.~\ref{sec:#1}} \makeatother \begin{document} \title{Proposal to Encode Subscripts/Superscripts\\ for Use in Programming-Language Source Code} \author{ Dr.\ Jiahao Chen, AI Research Director, JPMorgan AI Research and Research Scientist, MIT CSAIL \and Prof.\ Alan Edelman, MIT Department of Mathematics \and Prof.\ Steven G.\ Johnson, MIT Department of Mathematics \and Stefan Karpinski, Chief Open Source Officer, Julia Computing Inc. } \maketitle \TODO{% .... add other endorsers here --- we don't want to turn this into an Internet petition, but endorsements from well-known developers, organizations, open-source projects, and companies, could be helpful } \abstract{Widespread Unicode support in programming-language parsers now allows software developers in technical fields to use notations like Greek letters and sub/superscripts \emph{in their source code} that closely follow standard mathematical notations. This development has been hampered by the fact that only a subset of Latin and Greek characters have sub/superscript codepoints in Unicode 13, and so we propose to expand the set of sub/superscript characters to encompass common mathematical usage, ideally by simply adding two new combining characters.} \section{Summary} \label{sec:summary} Many recent programming languages (including C\#~\cite{Csharp}, Fortress~\cite{Fortress}, Go~\cite{Go}, Java~\cite{Java}, Julia~\cite{Julia}, Python~3~\cite{Python}, Scala\cite{Scala}, and Swift\cite{Swift}) allow a broad range of Unicode characters to be employed \textbf{in source code} by the programmer for identifiers (names of functions and variables), and indeed such behavior is recommended by the \emph{Unicode Standard Annex~\#31}~\cite{UAX31}. For programmers working in mathematical fields, including statistics, science, and engineering, where a wide array of mathematical notations and symbols are commonplace, this often enables computer programs to be written in a notation that closely mimics standard mathematical descriptions. For example, in a mathematical context one might easily have a quantity with a name like $\hat{x}$ or $\alpha_{x}$, and the same names can now be used in computer programs for this quantity ($\mathrm{\hat{{x}}}$ = U+0078 U+0302, $\mathrm{{\upalpha_{x}}}$ = U+03B1 U+2093) rather than ASCII approximations such as \texttt{xhat} and \texttt{alpha\_x} that are used in older programming languages. Furthermore, Unicode-capable \emph{code-editing software} has made it easy for programmers to type such symbols by ``tab completion'' (e.g. typing ``$\mathrm{{\upalpha_{x}}}$'' as \texttt{\textbackslash{}alpha}\texttt{\emph{{[}tab{]}}}\texttt{\textbackslash{}\_x}\texttt{\emph{{[}tab{]}}}) or related techniques in several popular editors~\cite{vsCode,Atom,Emacs,Sublime,Vim} and interactive-programming environments~\cite{Julia,IPython}. This is an important and positive trend. To quote a 2001 draft of \emph{Unicode Technical Report~\#25}~\cite[section 5.3]{UTR25}, \emph{Using real mathematical expressions in computer programs would be far superior in terms of readability, reduced coding times, program maintenance, and streamlined documentation.} One notation that is extremely common throughout all of mathematics is the use of subscripts and superscripts. Unfortunately, Unicode 13.0 provides only a very limited selection of subscript and superscript characters (digits 0--9, parentheses, $+$/$-$/$=$, a subset of the Latin and Greek alphabets, and a few phonetic symbols). Therefore,\textbf{ we propose that Unicode be extended to encode all the subscript/superscript characters most commonly used in mathematical notation}: all lower- and upper-case Latin and Greek letters (and variants thereof) and a selection of other mathematical symbols described below. Because these are all derived from existing Unicode codepoints, the assignment of their properties and the development of their fonts will be straightforward. As one possible implementation strategy, we suggest that the Consortium \textbf{add two new ``mathematical subscript'' and ``mathematical superscript'' combining characters} that convert the preceding character (or character + combining characters) into a sub/superscript. (This is closely related to the superscript and ``scientific inferiors'' features already present in OpenType~\cite{OpenType}.) Although the combining-character approach is potentially very general, it could initially be suggested that the new combining characters may be ignored except following a limited subset of characters commonly used in mathematical notation (similar to how the emoji skin-tone modifiers only affect rendering of a small set of emoji characters with skin-tone variants). We should mention that there were past proposals to extend the range of subscript and superscript characters in Unicode~\cite{L2-10-230,L2-11-208} (neither of which mentioned mathematical or programming applications) that were rejected. We have been unable to find an official record of the grounds for rejection of these proposals, but there seem to be two common objections in discussions of this topic on Unicode forums~\cite{Miller10,UCDF}. First, that this is \emph{mere ``text presentation''} devoid of semantic content, which is better accomplished via markup languages like MathML or other typesetting technologies such as the OpenType MATH table. Second, that the set of characters that deserve sub/superscript variants is unclear and might be subjected to never-ending extension. We believe that both of those objections must be re-evaluated in light of mathematical programming. First, subscripts and superscripts of any kind in mathematics always have semantic content ($\mathrm{{\upalpha_{x}}}$ is very different in meaning from $\mathrm{\upalpha\mathrm{{x}}}$). More importantly, \textbf{programming language source code is almost invariably restricted to ``plain text''}---in ordinary usage, computer code (and especially identifiers) cannot be intermingled with MathML markup or other formatting metadata---with a few exceptions such as Mathematica~\cite{Mathematica} that are edited and/or rendered mainly within specialized software. Finally, the restriction to \emph{common mathematical notations} greatly limits the scope of the additions, excluding the vast majority of Unicode characters from consideration. It is not practical to encode the full range of mathematical notations in Unicode; MathML and similar will always be useful, and generally should not be combined with Unicode-based formatting. However, in programming contexts, where non-Unicode math formatting is unavailable, increasing the set of available subscripts and superscripts is a ``low-hanging fruit'' that would be tremendously beneficial in technical fields. We respectfully urge you to expand the set of mathematical subscripts and superscripts, and especially to eliminate frustrating omissions from the Latin and Greek alphabets. \section{Suggested Subscript/Superscript Characters} \label{sec:required} The following existing Unicode codepoints are widely used in subscripts and superscripts of mathematical notation, and should therefore have new sub/superscript variants (ideally via combining characters as described below, which will allow the set of available characters to be expanded incrementally by font providers as described in \secref{fonts}): \begin{itemize} \item Latin upper- and lower-case characters `a'--`z' (U+0061--007A) and `A'--`Z' (U+0041--005A). (See \secref{existingchars} regarding the subset of these already present in Unicode as sub/superscripts.) \item Greek upper- and lower-case characters `$\upalpha$'--`$\upomega$' (U+03B1--03C9) and `A'--`$\Upomega$' (U+0391--03A9) as well as lunate epsilon `$\upepsilon$' (U+03F5). (See \secref{existingchars} regarding the small subset of these already present in Unicode as sub/superscripts.) \item Decimal digits `0'--`9' (U+0030--0039): these already have acceptable sub/superscript characters (U+2080--2089 and U+2070--2079), which should be canonically equivalent to the corresponding digit followed by a sub/superscript combining character as described below. \item Common mathematical operators: `$+$` (U+002B), `$-$` (U+002D and U+2212), `$\pm$' (U+00B1), `$*$' (U+002A), `$\times$' (U+00D7), `/' (U+002F), `$=$' (U+003D), `$<$` (U+003C), `$>$' (U+003E). Of these, `$+$', and `$-$', `$=$' already have acceptable sub/superscript characters (U+207A--207C and U+0208A--208C), which should be canonically equivalent to the corresponding character followed by a sub/superscript combining character as described below. \item Other important mathematical symbols widely used in sub/superscript form: `$\Vert$' (U+2016), `$\perp$' (U+27C2), `$\infty$' (U+221E), `$\dagger$' (U+2020), `$\angle$' (U+2220). \item Brackets: `(' and `)' (U+0028--0029), `[ and `]' (U+005B and U+005D), `\{' and `\}' (U+007B and U+007D). Of these, parentheses `()' already have acceptable sub/superscript characters (U+208D--208E and U+207D--207E), which should be canonically equivalent to the corresponding character followed by a sub/superscript combining character as described below. \item Comma `,' (U+002C), colon `:' (U+003A), and semicolon `;' (U+003B). \end{itemize} \subsection{Relationship to existing sub/superscript letters} \label{sec:existingchars} A subset of Latin and Greek letters already have sub/superscript variants in Unicode. For example, `$^\mathrm{a}$' is U+1D43 and `$_\mathrm{a}$' is U+2090, while `$^\mathrm{A}$' is U+1D2C but there is no subscript `A' character. However, these codepoints seem to have been added piecemeal into the Unicode standard for a variety of semantic meanings unrelated to their usage in mathematical notation (e.g. `$^\mathrm{a}$' is in the Latin-1 Supplement block, whereas `$^\mathrm{k}$' is in the Phonetic Extensions block). Therefore we suggest that these existing characters \textbf{should not be canonically equivalent} to the proposed new mathematical sub/superscript characters (whether implemented by new codepoints or by combining characters). However, the existing Latin and Greek sub/superscripts \textbf{should at least be compatible} with the new characters because in a number of contexts (e.g. programming code where the existing superscripts are already in use) they should be interchangeable in practice. Several other proposed canonical equivalences are noted in \secref{required}, above. \section{Proposed new combining characters} We believe that the simplest and most flexible way to support mathematical subscripts and superscripts would be to \textbf{add two new combining characters}, ``MATHEMATICAL SUBSCRIPT'' and ``MATHEMATICAL SUPERSCRIPT,'' which would indicate that the \textbf{preceding grapheme} should form a sub/superscript, respectively. As mentioned in \secref{summary}, text-rendering software might optionally display only a limited set of characters in sub/superscript form, following a recommended list such as the one in \secref{required}, implemented as variant characters or ligatures (e.g. similar to emoji skin-tone variants) as described in \secref{fonts}. Both of these characters could be put into the general category Mn (nonspacing marks). Placing them in the nonspacing marks category would automatically allow the sub/superscript combining characters to be used following any valid identifier character in programming languages implementing \emph{Unicode Standard Annex~\#31}~\cite{UAX31} or similar rules, since category Mn is included in the \texttt{ID\_Continue} lexical class of \emph{Annex~\#31}. (That is, the languages would support the new combining characters automatically once their parsers are updated to use the new Unicode data tables, which typically happens within a reasonable interval after each Unicode revision.) Alternatively, they could be added to category Sk (symbol modifiers), similar to the emoji skin-tone variants, with an amendment to \emph{Annex~\#31} adding them to \texttt{ID\_Continue}. The new combining characters should have bidirectional class ``Other Neutrals (ON)'' and mirror class ``N'' similar to the emoji modifiers, with no upper/lowercase mapping, should be treated like other combining characters for line-breaking and collation, and should be allowed in identifiers as lexical class \texttt{ID\_Continue} as mentioned above. Several canonical equivalencies to existing sub/superscript characters are suggested in \secref{required}. Some programming languages may further opt to allow additional sub/superscript mathematical symbols in identifiers on a case-by-case basis. For example, Julia~\cite{Julia} already allows sub/superscript parentheses in identifiers (e.g. $\upchi^{(3)}$ is a valid Julia identifier for a ``Kerr susceptibility'' in optics) so it might be natural in Julia to allow sub/superscript commas and asterisks (e.g. $\mathrm{p}^*$ or $\mathrm{p}_*$ may denote the ``primal optimum'' in convex-optimization problems). \subsection{Alternative: Many new sub/superscript characters} The alternative to adding new combining characters is to simply add new characters for the individual sub/superscripts proposed in \secref{required}, which would then be treated similarly to existing sub/superscript characters. They would be placed in general category Lm (modifier letter) for Latin and Greek letters, Ps/Pe (open/close punctuation) for brackets, Po (other punctuation) for comma and semicolon, and Sm (math symbol) for mathematical symbols. The disadvantages of this approach are obvious: many new codepoints must be chosen and documented, and supporting additional mathematical sub/superscripts in the future will require additional revisions to the standard (as opposed to improvements in fonts or rendering software). \section{Lack of Possible Equivalents} As explained in \secref{required}, Unicode currently contains only a subset of the most common mathematical subscripts and superscripts, and omits many Latin and Greek letters (e.g. there is no superscript `q' or subscript `A'). Furthermore, formatting/presentation markup like MathML, OpenDocument format (ODF), and similar typesetting/text-formatting technologies are \textbf{not usable for source code} in programming languages, nearly all of which require ``plain text'' input with no XML or other formatting metadata. \section{Provision of Fonts} \label{sec:fonts} Because the required glyphs are merely sub/superscript variants of existing characters, relatively little effort will be required to support the new combining characters for common mathematical usage. Common font editing software provides automated ways to take a set of glyphs and generate transformed variants by scaling them and shifting the baseline. Having generated a sub/superscript glyph, e.g. a subscript `A', the font can then define a glyph substitution sequence indicating that `A' followed by the new subscript combining character will be rendered as a subscript-`A' \emph{ligature}, e.g. using an OpenType glyph-substitution table~\cite{opentypeGSUB}. (Even if every glyph suggested in \secref{required} is implemented in this way, it still represents only about 250 glyphs.) Draft versions of the proposed glyphs in \secref{required} have already been implemented in the JuliaMono TrueType font~\cite{juliamono}, and in collaboration with Stefan~Karpinski (co-author of this proposal) we plan to support the new combining character via these glyphs in JuliaMono if this proposal is accepted. In the longer run, text-rendering software such as Microsoft DirectWrite, Apple Core Text, or the free/open-source HarfBuzz engine may choose to automate this process even for fonts lacking such substitution sequences. While this would add flexibility and improve adoption, however, it is not required---\emph{existing text rendering software would be sufficient} given a font providing sub/superscript ligatures. \section{Contact Information} \TODO{``names and addresses of appropriate contacts within national body or user organizations''} \begin{itemize} \item Steven G. Johnson, MIT Room 2-345, 77 Massachusetts Ave. Cambridge, MA 02139. \end{itemize} \begin{thebibliography}{10} \bibitem{Csharp}Microsoft Corporation, \textit{C\# Language Specification}, version 5.0, section 2.4.2 (2012). \bibitem{Fortress}Sun Microsystems, Inc., \textit{The Fortress Language Specification}, version 1.0, \url{http://www.ccs.neu.edu/home/samth/fortress-spec.pdf} (2008-03-31). \bibitem{Go}The Go Authors, \textit{The Go Programming Language Specification}, \url{https://golang.org/ref/spec} (2020-01-14). \bibitem{Java}James Gosling, Bill Joy, Guy Steele, Gilad Bracha, and Alex Buckley, \textit{The Java Language Specification: Java SE 8 Edition}, section 3.8, \url{http://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html\#jls-3.8} (2015-02-13). \bibitem{Julia}The Julia Authors, \textit{The Julia Programming Language Manual}, \url{http://docs.julialang.org/en/latest/manual/variables/} (retrieved 2016-08-25). \bibitem{Python}Martin von L{\"o}wis, ``Python PEP 3131: Supporting Non-ASCII Identifiers,'' \url{https://www.python.org/dev/peps/pep-3131/} (2007-05-01). \bibitem{Scala}M.~Odersky \textit{et al.}, \textit{Scala Language Specification}, version 2.13, \url{https://scala-lang.org/files/archive/spec/2.13/01-lexical-syntax.html} (2020-08-25). \bibitem{Swift}Apple Inc, \textit{The Swift Programming Language}, version 5.3, \url{https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html} (2020-06-22). \bibitem{UAX31}Mark Davis, \textit{Unicode Standard Annex \#31: Unicode Identifier and Pattern Syntax}, version 13.0.0, \url{https://www.unicode.org/reports/tr31/tr31-33.html} (2020-02-13). \bibitem{vsCode}Unicode LaTeX plugin for the vsCode editor, \url{https://marketplace.visualstudio.com/items?itemName=oijaz.unicode-latex} (retrieved 2020-08-24). \bibitem{Atom}atom-latex-completions plugin for the Atom editor, \url{https://github.com/JunoLab/atom-latex-completions} (retrieved 2020-08-24). \bibitem{Emacs}\TeX{} Input Method for the Emacs editor, \url{https://www.emacswiki.org/emacs/TeXInputMethod} (retrieved 2020-08-24). \bibitem{Sublime}UnicodeMath plugin for the Sublime editor, \url{https://github.com/mvoidex/UnicodeMath} (retrieved 2020-08-24). \bibitem{Vim}julia-vim plugin for the vim editor, \url{https://github.com/JuliaLang/julia-vim} (retrieved 2020-08-24). \bibitem{IPython}Brian E. Granger, ``Adds Julia-style latex->unicode tab completion,'' \textit{IPython pull request \#6380}, \url{https://github.com/ipython/ipython/pull/6380} (2014-08-28). \bibitem{UTR25}Barbara Beeton, Asmus Freytag, and Murray Sargent III, ``Proposed Draft Unicode Technical Report \#25: Unicode Support for Mathematics'' \url{http://www.unicode.org/unicode/reports/tr25/tr25-1.html} (2001-10-10). \bibitem{L2-10-230}Karl Pentzlin, ``Proposal to encode a modifier letter used in French abbreviations in the UCS,'' \url{http://www.unicode.org/L2/L2010/10230-modifier-q.pdf} (2010-07-13). \bibitem{L2-11-208}ISO/IEC JTC1/SC35/WG1, ``Proposal to encode missing Latin small capital and modifier letters in the UCS,'' \url{http://www.unicode.org/L2/L2011/11208-n4068.pdf} (2011-03-12). \bibitem{Miller10}Eric Miller, ``Comment on L2-10-230, Proposal to encode a modifier letter used in French abbreviations in the UCS,'' \url{http://www.unicode.org/L2/L2010/10315-comment.pdf} (2010-08-09). \bibitem{UCDF}``Superscript latin small letter q,'' discussion on \emph{The Unicode Consortium Discussion Forum}, \url{http://www.unicode.org/forum/viewtopic.php?f=24&t=142} (2011) (retrieved at at \url{https://web.archive.org/web/20151019072726/http://www.unicode.org/forum/viewtopic.php?f=24&t=142} 2020-08-24). \bibitem{Mathematica}Stephen Wolfram, ``Mathematical Notation: Past and Future,'' \url{http://www.stephenwolfram.com/publications/mathematical-notation-past-future/} (2000). \bibitem{OpenType}Microsoft Corporation, ``Registered Features,'' \emph{OpenType Specification}, version 1.7 \url{https://www.microsoft.com/typography/otspec/features_pt.htm\#sinf} (2008-11-19). \bibitem{opentypeGSUB}Microsoft Corporation, ``GSUB---Glyph Substitution Table,'' \textit{OpenType Specification} \url{https://docs.microsoft.com/en-us/typography/opentype/spec/gsub} (2018-08-15). \bibitem{juliamono}``JuliaMono---A monospaced font for scientific and technical computing,'' \url{https://cormullion.github.io/pages/2020-07-26-JuliaMono/} (2020-07-26). \end{thebibliography} \end{document}