From 673ac6bd5696bce4c9f18d39b0cecd5db1aa8f22 Mon Sep 17 00:00:00 2001 From: Melody Horn Date: Thu, 29 Oct 2020 19:32:38 -0600 Subject: finish moving tokens to new format --- language/scanning.rst | 91 +++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 81 insertions(+), 10 deletions(-) (limited to 'language') diff --git a/language/scanning.rst b/language/scanning.rst index 7a7b7d3..86177ac 100644 --- a/language/scanning.rst +++ b/language/scanning.rst @@ -3,6 +3,13 @@ Scanning .. glossary:: + token + A single atomic unit in a Crowbar source file. + May be a :term:`keyword`, an :term:`identifier`, a :term:`constant`, + a :term:`string literal`, or a :term:`punctuator`. + Keywords, identifiers, and constants (except for :term:`character constant`\ s) must have either whitespace or a comment separating them. + Punctuators, string literals, and character constants do not require explicit separation from adjacent tokens. + keyword One of the literal words ``bool``, :crowbar:ref:`break`, ``case``, ``char``, ``const``, ``continue``, ``default``, ``do``, ``double``, @@ -17,26 +24,90 @@ Scanning .. todo:: figure out https://www.unicode.org/reports/tr31/tr31-33.html + + constant + A numeric (or numeric-equivalent) value specified directly within the code. + May be a :term:`decimal constant`, a :term:`binary constant` , an :term:`octal constant`, + a :term:`hexadecimal constant`, a :term:`floating-point constant`, a :term:`hexadecimal floating-point constant`, + or a :term:`character constant`. + Any of these except for the character constant may contain underscores; these are ignored by the compiler and only meaningful to humans reading the code. decimal constant A sequence of characters matching the regular expression ``[0-9_]+``. Denotes the numeric value of the given sequence of decimal digits. - Underscores are ignored by the compiler, but may be useful separators for other readers. - + binary constant A sequence of characters matching the regular expression ``0[bB][01_]+``. Denotes the numeric value of the given sequence of binary digits (after the ``0[bB]`` prefix has been removed). - Underscores are ignored by the compiler, but may be useful separators for other readers. - + octal constant A sequence of characters matching the regular expression ``0o[0-7_]+``. Denotes the numeric value of the given sequence of octal digits (after the ``0o`` prefix has been removed). - Underscores are ignored by the compiler, but may be useful separators for other readers. - token - A single atomic unit in a Crowbar source file. - Has one (and exactly one) of the following types. + hexadecimal constant + A sequence of characters matching the regular expression ``0[xX][0-9a-fA-F]+``. + Denotes the numeric value of the given sequence of hexadecimal digits (after the ``0[xX]`` prefix has been removed). + + floating-point constant + A sequence of characters matching the regular expression ``[0-9_]+\.[0-9_]+([eE][+-]?[0-9_]+)?``. + + .. note:: -.. todo:: + Unlike in C and many other languages, ``6e3`` in Crowbar is not a valid floating-point constant. + The Crowbar-compatible spelling is ``6.0e3``. + + Denotes the numeric value of the given decimal number, optionally expressed in scientific notation. + That is, ``XeY`` denotes :math:`X * 10^Y`. - finish transcribing token definitions + hexadecimal floating-point constant + A sequence of characters matching the regular expression ``0(fx|FX)[0-9a-fA-F_]+\.[0-9a-fA-F_]+[pP][+-]?[0-9_]+``. + Denotes the numeric value of the given hexadecimal number expressed in binary scientific notation. + That is, ``XpY`` denotes :math:`X * 2^Y`. + + character constant + A pair of single quotes ``'`` surrounding either a single character or an :term:`escape sequence`. + The single character may not be a single quote or a backslash ``\``. + Denotes the Unicode code point number for either the single surrounded character or the character denoted by the escape sequence. + + escape sequence + One of the following pairs of characters: + + * ``\'``, denoting the single quote ``'`` + * ``\"``, denoting the double quote ``"`` + * ``\\``, denoting the backslash ``\`` + * ``\r``, denoting the carriage return (U+000D) + * ``\n``, denoting the line feed, or newline (U+000A) + * ``\t``, denoting the (horizontal) tab (U+0009) + * ``\0``, denoting a null character (U+0000) + + Or a sequence of characters matching one of the following regular expressions: + + * ``\\x[0-9a-fA-F]{2}``, denoting the numeric value of the given two hexadecimal digits + * ``\\x[0-9a-fA-F]{4}``, denoting the numeric value of the given four hexadecimal digits + * ``\\x[0-9a-fA-F]{8}``, denoting the numeric value of the given eight hexadecimal digits + + string literal + A pair of double quotes ``"`` surrounding a sequence whose elements are either single characters or escape sequences. + No single-character element may be the double quote or the backslash. + Denotes the UTF-8-encoded sequence of bytes representing the sequence of characters which, either directly or via an escape sequence, are specified between the quotes. + + punctuator + One of the literal sequences of characters ``[``, ``]``, ``(``, ``)``, + ``{``, ``}``, ``.``, ``,``, ``+``, ``-``, ``*``, ``/``, ``%``, ``;``, + ``!``, ``&``, ``|``, ``^``, ``~``, ``>``, ``<``, ``=``, ``->``, ``++``, + ``--``, ``>>``, ``<<``, ``<=``, ``>=``, ``==``, ``!=``, ``&&``, ``||``, + ``+=``, ``-=``, ``*=``, ``/=``, ``%=``, ``&=``, ``|=``, or ``^=``. + + whitespace + A nonempty sequence of characters that each has a Unicode general category of either Control (``Cc``) or Separator (``Z``). + Separates tokens. + + comment + Text that the compiler should ignore. + May be a :term:`line comment` or a :term:`block comment`. + + line comment + A sequence of characters beginning with the characters ``//`` (outside of a :term:`string literal` or :term:`comment`) and ending with a newline character U+000A. + + block comment + A sequence of characters beginning with the characters ``/*`` (outside of a :term:`string literal` or :term:`comment`) and ending with the characters ``*/``. -- cgit v1.2.3