aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--language/scanning.rst91
-rw-r--r--syntax.md2
2 files changed, 82 insertions, 11 deletions
diff --git a/language/scanning.rst b/language/scanning.rst
index 7a7b7d3..86177ac 100644
--- a/language/scanning.rst
+++ b/language/scanning.rst
@@ -3,6 +3,13 @@ Scanning
.. glossary::
+ token
+ A single atomic unit in a Crowbar source file.
+ May be a :term:`keyword`, an :term:`identifier`, a :term:`constant`,
+ a :term:`string literal`, or a :term:`punctuator`.
+ Keywords, identifiers, and constants (except for :term:`character constant`\ s) must have either whitespace or a comment separating them.
+ Punctuators, string literals, and character constants do not require explicit separation from adjacent tokens.
+
keyword
One of the literal words ``bool``, :crowbar:ref:`break`, ``case``,
``char``, ``const``, ``continue``, ``default``, ``do``, ``double``,
@@ -17,26 +24,90 @@ Scanning
.. todo::
figure out https://www.unicode.org/reports/tr31/tr31-33.html
+
+ constant
+ A numeric (or numeric-equivalent) value specified directly within the code.
+ May be a :term:`decimal constant`, a :term:`binary constant` , an :term:`octal constant`,
+ a :term:`hexadecimal constant`, a :term:`floating-point constant`, a :term:`hexadecimal floating-point constant`,
+ or a :term:`character constant`.
+ Any of these except for the character constant may contain underscores; these are ignored by the compiler and only meaningful to humans reading the code.
decimal constant
A sequence of characters matching the regular expression ``[0-9_]+``.
Denotes the numeric value of the given sequence of decimal digits.
- Underscores are ignored by the compiler, but may be useful separators for other readers.
-
+
binary constant
A sequence of characters matching the regular expression ``0[bB][01_]+``.
Denotes the numeric value of the given sequence of binary digits (after the ``0[bB]`` prefix has been removed).
- Underscores are ignored by the compiler, but may be useful separators for other readers.
-
+
octal constant
A sequence of characters matching the regular expression ``0o[0-7_]+``.
Denotes the numeric value of the given sequence of octal digits (after the ``0o`` prefix has been removed).
- Underscores are ignored by the compiler, but may be useful separators for other readers.
- token
- A single atomic unit in a Crowbar source file.
- Has one (and exactly one) of the following types.
+ hexadecimal constant
+ A sequence of characters matching the regular expression ``0[xX][0-9a-fA-F]+``.
+ Denotes the numeric value of the given sequence of hexadecimal digits (after the ``0[xX]`` prefix has been removed).
+
+ floating-point constant
+ A sequence of characters matching the regular expression ``[0-9_]+\.[0-9_]+([eE][+-]?[0-9_]+)?``.
+
+ .. note::
-.. todo::
+ Unlike in C and many other languages, ``6e3`` in Crowbar is not a valid floating-point constant.
+ The Crowbar-compatible spelling is ``6.0e3``.
+
+ Denotes the numeric value of the given decimal number, optionally expressed in scientific notation.
+ That is, ``XeY`` denotes :math:`X * 10^Y`.
- finish transcribing token definitions
+ hexadecimal floating-point constant
+ A sequence of characters matching the regular expression ``0(fx|FX)[0-9a-fA-F_]+\.[0-9a-fA-F_]+[pP][+-]?[0-9_]+``.
+ Denotes the numeric value of the given hexadecimal number expressed in binary scientific notation.
+ That is, ``XpY`` denotes :math:`X * 2^Y`.
+
+ character constant
+ A pair of single quotes ``'`` surrounding either a single character or an :term:`escape sequence`.
+ The single character may not be a single quote or a backslash ``\``.
+ Denotes the Unicode code point number for either the single surrounded character or the character denoted by the escape sequence.
+
+ escape sequence
+ One of the following pairs of characters:
+
+ * ``\'``, denoting the single quote ``'``
+ * ``\"``, denoting the double quote ``"``
+ * ``\\``, denoting the backslash ``\``
+ * ``\r``, denoting the carriage return (U+000D)
+ * ``\n``, denoting the line feed, or newline (U+000A)
+ * ``\t``, denoting the (horizontal) tab (U+0009)
+ * ``\0``, denoting a null character (U+0000)
+
+ Or a sequence of characters matching one of the following regular expressions:
+
+ * ``\\x[0-9a-fA-F]{2}``, denoting the numeric value of the given two hexadecimal digits
+ * ``\\x[0-9a-fA-F]{4}``, denoting the numeric value of the given four hexadecimal digits
+ * ``\\x[0-9a-fA-F]{8}``, denoting the numeric value of the given eight hexadecimal digits
+
+ string literal
+ A pair of double quotes ``"`` surrounding a sequence whose elements are either single characters or escape sequences.
+ No single-character element may be the double quote or the backslash.
+ Denotes the UTF-8-encoded sequence of bytes representing the sequence of characters which, either directly or via an escape sequence, are specified between the quotes.
+
+ punctuator
+ One of the literal sequences of characters ``[``, ``]``, ``(``, ``)``,
+ ``{``, ``}``, ``.``, ``,``, ``+``, ``-``, ``*``, ``/``, ``%``, ``;``,
+ ``!``, ``&``, ``|``, ``^``, ``~``, ``>``, ``<``, ``=``, ``->``, ``++``,
+ ``--``, ``>>``, ``<<``, ``<=``, ``>=``, ``==``, ``!=``, ``&&``, ``||``,
+ ``+=``, ``-=``, ``*=``, ``/=``, ``%=``, ``&=``, ``|=``, or ``^=``.
+
+ whitespace
+ A nonempty sequence of characters that each has a Unicode general category of either Control (``Cc``) or Separator (``Z``).
+ Separates tokens.
+
+ comment
+ Text that the compiler should ignore.
+ May be a :term:`line comment` or a :term:`block comment`.
+
+ line comment
+ A sequence of characters beginning with the characters ``//`` (outside of a :term:`string literal` or :term:`comment`) and ending with a newline character U+000A.
+
+ block comment
+ A sequence of characters beginning with the characters ``/*`` (outside of a :term:`string literal` or :term:`comment`) and ending with the characters ``*/``.
diff --git a/syntax.md b/syntax.md
index be62b47..80fa54b 100644
--- a/syntax.md
+++ b/syntax.md
@@ -1,4 +1,4 @@
-# Syntax
+# Syntax (old)
The syntax of Crowbar mostly matches the syntax of C, with fewer obscure/advanced/edge case features.