From 6f187130dcfb283fc0b0622cf1af62827d8443de Mon Sep 17 00:00:00 2001 From: Melody Horn Date: Thu, 1 Oct 2020 04:05:58 -0600 Subject: define scanning, tokens, identifiers --- syntax.md | 37 +++++++++++++++++++++++++++++++++++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/syntax.md b/syntax.md index ff38c19..91d2bac 100644 --- a/syntax.md +++ b/syntax.md @@ -6,9 +6,21 @@ A Crowbar source file is UTF-8. Crowbar source files can come in two varieties, an *implementation file* and a *header file*. An implementation file conventionally has a `.cro` extension, and a header file conventionally has a `.hro` extension. -# Keywords +A Crowbar source file is read into memory in two phases: *scanning* (which converts text into an unstructured sequence of tokens) and *parsing* (which converts an unstructured sequence of tokens into an Abstract Syntax Tree, or AST). -Crowbar has 26 keywords: +# Scanning + +A *token* is one of the following kinds of token: +- a *keyword*, +- an *identifier*, +- a *constant*, +- a *string literal*, +- or a *punctuator*. + +## Keywords + +A *keyword* is one of the following 28 literal words: +- `bool` - `break` - `case` - `char` @@ -23,6 +35,7 @@ Crowbar has 26 keywords: - `float` - `for` - `if` +- `include` - `int` - `long` - `return` @@ -35,3 +48,23 @@ Crowbar has 26 keywords: - `unsigned` - `void` - `while` + +## Identifiers + +An *identifier* is a sequence of one or more characters having Unicode categories within a legal set. + +The first character in an identifier must have one of the following Unicode categories: +- Connector Punctuation (e.g. `_`) +- Format Other (e.g. Zero-Width Joiner) +- Lowercase Letter (e.g. `h`) +- Modifier Letter (e.g. `ʹ`, U+02B9 Modifier Letter Prime) +- Modifier Symbol (e.g. `^`, U+005E Circumflex Accent) +- Nonspacing Mark (e.g. ` ̂`, U+0302 Combining Circumflex Accent) +- Other Letter (e.g. `א`, U+05D0 Hebrew Letter Alef) +- Titlecase Letter (e.g. `Dž`, U+01C5 Latin Capital Letter D With Small Letter Z With Caron) +- Uppercase Letter (e.g. `B`) + +Subsequent characters may have any of the above-listed Unicode categories, or one of the following: +- Decimal Digit Number (e.g. `0`) +- Letter Number (e.g. `Ⅳ`, U+2163 Roman Numeral Four) +- Other Number (e.g. `¼`, U+00BC Vulgar Fraction One Quarter) -- cgit v1.2.3