diff options
author | Melody Horn <melody@boringcactus.com> | 2020-10-01 04:05:58 -0600 |
---|---|---|
committer | Melody Horn <melody@boringcactus.com> | 2020-10-01 04:05:58 -0600 |
commit | 6f187130dcfb283fc0b0622cf1af62827d8443de (patch) | |
tree | 45e9c34ddad1acd608dd1f5a09bb8f29531b3bad | |
parent | 70eee34f829875b4f67b0f21a629bdc591d59034 (diff) | |
download | spec-6f187130dcfb283fc0b0622cf1af62827d8443de.tar.gz spec-6f187130dcfb283fc0b0622cf1af62827d8443de.zip |
define scanning, tokens, identifiers
-rw-r--r-- | syntax.md | 37 |
1 files changed, 35 insertions, 2 deletions
@@ -6,9 +6,21 @@ A Crowbar source file is UTF-8. Crowbar source files can come in two varieties, an *implementation file* and a *header file*. An implementation file conventionally has a `.cro` extension, and a header file conventionally has a `.hro` extension. -# Keywords +A Crowbar source file is read into memory in two phases: *scanning* (which converts text into an unstructured sequence of tokens) and *parsing* (which converts an unstructured sequence of tokens into an Abstract Syntax Tree, or AST). -Crowbar has 26 keywords: +# Scanning + +A *token* is one of the following kinds of token: +- a *keyword*, +- an *identifier*, +- a *constant*, +- a *string literal*, +- or a *punctuator*. + +## Keywords + +A *keyword* is one of the following 28 literal words: +- `bool` - `break` - `case` - `char` @@ -23,6 +35,7 @@ Crowbar has 26 keywords: - `float` - `for` - `if` +- `include` - `int` - `long` - `return` @@ -35,3 +48,23 @@ Crowbar has 26 keywords: - `unsigned` - `void` - `while` + +## Identifiers + +An *identifier* is a sequence of one or more characters having Unicode categories within a legal set. + +The first character in an identifier must have one of the following Unicode categories: +- Connector Punctuation (e.g. `_`) +- Format Other (e.g. Zero-Width Joiner) +- Lowercase Letter (e.g. `h`) +- Modifier Letter (e.g. `ʹ`, U+02B9 Modifier Letter Prime) +- Modifier Symbol (e.g. `^`, U+005E Circumflex Accent) +- Nonspacing Mark (e.g. ` ̂`, U+0302 Combining Circumflex Accent) +- Other Letter (e.g. `א`, U+05D0 Hebrew Letter Alef) +- Titlecase Letter (e.g. `Dž`, U+01C5 Latin Capital Letter D With Small Letter Z With Caron) +- Uppercase Letter (e.g. `B`) + +Subsequent characters may have any of the above-listed Unicode categories, or one of the following: +- Decimal Digit Number (e.g. `0`) +- Letter Number (e.g. `Ⅳ`, U+2163 Roman Numeral Four) +- Other Number (e.g. `¼`, U+00BC Vulgar Fraction One Quarter) |