aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMelody Horn <melody@boringcactus.com>2020-10-01 04:05:58 -0600
committerMelody Horn <melody@boringcactus.com>2020-10-01 04:05:58 -0600
commit6f187130dcfb283fc0b0622cf1af62827d8443de (patch)
tree45e9c34ddad1acd608dd1f5a09bb8f29531b3bad
parent70eee34f829875b4f67b0f21a629bdc591d59034 (diff)
downloadspec-6f187130dcfb283fc0b0622cf1af62827d8443de.tar.gz
spec-6f187130dcfb283fc0b0622cf1af62827d8443de.zip
define scanning, tokens, identifiers
-rw-r--r--syntax.md37
1 files changed, 35 insertions, 2 deletions
diff --git a/syntax.md b/syntax.md
index ff38c19..91d2bac 100644
--- a/syntax.md
+++ b/syntax.md
@@ -6,9 +6,21 @@ A Crowbar source file is UTF-8.
Crowbar source files can come in two varieties, an *implementation file* and a *header file*.
An implementation file conventionally has a `.cro` extension, and a header file conventionally has a `.hro` extension.
-# Keywords
+A Crowbar source file is read into memory in two phases: *scanning* (which converts text into an unstructured sequence of tokens) and *parsing* (which converts an unstructured sequence of tokens into an Abstract Syntax Tree, or AST).
-Crowbar has 26 keywords:
+# Scanning
+
+A *token* is one of the following kinds of token:
+- a *keyword*,
+- an *identifier*,
+- a *constant*,
+- a *string literal*,
+- or a *punctuator*.
+
+## Keywords
+
+A *keyword* is one of the following 28 literal words:
+- `bool`
- `break`
- `case`
- `char`
@@ -23,6 +35,7 @@ Crowbar has 26 keywords:
- `float`
- `for`
- `if`
+- `include`
- `int`
- `long`
- `return`
@@ -35,3 +48,23 @@ Crowbar has 26 keywords:
- `unsigned`
- `void`
- `while`
+
+## Identifiers
+
+An *identifier* is a sequence of one or more characters having Unicode categories within a legal set.
+
+The first character in an identifier must have one of the following Unicode categories:
+- Connector Punctuation (e.g. `_`)
+- Format Other (e.g. Zero-Width Joiner)
+- Lowercase Letter (e.g. `h`)
+- Modifier Letter (e.g. `ʹ`, U+02B9 Modifier Letter Prime)
+- Modifier Symbol (e.g. `^`, U+005E Circumflex Accent)
+- Nonspacing Mark (e.g. ` ̂`, U+0302 Combining Circumflex Accent)
+- Other Letter (e.g. `א`, U+05D0 Hebrew Letter Alef)
+- Titlecase Letter (e.g. `Dž`, U+01C5 Latin Capital Letter D With Small Letter Z With Caron)
+- Uppercase Letter (e.g. `B`)
+
+Subsequent characters may have any of the above-listed Unicode categories, or one of the following:
+- Decimal Digit Number (e.g. `0`)
+- Letter Number (e.g. `Ⅳ`, U+2163 Roman Numeral Four)
+- Other Number (e.g. `¼`, U+00BC Vulgar Fraction One Quarter)