aboutsummaryrefslogtreecommitdiff
path: root/syntax.md
blob: 91d2bacd6416ec2903901fe6535a96c8e7a06909 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
The syntax of Crowbar will eventually mostly match the syntax of C, with fewer obscure/advanced/edge case features.

# Source Files

A Crowbar source file is UTF-8.
Crowbar source files can come in two varieties, an *implementation file* and a *header file*.
An implementation file conventionally has a `.cro` extension, and a header file conventionally has a `.hro` extension.

A Crowbar source file is read into memory in two phases: *scanning* (which converts text into an unstructured sequence of tokens) and *parsing* (which converts an unstructured sequence of tokens into an Abstract Syntax Tree, or AST).

# Scanning

A *token* is one of the following kinds of token:
- a *keyword*,
- an *identifier*,
- a *constant*,
- a *string literal*,
- or a *punctuator*.

## Keywords

A *keyword* is one of the following 28 literal words:
- `bool`
- `break`
- `case`
- `char`
- `const`
- `continue`
- `default`
- `do`
- `double`
- `else`
- `enum`
- `extern`
- `float`
- `for`
- `if`
- `include`
- `int`
- `long`
- `return`
- `short`
- `signed`
- `sizeof`
- `struct`
- `switch`
- `typedef`
- `unsigned`
- `void`
- `while`

## Identifiers

An *identifier* is a sequence of one or more characters having Unicode categories within a legal set.

The first character in an identifier must have one of the following Unicode categories:
- Connector Punctuation (e.g. `_`)
- Format Other (e.g. Zero-Width Joiner)
- Lowercase Letter (e.g. `h`)
- Modifier Letter (e.g. `ʹ`, U+02B9 Modifier Letter Prime)
- Modifier Symbol (e.g. `^`, U+005E Circumflex Accent)
- Nonspacing Mark (e.g. ` ̂`, U+0302 Combining Circumflex Accent)
- Other Letter (e.g. `א`, U+05D0 Hebrew Letter Alef)
- Titlecase Letter (e.g. `Dž`, U+01C5 Latin Capital Letter D With Small Letter Z With Caron)
- Uppercase Letter (e.g. `B`)

Subsequent characters may have any of the above-listed Unicode categories, or one of the following:
- Decimal Digit Number (e.g. `0`)
- Letter Number (e.g. `Ⅳ`, U+2163 Roman Numeral Four)
- Other Number (e.g. `¼`, U+00BC Vulgar Fraction One Quarter)