syntax.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159

The syntax of Crowbar will eventually mostly match the syntax of C, with fewer obscure/advanced/edge case features.

# Source Files

A Crowbar source file is UTF-8.
Crowbar source files can come in two varieties, an *implementation file* and a *header file*.
An implementation file conventionally has a `.cro` extension, and a header file conventionally has a `.hro` extension.

A Crowbar source file is read into memory in two phases: *scanning* (which converts text into an unstructured sequence of tokens) and *parsing* (which converts an unstructured sequence of tokens into an Abstract Syntax Tree, or AST).

# Scanning

A *token* is one of the following kinds of token:
- a *keyword*,
- an *identifier*,
- a *constant*,
- a *string literal*,
- or a *punctuator*.

Tokens are separated by either *whitespace* or a *comment*.

## Keywords

A *keyword* is one of the following 28 literal words:
- `bool`
- `break`
- `case`
- `char`
- `const`
- `continue`
- `default`
- `do`
- `double`
- `else`
- `enum`
- `extern`
- `float`
- `for`
- `if`
- `include`
- `int`
- `long`
- `return`
- `short`
- `signed`
- `sizeof`
- `struct`
- `switch`
- `typedef`
- `unsigned`
- `void`
- `while`

## Identifiers

An *identifier* is a sequence of one or more characters having Unicode categories within a legal set.

The first character in an identifier must have one of the following Unicode categories:
- Connector Punctuation (e.g. `_`)
- Format Other (e.g. Zero-Width Joiner)
- Lowercase Letter (e.g. `h`)
- Modifier Letter (e.g. `ʹ`, U+02B9 Modifier Letter Prime)
- Modifier Symbol (e.g. `^`, U+005E Circumflex Accent)
- Nonspacing Mark (e.g. ` ̂`, U+0302 Combining Circumflex Accent)
- Other Letter (e.g. `א`, U+05D0 Hebrew Letter Alef)
- Titlecase Letter (e.g. `ǅ`, U+01C5 Latin Capital Letter D With Small Letter Z With Caron)
- Uppercase Letter (e.g. `B`)

Subsequent characters may have any of the above-listed Unicode categories, or one of the following:
- Decimal Digit Number (e.g. `0`)
- Letter Number (e.g. `Ⅳ`, U+2163 Roman Numeral Four)
- Other Number (e.g. `¼`, U+00BC Vulgar Fraction One Quarter)

## Constants

A *constant* can have one of five types:
- a *decimal constant*, a sequence of characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `_`};
- a *binary constant*, a prefix (either `0b` or `0B`) followed by a sequence of characters drawn from the set {`0`, `1`, `_`};
- a *hexadecimal constant*, a prefix (either `0x` or `0X`) followed by a sequence of characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `A`, `a`, `B`, `b`, `C`, `c`, `D`, `d`, `E`, `e`, `F`, `f`, `_`};
- a *floating-point constant*, a decimal constant followed by one of
    - `.` followed by a decimal constant,
    - either `e` or `E` followed by a decimal constant,
    - or a `.` followed by a decimal constant followed by either an `e` or `E` followed by a decimal constant;
- or a *character constant*, a `'` followed by either a single character or an *escape sequence* followed by another `'`. 

### Escape Sequences

The following sequences of characters are *escape sequences*:
- `\'`
- `\"`
- `\\`
- `\r`
- `\n`
- `\t`
- `\x` followed by two characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `A`, `a`, `B`, `b`, `C`, `c`, `D`, `d`, `E`, `e`, `F`, `f`}
- `\u` followed by four characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `A`, `a`, `B`, `b`, `C`, `c`, `D`, `d`, `E`, `e`, `F`, `f`}
- `\U` followed by eight characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `A`, `a`, `B`, `b`, `C`, `c`, `D`, `d`, `E`, `e`, `F`, `f`}

## String Literals

A *string literal* begins with a `"`.
It then contains a sequence where each element is either an escape sequence or a character that is neither `"` nor `\`.
It then ends with a `"`.

## Punctuators

The following sequences of characters form *punctuators*:
- `[`
- `]`
- `(`
- `)`
- `{`
- `}`
- `.`
- `+`
- `-`
- `*`
- `/`
- `%`
- `;`
- `!`
- `&`
- `|`
- `^`
- `~`
- `>`
- `<`
- `=`
- `->`
- `++`
- `--`
- `>>`
- `<<`
- `<=`
- `>=`
- `==`
- `!=`
- `&&`
- `||`
- `+=`
- `-=`
- `*=`
- `/=`
- `%=`
- `&=`
- `|=`
- `^=`

## Whitespace

A nonempty sequence of characters is considered to be *whitespace* if each character in it has a Unicode class of either Space Separator or Control Other.

## Comment

A *comment* can be either a *line comment* or a *block comment*.

A *line comment* begins with the characters `//` if they occur outside of a string literal and ends with a newline character U+000A.

A *block comment* begins with the characters `/*` if they occur outside of a string literal and ends with the characters `*/`.