From 5af481d62df80d8be3f5835042d30372ef9cbe04 Mon Sep 17 00:00:00 2001 From: Melody Horn Date: Sat, 31 Oct 2020 21:59:00 -0600 Subject: define and annotate some language elements --- syntax.md | 196 -------------------------------------------------------------- 1 file changed, 196 deletions(-) (limited to 'syntax.md') diff --git a/syntax.md b/syntax.md index 80fa54b..96f0b88 100644 --- a/syntax.md +++ b/syntax.md @@ -1,204 +1,8 @@ # Syntax (old) -The syntax of Crowbar mostly matches the syntax of C, with fewer obscure/advanced/edge case features. - -## Source Files - -A Crowbar source file is UTF-8. -Crowbar source files can come in two varieties, an *implementation file* and a *header file*. -An implementation file conventionally has a `.cro` extension, and a header file conventionally has a `.hro` extension. - -A Crowbar source file is read into memory in two phases: *scanning* (which converts text into an unstructured sequence of tokens) and *parsing* (which converts an unstructured sequence of tokens into a parse tree). - -## Scanning - -A *token* is one of the following kinds of token: - -- a *keyword*, -- an *identifier*, -- a *constant*, -- a *string literal*, -- or a *punctuator*. - -Tokens are separated by either *whitespace* or a *comment*. - -### Keywords - -A *keyword* is one of the following literal words: - -- `bool` -- `break` -- `case` -- `char` -- `const` -- `continue` -- `default` -- `do` -- `double` -- `else` -- `enum` -- `extern` -- `float` -- `for` -- `fragile` -- `function` -- `if` -- `include` -- `int` -- `long` -- `return` -- `short` -- `signed` -- `sizeof` -- `struct` -- `switch` -- `unsigned` -- `void` -- `while` - -### Identifiers - -An *identifier* is a sequence of one or more characters having Unicode categories within a legal set. - -The first character in an identifier must have one of the following Unicode categories: - -- `Pc` Connector Punctuation (e.g. `_`) -- `Ll` Lowercase Letter (e.g. `h`) -- `Lm` Modifier Letter (e.g. `ʹ`, U+02B9 Modifier Letter Prime) -- `Lo` Other Letter (e.g. `א`, U+05D0 Hebrew Letter Alef) -- `Lt` Titlecase Letter (e.g. `Dž`, U+01C5 Latin Capital Letter D With Small Letter Z With Caron) -- `Lu` Uppercase Letter (e.g. `B`) -- `Mn` Nonspacing Mark (e.g. ` ̂`, U+0302 Combining Circumflex Accent) -- `Sk` Modifier Symbol (e.g. `^`, U+005E Circumflex Accent) - -Subsequent characters may have any of the above-listed Unicode categories, or one of the following: - -- `Nd` Decimal Digit Number (e.g. `0`) -- `Nl` Letter Number (e.g. `Ⅳ`, U+2163 Roman Numeral Four) -- `No` Other Number (e.g. `¼`, U+00BC Vulgar Fraction One Quarter) - -### Constants - -A *constant* can have one of six types: - -- a *decimal constant*, a sequence of characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `_`}; -- a *binary constant*, a prefix (either `0b` or `0B`) followed by a sequence of characters drawn from the set {`0`, `1`, `_`}; -- an *octal constant*, the prefix `0o` followed by a sequence of characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `_`}; -- a *hexadecimal constant*, a prefix (either `0x` or `0X`) followed by a sequence of characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `A`, `a`, `B`, `b`, `C`, `c`, `D`, `d`, `E`, `e`, `F`, `f`, `_`}; -- a *floating-point constant*, a decimal constant followed by one of - - `.` followed by a decimal constant, - - either `e` or `E` followed by a decimal constant, - - or a `.` followed by a decimal constant followed by either an `e` or `E` followed by a decimal constant; -- or a *character constant*, a `'` followed by either a single character or an *escape sequence* followed by another `'`. - -#### Escape Sequences - -The following sequences of characters are *escape sequences*: - -- `\'` -- `\"` -- `\\` -- `\r` -- `\n` -- `\t` -- `\0` -- `\x` followed by two characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `A`, `a`, `B`, `b`, `C`, `c`, `D`, `d`, `E`, `e`, `F`, `f`} -- `\u` followed by four characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `A`, `a`, `B`, `b`, `C`, `c`, `D`, `d`, `E`, `e`, `F`, `f`} -- `\U` followed by eight characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `A`, `a`, `B`, `b`, `C`, `c`, `D`, `d`, `E`, `e`, `F`, `f`} - -### String Literals - -A *string literal* begins with a `"`. -It then contains a sequence where each element is either an escape sequence or a character that is neither `"` nor `\`. -It then ends with a `"`. - -### Punctuators - -The following sequences of characters form *punctuators*: - -- `[` -- `]` -- `(` -- `)` -- `{` -- `}` -- `.` -- `,` -- `+` -- `-` -- `*` -- `/` -- `%` -- `;` -- `!` -- `&` -- `|` -- `^` -- `~` -- `>` -- `<` -- `=` -- `->` -- `++` -- `--` -- `>>` -- `<<` -- `<=` -- `>=` -- `==` -- `!=` -- `&&` -- `||` -- `+=` -- `-=` -- `*=` -- `/=` -- `%=` -- `&=` -- `|=` -- `^=` - -### Whitespace - -A nonempty sequence of characters is considered to be *whitespace* if each character in it has a Unicode class of either Space Separator or Control Other. - -### Comments - -A *comment* can be either a *line comment* or a *block comment*. - -A *line comment* begins with the characters `//` if they occur outside of a string literal or comment, and ends with a newline character U+000A. - -A *block comment* begins with the characters `/*` if they occur outside of a string literal or comment, and ends with the characters `*/`. - -## Parsing - -The syntax of Crowbar is given as a [parsing expression grammar](https://en.wikipedia.org/wiki/Parsing_expression_grammar): - -### Entry points - -```PEG -HeaderFile ← HeaderFileElement+ -HeaderFileElement ← IncludeStatement / - TypeDeclaration / - FunctionDeclaration - -ImplementationFile ← ImplementationFileElement+ -ImplementationFileElement ← HeaderFileElement / - FunctionDefinition -``` - ### Top-level elements ```PEG -IncludeStatement ← 'include' string-literal ';' - -TypeDeclaration ← StructDeclaration / - EnumDeclaration -StructDeclaration ← 'struct' identifier '{' VariableDeclaration+ '}' ';' -EnumDeclaration ← 'enum' identifier '{' EnumBody '}' ';' -EnumBody ← identifier ('=' Expression)? ',' EnumBody / - identifier ('=' Expression)? ','? - FunctionDeclaration ← FunctionSignature ';' FunctionDefinition ← FunctionSignature Block FunctionSignature ← Type identifier '(' SignatureArguments? ')' -- cgit v1.2.3