From 1f20ab0d5fe29276a6e55e8bd9aa3e1d967aafdf Mon Sep 17 00:00:00 2001 From: Melody Horn Date: Sun, 25 Oct 2020 11:40:21 -0600 Subject: fucking windows line endings smdh --- .build.yml | 30 +-- errors.md | 2 +- index.md | 74 +++--- safety.md | 184 +++++++-------- syntax.md | 696 +++++++++++++++++++++++++++---------------------------- tagged-unions.md | 2 +- types.md | 2 +- vs-c.md | 140 +++++------ 8 files changed, 565 insertions(+), 565 deletions(-) diff --git a/.build.yml b/.build.yml index e40edc3..73562ea 100644 --- a/.build.yml +++ b/.build.yml @@ -1,15 +1,15 @@ -image: debian/stable -packages: - - pandoc - - wkhtmltopdf - - poppler-utils -sources: - - https://git.sr.ht/~boringcactus/crowbar-spec -tasks: - - page-count: | - cd crowbar-spec - pandoc -s -o ../spec.pdf -t html -M "title=Crowbar Specification" *.md - cd .. - pdfinfo spec.pdf | grep Pages -artifacts: - - spec.pdf +image: debian/stable +packages: + - pandoc + - wkhtmltopdf + - poppler-utils +sources: + - https://git.sr.ht/~boringcactus/crowbar-spec +tasks: + - page-count: | + cd crowbar-spec + pandoc -s -o ../spec.pdf -t html -M "title=Crowbar Specification" *.md + cd .. + pdfinfo spec.pdf | grep Pages +artifacts: + - spec.pdf diff --git a/errors.md b/errors.md index 1ea1912..1333ed7 100644 --- a/errors.md +++ b/errors.md @@ -1 +1 @@ -TODO +TODO diff --git a/index.md b/index.md index b63a778..6692800 100644 --- a/index.md +++ b/index.md @@ -1,37 +1,37 @@ -Crowbar: the good parts of C, with a little bit extra. - -**This is entirely a work-in-progress, and should not be relied upon to be stable (or even true) in any way.** - -Crowbar is a language that is derived from (and, wherever possible, interoperable with) C, and aims to remove as many [footgun](https://en.wiktionary.org/wiki/footgun)s and as much needless complexity from C as possible while still being familiar to C developers. - -Ideally, a typical C codebase should be straightforward to rewrite in Crowbar, and any atypical C constructions not supported by Crowbar can be left as C. - -# Context - -- [Rust is not a good C replacement](https://drewdevault.com/2019/03/25/Rust-is-not-a-good-C-replacement.html) - -# cactus's Blog Posts - -- [Crowbar: Defining a good C replacement](https://www.boringcactus.com/2020/09/28/crowbar-1-defining-a-c-replacement.html) -- [Crowbar: Simplifying C's type names](https://www.boringcactus.com/2020/10/13/crowbar-2-simplifying-c-type-names.html) -- [Crowbar: Turns out, language development is hard](https://www.boringcactus.com/2020/10/19/crowbar-3-this-is-tough.html) - -# Comparison with C - -The [comparison with C](vs-c.md) is an informal overview of the places where Crowbar and C diverge. - -# Syntax - -[Read the Syntax chapter of the spec.](syntax.md) - -# Semantics - -TODO - -# Discuss - -- [announcement mailing list](https://lists.sr.ht/~boringcactus/crowbar-lang-announce) -- [permanent discussion mailing list](https://lists.sr.ht/~boringcactus/crowbar-lang-devel) -- ephemeral discussions via IRC: #crowbar-lang on freenode ([join via irc](ircs://chat.freenode.net/#crowbar-lang), [join via web](https://webchat.freenode.net/#crowbar-lang)) - -[![Creative Commons BY-SA License](https://i.creativecommons.org/l/by-sa/4.0/80x15.png)](http://creativecommons.org/licenses/by-sa/4.0/) +Crowbar: the good parts of C, with a little bit extra. + +**This is entirely a work-in-progress, and should not be relied upon to be stable (or even true) in any way.** + +Crowbar is a language that is derived from (and, wherever possible, interoperable with) C, and aims to remove as many [footgun](https://en.wiktionary.org/wiki/footgun)s and as much needless complexity from C as possible while still being familiar to C developers. + +Ideally, a typical C codebase should be straightforward to rewrite in Crowbar, and any atypical C constructions not supported by Crowbar can be left as C. + +# Motivation + +- [Rust is not a good C replacement](https://drewdevault.com/2019/03/25/Rust-is-not-a-good-C-replacement.html) + +# Journal + +- [Crowbar: Defining a good C replacement](https://www.boringcactus.com/2020/09/28/crowbar-1-defining-a-c-replacement.html) +- [Crowbar: Simplifying C's type names](https://www.boringcactus.com/2020/10/13/crowbar-2-simplifying-c-type-names.html) +- [Crowbar: Turns out, language development is hard](https://www.boringcactus.com/2020/10/19/crowbar-3-this-is-tough.html) + +# Comparison with C + +The [comparison with C](vs-c.md) is an informal overview of the places where Crowbar and C diverge. + +# Syntax + +[Read the Syntax chapter of the spec.](syntax.md) + +# Semantics + +TODO + +# Discuss + +- [announcement mailing list](https://lists.sr.ht/~boringcactus/crowbar-lang-announce) +- [permanent discussion mailing list](https://lists.sr.ht/~boringcactus/crowbar-lang-devel) +- ephemeral discussions via IRC: #crowbar-lang on freenode ([join via irc](ircs://chat.freenode.net/#crowbar-lang), [join via web](https://webchat.freenode.net/#crowbar-lang)) + +[![Creative Commons BY-SA License](https://i.creativecommons.org/l/by-sa/4.0/80x15.png)](http://creativecommons.org/licenses/by-sa/4.0/) diff --git a/safety.md b/safety.md index b8a2303..845c45b 100644 --- a/safety.md +++ b/safety.md @@ -1,92 +1,92 @@ -Each item in Wikipedia's [list of types of memory errors](https://en.wikipedia.org/wiki/Memory_safety#Types_of_memory_errors) and what Crowbar does to prevent them. - -In general, Crowbar does its best to ensure that code will not exhibit any of the following memory errors. -However, sometimes the compiler knows less than the programmer, and so code that looks dangerous is actually fine. -Crowbar allows programmers to suspend the memory safety checks with the `fragile` keyword. - -# Access errors - -## Buffer overflow - -Crowbar addresses buffer overflow with bounds checking. -In C, the type `char *` can point to a single character, a null-terminated string of unknown length, a buffer of fixed size, or nothing at all. -In Crowbar, the type `char *` can only point to either a single character or nothing at all. -If a buffer is declared as `char[50] name;` then it has type `char[50]`, and can be implicitly converted to `(char[50])*`, a pointer-to-50-chars. -If memory is dynamically allocated, it works as follows: - -```crowbar -void process(size_t bufferSize, char[bufferSize] buffer) { - // do some work with buffer, given that we know its size -} - -int main(int argc, (char[1024?])[argc] argv) { - size_t bufferSize = getBufferSize(); - (char[bufferSize])* buffer = malloc(bufferSize); - process(bufferSize, buffer); - free(buffer); -} -``` - -Note that `malloc` as part of the Crowbar standard library has signature `(void[size])* malloc(size_t size);` and so no cast is needed above. -In C, `buffer` in `main` would have type pointer-to-VLA-of-char, but `buffer` in `process` would have type VLA-of-char, and this conversion would emit a compiler warning. -However, in Crowbar, a `(T[N])*` is always implicitly convertible to `T[N]`, so no warning exists. -(This is translated into C by dereferencing `buffer` in `main`.) - -Note as well that the type of `argv` is complicated. -This is because the elements of `argv` have unconstrained size. -TODO figure out if that's the right way to handle that - -## Buffer over-read - -bounds checking again - -## Race condition - -uhhhhh πŸ€·β€β™€οΈ - -## Page fault - -bounds checking, dubious-pointer checking - -## Use after free - -`free(x);` not followed by `x = NULL;` is a compiler error. -`owned` and `borrowed` keywords - -# Uninitialized variables - -forbid them in syntax - -## Null pointer dereference - -dubious-pointer checking - -## Wild pointers - -dubious-pointer checking - -# Memory leak - -## Stack exhaustion - -uhhhhhh πŸ€·β€β™€οΈ - -## Heap exhaustion - -that counts as error handling, just the `malloc`-shaped kind - -## Double free - -this is just use-after-free but the use is calling free on it - -## Invalid free - -don't do that - -## Mismatched free - -how does that even happen - -## Unwanted aliasing - -uhhh don't do that? +Each item in Wikipedia's [list of types of memory errors](https://en.wikipedia.org/wiki/Memory_safety#Types_of_memory_errors) and what Crowbar does to prevent them. + +In general, Crowbar does its best to ensure that code will not exhibit any of the following memory errors. +However, sometimes the compiler knows less than the programmer, and so code that looks dangerous is actually fine. +Crowbar allows programmers to suspend the memory safety checks with the `fragile` keyword. + +# Access errors + +## Buffer overflow + +Crowbar addresses buffer overflow with bounds checking. +In C, the type `char *` can point to a single character, a null-terminated string of unknown length, a buffer of fixed size, or nothing at all. +In Crowbar, the type `char *` can only point to either a single character or nothing at all. +If a buffer is declared as `char[50] name;` then it has type `char[50]`, and can be implicitly converted to `(char[50])*`, a pointer-to-50-chars. +If memory is dynamically allocated, it works as follows: + +```crowbar +void process(size_t bufferSize, char[bufferSize] buffer) { + // do some work with buffer, given that we know its size +} + +int main(int argc, (char[1024?])[argc] argv) { + size_t bufferSize = getBufferSize(); + (char[bufferSize])* buffer = malloc(bufferSize); + process(bufferSize, buffer); + free(buffer); +} +``` + +Note that `malloc` as part of the Crowbar standard library has signature `(void[size])* malloc(size_t size);` and so no cast is needed above. +In C, `buffer` in `main` would have type pointer-to-VLA-of-char, but `buffer` in `process` would have type VLA-of-char, and this conversion would emit a compiler warning. +However, in Crowbar, a `(T[N])*` is always implicitly convertible to `T[N]`, so no warning exists. +(This is translated into C by dereferencing `buffer` in `main`.) + +Note as well that the type of `argv` is complicated. +This is because the elements of `argv` have unconstrained size. +TODO figure out if that's the right way to handle that + +## Buffer over-read + +bounds checking again + +## Race condition + +uhhhhh πŸ€·β€β™€οΈ + +## Page fault + +bounds checking, dubious-pointer checking + +## Use after free + +`free(x);` not followed by `x = NULL;` is a compiler error. +`owned` and `borrowed` keywords + +# Uninitialized variables + +forbid them in syntax + +## Null pointer dereference + +dubious-pointer checking + +## Wild pointers + +dubious-pointer checking + +# Memory leak + +## Stack exhaustion + +uhhhhhh πŸ€·β€β™€οΈ + +## Heap exhaustion + +that counts as error handling, just the `malloc`-shaped kind + +## Double free + +this is just use-after-free but the use is calling free on it + +## Invalid free + +don't do that + +## Mismatched free + +how does that even happen + +## Unwanted aliasing + +uhhh don't do that? diff --git a/syntax.md b/syntax.md index c195811..e0ecea5 100644 --- a/syntax.md +++ b/syntax.md @@ -1,348 +1,348 @@ -The syntax of Crowbar mostly matches the syntax of C, with fewer obscure/advanced/edge case features. - -# Source Files - -A Crowbar source file is UTF-8. -Crowbar source files can come in two varieties, an *implementation file* and a *header file*. -An implementation file conventionally has a `.cro` extension, and a header file conventionally has a `.hro` extension. - -A Crowbar source file is read into memory in two phases: *scanning* (which converts text into an unstructured sequence of tokens) and *parsing* (which converts an unstructured sequence of tokens into a parse tree). - -# Scanning - -A *token* is one of the following kinds of token: -- a *keyword*, -- an *identifier*, -- a *constant*, -- a *string literal*, -- or a *punctuator*. - -Tokens are separated by either *whitespace* or a *comment*. - -## Keywords - -A *keyword* is one of the following literal words: -- `bool` -- `break` -- `case` -- `char` -- `const` -- `continue` -- `default` -- `do` -- `double` -- `else` -- `enum` -- `extern` -- `float` -- `for` -- `fragile` -- `function` -- `if` -- `include` -- `int` -- `long` -- `return` -- `short` -- `signed` -- `sizeof` -- `struct` -- `switch` -- `typedef` -- `unsigned` -- `void` -- `while` - -## Identifiers - -An *identifier* is a sequence of one or more characters having Unicode categories within a legal set. - -The first character in an identifier must have one of the following Unicode categories: -- `Pc` Connector Punctuation (e.g. `_`) -- `Ll` Lowercase Letter (e.g. `h`) -- `Lm` Modifier Letter (e.g. `ΚΉ`, U+02B9 Modifier Letter Prime) -- `Lo` Other Letter (e.g. `א`, U+05D0 Hebrew Letter Alef) -- `Lt` Titlecase Letter (e.g. `Η…`, U+01C5 Latin Capital Letter D With Small Letter Z With Caron) -- `Lu` Uppercase Letter (e.g. `B`) -- `Mn` Nonspacing Mark (e.g. ` Μ‚`, U+0302 Combining Circumflex Accent) -- `Sk` Modifier Symbol (e.g. `^`, U+005E Circumflex Accent) - -Subsequent characters may have any of the above-listed Unicode categories, or one of the following: -- `Nd` Decimal Digit Number (e.g. `0`) -- `Nl` Letter Number (e.g. `β…£`, U+2163 Roman Numeral Four) -- `No` Other Number (e.g. `ΒΌ`, U+00BC Vulgar Fraction One Quarter) - -## Constants - -A *constant* can have one of six types: -- a *decimal constant*, a sequence of characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `_`}; -- a *binary constant*, a prefix (either `0b` or `0B`) followed by a sequence of characters drawn from the set {`0`, `1`, `_`}; -- an *octal constant*, the prefix `0o` followed by a sequence of characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `_`}; -- a *hexadecimal constant*, a prefix (either `0x` or `0X`) followed by a sequence of characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `A`, `a`, `B`, `b`, `C`, `c`, `D`, `d`, `E`, `e`, `F`, `f`, `_`}; -- a *floating-point constant*, a decimal constant followed by one of - - `.` followed by a decimal constant, - - either `e` or `E` followed by a decimal constant, - - or a `.` followed by a decimal constant followed by either an `e` or `E` followed by a decimal constant; -- or a *character constant*, a `'` followed by either a single character or an *escape sequence* followed by another `'`. - -### Escape Sequences - -The following sequences of characters are *escape sequences*: -- `\'` -- `\"` -- `\\` -- `\r` -- `\n` -- `\t` -- `\0` -- `\x` followed by two characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `A`, `a`, `B`, `b`, `C`, `c`, `D`, `d`, `E`, `e`, `F`, `f`} -- `\u` followed by four characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `A`, `a`, `B`, `b`, `C`, `c`, `D`, `d`, `E`, `e`, `F`, `f`} -- `\U` followed by eight characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `A`, `a`, `B`, `b`, `C`, `c`, `D`, `d`, `E`, `e`, `F`, `f`} - -## String Literals - -A *string literal* begins with a `"`. -It then contains a sequence where each element is either an escape sequence or a character that is neither `"` nor `\`. -It then ends with a `"`. - -## Punctuators - -The following sequences of characters form *punctuators*: -- `[` -- `]` -- `(` -- `)` -- `{` -- `}` -- `.` -- `,` -- `+` -- `-` -- `*` -- `/` -- `%` -- `;` -- `!` -- `&` -- `|` -- `^` -- the tilde, `~` (given special treatment on this line due to [a bug in the Markdown renderer that sr.ht uses](https://github.com/miyuchina/mistletoe/issues/91)) -- `>` -- `<` -- `=` -- `->` -- `++` -- `--` -- `>>` -- `<<` -- `<=` -- `>=` -- `==` -- `!=` -- `&&` -- `||` -- `+=` -- `-=` -- `*=` -- `/=` -- `%=` -- `&=` -- `|=` -- `^=` - -## Whitespace - -A nonempty sequence of characters is considered to be *whitespace* if each character in it has a Unicode class of either Space Separator or Control Other. - -## Comments - -A *comment* can be either a *line comment* or a *block comment*. - -A *line comment* begins with the characters `//` if they occur outside of a string literal or comment, and ends with a newline character U+000A. - -A *block comment* begins with the characters `/*` if they occur outside of a string literal or comment, and ends with the characters `*/`. - -# Parsing - -The syntax of Crowbar is given as a [parsing expression grammar](https://en.wikipedia.org/wiki/Parsing_expression_grammar): - -## Entry points - -``` -HeaderFile ← HeaderFileElement+ -HeaderFileElement ← IncludeStatement / - TypeDeclaration / - FunctionDeclaration - -ImplementationFile ← ImplementationFileElement+ -ImplementationFileElement ← HeaderFileElement / - FunctionDefinition -``` - -## Top-level elements - -``` -IncludeStatement ← 'include' string-literal ';' - -TypeDeclaration ← StructDeclaration / - EnumDeclaration / - TypedefDeclaration -StructDeclaration ← 'struct' identifier '{' VariableDeclaration+ '}' ';' -EnumDeclaration ← 'enum' identifier '{' EnumBody '}' ';' -EnumBody ← identifier ('=' Expression)? ',' EnumBody / - identifier ('=' Expression)? ','? -TypedefDeclaration ← 'typedef' identifier '=' Type ';' - -FunctionDeclaration ← FunctionSignature ';' -FunctionDefinition ← FunctionSignature Block -FunctionSignature ← Type identifier '(' SignatureArguments? ')' -SignatureArguments ← Type identifier ',' SignatureArguments / - Type identifier ','? -``` - -## Statements - -``` -Block ← '{' Statement* '}' - -Statement ← VariableDefinition / - VariableDeclaration / - IfStatement / - SwitchStatement / - WhileStatement / - DoWhileStatement / - ForStatement / - FlowControlStatement / - AssignmentStatement / - ExpressionStatement - -VariableDefinition ← Type identifier '=' Expression ';' -VariableDeclaration ← Type identifier ';' - -IfStatement ← 'if' Expression Block 'else' Block / - 'if' Expression Block - -SwitchStatement ← 'switch' Expression '{' SwitchCase+ '}' -SwitchCase ← CaseSpecifier Block / - 'default' Block -CaseSpecifier ← 'case' Expression ',' CaseSpecifier / - 'case' Expression ','? - -WhileStatement ← 'while' Expression Block -DoWhileStatement ← 'do' Block 'while' Expression ';' -ForStatement ← 'for' VariableDefinition? ';' Expression ';' AssignmentStatementBody? Block - -FlowControlStatement ← 'continue' ';' / - 'break' ';' / - 'return' Expression? ';' - -AssignmentStatement ← AssignmentStatementBody ';' -AssignmentStatementBody ← AssignmentTargetExpression '=' Expression / - AssignmentTargetExpression '+=' Expression / - AssignmentTargetExpression '-=' Expression / - AssignmentTargetExpression '*=' Expression / - AssignmentTargetExpression '/=' Expression / - AssignmentTargetExpression '%=' Expression / - AssignmentTargetExpression '&=' Expression / - AssignmentTargetExpression '^=' Expression / - AssignmentTargetExpression '|=' Expression / - AssignmentTargetExpression '++' / - AssignmentTargetExpression '--' - -ExpressionStatement ← Expression ';' -``` - -## Types - -``` -Type ← 'const' BasicType / - BasicType '*' / - BasicType '[' Expression ']' / - BasicType 'function' '(' (BasicType ',')* ')' / - BasicType -BasicType ← 'void' / - IntegerType / - 'signed' IntegerType / - 'unsigned' IntegerType / - 'float' / - 'double' / - 'bool' / - 'struct' identifier / - 'enum' identifier / - 'typedef' identifier / - '(' Type ')' -IntegerType ← 'char' / - 'short' / - 'int' / - 'long' -``` - -## Expressions - -``` -AssignmentTargetExpression ← identifier ATEElementSuffix* -ATEElementSuffix ← '[' Expression ']' / - '.' identifier / - '->' identifier - -AtomicExpression ← identifier / - constant / - string-literal / - '(' Expression ')' - -ObjectExpression ← AtomicExpression ObjectSuffix* / - ArrayLiteralExpression / - StructLiteralExpression -ObjectSuffix ← '[' Expression ']' / - '(' CommasExpressionList? ')' / - '.' identifier / - '->' identifier -CommasExpressionList ← Expression ',' CommasExpressionList? / - Expression ','? -ArrayLiteralExpression ← '{' CommasExpressionList '}' -StructLiteralExpression ← '{' StructLiteralBody '}' -StructLiteralBody ← StructLiteralElement ',' StructLiteralBody? / - StructLiteralElement ','? -StructLiteralElement ← '.' identifier '=' Expression - -FactorExpression ← '(' Type ')' FactorExpression / - '&' FactorExpression / - '*' FactorExpression / - '+' FactorExpression / - '-' FactorExpression / - '~' FactorExpression / - '!' FactorExpression / - 'sizeof' FactorExpression / - 'sizeof' Type / - ObjectExpression - -TermExpression ← FactorExpression TermSuffix* -TermSuffix ← '*' FactorExpression / - '/' FactorExpression / - '%' FactorExpression - -ArithmeticExpression ← TermExpression ArithmeticSuffix* -ArithmeticSuffix ← '+' TermExpression / - '-' TermExpression - -BitwiseOpExpression ← ArithmeticExpression '<<' ArithmeticExpression / - ArithmeticExpression '>>' ArithmeticExpression / - ArithmeticExpression '^' ArithmeticExpression / - ArithmeticExpression ('&' ArithmeticExpression)+ / - ArithmeticExpression ('|' ArithmeticExpression)+ / - ArithmeticExpression - -ComparisonExpression ← BitwiseOpExpression '==' BitwiseOpExpression / - BitwiseOpExpression '!=' BitwiseOpExpression / - BitwiseOpExpression '<=' BitwiseOpExpression / - BitwiseOpExpression '>=' BitwiseOpExpression / - BitwiseOpExpression '<' BitwiseOpExpression / - BitwiseOpExpression '>' BitwiseOpExpression / - BitwiseOpExpression - -Expression ← ComparisonExpression ('&&' ComparisonExpression)+ / - ComparisonExpression ('||' ComparisonExpression)+ / - ComparisonExpression -``` - -[![Creative Commons BY-SA License](https://i.creativecommons.org/l/by-sa/4.0/80x15.png)](http://creativecommons.org/licenses/by-sa/4.0/) +The syntax of Crowbar mostly matches the syntax of C, with fewer obscure/advanced/edge case features. + +# Source Files + +A Crowbar source file is UTF-8. +Crowbar source files can come in two varieties, an *implementation file* and a *header file*. +An implementation file conventionally has a `.cro` extension, and a header file conventionally has a `.hro` extension. + +A Crowbar source file is read into memory in two phases: *scanning* (which converts text into an unstructured sequence of tokens) and *parsing* (which converts an unstructured sequence of tokens into a parse tree). + +# Scanning + +A *token* is one of the following kinds of token: +- a *keyword*, +- an *identifier*, +- a *constant*, +- a *string literal*, +- or a *punctuator*. + +Tokens are separated by either *whitespace* or a *comment*. + +## Keywords + +A *keyword* is one of the following literal words: +- `bool` +- `break` +- `case` +- `char` +- `const` +- `continue` +- `default` +- `do` +- `double` +- `else` +- `enum` +- `extern` +- `float` +- `for` +- `fragile` +- `function` +- `if` +- `include` +- `int` +- `long` +- `return` +- `short` +- `signed` +- `sizeof` +- `struct` +- `switch` +- `typedef` +- `unsigned` +- `void` +- `while` + +## Identifiers + +An *identifier* is a sequence of one or more characters having Unicode categories within a legal set. + +The first character in an identifier must have one of the following Unicode categories: +- `Pc` Connector Punctuation (e.g. `_`) +- `Ll` Lowercase Letter (e.g. `h`) +- `Lm` Modifier Letter (e.g. `ΚΉ`, U+02B9 Modifier Letter Prime) +- `Lo` Other Letter (e.g. `א`, U+05D0 Hebrew Letter Alef) +- `Lt` Titlecase Letter (e.g. `Η…`, U+01C5 Latin Capital Letter D With Small Letter Z With Caron) +- `Lu` Uppercase Letter (e.g. `B`) +- `Mn` Nonspacing Mark (e.g. ` Μ‚`, U+0302 Combining Circumflex Accent) +- `Sk` Modifier Symbol (e.g. `^`, U+005E Circumflex Accent) + +Subsequent characters may have any of the above-listed Unicode categories, or one of the following: +- `Nd` Decimal Digit Number (e.g. `0`) +- `Nl` Letter Number (e.g. `β…£`, U+2163 Roman Numeral Four) +- `No` Other Number (e.g. `ΒΌ`, U+00BC Vulgar Fraction One Quarter) + +## Constants + +A *constant* can have one of six types: +- a *decimal constant*, a sequence of characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `_`}; +- a *binary constant*, a prefix (either `0b` or `0B`) followed by a sequence of characters drawn from the set {`0`, `1`, `_`}; +- an *octal constant*, the prefix `0o` followed by a sequence of characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `_`}; +- a *hexadecimal constant*, a prefix (either `0x` or `0X`) followed by a sequence of characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `A`, `a`, `B`, `b`, `C`, `c`, `D`, `d`, `E`, `e`, `F`, `f`, `_`}; +- a *floating-point constant*, a decimal constant followed by one of + - `.` followed by a decimal constant, + - either `e` or `E` followed by a decimal constant, + - or a `.` followed by a decimal constant followed by either an `e` or `E` followed by a decimal constant; +- or a *character constant*, a `'` followed by either a single character or an *escape sequence* followed by another `'`. + +### Escape Sequences + +The following sequences of characters are *escape sequences*: +- `\'` +- `\"` +- `\\` +- `\r` +- `\n` +- `\t` +- `\0` +- `\x` followed by two characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `A`, `a`, `B`, `b`, `C`, `c`, `D`, `d`, `E`, `e`, `F`, `f`} +- `\u` followed by four characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `A`, `a`, `B`, `b`, `C`, `c`, `D`, `d`, `E`, `e`, `F`, `f`} +- `\U` followed by eight characters drawn from the set {`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `A`, `a`, `B`, `b`, `C`, `c`, `D`, `d`, `E`, `e`, `F`, `f`} + +## String Literals + +A *string literal* begins with a `"`. +It then contains a sequence where each element is either an escape sequence or a character that is neither `"` nor `\`. +It then ends with a `"`. + +## Punctuators + +The following sequences of characters form *punctuators*: +- `[` +- `]` +- `(` +- `)` +- `{` +- `}` +- `.` +- `,` +- `+` +- `-` +- `*` +- `/` +- `%` +- `;` +- `!` +- `&` +- `|` +- `^` +- the tilde, `~` (given special treatment on this line due to [a bug in the Markdown renderer that sr.ht uses](https://github.com/miyuchina/mistletoe/issues/91)) +- `>` +- `<` +- `=` +- `->` +- `++` +- `--` +- `>>` +- `<<` +- `<=` +- `>=` +- `==` +- `!=` +- `&&` +- `||` +- `+=` +- `-=` +- `*=` +- `/=` +- `%=` +- `&=` +- `|=` +- `^=` + +## Whitespace + +A nonempty sequence of characters is considered to be *whitespace* if each character in it has a Unicode class of either Space Separator or Control Other. + +## Comments + +A *comment* can be either a *line comment* or a *block comment*. + +A *line comment* begins with the characters `//` if they occur outside of a string literal or comment, and ends with a newline character U+000A. + +A *block comment* begins with the characters `/*` if they occur outside of a string literal or comment, and ends with the characters `*/`. + +# Parsing + +The syntax of Crowbar is given as a [parsing expression grammar](https://en.wikipedia.org/wiki/Parsing_expression_grammar): + +## Entry points + +``` +HeaderFile ← HeaderFileElement+ +HeaderFileElement ← IncludeStatement / + TypeDeclaration / + FunctionDeclaration + +ImplementationFile ← ImplementationFileElement+ +ImplementationFileElement ← HeaderFileElement / + FunctionDefinition +``` + +## Top-level elements + +``` +IncludeStatement ← 'include' string-literal ';' + +TypeDeclaration ← StructDeclaration / + EnumDeclaration / + TypedefDeclaration +StructDeclaration ← 'struct' identifier '{' VariableDeclaration+ '}' ';' +EnumDeclaration ← 'enum' identifier '{' EnumBody '}' ';' +EnumBody ← identifier ('=' Expression)? ',' EnumBody / + identifier ('=' Expression)? ','? +TypedefDeclaration ← 'typedef' identifier '=' Type ';' + +FunctionDeclaration ← FunctionSignature ';' +FunctionDefinition ← FunctionSignature Block +FunctionSignature ← Type identifier '(' SignatureArguments? ')' +SignatureArguments ← Type identifier ',' SignatureArguments / + Type identifier ','? +``` + +## Statements + +``` +Block ← '{' Statement* '}' + +Statement ← VariableDefinition / + VariableDeclaration / + IfStatement / + SwitchStatement / + WhileStatement / + DoWhileStatement / + ForStatement / + FlowControlStatement / + AssignmentStatement / + ExpressionStatement + +VariableDefinition ← Type identifier '=' Expression ';' +VariableDeclaration ← Type identifier ';' + +IfStatement ← 'if' Expression Block 'else' Block / + 'if' Expression Block + +SwitchStatement ← 'switch' Expression '{' SwitchCase+ '}' +SwitchCase ← CaseSpecifier Block / + 'default' Block +CaseSpecifier ← 'case' Expression ',' CaseSpecifier / + 'case' Expression ','? + +WhileStatement ← 'while' Expression Block +DoWhileStatement ← 'do' Block 'while' Expression ';' +ForStatement ← 'for' VariableDefinition? ';' Expression ';' AssignmentStatementBody? Block + +FlowControlStatement ← 'continue' ';' / + 'break' ';' / + 'return' Expression? ';' + +AssignmentStatement ← AssignmentStatementBody ';' +AssignmentStatementBody ← AssignmentTargetExpression '=' Expression / + AssignmentTargetExpression '+=' Expression / + AssignmentTargetExpression '-=' Expression / + AssignmentTargetExpression '*=' Expression / + AssignmentTargetExpression '/=' Expression / + AssignmentTargetExpression '%=' Expression / + AssignmentTargetExpression '&=' Expression / + AssignmentTargetExpression '^=' Expression / + AssignmentTargetExpression '|=' Expression / + AssignmentTargetExpression '++' / + AssignmentTargetExpression '--' + +ExpressionStatement ← Expression ';' +``` + +## Types + +``` +Type ← 'const' BasicType / + BasicType '*' / + BasicType '[' Expression ']' / + BasicType 'function' '(' (BasicType ',')* ')' / + BasicType +BasicType ← 'void' / + IntegerType / + 'signed' IntegerType / + 'unsigned' IntegerType / + 'float' / + 'double' / + 'bool' / + 'struct' identifier / + 'enum' identifier / + 'typedef' identifier / + '(' Type ')' +IntegerType ← 'char' / + 'short' / + 'int' / + 'long' +``` + +## Expressions + +``` +AssignmentTargetExpression ← identifier ATEElementSuffix* +ATEElementSuffix ← '[' Expression ']' / + '.' identifier / + '->' identifier + +AtomicExpression ← identifier / + constant / + string-literal / + '(' Expression ')' + +ObjectExpression ← AtomicExpression ObjectSuffix* / + ArrayLiteralExpression / + StructLiteralExpression +ObjectSuffix ← '[' Expression ']' / + '(' CommasExpressionList? ')' / + '.' identifier / + '->' identifier +CommasExpressionList ← Expression ',' CommasExpressionList? / + Expression ','? +ArrayLiteralExpression ← '{' CommasExpressionList '}' +StructLiteralExpression ← '{' StructLiteralBody '}' +StructLiteralBody ← StructLiteralElement ',' StructLiteralBody? / + StructLiteralElement ','? +StructLiteralElement ← '.' identifier '=' Expression + +FactorExpression ← '(' Type ')' FactorExpression / + '&' FactorExpression / + '*' FactorExpression / + '+' FactorExpression / + '-' FactorExpression / + '~' FactorExpression / + '!' FactorExpression / + 'sizeof' FactorExpression / + 'sizeof' Type / + ObjectExpression + +TermExpression ← FactorExpression TermSuffix* +TermSuffix ← '*' FactorExpression / + '/' FactorExpression / + '%' FactorExpression + +ArithmeticExpression ← TermExpression ArithmeticSuffix* +ArithmeticSuffix ← '+' TermExpression / + '-' TermExpression + +BitwiseOpExpression ← ArithmeticExpression '<<' ArithmeticExpression / + ArithmeticExpression '>>' ArithmeticExpression / + ArithmeticExpression '^' ArithmeticExpression / + ArithmeticExpression ('&' ArithmeticExpression)+ / + ArithmeticExpression ('|' ArithmeticExpression)+ / + ArithmeticExpression + +ComparisonExpression ← BitwiseOpExpression '==' BitwiseOpExpression / + BitwiseOpExpression '!=' BitwiseOpExpression / + BitwiseOpExpression '<=' BitwiseOpExpression / + BitwiseOpExpression '>=' BitwiseOpExpression / + BitwiseOpExpression '<' BitwiseOpExpression / + BitwiseOpExpression '>' BitwiseOpExpression / + BitwiseOpExpression + +Expression ← ComparisonExpression ('&&' ComparisonExpression)+ / + ComparisonExpression ('||' ComparisonExpression)+ / + ComparisonExpression +``` + +[![Creative Commons BY-SA License](https://i.creativecommons.org/l/by-sa/4.0/80x15.png)](http://creativecommons.org/licenses/by-sa/4.0/) diff --git a/tagged-unions.md b/tagged-unions.md index 1ea1912..1333ed7 100644 --- a/tagged-unions.md +++ b/tagged-unions.md @@ -1 +1 @@ -TODO +TODO diff --git a/types.md b/types.md index 1ea1912..1333ed7 100644 --- a/types.md +++ b/types.md @@ -1 +1 @@ -TODO +TODO diff --git a/vs-c.md b/vs-c.md index fe4ed3e..e086b4c 100644 --- a/vs-c.md +++ b/vs-c.md @@ -1,70 +1,70 @@ -What differentiates Crowbar from C? - -# Removals - -Some of the footguns and complexity in C come from misfeatures that can simply not be used. - -## Footguns - -Some constructs in C are almost always the wrong thing. - -- `goto` -- Hexadecimal float literals -- Wide characters -- Digraphs -- Prefix `++` and `--` -- Chaining mixed left and right shifts (e.g. `x << 3 >> 2`) -- Chaining relational/equality operators (e.g. `3 < x == 2`) -- Mixed chains of bitwise or logical operators (e.g. `2 & x && 4 ^ y`) -- The comma operator `,` - -Some constructs in C exhibit implicit behavior that should instead be made explicit. - -- `typedef` -- Octal escape sequences -- Using an assignment operator (`=`, `+=`, etc) or (postfix) `++` and `--` as components in a larger expression -- The conditional operator `?:` -- Preprocessor macros (but constants are fine) - -## Needless Complexity - -Some type modifiers in C exist solely for the purpose of enabling optimizations which most compilers can do already. - -- `inline` -- `register` - -Some type modifiers in C only apply in very specific circumstances and so aren't important. - -- `restrict` -- `volatile` -- `_Imaginary` - -# Adjustments - -Some C features are footguns by default, so Crowbar ensures that they are only used correctly. - -- Unions are not robust by default. - Crowbar only supports unions when they are [tagged unions](tagged-unions.md) (or declared and used with the `fragile` keyword). - -C's syntax isn't perfect, but it's usually pretty good. -However, sometimes it just sucks, and in those cases Crowbar makes changes. - -- C's variable declaration syntax is far from intuitive in nontrivial cases (function pointers, pointer-to-`const` vs `const`-pointer, etc). - Crowbar uses [simplified type syntax](types.md) to keep types and variable names distinct. -- `_Bool` is just `bool`, `_Complex` is just `complex` (why drag the preprocessor into it?) -- Adding a `_` to numeric literals as a separator -- All string literals, char literals, etc are UTF-8 -- Octal literals have a `0o` prefix (never `0O` because that looks nasty) - -# Additions - -## Anti-Footguns - -- C is generous with memory in ways that are unreliable by default. - Crowbar adds [memory safety conventions](safety.md) to make correctness the default behavior. -- C's conventions for error handling are unreliable by default. - Crowbar adds [error propagation](errors.md) to make correctness the default behavior. - -## Trivial Room For Improvement - -- Binary literals, prefixed with `0b`/`0B` +What differentiates Crowbar from C? + +# Removals + +Some of the footguns and complexity in C come from misfeatures that can simply not be used. + +## Footguns + +Some constructs in C are almost always the wrong thing. + +- `goto` +- Hexadecimal float literals +- Wide characters +- Digraphs +- Prefix `++` and `--` +- Chaining mixed left and right shifts (e.g. `x << 3 >> 2`) +- Chaining relational/equality operators (e.g. `3 < x == 2`) +- Mixed chains of bitwise or logical operators (e.g. `2 & x && 4 ^ y`) +- The comma operator `,` + +Some constructs in C exhibit implicit behavior that should instead be made explicit. + +- `typedef` +- Octal escape sequences +- Using an assignment operator (`=`, `+=`, etc) or (postfix) `++` and `--` as components in a larger expression +- The conditional operator `?:` +- Preprocessor macros (but constants are fine) + +## Needless Complexity + +Some type modifiers in C exist solely for the purpose of enabling optimizations which most compilers can do already. + +- `inline` +- `register` + +Some type modifiers in C only apply in very specific circumstances and so aren't important. + +- `restrict` +- `volatile` +- `_Imaginary` + +# Adjustments + +Some C features are footguns by default, so Crowbar ensures that they are only used correctly. + +- Unions are not robust by default. + Crowbar only supports unions when they are [tagged unions](tagged-unions.md) (or declared and used with the `fragile` keyword). + +C's syntax isn't perfect, but it's usually pretty good. +However, sometimes it just sucks, and in those cases Crowbar makes changes. + +- C's variable declaration syntax is far from intuitive in nontrivial cases (function pointers, pointer-to-`const` vs `const`-pointer, etc). + Crowbar uses [simplified type syntax](types.md) to keep types and variable names distinct. +- `_Bool` is just `bool`, `_Complex` is just `complex` (why drag the preprocessor into it?) +- Adding a `_` to numeric literals as a separator +- All string literals, char literals, etc are UTF-8 +- Octal literals have a `0o` prefix (never `0O` because that looks nasty) + +# Additions + +## Anti-Footguns + +- C is generous with memory in ways that are unreliable by default. + Crowbar adds [memory safety conventions](safety.md) to make correctness the default behavior. +- C's conventions for error handling are unreliable by default. + Crowbar adds [error propagation](errors.md) to make correctness the default behavior. + +## Trivial Room For Improvement + +- Binary literals, prefixed with `0b`/`0B` -- cgit v1.2.3