ArkScript's compiler implementation details
The compiler is divided in multiple parts, interacting with other in sequence:
- the parser, generating the AST from char predicates
- the import solver, calling the parser as many times as needed to get all the files AST's and piece them together
- the macro processor, applying macros in the AST
- the name resolver, resolving packages
- the optimizer, removing unused symbols from the AST
- the compiler, generating IR from the AST
- the IR optimizer, enhancing the IR
- the IR compiler, generating bytecode from IR
There exists another compiler, the json compiler, converting a given code to JSON by iterating through the AST. It follows the same process as the compiler to generate the AST.
Parser
It is generating the abstract syntax tree (AST) using character predicates in a "parser combinator" fashion.
Import solver
Using the main script's AST, it retrieves the imports and resolves them relatively to the main script path.
Say you have three files, a.ark
, foo/b.ark
and foo/bar/c.ark
:
./
|__ a.ark
|__ foo/
|__ b.ark
|__ bar/
|__ c.ark
To import c.ark
from a.ark
or b.ark
, you will need to write
(import foo.bar.c)
Then, it continues to resolve imports recursively, calling the parser on each new unimported package it finds.
The final AST is the agglomeration of all ASTs, where every import
directive is replaced by a Namespace node, which
holds the file's AST, its name, path, and import attributes (is glob, symbols, prefix).
Macro processor
Once the AST has been built, the macro processor kicks into action and goes over the AST, trying to find any macro node to register them and apply them. Macros are registered by block, so if we are at a current depth of 0 in the AST, and found a macro, it will available in every node. If we find a macro in a node at depth 2, the macro will get registered in applied in all the potential sub-nodes of the current node, but once we leave the depth 2 node, the macro will be removed from memory and won't be accessible in the other part of the script.
It is composed of different pieces:
- the processor, receiving the AST, keeping track of all the macros, and visiting every node, applying macros in them (it also does complex stuff with function macros like unifying the argument lists)
- the pipeline, holding all the different macro executors, trying to apply them one after another (until one matches) when the processor asks for a macro to be applied on a node
- the executors:
- the conditional one works only with if macros and apply them regarding the value of the condition
- the symbol one applies the value held by the macro
- the list one applies predefined macros and user defined macros (function macros), it is arguably the most complex one but also the best one as well
Once it is done running, every macro definition and macro usage in the AST has been removed and/or replaced with the corresponding value, so we are left with an AST that the compiler could already work with.
Name Resolver
Once macros have been processed, the name resolver visits the AST and updates each symbol, to fully qualify it.
Fully qualifying a symbol means finding the nearest namespace in which a symbol is declared ; in our input AST we
have (let foo 5)
, in the final AST we will have (let package:foo 5)
or even (let package:foo#hidden 5)
in the case of symbols that should be available but masked as they weren't explicitly imported (eg. (import foo :bar)
,
all symbols from the package foo
are imported and suffixed with #hidden
, and bar
is imported
as-is).
AST Optimizer
Though before being able to compile the AST, we can optimize it a bit, by removing unused variable definition. For
simple variables, like (let a 12)
, this isn't really a problem if we leave them, but for functions it would create
separate pages of bytecode for them, and the resulting bytecode would be a lot longer than just a variable declaration
and assignment. Having a long bytecode isn't a problem in term of computing power, as we are only computing what we
are asked to, but we are limited to 65'534 functions as our code segment indexes are on 2 bytes, and we are limited
to 65'536 different symbols (and values). Those symbols and values could be used only in those functions we want
to delete, and use substantial space in the symbols and values table.
So here comes the AST optimizer, reading only the first level of the AST, searching for variables and functions definition. Then, it visits the whole AST, counting how many times each gathered symbol has been used. In the end, if a symbol has been used only once (it correspond to its definition/declaration), then it gets removed from the AST.
AST Lowerer
The AST Lowerer's role is to take a tree-like structure, the AST, and flatten it, lowering it to IR that will be linear, so that the virtual machine can just read one instruction after another, without having to dive in a nested structure (which is very inefficient).
It works recursively on the nodes, and sometimes even I have some trouble understanding why something works and why something else doesn't. In the end, it all boils down to "the IR has been badly generated".
First step is to run through the AST, running each node in the ASTLowerer::compileExpression
recursive method, in charge of generating the IR.
The compilation
Given an input node, we make a decision based on its type (ASTLowerer::compileExpression
):
- if it's a symbol we will compile it to a
LOAD_SYMBOL <symbol>
(orLOAD_SYMBOL_BY_INDEX <index>
if the symbol is in the current scope and can be accessed by its index in the scope stack) - "get field operation",
GET_FIELD <symbol>
- String, Number or nil, we push the value on the stack with
LOAD_CONST <const>
- an additional step is done here to avoid pushing the value on the stack if the value isn't being used (eg in a function call, variable assignment)
- Keywords:
- If: first we compile the condition, then add a jump if true to go to the
ifTrue
branch. We compile theifFalse
branch, add a jump to the end of the if, and finally we compile theifTrue
branch. - Let, Mut, Set: we register the symbol, compile the value, add the corresponding instruction after it:
LET
,MUT
,STORE
- Fun: we start by checking if the function is a closure (has any Capture nodes) and emplace
CAPTURE
instructions in the callee page. Then we create a new page, dedicated to the function. If it's a closure, we'll add aMAKE_CLOSURE
instruction instead of aLOAD_CONST
to the callee page. Arguments are loaded by the function with series ofSTORE
instructions, in the order the arguments appear. Then we compile the body of the function, marking it as used and terminal. - Begin: we take each node and compile them. An additional step is done here, to remove any unused value, except for the last node whose fate is decided by the caller (so that we don't always drop the last node in a begin, which may be the return value of a function).
- While: much like the if, we compile the condition, add a jump to the end of the while if the condition is false, compile the body and add a jump to the beginning (right before the condition).
- Import: this is used only to import modules at runtime, we put the module name (a string) on the stack and then the
IMPORT
instruction. The VM will do what's needed for us. - Del: this adds
DEL <symbol>
to the bytecode, to tell the virtual machine to remove<symbol>
from memory.
- If: first we compile the condition, then add a jump if true to go to the
- if none the previous cases matched, then we most likely have a function call (
ASTLowerer::handleCalls
)- we first check if the function call is a short-circuiting operator,
and
oror
. If it is, we duplicate the first value, and add aPOP_JUMP_IF_FALSE
forand
, otherwise aPOP_JUMP_IF_TRUE
foror
, then aPOP
to remove the leftover value in case we didn't jump. The next value is added, and it's not the last one, weDUP
it as well. - otherwise, if the function call isn't an operator, we check if the call is terminal and self-call (calling the function we are currently
compiling). If it is, we replace the call by a jump to address 0 (relative to the current page), transforming the tail call into a loop.
If it isn't, we compile the function call to a temporary page.
- then we compile the arguments on the current page, marking them as "no discard"
- we push the computed expression for the function call and discard the temporary page
- we add the
CALL
instruction, the number of arguments
- finally, if we're here we have an operator call. We start by pushing arguments to the stack, check argument count for error reporting, and then add the operator.
- we first check if the function call is a short-circuiting operator,
IR Optimizer
The IR Optimizer is given the IR blocks generated by the ASTLowerer, as well as the symbol and value tables. From there, it visits each page and see if it can merge 2 or 3 instructions together.
First, it tries to merge 2 instructions. Then, if no instruction were merged, it tries with 3. Finally, if no instructions were merged, it leaves the instruction untouched.
Otherwise, the group of instructions is skipped over and the new instruction is kept in the final IR.
The value table is particularly useful when compacting code like (+ i 2)
to check if the value used for incrementing (or decrementing) is between 0 and 4095
(the maximum value we can encode on 12 bits in super instructions).
IR Compiler
The IR Compiler takes optimized blocks of IR, with the symbol and value tables, and transform it into bytecode for the virtual machine. Its role is also to compute the jumps' positions from the labels in the IR blocks.
- The first step is to generate the file header: the string
ark\0
, followed by the version of the compiler (each number on two bytes), and then the timestamp (IRCompiler::pushFileHeader
). - We can continue to build our bytecode file header, with the symbols and the values table, respectively for a list of identifier used throughout
the program, and for all the values used (a tag-value system is used here so that the VM knows what type the value is) (
IRCompiler::pushSymAndValTables
). - Finally, we add the different code pages to the bytecode file, generated by the
IRCompiler::compile
method call, and create the SHA256 used for integrity checking of the bytecode, in a spot in the file header.
JSON Compiler
Given an input node, we generate the corresponding JSON structure:
- Symbol:
{"type": "Symbol", "name": "..."}
- Spread:
{"type": "Spread", "name": "..."}
- Capture:
{"type": "Capture", "name": "..."}
- Field:
{"type": "Field", "children": [{"type": "Symbol", "name": "..."}, ...]}
- String:
{"type": "String", "value": "..."}
- Number:
{"type": "Number", "value": ...}
- Function call:
{"type": "FunctionCall", "name": "...", "args": [{"type": "Symbol", "name": "..."} ...]}
- Keywords:
- Function declaration:
{"type": "Fun", "args": [{"type": "Symbol", "name": "..."} ...], "body": [...] | {...}}
- Variable declaration/modification:
{"type": "Let|Mut|Set", "name": {"type": "Symbol", "name": "..."}, "value": {...}}
- Condition:
{"type": "If", "condition": [...], "then": [...] | {}, "else": [...] | {}}]
- While loop:
{"type": "While", "condition": [...], "body": [...] | {}}]
- Begin:
{"type": "Begin", "children": [...]}]
- Import:
{"type": "Import", "package": "foo.bar.egg", "glob": false, "symbols": ["a", "b", "c"]}
- Del:
{"type": "Del", "value": {"type": "Symbol", "name": "..."}}
- Function declaration:
- Macros:
- Macro value:
{"type": "Macro", "name": "...", "value": {...}}
- Macro function:
{"type": "Macro", "name": "...", "args": [...], "body": {...}}
- Macro condition:
{"type": "MacroCondition", "condition": {...}, "then": {...}, "else": {...}}
- Macro value:
Some fields have been filled to give you an idea of what the value
should look like in most cases. One
should note that function arguments can have Capture
nodes as well as Symbol
nodes.
Generic structure:
{
"type": "<name>",
"value": ... || "..." || {...}, // optional
"name": "...", // optional
"args": [{...}, ...], // optional
"condition": [{...}, ...] || {...}, // optional
"then": [{...}, ...] || {...}, // optional
"else": [{...}, ...] || {...}, // optional
"body": [{...}, ...] || {...}, // optional
"children": [{...}, ...] // optional
}