Implementation Details

I found while implementing the parser and lexer along side all of the ISPC source code was a little distracting. There’s always the tempation to change things in order to make them fit more nicely in a certain paradigm or preferences. I decided to develop most of the code outside of the source tree and make it so that I can easily write a method to convert from the new AST to the old AST.

Lexer

The new lexer is generic to the character type, so that it will work with char, char16_t, char32_t and wchar_t. When C++20 comes along, I also expect it will work with char8_t with little to no effort. I also parameterized the size type used to describe lines, columns and indices because they are generally less than 16-bits each and can take up a lot of memory in large files. To give an example, here’s what’s usually stored in each token.

struct SourcePos final {
	std::size_t index;
	std::size_t line;
	std::size_t column;
};
struct SourceRange final {
	SourcePos start;
	SourcePos finish;
};

Which is probably going to be 48-bytes per token (not including other token fields.) If the size type is parameterized, then the size is usually either going to be 12-bytes (uint16_t) or 24-bytes (uint32_t) per token, which is much better.

The lexer ended up looking like this:

template <typename Char, typename SizeType>
class Lexer final {
  public:
  	using TokenType = Token<Char, SizeType>;

  	static auto Lex(std::basic_string_view<Char> &sourceCode) -> std::vector<TokenType>;
};

I opted to make it a stateless lexer because none of the tokens are context sensitive. This may not have been the case if I decided to either:

Use the lexer hack, where type specifiers are given to the lexer when handling typedef
Write a lexer for preprocessing (the #include directive requires special tokens for the header path)

Parser

The parser also required parameterization of the token types. I chose to do this for two reasons:

Keeping it de-coupled from the lexer
Making it easier to mock the input (if needed)
If I chose to implement a preprocessor, the tokens emitted from it would be heterogenous, since each token may or may not be part of a macro expansion. If the token types are parameterized, then this can easily be accommodated without having to re-write the parser.

These leads to the consequence that each component of the AST is also parameterized by the token type. In order to keep the code readable, I opted for putting everything into a Parser class that is parameterized by the token type, so that I don’t have template declarations all over the place. Here’s what is looks like:

template <typename TokenIterator>
class Parser final {

    /* ... */

    class FunctionDecl final {
    	Type returnType;
    	FunctionDeclarator declarator;
	CompoundStmt body;
    };

    using std::variant<FunctionDecl, VariableDecl, StructDecl> TopLevelNode;

    class AST final {
        std::vector<TopLevelNode> nodes;
      public:
    };
  public:
    auto Parse(TokenIterator begin, TokenIterator end) -> ASTContext<TokenIterator>;
};

The interface to the AST is kept separate, so that client code does not have to have templates all over the place. Each node in the AST has a base class using static polymorphism (it’s not shown above). That way, nobody has to visit a type like ispc::Parser<const Token<const char *, std::size_t>>::FunctionDecl and instead they use an interface and heirarachy like this:

template <typename Derived, typename TokenIt>
class FunctionDecl {
  public:

    template <typename DerivedStmtVisitor>
    auto VisitBody(StmtVisitor<DerivedStmtVisitor> &) const;

    template <typename DerivedTypeVisitor>
    auto VisitReturnType(TypeVisitor<DerivedTypeVisitor> &) const;

    auto GetName() const -> TokenIt;

    /* ... */
};

template <typename DerivedVisitor>
class TopLevelASTVisitor {
  public:
    template <typename DerivedFunc, typename TokenIt>
    auto Visit(const FunctionDecl<DerivedFunc, TokenIt> &func);

    /* ... */
};

template <typename TokenIt>
class ASTContext final {
  public:
    template <typename DerivedVisitor>
    auto VisitAST(TopLevelASTVisitor<DerivedVisitor> &) const;
};

An example usage:

class CustomVisitor final : public TopLevelASTVisitor<CustomVisitor> {
  public:
    template <typename DerivedFunc, typename TokenIt>
    void VisitImpl(const FunctionDecl<DerivedFunc, TokenIt> &func) {
    	std::cout << "Function name: " << func.GetName() << std::endl;
    }
};

Yes there’s still templates, but they’re much shorter and the expectation of what’s required is shown clearly in TopLevelASTVisitor. There’s also the benefit of having return types in visitation methods. It would not be as easy to have a non-void return type using virtual visitor classes. With this method, there’s a clean picture of what’s an input and what’s an output in a visitor call:

auto errorList = astContext.VisitAST(typeChecker);

Instead of something like:

astContext.VisitAST(typeChecker);
auto errorList = typeChecker.GetErrorList();

Which would also have several less-readable components to it in the implementation of a derived visitor.

Downsides

As with most template-heavy code, the downside is going to be build times. Also, anyone not familiar with CRTP is probably going to be confused when looking at the AST heirarachy. And lastly, template errors are a horror.

I also wonder whether the type parameterization of the lexer is actually worth it, since most compilers only support UTF-8 and the benefits of smaller token types might be negligible.

Conclusion

I’m happy with the design and I think it’ll hold up for the rest of my time working on this project. It’s extremely nice having very loose coupling between the lexer and parser and reassuring to know that supporting a preprocessing phase won’t invoke a rewrite. I’m really looking forward to seeing the implementation to completion.