Building a Compiler for My Static Site Generator

Posted: emacs

My side project has exploded in scope. My original goal was to build a static site generator to learn how they work, not to support lots of features. Yet here I am building a compiler for my own template language.

Before the compiler, I had a very simple model for handling templates focused around variable replacement. That is, given some mustache expression, replace the requested variable with one from the parent environment.

;; An example from Orgify's test suite
(assert-compile-to-string
  :input    "Hello, {{ name }}!"
  :expected "Hello, world!"
  :env      '((name . "world"))

Originally I had implemented these substitutions via search and replace, something Emacs is adept at. It looked something like this:

(defun parse-handlebars (handlebars)
  (string-trim (substring handlebars 2 (- (length handlebars) 2))))

(defun search-and-replace-handlebars (env)
  (while (re-search-forward "{{[ ]*[a-z]*[ ]*}}" nil t)
    (let ((expr (save-match-data
                  (and (match-string 0)
                       (parse-handlebars (match-string 0))))))
      (unless (alist-get expr env)
        (error (concat "Unrecognized variable: " (match-string 0))))
      (replace-match (alist-get expr env)))))

I was happy with this solution because it solved variable replacement in a super simple, Emacs-y way. I soon learned, however, that this method is very difficult to extend. It expects too many assumptions:

With this in mind, I ditched the prior code in favor of a more traditional, compiler-driven approach. The new Orgify template language goes through the usual tokenize, parse, and generate cycle. The result resembles Handlebars, but with the added ability to execute Emacs Lisp expressions.

;; You can include any Emacs Lisp code in mustache templates:
(assert-compile-to-string
  :input "{{ (+ 1 41) }}"
  :expected "42")

The rest of this post describes how the new compiler works.

Building the compiler

Orgify's compiler has three phases: tokenization, parsing, and code generation (although the last two steps are actually performed at the same time).

Tokenization

The goal of tokenization is to make the text easier to parse by breaking it down into smaller tokens that pick out language symbols. This step is surprisingly crucial. The difference in a program's ability to understand a list of tokens vs. raw text is night and day.

My template language only cares about a few tokens:

The scope of each token is intentionally small. It's very easy to accidentally blur the line between the job of the tokenizer and the job of the parser by trying to capture tokens that contain too much information. This path leads only to headaches.

Finding these tokens in the original text still relies on regular expressions, though the implementation is quite a bit different from the original approach. Rather than using re-search-forward, which operates on a buffer, I use string-match, which operates on a string. Additionally, I use an incrementing index to ensure the regular expression is always matching against the beginning of the current position in the string (important due to some quirks in Emacs Lisp regular expression language).

Altogether, tokenization looks like this:

(defun tokenize (input)
  (let ((tokens '()) (cur-text "") (idx 0))
    ;; A helper function to append text tokens
    (cl-flet ((purge-text ()
                (when (> (length cur-text) 0)
                  (push (list 'text cur-text) tokens)
                  (setq cur-text ""))))
      ;; Looping over the input string, one character at a time
      (while (< idx (length input))
        ;; Need to use (eq ... idx) to ensure regex is matching from
        ;; idx onwards (e.g. start of string only).
        (cond ((eq (string-match "{{" input idx) idx)
               (purge-text)
               (push `(obrace ,(match-string 0 input)) tokens)
               (setq idx (1- (match-end 0))))
              ((eq (string-match "}}" input idx) idx)
               (purge-text)
               (push `(cbrace ,(match-string 0 input)) tokens)
               (setq idx (1- (match-end 0))))
              ;; ...snip

              ;; For everything else, just append the character
              ;; to cur-text.
              (t
               (setq cur-text (concat
                                cur-text
                                (char-to-string (aref input idx))))))
        (cl-incf idx))
      (purge-text))
    (reverse tokens)))

For every index in the string, decide whether the string starting at that index matches one of the language tokens via regular expression. If it does, purge any text that might be hanging around from previous iterations and push a new token to the list. If it doesn't, append the current character to the growing string of characters for the next text purge.

It might be easier to visualize by looking at an example template:

;; Template:
;; <p>Hello {{ name }}!</p>
;;   <ul>
;;     #each page in pages
;;       <li>{{ page }}</li>
;;     /each
;; </ul>

;; Tokens:
'((text "<p>Hello ")
  (obrace "{{")
  (text "name")
  (cbrace "}}")
  (text "!</p>\n\n<ul>\n  ")
  (text "\n  <li>\n    ")
  (oeach "#each page in pages")
  (obrace "{{")
  (text "page")
  (cbrace "}}")
  (text "\n  </li>\n  ")
  (ceach "/each")
  (text "\n</ul>"))

Parsing and code generation

The next step in compilation feeds these tokens to the parser. The parser runs through the list of tokens and gives them structure, appending them as leaves and branches to an abstract syntax tree (AST).

(defun parse (tokens)
  (let ((expressions '()) (cur 0))
    (while (< cur (length tokens))
      (let ((token (nth cur tokens)))
        (cond ((eq 'text (car token))
               ;; Handle text...)
              ((eq 'obrace (car token))
               ;; Handle opening braces...)
              ((eq 'cbrace (car token))
               ;; Handle closing braces...)))
      (setq cur (1+ cur)))
    (reverse expressions)))

The call to reverse here might be unexpected, but it's a common Lisp-ism since push prepends items to the front of the list. Reversing expressions ensures the order of the tree matches the order of the original tokens.

I commented out the implementation of each token branch because there are some important prerequisite topics to cover: quoting and eval. These two tools are crucial for this compiler, so it's worth explaining them with some extra detail.

Quoting

There's a very interesting consequence of building an AST for this compiler in Lisp. Because Lisp code is naturally structured as lists, the parser able to directly generate a tree of Emacs Lisp code, rather than a tree of generic nodes that need another layer of translation. This is accomplished through quoting, a Lisp special form that returns an object without evaluating it.

(+ 1 2)
;; => 3

'(+ 1 2)
;; => (+ 1 2)

The ability to pend evaluation by quoting is incredibly useful for a compiler that generates instructions. Every node in the AST generated by the parser is a quoted Emacs expression. It satisfies not only the AST data structure, a tree representing the program shape, but also the code that need be generated from the source tokens.

This aspect of code-as-data in Lisp is referred to as homoiconicity.

Taking it one step further, the backtick character enables mixing quoting and evaluation. This is how macros are generally written in Lisp.

(setq value 42)

`(+ 5 value)
;; => (+ 5 value)

`(+ 5 ,value)
;; => (+ 5 42)

Anywhere a comma falls is an expression that is evaluated. By mixing quoted forms and evaluation, it's easy to construct complex snippets of code for the compiler.

Evaluating Emacs Lisp expressions

Given that the parser generates quoted Emacs Lisp code, how does the compiler actually evaluate it? With eval.

(eval '(+ 1 2))
;; => 3

One problem remains, however. My original approach relied on alist-get to insert values from the parent environment into the source template when replacing regular expressions. This assumed that all text inside mustache braces was represented by a key-value pair in an association list.

;; Recall, replacing a variable in a template
(assert-compile-to-string
  :input    "Hello, !"
  :expected "Hello, world!"
  :env      '((name . "world"))

How does eval similarly replace variables from the parent environment? The key is the third argument of eval:

Signature
(eval FORM &optional LEXICAL)

Documentation
Evaluate FORM and return its value.

If LEXICAL is t, evaluate using lexical scoping.
LEXICAL can also be an actual lexical environment, in the form of an
alist mapping symbols to their value.

If an alist is passed into eval it uses it as the lexical environment with which variables are evaluated. This approach solves the problem of variable substitution without hard-coding the use of alist-get.

(eval 'name '((name . "world")))
;; "world"

When quoted symbols are evaluated, Emacs Lisp knows to look up that symbol from the ENV alist for its value. Cool.

Requiring a quoted symbol for the variable actually poses a bit of a problem for this compiler. When a layout is tokenized, that layout is input as a string. All of the generated tokens reference string values. Since eval requires a symbol, those strings need to be converted.

Luckily this problem is easily solvable with the function read-from-string:

(read-from-string "name")
;; '(name . 4)

(eval (car (read-from-string "name")) '((name . "world")))
;; "world"

Parsing

With quoting and eval out of the way, it's time to fill in the parser. For each conditional branch against a token, the parser pushes a tree of Emacs Lisp expressions into the AST. When mustaches are detected (e.g. obrace) a call to eval-string is to pend the execution of that text as an Emacs Lisp expression. Regular text is inserted as-is.

(defun lastcar (l)
  "Extract the last element from list L."
  (car (cdr l)))
  
(defun eval-string (string env)
  (eval (car (read-from-string string)) env))
  
(defun parse (tokens)
  (let ((expressions '()) (cur 0))
    (while (< cur (length tokens))
      (let ((token (nth cur tokens)))
        (cond ((eq 'text (car token))
               (push `(insert ,(lastcar token)) expressions))
              ((eq 'obrace (car token))
               ;; Assume that the only token between an obrace
               ;; and a cbrace is text.
               (push `(insert (eval-string
                               ,(lastcar (nth (1+ cur) tokens))
                               env))
                     expressions)
               (setq cur (+ cur 2)))
              ((eq 'cbrace (car token))
               (error "Unexpected closing brace"))))
      (setq cur (1+ cur)))
      ;; ,@ means spill the contents, kind of like
      ;; the ... operator in JS or Rust.
    `(lambda (env) ,@(reverse expressions))))

;; How the generated code is actually executed.
(defun compile-and-exec (input env)
  (funcall (parse (tokenize input)) env))

It's probably a little easier to look at the generated code:

;; Hello {{ name }}

(lambda (env)
  (insert "Hello, ")
  (insert (eval-string "name" env)))

The root of the tree is a lambda expression, taking the env as a single argument. The lexical environment containing the page variables and other metadata are assembled earlier in the static site generator and passed down as an alist.

Everything else boils down to insert statements, writing a string into the current buffer. What's great about the generated code is the deferred evaluation of env. The parser builds quoted forms to avoid working with env until the very last minute, that is, when the lambda expression is evaluated. This keeps the parser decoupled from anything that may happen in the lexical environment.

compile-and-exec is meant for use with a fresh buffer, since the insert statements will mutate that buffer with their string arguments. Something like with-temp-file, which will write the buffer contents to a new file:

(with-temp-file destination-file
  (compile-and-exec template-string env))

That about wraps it up. Orgify supports some additional syntax not mentioned in this article, but hopefully it's clear how the components from the parser can be altered to add loops and conditionals.


Thanks for reading! Send your comments to [email protected].