UltiSnips got a new parser
UltiSnips is the Vim plugin for Snippets I have started. It has become fairly successful with quite some people contributing snippets and code to it. I use it myself on a daily basis and can't remember how I did things before it was around. For example, I have the following snippet defined to start blog posts:
snippet blog "Blog Post Header" b
---
uuid: `!p if not snip.c: snip.rv = uuid.uuid4().hex`
date: `!p snip.rv = dt.datetime.now().strftime("%Y/%m/%d")`
type: ${1:article|quote|link|image|video|audio|custom}
title: ${2:Awesome, but short title}
tags: ${3:tag1,tag2}
---
${0}
endsnippet
The first line introduces this as a snippet definition, while the last line ends it. All between is pasted into my text when I type blog followed by a <Tab>. There is some special syntax:
- The parts starting with backticks are python interpolated code; that is the first line will become a UUID, the second the current date.
- The parts starting with ${<number> are tabs. They get selected and when I expand the snippet. I overwrite them my own text, then, by pressing <C-j> I quickly jump to the next tab which is then selected so I can overwrite it.
In toto, UltiSnips spares me to remember a lot of Syntax (like the various article types my blog supports) and a lot of manual work (like putting in the date or creating a UUID). I love it.
The purpose of parsing
When <Tab> is pressed, UltiSnips must learn where the snippet needs to be inserted and where the individual Text Objects (tabs, python code, shell code, vimL code, mirrors, transformation or escaped chars) end up in the text file the user is writing. This needs to be done by learning first where the text objects are inside the snippet and how they relate to each other (e.g., tabs can also be inside default text of other tabs). For this, parsing of the snippet is needed.
The original approach was to use a bunch of regular expressions to find individual Text Objects inside a snippet. This has a number of downsides, the two biggest are that
- the snippet syntax is not regular.
- the snippet syntax may contain ambiguity
See this example:
$1 ${1:This is ${2:a default} text} $3 ${4:another tabstop}
The first point is exampled by the 1 tabstop: regular parsing is out of the picture here, because you need to count the { and } to make sure you do not take too little or too much text into your tabstop.
The ambiguity is with the $<number> syntax: The first $1 is a Mirror (a text object that simply repeats the contents of a tabstop), because a tabstop definition appears later in the snippet. The $3 however is a tabstop without default text as there is no proper tabstop definition for 3, so the $3 could be replaced via ${3:} or ${3} and the snippet would behave exactly the same.
Never the less, we had a regular parser for a while now and jumped through many rings to support the cases I just mentioned. Not any more.
The new approach
This the what the new algorithm does:
- Tokenize the input text with a context sensitive lexer. The lexer checks for each position in the string which token happens to start here. If one potential token is found, it is parsed according to its own rules.
- Iterate through all tokens found in the text.
- Append the token and its parent to a list of all tokens in the snippet. As we recurse for tabstops (see next point) this builds a list of tokens in the order they appear in the text.
- If the token is a tabstop, recurse the algorithm with text being the default text of the tabstop and parent being the tabstop itself.
- If the token describes a text object that has no immediate reference to other tabstops (e.g. Python code interpolation, shell code interpolation or escaped special characters) create the text object and link it to its parent.
- When the whole snippet is traversed, do the finalizing touches for the list
of all tokens and their parents:
- Resolve the ambiguity. As we have seen all tokens by now and already instantiated all easily identifiable tabstops we can go through all the $<number> tokens in order and decide if they are a tabstop (first appearance) or a number (a tabstop is already know for this number) and create them accordingly.
- Create objects with links to tabs. Transformations take their content from a tabstop. Now, since all tabs are created, we can traverse the list of tokens again and create all the transformations and pass them their proper tabstop. Also, only now can we detect if a transformation references a non existing tabstop.
You can find the implementation of the new parser in the _TOParser class and the implementation of the lexer in the new Lexer.py file.