A Descent Into JSON Parsing

It has been said that a programmer is not worth their salt until they understand how compilers work. In the spirit of self improvement, I’m taking a small step down that path by making a foray into the world of compiler frontends: lexing and parsing.

A good candidate for tiptoeing into this space is JSON, which is arguably the most popular data serialization format today. Writing a parser for JSON is not something most programmers do, as there exists a number of libraries for doing this type of thing in many of the mainstream languages. Taking a deep dive into the implementation of a technology that is often taken for granted seems like a good way of learning something important.

Here, we’ll attempt to write a simplified JSON parser that follows but does not fully implement the official JSON specification. In particular, only objects or arrays will be considered to be valid JSON text, and standalone string literals, numbers, and boolean values will not be considered valid. This constraint is not clearly specified in the original JSON specification, but we deliberately make this choice for the exercise. (However, as it turns out the implementation we arrive at below will actually be able to handle standalone literals, which is a nice side-effect).

Additionally, we will further simplify the task at hand by:

  • Not handling escape characters in string literals
  • Not handling scientific notation for numeric values

Boolean values and null are also out of scope and left as an exercise for the reader to add as extensions. Our JSON parsing task will be modeled as a single process that combines both lexing (i.e. tokenization) and parsing. The JSON text is to be treated as a stream of characters, and a stack data structure will be used for storing values.

To start, let’s define the basic parsing control flow. When a bracket or brace is encountered, we should create a new array or object, respectively, and push it onto the parser stack. Once the closing bracket or brace is encountered, the active container (i.e. value at the top of the stack) can be considered complete and popped off the stack. By handling the opening and closing brackets and braces, we will have defined a significant portion of the core parsing code.

def parse(text):
    stack = []
    index = 0
    while index < len(text):
        char = text[index]
        if char == '{':       # Start object
            stack.append({})
        elif char == '}':     # End object
            stack.pop()
        elif char == '[':     # Start array
            stack.append([])
        elif char == ']':     # End array
            stack.pop()
        index += 1

The stack keeps track of the arrays or objects that have been encountered thus far at any point in the stream. Importantly, it preserves the order of these arrays or objects, and will allow us to correctly handle nested structures. For this example JSON text:

[{"a":[1,2,3]},"b",["c"]]

a condensed view of the state stored by the stack looks like this:

[]
[[]]
[[], {}]
[[], {}, []]
[[], {}]
[[]]
[[], []]
[[]]
[]

The stack starts off empty ([]), then the initial/outer array is pushed onto the stack ([[]]). The first element in the array is an object, which itself contains an array as a value; this reflects the deepest part of the nesting in the structure, which is when the stack is at its maximum size: [[], {}, []]. As closing brackets and braces are encountered in the stream, items are popped off the stack accordingly.

Now that we have the basic structure, the next step is to define some logic for handling literals in the JSON text so that we can actually parse values and insert them into the container objects on the parser stack. For strings, we have made the simplifing assumption that there are no escape characters. This means we can simply look for opening and closing double quotes as the delimiters for a full string literal, and parse accordingly.

def parse_string(text, start):
    "Returns a string literal and index at last character of literal (i.e. a double quotation mark)"
    end = start + 1
    while text[end] != '"':
        end += 1
    return (text[start+1:end], end)

Numeric values mostly follow the same story. Since we are ignoring scientific notation, we just have to handle a few special cases around negative numbers and decimals. Without diminishing the value of this exercise, we rely on the string to number parsing logic built into the language of our choice.

def parse_number(text, start):
    "Returns a numeric literal and index at last character of literal"
    end = start
    has_decimal = False
    while end < len(text) and (text[end] == '-' or text[end] == '.' or (text[end] >= '0' and text[end] <= '9')):
        if text[end] == '.':
            has_decimal = True
        end += 1
    val = text[start:end]
    return (float(val) if has_decimal else int(val), end-1)

The parse_string and parse_number helper functions both effectively eagerly advance the pointer into the character stream. At the end of parsing a literal, the index in the original text pointing to the last character in the literal is returned (along with the actual literal value) to be used by the core parsing logic for advancing the pointer appropriately.

Once the parsing logic has been expanded to handle literal values, we are ready to connect the dots between values and containers. Since we are following the strict JSON specification, we can assume that there is always at least one active container during any valid parsing action. That is, after a string or numeric literal is successfully parsed, it should be added to the active container as per the semantics of the active container.

In fact, after any value is successfully parsed - be it an array, object, or literal - the value should be added to the active container. This is equivalent to adding the value to the item at the top of the stack. The important edge case is when a value is successfully parsed but there is nothing at the top of the stack (i.e. the stack is empty). When parsing valid JSON, this can happen only when the parsed value is the outermost container. In this case, the parsed value is the result to be returned in as the final parsed object.

For arrays, the underlying representation simply maps to the built-in list, vector, or equivalent sequential data structure of your choice. For objects, the important thing is to convert the container structure into a key-value structure, such as your garden variety hash table, hash map, associative array, or dictionary. The implementation here takes a shortcut by storing both types of containers as lists, and only upon creation will construct a dictionary be out of the list of elements. The list is expected to have an even number of elements in order to fulfill the key-value contract of a valid JSON object.

def parse(text):
    stack = []
    index = 0
    result = None
    while index < len(text):
        char = text[index]
        val = None
        if char == '{':                                    # Start object
            stack.append([])
        elif char == '}':                                  # End object
            val = stack.pop()
            val = dict(zip(val[::2], val[1::2]))
        elif char == '[':                                  # Start array
            stack.append([])
        elif char == ']':                                  # End array
            val = stack.pop()
        elif char == '"':                                  # String literal
            val, index = parse_string(text, index)
        elif char == '-' or (char >= '0' and char <= '9'): # Numeric literal
            val, index = parse_number(text, index)
        index += 1
        if val is not None:
            if stack:
                # Add parsed value to the active container at top of stack
                stack[-1].append(val)
            else:
                # Parsed value is the outermost container
                result = val
    return result

For convenience (and without compromising correctness), commas, colons, and whitespace characters are ignored by the parsing logic. Additionally, error checking can be added in the form of assertions on properties of certain values. For example, when a } character is encountered and a dictionary is to be created out of the container object at the top of the stack, a valid assertion is that the number of items in the container (i.e. list) should be even.

To close things out, a list of test cases for the parser:

assert parse('1.1') == 1.1
assert parse('1') == 1
assert parse('-0.3') == -0.3
assert parse('{}') == {}
assert parse('[]') == []
assert parse('""') == ""
s = '[{"a":[1,2,[{}],4]},"b",["c",{"d":6}]]'
assert parse(s) == eval(s)
s = '{"xkd":1, "kcw":2, "art":3, "hxm":4, "qrt":5, "pad":6, "hoy":7}'
assert parse(s) == eval(s)
s = '[{"a_key": 1, "b_\xe9": 2}, {"a_key": 3, "b_\xe9": 4}]'
assert parse(s) == eval(s)

That concludes our foray into the world of JSON parsing. The parser works but has certainly not been optimized for performance. A rough comparison shows that the built-in Python json library is roughly 11-12x faster than the implementation in this article. However, the hope is that the implementation can serve as an instructive baseline upon which to build faster and more fully-featured parsers.

Comments