Kelvin’s Domain

Stoicism in the Time of Coronavirus

2020-03-30T00:00:00-05:00

I recently started exploring the ancient philosophy of stoicism by reading William Irvine’s A Guide to the Good Life: The Ancient Art of Stoic Joy. After finishing the book, the plan was to start a series of articles that attempt to draw lines between the ideas and techniques of ancient stoicism to modern day programming and software engineering. For example, a decent argument can be made that the practice of chaos engineering goes hand in hand with the recurring theme of negative visualization in stoicism, where the robustness of a system improves over time due to the creators of the system actively thinking about and preparing for all the things that could go wrong.

However, in light of the coronavirus pandemic that has been unfolding over the last few months, it feels more apt to discuss stoicism the way it was intended by ancient scholars and practitioners - as a philosophy of life particularly relevant during turbulent times of uncertainty, and to think about how it may help people from all walks of life get through this crisis. At a later time it may be useful to revisit the idea of stoical (and perhaps even antifragile) software; but given the current situation, let us find ways to remain stoic in the face of shared adversity.

There are many facets of stoicism that can be applied to modern life, particularly during times of hardship. Irvine’s book lays out some of the fundamental pillars of the ancient Stoic school, and these pillars are useful for guiding the discussion in uncovering how stoicism can help one cope during a time of crisis.

The first place to start is with the idea of negative visualization, or the practice of contemplating the loss of things you value in life as to gain a greater appreciation for such things, and to be mentally and emotionally protected against their eventual loss should such an event occur. As this unprecedented pandemic unfolds, I think it’s fair to say that few people, if any, had considered the loss of so many things that many take for granted: the freedom to work, socialize, and simply be outside; the existence of schools and child care; safe and stable work environments; even reliably procuring household essentials such as groceries and even toilet paper.

Stoicism recommends proactively visualizing the loss of things one takes for granted, so that one may gain a greater appreciation for these things and be better prepared, literally and figuratively, when those things are no longer available. For those in the middle of a shelter-in-place, lockdown lifestyle, the stoic silver lining is that after the crisis is over they will be able to withstand much more difficult hardships in the future. In these times, all of us are a little bit like Seneca, the Roman Stoic, who was exiled to an island for multiple years for political reasons. Seneca’s stoic ways allowed him to get through the tough times. The current situation is hard, but once it has been overcome, the built-up scar tissue will remain.

A more extreme form of negative visualization is the practice of self-denial. By intentionally subjecting oneself to minor pains, displeasures, and inconveniences, one learns to better appreciate the smaller things. For many people unaware or inexperienced in the practice of self-denial, the state-mandated orders to stay at home is more or less a collective self-denial that has been uniformly forced upon everyone simultaneously.

Nevertheless, negative visualization is still a useful technique for coping with this pandemic. Even for those under quarantine and self-isolation, it’s useful to contemplate the loss of and appreciate the things that still have not been taken away: one’s own health, the health of loved ones, the ability to connect and communicate with friends and family, and the spirit of compassion, community, and volunteerism that is emerging all around the world.

The slower pace of life under quarantine is a blessing in disguise to a practicing stoic: it is a hard lesson in learning to appreciate the small things and the things often taken for granted. Take a walk outside and breathe in the air that has been made fresher by fewer cars on the road and planes in the sky. Take a moment to appreciate the massive but mostly invisible technological infrastructure that allows billions to stay connected despite between physically separate.

Another theme is the dichotomy of control, or the partitioning of things, experiences, or events one encounters into those over which one has no control, versus those which one has partial or full control. Irvine actually advocates for a trichotomy, but using a binary partitioning can be an equally effective simplification.

The key idea here is to shift one’s attention away from things that cannot be controlled, which is particularly important during times of great uncertainty and chaos. Things such as government policies, temporary restrictions on freedom, stock market movements, and disease infection and mortality rates, for the most part can be considered to be things out of one’s control.

Instead, focus should be placed on things that are within an individual’s immediate control. Some of the most important messages being communicated by public health officials around the world are simple things that each individual has full control over. In order to reduce risk of infection, consistently and repeatedly do the simple things to minimize probabilities of contracting the virus: stay at home, stay away from other people, and practice good hygiene by thoroughly washing hands.

While staying at home, it is equally important to focus on general health. Eating healthy food, getting adequate sleep, and exercising regularly are all things that are fully under one’s control and may simply require a bit of creativity to maintain. Mental health may be overlooked and is just as important, and many enhancement practices also fall under the realm of control: avoid consuming too much doom and gloom from the media; and for those who are well-versed, meditation and mindfulness can go a great way in remaining even keeled. The stoic practice of meditation was perhaps not as rigorous as modern day practices, but does encourage constant self-reflection and introspection. With many parts of life slowing down or grinding to a halt, looking deeply inward and reflecting on one’s values can be a useful and rewarding exercise.

The final theme is fatalism - the idea of treating the past and present “as fate would have it” and to relinquish fears and worries over things that are happening or have already happened. It is important to resist the urge to point fingers and to assign blame. The events of the recent past and present are unraveling before our eyes, and being up in arms about things that cannot be changed is mostly a useless exercise. However, it is critical to focus on the future, as fate has yet to decree the way the future will unfold. Focusing on the future means being fatalistic while learning from the past and present, and making sure the hard lessons do not go to waste.

Perhaps the most important way of focusing on the future is to think about the extent to which one can control the future, which harks back to the theme of dichotomy of control. Second guessing and blaming elected officials for their mistakes and missteps is not terribly useful. Government policies, responses, and the subsequent outcomes are largely out of our control. However, it is important to keep these outcomes in mind, along with the words and actions of elected officials and representatives. When it comes time for one to flex civic duty muscles, use the power of the ballot to shape the future policies, laws, and institutions and ensure that one does not remain blindly fatalistic about the future.

Intuitive Iterators For Binary Trees

2019-12-06T00:00:00-06:00

As a follow-up to the previous article on intuitive iterative traversals for binary trees, we discuss the application of the intuitive stack based method to the task of creating iterators for binary trees. The iterator is an abstraction for making incremental traversals over data containers or data streams. At its core, the iterator pattern is implemented by two functions:

next - a function that retrieves the next element in the data container or stream and advances the underlying cursor
hasNext - a function that indicates whether or not the underlying data container or stream has been exhausted.

As a design pattern, the iterator can be used for accessing data streams with an arbitrary, and sometimes infinite, number of elements. For data structures, an iterator is useful for performing computations based on advancing cursors in certain defined orders. An elementary application of iterators to binary trees is for traversing a given tree in preorder, inorder, or postorder fashion one element at a time. For example, a specific use case might be for an iterator over binary search trees that provide access to comparable elements in increasing order by means of an inorder iterator.

Implementing an iterator for binary trees by modeling the access pattern for a traversal is not a very intuitive task. However, using the technique covered in the previous article of a stack and an algorithm that employs bookkeeping of discovered and undiscovered nodes, an implementation follows quite naturally.

Let’s walk through implementation of an inorder iterator. To implement this iterator, its internal state can be initialized in the same way as the beginning of a conventional, all-at-once iterative traversal by pushing the root node onto an empty stack.

def __init__(self, root):
    self.stack = []
    self.discovered = {}
    if root:
        self.stack.append(root)

The stack will serve as the primary internal state of the iterator. When the stack is empty, that indicates the end of the traversal and the depletion of elements.

def hasNext(self):
    return len(self.stack) > 0

Traversing a tree while advancing the implicit cursor requires a bit of thought. Just like in the conventional, all-at-once traversal, the top of the stack always contains the next element to be processed. However, it may not be the next element to be visited. For inorder traversals, we previously determined that a visit should happen only if the node has already been discovered. Thus, we will continue to process nodes until we come across a discovered node, in which case its value can be returned. Any undiscovered nodes should be processed by marking it as discovered and pushing its children onto the stack in the appropriate order.

def next(self):
    node = self.stack.pop()
    while node not in self.discovered:
        self.discovered[node] = True
        if node.right:
            self.stack.append(node.right)
        self.stack.append(node)
        if node.left:
            self.stack.append(node.left)
        node = self.stack.pop()
    return node.val

As a sanity check, we can walk through a base case using the logic above. For a single node binary tree, the root node will initially be undiscovered. The first invocation of next will pop the root node off the stack, mark it as discovered, then pop it off the stack again. Since the node is now marked as discovered, its value can be returned. The resulting stack is empty and an invocation of hasNext will return false, indicating that the iterator has been exhausted.

Putting it all together, we have an implementation of an inorder iterator for binary trees that follows quite naturally from the intuitive stack based method of traversing binary trees.

class BinaryTreeInorderIterator:
    def __init__(self, root):
        self.stack = []
        self.discovered = {}
        if root:
            self.stack.append(root)

    def hasNext(self):
        return len(self.stack) > 0

    def next(self):
        node = self.stack.pop()
        while node not in self.discovered:
            self.discovered[node] = True
            if node.right:
                self.stack.append(node.right)
            self.stack.append(node)
            if node.left:
                self.stack.append(node.left)
            node = self.stack.pop()
        return node.val

Like previously discussed, an implementation of a preorder or postorder iterator should easily be derived by modifying the order in which a node and its child nodes are pushed onto the stack.

A Descent Into JSON Parsing

2019-11-05T00:00:00-06:00

It has been said that a programmer is not worth their salt until they understand how compilers work. In the spirit of self improvement, I’m taking a small step down that path by making a foray into the world of compiler frontends: lexing and parsing.

A good candidate for tiptoeing into this space is JSON, which is arguably the most popular data serialization format today. Writing a parser for JSON is not something most programmers do, as there exists a number of libraries for doing this type of thing in many of the mainstream languages. Taking a deep dive into the implementation of a technology that is often taken for granted seems like a good way of learning something important.

Here, we’ll attempt to write a simplified JSON parser that follows but does not fully implement the official JSON specification. In particular, only objects or arrays will be considered to be valid JSON text, and standalone string literals, numbers, and boolean values will not be considered valid. This constraint is not clearly specified in the original JSON specification, but we deliberately make this choice for the exercise. (However, as it turns out the implementation we arrive at below will actually be able to handle standalone literals, which is a nice side-effect).

Additionally, we will further simplify the task at hand by:

Not handling escape characters in string literals
Not handling scientific notation for numeric values

Boolean values and null are also out of scope and left as an exercise for the reader to add as extensions. Our JSON parsing task will be modeled as a single process that combines both lexing (i.e. tokenization) and parsing. The JSON text is to be treated as a stream of characters, and a stack data structure will be used for storing values.

To start, let’s define the basic parsing control flow. When a bracket or brace is encountered, we should create a new array or object, respectively, and push it onto the parser stack. Once the closing bracket or brace is encountered, the active container (i.e. value at the top of the stack) can be considered complete and popped off the stack. By handling the opening and closing brackets and braces, we will have defined a significant portion of the core parsing code.

def parse(text):
    stack = []
    index = 0
    while index < len(text):
        char = text[index]
        if char == '{':       # Start object
            stack.append({})
        elif char == '}':     # End object
            stack.pop()
        elif char == '[':     # Start array
            stack.append([])
        elif char == ']':     # End array
            stack.pop()
        index += 1

The stack keeps track of the arrays or objects that have been encountered thus far at any point in the stream. Importantly, it preserves the order of these arrays or objects, and will allow us to correctly handle nested structures. For this example JSON text:

[{"a":[1,2,3]},"b",["c"]]

a condensed view of the state stored by the stack looks like this:

[]
[[]]
[[], {}]
[[], {}, []]
[[], {}]
[[]]
[[], []]
[[]]
[]

The stack starts off empty ([]), then the initial/outer array is pushed onto the stack ([[]]). The first element in the array is an object, which itself contains an array as a value; this reflects the deepest part of the nesting in the structure, which is when the stack is at its maximum size: [[], {}, []]. As closing brackets and braces are encountered in the stream, items are popped off the stack accordingly.

Now that we have the basic structure, the next step is to define some logic for handling literals in the JSON text so that we can actually parse values and insert them into the container objects on the parser stack. For strings, we have made the simplifing assumption that there are no escape characters. This means we can simply look for opening and closing double quotes as the delimiters for a full string literal, and parse accordingly.

def parse_string(text, start):
    "Returns a string literal and index at last character of literal (i.e. a double quotation mark)"
    end = start + 1
    while text[end] != '"':
        end += 1
    return (text[start+1:end], end)

Numeric values mostly follow the same story. Since we are ignoring scientific notation, we just have to handle a few special cases around negative numbers and decimals. Without diminishing the value of this exercise, we rely on the string to number parsing logic built into the language of our choice.

def parse_number(text, start):
    "Returns a numeric literal and index at last character of literal"
    end = start
    has_decimal = False
    while end < len(text) and (text[end] == '-' or text[end] == '.' or (text[end] >= '0' and text[end] <= '9')):
        if text[end] == '.':
            has_decimal = True
        end += 1
    val = text[start:end]
    return (float(val) if has_decimal else int(val), end-1)

The parse_string and parse_number helper functions both effectively eagerly advance the pointer into the character stream. At the end of parsing a literal, the index in the original text pointing to the last character in the literal is returned (along with the actual literal value) to be used by the core parsing logic for advancing the pointer appropriately.

Once the parsing logic has been expanded to handle literal values, we are ready to connect the dots between values and containers. Since we are following the strict JSON specification, we can assume that there is always at least one active container during any valid parsing action. That is, after a string or numeric literal is successfully parsed, it should be added to the active container as per the semantics of the active container.

In fact, after any value is successfully parsed - be it an array, object, or literal - the value should be added to the active container. This is equivalent to adding the value to the item at the top of the stack. The important edge case is when a value is successfully parsed but there is nothing at the top of the stack (i.e. the stack is empty). When parsing valid JSON, this can happen only when the parsed value is the outermost container. In this case, the parsed value is the result to be returned in as the final parsed object.

For arrays, the underlying representation simply maps to the built-in list, vector, or equivalent sequential data structure of your choice. For objects, the important thing is to convert the container structure into a key-value structure, such as your garden variety hash table, hash map, associative array, or dictionary. The implementation here takes a shortcut by storing both types of containers as lists, and only upon creation will construct a dictionary be out of the list of elements. The list is expected to have an even number of elements in order to fulfill the key-value contract of a valid JSON object.

def parse(text):
    stack = []
    index = 0
    result = None
    while index < len(text):
        char = text[index]
        val = None
        if char == '{':                                    # Start object
            stack.append([])
        elif char == '}':                                  # End object
            val = stack.pop()
            val = dict(zip(val[::2], val[1::2]))
        elif char == '[':                                  # Start array
            stack.append([])
        elif char == ']':                                  # End array
            val = stack.pop()
        elif char == '"':                                  # String literal
            val, index = parse_string(text, index)
        elif char == '-' or (char >= '0' and char <= '9'): # Numeric literal
            val, index = parse_number(text, index)
        index += 1
        if val is not None:
            if stack:
                # Add parsed value to the active container at top of stack
                stack[-1].append(val)
            else:
                # Parsed value is the outermost container
                result = val
    return result

For convenience (and without compromising correctness), commas, colons, and whitespace characters are ignored by the parsing logic. Additionally, error checking can be added in the form of assertions on properties of certain values. For example, when a } character is encountered and a dictionary is to be created out of the container object at the top of the stack, a valid assertion is that the number of items in the container (i.e. list) should be even.

To close things out, a list of test cases for the parser:

assert parse('1.1') == 1.1
assert parse('1') == 1
assert parse('-0.3') == -0.3
assert parse('{}') == {}
assert parse('[]') == []
assert parse('""') == ""
s = '[{"a":[1,2,[{}],4]},"b",["c",{"d":6}]]'
assert parse(s) == eval(s)
s = '{"xkd":1, "kcw":2, "art":3, "hxm":4, "qrt":5, "pad":6, "hoy":7}'
assert parse(s) == eval(s)
s = '[{"a_key": 1, "b_\xe9": 2}, {"a_key": 3, "b_\xe9": 4}]'
assert parse(s) == eval(s)

That concludes our foray into the world of JSON parsing. The parser works but has certainly not been optimized for performance. A rough comparison shows that the built-in Python json library is roughly 11-12x faster than the implementation in this article. However, the hope is that the implementation can serve as an instructive baseline upon which to build faster and more fully-featured parsers.

Intuitive Iterative Binary Tree Traversals

2019-10-23T00:00:00-05:00

Binary tree traversals are a staple of the technical interview process at many software companies, small and large. For anyone with an understanding of recursion, the family of traversal techniques are quite straightforward. A common twist on these concepts that show up more in technical interviews than undergraduate computer science problem sets is the rather artificial constraint that asks one to implement the traversals using iteration rather than recursion.

I’ve always found the reference implementations of iterative tree traversals, particularly inorder traversal, to be lacking in intuitive understanding. The classic way of iteratively traversing a binary tree is to use a stack data structure, and the first snippet of code you often see is something like this:

stack = []
curr = root
while curr is not null || stack is not empty
    while curr is not null
        stack.push(curr)
        curr = curr.left
        ...

When I see this code, I immediately have more questions than insight.

Why is the code following left pointers while pushing all the nodes onto the stack?
Why is the loop condition checking against both the current node and the stack size?
Why is there a nested while loop?

To me, the intuitive way to reason about an iterative implementation of a recursive function is to simulate a call stack, and that usually begins with a pen and paper. For example, suppose we have a binary tree that looks like this:

    A
   / \
  B   C
 / \
D   E

An inorder traversal of such a tree should yield the nodes in this order: D, B, E, A, C. The best way to simulate the call stack that yields such a traversal is to draw out the contents of the stack as the traversal makes its way through the tree. I like to model my stacks after the real world, with a physical base to indicate the bottom of the stack, and elements being pushed on and popped off. Here’s the visualization of an empty stack, and its transformation following two push (push(A), push(B)) operations and one pop (pop()) operation:

                  B      ~B~
          A       A       A
_____   _____   _____   _____
stack   stack   stack   stack

At the end of the operations, the stack contains a single element B. The notation here marks any items popped off the stack with strikethrough-like markers (~), but leaves it on the stack in its original location to better illustrate ordering. We can now use this notation to visually simulate the first few steps of what an inorder traversal on the example tree might look like using a standard depth-first search approach. Initially, the stack is empty, and the root node is pushed onto the stack.

  A
_____
stack

We invoke the same logic repeatedly while the stack has items: pop a node off and process it. When a node is popped off the stack, we need to process it in inorder fashion: traverse its left child first, visit itself, then traverse its right child. This translates to push(C), push(A), and push(B). Notice that the elements are pushed onto the stack in reverse order from the way they would be processed in a standard recursive implementation, as to achieve the desired order. Perhaps more importantly, the node itself (A) is pushed back onto the stack.

                  B
                  A
                  C
  A      ~A~     ~A~
_____   _____   _____
stack   stack   stack

At this point, B is popped off the stack and the same logic is applied. The traversal proceeds down to the left child of B, followed by a visit to B, and a subsequent traversal down the right child of B. That is, push(E), push(B), push(D). Here’s what the stack looks like after reaching the first leaf node D:

                  D
                  B
                  E
  B      ~B~     ~B~
  A       A       A
  C       C       C
 ~A~     ~A~     ~A~
_____   _____   _____
stack   stack   stack

Since D has no child nodes, it will be popped off the stack then pushed back onto the stack. Here’s the second piece of logic that is core to our traversal: if a node being processed has already been discovered, then it should be visited. With that, the algorithm for our inorder traversal - with respect to processing a single node - can be expressed as follows:

if the node has already been discovered
    "visit" or do something with the node
else
    mark the node as discovered
    push the right child of the node onto the stack
    push the node onto the stack
    push the left child of the node onto the stack

As noted earlier, the symmetry between this iterative approach and the standard recursive implementation is clear. The recursive implementation will first (in an eager, depth-first manner) traverse down the left child of a given node, then visit the node, followed by a traversal down the right child. Using an explicit stack data structure to simulate the calling pattern simply means pushing the nodes onto the explicit stack in reverse order as compared to the implicit recursion call stack.

To illustrate the process more clearly, we can push nodes onto the stack with an explicit status: a start status to indicate that the node has yet to be processed and an end status to indicate that it has been processed. For example, A.start and A.end will represent the start and end state for a node A, respectively. Here’s the same visualization for the same first few operations as above, with explicit status attributes for each node:

                       B.start
                       A.end
                       C.start
A.start   ~A.start~   ~A.start~
 _____      _____       _____
 stack      stack       stack

With this extended notation, the logic needed to process any given node at the top of the stack is clear. If a node at the top of the stack has a start status, push its right child onto the stack with a start status, push itself back onto the stack with an end status, and push its left child onto the stack with a start status.

An astute reader will notice that since there are only two states, a binary flag is sufficient for storing the same information. In fact, this can represented in exactly the same way as the classic implementations of graph traversals introduced in CLRS, in which nodes are assigned colors to keep track of traversal progress. In the case of binary tree traversals, we only need two colors: white for undiscovered nodes and black for discovered nodes. With that insight, the iterative version of any of the traversals becomes easy to derive:

def iterative_inorder_traversal(root):
    stack = []
    stack.append(root)
    discovered = {}
    while len(stack) > 0:
        node = stack.pop()
        if node in discovered:
            pass # "Visit" or do something with the node
        else:
            discovered[node] = True
            if node.right:
                stack.append(node.right)
            stack.append(node)
            if node.left:
                stack.append(node.left)

An iterative implementation of preorder or postorder traversal should easily follow from the inorder traversal; the sequence in which the nodes should be pushed onto the stack simply needs to be modified to match the desired traversal behavior. The cost of the intuitive version of these iterative traversals is a larger constant in the runtime complexity, as each node is actually processed twice. Ultimately, the runtime still grows at a rate linearly proportional to the size of the input.

Wrath of the Amazon Mechanical Turks

2019-07-03T00:00:00-05:00

I recently launched a small hobby website that aggregates documents and papers posted to a popular tech news website. Some of the feedback I received after the launch included suggestions to categorize the aggregated documents. It seemed like a nice, small exercise in document categorization, and I decided to take a shot using the data I had on hand, with the objective being to determine the category for a document from just the title text.

For starters, I limited the dataset to arXiv.org submissions, and used the categories associated with each document as ground truth labels. After playing around with the data, I realized that I would need an expanded dataset if I wanted to train useful models that could differentiate between a variety of subjects beyond just those related to science and technology, such as business, economics, games, news, and politics.

Enter the Mechanical Turks

In order to get my hands on a quality dataset with document labels for the categories I wanted, I turned to Amazon Mechanical Turk. I had prior experience with Amazon Mechanical Turk from using it shortly after its initial launch, playing around as a worker and earning pennies per task by solving unsophisticated CAPTCHA puzzles or examining satellite imagery to look for famed computer scientist and missing person Jim Gray.

After signing in as a requester and setting up my project, I was struck by how outdated the entire Amazon Mechanical Turk website appeared. Upon creation of a project and submission of a batch of tasks, simply viewing the progress of the tasks and downloading the ongoing results is a very clunky experience. The modal dialogs feel like they’re stuck in 2007, and look out of place when compared to the user interfaces of modern AWS services. However, the lackluster user experience as a requester was nothing compared to the anger I would soon face from other users as I started reviewing the results rolling in.

Big Bad Data

Document categorization is a fairly commonplace project by Amazon Mechanical Turk standards; the project creation page even has a built-in template that makes the setup for this class of projects fairly straightforward. A requester has the option of sending the same task (i.e. provide a label for a given document) to multiple workers, as to triangulate on the most appropriate answer in ambiguous or unclear cases.

After some manual inspection of a sample of the unlabeled dataset, these were chosen as the target categories:

Business and Economics
Computers and Technology
Games and Hobbies
Lifestyle
Math and Science
News, Politics, and Government

As I examined the results that came in after I submitted my first batch of requests, I was surprised by the poor quality of data for what should be a fairly straightforward task. Some of the examples were extreme, such as political documents or court case briefings getting labeled as “Games and Hobbies”. In fact, the most egregious mislabeled examples I found were all tagged with that label, as I came across several cases of technical papers, scientific journal submissions, and corporate earnings releases all miscategorized as such.

As a machine learning practitioner, the obvious thing to do was to reject the mislabeled data. A mislabeled document introduces noise to the model training process, and is particularly troubling in contexts involving a limited number of examples or features. Thus, my first inclination was to reject all responses that were not unanimous - even if two workers agreed on a label and a third worker provided a different label, all three submissions would be rejected. However, I decided that such a policy would be too harsh, and wrote some custom code to instead only reject submissions for documents that had no majority answer; that is, when all three responses were of different labels.

However, that meant that some babies would be thrown out with the bath water - as some appropriately labeled responses would be rejected along with the bad one. I did not see any other option; the whole point of using Amazon Mechanical Turk was to outsource the document labeling, and not have to manually inspect the outlier submissions and determine which ones were “right” or “wrong”.

Pitchforks

As soon as I submitted the reviews of the first batch of results, the angry feedback started flowing in. I received dozens of messages from workers: some were sincere apologies imploring for me to reconsider the rejection in order for the worker to retain their worker rating; others were disgruntled rants about how the rejection was unjust and a demand for correction.

Not only was I surprised by the amount of anger and frustration from these workers over tasks that paid only a penny each, but I felt that I had my hands tied as there was no other alternative. If I had not rejected submissions for documents that had no majority answer, I would’ve been left with unusable examples for a large fraction of my dataset. As a hobbyist, I can let it slide as there is no academic or business pressure to wring out all available value from the data. In fact, to avoid any further backlash, I ended up approving all submissions in the second batch, and decided to write off the poor dataset as a loss and forego the experiment altogether.

After this poor experience, I find it hard to see how real research projects using Amazon Mechanical Turk can deal with this level of data quality while managing the need to “appease” the workers creating these datasets and compensating them appropriately. Perhaps that is the reason many companies and researchers are turning to semi-supervised learning techniques and training models to generate labeled datasets or embeddings to be used by other models. It is a direction that could’ve been explored for this project; perhaps some off-the-shelf or well-known approach can be used in order to build topics from the comment thread text for each submission. At the very least, the semi-supervised models won’t get all up in arms about your treatment of their low accuracy results, and demand that you give pennies where pennies are due.

Data

A modified version of the resulting dataset is available on GitHub. The dataset includes the URLs for 2557 documents along with the labels tagged by the workers, with all of the Amazon Mechanical Turk metadata removed.

A Risk-Oriented View of Asset Classes

2018-02-09T00:08:00-06:00

In this article, let’s take a risk-oriented examination of the asset classes available to the everyday, retail investor. The goal is to provide a reasonably comprehensive overview of the different investment options available to the typical investor, along with the risks associated with each investment option.

For example, you may think that your current net worth precludes you from investing in fancy asset classes, such venture capital. “Angel investing and technology startups? That’s for high brow folks unlike myself!” However, with the advent of new technology and recent legislative changes, there are now many channels for folks from varying backgrounds to invest not only in things like seed-level venture capital, but other areas such as commercial real estate and private loans, all with a laptop from the comfort of your own home.

Without further ado, here’s the Table of Asset Classes. It is meant to be ordered in general level of risk, from the least risky to the most risky of investments. Of course, there are always exceptions, nuances, and differences of risk within and across neighboring asset classes on the risk spectrum. The hope is that the reader can use this guideline as a starting point, but still maintain due diligence and do the required homework before diving into unfamiliar investment areas.

Asset Class	Types	Risk
Cash	Checking/savings, money market, Certificate of Deposit (CD) Money market funds	Low
Bonds (Debt)	Bonds - government, municipal, corporate Loans - peer-to-peer (P2P), private lending	Low
Stocks (Equities)	Domestic, developed market, large cap stocks Mid cap stocks Emerging market, small/micro/nano cap stocks	Medium
Real Estate	Privately owned Private funds Real Estate Investment Trusts (REIT) Crowdfunded	Medium
Commodities	Precious metals - gold, tin, aluminum Food - agriculture, meat Energy - oil, natural gas	High
Currencies	Foreign currencies	High
Private equity	Venture capital - angel investments, syndicate funds	Very High
Speculative	Artwork, beanie babies, and other collectibles Cryptocurrencies Tulip bulbs	Very High

Additionally, it’s useful to visualize historical measures of risk for some of the aforementioned asset classes. Here, we examine 45 years of historical annual Asset Class Returns dataset provided by Portfolio Visualizer, dating back to 1972. To start, let’s look at the annual returns across broad asset classes:

We can see that the performance of some asset classes have bigger fluctuations than other asset classes. That is, in fact, a visual representation of risk - the likelihood of an asset class to have wild swings in returns. Rather than look at a relatively noisy graph of returns over time, we can compare the distribution of returns for each asset class across all time periods in our dataset. In this case, a box plot does a decent job of letting us compare the “spread” of annual returns for each asset class:

From the box plot, we can see that asset classes like cash or bonds tend to have small spreads in their distributions, where the returns are likely to fluctuate less but the potential upside is limited. On the other end of the spectrum, an asset class like gold has the potential to occasionally exhibit extreme returns, but in both positive and negative directions.

Finally, we can compute a single, statistical measure of risk for each asset class by taking the standard deviation of annual returns across all time periods. For the common asset classes for which we have historical data, the statistical measure more or less reflects the conventional wisdom as depicted in our Table of Asset Classes above.

My view is that taking a risk-oriented view of broad asset classes can serve as a healthy reminder for investors to think critically about the type of investments they make. Before you plop down a hefty sum of cash for that fancy cryptocurrency exchange-traded fund, take a second to think where that investment type falls in the risk spectrum.

Micro-Geographic Arbitrage with 529 Plans

2018-01-15T15:42:00-06:00

The recent tax reform bill that made its way through the legislative gauntlet of the US government brings with it a host of new changes, which will affect investors from all walks of life in big and small ways starting in 2018. One of the less covered but equally important changes is the extension of tax benefits for the 529 plan that many families use to save money for their children’s post-secondary education. At first glance, the change may not seem like much, but for those of us optimization minded financiers, it may make a huge difference in savings.

With the new tax bill, the qualifying expenses for the 529 plan have been expanded to include tuition for private schools at the primary and secondary levels, that is, from kindergarten through the 12th grade (K-12). Private school is typically associated with the high-brow, affluent amongst us that decide public education is not good enough for their children. However, with the rising cost of living in many metropolitan, costal areas such as the San Francisco Bay Area, the choice between public and private schools is actually more than simply a educational values judgement.

For a family with a heavy emphasis on education but more modest means, it may not be feasible to live in areas with good public education throughout the K-12 levels. For these families, a micro level of geographic arbitrage may come into play: buy or a rent a house in a lower cost of living area, and use the savings in housing costs to put school age children in better, private schools. With the 529 plan changes to include K-12 private school tuition, this strategy actually becomes even more attractive. These families can use the money they reaped from savings in housing costs, plow those after-tax dollars into 529 plans for their children, and start withdrawing up to $10,000 a year without having to pay taxes on any capital gains.

Ultimately, any money leftover after the primary and secondary school periods can still be used for higher education, so there’s very little downside for a family to start contributing more money even earlier in their children’s life. Add on the deduction that certain states offer from state taxes for 529 contributions, and 529 plan becomes a great tool for opportunistically using micro-geographic arbitrage to optimize a family’s quality of housing and education.

Null is a Global Variable

2017-11-28T16:47:00-06:00

Programmers often bemoan the problems of the concept of null that exists in programming languages. Even C.A.R. Hoare, the inventor of the null reference, calls it a billion dollar mistake. Some detest its existence, and indicate that it’s useless. However, it occurred to me that null can actually be thought of as a global variable - one that is used across all applications and domains to indicate special cases, such as the end of a data structure or a missing entry.

Null is a very useful concept across many fundamental data structures. Without null, a typical linked list implementation would need to define its own sentinel node reference to signify the end of a list. A typical tree implementation would need to define its own sentinel node reference to signify the absence of child nodes. A hash table would need to define a sentinel value to signify the absence of values in a particular bucket.

Application programmers would need to define their own domain-specific null references to represent the concept of “missing, but valid.” A database accessor that fails to find the record for a Person object in a human resources application would return its own sentinel Person object to indicate that the record could not be found, or be required to raise some sort of exception to its caller. Any programmer that maintains an object with optional fields would need to define their own sentinel value to indicate a missing field.

Rather than have each library or application declare its own custom sentinel representations, null is the convenient, global variable that gets (re)used everywhere to denote the special, non-exceptional, terminal case that signals to the programmer some special treatment may be needed, but is not required.

Maximize Your Homeowner Tax Deductions

2017-11-03T00:01:00-05:00

Update: the following tips may or may not be applicable for your county and state, depending on the process of property tax payments. Also, the general landscape of deductions have certainly changed since the original writing of this article, so be sure to check the latest regulations.

Paying taxes is never a fun thing, but if you happen to be a homeowner, you are privy to a nice deduction for any property taxes or mortgage interest paid for the calendar tax year.

Property Tax Deductions

It turns out that property taxes are something you should look forward to paying off as soon as you get the bill, and the reason is a bit subtle.

Since most property taxes are split into two installments, homeowners are given the option of paying the part one of the taxes sometime around November, and part two sometime around February. Since the property tax deduction lets you deduct the full amount of property taxes paid in the calendar year, paying the second installment of your property tax bill in the current calendar year actually nets you an immediate return on the installment amount equivalent to your effective tax rate.

Let’s go through an example. Suppose your home has an assessed value of $1 million in 2017. A simple property tax rate of 1% would equate to a 2017 property tax bill of $10,000, split into two installments of $5,000. Let’s also assume an effective tax rate of 30%. When you get your property tax bill in November 2017, you have two options: (A) pay both installments at a total of $10,000 on November 1st, (B) split the payments into two installments on November 1, 2017 and February 1, 2018.

Let’s assume that the assessed value of your home, your effective tax rate, and the prevailing interest rate and market conditions remain the same, and only consider the property tax bill for 2017. How much money would end up in your pocket in each of the scenarios?

With option A, you pay off your entire property tax bill on November 1, 2017 for $10,000. Come April 15, 2018, you’ll be able to deduct the full amount and effectively end up with 30% * $10,000 = $3,000 in your pocket. If you leave that amount in a bank account paying 1% interest, you’ll end up with $3,030 on April 15, 2019.

With option B, you pay the first installment of your property tax bill on November 1, 2017 for $5,000, and you leave the balance in your bank account. On February 1, 2018, you’ll pay the second installment. For the three months that elapsed between your first and second installments, you’ll earn $12.50 in interest on the $5,000 you earmarked for the second installment. Come April 15, 2018, you’ll deduct the first installment and end up with 30% * $5,000 = $1,500 in your pocket. If you leave everything in the bank account, on April 15, 2019 you’ll end up with an extra $0.15 from the $12.50, and $15 in interest earned from the deduction amount itself. You’ll also be able to deduct the second installment for another $1,500. In total, you’ll end up with $12.50 + $0.15 + $15 + $1500 + $1500 = $3,027.65 on April 15, 2019.

The difference between option A and option B is a meager $2.35, which may seem too trivial to be of concern. But as personal finance aficionado, we love to squeeze every bit of min/max opportunity available at hand! However, the picture may become clearer if we disregard returns you may reap on cash on hand, and simple look at the amount of deductions you may get for two payment options:

Option A’, you pay the first installment of your property tax bill on November 1, 2017 and the second installment on December 31, 2017. Come Tax Day 2018, you’ll be able to deduct the full amount of the property tax bill and pocket $3,000.

Option B’, you pay the first installment of your property tax bill on November 1, 2017 and the second installment on January 1, 2018. Come Tax Day 2018, you’ll only be able to deduct the first installment of the property tax bill, and pocket $1,500. You’ll then need to wait a full year for Tax Day 2019 before you can deduct the second installment.

Would you rather have $3,000 now or $1,500 now and $1,500 later? The choice is pretty clear. Pay those property taxes before the end of the year!

Mortgage Interest Deductions

Now let’s move on to mortgage interest deductions. Interest paid on mortgages are only deductible for the first $1,000,000. In practice, the actual limit is probably $1,100,000 for homeowners that don’t have any home equity debt. To calculate the percentage of your mortgage interest that is deductible, divide 1.1M by your outstanding principal if it happens to be over 1.1M, and that’s the percentage of deductible interest.

With real estate prices on the rise once again in many of America’s most coveted markets, many are no doubt using jumbo loans to purchase their homes. Since the Home mortgage interest deduction is one of the best tax breaks available around, you’ll want to ensure you’re getting the maximum deduction. If your principal amount is over 1.1M, consider paying down the principal quickly to get at or below the 1.1M mark.

In today’s environment of low interest rates, you can consider a lump sum payment towards your mortgage principal an investment with an automatic return rate equivalent to your mortgage rate. If you have a mortgage rate anywhere north of 3-4%, that’s already quite a deal compared to the meager rates your bank is offering to hold your cash for you. Even at the high end of the savings rates being offered by some of the online banks such as Ally or newer commercial offerings such as Goldman Sachs, you’re looking at most 1%.

Oft-Misheard Phrases in the Workplace

2017-10-30T23:33:00-05:00

There’s an affliction that affects many millions of Americans in the workplace, and it’s time to bring that affliction to light. There’s perhaps nothing more benignly embarrassing than uttering one of these often misheard phrases during a work meeting, much less writing them down in a widely distributed email or memo. Perhaps it’s time to set things straight once and for all, and help bring our less fortunate colleagues out of the darkness by setting them on the righteous path to using the correct version of these phrases.

Which version of each of these phrases do you think is the correct one?

“the long pull” or “the long pole”

“flush out” or “flesh out”

“all intensive purposes” or “all intents and purposes”

“could care less” or “couldn’t care less”

Do you have other examples of phrases that are repeatedly used in the workplace in the wrong way? Your colleagues always appreciate your honest feedback, the more pedantic the better.

Update: in a recent conversation with a colleague, I also learned that “beg(s) the question” is another phrase that is often misused in the workplace, amongst other contexts. The phrase is used to describe a form of argumentative circular reasoning, but it has widely evolved to be used to describe situations where “raise the question” or “invites the question” are actually appropriate.