Jekyll2020-09-26T02:21:24-05:00https://www.kelvinjiang.com/feed.xmlKelvin’s DomainPersonal website of Kelvin Jiang.Kelvin JiangStoicism in the Time of Coronavirus2020-03-30T00:00:00-05:002020-03-30T00:00:00-05:00https://www.kelvinjiang.com/2020/03/stoicism-in-the-time-of-coronavirus<p><img src="/assets/img/marcus_aurelius.jpg" alt="Marcus Aurelius" title="Marcus Aurelius" /></p>
<p>I recently started exploring the ancient philosophy of stoicism by reading
William Irvine’s <em>A Guide to the Good Life: The Ancient Art of Stoic Joy</em>. After
finishing the book, the plan was to start a series of articles that attempt to
draw lines between the ideas and techniques of ancient stoicism to modern day
programming and software engineering. For example, a decent argument can be made
that the practice of <a href="https://en.wikipedia.org/wiki/Chaos_engineering">chaos
engineering</a> goes hand in hand
with the recurring theme of negative visualization in stoicism, where the
robustness of a system improves over time due to the creators of the system
actively thinking about and preparing for all the things that could go wrong.</p>
<p>However, in light of the <a href="https://en.wikipedia.org/wiki/2019%E2%80%9320_coronavirus_pandemic">coronavirus
pandemic</a>
that has been unfolding over the last few months, it feels more apt to discuss
stoicism the way it was intended by ancient scholars and practitioners - as a
philosophy of life particularly relevant during turbulent times of uncertainty,
and to think about how it may help people from all walks of life get through
this crisis. At a later time it may be useful to revisit the idea of stoical
(and perhaps even <a href="/books/2019/antifragile.html">antifragile</a>) software; but
given the current situation, let us find ways to remain stoic in the face of
shared adversity.</p>
<p>There are many facets of stoicism that can be applied to modern life,
particularly during times of hardship. Irvine’s book lays out some of the
fundamental pillars of the ancient Stoic school, and these pillars are useful
for guiding the discussion in uncovering how stoicism can help one cope during a
time of crisis.</p>
<p>The first place to start is with the idea of <strong>negative visualization</strong>, or the
practice of contemplating the loss of things you value in life as to gain a
greater appreciation for such things, and to be mentally and emotionally
protected against their eventual loss should such an event occur. As this
unprecedented pandemic unfolds, I think it’s fair to say that few people, if
any, had considered the loss of so many things that many take for granted: the
freedom to work, socialize, and simply be outside; the existence of schools and
child care; safe and stable work environments; even reliably procuring household
essentials such as groceries and even toilet paper.</p>
<p>Stoicism recommends proactively visualizing the loss of things one takes for
granted, so that one may gain a greater appreciation for these things and be
better prepared, literally and figuratively, when those things are no longer
available. For those in the middle of a shelter-in-place, lockdown lifestyle,
the stoic silver lining is that after the crisis is over they will be able to
withstand much more difficult hardships in the future. In these times, all of us
are a little bit like
<a href="https://en.wikipedia.org/wiki/Seneca_the_Younger">Seneca</a>, the Roman Stoic, who
was exiled to an island for multiple years for political reasons. Seneca’s stoic
ways allowed him to get through the tough times. The current situation is hard,
but once it has been overcome, the built-up scar tissue will remain.</p>
<p>A more extreme form of negative visualization is the practice of
<strong>self-denial</strong>. By intentionally subjecting oneself to minor pains,
displeasures, and inconveniences, one learns to better appreciate the smaller
things. For many people unaware or inexperienced in the practice of self-denial,
the state-mandated orders to stay at home is more or less a collective
self-denial that has been uniformly forced upon everyone simultaneously.</p>
<p>Nevertheless, negative visualization is still a useful technique for coping with
this pandemic. Even for those under quarantine and self-isolation, it’s useful
to contemplate the loss of and appreciate the things that still have not been
taken away: one’s own health, the health of loved ones, the ability to connect
and communicate with friends and family, and the spirit of compassion,
community, and volunteerism that is emerging all around the world.</p>
<p>The slower pace of life under quarantine is a blessing in disguise to a
practicing stoic: it is a hard lesson in learning to appreciate the small things
and the things often taken for granted. Take a walk outside and breathe in the
air that has been made fresher by fewer cars on the road and planes in the
sky. Take a moment to appreciate the massive but mostly invisible technological
infrastructure that allows billions to stay connected despite between physically
separate.</p>
<p>Another theme is the <strong>dichotomy of control</strong>, or the partitioning of things,
experiences, or events one encounters into those over which one has no control,
versus those which one has partial or full control. Irvine actually advocates
for a trichotomy, but using a binary partitioning can be an equally effective
simplification.</p>
<p>The key idea here is to shift one’s attention away from things that cannot be
controlled, which is particularly important during times of great uncertainty
and chaos. Things such as government policies, temporary restrictions on
freedom, stock market movements, and disease infection and mortality rates, for
the most part can be considered to be things out of one’s control.</p>
<p>Instead, focus should be placed on things that are within an individual’s
immediate control. Some of the most important messages being communicated by
public health officials around the world are simple things that each individual
has full control over. In order to reduce risk of infection, consistently and
repeatedly do the simple things to minimize probabilities of contracting the
virus: stay at home, stay away from other people, and practice good hygiene by
thoroughly washing hands.</p>
<p>While staying at home, it is equally important to focus on general
health. Eating healthy food, getting adequate sleep, and exercising regularly
are all things that are fully under one’s control and may simply require a bit
of creativity to maintain. Mental health may be overlooked and is just as
important, and many enhancement practices also fall under the realm of control:
avoid consuming too much doom and gloom from the media; and for those who are
well-versed, meditation and mindfulness can go a great way in remaining even
keeled. The stoic practice of meditation was perhaps not as rigorous as modern
day practices, but does encourage constant self-reflection and
introspection. With many parts of life slowing down or grinding to a halt,
looking deeply inward and reflecting on one’s values can be a useful and
rewarding exercise.</p>
<p>The final theme is <strong>fatalism</strong> - the idea of treating the past and present <em>“as
fate would have it”</em> and to relinquish fears and worries over things that are
happening or have already happened. It is important to resist the urge to point
fingers and to assign blame. The events of the recent past and present are
unraveling before our eyes, and being up in arms about things that cannot be
changed is mostly a useless exercise. However, it is critical to focus on the
future, as fate has yet to decree the way the future will unfold. Focusing on
the future means being fatalistic while learning from the past and present, and
making sure the hard lessons do not go to waste.</p>
<p>Perhaps the most important way of focusing on the future is to think about the
extent to which one can control the future, which harks back to the theme of
dichotomy of control. Second guessing and blaming elected officials for their
mistakes and missteps is not terribly useful. Government policies, responses,
and the subsequent outcomes are largely out of our control. However, it is
important to keep these outcomes in mind, along with the words and actions of
elected officials and representatives. When it comes time for one to flex civic
duty muscles, use the power of the ballot to shape the future policies, laws,
and institutions and ensure that one does not remain blindly fatalistic about
the future.</p>Kelvin JiangIntuitive Iterators For Binary Trees2019-12-06T00:00:00-06:002019-12-06T00:00:00-06:00https://www.kelvinjiang.com/2019/12/intuitive-iterators-binary-trees<p>As a follow-up to the previous article on <a href="/2019/10/intuitive-iterative-tree-traversals.html">intuitive iterative traversals for
binary trees</a>, we discuss the
application of the intuitive stack based method to the task of creating
<strong>iterators for binary trees</strong>. The
<a href="https://en.wikipedia.org/wiki/Iterator">iterator</a> is an abstraction for making
incremental traversals over data containers or data streams. At its core, the
<a href="https://en.wikipedia.org/wiki/Iterator_pattern">iterator pattern</a> is
implemented by two functions:</p>
<ul>
<li><code class="highlighter-rouge">next</code> - a function that retrieves the next element in
the data container or stream and advances the underlying cursor</li>
<li><code class="highlighter-rouge">hasNext</code> - a function that indicates whether or not the underlying data
container or stream has been exhausted.</li>
</ul>
<p>As a design pattern, the iterator can be used for accessing data streams with an
arbitrary, and sometimes infinite, number of elements. For data structures, an
iterator is useful for performing computations based on advancing cursors in
certain defined orders. An elementary application of iterators to binary trees
is for traversing a given tree in preorder, inorder, or postorder fashion one
element at a time. For example, a specific use case might be for an iterator
over <a href="https://en.wikipedia.org/wiki/Binary_search_tree#Traversal">binary search
trees</a> that provide
access to comparable elements in increasing order by means of an inorder
iterator.</p>
<p>Implementing an iterator for binary trees by modeling the access pattern for a
traversal is not a very intuitive task. However, using the technique covered in
the previous article of a stack and an algorithm that employs bookkeeping of
discovered and undiscovered nodes, an implementation follows quite naturally.</p>
<p>Let’s walk through implementation of an inorder iterator. To implement this
iterator, its internal state can be initialized in the same way as the beginning
of a conventional, all-at-once iterative traversal by pushing the root node onto
an empty stack.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">root</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stack</span> <span class="o">=</span> <span class="p">[]</span>
<span class="bp">self</span><span class="o">.</span><span class="n">discovered</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">if</span> <span class="n">root</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">root</span><span class="p">)</span>
</code></pre></div></div>
<p>The stack will serve as the primary internal state of the iterator. When the
stack is empty, that indicates the end of the traversal and the depletion of
elements.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">hasNext</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">stack</span><span class="p">)</span> <span class="o">></span> <span class="mi">0</span>
</code></pre></div></div>
<p>Traversing a tree while advancing the implicit cursor requires a bit of
thought. Just like in the conventional, all-at-once traversal, the top of the
stack always contains the next element to be processed. However, it may not be
the next element to be visited. For inorder traversals, we previously determined
that a visit should happen only if the node has already been discovered. Thus,
we will continue to process nodes until we come across a discovered node, in
which case its value can be returned. Any undiscovered nodes should be processed
by marking it as discovered and pushing its children onto the stack in the
appropriate order.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">next</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">node</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">stack</span><span class="o">.</span><span class="n">pop</span><span class="p">()</span>
<span class="k">while</span> <span class="n">node</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">discovered</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">discovered</span><span class="p">[</span><span class="n">node</span><span class="p">]</span> <span class="o">=</span> <span class="bp">True</span>
<span class="k">if</span> <span class="n">node</span><span class="o">.</span><span class="n">right</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">node</span><span class="o">.</span><span class="n">right</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">node</span><span class="p">)</span>
<span class="k">if</span> <span class="n">node</span><span class="o">.</span><span class="n">left</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">node</span><span class="o">.</span><span class="n">left</span><span class="p">)</span>
<span class="n">node</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">stack</span><span class="o">.</span><span class="n">pop</span><span class="p">()</span>
<span class="k">return</span> <span class="n">node</span><span class="o">.</span><span class="n">val</span>
</code></pre></div></div>
<p>As a sanity check, we can walk through a base case using the logic above. For a
single node binary tree, the root node will initially be undiscovered. The first
invocation of <code class="highlighter-rouge">next</code> will pop the root node off the stack, mark it as
discovered, then pop it off the stack again. Since the node is now marked as
discovered, its value can be returned. The resulting stack is empty and an
invocation of <code class="highlighter-rouge">hasNext</code> will return <code class="highlighter-rouge">false</code>, indicating that the iterator has
been exhausted.</p>
<p>Putting it all together, we have an implementation of an inorder iterator for
binary trees that follows quite naturally from the intuitive stack based method
of traversing binary trees.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">BinaryTreeInorderIterator</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">root</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stack</span> <span class="o">=</span> <span class="p">[]</span>
<span class="bp">self</span><span class="o">.</span><span class="n">discovered</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">if</span> <span class="n">root</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">root</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">hasNext</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">stack</span><span class="p">)</span> <span class="o">></span> <span class="mi">0</span>
<span class="k">def</span> <span class="nf">next</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">node</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">stack</span><span class="o">.</span><span class="n">pop</span><span class="p">()</span>
<span class="k">while</span> <span class="n">node</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">discovered</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">discovered</span><span class="p">[</span><span class="n">node</span><span class="p">]</span> <span class="o">=</span> <span class="bp">True</span>
<span class="k">if</span> <span class="n">node</span><span class="o">.</span><span class="n">right</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">node</span><span class="o">.</span><span class="n">right</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">node</span><span class="p">)</span>
<span class="k">if</span> <span class="n">node</span><span class="o">.</span><span class="n">left</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">node</span><span class="o">.</span><span class="n">left</span><span class="p">)</span>
<span class="n">node</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">stack</span><span class="o">.</span><span class="n">pop</span><span class="p">()</span>
<span class="k">return</span> <span class="n">node</span><span class="o">.</span><span class="n">val</span>
</code></pre></div></div>
<p>Like previously discussed, an implementation of a preorder or postorder iterator
should easily be derived by modifying the order in which a node and its child
nodes are pushed onto the stack.</p>Kelvin JiangAs a follow-up to the previous article on intuitive iterative traversals for binary trees, we discuss the application of the intuitive stack based method to the task of creating iterators for binary trees. The iterator is an abstraction for making incremental traversals over data containers or data streams. At its core, the iterator pattern is implemented by two functions:A Descent Into JSON Parsing2019-11-05T00:00:00-06:002019-11-05T00:00:00-06:00https://www.kelvinjiang.com/2019/11/a-descent-into-json-parsing<p>It has been said that a programmer is not worth their salt until <a href="http://steve-yegge.blogspot.com/2007/06/rich-programmer-food.html">they
understand how compilers
work</a>. In the
spirit of self improvement, I’m taking a small step down that path by making a
foray into the world of compiler frontends: lexing and parsing.</p>
<p>A good candidate for tiptoeing into this space is
<a href="https://en.wikipedia.org/wiki/JSON">JSON</a>, which is arguably the most popular
data serialization format today. Writing a parser for JSON is not something most
programmers do, as there exists a number of libraries for doing this type of
thing in many of the mainstream languages. Taking a deep dive into the
implementation of a technology that is often taken for granted seems like a good
way of learning something important.</p>
<p>Here, we’ll attempt to write a simplified JSON parser that follows but does not
fully implement the official JSON specification. In particular, only objects or
arrays will be considered to be valid JSON text, and standalone string literals,
numbers, and boolean values will not be considered valid. This constraint is not
clearly specified in the original JSON specification, but we deliberately make
this choice for the exercise. (However, as it turns out the implementation
we arrive at below will actually be able to handle standalone literals, which
is a nice side-effect).</p>
<p>Additionally, we will further simplify the task at hand by:</p>
<ul>
<li>Not handling escape characters in string literals</li>
<li>Not handling scientific notation for numeric values</li>
</ul>
<p>Boolean values and <code class="highlighter-rouge">null</code> are also out of scope and left as an exercise for the
reader to add as extensions. Our JSON parsing task will be modeled as a single
process that combines both lexing (i.e. tokenization) and parsing. The JSON text
is to be treated as a stream of characters, and a stack data structure will be
used for storing values.</p>
<p>To start, let’s define the basic parsing control flow. When a bracket or brace
is encountered, we should create a new array or object, respectively, and push
it onto the parser stack. Once the closing bracket or brace is encountered, the
active container (i.e. value at the top of the stack) can be considered complete
and popped off the stack. By handling the opening and closing brackets and
braces, we will have defined a significant portion of the core parsing code.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
<span class="n">stack</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">index</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">while</span> <span class="n">index</span> <span class="o"><</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
<span class="n">char</span> <span class="o">=</span> <span class="n">text</span><span class="p">[</span><span class="n">index</span><span class="p">]</span>
<span class="k">if</span> <span class="n">char</span> <span class="o">==</span> <span class="s">'{'</span><span class="p">:</span> <span class="c1"># Start object
</span> <span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">({})</span>
<span class="k">elif</span> <span class="n">char</span> <span class="o">==</span> <span class="s">'}'</span><span class="p">:</span> <span class="c1"># End object
</span> <span class="n">stack</span><span class="o">.</span><span class="n">pop</span><span class="p">()</span>
<span class="k">elif</span> <span class="n">char</span> <span class="o">==</span> <span class="s">'['</span><span class="p">:</span> <span class="c1"># Start array
</span> <span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">([])</span>
<span class="k">elif</span> <span class="n">char</span> <span class="o">==</span> <span class="s">']'</span><span class="p">:</span> <span class="c1"># End array
</span> <span class="n">stack</span><span class="o">.</span><span class="n">pop</span><span class="p">()</span>
<span class="n">index</span> <span class="o">+=</span> <span class="mi">1</span>
</code></pre></div></div>
<p>The stack keeps track of the arrays or objects that have been encountered thus
far at any point in the stream. Importantly, it preserves the order of these
arrays or objects, and will allow us to correctly handle nested structures. For
this example JSON text:</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[{</span><span class="s2">"a"</span><span class="p">:[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">]},</span><span class="s2">"b"</span><span class="p">,[</span><span class="s2">"c"</span><span class="p">]]</span>
</code></pre></div></div>
<p>a condensed view of the state stored by the stack looks like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[]
[[]]
[[], {}]
[[], {}, []]
[[], {}]
[[]]
[[], []]
[[]]
[]
</code></pre></div></div>
<p>The stack starts off empty (<code class="highlighter-rouge">[]</code>), then the initial/outer array is pushed onto
the stack (<code class="highlighter-rouge">[[]]</code>). The first element in the array is an object, which itself
contains an array as a value; this reflects the deepest part of the nesting in
the structure, which is when the stack is at its maximum size: <code class="highlighter-rouge">[[], {}, []]</code>.
As closing brackets and braces are encountered in the stream, items are popped
off the stack accordingly.</p>
<p>Now that we have the basic structure, the next step is to define some logic for
handling literals in the JSON text so that we can actually parse values and
insert them into the container objects on the parser stack. For strings, we have
made the simplifing assumption that there are no escape characters. This means
we can simply look for opening and closing double quotes as the delimiters for a
full string literal, and parse accordingly.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">parse_string</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">start</span><span class="p">):</span>
<span class="s">"Returns a string literal and index at last character of literal (i.e. a double quotation mark)"</span>
<span class="n">end</span> <span class="o">=</span> <span class="n">start</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">while</span> <span class="n">text</span><span class="p">[</span><span class="n">end</span><span class="p">]</span> <span class="o">!=</span> <span class="s">'"'</span><span class="p">:</span>
<span class="n">end</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">return</span> <span class="p">(</span><span class="n">text</span><span class="p">[</span><span class="n">start</span><span class="o">+</span><span class="mi">1</span><span class="p">:</span><span class="n">end</span><span class="p">],</span> <span class="n">end</span><span class="p">)</span>
</code></pre></div></div>
<p>Numeric values mostly follow the same story. Since we are ignoring scientific
notation, we just have to handle a few special cases around negative numbers and
decimals. Without diminishing the value of this exercise, we rely on the string
to number parsing logic built into the language of our choice.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">parse_number</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">start</span><span class="p">):</span>
<span class="s">"Returns a numeric literal and index at last character of literal"</span>
<span class="n">end</span> <span class="o">=</span> <span class="n">start</span>
<span class="n">has_decimal</span> <span class="o">=</span> <span class="bp">False</span>
<span class="k">while</span> <span class="n">end</span> <span class="o"><</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="ow">and</span> <span class="p">(</span><span class="n">text</span><span class="p">[</span><span class="n">end</span><span class="p">]</span> <span class="o">==</span> <span class="s">'-'</span> <span class="ow">or</span> <span class="n">text</span><span class="p">[</span><span class="n">end</span><span class="p">]</span> <span class="o">==</span> <span class="s">'.'</span> <span class="ow">or</span> <span class="p">(</span><span class="n">text</span><span class="p">[</span><span class="n">end</span><span class="p">]</span> <span class="o">>=</span> <span class="s">'0'</span> <span class="ow">and</span> <span class="n">text</span><span class="p">[</span><span class="n">end</span><span class="p">]</span> <span class="o"><=</span> <span class="s">'9'</span><span class="p">)):</span>
<span class="k">if</span> <span class="n">text</span><span class="p">[</span><span class="n">end</span><span class="p">]</span> <span class="o">==</span> <span class="s">'.'</span><span class="p">:</span>
<span class="n">has_decimal</span> <span class="o">=</span> <span class="bp">True</span>
<span class="n">end</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">val</span> <span class="o">=</span> <span class="n">text</span><span class="p">[</span><span class="n">start</span><span class="p">:</span><span class="n">end</span><span class="p">]</span>
<span class="k">return</span> <span class="p">(</span><span class="nb">float</span><span class="p">(</span><span class="n">val</span><span class="p">)</span> <span class="k">if</span> <span class="n">has_decimal</span> <span class="k">else</span> <span class="nb">int</span><span class="p">(</span><span class="n">val</span><span class="p">),</span> <span class="n">end</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>The <code class="highlighter-rouge">parse_string</code> and <code class="highlighter-rouge">parse_number</code> helper functions both effectively eagerly
advance the pointer into the character stream. At the end of parsing a literal,
the index in the original text pointing to the last character in the literal is
returned (along with the actual literal value) to be used by the core parsing
logic for advancing the pointer appropriately.</p>
<p>Once the parsing logic has been expanded to handle literal values, we are ready
to connect the dots between values and containers. Since we are following the
strict JSON specification, we can assume that there is always at least one
active container during any valid parsing action. That is, after a string or
numeric literal is successfully parsed, it should be added to the active
container as per the semantics of the active container.</p>
<p>In fact, after <em>any</em> value is successfully parsed - be it an array, object, or
literal - the value should be added to the active container. This is equivalent
to adding the value to the item at the top of the stack. The important edge case
is when a value is successfully parsed but there is nothing at the top of the
stack (i.e. the stack is empty). When parsing valid JSON, this can happen only
when the parsed value is the outermost container. In this case, the parsed value
is the result to be returned in as the final parsed object.</p>
<p>For arrays, the underlying representation simply maps to the built-in list,
vector, or equivalent sequential data structure of your choice. For objects, the
important thing is to convert the container structure into a key-value
structure, such as your garden variety hash table, hash map, associative array,
or dictionary. The implementation here takes a shortcut by storing both types of
containers as lists, and only upon creation will construct a dictionary be out
of the list of elements. The list is expected to have an even number of
elements in order to fulfill the key-value contract of a valid JSON object.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
<span class="n">stack</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">index</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">result</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">while</span> <span class="n">index</span> <span class="o"><</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
<span class="n">char</span> <span class="o">=</span> <span class="n">text</span><span class="p">[</span><span class="n">index</span><span class="p">]</span>
<span class="n">val</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">if</span> <span class="n">char</span> <span class="o">==</span> <span class="s">'{'</span><span class="p">:</span> <span class="c1"># Start object
</span> <span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">([])</span>
<span class="k">elif</span> <span class="n">char</span> <span class="o">==</span> <span class="s">'}'</span><span class="p">:</span> <span class="c1"># End object
</span> <span class="n">val</span> <span class="o">=</span> <span class="n">stack</span><span class="o">.</span><span class="n">pop</span><span class="p">()</span>
<span class="n">val</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">val</span><span class="p">[::</span><span class="mi">2</span><span class="p">],</span> <span class="n">val</span><span class="p">[</span><span class="mi">1</span><span class="p">::</span><span class="mi">2</span><span class="p">]))</span>
<span class="k">elif</span> <span class="n">char</span> <span class="o">==</span> <span class="s">'['</span><span class="p">:</span> <span class="c1"># Start array
</span> <span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">([])</span>
<span class="k">elif</span> <span class="n">char</span> <span class="o">==</span> <span class="s">']'</span><span class="p">:</span> <span class="c1"># End array
</span> <span class="n">val</span> <span class="o">=</span> <span class="n">stack</span><span class="o">.</span><span class="n">pop</span><span class="p">()</span>
<span class="k">elif</span> <span class="n">char</span> <span class="o">==</span> <span class="s">'"'</span><span class="p">:</span> <span class="c1"># String literal
</span> <span class="n">val</span><span class="p">,</span> <span class="n">index</span> <span class="o">=</span> <span class="n">parse_string</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">index</span><span class="p">)</span>
<span class="k">elif</span> <span class="n">char</span> <span class="o">==</span> <span class="s">'-'</span> <span class="ow">or</span> <span class="p">(</span><span class="n">char</span> <span class="o">>=</span> <span class="s">'0'</span> <span class="ow">and</span> <span class="n">char</span> <span class="o"><=</span> <span class="s">'9'</span><span class="p">):</span> <span class="c1"># Numeric literal
</span> <span class="n">val</span><span class="p">,</span> <span class="n">index</span> <span class="o">=</span> <span class="n">parse_number</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">index</span><span class="p">)</span>
<span class="n">index</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">val</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
<span class="k">if</span> <span class="n">stack</span><span class="p">:</span>
<span class="c1"># Add parsed value to the active container at top of stack
</span> <span class="n">stack</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">val</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="c1"># Parsed value is the outermost container
</span> <span class="n">result</span> <span class="o">=</span> <span class="n">val</span>
<span class="k">return</span> <span class="n">result</span>
</code></pre></div></div>
<p>For convenience (and without compromising correctness), commas, colons, and
whitespace characters are ignored by the parsing logic. Additionally, error
checking can be added in the form of assertions on properties of certain
values. For example, when a <code class="highlighter-rouge">}</code> character is encountered and a dictionary is to
be created out of the container object at the top of the stack, a valid
assertion is that the number of items in the container (i.e. list) should be
even.</p>
<p>To close things out, a list of test cases for the parser:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">assert</span> <span class="n">parse</span><span class="p">(</span><span class="s">'1.1'</span><span class="p">)</span> <span class="o">==</span> <span class="mf">1.1</span>
<span class="k">assert</span> <span class="n">parse</span><span class="p">(</span><span class="s">'1'</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span>
<span class="k">assert</span> <span class="n">parse</span><span class="p">(</span><span class="s">'-0.3'</span><span class="p">)</span> <span class="o">==</span> <span class="o">-</span><span class="mf">0.3</span>
<span class="k">assert</span> <span class="n">parse</span><span class="p">(</span><span class="s">'{}'</span><span class="p">)</span> <span class="o">==</span> <span class="p">{}</span>
<span class="k">assert</span> <span class="n">parse</span><span class="p">(</span><span class="s">'[]'</span><span class="p">)</span> <span class="o">==</span> <span class="p">[]</span>
<span class="k">assert</span> <span class="n">parse</span><span class="p">(</span><span class="s">'""'</span><span class="p">)</span> <span class="o">==</span> <span class="s">""</span>
<span class="n">s</span> <span class="o">=</span> <span class="s">'[{"a":[1,2,[{}],4]},"b",["c",{"d":6}]]'</span>
<span class="k">assert</span> <span class="n">parse</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">==</span> <span class="nb">eval</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
<span class="n">s</span> <span class="o">=</span> <span class="s">'{"xkd":1, "kcw":2, "art":3, "hxm":4, "qrt":5, "pad":6, "hoy":7}'</span>
<span class="k">assert</span> <span class="n">parse</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">==</span> <span class="nb">eval</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
<span class="n">s</span> <span class="o">=</span> <span class="s">'[{"a_key": 1, "b_</span><span class="se">\xe9</span><span class="s">": 2}, {"a_key": 3, "b_</span><span class="se">\xe9</span><span class="s">": 4}]'</span>
<span class="k">assert</span> <span class="n">parse</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">==</span> <span class="nb">eval</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
</code></pre></div></div>
<p>That concludes our foray into the world of JSON parsing. The parser works but
has certainly not been optimized for performance. A rough comparison shows that
the built-in Python <a href="https://docs.python.org/3/library/json.html">json</a> library
is roughly 11-12x faster than the implementation in this article. However, the
hope is that the implementation can serve as an instructive baseline upon which
to build faster and more fully-featured parsers.</p>Kelvin JiangIt has been said that a programmer is not worth their salt until they understand how compilers work. In the spirit of self improvement, I’m taking a small step down that path by making a foray into the world of compiler frontends: lexing and parsing.Intuitive Iterative Binary Tree Traversals2019-10-23T00:00:00-05:002019-10-23T00:00:00-05:00https://www.kelvinjiang.com/2019/10/intuitive-iterative-tree-traversals<p>Binary tree traversals are a staple of the technical interview process at many
software companies, small and large. For anyone with an understanding of
recursion, the family of traversal techniques are quite straightforward. A
common twist on these concepts that show up more in technical interviews than
undergraduate computer science problem sets is the rather artificial constraint
that asks one to implement the traversals using iteration rather than recursion.</p>
<p>I’ve always found the reference implementations of iterative tree traversals,
particularly inorder traversal, to be lacking in intuitive understanding. The
classic way of iteratively traversing a binary tree is to use a stack data
structure, and the first snippet of code you often see is something like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>stack = []
curr = root
while curr is not null || stack is not empty
while curr is not null
stack.push(curr)
curr = curr.left
...
</code></pre></div></div>
<p>When I see this code, I immediately have more questions than insight.</p>
<ul>
<li>Why is the code following left pointers while pushing all the nodes onto the
stack?</li>
<li>Why is the loop condition checking against both the current node and the
stack size?</li>
<li>Why is there a nested <code class="highlighter-rouge">while</code> loop?</li>
</ul>
<p>To me, the intuitive way to reason about an iterative implementation of a
recursive function is to simulate a call stack, and that usually begins with
a pen and paper. For example, suppose we have a binary tree that looks like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> A
/ \
B C
/ \
D E
</code></pre></div></div>
<p>An inorder traversal of such a tree should yield the nodes in this order:
<code class="highlighter-rouge">D, B, E, A, C</code>. The best way to simulate the call stack that yields such a
traversal is to draw out the contents of the stack as the traversal makes its
way through the tree. I like to model my stacks after the real world, with a
physical base to indicate the bottom of the stack, and elements being pushed on
and popped off. Here’s the visualization of an empty stack, and its
transformation following two push (<code class="highlighter-rouge">push(A)</code>, <code class="highlighter-rouge">push(B)</code>) operations and one pop
(<code class="highlighter-rouge">pop()</code>) operation:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> B ~B~
A A A
_____ _____ _____ _____
stack stack stack stack
</code></pre></div></div>
<p>At the end of the operations, the stack contains a single element <code class="highlighter-rouge">B</code>. The
notation here marks any items popped off the stack with strikethrough-like
markers (~), but leaves it on the stack in its original location to better
illustrate ordering. We can now use this notation to visually simulate the first
few steps of what an inorder traversal on the example tree might look like using
a standard <a href="https://en.wikipedia.org/wiki/Depth-first_search">depth-first
search</a> approach. Initially,
the stack is empty, and the root node is pushed onto the stack.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> A
_____
stack
</code></pre></div></div>
<p>We invoke the same logic repeatedly while the stack has items: pop a node off
and process it. When a node is popped off the stack, we need to process it in
inorder fashion: traverse its left child first, visit itself, then traverse its
right child. This translates to <code class="highlighter-rouge">push(C)</code>, <code class="highlighter-rouge">push(A)</code>, and <code class="highlighter-rouge">push(B)</code>. Notice
that the elements are pushed onto the stack in reverse order from the way they
would be processed in a standard recursive implementation, as to achieve the
desired order. Perhaps more importantly, the node itself (<code class="highlighter-rouge">A</code>) is <em>pushed back
onto the stack.</em></p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> B
A
C
A ~A~ ~A~
_____ _____ _____
stack stack stack
</code></pre></div></div>
<p>At this point, <code class="highlighter-rouge">B</code> is popped off the stack and the same logic is applied. The
traversal proceeds down to the left child of <code class="highlighter-rouge">B</code>, followed by a visit to <code class="highlighter-rouge">B</code>,
and a subsequent traversal down the right child of <code class="highlighter-rouge">B</code>. That is, <code class="highlighter-rouge">push(E)</code>,
<code class="highlighter-rouge">push(B)</code>, <code class="highlighter-rouge">push(D)</code>. Here’s what the stack looks like after reaching the first
leaf node <code class="highlighter-rouge">D</code>:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> D
B
E
B ~B~ ~B~
A A A
C C C
~A~ ~A~ ~A~
_____ _____ _____
stack stack stack
</code></pre></div></div>
<p>Since <code class="highlighter-rouge">D</code> has no child nodes, it will be popped off the stack then pushed back
onto the stack. Here’s the second piece of logic that is core to our traversal:
if a node being processed has already been discovered, then it should be
visited. With that, the algorithm for our inorder traversal - with respect to
processing a single node - can be expressed as follows:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if the node has already been discovered
"visit" or do something with the node
else
mark the node as discovered
push the right child of the node onto the stack
push the node onto the stack
push the left child of the node onto the stack
</code></pre></div></div>
<p>As noted earlier, the symmetry between this iterative approach and the standard
recursive implementation is clear. The recursive implementation will first (in
an eager, depth-first manner) traverse down the left child of a given node, then
visit the node, followed by a traversal down the right child. Using an explicit
stack data structure to simulate the calling pattern simply means pushing the
nodes onto the explicit stack in reverse order as compared to the implicit
recursion call stack.</p>
<p>To illustrate the process more clearly, we can push nodes onto the stack with
an explicit status: a <code class="highlighter-rouge">start</code> status to indicate that the node has yet to be
processed and an <code class="highlighter-rouge">end</code> status to indicate that it has been processed. For
example, <code class="highlighter-rouge">A.start</code> and <code class="highlighter-rouge">A.end</code> will represent the start and end state for a node
<code class="highlighter-rouge">A</code>, respectively. Here’s the same visualization for the same first few
operations as above, with explicit status attributes for each node:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> B.start
A.end
C.start
A.start ~A.start~ ~A.start~
_____ _____ _____
stack stack stack
</code></pre></div></div>
<p>With this extended notation, the logic needed to process any given node at the
top of the stack is clear. If a node at the top of the stack has a <code class="highlighter-rouge">start</code>
status, push its right child onto the stack with a <code class="highlighter-rouge">start</code> status, push itself
back onto the stack with an <code class="highlighter-rouge">end</code> status, and push its left child onto the stack
with a <code class="highlighter-rouge">start</code> status.</p>
<p>An astute reader will notice that since there are only two states, a binary flag
is sufficient for storing the same information. In fact, this can represented in
exactly the same way as the classic implementations of graph traversals
introduced in <a href="https://en.wikipedia.org/wiki/Introduction_to_Algorithms">CLRS</a>,
in which nodes are assigned colors to keep track of traversal progress. In the
case of binary tree traversals, we only need two colors: white for undiscovered
nodes and black for discovered nodes. With that insight, the iterative version
of any of the traversals becomes easy to derive:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">iterative_inorder_traversal</span><span class="p">(</span><span class="n">root</span><span class="p">):</span>
<span class="n">stack</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">root</span><span class="p">)</span>
<span class="n">discovered</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">while</span> <span class="nb">len</span><span class="p">(</span><span class="n">stack</span><span class="p">)</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="n">node</span> <span class="o">=</span> <span class="n">stack</span><span class="o">.</span><span class="n">pop</span><span class="p">()</span>
<span class="k">if</span> <span class="n">node</span> <span class="ow">in</span> <span class="n">discovered</span><span class="p">:</span>
<span class="k">pass</span> <span class="c1"># "Visit" or do something with the node
</span> <span class="k">else</span><span class="p">:</span>
<span class="n">discovered</span><span class="p">[</span><span class="n">node</span><span class="p">]</span> <span class="o">=</span> <span class="bp">True</span>
<span class="k">if</span> <span class="n">node</span><span class="o">.</span><span class="n">right</span><span class="p">:</span>
<span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">node</span><span class="o">.</span><span class="n">right</span><span class="p">)</span>
<span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">node</span><span class="p">)</span>
<span class="k">if</span> <span class="n">node</span><span class="o">.</span><span class="n">left</span><span class="p">:</span>
<span class="n">stack</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">node</span><span class="o">.</span><span class="n">left</span><span class="p">)</span>
</code></pre></div></div>
<p>An iterative implementation of preorder or postorder traversal should easily
follow from the inorder traversal; the sequence in which the nodes should be
pushed onto the stack simply needs to be modified to match the desired traversal
behavior. The cost of the intuitive version of these iterative traversals is a
larger constant in the runtime complexity, as each node is actually processed
twice. Ultimately, the runtime still grows at a rate linearly proportional to
the size of the input.</p>Kelvin JiangBinary tree traversals are a staple of the technical interview process at many software companies, small and large. For anyone with an understanding of recursion, the family of traversal techniques are quite straightforward. A common twist on these concepts that show up more in technical interviews than undergraduate computer science problem sets is the rather artificial constraint that asks one to implement the traversals using iteration rather than recursion.Wrath of the Amazon Mechanical Turks2019-07-03T00:00:00-05:002019-07-03T00:00:00-05:00https://www.kelvinjiang.com/2019/07/wrath-amazon-mechanical-turks<p><img src="/assets/img/alex-kotliarskyi-QBpZGqEMsKg-unsplash.jpg" alt="Amazon Mechanical Turk Workers" title="Wrath of the Amazon Mechanical Turks" /></p>
<p>I recently <a href="https://news.ycombinator.com/item?id=19626404">launched a small hobby website</a>
that <a href="https://www.hackernewspapers.com/">aggregates documents and papers</a> posted
to a <a href="https://news.ycombinator.com/">popular tech news website</a>. Some of the
feedback I received after the launch included suggestions to categorize the
aggregated documents. It seemed like a nice, small exercise in document
categorization, and I decided to take a shot using the data I had on hand, with
the objective being to determine the category for a document from just the title
text.</p>
<p>For starters, I limited the dataset to <a href="https://arxiv.org/">arXiv.org</a>
submissions, and used the categories associated with each document as ground
truth labels. After playing around with the data, I realized that I would need
an expanded dataset if I wanted to train useful models that could differentiate
between a variety of subjects beyond just those related to science and
technology, such as business, economics, games, news, and politics.</p>
<h1 id="enter-the-mechanical-turks">Enter the Mechanical Turks</h1>
<p>In order to get my hands on a quality dataset with document labels for the
categories I wanted, I turned to <a href="https://www.mturk.com/">Amazon Mechanical Turk</a>.
I had prior experience with Amazon Mechanical Turk from using it shortly after
its initial launch, playing around as a worker and earning pennies per task by
solving unsophisticated CAPTCHA puzzles or examining satellite imagery to look
for <a href="https://en.wikipedia.org/wiki/Jim_Gray_(computer_scientist)#Disappearance">famed computer scientist and missing person Jim Gray</a>.</p>
<p>After signing in as a requester and setting up my project, I was struck by how
outdated the entire Amazon Mechanical Turk website appeared. Upon creation of a
project and submission of a batch of tasks, simply viewing the progress of the
tasks and downloading the ongoing results is a very clunky experience. The modal
dialogs feel like they’re stuck in 2007, and look out of place when compared to
the user interfaces of modern AWS services. However, the lackluster user
experience as a requester was nothing compared to the anger I would soon face
from other users as I started reviewing the results rolling in.</p>
<h1 id="big-bad-data">Big Bad Data</h1>
<p>Document categorization is a fairly commonplace project by Amazon Mechanical Turk
standards; the project creation page even has a built-in template that makes the
setup for this class of projects fairly straightforward. A requester has the
option of sending the same task (i.e. provide a label for a given document) to
multiple workers, as to triangulate on the most appropriate answer in ambiguous
or unclear cases.</p>
<p>After some manual inspection of a sample of the unlabeled dataset, these were
chosen as the target categories:</p>
<ul>
<li>Business and Economics</li>
<li>Computers and Technology</li>
<li>Games and Hobbies</li>
<li>Lifestyle</li>
<li>Math and Science</li>
<li>News, Politics, and Government</li>
</ul>
<p>As I examined the results that came in after I submitted my first batch of
requests, I was surprised by the poor quality of data for what should be a
fairly straightforward task. Some of the examples were extreme, such as
political documents or court case briefings getting labeled as “Games and
Hobbies”. In fact, the most egregious mislabeled examples I found were all
tagged with that label, as I came across several cases of technical papers,
scientific journal submissions, and corporate earnings releases all
miscategorized as such.</p>
<p>As a machine learning practitioner, the obvious thing to do was to reject the
mislabeled data. A mislabeled document introduces noise to the model training
process, and is particularly troubling in contexts involving a limited number of
examples or features. Thus, my first inclination was to reject all responses
that were not unanimous - even if two workers agreed on a label and a third
worker provided a different label, all three submissions would be
rejected. However, I decided that such a policy would be too harsh, and wrote
some custom code to instead only reject submissions for documents that had no
majority answer; that is, when all three responses were of different labels.</p>
<p>However, that meant that some babies would be thrown out with the bath water -
as some appropriately labeled responses would be rejected along with the bad
one. I did not see any other option; the whole point of using Amazon Mechanical
Turk was to outsource the document labeling, and not have to manually inspect
the outlier submissions and determine which ones were “right” or “wrong”.</p>
<h1 id="pitchforks">Pitchforks</h1>
<p><a href="/assets/img/wrath_amazon_mechanical_turks.png"><img src="/assets/img/wrath_amazon_mechanical_turks.png" alt="Feedback from Amazon Mechanical Turk Workers" title="Feedback from Amazon Mechanical Turk Workers" /></a></p>
<p>As soon as I submitted the reviews of the first batch of results, the angry
feedback started flowing in. I received dozens of messages from workers:
some were sincere apologies imploring for me to reconsider the rejection in
order for the worker to retain their worker rating; others were disgruntled rants
about how the rejection was unjust and a demand for correction.</p>
<p>Not only was I surprised by the amount of anger and frustration from these
workers over tasks that paid only a penny each, but I felt that I had my hands
tied as there was no other alternative. If I had not rejected submissions for
documents that had no majority answer, I would’ve been left with unusable
examples for a large fraction of my dataset. As a hobbyist, I can let it slide
as there is no academic or business pressure to wring out all available value
from the data. In fact, to avoid any further backlash, I ended up approving all
submissions in the second batch, and decided to write off the poor dataset as a
loss and forego the experiment altogether.</p>
<p>After this poor experience, I find it hard to see how real research projects
using Amazon Mechanical Turk can deal with this level of data quality while
managing the need to “appease” the workers creating these datasets and
compensating them appropriately. Perhaps that is the reason many companies and
researchers are turning to <a href="https://en.wikipedia.org/wiki/Semi-supervised_learning">semi-supervised learning techniques</a>
and training models to generate labeled datasets or embeddings to be used by
other models. It is a direction that could’ve been explored for this project;
perhaps some off-the-shelf or well-known approach can be used in order to build
topics from the comment thread text for each submission. At the very least, the
semi-supervised models won’t get all up in arms about your treatment of their
low accuracy results, and demand that you give pennies where pennies are due.</p>
<h1 id="data">Data</h1>
<p>A modified version of the <a href="https://gist.github.com/cloudkj/43f0ec7129893fad62e542002122b960">resulting dataset</a>
is available on GitHub. The dataset includes the URLs for 2557 documents along
with the labels tagged by the workers, with all of the Amazon Mechanical Turk
metadata removed.</p>Kelvin JiangA Risk-Oriented View of Asset Classes2018-02-09T00:08:00-06:002018-02-09T00:08:00-06:00https://www.kelvinjiang.com/2018/02/risk-oriented-view-asset-classes<p>In this article, let’s take a risk-oriented examination of the asset classes
available to the everyday, retail investor. The goal is to provide a reasonably
comprehensive overview of the different investment options available to the
typical investor, along with the risks associated with each investment option.</p>
<p>For example, you may think that your current net worth precludes you from
investing in fancy asset classes, such venture capital. <em>“Angel investing and
technology startups? That’s for high brow folks unlike myself!”</em> However, with
the advent of new technology and <a href="https://en.wikipedia.org/wiki/Jumpstart_Our_Business_Startups_Act">recent legislative
changes</a>,
there are now many channels for folks from varying backgrounds to invest not
only in things like seed-level venture capital, but other areas such as
commercial real estate and private loans, all with a laptop from the comfort of
your own home.</p>
<p>Without further ado, here’s the <strong>Table of Asset Classes</strong>. It is meant to be
ordered in general level of risk, from the least risky to the most risky of
investments. Of course, there are always exceptions, nuances, and differences of
risk within and across neighboring asset classes on the risk spectrum. The hope
is that the reader can use this guideline as a starting point, but still
maintain due diligence and do the required homework before diving into
unfamiliar investment areas.</p>
<div style="overflow-x: scroll;">
<table>
<colgroup>
<col width="33%" />
<col width="33%" />
<col width="33%" />
</colgroup>
<thead>
<tr class="header">
<th>Asset Class</th>
<th>Types</th>
<th>Risk</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Cash</td>
<td><ul>
<li>Checking/savings, money market, Certificate of Deposit (CD)</li>
<li>Money market funds</li>
</ul></td>
<td>Low</td>
</tr>
<tr class="even">
<td>Bonds (Debt)</td>
<td><ul>
<li>Bonds - government, municipal, corporate</li>
<li>Loans - peer-to-peer (P2P), private lending</li>
</ul></td>
<td>Low</td>
</tr>
<tr class="odd">
<td>Stocks (Equities)</td>
<td><ul>
<li>Domestic, developed market, large cap stocks</li>
<li>Mid cap stocks</li>
<li>Emerging market, small/micro/nano cap stocks</li>
</ul></td>
<td>Medium</td>
</tr>
<tr class="even">
<td>Real Estate</td>
<td><ul>
<li>Privately owned</li>
<li>Private funds</li>
<li>Real Estate Investment Trusts (REIT)</li>
<li>Crowdfunded</li>
</ul></td>
<td>Medium</td>
</tr>
<tr class="odd">
<td>Commodities</td>
<td><ul>
<li>Precious metals - gold, tin, aluminum</li>
<li>Food - agriculture, meat</li>
<li>Energy - oil, natural gas</li>
</ul></td>
<td>High</td>
</tr>
<tr class="even">
<td>Currencies</td>
<td><ul>
<li>Foreign currencies</li>
</ul></td>
<td>High</td>
</tr>
<tr class="odd">
<td>Private equity</td>
<td><ul>
<li>Venture capital - angel investments, syndicate funds</li>
</ul></td>
<td>Very High</td>
</tr>
<tr class="even">
<td>Speculative</td>
<td><ul>
<li>Artwork, beanie babies, and other collectibles</li>
<li>Cryptocurrencies</li>
<li>Tulip bulbs</li>
</ul></td>
<td>Very High</td>
</tr>
</tbody>
</table>
</div>
<p>Additionally, it’s useful to visualize historical measures of risk for some of the
aforementioned asset classes. Here, we examine 45 years of historical
annual <a href="https://www.portfoliovisualizer.com/historical-asset-class-returns">Asset Class Returns</a>
dataset provided by Portfolio Visualizer, dating back to 1972. To start, let’s
look at the annual returns across broad asset classes:</p>
<p><img src="/assets/img/asset_class_returns_56748eaf.png" alt="Asset Class Returns" title="Asset Class Returns" class="center" /></p>
<p>We can see that the performance of some asset classes have bigger fluctuations
than other asset classes. That is, in fact, a visual representation of risk -
the likelihood of an asset class to have wild swings in returns. Rather than
look at a relatively noisy graph of returns over time, we can compare the
distribution of returns for each asset class across all time periods in our
dataset. In this case, a box plot does a decent job of letting us compare the
“spread” of annual returns for each asset class:</p>
<p><img src="/assets/img/distribution_asset_class_returns_bc68ab5d.png" alt="Distribution of Asset Class Returns" title="Distribution of Asset Class Returns" class="center" /></p>
<p>From the box plot, we can see that asset classes like cash or bonds tend to have
small spreads in their distributions, where the returns are likely to fluctuate
less but the potential upside is limited. On the other end of the spectrum, an
asset class like gold has the potential to occasionally exhibit extreme returns,
but in both positive and negative directions.</p>
<p>Finally, we can compute a single, statistical measure of risk for each asset
class by taking the standard deviation of annual returns across all time
periods. For the common asset classes for which we have historical data, the
statistical measure more or less reflects the conventional wisdom as depicted in
our Table of Asset Classes above.</p>
<p><img src="/assets/img/asset_class_risk_2dcecc3e.png" alt="Asset Class Risk" title="Asset Class Risk" class="center" /></p>
<p>My view is that taking a risk-oriented view of broad asset classes can serve as
a healthy reminder for investors to think critically about the type of
investments they make. Before you plop down a hefty sum of cash for that fancy
cryptocurrency exchange-traded fund, take a second to think where that
investment type falls in the risk spectrum.</p>KelvinIn this article, let’s take a risk-oriented examination of the asset classes available to the everyday, retail investor. The goal is to provide a reasonably comprehensive overview of the different investment options available to the typical investor, along with the risks associated with each investment option. For example, you may think that your current net worth precludes you from investing in fancy asset classes, such venture capital. “Angel investing and technology startups? That’s for high brow folks unlike myself!” However, with the advent of new technology and recent legislative changes, there are now many channels for folks from varying backgrounds to invest not only in things like seed-level venture capital, but other areas such as commercial real estate and private loans, all with a laptop from the comfort of your own home. Without further ado, here’s the Table of Asset Classes. It is meant to be ordered in general level of risk, from the least risky to the most risky of investments. Of course, there are always exceptions, nuances, and differences of risk within and across neighboring asset classes on the risk spectrum. The hope is that the reader can use this guideline as a starting point, but still maintain due diligence and do the required homework before diving into unfamiliar investment areas.Micro-Geographic Arbitrage with 529 Plans2018-01-15T15:42:00-06:002018-01-15T15:42:00-06:00https://www.kelvinjiang.com/2018/01/micro-geographic-arbitrage-529-plans<p>The recent tax reform bill that made its way through the legislative gauntlet of
the US government brings with it a host of new changes, which will affect
investors from all walks of life in big and small ways starting in 2018. One of
the less covered but equally important changes is the extension of tax benefits
for the 529 plan that many families use to save money for their children’s
post-secondary education. At first glance, the change may not seem like much,
but for those of us optimization minded financiers, it may make a huge
difference in savings.</p>
<p>With the new tax bill, the qualifying expenses for the 529 plan have been
<a href="https://www.nytimes.com/2017/12/21/your-money/529-plans-taxes-private-school.html">expanded to include tuition for private schools at the primary and secondary
levels</a>,
that is, from kindergarten through the 12th grade (K-12). Private school is
typically associated with the high-brow, affluent amongst us that decide public
education is not good enough for their children. However, with the rising cost
of living in many metropolitan, costal areas such as the San Francisco Bay Area,
the choice between public and private schools is actually more than simply a
educational values judgement.</p>
<p>For a family with a heavy emphasis on education but more modest means, it may
not be feasible to live in areas with good public education throughout the K-12
levels. For these families, a micro level of geographic arbitrage may come into
play: buy or a rent a house in a lower cost of living area, and use the savings
in housing costs to put school age children in better, private schools. With the
529 plan changes to include K-12 private school tuition, this strategy actually
becomes even more attractive. These families can use the money they reaped from
savings in housing costs, plow those after-tax dollars into 529 plans for their
children, and start withdrawing up to $10,000 a year without having to pay taxes
on any capital gains.</p>
<p>Ultimately, any money leftover after the primary and secondary school periods
can still be used for higher education, so there’s very little downside for a
family to start contributing more money even earlier in their children’s
life. Add on the <a href="https://vanguard.wealthmsi.com/stdc.php">deduction that certain states offer from state taxes for 529
contributions</a>, and 529 plan becomes a
great tool for opportunistically using micro-geographic arbitrage to optimize a
family’s quality of housing and education.</p>KelvinThe recent tax reform bill that made its way through the legislative gauntlet of the US government brings with it a host of new changes, which will affect investors from all walks of life in big and small ways starting in 2018. One of the less covered but equally important changes is the extension of tax benefits for the 529 plan that many families use to save money for their children’s post-secondary education. At first glance, the change may not seem like much, but for those of us optimization minded financiers, it may make a huge difference in savings. With the new tax bill, the qualifying expenses for the 529 plan have been expanded to include tuition for private schools at the primary and secondary levels, that is, from kindergarten through the 12th grade (K-12). Private school is typically associated with the high-brow, affluent amongst us that decide public education is not good enough for their children. However, with the rising cost of living in many metropolitan, costal areas such as the San Francisco Bay Area, the choice between public and private schools is actually more than simply a educational values judgement. For a family with a heavy emphasis on education but more modest means, it may not be feasible to live in areas with good public education throughout the K-12 levels. For these families, a micro level of geographic arbitrage may come into play: buy or a rent a house in a lower cost of living area, and use the savings in housing costs to put school age children in better, private schools. With the 529 plan changes to include K-12 private school tuition, this strategy actually becomes even more attractive. These families can use the money they reaped from savings in housing costs, plow those after-tax dollars into 529 plans for their children, and start withdrawing up to $10,000 a year without having to pay taxes on any capital gains. Ultimately, any money leftover after the primary and secondary school periods can still be used for higher education, so there’s very little downside for a family to start contributing more money even earlier in their children’s life. Add on the deduction that certain states offer from state taxes for 529 contributions, and 529 plan becomes a great tool for opportunistically using micro-geographic arbitrage to optimize a family’s quality of housing and education.Null is a Global Variable2017-11-28T16:47:00-06:002017-11-28T16:47:00-06:00https://www.kelvinjiang.com/2017/11/null-is-global-variable<p>Programmers often bemoan the problems of the concept of <em>null</em> that exists in programming languages. Even <a href="https://en.wikipedia.org/wiki/Tony_Hoare">C.A.R. Hoare</a>, the inventor of the null reference, calls it a <a href="https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare">billion dollar mistake</a>. Some detest its existence, and indicate that it’s useless. However, it occurred to me that null can actually be thought of as a global variable - one that is used across all applications and domains to indicate special cases, such as the end of a data structure or a missing entry.</p>
<p><span id="more"></span>Null is a very useful concept across many fundamental data structures. Without null, a typical linked list implementation would need to define its own sentinel node reference to signify the end of a list. A typical tree implementation would need to define its own sentinel node reference to signify the absence of child nodes. A hash table would need to define a sentinel value to signify the absence of values in a particular bucket.</p>
<p>Application programmers would need to define their own domain-specific null references to represent the concept of “missing, but valid.” A database accessor that fails to find the record for a Person object in a human resources application would return its own sentinel Person object to indicate that the record could not be found, or be required to raise some sort of exception to its caller. Any programmer that maintains an object with optional fields would need to define their own sentinel value to indicate a missing field.</p>
<p>Rather than have each library or application declare its own custom sentinel representations, null is the convenient, global variable that gets (re)used everywhere to denote the special, non-exceptional, terminal case that signals to the programmer some special treatment may be needed, but is not required.</p>KelvinProgrammers often bemoan the problems of the concept of null that exists in programming languages. Even C.A.R. Hoare, the inventor of the null reference, calls it a billion dollar mistake. Some detest its existence, and indicate that it’s useless. However, it occurred to me that null can actually be thought of as a global variable - one that is used across all applications and domains to indicate special cases, such as the end of a data structure or a missing entry. Null is a very useful concept across many fundamental data structures. Without null, a typical linked list implementation would need to define its own sentinel node reference to signify the end of a list. A typical tree implementation would need to define its own sentinel node reference to signify the absence of child nodes. A hash table would need to define a sentinel value to signify the absence of values in a particular bucket. Application programmers would need to define their own domain-specific null references to represent the concept of “missing, but valid.” A database accessor that fails to find the record for a Person object in a human resources application would return its own sentinel Person object to indicate that the record could not be found, or be required to raise some sort of exception to its caller. Any programmer that maintains an object with optional fields would need to define their own sentinel value to indicate a missing field. Rather than have each library or application declare its own custom sentinel representations, null is the convenient, global variable that gets (re)used everywhere to denote the special, non-exceptional, terminal case that signals to the programmer some special treatment may be needed, but is not required.Maximize Your Homeowner Tax Deductions2017-11-03T00:01:00-05:002017-11-03T00:01:00-05:00https://www.kelvinjiang.com/2017/11/maximize-your-homeowner-tax-deductions<p><em>Update: the following tips may or may not be applicable for your county
and state, depending on the process of property tax payments. Also, the general
landscape of deductions have certainly <a href="https://turbotax.intuit.com/tax-tips/irs-tax-return/2017-tax-reform-legislation-what-you-should-know/L96aFuPhc">changed</a>
since the original writing of this article, so be sure to check the latest
regulations.</em></p>
<p>Paying taxes is never a fun thing, but if you happen to be a homeowner, you are
privy to a nice deduction for any property taxes or mortgage interest paid for
the calendar tax year.</p>
<h3 id="property-tax-deductions">Property Tax Deductions</h3>
<p>It turns out that property taxes are something you should look forward to paying
off as soon as you get the bill, and the reason is a bit subtle.</p>
<p>Since most property taxes are split into two installments, homeowners are given
the option of paying the part one of the taxes sometime around November, and
part two sometime around February. Since the property tax deduction lets you
deduct the full amount of property taxes paid in the calendar year, paying the
second installment of your property tax bill in the current calendar year
actually nets you an immediate return on the installment amount equivalent to
your effective tax rate.</p>
<p>Let’s go through an example. Suppose your home has an assessed value of $1
million in 2017. A simple property tax rate of 1% would equate to a 2017
property tax bill of $10,000, split into two installments of $5,000. Let’s also
assume an effective tax rate of 30%. When you get your property tax bill in
November 2017, you have two options: (A) pay both installments at a total of
$10,000 on November 1st, (B) split the payments into two installments on
November 1, 2017 and February 1, 2018.</p>
<p>Let’s assume that the assessed value of your home, your effective tax rate, and
the prevailing interest rate and market conditions remain the same, and only
consider the property tax bill for 2017. How much money would end up in your
pocket in each of the scenarios?</p>
<p>With option A, you pay off your entire property tax bill on November 1, 2017 for
$10,000. Come April 15, 2018, you’ll be able to deduct the full amount and
effectively end up with 30% * $10,000 = $3,000 in your pocket. If you leave
that amount in a bank account paying 1% interest, you’ll end up with $3,030 on
April 15, 2019.</p>
<p>With option B, you pay the first installment of your property tax bill on
November 1, 2017 for $5,000, and you leave the balance in your bank account. On
February 1, 2018, you’ll pay the second installment. For the three months that
elapsed between your first and second installments, you’ll earn $12.50 in
interest on the $5,000 you earmarked for the second installment. Come April 15,
2018, you’ll deduct the first installment and end up with 30% * $5,000 = $1,500
in your pocket. If you leave everything in the bank account, on April 15, 2019
you’ll end up with an extra $0.15 from the $12.50, and $15 in interest earned
from the deduction amount itself. You’ll also be able to deduct the second
installment for another $1,500. In total, you’ll end up with $12.50 + $0.15 +
$15 + $1500 + $1500 = $3,027.65 on April 15, 2019.</p>
<p>The difference between option A and option B is a meager $2.35, which may seem
too trivial to be of concern. But as personal finance aficionado, we love to
squeeze every bit of min/max opportunity available at hand! However, the picture
may become clearer if we disregard returns you may reap on cash on hand, and
simple look at the amount of deductions you may get for two payment options:</p>
<p>Option A’, you pay the first installment of your property tax bill on November
1, 2017 and the second installment on December 31, 2017. Come Tax Day 2018,
you’ll be able to deduct the full amount of the property tax bill and pocket
$3,000.</p>
<p>Option B’, you pay the first installment of your property tax bill on November
1, 2017 and the second installment on January 1, 2018. Come Tax Day 2018, you’ll
only be able to deduct the first installment of the property tax bill, and
pocket $1,500. You’ll then need to wait a full year for Tax Day 2019 before you
can deduct the second installment.</p>
<p>Would you rather have $3,000 now or $1,500 now and $1,500 later? The choice is
pretty clear. Pay those property taxes before the end of the year!</p>
<h3 id="mortgage-interest-deductions">Mortgage Interest Deductions</h3>
<p>Now let’s move on to mortgage interest deductions. Interest paid on mortgages
are only <a href="https://en.wikipedia.org/wiki/Home_mortgage_interest_deduction#United_States">deductible</a>
for the first $1,000,000. In practice, the actual limit is probably $1,100,000
for homeowners that don’t have any home equity debt. To calculate the percentage
of your mortgage interest that is deductible, divide 1.1M by your outstanding
principal if it happens to be over 1.1M, and that’s the percentage of deductible
interest.</p>
<p>With real estate prices on the rise once again in many of America’s most coveted
markets, many are no doubt using jumbo loans to purchase their homes. Since the
Home mortgage interest deduction is one of the best tax breaks available around,
you’ll want to ensure you’re getting the maximum deduction. If your principal
amount is over 1.1M, consider paying down the principal quickly to get at or
below the 1.1M mark.</p>
<p>In today’s environment of low interest rates, you can consider a lump sum
payment towards your mortgage principal an investment with an automatic return
rate equivalent to your mortgage rate. If you have a mortgage rate anywhere
north of 3-4%, that’s already quite a deal compared to the meager rates your
bank is offering to hold your cash for you. Even at the high end of the savings
rates being offered by some of the online banks such as Ally or newer commercial
offerings such as Goldman Sachs, you’re looking at most 1%.</p>KelvinUpdate: the following tips may or may not be applicable for your county and state, depending on the process of property tax payments. Also, the general landscape of deductions have certainly changed since the original writing of this article, so be sure to check the latest regulations.Oft-Misheard Phrases in the Workplace2017-10-30T23:33:00-05:002017-10-30T23:33:00-05:00https://www.kelvinjiang.com/2017/10/oft-misheard-phrases-in-workplace<p>There’s an affliction that affects many millions of Americans in the workplace, and it’s time to bring that affliction to light. There’s perhaps nothing more benignly embarrassing than uttering one of these often misheard phrases during a work meeting, much less writing them down in a widely distributed email or memo. Perhaps it’s time to set things straight once and for all, and help bring our less fortunate colleagues out of the darkness by setting them on the righteous path to using the correct version of these phrases.</p>
<p>Which version of each of these phrases do you think is the correct one?</p>
<p><span id="more"></span><em>“the long pull”</em> or <em>“the long pole”</em></p>
<p><em>“flush out”</em> or <em>“flesh out”</em></p>
<p><em>“all intensive purposes”</em> or <em>“all intents and purposes”</em></p>
<p><em>“could care less”</em> or <em>“couldn’t care less”</em></p>
<p>Do you have other examples of phrases that are repeatedly used in the workplace in the wrong way? Your colleagues always appreciate your honest feedback, the more pedantic the better.</p>
<p><strong>Update</strong>: in a recent conversation with a colleague, I also learned that <a href="https://en.wikipedia.org/wiki/Begging_the_question">“beg(s) the question”</a> is another phrase that is often misused in the workplace, amongst other contexts. The phrase is used to describe a form of argumentative circular reasoning, but it has widely evolved to be used to describe situations where “raise the question” or “invites the question” are actually appropriate.</p>KelvinThere’s an affliction that affects many millions of Americans in the workplace, and it’s time to bring that affliction to light. There’s perhaps nothing more benignly embarrassing than uttering one of these often misheard phrases during a work meeting, much less writing them down in a widely distributed email or memo. Perhaps it’s time to set things straight once and for all, and help bring our less fortunate colleagues out of the darkness by setting them on the righteous path to using the correct version of these phrases. Which version of each of these phrases do you think is the correct one? “the long pull” or “the long pole” “flush out” or “flesh out” “all intensive purposes” or “all intents and purposes” “could care less” or “couldn’t care less” Do you have other examples of phrases that are repeatedly used in the workplace in the wrong way? Your colleagues always appreciate your honest feedback, the more pedantic the better. Update: in a recent conversation with a colleague, I also learned that “beg(s) the question” is another phrase that is often misused in the workplace, amongst other contexts. The phrase is used to describe a form of argumentative circular reasoning, but it has widely evolved to be used to describe situations where “raise the question” or “invites the question” are actually appropriate.