tag:blogger.com,1999:blog-33821991520689439862017-02-26T16:09:23.516-08:00Kelvin's DomainKelvinnoreply@blogger.comBlogger22125tag:blogger.com,1999:blog-3382199152068943986.post-24542879974392114952017-02-03T00:28:00.001-08:002017-02-09T22:12:32.153-08:00The Math Behind the Jury Selection Problem<div class="separator" style="clear: both; text-align: center;"><a href="https://images-na.ssl-images-amazon.com/images/M/MV5BOTcwMmE5YWItNDkzMi00Yjk2LTk5ZjItMmY5ZDhhNWE0ZWRmXkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="320" src="https://images-na.ssl-images-amazon.com/images/M/MV5BOTcwMmE5YWItNDkzMi00Yjk2LTk5ZjItMmY5ZDhhNWE0ZWRmXkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_.jpg" width="216" /></a></div>I recently served on the jury for a trial in the county superior court system, and contrary to popular belief it turned out to be an interesting experience. The ordeal gave me confidence in the idea of being judged by a jury of one's peers, and I came away with an appreciation of the fact that the responsibilities of a jury are precise and straightforward, much like the role of computer hardware (jurors) carrying out the instructions of a software program (law) given a set of well-defined inputs (evidence).<br /><br />As I sat through the initial jury selection process, I also got to thinking about some of the mathematics behind the process. In particular, is there a way to answer the two questions on every prospective juror's mind: will I be selected as a juror and how long is this process going to take?<br /><br /><a name='more'></a>The <a href="https://en.wikipedia.org/wiki/Jury_selection_in_the_United_States" target="_blank">jury selection</a> process used in the American court system can be formulated as an online variant of the <a href="https://en.wikipedia.org/wiki/Secretary_problem" target="_blank">secretary problem</a>, where the task is to hire the k best secretaries (jurors) from a pool of n candidates (jury pool). During jury selection, the prospective jurors are <a href="https://en.wikipedia.org/wiki/Voir_dire" target="_blank">interviewed</a> one by one and an immediate decision is made to select the candidate as part of the jury, or to <a href="https://en.wikipedia.org/wiki/Strike_for_cause" target="_blank">excuse the candidate</a> from service. It is assumed that there exists a way to rank the entire pool of candidates from best to worst, and that this method is employed by the judge and attorneys during the selection process to evaluate jurors on a relative basis.<br /><br />It turns out that the problem is a generalized version of the secretary problem, and there exists an optimal strategy for performing such a selection process as to maximize the probability of selecting the top k jurors from the jury pool <a href="#footnote1">[1]</a>. The idea is to choose a cutoff value r, and to interview and immediately excuse the first r prospective jurors while remembering the best amongst the first r candidates interviewed. Starting with the r+1 candidate, every candidate that is better than the best candidate interviewed in the initial r prospective jurors is selected, until k jurors have been selected or the pool is exhausted.<br /><br />To maximize the probability of selecting the top k jurors from a pool of n candidates, the optimal value for r turns out to be <span style="font-family: "courier new" , "courier" , monospace;">floor(n/(k*e^(1/k)))</span>. For example, to select twelve jurors and three alternates (k = 15) from a pool of eighty prospective jurors (n = 80), the optimal cutoff is r = 4. That is, the judge and attorneys should interview and excuse the first four candidates, then begin selecting every subsequent candidate that proves better suited for the case than the best candidate amongst the first four interviewed.<br /><br />With the optimal strategy in hand, we can also compute the expected number of prospective jurors that will be interviewed and evaluated before the final k are selected when such a strategy is employed. By way of simulation, to select 15 jurors from a pool of 80 prospective jurors, the expected number of candidates that will be interviewed during the process is approximately 66, which turns out to be quite close to the number of candidates actually interviewed during my recent experience with the jury selection process with n = 80. For k = 15, here are some more values for the expected number of prospective jurors that will be interviewed, for several different pool sizes:<br /><br /><a href="https://4.bp.blogspot.com/-J7mb6ifxYqA/WJQ9NbHISVI/AAAAAAAAksw/qlbWRK4Dj1gLHmSd8IRFu3xcyFnQyMx4wCLcB/s1600/jury.png" imageanchor="1"><img border="0" src="https://4.bp.blogspot.com/-J7mb6ifxYqA/WJQ9NbHISVI/AAAAAAAAksw/qlbWRK4Dj1gLHmSd8IRFu3xcyFnQyMx4wCLcB/s1600/jury.png" /></a><br /><br />Of course, the actual jury selection process may slightly differ by jurisdiction, and if so the optimal strategy will also differ. For example, the selection process I went through is more akin to a mini-batch online version of the problem, where a small batch of candidates are interviewed together before the judge and attorneys have to decide whether to excuse any of the candidates in the batch. Additionally, there's a small buffer of candidates that can be kept around during the selection process, and may be excused at later stages.<br /><br /><a href="https://www.blogger.com/null" name="footnote1">[1]</a> Girdhar and Dudek, 2009. <i><a href="http://cim.mcgill.ca/~yogesh/publications/crv2009.pdf" target="_blank">Optimal Online Data Sampling</a></i>Kelvinnoreply@blogger.com0tag:blogger.com,1999:blog-3382199152068943986.post-62201572467268610142016-03-08T23:00:00.000-08:002017-02-03T00:32:07.237-08:00Wheel of Fortune Mystery Wedge Problem<div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-cu6E9VVe29Y/Vt-8Uqpnt4I/AAAAAAAAgHY/q-XrGLvdAkY/s1600/Mystery_wedge_10_000_side_by_wheelgenius-d2z42q0.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="320" src="https://4.bp.blogspot.com/-cu6E9VVe29Y/Vt-8Uqpnt4I/AAAAAAAAgHY/q-XrGLvdAkY/s320/Mystery_wedge_10_000_side_by_wheelgenius-d2z42q0.png" width="141" /></a></div>The <b>Wheel of Fortune Mystery Wedge Problem</b> is my attempt at a slightly more modern spin (pun intended) on the classic <a href="https://en.wikipedia.org/wiki/Monty_Hall_problem">Monty Hall problem</a>, which is a probability puzzle based on an older game show. Most modern American (and many international, so I've learned) TV viewers are likely familiar with <a href="https://en.wikipedia.org/wiki/Wheel_of_Fortune_(U.S._game_show)">Wheel of Fortune</a>, but this specific problem can be framed in a way that doesn't require complete knowledge of the rules of the game:<br /><div><br /></div><div><ul><li>You have X dollars in the bank and it's your turn to spin the wheel</li><li>You spin the wheel and land on the <a href="http://wheeloffortunehistory.wikia.com/wiki/Gameplay_elements#Mystery_Round.2FWedges">$10,000 Mystery Wedge</a>, make a correct guess to win Y dollars, and Pat Sajak, the host, then offers you two options:</li><li>The first option is to do nothing. It is still your turn and you will end up with X + Y dollars in the bank.</li><li>The second option is to flip over the Mystery Wedge for a 50/50 shot at winning $10,000. If you lose, you lose all your money plus your turn.</li></ul><div><br /></div><div>Should you flip over the Mystery Wedge or not? At what values of X and Y should you flip the wedge and take the risk?</div></div><br /><a name='more'></a><div>A quick calculation might suggest that $5,000 is the break-even point, since the probability of winning multiplied by the winnings is $5,000. From that reasoning, the amount you can end the turn with needs to be less than $5,000 (X + Y < 5000) in order to justify flipping over the Mystery Wedge.</div><div><br /></div><div>However, a more detailed calculation of the expected value of the wager reveals a different conclusion. The expected value of the wager is 0.5 * 10000 - 0.5 * (X + Y). If the amount you can end the turn with is $5,000 (X + Y = 5000), the expected value of the wager is actually +$2,500, which is an advantageous wager you should take. This suggests that the break-even point is actually $10,000, and that you should always flip over the Mystery Wedge if the amount you can end the turn with (amount in your bank plus what you just won by guessing letters) is less than that.</div><div><br /></div><div>Perhaps a more intuitive approach is to consider the odds versus the payout. With the option of a guaranteed $5,000, you are being offered a wager to win twice your money, which is a 2:1 payout at 1:1 odds (50%), a favorable wager. At $10,000, the payout and odds are both 1:1, which is the aforementioned break-even point.</div><div><br /></div><div>However, the fun doesn't end there. One subtle but important consequence of flipping over the Mystery Wedge is that, should you lose, you not only lose all your money but you also lose your turn. That is, you don't actually end up with $0 and the chance to spin the wheel again, which is unaccounted for in the expected value calculation. If you expect to win Z dollars after retaining your turn, then Z needs to be factored into the amount at risk in the wager.</div>Kelvinnoreply@blogger.com0tag:blogger.com,1999:blog-3382199152068943986.post-25463772756554241372015-12-12T12:45:00.000-08:002017-02-03T00:36:43.580-08:00California Drought Dashboard<div class="separator" style="clear: both; float: left; text-align: center;"><a href="http://2.bp.blogspot.com/-u2h9s1gmVZY/VmuyUqyMkvI/AAAAAAAAexI/adTBoPXnyxQ/s1600/20151208_ca_none.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="200" src="https://2.bp.blogspot.com/-u2h9s1gmVZY/VmuyUqyMkvI/AAAAAAAAexI/adTBoPXnyxQ/s200/20151208_ca_none.png" width="200" /></a></div>The drought in California has been going on for a few years now, and the talk of an El Niño winter bringing much needed relief makes me wonder about the latest state of the drought situation as we move through a hopefully wet winter. Surprisingly, I was hard-pressed to find good, up-to-date sources of California drought conditions data that show changes over time.<br /><br />For starters, the <a href="http://droughtmonitor.unl.edu/Home/StateDroughtMonitor.aspx?CA" target="_blank">United States Drought Monitor</a> provides great percentile data for different drought level intensities, updated on a weekly basis. However, the tables do a poor job of depicting the progression of the drought conditions on a weekly basis, even though that data is readily available and dates all the way back to 2000. The best visualization of this data that I've found, in time-series form, is <a href="https://www.drought.gov/dcpaws/rest/CA/graph?h=400&w=800" target="_blank">the graph from the US Drought Monitor</a> that almost gets the job done but lacks some fine grained control. So, I made a slightly tweaked version of this graph that is hopefully a solid dashboard that can serve as an up-to-date summary of drought conditions. The graph can be easily manipulated for comparisons over different time periods, and always loads the latest data from the aforementioned sources. <br /><br /><a name='more'></a><b>October 2016 Update:</b> Unfortunately, California is still in a drought and some of the aforementioned links and data sources no longer work. Nevertheless, the code used to generate the custom dashboard is reproduced here (requires Google Charts and jQuery):<br /><br /><pre id="dashboard-code-copy" style="overflow: auto;">$(document).ready(function() {<br /> var yearsAgo = 5;<br /><br /> var parseDate = function(str) {<br /> var date = Date.parse(str.substring(0, 4) + "-" +<br /> str.substring(4, 6) + "-" +<br /> str.substring(6, 8));<br /> return new Date(date);<br /> };<br /><br /> var graph = function(raw) {<br /> var columns = ["Exceptional", "Extreme", "Severe", "Moderate", "Abnormal"];<br /> var colors = ["#730000", "#E60000", "#FFAA00", "#FCD37F", "#FFFF00"];<br /> <br /> var data = new google.visualization.DataTable();<br /> data.addColumn('date', 'Date');<br /> columns.forEach(function(col) { data.addColumn("number", col); });<br /> data.addRows(raw.map(function(row) {<br /> return Array.of(parseDate(row[0])).concat(row.slice(1).map(parseFloat));<br /> }));<br /><br /> var latest = parseDate(raw[raw.length-1][0]);<br /> var chart = new google.visualization.AnnotationChart(document.getElementById("graph"));<br /> var options = {<br /> displayAnnotations: false,<br /> displayZoomButtons: false,<br /> colors: ["#730000", "#E60000", "#FFAA00", "#FCD37F", "#FFFF00"],<br /> fill: 40,<br /> legendPosition: 'newRow',<br /> min: 0,<br /> max: 100,<br /> thickness: 2,<br /> zoomStartTime: new Date(latest.getTime() - yearsAgo*365*24*60*60*1000)<br /> };<br /> chart.draw(data, options);<br /> };<br /><br /> var updateTimestamp = function(raw) {<br /> var text = $('#updated').text();<br /> text += parseDate(raw[raw.length-1][0]);<br /> $('#updated').text(text);<br /> $('#updated').show();<br /> };<br /><br /> var init = function() {<br /> $.get("http://www.drought.gov/dcpaws/data/CA/d0,d1,d2,d3,d4", function(data) {<br /> data = data.replace(/(^.*values\>|\<\/values.*$)/g, "")<br /> .trim()<br /> .split("\n")<br /> .map(function(line) {<br /> return line.split(/,\s+/);<br /> });<br /> updateTimestamp(data);<br /> graph(data);<br /> });<br /> };<br /><br /> google.load('visualization', '1.0', {'packages':['annotationchart']});<br /> google.setOnLoadCallback(init);<br />});<br /></pre>Kelvinnoreply@blogger.com0tag:blogger.com,1999:blog-3382199152068943986.post-51435668065630647952013-03-10T09:00:00.000-07:002013-03-10T09:00:02.420-07:00Yelp is to Netflix as Foursquare is to Netflix<div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-ca6Bkwimcw0/USfEnuBC-VI/AAAAAAAAB1E/X8hEmzt4sJc/s1600/foursquare_yelp.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="188" src="http://1.bp.blogspot.com/-ca6Bkwimcw0/USfEnuBC-VI/AAAAAAAAB1E/X8hEmzt4sJc/s320/foursquare_yelp.jpg" width="320" /></a></div>Pop quiz, hot shot. Here's an SAT-style analogy question for you. Yelp is to Netflix as Foursquare is to what? Answer: Netflix. To be precise, Yelp is to Netflix <i>DVD</i> as Foursquare is to Netflix <i>Streaming</i>. The reason for that is data.<br /><br /><a name='more'></a>When it comes to data and the applications of data to personalization and merchandising, Yelp is like the Netflix of yesteryear. Before its foray into instant streaming, Netflix had limited access to data that described the behavior of its customers. In the DVD-by-mail world, Netflix was able to know the movies its users wanted, the movies the users rented and returned, and the rating a user gives to a movie. In the streaming world, Netflix has <a href="http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html" target="_blank">access to all that data, plus a whole lot more</a> - what movie the user watches, when they watched the movie, how much of the movie they watched, how often they watched the movie, and where (i.e. on which device) they watched the movie. The data collection abilities in the steaming world are so much more fine-grained and specific in the streaming world that it dwarfs the old days of DVD-by-mail.<br /><br />Likewise, Yelp for the most part gets data after-the-fact when a user decides to write a review about or rate a restaurant they just visited. However, it's a single datapoint - a blob of text or a single 1 to 5 star rating about a restaurant. On the other hand, Foursquare is in the same boat as Netflix Streaming with respect to its ability to collect fine-grained information about user behavior. Unlike Yelp, Foursquare can access data about which the restaurant the user goes to, how often they go to that restaurant, when they go to that restaurant, who they go to the restaurant with, and where they were before and after going to the restaurant. The amount of information that can be captured is immense.<br /><br />It all boils down to having the ability to collect discrete data in the old, transaction-driven world versus continuous data in the new, real-time world. Foursquare is in a much better position to <a href="http://www.nytimes.com/2012/06/07/technology/in-app-overhaul-foursquare-shifts-focus-to-recommendations.html" target="_blank">take advantage of their data</a> to build more interesting, engaging experiences for its users. Unless Yelp is able to catch up in the big data game by extending their mobile capabilities, it may slowly fade away much like the old world of DVD-by-mail for Netflix.Kelvinnoreply@blogger.com0tag:blogger.com,1999:blog-3382199152068943986.post-72626409722239398642013-02-21T20:00:00.000-08:002016-03-08T22:03:44.112-08:00Winning Bar Bets With Mathematics<div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-U-8zIOcZCdw/USbH6urikUI/AAAAAAAAB0w/FIMREgr-jzA/s1600/Bar-Bets-aw1.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="288" src="http://4.bp.blogspot.com/-U-8zIOcZCdw/USbH6urikUI/AAAAAAAAB0w/FIMREgr-jzA/s320/Bar-Bets-aw1.jpg" width="320" /></a></div>Benford's law is pretty neat. I first came across this phenomenon while watching a silly Internet TV show related to bar bets. The bar bet that exploits <a href="http://en.wikipedia.org/wiki/Benford's_law" target="_blank">Benford's law</a> is a little complicated, and relies (of course) on the law of large numbers.<br /><br />The premise is that you challenge your fellow bar patrons to a bet: <b>that humans are no good at generating random numbers</b>. It makes sense intuitively since people tend to pick numbers that have a special meaning, and there's a finite number of numbers that have meaningful connotations (i.e. 3's a charm, lucky number 7, etc.). To add to the challenge, you'll throw in some randomness. Each person will come up with two randomly chosen quantities based on some data, such as the population of pet birds in the US and the volume of water in the Dead Sea in cubic feet. Then, multiply the two numbers to get a third number. The bet will be that even the first digit of *that* number is not uniformly distributed. On the surface, it seems counter-intuitive. But Benford's law will give you an edge in this silly bar game.<br /><br /><a name='more'></a>It turns out that, given natural data, the first digit of any given number is not uniformly distributed across digits 1 through 9. Exponentially growing phenomena, for example, may double in size each period. Thus, the observed quantities of something like a growing bacteria colony would be 1, 2, 4, 8, 16, 32, 64, 128, 256, etc. Taking the first digit of these quantities, it's clear that the lower digits have higher chances of occurring. The other way to prove it to yourself is to assume that for a random variable X with the first digit d, the digit d is equal to 1 if 1 <= d < 2, equal to 2 if 2 <= d < 3, etc. If we take the logarithm of each inequality, we end up with log 1 <= log d < log 2, log 2 <= log d < log 3, etc. In log-space, the size of each interval is actually non-uniform.<br /><br />Thus, to win your bar game, assign one team of your bar patrons to the digits 1, 2, 3 and the other team to the digits 5, 6, 7, 8, 9. You, of course, want to make sure you're on the team with digits 1, 2, 3. The team with the most "random" numbers that have the first digit belonging to their set wins. Assuming you play enough iterations, Benford's law will ensure that you always win. By discarding the number 4, you give the illusion that you're giving yourself even less of an advantage, when in fact it's actually advantageous for you to simply disregard it rather than assign it to the other team.<br /><br />In fact, an even more interesting fact about the tricky bar bet is that numbers that follow normal distributions don't abide by Benford's law. However, if you "mix" the numbers by way of multiplication, then the law reappears. So the seemingly innocuous trick of multiplying two arbitrary naturally occurring numbers just plays to your advantage.<br /><br />Benford's law can even be <a href="http://en.wikipedia.org/wiki/Benford's_law#Accounting_fraud_detection" target="_blank">applied to fraud detection</a> by analyzing the distribution of the first digit of numbers to look for anomalous results.Kelvinnoreply@blogger.com0tag:blogger.com,1999:blog-3382199152068943986.post-86668404627736590122011-11-03T01:04:00.000-07:002013-02-21T20:56:59.393-08:00Time Asymmetry and the Department of Motor Vehicles<div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-vVTPa7gptvM/TrJKd1Y30QI/AAAAAAAABck/SFphop9I2GQ/s1600/dmv.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-vVTPa7gptvM/TrJKd1Y30QI/AAAAAAAABck/SFphop9I2GQ/s1600/dmv.jpg" /></a></div>Why do people dread going to the <a href="http://dmv.ca.gov/" target="_blank">Department of Motor Vehicles</a>? Why do the workers there seem so disgruntled? Because of <b>time asymmetry</b>. The amount of time you have and will invest in taking care of your errands at the DMV is orders of magnitude more than the amount of time a worker will invest in handling your matters.<br /><br />You only have one appointment, whereas any given worker may go through dozens of customers in the span of a day. This imbalance and asymmetry is what leads to conflicting attitudes towards the same interaction between the representative and the customer, and why both sides seem to loathe the experience.<br /><br /><a name='more'></a>In fact, this phenomenon of time asymmetry can be generalized to customer service in general. It's likely one of the primary factors that determine the level of customer satisfaction for various products and services. For any company to get really good at customer service, it needs to overcome the barrier of time asymmetry. One way to do this is to use scale and leverage. Companies like <a href="http://www.amazon.com/" target="_blank">Amazon.com</a> are really good at this, as they can use automated technologies and computing scale to devote a seemingly asymmetric amount of time to each customer's issues. By automating the majority of customer service issues such as returns and refunds, Amazon.com provides the illusion of being able to constantly deliver a highly personalized experience without having to spend more on actual customer service representatives.<br /><br />Another way of tackling the time asymmetry problem is in training employees to mitigate the problem while interacting with customers. <a href="http://www.zappos.com/" target="_blank">Zappos</a> is well known for its customer service, and it employs a few tactics to help balance the time asymmetry between its representatives and customers. Zappos focuses on hiring great employees and keeping them happy, which contributes to the amount of care each representative is willing to give to each customer's call. There are no <a href="http://blogs.zappos.com/blogs/zappos-family/zcltstoriesthe8hrcall" target="_blank">call time limits</a>, and often times the representatives go above and beyond to help the customer take of his or her problems. Although this is approach may not scale as well as automated technologies, it brings an extraordinary amount of goodwill to the brand, perhaps even more so than slick self-service web pages.<br /><br />Ultimately, it's no surprise that Amazon.com and Zappos ended up together, as both companies share the same vision of delivering the absolute best customer experience in the world. And it's largely because they are able to overcome the problem of time asymmetry in the context of customer service. Until places like the DMV are able to catch up, people will always dread going to get their license renewed. Perhaps the best therapy then is to do some shopping on Amazon.com or Zappos while you spend the better part of your day waiting to be called.Kelvinnoreply@blogger.com0tag:blogger.com,1999:blog-3382199152068943986.post-62422806430960829922011-06-18T10:24:00.000-07:002013-02-22T11:07:03.250-08:00The Nate Dogg Number<div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-c0QrBI5YgMM/TbKRn1sBv2I/AAAAAAAABHQ/5-N2kujqjjE/s1600/nate-dogg.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="320" src="http://4.bp.blogspot.com/-c0QrBI5YgMM/TbKRn1sBv2I/AAAAAAAABHQ/5-N2kujqjjE/s320/nate-dogg.jpg" width="320" /></a></div>Nate Dogg was a rapper known for his prolific collaborations with numerous other hip-hop artists, whence he can usually be heard singing a catchy hook in between melodic rap verses. <a href="http://en.wikipedia.org/wiki/Nate_Dogg">Nate Dogg</a> passed away recently, and I thought it would be fun to honor him by creating the <b>Nate Dogg number</b>. Similar to the <a href="http://en.wikipedia.org/wiki/Erdos_number">Erdős number</a>, the Nate Dogg number is the collaborative distance between a person and Nate Dogg as measured by a collaboration on a musical track.<br /><br /><a name='more'></a>By definition, Nate Dogg himself has a Nate Dogg number of 0. Since Nate Dogg is closely associated with many West Coast hip-hop artists, musicians such as Warren G, Snoop Dogg, Dr. Dre, and 2Pac all have Nate Dogg numbers of 1.<br /><br />It's likely quite easy to determine the Nate Dogg number for many hip-hop and R&B artists due to the popularity of collaborative tracks and remixed songs featuring upwards of half a dozen to a dozen different artists. One could imagine that the graph representing the collaborative efforts between contemporary rappers and R&B singers is quite dense and well-connected, with most artists having low Nate Dogg numbers. Theoretically, since Nate Dogg has passed away, the number of artists with a Nate Dogg number of 1 should remain constant perpetually. That is, until Nate Dogg starts releasing tracks posthumously a la 2Pac.<br /><br />An interesting area to explore would be to determine the Nate Dogg number for artists in genres that are markedly distinct from hip-hop and R&B. For example, what would you guess the Nate Dogg numbers to be for Björk, Barbara Streisand, or Marilyn Manson? These exercises can easily be done by generating the aforementioned collaboration graph and doing unweighted shortest distance calculations from the root node representing Nate Dogg. Sites like <a href="http://musicbrainz.org/">MusicBrainz</a> provide free, public-domain data sets that include artist and track data, though parsing and normalizing the data is a small challenge in and of itself.<br /><br />Calculating Nate Dogg numbers by hand might be more fun than programmatic calculations, and may reveal interesting tidbits of music trivia unbeknownst to the Nate Dogg fan. It might even make for a fun road trip game: given a random artist, find the least number of collaborative links between the artist and Nate Dogg.<br /><br />So, go forth, and calculate the Nate Dogg number for your favorite artists. What is the most unexpectedly low Nate Dogg number you can find?Kelvinnoreply@blogger.com0tag:blogger.com,1999:blog-3382199152068943986.post-7450110865456780672011-06-08T23:56:00.000-07:002017-02-03T00:36:57.595-08:00Data Mining the NBA - Players Most Similar to Michael Jordan<div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-NlMEs-_Dh04/TfBsBCrnXxI/AAAAAAAABIE/YqCbfkmv92o/s1600/Mj-slam_dunk_comp.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="211" src="https://1.bp.blogspot.com/-NlMEs-_Dh04/TfBsBCrnXxI/AAAAAAAABIE/YqCbfkmv92o/s320/Mj-slam_dunk_comp.jpg" width="320" /></a></div>Professional sports are chock full of numbers and statistics, which are possibly the most widely consumed and readily available mass data sets. For an aspiring data geek and long time sports fan, applying data mining techniques to sports is a fun and interesting way to play with various approaches and potentially discover new things from data. Lucky for us, many of these data sets are available online and can be downloaded for free. For NBA data, we can use the great data set provided by <a href="http://www.databasebasketball.com/">databaseBasketball.com</a> that includes comprehensive statistics for the NBA and the ABA from inception until 2009.<br /><br />When it comes to basketball, the most oft-debated issue is the question of who the greatest player of all time is. However, there's usually little controversy or dissent.<br /><br /><a name='more'></a>Ask almost any fan, analyst, player, coach, or anyone that hasn't been living under a rock in the last twenty years, and they'll likely say it's <a href="http://en.wikipedia.org/wiki/Michael_Jordan">Michael Jordan</a>. With Jordan's career on the record books and his legacy cemented in the form of six NBA titles, the controversial debates of today circle around who is the heir to the throne of His Airness. One interesting data mining technique that can be applied to throw some statistical weight into the debate is the notion of similarities. That is, finding the players that are the most statistically similar to Michael Jordan. That may give us some good insight into who is most <a href="http://www.imdb.com/title/tt0308506/">Like Mike</a>.<br /><br /><b>Euclidean Distance </b><br /><br />There are <a href="http://www.dcs.shef.ac.uk/~sam/stringmetrics.html">many different measures of similarity.</a> For starters, we're most interested in similarity measures that can compare vectors of real numbers. Since we have player statistics at our disposal, it seems most natural to use career stats such as points, rebounds, and assists to perform the comparisons. To start, we can use a basic, intuitive measure of similarity by calculating the <a href="http://en.wikipedia.org/wiki/Euclidean_distance">Euclidean distance</a> between two vectors of player statistics. To normalize the comparisons, we'll take all statistics at a per game level. Additionally, we'll strive for completeness and compare all statistical categories provided in the dataset, which include points, rebounds (offensive and defensive), assists, steals, blocks, turnovers, fouls, field goals (attempted and made), free throws (attempted and made), and three-pointers (attempted and made).<br /><br />This makes our task of finding the most similar players to Michael Jordan quite simple: we can compare the pairwise distance between his career statistics against all other inactive and active players in our dataset. The players for which we get the smallest distance results are the players most similar to Michael Jordan. By using the Euclidean distance similarity measure, the similar players we end up with are<br /><br /><pre>LeBron James 5.0118536393989<br />Allen Iverson 5.65221580572828<br />Kobe Bryant 7.32620244207667<br />Rick Barry 7.40270212447083<br />Dominique Wilkins 7.41171127109192<br />George Gervin 7.51859410948606<br />Dwyane Wade 7.79350569920791<br />Jerry West 7.91375182383501<br />Pete Maravich 8.13164773279095<br />Carmelo Anthony 8.15452017320458<br /></pre><br /><b>Cosine Similarity</b><br /><br />Although the Euclidean distance is a very basic measurement, it actually works very well for our use case. Another measurement of similarity is the <a href="http://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a>. Cosine similarity is slightly different in that it represents the difference in angles of the two vectors. In our case, it measures how much the ratio of a player's statistics differs from the ratio of another player.<br /><br />For example, suppose we only take points per game and rebounds per game into account. A player with 20 points per game and 10 rebounds per game would be considered exactly the same as a player with 10 points per game and 5 rebounds per game. This is because their points-to-rebounds ratios are identical, as are the trajectories of the vectors, and hence the angular difference is zero. By using cosine similarity, the players we end up with are<br /><br /><pre>Rolando Blackman 0.998534495549499<br />Kelly Tripucka 0.997919081452981<br />George Gervin 0.997494409372126<br />David Thompson 0.997361238336493<br />Chris Mullin 0.997326855700903<br />Kiki Vandeweghe 0.997284742031856<br />Mark Aguirre 0.997123483334025<br />Monta Ellis 0.997102936964528<br />Ronnie Brewer 0.996697911841195<br />Chris Douglas-Roberts 0.996676745860137<br /></pre><br />As you can see, cosine similarity yields dramatically different results. In our case, we are trying to measure statistical similarity and use it as a gauge of performance and effectiveness. By only comparing the relative ratios of statistics, we lose information about the absolute numbers. However, cosine similarity may be useful for other purposes, such as for clustering players based on the positions they played and their roles. For example, in the list above, all of the players played at either the <a href="http://en.wikipedia.org/wiki/Shooting_guard">shooting guard</a> or <a href="http://en.wikipedia.org/wiki/Small_forward">small forward</a> position, and have similar heights, which would explain why they have similar statistical ratios.<br /><br /><b>Pearson Correlation</b><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-zfhPNpMsWi4/TfBo1foDnRI/AAAAAAAABIA/VBDIjHmqZ50/s1600/438px-Linear_regression.svg.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="131" src="https://2.bp.blogspot.com/-zfhPNpMsWi4/TfBo1foDnRI/AAAAAAAABIA/VBDIjHmqZ50/s200/438px-Linear_regression.svg.png" width="200" /></a></div><br />Another popular similarity measure is the <a href="http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient">Pearson correlation</a>. An intuitive, geometric interpretation of this similarity measure is that it measures how well a regression line fits the statistics of each player. That is, two players with identical statistics would have a best fit line where all data points lie perfectly on the line. As players differ more and more, their statistical data points will drift farther away from the best fit regression line.<br /><br />With the Pearson correlation, we get these similar players:<br /><br /><pre>Rolando Blackman 0.998156576027356<br />Kiki Vandeweghe 0.996956795013002<br />Kelly Tripucka 0.996522611284401<br />Chris Mullin 0.996417918887698<br />George Gervin 0.996039073475534<br />Monta Ellis 0.995632004734591<br />David Thompson 0.995578688988299<br />Ronnie Brewer 0.995306500496774<br />Chris Douglas-Roberts 0.995056403935876<br />Mark Aguirre 0.995004915521134<br /></pre><br />The results are very close to the similarities generated using cosine similarity, for similar reasons.<br /><br />Out of the three similarity measures we used, the basic Euclidean distance seems to work the best. Also, the data appears to be supportive of <b>LeBron James being the most statistically similar player to Michael Jordan</b>. Although statistics never lie, they may serve little purpose than to add fuel to the burning debates about James' South Beach talents versus Jordan's six rings, and who will end up with the Greatest Of All Time title when all is said and done.Kelvinnoreply@blogger.com0tag:blogger.com,1999:blog-3382199152068943986.post-51556093181536804332011-04-23T09:21:00.000-07:002011-04-23T09:21:00.090-07:00The Best Marketing Strategy That You Don't Know About<div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/--EWE_l34PFQ/TbKDIMCCxYI/AAAAAAAABHM/8idgrovwvis/s1600/800px-Public_Storage_logo.svg.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="164" src="http://2.bp.blogspot.com/--EWE_l34PFQ/TbKDIMCCxYI/AAAAAAAABHM/8idgrovwvis/s320/800px-Public_Storage_logo.svg.png" width="320" /></a></div>Every time I pass a <a href="http://www.publicstorage.com/">Public Storage</a> facility, I'm reminded of their brilliant marketing and pricing strategy of offering the first month of storage for $1. The effectiveness of this strategy is subtle, but ingenious. Rather than offering the first month of storage for free, they put a tangible (albeit small) price on a service that few people know the price of. I, for one, had no idea how much monthly storage facilities cost. For years, I thought the industry standard cost of a month's worth of rent for storage space was around $1. It wasn't until I actually looked into storage facilities and started comparing prices did I realize that the $1 was far from the truth.<br /><br /><a name='more'></a>If Public Storage had offered the first month for free, I (and I imagine most other people) would have saw right through the marketing and become dubious of the pricing, chalking the costs up to something higher than what I would expect. However, since I (and I'm willing to bet most people) was not familiar with the storage market, the tangible cost of a dollar seemed much more real and believable. Here is a rare, non-Perl, non-shell-script example where $0 > $1 is true.<br /><br />Even though I now know better, I still associate Public Storage as that orange storage company with affordable storage costs of only a buck. Now that is good marketing.Kelvinnoreply@blogger.com1tag:blogger.com,1999:blog-3382199152068943986.post-11402342806970494022011-03-23T19:07:00.000-07:002017-02-03T00:34:41.612-08:00Facebook Puzzles - Refrigerator Madness<div class="separator" style="clear: both; text-align: center;"><a href="https://lh6.googleusercontent.com/-UZkSlPRzyXU/TYW0AFTwZFI/AAAAAAAABG0/iI-USYfH23o/s1600/summit-freezerless-refrigerator-full-size-b767.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://lh6.googleusercontent.com/-UZkSlPRzyXU/TYW0AFTwZFI/AAAAAAAABG0/iI-USYfH23o/s1600/summit-freezerless-refrigerator-full-size-b767.jpg" /></a></div>For software engineers, free drinks at a place of employment equates to a certain level of prestige. If a company cares about you enough to always keep the refrigerator fully stocked with caffeinated beverages, then it must be a great place to work. It's sort of silly, considering how bad most soft drinks are for your health. Add that to the fact that most software engineers nary get any semblance of exercise - or even much daylight for that matter - and you have a dangerous combination on your hands.<br /><br />Nevertheless, free food and drinks, no matter how unhealthy, are always popular staples at hot technology companies. Facebook is no exception, which is why they released a programming puzzle to solicit help for optimizing productivity amongst their soda-drinking, caffeine-slinging engineers.<br /><br />The <a href="http://www.facebook.com/careers/puzzles.php#%21/careers/puzzles.php?puzzle_id=10">Refrigerator Madness puzzle</a> (fridgemadness) is interesting not because it deals with matching engineers with beverages, but rather because it presents the task at hand in the form of a classic mathematical and computer science problem. In fact, this problem has many real world applications, and has been used for things like pairing college students up with internship programs, or paring medical students up with residency programs.<br /><br /><a name='more'></a><b>Match Madness</b><br /><br />The input of the Refrigerator Madness puzzle is simple: there is an even number of engineers, an arbitrary list of types of drinks, and a preference list for every engineer that indicates his or her preferred drinks of choice, in order. At its core, the puzzle is a matching problem, as the goal is to match each upper 50th percentile engineer to a lower 50th percentile engineer in the other set. The key insight to the puzzle is the requirement that the final matching between engineers must be "stable" for every pair of engineers. That is, every matching pair of engineers must not prefer another engineer with whom he or she is not already paired with. This problem is an instance of the <a href="http://en.wikipedia.org/wiki/Stable_marriage">stable marriage problem</a>, and it's actually easier to think about the puzzle in this context.<br /><br />In the stable marriage problem, the goal is to match N men up with N men. A <b>stable matching</b> is a set of matches where no arbitrary pair of a man or a woman would prefer each other over the person to whom they're already matched up with. That is, everyone is happy with their matches and none of the pairs need to change, and thus are stable. A prerequisite for the stable marriage problem is that all men and women should have a preference list which ranks the members of the opposite sex in order of preference.<br /><br />Likewise, the Refrigerator Madness puzzle provides such a definition for the degree of preference between two engineers, using their drink preferences as a proxy. That is, an "engineer's preference to work with another engineer can be calculated by comparing the drink preferences of each engineer. An engineer prefers to work with one engineer over another if the preference score is higher." We can easily write a method for calculating the preference scores between a pair of engineers, where <span style="font-family: "courier new" , "courier" , monospace;">n</span> is the number of distinct drinks:<br /><br /><pre class="brush: ruby">def score(a, b, n)<br /> score_ab = score_ba = (n * (n + 1)) / 2<br /> for i in 0..n-1<br /> base_a = n - i<br /> for j in 0..n-1<br /> base_b = n - j<br /> if a[i] == b[j]<br /> if i < j<br /> score_ab += base_a<br /> elsif i > j<br /> score_ba += base_b<br /> else<br /> score_ab += base_a * base_a<br /> score_ba += base_b * base_b<br /> end<br /> score_ab -= base_a<br /> score_ba -= base_b<br /> end<br /> end<br /> end<br /> return [score_ab, score_ba]<br />end<br /></pre><br />Although the preferences are one-directional, we can take a few shortcuts and be more efficient in calculating the preference scores between two engineers. Once we have this method handy, we can tackle the heart of the puzzle.<br /><br /><b>I Choose You, You Choose Me</b> <br /><br />The primary motivation for calculating preference scores between pairs of engineers is to establish an ordered preference list for each engineer. It turns out that the actual drinks are irrelevant; we only need to know how many types of drinks are available. Let's assume that we have these preference lists readily available, in the form of arrays indexed by the drink id. <br /><br /><pre class="brush: ruby;">def scores(drink_prefs, num_engineers)<br /> # Median is the first engineer in lower 50th percentile<br /> median = num_engineers / 2<br /> scores = {}<br /><br /> # Compute scores between all pairs of engineers<br /> for i in 0..median-1<br /> scores[i] ||= {}<br /> p1 = drink_prefs[i]<br /> n = p1.size<br /> for j in median.. num_engineers-1<br /> scores[j] ||= {}<br /> p2 = drink_prefs[j]<br /> scores[i][j], scores[j][i] = score(p1, p2, n)<br /> end<br /> end<br /> return scores<br />end<br /></pre><br />Since the number of engineers is always even and all have consecutive ids starting at 0, we can use a shortcut of taking the median id as the cut-off between the top 50th percentile and the bottom 50th percentile. Additionally, we know the total number of engineers, so we can use these pieces of information to calculate preference lists.<br /><br /><pre class="brush: ruby;" style="overflow: auto;">def prefs(scores, num_engineers)<br /> median = num_engineers / 2<br /> prefs = {}<br /><br /> # Compute preference lists for all engineers<br /> for i in 0..median-1<br /> sorted = scores[i].sort do |a, b|<br /> v = b[1] <=> a[1]<br /><br /> # Tie break with engineer effectiveness, which is their id<br /> v = b[0] <=> a[0] if v == 0<br /> v<br /> end<br /> prefs[i] = sorted.map { |s| s[0] }<br /> end<br /> return prefs<br />end<br /></pre><br /><b>Matchmaker, Matchmaker</b> <br /><br />Note that we only calculate preference lists for the top 50th percentile of engineers. Later on, we'll see why we can utilize this optimization. Armed with the preference scores between engineers and the preference lists, we can start finding stable matchings. The <b>Gale-Shapley algorithm</b> is a method of establishing stable matchings by going through rounds of matchings. Since the algorithm is intended to match individuals from one set with individuals from another set, we naturally divide the engineers into the top 50th percentile and the bottom 50th percentile. To begin, the pool of engineers to be matched includes all the engineers in the top 50th percentile. We don't actually need to keep track of a pool of engineers from the bottom 50th percentile, since a matching by definition means that one engineer from each groups is selected.<br /><br />In each iteration of the algorithm, we select an engineer (from the top 50th percentile) from the pool. We attempt to match the engineer with the candidate (an engineer in the bottom 50th percentile) highest on his or her preference list, and remove the candidate from the list. If the candidate is unmatched, then we establish the matching. If the candidate is already matched, we try to see if we can improve the matching. If the candidate happens to prefer this new engineer more than the engineer he or she is already matched to, then we match the candidate to the new engineer and put the old engineer back into the pool; otherwise, nothing changes. The algorithm goes through as many rounds as it needs until all engineers are matched, and guarantees that all engineers will eventually be matched.<br /><br /><pre class="brush: ruby;">def matches(scores, prefs, num_engineers)<br /> median = num_engineers / 2<br /> matches = {}<br /><br /> # Initial pool of engineers<br /> pool = (0..median-1).to_a<br /><br /> # Keep matching engineers until the pool is exhausted<br /> while !pool.empty?<br /> hi = pool.shift<br /> lo = prefs[hi].shift<br /> if existing = matches[lo]<br /><br /> # Match exists; decide whether to break it up<br /> a = scores[lo][hi]<br /> b = scores[lo][existing]<br /> if a > b || (a == b && hi > existing)<br /> matches[lo] = hi<br /> pool.push(existing)<br /> else<br /> pool.push(hi)<br /> end<br /> else<br /> matches[lo] = hi<br /> end<br /> end<br /><br /> # Pair up engineers based on matches<br /> pairs = []<br /> matches.each do |a, b|<br /> pairs.push(a < b ? [a, b] : [b, a])<br /> end<br /> return pairs<br />end<br /></pre><br />Technically, we could've also calculated the preference lists for the engineers from the bottom 50th percentile, and used these lists to whether or not to break up existing matches. However, since we have available to us the notion of preference scores between all pairs of engineers, we can be slightly more efficient by using this to do a constant time lookup to determine if an engineer prefers one over another.<br /><br />Putting it all together, we just need to sort the pairs of engineers in ascending order, and we're done.<br /><br /><pre class="brush: ruby;">#!/usr/bin/ruby<br />num_engineers = 6<br />drink_prefs = {<br /> 0 => [0, 2, 1],<br /> 1 => [1, 2, 0],<br /> 2 => [2, 1, 0],<br /> 3 => [1, 0, 2],<br /> 4 => [2, 0, 1],<br /> 5 => [0, 1, 2],<br />}<br /><br />scores = scores(drink_prefs, num_engineers)<br />prefs = prefs(scores, num_engineers)<br />pairs = matches(scores, prefs, num_engineers)<br /><br />pairs.sort { |a, b| a[0] <=> b[0] }.each do |p|<br /> puts p.join(',')<br />end<br /></pre><br />All the engineers at Facebook can now work and co-exist in love, peace, and harmony, and enjoy their caffeine-fueled extreme programming sessions.<br /><br />If you enjoyed this post, you should follow me on Twitter <a href="http://twitter.com/cloudkj">@cloudkj</a>. If you're interested in more puzzle fun, you should check out my posts about various other puzzles, such as <a href="http://www.kelvinjiang.com/2010/09/facebook-puzzles-and-liar-liar.html">Liar Liar</a>, <a href="http://www.kelvinjiang.com/2010/10/facebook-puzzles-gattaca.html">Gattaca</a>, <a href="http://www.kelvinjiang.com/2010/10/facebook-puzzles-dance-battle.html">Dance Battle</a>, <a href="http://www.kelvinjiang.com/2010/10/facebook-puzzles-user-bin-crash.html">User Bin Crash</a>, and <a href="http://www.kelvinjiang.com/2011/02/facebook-puzzles-its-small-world.html">It's A Small World</a>.<br /><br /><b>Food For Thought</b><br /><ul><li>Does the algorithm achieve an optimal matching for all engineers?</li></ul>Kelvinnoreply@blogger.com0tag:blogger.com,1999:blog-3382199152068943986.post-16234032500556324552011-02-16T09:21:00.000-08:002017-02-03T00:35:14.547-08:00Facebook Puzzles - It's A Small World<div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-v9-QqbaDudA/TVujFxgshPI/AAAAAAAABF4/AcT8SDfTIKU/s1600/iasw11_5.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="240" src="https://4.bp.blogspot.com/-v9-QqbaDudA/TVujFxgshPI/AAAAAAAABF4/AcT8SDfTIKU/s320/iasw11_5.jpg" width="320" /></a></div>The <a href="http://xorswap.com/questions/162-facebook-engineering-puzzle-its-a-small-world">It's A Small World</a> Facebook Engineering Puzzle (smallworld) is probably one of the more difficult snack-level problems; it's also one of the funnest since the standard solution involves using a nifty data structure (note: Facebook took down the original puzzle so I'm linking to another site with the description). Even on first glance, it's easy to understand the objective of the puzzle. The input is a list of Cartesian coordinates on a two-dimensional plane, and the goal is to find the 3 closest points for every point in the list.<br /><br />In fact, this problem is commonly found in applications we all use everyday. Be it location based mobile apps, mapping software, or recommendations systems, chances are you've experienced the need to find things that are "close to" other things, whether physically or abstractly. Let's dive a little deeper and see how it all works.<br /><br /><a name='more'></a>Now, if you think back to basic geometry, you'll remember that the <a href="http://en.wikipedia.org/wiki/Pythagorean_theorem">Pythagorean Theorem</a> is handy in finding the distance between two points. For any two points A and B, we simply need to treat the line segment between A and B as the hypotenuse of a triangle. However, since we're merely using relative distances to compare the points, the actual distance value isn't required. Thus, we can save ourselves the cost of calculating a square root. Assuming that points are represented as 2-element arrays for the (x, y) coordinates, the distance function looks like this:<br /><br /><pre class="brush: ruby">def dist(p1, p2)<br /> a = p1[0] - p2[0]<br /> b = p1[1] - p2[1]<br /> return (a * a) + (b * b)<br />end<br /></pre><br />With the ability to calculate the distance between two points, let's shoot for the easiest, most straightforward way to solve the puzzle as a first attempt. The naive approach would be to go through the list of points, and calculate the distance between it and all other points, and keep track of the three closest points. It's a simple solution, but runs in <span style="font-family: "courier new" , "courier" , monospace;">O(n<sup>2</sup>)</span>. There has to be a more clever way.<br /><br /><b>"O kd-tree, O kd-tree"</b><br /><br />What if we can store the points in some sort of auxiliary data structure that would be more efficient for searching for points by proximity to a given point? That might speed things up a bit. It turns out that there's a nifty data structure called the <a href="http://en.wikipedia.org/wiki/Kd-tree">kd-tree</a> that is a helpful space-partitioning data structure. A <b>kd-tree</b> is short for k-dimensional tree, and is a binary tree that evenly partitions a set of points at each level into two branches.<br /><br />In a two-dimensional kd-tree, each level in the tree alternates between the x-axis and y-axis in order to partition the points. That is, at the root level, all the points to the left of the root point have smaller x-values, and all the points to the right have larger x-values. At one level lower, the tree partitions according to the y-value. Thus, all the points in the left branch of a node have smaller y-values, and all the points in the right branch have greater y-values. The next level splits by the x-axis, and so on.<br /><br />For example, a kd-tree that represents a small set of points on a Cartesian plane might look something like this (courtesy of Wikipedia): <br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://upload.wikimedia.org/wikipedia/commons/thumb/2/25/Tree_0001.svg/300px-Tree_0001.svg.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://upload.wikimedia.org/wikipedia/commons/thumb/2/25/Tree_0001.svg/300px-Tree_0001.svg.png" /></a></div><br />The kd-tree above represents this space partitioning:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://upload.wikimedia.org/wikipedia/commons/thumb/b/bf/Kdtree_2d.svg/200px-Kdtree_2d.svg.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://upload.wikimedia.org/wikipedia/commons/thumb/b/bf/Kdtree_2d.svg/200px-Kdtree_2d.svg.png" /></a></div><br />Each colored line segment represents a branch in the tree, which effectively divides each sub-tree into two halves. At the root level, we have the point (7, 2), and the two branches divide the entire set of points into two sets: one with points to the left of (7, 2), and the other with points to the right. As you can see, each partition is further divided into halves, and the partitions match up directly with the branches in the kd-tree.<br /><br />Constructing such a kd-tree is fairly straightforward. Given a list of points, we need to alternate between different the two axes as we go deeper in the tree. Once we know the axis (dimension) we're working in, we can go through the list of points and find the median point according to its value on the axis. Here, we can actually optimize the construction algorithm by using a linear-time median selection, such as the one covered in <a href="http://www.amazon.com/gp/product/0262033844?ie=UTF8&tag=kelvinjiang-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=0262033844">CLRS</a>. However, I've found that standard sort-then-select approach is sufficient. Once the median point has been found, it's simply a matter of splitting the points up into two branches, and recursively processing each branch.<br /><br /><pre class="brush: ruby">def build_tree(points, depth = 0)<br /> unless points.empty?<br /> axis = depth % 2<br /> points = points.sort { |a, b| a[axis] <=> b[axis] }<br /> median = points.size / 2<br /> return {<br /> :point => points[median],<br /> :left => build_tree(points[0, median], depth+1),<br /> :right => build_tree(points[median+1, points.size],<br /> depth+1),<br /> }<br /> end<br />end<br /></pre><br />Now that we have the locations of our friends in the small world organized in a neat little kd-tree, we need an approach that takes advantage of this data structure to efficiently find the nearest friends for each person. By now, some of you may have recognized that this problem is a <b>k-nearest neighbor search</b>. To keep things simple, let's first tackle the problem of finding the single nearest neighbor for each given friend.<br /><br />To find the nearest neighbor for a given point using the kd-tree we just built, we need to search through the tree in an appropriate manner. They key to the kd-tree traversal in a nearest neighbor search is in deciding on the right branch to explore. Since each branch in a kd-tree represents a space partition, the idea is to explore the partition that is closer to the point in question first. For a two-dimensional kd-tree, it's easy to visualize this approach.<br /><br /><b>An Illustrative Example</b><br /><br />Take the example tree from above as an example. Say we're looking for the nearest neighbor of the point at (6, 5). We start at the root node, and see the point (7, 2), which we remember as the best candidate so far (since it's the only one we've examined). At the root node, the point (7, 2) acts as a divider along the x-axis. That is, its left and right sub-trees are partitioned at x = 7. Since we're looking for the point closest to (6, 5), we want to first explore the left partition - the left sub-tree rooted at (5, 4) - as that partition is closer.<br /><br />Since the point (5, 4) is actually closer to our target than the best candidate so far, we'll replace the candidate with (5, 4). Likewise, we'll look at both branches, which now act as partitions on the y-axis. The right sub-tree represents the half that is closer to the target point, so we'll examine it first. It contains (4, 7), which is not closer to our target than our best candidate. Since we've hit a leaf node, we stop and examine the left sub-tree.<br /><br />At this point, we are exploring what is considered to be the "far" branch of the tree. By looking at our best candidate so far, we can decide whether or not the far branch is worth exploring. Conceptually, this is equivalent to drawing a circle around the target point, where the circle's radius is equal to the distance from the target point to the best candidate point. This circle represents the "best estimate" of the target point's nearest neighbor. If the far branch represents a partition where none of the points can possibly be closer than the best estimate so far (i.e. the best estimate circle does not intersect the partition), then we can forgo exploring the far branch altogether.<br /><br />Going back to our example, since the far branch in question represents a split along the y-axis, we can actually calculate the minimum distance between any point in this branch and the target point. This split along the y-axis is at y = 4. Thus, any point in this partition must be at least 1 unit away from our target point. However, our best candidate so far, (5, 4), is actually approximately 1.41 (square root of 2) units away from the target. Thus, we should still examine this branch.<br /><br />At this point, you should be able to see how the algorithm will proceed as it explores the rest of the kd-tree. In the end, the algorithm will determine that (5, 4) is in fact the closet neighbor to our target point of (6, 5). Let's formalize this in code.<br /><br /><pre class="brush: ruby">def nearest(node, point, min = node[:point], depth = 0)<br /> if !node.nil?<br /> axis = depth % 2<br /> d = point[axis] - node[:point][axis]<br /><br /> # Determine the near and far branches<br /> near = d <= 0 ? node[:left] : node[:right]<br /> far = d <= 0 ? node[:right] : node[:left]<br /><br /> # Explore the near branch<br /> min = nearest(near, point, min, depth + 1)<br /><br /> # If necessary, explore the far branch<br /> if d * d < dist(point, min)<br /> min = nearest(far, point, min, depth + 1)<br /> end<br /><br /> # Update the candidate if necessary<br /> if dist(point, node[:point]) < dist(point, min)<br /> min = node[:point]<br /> end<br /> end<br /> return min<br />end<br /></pre><br /><b>Love Thy Neighbors</b><br /><br />How do we extend the nearest neighbor search to find the nearest three neighbors? It's actually quite easy to do, and simply requires keeping track of a bit of state. As a matter of fact, we can generalize it for finding the <a href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm">k-nearest neighbors</a>. The trick is to keep track of a list of the k best candidates, instead of a single candidate. Whenever a new candidate is found, it should be inserted into the list if it is closer to the target point.<br /><br /><pre class="brush: ruby">def nearest_k(node, point, k, min = [], k_dist = Float::MAX, depth = 0)<br /> if !node.nil?<br /> axis = depth % 2<br /> d = point[axis] - node[:point][axis]<br /><br /> # Determine the near and far branches<br /> near = d <= 0 ? node[:left] : node[:right]<br /> far = d <= 0 ? node[:right] : node[:left]<br /><br /> # Explore the near branch<br /> min, k_dist = nearest_k(near, point, k, min, k_dist, depth+1)<br /><br /> # If necessary, explore the far branch<br /> if d * d < k_dist<br /> min, k_dist = nearest_k(far, point, k, min, k_dist, depth+1)<br /> end<br /><br /> # Save the current point as a candidate if it's eligible<br /> d = dist(point, node[:point])<br /> if d < k_dist<br /><br /> # Do a binary search to insert the candidate<br /> i, j = 0, min.size<br /> while i < j<br /> m = (i + j) / 2<br /> if min[m][1] < d<br /> i = m + 1<br /> else<br /> j = m<br /> end<br /> end<br /> min.insert(i, [node[:point], d])<br /><br /> # Keep only the k-best candidates<br /> min = min[0, k]<br /><br /> # Keep track of the radius of the "best estimates" circle<br /> k_dist = min[min.size - 1][1] if min.size >= k<br /> end<br /> end<br /> return min, k_dist<br />end<br /></pre><br />Since we're looking for the nearest three friends for each friend in the world, we actually want to find the nearest k = 4 neighbors. By definition, each person is his or her own nearest neighbor, so we'll need to skip the first point that we find. After we put it all together, we should be well on our way to finding ways to more efficiently visit our friends.<br /><br /><pre class="brush: ruby">#!/usr/bin/ruby<br />friends = [<br /> [0.0, 0.0, 1],<br /> [-10.1, 10.1, 2],<br /> [12.2, -12.2, 3],<br /> [38.3, 38.3, 4],<br /> [179.99, 79.99, 5],<br />]<br /><br />k = 4<br />tree = build_tree(friends)<br />friends.each do |f|<br /> near, d = nearest_k(tree, f, k)<br /> puts "#{f[2]} " + near[1, k].map { |n| n[0][2] }.join(',')<br />end<br /></pre><br />If you enjoyed this post, you should follow me on Twitter <a href="http://twitter.com/cloudkj">@cloudkj</a>. If you're interested in more puzzle fun, you should check out my posts about various other puzzles, such as <a href="http://www.kelvinjiang.com/2010/09/facebook-puzzles-and-liar-liar.html">Liar Liar</a>, <a href="http://www.kelvinjiang.com/2010/10/facebook-puzzles-gattaca.html">Gattaca</a>, <a href="http://www.kelvinjiang.com/2010/10/facebook-puzzles-dance-battle.html">Dance Battle</a>, <a href="http://www.kelvinjiang.com/2010/10/facebook-puzzles-user-bin-crash.html">User Bin Crash</a>, and <a href="http://www.kelvinjiang.com/2011/03/facebook-puzzles-refrigerator-madness.html">Refrigerator Madness</a>.<br /><br /><b>Food For Thought</b><br /><ul><li>What is the run-time complexity of the algorithm?</li></ul>Kelvinnoreply@blogger.com2tag:blogger.com,1999:blog-3382199152068943986.post-86651342404548200732011-01-01T17:00:00.000-08:002011-01-01T17:47:03.619-08:00An Open Letter Regarding Innovation To The Household Goods Manufacturers of America<div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/_odeQu6U_EYw/TR_MOL77qII/AAAAAAAABE4/gj_pI54doYY/s1600/2244-2-large.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="149" src="http://1.bp.blogspot.com/_odeQu6U_EYw/TR_MOL77qII/AAAAAAAABE4/gj_pI54doYY/s200/2244-2-large.jpg" width="200" /></a></div>Dear Household Goods Manufacturers of America, <br /><br />I have a simple request for you all: stop innovating. When it comes to things like home storage or cleaning supplies, I just need something that works and does its job. I don't need an ironing board that can double as a dining table or a lawn mower that also feeds the dog. My request is backed by two anecdotes with regards to plastic storage bins and vacuum cleaners.<br /><br /><a name='more'></a>Plastic storage is a product group that does <i>not</i> require further innovation. Today, I visited five different retail stores in search of a classic <a href="http://www.rubbermaid.com/Category/Pages/ProductDetail.aspx?Prod_ID=RP091418">14 gallon Rubbermaid Roughneck storage bin</a>. None of the stores had them. Instead, I saw a bunch of newfangled storage boxes in a myriad of non-standard sizes with more bells and whistles and handles and wheels than I need. Have your customers really requested all those features in their plastic storage boxes? I may be a traditionalist when it comes to plastic storage bins, but when I can't find the most common, dependable plastic storage bin known to mankind in five of the most common retail stores known to suburbia, there's something wrong. The plastic storage bin can't even be found on <a href="http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=rubbermaid+roughneck&x=0&y=0#/ref=sr_nr_p_76_1?rh=i%3Aaps%2Ck%3Arubbermaid+roughneck%2Cp_76%3A1-&keywords=rubbermaid+roughneck&ie=UTF8&qid=1293928739&tag=kelvinjiang-20">Amazon.com</a>.<br /><br />A second example of unnecessary innovation is in the household vacuum cleaner. Today, I tried to vacuum the house with a fancy Hoover vacuum cleaner with built-in lights, an automatic cord winder, and more brushes and attachments than I could shake a stick at. Yet, it sucked at the one thing that matters in vacuum cleaners: sucking. With all the features tacked on by Hoover to justify a higher sticker price and bigger margins, this product fails to perform at even the basic level expected of vacuum cleaners.<br /><br />Let's face it - when it comes to household goods, simplicity is king. Today, Sterlite and Rubbermaid wasted 1 hour of my life, and Hoover lost my respect for it as the best vacuum cleaner manufacturer in the market. I beg of you all to please get back to basics and start keeping it simple.Kelvinnoreply@blogger.com0tag:blogger.com,1999:blog-3382199152068943986.post-27980176415107350832010-12-30T10:49:00.000-08:002010-12-30T10:52:13.527-08:00What Rejection Means For You (Hint: Nothing)<div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/_odeQu6U_EYw/TRzUgQ0nleI/AAAAAAAABE0/RDUcFIsNOAk/s1600/War%252BFor%252BTalent.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="210" src="http://4.bp.blogspot.com/_odeQu6U_EYw/TRzUgQ0nleI/AAAAAAAABE0/RDUcFIsNOAk/s320/War%252BFor%252BTalent.jpg" width="320" /></a></div>With the <a href="http://blogs.wsj.com/venturecapital/2010/12/29/the-war-for-top-talent-in-silicon-valley/" target="_blank">Great Silicon Valley Talent War of 2010</a> (<a href="http://www.focus.com/images/view/42092/" target="_blank">infographic</a>) raging on between Google and Facebook, many engineers are undoubtedly dusting off their resumes and considering taking a stab at getting in on a piece of the action. Some may consider the escalating efforts at getting engineering talent to be a bubble, but there's no reason why talented engineers shouldn't get in on a piece of the action while the going is hot. Google's reputation as an employer is nothing short of legendary, as there have been hundreds of articles written about the Google work culture, benefits and perks, and the hiring process. If Facebook's trajectory remains on course, it will likely join Google at the top of the totem pole of coveted Silicon Valley employers.<br /><br /><a name='more'></a>An abundance of job applications will inevitably be followed by an abundance of rejected candidates. Companies like Google are well known for their arduous interview process and high hiring bar. Even people that work at Google admit to a <a href="http://steve-yegge.blogspot.com/2008/03/get-that-job-at-google.html" target="_blank">high false-negative rate</a> when it comes to identifying the right engineers to hire. This is likely the case at various other top technology companies in and out of Silicon Valley.<br /><br />The technical interview and hiring process is known to be fickle and tricky. The more prestigious the company, the bigger the applicant pool, and hence, the higher the false negative rate. Does this mean the process is broken and needs to be fixed? Probably. There are probably things that can be changed to improve the interview and hiring process for both candidates and companies alike. However, one can argue that there are challenges that are intrinsic to "talent evaluation" that make it an inevitably difficult process.<br /><br />The same challenges can be found in college admissions and professional sports. The best athletes in professional sports often don't spring out of drafts. Top draft picks often turn out to be busts, and lower picked or even undrafted players often end up having brilliant careers. The 1984 NBA Draft involving <a href="http://en.wikipedia.org/wiki/Sam_Bowie" target="_blank">Sam Bowie</a> and Michael Jordan is just one of many examples. The same phenomenon can be seen throughout many other professional sports. And to think, here is a set of businesses that are fueled by hundreds of millions of dollars in salaries and bonuses, where each talent acquisition can mean orders of magnitude in difference between revenue, and yet the draft scouts and general managers still manage to make many mistakes. If recruiters in the sports industry can't get it right, how will recruiters in the other industries fare?<br /><br />The bottom line is that when it comes to evaluating "human resources," humans are apt to make mistakes. It is hard to evaluate technical talent based purely on a resume and answers to arbitrarily chosen technical interview questions in a challenging interview environment. Yet, both sides recognize it as a necessary evil. Much like NFL hopefuls running through various tests and evaluations in the <a href="http://en.wikipedia.org/wiki/NFL_Scouting_Combine" target="_blank">NFL Scouting Combine</a>, job applicants in the tech industry have to jump through various hoops and play by the rules of the game. Even in a sport like football, where raw athleticism should be a paramount indicator of success, the measurable and quantifiable figures often don't yield the best draft results.<br /><br />In the end, whether you're a collegiate athlete whose hopes and dreams of going pro have been dashed by falling out of a draft or an engineer that didn't make the cut at a leading technology company, rejection really doesn't mean much.Kelvinnoreply@blogger.com0tag:blogger.com,1999:blog-3382199152068943986.post-50824237311895525862010-12-23T20:00:00.000-08:002016-03-24T22:15:59.530-07:00Hacking The Padlock<div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/_odeQu6U_EYw/TRR5H2N7uUI/AAAAAAAABEk/PF5TceJ53Qo/s1600/11964887.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="200" src="https://2.bp.blogspot.com/_odeQu6U_EYw/TRR5H2N7uUI/AAAAAAAABEk/PF5TceJ53Qo/s200/11964887.jpg" width="200" /></a></div>While I was at the gym a few weeks ago, a couple pretty silly thoughts came to me that I thought were interesting enough (at least to me) to jot down. Most gyms provide lockers for personal storage, sans the actual padlocks. As I turned and spun my trusty old Master Lock combination padlock, I realized that these padlocks are actually <b>97.5%</b> less secure than they're made out to be.<br /><br />From what I can tell, the common padlocks used by most people are more or less the same. These <a href="http://www.masterlock.com/school_and_institutional/">Master Lock-style combination locks</a> were also used for the hallway lockers I had during grade school. The locks have a combination of 3 numbers, and the sequence of actions required to unlock the lock are: spin the lock clockwise past the first number 3 times, spin the lock counter-clockwise past the second number 2 times, and finally spin the lock clockwise directly to the final number.<br /><br /><a name='more'></a>However, one trick I've learned over many years of using hallway lockers and gym lockers is that as you spin the lock towards the final number, you can simply pull down on the lock as if to open it. As you spin towards the final number, the lock will loosen and eventually open as you pass over the number. The final number in the combination is actually useless. I (and I'm sure many of my classmates and fellow gym-goers) have known this for a long time, but nonetheless this means that only 2 of the 3 numbers in a padlock combination are needed to open a lock. Thus, you only need to go through 40^2 = 1600 combinations in order to brute force a lock, as opposed to 40^3 = 64000 combinations.<br /><br />Of course, there are <a href="http://www.wikihow.com/Crack-a-%22Master-Lock%22-Combination-Lock">better ways of cracking a padlock</a> than a brute force approach. Nonetheless, the fact that there are 62,400 (97.5%) fewer combinations than "advertised" isn't immediately apparent. Perhaps newer generations of Master Lock padlocks will do away with the "shortcut" mechanism for the final spin in the combination, but who knows. I would be both impressed and amused if I saw a guy hunched over my lock at the gym, crossing off numbers on a list as he worked his way through 1600 combinations.Kelvinnoreply@blogger.com0tag:blogger.com,1999:blog-3382199152068943986.post-44257758339195286892010-10-24T14:56:00.000-07:002011-03-08T14:17:42.022-08:00Doublets - A Word Puzzle Game<div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/_odeQu6U_EYw/TMNfKJmy3WI/AAAAAAAAAwg/aVlRTvNdWgg/s1600/images.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/_odeQu6U_EYw/TMNfKJmy3WI/AAAAAAAAAwg/aVlRTvNdWgg/s1600/images.jpg" /></a></div><a href="http://thinks.com/puzzles/doublets.htm">Doublets</a> is a word puzzle game invented by <a href="http://en.wikipedia.org/wiki/Lewis_Carroll">Lewis Carroll</a> where the objective is to turn one dictionary word into another by changing one letter at a time. At each step, one letter may be changed, and the resulting word must be a valid dictionary word.<br /><br />For example, to turn "milk" into "wine", "milk" changes into "mink", "mink" changes into "mine", and "mine" changes into "wine". Each intermediate word is a valid dictionary word. If we are given two arbitrary words, how do we determine the minimal number of intermediate steps needed to transform one word into the other?<br /><br /><a name='more'></a>Since each transformation is a transition between two valid dictionary words, we need to first figure out a way to generate all the possible words that a given word can turn into by changing only one letter at a time. We can simply go through each character in the word and replace it with each of the 26 character; each changed word should be checked against the dictionary to ensure that it is a valid word.<br /><br /><pre class="brush: ruby"># Assume we're given a hash of words as the dictionary<br />def edits(word, dictionary)<br /> edits = {}<br /> for i in 0..word.length-1<br /> # Assuming the word is lowercase, go through each character<br /> # in the alphabet<br /> 'a'.upto('z') do |c|<br /> changed = word[0, i] + c + word[i + 1, word.length - 1]<br /> edits[changed] = true if dictionary[changed]<br /> end<br /> end<br /> return edits.keys<br />end<br /></pre><br />The transitions between words can be represented in the form of a graph, where each vertex represents a word and each edge represents a valid transition from one dictionary word to another by changing one letter. Thus, the problem of solving a doublets puzzle boils down to finding the shortest path between the two given words. Since all the edges represent one letter transitions, they are effectively unweighted. Using <b>breadth-first search</b> to search for the word we want to transition to, we are guaranteed to find a solution with the least number of transitions, although there may be alternate solutions with the same number of transitions. Since <a href="http://en.wikipedia.org/wiki/Breadth-first_search">breadth-first search</a> examines vertices in order of depth - which also happens to be distance from the starting vertex - the algorithm will never prematurely find the end vertex before traversing a path to the vertex that is shorter.<br /><br />We can see a portion of the graph of transitions for our example of turning "milk" into "wine".<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/_odeQu6U_EYw/TMNc2SEurGI/AAAAAAAAAwc/U-3Zc7jpK5Q/s1600/get_preview.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="200" src="http://1.bp.blogspot.com/_odeQu6U_EYw/TMNc2SEurGI/AAAAAAAAAwc/U-3Zc7jpK5Q/s400/get_preview.png" width="400" /></a></div><br />During the traversal, visited vertices will be marked so that they are not examined again. A vertex <span style="font-family: "Courier New",Courier,monospace;">u</span> will be set as the predecessor of its neighboring vertices <span style="font-family: "Courier New",Courier,monospace;">v</span> so that we can retrace the word transitions once the traversal terminates. Since the neighbors of each vertex is the set of valid words it can transform into with one-letter changes, we can call the function we defined earlier for each vertex we encounter.<br /><br /><pre class="brush: ruby">def bfs(start, finish, dictionary)<br /> visited = {start => true}<br /><br /> # Keep track of predecessors in the traversal<br /> pred = {}<br /><br /> # FIFO queue for vertices<br /> queue = [start]<br /> while !queue.empty?<br /> u = queue.shift<br /><br /> # Terminate once we've found the finish word<br /> break if u == finish<br /><br /> # Visit all the neighboring words of u<br /> edits(u, dictionary).each do |v|<br /> unless visited[v]<br /> pred[v] = u<br /> queue.push(v)<br /> visited[v] = true<br /> end<br /> end<br /> end<br /> return pred<br />end<br /></pre><br />By dynamically generating a vertex's neighbors, the algorithm doesn't need to go through each word in the dictionary in advance and precompute the graph by generating all of the possible transitions. In the worst case, breadth-first search will examine every vertex and every edge as it traverses the graph. Generating the neighboring transitions for a word requires going through each character in the word, and going through the entire alphabet for each of the characters. Since the alphabet size is constant, the time complexity is linear, or <span style="font-family: "Courier New",Courier,monospace;">O(n)</span> where <span style="font-family: "Courier New",Courier,monospace;">n</span> is the length of the word. Since our algorithm will generate transitions for every vertex encountered, the time complexity is <span style="font-family: "Courier New",Courier,monospace;">O(n|V| + |E|)</span>.<br /><br />When all is said and done, we can put together a program to solve any doublets puzzle. Let's assume that we have access to a text file of dictionary words, and the text file will be passed in as the an argument to the program.<br /><br /><pre class="brush: ruby">#!/usr/bin/ruby<br />file = File.new(ARGV[0], 'r')<br /><br /># Convert all dictionary words into lowercase, and store<br /># them in a hash<br />dictionary = {}<br />while line = file.gets<br /> dictionary[line.strip.downcase] = true<br />end<br /><br />start = 'milk'<br />finish = 'wine'<br />pred = bfs(start, finish, dictionary)<br /><br /># Output the words by retracing the path<br />while !pred[finish].nil?<br /> puts finish<br /> finish = pred[finish]<br />end<br />puts start<br /></pre>Kelvinnoreply@blogger.com1tag:blogger.com,1999:blog-3382199152068943986.post-55691456294086965982010-10-18T10:03:00.000-07:002011-03-08T14:21:16.319-08:00Currency Arbitrage In 99 Lines of Ruby<div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/_odeQu6U_EYw/TLJjpc5am-I/AAAAAAAAAwI/F6rzQORl8uM/s1600/currency+exchange.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="231" src="http://1.bp.blogspot.com/_odeQu6U_EYw/TLJjpc5am-I/AAAAAAAAAwI/F6rzQORl8uM/s320/currency+exchange.jpg" width="320" /></a></div>A few months ago, I had the crazy idea of writing some code to automatically detect currency <a href="http://en.wikipedia.org/wiki/Arbitrage">arbitrage</a> given current foreign exchange rates. The brilliant plan was to (1) develop this evil-genius program (2) unleash it upon the market (3) and profit. Okay, not really, but the thought did cross my mind when for some unknown reason I was reminiscing about homework problems from my introductory data structures class. One of the problems involved looking for currency arbitrage in a graph of exchange rates, albeit it was a simplified exercise in depth-first search. The actual problem involves looking for a sequence of currency exchanges (a path in the graph) that starts at a given currency and ends up with more units of that currency than we started with. This problem is straight out of <a href="http://www.amazon.com/gp/product/0262033844?ie=UTF8&tag=kelvinjiang-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=0262033844">CLRS</a>, and there are various solutions to it floating around on the web. Nevertheless, I decided to give it a go and see if arbitrage situations can really be found using real-world exchange rates.<br /><br /><a name='more'></a>Foreign exchange rates can be represented by a graph where each currency type is a vertex, and directed edges between vertices indicate the exchange rate. That is, weight <span style="font-family: 'Courier New', Courier, monospace;">w(u, v)</span> for edge <span style="font-family: 'Courier New', Courier, monospace;">(u, v)</span> means that one unit of currency <span style="font-family: 'Courier New', Courier, monospace;">u</span> buys <span style="font-family: 'Courier New', Courier, monospace;">w(u, v)</span> units of currency <span style="font-family: 'Courier New', Courier, monospace;">v</span>. Since the goal is to simply end up with more money than we started with, we want to look for cycles in the graph where the product of the edge weights (that is, the product of all the exchange rates) is greater than 1. The first key observation is the product of edge weights can be reduced to a sum of the logarithm of the edge weights. Once we're dealing with sums, the problem is reduced to searching for positive-weight cycles in the graph. By inverting the sums to look for negative-weight cycles instead, we can use the <a href="http://en.wikipedia.org/wiki/Bellman-ford">Bellman-Ford algorithm</a>.<br /><br />To start, we'll read the currency exchange rates from <span style="font-family: 'Courier New', Courier, monospace;">STDIN</span>. To make this exercise more applicable to the real world, we should include a fixed transaction cost for each exchange rate. Later on, we'll see how this affects the outcomes. <br /><br /><pre class="brush: ruby">#!/usr/bin/ruby<br />TRANSACTION_COST = 0.00001<br /><br /># Initialize graph from curreny exchange rates<br />rates = {}<br />graph = {}<br /><br />STDIN.read.split("\n").each do |line|<br /> c_1, r, c_2 = line.split(' ')<br /><br /> r = r.to_f * (1.0 - TRANSACTION_COST)<br /> w = Math.log10(r)<br /><br /> graph[c_1] ||= {}<br /> graph[c_1][c_2] = -1.0 * w<br /><br /> rates[c_1] ||= {}<br /> rates[c_1][c_2] = r<br />end<br /></pre><br />Since the Bellman-Ford algorithm computes shortest path values for a single source in a graph, we can add a dummy "root" vertex to the graph and use it as the source. By adding a zero-weight edge between the root vertex and all other vertices, we ensure that no new negative-weight cycles are introduced.<br /><br /><pre class="brush: ruby"># Add a dummy root node<br />root = 'Root'<br />graph[root] = {}<br />graph.each_key { |v| graph[root][v] = 0.0 }<br /><br /># Initialize distances and predecessors<br />dist = {}<br />pred = {}<br />graph.each_key { |v| dist[v] = Float::MAX }<br />dist[root] = 0<br /></pre><br />The algorithm works by "relaxing" each edge in the graph <span style="font-family: 'Courier New', Courier, monospace;">n - 1</span> times, where <span style="font-family: 'Courier New', Courier, monospace;">n</span> is the number of vertices. It has been shown that after <span style="font-family: 'Courier New', Courier, monospace;">n - 1</span> relaxations on each edge, the shortest paths to each vertex will be found. During all iterations of the algorithm, an estimate of the cost of the shortest path from the source to each vertex is kept. When an edge is relaxed, the algorithm checks whether or not the shortest path estimate can be improved by going through the edge, and updates the estimate accordingly.<br /><br /><pre class="brush: ruby"># Relax every edge n - 1 times<br />(graph.keys.size - 1).times do<br /> graph.each do |v_1, e|<br /> e.each do |v_2, w|<br /> if dist[v_2] > dist[v_1] + w<br /> dist[v_2] = dist[v_1] + w<br /> pred[v_2] = v_1<br /> end<br /> end<br /> end<br />end<br /></pre><br />Intuitively, it makes more sense to think of the relaxation in terms of a rubber band wrapped around nobs on a surface, which are the vertices. As the initial estimate for a path is high, the rubber band is stretched and represents a high upper bound on the shortest path. When the rubber band is "relaxed," the worst case shortest path estimate gets smaller; eventually the optimal path value is found.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/_odeQu6U_EYw/TLvQBBImQ1I/AAAAAAAAAwY/lOPOD_qP6qY/s1600/stretch-shelf.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="233" src="http://4.bp.blogspot.com/_odeQu6U_EYw/TLvQBBImQ1I/AAAAAAAAAwY/lOPOD_qP6qY/s320/stretch-shelf.jpg" width="320" /></a></div><br />The main reason for using the Bellman-Ford algorithm is to detect negative-weight cycles. In a graph without negative-weight cycles, all calculated shortest path values will be optimal after <span style="font-family: 'Courier New', Courier, monospace;">n - 1</span> iterations. For any vertices in a negative-weight cycles, the shortest path is actually undefined, since the cycle can be traversed an arbitrary number of times in order to decrease the path cost. Thus, to detect negative-weight cycles, we can simply attempt to relax every edge one more time. If any edges are relaxed (i.e. the shortest path estimate for one of the vertices improves), that means there's a negative weight cycle. We can also use this opportunity to keep track of the actual vertices in these cycles for later use.<br /><br /><pre class="brush: ruby"># Relax every edge again to find negative-weight cycles<br />arbitrage = false<br />cyclic = {}<br />graph.each do |v_1, e|<br /> e.each do |v_2, w|<br /> if dist[v_2] > dist[v_1] + w<br /> arbitrage = true<br /> dist[v_2] = dist[v_1] + w<br /><br /> # Keep track of vertices in negative-weight cycles<br /> cyclic[v_2] = true<br /> end<br /> end<br />end<br /><br />if !arbitrage<br /> puts "No arbitrage found."<br /> exit<br />end<br /></pre><br />Once we have a list of the vertices in negative-weight cycles, we can go through each one and calculate the arbitrage sequences required to make some money. Note that as we relaxed each edge, we also stored the predecessor vertices. We can now use this information to backtrack from a vertex in a cycle and find the remaining vertices in the cycle. Since these paths are cyclic, we'll also need to keep track of visited vertices.<br /><br />To calculate the amount of money made by making currency exchange transactions in each sequence, we simply multiply all the rates together to get the amount of money we end up with in units of the initial currency.<br /><br /><pre class="brush: ruby"># Calculate the arbitrage sequences<br />sequences = []<br />cyclic.each_key do |v|<br /> # Recursively visit predecessors to trace vertices in cycle<br /> visited = {v => true}<br /> seq = []<br /> p = v<br /> begin<br /> seq.push(p)<br /> visited[p] = true<br /> p = pred[p]<br /> end while !p.nil? && !visited[p]<br /> seq.reverse!.push(seq.first)<br /><br /> # Calculate the arbitrage amount<br /> val = (0..seq.size - 2).inject(1.0) do |v, i|<br /> v * rates[seq[i]][seq[i+1]]<br /> end<br /> sequences.push({:currencies => seq, :value => val})<br />end<br /></pre><br />Finally, it's simply a matter of sorting the sequences by their potential profit gains.<br /><br /><pre class="brush: ruby"># Output the sequences in descending order of value<br />puts "Arbitrage sequences:"<br />sequences.sort! { |a, b| b[:value] <=> a[:value] }<br />sequences.each do |s|<br /> break if s[:value] <= 1.0<br /> puts ("%.14f " % s[:value].to_s) + s[:currencies].inspect.to_s<br />end<br /></pre><br />Pretty straightforward, right? The interesting results are revealed when you actually apply the algorithm to actual market exchange rates. I opted to write a simple script to scrape the exchange rates off of <a href="http://www.xe.com/">XE.com</a>. All it takes to gather the various exchange rates are a few <a href="http://mechanize.rubyforge.org/mechanize/">Mechanize</a> calls followed by a few DOM traversals.<br /><br />The output of the script matches the expected input of the arbitrage program. That is, a string identifier for the first currency, a space, the decimal representation of the exchange rate, another space, then the string identifier for the second currency. A sample output may look like this:<br /><br /><pre style="background-color: #fff2cc;">USD 0.69546 EUR<br />USD 0.61482 GBP<br />USD 90.7400 JPY<br />EUR 1.43790 USD<br />EUR 0.88405 GBP<br />EUR 130.475 JPY<br />GBP 1.62650 USD<br />GBP 1.13116 EUR<br />GBP 147.589 JPY<br />JPY 0.01102 USD<br />JPY 0.00766 EUR<br />JPY 0.00678 GBP<br /></pre><br />Running the arbitrage program on the sample output, with a transaction cost of 0.00001 (1/10th of a basis point), yields:<br /><br /><pre style="background-color: #fff2cc;">Arbitrage sequences:<br />1.00063340703167 ["JPY", "GBP", "JPY"]<br />1.00063340703167 ["GBP", "JPY", "GBP"]<br />1.00062075657692 ["JPY", "GBP", "USD", "JPY"]<br />1.00061730566045 ["JPY", "GBP", "EUR", "JPY"]<br /></pre><br />According to the output, we can simply exchange Japanese Yen to British Pounds then back to Yen, and we'll be rich. Time to head down to the bank! Wait, there's got to be a catch, right? Let's see what happens when we try it on real exchange rates.<br /><br />In practice, foreign exchange transaction costs are an order of magnitude higher. As an example, we'll use the <a href="http://www.interactivebrokers.com/en/p.php?f=commission#fx-clear">commission costs from Interactive Brokers</a>, which is one basis point of the trade value. To focus solely on foreign exchange trading, we'll also ignore the exchange rates between currencies and precious metals (the rates from XE.com include metals such as silver, gold, platinum, and palladium). Taking these factors into account, here are the profitable arbitrage sequences for real exchange rates taken on October 17, 2010:<br /><br /><pre style="background-color: #fff2cc;">Arbitrage sequences:<br />1.00009725953714 ["ZAR", "JPY", "ZAR"]<br />1.00009725953714 ["JPY", "ZAR", "JPY"]<br /></pre><br />For this particular set of exchange rates, the only arbitrage opportunity lies in exchanging Japanese Yen and South African Rand. However, the gain is minuscule, at slightly less than one basis point. Additionally, if we look closer at the costs from Interactive Brokers, we'll see that there's also a minimum cost per transaction. Unless we trade at least $25,000-equivalent worth of currency each transaction, we incur costs of more than a basis point each time. Combine that with the fact that currency exchange rates are constantly changing and fluctuating makes it clear that making a sure profit by exploiting currency arbitrage isn't a trivial matter. Just to get in the game, you'll need very powerful computers, an ability to execute currency trades at a very high frequency, and a very large bankroll.Kelvinnoreply@blogger.com5tag:blogger.com,1999:blog-3382199152068943986.post-54619569337651063742010-10-13T11:15:00.000-07:002017-02-03T00:35:24.853-08:00Facebook Puzzles - User Bin Crash<div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/_odeQu6U_EYw/TLFyPgQNAvI/AAAAAAAAAwE/0KFAqtSDJwk/s1600/7734_Cargo_Plane.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="135" src="https://3.bp.blogspot.com/_odeQu6U_EYw/TLFyPgQNAvI/AAAAAAAAAwE/0KFAqtSDJwk/s200/7734_Cargo_Plane.jpg" width="200" /></a></div>The <a href="http://xorswap.com/questions/163-facebook-engineering-puzzle-user-bin-crash">User Bin Crash</a> puzzle (usrbincrash) is a quick and fun puzzle to solve (note: Facebook took down the original puzzle so I'm linking to another site with the description). After reading the problem statement, it should be easily recognizable as an inverted instance of the <a href="http://en.wikipedia.org/wiki/Knapsack_problem">knapsack problem</a>. The fact that there is a weight requirement is a dead giveaway. Unlike a burglar trying to maximize the value of the stolen goods in his knapsack, you're trying to minimize the cost of the goods being thrown out of the plane. Although the knapsack problem is <a href="http://en.wikipedia.org/wiki/NP-complete">NP-complete</a>, it can be solved in a straightforward manner using dynamic programming.<br /><br />Of course, it would be no fun to simply reveal the solution. Instead, I've decided to formulate the problem in parable form with a tropical theme. Hopefully, it'll help provide a more intuitive understanding of the knapsack problem. If you've been following along with my posts about various <a href="http://www.kelvinjiang.com/search/label/facebook%20puzzles">Facebook Puzzles</a> and are interested in trying some of them out yourself, getting a good handle on dynamic programming will prove to be quite useful.<br /><br /><a name='more'></a><b>Volcano Island</b><br /><br />An ancient volcano island tribe holds an annual festival to honor and worship the volcano gods that control the fate of island. It is believed that each year, they need to sacrifice a pound of fruit to the volcano for every fish spotted in the tribal lagoon on the day of last year's festival. Being an economically and technically advanced people, the island tribe has determined an exact monetary value for each type of fruit. The tribe is also very frugal, and wants to sacrifice just enough fruit to appease the gods but wants to keep its cost down so that it can sell the rest of its fruits to other tribes. After many years of using rough calculations to figure out the amount of fruits to be sacrificed, the tribal council has finally decided to come up with an algorithm to predetermine how much and what types of fruits they should grow to prepare for the coming year.<br /><br />Given that the tribe needs to sacrifice <span style="font-family: "courier new" , "courier" , monospace;">n</span> pounds of fruits on a given year, with access to a set of different types of fruits, each with a fixed unit weight and value, determine the minimum value that needs to be sacrificed in order to meet the requisite <span style="font-family: "courier new" , "courier" , monospace;">n</span> pounds and appease the gods.<br /><br /><b>Greedo the Flawed</b><br /><br />One of the tribe members, Greedo the Flawed, thinks he has a solution. He takes inventory of the different types of fruits that the tribe can grow, and calculates their weight to cost ratio. The higher the ratio, the more efficient the fruit for sacrifice. For example, suppose a watermelon weighs 5 lbs and costs $3, and a papaya weighs 4 lbs and costs $2; watermelons can add 1.67 lbs for every $1 spent, whereas papayas can add 2 lbs for every $1 spent. Thus, papayas should be considered for sacrifice before watermelons.<br /><br /><table border="1"><tbody><tr><td>Watermelon</td><td>5 lbs</td><td>$3</td></tr><tr><td>Papaya</td><td>4 lbs</td><td>$2</td></tr></tbody></table><br />Once the ratios are calculated, Greedo will keep picking the most efficient fruit to be sacrificed until the necessary weight is met. If the total weight exceeds more than is necessary to appease the gods, Greedo will remove the fruit he most recently added, and try to add more of the second most efficient fruit to see if costs can be cut, and so on. Greedo claims that this way, the tribe will be able to sacrifice just enough fruits while keeping the costs to a minimum.<br /><br />However, the wiser tribal council members immediately see a problem with Greedo's algorithm. If the gods want 5 lbs of fruit, Greedo's algorithm would select two papayas for a weight of 8 lbs and a cost of $4; it would not replace a papaya with a watermelon, since that would be more costly. However, sacrificing just one watermelon will meet the required 5 lbs with a cost of just $3. It is apparent that Greedo the Flawed's algorithm is indeed flawed.<br /><br /><b>A Better Approach</b><br /><br />While the rest of the tribe is debating the merits of Greedo's approach, an older, much wiser tribal council member steps forward with a solution. Dyno the Pragmatic points out that the key is that when considering a weight <span style="font-family: "courier new" , "courier" , monospace;">n</span>, the optimal cost of including one particular fruit in the solution is simply the optimal cost for "<span style="font-family: "courier new" , "courier" , monospace;">n</span> minus the weight of the fruit" pounds plus the cost of the fruit itself. For example, suppose 10 lbs of fruit are needed, he can take one watermelon and subtract its weight to get 5 lbs. Assuming that he knows what the optimal cost for 5 lbs of fruit is, he simply adds the cost of a watermelon, $3. If he make this calculation for one of every fruit (since he <i>has</i> to add at least one fruit in order to arrive at weight <span style="font-family: "courier new" , "courier" , monospace;">n</span>), and take the result with the minimum cost, then he will get an optimal cost! Dyno proclaims that he can use the ancient art of <a href="http://en.wikipedia.org/wiki/Dynamic_programming">dynamic programming</a> to build the solution from bottom up and arrive at the optimal solution; he quickly pulls out a stick and starts drawing symbols in the sand.<br /><br />Let's see how to implement Dyno's approach. Suppose that each type of fruit is represented by pairs of the form <span style="font-family: "courier new" , "courier" , monospace;">(weight, cost)</span> and we are given the total weight needed to appease the gods.<br /><br /><pre class="brush: ruby">def sacrifice(weight, fruits)<br /> # costs[k] = minimum cost for sacrificing k lbs of fruit<br /> costs = Array.new(weight + 1)<br /> costs[0] = 0<br /><br /> for w in 1..weight<br /> # For each weight, pick a fruit that yields us the lowest cost<br /> min = nil<br /> fruits.each do |fruit|<br /> diff = w - fruit[0]<br /> cost = (diff <= 0 ? 0 : costs[diff]) + fruit[1]<br /> min = cost if min.nil? || cost < min<br /> end<br /> costs[w] = min<br /> end<br /> return costs<br />end<br /></pre><br /><b>Optimizations</b><br /><br />Stepping away from the parable, we can see that this problem is actually a variant of the <b>unbounded knapsack problem</b>. Instead of trying to figure out the maximum value of goods that can fit in a knapsack so that the total weight is less than a given limit, the tribe is trying to figure out the minimum cost of goods that can be thrown out so that the total weight is more than a given limit. The knapsack problem is known to be <b>NP-complete</b>. For a weight <span style="font-family: "courier new" , "courier" , monospace;">W</span> and <span style="font-family: "courier new" , "courier" , monospace;">n</span> fruits, Our dynamic programming solution goes through <span style="font-family: "courier new" , "courier" , monospace;">n</span> fruits <span style="font-family: "courier new" , "courier" , monospace;">W</span> times, for a complexity of <span style="font-family: "courier new" , "courier" , monospace;">O(nW)</span>. Alas, the complexity is still <a href="http://en.wikipedia.org/wiki/Pseudo-polynomial_time">not polynomial in the length of the input</a> to the problem. Nevertheless, there are some optimizations we can make to speed up the algorithm, albeit not its complexity.<br /><br />Since we only care about the cost of the items in the knapsack, the individual item weights only have to remain in proportion to the weight limit. Thus, we can find the <b>greatest common denominator</b> (<span style="font-family: "courier new" , "courier" , monospace;">gcd</span>) of all item weights and the weight limit <span style="font-family: "courier new" , "courier" , monospace;">W</span>, and divide all the weights by this value if it's greater than one. That will cut down on the size of <span style="font-family: "courier new" , "courier" , monospace;">W</span>, thus reducing the running time. We can easily implement this optimization using Ruby's <span style="font-family: "courier new" , "courier" , monospace;">gcd</span> function for the <a href="http://ruby-doc.org/core/classes/Integer.html"><span style="font-family: "courier new" , "courier" , monospace;">Integer</span></a> class. (Note that you need to call <span style="font-family: "courier new" , "courier" , monospace;">require 'rubygems'</span> in order to use the <span style="font-family: "courier new" , "courier" , monospace;">gcd</span> function.) To find the <span style="font-family: "courier new" , "courier" , monospace;">gcd</span> of all the weights, find the <span style="font-family: "courier new" , "courier" , monospace;">gcd</span> of the first pair of values, then find the <span style="font-family: "courier new" , "courier" , monospace;">gcd</span> of that value with the next weight, and so on.<br /><br />With this optimization in hand, we can implement Dyno's simple, elegant solution. Here's what it would look like for an example case of planning for next year's sacrifice.<br /><br /><pre class="brush: ruby">#!/usr/bin/ruby<br />require 'rubygems'<br /><br />weight = 12<br />fruits = [[13, 11], [7, 5], [2, 3], [1, 10]]<br /><br />gcd = weight.gcd(fruits[0][0])<br />fruits.each { |fruit| gcd = gcd.gcd(fruit[0]) }<br />if gcd > 1<br /> weight /= gcd<br /> fruits.each { |fruit| fruit[0] /= gcd }<br />end<br /><br />result = sacrifice(weight, items)<br />puts result[weight]<br /></pre><br />If you enjoyed this post, you should follow me on Twitter <a href="http://twitter.com/cloudkj">@cloudkj</a>. If you're interested in more puzzle fun, you should check out my posts about various other puzzles, such as <a href="http://www.kelvinjiang.com/2010/09/facebook-puzzles-and-liar-liar.html">Liar Liar</a>, <a href="http://www.kelvinjiang.com/2010/10/facebook-puzzles-gattaca.html">Gattaca</a>, <a href="http://www.kelvinjiang.com/2010/10/facebook-puzzles-dance-battle.html">Dance Battle</a>, <a href="http://www.kelvinjiang.com/2011/02/facebook-puzzles-its-small-world.html">It's A Small World</a>, and <a href="http://www.kelvinjiang.com/2011/03/facebook-puzzles-refrigerator-madness.html">Refrigerator Madness</a>.<br /><br /><b>Food For Thought</b><br /><ul><li>How can the program be modified to output the fruits to be chosen, and the units of each fruit?</li><li>How can the algorithm be further optimized by comparing the weight/cost ratios of different fruits?</li></ul>Kelvinnoreply@blogger.com4tag:blogger.com,1999:blog-3382199152068943986.post-9126644113164153752010-10-09T10:00:00.000-07:002017-02-03T00:35:34.012-08:00Facebook Puzzles - Dance Battle<div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/_odeQu6U_EYw/TKw1aGVPAgI/AAAAAAAAAwA/5AzJJQAOazw/s1600/k1x6o1.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="213" src="https://3.bp.blogspot.com/_odeQu6U_EYw/TKw1aGVPAgI/AAAAAAAAAwA/5AzJJQAOazw/s320/k1x6o1.jpg" width="320" /></a></div>As I set out to solve the <a href="http://xorswap.com/questions/164-facebook-engineering-puzzle-dance-battle">Dance Battle</a> Facebook Engineering Puzzle (dancebattle), I realized that it was really just some sort of game (note: Facebook took down the original puzzle so I'm linking to another site with the description). It had alternating turns, repeatable moves, moves represented by non-negative integers, and was <a href="http://en.wikipedia.org/wiki/Zero-sum">zero-sum</a>. After some digging around, I realized that it was just a thinly-veiled version of a game of <a href="http://en.wikipedia.org/wiki/Dominoes">dominoes</a>.<br /><br />Naturally, it seemed fitting to present the solution to the puzzle in the form of the game it was modeled after. I had never played dominoes before, but I knew what the basic game pieces looked like since dominoes sets are among some of the more popular swag that certain companies like to give away at college career fairs. Nonetheless, I exercised due diligence and did some research on the game. Thus, I present to you Dance Battle, played out in the form of dominoes.<br /><br /><a name='more'></a><b>Dominoes!</b><br /><br />A domino set consists of game pieces that are rectangular in shape, formed by joining two identically-sized, adjacent squares. Each square has a number on it, represented in the form of a die face. A traditional double-six domino set contains tiles with numbers ranging from zero to six, with a total of twenty eight unique tiles.<br /><br />Consider a simple, modified, two-player game of dominoes using the traditional double-six set. The game starts with a player making an initial move by picking any of the twenty eight tiles, then placing it into play to form the line of play. The players alternatively extend the line of play by picking one of the remaining tiles to play so that adjacent tiles have equal values in their respective squares. In the variation, the line of play can only be extended in one direction.<br /><br />For example, suppose the first player chooses to play the tile with numbers <span style="font-family: "courier new" , "courier" , monospace;">(3, 0)</span> with the tile oriented such that 0 is at the tail of the line of play. Then the next player must pick one of the remaining tiles that contains 0, and place it such that the end with 0 is adjacent to the tail of the line of play. Players alternate until no more moves can be made; at that point, the person that played the last tile wins (and the player that is stuck loses).<br /><br />Now, consider a generalized version of this variant where the numbers on the tiles can range from 0 to <span style="font-family: "courier new" , "courier" , monospace;">n</span>. Given n and a list of tiles that have already been played, let's figure out how to determine the outcome if both players play optimally.<br /><br /><b>Exploring the Game Tree</b><br /><br />In order to simulate optimal play, we need to look ahead in the game tree and determine the results of optimal play. Each level of the game tree represents the choices a player has at any point in the game. To play optimally, the player will seek the best possible result. Since the game is a two-player, zero-sum game, we can use the<b> minimax algorithm</b> to simulate optimal play.<br /><br />The <a href="http://en.wikipedia.org/wiki/Minimax">minimax algorithm</a> performs an exhaustive search of the game tree space. At each level (height) of the game tree, it takes on the actions of a player and tries to maximize the outcome amongst all possible choices; equivalently, the algorithm will minimize the outcome for the other player. This can be done by calling the algorithm recursively at each level on all of the possible moves. When no more moves can be made, we reach an end game state, and a numerical value can returned to indicate the outcome of the game.<br /><br />Before we actually implement the algorithm, we'll need to define a function that can generate the set of all possible tiles that can be played, given a list of tiles that have already been played, and the last tile that was played. A tile will be represented by a pair of integers.<br /><br /><pre class="brush: ruby">def valid_tiles(n, played, last)<br /> valid = []<br /><br /> # If no tiles have been played, all is fair game<br /> if last.nil?<br /> for i in 0..n-1<br /> for j in 0..n-1<br /> valid.push([i, j])<br /> end<br /> end<br /><br /> # Valid tiles must have a value equal to the tail<br /> # value of the last tile<br /> else<br /> for i in 0..n-1<br /> t = [last[1], i]<br /> valid.push(t) if !played[t]<br /> end<br /> end<br /><br /> return valid<br />end<br /></pre><br />Once we've defined a way for generating the set of valid moves, we can proceed with implementing the minimax algorithm. As our game of dominoes will always end up with a winner and a loser, the algorithm only has to return two values. To be consistent, we can return 1 in the case of a win for the player that goes first (a loss for the other player), and -1 in the case of a loss (a win for the other player).<br /><br /><pre class="brush: ruby">def minimax(n, max, played, last)<br /> tiles = valid_tiles(n, played, last)<br /><br /> # End game - return a value to indicate outcome<br /> return max ? -1 : 1 if tiles.empty?<br /><br /> # Best possible outcome<br /> best = max ? -2 : 2<br /><br /> tiles.each do |t|<br /> # Mark tile (and reverse, equivalent tile) as played<br /> rt = [t[1], t[0]]<br /> played[t] = played[rt] = true<br /><br /> reply = minimax(n, !max, played, t)<br /><br /> # Mark the tile as unplayed, to preserve game state<br /> # for further calls<br /> played.delete(t)<br /> played.delete(rt)<br /><br /> best = reply if (max && reply > best) || (!max && reply < best)<br /> end<br /><br /> return best<br />end<br /></pre><br />Note that in order to ensure that the first move and outcome is accepted when no moves have been explored yet, we set the initial value of the best possible outcome out of range. We now have an algorithm that will exhaustively explore the game tree space, and return the outcome of a game when both players play optimally.<br /><br />This approach will always find the correct solution. However, since every possible game state needs to examined, the complexity of the algorithm grows exponentially with the depth of the tree. Furthermore, as the value of <span style="font-family: "courier new" , "courier" , monospace;">n</span> increases, the number of possible tiles (and thus the branching factor) that can be played at each stage increases. It's easy to see that this approach will be too slow for even reasonable values of <span style="font-family: "courier new" , "courier" , monospace;">n</span>. We should be able to do better.<br /><br /><b>Pruning</b><br /><br />We can incorporate some simple techniques for pruning the game tree by stopping the tree traversal when we've encountered a state with our ideal outcome. Additionally, we can use <b>alpha-beta pruning</b> in order to cut down on the potential search space. <a href="http://en.wikipedia.org/wiki/Alpha-beta_pruning">Alpha-beta pruning</a> allows us to stop the tree traversal when we discover a move that can, at best, only result in an outcome that is no better than the best one we've found so far. These pruning techniques allow us to significantly cut down on the search space.<br /><br /><pre class="brush: ruby">def minimax_alpha_beta(n, max, played, last, alpha = -2, beta = 2)<br /> tiles = valid_tiles(n, played, last)<br /> return max ? -1 : 1 if tiles.empty?<br /><br /> best = max ? alpha : beta<br /> tiles.each do |t|<br /> rt = [t[1], t[0]]<br /> played[t] = played[rt] = true<br /><br /> reply = minimax_alpha_beta(n, !max, played, t, alpha, beta)<br /><br /> played.delete(t)<br /> played.delete(rt)<br /><br /> # We can stop once we've found the best possible outcome<br /> if max<br /> best = alpha = reply if reply > best<br /> return best if best >= 1<br /> else<br /> best = beta = reply if reply < best<br /> return best if best <= -1<br /> end<br /><br /> # Can do no better than the best outcome so far<br /> return best if alpha >= beta<br /> end<br /><br /> return best<br />end<br /></pre><br /><b>Putting It Together</b><br /><br />We can now put the functions we've defined together and simulate an optimal game of our variant of dominoes. For example, a game with <span style="font-family: "courier new" , "courier" , monospace;">n = 5</span>, and a line of play of <span style="font-family: "courier new" , "courier" , monospace;">(0, 1), (1, 3)</span> can be solved by the following program:<br /><br /><pre class="brush: ruby">#!/usr/bin/ruby<br />n = 5<br />moves = [[0, 1], [1, 3]]<br /><br />played = {}<br />moves.each { played[m] = played[[m[1], m[0]]] = true }<br />last = moves[moves.size - 1]<br />max = moves.size % 2 == 0<br /><br />result = minimax_alpha_beta(n, max, played, last)<br />puts result == 1 ? 'Player 1 wins' : 'Player 2 wins'<br /></pre><br />If you enjoyed this post, you should follow me on Twitter <a href="http://twitter.com/cloudkj">@cloudkj</a>. If you're interested in more puzzle fun, you should check out my posts about various other puzzles, such as <a href="http://www.kelvinjiang.com/2010/09/facebook-puzzles-and-liar-liar.html">Liar Liar</a>, <a href="http://www.kelvinjiang.com/2010/10/facebook-puzzles-gattaca.html">Gattaca</a>, <a href="http://www.kelvinjiang.com/2010/10/facebook-puzzles-dance-battle.html"></a><a href="http://www.kelvinjiang.com/2010/10/facebook-puzzles-user-bin-crash.html">User Bin Crash</a>, <a href="http://www.kelvinjiang.com/2011/02/facebook-puzzles-its-small-world.html">It's A Small World</a>, and <a href="http://www.kelvinjiang.com/2011/03/facebook-puzzles-refrigerator-madness.html">Refrigerator Madness</a>.<br /><br /><b>Food For Thought</b><br /><ul><li>How can the program be modified to output the tile that the next player should play, given a line of play?</li><li>How can the algorithm be adapted to solve other zero-sum games, such as tic-tac-toe?</li><li>How would you handle the case of a tie game?</li><li>What is the run-time complexity of the program?</li></ul>Kelvinnoreply@blogger.com0tag:blogger.com,1999:blog-3382199152068943986.post-67847913160155776392010-10-06T09:00:00.000-07:002017-02-20T21:39:55.448-08:00The Lego Minifigure Collector's Problem<div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/_odeQu6U_EYw/TKwby-JfMCI/AAAAAAAAAv8/zU1ysGVpI7I/s1600/lego-minifigures-series-2-8684-limited-edition-2.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="218" src="http://1.bp.blogspot.com/_odeQu6U_EYw/TKwby-JfMCI/AAAAAAAAAv8/zU1ysGVpI7I/s320/lego-minifigures-series-2-8684-limited-edition-2.jpg" width="320" /></a></div>I recently attended <a href="http://www.brickcon.org/">BrickCon</a> and had the pleasure of listening to a very entertaining talk about Lego Minifigures. For those not in the know, Lego recently released a set of <a href="http://minifigures.lego.com/en-us/Default.aspx">Lego Collectible Minifigures</a>. Apparently, the collectible minifigures were made available for only a limited time, and come in "mystery" packages that don't show the minifigures inside, just like a pack of baseball cards. <br /><br />In the Lego lovers circuit, these things are currently selling like hotcakes. The speaker at BrickCon likened the minifigures hysteria to a virus in that the collectible minifigures are like "small infectious agents." Lego fans are doing everything they can in order to collect every figure in a series.<br /><br /><a name='more'></a>At BrickCon, there were sessions specifically for minifigure trading; people have also taken note of the unique barcodes on the back of each package and are bringing in barcode printouts to stores in order to find the minifigures they're missing; there's even a <a href="http://m.brickset.com/">Lego barcode scanner for Android phones</a> that that can identify the minifigure packages. However, the clever Lego folks in Denmark are one step ahead: the next series of minifigures will have no bar codes on them.<br /><br />I was reminded of the Lego minifigures when I came across the <a href="http://en.wikipedia.org/wiki/Coupon_collector%27s_problem">coupon collector's problem</a> while I was flipping through <a href="http://www.amazon.com/gp/product/0262033844?ie=UTF8&tag=kelvinjiang-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=0262033844">CLRS</a>. Lego minifigure collectors have the same dilemma; they need to know how many minifigures they need to buy before they can expect to have collected every figure in a series. It turns out that the expected number of purchases, given <span style="font-family: "Courier New",Courier,monospace;">n</span> distinct items, grows on the order of <span style="font-family: "Courier New",Courier,monospace;">n log n</span>. Since there are 16 figures in each Lego collectible minifigure series, one can expect to purchase around 55 minifigures before collecting every unique one.<br /><br />The catch is, according to the talk, Lego didn't manufacture and release the minifigures with uniform distribution. Apparently, certain minifigures like the popular <a href="http://minifigures.lego.com/en-us/Bios/Spartan%20Warrior.aspx">Spartan Warrior</a> are much more rare and in demand than other minifigures, such as the oft-ridiculed <a href="http://minifigures.lego.com/en-us/Bios/Ringmaster.aspx">Ringmaster</a>. Nevertheless, even if there was an equal probability of getting each minifigure (1/16), the expected number of minifigures one would need to purchase before collecting <a href="http://www.imdb.com/title/tt0416449/">300</a> Spartan Warriors is 4800. Touche, Lego, touche.Kelvinnoreply@blogger.com1tag:blogger.com,1999:blog-3382199152068943986.post-79395577465139741192010-10-02T12:00:00.000-07:002017-02-03T00:35:50.405-08:00Facebook Puzzles - Gattaca<div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/_odeQu6U_EYw/TKmLcqIdQeI/AAAAAAAAAv4/1EOsGgbNxBY/s1600/MV5BMTQ3MDMwMDA1OF5BMl5BanBnXkFtZTcwNzY2OTc1MQ@@._V1._SY314_CR15,0,214,314_.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://2.bp.blogspot.com/_odeQu6U_EYw/TKmLcqIdQeI/AAAAAAAAAv4/1EOsGgbNxBY/s1600/MV5BMTQ3MDMwMDA1OF5BMl5BanBnXkFtZTcwNzY2OTc1MQ@@._V1._SY314_CR15,0,214,314_.jpg" /></a></div>Continuing with my <a href="http://www.facebook.com/careers/puzzles.php">Facebook Engineering Puzzles</a> series, here's the writeup for <a href="http://xorswap.com/questions/152-facebook-engineering-puzzle-gattaca">Gattaca</a> (gattaca) (note: Facebook took down the original puzzle so I'm linking to another site with the description). The original problem statement is interesting in that it has a computational biology twist, but it's really a weighted <a href="http://en.wikipedia.org/wiki/Interval_scheduling">interval scheduling</a> problem in disguise. This one took me a while to recognize, even though I was barking up the right tree by looking for an optimal substructure that would lend to a dynamic programming solution. Ultimately, the key to the problem is that the order of the gene predictions is important, and is part of the solution.<br /><br />For this writeup, I reformulated the problem in a more "classic" interval scheduling manner. I think treating the DNA strand as a timeline and the gene predictions as time intervals is a good way to gain an intuitive understanding of the problem and the solution.<br /><br /><a name='more'></a><b>Party Prioritization</b><br /><br />Suppose you are the owner of the most popular dance hall in town. Everyone wants to hold their company party, school prom, or wedding at your venue, so you often receive more booking requests than you can handle. Being a tech-savvy business person, you decide to write a program to figure out the maximum profit you can make if you schedule events for your dance hall optimally.<br /><br />Each event has a start time and end time, along with a booking fee. A first stab at the problem might involve using a greedy algorithm. That is, repeatedly book the event with the highest booking fee that hasn't been booked yet, and reject any other events that have a time conflict with the booked event. However, it's easy to see that this approach quickly breaks down. A counter example would be three events: (A) one that takes the whole day and would net us a profit of $1000, (B) one that occupies the morning with a fee of $750, and (C) one that occupies the evening with a fee of $800. The greedy algorithm would automatically choose to book event C since it has the highest fee; but it's obvious that booking events B and C would make more money for the business.<br /><br /><b>Determining Conflicts</b><br /><br />We can develop a more sensible approach by recognizing some key properties of the problem. For a given event, we can always consider two solutions: one in which the event is booked, and one in which the event isn't booked. When the event is booked, we can subsequently reject all other conflicting events; otherwise, we only reject the given event and all other events are still fair game. The solution with the higher profit would then be the optimal solution.<br /><br />With that in mind, we first need to figure out how to find all the events that conflict with a given event. Suppose events are represented by tuples of the form <span style="font-family: "courier new" , "courier" , monospace;">(start, finish, fee)</span>, where start and finish are integers representing the (inclusive) start and finish times of the event, and fee is the profit to be made. If we sort the events by their finish times, we can figure out the conflicts for each event: for a given event <span style="font-family: "courier new" , "courier" , monospace;">i</span>, we can start examining all the events that finish earlier than <span style="font-family: "courier new" , "courier" , monospace;">i</span>, in order; any event <span style="font-family: "courier new" , "courier" , monospace;">j</span> that has a finish time earlier than the start time of <span style="font-family: "courier new" , "courier" , monospace;">i</span> does not conflict with <span style="font-family: "courier new" , "courier" , monospace;">i</span>. Once we find a event <span style="font-family: "courier new" , "courier" , monospace;">j</span> that meets the criteria, we know that all events with earlier finish times (that is, with index less than <span style="font-family: "courier new" , "courier" , monospace;">j</span>) do not conflict with event <span style="font-family: "courier new" , "courier" , monospace;">i</span> either.<br /><br />Thus, we can define an array <span style="font-family: "courier new" , "courier" , monospace;">compatible</span> such that <span style="font-family: "courier new" , "courier" , monospace;">compatible[i]</span> is the largest index in the sort order that does not conflict with the <span style="font-family: "courier new" , "courier" , monospace;">i</span>th event. Due to the sorted nature of the events, any event with an index smaller than <span style="font-family: "courier new" , "courier" , monospace;">compatible[i]</span> will also not conflict with the <span style="font-family: "courier new" , "courier" , monospace;">i</span>th event.<br /><br /><pre class="brush: ruby">def compatible(events)<br /> # Events conflicting with all earlier events have -1 values<br /> compatible = Array.new(events.size, -1)<br /><br /> # Assume that events are sorted by finish time<br /> for i in 0..events.size-1<br /> i_start = events[i][0]<br /><br /> # Examine events before i, in reverse sort order<br /> (i-1).downto(0) do |j|<br /> j_finish = events[j][1]<br /> if j_finish < i_start<br /> compatible[i] = j<br /> break<br /> end<br /> end<br /> end<br /> return compatible<br />end<br /></pre><br /><b>To Include Or Not To Include</b><br /><br />As determined earlier, when we examine event <span style="font-family: "courier new" , "courier" , monospace;">i</span> we only have to consider a solution including <span style="font-family: "courier new" , "courier" , monospace;">i</span> and a solution excluding <span style="font-family: "courier new" , "courier" , monospace;">i</span>. For a solution including <span style="font-family: "courier new" , "courier" , monospace;">i</span>, we need to exclude all conflicting events; equivalently, we only need to consider events up to <span style="font-family: "courier new" , "courier" , monospace;">compatible[i]</span>. Likewise, for a solution excluding <span style="font-family: "courier new" , "courier" , monospace;">i</span>, we only need to consider events up to <span style="font-family: "courier new" , "courier" , monospace;">compatible[i-1]</span>. This allows us to reuse the results of potentially overlapping subproblems, and naturally leads us to a dynamic programming solution.<br /><br />Our dynamic programming algorithm will produce an array <span style="font-family: "courier new" , "courier" , monospace;">profit</span>, where <span style="font-family: "courier new" , "courier" , monospace;">profit[i]</span> is the maximum profit we can make with an optimal schedule of events up to <span style="font-family: "courier new" , "courier" , monospace;">i</span>.<br /><br /><pre class="brush: ruby">def profit(events)<br /> # Note that profits is 1-indexed and compatible is 0-indexed<br /> compatible = compatible(events)<br /><br /> # profits[k] = maximum possible profit when scheduling up to<br /> # the kth event<br /> profits = Array.new(events.size + 1)<br /> profits[0] = 0<br /><br /> for i in 1..profits.size-1<br /> # Profit if we include event i<br /> include = events[i-1][2] + profits[compatible[i-1] + 1]<br /><br /> # Profit if we exclude event i<br /> exclude = profits[i-1]<br /><br /> profits[i] = include > exclude ? include : exclude<br /> end<br /> return profits<br />end<br /></pre><br />Suppose you've been sent eight booking requests for the same day. The start and finish times are the hours at which the events are scheduled to start and finish. The fee is in thousands of dollars. We can now run our program to find the maximum amount of profit we can make, which would be $11,000.<br /><br /><pre class="brush: ruby">#!/usr/bin/ruby<br />events = [<br /> [10, 14, 1],<br /> [14, 16, 3],<br /> [0, 4, 3],<br /> [0, 4, 4],<br /> [11, 15, 4],<br /> [8, 11, 2],<br /> [17, 18, 2],<br /> [11, 14, 4],<br />]<br />events.sort! { |a, b| a[1] <=> b[1] }<br /><br />result = profit(events)<br />puts result[events.size]<br /></pre><br />If you enjoyed this post, you should follow me on Twitter <a href="http://twitter.com/cloudkj">@cloudkj</a>. If you're interested in more puzzle fun, you should check out my posts about various other puzzles, such as <a href="http://www.kelvinjiang.com/2010/09/facebook-puzzles-and-liar-liar.html">Liar Liar</a>, <a href="http://www.kelvinjiang.com/2010/10/facebook-puzzles-dance-battle.html">Dance Battle</a>, <a href="http://www.kelvinjiang.com/2010/10/facebook-puzzles-user-bin-crash.html">User Bin Crash</a>, <a href="http://www.kelvinjiang.com/2011/02/facebook-puzzles-its-small-world.html">It's A Small World</a>, and <a href="http://www.kelvinjiang.com/2011/03/facebook-puzzles-refrigerator-madness.html">Refrigerator Madness</a>.<br /><br /><b>Food For Thought</b><br /><ul><li>How can the program be modified to output the events that should be scheduled to maximize profit?</li><li>What is the run-time complexity of the algorithm?</li></ul>Kelvinnoreply@blogger.com3tag:blogger.com,1999:blog-3382199152068943986.post-1102286115052895832010-09-29T16:55:00.000-07:002017-02-03T00:36:03.315-08:00Facebook Puzzles - Liar, Liar<div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/_odeQu6U_EYw/TKmLUYNExbI/AAAAAAAAAv0/yrYXODWZnW0/s1600/MV5BMjEzMzA3NzgwNF5BMl5BanBnXkFtZTcwNzQ4MDgyMQ@@._V1._SY314_CR3,0,214,314_.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://2.bp.blogspot.com/_odeQu6U_EYw/TKmLUYNExbI/AAAAAAAAAv0/yrYXODWZnW0/s1600/MV5BMjEzMzA3NzgwNF5BMl5BanBnXkFtZTcwNzQ4MDgyMQ@@._V1._SY314_CR3,0,214,314_.jpg" /></a></div>A few months ago, I stumbled across the <a href="http://www.facebook.com/careers/puzzles.php">Facebook Engineering Puzzles</a> page. I had come across some of the older puzzles before - I vaguely remember one about movie theater seating and another about escaping from velociraptors - but just kind of ignored them. This time around, I decided to give the <a href="http://xorswap.com/questions/149-facebook-engineering-puzzle-liar-liar">Liar, Liar</a> puzzle (liarliar) a try (note: Facebook took down the original puzzle so I'm linking to another site with the description). After I scribbled some stuff on paper, I had a very satisfying "a-ha" moment when I recognized the puzzle as a <a href="http://en.wikipedia.org/wiki/Graph_coloring">2-color graph coloring problem</a>. After a bit of review on graph coloring and bipartite graphs, I wrote up a solution, submitted it, and successfully solved the puzzle. After that, I was hooked.<br /><br />So for the next few weeks, I went on a puzzle binge, solving puzzles in between watching episodes of <a href="http://abc.go.com/shows/lost">Lost</a> on Netflix (I had just gotten hooked on the show as well; the show is puzzling in its own way). The puzzles were extremely fun and quite educational. I was able to brush up on some graph theory and dynamic programming, and learned about cool new things like <a href="http://en.wikipedia.org/wiki/Kd-tree">kd-trees</a>.<br /><br />After I solved a bunch of the puzzles, I was at a nice stopping point and had the ambitious goal of writing up the solutions and putting together some kind of programming puzzle book. I didn't get too far, but I do have a decent amount of content. I've decided to go ahead and just post the various "chapters" in the next few weeks and share them with fellow puzzle lovers. Note that for some puzzles I tried to reword the problem to give it a different flavor; nevertheless, the solutions are pretty much the same.<br /><br /><a name='more'></a>Since I had been riding the Ruby train for a while, I submitted most of my solutions in Ruby. Using Ruby helped me keep my solutions neat, compact, and easy to read. However, there were some puzzles where I ran into performance problems, but we'll sideline that for now.<br /><br />For those of you just finding out about the puzzles - take some time and try them out! Even though it's obviously a recruiting tool for Facebook, there's no downside to giving it a go. The Facebook recruiters won't start harassing you or come knocking on your door - I think it just helps if you eventually decide to apply for a job. The puzzles are just fun; and that feeling you get when you hit the "Eureka!" moment is what computer science is all about.<br /><br />Without further ado, here's the piece about <a href="http://www.facebook.com/careers/puzzles.php?puzzle_id=20">Liar, Liar</a>, which is essentially an exercise dealing with bipartite graphs. I decided to give the problem a bit of a scifi spin.<br /><br /><b>10 Types of People</b><br /><br /><i>"There are only 10 types of people in the world: those who understand binary, and those who don't."</i> This is an old mathematics joke that most programmers have heard already. Now, let's get a little more scifi and suppose that you're an inhabitant on a small planet where an understanding of binary is of utmost importance.<br /><br />One particular quirk about the citizens of this planet is that anyone who understands binary knows who else understands binary, and who does not. On the other hand, any person that does not understand binary always gets it wrong: if he or she thinks a person understands binary, then that person actually doesn't, and vice versa.<br /><br />The governing organization on this small planet performs a census every year to tabulate the level of binary understanding in the population. However, since it's a census, they have impartial data collected in a strange manner. The census workers go around asking each person for a list of people they know that do not understand binary. Each person responds with a list of people that they claim are clueless about binary. Although it is a small planet, not everyone knows everyone else, so the lists are not going to be complete.<br /><br />As a responsible citizen, you are curious as to just how many people understand binary, and how many do not. Given the impartial census data, find the size of each group of people.<br /><br /><b>A Planet Divided</b><br /><br />We can break this puzzle down by first examining an example. Suppose there are only five people on the planet: Alice, Bob, Carol, Dave, and Eve. The census data we have on hand is incomplete, and the raw data collected by the census worker looks like this:<br /><br /><pre>Alice claims Bob does not understand binary.<br />Bob claims Carol does not understand binary.<br />Dave claims Bob does not understand binary.<br />Carol claims Bob does not understand binary.<br />Eve claims Alice and Dave do not understand binary.</pre><br />Since we are dealing with a binary world where everybody either understands binary or doesn't, it seems reasonable to try to examine the data with a binary decision. We can assume two worlds: one in which Alice understands binary, and one in which Alice does not understand binary. How will these two worlds differ?<br /><br />In the first world, since Alice understands the binary system, she has a perfect knowledge of everyone else's level of understanding. Thus, her claim that Bob does not understand binary must be true. Since Bob does not understand binary, his knowledge of everyone else's level of understanding is always wrong. Thus, his claim that Carol does not understand binary must be false; Carol actually understands binary just fine. Since we've already established that Bob does not understand binary, Dave's claim in the census data is true, and therefore he must also have a perfect understanding of binary. Finally, since Eve made a bold claim that both Alice and Dave do not understand binary, which we know to be incorrect in this world, Eve must not understand binary either.<br /><br />By fixing the world on a single assumption about a person's knowledge, we can logically infer the level of understanding of all the other citizens based on the claims made in the census. At this point, we can apply the same logic to a world where Alice does not understand the binary system. When we combine the results of examining both worlds, we arrive at the the following conclusions:<br /><br /><table border="1"><tbody><tr><th>Alice</th><th>Understands</th><th>Does not understand</th></tr><tr><td>Bob</td><td>Does not understand</td><td>Understands</td></tr><tr><td>Carol</td><td>Understands</td><td>Does not understand</td></tr><tr><td>Dave</td><td>Understands</td><td>Does not understand</td></tr><tr><td>Eve</td><td>Does not understand</td><td>Understands</td></tr></tbody></table><br />In both cases, the two groups of people are of size 3 and size 2, which is the answer to this example problem. Note that we don't particularly care about how many people understand or don't understand binary, but rather how the groups are divided. In fact, we can't actually distinguish between the two groups due to the ambiguity of the data.<br /><br /><b>Binary Partition</b><br /><br />Let's visualize the raw census data in a graph. Each citizen can be represented by a vertex, and a directed edge <span style="font-family: "courier new" , "courier" , monospace;">(u, v)</span> represents the piece of census data that indicates person <span style="font-family: "courier new" , "courier" , monospace;">u</span> claimed that person <span style="font-family: "courier new" , "courier" , monospace;">v</span> does not understand binary.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/_odeQu6U_EYw/TKPNXm2xKyI/AAAAAAAAAvk/eEmAEesBcwg/s1600/bipartite1.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/_odeQu6U_EYw/TKPNXm2xKyI/AAAAAAAAAvk/eEmAEesBcwg/s1600/bipartite1.gif" /></a></div><br />If we apply our previous logic to the graph, we would start a graph traversal at Alice. If we assume Alice understands binary, then we know that everyone Alice mentions on the census does not understand binary; likewise, if we assume Alice doesn't understand binary, everyone Alice mentions on the census does understand binary. In terms of the graph, this is equivalent to marking Alice as being in one group, and all of Alice's neighboring vertices as in the other group. If we do this by coloring each vertex, it's easy to see that this problem becomes a <b>graph coloring</b> problem with two colors. That is, the problem of assigning each vertex one of two colors such that no two adjacent vertices share the same color.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/_odeQu6U_EYw/TKPNqKJtjfI/AAAAAAAAAvs/xR9RpDvYSCg/s1600/bipartite2.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://2.bp.blogspot.com/_odeQu6U_EYw/TKPNqKJtjfI/AAAAAAAAAvs/xR9RpDvYSCg/s1600/bipartite2.gif" /></a></div><br />In fact, if we group the different colored vertices together visually, we'll see that all the edges connect vertices in one group to vertices in the other group, which is also known as a <b>bipartite graph</b>. It turns out that all the graphs in this problem will be bipartite because of the binary division in the citizens and the relationships within the population itself.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/_odeQu6U_EYw/TKPNp823jcI/AAAAAAAAAvo/Zs-ohx6Ud58/s1600/bipartite3.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://2.bp.blogspot.com/_odeQu6U_EYw/TKPNp823jcI/AAAAAAAAAvo/Zs-ohx6Ud58/s1600/bipartite3.gif" /></a></div><br />Now that we've established the fact that we're dealing exclusively with bipartite graphs, we can simply use <b>breadth-first search</b> to find the size of each set of vertices. We can represent the graph of raw census data with an <a href="http://en.wikipedia.org/wiki/Adjacency_matrix">adjacency matrix</a>, where the keys are the names of the citizens and a value represents that there's an edge from the first citizen to the second.<br /><br />Our breadth-first search algorithm will start at an arbitrary vertex, mark it to be in one group, and then mark all its unvisited neighbors to be in another group. Since neighboring vertices cannot be in the same group (have the same color), directed edges are no longer relevant; we can consider all edges to be undirected. For this exercise, we'll make the assumption that the data always forms a connected graph, so we don't have to worry about unreachable vertices. Thus, once the traversal terminates, every vertex will have been visited and marked.<br /><br /><pre class="brush: ruby">def bfs(graph, start)<br /> # Keep track of visited vertices<br /> visited = {start => true}<br /><br /> # Group vertices into two groups<br /> group1 = {start => true}<br /> group2 = {}<br /><br /> # Maintain a FIFO queue for traversal<br /> queue = [start]<br /> while !queue.empty?<br /> u = queue.shift<br /> graph[u].keys.each do |v|<br /> unless visited[v]<br /> visited[v] = true<br /><br /> # Mark the vertex to be in different group from neighbor<br /> group1[v] = true if group2[u]<br /> group2[v] = true if group1[u]<br /> queue.push(v)<br /> end<br /> end<br /> end<br /> return [group1, group2]<br />end<br /></pre><br />Putting together a program to solve our example problem is then simply a matter of creating the adjacency matrix from the census data. Once all the vertices in the graph have been marked, the program outputs the sizes of each group of vertices.<br /><br /><pre class="brush: ruby">#!/usr/bin/ruby<br />data = {<br /> 'Alice' => { 'Bob' => true },<br /> 'Bob' => { 'Carol' => true },<br /> 'Dave' => { 'Bob' => true },<br /> 'Carol' => { 'Bob' => true },<br /> 'Eve' => {<br /> 'Alice' => true,<br /> 'Dave' => true,<br /> },<br />}<br /><br /># Treat edges as undirected<br />start = nil<br />graph = {}<br />data.each_key do |u|<br /> start = u<br /> graph[u] ||= {}<br /> data[u].each_key do |v|<br /> graph[v] ||= {}<br /> graph[u][v] = graph[v][u] = true<br /> end<br />end<br /><br />p1, p2 = bfs(graph, start)<br />puts "#{p1.size}, #{p2.size}"<br /></pre><br />If you enjoyed this post, you should follow me on Twitter <a href="http://twitter.com/cloudkj">@cloudkj</a>. If you're interested in more puzzle fun, you should check out my posts about various other puzzles, such as <a href="http://www.kelvinjiang.com/2010/10/facebook-puzzles-gattaca.html">Gattaca</a>, <a href="http://www.kelvinjiang.com/2010/10/facebook-puzzles-dance-battle.html">Dance Battle</a>, <a href="http://www.kelvinjiang.com/2010/10/facebook-puzzles-user-bin-crash.html">User Bin Crash</a>, <a href="http://www.kelvinjiang.com/2011/02/facebook-puzzles-its-small-world.html">It's A Small World</a>, and <a href="http://www.kelvinjiang.com/2011/03/facebook-puzzles-refrigerator-madness.html">Refrigerator Madness</a>.<br /><br /><b>Food For Thought</b><br /><ul><li>Can the problem also be solved using depth-first search?</li></ul>Kelvinnoreply@blogger.com13tag:blogger.com,1999:blog-3382199152068943986.post-31039978754220262602010-09-27T17:29:00.000-07:002010-10-04T01:14:05.292-07:00Founder Patterns<div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/_odeQu6U_EYw/TKmLE4U6M2I/AAAAAAAAAvw/v34m0cNnCwg/s1600/founding-fathers.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="131" src="http://2.bp.blogspot.com/_odeQu6U_EYw/TKmLE4U6M2I/AAAAAAAAAvw/v34m0cNnCwg/s200/founding-fathers.jpg" width="200" /></a></div>A few weeks ago I ventured into the depths of Redmond in search of ingredients for making pickled pork noodle soup. There also happened to be a Goodwill store next door. After perusing the used books section, I came across a pristine copy of <a href="http://www.amazon.com/gp/product/1430210788?ie=UTF8&tag=kelvinjiang-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=1430210788">Founders At Work</a> for less than a can of pickled radish, which was a pretty sweet deal. I had heard of and read about the book from various sources, but never added it to my reading list. I'm now more than half way through the book, and I must say it is an enjoyable and worthy read. I've definitely gotten to the point where I can spot some patterns that seem to be common amongst successful start-up founders. At least, patterns which were not as apparent to me before.<br /><br /><a name='more'></a><b>Founders work on stuff they want to use</b><br /><br />I guess this pattern is pretty apparent. Lots of great products and companies have sprung out of an engineer's desire to do something faster or simpler. They simply scratched their own itch, spent some personal time building out tools for themselves to use. Over time, the tools began to demonstrate value for other people as well. This is a pretty common pattern. Some of the start-ups that originated out of this pattern include Delicious, ONElist, Bloglines, craigslist, and many more.<br /><br />If anything, this is a pattern that should encourage hackers, builders, engineers, and any creative types to keep making things. I've definitely been guilty of building things that I think people want, and just kind of set out doing it blindly without really thinking about its utility. People say that a good way to conceive and validate a product idea is to find ten people that say they will use it, or better yet pay for it. Working on a tool or service for your own needs brings you 10% of the way along that path to validation.<br /><br />Although it seems like common sense, I think this pattern occurs a lot because it steers people away from coming up with ideas that they think will work, when they themselves have no need for it or even exposure to the same environment. For example, I've had friends pitch me ideas about products that they think would be cool, like iPad apps for doctors. Logically, it certainly seems like something that can be of value, but neither my friend or I are doctors and we don't really know what goes on in a doctor's daily life.<br /><br /><b>Founders talk about their ideas before working on them</b><br /><br />One of the issues you immediately face after building something is the problem of getting people to use the thing you built. One pattern I recognized from the book, and from other stories I've been reading about successful products that have gained traction, is that often times the people build an audience before they build a product. This is the way 37signals evolved from a service company into a product company. Their service and consultancy expertise allowed them to build a wide blog following and an audience on the web. Since they already had followers that were interested in what they were doing, it was an easier step into turning some of that expertise into a product. Once the product was built, they immediately had an audience they could present it to.<br /><br />It also leads back to some Internet marketing fundamentals that I've read about in <a href="http://www.amazon.com/gp/product/0307465357?ie=UTF8&tag=kelvinjiang-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=0307465357">The Four Hour Work Week</a>. Whether you're selling an e-book, a web service, or downloadable software, driving people to your product page and gauging their interest before you actually start producing your product is a sound approach. Having a skeleton sign-up page fulfills the minimally viable product idea and also helps build the initial audience you need once you're ready to launch. Blogging seems to be an even better forum, since you can easily throw back of the envelope ideas out there and see how people react to it. After all, as the legend goes, Groupon started out as a simple Wordpress blog and grew from there.<br /><br /><b>Founders project out the future and try to position themselves to meet it</b><br /><br />Predicting the future is hard. When it comes to things like financial predictions, I take a pretty conservative approach and stick with evaluating fundamentals. Projected earnings or revenue forecasts are usually just numbers on a page, and don't mean much. When it comes to building things, I've had a similar mentality drilled into me. On more than one occasion, Jeff Bezos has paraphrased Alan Kay at various all-hands meetings, reminding employees that it's easier to invent the future than to prevent it. On the surface, it definitely seems like a good mantra to stick by.<br /><br />However, a pattern I observed in the book is that some successful founders are able to avoid predicting, but instead look forward and anticipate changes that are probably going to occur. Technological projections aren't really that hard to make. I think it's pretty safe to say that storage costs are going to get cheaper, computing costs are going to get cheaper, people are going to own more computing devices, etc. Companies like Iris and Adobe grew while anticipating the surge in adoption of desktop computers. Even in hindsight, it doesn't seem like it would've been a far-fetched projection to tout the utility of desktop computers as being a driving factor in bringing computers into homes and offices around the world. Those companies were able to make pretty humble projections about trends that were on-going, and position themselves so that they can build useful products at a time when people would start to need them.<br /><br />I think it's an interesting way to go about conceptualizing ideas, and one that I seem to have been trained to stay away from. Predictions are hard to make, yes, but logical projections of trends seem to be slightly different and safer to make. Instead of figuring out ways to build things in the ecosystem that exists today, an interesting exercise would be to try to project trends out into the future. Mobile is a big space, but where is it going to be in a few years? Connectivity should get better, not worse. Devices should become more powerful, not less. Working on ideas that will coincide with impending changes that will most likely happen seems to have worked out for various successful founders.Kelvinnoreply@blogger.com1