## Wednesday, June 8, 2011

### Data Mining the NBA - Players Most Similar to Michael Jordan

Professional sports are chock full of numbers and statistics, which are possibly the most widely consumed and readily available mass data sets. For an aspiring data geek and long time sports fan, applying data mining techniques to sports is a fun and interesting way to play with various approaches and potentially discover new things from data. Lucky for us, many of these data sets are available online and can be downloaded for free. For NBA data, we can use the great data set provided by databaseBasketball.com that includes comprehensive statistics for the NBA and the ABA from inception until 2009.

When it comes to basketball, the most oft-debated issue is the question of who the greatest player of all time is. However, there's usually little controversy or dissent.

Ask almost any fan, analyst, player, coach, or anyone that hasn't been living under a rock in the last twenty years, and they'll likely say it's Michael Jordan. With Jordan's career on the record books and his legacy cemented in the form of six NBA titles, the controversial debates of today circle around who is the heir to the throne of His Airness. One interesting data mining technique that can be applied to throw some statistical weight into the debate is the notion of similarities. That is, finding the players that are the most statistically similar to Michael Jordan. That may give us some good insight into who is most Like Mike.

Euclidean Distance

There are many different measures of similarity. For starters, we're most interested in similarity measures that can compare vectors of real numbers. Since we have player statistics at our disposal, it seems most natural to use career stats such as points, rebounds, and assists to perform the comparisons. To start, we can use a basic, intuitive measure of similarity by calculating the Euclidean distance between two vectors of player statistics. To normalize the comparisons, we'll take all statistics at a per game level. Additionally, we'll strive for completeness and compare all statistical categories provided in the dataset, which include points, rebounds (offensive and defensive), assists, steals, blocks, turnovers, fouls, field goals (attempted and made), free throws (attempted and made), and three-pointers (attempted and made).

This makes our task of finding the most similar players to Michael Jordan quite simple: we can compare the pairwise distance between his career statistics against all other inactive and active players in our dataset. The players for which we get the smallest distance results are the players most similar to Michael Jordan. By using the Euclidean distance similarity measure, the similar players we end up with are

```LeBron James 5.0118536393989
Allen Iverson 5.65221580572828
Kobe Bryant 7.32620244207667
Rick Barry 7.40270212447083
Dominique Wilkins 7.41171127109192
George Gervin 7.51859410948606
Jerry West 7.91375182383501
Pete Maravich 8.13164773279095
Carmelo Anthony 8.15452017320458
```

Cosine Similarity

Although the Euclidean distance is a very basic measurement, it actually works very well for our use case. Another measurement of similarity is the cosine similarity. Cosine similarity is slightly different in that it represents the difference in angles of the two vectors. In our case, it measures how much the ratio of a player's statistics differs from the ratio of another player.

For example, suppose we only take points per game and rebounds per game into account. A player with 20 points per game and 10 rebounds per game would be considered exactly the same as a player with 10 points per game and 5 rebounds per game. This is because their points-to-rebounds ratios are identical, as are the trajectories of the vectors, and hence the angular difference is zero. By using cosine similarity, the players we end up with are

```Rolando Blackman 0.998534495549499
Kelly Tripucka 0.997919081452981
George Gervin 0.997494409372126
David Thompson 0.997361238336493
Chris Mullin 0.997326855700903
Kiki Vandeweghe 0.997284742031856
Mark Aguirre 0.997123483334025
Monta Ellis 0.997102936964528
Ronnie Brewer 0.996697911841195
Chris Douglas-Roberts 0.996676745860137
```

As you can see, cosine similarity yields dramatically different results. In our case, we are trying to measure statistical similarity and use it as a gauge of performance and effectiveness. By only comparing the relative ratios of statistics, we lose information about the absolute numbers. However, cosine similarity may be useful for other purposes, such as for clustering players based on the positions they played and their roles. For example, in the list above, all of the players played at either the shooting guard or small forward position, and have similar heights, which would explain why they have similar statistical ratios.

Pearson Correlation

Another popular similarity measure is the Pearson correlation. An intuitive, geometric interpretation of this similarity measure is that it measures how well a regression line fits the statistics of each player. That is, two players with identical statistics would have a best fit line where all data points lie perfectly on the line. As players differ more and more, their statistical data points will drift farther away from the best fit regression line.

With the Pearson correlation, we get these similar players:

```Rolando Blackman 0.998156576027356
Kiki Vandeweghe 0.996956795013002
Kelly Tripucka 0.996522611284401
Chris Mullin 0.996417918887698
George Gervin 0.996039073475534
Monta Ellis 0.995632004734591
David Thompson 0.995578688988299
Ronnie Brewer 0.995306500496774
Chris Douglas-Roberts 0.995056403935876
Mark Aguirre 0.995004915521134
```

The results are very close to the similarities generated using cosine similarity, for similar reasons.

Out of the three similarity measures we used, the basic Euclidean distance seems to work the best. Also, the data appears to be supportive of LeBron James being the most statistically similar player to Michael Jordan. Although statistics never lie, they may serve little purpose than to add fuel to the burning debates about James' South Beach talents versus Jordan's six rings, and who will end up with the Greatest Of All Time title when all is said and done.