NBA 3.0. Draft analytics. Big data. Draft 3.0. After watching the train crash that was the end of the Maloof era, it’s all very exciting and sounds very sexy. However, what does any of that mumbo jumbo mean? When Dalt posts his awesome model, Hollinger and now Pelton post on ESPN, Layne Varsho does his annual model on Canis Hoopus, etc. what is actually happening and more importantly, what can we actually learn other than seeing a number next to a prospects name that has little context other than the numbers next to other prospects names?
The good news is that while people like Dalt are highly skilled at their craft, understanding the basics of what they are doing and learning from their work is easy enough. The point of this post is to give a Draft 3.0 overview for the average fan, so that as I put up some more data / learnings in parts 2 and 3 of this draft preview or Pete D'Alessandro gives some jargony press conference explaining our latest pick, STR can be a savvy board and can take something from what is said. To do this, I am going to start with an overview and some easy to understand example of what these draft analytics are and how they are helpful, what are their shortcomings and what are some common mistakes made in interpreting them (by both experts and amateurs).
Draft Analytics 101
At their core, these different draft models are looking at past information about players who are in the NBA and trying to use this information to explain differences in their performance, such as their PER or Win Score. There are a few important concepts I think will be helpful to understand up front.
Correlation: One of the foundations of this type of analysis, when you hear someone refer to the correlation between two numbers, it’s just a fancy way of describing what happens to one statistic when another one goes up or down. So imagine you have each fan rate how happy they are on a scale of 1-10. After the Kings win, those rating get higher, so there is a positive correlation between the Kings winning and Kings fan’s happiness. If the Kings lose, it goes down, so there is a negative correlation. If Section goes to a bar and gets a pretty lady’s number, it’s nice for him, but does nothing for the rest of us (he never plays wingman, just hogs all the ladies for himself), so there is no correlation. This is important for the draft models, because people like Hollinger and Dalt are at a very basic level trying to see if someone’s PER or Win Score in the NBA increases as their points or assists in college increases.
Variance: This concept is important in two ways. The first is for creating the models. If there are two players who both scored 20 points per game in college, but one has a 10 PER in the NBA and one has a 18 PER, we want to know why there is a variance (or difference) between the two players. What other statistics are different between them that can help explain this? The second reason this is important is that there are different ranges for an individual player’s predicted performance. This may sound fancy, but lines up exactly with how we already talk about prospects. For example, we might be certain that Kyle Anderson will be a fantastic 6th man and potential starter, making him a very safe pick with little upside, while Aaron Gordon might be an 9th man if his offense doesn’t develop, but he also has the potential to be a superstar. Most models only tell you the expected performance for a player, but there can be big differences between their predicted floor and ceiling.
Proxy: Let’s say you love ice cream. But you love to eat it slowly and hate when it melts. So you want to predict on any day how fast your ice cream will melt. However, you don’t have access to the temperature. So you make a model with a lot of variables and the model tells you that the number of cars in the parking lot at the beach is very important in determining how fast your ice cream will melt. Now clearly, cars parking somewhere are not causing the ice cream to melt, so what is happening here? Without being able to include the temperature (the actual cause of melting), another variable that is highly correlated is taking its place. Since more cars park at the beach when it is hot, that is serving as a proxy for the temperature. This is important in looking at draft models and analytics, as many statistics that are important for predicting future success are proxies. For example, steals and minutes played come up as important in some models. However, steals can be a somewhat empty stat in the NBA and the ability to be on the court for more minutes might seem nonsensical in terms of predicting future performance. However, steals at the college level are an example of applied athleticism, so it’s not that the steal itself is important, but it shows the player can hang with NBA athletes. And minutes show that a college coach needs that player on the court. If we trust the coach is trying to win games, then heavy minutes show the coach trusts that player and lower minutes for an otherwise top prospect is a bit of a red flag (e.g. why is the coach playing this "top prospect" 25 minutes a game if he is trying to win, unless he sees the prospect hurting the team).
Coefficient: To keep from getting to complex here, let’s just say that in a model this helps tell you how much each statistic matters. So in these models, age tend to be very important and can add a lot to a player’s predicted performance (the younger they are), while some other statistics add significantly less to their score. So going back to the ice cream example, the number of cars in the parking lot might be the most important data point for explaining how fast your ice cream melts, but how many hats people are wearing might also be meaningful, but less so than the number of cars.
Signal v Noise: This is the title of Nate Silver’s book on statistics and may sound like fancy jargon, but is quite simple. Signal is just a way of saying we have found a legitimate "pattern" in the data. Noise basically means we think something might be random coincidence (and there are statistical ways of measuring this, so perhaps think is the wrong word). So let’s say one player averages 2.4 steals per game and one averages 2.5 steals per game. Is the one who averages 2.5 steals per game a better ball thief? Over the course of a 30 game season he will get 3 extra steals. Are we sure that is due to skill (signal) or if both players were equally as good at getting steals is it possible that over 30 games one guy could get 3 extra just by chance (noise). This is important when looking at the three point shooting percentage for players like Derrick Williams and Kyle Anderson. They took so few threes that their accuracy looks impressive, but a couple of extra misses would greatly decreases their percentage However, a player like Doug McDermott shoots so many threes, that even if he missed his next 10, his percentage would be very close to the same.
What we can learn
Comparisons given limitations / bias: Just eyeballing a model’s results (especially looking at past years), you can generally see that models have a specific bias. They tend to apply the same metrics to all players. So maybe big men are too highly rated or the model gives a bit too much love to three point shooting guards. That doesn’t mean you throw it out. It’s actually better if you can see what the bias is, as you can now compare the players, while understanding the model’s limitations and knowing which players are probably being over/underrated by it. Typically it’s not important to just look at a model’s rankings, really who cares if one guy is rated a 10.2 and the next guy is a 10.1, those players are virtually the same and the difference might just be some random noise.
What types of factors matter: Some of my favorite write ups from Dalt, Pelton, Hollinger, etc. are when they talk about what variables mattered and affected their scores the most. Not only can this help you spot the bias, but when looking at future players, it can give you an idea of what statistics to look at. It’s why you will see people who follow these analytics become concerned when a player has low steals / oboards / free throws attempted. Some combination of these is important for showing applied athleticism across nearly every model. A player doesn’t have to excel in all three, but it’s a red flag if they are poor in all of them. And that’s something learned by understanding what statistics are driving the model, which is a great way to learn even if the overall model and it’s predictions don’t mean much to you.
Where do these models fall flat (aka shortcomings)
Much better at predicting offensive performance: Quite simply, there aren’t a lot of great defensive statistics. As you look for busts that are highly rated in different draft models, there tend to be three patterns that emerge. First, players who succumbed to injuries. It’s understandable that a model based on college data isn’t going to predict an injury prone career, that’s a mix of team doctor’s predictions and random chance. Second is player’s with character issues (see below). And finally, we see guys who were actually very good offensive players, but couldn’t guard anyone. For example, Dejuan Blair is loved by most models and has put up some great PERs in the NBA, but his defense limits him to a 20mpg role (to be fair, with multiple knee surgeries he might also fall into the first list). So when looking at model lists, ask yourself if someone rated highly will be able to guard anyone or rotate quickly enough to play more than 20mpg as a specialist.
They are bad at the "squishy stuff": I’ll let John Hollinger take this one, "we've seen the particular ways in which it fails. The most obvious one is on all the squishy stuff -- character, dedication, conditioning, etc. Michael Beasley, Michael Sweetney and DeMarcus Cousins all got huge marks from the Draft Rater, but one could justify passing on them on draft day given the other red flags." And it’s true, in almost any model Beasley and Sweetney are highly rated. And if you look at their college numbers, it’s easy to see why. Both could have been offensive stars. The models weren’t wrong per se. The models just can’t predict mental issues.
Can’t pick up situational differences: These models somewhat rely on players being in similar situations. Dalt’s model (see his explanation in the comments this year) apparently looks at player skills on a team, Hollinger and I believe Dalt look at usage rates. However, models can’t really adjust for Russell Westbrook playing out of position with Collison running PG for UCLA. They struggle to take into account inflated steals due to an aggressive zone versus deflated steals from a coach who tells his players to stay at home and play solid D. They are great at generalizing over a large volume of players what statistics are very important. But they miss situational differences. So when looking at Zach Levine, it’s fair to wonder if his ratings are low because he wasn’t playing PG or if Tyler Ennis’ ratings are inflated due to playing PG in Syracuse’s aggressive zone defense and getting more steals than he would have playing man D.
Missing data: Quite simply, we don’t have access to all of the information we would like. And some of the information the models think is important are proxies for information we are missing. Going back to the ice cream example, the number of cars in the beach parking lot might be a good proxy for the temperature, but if the polar bear club comes to the beach in the winter or a jazz festival makes everyone go to the park for a day in the summer, the fallibility of the proxy will show.
There will always be some degree of randomness: We’re predicting the future of 18-23 year olds. No matter how good a model is, there’s always going to be some randomness. Did you ever work with someone in their 20s who was a rocks star until their SO broke up with then and then they fell apart? Or someone who didn’t really seem to care about their job until they got fired and then rocked at their next job? Welcome to the world of trying to see the inner-workings of the 18 year old brain, which science says isn’t even fully developed until 25. Some players like Andre Drummond will suddenly wake up and perform far better than most models would suggest. Some players like Deron Williams with pedestrian college numbers even though scouts saw something special. Two players with identical college numbers will still vary. There is no perfect prediction, only a guide to follow and use as an additional tool in evaluation.
The modelers are human: It’s easy to get excited about the modeling a machine can help us do. But it is still a human being picking different statistics and using more judgment than you might realize from the outside.
What are common mistakes in using this data to evaluate players
Becoming too rigid with a rule: It’s good to have statistical guides to help you ask the right questions (e.g. should I be concerned with a player’s low steals), but no rule is written in stone and the second you fool yourself into thinking it is, you will get burned. It’s better to take the red flag and then look for different explanations through other statistics, scouting reports and video to try to understand if the concern is legitimate or if you might be wrong.
Picking a biased or meaningless sample: This can apply to models or other comparisons as well. Take Tyler Ennis as an example, if you look for players who came out after their freshman-junior years who have a A/T ratio over 2.5 and at least 2 steals per 40 minutes, you wind up with a list of Chris Paul, Ty Lawson and Mike Conley. Wow, Ennis looks like a surefire all star. However, you might notice his 51% true shooting percentage is far far below those three. If you look for college PGs with a true shooting percentage at 55% or below and 7 assists per 40 or less, you get a much less attractive list. Whenever you are making a list to compare players, always ask yourself why you chose those number or those players. And try a couple of alternate statistics to see if it changes your outlook. It’s common to read posts on fan boards where someone says, Player A averaged 16-5-5 last season, here’s a list of players from basketball reference who averaged at least 16-5-5 and then they will show a list including mostly all stars. Those are almost always misleading unless there is a reason why the 16-5-5 matters.
Failing to understand the right objective: This is less about the model and more about the hubris of the modeler. Take this article from Wages of Wins, I am not here to critique the model, which might be fantastic. However, the writer’s conclusion is that GMs were making dumb mistakes and would be better off if they drafted according to his model. But this assumes that every GM’s goal is to select the player who is ultimately going to have the highest production. That might sound reasonable, but we know that GMs of bad teams (especially small market teams) need to get superstars. They aren’t always trying to pick the safest player. They often would prefer to draft a player with a 30% chance of becoming an all star and 70% chance of becoming a bust over a player who is "guaranteed" to become a solid bench contributor.
Getting swayed by a recent or vivid example: The inverse here is the mistake of fans that models can help with. People are often swayed by recent or vivid examples (this is psychology 101). In the draft, this leads us to immediately link Elfrid Payton with Rondo, even though there are plenty of athletic guards who can’t shoot who have been busts. As Kings fans, some of us see Aaron Gordon and can’t get images of Thomas Robinson out of our heads, even though there are definitive differences. Looking for patterns isn’t bad. But it’s important to stop for a moment and look at the data. Look at the models and question how the two players are different. Don’t let one vivid mistake lead you to make another, bigger mistake as your brain stereotypes certain types of players.
With all the talk of NBA 3.0, hopefully this fills in some of the gaps and helps as we go into the draft and discuss outcomes. In the early 2000s there was a battle between scouting and analytics with many people viewing it as a zero sum game. But now, as Nate Silver writes about in his book The Signal and the Noise, as analytics have grown so have scouting budgets. The two work perfectly in combination and the models produced by Dalt, Hollinger, Pelton, etc. aren’t meant to replace the observations of scouts or college basketball fans on this board, but add to them. Hopefully understanding the value of these models, their limitations and some of the common mistakes made with interpretation helps lead to better conversations.