When we learn about computational mathematics, software engineering, or data science in general, one of the most important questions we have to answer is **how many operations are required**. The answer isn’t to show the prowess of our algorithm, but rather identify how quickly the algorithm can perform. We need to know if the answer is **obtainable in a reasonable amount of time**.

In this post, we take a cursory look at streaming computations and perform a simple application to **scheduling in the NBA** in an effort to identify teams who have the most **compact schedules**; meaning they have the **least amount of rest between Game 1 and Game 82**.

## A Real Data Science Problem

**Note: **This is discussion about real data science problems. If you’re only here for the basketball analysis skip to the next section.

A common (and sometimes forgotten!) problem in data science is the ability to **compute** statistics. Recall that the definition of a statistic is merely a **function of the data**. What we mean to compute statistics is the ability perform **computations **that the computer is able to **store in memory**. Take for instance a 100 gigabit per second stream of data with the charge of calculating a mean.

### Sublinear Computations

If we can compute only 1 gigabit per second stream, we lose **99% of our data**. If we’d like to have a response in **real time**, we’d be forced to perform **sublinear computations**.

There is no way we can calculate the mean of the data. Suppose we get the mean in one calculation through some fairy tale manner, we are still requiring **100 seconds **to compute one second of data. If we need to compute an entire hour’s worth of data, we are stuck waiting for an **extra 356,400 seconds** for the computation to be complete. That’s and extra **four days and three hours**. Sorry, MapReduce won’t help you here either (it’s slower).

A sublinear algorithm is akin to **random sampling. **In this case, we sample data using some random number generator, selecting an adequate amount of data and computing the mean. The idea is straightforward, however several checks need to be performed to ensure the sample is appropriate enough to represent the global mean. We can do that through the process of **data forking** and working on multiple streams at the expense of requiring less data to perform computations.

### Real Time Computations

Now if our data stream is only 1 gigabit per second, we are able to compute the true mean at real time, but only for **order 1 **calculations. This means we can only perform “one” action to the data as each data point comes in. Recall that the mean is computed by

From the formula above, the time to compute the mean is actually **order n**. This means we require all **n** data points to compute the mean. From this viewpoint, we are in trouble. However, computation of the mean is probably the most well-known streaming computation out there. So there’s hope!

In the streaming paradigm, we change the method of computation to be a **forked stream**. What this means is that we can construct a **window **of data by building **n** streams of data. Here, all we need to do is perform a series of three calculations, which has nothing to do with the size of the data stream. The only worry here is memory allocation; which is already assumed to be OK from our first sentence.

Let’s rewrite the mean calculation.

To perform this computation, we simply hold the sum and an number of data points in memory. We can start with an **initial fill** of all zeros. This means the first **n-1** means are **biased**. But this is alright as we only cared about the mean of **n** points at a given time.

As we fork the stream for **n** data points, for the **n+1** data point, we subtract the first stream, containing data point **x_1**, replace the stream element with **x_(n+1)**, and reassign the stream orders. By doing this, we perform **three (four)**** calculations on the data: **subtract the oldest data point, add the newest data point, and divide by the sample size. The (four) comment is performed when the **initial fill washes out**. The update equation is **n = n + 1** until the initial fill is removed.

So let’s apply this methodology to a simple NBA problem: **scheduling.**

## NBA Schedules: Schedule Distribution

Opening day for the 2018 NBA season is on **October 17th** and proceeds for the following **177 days**, concluding on **April 11th**. Over the course of these 177 days, a series of 1230 games are needed to be scheduled. The games are distributed to ensure that all 30 NBA teams have 82 total games with 41 games on the road and 41 games at home.

Of the 82 games, each team must play their own division **four times** for 16 games. They also must play every team in their opposing conference **two times** for 30 games. The remaining 36 games are distributed across the remaining 10 teams; resulting in six teams being played **four times** and the other four teams **three times**.

Furthermore, the NBA has a required set of nine days off:

- November 23rd: Thanksgiving
- December 24th: Christmas Eve
- February 16th – 21st: All-Star Break
- April 2nd: NCAA National Championship

As all teams need to be mixed, with no other days off for the league, and in such a way that all teams get an **acceptable amount of rest**. The NBA does a phenomenal job in scheduling, despite the acceptable amount of rest being well-defined across the years.

### 2018 Distribution

For the 2018 season, we can walk through the entire 1230 game schedule and construct a 30×177 binary matrix representing the entire schedule. Using the **imshow** command, we can view the matrix where **yellow** represents the **game played** and **black** represents the **day off**.

We immediately see the days off in the league as they are black columns. By running across the rows, we can start to measure the **compactness **of the schedule. A simple, naive, measure would be to simply count the number of days off. However, every team plays 82 games in 177 days. This means every team gets 95 days “off.” Therefore counting days off just doesn’t cut it.

### Compactness

Instead, we become interested in **compactness **of the schedule. Compactness means that the schedule has **accumulation points**. In this sense, an accumulation point is finding days where there are **many games within a neighborhood** of that day. For instance, a team **might **have two games before a day and two games after a day; seen as a **4-in-5** set of games. That day is an accumulation point.

We can then start to map accumulation points to **high volume sections** of the schedule. That is, we hunt for **short burst** games where 66-75% (or more) of the days are played. We will also focus the course of **two weeks intervals**. This allows us to **nest** bursts in the schedule.

### Nesting

Nesting is a situation where certain scenarios in the schedule **cannot happen** unless other scenarios occur. For instance, a **back-to-back** series of games is a high-volume set of games. However, a **3-in-4** burst of games **must require a back-to-back **to occur.

Similarly, a **4-in-6** burst of games **must require a back-to-back **and a **3-in-4 **to occur. Let’s illustrate these using **binary vectors**. Suppose that we have a three game set. Using a **one** to indicate a game is played, and **zero** to indicate a game is not played, then all possible 2-in-3 bursts are **011, 101,** and** 110. **This means two of three possibilities can result in back-to-back sets.

But for 3-in-4, there’s **always a back-to-back:**

- 0111 – Two Back-to-Back’s and a Three-In-A-Row!
- 1011 – Back-to-Back
- 1101 – Back-to-Back
- 1110 – Two Back-to-Back’s and a Three-In-A-Row!

Similarly, for 4-in-6, there’s **almost always a 3-in-4:**

- 001111 – Three Back-to-Back’s, Two 3-in-3, One 3-in-4, one 4-in-4, and one 4-in-5
- 010111 – Two Back-to-Back’s, one 3-in-3, two 3-in-4, and one 4-in-5
- 011011 – Two Back-to-Back’s, two 3-in-4, and one 4-in-5
- 011101 – Two Back-to-Back’s, a 3-in-3, three 3-in-4, and one 4-in-5
- 011110 – Three Back-to-Back’s, two 3-in-3, two 3-in-4, one 4-in-4, and two 4-in-5
- 100111 – Two Back-to-Back’s, one 3-in-3, one 3-in-4
- 101011 – One Back-to-Back, one 3-in-4
- 101101 – One Back-to-Back, one 3-in-4
- 101110 – Two Back-to-Back, one 3-in-3, two 3-in-4, and one 4-in-5
- 110011 – Two Back-to-Back’s…
**no 3-in-4** - 110101 – One Back-to-Back, one 3-in-4
- 110110 – Two Back-to-Back’s, two 3-in-4, and a 4-in-5
- 111001 – Two Back-to-Back’s, one 3-in-3, and a 3-in-4
- 111010 – Two Back-to-Back’s, one 3-in-3, two 3-in-4, and a 4-in-5
- 111100 – Three Back-to-Back’s, two 3-in-3, one 3-in-4, one 4-in-4, and a 4-in-5

That’s 15 total options. There is one exception to the 3-in-4 option. That’s the **110011** burst. In this case **there must be LOTS of rest days to avoid a 3-in-4**. In fact, in order for a 3-in-4 to not exist is to require there to be a **0011001100**; which is **four games in 10 days**. Ridiculous. There’s effectively a 3-in-4.

### NBA Cutting Back on High Volume Bursts

Starting in the 2018 season, the NBA announced that 4-in-5 games would be eliminated from the schedule after a series of studies had been produced on the effect of travel, lack of sleep, and game density (compactness). This resulted in a **red alert **system for games; with a **Charlotte Hornets **game being cited as one of the worst offenders: a 4-in-5 with a back-to-back to finish the 4-in-5 being a Memphis-Charlotte transfer; resulting in “three hours of sleep.” The Hornets were dispatched by the **Detroit Pistons** with ease: 112-89.

So let’s analyze the 2018 NBA season!

## Processing the Schedule: Streaming Computations

To perform this analysis, we take the entire schedule and walk through every scenario of **compactness**. We look for all back-to-back games, all 3-in-4, 4-in-6, 5-in-7, and so on. We can do this by rolling through the schedule, or we can perform **streaming computations**!

To perform the streaming computation, we first construct a **window** of games. Here, we are interested only in 12 day bursts of windows. Therefore, we build an initial fill of all zeros and walk through the process of the first 12 days of the season. This is **October 17th through October 28th. **

We then compute a fixed series of computations: add the first two elements of the binary representation. If this is two, we have a **back-to-back**. Similarly, we add the first four elements and if this is more than three, we have a **3-in-4**. We can do this, as there are no 3-in-3’s and 4-in-4’s. We perform this computation similarly for all other high volume bursts through the 12 games.

Once we perform this computation, we add in the new day. Remember in the streaming computation, we just **pop** off the oldest day and add in the new day at the end. Since the data is small scale, we can hold the windows in a 30×12 matrix. This can be performed in Python using the **numpy.roll** function and a wipe.

As we walk through the schedule, the window slides across the binary matrix of games and pushes the oldest games to the front of the matrix, where the fixed burst computations are performed. Therefore, we don’t have to hunt any games!

The only catch is the end of the season, where we can either place a **terminal fill** of zeros, or sweep across the final games. I chose the latter.

### Python Code

By taking in a schedule data file, we can walk through each line, strip the teams, and populate the games window matrix (**games**) and the season matrix (**gameMat**). We only populate **gameMat** in order to display the matrix image above.

# Walk through the scores file. Each line will consist of: # DATE: Winning Team Score Losing Team Score for line in lines: # Strip to ensure no white space messes with the key structure of the team dictionary. winner = line[12:30].strip() loser = line[41:60].strip() day = line[8:10].strip() if day == currentDate: if gameNumber < 12: games[int(teams[winner])][int(gameNumber)] = 1. games[int(teams[loser])][int(gameNumber)] = 1. gameMat[int(teams[winner])][int(gameNumber)] = 1. gameMat[int(teams[loser])][int(gameNumber)] = 1. else: games[int(teams[winner])][11] = 1. games[int(teams[loser])][11] = 1. gameMat[int(teams[winner])][int(gameNumber)] = 1. gameMat[int(teams[loser])][int(gameNumber)] = 1. else: if line[11:15] == 'NONE': print 'NO GAMES TODAY!', line[0:10] # New day shifts games. Let's get our counts before we run away! for i in range(30): specials = CountGames(games[i],i,specials) # Work on shifting games; as it is a new day! games = np.roll(games,-1,axis=1) # Must wipe out the last element of every row for i in range(30): games[int(i)][11] = 0. gameNumber += 1. continue # New Day! gameNumber += 1. currentDate = day if gameNumber < 12: games[int(teams[winner])][int(gameNumber)] = 1. games[int(teams[loser])][int(gameNumber)] = 1. gameMat[int(teams[winner])][int(gameNumber)] = 1. gameMat[int(teams[loser])][int(gameNumber)] = 1. else: # New day shifts games. Let's get our counts before we run away! for i in range(30): specials = CountGames(games[i],i,specials) # Work on shifting games; as it is a new day! games = np.roll(games,-1,axis=1) # Must wipe out the last element of every row for i in range(30): games[int(i)][11] = 0. # Now add the new guy! games[int(teams[winner])][11] = 1. games[int(teams[loser])][11] = 1. gameMat[int(teams[winner])][int(gameNumber)] = 1. gameMat[int(teams[loser])][int(gameNumber)] = 1.

What we see in this code block is that we apply the roll function and then wipe the new column of zeros in order to populate new data. **This is a cheat** as the data is small. If we performed a purely streaming computation, we hold the row sums for each burst type; search the new teams playing on the new day and add/subtract accordingly. Despite this, you can see how the streaming window interacts.

## 2018 Results

Performing the above computations, we can then look at a **compactness measure**. In this case, we perform a simple calculation of counting **high volume days**, weighted by their volume. In this case, we add all **back-to-back** games to **75% of 3-in-4 **games, **66.6% of 4-in-6 **games, to **71.4% of 5-in-7 **games, and so on. The higher the number, the more dense games are.

### Back-To-Back Games:

In total there were **425 instances of Back-to-Back games**, with each team averaging a total of **14** back-to-back games. The **Memphis Grizzlies **and **Utah Jazz** are the worst offenders with **16 Back-To-Back games**. Compare this to the **Charlotte Hornets, Cleveland Cavaliers, Detroit Pistons, Golden State Warriors, Los Angeles Lakers, Miami Heat, **and **New Orleans Pelicans**, who all have **13 Back-To-Back** games; we find that the discrepancy isn’t too high. This is arguable dependent on travel time/distance between games.

### Three-In-Four Games:

Three-In-Four games are fairly common as well. In this case, we find that each NBA team averages **18.9** three-in-four games during the season. As there were a total of **567** of these instances, we found that there was a much larger discrepancy among teams when compared to **Back-To-Back** games. For the 2018 season, the **Memphis Grizzlies **have **23 3-in-4 bursts**, while the **Cleveland Cavaliers have 10. **The next lowest total? The **Denver Nuggets ****have 15**.

### Four-In-Five Games:

Finally, the NBA season witness **zero 4-in-5 bursts!**

### Four-In-Six:

For 4-in-6 games, we see **611 instances** throughout the season. However, recall from the breakdown above, a sequence of **101101 **results in a pair of **3-in-4 bursts**. This is similar for **4-in-6 **situations when we expand the window. So don’t read this as 611 **unique** scenarios.

Despite this, teams witness **20.4** **4-in-6 bursts** per season. The worst offender is the **Charlotte Hornets with 27**. The team with the least? You guessed is… the **Cleveland Cavaliers with 14**. The **Los Angeles Lakers **come in second lowest with **15. **

### Five-In-Seven:

The NBA attempted to reduce the number of 5-in-7 games this season, but were unable to eliminate them. Under this scenario, we find there are **35 total instances** where the **Minnesota Timberwolves **and the **San Antonio Spurs** take the brunt with three of these situations each.

There are eight teams that observe this burst **two times** during the season: **Atlanta, Boston, Chicago, Orlando, Philadelphia, Phoenix, Sacramento, **and **Toronto****. **There are thirteen teams with one observance. However, there are **seven teams** that escape the wrath of 5-in-7 bursts. These teams are **Cleveland, Golden State, Indiana, Los Angeles Lakers, New Orleans, New York, **and **Memphis**.

### Six-In-Eight:

Fortunately, there are no 6-in-8 games this season.

### Six-In-Nine:

Six-In-Nine games start to become fleeting, as there are **87 instances** of these bursts; each team averaging **2.9 6-in-9** bursts. Again, we find a large discrepancy across teams. The **Boston Celtics** appear in **7 games that are the tail end of a 6-in-9 burst.** **Phialdelphia **(6), **Orlando** (5), and **Minnesota **(5) take the brunt with a high number of these scenarios.

There are, however, two teams that avoid the 6-in-9 run. **What this means is they ALWAYS **get **FOUR days rest every 9 days**. These two fortunate teams? **Cleveland **and **New York. **

### Eight-In-Twelve:

The final scenario we look at are 8-in-12 games. This is because there are **zero instances **of **7-in-9, 8-in-11, **and **9-in-12** bursts. In fact, there are **only three instances of 8-in-12 bursts**. The poor teams who suffer these bursts are the **Boston Celtics, Phoenix Suns, **and **Chicago Bulls**.

### Compactness Score

By computing the compactness score over this sliding window, we find that the **Philadelphia 76ers** (51.929),** Memphis Grizzlies** (51.917)**, **and **Orlando Magic** (51.595) suffer the most from the schedule. In fact, teams with compactness scores over 50 have dense schedules; with the least amount of rest. This includes the top three above as well as the **Boston Celtics **(50.512), **Phoenix Suns **(50.179), **Charlotte Hornets **(50.131), and the **Minnesota Timberwolves **(50.060).

There is a significant drop off after 50 (to 47). However, there is an exceptional outlier: **Cleveland **(29.833). This means that Cleveland’s schedule is roughly twice as spread out compared to **Philadelphia, Memphis, **and **Orlando**.

The entire table can be viewed here:

Looking at the distribution of scores, we find that Cleveland does indeed have a significant advantage in schedule. Denver finds itself in a similar situation, and this added rest along with their home court advantage may potentially be giving Denver a slight boost in the NBA standings.

## 2017 NBA Season…

Comparing to the previous season, we find that there are only 170 total days between opening day and the final day of the season. The 2018 season not only witnessed less 5-in-7 bursts and elimination of 4-in-5 bursts; but also obtained one extra week of play. Computing the binary matrix, we find that the is also no **NCAA National Game** off.

We find that the distribution of compactness is **much higher** in 2017 than in 2018.

So much higher that the number of Back-To-Back games totaled 486 (425 in 2018), 3-in-4 bursts totaled 714 (567 in 2018) and 4-in-6 bursts totaled 855 (611 in 2018). Similarly, there were **23 total 4-in-5 games in the 2017 season**. Also, there were dreaded **7-in-10 bursts** for teams in the 2017 NBA season; another non-existent burst in the 2018 season. The **Los Angeles Clippers**, **Philadelphia 76ers**, and **Orlando Magic** all experienced these during the season. And as a note, each of these teams suffered significant injuries during the season.

Continuing this path, the number of 8-in-12 games dropped from 22 (2017) to 3 (2018) while the number of 5-in-7 games dropped from 93 (2017) to 35 (2018).

For the 2017 season, we see almost every team is above the 50 score. The exceptions are **Dallas** and **Brooklyn**. Every team, however, dropped in compactness; indicating that every team technically obtains more rest this season than in the previous season. The lone real exception is the **Minnesota Timberwolves**, who dropped from 50.679 to 50.060. This may become a problem as the Timberwolves are notorious for not resting their players.

As we have seen the NBA help increase rest for the NBA season, we still find kinks in the scheduling format as some teams are generously given days off while others still find themselves in 5-in-7 ruts. So how would you improve scheduling the season? Note that we didn’t even factor in traveling and time-zone changes. This increases the problem with scheduling and rest; making the job of an NBA scheduler an unenviable one.

Pingback: Weekly Sports Analytics News Roundup - January 2nd, 2018 - StatSheetStuffer