Login to Account Create an Account

# New project

###
#1
Posted 09 May 2012 - 06:22 PM

POPULAR

I issued the warning not to frighten anyone away from this thread, but rather to read carefully and ask any questions in case you don't understand something.

_____________________________

Having said that... Baker's recent article about the mlb attendance this year made me think about doing a project in which I estimate a team's attendance based upon several issues.

So it is a mathematical linear model in which attendance depends on several factors like:

1. The franchise we are talking about. Clearly more people attend games at Boston than at Oakland.

2. The novelty of the stadium. It's been proved that when a team builds a new stadium... they will come.

3. The winning percentage in that year. The better the team, the better the attendance.

4. Whether the team reaches the postseason or not. (correlated to the previous one).

5. Strike. Strike reduced attendance.

6. Year. We can see if attendance has been increasing with time.

7. Recent time success (whether it is 3 years, 5 years or 10 years).

Pretty much those are the variables that can explain yearly attendance.

________________________________

So I ran a generalized linear model where the dependent variable is yearly total attendance and the regressors were:

1. Year (from 1990 to 2011)

2. Strike (1 in 1994 and 1995, 0 otherwise)

3. Winning% (the winning percentage of the team in that year)

4. Winning%MovingAverage (a 3/5/10 year moving average of the winning%)

5. Stadium Novelty (exponential weight depending on the year the new stadium opened).

6. PostSeason (1 if the team reached the Postseason that year, 0 if not)

7. Franchise (a categorical variable representing the team).

Results will follow soon.

###
#2
Posted 09 May 2012 - 06:37 PM

POPULAR

Just want to tell you that I tried 3 different models.

Model 1. Uses the 3 Year Moving Average of the Winning Percentage as a measure of "recent success". This is simply an average of the previous 3 years' winning%, but it moves in time. So for example, the 2011 Moving Average of the Mariners = (.525 + .377 + .414) / 3 = .438 (2009 Win%, 2010 Win% and 2011 Win%).

Model 2. Uses the 5 Year Moving Average of the Winning Percentage. So it considers the last 5 years.

Model 3. Uses the 10 Year Moving Average.

All other variables remained the same in the 3 models.

Results.

All 3 models were significant, but in model #1, the actual year winning % was not significant to the 5% level (p-value = .0657), perhaps this happened because Wining% and the 3 year moving average are highly correlated. For models #2 and #3, the actual year winning% was significant to the 5%, meaning that it has a lower correlation to the moving average.

The other low significance variable was the PostSeason dummy, but it was significant also (p-value = .02).

The R^2, which is a measure of the variance explained by the model was higher for model #1.

Model, R^2

1, .7417

2, .7301

3, .7037

So, it seems like the 3 year moving average is better at predicting the Attendance.

**Edited by Pirata Morado, 09 May 2012 - 06:38 PM.**

###
#3
Posted 09 May 2012 - 06:53 PM

POPULAR

So, for example, the weights go like this under the first parameter:

0.95

0.902

0.8573

0.814

0.773

...

for the second parameter they go like this:

0.9

0.81

0.729

0.6561

...

for the third parameter they go like this:

0.8

0.64

0.512

0.4096

0.3276

...

so you can see that the weights decrease faster under the 3rd alpha.

Results.

Once I knew that Model 1 above was the highest R^2 model, I only tried the 3 different aphas under that one.

Alpha, R^2

1, 0.7417

2, 0.7559

3, 0.7555

So, it seems as tough the model with alpha = 0.1 works better.

**Edited by Pirata Morado, 09 May 2012 - 06:54 PM.**

###
#4
Posted 09 May 2012 - 07:13 PM

POPULAR

I also noticed that the Florida and the Tampa Bay parameters weren't significant under any model (Florida p-value = 0.3698, Tampa Bay p-value = 0.4971, all other <.0001).

__________

So the next interesting thing to analyze are the regression coefficients for each variable, which follow:

Variable, Parameter

B0 (intercept), -28,161,802

Year, 13,269.96

Strike, -499,461

Winning%, 618,022

3YearWin%MovingAverage, 5,462,943

StadiumNovelty, 1,226,575

PostSeasonDummy, 100,954

Arizona, 775,465

Atlanta, 632,983

....

Seattle, 804,966

...

The p-value of Winning% is = .1114 so it not significant.

###
#5
Posted 09 May 2012 - 07:19 PM

- local economic conditions. Looking at the unemployment rate and average income for a city plotted against the average ticket price.

- competition. Are there other baseball teams or sports with an overlapping season which have a higher local popularity?

Oakland has as rich and storied history as many other big franchises. They were THE team in the early 70's. But with local competition from the Giants, they are starting from a weak position.

Detroit has one of the weakest economies in the nation. How does that affect thier attendance?

**Adopt-a-Players:**

**Forrest Snow**- Suspended for the first 50 games of the season. (Come on, Forrest. Just because it's legal in Washington...)

No invite to Spring Training. Not assigned to any team. Contract still owned by the M's, but effectively out of baseball for now.

*2013 AAA/AA: 5-5, 2.96 ERA, 84 K, 28 BB, 1.085 WHIP*

*2012 AAA/AA: 5-9, 6.35 ERA, 99K, 67 BB, 1.674 WHIP*

**Gone But Not Forgotten (former adopt-a-players):**

Eric Thames - took his MLB .250/.296/.431/.727 and MiLB .305/.383/.508/.892 statlines and went to play in Korea.

Matt Mangini - Out of baseball.

Mike Morse - Now playing for the Giants.

Jamal Strong - Let go after 2005. Played in the Yankees, Cubs and Braves systems. Now a regional scout for the Cards.

Updated: 04/02/2014

###
#6
Posted 09 May 2012 - 07:29 PM

Interesting suggestions Huindekmi. The difficult thing could be where to get all those variables for each team, for each year.Other factors which might have a measurable effect:

- local economic conditions. Looking at the unemployment rate and average income for a city plotted against the average ticket price.

- competition. Are there other baseball teams or sports with an overlapping season which have a higher local popularity?

Oakland has as rich and storied history as many other big franchises. They were THE team in the early 70's. But with local competition from the Giants, they are starting from a weak position.

Detroit has one of the weakest economies in the nation. How does that affect thier attendance?

###
#7
Posted 09 May 2012 - 07:37 PM

These are the parameters:

Intercept: -27,645,125

Year: 13,054

Strike: -496,805

3YearWin%MA: 5,899,364

StadiumNovelty: 1,222,494

PostSeason: 139,188

Arizona: 771,370

Atlanta: 629,310

Baltimore: 1,338,687

Boston: 913,474

Cincinnati: 531,032

Cleveland: 576,181

Colorado: 1,416,095

Chi Cubs: 1,369,389

Detroit: 620,950

Florida: 105,065 *** (p-value = 0.374 not significant)

Houston: 583,011

Kansas City: 567,395

LA Angels: 1,076,424

LA Dodgers: 1,677,318

Milwaukee: 616,390

Minnesota: 440,434

NY Mets: 959,585

NY Yankees: 1,068,593

Oakland: 249,748

Philly: 840,866

Pittsburgh: 348,764

SD: 623,270

Seattle: 801,995

SF: 808,816

White Sox: 330,274

StL: 1,180,503

TB: 83,770 *** (not significant, p-value = 0.5218)

Texas: 832,011

Toronto: 882,722

Washington (Montreal): 0

###
#8
Posted 09 May 2012 - 07:47 PM

So for instance, look at the M's:

We have to add the intercept (-27,645,125) plus the M's coefficient (801,995) = -26,843,130 yearly attendance (yes negative), then add 5,899,364 for every winning% whole, so asuming a .500 winning%, you only add half of the parameter, so you add only 2,949,682 more people to get to -23,893,449. Then you add 13,054 for every year you are on. So, assuming you are in 1995, multiply 1995 times 13,054 and you get 26,043,528, which added to the negative number before, you arrive to the 1995 estimation of 2,150,079, and you can also add the PostSeason coefficient of 139,188, because the M's got to the post season in 1994. So your estimate is 2,289,267.

Edit: I guess the explanation above is not very clear.

Perhaps this example helps:

Team: M's: Coefficent: 801,995

Year: 2001: 13,054 * 2001 = 26,121,854

PostSeason: 1 * 139,188 = 139,188

3YearMovingAverage: .588 = (.488 + .561 + .716) / 3, so it is .588 * 5,899,364 = 3,468,826

StadiumNovelty = .729 (only the third year away from SafeCo opening) * 1,222,494 = 891,198

Strike: 0 (no strike year) * -496,805 = 0

Intercept = -27,645,125

Sum of the above = 3,777,936, the Mariners got 3.5 million attendance in 2001, so it is a pretty good estimate.

**Edited by Pirata Morado, 09 May 2012 - 08:02 PM.**

###
#9
Posted 09 May 2012 - 08:22 PM

Team: Pittsburgh = 348,764

Year: 2004, 2004 * 13,054 = 26,161,018

PostSeason: 0

MovingAverage = 0.452 * 5,899,364 = 2,669,212

StadiumNovelty = 0.6561 (4th year of PNC) * 1,222,494 = 802,078

Intercept = -27,645,125

Sum of the above = 2,335,947, the 2004 Pirates actually drew 1,583,031.

###
#10
Posted 09 May 2012 - 08:32 PM

1. There's evidence that attendance increases every year (global mlb increase trend) of 13,054 people every year.

2. The strike (1994, 1995) had a negative impact of almost half a million people for each team (-496,805)

3. The current year (matching) winning% is not significant to drive the attendance (which makes sense since you don't know in advance what will it be, so people go to the stadium later), but rather...

4. A better way to explain attendance is by using the 3 year average of the winning%, so people tend to remember how good the team has been "recently" than for the current year.

5. If a team would win all its games in those 3 last years, they would draw 5,899,364 more people for a year, or perhaps easier to say: a .500 3-year average team will draw 2,949,682 more people than a 0.000 team.

6. When you open a new stadium you can expect 1,222,494 more people in the year, but this will decrease exponentially each year away from the opening of your new stadium.

7. If you reach the postseason you can expect 139,188 more tickets sold in that year than if you don't reach it.

8. The teams with bigger fanbases are: Dodgers, 1.6M, Colorado, 1.4M, Cubs: 1.3M, Baltimore: 1.3M, Cardinals: 1.1M

9. The teams with the lower fanbases are: White Sox, 330K, Oakland: 249K, Florida: 105K, TampaBay: 83K, Washington: 0

10. The Mariners fanbase ranks in the middle (14th) with a 801,994 "fanbase".

**Edited by Pirata Morado, 09 May 2012 - 08:37 PM.**

###
#11
Posted 10 May 2012 - 07:47 AM

R^2 is now .7636, so a little bit better than before (76% of the variance in attendance is being explained by the model).

Coefficients are now:

Intercept: -27,245,815

Year: 12,845

Strike: -501,776

3YearMovingAverageWinning%: 5,928,636

StadiumNovelty: 1,258,165

PostSeason: 133,018

Mariners: 869,308

....

So you can see that coefficients varied just a little bit, the results hold.

__________________

Now, perhaps a picture says more than a thousand words, so here's a Time Series chart of the actual attendande and the predicted one based on this model for the Mariners:

Some things worth noticing:

1. You can see that the model does a very nice job of estimating the actual yearly attendance.

2. Notice the lowest pike in 1994, the year of the strike.

3. Then we see an upward trend thanks to "successful" seasons.

4. We can see a pike in 1997 thanks to the PostSeason.

5. We see a hill in the 2000-2003 period of time, thanks to the 2000 and 2001 seasons.

6. From then on it has been a downward trend.

7. We can also notice that the SafeCo Field effect seems to overestimate attendance, see that the teal line crosses the navy one precisely there in 1999.

**Edited by Pirata Morado, 10 May 2012 - 07:55 AM.**

###
#12
Posted 10 May 2012 - 08:10 AM

We can see here the yearly increasing trend, with some bumps in the road (1994-1995), in 2006 we see an overestimation due to the New Park and the PostSeason. Seems like the Cards fans are indeed very loyal, but don't have homerism.

###
#13
Posted 10 May 2012 - 08:14 AM

Notice first the lower general level compared to the M's and Cards (the limits in the Y-axis are the same in all charts).

Notice now the peak in 1991 due to their new stadium, it gets predicted quite well in the model.

This time, seems like the model undersestimated the good 2005 season they had (champions), because we can see a pike in attendance in 2006 that wasn't reflected in the model.

###
#14
Posted 10 May 2012 - 08:53 AM

Obviously, there is a maximum total number of seats possible at every stadium - (an upper bound) - which is not a constant. I would think this would significantly influence the upper bounds of attendance spikes. (No team can actually get 5-6 million in attendance, though it seems your model could predict that with a 162-win team).

Adopt-a-player(s): Brandon Maurer - made the Majors!

Age 24 - LHP (as of 9/23/2013)

Brian Moran - 2013 2-5; 3.45-ERA; 48-G; 62.2-IP; 70-H; 4-HR; 20-BB; 85-K; 1.43-WHIP; 12.2-K/9; 4.25-K/BB (AAA)

Brian Moran - mnrs - 18-17; 199-G; 3.06-ERA; 288.0-IP; 263-H; 20-HR; 78-BB; 339-K; 1.18-WHIP; 10.6-K/9; 4.35-K/BB

Age 24 - RH - (2B/UT) - (as of 9/23/2013)

Stefen Romero - 2013 - 411-PA; 23-2B; 11-HR; 74-RBI; 8-SB; 4-CS; 28-BB; 87-K; .277/.331/.448 -- .779 (AAA - ONLY)

Stefen Romero - mnrs - 1426-PA; 80-2B; 50-HR; 242-RBI; 36-SB; 18-CS; 89-BB; 229-K; .306/.357/.506 -- .863

Age 22 - LH - (LF) - (as of 9/23/2013)

Dario Pizzano - 2013 - 531-PA; 40-2B; 8-HR; 70-RBI; 8-SB; 4-CS; 61-BB; 48-K; .311/.392/.471 -- .863 (A)

Dario Pizzano - mnrs - 781-PA; 58-2B; 12-HR; 99-RBI; 11-SB; 4-CS; 91-BB; 85-K; .324/.408/.482 -- .890

Age 22 - RH - (CF) - (as of 9/23/2013)

Jabari Henry - 2013 - 433-PA; 23-2B; 11-HR; 57-RBI; 9-SB; 7-CS; 63-BB; 73-K; .260/.370/.436 -- .807 (A/A+)

Jabari Henry - mnrs - 683-PA; 38-2B; 19-HR; 99-RBI; 14-SB; 9-CS; 92-BB; 120-K; .264/.367/.454 -- .821

Age 23 - RH - (OF) - (as of 5/15/2013)

Jabari Blash - 2013 - 452-PA; 19-2B; 3-3B; 25-HR; 74-RBI; 15-SB; 9-CS; 60-BB; 113-K; .271/.381/.534 -- .915 (A+/AA)

Jabari Blash - mnrs - 1441-PA; 66-2B; 13-3B; 59-HR; 200-RBI; 44-SB; 22-CS; 199-BB; 399-K; .260/.375/.483 -- .857

###
#15
Posted 10 May 2012 - 09:14 AM

You're right Sandy, there's no upper bound, perhaps it gets self controlled due to the fact that no team has a Moving Average Winning% that big.Maybe I'm missing it ... but I don't see any control for stadium capacity.

Obviously, there is a maximum total number of seats possible at every stadium - (an upper bound) - which is not a constant. I would think this would significantly influence the upper bounds of attendance spikes. (No team can actually get 5-6 million in attendance, though it seems your model could predict that with a 162-win team).

###
#16
Posted 10 May 2012 - 09:20 AM

I ran this, and I want to show you some neat charts, the difference here is that we get different coefficients in each regressor for each team, but we lose degrees of freedom (we only use the sample size of each team).

Anyway, I thought this charts look very nice.

Milwaukee:

Notice the big peak in 2001 due to Miller Park, which isn't being reflected in the new model, but you can see the yearly trend and the improvement in the team's winning percentage driving the attendance up from 2003 onwards.

By the way, the coefficients were:

Intercept: -108,925,437

Year: 52,651

Strike: -225,113

3YearMovingAverage: 11,406,979

StadiumNovelty: 1,604,437

PostSeason: 109,523 *** (not significant, p-value = .4459)

What these mean is that PostSeason doesn't have an effect in attendance for the Brewers. Their yealry increase in attendance is 52K per year, the strike affected them by quarter million only, and their stadium novelty brought them 1.6 more million.

By the way, the R^2 for the Brewers was .9559, a very high percentage of variance explained (I picked the biggest ones).

**Edited by Pirata Morado, 10 May 2012 - 09:23 AM.**

###
#17
Posted 10 May 2012 - 09:31 AM

Notice how good the 1993 gets predicted (their first postseason in years), but then look also how good the downward trend gets predicted due to bad play.

Again notice the big peak their new stadium brought them in 2004, which isn't perfectly predicted by the model, but notice how good the model predicts the upward trend once they started playing "better" baseball (higher 3year Moving Average).

Coefficients:

Intercept: 65,408,688

Year: -33,987

Strike: -219,122 *** (not significant, p-value = .2436)

3YMAWin%: 9,367,844

StadiumNovelty: 977,488

PostSeason: 780,556

Seems like the Strike wasn't significant with them, and notice a negative yearly trend, meaning that people go less there as time goes by (has recession affected Philly badly?). But they seem to suffer a lot of homerism. Their 3Year Coeff is very high, meaning that they go if their team is good, but also look at their big PostSeason Coeff, they draw 3/4 million more when they get to the playoffs. Their new stadium drew only 977K more, so it is almost half that of the Brewers. The R^2 for the Phillies is 0.9584, also quite good.

**Edited by Pirata Morado, 10 May 2012 - 09:33 AM.**

###
#18
Posted 10 May 2012 - 09:41 AM

First thing you can see is that it predicts better than the previous one (lower error in prediction), particularly notice the strike effect, beautifully captured.

Again we see the hill in the 2000-2003 period of time, but now the prediction fits better. The peak in 1997 is also quite noticeable due to the postseason.

Coefficients:

Intercept: -51,238,281

Year: 25,054

Strike: -851,534

3YMAWin%: 7,046,137

StadiumNovelty: 591,446

PostSeason: 325,968

Big effect in the strike for us, of almost a million. PostSeason is big for us driving 325K more on those years. Stadium Novelty doesn't seem as big as for the other 2 teams, since SafeCo seemed to draw only half a million more fans. The R^2 for the M's model is 0.8741, lower than Phillies and Brewers, but still better than the previous model.

###
#19
Posted 10 May 2012 - 10:04 AM

In the whole mlb model (with Franchise as a regressor), the RMSE was 369,692, meaning that the model, on average, had errors of about 370K in the yearly estimations (an error can be an underestimation or an overestimation).

For the individual models, the Milwaukee model has a RMSE of 152,881, clearly much better (only half that of mlb).

The Phillies model had a RMSE of 174,913, again better than combined.

The Mariners individual model had a RMSE of 277,617, worse than the Phillies and Brewers, but better than the combined model.

So on average, the model misses the prediction by a quarter million fans, not bad.

###
#20
Posted 19 May 2012 - 04:57 PM

#### 0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users