Jump to content

Welcome to Mariner Central
Register now to gain access to all of our features. Once registered and logged in, you will be able to create topics, post replies to existing threads, give reputation to your fellow members, get your own private messenger, post status updates, manage your profile and so much more. This message will be removed once you have signed in.
Login to Account Create an Account
Photo

New project


  • Please log in to reply
30 replies to this topic

#1
Pirata Morado

Pirata Morado
  • Members
  • 9,014 posts
  • Gender:Male
  • Location:Queretaro, Mexico
  • Interests:Statistics, Theatre, Alan Parsons music, Astronomy, and of course, Mariners Baseball.

*
POPULAR

Warning! Heavy technical discussion!

I issued the warning not to frighten anyone away from this thread, but rather to read carefully and ask any questions in case you don't understand something.

_____________________________

Having said that... Baker's recent article about the mlb attendance this year made me think about doing a project in which I estimate a team's attendance based upon several issues.

So it is a mathematical linear model in which attendance depends on several factors like:

1. The franchise we are talking about. Clearly more people attend games at Boston than at Oakland.
2. The novelty of the stadium. It's been proved that when a team builds a new stadium... they will come.
3. The winning percentage in that year. The better the team, the better the attendance.
4. Whether the team reaches the postseason or not. (correlated to the previous one).
5. Strike. Strike reduced attendance.
6. Year. We can see if attendance has been increasing with time.
7. Recent time success (whether it is 3 years, 5 years or 10 years).

Pretty much those are the variables that can explain yearly attendance.

________________________________

So I ran a generalized linear model where the dependent variable is yearly total attendance and the regressors were:
1. Year (from 1990 to 2011)
2. Strike (1 in 1994 and 1995, 0 otherwise)
3. Winning% (the winning percentage of the team in that year)
4. Winning%MovingAverage (a 3/5/10 year moving average of the winning%)
5. Stadium Novelty (exponential weight depending on the year the new stadium opened).
6. PostSeason (1 if the team reached the Postseason that year, 0 if not)
7. Franchise (a categorical variable representing the team).

Results will follow soon.
  • 3

#2
Pirata Morado

Pirata Morado
  • Members
  • 9,014 posts
  • Gender:Male
  • Location:Queretaro, Mexico
  • Interests:Statistics, Theatre, Alan Parsons music, Astronomy, and of course, Mariners Baseball.

*
POPULAR

Won't bore you down with all the technical details.

Just want to tell you that I tried 3 different models.

Model 1. Uses the 3 Year Moving Average of the Winning Percentage as a measure of "recent success". This is simply an average of the previous 3 years' winning%, but it moves in time. So for example, the 2011 Moving Average of the Mariners = (.525 + .377 + .414) / 3 = .438 (2009 Win%, 2010 Win% and 2011 Win%).

Model 2. Uses the 5 Year Moving Average of the Winning Percentage. So it considers the last 5 years.

Model 3. Uses the 10 Year Moving Average.

All other variables remained the same in the 3 models.

Results.
All 3 models were significant, but in model #1, the actual year winning % was not significant to the 5% level (p-value = .0657), perhaps this happened because Wining% and the 3 year moving average are highly correlated. For models #2 and #3, the actual year winning% was significant to the 5%, meaning that it has a lower correlation to the moving average.

The other low significance variable was the PostSeason dummy, but it was significant also (p-value = .02).

The R^2, which is a measure of the variance explained by the model was higher for model #1.

Model, R^2
1, .7417
2, .7301
3, .7037

So, it seems like the 3 year moving average is better at predicting the Attendance.

Edited by Pirata Morado, 09 May 2012 - 06:38 PM.

  • 3

#3
Pirata Morado

Pirata Morado
  • Members
  • 9,014 posts
  • Gender:Male
  • Location:Queretaro, Mexico
  • Interests:Statistics, Theatre, Alan Parsons music, Astronomy, and of course, Mariners Baseball.

*
POPULAR

I also tried 3 different exponential smoothing parameters for the novelty of the stadium. alpha1 = .05, alpha2 = 0.1, alpha3 = 0.2 What this does is to give more weigh to the year in which the stadium opened and exponentially decrease as time goes by.

So, for example, the weights go like this under the first parameter:
0.95
0.902
0.8573
0.814
0.773
...

for the second parameter they go like this:
0.9
0.81
0.729
0.6561
...

for the third parameter they go like this:
0.8
0.64
0.512
0.4096
0.3276
...

so you can see that the weights decrease faster under the 3rd alpha.

Results.
Once I knew that Model 1 above was the highest R^2 model, I only tried the 3 different aphas under that one.

Alpha, R^2
1, 0.7417
2, 0.7559
3, 0.7555

So, it seems as tough the model with alpha = 0.1 works better.

Edited by Pirata Morado, 09 May 2012 - 06:54 PM.

  • 3

#4
Pirata Morado

Pirata Morado
  • Members
  • 9,014 posts
  • Gender:Male
  • Location:Queretaro, Mexico
  • Interests:Statistics, Theatre, Alan Parsons music, Astronomy, and of course, Mariners Baseball.

*
POPULAR

So, one first conclusion we can arrive at is that all these variables explain roughly 75% of the yearly variance in attendance.

I also noticed that the Florida and the Tampa Bay parameters weren't significant under any model (Florida p-value = 0.3698, Tampa Bay p-value = 0.4971, all other <.0001).

__________

So the next interesting thing to analyze are the regression coefficients for each variable, which follow:

Variable, Parameter
B0 (intercept), -28,161,802
Year, 13,269.96
Strike, -499,461
Winning%, 618,022
3YearWin%MovingAverage, 5,462,943
StadiumNovelty, 1,226,575
PostSeasonDummy, 100,954
Arizona, 775,465
Atlanta, 632,983
....
Seattle, 804,966
...

The p-value of Winning% is = .1114 so it not significant.
  • 3

#5
Huindekmi

Huindekmi

    Muppet Extraordinaire!

  • Line Drive Boosters
  • 5,499 posts
  • Gender:Male
  • Location:West Seattle
Other factors which might have a measurable effect:
- local economic conditions. Looking at the unemployment rate and average income for a city plotted against the average ticket price.
- competition. Are there other baseball teams or sports with an overlapping season which have a higher local popularity?

Oakland has as rich and storied history as many other big franchises. They were THE team in the early 70's. But with local competition from the Giants, they are starting from a weak position.

Detroit has one of the weakest economies in the nation. How does that affect thier attendance?
  • 1

Adopt-a-Players:
*new* Alex Jackson - Welcome aboard, AJaxx!

2014 Hitting .289/.333/.500/.833 in Peoria.

 

Forrest Snow - Working as a swingman and spot-starter for Tacoma.
2014 AAA/AA: 2-3, 2.35 ERA, 46 K, 12 BB, 1.065 WHIP - more to come!
2013 AAA/AA: 5-5, 2.96 ERA, 84 K, 28 BB, 1.085 WHIP

2012 AAA/AA: 5-9, 6.35 ERA, 99K, 67 BB, 1.674 WHIP


Gone But Not Forgotten (former adopt-a-players):
Eric Thames - Hitting .336/.423/.645/1.068 in Korea (304 plate appearances).
Matt Mangini - Out of baseball. Assistant coach for a high school.
Mike Morse - Hitting .273/.325/.472/.796 for the Giants. (sure would look good as our DH about now)
Jamal Strong - Let go after 2005. Played in the Yankees, Cubs and Braves systems. Now a regional scout for the Cards.


Updated: 06/27/2014


#6
Pirata Morado

Pirata Morado
  • Members
  • 9,014 posts
  • Gender:Male
  • Location:Queretaro, Mexico
  • Interests:Statistics, Theatre, Alan Parsons music, Astronomy, and of course, Mariners Baseball.

Other factors which might have a measurable effect:
- local economic conditions. Looking at the unemployment rate and average income for a city plotted against the average ticket price.
- competition. Are there other baseball teams or sports with an overlapping season which have a higher local popularity?

Oakland has as rich and storied history as many other big franchises. They were THE team in the early 70's. But with local competition from the Giants, they are starting from a weak position.

Detroit has one of the weakest economies in the nation. How does that affect thier attendance?

Interesting suggestions Huindekmi. The difficult thing could be where to get all those variables for each team, for each year.
  • 2

#7
Pirata Morado

Pirata Morado
  • Members
  • 9,014 posts
  • Gender:Male
  • Location:Queretaro, Mexico
  • Interests:Statistics, Theatre, Alan Parsons music, Astronomy, and of course, Mariners Baseball.
I decided to run the model again but WITHOUT the effect of the actual year winning percentage.

These are the parameters:

Intercept: -27,645,125
Year: 13,054
Strike: -496,805
3YearWin%MA: 5,899,364
StadiumNovelty: 1,222,494
PostSeason: 139,188
Arizona: 771,370
Atlanta: 629,310
Baltimore: 1,338,687
Boston: 913,474
Cincinnati: 531,032
Cleveland: 576,181
Colorado: 1,416,095
Chi Cubs: 1,369,389
Detroit: 620,950
Florida: 105,065 *** (p-value = 0.374 not significant)
Houston: 583,011
Kansas City: 567,395
LA Angels: 1,076,424
LA Dodgers: 1,677,318
Milwaukee: 616,390
Minnesota: 440,434
NY Mets: 959,585
NY Yankees: 1,068,593
Oakland: 249,748
Philly: 840,866
Pittsburgh: 348,764
SD: 623,270
Seattle: 801,995
SF: 808,816
White Sox: 330,274
StL: 1,180,503
TB: 83,770 *** (not significant, p-value = 0.5218)
Texas: 832,011
Toronto: 882,722
Washington (Montreal): 0
  • 2

#8
Pirata Morado

Pirata Morado
  • Members
  • 9,014 posts
  • Gender:Male
  • Location:Queretaro, Mexico
  • Interests:Statistics, Theatre, Alan Parsons music, Astronomy, and of course, Mariners Baseball.
Be careful how we interpret those parameters. You have to add the intercept and the team's coefficient in order to "estimate" a team with a 0.000 winning % 3 year moving average.

So for instance, look at the M's:

We have to add the intercept (-27,645,125) plus the M's coefficient (801,995) = -26,843,130 yearly attendance (yes negative), then add 5,899,364 for every winning% whole, so asuming a .500 winning%, you only add half of the parameter, so you add only 2,949,682 more people to get to -23,893,449. Then you add 13,054 for every year you are on. So, assuming you are in 1995, multiply 1995 times 13,054 and you get 26,043,528, which added to the negative number before, you arrive to the 1995 estimation of 2,150,079, and you can also add the PostSeason coefficient of 139,188, because the M's got to the post season in 1994. So your estimate is 2,289,267.

Edit: I guess the explanation above is not very clear.

Perhaps this example helps:
Team: M's: Coefficent: 801,995
Year: 2001: 13,054 * 2001 = 26,121,854
PostSeason: 1 * 139,188 = 139,188
3YearMovingAverage: .588 = (.488 + .561 + .716) / 3, so it is .588 * 5,899,364 = 3,468,826
StadiumNovelty = .729 (only the third year away from SafeCo opening) * 1,222,494 = 891,198
Strike: 0 (no strike year) * -496,805 = 0
Intercept = -27,645,125
Sum of the above = 3,777,936, the Mariners got 3.5 million attendance in 2001, so it is a pretty good estimate.

Edited by Pirata Morado, 09 May 2012 - 08:02 PM.

  • 2

#9
Pirata Morado

Pirata Morado
  • Members
  • 9,014 posts
  • Gender:Male
  • Location:Queretaro, Mexico
  • Interests:Statistics, Theatre, Alan Parsons music, Astronomy, and of course, Mariners Baseball.
Another example: Estimate attendance of the Pirates 2004.

Team: Pittsburgh = 348,764
Year: 2004, 2004 * 13,054 = 26,161,018
PostSeason: 0
MovingAverage = 0.452 * 5,899,364 = 2,669,212
StadiumNovelty = 0.6561 (4th year of PNC) * 1,222,494 = 802,078
Intercept = -27,645,125
Sum of the above = 2,335,947, the 2004 Pirates actually drew 1,583,031.
  • 2

#10
Pirata Morado

Pirata Morado
  • Members
  • 9,014 posts
  • Gender:Male
  • Location:Queretaro, Mexico
  • Interests:Statistics, Theatre, Alan Parsons music, Astronomy, and of course, Mariners Baseball.
Some interesting conclusions so far:

1. There's evidence that attendance increases every year (global mlb increase trend) of 13,054 people every year.
2. The strike (1994, 1995) had a negative impact of almost half a million people for each team (-496,805)
3. The current year (matching) winning% is not significant to drive the attendance (which makes sense since you don't know in advance what will it be, so people go to the stadium later), but rather...
4. A better way to explain attendance is by using the 3 year average of the winning%, so people tend to remember how good the team has been "recently" than for the current year.
5. If a team would win all its games in those 3 last years, they would draw 5,899,364 more people for a year, or perhaps easier to say: a .500 3-year average team will draw 2,949,682 more people than a 0.000 team.
6. When you open a new stadium you can expect 1,222,494 more people in the year, but this will decrease exponentially each year away from the opening of your new stadium.
7. If you reach the postseason you can expect 139,188 more tickets sold in that year than if you don't reach it.
8. The teams with bigger fanbases are: Dodgers, 1.6M, Colorado, 1.4M, Cubs: 1.3M, Baltimore: 1.3M, Cardinals: 1.1M
9. The teams with the lower fanbases are: White Sox, 330K, Oakland: 249K, Florida: 105K, TampaBay: 83K, Washington: 0
10. The Mariners fanbase ranks in the middle (14th) with a 801,994 "fanbase".

Edited by Pirata Morado, 09 May 2012 - 08:37 PM.

  • 2

#11
Pirata Morado

Pirata Morado
  • Members
  • 9,014 posts
  • Gender:Male
  • Location:Queretaro, Mexico
  • Interests:Statistics, Theatre, Alan Parsons music, Astronomy, and of course, Mariners Baseball.
I reviewed the data today while I was doing a chart, and noticed that the 1999 attendance figure for the M's was wrong, perhaps because it only had either the Kingdome or the SafeCo figure (remember they moved in the middle of the season). So, I've changed it and ran the anaylisis once more and things improved a little bit, but coefficients are a bit different.

R^2 is now .7636, so a little bit better than before (76% of the variance in attendance is being explained by the model).

Coefficients are now:

Intercept: -27,245,815
Year: 12,845
Strike: -501,776
3YearMovingAverageWinning%: 5,928,636
StadiumNovelty: 1,258,165
PostSeason: 133,018
Mariners: 869,308
....

So you can see that coefficients varied just a little bit, the results hold.

__________________

Now, perhaps a picture says more than a thousand words, so here's a Time Series chart of the actual attendande and the predicted one based on this model for the Mariners:

Posted Image

Some things worth noticing:

1. You can see that the model does a very nice job of estimating the actual yearly attendance.
2. Notice the lowest pike in 1994, the year of the strike.
3. Then we see an upward trend thanks to "successful" seasons.
4. We can see a pike in 1997 thanks to the PostSeason.
5. We see a hill in the 2000-2003 period of time, thanks to the 2000 and 2001 seasons.
6. From then on it has been a downward trend.
7. We can also notice that the SafeCo Field effect seems to overestimate attendance, see that the teal line crosses the navy one precisely there in 1999.

Edited by Pirata Morado, 10 May 2012 - 07:55 AM.

  • 2

#12
Pirata Morado

Pirata Morado
  • Members
  • 9,014 posts
  • Gender:Male
  • Location:Queretaro, Mexico
  • Interests:Statistics, Theatre, Alan Parsons music, Astronomy, and of course, Mariners Baseball.
Another example charts, the Saint Louis Cardinals:

Posted Image

We can see here the yearly increasing trend, with some bumps in the road (1994-1995), in 2006 we see an overestimation due to the New Park and the PostSeason. Seems like the Cards fans are indeed very loyal, but don't have homerism.
  • 2

#13
Pirata Morado

Pirata Morado
  • Members
  • 9,014 posts
  • Gender:Male
  • Location:Queretaro, Mexico
  • Interests:Statistics, Theatre, Alan Parsons music, Astronomy, and of course, Mariners Baseball.
One more chart, the Chicago White Sox:

Posted Image

Notice first the lower general level compared to the M's and Cards (the limits in the Y-axis are the same in all charts).
Notice now the peak in 1991 due to their new stadium, it gets predicted quite well in the model.
This time, seems like the model undersestimated the good 2005 season they had (champions), because we can see a pike in attendance in 2006 that wasn't reflected in the model.
  • 2

#14
Sandy - Raleigh

Sandy - Raleigh
  • Members
  • 2,730 posts
  • Gender:Male
  • Location:Raleigh, NC
  • Interests:Songwriting, spiritual growth, sports, family, puzzle solving
Maybe I'm missing it ... but I don't see any control for stadium capacity.

Obviously, there is a maximum total number of seats possible at every stadium - (an upper bound) - which is not a constant. I would think this would significantly influence the upper bounds of attendance spikes. (No team can actually get 5-6 million in attendance, though it seems your model could predict that with a 162-win team).
  • 0
The purpose of government is to set the minimum standard of behavior for a society, backed up with the power and authority to police and punish those who fail to meet those minimum standards. The purpose of Religion is to encourage people to exceed those standards voluntarily.

Adopt-a-player(s):
Age 25 - RH - (2B/UT) - (as of 8/19/2014)
Stefen Romero - mnrs - 1547-PA; 85-2B; 60-HR; 271-RBI; 36-SB; 21-CS; 96-BB; 250-K; .311/.361/.523 -- .884
MAJORS - 180-PA; 6-2B; 2-3B; 3-HR; 11-RBI; 0-SB; 3-CS; 4-BB; 46-K; .196/.236/.310 -- .545

Age 23 - LH - (LF) - (as of 8/19/2014)
Dario Pizzano - 2014 - 451-PA; 30-2B; 11-HR; 71-RBI; 1-SB; 1-CS; 64-BB; 49-K; .245/.357/.445 -- .802 (A+/AA)
Dario Pizzano - mnrs - 1232-PA; 88-2B; 23-HR; 170-RBI; 12-SB; 5-CS; 155-BB; 134-K; .296/.389/.469 -- .858

Age 23 - RH - (CF) - (as of 8/19/2014)
Jabari Henry - 2014 - 473-PA; 24-2B; 28-HR; 92-RBI; 6-SB; 8-CS; 63-BB; 99-K; .294/.400/.585 -- .986 (A+)
Jabari Henry - mnrs - 1156-PA; 62-2B; 47-HR; 191-RBI; 20-SB; 17-CS; 155-BB; 219-K; .276/.381/.508 -- .888

Age 24 - RH - (OF) - (as of 8/19/2014)
Jabari Blash - 2014 - 299-PA; 14-2B; 0-3B; 17-HR; 57-RBI; 6-SB; 2-CS; 39-BB; 81-K; .228/.351/.492 -- .843 (AA/AAA)
Jabari Blash - mnrs - 1740-PA; 80-2B; 13-3B; 76-HR; 257-RBI; 50-SB; 24-CS; 238-BB; 480-K; .254/.371/.484 -- .855

#15
Pirata Morado

Pirata Morado
  • Members
  • 9,014 posts
  • Gender:Male
  • Location:Queretaro, Mexico
  • Interests:Statistics, Theatre, Alan Parsons music, Astronomy, and of course, Mariners Baseball.

Maybe I'm missing it ... but I don't see any control for stadium capacity.

Obviously, there is a maximum total number of seats possible at every stadium - (an upper bound) - which is not a constant. I would think this would significantly influence the upper bounds of attendance spikes. (No team can actually get 5-6 million in attendance, though it seems your model could predict that with a 162-win team).

You're right Sandy, there's no upper bound, perhaps it gets self controlled due to the fact that no team has a Moving Average Winning% that big.
  • 2

#16
Pirata Morado

Pirata Morado
  • Members
  • 9,014 posts
  • Gender:Male
  • Location:Queretaro, Mexico
  • Interests:Statistics, Theatre, Alan Parsons music, Astronomy, and of course, Mariners Baseball.
Another alternative is to run separate analyses for each franchise, instead of using Franchise as a regressor.

I ran this, and I want to show you some neat charts, the difference here is that we get different coefficients in each regressor for each team, but we lose degrees of freedom (we only use the sample size of each team).

Anyway, I thought this charts look very nice.

Milwaukee:

Posted Image

Notice the big peak in 2001 due to Miller Park, which isn't being reflected in the new model, but you can see the yearly trend and the improvement in the team's winning percentage driving the attendance up from 2003 onwards.

By the way, the coefficients were:

Intercept: -108,925,437
Year: 52,651
Strike: -225,113
3YearMovingAverage: 11,406,979
StadiumNovelty: 1,604,437
PostSeason: 109,523 *** (not significant, p-value = .4459)

What these mean is that PostSeason doesn't have an effect in attendance for the Brewers. Their yealry increase in attendance is 52K per year, the strike affected them by quarter million only, and their stadium novelty brought them 1.6 more million.

By the way, the R^2 for the Brewers was .9559, a very high percentage of variance explained (I picked the biggest ones).

Edited by Pirata Morado, 10 May 2012 - 09:23 AM.

  • 1

#17
Pirata Morado

Pirata Morado
  • Members
  • 9,014 posts
  • Gender:Male
  • Location:Queretaro, Mexico
  • Interests:Statistics, Theatre, Alan Parsons music, Astronomy, and of course, Mariners Baseball.
Another neat example are the Phillies:

Posted Image

Notice how good the 1993 gets predicted (their first postseason in years), but then look also how good the downward trend gets predicted due to bad play.

Again notice the big peak their new stadium brought them in 2004, which isn't perfectly predicted by the model, but notice how good the model predicts the upward trend once they started playing "better" baseball (higher 3year Moving Average).

Coefficients:

Intercept: 65,408,688
Year: -33,987
Strike: -219,122 *** (not significant, p-value = .2436)
3YMAWin%: 9,367,844
StadiumNovelty: 977,488
PostSeason: 780,556

Seems like the Strike wasn't significant with them, and notice a negative yearly trend, meaning that people go less there as time goes by (has recession affected Philly badly?). But they seem to suffer a lot of homerism. Their 3Year Coeff is very high, meaning that they go if their team is good, but also look at their big PostSeason Coeff, they draw 3/4 million more when they get to the playoffs. Their new stadium drew only 977K more, so it is almost half that of the Brewers. The R^2 for the Phillies is 0.9584, also quite good.

Edited by Pirata Morado, 10 May 2012 - 09:33 AM.

  • 1

#18
Pirata Morado

Pirata Morado
  • Members
  • 9,014 posts
  • Gender:Male
  • Location:Queretaro, Mexico
  • Interests:Statistics, Theatre, Alan Parsons music, Astronomy, and of course, Mariners Baseball.
Now let me show you our beloved M's once again but this time with a model run for themselves instead of mixed with all mlb.

Posted Image

First thing you can see is that it predicts better than the previous one (lower error in prediction), particularly notice the strike effect, beautifully captured.

Again we see the hill in the 2000-2003 period of time, but now the prediction fits better. The peak in 1997 is also quite noticeable due to the postseason.

Coefficients:

Intercept: -51,238,281
Year: 25,054
Strike: -851,534
3YMAWin%: 7,046,137
StadiumNovelty: 591,446
PostSeason: 325,968

Big effect in the strike for us, of almost a million. PostSeason is big for us driving 325K more on those years. Stadium Novelty doesn't seem as big as for the other 2 teams, since SafeCo seemed to draw only half a million more fans. The R^2 for the M's model is 0.8741, lower than Phillies and Brewers, but still better than the previous model.
  • 1

#19
Pirata Morado

Pirata Morado
  • Members
  • 9,014 posts
  • Gender:Male
  • Location:Queretaro, Mexico
  • Interests:Statistics, Theatre, Alan Parsons music, Astronomy, and of course, Mariners Baseball.
Another way to compare models is by comparing their Root Mean Squared Errors or RMSE for short. Obviously, the lower the error, the better the model.

In the whole mlb model (with Franchise as a regressor), the RMSE was 369,692, meaning that the model, on average, had errors of about 370K in the yearly estimations (an error can be an underestimation or an overestimation).

For the individual models, the Milwaukee model has a RMSE of 152,881, clearly much better (only half that of mlb).
The Phillies model had a RMSE of 174,913, again better than combined.
The Mariners individual model had a RMSE of 277,617, worse than the Phillies and Brewers, but better than the combined model.

So on average, the model misses the prediction by a quarter million fans, not bad.
  • 1

#20
mabalasek

mabalasek
  • Members
  • 129 posts
i just found out that i cant give all my cpoints to you :D you deserve all of it pirata :D good job! im thoroughly impressed :D
  • 0




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users