Bowl Championship Series: BCS

By: Scott Moore
January 17, 2012 · Posted in statistics · Comment 

Issue

The Bowl Championship Series (BCS) ranking process is a failure by any measure. The good news is that it finally appears the powers-that-be are going to work out a playoff system. But what is the root cause of the problematic BCS rankings? Why don’t they work? And what type of numerical system might meet the needs of a college football ranking system?

Statistics: Or Lack of!

A cursory review of BCS statistics quickly identifies the main problem, which is that is the people who created these “methods” do not appear to use any form of statistics. Further limiting the public’s understanding of these data is that the methods used to calculate rankings are not available. In other words, they have not been peer-reviewed in any meaningful way – and subscribe to the “trust me” method!

We know the accuracy is questionable at best or scandalous at worst, since we never read or hear about odds, confidence intervals, error, probability or other common statistical references when referring to these data. We also know intuitively that around each number there is error. If the error is not displayed, we know we cannot trust neither the numbers nor the authors – hence the ruckus around these rankings.

The Champ: Play-off

The great thing about a playoff for college football, like every other major sports league, is that you know the answer at the end. The best team on that day is the final one standing. End of debate. Rodney Harrison was recently asked who he liked in the NFL playoff and his answer was that it is hard to estimate since anything can happen in a playoff game. Well said. The challenge with a college playoff system is not that it wouldn’t work, because it would. Rather, it cuts the number of bowl games in half. Ouch, that is a lot of lost revenue!

The Champ: Numerical Calculation

I will disclose my bias for a playoff system since, as Rodney stated, anything can happen. But I believe there is likely a method that would, in fact, provide a numerical answer that most would agree with. First, the method needs to be made public, and it should be a method that has a history of success. “Odds” are, of course, one system, but in reviewing the odds estimates for the BCS championship game, there were many conflicting estimates with some odds makers suggesting a difference of only a point or two. In other words, it was too close to call.

Odds is an interesting process (better than the “look what I made up” numerical process), but probability estimates are the only real tool we have that could pick a winner. Odds and probability sound similar but in fact are quite different. The difference:

  • Probability is used to express sensitivity, specificity and predictive value. It is the proportion of people in whom a particular characteristic, such as a positive test, is present.
  • Odds is the ratio of two complementary probabilities. (PDF)

Along the probability line is a process called Evidence Based Management (EBM) which uses Bayesian analysis.

Bayes Theorem: a statistical principle for combining prior knowledge of the classes with new evidence gathered from data. See Introduction to Data Mining Chapter 5 pp: 228-229) (PDF)

EBM with Bayesian analysis states: What was thought before the test was done, combined with the test result is greater than what is thought after the test result. In other words, what you thought you knew before the football contest, the game, and what you think afterward – LSU is still No. 1 syndrome! It is this process that could provide an answer to who is No. 1 regardless of the date, time or opponent,* effectively removing the Rodney affect, but not likely the debate!

Conclusion

I am not sure that the BCS question is all that important or worth a lot of time in the context of solving the world’s problems, but if we are going to do the math, let’s at least try to make the process transparent, thoughtful and based on some sort of peer-reviewed science. Frankly, that is the only way my team will EVER have a chance at a BCS championship!

*Note: I do not address “style” points: a non-sportsmanship concept.

Gini Coefficient

By: Scott Moore
January 10, 2012 · Posted in statistics · Comment 

Issue

The Gini Coefficient, developed by the Italian statistician Corrado Gini, is the most commonly used measure of inequality. The coefficient varies between 0, which reflects complete equality and 1, which indicates complete inequality (one person has all the income or consumption, all others have none). (The World Bank) We wanted to use this method to look at income distribution throughout South Carolina, but first we had to understand the formula.

At first glance, there is a fair amount of math needed to calculate the coefficient. Make no mistake, this is and can be a very complex formula, utilizing probability sampling, bootstrapping, confidence intervals and other statistical methodology. We however, tried to keep it applied, and therefore used the most basic variation:

Gini Formula

After sorting out the symbolism, we created a sample problem (PDF).  The sample problem allowed us to work through the math in a structured process. The value of  ”doing the math” is that one gains an understanding as to how different variables affect the formula. The PDF contains two versions of the sample problem, one showing the formula and the other with plugged numbers. Note how unlike most of the available examples, we show a calculation needed prior to using the formula.  In this case (dollars strata) TIMES (number of persons). That’s because the analyst may need to do a number of calculations prior to applying the formula.

The Formula: Results

We applied the formula to the classic income distribution (wealth share) problem, using Census, Household and Family Income Report B19001, for each county in South Carolina. These data have 16 income strata. We found the formula is particularly sensitive to changes in the top two strata, not necessarily the number of persons, but average dollar value. In other words, ”the tail wags the dog” in this formula. The other critical piece of information needed is what value to assign the highest strata. The census uses approximately $400,000 as an approximation for the average top strata dollar figure.  They calculate this number using volumes of data, so it’s good enough for me.

After making our calculations, the formula really did reveal a number of interesting trends. One, the impact of the economy on higher wage earners – in the case of these data – is very delayed. In other words, higher income households continued to make money well into the latest recession. The other revealing attribute is the affect of a rising tide. A rising tide does in fact lift boats, but some higher than others and in the process it also sinks a few!  In this case,  households with higher incomes grew at a proportionally higher rate than those with lower incomes, and in some counties, household income (high and low) was hit particularly hard.

Conclusions

Now that you understand the formula, if you use these data, the Census Bureau has already done the Gini Coefficient income calculations for you! Yes, to my surprise the the Bureau has been doing this calculation since the 1990s.  The file is B19083. It may sound like I have given you a shortcut but now you have to figure out the new GUI American Community Survey interface. Good Luck!

Acknowledgement: Thank you to the staff at the US Census Bureau for assisting me in understanding key drivers of the Gini Coefficient.

JMP 9.0 – Applied Data Analysis

By: Scott Moore
December 18, 2010 · Posted in statistics · Comment 

JMP (jump): The Sharpest Tool in the Shed

I have been a JMP fan ever since being introduced to the product through the University of Minnesota statistics department. I have a used a number of statistical programs over the years, but JMP is a perfect fit for the wide range of data analysis work I perform for customers.

The Problem – What You Don’t Know

I have found that even simple data sets can, and do, hide their secrets effectively. In fact, it is amazing what we don’t know about even the most basic data sets unless the data is run through a statistical package. Here is an example (PDF) of state population estimates for 2009. There are only 50 data points. The tool used here is the basic and easy distribution analysis in JMP. Of the 50 states, four actually have populations considered outliers in the data set. The median population is about 4.1 million. It would seem that these four states would  be easy to identify, but that’s not necessarily true. I thought Wyoming, with a small population at 544,000, would also be an outlier – but that’s not so. All this information is at your fingertips with the click of a button.

A public example of more complex data set is collected by NOAA.  Here we are taking a small sample (416K points) of Sea Level Pressure Data and plotting. JMP 9.0 makes short work of this data set.

Setting Expectations

One of the best attributes of any statistical package is helping users understand their hypotheses, or assumptions about some aspect of the world. Each of us creates hypotheses every day as a part of life. Estimating commute times to work is one example. JMP allows us to not only think more clearly about these everyday data interactions, but to test them if we so desire. The hypothesis test is a statistical approach to testing a theory, according to The Economist, Numbers Guide. This test however, is not necessary to increase the basic understanding of data you are responsible for, whether it is financial, engineering, manufacturing, marketing, medical or administrative.

JMP – World Applications

JMP’s latest magazine describes uses of the program. (PDF) It is used in clinical trials, consumer products, product development and, of course, manufacturing. In today’s competitive marketplace, analyzing your data with a statistical package has become a business fundamental – much like cash flow. It’s something every business should have and deploy as part of an effective business strategy.

South Carolina Lazy? I don’t think so!

By: Scott Moore
August 6, 2010 · Posted in statistics · Comment 

Lazy:  When Noise Interferes with the Signal

Recently the Post and Courier ran an article highlighting a Business Week analysis that said South Carolina was the eighth laziest state in the union! Typically, subjective words used to describe data pop a red flag that warns me of impending data misuse doom.

The Data Set

The American Time Use Survey (ATUS), measures the time people spend doing various activities such as work, childcare, housework, watching television, volunteering and socializing. Hence this is an activity survey, not a lazy survey.  The data are collected by the Census Bureau and sponsored by the Bureau of Labor Statistics (BLS). I ran a query to understand the nature of the survey, data availability and error rates.  I called in the big guns from Global Pragmatica LLC to assist in converting the data from a ASCIDAT file to my JMP statistical software package format. These folks are experts in scripting and were a huge help. Thank you!

These data are collected regionally but analyzed nationally.  There is about a 90-percent chance, or level of confidence, that an estimate based on a sample will differ by no more than 1.6 standard errors from the “true” population value because of sampling error.   No estimates are made for state level data, and one University of Minnesota analyst stated she was not aware of state level error estimates.

It is inappropriate to analyze these data at the state level without calculating the error inherent in the data. If you did that, the analysis would be interesting but useless when comparing one state to another. Why?

Sports Activity Variable Analysis

For a test sample, I choose state level geography,with sports as a variable activity. This category captures the respondent’s participation in sports, exercise and recreational activities. To extract the data from the system, I used a tool created by the University of Minnesota called the American Time Use Survey -X.  The data needs to be processed by a statistical package, in this case my JMP program. An analysis of people participating in sports activities indicates that South Carolina would  rank 22nd out of 50 states  in terms of average minutes spent participating in sports in a 24 hour period – not bad. However, upon further inspection of South Carolina’s 2009 detailed weighted data, the state could rank  anywhere from 12th to 23rd,based on national error rates! (PDF) Unfortunately, since these are state data, the results are meaningless. That’s because the sample is simply too small, which is one of many buried statistical problems. This 2009 sample included a total of 200 people, where 166 recorded zero sports activity minutes. (PDF) In fact, the median is zero, which is another red flag for this data set.  A review of other states’ data revealed the same issue. This is a fascinating national data set. But unfortunately, analysis of non-national geographies yields unreliable results.

Real Estate and In-Migration

By: Scott Moore
July 30, 2010 · Posted in statistics · Comment 

The Post and Courier covered a local real estate economist’s presentation on the Real Estate Recovery.  Core to any real estate recovery is, of course, employment and wage growth.  However, a key statistic overlooked in this presentation was  migration patterns.  I had mentioned in my June Unemployment post that areas such as Detroit were having problems as a result of a declining labor force. This map from Forbes graphically displays the migration problems Detroit is having.

But when you click on Berkley, Charleston, or Dorchester counties, a picture of in-migration emerges.  This is an important indicator of growth potential because people have jobs when they move here, have decided to collect transfer payments (retirement) in this region or believe there is  potential for work in the area.

Another important statistic this map displays is how our rural population is moving to metro areas (short black lines).  This is important for two reasons: 1) unemployed people may have  the opportunity to find work and 2) if they find work, the state increases its tax base while decreasing social services.

Unlike the economist quoted in the article, I predict our real estate growth will be better than the median national real estate growth, primarily because of in-migration. This is not to say it will be even close to the bubble years (when we had an unrealistic and unsustainable market), but we should see steady improvement as a result of our region’s possibilities.

I am bullish, for a change. I do believe we have significant control over our own growth since the most important contributors to growth and sustainability include education, health care, public safety, urban planning,  convenience and  infrastructure (including biking and walking trails), which all are within our control.

Thank you to Keihly Moore for her assistance with this article.

Reverse Pivot Table: Matrix → xyz Format

By: Scott Moore
July 8, 2010 · Posted in statistics · Comment 

I like to add a few technical tools now and again.  Here is a sweet piece of programming that could save time converting  a matrix to a xyz table. The surface plot on my web page can be created by converting matrix data to an xyz format.

Issue

The problem I often run into with excel spreadsheets, is the data is defined in a matrix. Sometimes it is more convenient to reorgainze the data with a pivot table in order to represent the data as xyz coordinates.  At first glace it appears this should be an easy task, but with out the right excel module or the full version of sql- forget it.

Solution

A solution to this problem is provided by The Spreadsheet Page, a reverse pivot table. The link does an excellent job of explaining the process.  At the bottom a VBA link allows one to copy the code into your excel application. A big thank you to these guys for sharing this- it saved me many hours of work.

Survey Monkey

By: Scott Moore
March 2, 2010 · Posted in statistics · Comment 

One of the easiest methods to collect survey data is Survey Monkey (SM). What many do not realize is that this tool is a cost effective way to collect simple everyday samples. Who wants to go to lunch?  Give us some feedback on the meeting?  SM allows ten questions for free.  It is surprising the amount of data (no relation to quality) one can capture in 10 questions.

I have also used SM for larger research projects. OK, yes there are issues with online accessibility.  As an example in SC, 40 percent do not have a computer at home- so one needs to know the subject and audience to insure data is not unintentionally skewed- you all know the rules!

When this process is appropriate, I typically supplement the online survey with a phone call (reminder, especially when time is an issue) and spend measured time confirming emails and contacts. This process however saves a significant amount of time in the end, especially if it is a survey that is repeated time and time again.  SM output quality is quite high- as good as YOUR process. SM provides a simple but effective interface to do what you need.

Groundhog Day- Forecasting Made Simple!

By: Scott Moore
February 2, 2010 · Posted in statistics · Comment 

Groundhog Day is one of my favorite holidays. As an impact economist and data statistician, this holiday represents the truth about forecasting. It has all the elements of forecasting in a simple-to-understand format. The good stuff like the null and alternative hypothesis, geography, time, historical data sets, measurements are applied against a “strict ” criteria, sources, and witnesses to boot! The result is clean and understandable to all. If the groundhog sees its shadow, expect six more weeks of winter;  if not, the season will likely be a little shorter.  Brings a grin to my face.

In all seriousness, it is a day when we need to thank the persons that work very hard every day in the  forecasting sciences. In particular, the nod goes this year to NOAA staff who forecast hurricanes in the South, tornadoes in the Midwest and fire dangers in the West. These efforts are real and intense using sophisticated models driven off of extensive databases where engineering, science, statistics and social science all come together to try to warn us of events that can and do affect our lives. Thank you.  As for Smokey, my groundhog mascot…no shadow today.  However, if we could get all the groundhogs on Google, I wonder what that forecast would look like! I guess we need more research.

You Cannot Prove the Null

By: Scott Moore
January 21, 2010 · Posted in statistics · Comment 

This sounds like statistics better go to the next site–stop before you do that. I think we can help. In 2009, I joined the SAS JMP® software user group and receive periodic updates and explanations on different aspects of statistics. Since using JMP®, my whole world has become significantly (no pun intended) less complicated.

The latest issue of JMPer Cable (pdf) (Issue 26 Winter 2010 pp 6-9), has a short and informative article by Ramirez and Bailey on significant testing. Questions we analysts need to ask ourselves, from a data standpoint include: is there a difference, does the data tell me anything, or is the simple comment “that’s interesting” good enough or do we need to have that discussion with the marketing guys again? Regardless, this article explains the null hypothesis (no change) and the alternative hypothesis in a few easy to understand pages. The authors do this with short informative examples and more importantly the intuitive computer display from the JMP® statistical package. This article is a conversation, not a lecture, allowing one to absorb concepts that frankly can be confusing.

Accessing Deleted Microsoft Access Files

By: Scott Moore
December 12, 2009 · Posted in statistics · Comment 

If you have ever lost a Microsoft Access file, check out a way to get it back from Black-YoYo

Next Page »