Bowl Championship Series: BCS

January 17, 2012 · Posted in statistics · Comment 

Issue

The Bowl Championship Series (BCS) ranking process is a failure by any measure. The good news is that it finally appears the powers-that-be are going to work out a playoff system. But what is the root cause of the problematic BCS rankings? Why don’t they work? And what type of numerical system might meet the needs of a college football ranking system?

Statistics: Or Lack of!

A cursory review of BCS statistics quickly identifies the main problem, which is that is the people who created these “methods” do not appear to use any form of statistics. Further limiting the public’s understanding of these data is that the methods used to calculate rankings are not available. In other words, they have not been peer-reviewed in any meaningful way – and subscribe to the “trust me” method!

We know the accuracy is questionable at best or scandalous at worst, since we never read or hear about odds, confidence intervals, error, probability or other common statistical references when referring to these data. We also know intuitively that around each number there is error. If the error is not displayed, we know we cannot trust neither the numbers nor the authors – hence the ruckus around these rankings.

The Champ: Play-off

The great thing about a playoff for college football, like every other major sports league, is that you know the answer at the end. The best team on that day is the final one standing. End of debate. Rodney Harrison was recently asked who he liked in the NFL playoff and his answer was that it is hard to estimate since anything can happen in a playoff game. Well said. The challenge with a college playoff system is not that it wouldn’t work, because it would. Rather, it cuts the number of bowl games in half. Ouch, that is a lot of lost revenue!

The Champ: Numerical Calculation

I will disclose my bias for a playoff system since, as Rodney stated, anything can happen. But I believe there is likely a method that would, in fact, provide a numerical answer that most would agree with. First, the method needs to be made public, and it should be a method that has a history of success. “Odds” are, of course, one system, but in reviewing the odds estimates for the BCS championship game, there were many conflicting estimates with some odds makers suggesting a difference of only a point or two. In other words, it was too close to call.

Odds is an interesting process (better than the “look what I made up” numerical process), but probability estimates are the only real tool we have that could pick a winner. Odds and probability sound similar but in fact are quite different. The difference:

  • Probability is used to express sensitivity, specificity and predictive value. It is the proportion of people in whom a particular characteristic, such as a positive test, is present.
  • Odds is the ratio of two complementary probabilities. (PDF)

Along the probability line is a process called Evidence Based Management (EBM) which uses Bayesian analysis.

Bayes Theorem: a statistical principle for combining prior knowledge of the classes with new evidence gathered from data. See Introduction to Data Mining Chapter 5 pp: 228-229) (PDF)

EBM with Bayesian analysis states: What was thought before the test was done, combined with the test result is greater than what is thought after the test result. In other words, what you thought you knew before the football contest, the game, and what you think afterward – LSU is still No. 1 syndrome! It is this process that could provide an answer to who is No. 1 regardless of the date, time or opponent,* effectively removing the Rodney affect, but not likely the debate!

Conclusion

I am not sure that the BCS question is all that important or worth a lot of time in the context of solving the world’s problems, but if we are going to do the math, let’s at least try to make the process transparent, thoughtful and based on some sort of peer-reviewed science. Frankly, that is the only way my team will EVER have a chance at a BCS championship!

*Note: I do not address “style” points: a non-sportsmanship concept.

Gini Coefficient

January 10, 2012 · Posted in statistics · Comment 

Issue

The Gini Coefficient, developed by the Italian statistician Corrado Gini, is the most commonly used measure of inequality. The coefficient varies between 0, which reflects complete equality and 1, which indicates complete inequality (one person has all the income or consumption, all others have none). (The World Bank) We wanted to use this method to look at income distribution throughout South Carolina, but first we had to understand the formula.

At first glance, there is a fair amount of math needed to calculate the coefficient. Make no mistake, this is and can be a very complex formula, utilizing probability sampling, bootstrapping, confidence intervals and other statistical methodology. We however, tried to keep it applied, and therefore used the most basic variation:

Gini Formula

After sorting out the symbolism, we created a sample problem (PDF).  The sample problem allowed us to work through the math in a structured process. The value of  ”doing the math” is that one gains an understanding as to how different variables affect the formula. The PDF contains two versions of the sample problem, one showing the formula and the other with plugged numbers. Note how unlike most of the available examples, we show a calculation needed prior to using the formula.  In this case (dollars strata) TIMES (number of persons). That’s because the analyst may need to do a number of calculations prior to applying the formula.

The Formula: Results

We applied the formula to the classic income distribution (wealth share) problem, using Census, Household and Family Income Report B19001, for each county in South Carolina. These data have 16 income strata. We found the formula is particularly sensitive to changes in the top two strata, not necessarily the number of persons, but average dollar value. In other words, ”the tail wags the dog” in this formula. The other critical piece of information needed is what value to assign the highest strata. The census uses approximately $400,000 as an approximation for the average top strata dollar figure.  They calculate this number using volumes of data, so it’s good enough for me.

After making our calculations, the formula really did reveal a number of interesting trends. One, the impact of the economy on higher wage earners – in the case of these data – is very delayed. In other words, higher income households continued to make money well into the latest recession. The other revealing attribute is the affect of a rising tide. A rising tide does in fact lift boats, but some higher than others and in the process it also sinks a few!  In this case,  households with higher incomes grew at a proportionally higher rate than those with lower incomes, and in some counties, household income (high and low) was hit particularly hard.

Conclusions

Now that you understand the formula, if you use these data, the Census Bureau has already done the Gini Coefficient income calculations for you! Yes, to my surprise the the Bureau has been doing this calculation since the 1990s.  The file is B19083. It may sound like I have given you a shortcut but now you have to figure out the new GUI American Community Survey interface. Good Luck!

Acknowledgement: Thank you to the staff at the US Census Bureau for assisting me in understanding key drivers of the Gini Coefficient.

Transportation Economic Development Impact System (TREDIS)

December 6, 2011 · Posted in TREDIS · Comment 

The Transportation Economic Development Impact System (TREDIS), is a product developed by the Economic Development Research Group, Inc (EDR). It is an integrated framework for transportation planning and project assessment – designed to cover a wide range of applications, from looking at benefit/cost impacts of a single transportation investment, to analyzing the macroeconomic impacts of alternative long-range plans.

It models passenger and freight travel across all modes, and it assesses costs, benefits, and impacts across a range of economic responses and societal perspectives. To  integrate this range of features, TREDIS operates as four separate but interconnected modules:

  • Travel Cost
  • Market Access
  • Economic Adjustment, and
  • Benefit Cost

For more information see:

State and Local Government Employment

March 18, 2011 · Posted in employment · Comment 

Issue

The Richmond Federal Reserve recently published an informative article on the recession and government tax shortfalls. The analysis included the affect on government employment. “State and local governments employed nearly 20 million workers in the U.S.  That is about 15 percent of total payroll employment in the nation, more than the manufacturing and construction industries combined. As a result of the fiscal duress, state and local governments have been cutting jobs and more are likely to follow.” (PDF)

Economic Impact: Cutting Government Workers

I am not sure anyone would argue that an efficient and productive government is not a good thing for almost everyone.  However, arbitrary employment cuts have a significant negative affect on an economy.

As an example, a Targeting Economic Development study using Analytic Hierarchy Process (AHP) showed the impact of government workers. Cox et al. (2000). The study showed that within a three-county region in Virginia, that the State and Local Government, non-education, sector created 32 jobs per million dollars of output. It was the No. 1 industry for this region out of the top 20 studied. The industry also had the 15 lowest average wages of the top 20  but the highest value-added effect (total Virginia/dollars of output) of 1.30. The next-closest industry, oil and gas, was 1.17.

Value-added includes employee compensation, proprietor income (i.e. self employment), other property-type income (i.e. rents and profits), and indirect business tax (i.e. sales tax paid to business). So if government cuts  employment, indirect and induced dollars flowing to private sector industries are significantly reduced.

Conclusion

Having an efficient and productive workforce is important for both the government and private sectors. Random cutting, however, will lead to direct negative economic impacts in the private sector at a time when we are all looking for a sign of an improved economy.

BEA RIMS II and Lucky Charms

February 8, 2011 · Posted in economic development · Comment 

Recently I have had number of questions concerning the Bureau of Economic Analysis (BEA), Regional Input-Output Modeling System, or RIMS II, data set. I use these data primarily for scoping, to determine whether there are any surprises in the economic study region that may help me  formulate a plan. It performs superbly in this application. From the RIMS II handbook pp1:

“Using RIMS II for impact analyses has several advantages. RIMS II multipliers can be estimated for any region composed of one or more counties and for any industry or group of industries in the national I-O table. The cost of estimating regional multipliers is relatively low because of the accessibility of the main data sources for RIMS II. According to empirical tests, the estimates based on RIMS II are similar in magnitude to the estimates based on relatively expensive surveys. To effectively use the multipliers for impact analysis, users must provide geographically and industrially detailed information on the initial changes in output, earnings, or employment that are associated with the project or program under study. The multipliers can then be used to estimate the total impact of the project or program on regional output, earnings, or employment.”

RIMS is a solid input-output modeling system for the right phase of a project because it is able to provide final multipliers for many different industries.  However, this is where the capability ends.  It is like eating a bowl of Lucky Charms. You can find yellow moons, orange stars and green clovers the marshmallows, but no meat and potatoes. In this case, the good stuff is the impact on affected local industries. With RIMS it is necessary to know that information up front, which is unlikely. From the handbook manual case study, pp15:

“These changes consist of the decline in the purchases of goods and services that results from closing the military base and the decline in purchases by military personnel. For both types of purchases, the user must determine which purchases occur in the economic area and then must show these purchases in producers’ prices.”

So although we know something about direct industries – likely a guess determined through a review of an available budget – we have no way of knowing the relationships of these impacts to the broader economy other than the multiplier, which gives us an accurate, yet gross estimate of impacts.

Below is an example of the RIMS II output, pp18:

RIMS Out put

Missing from the basic calculation are the indirect, induced and industry details such as taxes, proprietor income and relationship of these data to other regions either within or outside the study area. In other words, we do not have a complete picture of the money flows as a result to a change in the economy. Although some calculations can be completed using Type I and Type II data, it is these missing details that fill out the play book for a competent economic development analysis and subsequent plan.

The Advent of the Algorithm

January 17, 2011 · Posted in method · Comment 

The Post and Courier recently printed an unusual column discussing “Deep Reading.” It is an excellent article by Laura Casey discussing a lost art.

“Deep Reading, or slow reading, is a sophisticated process in which people can critically think, reflect and understand the words they are looking at. With most this means slowing down – even stopping and rereading a page or paragraph if it doesn’t sink in…”

Deep reading seems out of sync with today’s “modern” communication tools, like Twitter, and it is.  But what does this have to do with the algorithm? As a data guy, the algorithm is my life, and the only way to understand what makes it tick is deep reading! My choice for this subject is The Advent of the Algorithm, by David Berlinski.

We experience the algorithm everywhere in our daily lives – in the operation of the toaster, the dentist’s office, our vehicles and in most, if not all, of the communication tools we use daily. I reviewed definitions of the algorithm online, but none have the elegance of Berlinski, in his book, The Advent of the Algorithm:

“In the logician’s voice:

an algorithm is a finite procedure,

written in a fixed symbolic vocabulary,

governed by precise instructions,

moving in discrete steps, 1,2,3,…,

whose execution requires no insight, cleverness,

intuition, intelligence, or perspicuity,

and that sooner or later come to an end.”

Where did this idea come from? It turns out the algorithm we all know is the brain child of four people: Kurt Gödel, Alonzo Church, Alan M.Turning and Emil Post. Each contributed, in part, to the concept of the modern day algorithm, including functions, calculus of conversion and machines capable of manipulating symbols (computers).

Berlinski describes, in depth, the development of the algorithm concept. In one example, the Euler algorithm, he makes clear how important it is – for an analyst, anyway – to know, not just understand, the magic and flaws behind the algorithm. A case in point is the Numerical Solution For Ordinary Differential Equation, page 245.

“From a mathematical point of view, the original differential equation, contingent as it was upon the concept of the limit, has been replaced by a difference equation, one in which the derivative is approximated by a difference quotient, involving no limits whatsoever.

The Euler algorithm demonstrates this method:

BEGIN Euler

Input xΟ, yΟ, xf, h

x: = xΟ

y: = yΟ

WHILE (x<xf) DO

y: = y+ h* f(x,y)

x: = x+ h

OUTPUT x,y

ENDDO

END Euler

This simple algorithm, however, provides critical insight into the weakness of the algorithm:

“…the difference between an analytic and algorithmic solution to an ordinary differential equation is sharp and it is inescapable. An analytic solution completely penetrates the future or the past; an algorithmic solution acts only over a finite interval of time and space. The analytic solution returns a differential equation to a continuous world; an algorithmic solution, to a world that is discrete.”

Now I understand why the brakes failed in my Toyota. :)

Note: some of the reviews of the book were not very flattering, but this is DEEP READING with the good stuff starting on page 205.  Do you think you have what it takes to DEEP READ? Have at it!

JMP 9.0 – Applied Data Analysis

December 18, 2010 · Posted in statistics · Comment 

JMP (jump): The Sharpest Tool in the Shed

I have been a JMP fan ever since being introduced to the product through the University of Minnesota statistics department. I have a used a number of statistical programs over the years, but JMP is a perfect fit for the wide range of data analysis work I perform for customers.

The Problem – What You Don’t Know

I have found that even simple data sets can, and do, hide their secrets effectively. In fact, it is amazing what we don’t know about even the most basic data sets unless the data is run through a statistical package. Here is an example (PDF) of state population estimates for 2009. There are only 50 data points. The tool used here is the basic and easy distribution analysis in JMP. Of the 50 states, four actually have populations considered outliers in the data set. The median population is about 4.1 million. It would seem that these four states would  be easy to identify, but that’s not necessarily true. I thought Wyoming, with a small population at 544,000, would also be an outlier – but that’s not so. All this information is at your fingertips with the click of a button.

A public example of more complex data set is collected by NOAA.  Here we are taking a small sample (416K points) of Sea Level Pressure Data and plotting. JMP 9.0 makes short work of this data set.

Setting Expectations

One of the best attributes of any statistical package is helping users understand their hypotheses, or assumptions about some aspect of the world. Each of us creates hypotheses every day as a part of life. Estimating commute times to work is one example. JMP allows us to not only think more clearly about these everyday data interactions, but to test them if we so desire. The hypothesis test is a statistical approach to testing a theory, according to The Economist, Numbers Guide. This test however, is not necessary to increase the basic understanding of data you are responsible for, whether it is financial, engineering, manufacturing, marketing, medical or administrative.

JMP – World Applications

JMP’s latest magazine describes uses of the program. (PDF) It is used in clinical trials, consumer products, product development and, of course, manufacturing. In today’s competitive marketplace, analyzing your data with a statistical package has become a business fundamental – much like cash flow. It’s something every business should have and deploy as part of an effective business strategy.

State Gross Domestic Product (GDP)

November 30, 2010 · Posted in economics · Comment 

In This Together – Not Really!

Recently the Post and Courier published an article on South Carolina 2009 GDP. (See GDP Discussion.)  Wells Fargo’s Mark Vitner provided the color commentary:

“This recession was very much centered on housing, manufacturing and financial services, and those three industries are much more important to the South than the nation as a whole.”

The Devil is in the Details

Unfortunately, this article is about South Carolina and not “the South.” What particularly grabbed my attention was the reference to financial services. I did not believe that South Carolina financial services were much more important within the state than to financial services in the United States as a whole. In fact a little research, apparently not provided to Mr. Vitner, indicates  GDP is significantly LESS as a percentage of the South Carolina total than of the United States – 6.6 percent versus 9.7 percent.  Another surprise is that manufacturing is a significantly LARGER portion of the state’s GDP than the national figures – 18.4 percent versus 12.7 percent (PDF).

Summary

The bottom line is this: We are not like the national economy. It is a poor comparison because South Carolina is too small. More appropriately, the reporter compares South Carolina with Georgia and North Carolina but misses Florida, the big dog in the region. At least at the macro level, South Carolina’s numbers are better than our neighbors’.

And that’s something to build on.

2010 October Unemployment

November 30, 2010 · Posted in unemployment · Comment 

Forecasting Unemployment

The Bureau of Labor Statistics recently posted October unemployment statistics for South Carolina. The state’s unemployment actually dropped, which is a positive sign and was not expected. It tends to be difficult to forecast any trend from one set of data, especially when we hope for a better economy but have no way to know whether that hope is grounded.

History of September to October

I thought I would do a back of the envelope review of the employment change trend from September to October. As to be expected, it is all over the place, but there are some interesting relationships that become apparent and are typically glossed over in monthly reports. (PDF) What is revealed is the link between labor force, employment and unemployment.

From 2005 to 2007,  employment and the labor force moved together. The result of the way these two variables interact is the third category, which is unemployment. In 2008 the bottom fell out of employment, with unemployment shooting to the moon and the labor force making a steady march south. In 2010, it seems that employment has overshot the capability of the labor market, so it is reasonable to expect that employment will moderate going forward.

Final Thoughts

What I do not like is the potential for the labor force to make a strong comeback and for employment to flatten. The result would be more unemployment. The best case is continued strong employment, which decreases unemployment while allowing the labor force to expand at a moderate pace.

Update 12.03.2010

Jobs Byte – Some times I get it right!

GDP Explained (Third Quarter 2010)

November 16, 2010 · Posted in economics · Comment 

The Quarterly Data Mind Melt

Gross Domestic Product (GDP) is a huge data set managed by the Bureau of Economic Analysis (BEA).  On a quarterly basis, I receive a number of emails announcing the latest data from the BEA. Most economists, including Dean Baker, give concise analyses of these data.  But even with  one page summaries, I wonder where these data come from and what exactly they are talking about, since the analysis is usually out of context. Furthermore, the data are national in scope and tell very little about what is going on in my state or the relationship between the national data and the state or regional economy.

Third Quarter 2010 Perspective

The third quarter briefing is an excellent example of how these data are developed over a period of time. In fact, the “advance”  third quarter numbers are actually estimates, not final numbers. (Most skim over this fact.)

“Real gross domestic product – the output of goods and services produced by labor and property located in the United States – increased at an annual rate of 2.0 percent in the third quarter of 2010, (that is, from the second quarter to the third quarter), according to the ‘advance’ estimate released by the Bureau of Economic Analysis.  In the second quarter, real GDP increased 1.7 percent.”

A technical note describes assumptions, data and how “advance” estimates are calculated. The method is described in detail, which is one of the truly great features of these data.  This release goes on the state:

“The change in real private inventories added 1.44 percentage points to the third-quarter change in real GDP after adding 0.82 percentage point to the second-quarter change.  Private businesses increased inventories $115.5 billion in the third quarter, following increases of $68.8 billion in the second quarter and $44.1 billion in the first.”

These statements allow the reader to delve deeper into the data set. But where did these data come from? The BEA has a number of interactive tables so you can explore the data in more detail.  The $115.5 billion is found in Table 5.6.6B., “Change in Real Private Inventories by Industry, Chained Dollars” (PDF).  This happens to be an important number because most economist, including me, believe inventory building is not sustainable. Therefore subtracting inventories 1.44 percent from the total growth, final GDP is a measly .6 percent, close to zero.  Likely not what most are looking for.

You may have noted recent news stories of the private sector trying to move-up or expand Black Friday?  That’s because retailers hope to decrease the temporary inventory bubble they have created.

National, State and Local Comparisons

If you are like me, national data is fine, but I like to know how they sync with the regional economy. State and local data lag behind national data by about two years (PDF). That is quite a long time. However, there are a number of ways an analyst can create an index comparing national and state level data, with reasonable assumptions, to produce a current trend for the regional economy.  That would be particularly helpful here in South Carolina when discussing automobile inventories and the effect an increase in inventory has on both short- and long-term investment and employment.

Next Page »