Friday, November 16, 2007

Step 0: Why Stata?

I'm going to cover the basic steps of setting up stata on a Mac, including specifying the working directory, editing and placing the file, selecting an external text editor and logging your work. However, I thought I'd start with the larger question: why use Stata in the first place?

Stata is hardly the only option in social science statistics software. At the very least, you could consider SAS, The R Project, Minitab, SPSS, S-Plus, MPlus, not to mention more specialized forms of modeling software like HLM (hierarchical linear modeling) and LISREL (structural equation modeling).

However, for Mac users there are considerably fewer options:

Here's the simple part: SAS, Minitab, HLM, Lisrel, MPlus, and S-Plus don't develop for OS X. Yes, I know there are Intel Macs that can run Windows, but that's not what this website is about. I am among those who still have an old PowerPC Mac; moreover, if I upgrade, I'd rather keep doing my data analysis in OS X.

Many people have told me R is the future. It is free, open-source and offers incredibly robust graphics features. I believe it. However, R is largely a programming environment. It is not primarily a data analysis package. I've tried it once or twice but I just never felt comfortable. I'd suggest R for only tech-oriented people very comfortable with programming. If you still need drop-down menus and extensive help files to do anything in statistics (and I do), I wouldn't recommend R.

This leaves us with the big players in the statistical software market: SPSS and Stata.

Stata offers four different versions: Small, Intercooled, SE, and MP. This page from the Stata website outlines the differences.

  • Small Stata is for teaching environments only - you're limited to 99 variables and 1,000 observations.
  • Stata MP is for multi-processor environments, either newer dual-core computers or servers. With an older PowerPC laptop this was not an option. It's also rather unnecessary because it's largely designed for parallel processing.
  • Stata SE is the larger version of Intercooled. The main difference is that is allows ~32,000 (2^15) variables instead of ~2,000 (2^11) for Intercooled. SE also allows you to include up to 11,000 variables in a regression (!).
  • Intercooled Stata is what most people will be purchasing. It allows an unlimited number of observations, 2,047 variables and 800 variables in a regression. In practice, I've never had any limitations with Intercooled Stata. Even with large social science datasets, it's rare to deal with more than 2,000 variables; even the largest ICPSR datasets like the General Social Survey don't usually contain that many.

    All version of Stata include a full range of data analysis options. There are no add-ons or extra modules to purchase. This includes basic functions like descriptive statistics, graphics, and many forms of regression: linear, logistic, multinomial, poisson, GLM, ANOVA, ordered logistic and others. It also includes tools for time-series, event history analysis, multidimensional scaling, survey data, panel data, and robust standard error techniques like bootstrap and jackknife.

    Stata also archives many user-submitted programs; these can be accessed through a simple search in the main window. It's a quasi-open-source structure: the software is proprietary but people submit add-ons that are freely distributed.

    Intercooled Stata is available for $155 US with an educational discount. If you only want a one-year license (i.e. you're working on a limited-time project and know you won't need the software long-term) you can buy a one-year license for $95.

    Stata comes with a small manual, "Getting Started With Stata for Macintosh." It covers the basics like file management and help as well as how to interact with your data. It's short but quite useful. There is a much larger set of Stata manuals that you can purchase separately, but I haven't bothered.

    Stata offers a completely full featured package with excellent document for a very reasonable price. For comparison's sake, it's roughly the same cost as an upgrade to OS X 10.5. But what about the competition?

    SPSS is available for Mac, and it appears they now develop the versions concurrently for Mac/Windows/Linux instead of lagging the Mac versions behind. However, the real drawback of SPSS is simple: price.

    Let's take a look at the SPSS web store: thankfully I'll receive the higher education discount. That means the SPSS base system will only cost $619 US. Pretty high for a starting point. But wait, suppose I want to continue with an interrupted time-series analysis project I've been working on regarding crime rates in Atlantic City, NJ? I suppose I'd need SPSS Trends for another $519. Even worse, if I want basic maximum likelihood estimation models like logistic regression, I'll need to buy SPSS regression models for another $519. Remember this is with the educational discount and we're already talking about ~$1700 in software costs.

    As you can imagine, the pricing continues like this. Would you like generalized linear model options? You guessed it, another $519 for that module. How about correspondence analysis and multi-dimensional scaling? For that option, the bargain basement price is only $419.

    In spite of all this, I'm not completely opposed to SPSS. When I started out in graduate school and wrote my master's thesis, SPSS was a nice first option for dealing with logistic regression models and descriptive statistics. I've also used it in summer statistics workshops at ICPSR and been satisfied with its functionality. However, in both of those situations, the university had already paid to have all of the modules included; it wasn't until recently that I learned about the piece-by-piece pricing structure.

    If you are purchasing a statistics package for yourself, there's no way I can recommend SPSS. The price is just unreasonable, and I really dislike the idea of separate modules that you need to purchase in addition. The non-university prices are even more prohibitive: the base system starts at $1600!

    If you're still dubious, look at this MacStats review of Stata versus SPSS:

    There are some reasons why SPSS for the Mac is not a viable long-term option for many people, even though it’s easier to explore than most stats programs. Mac versions lag behind Windows versions, and the user interface has quirks, bugs, odd crashes and pauses, and problems working with other programs. The price is absurd, and on top of the excessive cost for the base package, most users will need extra modules, each of which costs about as much as Stata – and they charge for module upgrades, too. Finally, there might not even be another version of SPSS for the Mac, and if there is, it might not work with new or old computer. Even now, there is no version of SPSS for Intel Macs.

    Stata is the most full-featured, affordable and Mac-friendly statistical package available, and it's really not even close. Next post I'll discuss the basics of installation and setting up a working directory, with screenshots included.

    The Geeks said...

    Thanks for review, it was excellent and very informative.
    thank you :)

    Dyna francis said...

    Well, if you are looking for spss online help for Mac then you must know that SPSS for the Mac is not a viable long-term option for many people, even though it’s easier to explore than most stats programs. So, you do not get any specific help regarding.

    Effect Matrix said...

    Thanks for sharing in detail. Your blog is an inspiration! Apart of really useful tips, it's just really ! This post will be effectively Just about everything looks good displayed.
    Mac Photo Editor