Wednesday, November 28, 2007

Step 4: Importing unformatted data

If you're lucky, you'll never have to deal with unformatted data. You'll just cruise to ICPSR or the local library and get your data in pre-formatted, machine-readable files with no errors, complete dictionary and label files, and lots of documentation.

That's exactly what I did for four years. Then I ran across an old, useful piece of data that had never before been put into a statistics package.

That's an exciting moment. While the General Social Survey has been combed over (and over and over) by social scientists, including me, it's nice to find a new dataset that hasn't been mined very heavily. Either that means it's a dud, or just that no one bothered with it and you may find something new.

Here's what the data looked like when I opened it:

0001124352540000000000000000000000000000000000000

000222223225202022222322120200001222232322520202223232212020

371141530500000000000000140010021803



Hmm.

No commas, no spaces, and no file to read all of this into Stata (or SPSS or anything else).

However, there was a paper codebook that explained what all of this meant. The first four digits are the ID; the next is the card number; the next is the opinions of Fraternal Order of Police members on the severity of illegal bookmaking on a 1-to-4 Likert scale. And so on.

infix dictionary {

* imports Fraternal Order of Police Study from 1975 Gambling Commission Study

2 lines

1:
ID 1-4


card 5


serbook 7


serlarc 8


sernumb 9


serpot 10


sercard 11


serfence 12


serhook 13



And we're off! By the way, "2 lines" is a Stata command indicating that a data records runs longer than a single line.

From here there are two ways to go:

First way:
  • Save the file as a .dct file, meaning it's a data dictionary
  • Use the infix menu in Stata (under File)
  • Specify the location of the raw data file and the dictionary file:

    Stata File Menu



    Picture 9.jpg



    Second way (preferred for me):

  • Paste the raw data into the file below the dictionary
  • Save the file as a .do file
  • Run it in Stata



    At this point, everything should be fine. Of course, you'll make mistakes the first few times, and need to re-run things.

    Next time I'll address Step 5: Logging your sessions.

  • No comments: