11.2 Data Manipulation

11.2.1 Import and Export

Shipped datasets

Stata contains some demonstration datasets in the system directories.

sysuse dir: list the names of shipped datasets.

sysuse lifeexp: use lifeexp

Note that use lifeexp will return error. Data not found.

User datasets

.dta

use myauto [, clear]: Load myauto.dta (Stata-format) into memory.

  • clear it is okay to replace the data in memory, even though the current data have not been saved to disk.

save myauto [, replace]: Create a Stata data type file myauto.dta

  • replace allows Stata to overwrite existing dataset that is the output from previous attempts to run the do file.

.csv

import delimited myauto.csv: Import myauto.csv to Stata’s memory

export delimited myauto.csv” Export to myauto.csv

import delimited filename reads text (ASCII) files in which there is one observation per line and the values are separated by commas, tabs, or some other delimiter.

By default, Stat will check if the file is delimited by tabs or commas based on the first line of data.

export delimited filename writes data into a file in comma-separated (.csv) format by default. You can specify any separation character delimiter that you prefer.

If filename is specified without an extension, .csv is assumed. If filename contains embedded spaces, enclose it in double quotes.

import delimited [using] filename [, import_delimited_options]

Options

  • delimiters("chars"[, collapse | asstring] ):

    • "chars" specifies the delimiter

      ";": uses semicolon as a delimiter; "\t" uses tab, "whitespace" uses whitespace

    • collapse treat multiple consecutive delimiters as just one delimiter.

    • asstring treat chars as one delimiter. By default, each character in chars is treated as an individual delimiter.

    // use example
    import delimited auto, delim(" ", collapse) colrange(:3) rowrange(8) 
  • clear replace data in memory


11.2.2 Save Estimation Results

estimates store model_name stores the current (active) estimation results under the name model_name.

// Store estimation results as m1 for use later in the same session
. estimates store m1
// to get them back 
. estimates restore m1
// Find out what you have stored 
. estimates dir

estimates save saves the current active estimation results to a file with the extension .ster.

// Save the current active estimation results
. estimates save basemodel
file basemodel.ster saved

In a different session, you can reload those results:

// Load the saved estimation results
. estimates use basemodel
// Display the results
. estimates table

Q: What is the difference between estimates store and estimates save?
A: Once estimation results are stored (estimates store), you can use other estimates commands to produce tables and reports from them. estimates save saves the results to disk (as a .ster file) so that you can use them in a different Stata session.

  • estimates store keeps the results in memory, fast and temporary. The most common use is to store results from multiple models in the same session for comparison.

  • estimates save keeps the results on disk, persistent. The most common use is to save time-consuming important results for future reference. Also useful when you want to share results with others.


Q: How to print the original regression output again after storing the results?
A: Use estimates replay model_name. This will show the original output as it appeared when the model was first estimated.


estimates table [namelist] [, options] organizes estimation results from one or more models in a single formatted table.

If you type estimates table without arguments, a table of the most recent estimation coefficients will be shown.

estimates table namelist is useful to compare results from multiple models side by side. Good for a quick preview. Use esttab or etable for more advanced table formatting and exporting.

// Display a table of coefficients for stored estimates m1 and m2
estimates table m1 m2

// with SE
estimates table m1 m2, se

// with sample size, adjusted R2, and stars
estimates table m1 m2, stats(N r2_a) star

You can add more results to show using options:

Parameter statistics options:

  • b[%fmt] how to format the coefficients.
  • se[%fmt] show standard errors and use optional format
  • t[%fmt] show \(t\) or \(z\) statistics and use optional format
  • p[%fmt] show \(p\) values and use optional format
  • varlabel display variable labels rather than variable names
// show se, t, and p values
. estimates table, se t p

All statistics are shown in order under the coefficients. If you have a long list of variables, the table can be very long.

You can use keep(varlist) to keep only the variables you want to show in the table.

  • varlist is a list of variables you want to keep in the table.

    • A list of variables can be specified as keep(var1 var2 var3).

      Names are separated by spaces.

    • Not possible to use variable ranges, e.g., keep(var1-var3) will return an error.

    • When you have multiple equations, use eqn_name:varname to specify the variable in a specific equation.

  • stats(scalarlist) reports additional statistics in the table. Below are commonly used result identifiers:

    • N for sample size
    • r2_a for adjusted \(R^2\)
    • r2 for \(R^2\)
    • F for F-statistic
    • chi2 for chi-squared statistic
    • p for p-value
    // show sample size and adjusted R2
    . estimates table, stats(N r2_a)
  • star shows stars for significance levels.

    // show stars for sig. levels
    . estimates table, star
    • By default, star(.05 .01 .001), which uses the following significance levels:

      • * for \(p < 0.05\)
      • ** for \(p < 0.01\)
      • *** for \(p < 0.001\)
    • You can change the significance levels using star(.1 .05 .01) to set the levels to 0.10, 0.05, and 0.01, respectively.

      Many journals use this convention actually. Also this boosts the appearance of significance even though the p-values remain the same.

      • * for \(p < 0.10\)
      • ** for \(p < 0.05\)
      • *** for \(p < 0.01\)
    • N.B. the star option may NOT be combined with the se, t, or p option.

    An error will be returned if you try to combine them:

    . estimate table, star se t p star
    option star not allowed

Example output

By default, estimates table shows only the coefficients for all regressors.

Use keep() to show only selected variables, and use star to show significance levels.

. estimates table global_trend country_trend, keep(L.logd_gdp tmp tmp2 pre pre2 tmp_pre tmp2_pre tmp_pre2 tmp2_pre2) star

----------------------------------------------
    Variable | global_trend    country_trend  
-------------+--------------------------------
    logd_gdp |
         L1. |  .02106535       .01196853     
             |
         tmp |  .02233754**     .02285209**   
        tmp2 | -.00091508***   -.00093839***  
         pre |  .07140978       .07469934     
        pre2 | -.04168053      -.04459604     
     tmp_pre | -.01259437      -.01284796     
    tmp2_pre |  .00047348*      .00047987*    
    tmp_pre2 |  .00472783       .00496185     
   tmp2_pre2 | -.00013928      -.00014426     
----------------------------------------------
      Legend: * p<0.05; ** p<0.01; *** p<0.001

Print standard errors, t-statistics, and p-values together:

. estimates table global_trend country_trend, keep(L.logd_gdp tmp tmp2 pre pre2 tmp_pre tmp2_pre tmp_pre2 tmp2_pre2) 
> se t p

----------------------------------------
    Variable | global_t~d   country_~d  
-------------+--------------------------
    logd_gdp |
         L1. |  .02106535    .01196853  
             |  .01402981    .01433765  
             |       1.50         0.83  
             |     0.1332       0.4039  
             |
         tmp |  .02233754    .02285209  
             |   .0068979    .00701607  
             |       3.24         3.26  
             |     0.0012       0.0011  
        tmp2 | -.00091508   -.00093839  
             |  .00019456    .00019787  
             |      -4.70        -4.74  
             |     0.0000       0.0000  
...
...
    tmp_pre2 |  .00472783    .00496185  
             |  .00420448    .00427315  
             |       1.12         1.16  
             |     0.2608       0.2456  
   tmp2_pre2 | -.00013928   -.00014426  
             |  .00009435    .00009588  
             |      -1.48        -1.50  
             |     0.1399       0.1324  
----------------------------------------
                        Legend: b/se/t/p

11.2.2.1 esttab

esttab is a wrapper for estout. It is a command for publication-style regression tables that display nicely in Stata’s results window or, optionally, can be exported to various formats such as CSV, RTF, HTML, or LaTeX.

esttab [ namelist ] [ using filename ] [, options estout_options ]
  • namelist is a list of stored estimation results to be included in the table. Can be _all.

    If namelist is omitted, esttab tabulates the estimation sets stored by eststo. If no such estimates exist, esttab tabulates the most recent estimation results.


eststo stores a copy of the active estimation results for later tabulation. If name is provided, the estimation set is stored under name. If name is not provided, the estimation set is stored under est#, where # is a counter for the number of stored estimation sets.

eststo may be used in two ways: Either after fitting a model as in

. regress y x
. eststo

or as a prefix command:

. eststo: regress y x

Add name to model using eststo model_name.

. sysuse auto
(1978 Automobile Data)

. quietly regress price weight

. eststo model1

. quietly regress turn weight foreign

. eststo model2

. estout
        
--------------------------------------
                    model1       model2
                        b            b
--------------------------------------
weight           2.044063     .0042183
foreign                      -1.809802
_cons           -6.707353     27.44963
--------------------------------------

Alternatively, use estimates store model_name to store estimation results.


Output example of estab

--------------------------------------------
                      (1)             (2)   
                    price           price   
--------------------------------------------
weight              1.747**         3.465***
                   (2.72)          (5.49)   

mpg                -49.51           21.85   
                  (-0.57)          (0.29)   

foreign                            3673.1***
                                   (5.37)   

_cons              1946.1         -5853.7   
                   (0.54)         (-1.73)   
--------------------------------------------
N                      74              74   
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

The default of esttab is to display raw point estimates along with t-statistics and to print the number of observations in the table footer.

Parameter statistics options:

You can replace the t-statistics with standard errors by using the se option.

The t-statistics can also be replaced by p-values (option p), confidence intervals (option ci), or any parameter statistics contained in the estimates (see the aux() option).

If you want to include multiple parameter statistics, you can make use of the cells() option. E.g., cells("b(fmt(%9.4f)) se(fmt(%9.4f) par) t(fmt(%9.2f)) p(fmt(%9.3f) star)") will display the point estimates, standard errors, t-statistics, and p-values in the specified formats. Note that the significance star will be added to the p-values. By doing so, you can perform calculations using the coefficients.

Use help estout for more details about the cells(element[(subopts)]) option.

cells() specifies the contents of the table cells. For each element (b, se, t, p, etc.), you can specify suboptions in parentheses to control the format and appearance of the statistics.

Available cells() subopts:

  • fmt(%fmt) specify the format for the statistic
  • star add significance stars
    • nostar suppress significance stars
  • label add a customized label instead of default abbreviation
  • par enclose the statistic in parentheses

N.B. By default, multiple statistics are stacked vertically (i.e., one statistic per line). If you want to display them side by side, enclose the elements in quotes or parentheses, e.g., cells("b se t p") or cells((b se t p)). You can add suboptions to each element as needed. See the wide table below for more details.


Summary statistics options:

Adjusted R-squared (option ar2), pseudo R-squared (option pr2), Akaike’s or Schwarz’s information criterion (options aic and sic).


Significance stars options:

  • star(* .05 ** .01 *** .001) adds significance stars to the point estimates based on the specified p-value cutoffs.

    You can change the significance levels by specifying different cutoff values. For example, star(* .1 ** .05 *** .01) will use 0.1, 0.05, and 0.01 as the significance levels for one, two, and three stars, respectively.

    Syntax:

    star[(symbol level [...])]

    Note that the threshold levels must lie in the \((0,1]\) interval and must be specified in descending order.

    star(+ 0.10 * 0.05) will use + for \(p<.10\) and * for \(p<.05.\)

    When you change the default behavior of star, the legned is sometimes disabled. This happens when you add the stars to p-values instead of point estimates.

    In such cases, use legend to show the legend explaining the significance levels.

    When you change the significance levels, e.g., star(* 0.10 ** 0.05 *** 0.01), the legend will be updated automatically.

  • staraux adds significance stars based on auxiliary statistics (t-stat or SE depending on your settings) instead of the point estimates.


Layout options:

  • plain setting all formats to %9.0g.

    This can be useful when you want to export the table to CSV.

    Without plain, the table cells are enclosed in double quotes and preceded by an equal sign (i.e., ="..."). This is to prevent Excel from interpreting the contents of the table cells.

    But the drawback is that you cannot perform calculations directly in Excel using these cells. To enable calculations, you can use the plain option to export the table cells enclosed in double quotes without the leading equal sign.

  • wide causes point estimates and t-statistics (or standard errors, etc.) to be printed beside one another instead of beneath one another. (By default, esttab stacks all parameter statistics beneath the point estimates.)

    N.B. This only works when a single parameter statistic is displayed per model.

    If you have multiple statistics per model (e.g., both standard errors and t-statistics), wide will be ignored. You need to use cells() to explicitly specify the layout you want.

    // Each statistic as a separate column, wide format
    esttab dynamic_model_xtabond using "data/stata/xtabond_test.csv", ///
      cells("b(fmt(%9.4f)) se(fmt(%9.4f)) t(fmt(%9.2f)) p(fmt(%9.3f) star)") ///
      keep(L.logd_gdp tmp tmp2 pre pre2 tmp_pre tmp2_pre tmp_pre2 tmp2_pre2) ///
      csv replace legend

    Note that cells(b se) will produce the vertical layout by default. To get the wide layout, two options:

    • put elements in quotes cells("b se"), or
    • put elements in parentheses cells((b se))

Long table example

// avoid "cells(b se)", difficult to read
. estout, cells(b se)

--------------------------------------
                     est1         est2
                     b/se         b/se
--------------------------------------
weight           1.746559     3.464706
                 .6413538      .630749
mpg             -49.51222      21.8536
                 86.15604     74.22114
foreign                        3673.06
                              683.9783
_cons            1946.069    -5853.696
                  3597.05     3376.987
--------------------------------------

Wide table example

// better to use "cells("b se t p")" for wide format
. estout est2, cells("b se t p") 

----------------------------------------------------------------
                     est2                                       
                        b           se            t            p
----------------------------------------------------------------
weight           3.464706      .630749     5.493003     5.99e-07
mpg               21.8536     74.22114     .2944391     .7692938
foreign           3673.06     683.9783     5.370142     9.72e-07
_cons           -5853.696     3376.987    -1.733408     .0874262
----------------------------------------------------------------
  • mtitles[(list)] specifies model titles to be printed as the table header. If list is omitted, the names of the stored estimation results are used as model titles.

Output options:

  • replace allows Stata to overwrite existing files.

ref: estout package documentation


11.2.2.2 etable

etable allows you to easily create a table of estimation results and export it to a variety of file types, e.g., docx, html, pdf, xlsx, tex, txt, markdown, md.

Easier to use than esttab for exporting a regression table; but less flexible. esttab is more customizable.

// use example of etable
. clear all
. webuse nhanes2l
(Second National Health and Nutrition Examination Survey)
. quietly regress bpsystol age weight i.region
. estimates store model1

. quietly regress bpsystol i.sex weight i.agegrp
. estimates store model2

. quietly regress bpsystol age weight i.agegrp
. estimates store model3

. etable, estimates(model1 model2 model3) showstars showstarsnote title("Table 1. Models for systolic blood pressure") export(mydoc.docx, replace)

Options:

  • showstars and showstarsnote shows stars and notes for significance levels.
  • export allows you to specify the output format

11.2.3 Stored Results

Stata commands that report results also store the results where they can be subsequently used by other commands or programs. This is documented in the Stored results section of the particular command in the reference manuals.

  • r-class commands, such as summarize, store their results in r();

    most commands are r-class.

  • e-class commands, such as regress, store their results in e();

    e-class commands are Stata’s model estimation commands.

// for r-class command
return list
// for e-class command
ereturn list

Most estimation commands leave behind

  • e(b) the coefficient vector, and
  • e(V) the variance–covariance matrix of the estimates (VCE)
// display coef vector
matrix list e(b)
// assign it to a variable
matrix myb = e(b)
matrix list myb

You can refer to e(b) and e(V) in any matrix expression:

matrix c = e(b)*invsym(e(V))*e(b)’
matrix list c

invsym(e(V)) returns the inverse of e(V). Generally, invsym requires a a square, symmetric, and positive-definite matrix.