11.2 Data Manipulation
11.2.1 Import and Export
Shipped datasets
Stata contains some demonstration datasets in the system directories.
sysuse dir: list the names of shipped datasets.
sysuse lifeexp: use lifeexp
Note that use lifeexp will return error. Data not found.
User datasets
.dta
use myauto [, clear]: Load myauto.dta (Stata-format) into memory.
clearit is okay to replace the data in memory, even though the current data have not been saved to disk.
save myauto [, replace]: Create a Stata data type file myauto.dta
replaceallows Stata to overwrite existing dataset that is the output from previous attempts to run the do file.
.csv
import delimited myauto.csv: Import myauto.csv to Stata’s memory
export delimited myauto.csv” Export to myauto.csv
import delimited filename reads text (ASCII) files in which there is one observation
per line and the values are separated by commas, tabs, or some other delimiter.
By default, Stat will check if the file is delimited by tabs or commas based on the first line of data.
export delimited filename writes data into a file in comma-separated (.csv) format by default. You can specify any separation character delimiter that you prefer.
If filename is specified without an extension, .csv is assumed. If filename contains embedded spaces, enclose it in double quotes.
Options
delimiters("chars"[, collapse | asstring] ):"chars"specifies the delimiter";": uses semicolon as a delimiter;"\t"uses tab,"whitespace"uses whitespacecollapsetreat multiple consecutive delimiters as just one delimiter.asstringtreatcharsas one delimiter. By default, each character incharsis treated as an individual delimiter.
clearreplace data in memory
11.2.2 Save Estimation Results
estimates store model_name stores the current (active) estimation results under the name model_name.
// Store estimation results as m1 for use later in the same session
. estimates store m1
// to get them back
. estimates restore m1
// Find out what you have stored
. estimates direstimates save saves the current active estimation results to a file with the extension .ster.
In a different session, you can reload those results:
// Load the saved estimation results
. estimates use basemodel
// Display the results
. estimates tableQ: What is the difference between estimates store and estimates save?
A: Once estimation results are stored (estimates store), you can use other estimates commands to produce tables and reports from them. estimates save saves the results to disk (as a .ster file) so that you can use them in a different Stata session.
estimates storekeeps the results in memory, fast and temporary. The most common use is to store results from multiple models in the same session for comparison.estimates savekeeps the results on disk, persistent. The most common use is to save time-consuming important results for future reference. Also useful when you want to share results with others.
Q: How to print the original regression output again after storing the results?
A: Use estimates replay model_name. This will show the original output as it appeared when the model was first estimated.
estimates table [namelist] [, options] organizes estimation results from one or more models in a single formatted table.
If you type estimates table without arguments, a table of the most recent estimation coefficients will be shown.
estimates table namelist is useful to compare results from multiple models side by side. Good for a quick preview. Use esttab or etable for more advanced table formatting and exporting.
// Display a table of coefficients for stored estimates m1 and m2
estimates table m1 m2
// with SE
estimates table m1 m2, se
// with sample size, adjusted R2, and stars
estimates table m1 m2, stats(N r2_a) starYou can add more results to show using options:
Parameter statistics options:
b[%fmt]how to format the coefficients.se[%fmt]show standard errors and use optional formatt[%fmt]show \(t\) or \(z\) statistics and use optional formatp[%fmt]show \(p\) values and use optional formatvarlabeldisplay variable labels rather than variable names
All statistics are shown in order under the coefficients. If you have a long list of variables, the table can be very long.
You can use keep(varlist) to keep only the variables you want to show in the table.
varlistis a list of variables you want to keep in the table.A list of variables can be specified as
keep(var1 var2 var3).Names are separated by spaces.
Not possible to use variable ranges, e.g.,
keep(var1-var3)will return an error.When you have multiple equations, use
eqn_name:varnameto specify the variable in a specific equation.
stats(scalarlist)reports additional statistics in the table. Below are commonly used result identifiers:Nfor sample sizer2_afor adjusted \(R^2\)r2for \(R^2\)Ffor F-statisticchi2for chi-squared statisticpfor p-value
starshows stars for significance levels.By default,
star(.05 .01 .001), which uses the following significance levels:*for \(p < 0.05\)**for \(p < 0.01\)***for \(p < 0.001\)
You can change the significance levels using
star(.1 .05 .01)to set the levels to 0.10, 0.05, and 0.01, respectively.Many journals use this convention actually. Also this boosts the appearance of significance even though the p-values remain the same.
*for \(p < 0.10\)**for \(p < 0.05\)***for \(p < 0.01\)
N.B. the
staroption may NOT be combined with these,t, orpoption.
An error will be returned if you try to combine them:
Example output
By default, estimates table shows only the coefficients for all regressors.
Use keep() to show only selected variables, and use star to show significance levels.
. estimates table global_trend country_trend, keep(L.logd_gdp tmp tmp2 pre pre2 tmp_pre tmp2_pre tmp_pre2 tmp2_pre2) star
----------------------------------------------
Variable | global_trend country_trend
-------------+--------------------------------
logd_gdp |
L1. | .02106535 .01196853
|
tmp | .02233754** .02285209**
tmp2 | -.00091508*** -.00093839***
pre | .07140978 .07469934
pre2 | -.04168053 -.04459604
tmp_pre | -.01259437 -.01284796
tmp2_pre | .00047348* .00047987*
tmp_pre2 | .00472783 .00496185
tmp2_pre2 | -.00013928 -.00014426
----------------------------------------------
Legend: * p<0.05; ** p<0.01; *** p<0.001Print standard errors, t-statistics, and p-values together:
. estimates table global_trend country_trend, keep(L.logd_gdp tmp tmp2 pre pre2 tmp_pre tmp2_pre tmp_pre2 tmp2_pre2)
> se t p
----------------------------------------
Variable | global_t~d country_~d
-------------+--------------------------
logd_gdp |
L1. | .02106535 .01196853
| .01402981 .01433765
| 1.50 0.83
| 0.1332 0.4039
|
tmp | .02233754 .02285209
| .0068979 .00701607
| 3.24 3.26
| 0.0012 0.0011
tmp2 | -.00091508 -.00093839
| .00019456 .00019787
| -4.70 -4.74
| 0.0000 0.0000
...
...
tmp_pre2 | .00472783 .00496185
| .00420448 .00427315
| 1.12 1.16
| 0.2608 0.2456
tmp2_pre2 | -.00013928 -.00014426
| .00009435 .00009588
| -1.48 -1.50
| 0.1399 0.1324
----------------------------------------
Legend: b/se/t/p11.2.2.1 esttab
esttab is a wrapper for estout. It is a command for publication-style regression tables that display nicely in Stata’s results window or, optionally, can be exported to various formats such as CSV, RTF, HTML, or LaTeX.
namelistis a list of stored estimation results to be included in the table. Can be_all.If namelist is omitted, esttab tabulates the estimation sets stored by
eststo. If no such estimates exist,esttabtabulates the most recent estimation results.
eststo stores a copy of the active estimation results for later tabulation. If name is provided, the estimation set is stored under name. If name is not provided, the estimation set is stored under est#, where # is a counter for the number of stored estimation sets.
eststo may be used in two ways: Either after fitting a model as in
. regress y x
. eststo
or as a prefix command:
. eststo: regress y x
Add name to model using eststo model_name.
. sysuse auto
(1978 Automobile Data)
. quietly regress price weight
. eststo model1
. quietly regress turn weight foreign
. eststo model2
. estout
--------------------------------------
model1 model2
b b
--------------------------------------
weight 2.044063 .0042183
foreign -1.809802
_cons -6.707353 27.44963
--------------------------------------
Alternatively, use estimates store model_name to store estimation results.
Output example of estab
--------------------------------------------
(1) (2)
price price
--------------------------------------------
weight 1.747** 3.465***
(2.72) (5.49)
mpg -49.51 21.85
(-0.57) (0.29)
foreign 3673.1***
(5.37)
_cons 1946.1 -5853.7
(0.54) (-1.73)
--------------------------------------------
N 74 74
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
The default of esttab is to display raw point estimates along with t-statistics and to print the number of observations in the table footer.
Parameter statistics options:
You can replace the t-statistics with standard errors by using the se option.
The t-statistics can also be replaced by p-values (option p), confidence intervals (option ci), or any parameter statistics contained in the estimates (see the aux() option).
If you want to include multiple parameter statistics, you can make use of the cells() option. E.g., cells("b(fmt(%9.4f)) se(fmt(%9.4f) par) t(fmt(%9.2f)) p(fmt(%9.3f) star)") will display the point estimates, standard errors, t-statistics, and p-values in the specified formats. Note that the significance star will be added to the p-values. By doing so, you can perform calculations using the coefficients.
Use help estout for more details about the cells(element[(subopts)]) option.
cells() specifies the contents of the table cells. For each element (b, se, t, p, etc.), you can specify suboptions in parentheses to control the format and appearance of the statistics.
Available cells() subopts:
fmt(%fmt)specify the format for the statisticstaradd significance starsnostarsuppress significance stars
labeladd a customized label instead of default abbreviationparenclose the statistic in parentheses
N.B. By default, multiple statistics are stacked vertically (i.e., one statistic per line). If you want to display them side by side, enclose the elements in quotes or parentheses, e.g., cells("b se t p") or cells((b se t p)). You can add suboptions to each element as needed.
See the wide table below for more details.
Summary statistics options:
Adjusted R-squared (option ar2), pseudo R-squared (option pr2), Akaike’s or Schwarz’s information criterion (options aic and sic).
Significance stars options:
star(* .05 ** .01 *** .001)adds significance stars to the point estimates based on the specified p-value cutoffs.You can change the significance levels by specifying different cutoff values. For example,
star(* .1 ** .05 *** .01)will use 0.1, 0.05, and 0.01 as the significance levels for one, two, and three stars, respectively.Syntax:
Note that the threshold levels must lie in the \((0,1]\) interval and must be specified in descending order.
star(+ 0.10 * 0.05)will use+for \(p<.10\) and*for \(p<.05.\)When you change the default behavior of
star, the legned is sometimes disabled. This happens when you add the stars to p-values instead of point estimates.In such cases, use
legendto show the legend explaining the significance levels.When you change the significance levels, e.g.,
star(* 0.10 ** 0.05 *** 0.01), the legend will be updated automatically.starauxadds significance stars based on auxiliary statistics (t-stat or SE depending on your settings) instead of the point estimates.
Layout options:
plainsetting all formats to%9.0g.This can be useful when you want to export the table to CSV.
Without
plain, the table cells are enclosed in double quotes and preceded by an equal sign (i.e.,="..."). This is to prevent Excel from interpreting the contents of the table cells.But the drawback is that you cannot perform calculations directly in Excel using these cells. To enable calculations, you can use the
plainoption to export the table cells enclosed in double quotes without the leading equal sign.
widecauses point estimates and t-statistics (or standard errors, etc.) to be printed beside one another instead of beneath one another. (By default, esttab stacks all parameter statistics beneath the point estimates.)N.B. This only works when a single parameter statistic is displayed per model.
If you have multiple statistics per model (e.g., both standard errors and t-statistics),
widewill be ignored. You need to usecells()to explicitly specify the layout you want.// Each statistic as a separate column, wide format esttab dynamic_model_xtabond using "data/stata/xtabond_test.csv", /// cells("b(fmt(%9.4f)) se(fmt(%9.4f)) t(fmt(%9.2f)) p(fmt(%9.3f) star)") /// keep(L.logd_gdp tmp tmp2 pre pre2 tmp_pre tmp2_pre tmp_pre2 tmp2_pre2) /// csv replace legendNote that
cells(b se)will produce the vertical layout by default. To get the wide layout, two options:- put elements in quotes
cells("b se"), or - put elements in parentheses
cells((b se))
- put elements in quotes
Long table example
// avoid "cells(b se)", difficult to read
. estout, cells(b se)
--------------------------------------
est1 est2
b/se b/se
--------------------------------------
weight 1.746559 3.464706
.6413538 .630749
mpg -49.51222 21.8536
86.15604 74.22114
foreign 3673.06
683.9783
_cons 1946.069 -5853.696
3597.05 3376.987
--------------------------------------Wide table example
// better to use "cells("b se t p")" for wide format
. estout est2, cells("b se t p")
----------------------------------------------------------------
est2
b se t p
----------------------------------------------------------------
weight 3.464706 .630749 5.493003 5.99e-07
mpg 21.8536 74.22114 .2944391 .7692938
foreign 3673.06 683.9783 5.370142 9.72e-07
_cons -5853.696 3376.987 -1.733408 .0874262
----------------------------------------------------------------mtitles[(list)]specifies model titles to be printed as the table header. Iflistis omitted, the names of the stored estimation results are used as model titles.
Output options:
replaceallows Stata to overwrite existing files.
ref: estout package documentation
11.2.2.2 etable
etable allows you to easily create a table of estimation results and export it to a variety of file types, e.g., docx, html, pdf, xlsx, tex, txt, markdown, md.
Easier to use than esttab for exporting a regression table; but less flexible. esttab is more customizable.
// use example of etable
. clear all
. webuse nhanes2l
(Second National Health and Nutrition Examination Survey)
. quietly regress bpsystol age weight i.region
. estimates store model1
. quietly regress bpsystol i.sex weight i.agegrp
. estimates store model2
. quietly regress bpsystol age weight i.agegrp
. estimates store model3
. etable, estimates(model1 model2 model3) showstars showstarsnote title("Table 1. Models for systolic blood pressure") export(mydoc.docx, replace)Options:
showstarsandshowstarsnoteshows stars and notes for significance levels.exportallows you to specify the output format
11.2.3 Stored Results
Stata commands that report results also store the results where they can be subsequently used by other commands or programs. This is documented in the Stored results section of the particular command in the reference manuals.
r-class commands, such as summarize, store their results in
r();most commands are r-class.
e-class commands, such as regress, store their results in
e();e-class commands are Stata’s model estimation commands.
Most estimation commands leave behind
e(b)the coefficient vector, ande(V)the variance–covariance matrix of the estimates (VCE)
// display coef vector
matrix list e(b)
// assign it to a variable
matrix myb = e(b)
matrix list mybYou can refer to e(b) and e(V) in any matrix expression:
invsym(e(V)) returns the inverse of e(V). Generally, invsym requires a a square, symmetric, and positive-definite matrix.