11.1 Basic syntax
Package management
Users can add new features to Stata, and some users choose to make new features that they have written available to others via the web. The files that comprise a new feature are called a package, and a package usually consists of one or more ado-files and help files.
ssc install newpkgname
: Install newpkgname
from ssc. The SSC (Statistical Software Components) is the premier Stata download site.
ssc uninstall pkgname
to uninstall pkgname
ado update
to update packages
ssc hot [, n(#)]
a list of most popular pkgs at SSC. n(#)
to specify the number of pkgs listed.
Stata is case-sensitive: myvar
, Myvar
and MYVAR
are three distinct names.
Semicolons (:
) is treated as a line separator. It is not required, but it may be used to place two statements on the same physical line:
The last semicolon in the above example is unnecessary but allowed.
Types and Declarations
A variable’s type can be described in two perspectives:
eltype
: specifies the type of the elements. Default:transmorphic
.orgtype
: specifies the organization of the elements. Default:matrix
.
eltype |
orgtype |
---|---|
transmorphic |
matrix |
numeric |
vector |
real |
rowvector |
complex |
colvector |
string |
scalar |
pointer |
where square brackets distinguish optional qualifiers and options from required ones. In this diagram,
varlist
denotes a list of variable names,If no
varlist
appears, most commands assumes_all
, which indicate all the variables in the dataset.command
denotes a Stata command,exp
denotes an algebraic expression,range
denotes an observation range,weight
denotes a weighting expression, andoptions
denotes a list of options.Note the comma
,
which separate the command’s main body to options.
by varlist
repeat a cmd for each subset of the data, grouped by varlist
.
Ex: group by region
and summarize marriage divorce
sysuse census
sort region
by region: summarize marriage divorce
Note that your have to sort
before by varlist
.
Alternatively, you can
by region, sort: summarize marriage_rate divorce_rate
if exp
filter observations for which exp
returns true
summarize marriage_rate divorce_rate if region == "West"
&
(and) and|
(or) to join conditions.
in range
restricts the scope of the cmd to be applied to a specific observation range.
- First observation can be denoted by
f
- Last observation can be denoted by
l
- Negative numbers mean “from the end of the data”
// summarize for observations 5 to 25
summarize marriage_rate divorce_rate in 5/25
// summarize for the last five observations
summarize marriage_rate divorce_rate in -5/l
Create new variables
gen variable = expression // generate new variables
replace variable = expression // replace the value of existing variables
generate
create variables based on expressions you specified.
generate newvar = oldvar + 2
generate a new variable newvar
, which equals oldvar + 2
generate lngdp = ln(gdp)
generate the natural log of gdp
generate exp2 = exp^2
generate the square of exp
egen
: Extensions to generate
; creates a new variable based on egen
functions of existing variables.
Q: What are egen
functions?
A: The functions are specifically written for egen
.
// Generate newv1 for distinct groups of v1 and v2, and create and apply value label mylabel
egen newv1 = group(v1 v2), label(mylabel)
// for each country, calculate the average of wpop
by country_id, sort: egen pop_country = mean(wpop)
gen
vs. egen
gen
used for simple algebraic transformationsegen
for more complexed transformations, e.g., operations based on groups of observations.- They behave differently if you want to calcualte the
sum
per group.gen
returns running sumegen
returns group sum
// Create variable containing the running sum of x
generate runsum = sum(x)
// Create variable containing a constant equal to the overall sum of x
egen totalsum = total(x)
encode var, gen(newvar)
creates a new variable named newvar
based on the string variable varname
. It alphabetizes unique values in var
and assigns numeric codes to each entry.
encode sex, gen(gender)
// nolabel drops value labels and show how the data really appear
list sex gender in 1/4, nolabel
// you won't see difference using the following cmd
list sex gender in 1/4
sex
is a string variable and takes on values female
and male
.
encode
creates a new variable gender
, mapping each level in sex
to a numerical value. female
becomes 1 and male
becomes 2.
display
displays strings and values of scalar expressions.
list
displays the values of variables. If no varlist
is specified, the values of all the variables are displayed.
11.1.1 System Variables
Expressions may also contain variables (pronounced “underscore variables”), which are built-in system variables that are created and updated by Stata. They are called variables because their names all begin with the underscore character, _
.
Var | Description |
---|---|
_n |
the number of the current observation. |
_N |
the total number of observations in the dataset or the number of observations in the current by() group. |
_pi |
\(\pi\) |
[eqno]_b[varname] or [eqno]_coef[varname] |
value of the coefficient on varname from the most recently fitted model |
[eqno]_se[varname] |
standard error of the coefficient on varname from the most recently fit model |
_b[_cons] |
value of the intercept term |
11.1.2 Matrix
You enter the matrices by row, separate one element from the next by using commas (,
) and one row from the next by using backslashes (\
).
To create
\[ A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} \]
input
is optional.
- without
input
, matrix must be small, can include expressions. - with
input
, matrix can be large, but no expressions for the elements.
Menu: Data > Matrices, ado language > Input matrix by hand
Get one element using matname[r,c]
to get r
row, c
column element.
matrix rownames
and colnames reset
the row and column names of an already existing matrix.
matrix roweq
and coleq
also reset the row and column names of an already existing matrix, but if a simple name (a name without a colon) is specified, it is interpreted as an equation name.
// Reset row names of matrix
matrix rownames A = names
matrix colnames A = names
// Reset row names and interpret simple names as equation names
matrix roweq A = names
matrix coleq A = names
A
is a matrix.
name
can be:
- a simple name;
var
- an interaction; e.g.,
var1#var2
- a colon followed by a simple name;
- a colon followed by an interaction;
- an equation name followed by a colon, and a simple name; e.g.,
myeq:var
- an equation name, a colon, and an interaction, e.g.,
myeq:var1#var2
Matrix define: https://www.stata.com/manuals/pmatrixdefine.pdf#pmatrixdefine
Macro functions
rownames A
and colnames A
return a list of all the row or column subnames (with time-series operators if applicable) of A, separated by single blanks. The equation names, even if present, are not included.
roweq A
and coleq A
return a list of all the row equation names or column equation names of A, separated by single blanks, and with each name appearing however many times it appears in the matrix.
rowfullnames A
and colfullnames A
return a list of all the row or column names, including equation names of A, separated by single blanks.
11.1.3 Factor Variables
i.varname
create indicators for each level of the variable
// group=1 as base level
list group i.group in 1/5
// group=3 as base level
list group i3.group in 1/5
// individual fixed effects
regress y i.group
c.varname
treat as continuous
#
cross, create an interaction for each combination of the variables. Spaces are not allowed in interactions.
##
factorial cross, a full factorial of the variables: standalone effects for each variable and an interaction
o.varname
omit a variable or indicator
o.age
means that the continuous variable age
should be omitted, and
o2.group
means that the indicator for group = 2
should be omitted.
Interaction Expansion
xi
expands terms containing categorical variables into indicator (also called dummy) variable sets. xi
provides a convenient way to include dummy or indicator variables when fitting a model that does NOT support factor variables, e.g., xtabond
.
We recommend that you use factor variables instead of xi
if a command allows factor variables.
By default, xi
will create interaction variables starting with _I
. This can be changed using the prefix(string)
option.
Operator | Description |
---|---|
i.varname |
creates dummies for categorical variable varname |
i.varname1*i.varname2 |
creates dummies for categorical variables varname1 and varname2 : main effects and all interactions |
i.varname1*varname3 |
creates dummies for categorical variable varname1 and continuous variable varname3 : main effects and all interactions |
i.varname1|varname3 |
creates dummies for categorical variable varname1 and continuous variable varname3 : all interactions and main effect of varname3 , but NO main effect of varname1 |
xi
expands both numeric and string categorical variables.agegrp
takes on values 1, 2,3, and 4.xi
tabulatesi.agegrp
creates indicator (dummy) variables for each observed value, omitting the indicator for the smallest value.This creates variables name
-Iagegrp2
,-Iagegrp3
, and-Iagegrp4
.Dummy variables are created automatically and are left in your dataset.
You can drop them by typing
drop I*
. You do not have to do this; each time you usexi
, any automatically generated dummies with the same prefix as the one specified in theprefix(string)
option, or_I
by default, are dropped and new ones are created.
Use xi
as a command prefix
// simple effects
xi: logistic outcome weight i.agegrp bp
// interactions of categorical variables
xi: logistic outcome weight bp i.agegrp*i.race
// interactions of dummy variables with continuous variables
// fits a model with indicator variables for all agegrp categories interacted with weight, plus the maineffect terms weight and i.agegrp.
xi: logistic outcome bp i.agegrp*weight i.race
// interaction terms without the agegrp main effect (but with the weight main effect)
xi: logistic outcome bp i.agegrp|weight i.race
11.1.4 Time series varlists
Three time series operators: L.
, D.
and S.
.
First convert variables to time variables by using tsset
, then you can use the TS operators.
Convert to panel
TS Operator | Meaning |
---|---|
L. |
lag \(x_{t-1}\) |
L2. |
2-period lag \(x_{t-2}\) |
L(1/2). |
a varlist \(x_{t-1}\) and \(x_{t-2}\) |
F. |
lead \(x_{t+1}\) |
F2. |
2-period lead \(x_{t+2}\) |
D. |
difference \(x_{t}-x_{t-1}\) |
D2. |
difference of difference \((x_{t}-x_{t-1})-(x_{t-1}-x_{t-2})\) |
S. |
“seasonal” difference \(x_{t}-x_{t-1}\) |
S2. |
lag-2 seasonal difference \(x_{t}-x_{t-2}\) |
Note that D1.
= S1.
, but D2.
\(\ne\) S2.
.
D2.
refers to the difference of differenceS2.
refers to the two-period difference
Operators may be typed in uppercase or lowercase
L(1/3).(gnp cpi)
// equivalently
L.gnp L2.gnp L3.gnp L.cpi L2.cpi L3.cpi
DS12.gnp
one-period difference of the 12-period difference
.do
is a Stata do-file.
.dta
is Stata dataset file format
11.1.5 Labels
Variable labels convey information about a variable, and can be a substitute for long variable names.
// generally
label variable variable_name "variable label"
// use example
label variable price "Price in 1978 Dollars"
Value labels are used with categorical variables to tell you what the categories mean.
- First define a mapping
// generally
label define map_name value1 "label1" value2 "label2"...
// use example
label define rep_label 1 "Bad" 2 "Average" 3 "Good"
- Add value labels to existing variables