11.1 Basic syntax

Package management

Users can add new features to Stata, and some users choose to make new features that they have written available to others via the web. The files that comprise a new feature are called a package, and a package usually consists of one or more ado-files and help files.

ssc install newpkgname: Install newpkgname from ssc. The SSC (Statistical Software Components) is the premier Stata download site.

ssc uninstall pkgname to uninstall pkgname

ado update to update packages

ssc hot [, n(#)] a list of most popular pkgs at SSC. n(#) to specify the number of pkgs listed.

Stata is case-sensitive: myvar, Myvar and MYVAR are three distinct names.

Semicolons (:) is treated as a line separator. It is not required, but it may be used to place two statements on the same physical line:

x = 1 ; y = 2 ;

The last semicolon in the above example is unnecessary but allowed.

Types and Declarations

A variable’s type can be described in two perspectives:

  • eltype: specifies the type of the elements. Default: transmorphic.
  • orgtype: specifies the organization of the elements. Default: matrix.
eltype orgtype
transmorphic matrix
numeric vector
real rowvector
complex colvector
string scalar
pointer
[by varlist:] command [ varlist ] [=exp] [if exp] [in range] [ weight ] [, options]

where square brackets distinguish optional qualifiers and options from required ones. In this diagram,

  • varlist denotes a list of variable names,

    If no varlist appears, most commands assumes _all, which indicate all the variables in the dataset.

  • command denotes a Stata command,

  • exp denotes an algebraic expression,

  • range denotes an observation range,

  • weight denotes a weighting expression, and

  • options denotes a list of options.

    Note the comma , which separate the command’s main body to options.

by varlist repeat a cmd for each subset of the data, grouped by varlist.

Ex: group by region and summarize marriage divorce

sysuse census
sort region
by region: summarize marriage divorce

Note that your have to sort before by varlist.

Alternatively, you can

by region, sort: summarize marriage_rate divorce_rate

if exp filter observations for which exp returns true

summarize marriage_rate divorce_rate if region == "West"
  • & (and) and | (or) to join conditions.

in range restricts the scope of the cmd to be applied to a specific observation range.

  • First observation can be denoted by f
  • Last observation can be denoted by l
  • Negative numbers mean “from the end of the data”
// summarize for observations 5 to 25
summarize marriage_rate divorce_rate in 5/25
// summarize for the last five observations
summarize marriage_rate divorce_rate in -5/l

Create new variables

gen variable = expression      // generate new variables
replace variable = expression  // replace the value of existing variables

generate create variables based on expressions you specified.

generate newvar = oldvar + 2 generate a new variable newvar, which equals oldvar + 2

generate lngdp = ln(gdp) generate the natural log of gdp

generate exp2 = exp^2 generate the square of exp

egen: Extensions to generate; creates a new variable based on egen functions of existing variables.

Q: What are egen functions?
A: The functions are specifically written for egen.

// Generate newv1 for distinct groups of v1 and v2, and create and apply value label mylabel
egen newv1 = group(v1 v2), label(mylabel)

// for each country, calculate the average of wpop
by country_id, sort: egen pop_country = mean(wpop)

gen vs. egen

  • gen used for simple algebraic transformations
  • egen for more complexed transformations, e.g., operations based on groups of observations.
  • They behave differently if you want to calcualte the sum per group.
    • gen returns running sum
    • egen returns group sum
// Create variable containing the running sum of x
generate runsum = sum(x)

// Create variable containing a constant equal to the overall sum of x
egen totalsum = total(x)

encode var, gen(newvar) creates a new variable named newvar based on the string variable varname. It alphabetizes unique values in var and assigns numeric codes to each entry.

encode sex, gen(gender)
// nolabel drops value labels and show how the data really appear
list sex gender in 1/4, nolabel
// you won't see difference using the following cmd
list sex gender in 1/4

sex is a string variable and takes on values female and male.

encode creates a new variable gender, mapping each level in sex to a numerical value. female becomes 1 and male becomes 2.

display displays strings and values of scalar expressions.

display [display_directive [display_directive [...]]]

list displays the values of variables. If no varlist is specified, the values of all the variables are displayed.

list [varlist] [if] [in] [, options]

11.1.1 System Variables

Expressions may also contain variables (pronounced “underscore variables”), which are built-in system variables that are created and updated by Stata. They are called variables because their names all begin with the underscore character, _.

Var Description
_n the number of the current observation.
_N the total number of observations in the dataset or the number of observations in the current by() group.
_pi \(\pi\)
[eqno]_b[varname] or [eqno]_coef[varname] value of the coefficient on varname from the most recently fitted model
[eqno]_se[varname] standard error of the coefficient on varname from the most recently fit model
_b[_cons] value of the intercept term

11.1.2 Matrix

You enter the matrices by row, separate one element from the next by using commas (,) and one row from the next by using backslashes (\).

To create

\[ A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} \]

matrix [input] a = (1,2\3,4)
matrix list a

input is optional.

  • without input, matrix must be small, can include expressions.
  • with input, matrix can be large, but no expressions for the elements.

Menu: Data > Matrices, ado language > Input matrix by hand

Get one element using matname[r,c] to get r row, c column element.

matrix rownames and colnames reset the row and column names of an already existing matrix.

matrix roweq and coleq also reset the row and column names of an already existing matrix, but if a simple name (a name without a colon) is specified, it is interpreted as an equation name.

// Reset row names of matrix
matrix rownames A = names
matrix colnames A = names
// Reset row names and interpret simple names as equation names
matrix roweq A = names
matrix coleq A = names

A is a matrix.

name can be:

  • a simple name; var
  • an interaction; e.g., var1#var2
  • a colon followed by a simple name;
  • a colon followed by an interaction;
  • an equation name followed by a colon, and a simple name; e.g., myeq:var
  • an equation name, a colon, and an interaction, e.g., myeq:var1#var2

Matrix define: https://www.stata.com/manuals/pmatrixdefine.pdf#pmatrixdefine


Macro functions

rownames A and colnames A return a list of all the row or column subnames (with time-series operators if applicable) of A, separated by single blanks. The equation names, even if present, are not included.

roweq A and coleq A return a list of all the row equation names or column equation names of A, separated by single blanks, and with each name appearing however many times it appears in the matrix.

rowfullnames A and colfullnames A return a list of all the row or column names, including equation names of A, separated by single blanks.


11.1.3 Factor Variables

i.varname create indicators for each level of the variable

// group=1 as base level
list group i.group in 1/5
// group=3 as base level
list group i3.group in 1/5
// individual fixed effects
regress y i.group 

c.varname treat as continuous

# cross, create an interaction for each combination of the variables. Spaces are not allowed in interactions.

## factorial cross, a full factorial of the variables: standalone effects for each variable and an interaction

group##sex
// equivalently
i.group i.sex i.group#i.sex

o.varname omit a variable or indicator

o.age means that the continuous variable age should be omitted, and o2.group means that the indicator for group = 2 should be omitted.

Interaction Expansion

xi [ , prefix(string) noomit ] term(s)

xi expands terms containing categorical variables into indicator (also called dummy) variable sets. xi provides a convenient way to include dummy or indicator variables when fitting a model that does NOT support factor variables, e.g., xtabond.

We recommend that you use factor variables instead of xi if a command allows factor variables.

By default, xi will create interaction variables starting with _I. This can be changed using the prefix(string) option.

Operator Description
i.varname creates dummies for categorical variable varname
i.varname1*i.varname2 creates dummies for categorical variables varname1 and varname2: main effects and all interactions
i.varname1*varname3 creates dummies for categorical variable varname1 and continuous variable varname3: main effects and all interactions
i.varname1|varname3 creates dummies for categorical variable varname1 and continuous variable varname3: all interactions and main effect of varname3, but NO main effect of varname1
  • xi expands both numeric and string categorical variables.

    agegrp takes on values 1, 2,3, and 4.

    xi: logistic outcome i.agegrp

    xi tabulates i.agegrp creates indicator (dummy) variables for each observed value, omitting the indicator for the smallest value.

    This creates variables name -Iagegrp2, -Iagegrp3, and -Iagegrp4.

    // The expanded logistic model is
    logistic outcome _Iagegrp_2 _Iagegrp_3 _Iagegrp_4
  • Dummy variables are created automatically and are left in your dataset.

    You can drop them by typing drop I*. You do not have to do this; each time you use xi, any automatically generated dummies with the same prefix as the one specified in the prefix(string) option, or _I by default, are dropped and new ones are created.

Use xi as a command prefix

// simple effects
xi: logistic outcome weight i.agegrp bp
// interactions of categorical variables
xi: logistic outcome weight bp i.agegrp*i.race
// interactions of dummy variables with continuous variables
// fits a model with indicator variables for all agegrp categories interacted with weight, plus the maineffect terms weight and i.agegrp.
xi: logistic outcome bp i.agegrp*weight i.race
// interaction terms without the agegrp main effect (but with the weight main effect)
xi: logistic outcome bp i.agegrp|weight i.race

11.1.4 Time series varlists

Three time series operators: L., D. and S..

First convert variables to time variables by using tsset, then you can use the TS operators.

tsset time
list L.gnp

Convert to panel

tsset country year
// or
xtset country year
TS Operator Meaning
L. lag \(x_{t-1}\)
L2. 2-period lag \(x_{t-2}\)
L(1/2). a varlist \(x_{t-1}\) and \(x_{t-2}\)
F. lead \(x_{t+1}\)
F2. 2-period lead \(x_{t+2}\)
D. difference \(x_{t}-x_{t-1}\)
D2. difference of difference \((x_{t}-x_{t-1})-(x_{t-1}-x_{t-2})\)
S. “seasonal” difference \(x_{t}-x_{t-1}\)
S2. lag-2 seasonal difference \(x_{t}-x_{t-2}\)

Note that D1. = S1., but D2. \(\ne\) S2..

  • D2. refers to the difference of difference
  • S2. refers to the two-period difference

Operators may be typed in uppercase or lowercase

L(1/3).(gnp cpi)
// equivalently
L.gnp L2.gnp L3.gnp L.cpi L2.cpi L3.cpi

DS12.gnp one-period difference of the 12-period difference

.do is a Stata do-file.

.dta is Stata dataset file format

11.1.5 Labels

Variable labels convey information about a variable, and can be a substitute for long variable names.

// generally
label variable variable_name "variable label"
// use example
label variable price "Price in 1978 Dollars"

Value labels are used with categorical variables to tell you what the categories mean.

  1. First define a mapping
// generally
label define map_name value1 "label1" value2 "label2"...
// use example
label define rep_label 1 "Bad" 2 "Average" 3 "Good"
  1. Add value labels to existing variables
// generally
label values map_name
// use example
label values rep3 rep_label