11.1 Basic syntax

11.1.1 Package management

Users can add new features to Stata, and some users choose to make new features that they have written available to others via the web. The files that comprise a new feature are called a package, and a package usually consists of one or more ado-files and help files.

ssc install newpkgname: Install newpkgname from ssc. The SSC (Statistical Software Components) is the premier Stata download site.

ssc uninstall pkgname to uninstall pkgname

ado update to update packages

ssc hot [, n(#)] a list of most popular pkgs at SSC. n(#) to specify the number of pkgs listed.

Stata is case-sensitive: myvar, Myvar and MYVAR are three distinct names.

Semicolons (:) is treated as a line separator. It is not required, but it may be used to place two statements on the same physical line:

x = 1 ; y = 2 ;

The last semicolon in the above example is unnecessary but allowed.

11.1.2 Types and Declarations

A variable’s type can be described in two perspectives:

eltype: specifies the type of the elements. Default: transmorphic.
orgtype: specifies the organization of the elements. Default: matrix.

`eltype`	`orgtype`
`transmorphic`	`matrix`
`numeric`	`vector`
`real`	`rowvector`
`complex`	`colvector`
`string`	`scalar`
`pointer`

[by varlist:] command [ varlist ] [=exp] [if exp] [in range] [ weight ] [, options]

where square brackets distinguish optional qualifiers and options from required ones. In this diagram,

varlist denotes a list of variable names.

Variables are separated by spaces.

Use help varlist for more details on how to specify varlist.

If no varlist appears, most commands assumes _all, which indicate all the variables in the dataset.
command denotes a Stata command,
exp denotes an algebraic expression,
range denotes an observation range,
weight denotes a weighting expression, and
options denotes a list of options.

Note the comma , which separate the command’s main body from options.

by varlist repeat a cmd for each subset of the data, grouped by varlist.

Ex: group by region and summarize marriage divorce

sysuse census
sort region
by region: summarize marriage divorce

Note that your have to sort before by varlist.

Alternatively, you can

by region, sort: summarize marriage_rate divorce_rate

if exp filter observations for which exp returns true

summarize marriage_rate divorce_rate if region == "West"

& (and) and | (or) to join conditions.

in range restricts the scope of the cmd to be applied to a specific observation range.

First observation can be denoted by f
Last observation can be denoted by l
Negative numbers mean “from the end of the data”

// summarize for observations 5 to 25
summarize marriage_rate divorce_rate in 5/25
// summarize for the last five observations
summarize marriage_rate divorce_rate in -5/l

11.1.3 Create new variables

gen variable = expression      // generate new variables
replace variable = expression  // replace the value of existing variables

generate create variables based on expressions you specified.

generate newvar = oldvar + 2 generate a new variable newvar, which equals oldvar + 2

generate lngdp = ln(gdp) generate the natural log of gdp

generate exp2 = exp^2 generate the square of exp

egen: Extensions to generate; creates a new variable based on egen functions of existing variables.

Q: What are egen functions?
A: The functions are specifically written for egen.

// Generate newv1 for distinct groups of v1 and v2, and create and apply value label mylabel
egen newv1 = group(v1 v2), label(mylabel)

// for each country, calculate the average of wpop
by country_id, sort: egen pop_country = mean(wpop)

gen vs. egen

gen used for simple algebraic transformations
egen for more complexed transformations, e.g., operations based on groups of observations.
They behave differently if you want to calcualte the sum per group.
- gen returns running sum
- egen returns group sum

// Create variable containing the running sum of x
generate runsum = sum(x)

// Create variable containing a constant equal to the overall sum of x
egen totalsum = total(x)

encode var, gen(newvar) creates a new variable named newvar based on the string variable varname. It alphabetizes unique values in var and assigns numeric codes to each entry.

encode sex, gen(gender)
// nolabel drops value labels and show how the data really appear
list sex gender in 1/4, nolabel
// you won't see difference using the following cmd
list sex gender in 1/4

sex is a string variable and takes on values female and male.

encode creates a new variable gender, mapping each level in sex to a numerical value. female becomes 1 and male becomes 2.

display displays strings and values of scalar expressions.

display [display_directive [display_directive [...]]]

list displays the values of variables. If no varlist is specified, the values of all the variables are displayed.

list [varlist] [if] [in] [, options]

11.1.4 Refer to a range of variables

How can I list, drop, and keep a consecutive set of variables without typing the names individually?

list all variables starting with a certain prefix
list all variables between two variables
combination of the two

// list all variables starting with a certain prefix
.  list var* // all variables starting with "var"

// list all variables between two variables
.  list var1-var5 // all variables between var1 and var5

// combination of the two
.  list var1 var3-var5

Wildcard characters:

* matches any string of characters, including no characters.
? matches any single character.
- ?* matches one or more character
- ??* matches two or more characters

var1-var2 specifies a range of variables, from the first variable to the second variable, in the order in which they appear in the dataset.

A numlist is a list of numbers. It can include individual numbers, ranges of numbers, and increments.

Common operators in numlist:

range: start/end means all numbers from start to end, inclusive.
- start to end
- start:end
specify increment

start(increment)end or start[increment]end

1/3     // 1,2,3
-5/-8   // -5,-6,-7,-8
1 to 3  // 1,2,3
1(2)9   // 1,3,5,7,9

If you want to consider reordering the variables in your dataset, order, sequential will put the variables in alphabetical order (and does mostly smart things with numeric suffixes).

. order *, sequential

the resulting order will be:

1.  alpha
2.  beta
3.  gamma
4.  v1
5.  v2
6.  v3
7.  v4

order, sequential is smart enough to know that v10 comes after v9 and not between v1 and v2, which pure alphabetical order would specify. For online help, type help order in Stata, or see [D] order.

11.1.5 System Variables

Expressions may also contain variables (pronounced “underscore variables”), which are built-in system variables that are created and updated by Stata. They are called variables because their names all begin with the underscore character, _.

Var	Description
`_n`	the number of the current observation.
`_N`	the total number of observations in the dataset or the number of observations in the current `by()` group.
`_pi`	$\pi$
`[eqno]_b[varname]` or `[eqno]_coef[varname]`	value of the coefficient on `varname` from the most recently fitted model
`[eqno]_se[varname]`	standard error of the coefficient on `varname` from the most recently fit model
`_b[_cons]`	value of the intercept term

11.1.6 Matrix

You enter the matrices by row, separate one element from the next by using commas (,) and one row from the next by using backslashes (\).

To create

\[ A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} \]

matrix [input] a = (1,2\3,4)
matrix list a

input is optional.

without input, matrix must be small, can include expressions.
with input, matrix can be large, but no expressions for the elements.

Menu: Data > Matrices, ado language > Input matrix by hand

Get one element using matname[r,c] to get r row, c column element.

matrix rownames and colnames reset the row and column names of an already existing matrix.

matrix roweq and coleq also reset the row and column names of an already existing matrix, but if a simple name (a name without a colon) is specified, it is interpreted as an equation name.

// Reset row names of matrix
matrix rownames A = names
matrix colnames A = names
// Reset row names and interpret simple names as equation names
matrix roweq A = names
matrix coleq A = names

A is a matrix.

name can be:

a simple name; var
an interaction; e.g., var1#var2
a colon followed by a simple name;
a colon followed by an interaction;
an equation name followed by a colon, and a simple name; e.g., myeq:var
an equation name, a colon, and an interaction, e.g., myeq:var1#var2

Matrix define: https://www.stata.com/manuals/pmatrixdefine.pdf#pmatrixdefine

Macro functions

rownames A and colnames A return a list of all the row or column subnames (with time-series operators if applicable) of A, separated by single blanks. The equation names, even if present, are not included.

roweq A and coleq A return a list of all the row equation names or column equation names of A, separated by single blanks, and with each name appearing however many times it appears in the matrix.

rowfullnames A and colfullnames A return a list of all the row or column names, including equation names of A, separated by single blanks.

11.1.7 Factor Variables

help fvvarlist for documentation on factor variables.

i.varname create indicators for each level of the variable

// group=1 as base level
list group i.group in 1/5
// group=3 as base level
list group i3.group in 1/5
// individual fixed effects
regress y i.group

ib#.varname specify the base level. # is the value of the base level.

By default, the smallest level becomes the base level.

i might be omitted. ib3.group is equivalent to b3.group.

c.varname treat as continuous

# cross, create an interaction for each combination of the variables. Spaces are not allowed in interactions.

sex#c.age   // interaction between categorical variable `sex` and continuous variable `age`

c.age#c.age        // age squared
c.age#c.age#c.age  // age cubed

## factorial cross, a full factorial of the variables: standalone effects for each variable and an interaction

group##sex
// equivalently
i.group i.sex i.group#i.sex

o.varname omit a variable or indicator

o.age means that the continuous variable age should be omitted, and
o2.group means that the indicator for group = 2 should be omitted.

Interaction Expansion

xi [ , prefix(string) noomit ] term(s)

xi expands terms containing categorical variables into indicator (also called dummy) variable sets. xi provides a convenient way to include dummy or indicator variables when fitting a model that does NOT support factor variables, e.g., xtabond.

We recommend that you use factor variables instead of xi if a command allows factor variables.

By default, xi will create interaction variables starting with _I. This can be changed using the prefix(string) option.

Operator	Description
`i.varname`	creates dummies for categorical variable `varname`
`i.varname1*i.varname2`	creates dummies for categorical variables `varname1` and `varname2`: main effects and all interactions
`i.varname1*varname3`	creates dummies for categorical variable `varname1` and continuous variable `varname3`: main effects and all interactions
`i.varname1\|varname3`	creates dummies for categorical variable `varname1` and continuous variable `varname3`: all interactions and main effect of `varname3`, but NO main effect of `varname1`

xi expands both numeric and string categorical variables.

agegrp takes on values 1, 2,3, and 4.
```
xi: logistic outcome i.agegrp
```
xi tabulates i.agegrp creates indicator (dummy) variables for each observed value, omitting the indicator for the smallest value.

This creates variables name -Iagegrp2, -Iagegrp3, and -Iagegrp4.
```
// The expanded logistic model is
logistic outcome _Iagegrp_2 _Iagegrp_3 _Iagegrp_4
```
Dummy variables are created automatically and are left in your dataset.

You can drop them by typing drop I*. You do not have to do this; each time you use xi, any automatically generated dummies with the same prefix as the one specified in the prefix(string) option, or _I by default, are dropped and new ones are created.

Use xi as a command prefix

// simple effects
xi: logistic outcome weight i.agegrp bp
// interactions of categorical variables
xi: logistic outcome weight bp i.agegrp*i.race
// interactions of dummy variables with continuous variables
// fits a model with indicator variables for all agegrp categories interacted with weight, plus the maineffect terms weight and i.agegrp.
xi: logistic outcome bp i.agegrp*weight i.race
// interaction terms without the agegrp main effect (but with the weight main effect)
xi: logistic outcome bp i.agegrp|weight i.race

11.1.8 Time series varlists

Three time series operators: L., D. and S..

Basic syntax: operator(order/spec).(varlist)

First convert variables to time variables by using tsset, then you can use the TS operators.

tsset time
list L.gnp

Convert to panel

tsset country year
// or
xtset country year

TS Operator	Meaning
`L.`	lag $x_{t-1}$
`L2.`	2-period lag $x_{t-2}$
`L(1/2).`	a varlist $x_{t-1}$ and $x_{t-2}$
`L(2/.).`	from $x_{t-2} up to the maximum available lag
`F.`	lead $x_{t+1}$
`F2.`	2-period lead $x_{t+2}$
`D.`	difference $x_{t}-x_{t-1}$
`D2.`	difference of difference $(x_{t}-x_{t-1})-(x_{t-1}-x_{t-2})$
`S.`	“seasonal” difference $x_{t}-x_{t-1}$
`S2.`	lag-2 seasonal difference $x_{t}-x_{t-2}$

2/. the dot means “up to the maximum available lag/lead”.

Note that D1. = S1., but D2. $\ne$ S2..

D2. refers to the difference of difference
S2. refers to the two-period difference

Operators may be typed in uppercase or lowercase

L(1/3).(gnp cpi)
// equivalently
L.gnp L2.gnp L3.gnp L.cpi L2.cpi L3.cpi

DS12.gnp one-period difference of the 12-period difference

.do is a Stata do-file.

.dta is Stata dataset file format

11.1.9 Labels

Variable labels convey information about a variable, and can be a substitute for long variable names.

// generally
label variable variable_name "variable label"
// use example
label variable price "Price in 1978 Dollars"

Value labels are used with categorical variables to tell you what the categories mean.

First define a mapping

// generally
label define map_name value1 "label1" value2 "label2"...
// use example
label define rep_label 1 "Bad" 2 "Average" 3 "Good"

Add value labels to existing variables

// generally
label values map_name
// use example
label values rep3 rep_label

11.1.10 Output format

% indicates the start of a format specification.

%9.2f means a floating-point number with 9 characters wide, including 2 digits after the decimal point.

the first digit states the width of the results
the second digit after the decimal point states the number of digits after the decimal point
f for fixed format. Alternatively,
- e for scientific notation

ref: [U] 12.5 Data: Formats, control how data are displayed

TS Operator	Meaning
`L.`	lag \(x_{t-1}\)
`L2.`	2-period lag \(x_{t-2}\)
`L(1/2).`	a varlist \(x_{t-1}\) and \(x_{t-2}\)
`L(2/.).`	from $x_{t-2} up to the maximum available lag
`F.`	lead \(x_{t+1}\)
`F2.`	2-period lead \(x_{t+2}\)
`D.`	difference \(x_{t}-x_{t-1}\)
`D2.`	difference of difference \((x_{t}-x_{t-1})-(x_{t-1}-x_{t-2})\)
`S.`	“seasonal” difference \(x_{t}-x_{t-1}\)
`S2.`	lag-2 seasonal difference \(x_{t}-x_{t-2}\)