8.1 Data Input & Output

File names

file.path(...) constructs file paths in a platform-independent way. ... are character vectors, which are concatenated together with \ on Windows and / on Unix-alikes (Mac) as the path separator.

On Windows, the path separator is \, but in R you need to use \\ to represent a single backslash because \ is an escape character in R strings.

# define file path
path <- file.path("C:", "Users", "bob", "Data_Science_Documents")

# view file path
path
[1] "C:\\Users\\bob\\Data_Science_Documents"

# set path as working directory
setwd(path)

The working directory is now set to the following location: C:\Users\bob\Data_Science_Documents.


On Mac,

path <- file.path("data", "my_data.csv")

# view file path
path
[1] "data/my_data.csv"

8.1.1 Read Data

Read Fortran

read.fortran(file, format, ..., as.is = TRUE, colClasses = NA)

  • format Character vector or list of vectors.

Read dta

haven::read_dta() read Stata data file.

data <- read_dta("climate_health_2406yl.dta")
# retrieve variable labels/definitions
var_dict <- tibble(
  "name" = colnames(data),
  "label" = sapply(data, function(x) attr(x, "label")) %>% 
  as.character()
  )
var_dict
                   
var_label(data$gor) # get variable label
val_labels(data$gor) # get value labels 

Read fixed width text files

8.1.1.1 Base R functions

read.fwf(file, widths)

  • widths integer vector, giving the widths of the fixed-width fields (of one line), or list of integer vectors giving widths for multiline records.

read.table(f_name, header=FALSE, row.names, col.names, sep="", na.strings = "NA") a very versatile function. Can be used to read .csv or .txt files.

  • f_name path to data.
  • header=FALSE defaults to FALSE, assumes there is no header row in the file unless specified otherwise.
    • If there is a header in the first row, should specify header=TRUE.
  • row.names a vector of row names. This can be
    • a vector giving the actual row names, or
    • a single number giving the column of the table which contains the row names, or
    • character string giving the name of the table column containing the row names.
  • col.names a vector of optional names for the variables. The default is to use "V" followed by the column number.
  • sep use white space as delimiter.
    • if it is a csv file, use sep=',' to specify comma as delimiter
  • na.strings = "NA" a character vector of strings which are to be interpreted as NA values.
    • A useful setting: na.strings = c("", "NA", "NULL")

read.csv(f_name, header = TRUE, sep = ",", na.strings = "..", dec=".")

  • header = TRUE whether the file contains the names of the variables as its first line.
  • sep the field separator string. Values within each row of x are separated by this string.
  • na the string to use for missing values in the data.
  • dec the string to use for decimal points in numeric or complex columns: must be a single character.
  • fileEncoding UTF-8

When reading data from github, you need to pass in the raw version of the data in read.csv(),

R cannot read the display version.

You can get the URL for the raw version by clicking on the Raw button displayed above the data.

github raw data

read.table(filename, header=FALSE, sep="") is more versatile than read.csv. Useful when you have a data file saved as txt. Default separator is “white space” for read.table, i.e., one or more spaces, tabs, newlines or carriage returns.

# read.table can be used to read txt and csv. Need to specify sep=',' when reading csv.
data <- read.table("https://raw.githubusercontent.com/my1396/course_dataset/refs/heads/main/bonedensity.txt", header=TRUE)
data

data <- read.table("https://raw.githubusercontent.com/my1396/course_dataset/refs/heads/main/bonedensity.csv", header=TRUE, sep=",")

# Alternatively, can use read_csv or read.csv directly
data <- read_csv("https://raw.githubusercontent.com/my1396/course_dataset/refs/heads/main/bonedensity.csv")
data

8.1.1.2 readr

The major difference of readr is that it returns a tibble instead of a data frame.

read_delim(f_name, delim = ";", col_names = TRUE, skip = 0) allows you to specify the delimeter as ;.

  • col_names = TRUE whether the first row contains column names.
  • skip = 0 number of lines to skip before reading the data. Default is 0, meaning no lines are skipped.

read_delim(f_name, delim = "\t") read tab separated values.

read_tsv() read tab separated values.


Read comma separated values.

readr::read_csv(
  f_name,
  na = c("..", NA, ""),
  locale = locale(encoding = "UTF-8"),
  col_types = cols(Date = col_date(format = "%m/%d/%y"))
)
  • col_types specify column types. Could be created by list() or cols().

    read_csv will automatically guess, if you don’t explicitly specify column types. You can override column types by providing the argument col_types. You don’t need to provide all column types, just the ones you want to override.

  • By default, reading a file without a column specification will print a message showing what readr guessed they were. To remove this message,

    • set show_col_types = FALSE for one time setting, or
    • set options(readr.show_col_types = FALSE) for the current sessions’ global options setting. If want to change permanently everytime when R starts, put options(readr.show_col_types = FALSE) in .Rprofile as global options.

read_csv2(f_name, na = c("..", NA, "")) use semicolon ; to separate values; and use comma , for the decimal point. This is common in some European countries.

  • locale The locale controls defaults that vary from place to place. The default locale is US-centric (like R), but you can use locale() to create your own locale that controls things like the default time zone, encoding, decimal mark, big mark, and day/month names.
  • locale(date_names = "en", date_format = "%AD", time_format = "%AT", decimal_mark = ".", grouping_mark = ",", tz = "UTC", encoding = "UTF-8", asciify = FALSE)
    • decimal_mark indicate the decimal place, can only be , or .
    • encoding This only affects how the file is read - readr always converts the output to UTF-8.

8.1.2 Write Data

Save data in uft8 encoding with special language characters

write_excel_csv() include a UTF-8 Byte order mark which indicates to Excel the csv is UTF-8 encoded.

write.csv(x, f_name, row.names=TRUE, fileEncoding ="UTF-8")

  • x a matrix or data frame. If not one of the types, it is attempted to coerce x to a data frame.

    • write_csv(x) x can only be data frame or tibble. Doesn’t support matrix.

      mat %>% as_tibble(rownames = "rowname") %>% write_csv("mat.csv")
      mat %>% write.csv("mat.csv")
  • row.names whether to write row names of x. Defaults to TRUE.


flextable

flextable package create tables for reporting and publications.

The main function is flextable which takes a data.frame as argument and returns a flextable. If you are using RStudio or another R GUI, the table will be displayed in the Viewer panel or in your default browser.

The package provides a set of functions to easily create some tables from others objects.

The as_flextable() function is used to transform specific objects into flextable objects. For example, you can transform a crosstab produced with the ‘tables’ package into a flextable which can then be formatted, annotated or augmented with footnotes.


8.1.2.1 Prevent Accidentally Overwriting Files

When you do data cleaning, you want to be explicit about if you have saved the data. The code snippet below prompts the user to confirm before overwriting an existing file. It you enter 1, the file will be overwritten; if you enter 0 or any other value, the file will not be saved.

f_name <- file.path(data_dir, "US_fundamental_macro_cleaned_V2.csv")
message(glue("Dataset dimension: ({nrow(fundamental_df)}, {ncol(fundamental_df)})"))
message(glue("Target file: {f_name}"))
cat("Overwrite data? [1 = Yes, 0 = No]: ")
flush.console() # forces immediate display of the prompt
user_input <- as.integer(readLines("stdin", n = 1))

if (!is.na(user_input) && user_input == 1) {
    write_csv(fundamental_df, f_name)
    message(glue("✓ Dataset saved to: {f_name}"))
} else {
    message("✗ Data not saved")
}

flush.console() forces the output to be displayed immediately, which is important for interactive prompts. Without it, the whole cell will run before the prompt is shown, leading to errors.

It looks like the following in the console:

Wrap the code above in a function to make it reusable.

save_with_prompt <- function(df, f_name) {
    #' Prompts the user to confirm before saving a data frame or text output to a file.
    #' Prevents accidental overwriting of files
    #' @param df The data frame or text output to be saved.
    #' @param f_name The file name (including path) where the data should be saved to.
    
    message(glue("Target file: {f_name}"))
    cat("Overwrite data? [1 = Yes, 0 = No]: ")
    flush.console()
    user_input <- as.integer(readLines("stdin", n = 1))

    if (!is.na(user_input) && user_input == 1) {
        if (inherits(df, c("data.frame", "tbl_df", "tbl"))) {
            write_csv(df, f_name)
        } else {
            log_to_file(f_name, cat(df))
        }
        message(glue("✓ Dataset saved to: {f_name}"))
    } else {
        message("✗ Data not saved")
    }
}
# csv files
df <- tibble(year = 1:20, values = letters[1:20])
f_name <- here(out_dir, "my_table.csv")
save_with_prompt(df, f_name)

# txt files
res <- glue("
    === Data Description ===
    Period: {min(fundamental_complete$year)} – {max(fundamental_complete$year)}
    Firm-year observations: {nrow(fundamental_complete)}
    Unique firms: {n_distinct(fundamental_complete$fsym_id)}
    Missing market value: {sum(is.na(fundamental_complete$ff_mkt_val))}
    Missing Tobins Q (lagged): {sum(is.na(fundamental_complete$tobin_q_lag1))}
    Sectors ({n_distinct(fundamental_complete$l1_name)}): {paste(sort(unique(fundamental_complete$l1_name)), collapse = ', ')}
")
res
f_name <- here(out_dir, "my_table.txt")
save_with_prompt(res, f_name)