Intro to R for Data Analysis

A workshop for NYU Wagner MSPP class Policy & Data Studio on the basics of R for data analysis

Maxwell Austensen https://github.com/austensen
2020-07-05

Table of Contents


All the workshop materials are available on GitHub.

R

R is a free and open-source software environment for statistical computing and graphics. It has a more narrow set of uses and a smaller user base compared python, but because it is specifically designed for data analysis it is a great language for data analysts to learn. In recent years it has become increasingly popular and has also become much easier to learn, especially for those new to coding.

There are three main reasons that I enjoying using R myself and teaching it to others.

  1. R has an amazing community of users that have produced wealth of user friendly guides, documentation, and other tools and resources to support people learning the language.
  2. RStudio is an incredibly helpful application (Integrated Development Environment) in which you can work with R.
  3. A wide ecosystem of user-written packages that provide tools for almost every possible use case. Especially important is the collection of packages known as the Tidyverse that prioritize good design and documentation that make it easy to learn R.

R Community Resources

Here is just a small sample of some of the great resources available for learning R:

RStudio

The RStudio Desktop application is free to download.

This is the default RStudio interface. * The top left pane contains all your code files. This is where you can write and save all the code for your analysis. * The bottom left pane has the R console where you can run individual pieces of R code, such as quick calculations, printing objects, or anything else that you don’t need to save in your final files. * The top right pane contains a list of all the objects currently in your environment. If you click on a dataframe object it will open in a viewer in the top left pane where you can browse, sort, and filter your view of the data (without altering the object itself) * The bottom right pane contains a few important tabs: the plot viewer where any graphs you create will appear, the files explorer, and the help page

Packages in R

Installing and Loading Packages

R is an open-source language so in addition to the basic functions that come standard with R (referred to as Base R) there are more than 10,000 user written packages that can accomplish virtually any task in R. There is an official repository for these packages called CRAN that does some vetting of the quality of packages, and packages from here can be installed directly from R using:


install.packages("PackageName")

These packages only need to be installed like this once, and after that initial installation we only need to load the packages that we want use for each analysis with library().

This project uses renv to handle dependency managemnt. To get this projct set up on your own system, open the project in RStudio (and open the .Rproj file), install renv (install.packages("renv")), and run renv::init().

If you haven’t installed any R packages yet, this might take a little while.

Tidyverse Packages

All of the packages we are using here are part of a collection of R packages referred to as the tidyverse.

The Tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. All of these packages are extremely well-maintained and have helpful websites that include, examples and guides, function documentation, cheatsheets, and links to the GitHub repos where the packages are developed.

The following are the core set of Tidyverse packages, but there are many more.

In addition to the package websites, there is an amazing free book that covers how to use all these packages to do data analysis, called R for Data Science.


library(dplyr) # manipulate dataframes
library(readr) # read/write dataframes
library(tidyr) # reshaping dataframes
library(stringr) # string manipulation
library(forcats) # factor manipulation
library(purrr) # iteration (instead of loops)
library(ggplot2) # making plots

Import and Preview Dataset

For these examples we’ll be using a dataset of buildings from the Public Advocate’s NYC Landlord Watchlist.

Lets get started by reading in our dataset from a CSV file using read_csv form the readr package.

We’ll also be making use of the fs package, which provides cross-platform file system operations.

If you need to import dataset that aren’t simple rectanular flat files (like csv, tsv, and fwf) then you will need another package.
* DBI for relational databases (paired with database specific backends),
* haven for SPSS, Stata, and SAS data,
* httr for web APIs,
* readxl for .xls and .xlsx sheets,
* rvest for web scraping,
* jsonlite for JSON, and
* xml2 for XML.


library(fs) # cross-platform file system operations

watchlist_bldgs <- read_csv(path("data-raw", "landlord-watchlist-buildings_2018-12-08.csv"))

read_csv() guesses about the data type of each column, and gives you the column specification is used. Often times this will be what you want, but if you want to override the guesses you can supply your own specification (see ?readr::cols for details).


watchlist_bldgs <- read_csv(
  file = path("data-raw", "landlord-watchlist-buildings_2018-12-08.csv"),
  col_types = cols(
    .default = col_character(),
    units = col_integer(),
    violations = col_integer()
  )
)

Now let’s take a look at this new dataframe that we’ve imported. You can print the dataframe to get a simple preview.


watchlist_bldgs

# A tibble: 452 x 8
   landlord    address    units violations number street borough zip  
   <chr>       <chr>      <int>      <int> <chr>  <chr>  <chr>   <chr>
 1 Jonathan C… 5416 4 AV…    16         62 5416   4 AVE… BROOKL… 11220
 2 Jonathan C… 5422 4 AV…    21         78 5422   4 AVE… BROOKL… 11220
 3 Jonathan C… 314 57 ST…    17         70 314    57 ST… BROOKL… 11220
 4 Jonathan C… 372 BALTI…     8         76 372    BALTI… BROOKL… 11201
 5 Jonathan C… 166 BLEEC…     6         41 166    BLEEC… BROOKL… 11221
 6 Jonathan C… 590 BUSHW…     4         55 590    BUSHW… BROOKL… 11206
 7 Jonathan C… 165 HOWAR…     8         39 165    HOWAR… BROOKL… 11233
 8 Jonathan C… 208 HOYT …     3          9 208    HOYT … BROOKL… 11217
 9 Jonathan C… 203 HULL …     6        143 203    HULL … BROOKL… 11233
10 Jonathan C… 318 HUMBO…     7        102 318    HUMBO… BROOKL… 11211
# … with 442 more rows

When simply printing the dataframe you’ll only see a few rows and as many columns as fit nicely on your screen. When you have many columns it’s often helpful to use the function glimpse() to see a list of all your columns.


glimpse(watchlist_bldgs)

Rows: 452
Columns: 8
$ landlord   <chr> "Jonathan Cohen-Silvershore Properties", "Jonath…
$ address    <chr> "5416 4 AVENUE, BROOKLYN 11220", "5422 4 AVENUE,…
$ units      <int> 16, 21, 17, 8, 6, 4, 8, 3, 6, 7, 6, 8, 6, 6, 7, …
$ violations <int> 62, 78, 70, 76, 41, 55, 39, 9, 143, 102, 19, 54,…
$ number     <chr> "5416", "5422", "314", "372", "166", "590", "165…
$ street     <chr> "4 AVENUE", "4 AVENUE", "57 STREET", "BALTIC STR…
$ borough    <chr> "BROOKLYN", "BROOKLYN", "BROOKLYN", "BROOKLYN", …
$ zip        <chr> "11220", "11220", "11220", "11201", "11221", "11…

We can also get a very helpful overview of our dataset that includes some informative descriptive statistics using skim() from the package skimr


library(skimr)

skim(watchlist_bldgs)
Table 1: Data summary
Name watchlist_bldgs
Number of rows 452
Number of columns 8
_______________________
Column type frequency:
character 6
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
landlord 0 1 7 37 0 98 0
address 0 1 27 50 0 452 0
number 0 1 1 5 0 394 0
street 0 1 7 28 0 294 0
borough 0 1 5 13 0 5 0
zip 0 1 5 5 0 84 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
units 0 1 15.91 17.67 3 6 8 20 120 ▇▁▁▁▁
violations 0 1 91.22 78.81 8 39 71 117 650 ▇▁▁▁▁

Data Manipulation with dplyr

The package dplyr contains functions for basic data manipulation. It is organized around 5 main functions that take a dataframe and manipulate it in some way. The functions are named as verbs which help explain what they do.

Every one of these functions takes a dataframe as the first argument and returns an altered version of that dataframe.

Inside of these functions columns are referred to with just their names without quotes.

filter()

Use filter() find rows/cases where conditions are true. Rows where the condition evaluates to NA are dropped.


bk_bldgs <- filter(watchlist_bldgs, borough == "BROOKLYN")
bk_bldgs

# A tibble: 256 x 8
   landlord    address    units violations number street borough zip  
   <chr>       <chr>      <int>      <int> <chr>  <chr>  <chr>   <chr>
 1 Jonathan C… 5416 4 AV…    16         62 5416   4 AVE… BROOKL… 11220
 2 Jonathan C… 5422 4 AV…    21         78 5422   4 AVE… BROOKL… 11220
 3 Jonathan C… 314 57 ST…    17         70 314    57 ST… BROOKL… 11220
 4 Jonathan C… 372 BALTI…     8         76 372    BALTI… BROOKL… 11201
 5 Jonathan C… 166 BLEEC…     6         41 166    BLEEC… BROOKL… 11221
 6 Jonathan C… 590 BUSHW…     4         55 590    BUSHW… BROOKL… 11206
 7 Jonathan C… 165 HOWAR…     8         39 165    HOWAR… BROOKL… 11233
 8 Jonathan C… 208 HOYT …     3          9 208    HOYT … BROOKL… 11217
 9 Jonathan C… 203 HULL …     6        143 203    HULL … BROOKL… 11233
10 Jonathan C… 318 HUMBO…     7        102 318    HUMBO… BROOKL… 11211
# … with 246 more rows

Multiple conditions are combined with &.


bk_big_bldgs <- filter(watchlist_bldgs, units > 10, borough == "QUEENS")
bk_big_bldgs

# A tibble: 4 x 8
  landlord    address    units violations number street  borough zip  
  <chr>       <chr>      <int>      <int> <chr>  <chr>   <chr>   <chr>
1 Jonathan C… 1708 SUMM…    39        116 1708   SUMMER… QUEENS  11385
2 RONALD SWA… 34-01 28 …    41        157 34-01  28 AVE… QUEENS  11103
3 RONALD SWA… 47-02 88 …    29         98 47-02  88 STR… QUEENS  11373
4 HILLDISDE … 87-40 165…   119        383 87-40  165 ST… QUEENS  11432

select()

Use select() to keep or drop columns. You can either specify a set of variables to keep by listing them, or specify columns to be dropped with -.


select(watchlist_bldgs, landlord, borough, units)

# A tibble: 452 x 3
   landlord                              borough  units
   <chr>                                 <chr>    <int>
 1 Jonathan Cohen-Silvershore Properties BROOKLYN    16
 2 Jonathan Cohen-Silvershore Properties BROOKLYN    21
 3 Jonathan Cohen-Silvershore Properties BROOKLYN    17
 4 Jonathan Cohen-Silvershore Properties BROOKLYN     8
 5 Jonathan Cohen-Silvershore Properties BROOKLYN     6
 6 Jonathan Cohen-Silvershore Properties BROOKLYN     4
 7 Jonathan Cohen-Silvershore Properties BROOKLYN     8
 8 Jonathan Cohen-Silvershore Properties BROOKLYN     3
 9 Jonathan Cohen-Silvershore Properties BROOKLYN     6
10 Jonathan Cohen-Silvershore Properties BROOKLYN     7
# … with 442 more rows

select(watchlist_bldgs, -landlord)

# A tibble: 452 x 7
   address             units violations number street    borough zip  
   <chr>               <int>      <int> <chr>  <chr>     <chr>   <chr>
 1 5416 4 AVENUE, BRO…    16         62 5416   4 AVENUE  BROOKL… 11220
 2 5422 4 AVENUE, BRO…    21         78 5422   4 AVENUE  BROOKL… 11220
 3 314 57 STREET, BRO…    17         70 314    57 STREET BROOKL… 11220
 4 372 BALTIC STREET,…     8         76 372    BALTIC S… BROOKL… 11201
 5 166 BLEECKER STREE…     6         41 166    BLEECKER… BROOKL… 11221
 6 590 BUSHWICK AVENU…     4         55 590    BUSHWICK… BROOKL… 11206
 7 165 HOWARD AVENUE,…     8         39 165    HOWARD A… BROOKL… 11233
 8 208 HOYT STREET, B…     3          9 208    HOYT STR… BROOKL… 11217
 9 203 HULL STREET, B…     6        143 203    HULL STR… BROOKL… 11233
10 318 HUMBOLDT STREE…     7        102 318    HUMBOLDT… BROOKL… 11211
# … with 442 more rows

You can rename the columns that you are selecting within select(), or use rename() which keeps all columns.


select(watchlist_bldgs, borough_name = borough)

# A tibble: 452 x 1
   borough_name
   <chr>       
 1 BROOKLYN    
 2 BROOKLYN    
 3 BROOKLYN    
 4 BROOKLYN    
 5 BROOKLYN    
 6 BROOKLYN    
 7 BROOKLYN    
 8 BROOKLYN    
 9 BROOKLYN    
10 BROOKLYN    
# … with 442 more rows

rename(watchlist_bldgs, landlord_name = landlord)

# A tibble: 452 x 8
   landlord_name  address units violations number street borough zip  
   <chr>          <chr>   <int>      <int> <chr>  <chr>  <chr>   <chr>
 1 Jonathan Cohe… 5416 4…    16         62 5416   4 AVE… BROOKL… 11220
 2 Jonathan Cohe… 5422 4…    21         78 5422   4 AVE… BROOKL… 11220
 3 Jonathan Cohe… 314 57…    17         70 314    57 ST… BROOKL… 11220
 4 Jonathan Cohe… 372 BA…     8         76 372    BALTI… BROOKL… 11201
 5 Jonathan Cohe… 166 BL…     6         41 166    BLEEC… BROOKL… 11221
 6 Jonathan Cohe… 590 BU…     4         55 590    BUSHW… BROOKL… 11206
 7 Jonathan Cohe… 165 HO…     8         39 165    HOWAR… BROOKL… 11233
 8 Jonathan Cohe… 208 HO…     3          9 208    HOYT … BROOKL… 11217
 9 Jonathan Cohe… 203 HU…     6        143 203    HULL … BROOKL… 11233
10 Jonathan Cohe… 318 HU…     7        102 318    HUMBO… BROOKL… 11211
# … with 442 more rows

mutate()

Use mutate() to add new columns to a dataset. mutate() keeps all the existing columns and adds new one to the end of the dataset, and the variant transmute() creates new columns but keeps only the new ones.


mutate(watchlist_bldgs, landlord_lower = str_to_lower(landlord))

# A tibble: 452 x 9
   landlord address units violations number street borough zip  
   <chr>    <chr>   <int>      <int> <chr>  <chr>  <chr>   <chr>
 1 Jonatha… 5416 4…    16         62 5416   4 AVE… BROOKL… 11220
 2 Jonatha… 5422 4…    21         78 5422   4 AVE… BROOKL… 11220
 3 Jonatha… 314 57…    17         70 314    57 ST… BROOKL… 11220
 4 Jonatha… 372 BA…     8         76 372    BALTI… BROOKL… 11201
 5 Jonatha… 166 BL…     6         41 166    BLEEC… BROOKL… 11221
 6 Jonatha… 590 BU…     4         55 590    BUSHW… BROOKL… 11206
 7 Jonatha… 165 HO…     8         39 165    HOWAR… BROOKL… 11233
 8 Jonatha… 208 HO…     3          9 208    HOYT … BROOKL… 11217
 9 Jonatha… 203 HU…     6        143 203    HULL … BROOKL… 11233
10 Jonatha… 318 HU…     7        102 318    HUMBO… BROOKL… 11211
# … with 442 more rows, and 1 more variable: landlord_lower <chr>

transmute(watchlist_bldgs, violations_per_unit = violations / units)

# A tibble: 452 x 1
   violations_per_unit
                 <dbl>
 1                3.88
 2                3.71
 3                4.12
 4                9.5 
 5                6.83
 6               13.8 
 7                4.88
 8                3   
 9               23.8 
10               14.6 
# … with 442 more rows

arrange()

Use arrange() to add order the rows in your dataset by the values of one or more columns. Be default they will be in ascending order, and you can use desc() for descending order.


arrange(watchlist_bldgs, landlord, desc(units))

# A tibble: 452 x 8
   landlord  address     units violations number street  borough zip  
   <chr>     <chr>       <int>      <int> <chr>  <chr>   <chr>   <chr>
 1 ADAM STR… 150 WEST 1…    31         92 150    WEST 1… BRONX   10468
 2 ADAM STR… 19 EAST 12…    25        105 19     EAST 1… MANHAT… 10035
 3 ADAM STR… 611 WEST 1…    22         78 611    WEST 1… MANHAT… 10031
 4 ADAM STR… 535 EDGECO…    21         96 535    EDGECO… MANHAT… 10032
 5 ADAM STR… 480 CONVEN…    15         76 480    CONVEN… MANHAT… 10031
 6 ADAM STR… 616 EAST 1…    13         62 616    EAST 1… BRONX   10458
 7 ADAM STR… 2372 ARTHU…    12         47 2372   ARTHUR… BRONX   10458
 8 ADAM STR… 2550 CREST…    11         67 2550   CRESTO… BRONX   10468
 9 ADAM STR… 293 PLEASA…    10         49 293    PLEASA… MANHAT… 10029
10 ADAM STR… 2405 BEAUM…     9         33 2405   BEAUMO… BRONX   10458
# … with 442 more rows

summarize()

You can use summarize() on a dataset to collapse down all the rows to a single row to calculate an aggregate statistic of one or more columns. It works in a similar way as mutate(), except whereas in mutate you can create new columns that are the same length as your existing dataset, with summarise() you will sum some sort of aggregate function (like sum()) that takes a column of multiple values and returns only one value.


summarise(watchlist_bldgs, total_units = sum(units))

# A tibble: 1 x 1
  total_units
        <int>
1        7192

group_by()

The 6th function is group_by() and this doesn’t change the contents of your dataframe, but instead affects how all of the above functions work if they are subsequently called on the dataframe. After a dataframe has been grouped by one or more columns, all functions apply to each group of rows in the dataset as if it was it’s own dataset. group_by() is most commonly used with summarize. Alone summarize() will collapse a dataframe to a single row, but with a grouped dataframe it is collapsed down to one row per group. After you have finished with your grouped operations use ungroup() to make sure that it doesn’t unintentionally alter later operations.


boro_bldgs <- group_by(watchlist_bldgs, borough)
boro_bldgs <- summarise(boro_bldgs, total_units = sum(units))
boro_bldgs <- ungroup(boro_bldgs)
boro_bldgs

# A tibble: 5 x 2
  borough       total_units
  <chr>               <int>
1 BRONX                1662
2 BROOKLYN             2723
3 MANHATTAN            2458
4 QUEENS                237
5 STATEN ISLAND         112

Data manipulation pipelines with %>% (“pipe”)

As you can see above when you want to make a series of changes to a dataframe you can end up repeating yourself a lot and overwriting a dataframe with each step. Thankfully there’s a way to avoid this!

The beauty of dplyr is that all of the functions above take a dataframe as the first argument, and return an altered version of that dataframe as the result. This allows us to start with a dataframe and link together multiple functions so that they each make a change to the dataframe then pass it along to the next function. dplyr includes a special operator, %>% (pronounced “pipe”), that allows us to chain together these function calls. When reading the code for these pipelines you can read %>% as “then”.

This %>% takes the object on the left and passes it to the function on the right as the first argument.

For a simple example, let’s look at the function str_c(), which concatenates strings together. Instead of passing "a" and "b" as the first and second argument, we can use the %>% to “pipe” the "a" into the function as the first argument and the "b" becomes the second argument.


str_c("a", "b")

[1] "ab"

"a" %>% str_c("b")

[1] "ab"

Now let’s practice putting together some of these dplyr functions into a little data manipulation pipeline by getting some information about the landlords on the watchlist and the buildings they own in Brooklyn.

The long pipeline of these dplyr functions can seem overwhelming at first, but once you get familiar with the functions you’ll be able to read these code chunks like a little paragraph explaining the changes being made to a dataframe. To help illustrate this the following paragraph is a written explanation of every step of the accompanying block of code.

We’ll start with the full watchlist_bldgs dataset, then “pipe” (%>%) it into the next function to filter the dataset to just buildings where the borough is "Brooklyn". Then we mutate the dataset to add a new column called landlord_name that is simply a more nicely-formatted version of the existing landlord column. Then we select only the columns that we need: landlord_name, units, and HPD violations. Then we group_by the new landlord_name column, and then, with the dataset grouped, we’ll summarize the data across all buildings for each landlord to get some summary information about each landlord and their buildings in Brooklyn. Specifically, we’ll summarize to get the total number of buildings using the special n() function that counts the number of rows, we’ll also get the total_units by summing the units across all buildings for each landlord, and we’ll get the avg_bldg_size of each landlord’s Brooklyn buildings by taking the mean of units across their buildings. Similarly, we get the sum and mean of HPD violations for each landlord. We’ve now gone from a dataset in which each row represents a building to one in which each row is a landlord. Since we are done with our grouped operations we can ungroup the data, then finally we can arrange the dataset in descending order of the number of buildings the landlord owns in Brooklyn. After all of this our final resulting dataset is assigned to a new dataframe we’ll call bk_landlords.


bk_landlords <- watchlist_bldgs %>% 
  filter(borough == "BROOKLYN") %>% 
  mutate(landlord_name = str_to_title(landlord)) %>% 
  select(landlord_name, units, violations) %>% 
  group_by(landlord_name) %>% 
  summarize(
    buildings = n(),
    total_units = sum(units),
    avg_bldg_size = mean(units),
    total_viol = sum(violations),
    avg_bldg_viol = mean(violations)
  ) %>% 
  ungroup() %>% 
  arrange(desc(buildings))
bk_landlords

# A tibble: 53 x 6
   landlord_name buildings total_units avg_bldg_size total_viol
   <chr>             <int>       <int>         <dbl>      <int>
 1 Jonathan Coh…        18         149          8.28        974
 2 Meir Fried           17         125          7.35        690
 3 Deodat Lowtan        14          80          5.71        565
 4 Cheryl Ighod…        11          68          6.18        413
 5 Joel Kohn            10          60          6           312
 6 Iskyo Aronov          9          37          4.11        495
 7 Yosef Emergi          9          46          5.11        528
 8 Chaim Schwar…         8          39          4.88        361
 9 Isaac Schwar…         8         110         13.8         510
10 Jacob Rubin           8          52          6.5         349
# … with 43 more rows, and 1 more variable: avg_bldg_viol <dbl>

Making graphs with ggplot2

Now let’s visualize this new dataset we’ve created using the package ggplot2.

ggplot2 is designed to work with dataframe inputs, so the first step is always to use ggplot(data = your_dataframe). You can build plots step by step by adding layers with +. The second step is always aes(), which establishes the aesthetics of the plot by mapping columns from the dataframe to aesthetic elements of the plot. For example, here we are setting the x axis values to landlord names and y to the total number of HPD violations. After the aes is set, you can use one of the many geom_*() functions to transform the aesthetics into a type of plot. In this case we want a column plot, so we use geom_column(). Finally, we can label any part of our graph with labs().


ggplot(data = bk_landlords) +
  aes(x = landlord_name, y = total_viol) +
  geom_col() +
  labs(
    title = '"Worst" Landlords in Brooklyn',
    subtitle = "Total HPD Violations in All Buildings for 2017",
    x = NULL,
    y = "Number of Violations",
    caption = "Source: NYC Public Advocate's Landlord Watchlist"
  )

With only the defaults ggplot2 graphs tend to look pretty good, and are not too difficult to create. However, there are definitely some things we’ll want to improve with this graph. Luckily, there is a near-infinite amount of customization possible with ggplot2 to get the plot looking exactly the way you want.

To start, there are clearly too many landlords to display clearly in a graph like this, so we can use dplyr toarrange the data by violations and filter to keep only the top 10 landlords. The first landlord name doesn’t match the same format as the other, so let’s remove the " Properties" part using str_remove from the stringr package. It would also be nice if the landlords were sorted in order of the number of violations. To achieve this we can change the landlord_name column from a string to instead use R’s factor datatype, which allows us to specify an ordering to the values. Specifically, we’ll use the function fct_reorder() from the package forcats to make the column a factor and put the values in order based on the values of the total_viol column.

Now we can use this new dataframe with ggplot2 and make a few more changes to improve the graph further. One obvious problem with our initial graph is that the landlord names are completely illegible due to overlap. To solve this we can use the ggplot2 function coord_flip() to flip our bars sideways so we can read the labels more cleanly. Another smaller adjustment we can make it to format the violation count labels on our y-axis. To make changes to anything related to one of the aesthetic elements of a plot we can use one of the many scale_*_* functions. The first * is always one of the aes element types, and the second * indicates the type of data that is mapped to it. In our case we want to make a change to the y axis and we’ve mapped our count of violations to y so it a continuous scale, so the function we’ll want to use is scale_y_continuous(). Now within that function we’ll want to use the formatting function comma from the scales package on our axis labels. Lastly, we can use one of the theme_* functions to apply some alternative styling to the plot. These functions provide you some helpful preset styling, but you can make your own fine-tuned adjustments using theme(). This can get a bit overwhelming, but just to illustrate what’s possible, here we’ll remove the unnecessary lines on the plot, move the landlord name labels over a bit, and change the font of the caption.


landlord_bk_10_worst <- bk_landlords %>% 
  arrange(desc(total_viol)) %>% 
  filter(row_number() <= 10) %>% 
  mutate(
    landlord_name = str_remove(landlord_name, " Properties"),
    landlord_name = fct_reorder(landlord_name, total_viol)
  )

ggplot(data = landlord_bk_10_worst) +
  aes(x = landlord_name, y = total_viol) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  theme(
    panel.grid.major.y = element_blank(),
    axis.text.y = element_text(margin = margin(r = -15)),
    plot.caption = element_text(face = "italic", color = "darkgrey", margin = margin(t = 10))
  ) +
  labs(
    title = '10 "Worst" Landlords in Brooklyn',
    subtitle = "Total HPD Violations in All Buildings for 2017",
    x = NULL,
    y = "Number of Violations",
    caption = "Source: NYC Public Advocate's Landlord Watchlist"
  )


Reshaping data with tidyr

This demo for reshaping data was borrowed from Sam Raby

One of the core ideas of the tidyverse is “tidy data”, which is a structure of datasets that facilitates analysis using the tools in tidyverse packages (dplyr, ggplot2, etc.). In tidy data: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table.

“Tidy datasets are all alike but every messy dataset is messy in its own way”

There are lots of common problems with datasets that make them untidy, and the package tidyr provides some powerful and flexible tools to help tidy them up into the standard format. Here’s we’ll focus on the most important of those tools, pivot_longer and pivot_wider.

This annimation (from this blog post) nicely captures the basic idea behind the pivot_* functions.

For this example of reshaping data we’ll be working with a dataset of building in NYC that have rent stabilized units from http://taxbills.nyc/. This project scraped data from PDFs of property tax documents to get estimates for rent stabilized units counts in buildings across NYC. You can find a direct link to the data we’re using and read up on the various field names at the Github project page

In this separate script we download the file if we don’t already have it.


source(path("R", "download-rent-stab.R"))

taxbills <- read_csv(path("data-raw", "rent-stab-units_joined_2007-2017.csv"))

For this demo, we only want to look at rent stabilized unit counts, which according to the Github documentation corresponds to column names that end in “uc”. Let’s also grab BBL (which is a unique identifier for NYC buildings) and Borough while we’re at it:


rentstab <- taxbills %>% select(borough, ucbbl, ends_with("uc") )

rentstab

# A tibble: 46,461 x 13
   borough  ucbbl `2007uc` `2008uc` `2009uc` `2010uc` `2011uc`
   <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 MN      1.00e9      293      293      292      293      293
 2 MN      1.00e9      253      253      253      253      253
 3 <NA>    1.00e9       NA       NA       NA       NA       29
 4 MN      1.00e9       NA       NA       NA       NA       NA
 5 MN      1.00e9       97       60       38       15       11
 6 MN      1.00e9       NA       NA       NA       NA      130
 7 MN      1.00e9       NA       NA       NA        1       NA
 8 MN      1.00e9        4        2        2        2        2
 9 MN      1.00e9       NA       NA       NA       NA        2
10 MN      1.00e9        5        5        5        5        5
# … with 46,451 more rows, and 6 more variables: `2012uc` <dbl>,
#   `2013uc` <dbl>, `2014uc` <dbl>, `2015uc` <dbl>, `2016uc` <dbl>, …

Annoyingly, the data separates unit counts for different years into different columns, violating the principle of tidy data that every column is a variable. To make proper use of ggplot to visualizae our data, we’ll need to first tidy up our dataset to get all of the year values in one column and the unit counts into another.

We can use pivot_longer to achieve this, performing this basic transformation:


rs_long <- rentstab %>% 
  pivot_longer(
    ends_with("uc"),  # The multiple column names we want to mush into one column
    names_to = "year", # The title for the new column of names we're generating
    values_to = "units" # The title for the new column of values we're generating
  )

rs_long

# A tibble: 511,071 x 4
   borough      ucbbl year   units
   <chr>        <dbl> <chr>  <dbl>
 1 MN      1000160180 2007uc   293
 2 MN      1000160180 2008uc   293
 3 MN      1000160180 2009uc   292
 4 MN      1000160180 2010uc   293
 5 MN      1000160180 2011uc   293
 6 MN      1000160180 2012uc   293
 7 MN      1000160180 2013uc   293
 8 MN      1000160180 2014uc   293
 9 MN      1000160180 2015uc   293
10 MN      1000160180 2016uc   293
# … with 511,061 more rows

Now that we have our data in the proper “long” (tidy) format, we can start working towards our desired plot. Let’s try and make a yearly timeline of rent stab. unit counts for the boroughs of Manhattan and Brooklyn:


rs_long_mn_bk_summary <- rs_long %>% 
  filter(
    borough %in% c("MN","BK"), # Filter only Manhattan and Brooklyn values
    !is.na(units) # Filter out null unit count values
  ) %>% 
  mutate(
    year = str_remove(year, "uc"), # Remove "uc" from year values
    year = as.integer(year) # change from character to integer
  ) %>% 
  group_by(year, borough) %>%
  summarise(total_units = sum(units)) %>% 
  ungroup()

rs_long_mn_bk_summary

# A tibble: 22 x 3
    year borough total_units
   <int> <chr>         <dbl>
 1  2007 BK           236242
 2  2007 MN           284001
 3  2008 BK           237201
 4  2008 MN           282871
 5  2009 BK           223466
 6  2009 MN           266088
 7  2010 BK           226492
 8  2010 MN           261740
 9  2011 BK           231016
10  2011 MN           263745
# … with 12 more rows

Let’s build our bar graph. We are going to specify a dodge property of the plot to show the Manhattan and Brooklyn bars side-by-side:


rs_over_time_graph_col <- ggplot(rs_long_mn_bk_summary) +
  aes(x = year, y = total_units, fill = borough) +
  geom_col(stat = "identity", position = "dodge") +
  scale_x_continuous(breaks = 2007:2017) +
  scale_y_continuous(labels = scales::label_number_si()) +
  scale_fill_discrete(labels = c("BK" = "Brooklyn", "MN" = "Manhattan")) +
  labs(
    title = "Total Rent Stabilized Units over Time",
    subtitle = "Manhattan and Brooklyn, 2007 to 2017",
    fill = NULL,
    x = "Year",
    y = "Total Rent Stabilized Units",
    caption = "Source: taxbills.nyc"
  )

rs_over_time_graph_col

We can also change the geom_* function, and make a few other tweaks to change this to a line graph instead.


rs_over_time_graph_line <- ggplot(rs_long_mn_bk_summary) +
  aes(x = year, y = total_units, color = borough) +
  geom_line() +
  scale_x_continuous(breaks = 2007:2017) +
  scale_y_continuous(labels = scales::label_number_si()) +
  scale_color_discrete(labels = c("BK" = "Brooklyn", "MN" = "Manhattan")) +
  labs(
    title = "Total Rent Stabilized Units over Time",
    subtitle = "Manhattan and Brooklyn, 2007 to 2017",
    color = NULL,
    x = "Year",
    y = "Total Rent Stabilized Units",
    caption = "Source: taxbills.nyc"
  )

rs_over_time_graph_line

If you ever need to make the opposite tranformation to go from “long” data to “wide”, you can use pivot_wider. For example, we can reverse the change we made above using the following code.


rs_wide <- rs_long %>% 
  pivot_wider(
    names_from = year, # The current column containing our future column names 
    values_from = units # The current column containing the values for our future columns
  )

rs_wide

# A tibble: 46,461 x 13
   borough  ucbbl `2007uc` `2008uc` `2009uc` `2010uc` `2011uc`
   <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 MN      1.00e9      293      293      292      293      293
 2 MN      1.00e9      253      253      253      253      253
 3 <NA>    1.00e9       NA       NA       NA       NA       29
 4 MN      1.00e9       NA       NA       NA       NA       NA
 5 MN      1.00e9       97       60       38       15       11
 6 MN      1.00e9       NA       NA       NA       NA      130
 7 MN      1.00e9       NA       NA       NA        1       NA
 8 MN      1.00e9        4        2        2        2        2
 9 MN      1.00e9       NA       NA       NA       NA        2
10 MN      1.00e9        5        5        5        5        5
# … with 46,451 more rows, and 6 more variables: `2012uc` <dbl>,
#   `2013uc` <dbl>, `2014uc` <dbl>, `2015uc` <dbl>, `2016uc` <dbl>, …

Census/ACS data with tidycensus

Now we’ll put all these skills into practice with Census data, which will be relevant to many of your projects.

The tidycensus package uses the Census Bureau’s APIs to download Decennial Census and ACS summary file estaimtes. As the name implies, the package fits in with the tidyverse collection of packages.

While the Census API is free to use, they require you to sign up for an API Key to get access. This is easy to do, all you need to provide is an email and there are instructions for doing this on the help package for this function: ?census_api_key


library(tidycensus)

census_api_key("c32dfbf7d25fe9558fd11bf021780457970f96ff")

Tidycensus includes decennial census and ACS data, but today we’ll stick with just ACS using get_acs(). There are many variables and they need to be specified using codes. We can explore these using load_variables()


acs_vars <- load_variables(2017, "acs5", cache = TRUE)
acs_vars

# A tibble: 25,070 x 3
   name      label                       concept                      
   <chr>     <chr>                       <chr>                        
 1 B00001_0… Estimate!!Total             UNWEIGHTED SAMPLE COUNT OF T…
 2 B00002_0… Estimate!!Total             UNWEIGHTED SAMPLE HOUSING UN…
 3 B01001_0… Estimate!!Total             SEX BY AGE                   
 4 B01001_0… Estimate!!Total!!Male       SEX BY AGE                   
 5 B01001_0… Estimate!!Total!!Male!!Und… SEX BY AGE                   
 6 B01001_0… Estimate!!Total!!Male!!5 t… SEX BY AGE                   
 7 B01001_0… Estimate!!Total!!Male!!10 … SEX BY AGE                   
 8 B01001_0… Estimate!!Total!!Male!!15 … SEX BY AGE                   
 9 B01001_0… Estimate!!Total!!Male!!18 … SEX BY AGE                   
10 B01001_0… Estimate!!Total!!Male!!20 … SEX BY AGE                   
# … with 25,060 more rows

The main function to extract the data is a bit complicated, you should pull up the help page (?get_acs) and walk through the arguments as you write it out.

In this demo we’ll start by just downloading the estimates, and will incorporate the geometries later on.


acs_tracts_raw <- get_acs(
  geography = "tract",
  state = "NY",
  county = c("005", "047", "061", "081", "085"), # NYC counties/boroughs
  variables = c(
    "gross_rent_med" = "B25064_001", # median gross rent
    "hh_inc_med" = "B19013_001", # median household income
    "rent_burden_med" = "B25071_001", # median rent burden
    "pov_pct" = "C17002_001", # poverty rate
    "hh_size_avg" = "B25010_001", # average hosehold size
    "occ_units" = "B25003_001", # total occupied units
    "occ_renter_units" = "B25003_003", # renter occupied units
    "vac_forrent_units" = "B25004_002", # vacant units for rent
    "vac_rented_units" = "B25004_003" # vacant units rented
  ),
  survey = "acs5",
  year = 2017,
  output = "wide",
  geometry = FALSE
)


acs_tracts_clean <- acs_tracts_raw %>% 
  rename(geoid = GEOID) %>% 
  mutate(
    state = str_sub(geoid, 1, 2),
    county = str_sub(geoid, 3, 5),
    tract = str_sub(geoid, 6, 11),
    renter_pctE = occ_renter_unitsE / na_if(occ_unitsE, 0),
    renter_pctM = moe_ratio(occ_renter_unitsE, na_if(occ_unitsE, 0), occ_renter_unitsM, occ_unitsM),
    rental_unitsE = occ_renter_unitsE + vac_forrent_unitsE + vac_rented_unitsE
  ) %>% 
  # moe_sum is designed to sum one column, but we can adjust the behaviour by
  # using rowwise and c_across from dplyr (this is a bit confusing, and you
  # won't come across it much)
  rowwise() %>% 
  mutate(
    rental_unitsM = moe_sum(c_across(c(occ_renter_unitsM, vac_forrent_unitsM, vac_rented_unitsM)))
  ) %>% 
  ungroup()

acs_tracts_clean

# A tibble: 2,167 x 27
   geoid NAME  gross_rent_medE gross_rent_medM hh_inc_medE hh_inc_medM
   <chr> <chr>           <dbl>           <dbl>       <dbl>       <dbl>
 1 3608… Cens…            1606              98       63750       13419
 2 3608… Cens…            1548             238       67132       13825
 3 3608… Cens…            1413              88       66645        6978
 4 3608… Cens…            1492             112       61154       11608
 5 3608… Cens…            1617              88       65313       17909
 6 3608… Cens…            1663              80       51964       13265
 7 3608… Cens…             532              96       20920        2131
 8 3608… Cens…            1446             200       66739       14830
 9 3608… Cens…            1463             259       69630        8971
10 3608… Cens…            1636              74       58879       16157
# … with 2,157 more rows, and 21 more variables:
#   rent_burden_medE <dbl>, rent_burden_medM <dbl>, pov_pctE <dbl>,
#   pov_pctM <dbl>, hh_size_avgE <dbl>, …

Now that we have our nice clean dataset of tract-level indicators, we can easily vizualize some of the relationships between indicators across tracts with the basics we leanred about ggplot. Here we’ll expand beyond the geom_col and geom_line, and make a scatter point graph with geom_point and add a simple smoothed conditional mean line with geom_smoth to help cut through the potential overplotting and reveal the pattern.


ggplot(acs_tracts_clean) +
  aes(x = renter_pctE, y = pov_pctE) +
  geom_point(size = 0.5, alpha = 0.25) +
  geom_smooth() +
  theme_minimal()

Joining datasets

Now we can combine some of our data. First we’ll take the property level rent stabilization data and aggregate the counts of stabilized units by census tract. Then we can join that new aggregated data with our ACS tract data. This will allow us to incorporate the count of rental units from the ACS data to create a tract-level measure of share of rental units that are rent stabilized.

To eventually join these two datasets we’ll need to have a common column that identifies the rows in each dataset - in this case it will be census tract “geoid”. One issue is that in the rent stabilization data the tract column is not in the same format, so first will have to fix that.


tract_stab_units_2017 <- taxbills %>% 
  mutate(
    county = recode(borough, "MN" = "061", "BX" = "005", "BK" = "047", "QN" = "081", "SI" = "085"),
    tract = scales::number(ct2010, accuracy = 0.01) %>% str_remove("\\.") %>% str_pad(6, "left", "0"),
    geoid = str_c("36", county, tract)
  ) %>% 
  group_by(geoid) %>% 
  summarise(stab_units_2017 = replace_na(sum(`2017uc`), 0)) %>% 
  ungroup()

tract_stab_units_2017  

# A tibble: 1,753 x 2
   geoid       stab_units_2017
   <chr>                 <dbl>
 1 36005000200               0
 2 36005000400               9
 3 36005001600               0
 4 36005001900               0
 5 36005002000               0
 6 36005002500               0
 7 36005002701               0
 8 36005002702            1159
 9 36005002800            1585
10 36005003100             303
# … with 1,743 more rows

There are a few different types of joins that you can use to merge two datasets.

In this case we’ll be using the most common join type - left_join - to keep all the records from our left table (ACS data) and merge in all the data for matching records in the right table (stabilized unit counts).


tracts_acs_stab <- acs_tracts_clean %>% 
  left_join(tract_stab_units_2017, by = "geoid")

tracts_acs_stab %>% select(geoid, rental_unitsE, stab_units_2017)

# A tibble: 2,167 x 3
   geoid       rental_unitsE stab_units_2017
   <chr>               <dbl>           <dbl>
 1 36081011900           608               0
 2 36081013500           284               6
 3 36081013600           466             326
 4 36081014100           626               0
 5 36081014500           597               0
 6 36081016100           638               0
 7 36081016300          1604               0
 8 36081018000           120              NA
 9 36081020600           311              36
10 36081006900          1574               0
# … with 2,157 more rows

Whenever you are working with joins, it’s always important to inspect the results carfeully to make sure everything went as expected, and ensure that you understand what happened. Joins can very easily introduce errors into your work without you noticing. For example, you can easily introduce duplicate records or you can drop important records you need.

A couple helpful ways to inspect your data actually make use of some special versions of join functions, known as “filtering” joins, that do the matching between datasets without actually merging in any new data.

Above we used left_join to merge in all the records from the right table that find matches in the left table. Here we’ll use semi_join to keep only the records from the left table that find matches in the right table. This will let us see what tracts from the ACS data don’t have any records in the rent stabilization data. And then we can use some of the same basic tools to get a better understanding of that data, for example counting up the rows by borough (county).


in_acs_notin_stab <- acs_tracts_clean %>% 
  anti_join(tract_stab_units_2017, by = "geoid")

in_acs_notin_stab %>% count(county)

# A tibble: 5 x 2
  county     n
  <chr>  <int>
1 005       29
2 047      167
3 061       18
4 081      279
5 085       40

Once we are confident that our join went as planned, we can continue with our analysis work. Next we’ll use data from the two sources to create a measure of the share of all rental units that are rent stabilized.


tracts_rent_info <- tracts_acs_stab %>% 
  mutate(rental_stab_pct = replace_na(stab_units_2017 / na_if(rental_unitsE, 0), 0)) %>% 
  select(county, geoid, rental_stab_pct, rent_burden_medE)

tracts_rent_info

# A tibble: 2,167 x 4
   county geoid       rental_stab_pct rent_burden_medE
   <chr>  <chr>                 <dbl>            <dbl>
 1 081    36081011900          0                  29.8
 2 081    36081013500          0.0211             27.9
 3 081    36081013600          0.700              29.1
 4 081    36081014100          0                  33.3
 5 081    36081014500          0                  31.8
 6 081    36081016100          0                  35  
 7 081    36081016300          0                  30.2
 8 081    36081018000          0                  30.7
 9 081    36081020600          0.116              29.7
10 081    36081006900          0                  29.4
# … with 2,157 more rows

tracts_rent_info %>% 
  filter(
    rental_stab_pct > 0,
    rental_stab_pct <= 1
  ) %>% 
  ggplot() +
  aes(x = rental_stab_pct, y = rent_burden_medE) +
  geom_point(size = 0.5, alpha = 0.25) +
  geom_smooth()


tracts_rent_info %>% 
  filter(
    rental_stab_pct > 0,
    rental_stab_pct <= 1
  ) %>% 
  ggplot() +
  aes(x = rental_stab_pct, y = rent_burden_medE) +
  geom_point(size = 0.5, alpha = 0.25) +
  geom_smooth() +
  facet_wrap(~county)

Spatial Data

Two kinds of spatial data, raster (pixels) and vector () SF package - emerging as the new standard, you might see sp, but that’s is being phased out

simple features, international standard format for how to represent geometries in data (vector data)

A feature geometry is called simple when it consists of points connected by straight line pieces, and does not intersect itself.


knitr::include_graphics(path("img", "wkt_primitives.png"))


knitr::include_graphics(path("img", "wkt_multipart.png"))

We’ll start with the the most simple type - point data. You’ll already be familiar with latitude and longitude, and this is spatial point data. The only difference between something like lat/lon and these example values in the images above, is that lat/lon have an associated Coordinate Reference System (CRS) that describes how the numeric values of x and y correspond to positions on the earth. The CRS used for latitude and longitude data is called WGS84, and is often more conveniently specified using EPSG codes, in this case 4326.

Here are some common coordinate reference systems that you may see:
* 4326 - latitude/longitude * 3857 - Web Mercator * 2263 - NY State Plane Long Island (feet)

Let’s start by creating some simple spatial data from lat/lon columns in a dataframe


nyc_housing_courts <- tribble(
                               ~court,             ~lat,              ~lon,
                "Bronx Housing Court", 40.8322917983711, -73.9189644981048,
             "Brooklyn Housing Court", 40.6908281879688, -73.9881400109281,
  "Red Hook Community Justice Center", 40.6792674361971, -74.0094171241109,
            "Manhattan Housing Court", 40.7168162247486, -74.0014321226876,
    "Harlem Community Justice Center", 40.8012116588986, -73.9384598376274,
               "Queens Housing Court", 40.7034894794273, -73.8080039304503,
        "Staten Island Housing Court", 40.6350251497506, -74.1119001588143
)

nyc_housing_courts

# A tibble: 7 x 3
  court                               lat   lon
  <chr>                             <dbl> <dbl>
1 Bronx Housing Court                40.8 -73.9
2 Brooklyn Housing Court             40.7 -74.0
3 Red Hook Community Justice Center  40.7 -74.0
4 Manhattan Housing Court            40.7 -74.0
5 Harlem Community Justice Center    40.8 -73.9
6 Queens Housing Court               40.7 -73.8
7 Staten Island Housing Court        40.6 -74.1

Now we’ll use st_as_sf() from the sf package to make it into a spatial dataframe, by providing the columns containing the y and x coordinates, and specify the CRS to use. Now when we print the data frame we get some extra information, including the bounding box and CRS, and you’ll see we now have a new column geometry that is of the type sfc_POINT (simple features collection).


library(sf)

nyc_housing_courts_sf <- st_as_sf(nyc_housing_courts, coords = c("lon", "lat"), crs = 4326)
  
nyc_housing_courts_sf

Simple feature collection with 7 features and 1 field
geometry type:  POINT
dimension:      XY
bbox:           xmin: -74.1119 ymin: 40.63503 xmax: -73.808 ymax: 40.83229
CRS:            EPSG:4326
# A tibble: 7 x 2
  court                                         geometry
  <chr>                                      <POINT [°]>
1 Bronx Housing Court               (-73.91896 40.83229)
2 Brooklyn Housing Court            (-73.98814 40.69083)
3 Red Hook Community Justice Center (-74.00942 40.67927)
4 Manhattan Housing Court           (-74.00143 40.71682)
5 Harlem Community Justice Center   (-73.93846 40.80121)
6 Queens Housing Court                (-73.808 40.70349)
7 Staten Island Housing Court        (-74.1119 40.63503)

Now that we have a spatial dataframe with a CRS, we can transform the CRS by using st_transform.


nyc_housing_courts_sf_ny <- st_transform(nyc_housing_courts_sf, 2263)
nyc_housing_courts_sf_ny

Simple feature collection with 7 features and 1 field
geometry type:  POINT
dimension:      XY
bbox:           xmin: 953192 ymin: 170653 xmax: 1037484 ymax: 242514
CRS:            EPSG:2263
# A tibble: 7 x 2
  court                                             geometry
  <chr>                             <POINT [US_survey_foot]>
1 Bronx Housing Court                       (1006675 242514)
2 Brooklyn Housing Court                     (987539 190964)
3 Red Hook Community Justice Center          (981638 186752)
4 Manhattan Housing Court                    (983853 200432)
5 Harlem Community Justice Center           (1001288 231186)
6 Queens Housing Court                      (1037484 195635)
7 Staten Island Housing Court                (953192 170653)

Now we’ll look at some polygon data that we’ll read in from a shapefile

shapefiles are the most common format you’ll find for spatial data

they are really a collection of multiple files - as many as 13, but at least 3 files (.dbf attributes, .shp shapes, .shx shape index)

we can read this in with read_sf

First we need to download the file. We’ll be using a file of NYC census tracts from Open Data


source(path("R", "download_tracts.R"))

When we print the dataframe we can see that it is using the 2263 CRS. One of the optional files (.prj) contains the projection (CRS) information, but if that’s missing you’ll need to specify what CRS to use when you import the data.

We can also see that the data type of the geometry column is sfc_MULTIPOLYGON. They are*_multi*polygons because some tracts include multiple islands, etc.

nyc_tracts <- read_sf(path("data-raw", "nyc-tracts-2010", "nyct2010.shp"))

# let's first clean up the column names
nyc_tracts <- nyc_tracts %>% rename_all(str_to_lower)

Now that we have a couple different spatial dataframes, let’s take a look at them

We can use the same basic ggplot2 tools to do this

To just look at the geometries we can omit the aes() now because it knows which columns to use for x and y, but if we want to specify what column to use for color, we can do that the same way as before


ggplot(nyc_housing_courts_sf) +
  # aes(color = court) +
  geom_sf()

if we want to change the crs, we don’t have to use st_transform, but can use coord_sf

this can be helpful, because the crs that’s best for working with the data might not be the same as how you want to display it. for example, if you are doing analysis it’s helpful to use a system that is in feet or metres rather than degrees, but the web mapping standard 3587 can look good for maps


ggplot(nyc_housing_courts_sf) +
  geom_sf() +
  coord_sf(crs = 3587)

now let’s layer on the tract polygons

by default when using multiple spatial dataframes, ggplot will transform them to use the same CRS


ggplot() +
  geom_sf(data = nyc_tracts) +
  geom_sf(data = nyc_housing_courts_sf)

Now lets add some data to these tract geometries


nyc_tracts_acs <- nyc_tracts %>% 
  mutate(
    county = recode(borocode, "1" = "061", "2" = "005", "3" = "047", "4" = "081", "5" = "085"),
    geoid = str_c("36", county, ct2010)
  ) %>% 
  left_join(acs_tracts_clean, by = "geoid")

nyc_tracts_acs

Simple feature collection with 2165 features and 39 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
CRS:            2263
# A tibble: 2,165 x 40
   ctlabel borocode boroname ct2010 boroct2010 cdeligibil ntacode
   <chr>   <chr>    <chr>    <chr>  <chr>      <chr>      <chr>  
 1 9       5        Staten … 000900 5000900    E          SI22   
 2 98      1        Manhatt… 009800 1009800    I          MN19   
 3 100     1        Manhatt… 010000 1010000    I          MN19   
 4 102     1        Manhatt… 010200 1010200    I          MN17   
 5 104     1        Manhatt… 010400 1010400    I          MN17   
 6 113     1        Manhatt… 011300 1011300    I          MN17   
 7 114.02  1        Manhatt… 011402 1011402    I          MN40   
 8 130     1        Manhatt… 013000 1013000    I          MN40   
 9 140     1        Manhatt… 014000 1014000    I          MN40   
10 148.01  1        Manhatt… 014801 1014801    I          MN40   
# … with 2,155 more rows, and 33 more variables: ntaname <chr>,
#   puma <chr>, shape_leng <dbl>, shape_area <dbl>,
#   geometry <MULTIPOLYGON [US_survey_foot]>, …

now we can plot some of our census data


ggplot(nyc_tracts_acs) +
  aes(fill = renter_pctE) +
  geom_sf(color = "white", size = 0.05) +
  scale_fill_viridis_b(labels = scales::percent) +
  theme_void() +
  theme(legend.position = c(0.1, .75)) +
  labs(
    title = "Renter Share of Units",
    subtitle = "New York City,Census Tracts, 2014-2018",
    fill = NULL,
    caption = "Sources: American Community Survey (2014-2018)"
  )

always have to be careful about how to communicate uncertainty in graphs like this, we know the margins of error can be quite large on tract data, and this doesn’t communicate that in any way

one way to address this is bin the continuous values into categories

but we still aren’t showing the level of uncertainty about whether any given area may be misclassified simply due to sampling error

There are some tools that city planning uses to help enforce some basic standards for reliability, and I made a simple package that helps you test this out in R, for details see mapreliability.

When you use this tool with ACS tract data you’ll often find that it’s very hard to classify the data in any reasonable way that satisfies the thresholds. In that case then the best thing to do is consider using data aggregated at a larger geography that will give you a larger sample and smaller MOEs.

The other options will often be too large for neighborhood analysis, but city planning has also helped with that by defining Neighborhood Tabulation Areas, which are aggregations of census tracts.

The census tract geometries we downloaded have this variable, and so we can aggregate our own census data and at the same time combine our tract geometries to create a new nta-level dataset and geometries.

You need to take some care in handling the estimates and margins so that the MOE calculations are valid. We can’t simply take the mean the share and it’s MOE, but instead need to sum up the components and then recalculate the share


nyc_ntas_acs <- nyc_tracts_acs %>% 
  group_by(ntacode, ntaname) %>% 
  summarise(
    occ_renter_unitsE = sum(occ_renter_unitsE),
    occ_renter_unitsM = moe_sum(occ_renter_unitsM),
    occ_unitsE = sum(occ_unitsE),
    occ_unitsM = moe_sum(occ_unitsM),
    geometry = st_union(geometry)
  ) %>% 
  ungroup() %>% 
  mutate(
    renter_pctE = occ_renter_unitsE / na_if(occ_unitsE, 0),
    renter_pctM = moe_ratio(occ_renter_unitsE, na_if(occ_unitsE, 0), occ_renter_unitsM, occ_unitsM)
  )

nyc_ntas_acs

Simple feature collection with 195 features and 8 fields
geometry type:  GEOMETRY
dimension:      XY
bbox:           xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
CRS:            2263
# A tibble: 195 x 9
   ntacode ntaname occ_renter_unit… occ_renter_unit… occ_unitsE
 * <chr>   <chr>              <dbl>            <dbl>      <dbl>
 1 BK09    Brookl…             6021             364.      11115
 2 BK17    Sheeps…            12000             483.      26150
 3 BK19    Bright…            10047             408.      14557
 4 BK21    Seagat…             9419             292.      11236
 5 BK23    West B…             4922             292.       8401
 6 BK25    Homecr…             9938             381.      15528
 7 BK26    Graves…             7514             331.      11445
 8 BK27    Bath B…             6733             302.      11204
 9 BK28    Benson…            19702             515.      30079
10 BK29    Benson…            14269             469.      22373
# … with 185 more rows, and 4 more variables: occ_unitsM <dbl>,
#   geometry <POLYGON [US_survey_foot]>, renter_pctE <dbl>,
#   renter_pctM <dbl>

ggplot(nyc_ntas_acs) +
  aes(fill = renter_pctE) +
  geom_sf(color = "white", size = 0.05) +
  scale_fill_viridis_c(labels = scales::percent) +
  theme_void() +
  theme(legend.position = c(0.1, .75)) +
  labs(
    title = "Renter Share of Units",
    subtitle = "New York City, Neighborhood Tabulation Areas, 2014-2018",
    fill = NULL,
    caption = "Sources: American Community Survey (2014-2018)"
  )


hpd_viol_query <- str_glue(
  "https://data.cityofnewyork.us/resource/wvxf-dwi5.csv?$query=
  select violationid, inspectiondate, latitude, longitude
  where inspectiondate > '{Sys.Date()-7}T00:00:00'
   and latitude is not null
  limit 100000")

hpd_violations <- read_csv(URLencode(hpd_viol_query)) %>% 
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>% 
  st_transform(2263) 

hpd_violations

Simple feature collection with 3747 features and 2 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 950378.9 ymin: 148699.2 xmax: 1058381 ymax: 269486
CRS:            EPSG:2263
# A tibble: 3,747 x 3
   violationid inspectiondate                      geometry
         <dbl> <dttm>              <POINT [US_survey_foot]>
 1    13714193 2020-06-29 00:00:00       (1001557 196353.2)
 2    13713995 2020-06-29 00:00:00      (995169.1 177480.2)
 3    13713994 2020-06-29 00:00:00      (995169.1 177480.2)
 4    13713993 2020-06-29 00:00:00      (995169.1 177480.2)
 5    13713992 2020-06-29 00:00:00      (995169.1 177480.2)
 6    13713991 2020-06-29 00:00:00      (995169.1 177480.2)
 7    13713990 2020-06-29 00:00:00      (995169.1 177480.2)
 8    13713989 2020-06-29 00:00:00      (995169.1 177480.2)
 9    13713988 2020-06-29 00:00:00      (995169.1 177480.2)
10    13713987 2020-06-29 00:00:00      (995169.1 177480.2)
# … with 3,737 more rows

hpd_violations_joined <- nyc_ntas_acs %>% 
  st_join(hpd_violations, st_intersects) %>% 
  select(violationid, inspectiondate, ntacode, ntaname)

hpd_violations_joined

Simple feature collection with 3813 features and 4 fields
geometry type:  GEOMETRY
dimension:      XY
bbox:           xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
CRS:            2263
First 10 features:
   violationid inspectiondate ntacode
1     13712509     2020-07-01    BK09
2     13710989     2020-06-30    BK17
3     13710952     2020-06-30    BK17
4     13710935     2020-06-30    BK17
5     13710446     2020-06-29    BK19
6     13710430     2020-06-29    BK19
7     13710421     2020-06-29    BK19
8     13711532     2020-06-30    BK19
9     13711529     2020-06-30    BK19
10    13710744     2020-06-30    BK19
                                          ntaname
1                    Brooklyn Heights-Cobble Hill
2  Sheepshead Bay-Gerritsen Beach-Manhattan Beach
3  Sheepshead Bay-Gerritsen Beach-Manhattan Beach
4  Sheepshead Bay-Gerritsen Beach-Manhattan Beach
5                                  Brighton Beach
6                                  Brighton Beach
7                                  Brighton Beach
8                                  Brighton Beach
9                                  Brighton Beach
10                                 Brighton Beach
                         geometry
1  POLYGON ((986838.3 192326.7...
2  POLYGON ((997485 148503.8, ...
3  POLYGON ((997485 148503.8, ...
4  POLYGON ((997485 148503.8, ...
5  POLYGON ((997341.1 149453.7...
6  POLYGON ((997341.1 149453.7...
7  POLYGON ((997341.1 149453.7...
8  POLYGON ((997341.1 149453.7...
9  POLYGON ((997341.1 149453.7...
10 POLYGON ((997341.1 149453.7...

hpd_violations_ntas <- hpd_violations_joined %>% 
  group_by(ntacode, ntaname) %>% 
  summarise(violations = sum(!is.na(inspectiondate))) %>% 
  ungroup()

hpd_violations_ntas

Simple feature collection with 195 features and 3 fields
geometry type:  GEOMETRY
dimension:      XY
bbox:           xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
CRS:            2263
# A tibble: 195 x 4
   ntacode ntaname         violations                         geometry
   <chr>   <chr>                <int>       <POLYGON [US_survey_foot]>
 1 BK09    Brooklyn Heigh…          1 ((986838.3 192326.7, 986899.7 1…
 2 BK17    Sheepshead Bay…          3 ((997485 148503.8, 997466.1 148…
 3 BK19    Brighton Beach          31 ((997341.1 149453.7, 997466.1 1…
 4 BK21    Seagate-Coney …          4 ((989467.6 148972.3, 989627.2 1…
 5 BK23    West Brighton            0 ((990752.8 148875.7, 990727.1 1…
 6 BK25    Homecrest               35 ((996416.9 153416.9, 996158.3 1…
 7 BK26    Gravesend               10 ((992262.4 154391.8, 992520 154…
 8 BK27    Bath Beach              15 ((985473.7 157957.2, 984988.9 1…
 9 BK28    Bensonhurst We…         42 ((985575.9 156813.6, 985277.5 1…
10 BK29    Bensonhurst Ea…         33 ((987585.9 155159.6, 987132.5 1…
# … with 185 more rows

ggplot(hpd_violations_ntas) +
  aes(fill = violations) +
  geom_sf(color = "white", size = 0.05) +
  scale_fill_viridis_c(labels = scales::comma) +
  theme_void() +
  theme(legend.position = c(0.1, .75)) +
  labs(
    title = "Housing Maintenance Code Violations",
    subtitle = str_glue("New York City, Neighborhood Tabulation Areas, {Sys.Date()-7} to {Sys.Date()}"),
    fill = NULL,
    caption = "Sources: Department of Housing Preservation and Development (HPD) via Open Data"
  )

Resources

Survey Data

So far we’ve been working with ACS data that is already tabulated for us, but sometimes there are cuts of the data that are not available and so we can use the microdata to calculate these statistics ourselves.

Because these microdata are some a survey we can’t simply treat them as a regular dataframe and sum up columns because there are survey weights associated with each observation and these need to be incorporated. In some cases the weights are simple and it would be enough to use weighted means, etc. to get our estimates, but when the survey design is more complicated and/or get margins of error on our estimates, we’ll need some other tools.

Here we’ll be working with the srvyr package, which (in the same spirit as the sf package) allows us to set up our data frame with some special information about the survey design and then use the existing dplyr tools we know to work with the data.

First let’s import some ACS microdata. We’ll be using a file for NY downloaded from IPUMS, which cleans up the ACS Public Use Microdata Sample (PUMS) data to make it easier to work with. Unfortunately they don’t have a system to automate these downloads, so we need to do it manually. I’m goig to borrow the file from another project, where you can read a description of the variables and how to download a copy.


# Read in IPUMS USA ACS microdata, standardize the column names
ipums_raw <- read_csv(path("data-raw", "ipums_acs-2018-1yr_ny.csv.gz")) %>% rename_all(str_to_lower)

First I’m going to do some basic cleaning and create some variables for analysis


ipums_clean <- ipums_raw %>% 
  filter(
    # Remove group quarters population
    gq %in% 1:2,
    # Keep only head-of-household records
    pernum == 1,
    # keep only NYC
    countyfip %in% c(5, 47, 61, 81, 85)
  ) %>% 
  transmute(
    serial,
    hhwt,
    countyfip,
    
    # Household income
    hh_inc_nom = case_when(
      hhincome <= 0 ~ 0,
      hhincome == 9999999 ~ NA_real_, 
      TRUE ~ hhincome
    ),
    hh_inc_grp = cut(
      hh_inc_nom, c(-Inf, 25e3, 50e3, 75e3, 100e3, 125e5, 150e3, Inf),
      c("< $25k", "$25k - 50k", "$50k - $75k", "$75k - $100k", "$100k - $125k", "$125k - $150k", ">= $150k"),
      include.lowest = TRUE, 
      ordered_result = TRUE
    ),
    
    # Renter vars
    is_renter = (ownershp == 2),
    gross_rent_nom = if_else(is_renter, rentgrs, NA_real_),
    rent_burden = gross_rent_nom / (hh_inc_nom / 12),
    is_rent_burdened = (rent_burden > 0.30),
    
    
    # Race/ethnicity 
    race_eth = case_when(
      hispan %in% 1:4 ~ "Hispanic, of any race",
      race == 1 ~ "Non-Hispanic white",
      race == 2 ~ "Non-Hispanic Black",
      race == 3 ~ "Non-Hispanic American Indian or Alaska Native",
      race == 4 ~ "Non-Hispanic Asian or Pacific Islander", # Chinese
      race == 5 ~ "Non-Hispanic Asian or Pacific Islander", # Japanese
      race == 6 ~ "Non-Hispanic Asian or Pacific Islander", # Other Asian or Pacific Island
      race == 7 ~ "Non-Hispanic other",
      race == 8 ~ "Non-Hispanic Two or more major races", # Two major races
      race == 9 ~ "Non-Hispanic Two or more major races" # Three or more major races
    ),
    
    age
  )

now we can set the survey design info


library(srvyr)

ipums_svy <- as_survey_design(ipums_clean, weights = hhwt)

ipums_svy

Independent Sampling design (with replacement)
Called via srvyr
Sampling variables:
 - ids: `1`
 - weights: hhwt
Data variables: serial (dbl), hhwt (dbl), countyfip (dbl), hh_inc_nom
  (dbl), hh_inc_grp (ord), is_renter (lgl), gross_rent_nom (dbl),
  rent_burden (dbl), is_rent_burdened (lgl), race_eth (chr), age
  (dbl)

ipums_inc_burden <- ipums_svy %>% 
  filter(
    is_renter,
    !is.na(hh_inc_grp),
  ) %>% 
  group_by(hh_inc_grp) %>% 
  summarise(
    rent_burden_pct = survey_mean(is_rent_burdened, na.rm = TRUE, vartype = "ci", level = 0.95)
  ) %>% 
  ungroup()

ipums_inc_burden

# A tibble: 6 x 4
  hh_inc_grp    rent_burden_pct rent_burden_pct_low rent_burden_pct_u…
  <ord>                   <dbl>               <dbl>              <dbl>
1 < $25k                0.852              0.839               0.865  
2 $25k - 50k            0.742              0.723               0.761  
3 $50k - $75k           0.453              0.428               0.478  
4 $75k - $100k          0.220              0.196               0.244  
5 $100k - $125k         0.114              0.0950              0.133  
6 $125k - $150k         0.00489            0.000914            0.00888

ggplot(ipums_inc_burden) +
  aes(x = hh_inc_grp, y = rent_burden_pct, ymin = rent_burden_pct_low, ymax = rent_burden_pct_upp) +
  geom_col(fill = "goldenrod") +
  geom_errorbar() +
  geom_text(
    aes(y = rent_burden_pct_upp, label = scales::percent(rent_burden_pct)),
    vjust = -0.5, size = 3.5
  ) +
  scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
  theme_minimal() +
  labs(
    title = str_glue(
      "Share of Renter Households Paying More than 30% of Monthly Income on Rent, 
      by Annual Household Income Level"
    ),
    subtitle = "New York City, 2018",
    x = "Household income",
    y = "Share of Renter Households",
    caption = str_glue(
      "Notes: Error bars represent 95% confidence intervals, and value labels reflect point estimates
      Sources: American Community Survey (2018), IPUMS USA"
    )
  )

Modeling

Logistic regression with dependent variable of whether they rent their home, predicted by characteristics of the head-of-household.


model_1 <- glm(
  is_renter ~ hh_inc_nom + factor(race_eth) + age + factor(countyfip), 
  data = ipums_clean, 
  weights = hhwt,
  family = binomial(link = "logit")
)

The built-in summary function gives you helpful information, but it only prints it to the screen as text, and if you capture the results it’s a complicated list structure.


summary(model_1)

Call:
glm(formula = is_renter ~ hh_inc_nom + factor(race_eth) + age + 
    factor(countyfip), family = binomial(link = "logit"), data = ipums_clean, 
    weights = hhwt)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-53.478   -8.784    4.712    8.204   54.715  

Coefficients:
                                                                Estimate
(Intercept)                                                    4.347e+00
hh_inc_nom                                                    -5.525e-06
factor(race_eth)Non-Hispanic American Indian or Alaska Native -1.477e+00
factor(race_eth)Non-Hispanic Asian or Pacific Islander        -1.215e+00
factor(race_eth)Non-Hispanic Black                            -5.763e-01
factor(race_eth)Non-Hispanic other                            -1.278e+00
factor(race_eth)Non-Hispanic Two or more major races          -6.613e-01
factor(race_eth)Non-Hispanic white                            -1.021e+00
age                                                           -3.891e-02
factor(countyfip)47                                           -2.303e-01
factor(countyfip)61                                            6.590e-01
factor(countyfip)81                                           -8.373e-01
factor(countyfip)85                                           -1.736e+00
                                                              Std. Error
(Intercept)                                                    6.761e-03
hh_inc_nom                                                     1.382e-08
factor(race_eth)Non-Hispanic American Indian or Alaska Native  3.099e-02
factor(race_eth)Non-Hispanic Asian or Pacific Islander         4.900e-03
factor(race_eth)Non-Hispanic Black                             4.364e-03
factor(race_eth)Non-Hispanic other                             1.622e-02
factor(race_eth)Non-Hispanic Two or more major races           1.024e-02
factor(race_eth)Non-Hispanic white                             4.037e-03
age                                                            8.434e-05
factor(countyfip)47                                            4.692e-03
factor(countyfip)61                                            5.372e-03
factor(countyfip)81                                            4.733e-03
factor(countyfip)85                                            7.038e-03
                                                              z value
(Intercept)                                                    642.97
hh_inc_nom                                                    -399.78
factor(race_eth)Non-Hispanic American Indian or Alaska Native  -47.66
factor(race_eth)Non-Hispanic Asian or Pacific Islander        -248.08
factor(race_eth)Non-Hispanic Black                            -132.04
factor(race_eth)Non-Hispanic other                             -78.80
factor(race_eth)Non-Hispanic Two or more major races           -64.56
factor(race_eth)Non-Hispanic white                            -253.01
age                                                           -461.37
factor(countyfip)47                                            -49.09
factor(countyfip)61                                            122.69
factor(countyfip)81                                           -176.92
factor(countyfip)85                                           -246.68
                                                              Pr(>|z|)
(Intercept)                                                     <2e-16
hh_inc_nom                                                      <2e-16
factor(race_eth)Non-Hispanic American Indian or Alaska Native   <2e-16
factor(race_eth)Non-Hispanic Asian or Pacific Islander          <2e-16
factor(race_eth)Non-Hispanic Black                              <2e-16
factor(race_eth)Non-Hispanic other                              <2e-16
factor(race_eth)Non-Hispanic Two or more major races            <2e-16
factor(race_eth)Non-Hispanic white                              <2e-16
age                                                             <2e-16
factor(countyfip)47                                             <2e-16
factor(countyfip)61                                             <2e-16
factor(countyfip)81                                             <2e-16
factor(countyfip)85                                             <2e-16
                                                                 
(Intercept)                                                   ***
hh_inc_nom                                                    ***
factor(race_eth)Non-Hispanic American Indian or Alaska Native ***
factor(race_eth)Non-Hispanic Asian or Pacific Islander        ***
factor(race_eth)Non-Hispanic Black                            ***
factor(race_eth)Non-Hispanic other                            ***
factor(race_eth)Non-Hispanic Two or more major races          ***
factor(race_eth)Non-Hispanic white                            ***
age                                                           ***
factor(countyfip)47                                           ***
factor(countyfip)61                                           ***
factor(countyfip)81                                           ***
factor(countyfip)85                                           ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4027428  on 26582  degrees of freedom
Residual deviance: 3263248  on 26570  degrees of freedom
AIC: 3263274

Number of Fisher Scoring iterations: 9

The broom package provides some helpful functions that extract the relvent information from the model object and return them as a dataframe so that you can use some of the data manipulation and graphing tools that you are already familiar with.


library(broom)

There are three functions that extract different levels of information from the model.

tidy() gives you a dataframe where each row contains information regression coefficients


tidy(model_1)

# A tibble: 13 x 5
   term                          estimate  std.error statistic p.value
   <chr>                            <dbl>      <dbl>     <dbl>   <dbl>
 1 (Intercept)                    4.35e+0    6.76e-3     643.        0
 2 hh_inc_nom                    -5.53e-6    1.38e-8    -400.        0
 3 factor(race_eth)Non-Hispani…  -1.48e+0    3.10e-2     -47.7       0
 4 factor(race_eth)Non-Hispani…  -1.22e+0    4.90e-3    -248.        0
 5 factor(race_eth)Non-Hispani…  -5.76e-1    4.36e-3    -132.        0
 6 factor(race_eth)Non-Hispani…  -1.28e+0    1.62e-2     -78.8       0
 7 factor(race_eth)Non-Hispani…  -6.61e-1    1.02e-2     -64.6       0
 8 factor(race_eth)Non-Hispani…  -1.02e+0    4.04e-3    -253.        0
 9 age                           -3.89e-2    8.43e-5    -461.        0
10 factor(countyfip)47           -2.30e-1    4.69e-3     -49.1       0
# … with 3 more rows

glance() returns a dataframe with exactly one row of goodness of fitness measures and related statistics.


glance(model_1)

# A tibble: 1 x 7
  null.deviance df.null    logLik     AIC     BIC deviance df.residual
          <dbl>   <int>     <dbl>   <dbl>   <dbl>    <dbl>       <int>
1      4027428.   26582 -1631624.  3.26e6  3.26e6 3263248.       26570

augment() adds columns to a dataset, containing information such as fitted values, residuals or cluster assignments. All columns added to a dataset have . prefix to prevent existing columns from being overwritten.


augment(model_1, data = ipums_clean) %>% 
  # Select just a few columns to see what's added
  select(serial, starts_with("."))

# A tibble: 26,583 x 8
   serial .fitted .se.fit .resid     .hat .sigma  .cooksd .std.resid
    <dbl>   <dbl>   <dbl>  <dbl>    <dbl>  <dbl>    <dbl>      <dbl>
 1 838281   2.21  0.00444 -21.0  0.000166   11.1 0.0111       -21.0 
 2 838284  -0.356 0.00417 -13.9  0.000763   11.1 0.00745      -13.9 
 3 838287  -0.396 0.00327  -7.99 0.000159   11.1 0.000512      -7.99
 4 838288   1.95  0.00440  10.7  0.000909   11.1 0.00429       10.7 
 5 838294   0.345 0.00372  14.7  0.000676   11.1 0.00741       14.7 
 6 838298   0.675 0.00424   6.79 0.000225   11.1 0.000493       6.79
 7 838306   1.36  0.0102    7.37 0.00202    11.1 0.00475        7.38
 8 838308   2.02  0.00458 -15.1  0.000115   11.1 0.00353      -15.1 
 9 838310   1.53  0.00357   4.88 0.000113   11.1 0.000115       4.88
10 838318   1.28  0.00335 -21.0  0.000278   11.1 0.0111       -21.0 
# … with 26,573 more rows

The package gtsummary builds on broom and another package for making tables called gt and provides helpful tools for quickly creating nicely formatted tables of results that can be highly customized.


library(gtsummary)

Before we even get to the regression model, it can be helpful to create some presentation-ready summary tables about our two groups - renters and owners. tbl_summary() can help with this (for more details see this tutorial)

By default it will give you a very helpful table choosing reasonable default for statistics based on the type of variable (numeric, character/factor, etc), but you can specify exactly what types of statistics (and how they are formatted) in the table. You can also use add_p() to automatically add a column of p-values.

NOTE: the gtsummary package does not support data with survey weights, so for the this example of how it works we’ll be ignoring the survey weights, but if you data has weights you’ll need to use the srvyr tools above to create your own summary tables and then you can use gt alone to create tables for thsoe results.


ipums_clean %>%
  select(is_renter, hh_inc_nom, age, race_eth, countyfip) %>% 
  tbl_summary(
    by = is_renter,
    label = list(
      hh_inc_nom ~ "Household Income",
      age ~ "Age, years", 
      countyfip ~ "County", 
      race_eth ~ "Race/Ethnicity"
    )
  ) %>% 
  add_p()
Characteristic FALSE, N = 106191 TRUE, N = 159641 p-value2
Household Income 100000 (50925, 170000) 52000 (21500, 103825) <0.001
Age, years 58 (46, 69) 46 (33, 61) <0.001
Race/Ethnicity <0.001
Hispanic, of any race 1225 (12%) 4461 (28%)
Non-Hispanic American Indian or Alaska Native 13 (0.1%) 12 (<0.1%)
Non-Hispanic Asian or Pacific Islander 1959 (18%) 1876 (12%)
Non-Hispanic Black 1879 (18%) 3277 (21%)
Non-Hispanic other 91 (0.9%) 83 (0.5%)
Non-Hispanic Two or more major races 188 (1.8%) 361 (2.3%)
Non-Hispanic white 5264 (50%) 5894 (37%)
County <0.001
5 909 (8.6%) 2840 (18%)
47 3457 (33%) 5708 (36%)
61 1168 (11%) 3413 (21%)
81 4032 (38%) 3625 (23%)
85 1053 (9.9%) 378 (2.4%)

1 Statistics presented: median (IQR); n (%)

2 Statistical tests performed: Wilcoxon rank-sum test; chi-square test of independence

Now we’ll move back to our logit model, and use the tbl_regression() function. Again, it detects the type of model you have and uses some reasonable defaults to display your results in a helpful format. For example, for a logit model like we have it will present odds ratios.


tbl_regression(
  model_1, 
  exponentiate = TRUE,
  label = list(
    "hh_inc_nom" ~ "Household Income",
    "age" ~ "Age, years", 
    "factor(countyfip)" ~ "County", 
    "factor(race_eth)" ~ "Race/Ethnicity"
  )
)
Characteristic OR1 95% CI1 p-value
Household Income 1.00 1.00, 1.00 <0.001
Race/Ethnicity
Hispanic, of any race
Non-Hispanic American Indian or Alaska Native 0.23 0.21, 0.24 <0.001
Non-Hispanic Asian or Pacific Islander 0.30 0.29, 0.30 <0.001
Non-Hispanic Black 0.56 0.56, 0.57 <0.001
Non-Hispanic other 0.28 0.27, 0.29 <0.001
Non-Hispanic Two or more major races 0.52 0.51, 0.53 <0.001
Non-Hispanic white 0.36 0.36, 0.36 <0.001
Age, years 0.96 0.96, 0.96 <0.001
County
5
47 0.79 0.79, 0.80 <0.001
61 1.93 1.91, 1.95 <0.001
81 0.43 0.43, 0.44 <0.001
85 0.18 0.17, 0.18 <0.001

1 OR = Odds Ratio, CI = Confidence Interval

If instead we are using a basic OLS model, it will present the results differently.


model_2 <- lm(
  hh_inc_nom ~ is_renter + factor(race_eth) + age + factor(countyfip), 
  data = ipums_clean, 
  weights = hhwt
)

tbl_regression(
  model_2,
  label = list(
    "is_renter" ~ "Renter",
    "age" ~ "Age, years", 
    "factor(countyfip)" ~ "County", 
    "factor(race_eth)" ~ "Race/Ethnicity"
  )
)
Characteristic Beta 95% CI1 p-value
Renter -75906 -79273, -72539 <0.001
Race/Ethnicity
Hispanic, of any race
Non-Hispanic American Indian or Alaska Native -28955 -65838, 7929 0.12
Non-Hispanic Asian or Pacific Islander 14281 9039, 19523 <0.001
Non-Hispanic Black 3807 -539, 8154 0.086
Non-Hispanic other 1089 -18081, 20259 >0.9
Non-Hispanic Two or more major races 23812 13171, 34452 <0.001
Non-Hispanic white 50673 46634, 54712 <0.001
Age, years -1235 -1322, -1148 <0.001
County
5
47 8049 3359, 12739 <0.001
61 65554 60546, 70561 <0.001
81 -714 -5630, 4202 0.8
85 -17880 -25527, -10234 <0.001

1 CI = Confidence Interval

Resources