A workshop for NYU Wagner MSPP class Policy & Data Studio on the basics of R for data analysis
All the workshop materials are available on GitHub.
R is a free and open-source software environment for statistical computing and graphics. It has a more narrow set of uses and a smaller user base compared python, but because it is specifically designed for data analysis it is a great language for data analysts to learn. In recent years it has become increasingly popular and has also become much easier to learn, especially for those new to coding.
There are three main reasons that I enjoying using R myself and teaching it to others.
Here is just a small sample of some of the great resources available for learning R:
The RStudio Desktop application is free to download.
This is the default RStudio interface. * The top left pane contains all your code files. This is where you can write and save all the code for your analysis. * The bottom left pane has the R console where you can run individual pieces of R code, such as quick calculations, printing objects, or anything else that you don’t need to save in your final files. * The top right pane contains a list of all the objects currently in your environment. If you click on a dataframe object it will open in a viewer in the top left pane where you can browse, sort, and filter your view of the data (without altering the object itself) * The bottom right pane contains a few important tabs: the plot viewer where any graphs you create will appear, the files explorer, and the help page
R is an open-source language so in addition to the basic functions that come standard with R (referred to as Base R) there are more than 10,000 user written packages that can accomplish virtually any task in R. There is an official repository for these packages called CRAN that does some vetting of the quality of packages, and packages from here can be installed directly from R using:
install.packages("PackageName")
These packages only need to be installed like this once, and after that initial installation we only need to load the packages that we want use for each analysis with library()
.
This project uses renv
to handle dependency managemnt. To get this projct set up on your own system, open the project in RStudio (and open the .Rproj
file), install renv
(install.packages("renv")
), and run renv::init()
.
If you haven’t installed any R packages yet, this might take a little while.
All of the packages we are using here are part of a collection of R packages referred to as the tidyverse
.
The Tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. All of these packages are extremely well-maintained and have helpful websites that include, examples and guides, function documentation, cheatsheets, and links to the GitHub repos where the packages are developed.
The following are the core set of Tidyverse packages, but there are many more.
dplyr
is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challengesreadr
provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf)tidyr
helps you create tidy data. Tidy data is data where: Each variable is in a column, each observation is a row, and each value is a cellstringr
provides a cohesive set of functions designed to make working with strings as easy as possibleforcats
provides a suite of useful tools that solve common problems with factorspurrr
is a complete and consistent set of tools for working with functions and vectorsggplot2
is a system for declaratively creating graphics, based on The Grammar of GraphicsIn addition to the package websites, there is an amazing free book that covers how to use all these packages to do data analysis, called R for Data Science.
library(dplyr) # manipulate dataframes
library(readr) # read/write dataframes
library(tidyr) # reshaping dataframes
library(stringr) # string manipulation
library(forcats) # factor manipulation
library(purrr) # iteration (instead of loops)
library(ggplot2) # making plots
For these examples we’ll be using a dataset of buildings from the Public Advocate’s NYC Landlord Watchlist.
Lets get started by reading in our dataset from a CSV file using read_csv
form the readr
package.
We’ll also be making use of the fs
package, which provides cross-platform file system operations.
If you need to import dataset that aren’t simple rectanular flat files (like csv, tsv, and fwf) then you will need another package.
* DBI
for relational databases (paired with database specific backends),
* haven
for SPSS, Stata, and SAS data,
* httr
for web APIs,
* readxl
for .xls and .xlsx sheets,
* rvest
for web scraping,
* jsonlite
for JSON, and
* xml2
for XML.
library(fs) # cross-platform file system operations
watchlist_bldgs <- read_csv(path("data-raw", "landlord-watchlist-buildings_2018-12-08.csv"))
read_csv()
guesses about the data type of each column, and gives you the column specification is used. Often times this will be what you want, but if you want to override the guesses you can supply your own specification (see ?readr::cols
for details).
watchlist_bldgs <- read_csv(
file = path("data-raw", "landlord-watchlist-buildings_2018-12-08.csv"),
col_types = cols(
.default = col_character(),
units = col_integer(),
violations = col_integer()
)
)
Now let’s take a look at this new dataframe that we’ve imported. You can print the dataframe to get a simple preview.
watchlist_bldgs
# A tibble: 452 x 8
landlord address units violations number street borough zip
<chr> <chr> <int> <int> <chr> <chr> <chr> <chr>
1 Jonathan C… 5416 4 AV… 16 62 5416 4 AVE… BROOKL… 11220
2 Jonathan C… 5422 4 AV… 21 78 5422 4 AVE… BROOKL… 11220
3 Jonathan C… 314 57 ST… 17 70 314 57 ST… BROOKL… 11220
4 Jonathan C… 372 BALTI… 8 76 372 BALTI… BROOKL… 11201
5 Jonathan C… 166 BLEEC… 6 41 166 BLEEC… BROOKL… 11221
6 Jonathan C… 590 BUSHW… 4 55 590 BUSHW… BROOKL… 11206
7 Jonathan C… 165 HOWAR… 8 39 165 HOWAR… BROOKL… 11233
8 Jonathan C… 208 HOYT … 3 9 208 HOYT … BROOKL… 11217
9 Jonathan C… 203 HULL … 6 143 203 HULL … BROOKL… 11233
10 Jonathan C… 318 HUMBO… 7 102 318 HUMBO… BROOKL… 11211
# … with 442 more rows
When simply printing the dataframe you’ll only see a few rows and as many columns as fit nicely on your screen. When you have many columns it’s often helpful to use the function glimpse()
to see a list of all your columns.
glimpse(watchlist_bldgs)
Rows: 452
Columns: 8
$ landlord <chr> "Jonathan Cohen-Silvershore Properties", "Jonath…
$ address <chr> "5416 4 AVENUE, BROOKLYN 11220", "5422 4 AVENUE,…
$ units <int> 16, 21, 17, 8, 6, 4, 8, 3, 6, 7, 6, 8, 6, 6, 7, …
$ violations <int> 62, 78, 70, 76, 41, 55, 39, 9, 143, 102, 19, 54,…
$ number <chr> "5416", "5422", "314", "372", "166", "590", "165…
$ street <chr> "4 AVENUE", "4 AVENUE", "57 STREET", "BALTIC STR…
$ borough <chr> "BROOKLYN", "BROOKLYN", "BROOKLYN", "BROOKLYN", …
$ zip <chr> "11220", "11220", "11220", "11201", "11221", "11…
We can also get a very helpful overview of our dataset that includes some informative descriptive statistics using skim()
from the package skimr
library(skimr)
skim(watchlist_bldgs)
Name | watchlist_bldgs |
Number of rows | 452 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 6 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
landlord | 0 | 1 | 7 | 37 | 0 | 98 | 0 |
address | 0 | 1 | 27 | 50 | 0 | 452 | 0 |
number | 0 | 1 | 1 | 5 | 0 | 394 | 0 |
street | 0 | 1 | 7 | 28 | 0 | 294 | 0 |
borough | 0 | 1 | 5 | 13 | 0 | 5 | 0 |
zip | 0 | 1 | 5 | 5 | 0 | 84 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
units | 0 | 1 | 15.91 | 17.67 | 3 | 6 | 8 | 20 | 120 | ▇▁▁▁▁ |
violations | 0 | 1 | 91.22 | 78.81 | 8 | 39 | 71 | 117 | 650 | ▇▁▁▁▁ |
dplyr
The package dplyr
contains functions for basic data manipulation. It is organized around 5 main functions that take a dataframe and manipulate it in some way. The functions are named as verbs which help explain what they do.
filter()
- filter to the rows you want to keep based on conditionsselect()
- select columns you want to keeparrange()
- sort dataframe by a columnmutate()
- adds new columnssummarise()
- collapse multiple rows down to a single oneEvery one of these functions takes a dataframe as the first argument and returns an altered version of that dataframe.
Inside of these functions columns are referred to with just their names without quotes.
filter()
Use filter()
find rows/cases where conditions are true. Rows where the condition evaluates to NA
are dropped.
bk_bldgs <- filter(watchlist_bldgs, borough == "BROOKLYN")
bk_bldgs
# A tibble: 256 x 8
landlord address units violations number street borough zip
<chr> <chr> <int> <int> <chr> <chr> <chr> <chr>
1 Jonathan C… 5416 4 AV… 16 62 5416 4 AVE… BROOKL… 11220
2 Jonathan C… 5422 4 AV… 21 78 5422 4 AVE… BROOKL… 11220
3 Jonathan C… 314 57 ST… 17 70 314 57 ST… BROOKL… 11220
4 Jonathan C… 372 BALTI… 8 76 372 BALTI… BROOKL… 11201
5 Jonathan C… 166 BLEEC… 6 41 166 BLEEC… BROOKL… 11221
6 Jonathan C… 590 BUSHW… 4 55 590 BUSHW… BROOKL… 11206
7 Jonathan C… 165 HOWAR… 8 39 165 HOWAR… BROOKL… 11233
8 Jonathan C… 208 HOYT … 3 9 208 HOYT … BROOKL… 11217
9 Jonathan C… 203 HULL … 6 143 203 HULL … BROOKL… 11233
10 Jonathan C… 318 HUMBO… 7 102 318 HUMBO… BROOKL… 11211
# … with 246 more rows
Multiple conditions are combined with &
.
bk_big_bldgs <- filter(watchlist_bldgs, units > 10, borough == "QUEENS")
bk_big_bldgs
# A tibble: 4 x 8
landlord address units violations number street borough zip
<chr> <chr> <int> <int> <chr> <chr> <chr> <chr>
1 Jonathan C… 1708 SUMM… 39 116 1708 SUMMER… QUEENS 11385
2 RONALD SWA… 34-01 28 … 41 157 34-01 28 AVE… QUEENS 11103
3 RONALD SWA… 47-02 88 … 29 98 47-02 88 STR… QUEENS 11373
4 HILLDISDE … 87-40 165… 119 383 87-40 165 ST… QUEENS 11432
select()
Use select()
to keep or drop columns. You can either specify a set of variables to keep by listing them, or specify columns to be dropped with -
.
select(watchlist_bldgs, landlord, borough, units)
# A tibble: 452 x 3
landlord borough units
<chr> <chr> <int>
1 Jonathan Cohen-Silvershore Properties BROOKLYN 16
2 Jonathan Cohen-Silvershore Properties BROOKLYN 21
3 Jonathan Cohen-Silvershore Properties BROOKLYN 17
4 Jonathan Cohen-Silvershore Properties BROOKLYN 8
5 Jonathan Cohen-Silvershore Properties BROOKLYN 6
6 Jonathan Cohen-Silvershore Properties BROOKLYN 4
7 Jonathan Cohen-Silvershore Properties BROOKLYN 8
8 Jonathan Cohen-Silvershore Properties BROOKLYN 3
9 Jonathan Cohen-Silvershore Properties BROOKLYN 6
10 Jonathan Cohen-Silvershore Properties BROOKLYN 7
# … with 442 more rows
select(watchlist_bldgs, -landlord)
# A tibble: 452 x 7
address units violations number street borough zip
<chr> <int> <int> <chr> <chr> <chr> <chr>
1 5416 4 AVENUE, BRO… 16 62 5416 4 AVENUE BROOKL… 11220
2 5422 4 AVENUE, BRO… 21 78 5422 4 AVENUE BROOKL… 11220
3 314 57 STREET, BRO… 17 70 314 57 STREET BROOKL… 11220
4 372 BALTIC STREET,… 8 76 372 BALTIC S… BROOKL… 11201
5 166 BLEECKER STREE… 6 41 166 BLEECKER… BROOKL… 11221
6 590 BUSHWICK AVENU… 4 55 590 BUSHWICK… BROOKL… 11206
7 165 HOWARD AVENUE,… 8 39 165 HOWARD A… BROOKL… 11233
8 208 HOYT STREET, B… 3 9 208 HOYT STR… BROOKL… 11217
9 203 HULL STREET, B… 6 143 203 HULL STR… BROOKL… 11233
10 318 HUMBOLDT STREE… 7 102 318 HUMBOLDT… BROOKL… 11211
# … with 442 more rows
You can rename the columns that you are selecting within select()
, or use rename()
which keeps all columns.
select(watchlist_bldgs, borough_name = borough)
# A tibble: 452 x 1
borough_name
<chr>
1 BROOKLYN
2 BROOKLYN
3 BROOKLYN
4 BROOKLYN
5 BROOKLYN
6 BROOKLYN
7 BROOKLYN
8 BROOKLYN
9 BROOKLYN
10 BROOKLYN
# … with 442 more rows
rename(watchlist_bldgs, landlord_name = landlord)
# A tibble: 452 x 8
landlord_name address units violations number street borough zip
<chr> <chr> <int> <int> <chr> <chr> <chr> <chr>
1 Jonathan Cohe… 5416 4… 16 62 5416 4 AVE… BROOKL… 11220
2 Jonathan Cohe… 5422 4… 21 78 5422 4 AVE… BROOKL… 11220
3 Jonathan Cohe… 314 57… 17 70 314 57 ST… BROOKL… 11220
4 Jonathan Cohe… 372 BA… 8 76 372 BALTI… BROOKL… 11201
5 Jonathan Cohe… 166 BL… 6 41 166 BLEEC… BROOKL… 11221
6 Jonathan Cohe… 590 BU… 4 55 590 BUSHW… BROOKL… 11206
7 Jonathan Cohe… 165 HO… 8 39 165 HOWAR… BROOKL… 11233
8 Jonathan Cohe… 208 HO… 3 9 208 HOYT … BROOKL… 11217
9 Jonathan Cohe… 203 HU… 6 143 203 HULL … BROOKL… 11233
10 Jonathan Cohe… 318 HU… 7 102 318 HUMBO… BROOKL… 11211
# … with 442 more rows
mutate()
Use mutate()
to add new columns to a dataset. mutate()
keeps all the existing columns and adds new one to the end of the dataset, and the variant transmute()
creates new columns but keeps only the new ones.
mutate(watchlist_bldgs, landlord_lower = str_to_lower(landlord))
# A tibble: 452 x 9
landlord address units violations number street borough zip
<chr> <chr> <int> <int> <chr> <chr> <chr> <chr>
1 Jonatha… 5416 4… 16 62 5416 4 AVE… BROOKL… 11220
2 Jonatha… 5422 4… 21 78 5422 4 AVE… BROOKL… 11220
3 Jonatha… 314 57… 17 70 314 57 ST… BROOKL… 11220
4 Jonatha… 372 BA… 8 76 372 BALTI… BROOKL… 11201
5 Jonatha… 166 BL… 6 41 166 BLEEC… BROOKL… 11221
6 Jonatha… 590 BU… 4 55 590 BUSHW… BROOKL… 11206
7 Jonatha… 165 HO… 8 39 165 HOWAR… BROOKL… 11233
8 Jonatha… 208 HO… 3 9 208 HOYT … BROOKL… 11217
9 Jonatha… 203 HU… 6 143 203 HULL … BROOKL… 11233
10 Jonatha… 318 HU… 7 102 318 HUMBO… BROOKL… 11211
# … with 442 more rows, and 1 more variable: landlord_lower <chr>
transmute(watchlist_bldgs, violations_per_unit = violations / units)
# A tibble: 452 x 1
violations_per_unit
<dbl>
1 3.88
2 3.71
3 4.12
4 9.5
5 6.83
6 13.8
7 4.88
8 3
9 23.8
10 14.6
# … with 442 more rows
arrange()
Use arrange()
to add order the rows in your dataset by the values of one or more columns. Be default they will be in ascending order, and you can use desc()
for descending order.
arrange(watchlist_bldgs, landlord, desc(units))
# A tibble: 452 x 8
landlord address units violations number street borough zip
<chr> <chr> <int> <int> <chr> <chr> <chr> <chr>
1 ADAM STR… 150 WEST 1… 31 92 150 WEST 1… BRONX 10468
2 ADAM STR… 19 EAST 12… 25 105 19 EAST 1… MANHAT… 10035
3 ADAM STR… 611 WEST 1… 22 78 611 WEST 1… MANHAT… 10031
4 ADAM STR… 535 EDGECO… 21 96 535 EDGECO… MANHAT… 10032
5 ADAM STR… 480 CONVEN… 15 76 480 CONVEN… MANHAT… 10031
6 ADAM STR… 616 EAST 1… 13 62 616 EAST 1… BRONX 10458
7 ADAM STR… 2372 ARTHU… 12 47 2372 ARTHUR… BRONX 10458
8 ADAM STR… 2550 CREST… 11 67 2550 CRESTO… BRONX 10468
9 ADAM STR… 293 PLEASA… 10 49 293 PLEASA… MANHAT… 10029
10 ADAM STR… 2405 BEAUM… 9 33 2405 BEAUMO… BRONX 10458
# … with 442 more rows
summarize()
You can use summarize()
on a dataset to collapse down all the rows to a single row to calculate an aggregate statistic of one or more columns. It works in a similar way as mutate()
, except whereas in mutate you can create new columns that are the same length as your existing dataset, with summarise()
you will sum some sort of aggregate function (like sum()
) that takes a column of multiple values and returns only one value.
summarise(watchlist_bldgs, total_units = sum(units))
# A tibble: 1 x 1
total_units
<int>
1 7192
group_by()
The 6th function is group_by()
and this doesn’t change the contents of your dataframe, but instead affects how all of the above functions work if they are subsequently called on the dataframe. After a dataframe has been grouped by one or more columns, all functions apply to each group of rows in the dataset as if it was it’s own dataset. group_by()
is most commonly used with summarize. Alone summarize()
will collapse a dataframe to a single row, but with a grouped dataframe it is collapsed down to one row per group. After you have finished with your grouped operations use ungroup()
to make sure that it doesn’t unintentionally alter later operations.
boro_bldgs <- group_by(watchlist_bldgs, borough)
boro_bldgs <- summarise(boro_bldgs, total_units = sum(units))
boro_bldgs <- ungroup(boro_bldgs)
boro_bldgs
# A tibble: 5 x 2
borough total_units
<chr> <int>
1 BRONX 1662
2 BROOKLYN 2723
3 MANHATTAN 2458
4 QUEENS 237
5 STATEN ISLAND 112
%>%
(“pipe”)As you can see above when you want to make a series of changes to a dataframe you can end up repeating yourself a lot and overwriting a dataframe with each step. Thankfully there’s a way to avoid this!
The beauty of dplyr is that all of the functions above take a dataframe as the first argument, and return an altered version of that dataframe as the result. This allows us to start with a dataframe and link together multiple functions so that they each make a change to the dataframe then pass it along to the next function. dplyr
includes a special operator, %>%
(pronounced “pipe”), that allows us to chain together these function calls. When reading the code for these pipelines you can read %>%
as “then”.
This %>%
takes the object on the left and passes it to the function on the right as the first argument.
For a simple example, let’s look at the function str_c()
, which concatenates strings together. Instead of passing "a"
and "b"
as the first and second argument, we can use the %>%
to “pipe” the "a"
into the function as the first argument and the "b"
becomes the second argument.
str_c("a", "b")
[1] "ab"
"a" %>% str_c("b")
[1] "ab"
Now let’s practice putting together some of these dplyr functions into a little data manipulation pipeline by getting some information about the landlords on the watchlist and the buildings they own in Brooklyn.
The long pipeline of these dplyr
functions can seem overwhelming at first, but once you get familiar with the functions you’ll be able to read these code chunks like a little paragraph explaining the changes being made to a dataframe. To help illustrate this the following paragraph is a written explanation of every step of the accompanying block of code.
We’ll start with the full
watchlist_bldgs
dataset, then “pipe” (%>%
) it into the next function tofilter
the dataset to just buildings where theborough
is"Brooklyn"
. Then wemutate
the dataset to add a new column calledlandlord_name
that is simply a more nicely-formatted version of the existinglandlord
column. Then weselect
only the columns that we need:landlord_name
,units
, and HPDviolations
. Then wegroup_by
the newlandlord_name
column, and then, with the dataset grouped, we’llsummarize
the data across all buildings for each landlord to get some summary information about each landlord and their buildings in Brooklyn. Specifically, we’llsummarize
to get the total number ofbuildings
using the specialn()
function that counts the number of rows, we’ll also get thetotal_units
bysum
ming the units across all buildings for each landlord, and we’ll get theavg_bldg_size
of each landlord’s Brooklyn buildings by taking themean
of units across their buildings. Similarly, we get thesum
andmean
of HPDviolations
for each landlord. We’ve now gone from a dataset in which each row represents a building to one in which each row is a landlord. Since we are done with our grouped operations we canungroup
the data, then finally we canarrange
the dataset indesc
ending order of the number ofbuildings
the landlord owns in Brooklyn. After all of this our final resulting dataset is assigned to a new dataframe we’ll callbk_landlords
.
bk_landlords <- watchlist_bldgs %>%
filter(borough == "BROOKLYN") %>%
mutate(landlord_name = str_to_title(landlord)) %>%
select(landlord_name, units, violations) %>%
group_by(landlord_name) %>%
summarize(
buildings = n(),
total_units = sum(units),
avg_bldg_size = mean(units),
total_viol = sum(violations),
avg_bldg_viol = mean(violations)
) %>%
ungroup() %>%
arrange(desc(buildings))
bk_landlords
# A tibble: 53 x 6
landlord_name buildings total_units avg_bldg_size total_viol
<chr> <int> <int> <dbl> <int>
1 Jonathan Coh… 18 149 8.28 974
2 Meir Fried 17 125 7.35 690
3 Deodat Lowtan 14 80 5.71 565
4 Cheryl Ighod… 11 68 6.18 413
5 Joel Kohn 10 60 6 312
6 Iskyo Aronov 9 37 4.11 495
7 Yosef Emergi 9 46 5.11 528
8 Chaim Schwar… 8 39 4.88 361
9 Isaac Schwar… 8 110 13.8 510
10 Jacob Rubin 8 52 6.5 349
# … with 43 more rows, and 1 more variable: avg_bldg_viol <dbl>
ggplot2
Now let’s visualize this new dataset we’ve created using the package ggplot2
.
ggplot2 is designed to work with dataframe inputs, so the first step is always to use ggplot(data = your_dataframe)
. You can build plots step by step by adding layers with +
. The second step is always aes()
, which establishes the aesthetics of the plot by mapping columns from the dataframe to aesthetic elements of the plot. For example, here we are setting the x
axis values to landlord names and y
to the total number of HPD violations. After the aes
is set, you can use one of the many geom_*()
functions to transform the aesthetics into a type of plot. In this case we want a column plot, so we use geom_column()
. Finally, we can label any part of our graph with labs()
.
ggplot(data = bk_landlords) +
aes(x = landlord_name, y = total_viol) +
geom_col() +
labs(
title = '"Worst" Landlords in Brooklyn',
subtitle = "Total HPD Violations in All Buildings for 2017",
x = NULL,
y = "Number of Violations",
caption = "Source: NYC Public Advocate's Landlord Watchlist"
)
With only the defaults ggplot2 graphs tend to look pretty good, and are not too difficult to create. However, there are definitely some things we’ll want to improve with this graph. Luckily, there is a near-infinite amount of customization possible with ggplot2 to get the plot looking exactly the way you want.
To start, there are clearly too many landlords to display clearly in a graph like this, so we can use dplyr toarrange
the data by violations and filter
to keep only the top 10 landlords. The first landlord name doesn’t match the same format as the other, so let’s remove the " Properties"
part using str_remove
from the stringr
package. It would also be nice if the landlords were sorted in order of the number of violations. To achieve this we can change the landlord_name
column from a string to instead use R’s factor
datatype, which allows us to specify an ordering to the values. Specifically, we’ll use the function fct_reorder()
from the package forcats
to make the column a factor and put the values in order based on the values of the total_viol
column.
Now we can use this new dataframe with ggplot2 and make a few more changes to improve the graph further. One obvious problem with our initial graph is that the landlord names are completely illegible due to overlap. To solve this we can use the ggplot2 function coord_flip()
to flip our bars sideways so we can read the labels more cleanly. Another smaller adjustment we can make it to format the violation count labels on our y-axis. To make changes to anything related to one of the aesthetic elements of a plot we can use one of the many scale_*_*
functions. The first *
is always one of the aes
element types, and the second *
indicates the type of data that is mapped to it. In our case we want to make a change to the y axis and we’ve mapped our count of violations to y
so it a continuous scale, so the function we’ll want to use is scale_y_continuous()
. Now within that function we’ll want to use the formatting function comma
from the scales
package on our axis labels. Lastly, we can use one of the theme_*
functions to apply some alternative styling to the plot. These functions provide you some helpful preset styling, but you can make your own fine-tuned adjustments using theme()
. This can get a bit overwhelming, but just to illustrate what’s possible, here we’ll remove the unnecessary lines on the plot, move the landlord name labels over a bit, and change the font of the caption.
landlord_bk_10_worst <- bk_landlords %>%
arrange(desc(total_viol)) %>%
filter(row_number() <= 10) %>%
mutate(
landlord_name = str_remove(landlord_name, " Properties"),
landlord_name = fct_reorder(landlord_name, total_viol)
)
ggplot(data = landlord_bk_10_worst) +
aes(x = landlord_name, y = total_viol) +
geom_col() +
coord_flip() +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
theme(
panel.grid.major.y = element_blank(),
axis.text.y = element_text(margin = margin(r = -15)),
plot.caption = element_text(face = "italic", color = "darkgrey", margin = margin(t = 10))
) +
labs(
title = '10 "Worst" Landlords in Brooklyn',
subtitle = "Total HPD Violations in All Buildings for 2017",
x = NULL,
y = "Number of Violations",
caption = "Source: NYC Public Advocate's Landlord Watchlist"
)
tidyr
This demo for reshaping data was borrowed from Sam Raby
One of the core ideas of the tidyverse is “tidy data”, which is a structure of datasets that facilitates analysis using the tools in tidyverse packages (dplyr
, ggplot2
, etc.). In tidy data: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table.
“Tidy datasets are all alike but every messy dataset is messy in its own way”
There are lots of common problems with datasets that make them untidy, and the package tidyr
provides some powerful and flexible tools to help tidy them up into the standard format. Here’s we’ll focus on the most important of those tools, pivot_longer
and pivot_wider
.
This annimation (from this blog post) nicely captures the basic idea behind the pivot_*
functions.
For this example of reshaping data we’ll be working with a dataset of building in NYC that have rent stabilized units from http://taxbills.nyc/. This project scraped data from PDFs of property tax documents to get estimates for rent stabilized units counts in buildings across NYC. You can find a direct link to the data we’re using and read up on the various field names at the Github project page
In this separate script we download the file if we don’t already have it.
source(path("R", "download-rent-stab.R"))
taxbills <- read_csv(path("data-raw", "rent-stab-units_joined_2007-2017.csv"))
For this demo, we only want to look at rent stabilized unit counts, which according to the Github documentation corresponds to column names that end in “uc”. Let’s also grab BBL (which is a unique identifier for NYC buildings) and Borough while we’re at it:
rentstab <- taxbills %>% select(borough, ucbbl, ends_with("uc") )
rentstab
# A tibble: 46,461 x 13
borough ucbbl `2007uc` `2008uc` `2009uc` `2010uc` `2011uc`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 MN 1.00e9 293 293 292 293 293
2 MN 1.00e9 253 253 253 253 253
3 <NA> 1.00e9 NA NA NA NA 29
4 MN 1.00e9 NA NA NA NA NA
5 MN 1.00e9 97 60 38 15 11
6 MN 1.00e9 NA NA NA NA 130
7 MN 1.00e9 NA NA NA 1 NA
8 MN 1.00e9 4 2 2 2 2
9 MN 1.00e9 NA NA NA NA 2
10 MN 1.00e9 5 5 5 5 5
# … with 46,451 more rows, and 6 more variables: `2012uc` <dbl>,
# `2013uc` <dbl>, `2014uc` <dbl>, `2015uc` <dbl>, `2016uc` <dbl>, …
Annoyingly, the data separates unit counts for different years into different columns, violating the principle of tidy data that every column is a variable. To make proper use of ggplot to visualizae our data, we’ll need to first tidy up our dataset to get all of the year values in one column and the unit counts into another.
We can use pivot_longer
to achieve this, performing this basic transformation:
rs_long <- rentstab %>%
pivot_longer(
ends_with("uc"), # The multiple column names we want to mush into one column
names_to = "year", # The title for the new column of names we're generating
values_to = "units" # The title for the new column of values we're generating
)
rs_long
# A tibble: 511,071 x 4
borough ucbbl year units
<chr> <dbl> <chr> <dbl>
1 MN 1000160180 2007uc 293
2 MN 1000160180 2008uc 293
3 MN 1000160180 2009uc 292
4 MN 1000160180 2010uc 293
5 MN 1000160180 2011uc 293
6 MN 1000160180 2012uc 293
7 MN 1000160180 2013uc 293
8 MN 1000160180 2014uc 293
9 MN 1000160180 2015uc 293
10 MN 1000160180 2016uc 293
# … with 511,061 more rows
Now that we have our data in the proper “long” (tidy) format, we can start working towards our desired plot. Let’s try and make a yearly timeline of rent stab. unit counts for the boroughs of Manhattan and Brooklyn:
rs_long_mn_bk_summary <- rs_long %>%
filter(
borough %in% c("MN","BK"), # Filter only Manhattan and Brooklyn values
!is.na(units) # Filter out null unit count values
) %>%
mutate(
year = str_remove(year, "uc"), # Remove "uc" from year values
year = as.integer(year) # change from character to integer
) %>%
group_by(year, borough) %>%
summarise(total_units = sum(units)) %>%
ungroup()
rs_long_mn_bk_summary
# A tibble: 22 x 3
year borough total_units
<int> <chr> <dbl>
1 2007 BK 236242
2 2007 MN 284001
3 2008 BK 237201
4 2008 MN 282871
5 2009 BK 223466
6 2009 MN 266088
7 2010 BK 226492
8 2010 MN 261740
9 2011 BK 231016
10 2011 MN 263745
# … with 12 more rows
Let’s build our bar graph. We are going to specify a dodge
property of the plot to show the Manhattan and Brooklyn bars side-by-side:
rs_over_time_graph_col <- ggplot(rs_long_mn_bk_summary) +
aes(x = year, y = total_units, fill = borough) +
geom_col(stat = "identity", position = "dodge") +
scale_x_continuous(breaks = 2007:2017) +
scale_y_continuous(labels = scales::label_number_si()) +
scale_fill_discrete(labels = c("BK" = "Brooklyn", "MN" = "Manhattan")) +
labs(
title = "Total Rent Stabilized Units over Time",
subtitle = "Manhattan and Brooklyn, 2007 to 2017",
fill = NULL,
x = "Year",
y = "Total Rent Stabilized Units",
caption = "Source: taxbills.nyc"
)
rs_over_time_graph_col
We can also change the geom_*
function, and make a few other tweaks to change this to a line graph instead.
rs_over_time_graph_line <- ggplot(rs_long_mn_bk_summary) +
aes(x = year, y = total_units, color = borough) +
geom_line() +
scale_x_continuous(breaks = 2007:2017) +
scale_y_continuous(labels = scales::label_number_si()) +
scale_color_discrete(labels = c("BK" = "Brooklyn", "MN" = "Manhattan")) +
labs(
title = "Total Rent Stabilized Units over Time",
subtitle = "Manhattan and Brooklyn, 2007 to 2017",
color = NULL,
x = "Year",
y = "Total Rent Stabilized Units",
caption = "Source: taxbills.nyc"
)
rs_over_time_graph_line
If you ever need to make the opposite tranformation to go from “long” data to “wide”, you can use pivot_wider
. For example, we can reverse the change we made above using the following code.
rs_wide <- rs_long %>%
pivot_wider(
names_from = year, # The current column containing our future column names
values_from = units # The current column containing the values for our future columns
)
rs_wide
# A tibble: 46,461 x 13
borough ucbbl `2007uc` `2008uc` `2009uc` `2010uc` `2011uc`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 MN 1.00e9 293 293 292 293 293
2 MN 1.00e9 253 253 253 253 253
3 <NA> 1.00e9 NA NA NA NA 29
4 MN 1.00e9 NA NA NA NA NA
5 MN 1.00e9 97 60 38 15 11
6 MN 1.00e9 NA NA NA NA 130
7 MN 1.00e9 NA NA NA 1 NA
8 MN 1.00e9 4 2 2 2 2
9 MN 1.00e9 NA NA NA NA 2
10 MN 1.00e9 5 5 5 5 5
# … with 46,451 more rows, and 6 more variables: `2012uc` <dbl>,
# `2013uc` <dbl>, `2014uc` <dbl>, `2015uc` <dbl>, `2016uc` <dbl>, …
tidycensus
Now we’ll put all these skills into practice with Census data, which will be relevant to many of your projects.
The tidycensus
package uses the Census Bureau’s APIs to download Decennial Census and ACS summary file estaimtes. As the name implies, the package fits in with the tidyverse collection of packages.
While the Census API is free to use, they require you to sign up for an API Key to get access. This is easy to do, all you need to provide is an email and there are instructions for doing this on the help package for this function: ?census_api_key
library(tidycensus)
census_api_key("c32dfbf7d25fe9558fd11bf021780457970f96ff")
Tidycensus includes decennial census and ACS data, but today we’ll stick with just ACS using get_acs()
. There are many variables and they need to be specified using codes. We can explore these using load_variables()
acs_vars <- load_variables(2017, "acs5", cache = TRUE)
acs_vars
# A tibble: 25,070 x 3
name label concept
<chr> <chr> <chr>
1 B00001_0… Estimate!!Total UNWEIGHTED SAMPLE COUNT OF T…
2 B00002_0… Estimate!!Total UNWEIGHTED SAMPLE HOUSING UN…
3 B01001_0… Estimate!!Total SEX BY AGE
4 B01001_0… Estimate!!Total!!Male SEX BY AGE
5 B01001_0… Estimate!!Total!!Male!!Und… SEX BY AGE
6 B01001_0… Estimate!!Total!!Male!!5 t… SEX BY AGE
7 B01001_0… Estimate!!Total!!Male!!10 … SEX BY AGE
8 B01001_0… Estimate!!Total!!Male!!15 … SEX BY AGE
9 B01001_0… Estimate!!Total!!Male!!18 … SEX BY AGE
10 B01001_0… Estimate!!Total!!Male!!20 … SEX BY AGE
# … with 25,060 more rows
The main function to extract the data is a bit complicated, you should pull up the help page (?get_acs
) and walk through the arguments as you write it out.
geography
: the level of geography we want for our data, in this case census tract. (Full list of all available geography levels)state
: if the requested geography nests within states, you can limit to one or more statesvariables
: this can either be a simple vector of variable codes, or a named vector of variable codes where the names replace the codes in the column namessurvey
: the ACS releases 1-, 3-, and 5-year estimates. (Tracts are only available with 5-year data)year
: this is the latest year of the range. So 2018 with “acs5” means 2014-2018output
: this can either be “tidy” (default) or wide. For mapping “wide” makes since - where each variable is it’s own columngeometry
: whether to include geometries (shapes) with the dataIn this demo we’ll start by just downloading the estimates, and will incorporate the geometries later on.
acs_tracts_raw <- get_acs(
geography = "tract",
state = "NY",
county = c("005", "047", "061", "081", "085"), # NYC counties/boroughs
variables = c(
"gross_rent_med" = "B25064_001", # median gross rent
"hh_inc_med" = "B19013_001", # median household income
"rent_burden_med" = "B25071_001", # median rent burden
"pov_pct" = "C17002_001", # poverty rate
"hh_size_avg" = "B25010_001", # average hosehold size
"occ_units" = "B25003_001", # total occupied units
"occ_renter_units" = "B25003_003", # renter occupied units
"vac_forrent_units" = "B25004_002", # vacant units for rent
"vac_rented_units" = "B25004_003" # vacant units rented
),
survey = "acs5",
year = 2017,
output = "wide",
geometry = FALSE
)
acs_tracts_clean <- acs_tracts_raw %>%
rename(geoid = GEOID) %>%
mutate(
state = str_sub(geoid, 1, 2),
county = str_sub(geoid, 3, 5),
tract = str_sub(geoid, 6, 11),
renter_pctE = occ_renter_unitsE / na_if(occ_unitsE, 0),
renter_pctM = moe_ratio(occ_renter_unitsE, na_if(occ_unitsE, 0), occ_renter_unitsM, occ_unitsM),
rental_unitsE = occ_renter_unitsE + vac_forrent_unitsE + vac_rented_unitsE
) %>%
# moe_sum is designed to sum one column, but we can adjust the behaviour by
# using rowwise and c_across from dplyr (this is a bit confusing, and you
# won't come across it much)
rowwise() %>%
mutate(
rental_unitsM = moe_sum(c_across(c(occ_renter_unitsM, vac_forrent_unitsM, vac_rented_unitsM)))
) %>%
ungroup()
acs_tracts_clean
# A tibble: 2,167 x 27
geoid NAME gross_rent_medE gross_rent_medM hh_inc_medE hh_inc_medM
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 3608… Cens… 1606 98 63750 13419
2 3608… Cens… 1548 238 67132 13825
3 3608… Cens… 1413 88 66645 6978
4 3608… Cens… 1492 112 61154 11608
5 3608… Cens… 1617 88 65313 17909
6 3608… Cens… 1663 80 51964 13265
7 3608… Cens… 532 96 20920 2131
8 3608… Cens… 1446 200 66739 14830
9 3608… Cens… 1463 259 69630 8971
10 3608… Cens… 1636 74 58879 16157
# … with 2,157 more rows, and 21 more variables:
# rent_burden_medE <dbl>, rent_burden_medM <dbl>, pov_pctE <dbl>,
# pov_pctM <dbl>, hh_size_avgE <dbl>, …
Now that we have our nice clean dataset of tract-level indicators, we can easily vizualize some of the relationships between indicators across tracts with the basics we leanred about ggplot. Here we’ll expand beyond the geom_col
and geom_line
, and make a scatter point graph with geom_point
and add a simple smoothed conditional mean line with geom_smoth
to help cut through the potential overplotting and reveal the pattern.
ggplot(acs_tracts_clean) +
aes(x = renter_pctE, y = pov_pctE) +
geom_point(size = 0.5, alpha = 0.25) +
geom_smooth() +
theme_minimal()
Now we can combine some of our data. First we’ll take the property level rent stabilization data and aggregate the counts of stabilized units by census tract. Then we can join that new aggregated data with our ACS tract data. This will allow us to incorporate the count of rental units from the ACS data to create a tract-level measure of share of rental units that are rent stabilized.
To eventually join these two datasets we’ll need to have a common column that identifies the rows in each dataset - in this case it will be census tract “geoid”. One issue is that in the rent stabilization data the tract column is not in the same format, so first will have to fix that.
tract_stab_units_2017 <- taxbills %>%
mutate(
county = recode(borough, "MN" = "061", "BX" = "005", "BK" = "047", "QN" = "081", "SI" = "085"),
tract = scales::number(ct2010, accuracy = 0.01) %>% str_remove("\\.") %>% str_pad(6, "left", "0"),
geoid = str_c("36", county, tract)
) %>%
group_by(geoid) %>%
summarise(stab_units_2017 = replace_na(sum(`2017uc`), 0)) %>%
ungroup()
tract_stab_units_2017
# A tibble: 1,753 x 2
geoid stab_units_2017
<chr> <dbl>
1 36005000200 0
2 36005000400 9
3 36005001600 0
4 36005001900 0
5 36005002000 0
6 36005002500 0
7 36005002701 0
8 36005002702 1159
9 36005002800 1585
10 36005003100 303
# … with 1,743 more rows
There are a few different types of joins that you can use to merge two datasets.
In this case we’ll be using the most common join type - left_join
- to keep all the records from our left table (ACS data) and merge in all the data for matching records in the right table (stabilized unit counts).
tracts_acs_stab <- acs_tracts_clean %>%
left_join(tract_stab_units_2017, by = "geoid")
tracts_acs_stab %>% select(geoid, rental_unitsE, stab_units_2017)
# A tibble: 2,167 x 3
geoid rental_unitsE stab_units_2017
<chr> <dbl> <dbl>
1 36081011900 608 0
2 36081013500 284 6
3 36081013600 466 326
4 36081014100 626 0
5 36081014500 597 0
6 36081016100 638 0
7 36081016300 1604 0
8 36081018000 120 NA
9 36081020600 311 36
10 36081006900 1574 0
# … with 2,157 more rows
Whenever you are working with joins, it’s always important to inspect the results carfeully to make sure everything went as expected, and ensure that you understand what happened. Joins can very easily introduce errors into your work without you noticing. For example, you can easily introduce duplicate records or you can drop important records you need.
A couple helpful ways to inspect your data actually make use of some special versions of join functions, known as “filtering” joins, that do the matching between datasets without actually merging in any new data.
Above we used left_join
to merge in all the records from the right table that find matches in the left table. Here we’ll use semi_join
to keep only the records from the left table that find matches in the right table. This will let us see what tracts from the ACS data don’t have any records in the rent stabilization data. And then we can use some of the same basic tools to get a better understanding of that data, for example counting up the rows by borough (county).
in_acs_notin_stab <- acs_tracts_clean %>%
anti_join(tract_stab_units_2017, by = "geoid")
in_acs_notin_stab %>% count(county)
# A tibble: 5 x 2
county n
<chr> <int>
1 005 29
2 047 167
3 061 18
4 081 279
5 085 40
Once we are confident that our join went as planned, we can continue with our analysis work. Next we’ll use data from the two sources to create a measure of the share of all rental units that are rent stabilized.
tracts_rent_info <- tracts_acs_stab %>%
mutate(rental_stab_pct = replace_na(stab_units_2017 / na_if(rental_unitsE, 0), 0)) %>%
select(county, geoid, rental_stab_pct, rent_burden_medE)
tracts_rent_info
# A tibble: 2,167 x 4
county geoid rental_stab_pct rent_burden_medE
<chr> <chr> <dbl> <dbl>
1 081 36081011900 0 29.8
2 081 36081013500 0.0211 27.9
3 081 36081013600 0.700 29.1
4 081 36081014100 0 33.3
5 081 36081014500 0 31.8
6 081 36081016100 0 35
7 081 36081016300 0 30.2
8 081 36081018000 0 30.7
9 081 36081020600 0.116 29.7
10 081 36081006900 0 29.4
# … with 2,157 more rows
tracts_rent_info %>%
filter(
rental_stab_pct > 0,
rental_stab_pct <= 1
) %>%
ggplot() +
aes(x = rental_stab_pct, y = rent_burden_medE) +
geom_point(size = 0.5, alpha = 0.25) +
geom_smooth()
tracts_rent_info %>%
filter(
rental_stab_pct > 0,
rental_stab_pct <= 1
) %>%
ggplot() +
aes(x = rental_stab_pct, y = rent_burden_medE) +
geom_point(size = 0.5, alpha = 0.25) +
geom_smooth() +
facet_wrap(~county)
Two kinds of spatial data, raster (pixels) and vector () SF package - emerging as the new standard, you might see sp, but that’s is being phased out
simple features, international standard format for how to represent geometries in data (vector data)
A feature geometry is called simple when it consists of points connected by straight line pieces, and does not intersect itself.
knitr::include_graphics(path("img", "wkt_primitives.png"))
knitr::include_graphics(path("img", "wkt_multipart.png"))
We’ll start with the the most simple type - point data. You’ll already be familiar with latitude and longitude, and this is spatial point data. The only difference between something like lat/lon and these example values in the images above, is that lat/lon have an associated Coordinate Reference System (CRS) that describes how the numeric values of x and y correspond to positions on the earth. The CRS used for latitude and longitude data is called WGS84, and is often more conveniently specified using EPSG codes, in this case 4326.
Here are some common coordinate reference systems that you may see:
* 4326 - latitude/longitude * 3857 - Web Mercator * 2263 - NY State Plane Long Island (feet)
Let’s start by creating some simple spatial data from lat/lon columns in a dataframe
nyc_housing_courts <- tribble(
~court, ~lat, ~lon,
"Bronx Housing Court", 40.8322917983711, -73.9189644981048,
"Brooklyn Housing Court", 40.6908281879688, -73.9881400109281,
"Red Hook Community Justice Center", 40.6792674361971, -74.0094171241109,
"Manhattan Housing Court", 40.7168162247486, -74.0014321226876,
"Harlem Community Justice Center", 40.8012116588986, -73.9384598376274,
"Queens Housing Court", 40.7034894794273, -73.8080039304503,
"Staten Island Housing Court", 40.6350251497506, -74.1119001588143
)
nyc_housing_courts
# A tibble: 7 x 3
court lat lon
<chr> <dbl> <dbl>
1 Bronx Housing Court 40.8 -73.9
2 Brooklyn Housing Court 40.7 -74.0
3 Red Hook Community Justice Center 40.7 -74.0
4 Manhattan Housing Court 40.7 -74.0
5 Harlem Community Justice Center 40.8 -73.9
6 Queens Housing Court 40.7 -73.8
7 Staten Island Housing Court 40.6 -74.1
Now we’ll use st_as_sf()
from the sf
package to make it into a spatial dataframe, by providing the columns containing the y and x coordinates, and specify the CRS to use. Now when we print the data frame we get some extra information, including the bounding box and CRS, and you’ll see we now have a new column geometry
that is of the type sfc_POINT
(simple features collection).
library(sf)
nyc_housing_courts_sf <- st_as_sf(nyc_housing_courts, coords = c("lon", "lat"), crs = 4326)
nyc_housing_courts_sf
Simple feature collection with 7 features and 1 field
geometry type: POINT
dimension: XY
bbox: xmin: -74.1119 ymin: 40.63503 xmax: -73.808 ymax: 40.83229
CRS: EPSG:4326
# A tibble: 7 x 2
court geometry
<chr> <POINT [°]>
1 Bronx Housing Court (-73.91896 40.83229)
2 Brooklyn Housing Court (-73.98814 40.69083)
3 Red Hook Community Justice Center (-74.00942 40.67927)
4 Manhattan Housing Court (-74.00143 40.71682)
5 Harlem Community Justice Center (-73.93846 40.80121)
6 Queens Housing Court (-73.808 40.70349)
7 Staten Island Housing Court (-74.1119 40.63503)
Now that we have a spatial dataframe with a CRS, we can transform the CRS by using st_transform
.
nyc_housing_courts_sf_ny <- st_transform(nyc_housing_courts_sf, 2263)
nyc_housing_courts_sf_ny
Simple feature collection with 7 features and 1 field
geometry type: POINT
dimension: XY
bbox: xmin: 953192 ymin: 170653 xmax: 1037484 ymax: 242514
CRS: EPSG:2263
# A tibble: 7 x 2
court geometry
<chr> <POINT [US_survey_foot]>
1 Bronx Housing Court (1006675 242514)
2 Brooklyn Housing Court (987539 190964)
3 Red Hook Community Justice Center (981638 186752)
4 Manhattan Housing Court (983853 200432)
5 Harlem Community Justice Center (1001288 231186)
6 Queens Housing Court (1037484 195635)
7 Staten Island Housing Court (953192 170653)
Now we’ll look at some polygon data that we’ll read in from a shapefile
shapefiles are the most common format you’ll find for spatial data
they are really a collection of multiple files - as many as 13, but at least 3 files (.dbf
attributes, .shp
shapes, .shx
shape index)
we can read this in with read_sf
First we need to download the file. We’ll be using a file of NYC census tracts from Open Data
source(path("R", "download_tracts.R"))
When we print the dataframe we can see that it is using the 2263
CRS. One of the optional files (.prj
) contains the projection (CRS) information, but if that’s missing you’ll need to specify what CRS to use when you import the data.
sfc_MULTIPOLYGON
. They are*_multi*polygons because some tracts include multiple islands, etc.
nyc_tracts <- read_sf(path("data-raw", "nyc-tracts-2010", "nyct2010.shp"))
# let's first clean up the column names
nyc_tracts <- nyc_tracts %>% rename_all(str_to_lower)
Now that we have a couple different spatial dataframes, let’s take a look at them
We can use the same basic ggplot2 tools to do this
To just look at the geometries we can omit the aes()
now because it knows which columns to use for x and y, but if we want to specify what column to use for color, we can do that the same way as before
ggplot(nyc_housing_courts_sf) +
# aes(color = court) +
geom_sf()
if we want to change the crs, we don’t have to use st_transform
, but can use coord_sf
this can be helpful, because the crs that’s best for working with the data might not be the same as how you want to display it. for example, if you are doing analysis it’s helpful to use a system that is in feet or metres rather than degrees, but the web mapping standard 3587 can look good for maps
ggplot(nyc_housing_courts_sf) +
geom_sf() +
coord_sf(crs = 3587)
now let’s layer on the tract polygons
by default when using multiple spatial dataframes, ggplot will transform them to use the same CRS
ggplot() +
geom_sf(data = nyc_tracts) +
geom_sf(data = nyc_housing_courts_sf)
Now lets add some data to these tract geometries
nyc_tracts_acs <- nyc_tracts %>%
mutate(
county = recode(borocode, "1" = "061", "2" = "005", "3" = "047", "4" = "081", "5" = "085"),
geoid = str_c("36", county, ct2010)
) %>%
left_join(acs_tracts_clean, by = "geoid")
nyc_tracts_acs
Simple feature collection with 2165 features and 39 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
CRS: 2263
# A tibble: 2,165 x 40
ctlabel borocode boroname ct2010 boroct2010 cdeligibil ntacode
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 9 5 Staten … 000900 5000900 E SI22
2 98 1 Manhatt… 009800 1009800 I MN19
3 100 1 Manhatt… 010000 1010000 I MN19
4 102 1 Manhatt… 010200 1010200 I MN17
5 104 1 Manhatt… 010400 1010400 I MN17
6 113 1 Manhatt… 011300 1011300 I MN17
7 114.02 1 Manhatt… 011402 1011402 I MN40
8 130 1 Manhatt… 013000 1013000 I MN40
9 140 1 Manhatt… 014000 1014000 I MN40
10 148.01 1 Manhatt… 014801 1014801 I MN40
# … with 2,155 more rows, and 33 more variables: ntaname <chr>,
# puma <chr>, shape_leng <dbl>, shape_area <dbl>,
# geometry <MULTIPOLYGON [US_survey_foot]>, …
now we can plot some of our census data
ggplot(nyc_tracts_acs) +
aes(fill = renter_pctE) +
geom_sf(color = "white", size = 0.05) +
scale_fill_viridis_b(labels = scales::percent) +
theme_void() +
theme(legend.position = c(0.1, .75)) +
labs(
title = "Renter Share of Units",
subtitle = "New York City,Census Tracts, 2014-2018",
fill = NULL,
caption = "Sources: American Community Survey (2014-2018)"
)
always have to be careful about how to communicate uncertainty in graphs like this, we know the margins of error can be quite large on tract data, and this doesn’t communicate that in any way
one way to address this is bin the continuous values into categories
but we still aren’t showing the level of uncertainty about whether any given area may be misclassified simply due to sampling error
There are some tools that city planning uses to help enforce some basic standards for reliability, and I made a simple package that helps you test this out in R, for details see mapreliability
.
When you use this tool with ACS tract data you’ll often find that it’s very hard to classify the data in any reasonable way that satisfies the thresholds. In that case then the best thing to do is consider using data aggregated at a larger geography that will give you a larger sample and smaller MOEs.
The other options will often be too large for neighborhood analysis, but city planning has also helped with that by defining Neighborhood Tabulation Areas, which are aggregations of census tracts.
The census tract geometries we downloaded have this variable, and so we can aggregate our own census data and at the same time combine our tract geometries to create a new nta-level dataset and geometries.
You need to take some care in handling the estimates and margins so that the MOE calculations are valid. We can’t simply take the mean the share and it’s MOE, but instead need to sum up the components and then recalculate the share
nyc_ntas_acs <- nyc_tracts_acs %>%
group_by(ntacode, ntaname) %>%
summarise(
occ_renter_unitsE = sum(occ_renter_unitsE),
occ_renter_unitsM = moe_sum(occ_renter_unitsM),
occ_unitsE = sum(occ_unitsE),
occ_unitsM = moe_sum(occ_unitsM),
geometry = st_union(geometry)
) %>%
ungroup() %>%
mutate(
renter_pctE = occ_renter_unitsE / na_if(occ_unitsE, 0),
renter_pctM = moe_ratio(occ_renter_unitsE, na_if(occ_unitsE, 0), occ_renter_unitsM, occ_unitsM)
)
nyc_ntas_acs
Simple feature collection with 195 features and 8 fields
geometry type: GEOMETRY
dimension: XY
bbox: xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
CRS: 2263
# A tibble: 195 x 9
ntacode ntaname occ_renter_unit… occ_renter_unit… occ_unitsE
* <chr> <chr> <dbl> <dbl> <dbl>
1 BK09 Brookl… 6021 364. 11115
2 BK17 Sheeps… 12000 483. 26150
3 BK19 Bright… 10047 408. 14557
4 BK21 Seagat… 9419 292. 11236
5 BK23 West B… 4922 292. 8401
6 BK25 Homecr… 9938 381. 15528
7 BK26 Graves… 7514 331. 11445
8 BK27 Bath B… 6733 302. 11204
9 BK28 Benson… 19702 515. 30079
10 BK29 Benson… 14269 469. 22373
# … with 185 more rows, and 4 more variables: occ_unitsM <dbl>,
# geometry <POLYGON [US_survey_foot]>, renter_pctE <dbl>,
# renter_pctM <dbl>
ggplot(nyc_ntas_acs) +
aes(fill = renter_pctE) +
geom_sf(color = "white", size = 0.05) +
scale_fill_viridis_c(labels = scales::percent) +
theme_void() +
theme(legend.position = c(0.1, .75)) +
labs(
title = "Renter Share of Units",
subtitle = "New York City, Neighborhood Tabulation Areas, 2014-2018",
fill = NULL,
caption = "Sources: American Community Survey (2014-2018)"
)
hpd_viol_query <- str_glue(
"https://data.cityofnewyork.us/resource/wvxf-dwi5.csv?$query=
select violationid, inspectiondate, latitude, longitude
where inspectiondate > '{Sys.Date()-7}T00:00:00'
and latitude is not null
limit 100000")
hpd_violations <- read_csv(URLencode(hpd_viol_query)) %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
st_transform(2263)
hpd_violations
Simple feature collection with 3747 features and 2 fields
geometry type: POINT
dimension: XY
bbox: xmin: 950378.9 ymin: 148699.2 xmax: 1058381 ymax: 269486
CRS: EPSG:2263
# A tibble: 3,747 x 3
violationid inspectiondate geometry
<dbl> <dttm> <POINT [US_survey_foot]>
1 13714193 2020-06-29 00:00:00 (1001557 196353.2)
2 13713995 2020-06-29 00:00:00 (995169.1 177480.2)
3 13713994 2020-06-29 00:00:00 (995169.1 177480.2)
4 13713993 2020-06-29 00:00:00 (995169.1 177480.2)
5 13713992 2020-06-29 00:00:00 (995169.1 177480.2)
6 13713991 2020-06-29 00:00:00 (995169.1 177480.2)
7 13713990 2020-06-29 00:00:00 (995169.1 177480.2)
8 13713989 2020-06-29 00:00:00 (995169.1 177480.2)
9 13713988 2020-06-29 00:00:00 (995169.1 177480.2)
10 13713987 2020-06-29 00:00:00 (995169.1 177480.2)
# … with 3,737 more rows
hpd_violations_joined <- nyc_ntas_acs %>%
st_join(hpd_violations, st_intersects) %>%
select(violationid, inspectiondate, ntacode, ntaname)
hpd_violations_joined
Simple feature collection with 3813 features and 4 fields
geometry type: GEOMETRY
dimension: XY
bbox: xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
CRS: 2263
First 10 features:
violationid inspectiondate ntacode
1 13712509 2020-07-01 BK09
2 13710989 2020-06-30 BK17
3 13710952 2020-06-30 BK17
4 13710935 2020-06-30 BK17
5 13710446 2020-06-29 BK19
6 13710430 2020-06-29 BK19
7 13710421 2020-06-29 BK19
8 13711532 2020-06-30 BK19
9 13711529 2020-06-30 BK19
10 13710744 2020-06-30 BK19
ntaname
1 Brooklyn Heights-Cobble Hill
2 Sheepshead Bay-Gerritsen Beach-Manhattan Beach
3 Sheepshead Bay-Gerritsen Beach-Manhattan Beach
4 Sheepshead Bay-Gerritsen Beach-Manhattan Beach
5 Brighton Beach
6 Brighton Beach
7 Brighton Beach
8 Brighton Beach
9 Brighton Beach
10 Brighton Beach
geometry
1 POLYGON ((986838.3 192326.7...
2 POLYGON ((997485 148503.8, ...
3 POLYGON ((997485 148503.8, ...
4 POLYGON ((997485 148503.8, ...
5 POLYGON ((997341.1 149453.7...
6 POLYGON ((997341.1 149453.7...
7 POLYGON ((997341.1 149453.7...
8 POLYGON ((997341.1 149453.7...
9 POLYGON ((997341.1 149453.7...
10 POLYGON ((997341.1 149453.7...
hpd_violations_ntas <- hpd_violations_joined %>%
group_by(ntacode, ntaname) %>%
summarise(violations = sum(!is.na(inspectiondate))) %>%
ungroup()
hpd_violations_ntas
Simple feature collection with 195 features and 3 fields
geometry type: GEOMETRY
dimension: XY
bbox: xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
CRS: 2263
# A tibble: 195 x 4
ntacode ntaname violations geometry
<chr> <chr> <int> <POLYGON [US_survey_foot]>
1 BK09 Brooklyn Heigh… 1 ((986838.3 192326.7, 986899.7 1…
2 BK17 Sheepshead Bay… 3 ((997485 148503.8, 997466.1 148…
3 BK19 Brighton Beach 31 ((997341.1 149453.7, 997466.1 1…
4 BK21 Seagate-Coney … 4 ((989467.6 148972.3, 989627.2 1…
5 BK23 West Brighton 0 ((990752.8 148875.7, 990727.1 1…
6 BK25 Homecrest 35 ((996416.9 153416.9, 996158.3 1…
7 BK26 Gravesend 10 ((992262.4 154391.8, 992520 154…
8 BK27 Bath Beach 15 ((985473.7 157957.2, 984988.9 1…
9 BK28 Bensonhurst We… 42 ((985575.9 156813.6, 985277.5 1…
10 BK29 Bensonhurst Ea… 33 ((987585.9 155159.6, 987132.5 1…
# … with 185 more rows
ggplot(hpd_violations_ntas) +
aes(fill = violations) +
geom_sf(color = "white", size = 0.05) +
scale_fill_viridis_c(labels = scales::comma) +
theme_void() +
theme(legend.position = c(0.1, .75)) +
labs(
title = "Housing Maintenance Code Violations",
subtitle = str_glue("New York City, Neighborhood Tabulation Areas, {Sys.Date()-7} to {Sys.Date()}"),
fill = NULL,
caption = "Sources: Department of Housing Preservation and Development (HPD) via Open Data"
)
sf
package websiteSo far we’ve been working with ACS data that is already tabulated for us, but sometimes there are cuts of the data that are not available and so we can use the microdata to calculate these statistics ourselves.
Because these microdata are some a survey we can’t simply treat them as a regular dataframe and sum up columns because there are survey weights associated with each observation and these need to be incorporated. In some cases the weights are simple and it would be enough to use weighted means, etc. to get our estimates, but when the survey design is more complicated and/or get margins of error on our estimates, we’ll need some other tools.
Here we’ll be working with the srvyr
package, which (in the same spirit as the sf
package) allows us to set up our data frame with some special information about the survey design and then use the existing dplyr
tools we know to work with the data.
First let’s import some ACS microdata. We’ll be using a file for NY downloaded from IPUMS, which cleans up the ACS Public Use Microdata Sample (PUMS) data to make it easier to work with. Unfortunately they don’t have a system to automate these downloads, so we need to do it manually. I’m goig to borrow the file from another project, where you can read a description of the variables and how to download a copy.
# Read in IPUMS USA ACS microdata, standardize the column names
ipums_raw <- read_csv(path("data-raw", "ipums_acs-2018-1yr_ny.csv.gz")) %>% rename_all(str_to_lower)
First I’m going to do some basic cleaning and create some variables for analysis
ipums_clean <- ipums_raw %>%
filter(
# Remove group quarters population
gq %in% 1:2,
# Keep only head-of-household records
pernum == 1,
# keep only NYC
countyfip %in% c(5, 47, 61, 81, 85)
) %>%
transmute(
serial,
hhwt,
countyfip,
# Household income
hh_inc_nom = case_when(
hhincome <= 0 ~ 0,
hhincome == 9999999 ~ NA_real_,
TRUE ~ hhincome
),
hh_inc_grp = cut(
hh_inc_nom, c(-Inf, 25e3, 50e3, 75e3, 100e3, 125e5, 150e3, Inf),
c("< $25k", "$25k - 50k", "$50k - $75k", "$75k - $100k", "$100k - $125k", "$125k - $150k", ">= $150k"),
include.lowest = TRUE,
ordered_result = TRUE
),
# Renter vars
is_renter = (ownershp == 2),
gross_rent_nom = if_else(is_renter, rentgrs, NA_real_),
rent_burden = gross_rent_nom / (hh_inc_nom / 12),
is_rent_burdened = (rent_burden > 0.30),
# Race/ethnicity
race_eth = case_when(
hispan %in% 1:4 ~ "Hispanic, of any race",
race == 1 ~ "Non-Hispanic white",
race == 2 ~ "Non-Hispanic Black",
race == 3 ~ "Non-Hispanic American Indian or Alaska Native",
race == 4 ~ "Non-Hispanic Asian or Pacific Islander", # Chinese
race == 5 ~ "Non-Hispanic Asian or Pacific Islander", # Japanese
race == 6 ~ "Non-Hispanic Asian or Pacific Islander", # Other Asian or Pacific Island
race == 7 ~ "Non-Hispanic other",
race == 8 ~ "Non-Hispanic Two or more major races", # Two major races
race == 9 ~ "Non-Hispanic Two or more major races" # Three or more major races
),
age
)
now we can set the survey design info
library(srvyr)
ipums_svy <- as_survey_design(ipums_clean, weights = hhwt)
ipums_svy
Independent Sampling design (with replacement)
Called via srvyr
Sampling variables:
- ids: `1`
- weights: hhwt
Data variables: serial (dbl), hhwt (dbl), countyfip (dbl), hh_inc_nom
(dbl), hh_inc_grp (ord), is_renter (lgl), gross_rent_nom (dbl),
rent_burden (dbl), is_rent_burdened (lgl), race_eth (chr), age
(dbl)
ipums_inc_burden <- ipums_svy %>%
filter(
is_renter,
!is.na(hh_inc_grp),
) %>%
group_by(hh_inc_grp) %>%
summarise(
rent_burden_pct = survey_mean(is_rent_burdened, na.rm = TRUE, vartype = "ci", level = 0.95)
) %>%
ungroup()
ipums_inc_burden
# A tibble: 6 x 4
hh_inc_grp rent_burden_pct rent_burden_pct_low rent_burden_pct_u…
<ord> <dbl> <dbl> <dbl>
1 < $25k 0.852 0.839 0.865
2 $25k - 50k 0.742 0.723 0.761
3 $50k - $75k 0.453 0.428 0.478
4 $75k - $100k 0.220 0.196 0.244
5 $100k - $125k 0.114 0.0950 0.133
6 $125k - $150k 0.00489 0.000914 0.00888
ggplot(ipums_inc_burden) +
aes(x = hh_inc_grp, y = rent_burden_pct, ymin = rent_burden_pct_low, ymax = rent_burden_pct_upp) +
geom_col(fill = "goldenrod") +
geom_errorbar() +
geom_text(
aes(y = rent_burden_pct_upp, label = scales::percent(rent_burden_pct)),
vjust = -0.5, size = 3.5
) +
scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
theme_minimal() +
labs(
title = str_glue(
"Share of Renter Households Paying More than 30% of Monthly Income on Rent,
by Annual Household Income Level"
),
subtitle = "New York City, 2018",
x = "Household income",
y = "Share of Renter Households",
caption = str_glue(
"Notes: Error bars represent 95% confidence intervals, and value labels reflect point estimates
Sources: American Community Survey (2018), IPUMS USA"
)
)
Logistic regression with dependent variable of whether they rent their home, predicted by characteristics of the head-of-household.
model_1 <- glm(
is_renter ~ hh_inc_nom + factor(race_eth) + age + factor(countyfip),
data = ipums_clean,
weights = hhwt,
family = binomial(link = "logit")
)
The built-in summary function gives you helpful information, but it only prints it to the screen as text, and if you capture the results it’s a complicated list structure.
summary(model_1)
Call:
glm(formula = is_renter ~ hh_inc_nom + factor(race_eth) + age +
factor(countyfip), family = binomial(link = "logit"), data = ipums_clean,
weights = hhwt)
Deviance Residuals:
Min 1Q Median 3Q Max
-53.478 -8.784 4.712 8.204 54.715
Coefficients:
Estimate
(Intercept) 4.347e+00
hh_inc_nom -5.525e-06
factor(race_eth)Non-Hispanic American Indian or Alaska Native -1.477e+00
factor(race_eth)Non-Hispanic Asian or Pacific Islander -1.215e+00
factor(race_eth)Non-Hispanic Black -5.763e-01
factor(race_eth)Non-Hispanic other -1.278e+00
factor(race_eth)Non-Hispanic Two or more major races -6.613e-01
factor(race_eth)Non-Hispanic white -1.021e+00
age -3.891e-02
factor(countyfip)47 -2.303e-01
factor(countyfip)61 6.590e-01
factor(countyfip)81 -8.373e-01
factor(countyfip)85 -1.736e+00
Std. Error
(Intercept) 6.761e-03
hh_inc_nom 1.382e-08
factor(race_eth)Non-Hispanic American Indian or Alaska Native 3.099e-02
factor(race_eth)Non-Hispanic Asian or Pacific Islander 4.900e-03
factor(race_eth)Non-Hispanic Black 4.364e-03
factor(race_eth)Non-Hispanic other 1.622e-02
factor(race_eth)Non-Hispanic Two or more major races 1.024e-02
factor(race_eth)Non-Hispanic white 4.037e-03
age 8.434e-05
factor(countyfip)47 4.692e-03
factor(countyfip)61 5.372e-03
factor(countyfip)81 4.733e-03
factor(countyfip)85 7.038e-03
z value
(Intercept) 642.97
hh_inc_nom -399.78
factor(race_eth)Non-Hispanic American Indian or Alaska Native -47.66
factor(race_eth)Non-Hispanic Asian or Pacific Islander -248.08
factor(race_eth)Non-Hispanic Black -132.04
factor(race_eth)Non-Hispanic other -78.80
factor(race_eth)Non-Hispanic Two or more major races -64.56
factor(race_eth)Non-Hispanic white -253.01
age -461.37
factor(countyfip)47 -49.09
factor(countyfip)61 122.69
factor(countyfip)81 -176.92
factor(countyfip)85 -246.68
Pr(>|z|)
(Intercept) <2e-16
hh_inc_nom <2e-16
factor(race_eth)Non-Hispanic American Indian or Alaska Native <2e-16
factor(race_eth)Non-Hispanic Asian or Pacific Islander <2e-16
factor(race_eth)Non-Hispanic Black <2e-16
factor(race_eth)Non-Hispanic other <2e-16
factor(race_eth)Non-Hispanic Two or more major races <2e-16
factor(race_eth)Non-Hispanic white <2e-16
age <2e-16
factor(countyfip)47 <2e-16
factor(countyfip)61 <2e-16
factor(countyfip)81 <2e-16
factor(countyfip)85 <2e-16
(Intercept) ***
hh_inc_nom ***
factor(race_eth)Non-Hispanic American Indian or Alaska Native ***
factor(race_eth)Non-Hispanic Asian or Pacific Islander ***
factor(race_eth)Non-Hispanic Black ***
factor(race_eth)Non-Hispanic other ***
factor(race_eth)Non-Hispanic Two or more major races ***
factor(race_eth)Non-Hispanic white ***
age ***
factor(countyfip)47 ***
factor(countyfip)61 ***
factor(countyfip)81 ***
factor(countyfip)85 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4027428 on 26582 degrees of freedom
Residual deviance: 3263248 on 26570 degrees of freedom
AIC: 3263274
Number of Fisher Scoring iterations: 9
The broom
package provides some helpful functions that extract the relvent information from the model object and return them as a dataframe so that you can use some of the data manipulation and graphing tools that you are already familiar with.
library(broom)
There are three functions that extract different levels of information from the model.
tidy()
gives you a dataframe where each row contains information regression coefficients
tidy(model_1)
# A tibble: 13 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.35e+0 6.76e-3 643. 0
2 hh_inc_nom -5.53e-6 1.38e-8 -400. 0
3 factor(race_eth)Non-Hispani… -1.48e+0 3.10e-2 -47.7 0
4 factor(race_eth)Non-Hispani… -1.22e+0 4.90e-3 -248. 0
5 factor(race_eth)Non-Hispani… -5.76e-1 4.36e-3 -132. 0
6 factor(race_eth)Non-Hispani… -1.28e+0 1.62e-2 -78.8 0
7 factor(race_eth)Non-Hispani… -6.61e-1 1.02e-2 -64.6 0
8 factor(race_eth)Non-Hispani… -1.02e+0 4.04e-3 -253. 0
9 age -3.89e-2 8.43e-5 -461. 0
10 factor(countyfip)47 -2.30e-1 4.69e-3 -49.1 0
# … with 3 more rows
glance()
returns a dataframe with exactly one row of goodness of fitness measures and related statistics.
glance(model_1)
# A tibble: 1 x 7
null.deviance df.null logLik AIC BIC deviance df.residual
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 4027428. 26582 -1631624. 3.26e6 3.26e6 3263248. 26570
augment()
adds columns to a dataset, containing information such as fitted values, residuals or cluster assignments. All columns added to a dataset have . prefix to prevent existing columns from being overwritten.
augment(model_1, data = ipums_clean) %>%
# Select just a few columns to see what's added
select(serial, starts_with("."))
# A tibble: 26,583 x 8
serial .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 838281 2.21 0.00444 -21.0 0.000166 11.1 0.0111 -21.0
2 838284 -0.356 0.00417 -13.9 0.000763 11.1 0.00745 -13.9
3 838287 -0.396 0.00327 -7.99 0.000159 11.1 0.000512 -7.99
4 838288 1.95 0.00440 10.7 0.000909 11.1 0.00429 10.7
5 838294 0.345 0.00372 14.7 0.000676 11.1 0.00741 14.7
6 838298 0.675 0.00424 6.79 0.000225 11.1 0.000493 6.79
7 838306 1.36 0.0102 7.37 0.00202 11.1 0.00475 7.38
8 838308 2.02 0.00458 -15.1 0.000115 11.1 0.00353 -15.1
9 838310 1.53 0.00357 4.88 0.000113 11.1 0.000115 4.88
10 838318 1.28 0.00335 -21.0 0.000278 11.1 0.0111 -21.0
# … with 26,573 more rows
The package gtsummary
builds on broom
and another package for making tables called gt
and provides helpful tools for quickly creating nicely formatted tables of results that can be highly customized.
library(gtsummary)
Before we even get to the regression model, it can be helpful to create some presentation-ready summary tables about our two groups - renters and owners. tbl_summary()
can help with this (for more details see this tutorial)
By default it will give you a very helpful table choosing reasonable default for statistics based on the type of variable (numeric, character/factor, etc), but you can specify exactly what types of statistics (and how they are formatted) in the table. You can also use add_p()
to automatically add a column of p-values.
NOTE: the
gtsummary
package does not support data with survey weights, so for the this example of how it works we’ll be ignoring the survey weights, but if you data has weights you’ll need to use thesrvyr
tools above to create your own summary tables and then you can usegt
alone to create tables for thsoe results.
ipums_clean %>%
select(is_renter, hh_inc_nom, age, race_eth, countyfip) %>%
tbl_summary(
by = is_renter,
label = list(
hh_inc_nom ~ "Household Income",
age ~ "Age, years",
countyfip ~ "County",
race_eth ~ "Race/Ethnicity"
)
) %>%
add_p()
Characteristic | FALSE, N = 106191 | TRUE, N = 159641 | p-value2 |
---|---|---|---|
Household Income | 100000 (50925, 170000) | 52000 (21500, 103825) | <0.001 |
Age, years | 58 (46, 69) | 46 (33, 61) | <0.001 |
Race/Ethnicity | <0.001 | ||
Hispanic, of any race | 1225 (12%) | 4461 (28%) | |
Non-Hispanic American Indian or Alaska Native | 13 (0.1%) | 12 (<0.1%) | |
Non-Hispanic Asian or Pacific Islander | 1959 (18%) | 1876 (12%) | |
Non-Hispanic Black | 1879 (18%) | 3277 (21%) | |
Non-Hispanic other | 91 (0.9%) | 83 (0.5%) | |
Non-Hispanic Two or more major races | 188 (1.8%) | 361 (2.3%) | |
Non-Hispanic white | 5264 (50%) | 5894 (37%) | |
County | <0.001 | ||
5 | 909 (8.6%) | 2840 (18%) | |
47 | 3457 (33%) | 5708 (36%) | |
61 | 1168 (11%) | 3413 (21%) | |
81 | 4032 (38%) | 3625 (23%) | |
85 | 1053 (9.9%) | 378 (2.4%) | |
1
Statistics presented: median (IQR); n (%)
2
Statistical tests performed: Wilcoxon rank-sum test; chi-square test of independence
|
Now we’ll move back to our logit model, and use the tbl_regression()
function. Again, it detects the type of model you have and uses some reasonable defaults to display your results in a helpful format. For example, for a logit model like we have it will present odds ratios.
tbl_regression(
model_1,
exponentiate = TRUE,
label = list(
"hh_inc_nom" ~ "Household Income",
"age" ~ "Age, years",
"factor(countyfip)" ~ "County",
"factor(race_eth)" ~ "Race/Ethnicity"
)
)
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
Household Income | 1.00 | 1.00, 1.00 | <0.001 |
Race/Ethnicity | |||
Hispanic, of any race | — | — | |
Non-Hispanic American Indian or Alaska Native | 0.23 | 0.21, 0.24 | <0.001 |
Non-Hispanic Asian or Pacific Islander | 0.30 | 0.29, 0.30 | <0.001 |
Non-Hispanic Black | 0.56 | 0.56, 0.57 | <0.001 |
Non-Hispanic other | 0.28 | 0.27, 0.29 | <0.001 |
Non-Hispanic Two or more major races | 0.52 | 0.51, 0.53 | <0.001 |
Non-Hispanic white | 0.36 | 0.36, 0.36 | <0.001 |
Age, years | 0.96 | 0.96, 0.96 | <0.001 |
County | |||
5 | — | — | |
47 | 0.79 | 0.79, 0.80 | <0.001 |
61 | 1.93 | 1.91, 1.95 | <0.001 |
81 | 0.43 | 0.43, 0.44 | <0.001 |
85 | 0.18 | 0.17, 0.18 | <0.001 |
1
OR = Odds Ratio, CI = Confidence Interval
|
If instead we are using a basic OLS model, it will present the results differently.
model_2 <- lm(
hh_inc_nom ~ is_renter + factor(race_eth) + age + factor(countyfip),
data = ipums_clean,
weights = hhwt
)
tbl_regression(
model_2,
label = list(
"is_renter" ~ "Renter",
"age" ~ "Age, years",
"factor(countyfip)" ~ "County",
"factor(race_eth)" ~ "Race/Ethnicity"
)
)
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
Renter | -75906 | -79273, -72539 | <0.001 |
Race/Ethnicity | |||
Hispanic, of any race | — | — | |
Non-Hispanic American Indian or Alaska Native | -28955 | -65838, 7929 | 0.12 |
Non-Hispanic Asian or Pacific Islander | 14281 | 9039, 19523 | <0.001 |
Non-Hispanic Black | 3807 | -539, 8154 | 0.086 |
Non-Hispanic other | 1089 | -18081, 20259 | >0.9 |
Non-Hispanic Two or more major races | 23812 | 13171, 34452 | <0.001 |
Non-Hispanic white | 50673 | 46634, 54712 | <0.001 |
Age, years | -1235 | -1322, -1148 | <0.001 |
County | |||
5 | — | — | |
47 | 8049 | 3359, 12739 | <0.001 |
61 | 65554 | 60546, 70561 | <0.001 |
81 | -714 | -5630, 4202 | 0.8 |
85 | -17880 | -25527, -10234 | <0.001 |
1
CI = Confidence Interval
|