US Public Transit System Data Analysis
Introduction
When I was in Japan, I got to ride some of the most efficient transit systems in the world, characterized by its punctuality, its service, and the large numbers of passengers. Here in the United States, public transit is a whole different story. At the end of this analysis, I want to figure out what the most efficient transit system in the country is. How do I define efficiency? To me, efficiency can be broken down into ridership (trips, miles, the functionality) and financial efficiency (revenue, expenses). In this project, I will be investigating the usage and financial statistics of the US transit systems.
Data Sets
The National Transit Database (NTD) records the financial and operations of transit systems to keep track of the industry and provide public information and statistics. The data is collected by transit agencies and submitted to the Federal Administration (FTA) annually and reviewed by the FTA. The most recent and complete information available at the moment is for 2022. Let’s dive into it by exploring the following data sets from the FTA:
2022 Fare Revenue
2022 Expenses
Ridership
I will be extracting data on fares, expenses,
Fare Revenue Data
Expenses Data
Financials Data
We can inner join the Revenue and Expenses data into a more comprehensiive financials data set that we can do our analysis with.
Here’s a sample of the Financials Data.
Trips Data
From the ridership data, I will be extracting information on public transportation trips taken by unlinked passengers.
Miles Data
Also from the ridership data, I will be extracting information on the vehicle revenue miles.
Usage Data
We can inner join the Trips and Miles data together using NTD ID into a data set on the usage.
Attributes
Some of the attribute namings are unclear so let’s use the FTA Glossary to interpret the data.
Renaming
renaming UZA Name to metro_area
replacing the modes with their respective full names
renaming UPT to unlinked_passenger_trips
renaming VRM to vehicle_revenue_miles
VRM (vehicle revenue miles): The miles that vehicles travel while in revenue service.
UPT (unlinked passenger trips): The number of passengers who board public transportation vehicles. Passengers are counted each time they board vehicles no matter how many vehicles they use to travel from their origin to their destination.
Now,, here’s a sample of the processed Usage Data.
Join Usage and Financial Data
Next, we can join the Usage and Financial Data sets.
In order to do so, we need to get the Usage for 2022 in order to match the 2022 Financial data.
Since we are joining on Mode, we need to convert the modes of the Financials data as well.
After that, we can LEFT JOIN the two data sets.
Let’s take a look at the Usage and Financials data.
Project Outcomes
I used summary statistics to explore the data sets processed above to extract insights that can shed light on efficiency of the US Public Transit Systems.
Libraries: tidyverse, dplyr
Let’s see what the data can tell us about public transit in the US looking at transit Usage and Financial data.
Vehicle Revenue Miles
What transit agency had the most total VRM in this sample?
MTA New York City Transit with 10832855350 total miles
library(dplyr)
|>
USAGE group_by(Agency) |>
summarise(total_VRM = sum(vehicle_revenue_miles)) |>
arrange(desc(total_VRM)) |>
slice(1)
# A tibble: 1 × 2
Agency total_VRM
<chr> <dbl>
1 MTA New York City Transit 10832855350
What transit mode had the most total VRM in this sample?
The Bus at 49444494088 total miles
|>
USAGE group_by(Mode) |>
summarise(total_VRM = sum(vehicle_revenue_miles)) |>
arrange(desc(total_VRM)) |>
slice(1)
# A tibble: 1 × 2
Mode total_VRM
<chr> <dbl>
1 Bus 49444494088
What mode of transport had the longest average trip in May 2024?
The Heavy Rail did with 2654864 average miles
|>
USAGE filter(month == "2024-05-01") |>
group_by(Mode) |>
summarise(average_VRM = mean(vehicle_revenue_miles)) |>
arrange(desc(average_VRM)) |>
slice(1)
# A tibble: 1 × 2
Mode average_VRM
<chr> <dbl>
1 Heavy Rail 2654864.
Unlinked Passenger Trips
How many trips were taken on the NYC Subway (Heavy Rail) in May 2024?
A total of 237383777 trips were taken.
|>
TRIPS filter(Mode == "HR", month == "2024-05-01") |>
summarise(total_trips = sum(UPT))
# A tibble: 1 × 1
total_trips
<dbl>
1 237383777
How much did NYC subway ridership fall between April 2019 and April 2020?
Ridership fell by 296864650 between April 2018 and April 2020.
<- USAGE |>
april_2020 filter(Mode == "Heavy Rail") |>
filter(month == "2020-04-01") |>
summarise(total_riders = sum(unlinked_passenger_trips))
<- USAGE |>
april_2019 filter(Mode == "Heavy Rail") |>
filter(month == "2019-04-01") |>
summarise(total_riders = sum(unlinked_passenger_trips))
= abs(april_2020 - april_2019)
difference print(difference)
total_riders
1 296864650
Which transit system (agency and mode) had the most UPT in 2022?
The MTA New York City Transit Heavy Rail had the most UPT in 2022 at 1793073801.
|> select(Agency, Mode, UPT) |>
USAGE_AND_FINANCIALS arrange(desc(UPT))|>
slice(1)
# A tibble: 1 × 3
Agency Mode UPT
<chr> <chr> <dbl>
1 MTA New York City Transit Heavy Rail 1793073801
Three more interesting transit facts
Which month had the highest number of average trips between 2002 and 2024?
October with an average of 768205 trips.
<- USAGE |> mutate(month_number = month(month))
USAGE |> group_by(month_number) |> summarise(avg_UPT = mean(unlinked_passenger_trips)) |> arrange(desc(avg_UPT)) |>
USAGE slice(1)
# A tibble: 1 × 2
month_number avg_UPT
<dbl> <dbl>
1 10 768206.
<- USAGE |> select(-month_number) USAGE
Which metro area had the most unlinked passenger trips?
The New York, Jersey City, and Newark area has the greatest total UPT.
|> group_by(metro_area) |>
USAGE summarise(total_UPT = sum(unlinked_passenger_trips)) |>
arrange(desc(total_UPT)) |>
slice(1)
# A tibble: 1 × 2
metro_area total_UPT
<chr> <dbl>
1 New York--Jersey City--Newark, NY--NJ 84020935224
Which metro area offers the most modes of transit?
San Francisco–Oakland, CA with 13 Modes
|> group_by(metro_area) |>
USAGE summarise(mode_count = n_distinct(Mode)) |>
arrange(desc(mode_count)) |>
slice(1)
# A tibble: 1 × 2
metro_area mode_count
<chr> <int>
1 San Francisco--Oakland, CA 13
distinct(USAGE |> select(Mode, metro_area) |> filter(metro_area == "San Francisco--Oakland, CA"))
# A tibble: 13 × 2
Mode metro_area
<chr> <chr>
1 Heavy Rail San Francisco--Oakland, CA
2 Monorail San Francisco--Oakland, CA
3 Demand Response San Francisco--Oakland, CA
4 Bus San Francisco--Oakland, CA
5 Commuter Bus San Francisco--Oakland, CA
6 Bus Rapid Transit San Francisco--Oakland, CA
7 Cable Car San Francisco--Oakland, CA
8 Light Rail San Francisco--Oakland, CA
9 Streetcar San Francisco--Oakland, CA
10 Trolleybus San Francisco--Oakland, CA
11 Ferryboat San Francisco--Oakland, CA
12 Vanpool San Francisco--Oakland, CA
13 Commuter Rail San Francisco--Oakland, CA
Financial Efficiency
Farebox Recovery Among Major Systems
Farebox recovery is defined as the highest ratio of Total Fares to Expenses and can be used to measure efficiency.
Which transit system (agency and mode) had the highest farebox recovery?
Transit Authority of Central Kentucky Vanpool has the highest farebox recovery at 2.38.
|> select(Agency, Mode, `Total Fares`, Expenses) |>
USAGE_AND_FINANCIALS mutate(farebox_recovery = `Total Fares`/Expenses) |>
arrange(desc(farebox_recovery))|>
slice(1)
# A tibble: 1 × 5
Agency Mode `Total Fares` Expenses farebox_recovery
<chr> <chr> <dbl> <dbl> <dbl>
1 Transit Authority of Central Ke… Vanp… 97300 40801 2.38
Which transit system (agency and mode) has the lowest expenses per UPT?
San Francisco Bay Area Rapid Transit District Heavy Rail has the lowest expenses per UPT at 0.396.
|> select(Agency, Mode, UPT, Expenses) |>
USAGE_AND_FINANCIALS mutate(Expenses_per_UPT = Expenses/UPT) |>
arrange(Expenses_per_UPT)|>
slice(1)
# A tibble: 1 × 5
Agency Mode UPT Expenses Expenses_per_UPT
<chr> <chr> <dbl> <dbl> <dbl>
1 San Francisco Bay Area Rapid Transit D… Heav… 4.53e7 17965407 0.396
Which transit system (agency and mode) has the highest total fares per UPT?
The highest total fares per UPT belongs to Altoona Metro Transit’s Demand Response at 656 per UPT.
|> select(Agency, Mode, `Total Fares`, UPT) |>
USAGE_AND_FINANCIALS mutate(Total_Fares_per_UPT = `Total Fares`/UPT) |>
arrange(desc(Total_Fares_per_UPT))|>
slice(1)
# A tibble: 1 × 5
Agency Mode `Total Fares` UPT Total_Fares_per_UPT
<chr> <chr> <dbl> <dbl> <dbl>
1 Altoona Metro Transit Demand Response 17058 26 656.
Which transit system (agency and mode) has the lowest expenses per VRM?
San Francisco Bay Area Rapid Transit District’s Heavy Rail at 0.217 per VRM.
|> select(Agency, Mode, Expenses, VRM) |>
USAGE_AND_FINANCIALS mutate(Expense_VRM = Expenses/VRM) |>
arrange(Expense_VRM)|>
slice(1)
# A tibble: 1 × 5
Agency Mode Expenses VRM Expense_VRM
<chr> <chr> <dbl> <dbl> <dbl>
1 San Francisco Bay Area Rapid Transit Distri… Heav… 17965407 8.27e7 0.217
Which transit system (agency and mode) has the highest total fares per VRM?
Chicago Water Taxi (Wendella)’s Ferryboat at 237 total fares per VRM
|> select(Agency, Mode, `Total Fares`, VRM) |>
USAGE_AND_FINANCIALS mutate(Fares_VRM = `Total Fares`/VRM) |>
arrange(desc(Fares_VRM)) |>
slice(1)
# A tibble: 1 × 5
Agency Mode `Total Fares` VRM Fares_VRM
<chr> <chr> <dbl> <dbl> <dbl>
1 Chicago Water Taxi (Wendella) Ferryboat 142473 600 237.
Conclusion
In terms of ridership, the MTA New York City Transit takes the win in with the most Vehicle Revenue Miles and the most Unlinked Passenger Trips in 2022. Ridership in the NYC, NJ, Newark area overall is the highest and the transit systems in the area are some of the most utilized public transit systems in the US. Financially, San Francisco’s BART Heavy Rail/Subway comes out on top with both the lowest expense per VRM and lowest expense per UPT. Additionally, San Francisco/Oakland, CA also offers the most modes of transportation. When it comes to usage, the MTA is the transit system that shines, covering the most revenue miles with its vehicles and servicing the most passenger trips. When finances are added to the picture, the BART seems to be the most cost effective transit system.