US Public Transit System Data Analysis

Author

Cindy Li

Introduction

When I was in Japan, I got to ride some of the most efficient transit systems in the world, characterized by its punctuality, its service, and the large numbers of passengers. Here in the United States, public transit is a whole different story. At the end of this analysis, I want to figure out what the most efficient transit system in the country is. How do I define efficiency? To me, efficiency can be broken down into ridership (trips, miles, the functionality) and financial efficiency (revenue, expenses). In this project, I will be investigating the usage and financial statistics of the US transit systems.

Data Sets

The National Transit Database (NTD) records the financial and operations of transit systems to keep track of the industry and provide public information and statistics. The data is collected by transit agencies and submitted to the Federal Administration (FTA) annually and reviewed by the FTA. The most recent and complete information available at the moment is for 2022. Let’s dive into it by exploring the following data sets from the FTA:

2022 Fare Revenue
2022 Expenses
Ridership

I will be extracting data on fares, expenses,

Fare Revenue Data

Expenses Data

Financials Data

We can inner join the Revenue and Expenses data into a more comprehensiive financials data set that we can do our analysis with.

Here’s a sample of the Financials Data.

Trips Data

From the ridership data, I will be extracting information on public transportation trips taken by unlinked passengers.

Miles Data

Also from the ridership data, I will be extracting information on the vehicle revenue miles.

Usage Data

We can inner join the Trips and Miles data together using NTD ID into a data set on the usage.

Attributes

Some of the attribute namings are unclear so let’s use the FTA Glossary to interpret the data.

Renaming

renaming UZA Name to metro_area
replacing the modes with their respective full names
renaming UPT to unlinked_passenger_trips
renaming VRM to vehicle_revenue_miles

VRM (vehicle revenue miles): The miles that vehicles travel while in revenue service.

UPT (unlinked passenger trips): The number of passengers who board public transportation vehicles. Passengers are counted each time they board vehicles no matter how many vehicles they use to travel from their origin to their destination.

Now,, here’s a sample of the processed Usage Data.

Join Usage and Financial Data

Next, we can join the Usage and Financial Data sets.

In order to do so, we need to get the Usage for 2022 in order to match the 2022 Financial data.

Since we are joining on Mode, we need to convert the modes of the Financials data as well.

After that, we can LEFT JOIN the two data sets.

Let’s take a look at the Usage and Financials data.

Project Outcomes

I used summary statistics to explore the data sets processed above to extract insights that can shed light on efficiency of the US Public Transit Systems.

Libraries: tidyverse, dplyr

Let’s see what the data can tell us about public transit in the US looking at transit Usage and Financial data.

Vehicle Revenue Miles

What transit agency had the most total VRM in this sample?

MTA New York City Transit with 10832855350 total miles

library(dplyr)
USAGE |> 
  group_by(Agency) |>
  summarise(total_VRM = sum(vehicle_revenue_miles)) |>
  arrange(desc(total_VRM)) |>
  slice(1)

# A tibble: 1 × 2
  Agency                      total_VRM
  <chr>                           <dbl>
1 MTA New York City Transit 10832855350

What transit mode had the most total VRM in this sample?

The Bus at 49444494088 total miles

USAGE |>
  group_by(Mode) |>
  summarise(total_VRM = sum(vehicle_revenue_miles)) |>
  arrange(desc(total_VRM)) |>
  slice(1)

# A tibble: 1 × 2
  Mode    total_VRM
  <chr>       <dbl>
1 Bus   49444494088

What mode of transport had the longest average trip in May 2024?

The Heavy Rail did with 2654864 average miles

USAGE |> 
  filter(month == "2024-05-01") |> 
  group_by(Mode) |> 
  summarise(average_VRM = mean(vehicle_revenue_miles)) |> 
  arrange(desc(average_VRM)) |>
  slice(1)

# A tibble: 1 × 2
  Mode       average_VRM
  <chr>            <dbl>
1 Heavy Rail    2654864.

Unlinked Passenger Trips

How many trips were taken on the NYC Subway (Heavy Rail) in May 2024?

A total of 237383777 trips were taken.

TRIPS |> 
  filter(Mode == "HR", month == "2024-05-01") |> 
  summarise(total_trips = sum(UPT))

# A tibble: 1 × 1
  total_trips
        <dbl>
1   237383777

How much did NYC subway ridership fall between April 2019 and April 2020?

Ridership fell by 296864650 between April 2018 and April 2020.

april_2020 <- USAGE |> 
  filter(Mode == "Heavy Rail") |> 
  filter(month == "2020-04-01") |> 
  summarise(total_riders = sum(unlinked_passenger_trips))

april_2019 <- USAGE |> 
  filter(Mode == "Heavy Rail") |> 
  filter(month == "2019-04-01") |> 
  summarise(total_riders = sum(unlinked_passenger_trips))

difference = abs(april_2020 - april_2019)
print(difference)

  total_riders
1    296864650

Which transit system (agency and mode) had the most UPT in 2022?

The MTA New York City Transit Heavy Rail had the most UPT in 2022 at 1793073801.

USAGE_AND_FINANCIALS |> select(Agency, Mode, UPT) |> 
  arrange(desc(UPT))|>
  slice(1)

# A tibble: 1 × 3
  Agency                    Mode              UPT
  <chr>                     <chr>           <dbl>
1 MTA New York City Transit Heavy Rail 1793073801

Three more interesting transit facts

Which month had the highest number of average trips between 2002 and 2024?

October with an average of 768205 trips.

USAGE <- USAGE |> mutate(month_number = month(month))
USAGE |> group_by(month_number) |> summarise(avg_UPT = mean(unlinked_passenger_trips)) |> arrange(desc(avg_UPT)) |>
  slice(1)

# A tibble: 1 × 2
  month_number avg_UPT
         <dbl>   <dbl>
1           10 768206.

USAGE <- USAGE |> select(-month_number)

Which metro area had the most unlinked passenger trips?

The New York, Jersey City, and Newark area has the greatest total UPT.

USAGE |> group_by(metro_area) |> 
  summarise(total_UPT = sum(unlinked_passenger_trips)) |> 
  arrange(desc(total_UPT)) |> 
  slice(1)

# A tibble: 1 × 2
  metro_area                              total_UPT
  <chr>                                       <dbl>
1 New York--Jersey City--Newark, NY--NJ 84020935224

Which metro area offers the most modes of transit?

San Francisco–Oakland, CA with 13 Modes

USAGE |> group_by(metro_area) |> 
  summarise(mode_count = n_distinct(Mode)) |> 
  arrange(desc(mode_count)) |> 
  slice(1)

# A tibble: 1 × 2
  metro_area                 mode_count
  <chr>                           <int>
1 San Francisco--Oakland, CA         13

distinct(USAGE |> select(Mode, metro_area) |> filter(metro_area == "San Francisco--Oakland, CA"))

# A tibble: 13 × 2
   Mode              metro_area                
   <chr>             <chr>                     
 1 Heavy Rail        San Francisco--Oakland, CA
 2 Monorail          San Francisco--Oakland, CA
 3 Demand Response   San Francisco--Oakland, CA
 4 Bus               San Francisco--Oakland, CA
 5 Commuter Bus      San Francisco--Oakland, CA
 6 Bus Rapid Transit San Francisco--Oakland, CA
 7 Cable Car         San Francisco--Oakland, CA
 8 Light Rail        San Francisco--Oakland, CA
 9 Streetcar         San Francisco--Oakland, CA
10 Trolleybus        San Francisco--Oakland, CA
11 Ferryboat         San Francisco--Oakland, CA
12 Vanpool           San Francisco--Oakland, CA
13 Commuter Rail     San Francisco--Oakland, CA

Financial Efficiency

Farebox Recovery Among Major Systems

Farebox recovery is defined as the highest ratio of Total Fares to Expenses and can be used to measure efficiency.

Which transit system (agency and mode) had the highest farebox recovery?

Transit Authority of Central Kentucky Vanpool has the highest farebox recovery at 2.38.

USAGE_AND_FINANCIALS |> select(Agency, Mode, `Total Fares`, Expenses) |> 
  mutate(farebox_recovery = `Total Fares`/Expenses) |> 
  arrange(desc(farebox_recovery))|>
  slice(1)

# A tibble: 1 × 5
  Agency                           Mode  `Total Fares` Expenses farebox_recovery
  <chr>                            <chr>         <dbl>    <dbl>            <dbl>
1 Transit Authority of Central Ke… Vanp…         97300    40801             2.38

Which transit system (agency and mode) has the lowest expenses per UPT?

San Francisco Bay Area Rapid Transit District Heavy Rail has the lowest expenses per UPT at 0.396.

USAGE_AND_FINANCIALS |> select(Agency, Mode, UPT, Expenses) |> 
  mutate(Expenses_per_UPT = Expenses/UPT) |> 
  arrange(Expenses_per_UPT)|>
  slice(1)

# A tibble: 1 × 5
  Agency                                  Mode     UPT Expenses Expenses_per_UPT
  <chr>                                   <chr>  <dbl>    <dbl>            <dbl>
1 San Francisco Bay Area Rapid Transit D… Heav… 4.53e7 17965407            0.396

Which transit system (agency and mode) has the highest total fares per UPT?

The highest total fares per UPT belongs to Altoona Metro Transit’s Demand Response at 656 per UPT.

USAGE_AND_FINANCIALS |> select(Agency, Mode, `Total Fares`, UPT) |> 
  mutate(Total_Fares_per_UPT = `Total Fares`/UPT) |> 
  arrange(desc(Total_Fares_per_UPT))|>
  slice(1)

# A tibble: 1 × 5
  Agency                Mode            `Total Fares`   UPT Total_Fares_per_UPT
  <chr>                 <chr>                   <dbl> <dbl>               <dbl>
1 Altoona Metro Transit Demand Response         17058    26                656.

Which transit system (agency and mode) has the lowest expenses per VRM?

San Francisco Bay Area Rapid Transit District’s Heavy Rail at 0.217 per VRM.

USAGE_AND_FINANCIALS |> select(Agency, Mode, Expenses, VRM) |> 
  mutate(Expense_VRM = Expenses/VRM) |> 
  arrange(Expense_VRM)|>
  slice(1)

# A tibble: 1 × 5
  Agency                                       Mode  Expenses    VRM Expense_VRM
  <chr>                                        <chr>    <dbl>  <dbl>       <dbl>
1 San Francisco Bay Area Rapid Transit Distri… Heav… 17965407 8.27e7       0.217

Which transit system (agency and mode) has the highest total fares per VRM?

Chicago Water Taxi (Wendella)’s Ferryboat at 237 total fares per VRM

USAGE_AND_FINANCIALS |> select(Agency, Mode, `Total Fares`, VRM) |> 
  mutate(Fares_VRM = `Total Fares`/VRM) |>
  arrange(desc(Fares_VRM)) |>
  slice(1)

# A tibble: 1 × 5
  Agency                        Mode      `Total Fares`   VRM Fares_VRM
  <chr>                         <chr>             <dbl> <dbl>     <dbl>
1 Chicago Water Taxi (Wendella) Ferryboat        142473   600      237.

Conclusion

In terms of ridership, the MTA New York City Transit takes the win in with the most Vehicle Revenue Miles and the most Unlinked Passenger Trips in 2022. Ridership in the NYC, NJ, Newark area overall is the highest and the transit systems in the area are some of the most utilized public transit systems in the US. Financially, San Francisco’s BART Heavy Rail/Subway comes out on top with both the lowest expense per VRM and lowest expense per UPT. Additionally, San Francisco/Oakland, CA also offers the most modes of transportation. When it comes to usage, the MTA is the transit system that shines, covering the most revenue miles with its vehicles and servicing the most passenger trips. When finances are added to the picture, the BART seems to be the most cost effective transit system.