Chapter 6 Generalized Additive Model

By Sheptim Veseli

6.1 Introduction

The Generalized Additive Model (GAM) represents an advanced statistical approach designed to capture non-linear relationships between predictor and response variables. GAMs extend linear models by allowing non-linear functions of predictor variables while maintaining additivity. This flexibility makes GAMs particularly useful in scenarios where the relationship between variables is complex and cannot be adequately modeled with simple linear terms. In this report, we apply GAM to predict the log-transformed flight arrival delays using the ‘carrier’ variable from the nycflights13 dataset. This analysis aims to assess the effectiveness of GAM in modeling flight delays and uncover patterns associated with different airlines.

6.2 Data Preparation

The dataset used in this analysis is a subset of the nycflights13 dataset, consisting of 5000 randomly selected flights. The following steps were undertaken to prepare the data:

# Set random seed for reproducibility
set.seed(123)

# Randomly select 5000 rows and use them to subset original data
selected_indices <- sample(1:nrow(flights), 5000, replace = FALSE)
flights_subset <- flights[selected_indices, ]

A random sample of 5000 rows was selected from the original dataset to create a manageable subset for analysis. Handling Missing Values: Rows with missing values in critical variables such as dep_time, arr_time, arr_delay, tailnum, and air_time were removed to ensure data integrity. Removing Negative Arrival Delays: Rows with negative values for arr_delay were eliminated as they do not represent valid flight delay times.

Code Preview

# Remove rows with missing values in key variables
flights_subset <- flights_subset[complete.cases(flights_subset$dep_time, flights_subset$arr_time, flights_subset$arr_delay, flights_subset$tailnum, flights_subset$air_time), ]

# Remove NA and negative values from arrival delay time (arr_delay)
flights_subset <- flights_subset[complete.cases(flights_subset$arr_delay) & flights_subset$arr_delay >= 0, ]

6.2.1 Log Transformation

The arr_delay variable was log-transformed to normalize its distribution and stabilize variance. One was added to avoid log(0) issues.

# Apply logarithmic transformation to arrival delay time (arr_delay)
flights_subset$log_arr_delay <- log(flights_subset$arr_delay + 1)  # Adding 1 to avoid log(0) issues

# Convert carrier to factor
flights_subset$carrier <- as.factor(flights_subset$carrier)

6.3 Data Splitting & Model Fitting

The data was split into training (80%) and test (20%) sets for model evaluation. This step ensures that the model’s performance is evaluated on unseen data, providing a more realistic assessment of its predictive capabilities.

# Split the data into training and test sets
set.seed(123)  # Set seed for reproducibility
train_index <- createDataPartition(flights_subset$log_arr_delay, p = 0.8, list = FALSE)
train_data <- flights_subset[train_index, ]
test_data <- flights_subset[-train_index, ]

A Generalized Additive Model was fitted to the training data with the following specifications:

  • Response Variable: log_arr_delay

  • Predictor Variable: carrier (as a factor with a smooth term)

The model formula used was: log_arr_delay∼s(carrier,bs=“re”) log_arr_delay∼s(carrier,bs=“re”) where s(carrier, bs = “re”) indicates a smooth term for the carrier variable using a random effects basis spline.

# Fit a Generalized Additive Model
gam_model <- gam(log_arr_delay ~ s(carrier, bs = "re"), data = train_data)

6.3.1 Model Summary

The summary of the fitted GAM is as follows:

  • Family: Gaussian

  • Link Function: Identity

  • Intercept: 2.88605 (Std. Error: 0.06721, p < 2e-16)

  • Smooth Terms: s(carrier): edf = 6.945, Ref.df = 14, F = 1.4, p = 0.00212

  • Model Performance: Adjusted

  • R-squared: 0.011

  • Deviance Explained: 1.52%

  • Generalized Cross-Validation (GCV) Score: 1.6482

  • Scale Estimate: 1.6402

  • Number of Observations: 1643

Code Preview

# Print model summary
summary(gam_model)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log_arr_delay ~ s(carrier, bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.88605    0.06721   42.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df   F p-value   
## s(carrier) 6.945     14 1.4 0.00212 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.011   Deviance explained = 1.52%
## GCV = 1.6482  Scale est. = 1.6402    n = 1643

The intercept of 2.88605 represents the baseline log-transformed arrival delay when the carrier effect is not considered. The significant smooth term for the carrier variable (p = 0.00212) indicates that different airlines have different patterns of arrival delays. The effective degrees of freedom (edf) for the carrier term suggests a moderately complex relationship. However, the adjusted R-squared value of 0.011 and the deviance explained of 1.52% indicate that the carrier variable alone does not explain much of the variability in arrival delays. This low explanatory power suggests that other factors not included in this model, such as weather conditions, airport congestion, and specific aircraft issues, are likely to play a significant role in influencing arrival delays. These findings align with expectations, as flight delays are typically influenced by a multitude of factors beyond the airline carrier alone.

6.3.2 Model Evaluation

The model was evaluated on the test set, yielding the following performance metrics:

## Mean Squared Error: 1.576027
## Root Mean Squared Error: 1.255399

The MSE and RMSE indicate the average squared difference and the square root of this average, respectively, between the actual and predicted log-transformed arrival delays. These metrics provide a quantitative measure of the model’s prediction accuracy. The RMSE value of approximately 1.255 suggests that the model’s predictions are, on average, within 1.255 units of the actual log-transformed arrival delays. Next,the scatterplot below compares the actual log-transformed arrival delays with the predicted values from the GAM model. The red line represents the ideal fit where predicted values equal actual values. The scatterplot indicates a spread around the red line, reflecting the model’s predictions. The spread suggests variability in prediction accuracy, which is expected given the low R-squared value.

Summary Preview

summary(gam_model)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log_arr_delay ~ s(carrier, bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.88605    0.06721   42.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df   F p-value   
## s(carrier) 6.945     14 1.4 0.00212 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.011   Deviance explained = 1.52%
## GCV = 1.6482  Scale est. = 1.6402    n = 1643

6.4 Conclusion

The application of GAM to predict flight arrival delays based on the carrier variable highlights the significance of airline-specific patterns in arrival delays. However, the low explanatory power of the model underscores the complexity of factors influencing flight delays. While the GAM captured some carrier-specific trends, it is evident that additional variables such as weather conditions, airport traffic, flight distance, and aircraft-specific issues need to be included to improve predictive performance. Future research should focus on incorporating these additional predictors to develop a more comprehensive model.