Chapter 1 Introduction

In the contemporary landscape of Applied Machine Learning, predictive modeling has emerged as a crucial field of study to derive insights from data. This field encompasses a variety of methods, each showcasing distinct capabilities worthy of exploration. This project aims to comprehensively understand six machine learning methods, focusing on their application, structure, and outcomes, by covering the following key aspects.

Firstly, an overview to introduce the fundamental principles of the model. Following, an in-depth examination of the algorithmic framework to explore the model’s architecture and computational mechanisms. Subsequently, a detailed description of the model’s training and predicting process to elucidate how the model acquires knowledge and makes predictions. Finally, an assessment of the strengths and limitations of the model to evaluate its practical utility, considering factors like interpretability, scalability and computational complexity.

1.1 Dataset

The dataset comprises an array of 19 predictors that withhold historical information regarding flight operations for all aircrafts departing from New York City airports throughout 2013. This table is easily accessible through the implementation of the following code:

library(nycflights13)
data_flight <- flights 

As you can see in the code, this dataset is sourced from the nycflights13 library in R, and despite containing only 19 predictors, it encompasses a total of 336,776 entries, which exceeds the criteria for a “moderate” dataset. To meet the project’s criteria of working with a dataset within a range of 10^3 to 10^5 rows, and also to simplify the analysis of the data, a method of random sampling is used to shrink the table to 5000 entries. Click on Code Preview to learn how this operation is coded.

Code Preview

# Set random seed to ensure reproducibility.
set.seed(123)

# Randomly select 5000 row indices and use them to subset original data.
selected_indices <- sample(1:nrow(data_flight), 5000, replace = FALSE)
data_flight <- data_flight[selected_indices, ]

The table contains discrete categorical and numerical variables associated to predictors like year, month, day, carrier, origin, destination, and tail number. Additionally, continuous numerical attributes such as hour, minute, departure time, arrival time, departure delay, arrival delay, air time, and distance are included. Likewise, among the categorical variables, some are ordinal, like the month, while others are nominal, such as carrier, origin, destination, and tail number. Below, you’ll find a detailed depiction of the table layout and the column names:

year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00
2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00

Utilizing this dataset, the project aims to explore the predictive potential of 6 selected machine learning methods, to forecast the relation between flight arrival delay time and airline name, where carrier (airline name) serves as the primary independent variable, impacting the arr_delay (arrival delay time) as the dependent variable.

1.2 The models

Concluding this introduction, here is the list of the selected models to be studied:

  1. A simple Linear Model

  2. A generalized Linear Model with family set to Poisson

  3. A generalized Linear Model with family set to Binomial

  4. A Support Vector Machine

  5. A generalized Additive Model

  6. A neural Network

1.3 Use of Generative AI

Generative AI proved particularly effective in providing quick and intuitive explanations of complex concepts, such as the natural logic of odds, and in assisting with debugging R code. However, it had limitations, especially in handling intricate statistical nuances and ensuring the appropriateness of advanced methods for our specific dataset. Some debugging suggestions included irrelevant steps, necessitating careful double-checking. Personalized suggestions tailored to the specific conditions of our model were also challenging, as the AI generally offers broad insights that require us to connect them to our particular case. Despite these limitations, Generative AI was especially useful for creating the README file, providing a structured format that ensured the document was correctly written and all necessary sections were included.