Chapter 1 Introduction

In the contemporary landscape of Applied Machine Learning, predictive modeling has emerged as a crucial field of study to derive insights from data. This field encompasses a variety of methods, each showcasing distinct capabilities worthy of exploration. This project aims to comprehensively understand six machine learning methods, focusing on their application, structure, and outcomes, by covering the following key aspects.

Firstly, an overview to introduce the fundamental principles of the model. Following, an in-depth examination of the algorithmic framework to explore the model’s architecture and computational mechanisms. Subsequently, a detailed description of the model’s training and predicting process to elucidate how the model acquires knowledge and makes predictions. Finally, an assessment of the strengths and limitations of the model to evaluate its practical utility, considering factors like interpretability, scalability and computational complexity.

1.1 Dataset

The dataset comprises an array of 19 predictors that withhold historical information regarding flight operations for all aircrafts departing from New York City airports throughout 2013. This table is easily accessible through the implementation of the following code:

library(nycflights13)
data_flight <- flights

As you can see in the code, this dataset is sourced from the nycflights13 library in R, and despite containing only 19 predictors, it encompasses a total of 336,776 entries, which exceeds the criteria for a “moderate” dataset. To meet the project’s criteria of working with a dataset within a range of 10^3 to 10^5 rows, and also to simplify the analysis of the data, a method of random sampling is used to shrink the table to 5000 entries. Click on Code Preview to learn how this operation is coded.

Code Preview

# Set random seed to ensure reproducibility.
set.seed(123)

# Randomly select 5000 row indices and use them to subset original data.
selected_indices <- sample(1:nrow(data_flight), 5000, replace = FALSE)
data_flight <- data_flight[selected_indices, ]

The table contains discrete categorical and numerical variables associated to predictors like year, month, day, carrier, origin, destination, and tail number. Additionally, continuous numerical attributes such as hour, minute, departure time, arrival time, departure delay, arrival delay, air time, and distance are included. Likewise, among the categorical variables, some are ordinal, like the month, while others are nominal, such as carrier, origin, destination, and tail number. Below, you’ll find a detailed depiction of the table layout and the column names:

year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
2013	1	1	517	515	2	830	819	11	UA	1545	N14228	EWR	IAH	227	1400	5	15	2013-01-01 05:00:00
2013	1	1	533	529	4	850	830	20	UA	1714	N24211	LGA	IAH	227	1416	5	29	2013-01-01 05:00:00
2013	1	1	542	540	2	923	850	33	AA	1141	N619AA	JFK	MIA	160	1089	5	40	2013-01-01 05:00:00
2013	1	1	544	545	-1	1004	1022	-18	B6	725	N804JB	JFK	BQN	183	1576	5	45	2013-01-01 05:00:00

Utilizing this dataset, the project aims to explore the predictive potential of 6 selected machine learning methods, to forecast the relation between flight arrival delay time and airline name, where carrier (airline name) serves as the primary independent variable, impacting the arr_delay (arrival delay time) as the dependent variable.

1.2 The models

Concluding this introduction, here is the list of the selected models to be studied:

A simple Linear Model
A generalized Linear Model with family set to Poisson
A generalized Linear Model with family set to Binomial
A Support Vector Machine
A generalized Additive Model
A neural Network

1.3 Use of Generative AI

Generative AI proved particularly effective in providing quick and intuitive explanations of complex concepts, such as the natural logic of odds, and in assisting with debugging R code. However, it had limitations, especially in handling intricate statistical nuances and ensuring the appropriateness of advanced methods for our specific dataset. Some debugging suggestions included irrelevant steps, necessitating careful double-checking. Personalized suggestions tailored to the specific conditions of our model were also challenging, as the AI generally offers broad insights that require us to connect them to our particular case. Despite these limitations, Generative AI was especially useful for creating the README file, providing a structured format that ensured the document was correctly written and all necessary sections were included.