Applied Machine Learning
2024-06-06
Chapter 1 Introduction
In the contemporary landscape of Applied Machine Learning, predictive modeling has emerged as a crucial field of study to derive insights from data. This field encompasses a variety of methods, each showcasing distinct capabilities worthy of exploration. This project aims to comprehensively understand six machine learning methods, focusing on their application, structure, and outcomes, by covering the following key aspects.
Firstly, an overview to introduce the fundamental principles of the model. Following, an in-depth examination of the algorithmic framework to explore the model’s architecture and computational mechanisms. Subsequently, a detailed description of the model’s training and predicting process to elucidate how the model acquires knowledge and makes predictions. Finally, an assessment of the strengths and limitations of the model to evaluate its practical utility, considering factors like interpretability, scalability and computational complexity.
1.1 Dataset
The dataset comprises an array of 19 predictors that withhold historical information regarding flight operations for all aircrafts departing from New York City airports throughout 2013. This table is easily accessible through the implementation of the following code:
library(nycflights13)
data_flight <- flights
As you can see in the code, this dataset is sourced from the nycflights13 library in R, and despite containing only 19 predictors, it encompasses a total of 336,776 entries, which exceeds the criteria for a “moderate” dataset. To meet the project’s criteria of working with a dataset within a range of 10^3 to 10^5 rows, and also to simplify the analysis of the data, a method of random sampling is used to shrink the table to 5000 entries. Click on Code Preview to learn how this operation is coded.
Code Preview
# Set random seed to ensure reproducibility.
set.seed(123)
# Randomly select 5000 row indices and use them to subset original data.
selected_indices <- sample(1:nrow(data_flight), 5000, replace = FALSE)
data_flight <- data_flight[selected_indices, ]
The table contains discrete categorical and numerical variables associated to predictors like year, month, day, carrier, origin, destination, and tail number. Additionally, continuous numerical attributes such as hour, minute, departure time, arrival time, departure delay, arrival delay, air time, and distance are included. Likewise, among the categorical variables, some are ordinal, like the month, while others are nominal, such as carrier, origin, destination, and tail number. Below, you’ll find a detailed depiction of the table layout and the column names:
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2013 | 1 | 1 | 517 | 515 | 2 | 830 | 819 | 11 | UA | 1545 | N14228 | EWR | IAH | 227 | 1400 | 5 | 15 | 2013-01-01 05:00:00 |
| 2013 | 1 | 1 | 533 | 529 | 4 | 850 | 830 | 20 | UA | 1714 | N24211 | LGA | IAH | 227 | 1416 | 5 | 29 | 2013-01-01 05:00:00 |
| 2013 | 1 | 1 | 542 | 540 | 2 | 923 | 850 | 33 | AA | 1141 | N619AA | JFK | MIA | 160 | 1089 | 5 | 40 | 2013-01-01 05:00:00 |
| 2013 | 1 | 1 | 544 | 545 | -1 | 1004 | 1022 | -18 | B6 | 725 | N804JB | JFK | BQN | 183 | 1576 | 5 | 45 | 2013-01-01 05:00:00 |
Utilizing this dataset, the project aims to explore the predictive potential of 6 selected machine learning methods, to forecast the relation between flight arrival delay time and airline name, where carrier (airline name) serves as the primary independent variable, impacting the arr_delay (arrival delay time) as the dependent variable.
1.2 The models
Concluding this introduction, here is the list of the selected models to be studied:
A simple Linear Model
A generalized Linear Model with family set to Poisson
A generalized Linear Model with family set to Binomial
A Support Vector Machine
A generalized Additive Model
A neural Network
1.3 Use of Generative AI
Generative AI proved particularly effective in providing quick and intuitive explanations of complex concepts, such as the natural logic of odds, and in assisting with debugging R code. However, it had limitations, especially in handling intricate statistical nuances and ensuring the appropriateness of advanced methods for our specific dataset. Some debugging suggestions included irrelevant steps, necessitating careful double-checking. Personalized suggestions tailored to the specific conditions of our model were also challenging, as the AI generally offers broad insights that require us to connect them to our particular case. Despite these limitations, Generative AI was especially useful for creating the README file, providing a structured format that ensured the document was correctly written and all necessary sections were included.