ML models from the ground-up

Math behind Data-Driven Decisions

The project-based 'Machine Learning' course provided an in-depth exploration of machine learning, focusing on the mathematical foundations of key algorithms. I implemented complex models like decision trees, neural networks, and clustering methods from scratch, gaining deep intuition on how each algorithm fits different use cases. I also gained experience in data preparation and evaluated models using various metrics. This hands-on approach, applied to real-world datasets, sharpened my ability to select, finetune, and rigorously test machine learning models for diverse problems.

Organization

The University of Chicago

Core Technologies

Python Scikit-learn NumPy Pandas Matplotlib Plotly SciPy

Domain

Machine Learning

Date

April 2024

Technical highlight

Implemented a decision tree algorithm capable of handling both categorical and continuous variables. Designed methods for node splitting and information gain calculation, and utilized Matplotlib to plot validation curves, assessing model performance. Analyzed classification effectiveness for arrhythmias based on 279 patient variables, achieving an accuracy rate of 56%.

Decision trees offer interpretability but are prone to overfitting and instability, as small changes in data can lead to different trees. To address these issues, I extended the model by implementing Random Forest and AdaBoost algorithms. Random Forest mitigated high variance and enhanced stability by averaging multiple trees, making it effective for complex, high-dimensional data.

AdaBoost further improved accuracy by focusing on misclassified instances, though it required careful handling of noisy data. Each model was evaluated for its trade-offs between accuracy, complexity, and interpretability.

Technical highlight

Developed a neural network from scratch using a custom framework with the primary class, NN. This class represents a fully connected, feed-forward neural network and includes attributes and methods for managing network layers, training with stochastic gradient descent (SGD), and evaluating performance on a test set.

I implemented forward and backpropagation for each operational layer, including softmax, matrix multiplication, and ReLU. I also coded a dynamic computational graph, adaptable to the network architecture defined at instantiation. The model was rigorously tested on the MNIST dataset of handwritten digits, achieving an accuracy of 97% across 60,000 training examples and 10,000 test examples.

Technical highlight

Extended the neural network framework to implement a convolutional neural network (CNN) designed to classify 32×32 color images into four classes. Building on the modular approach, I introduced additional operation classes—Conv, MaxPool, and Flatten—to support CNN functionality and learn from image data effectively.

In this framework, the architecture is not specified by a simple list of layer sizes, as with the original neural network. Instead, I modified the build_computation_graph() function to accommodate the combination of different layer types, including convolutional, pooling, and fully connected layers. Each convolutional or fully connected layer is followed by a ReLU activation function, with the exception of the last layer. A flatten layer is used between the final max-pooling layer and the first fully connected layer to convert the 3D image tensor into a 1D vector.

The network was trained using stochastic gradient descent (SGD) and evaluated on CIFAR-10 dataset (labeled tiny images) of 3,000 images, achieving a test accuracy of 65% across the four target classes.

Takeaways

This project showcased the power of machine learning in automating tasks like image classification with high accuracy. However, it also shed light on some of its drawbacks, such as the tendency for overfitting and the hefty computational demands. Alongside these technical revelations, I discovered the crucial role of ethical considerations—ensuring our data is diverse and representative to avoid perpetuating biases and handling it with care to safeguard privacy and integrity.

This project showcased the power of machine learning in automating tasks like image classification with high accuracy. However, it also shed light on some of its drawbacks, such as the tendency for overfitting and the hefty computational demands. Alongside these technical revelations, I discovered the crucial role of ethical considerations—ensuring our data is diverse and representative to avoid perpetuating biases and handling it with care to safeguard privacy and integrity.