ML models from the ground-up
Math behind Data-Driven Decisions
Organization
The University of Chicago
Core Technologies
Python Scikit-learn NumPy Pandas Matplotlib Plotly SciPy
Domain
Machine Learning
Date
April 2024
Technical highlight
Implemented a decision tree algorithm capable of handling both categorical and continuous variables. Designed methods for node splitting and information gain calculation, and utilized Matplotlib to plot validation curves, assessing model performance. Analyzed classification effectiveness for arrhythmias based on 279 patient variables, achieving an accuracy rate of 56%.
Decision trees offer interpretability but are prone to overfitting and instability, as small changes in data can lead to different trees. To address these issues, I extended the model by implementing Random Forest and AdaBoost algorithms. Random Forest mitigated high variance and enhanced stability by averaging multiple trees, making it effective for complex, high-dimensional data.
AdaBoost further improved accuracy by focusing on misclassified instances, though it required careful handling of noisy data. Each model was evaluated for its trade-offs between accuracy, complexity, and interpretability.
Technical highlight
Developed a neural network from scratch using a custom framework with the primary class, NN. This class represents a fully connected, feed-forward neural network and includes attributes and methods for managing network layers, training with stochastic gradient descent (SGD), and evaluating performance on a test set.
I implemented forward and backpropagation for each operational layer, including softmax, matrix multiplication, and ReLU. I also coded a dynamic computational graph, adaptable to the network architecture defined at instantiation. The model was rigorously tested on the MNIST dataset of handwritten digits, achieving an accuracy of 97% across 60,000 training examples and 10,000 test examples.
Technical highlight
Extended the neural network framework to implement a convolutional neural network (CNN) designed to classify 32×32 color images into four classes. Building on the modular approach, I introduced additional operation classes—Conv, MaxPool, and Flatten—to support CNN functionality and learn from image data effectively.
In this framework, the architecture is not specified by a simple list of layer sizes, as with the original neural network. Instead, I modified the build_computation_graph() function to accommodate the combination of different layer types, including convolutional, pooling, and fully connected layers. Each convolutional or fully connected layer is followed by a ReLU activation function, with the exception of the last layer. A flatten layer is used between the final max-pooling layer and the first fully connected layer to convert the 3D image tensor into a 1D vector.
The network was trained using stochastic gradient descent (SGD) and evaluated on CIFAR-10 dataset (labeled tiny images) of 3,000 images, achieving a test accuracy of 65% across the four target classes.
Takeaways