Distributed FaaS Platform

A High-Performance Serverless Platform for Running Functions at Scale

Function-as-a-Service (FaaS) platform allows users to run Python functions in a serverless environment. It was implemented in Python using the FastAPI framework for the communication between the client and server, and ZMQ for load-balanced and fault-tolerant communication between a Task Dispatcher and a pool of worker processes. It was optimized for performance to handle many concurrent requests and provide fast response times.

Github repo

Organization

The University of Chicago

Core Technologies

Python FastAPI Redis ZMQ Pytest

Domain

Distributed Computing

Date

Dec 2024

Technical highlight

FaaS Endpoints

FaaS service provides the following FastAPI endpoints for the client to interact with:

/register: Allows clients to register a Python function by serializing its body (using dill). The service stores the function in Redis with metadata (e.g., name) and generates a unique function UUID, which is returned to the client for future invocation.
/execute: Clients can invoke a registered function by sending its function UUID along with serialized input arguments. The service creates a task with a unique task UUID, stores the task details in Redis, and publishes the task to a Redis channel and Task Queue for processing.
/status: Enables clients to check the current status of a task using its task UUID. The status (e.g., pending, running, completed, or failed) is retrieved from Redis.
/result: Fetches the result of a completed task using its task UUID. The result, including exceptions if the task failed, is serialized and returned to the client.

These endpoints are designed for high performance, fault tolerance, and efficient task management in a serverless environment.

Technical highlight

Task Dispatcher

The Task Dispatcher is a separate component responsible for routing tasks to workers, detecting failures, and updating Redis with task status and results. It operates independently of the FaaS service to simplify handling concurrency, and to avoid the worker processes directly interacting with Redis.

Key Features:

Task Listening: The dispatcher listens for task IDs on the Redis tasks channel using a ZMQ PUB/SUB model in local or push modes (via Redis.sub("tasks")), or polls Redis task queue (via Redis.lpop("task_queue")) in pull mode.
Task Dispatching: For each task ID received, it retrieves the function body and input arguments from Redis and assigns the task to an available worker for execution.
Load Balancing: In the case of Push mode, a round-robin strategy is used for balancing the load.
Execution Models: Three methods are implemented to ensure concurrent execution:
1. Local Mode: Uses Python's multiprocessing to execute tasks locally for development and testing. This served as the baseline for performance testing.
2. Pull Mode: Uses a ZMQ REQ/REP model, where workers request tasks from the dispatcher. The dispatcher checks the Redis task queue and if any tasks are available, it sends tasks as responses to worker requests.
3. Push Mode: Uses a ZMQ DEALER/ROUTER pattern to push tasks to waiting workers in a load-balanced manner (round-robin assignment) as they are received.

This modular design ensures efficient task distribution and concurrent execution across workers, supporting scalability and fault detection in distributed environments.

Technical highlight

Fault Tolerance

To ensure robustness against failures, the Task Dispatcher implements fault tolerance for two primary failure scenarios:

Task Failures:
- If a function raises an exception during execution (e.g., due to serialization errors or runtime issues), the exception is serialized and stored in Redis. The user is notified via an appropriate error code and a detailed error message.
Worker Failures:
- A worker may fail during task execution or miss its deadline for completing a task. These cases are handled as WorkerFailure Exceptions, which are reported to the user.
- Failure Detection:
  - A heartbeat mechanism is implemented for Push Mode to monitor worker health. If a worker misses sending heartbeats or fails to report task completion within a deadline, it is marked as failed. In the event of a heartbeat missing, all tasks assigned to that worker are marked as failed.
  - For Pull Mode, a deadline-based tracking mechanism ensures that tasks not completed within a configurable timeframe are marked as failed. No conclusion is made about the worker as a whole.
- The system maintains a record of active and failed workers, along with task-to-worker mappings, to reroute workloads and provide meaningful error reports.

By implementing these mechanisms, the system ensures accurate failure detection, graceful error handling, and reliable communication with clients, even in the face of distributed system challenges.

Technical highlight

Performance Analysis

To evaluate the performance of our FaaS system, we conducted a weak scaling study to compare the relative performance of the local, push, and pull execution modes. The local implementation (using a single Python multiprocessing pool) was treated as the baseline for comparison. The goal of the weak scaling study is to assess how well the system maintains consistent performance as both the workload and the resources scale proportionally.

Detailed Performance Analysis available in the Git Repo: reports/performance_report.pdf

We evaluated the dimensions of Latency, System Overhead, Efficiency, and Throughput. The Push mode performed as per expectations and scaled well with increasing load. However, the push, while scaling horizontally as per expectations was indicating a communication overhead.

We found this deeply curious and we dove deep to discover the bottlenecks. At the heart of the problem was...THE HEARTBEATS! The push dispatcher is set up to have two threads-one to handle the main message processing and another to check the heartbeats being sent by the workers and determine if they need to be marked as dead. Having two threads that both run continuously with busy while loops leads to too many context switches which was bogging the system down. They also both updated a dictionary with information on worker ids, active tasks, and heartbeats and this required acquiring a global lock. This was causing sequential bottlenecks in the dispatcher.

The performance analysis brought this to light and highlighted areas of improvement which are as follows:

Make both push worker and push dispatcher work with a single thread to avoid the usage of global locks in the system.
On the dispatcher, perform checks on the heartbeats only when no other task related messages have been received from the workers.

This was eye-opening and made the value of performance analyses abundantly clear to us.

Takeaways

This project gave me a deep understanding of the complexities involved in communication and coordination tasks in a distributed system. I most enjoyed learning the importance of testing and performance analyses in a distributed system as it involved thinking of unique edge cases, and protecting against any bad faith actions that could affect the system as a whole.

This project gave me a deep understanding of the complexities involved in communication and coordination tasks in a distributed system. I most enjoyed learning the importance of testing and performance analyses in a distributed system as it involved thinking of unique edge cases, and protecting against any bad faith actions that could affect the system as a whole.