Distributed FaaS Platform
A High-Performance Serverless Platform for Running Functions at Scale
Organization
The University of Chicago
Core Technologies
Python FastAPI Redis ZMQ Pytest
Domain
Distributed Computing
Date
Dec 2024
Technical highlight
FaaS Endpoints
FaaS service provides the following FastAPI endpoints for the client to interact with:
/register
: Allows clients to register a Python function by serializing its body (using dill). The service stores the function in Redis with metadata (e.g., name) and generates a unique function UUID, which is returned to the client for future invocation./execute
: Clients can invoke a registered function by sending its function UUID along with serialized input arguments. The service creates a task with a unique task UUID, stores the task details in Redis, and publishes the task to a Redis channel and Task Queue for processing./status
: Enables clients to check the current status of a task using its task UUID. The status (e.g., pending, running, completed, or failed) is retrieved from Redis./result:
Fetches the result of a completed task using its task UUID. The result, including exceptions if the task failed, is serialized and returned to the client.
These endpoints are designed for high performance, fault tolerance, and efficient task management in a serverless environment.
Technical highlight
Task Dispatcher
The Task Dispatcher is a separate component responsible for routing tasks to workers, detecting failures, and updating Redis with task status and results. It operates independently of the FaaS service to simplify handling concurrency, and to avoid the worker processes directly interacting with Redis.
Key Features:
Task Listening: The dispatcher listens for task IDs on the Redis
tasks
channel using a ZMQPUB/SUB
model in local or push modes (viaRedis.sub("tasks")
), or polls Redis task queue (viaRedis.lpop("task_queue")
) in pull mode.Task Dispatching: For each task ID received, it retrieves the function body and input arguments from Redis and assigns the task to an available worker for execution.
Load Balancing: In the case of Push mode, a round-robin strategy is used for balancing the load.
Execution Models: Three methods are implemented to ensure concurrent execution:
Local Mode: Uses Python's
multiprocessing
to execute tasks locally for development and testing. This served as the baseline for performance testing.Pull Mode: Uses a ZMQ
REQ/REP
model, where workers request tasks from the dispatcher. The dispatcher checks the Redis task queue and if any tasks are available, it sends tasks as responses to worker requests.Push Mode: Uses a ZMQ
DEALER/ROUTER
pattern to push tasks to waiting workers in a load-balanced manner (round-robin assignment) as they are received.
This modular design ensures efficient task distribution and concurrent execution across workers, supporting scalability and fault detection in distributed environments.
Technical highlight
Fault Tolerance
To ensure robustness against failures, the Task Dispatcher implements fault tolerance for two primary failure scenarios:
Task Failures:
If a function raises an exception during execution (e.g., due to serialization errors or runtime issues), the exception is serialized and stored in Redis. The user is notified via an appropriate error code and a detailed error message.
Worker Failures:
A worker may fail during task execution or miss its deadline for completing a task. These cases are handled as WorkerFailure Exceptions, which are reported to the user.
Failure Detection:
A heartbeat mechanism is implemented for Push Mode to monitor worker health. If a worker misses sending heartbeats or fails to report task completion within a deadline, it is marked as failed. In the event of a heartbeat missing, all tasks assigned to that worker are marked as failed.
For Pull Mode, a deadline-based tracking mechanism ensures that tasks not completed within a configurable timeframe are marked as failed. No conclusion is made about the worker as a whole.
The system maintains a record of active and failed workers, along with task-to-worker mappings, to reroute workloads and provide meaningful error reports.
By implementing these mechanisms, the system ensures accurate failure detection, graceful error handling, and reliable communication with clients, even in the face of distributed system challenges.
Technical highlight
Performance Analysis
To evaluate the performance of our FaaS system, we conducted a weak scaling study to compare the relative performance of the local, push, and pull execution modes. The local implementation (using a single Python multiprocessing pool) was treated as the baseline for comparison. The goal of the weak scaling study is to assess how well the system maintains consistent performance as both the workload and the resources scale proportionally.
Detailed Performance Analysis available in the Git Repo: reports/performance_report.pdf
We evaluated the dimensions of Latency, System Overhead, Efficiency, and Throughput. The Push mode performed as per expectations and scaled well with increasing load. However, the push, while scaling horizontally as per expectations was indicating a communication overhead.
We found this deeply curious and we dove deep to discover the bottlenecks. At the heart of the problem was...THE HEARTBEATS! The push dispatcher is set up to have two threads-one to handle the main message processing and another to check the heartbeats being sent by the workers and determine if they need to be marked as dead. Having two threads that both run continuously with busy while loops leads to too many context switches which was bogging the system down. They also both updated a dictionary with information on worker ids, active tasks, and heartbeats and this required acquiring a global lock. This was causing sequential bottlenecks in the dispatcher.
The performance analysis brought this to light and highlighted areas of improvement which are as follows:
Make both push worker and push dispatcher work with a single thread to avoid the usage of global locks in the system.
On the dispatcher, perform checks on the heartbeats only when no other task related messages have been received from the workers.
This was eye-opening and made the value of performance analyses abundantly clear to us.
Takeaways