Parallelized Marching Squares

Accelerated contour detection in Go

The Marching Squares algorithm is a computational method used to generate contours or boundary lines from scalar field data, often represented in a 2D grid. I specifically examined performance on lung cancer images to detect areas that could indicate malignant growths. The Marching Squares algorithm is highly suitable for parallelization due to its inherent independence between each square in the grid. I implemented a baseline sequential verison of the algorithm, followed by two parallel versions-without and with workstealing, and analyzed the performance on non-uniform and uniform image datasets.

Github Repo

Organization

The University of Chicago

Core Technologies

Go Slurm NumPy Matplotlib

Domain

Parallel Programming

Date

December 2024

Technical highlight

How Marching Squares works

The algorithm works by iterating over the grid, examining each square formed by four adjacent grid points, and determining how the contour or boundary should pass through that square based on the scalar values.

Basic Steps:

Grid Setup: The scalar field is represented as a grid of values (e.g., height values or temperature).
Square Evaluation: Each grid cell is treated as a square, with each corner having a scalar value.
Thresholding: A threshold value is used to classify each corner of the square as either inside or outside the contour (e.g., if the scalar value is above or below a certain threshold).
Edge Interpolation: Based on the corner classifications, the edges of the square are interpolated to generate the contour line.
Pattern Matching: A lookup table is used to determine the correct contour for each of the 16 possible corner configurations (since each corner can either be inside or outside the contour, forming 24=162^4 = 1624=16 possibilities).
Connecting the Contours: The contour lines are drawn by connecting the interpolated edges of each square, forming a continuous contour.

After all images are processed, a report is generated on the dataset as a whole. This provides following insights on the fraction of the images that were of a grayscale intensity >= isovalue. This was introduced to justify the need for a barrier in the parallel version of the project and also it does truly provide an understanding of the dataset. This could be very helpful in understanding questions such as-spread of cancerous growths, areas of certain altitudes, etc. The stats on the filled fractions I include are: mean, mode, min filled area, max filled area, histogram, standard-deviation, and variance.

Technical highlight

Three modes of working

1. Sequential Implementation

Overview: In this mode, each pixel was processed one after the other in sequential manner, with no parallelism involved. It served as a baseline for comparing parallel approaches.

2. Basic Parallel Implementation

Overview: This mode introduced parallelism using a map reduce pattern. Key steps involved:
1. Task Division:
  - The total tasks (T) are divided among N goroutines, with each goroutine assigned approximately T/N tasks.
  - A circular modulo operation (i % N) is used to distribute tasks evenly across the goroutines.
2. Task Structure:
  - Each task is represented by a Task struct, and tasks are assigned to goroutines based on the modulo operation.
3. Processing by Goroutines:
  - Each goroutine processes the entire assigned image (task) rather than smaller slices of it. This decision was made to explore the impact of tasks with uniform vs. nonuniform sizes which could be achieved more directly on the image-level.
4. Barrier Implementation:
  - A barrier ensures all goroutines finish processing before generating a report capturing statistical information about the dataset processed (mean, median, histograms, and so on).
  - A condition variable and mutex are used to synchronize the goroutines. Each goroutine updates a shared variable (completedTasks) after processing its tasks.
  - If not all tasks are completed, a goroutine sleeps. The last goroutine to complete signals all sleeping goroutines via a broadcast.
5. Map-Reduce Pattern:
  - Map Phase: The Marching Squares algorithm is applied in parallel across different sections of the data.
  - Reduce Phase: After all tasks are completed, the barrier synchronizes the goroutines, and a report is generated based on the processed data.
This approach ensures parallel processing of tasks with synchronization for accurate reporting at the end.

3. Work-Stealing Refinement

Overview: In this mode, work stealing is introduced to address the issue of idle goroutines in the basic parallel version, particularly when tasks are nonuniform in complexity. Goroutines that finish their tasks quickly can "steal" tasks from other goroutines that are still working, improving overall efficiency.
Key Components:
1. Unbounded Double-Ended Queue:
  - PopBottom(): Allows a goroutine to pop a task from the bottom of the queue.
  - PushBottom(): Allows a goroutine to push a task to the bottom of the queue.
  - PopTop(): Allows other goroutines to steal tasks from the top of the queue.
  - This unbounded queue avoids issues like the ABA problem by using dynamic memory management.
2. Work Stealing Process:
  - Each goroutine is assigned a double-ended queue of tasks.
  - Once a goroutine finishes its tasks, it starts checking other goroutines' queues for tasks to steal, starting at a random thread ID and checking others sequentially.
  - A goroutine can steal one task at a time and immediately starts processing it.
  - If no tasks are available to steal, the goroutine enters the barrier and waits for others to finish.
3. Synchronization:
  - The process of work stealing is synchronized with a barrier to ensure all goroutines finish their tasks before a report is generated, similar to the basic parallel version.
Outcome: This mode helps balance the workload dynamically, reducing idle time and potentially improving the overall processing speed by enabling goroutines to steal tasks from each other.

Technical highlight

Performance Analysis on a Dataset with uniform-sized images:

Speedup Trends: Both the parallel and work-stealing versions showed a linear increase in speedup as the number of threads increased. However, the rate of speedup slowed and started plateauing, suggesting diminishing returns as more threads were added.
Potential Reasons for Plateauing: The decrease in speedup is likely due to:
- Communication overhead: Threads may be idle while waiting for synchronization at the barrier in the parallel version.
- Task checking in work-stealing: In the work-stealing version, threads check other threads for tasks to steal, which can lead to inefficiencies when tasks are nearly evenly distributed.
Work-Stealing Results:
- The work-stealing implementation showed little to no improvement over the basic parallel version when processing a uniform dataset. Since all tasks had approximately the same processing time, threads finished around the same time, leaving few tasks to steal.
- The slight improvement in the work-stealing version, particularly after 8 threads, may have occurred because the last threads (with fewer tasks) finished slightly earlier and could steal tasks from others with more assignments.
Conclusion: Work-stealing showed negligible performance gains for a uniform dataset, as the workload was evenly distributed. The minimal improvements came from the way tasks were assigned in a round-robin fashion, where threads with fewer tasks might have stolen tasks from overloaded threads. At higher thread counts (8 or 12), the limited number of tasks ran out before all threads could be fully utilized, reducing the effectiveness of work-stealing.

Technical highlight

Performance Analysis on Dataset with non-uniform-sized images:

Speedup Trends: Both the basic parallel and work-stealing versions showed increasing speedup with more threads, but at a decreasing rate up to 8 threads, primarily due to the benefits of parallelization and the communication overheads (similar to the explanation in section 4.2).

Basic Parallel Plateaus: After 8 threads, the basic parallel version experienced a drastic plateau, likely due to:
- Varying task sizes: Some threads were assigned tasks of vastly different sizes, causing some to finish quickly while others took much longer.
- Idle threads: The threads that finished early had to wait idly for the slowest thread, leading to a bottleneck as the overall runtime was dominated by the slowest thread.
Work-Stealing Plateaus and Decreases:
- In the work-stealing version, after 8 threads, the system didn't just plateau but showed a slight decrease in performance. This is because searching for tasks to steal became less efficient when most threads were either finished or couldn't find tasks to steal. This added unnecessary overhead, increasing the time it took for threads to determine they were done.
Key Benefit of Work-Stealing on Nonuniform Datasets:
- The work-stealing implementation performed significantly better than the basic parallel version on nonuniform datasets. This is because when tasks vary in size, some threads finish their tasks faster than others.
- In the basic parallel version, faster threads are left idle while waiting for the slowest thread, which creates a bottleneck. However, in work-stealing, faster threads can steal tasks from slower threads, helping balance the workload and preventing idle time.
- This leads to a faster overall system, as work-stealing reduces the bottleneck caused by slow threads, and the system doesn't wait for just one thread to finish.
Conclusion: The work-stealing algorithm provides considerable benefits in scenarios with nonuniform tasks, improving performance significantly over the basic parallel version, where the system is constrained by the slowest thread. However, performance gains diminish past 8 threads due to the overhead involved in searching for tasks to steal among an increasing number of threads.

Takeaways

Through doing this project, I was able to learn how to think about the granularity of tasks and what level to parallelize at. I also specifically learned the effectiveness of applying work-stealing on uniform and nonuniform images and processing at the images level would allow me to create datasets of uniform and nonuniform nature.

Through doing this project, I was able to learn how to think about the granularity of tasks and what level to parallelize at. I also specifically learned the effectiveness of applying work-stealing on uniform and nonuniform images and processing at the images level would allow me to create datasets of uniform and nonuniform nature.