AWS for Genomic Insights

Scaling Annotation Pipelines with AWS

In our cloud computing course, we embarked on a rewarding project to develop a Software-as-a-Service (SaaS) solution for genomics analysis, aptly named the Genomics Analysis Service (GAS). The primary goal of GAS is to streamline genomic data processing and analysis, providing researchers and clinicians with a robust, scalable tool to interpret complex genomic information.

In our cloud computing course, we embarked on a rewarding project to develop a Software-as-a-Service (SaaS) solution for genomics analysis, aptly named the Genomics Analysis Service (GAS). The primary goal of GAS is to streamline genomic data processing and analysis, providing researchers and clinicians with a robust, scalable tool to interpret complex genomic information.

Organization

The University of Chicago

Core Technologies

AWS Lambda AWS DynamoDB AWS EC2 AWS S3 SNS SQS AWS Step Functions

Domain

Cloud Computing

Date

May 2024

Full Dashboard
Full Dashboard
Full Dashboard

Technical highlight

The GAS allowed for the following core functionalities: 1. Log in (via Globus Auth) to use the service — Some aspects of the service are available only to registered users. Two classes of users were supported: Free and Premium. Premium users have access to additional functionality such as larger file sizes and persistence of annotation result files. 2. Submit an annotation job — Free users may only submit jobs of up to a certain size. Premium users may submit any size job. If a Free user submits an oversized job, the system will refuse it and will prompt the user to convert to a Premium user. 3. Upgrade from a Free to a Premium user — Premium users will be required to provide a credit card for payment of the service subscription. The GAS integrated with Stripe (www.stripe.com) for credit card payment processing. 4. Receive email notifications when annotation jobs finish — When their annotation request is complete, the GAS will send users an email that includes a link where they can view the log file and download the results file. 5. Browse jobs and download annotation results — The GAS will store annotation results for later retrieval. Users may view a list of their jobs (completed and running), and the log file for completed jobs. 6. Restrict data access for Free users — Free users may download their results file for a limited time after their job completes; thereafter the results file is archived, and only available to them if they convert to a Premium user. Premium users will always have all their data available for download.

Extracted currency modules
Extracted currency modules
Extracted currency modules

Technical highlight

I built an architecture that automates the thawing of genomic files stored in AWS Glacier when free-tier users upgrade to premium. This architecture leverages several AWS services, including SNS, SQS, Lambda, DynamoDB, and S3, ensuring efficient and reliable file retrieval and storage. Key Components and Workflow: 1. Premium User Upgrade Event: When a user links their Stripe account to the Genomics Analysis Service (GAS) and upgrades to premium, an SNS topic navyavedachala_a16_start_thaw is triggered, publishing a message about the upgrade. 2. Thaw Endpoint Subscription: The /thaw API endpoint is subscribed to the navyavedachala_a16_start_thaw SNS topic. When a user upgrades, it receives the notification and begins processing. 3. Polling from SQS Queue: An SQS queue navyavedachala_a16_start_thaw, subscribed to the SNS topic, polls for messages. By using SQS, I ensured that messages are processed at least once, increasing system reliability in case of downtime or errors. 4. Querying DynamoDB for Archived Files: Within the /thaw endpoint, the DynamoDB table is queried for jobs that have an associated results_file_archive_id, indicating archived files from free-tier users. I optimized this query by avoiding a full table scan, keeping the cost minimal. 5. File Information Extraction: The query also retrieves s3_key_result_file and job_id for the corresponding jobs. This metadata is passed into the Glacier retrieval process to track the files and ensure accurate restoration. 6. Initiating Glacier Job: For each archived file, I initiate a Glacier retrieval job using the glacier_client.initiate_job function. 7. Receiving Glacier Completion Notification: An SNS topic navyavedachala_a16_complete_thaw is configured to receive messages from Glacier when a file has been successfully retrieved. The SQS queue navyavedachala_a16_complete_thaw is subscribed to this topic and captures these completion messages. 8. Lambda Function for File Restoration: The Lambda function navyavedachala_a16_restore is triggered by messages in the SQS queue. 9. Cleanup and Archive Deletion: Glacier supplies a glacier_job_id when it finishes retrieving. This glacier_job_id is used by the Lambda function to save the file retrieved to S3. The Lambda function also deletes the archived Glacier file and removes the message from the SQS queue, completing the thawing process. 10. Frontend Integration: The frontend reflects the file restoration status

I built an architecture that automates the thawing of genomic files stored in AWS Glacier when free-tier users upgrade to premium. This architecture leverages several AWS services, including SNS, SQS, Lambda, DynamoDB, and S3, ensuring efficient and reliable file retrieval and storage. Key Components and Workflow: 1. Premium User Upgrade Event: When a user links their Stripe account to the Genomics Analysis Service (GAS) and upgrades to premium, an SNS topic navyavedachala_a16_start_thaw is triggered, publishing a message about the upgrade. 2. Thaw Endpoint Subscription: The /thaw API endpoint is subscribed to the navyavedachala_a16_start_thaw SNS topic. When a user upgrades, it receives the notification and begins processing. 3. Polling from SQS Queue: An SQS queue navyavedachala_a16_start_thaw, subscribed to the SNS topic, polls for messages. By using SQS, I ensured that messages are processed at least once, increasing system reliability in case of downtime or errors. 4. Querying DynamoDB for Archived Files: Within the /thaw endpoint, the DynamoDB table is queried for jobs that have an associated results_file_archive_id, indicating archived files from free-tier users. I optimized this query by avoiding a full table scan, keeping the cost minimal. 5. File Information Extraction: The query also retrieves s3_key_result_file and job_id for the corresponding jobs. This metadata is passed into the Glacier retrieval process to track the files and ensure accurate restoration. 6. Initiating Glacier Job: For each archived file, I initiate a Glacier retrieval job using the glacier_client.initiate_job function. 7. Receiving Glacier Completion Notification: An SNS topic navyavedachala_a16_complete_thaw is configured to receive messages from Glacier when a file has been successfully retrieved. The SQS queue navyavedachala_a16_complete_thaw is subscribed to this topic and captures these completion messages. 8. Lambda Function for File Restoration: The Lambda function navyavedachala_a16_restore is triggered by messages in the SQS queue. 9. Cleanup and Archive Deletion: Glacier supplies a glacier_job_id when it finishes retrieving. This glacier_job_id is used by the Lambda function to save the file retrieved to S3. The Lambda function also deletes the archived Glacier file and removes the message from the SQS queue, completing the thawing process. 10. Frontend Integration: The frontend reflects the file restoration status

Takeaways

This project deepened my understanding of key cloud computing concepts, including message queues, serverless functions, and tiered storage. It reinforced my ability to design robust architectures that automate workflows across AWS services, ensuring high availability and cost efficiency.

This project deepened my understanding of key cloud computing concepts, including message queues, serverless functions, and tiered storage. It reinforced my ability to design robust architectures that automate workflows across AWS services, ensuring high availability and cost efficiency.