

Experienced Data Engineer with a strong background in cloud technologies, specializing in AWS and GCP. Proficient in designing and implementing scalable ETL/ELT pipelines, data warehousing, and advanced analytics solutions. Skilled in leveraging cutting-edge technologies such as Apache Airflow, Spark, and Kafka for efficient data processing and real-time streaming. Demonstrated expertise in integrating AI and machine learning models, to enhance data extraction and search capabilities. Committed to data quality and governance, with a track record of improving data integrity and optimizing query performance in large-scale systems.
Vosyn Inc.
February 2025 - April 2025
• Designed and implemented a scalable PostgreSQL database schema for an HR management system, ensuring ACID compliance and normalized data structures to track employee and payroll details. • Led migration of internal HR tool from development to production environment, collaborating with cross-functional teams to implement DB changes, ensure data integrity, enforce security protocols, and document PostgreSQL backup strategies on GCE. • Engineered automated ETL pipelines using Cloud Functions and Dataflow with validation checks to streamline intern hour analytics, employee onboarding, and share issuance reporting cutting manual effort and ensuring data quality across financial metrics such as hourly pay, leaves, and billable hours. • Provided leadership and documentation support by designing SOA, delivering KT sessions to new engineers, and leading team discussions on translating stakeholder requirements into technical solutions.
CrowdDoing
June 2024 - January 2025
• Designed and implemented scalable ETL pipelines using Minio, Docker, and Apache Airflow to support efficient data extraction, transformation, and loading (ETL) of book data and metadata. • Integrated LLMs for extracting and consolidating information into JSON files and managed data quality with automated checks, achieving a 40% improvement in context and quote extraction accuracy. • Developed a knowledge graph using Neo4j for storing vector embeddings and relationships, integrated OpenAI APIs and RAG models with LangChain for contextual searches. • Implemented query routing to distinguish between contextual and data-specific queries, significantly enhancing application search functionality and user response precision.
Indiana University, University Information Technology Services
January 2023 - May 2024
• Architected a scalable, cloud-based ETL pipeline for student and HR data from PeopleSoft Oracle databases using AWS services (S3, Glue, Lambda, Step Functions, Redshift). • Automated infrastructure provisioning with CloudFormation, reducing deployment time by 70%. • Developed AWS Glue ETL jobs for data extraction, transformation (including PII handling), and loading into Redshift, implementing incremental loading to decrease processing time by 60%. • Designed a dimensional data model in Redshift, optimizing query performance through distribution styles and indexing. Established RBAC for enhanced data governance and security compliance. • Developed and executed a series of comprehensive data quality checks that improved the integrity of datasets utilized by analytics teams, ensured consistent monitoring and validation of over 10,000 records weekly for accuracy.
Indiana University Bloomington, May 2024
Computer Science
University of Mumbai, May 2022
Computer Engineering
Amazon Web Services
Issued: 10/25/2024 - Expires: 10/25/2027
Credential ID: c2076df0c4314cbc8971a8eb547aacd9
This project implements an ELT (Extract, Load, Transform) data pipeline to process and analyze global health data using Google Cloud Platform (GCP), Apache Airflow, and dbt (data build tool). The pipeline extracts data from Google Cloud Storage (GCS), loads it into BigQuery, and applies transformations to create region-specific tables and views for analysis.
View ProjectDeveloped and optimized a real-time car travel data simulation system using Apache Kafka and Spark, achieving 50% increase in data processing speed. Utilized AWS S3, Glue, and Redshift for efficient data storage and transformation, enabling advanced analysis with Tableau.
View ProjectI spearheaded an end-to-end Extract, Transform, Load (ETL) process for Amazon's mobile sales orders, utilizing Snowpark and Snowflake technologies. We facilitated seamless data flow from three regions, ensuring integrity and accessibility. Our analysis revealed a notable 10% annual growth in sales, empowering stakeholders with actionable insights. Technical tools included SQL, Snowflake, Snowpark Python API, and AWS S3 bucket & IAM for secure data management and analysis.
View ProjectA robust ETL (Extract, Transform, Load) pipeline built on AWS for processing student enrollment data, implementing data quality checks, and maintaining a secure data warehouse.
View ProjectVerified Data Engineer
0-2 years of experience
Preferred commitment: Full Time
Take the next step and bring this top talent to your team
Hire Sreyas for your team