Summary
Overview
Work History
Education
Skills
Websites
Projects
Timeline
Generic
Mustafa Mirza

Mustafa Mirza

Summary

Senior Data Engineer with expertise in designing scalable Data Mesh architectures, optimizing low-latency data processing using Apache Spark, Apache Flink, and Apache Kafka, and implementing robust microservices-based distributed systems. Experienced in developing and automating real-time ingestion pipelines with Airflow, DBT, and Trino on AWS. Strong background in data governance (RBAC, PII compliance, Apache Ranger) and performance optimization. Passionate about leveraging open-source technologies to build self-service, high-performance data platforms.

Overview

3
3
years of professional experience

Work History

Lead Platform Data Engineer

HugoBank
07.2024 - Current
  • Designed and implemented low-latency ingestion pipelines leveraging Apache Spark, Apache Flink, and Kafka, enabling real-time processing and reducing data ingestion latency by 40% for scalable Data Mesh solutions.
  • Optimized data storage and retrieval using Hudi, Trino, and Open Metadata, ensuring 99.9% data availability, reducing query response times by 50%, and improving overall governance.
  • Developed and executed Data Mesh-driven strategies, implementing self-service data platforms, increasing data democratization and access speed by 3x across distributed teams.
  • Built microservices-based architectures for event-driven data processing, leading to a 30% improvement in pipeline scalability and integration across AWS, GCP, and Azure.
  • Automated data workflows with Airflow and Cosmos, reducing manual intervention by 70%, ensuring faster and more reliable data operations.
  • Led a high-performing team of data engineers, and driving a 25% increase in engineering productivity in data analytics and platform automation.

Senior Software Engineer - Big Data and Platform

Bazaar Technologies
07.2022 - 07.2024
  • Spearheaded scalable Data Warehouse initiatives, ensuring data quality and lineage with DBT, reducing data freshness lag from 12 hours to under 30 minutes.
  • Developed and maintained 400+ data pipelines using Apache Hudi, Spark, and Airflow, processing 100+ terabytes of data daily while improving pipeline execution efficiency by 60%.
  • Built an enterprise-grade analytics platform with Apache Superset and Tableau, enabling teams to generate 1,000+ dashboards, increasing real-time reporting accuracy by 35%.
  • Enhanced architecture monitoring by integrating Prometheus and Loki, reducing MTTR (Mean Time to Resolution) of system failures by 50%, improving overall service uptime to 99.98%.
  • Implemented fine-grained data governance policies using Apache Ranger, ensuring 100% GDPR compliance, reducing unauthorized data access incidents by 80%.
  • Optimized AWS infrastructure costs, achieving $12,000/month savings through Spark job optimizations, improved auto-scaling with Karpenter, and S3 storage efficiency improvements.
  • Developed a Data-as-a-Service (DaaS) solution using GoLang, Trino, and Apache Pinot, reducing query response times from 5 seconds to under 500ms, supporting millions of analytical queries per day.

Education

Bachelors - Computer Science

FAST-NUCES
08.2022

Skills

  • Big Data & Streaming: Apache Spark, Apache Flink, Apache Kafka, Hadoop
  • Cloud & Orchestration: AWS (S3, EMR, Glue, Lambda), Airflow, Kubernetes, Docker, Terraform
  • Data Processing & Storage: Trino, Hive, Hudi, DBT, Redshift, Pinot
  • Programming & Development: Python, SQL, Scala, GoLang, Bash
  • Architecture & Governance: Data Mesh, Microservices, Data Security (RBAC, PII Compliance, Apache Ranger), ETL Development, Cost Optimization

Projects

Financial Data Platform – AWS-Based, Regulatory-Compliant

 Technologies: Apache Spark, Flink, Kafka, Airflow, Hudi, Trino, Open Metadata, AWS (S3, EMR, Glue), Python, Scala

  • Architected a self-service Data Mesh platform, enabling distributed teams to autonomously manage, process, and access high-quality data.
  • Built ingestion pipelines with Spark and Flink, integrating streaming and batch data sources into a centralized Data Lake.
  • Ensured schema evolution and data drift management using DBT with Trino, maintaining high data integrity.
  • Integrated Open Metadata to enhance data cataloging, lineage tracking, and discoverability across multiple business units.


Enterprise Data Mesh Platform

 Technologies: AWS (S3, EMR, Glue), Spark, Airflow, DBT, Apache Ranger, Trino

  • Developed an AWS-based modular data platform, ensuring compliance with financial regulations and security best practices.
  • Automated ETL processes using Airflow and DBT, enabling real-time reporting and predictive analytics.
  • Implemented Apache Ranger for role-based access control (RBAC), data masking, and regulatory compliance.

Timeline

Lead Platform Data Engineer

HugoBank
07.2024 - Current

Senior Software Engineer - Big Data and Platform

Bazaar Technologies
07.2022 - 07.2024

Bachelors - Computer Science

FAST-NUCES
Mustafa Mirza