Lead Data Engineer (Spark/Scala)

Bengaluru

Company Social & Media:

Xebia

About the Company

A technology-driven organization is seeking a Lead Data Engineer to join its data engineering team. The company focuses on building scalable data solutions and platforms that support batch and real-time processing pipelines for large-scale distributed systems.

About the Role

The Lead Data Engineer is responsible for designing and implementing robust data engineering solutions using Scala and Apache Spark. The role involves building scalable batch and streaming data pipelines, deploying workloads on Kubernetes, and orchestrating workflows through Apache Airflow.

The position requires close collaboration with engineering teams to ensure high-quality, maintainable, and efficient data systems that follow modern software engineering practices.

Responsibilities

  • Design and implement scalable batch and real-time data engineering solutions using Apache Spark (Scala) and Spark Structured Streaming.
  • Architect modular, reusable, and testable Scala codebases following SOLID and clean architecture principles.
  • Develop, deploy, and manage Spark workloads on Kubernetes clusters ensuring scalability, fault tolerance, and efficient resource utilization.
  • Orchestrate data workflows using Apache Airflow, including DAG design, task dependencies, retries, and SLA monitoring.
  • Develop and maintain unit and integration tests for data pipelines and utilities.
  • Perform performance tuning, including partitioning strategies and optimization of data processing jobs.
  • Enforce version control best practices including branching strategies, pull requests, and code reviews.
  • Support CI/CD processes for automated testing and deployment.
  • Maintain clear and structured technical documentation including README files and inline documentation.
  • Participate in design reviews and provide technical guidance to team members and junior engineers.

Requirements

  • 6 to 12 years of overall experience in data engineering roles.
  • Minimum 5 years of experience with Hadoop and Apache Spark.
  • Strong proficiency in Scala and Java.

Technical Skills

  • Experience with Apache Spark, Spark Structured Streaming, Hadoop, Kafka, and Airflow.
  • Strong understanding of distributed systems and big data architectures.
  • Experience deploying and managing Spark workloads on Kubernetes clusters.
  • Knowledge of Yarn and Oozie.
  • Strong software engineering fundamentals including SOLID and DRY principles.
  • Experience in functional programming with Scala, including case classes and complex data structures and algorithms.
  • Experience with Docker, Helm, and containerized environments.
  • Experience in data validation and data quality frameworks.
  • Experience with CI/CD tools such as Jenkins, Maven, GitHub, and GitHub Actions.
  • Strong experience in automated unit and integration testing frameworks.

Please refer to the official website below for a comprehensive job description: