Title:  Reliability Engineering Lead


Hyderabad, TG, IN Gurugram, HR, IN Bengaluru, KA, IN


Arcesium seeks a highly skilled Site Reliability Engineer to join our Technology team. You will be working as part of a cross-functional product team to create elegant solutions to highly complex and intricate business challenges.


What you'll do:

  • Supervise a team of SREs, ensuring that production applications which team supports are stable, reliable and well documented
  • Own end to end availability and performance of mission critical services
  • Identify and scope areas of improvements and build automation to reduce toil
  • Assist in roll-out and deployment of new product features and installations to facilitate rapid iteration and constant growth
  • Contributing to the design/architecture of the system
  • Adjust capacity utilization to be optimal
  • Identification of top errors, reliability issues and driving root cause to avoid repeat of incidents
  • Ability to analyze and debug complex issue across tiers from frontend to mid-tier to infrastructure
  • Actively participate in 24*7 on call support on a rotational basis – at least one week in a month


What you'll need:

  • 7 to 9 years of experience handling systems for large scale production environments.
  • Experienced with variety of tools that help manage, understand, and debug large, complex distributed systems.
  • Good knowledge of Unix system, web technologies, databases and public cloud systems like AWS, Networking, System.
  • Experience with monitoring and logging tools (e.g., Datadog, ELK, Prometheus, Grafana).
  • Experience automating the software dev/test/deployment lifecycle with continuous integration and continuous deployment.
  • Deep experience with Kubernetes and Docker.
  • Deep understanding of SRE concepts like SLAs, SLOs, SLIs, error budgets, MTTR, MTTD, etc.
  • Incident Management experience coupled with effective communication skills.
  • Experience working on cross department efforts by communicating and negotiating with multiple teams to accomplish goals.
  • Expert with troubleshooting issues and bugs.
  • Programming experience (Python/Java/Go/C/shell).
  • Experience in financial domain (desirable).
  • Prior SRE/DevOps experience desirable.

Arcesium and its affiliates do not discriminate in employment matters on the basis of race, color, religion, gender, gender identity, pregnancy, national origin, age, military service eligibility, veteran status, sexual orientation, marital status, disability, or any other category protected by law. Note that for us, this is more than just a legal boilerplate. We are genuinely committed to these principles, which form an important part of our corporate culture, and are eager to hear from extraordinarily well qualified individuals having a wide range of backgrounds and personal characteristics.