Title:  Reliability Engineer Lead - Distributed Systems


Hyderabad, TG, IN Bengaluru, KA, IN Gurugram, HR, IN


Team Summary

Arcesium seeks a highly skilled Site Reliability Engineer to join our Technology team. You will be working as part of a cross-functional product team to create elegant solutions to highly complex and intricate business challenges.

What You'll Do

  • Working with the rest of the team to deploy, maintain, and run a highly-available, multi-tenant distributed system
  • Automating both the infrastructure creation and the application deployment to that environment.
  • Contributing to the design/architecture of the system
  • Programming in the core application (ex: instrumenting code with monitoring metrics, setting up traces, shipping and organizing logs)
  • Ensuring the system performs as intended


The ideal candidate will have at least 6 years of experience in a SRE/Operations/DevOps role running distributed systems in production.

What You'll Need


  • Experience with automated provisioning and management of AWS infrastructure and services
  • Deep experience with Kubernetes and Docker
  • Experience automating the software dev/test/deployment lifecycle with continuous integration and continuous deployment
  • Experience with scaling, monitoring, and troubleshooting actively running systems
  • Ability to program in Python / Bash
  • Comfortable with configuration management tools: Ansible, Chef, Puppet, etc.
  • Other technologies: Fluentd, Key-Val datastores, API management/service meshes, Git, Key management