Title:  Senior Reliability Engineer


Hyderabad, TG, IN Gurugram, HR, IN Bengaluru, KA, IN


Team Summary

Arcesium seeks a highly skilled Site Reliability Engineer to join our Technology team. You will be working as part of a cross-functional product team to create elegant solutions to highly complex and intricate business challenges.

What You'll Do

  • Supervise a team of SREs, ensuring that production applications which team supports are stable, reliable and well documented.
  • Own end to end availability and performance of mission critical services.
  • Identify and scope areas of improvements and build automation to reduce toil.
  • Assist in roll-out and deployment of new product features and installations to facilitate rapid iteration and constant growth.
  • Contributing to the design/architecture of the system.
  • Adjust capacity utilization to be optimal.
  • Identification of top errors, reliability issues and driving root cause to avoid repeat of incidents.
  • Ability to analyze and debug complex issue across tiers from frontend to mid-tier to infrastructure.
  • Actively participate in 24*7 on call support on a rotational basis – at least one week in a month.

What You'll Need

  • Programming experience (Python/Java/Go/C/shell)
  • 4 to 7 years of experience in Operations and/or Development teams
  • Good understanding of standard networking protocol and security concerns (SSL, VPN, VPC)
  • Experience with monitoring and data analysis tool like Nagios, Prometheus, DataDog etc.
  • Strong knowledge of Linux systems and internals
  • Debugging and troubleshooting skills using tools such as strace, tcpdump, wireshark, auditd, gdb
  • Exposure to cloud platform e.g. AWS
  • Working knowledge of application servers, servlet containers, and web server
  • Good communication & collaboration skills and attention to details