Back to List

Location: Hybrid (Alpharetta, GA – 3 days/week in office)
Type: Full-Time

We are seeking a Site Reliability Engineer to join our team and play in enhancing the stability, performance, and reliability of our production systems. You’ll work closely with development, DevOps, and security teams to improve observability, optimize system performance, and ensure production readiness. From monitoring to automation, you’ll make a direct impact on our cloud infrastructure and service reliability.

 

In this role, you will work hand-in-hand with our development, operations, and security teams worldwide to implement best practices, automate deployments, and ensure our platforms are reliable, secure, and scalable. Troubleshooting in Kubernetes requires deep understanding of pods, nodes, networking, scaling, logs, and service-to-service communication

 

This role requires a deep understanding of SRE best practices and a strong ability to troubleshoot complex issues.

 

Your responsibilities in this role will include:

  • Identify opportunities for automation and ensure continuous security, quality in application development by automating security checks, test executions in build and deployment pipelines.
  • Deploy and manage Kubernetes workloads to AWS EKS(A) using Helm, ArgoCD
  • You will be working with Kubernetes and responsible for ensuring that applications and clusters stay reliable, performant, scalable, and observable.
  • Collaborate with development, operations and security team to build secure, optimized and efficient pipelines.
  • Create comprehensive documentation on pipeline functionality and provide training to required members.
  • Proactively monitor system performance and identify potential issues before they become critical.
  • Participate in on-call rotation. Troubleshoot production issues and perform root cause analysis.
  • Engage in continuous learning and actively advocate for Dev(Sec)Ops, GitOps best practices and standards across the team.

 

We are looking for you to have the following skills and experience:

  • 8+ years of experience as a Site Reliability Engineer, or equivalent
  • Experience with tools like New Relic for monitoring and Graylog for logging.
  • 3+ years of experience with Amazon Web Services (AWS) or Microsoft Azure
  • 3+ years of experience with Kubernetes clusters - performance monitoring in Kubernetes.
  • Proficiency with public cloud environments (AWS preferred)
  • Proficiency in scripting language, like Bash, Groovy, Python
  • Excellent debugging and troubleshooting skills.
  • Ability to prioritize tasks efficiently and independently under minimal supervision.

Nice to Have

  • AWS Cloud certification
  • Familiar with .NET applications.
  • Knowledge in Terraform, Ansible, monitoring tools

This is a full-time role and we are unable to sponsor so you must be a USC or be a Green Card holder. We are working onsite a few days each week in our Alpharetta offices so you must live in Atlanta and within commuting distance of our office. If you thrive on solving complex technical challenges, have a passion for automation, and want to influence how enterprise platforms evolve and modernize, this is an ideal opportunity for you.

Ready to take the next step in your SRE career? Apply now and help us build the future of reliable systems!

Apply to this Job
First Name *
Last Name *
Email Address *

Phone Number

Yes
No