Senior Site Reliability Engineer
About the Senior Site Reliability Engineer role
We’re seeking a senior infrastructure-focused engineer to help operate and scale a complex production platform where reliability, performance, and visibility are first-class concerns. You’ll work closely with software engineers and data-focused teams to design, automate, and run the systems that support large-scale data processing, machine learning workloads, and low-latency services.
This role is best suited for a hands-on infrastructure or systems engineer with strong SRE experience, rather than a narrowly scoped, cloud-only SRE.
This role is highly hands-on and spans the full lifecycle of infrastructure—from architecture and automation to incident response and continuous improvement. You’ll have broad ownership across the stack, with direct responsibility for infrastructure architecture decisions, operational practices, and how reliability and observability are implemented as systems scale.
As the platform scales to support larger data volumes, more compute-intensive workloads, and lower-latency services, this role plays a critical part in ensuring reliability keeps pace with growth.
Salary/OTE: $170,000 - $250,000
Location: San Francisco, CA (onsite)
- Architect, operate, and evolve scalable infrastructure supporting data-driven and compute-intensive workloads, including ownership of key architectural tradeoffs
- Improve platform stability and performance through automation, observability, and capacity forecasting
- Build and maintain deployment workflows, release automation, and configuration management systems
- Define, own, and continuously improve operational practices including monitoring, alerting, on-call processes, and incident response workflows
- Partner cross-functionally to embed reliability and performance best practices into engineering workflows
- Maintain a strong security and operational posture across production environments
- Lead incident reviews and drive long-term corrective and preventative improvements
- This role is on-site in San Francisco to enable close collaboration with platform, data, and hardware-adjacent engineering teams.
What Will Help You Succeed
- 8+ years of experience in infrastructure, SRE, or systems engineering roles
- 5+ years working close to physical infrastructure, systems administration, and/or networking environments supporting production workloads
- Deep experience with containerized systems and orchestration platforms (Docker, Kubernetes)
- Strong command of Linux internals, networking fundamentals, and performance optimization
- Hands-on experience with infrastructure-as-code tools such as Terraform and Ansible
- Ability to write and maintain clean, reliable automation using tools such as Bash and Python
- Practical experience implementing observability stacks (metrics, logs, tracing)
- Familiarity with modern CI/CD pipelines and deployment tooling
- Strong troubleshooting skills across distributed systems
- Clear communicator who works effectively across engineering disciplines
- No sponsorship available
- Experience taking systems from earlier-stage architectures to more mature, production-hardened platforms is highly valued.