Jobs/San Francisco/Software Engineer, Infrastructure Reliability
San Francisco, California, United States

Software Engineer, Infrastructure Reliability

About the Team We’re hiring software engineers to join our broader Infrastructure organization, which supports multiple high-impact teams. Depending on your interests and experience, you could work on one of several focus areas—including Core Distributed Systems, Databases, Observability, or Cloud Infrastructure.

Company
OpenAI
Compensation
$255K - $405K
Schedule
Full-Time
Role overview

What this role actually needs.

Software Engineer, Infrastructure Reliability at OpenAI in San Francisco. UpJobz keeps this listing high-signal for applicants targeting serious high-tech roles across the United States, Canada, and Mexico. About the Team We’re hiring software engineers to join our broader Infrastructure organization, which supports multiple high-impact teams. Depending on your interests and experience, you could work on one of several focus areas—including Core Distributed Systems, Databases, Observability, or Cloud Infrastructure.

Responsibilities

Day-to-day expectations

A clear list of the work this role is designed to cover.

  • Design, build, and operate reliable and performant systems used across engineering.
  • Identify and fix performance bottlenecks and inefficiencies, ensuring our infrastructure can scale to the next order of magnitude.
  • Dig deep to resolve complex issues.
  • Continuously improve automation to reduce manual work. Improve internal tooling and our developer experience.
  • Contribute to incident response, postmortems, and the development of best practices around system reliability and scalability.
  • Have a deep understanding of distributed systems principles and a proven track record in building and operating scalable and reliable systems.
Requirements

What a strong candidate brings

This keeps the job page specific, readable, and easier to match.

  • 4+ years of relevant industry experience, with 2+ years leading large scale, complex projects or teams as an engineer or tech lead
  • A passion for distributed systems at scale with a focus on reliability, scalability, security, and continuous improvement.
  • Proven experience as an reliability engineer, production engineer, or a similar role in a fast-paced, rapidly scaling company.
  • Strong proficiency in cloud infrastructure (like AWS, GCP, Azure) and IaC tools such as Terraform. Proficiency in programming / scripting languages.
  • Experience with containerization technologies and container orchestration platforms like Kubernetes.
  • Experience with observability tools such as Datadog, Prometheus, Grafana, Splunk and ELK stack.
Benefits

Why people would want this job

Benefits help searchers understand whether the role is a real fit before they apply.

    Subscriber playbook

    Turn this listing into an application plan.

    This is the first pass at the premium UpJobz layer: a fast brief that helps serious applicants move with more clarity.

    Next moves

    • Tailor your resume around ai and llm instead of sending a generic application.
    • Use the first two bullets of your application to connect your background directly to software engineer, infrastructure reliability is a high-signal on-site role in san francisco, and it is most realistic for united states residents.
    • Open the role quickly if it fits and bookmark three similar jobs before you leave the page.

    Interview themes

    Cloud & DevOpsOn-siteaillmresearchkubernetes

    Watchouts

    • $255K - $405K is visible, so calibrate your application around the posted range.
    • Use united states residents as part of your positioning so the recruiter does not have to infer it.
    • Show concrete examples of succeeding in on-site environments.
    SEO context

    Search intent signals for this listing

    Helpful keyword hooks for serious tech searchers and future programmatic job pages.

    Software Engineer, Infrastructure ReliabilityOpenAISan FranciscoUSCloud & DevOpsaillmresearchkubernetesterraformawsgcpazuresecurityuxplatformobservabilitydeveloper-toolsapipythoninfrastructure
    Next step

    Ready to move on this role?

    This page keeps the application flow simple while giving you enough context to decide quickly and move.