<-- Back to Careers

System Reliability Engineer

full time
fully remote
Senior

Job Description:

This role provides a greenfield opportunity to build out a multi-cloud GPU compute architecture as well as setting the engineering standards for reliable code.  It involves working with the Chief Product Officer to imbue the organization with best practices, automation, and secure, reliable infrastructure.  This infrastructure will be cloud agnostic and needs to be able to drop into any cloud and customer on-premises.

Responsibilities:

  • Design and implement:
    • Cloud agnostic infrastructure
    • Observability
    • Deployments
    • Escalation policy
    • Post-resolution evaluation
    • Operational support plans
    • Disaster recovery plans
  • Educate and guide engineering team on reliable software practices
  • Participate in design of application architecture

You:

  • Have planned and built a greenfield infrastructure for multiple levels of environments from test to production.
  • Have experience with Nvidia GPU deployments
  • Have experience with all three major cloud providers
  • Are deeply knowledgeable about containers and container orchestration systems like Kubernetes
  • Enjoy educating engineers about building reliable systems
  • Consider automation to be second nature to you
  • Understand cloud network architecture, security, load balancing, edge protections, and other critical cloud systems
  • are eager to deploy environments to on-premise, private cloud, or public clouds
  • In love with observability

How to Apply:

To apply, email your resume and cover letter to careers@codevalet.com and be sure to include the phrase "I want to empower 10x engineers." ... alternatively reach out to Jan Drake at https://www.linkedin.com/in/janman with the same information.