System Reliability Engineer

full time

fully remote

Senior

Job Description:

This role provides a greenfield opportunity to build out a multi-cloud GPU compute architecture as well as setting the engineering standards for reliable code. It involves working with the Chief Product Officer to imbue the organization with best practices, automation, and secure, reliable infrastructure. This infrastructure will be cloud agnostic and needs to be able to drop into any cloud and customer on-premises.

Responsibilities:

Design and implement:
- Cloud agnostic infrastructure
- Observability
- Deployments
- Escalation policy
- Post-resolution evaluation
- Operational support plans
- Disaster recovery plans
Educate and guide engineering team on reliable software practices
Participate in design of application architecture

You:

Have planned and built a greenfield infrastructure for multiple levels of environments from test to production.
Have experience with Nvidia GPU deployments
Have experience with all three major cloud providers
Are deeply knowledgeable about containers and container orchestration systems like Kubernetes
Enjoy educating engineers about building reliable systems
Consider automation to be second nature to you
Understand cloud network architecture, security, load balancing, edge protections, and other critical cloud systems
are eager to deploy environments to on-premise, private cloud, or public clouds
In love with observability

How to Apply:

To apply, email your resume and cover letter to careers@codevalet.com and be sure to include the phrase "I want to empower 10x engineers." ... alternatively reach out to Jan Drake at https://www.linkedin.com/in/janman with the same information.