Aviatrix

Principal Engineer - Site Reliability

Posted 16 Days Ago

Be an Early Applicant

Remote

Hiring Remotely in United States

203K-227K

Senior level

Remote

Hiring Remotely in United States

203K-227K

Senior level

The Principal Engineer - SRE will ensure system reliability through design, implementation, automation, and monitoring, while managing on-call duties and collaboration.

The summary above was generated by AI

The Aviatrix SRE team is a small but highly skilled global group of Systems Engineers/SREs dedicated to ensuring the reliability, availability, and performance of Aviatrix’s critical systems and services. Our mission is to build and maintain a robust, resilient infrastructure that enables Aviatrix to deliver high-quality services with agility through automation, best practices, and a culture of operational excellence.

About the Role

As an SRE – Principal Engineer, you’ll play a key role in designing, implementing, and maintaining highly available, fault-tolerant, and scalable systems. You’ll focus on automation, proactive monitoring, and Infrastructure-as-Code (IaC) to drive efficiency and reliability across our services.

Tech Stack & Responsibilities

Kubernetes – Manage application lifecycles, automate operational tasks, troubleshoot issues, integrate monitoring and alerting, optimize infrastructure, and ensure reliable operations using custom-built operators and cdk8s.
Terraform – Implement Infrastructure-as-Code (IaC) to enable rapid provisioning, seamless configuration changes, and efficient scaling.
Automation & Development – Build and enhance automation tools and frameworks in Golang and Python to streamline operations.

On-Call Rotation

We maintain a structured on-call rotation to ensure 24/7 coverage:

During Business Hours (rotates every 2 days)

EST: 9 AM – 6 PM
CST: 8 AM – 5 PM
PST: 6 AM – 3 PM

Outside Business Hours (6 PM – 9 AM PT, rotates weekly: Monday to Monday)Location & Eligibility

This is a remote role open to candidates located in the US or Canada. You must be eligible to work in either country and currently reside there.

If you're passionate about building resilient infrastructure, automating operations, and ensuring system reliability at scale, we'd love to hear from you! 🚀

RESPONSIBILITIES:  

Ensure Reliability and Availability: You will ensure uptime for crucial services and systems based on business required SLOs. Minimize service disruptions through proactive monitoring, capacity planning and fault-tolerant design.
Architecture and System Design: you will design and architect complex, scalable and reliable systems.
Automation and Efficiency: you will develop and implement automation tools and frameworks to automate routine tasks to reduce human error and to streamline and improve operational processes to increase efficiency.
Build Observability and Monitoring tools: you will define, build, deploy, maintain, and extend our observability and monitoring tools to enhance system reliability and availability.
Incident Management and Response: you will maintain an effective on-call rotation to ensure 24/7 coverage. You will respond to incident response procedures to swiftly address and mitigate service disruptions.
Performance Monitoring and SLIs/SLOs: you will help define and monitor Service level Indicators (SLIs) and Service Level Objectives to set clear expectations for system performance.
Collaboration: you will work closely with product engineering to ensure service-level objectives and reliability targets are met
Problem-Solving & Troubleshooting: you respond to escalations by troubleshooting complex system and application incidents, perform root cause analysis, implement necessary corrective actions.
Thought Leadership and Innovation: Stay up to date with latest industry trends, emerging technologies. Iterate on best practices to increase the quality & velocity of development and deliverables.

QUALIFICATIONS:   

8+ years of experience maintaining and deploying highly available, fault-tolerant systems at scale. 
Proficiency in Golang or Python is required.
Infrastructure-as-code (IaC): Deep understanding of Terraform core components (e.g., Terragrunt is a bonus) with real-world experience using Terraform for infrastructure provisioning and management.
At least one cloud service provider experience (e.g., AWS, GCP, Azure, OCI)  
Good knowledge with Kubernetes (e.g., cdk8s and operators are a bonus)
Solid experience developing Automation tools and frameworks.
Experience with Logging Solutions (e.g., Loki, Syslog, Elasticsearch, Logstash, Kibana, Filebeat, Fluentbit, etc.) 
Experience with Monitoring and Metrics Solutions (e.g., Prometheus, Grafana, Victoria Metrics)
Practical experience with Linux system administration
Experience with Version control system (e.g., Git, GitHub) and code review  
Excellent communication skills are required.

US Pay Range

The US annual base salary range for this full-time position is $202,900-$226,700 + benefits + 401(k) match + equity. The pay range is determined by the role, work location, job-related skills, level, experience, and relevant education. [Certain roles are eligible to earn sales commission, depending on the terms of the applicable plan.] The range displayed is the minimum and maximum target base salary and is applicable only for new hires for the listed position located in the US. Your Talent Advisor can share more details regarding salary ranges, benefits, and equity for your location during the hiring process.

BENEFITS

US: We cover 100% of employee premiums and 88% of dependent(s) premiums for medical, dental and vision coverage, 401(k) match, short and long-term disability, life/AD&D insurance, $1,000/year education reimbursement, and a flexible vacation policy.

Outside the US: We offer a comprehensive benefits package which, (subject to regional variations) could include pension, private medical for you and dependents, generous holiday allowance, life assurance, long-term disability, annual wellbeing stipend

Your total compensation package will be based on job-related knowledge, education, certifications and location, per our aligned ranges.

About Aviatrix 
Aviatrix is the cloud networking expert. We’re on a mission to make cloud networking simple so companies stay agile. Trusted by more than 500 of the world’s leading enterprises, our cloud networking platform creates the visibility, security, and control needed to adapt with ease and move ahead at speed. Combined with the Aviatrix Certified Engineer (ACE) Program, the industry's leading multicloud networking and security certification, Aviatrix empowers the cloud networking community to stay at the forefront of digital transformation.

WE WANT TO INCLUDE YOU

We embrace the fact that not everyone’s journey took the same route or started at the same place. If your experience doesn’t quite meet the requirements but the opportunity excites you and you believe you could be great, don’t let that hold you back from applying. Tell us what you CAN bring and what makes you special.

Aviatrix is a community where everyone's career can grow and we want to help you achieve your goals and be “your best YOU,” however that looks. If you're seeking an opportunity where you can be excited to start work every morning with enthusiastic people, make a real difference and be part of something amazing then let’s talk. We want to get to know you and how we could grow together.

Aviatrix, Inc. is an equal opportunity employer and does not make hiring decisions based on race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws. This policy applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation and training.

CPRA - California Applicant Privacy Notice

Top Skills

AWS

Azure

Elasticsearch

Filebeat

Fluentbit

GCP

Git

Grafana

Kibana

Kubernetes

Linux

Logstash

Loki

Oci

Prometheus

Python

Syslog

Terraform

Victoria Metrics

Similar Jobs

DFIN

Principal Site Reliability Engineer - Cloud (Remote)

11 Days Ago

Remote

Hybrid

United States

Senior level

Artificial Intelligence • Fintech • Information Technology • Software • Data Privacy

The Principal Site Reliability Engineer ensures SaaS products are fast and stable, focuses on automation, system monitoring, and collaborates with teams to improve product performance.

Top Skills: C#,.Net,Java,Harness,Azure Devops,Ansible,Jenkins,New Relic,Dynatrace,Datadog,Appdynamics,Powershell,Python,Bash,Terrraform,Sql,Cosmos,Solarwinds Database Performance Analyzer,Idera Sql Diagnostic Manager,Redgate Sql Monitor,Kubernetes,Aks,Eks

Atlassian

Principal Site Reliability Engineer

12 Days Ago

Remote

San Francisco, CA, USA

171K-274K Annually

Senior level

171K-274K Annually

Senior level

Cloud • Information Technology • Productivity • Security • Software • App development • Automation

As a Principal Site Reliability Engineer, you will enhance cloud service reliability, improve scalability, and foster cross-team collaboration to implement reliability practices.

Top Skills: AWSAzureGCPJavaNoSQLRdbms

NVIDIA

Principal Staff Site Reliability Engineer - CDN

Yesterday

Remote

248K-391K

Expert/Leader

248K-391K

Expert/Leader

Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse

As a Principal Site Reliability Engineer, you'll lead CDN management, design efficient distributed systems, mentor engineers, and drive innovation in AI-based enterprise products.

Top Skills: AWSAzureCdnDnsGoogleHttp/SPythonSplunkTcp/IpTlsUnix/Linux

What you need to know about the Boston Tech Scene

Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.

Key Facts About Boston Tech

Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories