Observability at WarpBuild: Designing for Uptime and Reliability

Uptime and reliability are critical for any platform, but especially for WarpBuild as CI is critical to our customer workflows. Poor uptime can block hotfixes and releases causing significant business impact. At WarpBuild, our goal is to have a system that our customers can set and forget, so that it just works.

In the last two months, we overhauled our internal observability stack for better alerting and visibility. This post discusses how we architect for uptime and reliability at WarpBuild through three key pillars: intelligent automation, multi-cloud redundancy, and comprehensive observability. But first, let's understand the infrastructure landscape that makes this all possible.

In the previous post, we discussed how we built a zero-maintenance observability system using S3, presigned URLs, and OpenTelemetry for our users to view their job metrics and logs. This post focuses on the internal infrastructure that we use to ensure that all our customers' workloads always run regardless of infrastructure issues, capacity constraints, or unexpected demand spikes.

Infrastructure Overview

Compute Layer

At WarpBuild, we run GitHub Actions runners across multiple infrastructure stacks, optimizing for both performance and user experience.

Bare Metal Servers: Most Linux and macOS runners run on bare metal servers. This gives us maximum performance and control, which is critical for compute intensive CI workloads.

Hyperscalers: Windows runners, ARM64 Linux runners, and Docker builders run on hyperscalers due to licensing (for Windows) and performance related constraints (ARM64 instances on AWS and Azure are vastly superior to Ampere servers). When we need to rely on hyperscaler infrastructure, we primarily use AWS with passive backup stacks on GCP and Azure for redundancy and failover.

Persistence and Backend Services

Persistence layers - including S3, databases, Redis, and SQS - are primarily hosted on AWS. Our backend services also run on AWS, but in a separate region isolated from the GitHub Actions runners. This isolation ensures that customer workloads don't impact our control plane and vice versa.

Critically, our backend services are infrastructure aware. They maintain real-time visibility into the state of our compute infrastructure and communicate bidirectionally with our orchestrators.

Orchestration

We use both Kubernetes and Nomad as orchestrators, depending on the workload characteristics. These orchestrators handle the placement of virtual machines and containers, but they don't make decisions in isolation. They're part of a closed-loop system with our backend services, continuously providing real-time information about the state of the underlying infrastructure.

Three Pillars

Building reliable infrastructure isn't about a single magic solution. At WarpBuild, we approach reliability through three interconnected pillars that work together to ensure uptime and robust performance.

Pillar 1: Intelligent Automation

The foundation of our reliability is automation that continuously monitors infrastructure state and makes intelligent scheduling decisions on the fly.

Our backend services are the brain of this operation. They're infrastructure aware and make dynamic decisions about where compute needs to be scheduled, optimizing for both queue times and performance. This isn't static configuration - it's realtime decision making based on current conditions.

The system works as a closed loop:

Our orchestrators (Kubernetes or Nomad) continuously provide realtime information about the state of the underlying infrastructure to our backend services
The backend services analyze this data alongside incoming workload requests
Based on capacity, performance characteristics, and current state, the backend commands the orchestrator for optimal VM placement
The orchestrator executes the placement and feeds updated state back to the backend

This closed-loop system enables automatic failover management. When the backend detects that scheduling isn't feasible on the preferred infrastructure - whether due to capacity constraints or infrastructure issues - it automatically triggers failover to alternative compute resources.

Pillar 2: Multi-cloud Redundancy with Intelligent Failover

While automation handles the decision making, multi-cloud redundancy provides the infrastructure options that make seamless failover possible.

Primary Infrastructure: We prioritize bare metal servers for runners whenever possible. Bare metal delivers the best performance for CPU-intensive CI workloads, and our customers benefit from faster build times.

Automatic Fallback: When bare metal capacity is insufficient or when there are issues with our bare metal hosting provider, the system automatically falls back to hyperscalers. This isn't a manual process - our infrastructure-aware backend services detect capacity or availability issues and seamlessly redirect workloads to hyperscaler infrastructure.

Multi-cloud Backup: Beyond the bare metal to hyperscaler failover, we maintain backup stacks on GCP and Azure in addition to our primary AWS infrastructure. This provides an additional layer of redundancy for hyperscaler workloads.

The result is a seamless customer experience. Whether a job runs on bare metal or gets automatically failed over to a hyperscaler, customer workloads are always running. The complexity is hidden from users - they simply see their CI jobs complete successfully.

Pillar 3: Observability and Alerting

While automation and redundancy handle most reliability scenarios, observability provides visibility and enables quick response to everything else.

Tools and Stack: We use a comprehensive observability stack including OpenTelemetry for instrumentation, Prometheus for metrics collection, Grafana for visualization, and Datadog and Signoz for unified monitoring and alerting. This gives us deep visibility into logs, metrics, alerts, and dashboards across all our systems.

Persistence Layer Strategy: Unlike our compute layer which has automated failover, our persistence layers (databases, Redis, SQS) rely on observability and rapid response rather than automated failover.

Recent Improvements: The observability stack overhaul we completed in the last two months significantly improved our alert quality and visibility. We now have better signal-to-noise ratio in alerts, more comprehensive dashboards for infrastructure health, and generally more surface area for observability.

How It All Works Together

These three pillars aren't independent - they work in concert to deliver the reliability our customers depend on.

Automation handles dynamic workload placement and failover, continuously optimizing where jobs run based on realtime infrastructure state. Multi-cloud redundancy provides the infrastructure options - bare metal, multiple hyperscalers, multiple regions - that make intelligent failover possible. Observability ensures we have visibility into every layer of the stack and can quickly respond to any issues that automation doesn't handle.

The result is the "set and forget" reliability that we promised. Our customers configure their CI once, and from that point forward, WarpBuild handles the complexity of ensuring their workloads always run regardless of infrastructure issues, capacity constraints, or unexpected demand spikes.

This setup also helps protect against dreaded us-east-1 failures and other outages.

Conclusion

Reliability at scale requires more than just redundant infrastructure - it requires intelligent systems that can make realtime decisions, seamless failover mechanisms, and comprehensive observability to catch what automation misses.

At WarpBuild, these three pillars - intelligent automation, multi-cloud redundancy, and observability - work together to deliver the uptime and reliability that modern development teams require. As we continue to grow and evolve our infrastructure, these principles guide how we approach every architectural decision.

Call for Developers

We are looking for developers who are interested in building the future of CI/CD. If you are interested in working on these kinds of infrastructure challenges, get in touch with us at [email protected]!

Observability at WarpBuild: Designing for Uptime and Reliability

On this page