# Your CI Wasn't Built for AI-Assisted Development URL: /blog/ai-assisted-development-ci-infrastructure Learn why CI pipelines break under AI-assisted development velocity. Discover how to fix queue times, cache thrashing, and flaky tests when using Copilot, Cursor, and other AI coding tools. --- title: "Your CI Wasn't Built for AI-Assisted Development" excerpt: "AI coding tools have changed how fast teams ship code. Most CI infrastructure was designed for human-speed development. Here's what breaks — and how to fix it." description: "Learn why CI pipelines break under AI-assisted development velocity. Discover how to fix queue times, cache thrashing, and flaky tests when using Copilot, Cursor, and other AI coding tools." date: "2026-01-13" author: surya_oruganti cover: "/images/blog/ai-assisted-development-ci-infrastructure/cover.png" --- Something shifted in the past eighteen months. Engineers who once spent hours crafting code now generate working implementations in minutes. Claude Code autocompletes entire features. [Cursor](https://cursor.sh/) rewrites files on command. The velocity gains are real, measurable, and accelerating. But there's a downstream effect that's getting less attention: CI infrastructure is buckling under the load. The CI systems most teams run today were architected for human-speed development. They assume a certain cadence of commits, a predictable volume of pull requests, a manageable rate of test execution. When code velocity doubles or triples, everything downstream breaks in predictable ways. Queue times spike. Caches thrash. Flaky tests that surfaced once a week now fail daily. Costs climb faster than budgets. This isn't an argument against AI tools. They're a genuine productivity multiplier. But the infrastructure that validates and ships that code needs to catch up. The bottleneck has shifted from writing code to validating code, and CI is now the constraint. ## Key Takeaways - **AI coding tools increase PR volume by 26-98%** depending on adoption levels, overwhelming CI systems designed for human-speed development - **Queue times, cache thrashing, and flaky tests** are the primary failure modes at higher velocity - **GitHub-hosted runner concurrency limits** (20 for Free, 60 for Team) become bottlenecks for AI-assisted teams - **Solutions include unlimited concurrency, larger caches, and test parallelization** - **Total cost analysis** should include developer wait time, not just compute costs ## The Velocity Shift Is Real [GitHub's research on Copilot](https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/) found that developers using the tool completed tasks 55% faster than those without it. A [multi-company study](https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-in-the-enterprise-with-accenture/) involving over 4,800 developers showed a 26% increase in completed pull requests per week. [Stack Overflow's 2025 Developer Survey](https://survey.stackoverflow.co/2025/ai) found 84% of developers are using or planning to use AI tools in their workflow. Copilot alone has over 20 million users, with 90% of Fortune 100 companies now using it. The research confirms substantial PR volume increases. [Faros AI's analysis](https://www.index.dev/blog/ai-coding-assistants-roi-productivity) of over 10,000 developers found teams with high AI adoption created 47% more pull requests per developer per day, with some teams seeing up to 98% more PRs overall. The effect compounds: faster code generation leads to faster code reviews (because reviewers use AI too), which leads to faster merges, which leads to more CI runs per day. Consider a concrete scenario. A 20-person engineering team that previously opened 40 PRs per week adopts AI coding assistants across the org. Based on research showing 47-98% PR increases for high-adoption teams, within a month they're opening 60-80 PRs weekly. Each PR triggers 3-5 CI jobs on average. That's 120-200 CI runs per week becoming 180-400. The CI system that handled the old volume with headroom to spare is now frequently saturated. The math is straightforward. If your CI was sized for X throughput and your development velocity goes to 1.5-2X, something has to give. Usually it's developer wait time, and that erodes the velocity gains you thought you were getting. ## What Breaks at Higher Velocity The failure modes are predictable once you understand the mechanics. Each one has a trigger point and a symptom that shows up in developer experience. **Queue times explode.** Most CI systems have concurrency limits. [GitHub-hosted runners](https://docs.github.com/en/actions/reference/limits) allow 20 concurrent jobs for Free plans, 40 for Pro, 60 for Team, and up to 500 for Enterprise. At 40 PRs per week with modest job counts, you rarely hit those limits. Jobs start immediately. At 100+ PRs per week, the math changes. Jobs queue constantly, especially during peak hours when the whole team is pushing code. Developers report "CI is slow," but the jobs aren't actually running slowly. They're waiting in line. **Cache economics change.** Caches have size limits and eviction policies. [GitHub Actions cache storage](https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows) was historically limited to 10GB per repository (though this limit has been relaxed as of late 2025). At low velocity, your dependency caches, build caches, and test caches all fit comfortably. Cache hits are high. Builds are fast. At high velocity, more PRs means more cache writes. More cache writes means faster eviction. The cache that "always hit" at low velocity starts missing at high velocity. A build that took 3 minutes with warm caches now takes 8 minutes cold. Multiply that across hundreds of runs and you've lost hours of developer time daily. Learn more in our [guide to GitHub Actions caching](/blog/github-actions-cache). **Flaky tests surface more often.** A test with a 1% flake rate fails once per 100 runs. At 200 CI runs per week, that's 2 flakes. Annoying but manageable. At 600 CI runs per week, that's 6 flakes. Now flaky tests are a daily occurrence. The noise becomes constant. Teams start ignoring CI failures or reflexively re-running jobs. Trust in the test suite erodes. The signal that CI is supposed to provide gets lost. **Cost scales linearly, or worse.** Three times the runs means three times the compute cost. But the velocity gains aren't perfectly linear because there's coordination overhead. Finance starts asking questions. The CI budget that seemed reasonable last quarter is now a line item that needs justification. Teams defer infrastructure improvements because the bill is already high. See our [guide to reducing GitHub Actions costs](/blog/github-actions-cost-reduction) for optimization strategies. **Developer wait time compounds.** A 10-minute CI wait isn't bad in isolation. But at higher velocity, developers are pushing more frequently. Three PRs per day with 10-minute waits is 30 minutes of waiting. Context switching during those waits has its own cost. The "fast" AI-assisted development starts feeling slow because CI can't keep up. | Failure Mode | Trigger | Symptom | |--------------|---------|---------| | Queue saturation | PR volume exceeds concurrency limits | Jobs waiting 5-15 minutes before starting | | Cache thrashing | Write volume exceeds cache capacity | Build times 2-3x longer than baseline | | Flake amplification | Run volume surfaces rare failures | Multiple false failures per day | | Cost escalation | Linear compute scaling | 2-3x CI spend increase | | Wait time compounding | Higher PR frequency per developer | 30+ minutes daily waiting on CI | ## The AI-Generated Code Factor There's a nuance most teams miss. AI-generated code has characteristics that stress CI in specific ways, beyond just the volume increase. Research is beginning to quantify these effects. AI tends to generate explicit, readable code. That's good for humans reviewing it. But more lines means more to compile, more to analyze, more to test. [GitClear's analysis](https://www.gitclear.com/ai_assistant_code_quality_2025_research) of 211 million changed lines of code found a 4x increase in code duplication since AI tools became prevalent. Codebases are growing faster in absolute terms, not just in commit frequency. The build that used to process 50,000 lines now processes 80,000. AI coding assistants have no awareness of your build system. They don't optimize for incremental compilation. They don't consider whether the code they're generating will invalidate caches. They don't know that touching a certain file triggers a full rebuild. They're optimizing for correctness and readability, not CI performance. The copy-paste pattern is particularly problematic. AI makes it trivially easy to generate similar code across multiple files. Need the same validation logic in three places? AI will happily generate three implementations. GitClear's research shows copy-pasted code now exceeds refactored/moved code for the first time. This creates redundant test coverage and reduces cache effectiveness because more files are changing per commit. Generated tests are often brittle. AI-generated test suites frequently have poor isolation. Time dependencies, order dependencies, shared state. [CodeRabbit's 2025 report](https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report) found AI-generated pull requests contain approximately 1.7x more issues than human-written PRs, with logic and correctness issues rising 75%. The tests pass when run individually but fail in certain sequences. More tests plus worse test architecture equals more flakiness. The test suite grows faster than its reliability. Code churn—code that gets discarded within two weeks of being written—is also increasing. GitClear projects this metric has doubled compared to pre-AI baselines. This means CI is running more jobs to validate code that won't survive. None of this is AI's fault. These tools aren't optimized for CI performance, and that's probably the right tradeoff for their primary use case. But it means the code velocity increase comes with a CI tax that teams need to account for. ## Adapting Your CI for AI-Speed Development The good news is that these problems are solvable. The solutions require some upfront investment but pay dividends as velocity continues to increase. **Remove the concurrency ceiling.** Start by auditing your current limits. Calculate your saturation rate: average PR volume multiplied by jobs per PR multiplied by average job duration, divided by working hours. If you're hitting concurrency limits more than 10% of the time, you need more headroom. The options are upgrading your GitHub plan (which has its own limits), [self-hosting runners](/blog/self-hosting-github-actions) (which adds operational burden), or using a runner provider that doesn't impose concurrency limits. The right choice depends on your team's appetite for infrastructure work. **Optimize caching for churn.** GitHub's cache storage fills fast at high velocity. Consider external caching solutions with larger storage limits. Smarter cache keys help too. Don't invalidate your entire dependency cache when the lockfile changes. Use content-addressed keys where possible. For Docker builds, layer caching is usually the biggest win. A well-structured Dockerfile with stable base layers can reduce build times by 60-80%. Measure your cache hit rate at current velocity, then project what happens when volume doubles. Our [Docker build optimization guide](/blog/optimizing-docker-builds) covers this in detail. **Architect tests for scale.** Parallelize aggressively. Sharding test suites across multiple runners can turn a 20-minute test run into a 4-minute one. Matrix builds let you test across environments simultaneously rather than sequentially. Implement flaky test detection and quarantine. When a test fails intermittently, automatically flag it and move it out of the critical path. Prune redundant tests. AI often generates overlapping coverage, and removing duplicates speeds up the suite without reducing confidence. Set time budgets per PR and enforce them. Learn more about [running concurrent tests effectively](/blog/concurrent-tests). ```yaml jobs: test: strategy: matrix: shard: [1, 2, 3, 4] runs-on: warpbuild-ubuntu-22.04-x64-4x steps: - uses: actions/checkout@v4 - name: Run tests (shard ${{ matrix.shard }}/4) run: | npm test -- --shard=${{ matrix.shard }}/4 ``` **Match runner sizing to the workload.** AI-assisted PRs often have larger diffs. Larger diffs benefit from bigger runners. The cost-per-minute versus time tradeoff shifts when velocity is high. A runner that costs 1.5x as much but finishes in 0.4x the time is worth it when you're running hundreds of jobs daily. The developer time saved exceeds the compute cost increase. Run the numbers for your specific workload. **Model costs for the new normal.** Don't assume linear growth. AI adoption curves are steep. If your team is at 30% AI tool adoption today, plan for 70% within a year. Build CI cost into the AI tooling ROI calculation. The productivity gains from AI coding assistants are real, but so is the infrastructure cost. Consider total cost: compute plus developer wait time plus operational burden. A system that costs more per minute but eliminates wait time and ops work often has lower total cost. ## The Infrastructure Gap There's a structural issue underneath all of this. GitHub-hosted runners were designed in an era of human-speed development. The concurrency limits, cache sizes, and pricing models assume a certain velocity. That assumption is breaking. Self-hosted runners give you control but add operational burden. That burden scales with volume. More runs means more infrastructure to manage, more capacity planning, more on-call rotations. For teams that already have platform engineering capacity, this can work. For teams that don't, it's a distraction from shipping product. Read about the [challenges of GitHub Actions at scale](/blog/github-actions-challenges) for more context. The new requirement is infrastructure that scales elastically with demand, has caching that performs at high throughput, starts instantly without queue time, and doesn't require a dedicated team to operate. This is the problem we built WarpBuild to solve. ## Frequently Asked Questions ### How much faster do AI coding tools make developers? GitHub's research found that developers using Copilot completed tasks 55% faster, with a [controlled study](https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-in-the-enterprise-with-accenture/) showing a 26% increase in completed PRs per week. The downstream effect on CI compounds this—[research from Faros AI](https://www.index.dev/blog/ai-coding-assistants-roi-productivity) found high-adoption teams saw 47-98% more pull requests per day, significantly increasing CI load. ### Why does my CI feel slow even though individual jobs are fast? The most common cause is queue saturation. [GitHub-hosted runners have concurrency limits](https://docs.github.com/en/actions/reference/limits): 20 for Free, 40 for Pro, 60 for Team, and up to 500 for Enterprise. When you exceed these limits, jobs wait in line before they even start running. Check the "Queued" timestamp versus "In progress" timestamp in your workflow runs. ### How do I know if my cache is thrashing? Look for inconsistent build times. If the same build takes 3 minutes sometimes and 8 minutes other times, you're likely experiencing cache misses due to eviction. GitHub's cache storage can fill quickly at high velocity, especially with frequent lockfile changes. ### Should I use self-hosted runners or a managed service? Self-hosted runners remove concurrency limits but add operational burden that scales with volume. For teams without dedicated platform engineering capacity, managed services like WarpBuild provide unlimited concurrency without the ops overhead. ### How do I calculate the true cost of CI wait time? Multiply average wait time per PR by PRs per day by average developer hourly cost. A team with 50 PRs/day, 10-minute average waits, and $75/hour developer cost loses $6,250/week in developer time alone—often more than the CI compute cost itself. ## Moving Forward AI coding tools are a net positive for engineering velocity. The teams using them are shipping more, faster. But velocity gains upstream create pressure downstream. CI is the validation layer, and most CI infrastructure wasn't built for this level of throughput. The teams that will thrive in AI-assisted development are the ones that upgrade their infrastructure proactively. Not the ones scrambling after CI becomes the bottleneck. Not the ones watching developers wait in queues while the productivity gains evaporate. The specifics matter. Audit your concurrency headroom. Rethink your caching strategy for higher write volumes. Architect tests for parallelization and scale. Honestly assess whether your current CI setup can handle 50-100% more volume than it handles today. We're still early in the AI-assisted development curve. The tools are getting better. Adoption is accelerating. The teams that build infrastructure for this future now will have a structural advantage over those that wait. --- *If you're hitting these limits, WarpBuild offers unlimited concurrency and high-performance caching designed for high-velocity teams. [Start free →](https://www.warpbuild.com)* # Cost comparison: GitHub Actions Runner Controller (ARC) and WarpBuild URL: /blog/arc-warpbuild-comparison-case-study Comparing Cost Efficiency of GitHub Actions Runner Controller (ARC) using kubernetes and karpenter on AWS with WarpBuild BYOC runners. WarpBuild leads to >40% cost savings. --- title: "Cost comparison: GitHub Actions Runner Controller (ARC) and WarpBuild" excerpt: "Comparing Costs of GitHub Actions Runner Controller (ARC) with WarpBuild BYOC runners" description: "Comparing Cost Efficiency of GitHub Actions Runner Controller (ARC) using kubernetes and karpenter on AWS with WarpBuild BYOC runners. WarpBuild leads to >40% cost savings." date: "2024-09-18" author: prajjwal_dimri cover: "/images/blog/arc-warpbuild-comparison-case-study/cover.webp" --- In this case study, we will explore the cost, flexibility, and management aspects of running your own GitHub Actions Runners using ARC (Actions Runner Controller) vs. using WarpBuild's **Bring Your Own Cloud (BYOC)** offering on AWS. ## TL;DR In this case study, we compare setting up GitHub's Action Runner Controller on EKS using Karpenter for autoscaling, with WarpBuild's BYOC offering. We found that ARC comes with significant operational overhead and efficiency challenges. On the other hand, WarpBuild's BYOC solution provides better performance, ease of use, and lower operational costs, making it a more suitable choice for teams, especially with large volumes of CI/CD workflows. **Cost Comparison Highlights**: The cost comparison is for a representative 2 hour period, where there is a continuous load of commits, each triggering a job. We use `PostHog` OSS as an example repo to demonstrate the cost comparison on real world use cases over 960 jobs. - ARC Setup Cost (for the analyzed period): **$42.60** - WarpBuild BYOC Cost: **$25.20** This is effectively a **~41%** cost savings. ![Cost Comparison](/images/blog/arc-warpbuild-comparison-case-study/cost-comparison.png) You can find the detailed cost comparison [here](#cost-comparison). The following sections describe the setup of ARC Runners on EKS, and the assumptions that went into this. ## Setting up ARC Runners on EKS We setup Karpenter v1 and EKS using Terraform to provision the infrastructure. This approach provided more control, automation, and consistency in deploying and managing the EKS cluster and related resources. Complete setup code is available @ https://github.com/WarpBuilds/github-arc-setup ### EKS Cluster Setup The EKS cluster was provisioned using Terraform and runs on Kubernetes v1.30. A key aspect of our setup was using a dedicated node group for essential add-ons, keeping them isolated from other workloads. The `default-ng` node group utilizes `t3.xlarge` instance types, with taints to ensure that only critical workloads, such as Networking, DNS management, Node management, ARC controllers etc. can be scheduled on these nodes. ```hcl module "eks" { source = "terraform-aws-modules/eks/aws" cluster_name = local.cluster_name cluster_version = "1.30" cluster_endpoint_public_access = true cluster_addons = { coredns = {} eks-pod-identity-agent = {} kube-proxy = {} vpc-cni = {} } subnet_ids = var.private_subnet_ids vpc_id = var.vpc_id eks_managed_node_groups = { default-ng = { desired_capacity = 2 max_capacity = 5 min_capacity = 1 instance_types = ["t3.xlarge"] subnet_ids = var.private_subnet_ids taints = { addons = { key = "CriticalAddonsOnly" value = "true" effect = "NO_SCHEDULE" } } } } node_security_group_tags = merge(local.tags, { "karpenter.sh/discovery" = local.cluster_name }) enable_cluster_creator_admin_permissions = true tags = local.tags } ``` #### Private Subnets and NAT Gateway To secure our infrastructure, we placed the EKS nodes in private subnets, allowing them to communicate with external resources through a NAT Gateway. This configuration ensured that the nodes could still access the internet for essential tasks without exposing them directly to external traffic. Using private subnets with a NAT Gateway enhanced the security posture of the cluster while allowing for the necessary external connectivity. ### Karpenter for Autoscaling To manage autoscaling of the nodes and optimize cost and resource efficiency, we utilized Karpenter, which offers a more flexible and cost-effective alternative to the Kubernetes Cluster Autoscaler. Karpenter allows nodes to be created and terminated dynamically based on real-time resource needs, reducing over-provisioning and unnecessary costs. We deployed Karpenter using Terraform and Helm, with some notable configurations: - [**Karpenter v1.0.2**](https://karpenter.sh/): We chose the latest version of karpenter at the time of writing. - **Amazon Linux 2023 (AL2023)**: The default NodeClass provisions nodes with AL2023, and each node is configured with 300GiB of EBS storage. This additional storage is crucial for workloads that require high disk usage, such as CI/CD runners, preventing out-of-disk errors commonly encountered with default node storage (17GiB). This needs to be increased based on the number of jobs expected to run on a node in parallel. - **Private Subnet Selection**: The NodeClass is configured to use the private subnets created earlier. This ensures that nodes are spun up in a secure, isolated environment, consistent with the EKS cluster's network setup. - [**m7a Node Families**](https://aws.amazon.com/ec2/instance-types/m7a/): Using the NodePool resource, we restricted node provisioning to the m7a instance family. These instances were chosen for their performance-to-cost efficiency and are only provisioned in the us-east-1a and us-east-1b Availability Zones. - **On-demand Instances**: While Karpenter supports Spot Instances for cost savings, we opted for on-demand instances for an equivalent cost comparison. - **Consolidation Policy**: We configured a 5-minute consolidation delay, preventing premature node terminations that could disrupt workflows. Karpenter will only consolidate nodes once they are underutilized for at least 5 minutes, ensuring stable operations during peak workloads. ```hcl module "karpenter" { source = "terraform-aws-modules/eks/aws//modules/karpenter" cluster_name = module.eks.cluster_name enable_pod_identity = true create_pod_identity_association = true create_instance_profile = true node_iam_role_additional_policies = { AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore" } tags = local.tags } resource "helm_release" "karpenter-crd" { namespace = "karpenter" create_namespace = true name = "karpenter-crd" repository = "oci://public.ecr.aws/karpenter" chart = "karpenter-crd" version = "1.0.2" wait = true values = [] } resource "helm_release" "karpenter" { depends_on = [helm_release.karpenter-crd] namespace = "karpenter" create_namespace = true name = "karpenter" repository = "oci://public.ecr.aws/karpenter" chart = "karpenter" version = "1.0.2" wait = true skip_crds = true values = [ <<-EOT serviceAccount: name: ${module.karpenter.service_account} settings: clusterName: ${module.eks.cluster_name} clusterEndpoint: ${module.eks.cluster_endpoint} EOT ] } resource "kubectl_manifest" "karpenter_node_class" { yaml_body = <<-YAML apiVersion: karpenter.k8s.aws/v1beta1 kind: EC2NodeClass metadata: name: default spec: amiFamily: AL2023 detailedMonitoring: true blockDeviceMappings: - deviceName: /dev/xvda ebs: volumeSize: 300Gi volumeType: gp3 deleteOnTermination: true iops: 5000 throughput: 500 instanceProfile: ${module.karpenter.instance_profile_name} subnetSelectorTerms: - tags: karpenter.sh/discovery: ${module.eks.cluster_name} securityGroupSelectorTerms: - tags: karpenter.sh/discovery: ${module.eks.cluster_name} tags: karpenter.sh/discovery: ${module.eks.cluster_name} Project: arc-test-praj YAML depends_on = [ helm_release.karpenter, helm_release.karpenter-crd ] } resource "kubectl_manifest" "karpenter_node_pool" { yaml_body = <<-YAML apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: default spec: template: spec: tags: Project: arc-test-praj nodeClassRef: name: default requirements: - key: "karpenter.k8s.aws/instance-category" operator: In values: ["m"] - key: "karpenter.k8s.aws/instance-family" operator: In values: ["m7a"] - key: "karpenter.k8s.aws/instance-cpu" operator: In values: ["4", "8", "16", "32", "64"] - key: "karpenter.k8s.aws/instance-generation" operator: Gt values: ["2"] - key: "topology.kubernetes.io/zone" operator: In values: ["us-east-1a", "us-east-1b"] - key: "kubernetes.io/arch" operator: In values: ["amd64"] - key: "karpenter.sh/capacity-type" operator: In values: ["on-demand"] limits: cpu: 1000 disruption: consolidationPolicy: WhenEmpty consolidateAfter: 5m YAML depends_on = [ kubectl_manifest.karpenter_node_class ] } ``` We also ran another setup with a single job per node to compare the performance and cost implications of running multiple jobs on a single node. ```diff - key: "karpenter.k8s.aws/instance-cpu" - operator: In - values: ["4", "8", "16", "32", "64"] + key: "karpenter.k8s.aws/instance-cpu" + operator: In + values: ["8"] ``` ### Actions Runner Controller and Runner Scale Set Once Karpenter was configured, we proceeded to set up the GitHub Actions Runner Controller (ARC) and the Runner Scale Set using Helm. The ARC setup was deployed with Helm using the following command and values: ```bash helm upgrade arc \ --namespace "${NAMESPACE}" \ oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \ --values runner-set-values.yaml --install ``` ```yaml tolerations: - key: "CriticalAddonsOnly" operator: "Equal" value: "true" effect: "NoSchedule" ``` This configuration applies tolerations to the controller, enabling it to run on nodes with the `CriticalAddonsOnly` taint i.e. `default-ng` nodegroup, ensuring it doesn't interfere with other runner workloads. Next, we set up the Runner Scale Set using another Helm command: ```bash helm upgrade warp-praj-arc-test oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set --namespace ${NAMESPACE} --values values.yaml --install ``` The key points for our Runner Scale Set configuration: - **GitHub App Integration**: We connected our runners to GitHub via a GitHub App, enabling the runners to operate at the organization level.\ - **Listener Tolerations**: Like the controller, the listener template also included tolerations to allow it to run on the `default-ng` node group. - **Custom Image for Runners**: We used a custom Docker image for the runner pods (detailed in the next section). - **Resource Requirements**: To simulate high-performance runners, the runner pods were configured to require 8 CPU cores and 32 GiB of RAM, which aligns with the performance of an 8x runner used in the workflows. ```yaml githubConfigUrl: "https://github.com/Warpbuilds" githubConfigSecret: github_app_id: "" github_app_installation_id: "" github_app_private_key: | -----BEGIN RSA PRIVATE KEY----- [your-private-key-contents] -----END RSA PRIVATE KEY----- github_token: "" listenerTemplate: spec: containers: - name: listener securityContext: runAsUser: 1000 tolerations: - key: "CriticalAddonsOnly" operator: "Equal" value: "true" effect: "NoSchedule" template: spec: containers: - name: runner image: command: ["/home/runner/run.sh"] resources: requests: cpu: "4" memory: "16Gi" limits: cpu: "8" memory: "32Gi" controllerServiceAccount: namespace: arc-systems name: arc-gha-rs-controller ``` ### Custom Image for Runner Pods By default, the Runner Scale Sets use GitHub's official `actions-runner` image. However, this image doesn't include essential utilities such as wget, curl, and git, which are required by various workflows. To address this, we created a custom Docker image based on GitHub's runner image, adding the necessary tools. This image was hosted in a public ECR repository and was used by the runner pods during our tests. The custom image allowed us to run workflows without missing dependencies and ensured smooth execution. ```dockerfile FROM ghcr.io/actions/actions-runner:2.319.1 RUN sudo apt-get update && sudo apt-get install -y wget curl unzip git RUN sudo apt-get clean && sudo rm -rf /var/lib/apt/lists/* ``` This approach ensured that our runners were always equipped with the required utilities, preventing errors and reducing friction during the workflow runs. ### Tagging Infrastructure for Cost Tracking In order to track costs effectively during the ARC setup, we implemented cost allocation tags across all the resources that we used for the setup along with collecting hourly data. AWS Cost Explorer allowed us to monitor and attribute costs to specific resources based on these tags. This was essential for calculating the true cost of running ARC compared to the WarpBuild BYOC solution. ## Setting up BYOC Runners on WarpBuild ### Adding Cloud Account Setting up BYOC (Bring Your Own Cloud) runners on WarpBuild begins by connecting your own cloud account. After signing up for WarpBuild, navigate to the BYOC page and follow the process to add your cloud account. This step is critical as it allows WarpBuild to provision and manage runners directly in your own AWS environment, providing greater control and flexibility. ![Add Cloud Account Flow](/images/blog/arc-warpbuild-comparison-case-study/C2.gif) ### Creating Stack Once your cloud account is connected, you need to create a Stack in the WarpBuild dashboard. A WarpBuild Stack represents a group of essential infrastructure components, such as VPCs, subnets, and object storage buckets, provisioned in a specific region of your cloud account. These components are required for running CI workflows on WarpBuild. ![Create Stack Flow](/images/blog/arc-warpbuild-comparison-case-study/S2.gif) ### Custom Runner Creation For this experiment, we also created a custom 8x runner. Although WarpBuild provides default stock runner configurations, creating a custom runner allowed us to match the specifications of the ARC runners. WarpBuild runners are based on the Ubuntu 22.04 image, which is approximately 60GB in size. This image is pre-configured to work seamlessly with GitHub Actions workflows, offering better performance and compatibility than a general-purpose runner image. While such an image would be impractical for an ARC setup due to the high storage costs incurred every time a new node is provisioned, WarpBuild manages this efficiently through its runner orchestration. ![Create Runner Flow](/images/blog/arc-warpbuild-comparison-case-study/R2.gif) ### Tagging Infrastructure for Cost Tracking WarpBuild simplifies cost tracking for its users by automatically tagging all provisioned resources. This allows users to monitor and manage costs more effectively. Additionally, WarpBuild offers a dedicated dashboard where users can see real-time cost breakdowns, making cost management more transparent. ## Workflow Simulation ### PostHog's Frontend CI Workflow To simulate real-world use-case, we leveraged PostHog's Frontend CI workflow. This workflow is designed to run a series of frontend checks, followed by two sets of jobs: one for code quality checks and another for executing a matrix of Jest tests. This setup provided a comprehensive load for both the ARC and WarpBuild BYOC runners, allowing us to assess their performance under typical CI workloads. You can view the workflow file here: [PostHog Frontend CI Workflow](https://github.com/WarpBuilds/posthog/blob/master/.github/workflows/ci-frontend.yml) ### Auto-Commit Simulation Script To ensure continuous triggering of the Frontend CI workflow, we developed an automated commit script in JavaScript. This script generates commits every minute on the forked PostHog repository, which in turn triggers the CI workflow. Both the ARC and the WarpBuild BYOC runners simultaneously pick up these jobs, enabling us to track costs and performance over time. The script is designed to run for two hours, ensuring a consistent workload over an extended period for accurate cost measurement. The results were then analyzed to compare the costs of using ARC versus WarpBuild's BYOC runners. Commit simulation script: ```javascript const { exec } = require("child_process"); const fs = require("fs"); const path = require("path"); const repoPath = "arc-setup/posthog"; const frontendDir = path.join(repoPath, "frontend"); const intervalTime = 1 * 60 * 1000; // Every Minute const maxRunTime = 2 * 60 * 60 * 1000; // 2 hours const setupGitConfig = () => { exec('git config user.name "Auto Commit Script"', { cwd: repoPath }); exec('git config user.email "auto-commit@example.com"', { cwd: repoPath }); }; const makeCommit = () => { const logFilePath = path.join(frontendDir, "commit_log.txt"); // Create the frontend directory if it doesn't exist if (!fs.existsSync(frontendDir)) { fs.mkdirSync(frontendDir); } // Write to commit_log.txt in the frontend directory fs.appendFileSync( logFilePath, `Auto commit in frontend at ${new Date().toISOString()}\n`, ); // Add, commit, and push changes exec(`git add ${logFilePath}`, { cwd: repoPath }, (err) => { if (err) return console.error("Error adding file:", err); exec( `git commit -m "Auto commit at ${new Date().toISOString()}"`, { cwd: repoPath }, (err) => { if (err) return console.error("Error committing changes:", err); exec("git push origin master", { cwd: repoPath }, (err) => { if (err) return console.error("Error pushing changes:", err); console.log("Changes pushed successfully"); }); }, ); }); }; setupGitConfig(); const interval = setInterval(makeCommit, intervalTime); // Stop the script after 2 hours setTimeout(() => { clearInterval(interval); console.log("Script completed after 2 hours"); }, maxRunTime); ``` ## Cost Comparison | **Category** | **ARC (Varied Node Sizes)** | **WarpBuild** | **ARC (1 Job Per Node)** | | ------------------ | --------------------------- | ------------------ | ------------------------ | | **Total Jobs Ran** | 960 | 960 | 960 | | Node Type | m7a (varied vCPUs) | m7a.2xlarge | m7a.2xlarge | | Max K8s Nodes | 8 | - | 27 | | Storage | 300GiB per node | 150GiB per runner | 150GiB per node | | IOPS | 5000 per node | 5000 per runner | 5000 per node | | Throughput | 500Mbps per node | 500Mbps per runner | 500Mbps per node | | Compute | $27.20 | $20.83 | $22.98 | | EC2-Other | $18.45 | $0.27 | $19.39 | | VPC | $0.23 | $0.29 | $0.23 | | S3 | $0.001 | $0.01 | $0.001 | | WarpBuild Costs | - | $3.80 | - | | **Total Cost** | **$45.88** | **$25.20** | **$42.60** | ## Performance and Scalability The following metrics showcase the average time taken by WarpBuild BYOC Runners and ARC Runners for jobs in the Frontend-CI workflow: | **Test** | **ARC (Varied Node Sizes)** | **WarpBuild** | **ARC (1 Job Per Node)** | | ----------------------- | --------------------------- | -------------------- | ------------------------ | | **Code Quality Checks** | ~9 minutes 30 seconds | ~7 minutes | ~7 minutes | | **Jest Test (FOSS)** | ~2 minutes 10 seconds | ~1 minute 30 seconds | ~1 minute 30 seconds | | **Jest Test (EE)** | ~1 minute 35 seconds | ~1 minute 25 seconds | ~1 minute 25 seconds | ARC runners exhibited slower performance primarily because multiple runners shared disk and network resources on the same node, causing bottlenecks despite larger node sizes. In contrast, WarpBuild's dedicated VM runners eliminated this resource contention, allowing jobs to complete faster. To address these bottlenecks, we tested a **1 Job Per Node** configuration with ARC, where each job ran on its own node. This approach significantly improved performance, matching the job times of WarpBuild runners. However, it introduced higher job start delays due to the time required to provision new nodes. > Note: Job start delays are directly influenced by the time needed to provision a new node and pull the container image. Larger image sizes increase pull times, leading to longer delays. If the image size is reduced, additional tools would need to be installed during the action run, increasing the overall workflow run time. This is a trade-off that you don't have to make with WarpBuild. You can further enhance optimization by leveraging WarpBuild's features [custom images](/docs/ci/byoc/custom-vm-images), [snapshot runners](/docs/ci/snapshot-runners) and more. ## Conclusion The cost and performance comparison between ARC and WarpBuild's BYOC offering demonstrates clear advantages to using WarpBuild. WarpBuild provides the same flexibility as ARC in configuring and scaling your own runners, but without the operational complexity and performance bottlenecks (such as resource contention on larger nodes) make it ideal for large-scale workloads. ARC's scalability is limited by node resources like disk I/O and network throughput, which can affect workflow performance despite using high-performance nodes. WarpBuild simplifies the entire process, offering better performance with lower operational overhead and lower costs. It handles provisioning and scaling seamlessly while maintaining performance, making it the ideal option for CI/CD management for high performance teams. # Blacksmith vs WarpBuild Comparison - 2025 May URL: /blog/blacksmith-warpbuild-comparison-2025-May Blacksmith features, performance, pricing, and how it compares to WarpBuild. --- title: "Blacksmith vs WarpBuild Comparison - 2025 May" excerpt: "Blacksmith features, performance, pricing, and how it compares to WarpBuild." description: "Blacksmith features, performance, pricing, and how it compares to WarpBuild." date: "2025-05-19" author: surya_oruganti cover: "/images/blog/blacksmith-warpbuild-comparison-2025-May/cover.png" --- Blacksmith provides Github Actions runners that you can start using by just changing one line in the Github actions workflow file. This post provides a detailed comparison between Blacksmith and WarpBuild to help you make an informed decision. ## Feature Comparison | Feature | Blacksmith | WarpBuild | WarpBuild Advantage | | ------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **CPUs** | Server class (AMD EPYC, ARM Ampere) | x86-64 (Desktop Class Ryzen 7950X3D), arm64 (Graviton 4) | arm64: 40% more powerful but ~20% more expensive than Blacksmith; x86-64: same performance; Better networking, more disk options | | **Architecture** | x86-64, arm64 | x86-64, arm64 | arm64: 40% more powerful but ~20% more expensive than Blacksmith; x86-64: same performance; Better networking, more disk options | | **Concurrency** | Unlimited | Unlimited | | | **OS Support** | Ubuntu 22.04, 24.04 only | Ubuntu 22.04, 24.04, MacOS 13/14/15 with M4 Pro, Windows Server 2022 | Latest Linux, MacOS, Windows, custom images through app | | **Caching** | 25GB per repo | Unlimited; 7-day retention; Scales to 100+ jobs with no throttling | Unlimited storage, better scaling, container layer caching | | **Infrastructure** | EU and US | Cloud (EU), BYOC (AWS, GCP, Azure any region) | Multiple providers/regions, reduced data transfer costs | | **Sticky Disks** | Available | Not available | Cache stored on sticky disks is instantly available to runners, but concurrency is limited | | **Remote Docker Builder** | Only x86-64 | x86-64 and arm64 multi-arch builds | Multi arch support for fast, native builds. | | **SSO support** | Not available | SSO supported | SSO supported for Microsoft Entra ID, Google, Okta, Auth0, JumpCloud etc. | | **Static IPs** | Available as paid feature | Available at no cost (BYOC only) | | | **Pricing** | 50% of GitHub cost | x86-64: Same as Blacksmith; arm64: 40% more powerful, 20% more expensive | | | **Snapshots** | Not available | Available | Save and restore runner state for persistence and incremental builds. Provides 10x improvement in build times by eliminating dependency installation time. | | **Bring Your Own Cloud (BYOC)** | Not available | Available | Cloud-hosted control plane with runners in user's cloud account. Provides maximum flexibility, zero management overhead, and is 10x cheaper than Blacksmith. Users can leverage preferential pricing with their cloud providers. | | **Compliance** | SOC2 Type2 | SOC2 Type2 | | | **Global Regions** | Limited to EU and US | Cloud (EU), Support for 29+ regions globally (BYOC) | Minimizes data transfer costs, improves performance, and supports data residency regulations essential for sensitive workloads. | | **Configurable Disks** | Fixed 80-160GB | Configurable sizes, IOPS, and throughput | Optimized for ML/AI workloads, large container builds, monorepos, game and mobile app development. | | **Advanced Dashboard** | Richer dashboard | Rich dashboard | Blacksmith has more insights on metrics like cost saved, and analytics on cache usage. | | **Security** | Bot requires read-write access to code and build logs. | Bot requires minimal permissions and no code or build logs access. | This is crucial for Enterprises and security conscious teams. | ![Blacksmith WarpBuild Cache](/images/blog/blacksmith-warpbuild-comparison-2025-May/blacksmith-warpbuild-cache.png) ![WarpBuild Dashboard](/images/blog/blacksmith-warpbuild-comparison-2025-May/warpbuild-dashboard.png) ![WarpBuild Analytics](/images/blog/blacksmith-warpbuild-comparison-2025-May/warpbuild-analytics.png) ## Conclusion Blacksmith provides fast Github Actions runners. WarpBuild provides equally fast x86-64 runners, and even faster arm64 runners. For enterprises, WarpBuild delivers 10x faster builds with SSO, snapshots, multi-arch docker builder cloud. WarpBuild also provides BYOC for improved security and more customization at a fraction of the cost (~5x cheaper compared to Blacksmith). ## Get Started Today WarpBuild is committed to providing you with the tools you need to build faster, smarter, and more cost-effectively. Join us in this new era of development. --- For detailed technical documentation, visit [WarpBuild Docs](http://docs.warpbuild.com). For any errors in this post, please contact us at support@warpbuild.com. --- # Blacksmith vs WarpBuild Comparison URL: /blog/blacksmith-warpbuild-comparison Blacksmith features, performance, and why WarpBuild is a better Github Actions runner alternative. --- title: "Blacksmith vs WarpBuild Comparison" excerpt: "Blacksmith features, performance, and why WarpBuild is a better Github Actions runner alternative." description: "Blacksmith features, performance, and why WarpBuild is a better Github Actions runner alternative." date: "2024-08-22" author: surya_oruganti cover: "/images/blog/blacksmith-warpbuild-comparison/cover.png" --- Blacksmith provides Github Actions runners, that you can start using by just changing one line in the Github actions workflow file. This post provides a detailed description of the features of Blacksmith, and see how it compares to WarpBuild. ## Blacksmith Features ### `x86-64` and `arm64` Runners Blacksmith supports both x86-64 and arm64 architectures. This allows you to build your projects on both platforms, without emulation. This can speed up your arm64 builds 2-5x. WarpBuild Advantage: WarpBuild supports x86-64 and arm64 instances. WarpBuild arm64 instances are ~20% more powerful for faster raw performance but Blacksmith's x86-64 instances are ~15% faster. WarpBuild has faster networking, more disk configurations which lead to overall faster builds. ### Concurrency Blacksmith has no concurrency limits, in theory. However, there may be some delays for large customers because of lead time in provisioning instances. WarpBuild Advantage: WarpBuild supports truly unlimited concurrency at no additional cost out of the box for x86-64 and arm64 instances. ### OS and Images Blacksmith supports Ubuntu 22.04 only, compatible with Github actions runner images. WarpBuild Advantage: WarpBuild supports ubuntu 22.04 and the latest 24.04 ubuntu image as well, and is 100% compatible with Github actions runner images. Users can use the app to setup any number of custom base images directly. > WarpBuild also supports MacOS runners powered by M2 Pros for blazing fast MacOS builds on MacOS 13 and 14. ### Caching Blacksmith provides 25GB of cache per repo, free. Any additional usage evicts the least recently used cache entries. The cache performance is fast for low concurrency workflows but may not scale well for high concurrency workflows since it is backed by a single disk. WarpBuild Advantage: WarpBuild provides unlimited cache storage with a 7-day retention policy from last access. The cache performance is blazing fast even for workflows having 100+ concurrent jobs. This is a major advantage for larger customers, especially when there are large artifacts, monorepos, or container builds with large layers. > WarpBuild supports container large layer caching, which Blacksmith does not support. ![Blacksmith WarpBuild Cache](/images/blog/blacksmith-warpbuild-comparison/blacksmith-warpbuild-cache.png) ### Hosting Provider Blacksmith runners are hosted on Hetzner, with most of the compute in the EU region. WarpBuild Advantage: WarpBuild runs compute on AWS, GCP, and Azure and users can choose which region to run their builds in. This has enormous advantages for customers to minimize inter-region data transfer costs and improve performance. ### Security Blacksmith runners are ephemeral VMs, running on bare-metal. This is potentially subject to noisy neighbors and performance degradation. WarpBuild Advantage: WarpBuild runners are ephemeral VMs as well, with the virtualization and isolation handled by the underlying cloud provider (AWS, GCP, Azure) and strong performance guarantees. This allows WarpBuild to be more secure and compliant with the most stringent security standards. ### Enterprise Compliance Blacksmith is SOC2 Type1 compliant. Data residency regions are not handled. WarpBuild Advantage: WarpBuild is in the process of getting SOC2 Type2 compliance certification. The documentation will be available for free, on request. ### Static IPs Blacksmith supports static IPs for the runner instances, required for allowlisting in sensitive workflows. This is a paid feature and is available on request. WarpBuild Advantage: WarpBuild offers static IPs for runners (on BYOC only) at no additional cost. ### Runner Pricing Blacksmith's pricing is 50% of the cost of a Github Actions runner for x86-64 and arm64 runners. WarpBuild Advantage: WarpBuild x86-64 runners are the same price as Blacksmith runners. The arm64 runners are 20% more expensive but offer ~20% higher performance. ### Dashboard and Analytics Blacksmith has a basic dashboard for runners at the high level without drill-downs, but with nice cache analytics. WarpBuild Advantage: WarpBuild supports a rich dashboard for runners, cache usage, and builds. There are also analytics and insights for the entire repository, including build times, build failure rates, runtime trends, activity heatmaps and more. ![WarpBuild Analytics](/images/blog/blacksmith-warpbuild-comparison/warpbuild-analytics.png) ## Missing Features Blacksmith is missing a lot of features essential for robust CI. These features are usually deal breakers for large and fast growing teams. WarpBuild supports all of these features. ### Snapshots WarpBuild supports saving and restoring state from a runner instance for persistence and incremental builds is essential for large codebases. WarpBuild users see a _10x improvement_ in build times, due to this feature. Snapshots are very useful because the time in a CI workflow spent installing dependencies is eliminated and snapshots enable incremental builds. ### Spot Instances WarpBuild has multiple runner instance configurations including spot instances, which are ideal for low-cost and short-duration workflows. This makes the WarpBuild instances ~20-40% cheaper than Blacksmith. ### Bring Your Own Cloud (BYOC) WarpBuild supports BYOC, with a cloud-hosted control plane and the runners spawned in the user's cloud account. This provides the best of both worlds with maximum flexibility and zero management overhead. This is a major advantage for larger customers and is 10x cheaper than Blacksmith. Users can leverage preferential pricing agreements with their cloud providers for even more value. ### Regions WarpBuild supports over 29 regions globally. This is huge for minimizing data transfer costs and improving performance. It is essential for some customers with sensitive workloads and data residency regulations. ### Disk Configurations Blacksmith only supports 64GB of disk storage. WarpBuild supports configurable disk sizes, iops, and throughput. This is useful for ML/AI workloads, large container builds, monorepos, game developers, and mobile app development. ### Roadmap Blacksmith's product is still evolving. In contrast, WarpBuild is already the most capable product in this space and has been adding new features and capabilities to its platform rapidly since launch less than a year ago. ![WarpBuild Dashboard](/images/blog/blacksmith-warpbuild-comparison/warpbuild-dashboard.png) ## Conclusion Blacksmith is a basic provider of Github Actions runners. WarpBuild is superior to Blacksmith in every way, except ~10% difference in x86-64 vCPU performance. Blacksmith is ~5x more expensive compared to WarpBuild's BYOC runners. Large or fast growing teams use WarpBuild for their CI/CD needs at scale for 10x faster builds with snapshots, better security, and more customizable with BYOC. ## Get Started Today WarpBuild is committed to providing you with the tools you need to build faster, smarter, and more cost-effectively. Join us in this new era of development. --- Stay tuned for more updates and features coming soon. Happy building! --- For detailed technical documentation, visit [WarpBuild Docs](http://docs.warpbuild.com). Contact us at support@warpbuild.com. --- # BuildJet Is Shutting Down: Migrate to WarpBuild URL: /blog/buildjet-shutting-down BuildJet is shutting down on March 31, 2026. Learn why WarpBuild is the best alternative for fast, affordable GitHub Actions runners with snapshots, unlimited concurrency, and multi-cloud support. --- title: "BuildJet Is Shutting Down: Migrate to WarpBuild" description: "BuildJet is shutting down on March 31, 2026. Learn why WarpBuild is the best alternative for fast, affordable GitHub Actions runners with snapshots, unlimited concurrency, and multi-cloud support." excerpt: "BuildJet is shutting down. Here's why WarpBuild is the best alternative for your GitHub Actions workflows." date: "2026-02-10" author: surya_oruganti cover: "/images/blog/buildjet-shutting-down/cover.png" --- On February 6, 2026, BuildJet [announced](https://buildjet.com/for-github-actions/blog/we-are-shutting-down) that it is shutting down. The service will stop running jobs on March 31, 2026, and new signups have already been halted. If you're a BuildJet customer, you need to migrate your workflows before the deadline. BuildJet was one of the first to prove that teams needed faster GitHub Actions runners. They showed that the default 2-core VMs weren't cutting it — and thousands of teams agreed. Credit where it's due: BuildJet helped establish the market for third-party CI runners. Teams with demanding CI workloads still need more — snapshots, unlimited concurrency, multi-cloud infrastructure, and fast caching at scale. This is even more essential now, with so much more code being generated everyday by AI tools that CI is fast becoming the bottleneck. That's exactly where WarpBuild comes in. ## What This Means for Your Team Here's the timeline: - **February 6, 2026:** New signups halted. All concurrency subscriptions are now free. - **March 31, 2026:** BuildJet stops running jobs entirely. If your workflows use `buildjet-*` runner labels, they will fail after March 31. You need to migrate before then. You have two options: go back to GitHub's stock runners, or move to WarpBuild and keep the fast CI experience your team is used to. ## Why Not Just Go Back to GitHub Runners? If your team chose BuildJet, it was because you needed faster builds. That need doesn't go away when BuildJet does. GitHub's default runners are 2-core shared VMs, and while their larger runners offer more CPU, they still lack features that modern CI demands — snapshots, bring-your-own-cloud, unlimited concurrency, and fast caching at scale. If fast CI matters to your team, it's worth looking at what's available today. ## Why WarpBuild WarpBuild is a natural next step for BuildJet users. Same one-line setup, same drop-in replacement model — with additional capabilities on top. For a detailed feature-by-feature comparison, see our [BuildJet vs WarpBuild comparison](/blog/buildjet-warpbuild-comparison-2025-May). Here's what matters most: ### Snapshots: 2-10x Faster Builds WarpBuild provides fast hardware plus [snapshots](/docs/ci/snapshot-runners) — the ability to save and restore full VM state across builds. Instead of reinstalling dependencies, rebuilding caches, and setting up environments from scratch every run, your CI starts exactly where it left off. This is the single biggest unlock for CI performance. ### Unlimited Concurrency WarpBuild has no concurrency limits and no extra fees. Run as many parallel jobs as your workflows need — no caps, no add-on pricing. ### Fast, Unlimited Cache WarpBuild provides unlimited [cache storage](/docs/ci/caching) with 7-day retention from last access, designed to stay fast even at 100+ concurrent jobs. It works with the standard `actions/cache` action — no proprietary tooling needed. ### Multi-Cloud and 29+ Regions WarpBuild supports AWS, GCP, and Azure across 29+ regions globally. Pick the region closest to your source code and artifact storage to minimize latency and data transfer costs. ### Bring Your Own Cloud (BYOC) For teams that want maximum control and savings, [WarpBuild BYOC](/docs/ci/byoc) lets you run CI runners in your own AWS, GCP, or Azure account. WarpBuild manages the control plane while runners execute in your infrastructure — giving you 10x cost savings compared to hosted options, plus static IPs, custom images, and full network control. BuildJet never offered this. ### Full OS Coverage BuildJet supported Ubuntu 20.04 and 22.04 only. WarpBuild runs Ubuntu 20.04/22.04/24.04, Windows Server 2022/2025, and macOS 13/14/15/26 on M4 Pro hardware. Whatever your workflows need, WarpBuild covers it. ### Enterprise Ready WarpBuild is SOC2 Type 2 compliant, supports SSO through Microsoft Entra ID, Google, Okta, Auth0, and JumpCloud, and offers a 99.9% SLA with dedicated support. BuildJet was not SOC2 compliant and charged $500 just for a security assessment. ### Active Development BuildJet's product was essentially unchanged for over 2.5 years before the shutdown announcement. WarpBuild ships new features every week — remote Docker builders, configurable disks, analytics dashboards, and more. You're migrating to a platform that's investing in the future, not winding down. ## Migrating from BuildJet to WarpBuild Migration takes under 10 minutes for most teams. Here's how: ### 1. Sign up and install the WarpBuild bot Create an account at [app.warpbuild.com](https://app.warpbuild.com) and install the WarpBuild GitHub bot on your repositories. See the [quick start guide](/docs/ci/quick-start) for details. ### 2. Update your runner labels Replace BuildJet runner labels with their WarpBuild equivalents in your workflow files: | BuildJet Label | WarpBuild Label | | --- | --- | | `buildjet-2vcpu-ubuntu-2204` | `warp-ubuntu-2204-x64-2x` | | `buildjet-4vcpu-ubuntu-2204` | `warp-ubuntu-2204-x64-4x` | | `buildjet-8vcpu-ubuntu-2204` | `warp-ubuntu-2204-x64-8x` | | `buildjet-16vcpu-ubuntu-2204` | `warp-ubuntu-2204-x64-16x` | | `buildjet-2vcpu-ubuntu-2204-arm` | `warp-ubuntu-latest-arm64-2x` | | `buildjet-4vcpu-ubuntu-2204-arm` | `warp-ubuntu-latest-arm64-4x` | For a full list of available runner configurations, see the [cloud runners documentation](/docs/ci/cloud-runners). Here's what the change looks like in a workflow file: ```yaml title="Before (BuildJet)" jobs: build: runs-on: buildjet-4vcpu-ubuntu-2204 steps: - uses: actions/checkout@v4 - run: make build ``` ```yaml title="After (WarpBuild)" jobs: build: runs-on: warp-ubuntu-2204-x64-4x steps: - uses: actions/checkout@v4 - run: make build ``` ### 3. Swap the cache action If you're using `buildjet/cache`, replace it with `actions/cache`. WarpBuild's caching infrastructure is fully compatible with the standard GitHub Actions cache action. ```yaml title="Before" - uses: buildjet/cache@v4 ``` ```yaml title="After" - uses: actions/cache@v4 ``` That's it. Your workflows will now run on WarpBuild's infrastructure with faster hardware, better caching, and no concurrency limits. ## Get Started BuildJet helped prove that teams deserve faster CI. WarpBuild is where that journey continues — with snapshots, unlimited concurrency, multi-cloud support, and pricing that's 50% cheaper than GitHub. Don't wait until March 31. [Sign up for WarpBuild](https://app.warpbuild.com) and migrate your workflows today. For a deeper technical comparison, read our [full BuildJet vs WarpBuild breakdown](/blog/buildjet-warpbuild-comparison-2025-May). # BuildJet vs WarpBuild Comparison - 2025 May URL: /blog/buildjet-warpbuild-comparison-2025-May BuildJet features, performance, pricing, and how it compares to WarpBuild. --- title: "BuildJet vs WarpBuild Comparison - 2025 May" excerpt: "BuildJet features, performance, pricing, and how it compares to WarpBuild." description: "BuildJet features, performance, pricing, and how it compares to WarpBuild." date: "2025-05-19" author: surya_oruganti cover: "/images/blog/buildjet-warpbuild-comparison-2025-May/cover.png" --- BuildJet provides Github Actions runners that you can start using by just changing one line in the Github actions workflow file. This post provides a detailed comparison between BuildJet and WarpBuild to help you make an informed decision. ## Feature Comparison | Feature | BuildJet | WarpBuild | WarpBuild Advantage | | ------------------ | --------------------------------------------------- | -------------------------------------------------------- | -------------------------------------------------- | | **Architecture** | x86-64, arm64 | x86-64 (Desktop Class Ryzen 7950X3D), arm64 (Graviton 4) | More powerful instances for faster raw performance | | **Concurrency** | Limited (64 x86, 32 arm64); +$300/month for 100vCPU | Unlimited | No limits or fees | | **OS Support** | Ubuntu 20/22; Limited custom images | Ubuntu 20/22/24; MacOS 13/14/15; Custom images | Latest OS, MacOS, full compatibility | | **Caching** | 20GB/repo; Slow | Unlimited; 7-day retention; Fast | Unlimited storage, faster performance | | **Infrastructure** | Hetzner (EU) | Cloud (EU), BYOC (AWS, GCP, Azure) | Multiple providers/regions | | **Security** | KVM-based | KVM-based / Cloud provider isolation; Strong guarantees | Better security standards | | **Compliance** | Not SOC2 compliant; $500 assessment fee | SOC2 Type2 compliant | Higher compliance at no cost | | **Pricing** | x86-64: 50% of GitHub; arm64: 20% cheaper | x86-64: 50% of GitHub; arm64: 40% cheaper | Better x86-64 and arm64 price-performance | ## WarpBuild Exclusive Features | Feature | Description & Benefit | | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | **Remote Docker Builder** | Run docker builds on a remote machine with full local caching. This infrastructure is optimized with high performance processors, and local NVMe SSDs. This is a major advantage for large container builds, monorepos, and game and mobile app development. | | **Snapshots** | Save and restore runner state for persistence and incremental builds. Provides 10x improvement in build times by eliminating dependency installation time. | | **Bring Your Own Cloud (BYOC)** | Cloud-hosted control plane with runners in user's cloud account on AWS, GCP, Azure. Provides maximum flexibility, zero management overhead, and is 10x cheaper than BuildJet. | | **SSO support** | SSO support for Microsoft Entra ID, Google, Okta, Auth0, JumpCloud etc. Ensures secure and enterprise ready deployment. | | **SOC2 Type2 Certification** | SOC2 Type2 compliant. Documentation available on request. Provides higher compliance standards at no additional cost. | | **Global Regions** | Support for 29+ regions globally. Minimizes data transfer costs, improves performance, and supports data residency regulations. | | **Static IPs** | Static IP addresses for BYOC runner instances. Required for allowlisting in sensitive workflows. | | **Configurable Disks** | Cloud runners with local NVMe SSDs. BYOC runners with configurable disk sizes, IOPS, and throughput. Optimized for ML/AI workloads, large container builds, monorepos, game and mobile app development. | | **Dashboard and Analytics** | Rich dashboard for runners, cache usage, builds, with insights on build times, failure rates, trends. Enables data-driven CI optimization. | ## Product Development | Aspect | BuildJet | WarpBuild | | ------------------- | -------------------------------- | ------------------------------------------------- | | **Innovation Rate** | Product unchanged for >2.5 years | Rapid feature development and feature rich | | **Roadmap** | Maintenance mode | Active development with regular feature additions | ![WarpBuild Dashboard](/images/blog/buildjet-warpbuild-comparison-2025-May/warpbuild-dashboard.png) ![WarpBuild Analytics](/images/blog/buildjet-warpbuild-comparison-2025-May/warpbuild-analytics.png) ![WarpBuild Cache](/images/blog/buildjet-warpbuild-comparison-2025-May/warpbuild-cache.png) ## Conclusion BuildJet provides basic Github Actions runners, but lacks essential features for robust CI/CD platforms. It has been in maintenance mode for approximately 1.5 years. WarpBuild offers superior price, performance, and features, making it the preferred choice for large or fast-growing teams. ## Get Started Today WarpBuild is committed to providing you with the tools you need to build faster, smarter, and more cost-effectively. Join us in this new era of development. --- For detailed technical documentation, visit [WarpBuild Docs](http://docs.warpbuild.com). For any errors in this post, please contact us at support@warpbuild.com. --- # BuildJet vs WarpBuild Comparison URL: /blog/buildjet-warpbuild-comparison BuildJet features, performance, and why WarpBuild is a better Github Actions runner alternative. --- title: "BuildJet vs WarpBuild Comparison" excerpt: "BuildJet features, performance, and why WarpBuild is a better Github Actions runner alternative." description: "BuildJet features, performance, and why WarpBuild is a better Github Actions runner alternative." date: "2024-08-20" author: surya_oruganti cover: "/images/blog/buildjet-warpbuild-comparison/cover.png" --- BuildJet provides Github Actions runners, that you can start using by just changing one line in the Github actions workflow file. This post provides a detailed description of the features of BuildJet, and see how it compares to WarpBuild. ## BuildJet Features ### `x86-64` and `arm64` Runners BuildJet supports both x86-64 and arm64 architectures. This allows you to build your projects on both platforms, without emulation. This can speed up your builds 2-5x. WarpBuild Advantage: WarpBuild supports x86-64 and arm64 instances, that are more powerful for faster raw performance. ### Concurrency BuildJet supports up to 64 concurrent x86-64 vCPUs and 32 concurrent arm64 vCPUs for builds. This can be increased at an additional cost of $300/month for 100vCPU. WarpBuild Advantage: WarpBuild supports unlimited concurrency at no additional cost out of the box for x86-64 and arm64 instances. ### OS and Images BuildJet supports Ubuntu 22.04, Ubuntu 20.04, compatible with Github actions runner images, with some deviations. Custom base images are available for large customers at an additional cost. WarpBuild Advantage: WarpBuild supports the latest 24.04 ubuntu image as well, and is 100% compatible with Github actions runner images. Users can use the app to setup any number of custom base images directly. > WarpBuild also supports MacOS runners powered by M2 Pros for blazing fast MacOS builds on MacOS 13 and 14. ### Caching BuildJet provides 20GB of cache per repo, free. Any additional usage evicts the oldest cache entries. Users migrating away from BuildJet have reported that the cache performance was slow. WarpBuild Advantage: WarpBuild provides unlimited cache storage with a 7-day retention policy from last access. The cache performance is blazing fast even for workflows having 100+ concurrent jobs. This is a major advantage for larger customers, especially when there are large artifacts, monorepos, or container builds with large layers. > WarpBuild supports container large layer caching, which BuildJet does not support. ![WarpBuild Cache](/images/blog/buildjet-warpbuild-comparison/warpbuild-cache.png) ### Hosting Provider BuildJet runners are hosted on Hetzner, with most of the compute in the EU region. WarpBuild Advantage: WarpBuild runs compute on AWS, GCP, and Azure and users can choose which region to run their builds in. This has enormous advantages for customers to minimize inter-region data transfer costs and improve performance. ### Security BuildJet runners are KVM-based ephemeral VMs, running on bare-metal. This is potentially subject to noisy neighbors and performance degradation. WarpBuild Advantage: WarpBuild runners are ephemeral VMs as well, with the virtualization and isolation handled by the underlying cloud provider (AWS, GCP, Azure) and strong performance guarantees. This allows WarpBuild to be more secure and compliant with the most stringent security standards. ### Enterprise Compliance BuildJet is not SOC2 compliant. A Security Assessment Questionnaire is available for a $500 fee. WarpBuild Advantage: WarpBuild is in the process of getting SOC2 Type2 compliance certification. The documentation will be available for free, on request. ### Runner Pricing BuildJet's pricing is 50% of the cost of a Github Actions runner for x86-64. The arm64 runners are less powerful (about 40% the memory per vCPU) compared to Github hosted runners and are 20% cheaper. WarpBuild Advantage: WarpBuild x86-64 runners are the same price as BuildJet. The arm64 runners are more powerful on single-core and multi-core performance compared to Github hosted runners while being 40% cheaper. ![BuildJet Dashboard](/images/blog/buildjet-warpbuild-comparison/buildjet-dashboard.png) ## Missing Features BuildJet is missing a lot of features essential for robust CI. These features are usually deal breakers for large and fast growing teams. WarpBuild supports all of these features. ![WarpBuild Dashboard](/images/blog/buildjet-warpbuild-comparison/warpbuild-dashboard.png) ### Snapshots WarpBuild supports saving and restoring state from a runner instance for persistence and incremental builds is essential for large codebases. WarpBuild users see a 10x improvement in build times, due to this feature. Snapshots are very useful because the time in a CI workflow spent installing dependencies is eliminated and snapshots enable incremental builds. ### Spot Instances WarpBuild has multiple runner instance configurations including spot instances, which are ideal for low-cost and short-duration workflows. ### Bring Your Own Cloud (BYOC) WarpBuild supports BYOC, with a cloud-hosted control plane and the runners spawned in the user's cloud account. This provides the best of both worlds with maximum flexibility and zero management overhead. This is a major advantage for larger customers and is 10x cheaper than BuildJet. Users can leverage preferential pricing agreements with their cloud providers for even more value. ### Regions WarpBuild supports over 29 regions globally. This is huge for minimizing data transfer costs and improving performance. It is essential for some customers with sensitive workloads and data residency regulations. ### Static IPs WarpBuild supports static IPs for the runner instances, required for allowlisting in sensitive workflows. ### Disk Configurations WarpBuild supports configurable disk sizes, iops, and throughput. This is useful for ML/AI workloads, large container builds, monorepos, game developers, and mobile app development. ### Dashboard and Analytics WarpBuild supports a rich dashboard for runners, cache usage, and builds. There are also analytics and insights for the entire repository, including build times, build failure rates, runtime trends, activity heatmaps and more. ![WarpBuild Analytics](/images/blog/buildjet-warpbuild-comparison/warpbuild-analytics.png) ### Roadmap and Execution BuildJet's product has remained unchanged for >1.5 years, with no features or updates added. In contrast, WarpBuild is already the most capable product in this space and has been adding new features and capabilities to its platform rapidly since launch less than a year ago. ## Conclusion BuildJet is a basic provider of Github Actions runners, but it is missing a lot of features essential for a robust CI/CD platform. BuildJet has effectively been in maintenance mode for the last ~1.5 years. Large or fast growing teams use WarpBuild for their CI/CD needs for price, performance, and features. ## Get Started Today WarpBuild is committed to providing you with the tools you need to build faster, smarter, and more cost-effectively. Join us in this new era of development. --- Stay tuned for more updates and features coming soon. Happy building! --- For detailed technical documentation, visit [WarpBuild Docs](http://docs.warpbuild.com). Contact us at support@warpbuild.com. --- # Concurrent tests in GitHub Actions URL: /blog/concurrent-tests Speeding up test suites by running them concurrently on multiple machines with GitHub Actions --- title: "Concurrent tests in GitHub Actions" excerpt: "Concurrent tests in GitHub Actions" description: "Speeding up test suites by running them concurrently on multiple machines with GitHub Actions" date: "2024-04-18" author: abhijit_hota cover: "/images/blog/concurrent-tests/cover.png" --- Running builds and tests are probably the most common use-cases for GitHub Actions. Modern test frameworks and build systems have built-in support for sharding tests and remote execution. Sharding is essentially running your tests or builds in parallel, not only on different threads or cores but distributing them across different machines altogether. ## Concurrent tests for popular test frameworks This guide details how to set up concurrent tests for some of the popular test frameworks. We will explore how to set up test sharding in GitHub Actions for Jest, Playwright and Pytest. ### Jest Jest is a popular JavaScript testing framework. It provides [native support for test sharding](https://jestjs.io/docs/cli#--shard) using the `--shard` option to run your tests in parallel simultaneously across multiple machines. The option takes an argument in the form of `shardIdx/shardCount`, where `shardIdx` is a number representing index of the shard and `shardCount` is the total number of shards. A simple benchmark on [a dummy test suite run](https://github.com/WarpBuilds/concurrent-tests/actions/runs/8738287628) showed that sharding improves the run time from 3 minutes to 30 seconds. Here's how you can set it up in GitHub Actions: ```yaml name: CI on: push jobs: test: name: Tests runs-on: warp-ubuntu-latest-x64-4x strategy: fail-fast: false matrix: shardCount: [10] shardIdx: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 - run: npm ci - name: Run Jest tests run: npx jest --shard=${{ matrix.shardIdx }}/${{ matrix.shardCount }} ``` Jest parallelizes test runs across workers to maximize performance. You could optimize the performance of each shard by using the --maxWorkers option to specify the number of workers to use. ### Playwright Playwright is a popular end-to-end automation and testing framework for web applications. [Native support for test sharding](https://playwright.dev/docs/test-sharding) is available using the `--shard` option. Just like we saw with Jest earlier, the `--shard` option takes an argument in the form of `shardIdx/shardCount`, where `shardIdx` is the index of the shard and `shardCount` is the total number of shards. Sharding the tests improve the time from around 5 minutes 26 seconds to 1 minute 25 seconds for a [dummy test suite run](https://github.com/WarpBuilds/concurrent-tests/actions/runs/8740435954). The setup is very similar to that of Jest: ```yaml name: CI on: push jobs: tests: name: Tests runs-on: warp-ubuntu-latest-x64-4x strategy: fail-fast: false matrix: shardCount: [10] shardIndex: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 - run: npm ci - name: Install Playwright browsers run: npx playwright install --with-deps - name: Run Playwright tests run: npx playwright test --shard=${{ matrix.shardIndex }}/${{ matrix.shardCount }} ```
  1. Playwright runs your tests in parallel by default using worker processes. To further optimize the tests in each shard, you can use the --workers option to specify the number of workers to use.
  2. Consider using container-based jobs to potentially speed up tests by reducing the overhead of browser installation for each shard.
### Pytest Pytest is a widely used testing framework for Python. Although Pytest doesn't natively support test sharding across machines, the third-party plugin [`pytest-split`](https://github.com/jerry-git/pytest-split) can distribute tests based on duration or name. The performance gain after sharding was really significant since Pytest doesn't run tests in parallel by default. The [dummy test runs](https://github.com/WarpBuilds/concurrent-tests/actions/runs/8738290262) showed a 10x improvement in test run times - from 500 to 50 seconds. Setting up the workflow involves installing the `pytest-split` package and running pytest with the `--splits` and `--group` options: ```yaml name: CI on: push jobs: test: name: Tests runs-on: warp-ubuntu-latest-x64-4x strategy: fail-fast: false matrix: splitCount: [10] group: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 - run: pip install pytest pytest-split - name: Run pytest run: pytest --splits ${{ matrix.splitCount }} --group ${{ matrix.group }} ``` You can also optimize the tests to run faster by parallelizing the tests within each shard by using the pytest-xdist plugin. ## Comparison and Notes All of the test suites used, workflows and their runs can be found in our [`concurrent-tests`](https://github.com/WarpBuilds/concurrent-tests) repository. Here is a comparison the test run performance before and after sharding for the three test frameworks: | Test Framework | Default | Sharded x10 | Improvement | Workflow Run | | -------------- | ------- | ----------- | ----------- | ------------------------------------------------------------------------------ | | Jest | 3m | 30s | 6x | [Link](https://github.com/WarpBuilds/concurrent-tests/actions/runs/8738287628) | | Playwright | 5m 26s | 1m 25s | 3.8x | [Link](https://github.com/WarpBuilds/concurrent-tests/actions/runs/8740435954) | | Pytest | 8m 20s | 50s | 10x | [Link](https://github.com/WarpBuilds/concurrent-tests/actions/runs/8738290262) |

An important thing to keep in mind before sharding your tests with GitHub Actions is - all the steps inside the job are run again for each shard. This includes dependency installs, fetching third party actions, etc. For example, as stated earlier in the Playwright setup, we need to install the browsers for each shard.

Steps like these could add significant overhead and the performance gain might not be worth it or even inexistent. In worst cases, it might even increase the overall run time. It makes sense to benchmark your tests before and after sharding to see the benefits. Usually, large test suites with long run times benefit the most from sharding.

## Limitations While GitHub Actions' matrix strategy is really handy for parallelizing our tests and builds, a constraint that can't be ignored for performance is the imposed limit on number of concurrent jobs. GitHub has [a limit of 20 concurrent jobs per account](https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits). You can increase this limit to 40 on a pro account or 60 on a team account but it is still very limiting for large teams. ## Further improvements Overall CI workflow times with GitHub actions can be reduced by parallelizing tests. Run times can be further improved by making the tests within each shard run faster by using the native parallelization features of the test frameworks or by using third-party packages. WarpBuild provides blazing fast runners, optimized for CPU, network, and disk performance that are 30% faster at half the cost. **At [WarpBuild](https://warpbuild.com), there are _no limits_ on the number of concurrent jobs**. You can run as many jobs as you want in parallel and minimize the wait times on GitHub Actions workflows. It takes only ~2 minutes to [get started](https://app.warpbuild.com). Even if WarpBuild doesn't impose a concurrency limitation, there is an upstream constraint by GitHub which limits the maximum number of jobs generated by a job matrix to 256. But that limit is seldom the problem in practice. # Debug GitHub Actions in live CI environment with AI URL: /blog/debug-github-actions-ai A comprehensive guide to debugging GitHub Actions CI failures by SSH-ing into the live runner environment. Covers Tailscale (zero-config) and Cloudflare Tunnel (SSH key) approaches with step-by-step setup instructions. Uses Cursor as the IDE and AI assistants for real-time debugging. --- title: "Debug GitHub Actions in live CI environment with AI" excerpt: "Stop the push-wait-fail cycle. Two methods to SSH directly into your GitHub Actions runner on failure for real-time debugging in live CI environment with your IDE (Cursor) and AI assistants." description: "A comprehensive guide to debugging GitHub Actions CI failures by SSH-ing into the live runner environment. Covers Tailscale (zero-config) and Cloudflare Tunnel (SSH key) approaches with step-by-step setup instructions. Uses Cursor as the IDE and AI assistants for real-time debugging." date: "2025-12-23" author: prashant_raj cover: "/images/blog/debug-github-actions-ai/cover.png" --- import { Step, Steps } from 'fumadocs-ui/components/steps'; ## The Problem Debugging CI failures often devolves into a frustrating cycle: make a change, push, wait for the build, hit the next error, repeat. This happens because replicating the CI environment locally is difficult—or sometimes impossible when dealing with OAuth tokens, cloud IAM roles, or environment-specific configurations. This guide shows you how to break that cycle by SSH-ing directly into your live CI environment when a job fails. You'll get full IDE access (with AI coding assistants) and an interactive terminal to diagnose and fix issues in real-time. This guide uses Cursor. Any VS Code fork may work but alternatives have not been tested. ## Prerequisites - [Remote-SSH extension](https://marketplace.cursorapi.com/items?itemName=anysphere.remote-ssh) installed in Cursor/VS Code - A GitHub repository with Actions workflows - For Tailscale method: A [Tailscale account](https://login.tailscale.com/start) - For SSH method: An SSH key pair and [Cloudflare CLI](https://developers.cloudflare.com/cloudflare-one/connections/connect-apps/install-and-setup/installation/) --- ## Method 1: Tailscale (Recommended) Tailscale provides a zero-trust mesh VPN that eliminates SSH key management entirely. Authentication is handled through your identity provider, and devices on your tailnet can communicate directly without exposing ports to the public internet. No SSH keys to configure, rotate, or store in secrets. Keyless authentication via your identity provider. Ephemeral nodes auto-remove when the runner terminates. ### Setup Tailscale ### Create a Tailscale account Sign up at [login.tailscale.com/start](https://login.tailscale.com/start). This creates your personal tailnet—a private mesh network for your devices. ### Add your local machine to the tailnet Follow the [Tailscale device setup guide](https://tailscale.com/kb/1316/device-add) to install Tailscale on your development machine. Verify it appears in your [admin dashboard](https://login.tailscale.com/admin/machines). ### Generate an ephemeral auth key Navigate to **Settings → Personal Settings → Keys** in the Tailscale admin console. Click **Generate auth key** with these options: - **Ephemeral**: Enabled (auto-removes the device when disconnected) - **Reusable**: Enabled (allows the same key for multiple runner registrations) ![tailscale auth key](/images/blog/github-actions-faster-debugging/tailscale-auth-key-form.png) ### Add the auth key to GitHub Secrets In your repository, go to **Settings → Secrets and variables → Actions** and create a new secret named `TAILSCALE_AUTH_KEY_DEBUG` with your generated auth key. ### Configure the Workflow Add this step to your workflow after the step that's failing. It connects the runner to your tailnet and keeps the job alive for debugging: ```yaml - name: Setup Tailscale SSH if: ${{ failure() }} run: | # Write secret to temp file mkdir -p /tmp/secrets echo "${{ secrets.TAILSCALE_AUTH_KEY_DEBUG }}" > /tmp/secrets/tailscale_auth_key chmod 600 /tmp/secrets/tailscale_auth_key # Install Tailscale curl -fsSL https://tailscale.com/install.sh | sh # Start Tailscale with auth key and SSH enabled sudo tailscale up --auth-key=$(cat /tmp/secrets/tailscale_auth_key) --hostname=gha-debug-${GITHUB_RUN_ID} --ssh # Get connection info TAILSCALE_IP=$(tailscale ip -4) TAILSCALE_HOSTNAME="gha-debug-${GITHUB_RUN_ID}" echo "" echo "========================================" echo "SSH INTO RUNNER" echo "========================================" echo "ssh runner@$TAILSCALE_IP" echo "ssh runner@$TAILSCALE_HOSTNAME" echo "" echo "OPEN IN CURSOR:" echo "cursor --folder-uri \"vscode-remote://ssh-remote+runner%40$TAILSCALE_IP/home/runner/work/\"" echo "========================================" echo "" echo "(Keeping job alive; cancel the workflow when done debugging.)" tail -f /dev/null ``` ### Connect from Cursor When the workflow fails, the step above logs a Cursor command. Copy and run it in your terminal: ```console cursor --folder-uri "vscode-remote://ssh-remote+runner%40100.88.xx.xx/home/runner/work/" ``` Cursor opens directly to the runner's workspace. You now have full IDE access with integrated terminal and AI assistance. --- ## Method 2: SSH with Cloudflare Tunnel If you can't use Tailscale, this method uses traditional SSH keys with Cloudflare Tunnel to expose the runner's SSH port without a public IP. This approach requires managing SSH keys and updating your local SSH config for each debug session. It's more complex but works in environments where Tailscale isn't an option. ### Setup SSH Keys ### Generate an SSH key pair (if needed) Check for existing keys with `ls ~/.ssh`. If you need a new key, generate one: ```bash ssh-keygen -t ed25519 -C "github-actions-debug" ``` See the [GitHub SSH key guide](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent) for details. ### Add the private key to GitHub Secrets Create a secret named `SSH_PRIVATE_KEY_DEBUG` containing your **private** key (the file without the `.pub` extension). ### Configure the Workflow Add these steps after the failing step. They configure SSH and create a Cloudflare tunnel: ```yaml - name: Setup SSH private key if: ${{ failure() }} run: | mkdir -p ~/.ssh echo "${{ secrets.SSH_PRIVATE_KEY_DEBUG }}" > ~/.ssh/custom_key chmod 600 ~/.ssh/custom_key ssh-keygen -y -f ~/.ssh/custom_key > ~/.ssh/custom_key.pub chmod 644 ~/.ssh/custom_key.pub cat >> ~/.ssh/config << 'EOF' Host * IdentityFile ~/.ssh/custom_key AddKeysToAgent yes EOF chmod 600 ~/.ssh/config - name: Start SSH session with Cloudflare Tunnel if: ${{ failure() }} uses: valeriangalliat/action-sshd-cloudflared@v1 ``` ### Connect from Cursor ### Clear any cached host keys ```bash ssh-keygen -R action-sshd-cloudflared ``` ### Configure the SSH proxy Add a host entry to `~/.ssh/config` using the hostname from the action output: ```text Host cf-gha HostName action-sshd-cloudflared User runner ProxyCommand cloudflared access tcp --hostname StrictHostKeyChecking accept-new ``` Replace `` with the Cloudflare hostname (e.g., `facilities-canvas-frequency-reasonable.trycloudflare.com`). ### Open Cursor ```bash cursor --folder-uri "vscode-remote://ssh-remote+cf-gha/home/runner/work/" ``` --- ## Production Considerations ### Finding Runner Logs When debugging with AI assistants, you'll often want to feed them the full job logs. WarpBuild runners store logs at: | Image | Log Location | | --- | --- | | Ubuntu x86-64 | `/runner/_diag/*.log` | | Ubuntu ARM64 24.04 | `/runner/_diag/*.log` | | Windows x86-64 | `C:\warpbuilds\runner\_diag\*.log` | | macOS ARM64 | `/Users/runner/.warpbuild/github-runner/runner-app-new/_diag/*.log` | ### Timeout and Notifications Keeping a runner alive indefinitely burns CI minutes. Add a timeout and optional Slack notification for visibility: ```yaml - name: Notify on failure if: ${{ failure() }} uses: rtCamp/action-slack-notify@v2 env: SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }} - name: Setup debug session if: ${{ failure() }} timeout-minutes: 30 # Auto-terminate after 30 minutes run: | # ... Tailscale or SSH setup from above ``` GitHub Actions has a default job timeout of 6 hours. Without an explicit timeout-minutes, a forgotten debug session will continue burning minutes until it hits that limit. See [rtCamp/action-slack-notify](https://github.com/rtCamp/action-slack-notify) for notification configuration and [GitHub's timeout documentation](https://docs.github.com/en/actions/reference/workflows-and-actions/workflow-syntax#jobsjob_idstepstimeout-minutes) for step timeout syntax. --- WarpBuild provides high-performance GitHub Actions runners with 10x faster job start times and built-in debugging features. Our runners include pre-configured log locations and optimized networking for seamless SSH access. - **Linux, Windows, and macOS** runners across all major cloud providers - **50-90% cost savings** compared to GitHub-hosted runners - **Enterprise-ready** with BYOC (Bring Your Own Cloud) support Get started or book a demo to see how WarpBuild can accelerate your CI/CD pipeline. # Depot vs WarpBuild Comparison - 2025 May URL: /blog/depot-warpbuild-comparison-2025-May Depot features, performance, pricing, and how it compares to WarpBuild. --- title: "Depot vs WarpBuild Comparison - 2025 May" excerpt: "Depot features, performance, pricing, and how it compares to WarpBuild." description: "Depot features, performance, pricing, and how it compares to WarpBuild." date: "2025-05-19" author: surya_oruganti cover: "/images/blog/depot-warpbuild-comparison-2025-May/cover.png" --- Depot provides Github Actions runners that you can start using by just changing one line in the Github actions workflow file. This post provides a detailed comparison between Depot and WarpBuild to help you make an informed decision. Check out the live benchmark comparison between WarpBuild and Depot [here](https://www.warpbuild.com/compare/depot). ## Feature Comparison | Feature | Depot | WarpBuild | WarpBuild Advantage | | ------------------------------- | -------------------------------------------------- | -------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **CPU: x86-64** | AMD EPYC m7a | x86-64 (Desktop Class Ryzen 7950X3D) | WarpBuild is 40% more powerful than Depot for the same price | | **CPU: arm64** | AWS Graviton 4 | AWS Graviton 4 | Depot is 33% more expensive for the same performance; | | **MacOS hardware** | M2 Pro (8vcpu) 24GB RAM | M4 Pro (6 vcpu) 24GB RAM | WarpBuild jobs are ~20% faster | | **OS Support** | Ubuntu 22.04, 24.04; MacOS 14; Windows Server 2022 | Ubuntu 22.04, 24.04, MacOS 13/14/15 with M4 Pro, Windows Server 2022 | Latest Linux, MacOS, Windows, custom images | | **Bring Your Own Cloud (BYOC)** | Enterprise plan and AWS only | Self-serve on AWS, GCP, Azure | Cloud-hosted control plane with runners in user's cloud account. Provides maximum flexibility, zero management overhead, and does not require an Enterprise Plan. | | **Infrastructure** | Region info unavailable | Cloud (EU), BYOC (AWS, GCP, Azure any region) | Multiple providers/regions, reduced data transfer costs, and BYOC for improved security | | **Remote Docker Builder** | Multi-arch, limited concurrency | Multi-arch higher concurrency | WarpBuild spins up multiple instances in parallel for concurrent builds while Depot runs parallel builds on the same instance causing slow-down. WarpBuild also provides a multi-arch docker builder cloud. | | **SSO support** | Available | Available | WarpBuild supports SSO with direct integrations for Microsoft Entra ID, Google, Okta, Auth0, JumpCloud etc. | | **Static IPs** | No | Available at no cost (BYOC only) | | | **Snapshots** | Not available | Available | Save and restore runner state for persistence and incremental builds. Provides 10x improvement in build times by eliminating dependency installation time. | | **Caching** | Unlimited | Unlimited; 7-day retention; | - | | **Advanced Dashboard** | Richer dashboard | Rich dashboard | Depot has estimated analytics on time saved by builds and docker layer explorer. WarpBuild does not. | | **Dedicated Cache Actions** | No | Yes | WarpBuild provides framework specific cache actions to speed up builds. | | **Compliance** | SOC2 Type2 | SOC2 Type2 | | | **Fast boot** | Instances on Standby for fast boot | Instances on standby for fast boot even on BYOC. | Extremely valuable, especially for Windows isntances and larger instance types | | **Pricing** | Fixed subscription of $200/month beyond usage | No subscription required, pay as you go. | | ![Depot WarpBuild Cache](/images/blog/depot-warpbuild-comparison-2025-May/depot-warpbuild-cache.png) ![WarpBuild Dashboard](/images/blog/depot-warpbuild-comparison-2025-May/warpbuild-dashboard.png) ![WarpBuild Analytics](/images/blog/depot-warpbuild-comparison-2025-May/warpbuild-analytics.png) ## Conclusion WarpBuild's x86-64 runners are much faster than Depot's runners while the arm64 runners are similar in performance but cheaper. Overall, WarpBuild is cheaper and fully usage based without any subscription fees. WarpBuild also provides MacOS M4 Pro hardware for faster builds. For enterprises, WarpBuild delivers fully customizable BYOC options, SSO, snapshots, better security, and more customization at a fraction of the cost (~5x cheaper compared to Depot). ## Get Started Today WarpBuild is committed to providing you with the tools you need to build faster, smarter, and more cost-effectively. Join us in this new era of development. --- For detailed technical documentation, visit [WarpBuild Docs](http://docs.warpbuild.com). For any errors in this post, please contact us at support@warpbuild.com. --- # Design files: Onboarding URL: /blog/design-files-onboarding Designing the best onboarding experience --- title: "Design files: Onboarding" excerpt: "Designing the onboarding experience from first principles that engineers love by going the extra mile." description: "Designing the best onboarding experience" date: "2023-12-15" author: surya_oruganti cover: "/images/blog/design-files-onboarding/cover.png" --- WarpBuild's value proposition is built around a fundamental invariant - everybody wants builds to be faster, and everybody wants things cheaper[1]. Now, that means our primary job shifts from convincing developers to use our product to making people aware of the fact that our product exists. This shortens the funnel significantly. Once the user signs up to try the product, the onus is on the onboarding flow to ensure there is no drop-off and that is a challenge we took up very seriously. We laid out some principles to deliver on this promise: - Onboarding must be trivially easy - High degree of polish is required for the product, from day 1 - Speed is a feature and critical to developer experience ## The onboarding process Switching a GitHub Actions workflow to WarpBuild is a one line change in the workflow file. Easy right? Not really. It's very common for repositories to have tens or even hundreds of GitHub Actions workflows. Our single highest priority for a signed up user is that we want all the user's workflows to run on WarpBuild. Changing 100s of workflows manually is frustrating, obviously. Asking our users to do that would be conducive to exactly two things: having our users hate us and ensuring they don't switch workflows. We set out to solve this while designing the UI to: - Pull information from GitHub about the connected repositories. Then parse and display the workflows present. - Raise a PR with the user selected runner configuration for all selected workflows in one click. - Open the PR in a new tab so users can review and merge. This is how our onboarding workflow looks like now and it worked brilliantly! ![Onboarding workflow](/images/blog/design-files-onboarding/onboarding-workflow.gif) ## The results We had users move 50+ workflows in a few minutes and [got good HN karma](https://news.ycombinator.com/item?id=38572160) because of this. Our users love it! While the numbers are small to provide statistically significant analysis, we continue to see very high funnel progression and conversion to paid usage. 📚 Check out the [documentation here](/docs/ci/) ⚡️ [Get started with WarpBuild in 2 minutes here](https://app.warpbuild.com) --- [1] Variant of [this quote by Jeff Bezos](https://www.goodreads.com/quotes/966699-i-very-frequently-get-the-question-what-s-going-to-change). # Cached Docker Builders - ARM64 support URL: /blog/docker-builders-arm64 ARM64 support for cached docker builders for multi-arch builds, for the world's fastest docker builds. --- title: "Cached Docker Builders - ARM64 support" excerpt: "ARM64 support for cached docker builders for multi-arch builds, for the world's fastest docker builds." description: "ARM64 support for cached docker builders for multi-arch builds, for the world's fastest docker builds." date: "2025-03-19" author: surya_oruganti cover: "/images/blog/docker-builders-arm64/cover.png" --- WarpBuild recently announced support for new docker container builders combine high performance processors with directly attached SSDs to deliver the fastest docker builds in the world. Now, we are excited to announce ARM64 support for these builders to enable multi-arch builds. ## What does this mean? This means that you can now build and push your docker images for `linux/amd64` and `linux/arm64` architectures from a single build. ## Caching Similar to the x86_64 builders, the ARM64 builders support caching of the build context, base image, and dependencies. This means that you can now build and push your docker images for `linux/amd64` and `linux/arm64` architectures from a single build. Caches are automatically managed by the builder and no additional configuration is required. Remove the `cache-from` and `cache-to` steps from your `build-push-action` step as the builder will handle it for you. This leads to build times that are 2-3x faster than the x86_64 builders with emulation for `linux/arm64` builds. ## 🚀 Try it out Create a builder profile in the WarpBuild dashboard and add the following to your GitHub actions workflow before your `build-push-action` step: ```yaml - name: Configure WarpBuild Docker Builders uses: Warpbuilds/docker-configure@v1 with: profile-name: "super-fast-builder" - name: Build and push uses: docker/build-push-action@v6 with: context: . file: Dockerfile platforms: linux/amd64,linux/arm64 # Provide the platforms you want to build for push: true tags: ${{ steps.docker_build.outputs.image }} ``` Try out WarpBuild's new docker builders - [Documentation](/docs/ci/docker-builders). - [Get started](https://app.warpbuild.com). --- # The World's Fastest Docker Builders URL: /blog/docker-builders WarpBuild's new docker container builders combine high performance processors with directly attached SSDs to deliver the fastest docker builds in the world. --- title: "The World's Fastest Docker Builders" excerpt: "WarpBuild's new docker container builders combine high performance processors with directly attached SSDs to deliver the fastest docker builds in the world." description: "WarpBuild's new docker container builders combine high performance processors with directly attached SSDs to deliver the fastest docker builds in the world." date: "2025-03-14" author: surya_oruganti cover: "/images/blog/docker-builders/cover.png" --- WarpBuild's new docker container builders combine high performance processors with directly attached SSDs to deliver the fastest docker builds in the world. ## First principles ### The hardware The cloud instances have amazing network speeds, but the disk speed is often a limiting factor. There are two types of instances that cloud providers offer: 1. Ones with local SSDs - The local SSDs are fast, but they come with a catch - the local SSDs are ephemeral. This means that the disk is lost when the instance is terminated. This requires the context to be persisted elsewhere which becomes a bottleneck and a maintenance nightmare. 2. Ones with network-attached SSDs - The network-attached SSDs are high-latency, but the size is not limited. The performance in terms of IOPS and throughput is low by default but can be improved though that becomes extremely expensive. Processors are fast, specifically the Genoa-X EPYC processors compared to the intel processors (`m7a` on aws, `c4d` on gcp, and `dasv6` on azure). However, for quick builds, the IO is the limiting factor. The processor speeds show a significant improvement to overall build times when the the build step is time consuming (as opposed to the IO operations). A quick aside: The `m7a` and `dasv6` instances have roughly the same single core performance when tested on Passmark (~2900). However, the `c4d` instances have a single core performance of ~3400. While benchmarks are not everything, this is a good indicator of the performance difference. ### Logical optimizations aka caching The fastest network or disk transfer is one that is not needed. Caching can eliminate the need for some of the transfers. The more we can cache, the faster the build. ## What makes a fast docker builder? The (interesting part of the) lifecycle of a docker build is as follows: 1. Transfer the context to the builder instance 2. Download the base image 3. Download the dependencies 4. Build the application 5. Push the image to a container registry All the steps except the actual build step are network or disk IO bound. The actual build step is processor and memory bound, though it can also depend on the disk IOPS in certain cases. ### But it builds fast on my laptop That is true, because your typical Macbook has an extremely high single core performance processor and an extremely high IOPS disk (SSD). The network speed may be low, but the entire context, the base image, the dependencies, and some parts of the build step are cached and readily available. ### How do we replicate this, but on a cloud instance? The answer is simple - attach a high performance disk to a high performance CPU instance and pair it with a fast network. Doing this is inefficient in a public cloud environment, because the disk is ephemeral and the cost of the instance is high. ## The solution We went baremetal. We bought high performance CPUs and attached large volumes of extremely fast SSDs to them. We then paired it with a blazing fast network. We then created a custom orchestration layer on top of that to manage the lifecycle of the docker builders. This orchestration layer is built ground up with the following capabilities: 1. Complete caching for maximum build speed. 2. Dedicated cores of fast, high single core performance processors. 3. Storage that is directly attached to the builder instance to ensure that the disk IO is not a bottleneck. 4. Nested virtualization to ensure that the builder instance is completely isolated from the host. 5. Managing the storage lifecycle even with parallel builds. Managing the storage lifecycle is a non-trivial problem. We had to ensure that we can handle parallel builds (currently limited but will soon be unlimited) and that the build context is available to the builder instance. Copy-on-write volumes are used to ensure that the build context is available to the builder instance with minimal storage overhead and fast builder startup times. ## Performance This was a lot of effort to get right, but the results speak for themselves. For the same commit on `posthog/posthog` and `netflix/dispatch` repositories, the docker builders are able to deliver build times that are upto 65x faster than the public cloud builders! ![Docker Builder Performance Comparison](/images/blog/docker-builders/comparison.png) The chart above shows the build time comparison between using the default GitHub actions runners, another provider with builders hosted in the public cloud, and the WarpBuild docker builders. Since launch, we have customers reporting real life build times reductions from `7:00 -> 0:45`, `4:00 -> 1:30` and the list goes on. ## 🚀 Try it out Create a builder profile in the WarpBuild dashboard and add the following to your GitHub actions workflow before your `build-push-action` step: ```yaml - name: Configure WarpBuild Docker Builders uses: Warpbuilds/docker-configure@v1 with: profile-name: "super-fast-builder" ``` Try out WarpBuild's new docker builders - [Documentation](/docs/ci/docker-builders). - [Get started](https://app.warpbuild.com). --- # Docker registry mirror setup URL: /blog/docker-mirror-setup Setup docker registry mirror for faster image pulls and avoiding rate limit issues --- title: "Docker registry mirror setup" excerpt: "Setup docker registry mirror for faster image pulls and avoiding rate limit issues" description: "Setup docker registry mirror for faster image pulls and avoiding rate limit issues" date: "2024-04-17" author: surya_oruganti cover: "/images/blog/docker-mirror-setup/cover.png" --- You might have run into a docker rate limit issue when pulling docker images from DockerHub if you didn't sign in to DockerHub and were pulling the image anonymously. DockerHub has a rate limit of 100 pulls per 6 hours per IP address. For authenticated users, the limit is 200 pulls per 6 hours. However, this goes up to 5000 pulls per day with a paid plan. In many cases, this is not enough. One way to work around these limits is to send requests to a registry mirror. The docker daemon can be configured to pull images from the mirror instead of going to DockerHub each time for the pull. Docker maintains an [image](https://hub.docker.com/_/registry) which can be used for this exact purpose. Another way is to use public mirrors such as Google's `https://mirror.gcr.io`. However, this will not contain private registries and the coverage of all public Dockerhub images is not guaranteed either. We'll focus on the first method, setting this up on AWS, the pros, cons and common pitfalls. ## Infra setup - A Kubernetes cluster, EKS. This isn't a requirement for deploying registry but we prefer to manage it through a k8s cluster. - Ingress nginx setup on the k8s cluster. - Cert-manager setup on the k8s cluster. - Cluster issuer for cert-manager setup called `letsencrypt-prod` or something else. - An S3 bucket. This is used to store the images. You can set this up as just a container on AWS Fargate, AWS ECS, Google Cloud Run, or any other container runner. The data from the s3 backend does not go through this container but is instead directly transferred from s3 to the docker process. This is beneficial as it avoids egress fees in many scenarios. ## Deploying We'll deploy the registry using a [helm chart](https://artifacthub.io/packages/helm/twuni/docker-registry). If you wish to use just the k8s yaml file. You should do a helm template and make the suggested changes on the generated k8s manifest. ```yaml image: repository: mirror.gcr.io/registry ingress: enabled: true className: nginx path: / hosts: - annotations: ingress.kubernetes.io/ssl-redirect: "true" ingress.kubernetes.io/proxy-body-size: "0" # You might need to change based on your issuer cert-manager.io/cluster-issuer: letsencrypt-prod kubernetes.io/tls-acme: "true" nginx.ingress.kubernetes.io/proxy-body-size: "0" nginx.ingress.kubernetes.io/ssl-redirect: "true" labels: {} tls: - secretName: registry-mirror-docker-registry-ingress hosts: - resources: limits: cpu: 1 memory: 1Gi requests: cpu: 0.5 memory: 512Mi storage: s3 # Secrets for S3 access and secret keys # Use a secretRef with keys (accessKey, secretKey) for secrets stored outside # the chart s3: secretRef: "registry-mirror-aws-credentials" ``` Save the above file as values.yaml. ```yaml apiVersion: v1 kind: Secret metadata: name: registry-mirror-aws-credentials namespace: mirror type: Opaque stringData: accessKey: secretKey: ``` Save the above Kubernetes secret as secret.yaml and replace the accessKey and secretKey. Since we are using `stringData` you don't need to encode your credentials to base64. Make sure that the DNS mapping is correct, and make sure the URL is routed to your cluster. The processing of configuring the DNS will depend on what you are using to manage the mappings. If you are completely on AWS, it should be in Route53. Make sure you have configured kubectl with your cluster context. In the case of AWS EKS just run, ```console aws eks update-kubeconfig --name --region ``` This with make your cluster the default context. We set up the secret first so that the registry has access to the s3 bucket. ```console kubectl apply -f secret.yaml ``` If you wish to validate that the secret was added to the cluster run the following. ```console kubectl get secret registry-mirror-aws-credentials -n mirror ``` This should give you info on the k8s secret. Install the registry on your cluster with the following command ```console helm repo add twuni https://helm.twun.io helm install registry-mirror twuni/docker-registry \ --namespace mirror \ --version 2.2.3 \ --values values.yaml \ --create-namespace ``` This will set up a registry mirror deployment in the mirror namespace. If you need to make updates to the values, replace the `helm install` with `helm update`. ## Try out the mirror Configure your docker daemon to use the registry mirror. Add the following entry to your `/etc/docker/daemon.json`. If you don't have this file create a new one with only the entry below. ```json { "registry-mirrors": [""] } ``` Restart the docker daemon. ```console # stop the docker daemon sudo systemctl stop docker.service # verify that the docker daemon is stopped, this would list the service as # stopped sudo systemctl status docker.service # start the docker daemon sudo systemctl start docker.service ``` Do a pull of an image say `ubuntu:22.04`. ```console docker pull ubuntu:22.04 ``` After this change, the docker daemon will query the mirror service if it has ubuntu:22.04. If it doesn't it will pull from `docker.io`. And the mirror service will silently pull ubuntu:22.04. Remove the docker image for ubuntu:22.04 and re-pull. ``` docker image rm ubuntu:22.04 docker pull ubuntu:22.04 ``` This should pull the image from s3. You can check the registry mirror logs to verify this. You would see a bunch of 307 with layer sha codes. 307 is sent because of redirection to the pre-signed URL by the registry mirror. You can also verify that the s3 storage has these sha layers cached. ## Pros - Faster downloads, download from s3 is very fast which means your images are downloaded at a boosted speed. - Dockerhub rate limits are bypassed. ## Cons - Infra management overhead. You'll need to pay for compute and S3. - No direct lifecycle support for images. So no way to clean up stale images. So you end up having old images in your s3 bucket. You can configure the lifecycle at S3 but it's for all objects not just the stale ones. ## Common Pitfalls ### Using docker.io image for the registry mirror. We had a bad deployment that was causing the Docker mirror to restart. The registry mirror pods became unreachable which means all the traffic was directed to DockerHub now. This quickly led to the Docker mirror container itself being throttled because we were using the [official registry image](https://hub.docker.com/_/registry). We switched the image to pull from `gcr.io` instead to mitigate this. ### Surprising data transfer charges for S3 S3 is excellent when you want a cheap and simple way to save and restore data that can scale. It is also free if your transfers are within the same region which is the case for us. So it was surprising to see us incurring an unexpected huge amount of data transfer on our AWS bill. Our infra hadn't changed apart from the registry, and the registry sends a pre-signed URL which is so small that it can be neglected. Scouring through the AWS docs, we found that even though the bucket was in the same region it was still being transferred globally because our route tables didn't have a direct gateway to AWS S3 service. To configure it you need to set up a VPC gateway endpoint for S3, making the data transfers free, within the same region. ## Footnote This is just one of the many optimizations that you'll get out of the box with WarpBuild runners to make your GitHub Actions CI workflows faster. You can try it out at [app.warpbuild.com](https://app.warpbuild.com) ## References Docker registry mirror image: https://hub.docker.com/_/registry Docker registry mirror source code: https://github.com/distribution/distribution/tree/main/registry AWS S3 Pricing: https://aws.amazon.com/s3/pricing/ AWS S3 Gateway Endpoint: https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html # Using GitHub Actions Cache with popular languages URL: /blog/github-actions-cache Using GitHub Actions Cache with popular languages --- title: "Using GitHub Actions Cache with popular languages" excerpt: "Using GitHub Actions Cache with popular languages" description: "Using GitHub Actions Cache with popular languages" date: "2024-05-16" author: surya_oruganti cover: "/images/blog/github-actions-cache/cover.webp" --- ## GitHub Actions Cache GitHub Actions provides a powerful CI/CD platform enabling developers to automate workflows and streamline development pipelines. One valuable feature to speed up these pipelines is **caching**. By caching dependencies and other build artifacts, you can drastically reduce build times, particularly for frameworks that rely on extensive dependency fetching and compilation. GitHub provides a [`cache`](https://github.com/actions/cache) action that allows workflows to cache files between workflow runs. In this post, we'll dive into how to use this action for popular programming languages and frameworks, including Node.js, Python, Rust, Go, PHP, and Java. We'll also highlight common pitfalls, considerations, and shortcomings of the cache action to provide a comprehensive understanding. ## Benefits Caching dependencies offers several benefits: 1. **Faster Builds**: By caching dependencies, you can avoid the time-consuming process of downloading and installing them on every workflow run. This leads to faster build times and quicker feedback loops. 2. **Reduced Network Bandwidth**: Caching minimizes the need to download dependencies repeatedly, saving network bandwidth and reducing the load on package registries. 3. **Improved Reliability**: Caching ensures that your builds are less susceptible to network issues or outages of package registries, as the cached dependencies can be restored locally. ## Using the Cache Action GitHub provides official environment setup actions for a few popular languages and frameworks. These actions support caching dependencies out of the box. For the rest, you can use the actions/cache action directly to cache the relevant directories. ### Node.js Caching dependencies in Node.js involves storing the package manager cache. The official [`setup-node`](https://github.com/actions/setup-node) action [supports caching](https://github.com/actions/setup-node#caching-global-packages-data) by using `actions/cache` under the hood, abstracting out the setup required to cache the required package manager cache directories. It supports caching for `npm`, `yarn`, and `pnpm` with the `cache` input. Caching is **disabled** be default. Caching the node_modules is not recommended by GitHub as it fails across Node versions and doesn't work well with npm ci. ```yaml name: Node CI on: [push, pull_request] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Node.js uses: actions/setup-node@v4 with: node-version: 20 cache: "npm" # or 'yarn' or 'pnpm' - name: Install dependencies run: npm ci # or 'yarn install --frozen-lockfile' or 'pnpm install --frozen-lockfile' ``` The above action uses the relevant lockfile used by the package manager to create the cache key. For more control over caching, such as using your own cache keys or caching the `node_modules` directory, you can use the `actions/cache` action directly. Here's an example of caching the package directory for an `npm` project: ```yaml name: Node CI on: [push, pull_request] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Node.js uses: actions/setup-node@v4 with: node-version: 20 - name: Cache .npm directory uses: actions/cache@v4 with: path: ~/.npm key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }} restore-keys: | ${{ runner.os }}-node- - name: Install dependencies run: npm ci ``` Make sure to use the correct cache directory for your package manager (npm, yarn, or pnpm). You can get the cache directory by running npm config get cache or yarn cache dir. Learn more here. ### Python Caching in Python projects also involves storing the package manager cache directory. Similar to Node.js, the official [`setup-python`](https://github.com/actions/setup-python/) action also [supports caching](https://github.com/actions/setup-python/#caching-packages-dependencies) via the `cache` input. ```yaml name: Python CI on: [push, pull_request] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Python uses: actions/setup-python@v5 with: python-version: "3.12" cache: "pip" # or 'poetry' or 'pipenv' - name: Install dependencies run: pip install -r requirements.txt ``` Similarly, you can also use the `actions/cache` action directly for more control over caching: ```yaml name: Python CI on: [push, pull_request] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Python uses: actions/setup-python@v5 with: python-version: "3.12" - name: Cache pip packages uses: actions/cache@v4 with: path: ~/.cache/pip key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }} restore-keys: | ${{ runner.os }}-pip- - name: Install dependencies run: pip install -r requirements.txt ``` You can find the correct cache directory for pip by running pip cache dir and for poetry by running poetry config cache-dir. Learn more here ### Golang For Go projects, caching the Go modules directory speeds up the build process. The directory is generally located at `~/go/pkg/mod` and can be found by running `go env GOMODCACHE`. The [`setup-go`](https://github.com/actions/setup-go/) action provides [caching support](https://github.com/actions/setup-go/#caching-dependency-files-and-build-outputs) for Go projects with the `cache` input. Unlike Node.js and Python, this input is a boolean value that enables or disables caching. Caching is **enabled** by default. ```yaml name: Go CI on: [push, pull_request] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Go uses: actions/setup-go@v5 with: go-version: "1.22" cache: true # Note: this is not required as caching is enabled by default - name: Build run: go build ./... ``` You can also disable the caching by setting `cache: false` in the `actions/setup-go` action and use the `actions/cache` action directly for more control. ```yaml name: Go CI on: [push, pull_request] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Go uses: actions/setup-go@v5 with: go-version: "1.22" cache: false - name: Cache Go modules uses: actions/cache@v4 with: path: ~/go/pkg/mod key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }} restore-keys: | ${{ runner.os }}-go- - name: Build run: go build ./... ``` If you use vendor directories, the modules get loaded from your project's vendor directory instead of downloading from the network or restoring from cache. In such cases, caching the Go modules directory may not be necessary. ### Rust Rust's package manager, Cargo caches the modules and binaries in the `~/.cargo` directory. The compiled dependencies are stored in the `target` directory of the project. Rust has no official GitHub Action for setup, so you can use the `actions/cache` action directly to cache the relevant directories. ```yaml name: Rust CI on: [push, pull_request] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Rust uses: dtolnay/rust-toolchain@stable - name: Cache cargo registry and build uses: actions/cache@v4 with: path: | ~/.cargo/bin/ ~/.cargo/registry/index/ ~/.cargo/registry/cache/ ~/.cargo/git/db/ target key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }} restore-keys: | ${{ runner.os }}-cargo- - name: Build and test run: cargo test --all ``` ### Java Java projects commonly use the Gradle or Maven build tools which have their corresponding cache directories. Java does have an official [`setup-java`](https://github.com/actions/setup-java) action that [supports caching](https://github.com/actions/setup-java#caching-packages-dependencies) via the `cache` input. This input takes the name of the build tool (`maven`, `gradle` or `sbt`) to cache their relevant directories. Caching is disabled by default. ```yaml name: Java CI on: [push, pull_request] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Java uses: actions/setup-java@v4 with: distribution: "temurin" java-version: "17" cache: "maven" # or 'gradle' or 'sbt' - name: Build with Maven run: mvn -B clean verify ``` You can also use the `actions/cache` action directly for more control over caching. #### Maven Example ```yaml name: Java CI on: [push, pull_request] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Java uses: actions/setup-java@v4 with: distribution: "temurin" java-version: "17" - name: Cache Maven repository uses: actions/cache@v4 with: path: ~/.m2/repository key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }} restore-keys: | ${{ runner.os }}-maven- - name: Build with Maven run: mvn -B clean verify ``` #### Gradle Example ```yaml name: Java Gradle CI on: [push, pull_request] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Java uses: actions/setup-java@v4 with: distribution: "temurin" java-version: "17" - name: Cache Gradle wrapper uses: actions/cache@v4 with: path: | ~/.gradle/caches ~/.gradle/wrapper key: ${{ runner.os }}-gradle-${{ hashFiles('**/gradle/wrapper/gradle-wrapper.properties') }} restore-keys: | ${{ runner.os }}-gradle- - name: Build with Gradle run: chmod +x gradlew && ./gradlew build ``` ### PHP PHP projects with Composer can cache their dependencies by caching the Composer cache directory. PHP has no official GitHub Action for setup, so you can use the `actions/cache` action directly. ```yaml name: PHP Cache Workflow on: [push, pull_request] jobs: build: runs-on: ubuntu-latest steps: - name: Checkout code uses: actions/checkout@v4 - uses: shivammathur/setup-php@v2 with: php-version: '8.3' # The cache directory is usually located at ~/.composer/cache # This step can be skipped and the cache directory can be hardcoded # in the `path` field of the `actions/cache` step. - name: Get Composer Cache Directory id: composer-cache run: | echo "dir=$(composer config cache-files-dir)" >> $GITHUB_OUTPUT - name: Cache Composer dependencies uses: actions/cache@v4 with: path: ${{ steps.composer-cache.outputs.dir }} key: ${{ runner.os }}-composer-${{ hashFiles('**/composer.lock') }} restore-keys: | ${{ runner.os }}-composer- - name: Install dependencies run: composer install --prefer-dist ``` ### Ruby The official [`ruby/setup-ruby`](https://github.com/ruby/setup-ruby) action provides [caching support](https://github.com/ruby/setup-ruby#caching-bundle-install-automatically) for Ruby projects with the `bundler-cache` input. This input takes a boolean value to enable or disable caching. Caching is **disabled** by default. ```yaml name: Ruby CI on: [push, pull_request] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Ruby uses: ruby/setup-ruby@v1 with: ruby-version: "3.3" bundler-cache: true # Installing dependencies via `bundle install` or `gem install bundler` # is not required as the action automatically installs dependencies. - name: Run tests run: bundle exec rake test ``` We can also directly cache the `bundle` directory using the `actions/cache` action. Manually caching this directory is not recommended, and the suggested approach is to use the ruby/setup-ruby action as shown above. ```yaml name: Ruby CI on: [push, pull_request] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Ruby uses: ruby/setup-ruby@v1 with: ruby-version: "3.3" - name: Cache gems uses: actions/cache@v4 with: path: vendor/bundle key: ${{ runner.os }}-gems-${{ hashFiles('**/Gemfile.lock') }} restore-keys: | ${{ runner.os }}-gems- - name: Install dependencies run: | gem install bundler bundle install --path vendor/bundle - name: Run tests run: bundle exec rake test ``` ### .NET (NuGet) The official [`setup-dotnet`](https://github.com/actions/setup-dotnet) action provides [caching support](https://github.com/actions/setup-dotnet#caching-nuget-packages) for .NET projects with the `cache` input. This input takes a boolean value to enable or disable caching. Caching is **disabled** by default. Passing an explicit NUGET_PACKAGES is also recommended for caching the NuGet packages directory instead of the global cache directory since there might be some huge packages pre-installed. ```yaml name: .NET CI on: [push, pull_request] jobs: build: runs-on: windows-latest env: NUGET_PACKAGES: ${{ github.workspace }}/.nuget/packages steps: - uses: actions/checkout@v4 - name: Setup .NET uses: actions/setup-dotnet@v4 with: dotnet-version: "6.x" cache: true - name: Restore dependencies run: dotnet restore --locked-mode - name: Build run: dotnet build my-project ``` For caching manually, use the `actions/cache` action directly. ```yaml name: .NET CI on: [push, pull_request] jobs: build: runs-on: windows-latest env: NUGET_PACKAGES: ${{ github.workspace }}/.nuget/packages steps: - uses: actions/checkout@v4 - name: Setup .NET uses: actions/setup-dotnet@v4 with: dotnet-version: "6.x" - name: Cache NuGet packages uses: actions/cache@v4 with: path: ${{ github.workspace }}/.nuget/packages key: ${{ runner.os }}-nuget-${{ hashFiles('**/packages.lock.json') }} # Or **/*.csproj restore-keys: | ${{ runner.os }}-nuget- - name: Restore dependencies run: dotnet restore --locked-mode - name: Build run: dotnet build my-project ``` ## Considerations While the cache action speeds up workflows, be mindful of these considerations: 1. **Cache Keys and Restoring:** Using unique cache keys ensures a cache is correctly restored or created. However, overly specific keys may result in missed cache hits. 2. **Sequencing:** Sequence of cache commits and restores could lead to unpredictable behavior in workflows, especially if the cache keys are insufficiently defined. 3. **Cache Size:** Large caches can take longer to restore, reducing the performance gain. 4. **Invalidation:** Changes in dependencies (like updates in `package-lock.json` or `go.sum`) will cause a new cache to be created. 5. **Security Risks:** Ensure sensitive files are not accidentally cached. ## Common Pitfalls 1. **Concurrency Issues:** Parallel jobs can overwrite the cache leading to incomplete or corrupted data. 2. **Storage Limits:** Exceeding storage limits (10 GB per repository) will cause cache eviction and is a very low limit for many use cases. 3. **Compatibility:** Some caches may not be compatible across different operating systems or configurations. ## Conclusion Leveraging the GitHub Actions cache action can significantly accelerate your CI/CD workflows, especially with common languages like Node.js, Python, Rust, Go, PHP, and Java. However, it's essential to manage cache keys, avoid overly large caches, and be cautious of security issues. For optimal results, test different strategies to see which works best for your specific project requirements. ## Unlimited, fast cache with `WarpBuilds/cache` action While GitHub Actions provides great flexibility and functionality, workflow speeds can always benefit from improvements. This is where WarpBuild comes in. Offering GitHub Actions runners with unlimited, superfast caching capabilities, WarpBuild accelerates your builds with blazing speed. The `WarpBuilds/cache` action is a drop-in replacement for the `actions/cache`, so you can get started instantly. [Here are the cache docs](/docs/ci/caching). - **Unlimited Caching:** Never worry about hitting cache size limits or losing important build data. - **Fast Caching:** Save substantial time by utilizing highly optimized caching mechanisms. By seamlessly integrating WarpBuild into your workflows, you can significantly speed up your CI/CD pipelines without compromising flexibility or reliability. [Try it out](https://www.warpbuild.com). ## References - [Caching Dependencies with GitHub Actions](https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#examples) - [`actions/cache` GitHub Repository](https://github.com/actions/cache) - [More caching examples using `actions/cache`](https://github.com/actions/cache/blob/main/examples.md) # Common challenges with GitHub Actions URL: /blog/github-actions-challenges Common challenges with GitHub Actions --- title: "Common challenges with GitHub Actions" excerpt: "Common challenges with GitHub Actions" description: "Common challenges with GitHub Actions" date: "2023-11-14" author: surya_oruganti cover: "/images/blog/github-actions-challenges/thumbnail.png" --- GitHub actions is very powerful. There are some common challenges that developers face while using GitHub Actions. In this article, we will discuss some of the common challenges for folks evaluating the use of GitHub Actions for CI. 1. **YAML Syntax and Expression Evaluation**: One non-trivial aspect is the correct evaluation of expressions in YAML. For example, GitHub Actions' context and expression syntax within `if` conditions are often misunderstood. A misinterpretation of how expressions like `github.event_name == 'push' && github.ref == 'refs/heads/main'` are evaluated can lead to workflows not triggering as expected. It's not just about syntax; it's about understanding the context and logical evaluation within the GitHub Actions environment. 2. **Workflow Debugging and Logging**: A significant challenge is identifying issues within actions that don't provide verbose logging. For instance, if a custom action doesn't log its internal processing steps, and it fails, developers might have to fork the action and add logging themselves to diagnose the issue. It's a layer deeper than just looking at workflow logs; it's about understanding and possibly modifying the actions themselves. 3. **Advanced Dependency Management**: Consider managing dependencies across multiple jobs and workflows. Developers might face issues where a job in one workflow produces a build artifact that should be used in another workflow. Managing this across branches or forks, especially with versioning, requires a deep understanding of artifacts, workflow triggers, and job dependencies. 4. **Resource Allocation and Optimization**: Beyond just hitting resource limits, there's the optimization of these resources. For instance, parallelizing tests can speed up workflows, but if not managed correctly, it can lead to resource contention or rate-limiting issues, especially when interacting with external APIs or services. It's about strategically utilizing resources for optimal performance. 5. **Security in Complex Environments**: Handling secrets in multi-environment workflows poses challenges. For example, using organization-level secrets effectively while ensuring they're not exposed in forked repository runs, especially when dealing with public repositories, requires a careful setup of access controls and understanding of the security model of GitHub Actions. 6. **Dynamic Workflow Configurations**: Developers might need to dynamically generate workflow configurations based on external factors, like changes in a database or a response from an API. This requires using output parameters from one job to control the behavior of subsequent jobs, often leading to complex workflow setups that are hard to maintain and debug. 7. **Integrating Diverse Tools and Services**: Challenges arise when integrating with tools that don't have existing marketplace actions or when the existing actions are not flexible enough. This might involve writing custom scripts or actions to interface with these tools, understanding their APIs, and handling authentication in a secure way. 8. **Concurrency and Dependency Management**: Managing dependencies between parallel jobs can be complex. For instance, if multiple jobs modify a shared resource like a database or a file in a storage bucket, ensuring that these modifications don't conflict or cause data integrity issues requires sophisticated coordination and concurrency control mechanisms. 9. **Caching Strategies for Complex Builds**: Effective caching in multi-language or multi-framework projects is challenging. It requires a deep understanding of how dependencies are resolved and stored. Incorrect caching can lead to outdated dependencies being used, or in the worst case, corrupt build environments. Developers need to craft caching strategies that are both efficient and reliable. 10. **Cross-Platform Workflow Design**: Designing workflows that are truly cross-platform involves more than just specifying different runners. It requires an understanding of the different environment variables, filesystem paths, and tool availability across operating systems. For example, handling path normalization across Windows and Unix systems in a workflow requires careful scripting and consideration of OS-specific peculiarities. These points reflect the intricacies and advanced challenges faced in GitHub Actions, requiring a deep understanding and strategic approach to workflow design and execution. # Reducing GitHub Actions Costs URL: /blog/github-actions-cost-reduction Practical guide to reducing GitHub Actions costs with triggers, cancellation, caching, right-sizing, and more --- title: "Reducing GitHub Actions Costs" excerpt: "Practical guide to reducing GitHub Actions costs with triggers, cancellation, caching, right-sizing, and more" description: "Practical guide to reducing GitHub Actions costs with triggers, cancellation, caching, right-sizing, and more" author: surya_oruganti date: "2025-10-20" cover: "/images/blog/github-actions-cost-reduction/cover.png" --- This guide focuses on practical ways to reduce costs while improving reliability and developer UX. It's vendor-neutral and cites primary docs throughout. Cost is largely a function of billed minutes, runner SKU, artifact/storage usage, and fan-out. Understanding billing and usage is the first step: see About billing for GitHub Actions and Viewing your Actions usage. ## Cost Optimizations ### Run less by default (event filters) Only run workflows when it matters. ```yaml on: pull_request: branches: ["main", "release/*"] paths: ["app/**", "lib/**", "package.json"] paths-ignore: ["**/*.md", "docs/**", "*.png", "*.svg"] push: branches: ["main"] ``` - See workflow syntax for `paths`/`paths-ignore`: [docs](https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions) - Skip runs with commit messages or workflow inputs: [skipping runs](https://docs.github.com/en/actions/managing-workflow-runs/skipping-workflow-runs) Use separate lightweight linters for every push and run heavy integration tests only for PRs targeting main. ### Cancel superseded work Stop old runs when new commits arrive. ```yaml concurrency: group: ci-${{ github.workflow }}-${{ github.ref }} cancel-in-progress: true ``` - Concurrency groups and cancellation: [docs](https://docs.github.com/en/actions/using-jobs/using-concurrency) ### Add timeouts to jobs Set hard ceilings on job time. ```yaml jobs: test: timeout-minutes: 20 runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: npm ci && npm test ``` - Timeouts via workflow syntax: [docs](https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions) ### Cache smartly (dependencies and Docker) Caching cuts both time and cost. ```yaml - uses: actions/cache@v4 with: path: ~/.npm key: npm-${{ runner.os }}-${{ hashFiles('**/package-lock.json') }} restore-keys: npm-${{ runner.os }}- ``` - Dependency caching: [docs](https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows) For Docker image builds, prefer BuildKit/Buildx with the GitHub cache backend: ```yaml - uses: docker/setup-buildx-action@v3 - uses: docker/build-push-action@v5 with: push: false cache-from: type=gha cache-to: type=gha,mode=max ``` - Buildx cache with `type=gha`: [Docker docs](https://docs.docker.com/build/ci/github-actions/cache/) ### Store less, keep it shorter (artifacts) Artifacts are billed for storage and egress. Upload only what you need, compress appropriately, and reduce retention. **Example savings:** A team uploading 5GB of artifacts daily with default 90-day retention stores 450GB/month. Reducing retention to 7 days cuts this to 35GB-a **92% reduction**. At GitHub's storage rates ($0.008/GB/day for Actions), this saves ~$100/month. ```yaml - uses: actions/upload-artifact@v4 with: name: reports path: reports/** retention-days: 7 # default is 90 days compression-level: 6 # balance speed vs size ``` **Quick wins:** - Reduce retention from 90 to 7-14 days: **~85-90% storage cost reduction** - Compress artifacts: typical **40-60% size reduction** for logs and build outputs - Upload only test failures, not full test runs - Artifacts and retention: [docs](https://docs.github.com/en/actions/using-workflows/storing-workflow-data-as-artifacts) ### Choose cheaper runners and architectures Linux runners are cheaper than macOS/Windows. Where possible, move tasks to Linux, and isolate platform-specific steps. Use arm64 runners for builds that are not CPU bound. **GitHub-hosted pricing multipliers (relative to Linux x64):** - Linux x64: **1x** baseline - Linux arm64: **~0.5x** (50% cheaper) - Windows: **2x** (2x more expensive) - macOS x64: **10x** (10x more expensive) - macOS arm64 (M1): **3-5x** (3-5x more expensive) **Example:** A workflow running 1,000 minutes/month on macOS x64 costs ~10x more than the same workflow on Linux. Switching non-macOS-specific tasks to Linux can save **$80-90/month** per 1,000 minutes at typical GitHub Teams pricing. Billing differences and entitlements vary by plan and OS: see About billing for GitHub Actions. ### Parallelize thoughtfully Matrix builds are powerful but can explode costs. Keep fan-out intentional and throttle when needed. ```yaml strategy: matrix: node: [18, 20, 22] max-parallel: 2 ``` - Matrix usage and limits (256 jobs): [docs](https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs) ### Observe and budget Track performance and spend so you can right-size and de-fanout with confidence.
  • Billing overview and usage: docs
  • Viewing usage: docs
```bash # Org billing summary (requires org admin) gh api \ -H "Accept: application/vnd.github+json" \ /orgs/OWNER/settings/billing/actions | jq # A run's billable minutes and timing gh api \ -H "Accept: application/vnd.github+json" \ /repos/OWNER/REPO/actions/runs/123456/timing | jq # Repository artifact sizes gh api \ -H "Accept: application/vnd.github+json" \ /repos/OWNER/REPO/actions/artifacts?per_page=100 | jq '.artifacts[] | {name, size_in_bytes, expired}' # Cache usage for a repo gh api \ -H "Accept: application/vnd.github+json" \ /repos/OWNER/REPO/actions/cache/usage | jq ```

Track: total minutes/spend, median run time for critical workflows, queue time, artifact GB.

### Right-size your runners Use data to pick the smallest runner that meets SLOs. **Example:** If a build takes 20 minutes on 2-core and 12 minutes on 8-core, you're paying **4x the rate for a 1.67x speedup**-net result is **2.4x more expensive** for that run. Right-sizing down from 8-core to 4-core (if 4-core completes in 15 minutes) cuts cost by **2x** while only adding 3 minutes. This optimizes for cost but developer experience may suffer. - Track job duration trends in the Actions UI and compare before/after changes: [usage](https://docs.github.com/en/billing/managing-billing-for-github-actions/viewing-your-github-actions-usage) - Watch CPU/memory-bound steps: run smaller/larger machine experiments and choose the knee of the curve - Prefer improving IO (cache, network, disk) before simply scaling CPU Mermaid decision helper: ```mermaid flowchart LR S[Job slow?] -->|Yes| CPU{CPU >80%?} CPU -- Yes --> SizeUp[Use larger runner] CPU -- No --> IO{IO wait/high net?} IO -- Yes --> BiggerDisk[Use faster disk/network] IO -- No --> Fanout[Reduce matrix fanout] S -->|No| Keep[Keep size] ``` To test right-sizing safely, cap max-parallel, then A/B compare durations and queue times before switching your default runner. ### Flaky tests quietly tax your bill Retries and re-runs multiply minutes. Treat flakiness as a cost center. - Prefer “retry failed tests only” where supported (keeps cost bounded) - Quarantine known-flaky suites and run them on schedule, not per-PR - Surface failure triage early; fail fast before long e2e suites Framework knobs (examples):

See: jest.retryTimes

```yaml # Jest run: npx jest --maxWorkers=50% # consider also: jest.retryTimes() ```

See: Playwright retries docs

```yaml # Playwright config use: retries: 2 workers: 4 ```

See: pytest-rerunfailures

```bash # Pytest example pytest -q --maxfail=1 # fail fast pytest --reruns 2 --only-rerun "AssertionError" ```
- Re-running jobs/workflows: [docs](https://docs.github.com/en/actions/managing-workflow-runs/re-running-workflows-and-jobs) ### Self-hosted runners can save costs Self-hosting can reduce per-minute costs (e.g., spot/preemptible instances) and unlock bespoke hardware and caching. It also introduces infra, security, and maintenance overhead. Compared to GitHub hosted runners, self-hosted runners have the potential to save upto 90% on costs. ## Cost-first execution flowchart ```mermaid flowchart TD A[Push/PR] --> B{Paths match?} B -- No --> X[Skip] B -- Yes --> C[Concurrency group] C -->|cancel previous| D[Run CI] D --> E{Cache hit?} E -- Yes --> F[Fast + cheaper] E -- No --> G[Build + save cache] G --> H[Upload minimal artifacts] ``` --- ## How WarpBuild can help This guide is vendor-neutral; if you want managed building blocks that implement many of the above: - Fast cloud runners (Linux, Windows, macOS): [`/docs/ci/cloud-runners/`](/docs/ci/cloud-runners/) - Snapshot runners: `/docs/ci/snapshot-runners/` - Fast cache and Docker Builders: [`/docs/ci/caching/`](/docs/ci/caching/), [`/docs/ci/docker-builders/`](/docs/ci/docker-builders/) - Observability guides: [`/docs/ci/observability/`](/docs/ci/observability/) - Quick start: [`/docs/ci/quick-start/`](/docs/ci/quick-start/) --- # The Complete Guide to GitHub Actions for Monorepos: Turborepo, Nx, and pnpm Workspaces URL: /blog/github-actions-monorepo-guide Learn how to optimize GitHub Actions for monorepos using Turborepo, Nx, and pnpm workspaces. Reduce CI time by 12x with affected-only execution, remote caching, and dynamic matrix strategies. --- title: "The Complete Guide to GitHub Actions for Monorepos: Turborepo, Nx, and pnpm Workspaces" excerpt: "Monorepos run 10-100x more CI jobs than single-repo teams. Naive GitHub Actions configs explode CI minutes, queue times, and cost. This is the guide that should have existed years ago." description: "Learn how to optimize GitHub Actions for monorepos using Turborepo, Nx, and pnpm workspaces. Reduce CI time by 12x with affected-only execution, remote caching, and dynamic matrix strategies." date: "2026-01-13" author: surya_oruganti cover: "/images/blog/github-actions-monorepo-guide/cover.png" --- import { Callout } from 'fumadocs-ui/components/callout'; import { Tab, Tabs } from 'fumadocs-ui/components/tabs'; import { Step, Steps } from 'fumadocs-ui/components/steps'; ## Key Takeaways - **Run less, not faster.** Affected-only execution is the biggest lever. Running 4 packages instead of 45 beats any caching optimization. - **Remote caching compounds.** Local caching helps one run. Remote caching lets every run share work across your team. - **Matrix parallelism requires concurrency.** 30 matrix jobs on a 20-concurrency system just moves the bottleneck to queue time. - **Pin your base SHA.** Dynamic affected detection can race with merges to main. --- ## Why Monorepos Break CI Single-repo CI is predictable: code changes, tests run, build happens. Three jobs per PR, done. Monorepos scale combinatorially. A 30-package repo with naive config runs all 30 test suites on every push. Add matrix testing across Node versions and you're at 60-90 jobs per PR. A one-line README fix triggers the same CI load as a core library rewrite. Dependencies make it worse. Change package C and you need to test A and B too (they import it). Naive configs either test everything (wasteful) or only the changed package (misses downstream breakage). ```mermaid flowchart LR subgraph Naive["Naive CI"] P1[Push] --> A1[Run all 30 packages] A1 --> W1[90 jobs per PR] end subgraph Smart["Affected-Only CI"] P2[Push] --> D[Detect changes] D --> A2[Run 4 affected packages] A2 --> W2[12 jobs per PR] end ``` [GitHub-hosted runners](https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners) bottleneck this in two ways: | Constraint | Impact | |------------|--------| | **Concurrency limits** | Free: 20 jobs, Team: 60, Enterprise: up to 500. A 90-job PR queues constantly. | | **Cache storage** | Historically 10GB per repo. 30 packages with pnpm, dist, and test caches fill it fast. Caches evict, builds run cold. | Learn more about [common GitHub Actions challenges](/blog/github-actions-challenges) and [caching strategies](/blog/github-actions-cache). --- ## Affected-Only Execution Detect which packages changed, run CI only for those plus their dependents. This is the difference between 90 jobs and 8 jobs per PR. ### The Wrong Way ```yaml # Don't do this - runs all 50 packages on every push jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: bun install - run: bun test ``` ### The Right Way The `--filter` flag tells Turborepo to only run tasks for packages changed since `origin/main`. The `...` syntax includes dependents. ```yaml jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 with: fetch-depth: 0 # Required for git comparison - run: bun install - run: bunx turbo run test --filter='...[origin/main...HEAD]' ``` See the [Turborepo GitHub Actions guide](https://turborepo.dev/docs/guides/ci-vendors/github-actions). Nx uses `NX_BASE` and `NX_HEAD` environment variables. The [`nrwl/nx-set-shas`](https://github.com/nrwl/nx-set-shas) action sets these correctly for PRs and push events. ```yaml jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 with: fetch-depth: 0 - uses: nrwl/nx-set-shas@v4 - run: bun install - run: bunx nx affected -t test --base=$NX_BASE --head=$NX_HEAD ``` See the [Nx affected command docs](https://nx.dev/docs/features/ci-features/affected). ### Common Footguns Both tools need git history to compare changes. Default `actions/checkout` does a shallow clone with `fetch-depth: 1`. You need `fetch-depth: 0` for full history, or enough depth to reach your merge-base. On a PR, compare against the base branch. On push to main, compare against the previous commit. Turborepo handles this with `[origin/main...HEAD]`. Nx requires the `nx-set-shas` action. PRs from forks may need explicit fetch: ```yaml - run: git fetch origin main:main ``` Changes to `package.json`, `turbo.json`, or `bun.lockb` at the root might affect all packages. Both tools handle this, but verify your config triggers full CI when needed. For explicit control over root changes: ```yaml - id: check-root run: | if git diff --name-only origin/main...HEAD | grep -qE '^(package\.json|turbo\.json|bun\.lockb)$'; then echo "root_changed=true" >> $GITHUB_OUTPUT else echo "root_changed=false" >> $GITHUB_OUTPUT fi - name: Run affected tests if: steps.check-root.outputs.root_changed != 'true' run: bunx turbo run test --filter='...[origin/main...HEAD]' - name: Run all tests (root changed) if: steps.check-root.outputs.root_changed == 'true' run: bunx turbo run test ``` --- ## Caching That Works Dependency caching is necessary but not sufficient. Build artifacts are where real time savings live. ### Dependency Caching **bun** is excellent for monorepos. Its binary lockfile and global cache mean fast installs with minimal overhead. ```yaml - uses: oven-sh/setup-bun@v2 - uses: actions/cache@v4 with: path: ~/.bun/install/cache key: bun-${{ runner.os }}-${{ hashFiles('bun.lockb') }} restore-keys: bun-${{ runner.os }}- - run: bun install ``` **pnpm**'s content-addressable store means identical dependencies across packages are stored once: ```yaml - uses: pnpm/action-setup@v2 with: version: 9 - uses: actions/setup-node@v4 with: node-version: '22' cache: 'pnpm' ``` ### Build Artifact Caching For TypeScript monorepos, cache `dist` folders and `.tsbuildinfo` files: ```yaml - uses: actions/cache@v4 with: path: | packages/**/dist packages/**/.tsbuildinfo key: build-${{ runner.os }}-${{ hashFiles('packages/**/src/**', 'packages/**/tsconfig.json') }} restore-keys: build-${{ runner.os }}- ``` Use `hashFiles` on source content, not `github.sha`. Every commit has a different SHA, so you'd almost never hit cache. Content-based keys mean identical source produces hits regardless of commit. The `restore-keys` fallback means partial hits still help—you get a recent build even if today's exact hash isn't cached. ### Why Naive Caching Falls Short - **Key granularity matters.** One key for all packages = any change invalidates everything. Per-package keys = 50 cache operations per job. - **Size limits bite.** A 30-package TypeScript monorepo can exceed storage limits fast. Caches evict under LRU. Jobs run cold. - **Restore time scales with size.** A 2GB cache takes 30-60 seconds to restore. If your build only takes 90 seconds, you've added 50% overhead. --- ## Matrix Strategies for Parallel Testing Once you've detected affected packages, parallelize their execution across runners. ### Dynamic Matrix Generation Generate the matrix from affected packages, not hardcoded: ```yaml jobs: detect: runs-on: ubuntu-latest outputs: packages: ${{ steps.affected.outputs.packages }} base_sha: ${{ steps.set-base.outputs.base_sha }} steps: - uses: actions/checkout@v4 with: fetch-depth: 0 - id: set-base run: | BASE_SHA=$(git merge-base origin/main HEAD) echo "base_sha=$BASE_SHA" >> $GITHUB_OUTPUT - run: bun install - id: affected run: | PACKAGES=$(bunx turbo run test --filter='...[${{ steps.set-base.outputs.base_sha }}...HEAD]' --dry-run=json | jq -c '[.tasks[].package] | unique') echo "packages=$PACKAGES" >> $GITHUB_OUTPUT test: needs: detect if: ${{ needs.detect.outputs.packages != '[]' }} strategy: matrix: package: ${{ fromJson(needs.detect.outputs.packages) }} runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: bun install - run: bunx turbo run test --filter=${{ matrix.package }} ``` Between the detect job and test jobs, someone might merge to main. If you use `origin/main` directly, the affected calculation might not match reality when tests run. Pinning the merge-base SHA ensures consistency. ### When Matrix Helps vs Hurts | Matrix helps when... | Matrix hurts when... | |---------------------|----------------------| | Package tests are slow (>2 min) | Package tests are fast (<30 sec) | | You have concurrency headroom | Job startup exceeds test time | | Tests are independent | Concurrency limits mean jobs queue anyway | For fast tests, a single job running all affected packages sequentially might be faster than 10 matrix jobs each spending 30 seconds on setup. ### Test Sharding Within Packages If one package has 80% of your test time, shard within it. Learn more about [running concurrent tests effectively](/blog/concurrent-tests): ```yaml strategy: matrix: package: ${{ fromJson(needs.detect.outputs.packages) }} shard: [1, 2, 3, 4] steps: - run: bunx turbo run test --filter=${{ matrix.package }} -- --shard=${{ matrix.shard }}/4 ``` A 12-minute test suite becomes 3 minutes wall-clock (with sufficient concurrency). ### Concurrency Reality Check 30 jobs at 2 minutes each on 20 concurrent slots: - First batch: 20 jobs run (2 min) - Second batch: 10 jobs run (2 min) - **Total: 4 minutes** instead of 2 with unlimited concurrency This is where infrastructure becomes the bottleneck, not configuration. --- ## Remote Caching [Turborepo's remote cache](https://turbo.build/repo/docs/core-concepts/remote-caching) shares build artifacts across CI runs. PR #2 doesn't rebuild what PR #1 already built. ### Why It Matters for CI Without remote caching, every CI run starts cold. With it, PRs share work. A team running 50 PRs/day with 30-minute builds can save 20+ hours daily at 80% hit rate. ```yaml jobs: build: runs-on: ubuntu-latest env: TURBO_TOKEN: ${{ secrets.TURBO_TOKEN }} TURBO_TEAM: ${{ vars.TURBO_TEAM }} steps: - uses: actions/checkout@v4 with: fetch-depth: 0 - uses: oven-sh/setup-bun@v2 - run: bun install - run: bunx turbo run build test lint --filter='...[origin/main...HEAD]' ``` The environment variables authenticate with Vercel's remote cache. No additional config needed. ```yaml env: TURBO_API: 'https://your-cache-server.com' TURBO_TOKEN: ${{ secrets.TURBO_TOKEN }} TURBO_TEAM: 'your-team' ``` You can self-host with S3, GCS, or other storage backends. --- ## Nx Commands Reference ### Affected vs run-many ```yaml # Affected projects only (use for PR CI) - run: bunx nx affected -t test # All projects (use for nightly/release builds) - run: bunx nx run-many -t test --all # Specific projects - run: bunx nx run-many -t test --projects=app1,app2 ``` The `--parallel` flag controls tasks within a single job (different from matrix parallelism across jobs): ```yaml - run: bunx nx affected -t lint test build --parallel=3 ``` ### Nx Cloud Distributed Execution [Nx Cloud DTE](https://nx.dev/ci/features/distribute-task-execution) automatically distributes tasks across multiple agents: ```yaml jobs: main: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 with: fetch-depth: 0 - uses: nrwl/nx-set-shas@v4 - run: bun install - run: bunx nx-cloud start-ci-run --distribute-on="5 linux-medium-js" - run: bunx nx affected -t lint test build agents: runs-on: ubuntu-latest strategy: matrix: agent: [1, 2, 3, 4, 5] steps: - uses: actions/checkout@v4 - run: bun install - run: bunx nx-cloud start-agent ``` For 100+ package monorepos, DTE can reduce CI from hours to minutes. ### Turborepo vs Nx | | Turborepo | Nx | |---|-----------|-----| | **Setup** | Simpler, faster to adopt | More configuration options | | **Cold start** | Faster (~5MB runtime) | Larger (~40MB) | | **Remote cache** | Vercel integration or self-host | Nx Cloud | | **Distributed execution** | Manual with matrix | Built-in with Nx Cloud | | **Polyglot support** | JS/TS focused | Go, Rust, Java, etc. | | **Best for** | Under 50 packages, JS/TS | 100+ packages, polyglot | Both integrate cleanly with GitHub Actions. Choose based on scale and whether you need distributed execution. --- ## Measuring Performance Run these before optimizing to quantify your opportunity: ```bash # How many tasks run today? bunx turbo run test --dry-run | grep "Tasks:" # How many with affected-only? bunx turbo run test --filter='...[origin/main...HEAD]' --dry-run | grep "Tasks:" ``` If full CI runs 45 tasks and affected runs 4, you're doing 10x more work than necessary. ### Key Metrics | Metric | How to measure | Target | |--------|---------------|--------| | **Affected ratio** | Tasks with filter vs without | Under 20% of total | | **Cache hit rate** | Run same build twice, count `FULL TURBO` | Above 80% | | **Queue time** | "Queued" vs "In progress" timestamp | Under 30 sec average | | **Wall-clock vs CI minutes** | Total time vs sum of job times | High wall-clock + low minutes = queue saturation | --- ## When Infrastructure Is the Bottleneck Configuration optimization has limits. At some point, infrastructure is the constraint. ### Signs You've Hit the Limit - [ ] Jobs queue even with optimized configs - [ ] Cache operations take longer than builds - [ ] Matrix strategies don't improve wall-clock time - [ ] Cost scales linearly despite optimizations ### What Moves the Needle 1. **Affected-only execution** — configuration 2. **Remote caching** — configuration + service 3. **Unlimited concurrency** — infrastructure 4. **Faster cache I/O** — infrastructure 5. **Faster runners** — infrastructure [GitHub-hosted runner limits](https://docs.github.com/en/actions/reference/limits) vary by plan: Free (20), Pro (40), Team (60), Enterprise (up to 500). Cache storage has historically been 10GB per repo. WarpBuild removes these constraints. Unlimited concurrency means matrix strategies actually parallelize. 50GB+ cache storage means caches don't evict. Change `runs-on: ubuntu-latest` to `runs-on: warpbuild-ubuntu-22.04-x64-4x` and the infrastructure constraints disappear. --- ## What To Do Next ### Measure your waste ```bash bunx turbo run test --dry-run | grep "Tasks:" bunx turbo run test --filter='...[origin/main...HEAD]' --dry-run | grep "Tasks:" ``` The difference is your optimization opportunity. ### Check cache hit rate Look for `FULL TURBO` (hit) vs `cache miss` in Turborepo output. Below 80% on repeated runs means cache keys or storage need attention. ### Calculate queue time In GitHub Actions, compare "Queued" to "In progress" timestamps. Over 30 seconds average means you're hitting concurrency limits. ### Implement affected-only execution Start with `--filter='...[origin/main...HEAD]'` for Turborepo or `nx affected` for Nx. Pin your base SHA to avoid race conditions. ### Enable remote caching Vercel's Turborepo cache or Nx Cloud. Setup takes 10 minutes. Run the same build twice and watch the second complete in seconds. ### Evaluate infrastructure If you've optimized config and still hit limits, the constraint is infrastructure: unlimited concurrency and faster cache I/O are the next lever. --- *Need unlimited concurrency and faster caching for your monorepo? WarpBuild removes GitHub's infrastructure constraints with a single line change. [Start free →](https://www.warpbuild.com)* # GitHub Actions Price Change URL: /blog/github-actions-price-change GitHub Actions reduces pricing for GitHub hosted runners and adds a new $0.002/minute cost for self-hosted runners. --- title: "GitHub Actions Price Change" excerpt: "GitHub Actions reduces pricing for GitHub hosted runners and adds a new $0.002/minute cost for self-hosted runners." description: "GitHub Actions reduces pricing for GitHub hosted runners and adds a new $0.002/minute cost for self-hosted runners." date: "2025-12-15" author: surya_oruganti cover: "/images/blog/github-actions-price-change/cover.png" --- GitHub Actions reduces pricing by 15-39% for GitHub hosted runners (from 2026-01-01) and adds a new $0.002/minute cost for self-hosted runners (from 2026-03-01). GitHub [recently wrote](https://github.blog/news-insights/product-news/lets-talk-about-github-actions/) that developers used 11.5 billion GitHub actions minutes in 2025. One can safely assume that a majority of this comes from enterprises, who in turn use self-hosted runners. In the earlier model with free self-hosted runner usage, GitHub had no way to monetize most of the actions usage. This new GitHub actions self-hosted runner tax is a simple way for GitHub to monetize their actions platform and push users to use their own runners. Note: [BitBucket recently announced](https://www.atlassian.com/blog/bitbucket/announcing-v5-self-hosted-runners) that they will be charging for self-hosted runners as well. This is a significant change and we break down what it means. ## GitHub's new tax on self-hosted runners GitHub adds a new $0.002/minute cost for self-hosted runners. This is a new cost charged by GitHub for all users except for GitHub Enterprise Server (GHES) customers. ## Reduced pricing for GitHub hosted runners GitHub has reduced pricing for GitHub hosted runners. The computation is as follows: Smaller runners see a lesser reduction in price, whereas the larger runners see a greater reduction. It's fantastic to see that GitHub is reducing pricing for GitHub hosted runners. However, the magnitude of the reduction is not as significant as one might expect. ## Pricing table | OS | vCPUs | Old Price | New Price | WarpBuild Price | delta | | ---- | ---- | --------- | --------- | --------------- | ----- | | ubuntu | 2 | $0.008 | $0.006 | $0.004 | $0.002 | | ubuntu | 4 | $0.016 | $0.012 | $0.008 | $0.004 | | ubuntu | 8 | $0.032 | $0.022 | $0.016 | $0.006 | | ubuntu | 16 | $0.064 | $0.042 | $0.032 | $0.010 | | ubuntu | 32 | $0.128 | $0.082 | $0.064 | $0.018 | | windows | 2 | $0.016 | $0.010 | $0.008 | $0.002 | | windows | 4 | $0.032 | $0.022 | $0.016 | $0.006 | | windows | 8 | $0.064 | $0.042 | $0.032 | $0.010 | | windows | 16 | $0.128 | $0.082 | $0.064 | $0.018 | | windows | 32 | $0.256 | $0.162 | $0.128 | $0.034 | | macos | 6 | $0.160 | $0.102 | $0.080 | $0.022 | The `delta` column shows the difference between the reduced GitHub hosted runner price and the WarpBuild price. Two important observations: 1. WarpBuild runners are cheaper, even after including the $0.002/minute self-hosted runner tax imposed by GitHub. 2. WarpBuild runners are ~twice as fast, so the observed costs are still significantly lower than the GitHub hosted runners. ## Optimizing for cost Here are the practical implications and considerations to optimize for cost, given the new pricing. These are generic and ensure you think through your workflows and runners before making any changes. ### 1. Self-hosting runners or using WarpBuild runners is still cheaper Despite the $0.002/minute self-hosted runner tax, self-hosting runners on your cloud (aws/gcp/azure/...) or using WarpBuild runners remains the cheaper option. ### 2. Prefer larger runners If your workflow scales with the number of vCPUs, prefer larger runners. That ensures you spend fewer minutes on the runner, which reduces the GitHub self-hosted runner tax. For example, using `actions-runner-controller` with heavy jobs running on 1 vcpu runners is not a good idea. Instead, prefer a 2vcpu runner (say) if it runs the job ~2x faster. ### 3. Prefer faster runners All else being equal, prefer faster runners. That ensures you spend fewer minutes on the runner, which reduces the GitHub self-hosted runner tax. For example, if you're self-hosting on aws and using a `t3g.medium` runner, it's better to use a `t4g.medium` runner since the newer generation is faster, but not much more expensive. WarpBuild runners have higher single core performance than aws/gcp/azure hosted runners. This is coupled with directly attached NVMe disks for fast disk IO. ### 4. Prefer fewer shards If you have a lot of shards for your jobs (example: tests on ~50 shards), consider reducing the number of shards and parallelizing the tests on fewer but larger runners. ### 5. Improve job performance This is not new advice, but it's now more important than ever because of the additional GitHub self-hosted runner tax. ### 6. Use GitHub hosted runners for very short jobs For linters and other very short jobs, it's better to use GitHub hosted runners. ## What's not changing? 1. Public repos generally stand to gain from this change. - Standard runner size (`ubuntu-latest`) is still free. - No $0.002/minute tax for self-hosted runners. 1. GitHub Enterprise Server (GHES) users do not have the $0.002/minute tax. ## 🚀 Try WarpBuild Using WarpBuild runners is cheaper than using GitHub hosted runners, even after including the $0.002/minute self-hosted runner tax. WarpBuild Cloud Runners are baremetal servers with high single core performance and directly attached NVMe disks for fast disk IO. This paired with caching and snapshotting capabilities, WarpBuild runners are a great way to optimize your workflows and save money. # A Developer's Guide to Speeding Up GitHub Actions URL: /blog/github-actions-speeding-up A Developer's Guide to Speeding Up GitHub Actions --- title: "A Developer's Guide to Speeding Up GitHub Actions" excerpt: "A Developer's Guide to Speeding Up GitHub Actions" description: "A Developer's Guide to Speeding Up GitHub Actions" date: "2023-11-13" author: surya_oruganti cover: "/images/blog/github-actions-speeding-up/thumbnail.png" --- GitHub Actions offer a powerful tool for automating CI/CD pipelines. However, slow runs can be frustrating. This guide provides an exhaustive evaluation of GitHub Actions workflow performance and suggests effective mitigation strategies. Struggling with debugging GitHub actions? Try our free tool, Action Debugger by WarpBuild, which allows real-time debugging via SSH into a running workflow. If you are just starting off on GitHub Actions performance optimization, here are the first 3 things you should do: 1. **Cache Dependencies:** Reduce build times by avoiding repetitive dependency downloads. 2. **Parallelize Jobs:** Accelerate workflows by running jobs concurrently. 3. **Use Powerful Runners:** For intensive tasks, custom runners with high-spec hardware are beneficial. Here is a comprehensive list of factors that can affect GitHub Action performance, along with examples and mitigation strategies. ## Workflow and Script Configuration ### Complex Workflows - **Description**: Workflows with many steps or intricate logic can take longer to execute. - **Example**: A workflow that includes multiple build processes, testing across different environments, and deployment steps can become time-consuming. - **Mitigation**: Simplify workflows by breaking them into smaller, more focused jobs. Use conditional steps to avoid unnecessary runs. ### Inefficient Build Scripts - **Description**: Scripts not optimized for performance can slow down the workflow. - **Example**: A build script that compiles code without leveraging caching or incremental builds. - **Mitigation**: Refactor scripts for efficiency, use caching where possible, and remove redundant operations. ### Misconfigured Workflow Triggers - **Description**: Triggers that initiate workflows unnecessarily or too frequently can cause delays. - **Example**: A workflow set to trigger on every push, including documentation updates, can lead to unneeded builds. - **Mitigation**: Configure triggers carefully, such as triggering workflows only on specific branches or for specific types of changes. ### Poorly Managed Artifacts/Logs - **Description**: Large or numerous build artifacts and logs can slow down operations. - **Example**: Saving extensive logs or large binary files after every build can consume significant storage and bandwidth. - **Mitigation**: Implement log rotation, compress artifacts, and retain only essential artifacts. If using self-hosted runners, the default GitHub cache can be extremely slow. Consider hosting your own cache server and/or artifact registry that is closer to compute. ## Repository and Codebase Factors ### High Repository Size - **Description**: Large repositories with extensive histories or large files take longer to clone and process. - **Example**: A repository with gigabytes of historical data and large binary files will be slow to clone. - **Mitigation**: Use Git LFS for large files, prune old branches, and compact repository history. ### Build Complexity - **Description**: Complex applications with many dependencies and modules require more time to build. - **Example**: A project with dozens of dependencies and a multi-module structure. - **Mitigation**: Optimize build processes, use dependency caching, and modularize the codebase. ### Dependency Management - **Description**: Fetching large or numerous external dependencies can be time-consuming. - **Example**: A build process that downloads hundreds of npm packages or large Docker images. - **Mitigation**: Cache dependencies and consider using a package manager that supports efficient dependency resolution. ## GitHub Hosted Runner Specifications ### Computational Resources - **Description**: Limited CPU and memory on runners can slow down intensive tasks. - **Example**: Compiling a large application on a runner with only 2 CPU cores and limited RAM. - **Mitigation**: Optimize code for resource efficiency or use self-hosted runners with better specs. ### Disk I/O Performance - **Description**: Disk speed and capacity affect tasks involving significant read/write operations. - **Example**: A workflow that involves processing large datasets stored on disk. - **Mitigation**: Optimize disk usage, consider storing less data on disk, or use self-hosted runners with faster disks. ### Network Bandwidth - **Description**: Limited network speed can delay tasks that require data transfer. - **Example**: Uploading large build artifacts or pulling large Docker images from a registry. - **Mitigation**: Compress data before transfer, reduce the size of artifacts, and use Docker layer caching. ## External and Environmental Factors ### External Dependencies - **Description**: Reliance on external services or APIs can introduce delays. - **Example**: A workflow that makes numerous calls to external APIs, which may be slow or rate-limited. - **Mitigation**: Minimize external API calls, implement retry logic with backoff, and use mocking for testing. ### Network Issues - **Description**: General internet connectivity problems or specific network issues can cause delays. - **Example**: Slowdowns due to poor network performance in the runner's region. - **Mitigation**: If persistent, consider using self-hosted runners in a different region with better connectivity. ### Service Outages/Degradation - **Description**: Issues with GitHub or third-party services can impact workflow performance. - **Example**: Delays or failures due to GitHub service outages. - **Mitigation**: Implement error handling and retry mechanisms, stay informed about service statuses. ### Rate Limiting - **Description**: Hitting rate limits on external APIs or services used in workflows. - **Example**: Exceeding the API rate limit of a third-party service used in a build process. - **Mitigation**: Optimize API usage, cache responses where possible, and handle rate limit errors gracefully. ## Resource Management and Optimization Strategies ### Caching Strategies - **Description**: Effective use of caching can significantly reduce build times. - **Example**: Rebuilding dependencies in every run can be time-consuming. Using cached dependencies speeds up the process. - **Mitigation**: Implement caching for dependencies, build outputs, and other reusable data to avoid unnecessary repetition. ### Parallelization - **Description**: Running tasks in parallel can reduce overall workflow duration. - **Example**: Sequential execution of tests can be slow. Running tests in parallel across different runners can speed it up. - **Mitigation**: Break down workflows into parallelizable jobs and use matrix strategies for testing across environments. ### Resource Allocation Policies - **Description**: GitHub's internal resource allocation can affect performance. - **Example**: During peak times, resource contention might slow down workflows. - **Mitigation**: Optimize workflows to run efficiently with available resources and consider off-peak scheduling. ### Use of Self-Hosted Runners - **Description**: Custom hardware specifications can meet specific performance needs. - **Example**: A workflow requiring high CPU and RAM might be slow on GitHub-hosted runners. This is especially true of disk intensive workloads and workflows that require GPUs. - **Mitigation**: Set up self-hosted runners with tailored resources for intensive workflows. ## Platform-Specific Constraints and Policies ### Concurrent Usage Limits - **Description**: Limits on the number of concurrent workflows can delay execution. - **Example**: If multiple workflows are queued, subsequent ones will wait, causing delays. - **Mitigation**: Optimize workflow triggers and durations to minimize queuing delays. ### Resource Throttling Policies - **Description**: GitHub may limit resources to manage platform stability. - **Example**: Throttling can occur during times of high demand, affecting performance. - **Mitigation**: Design workflows to be efficient under varied resource availability conditions. ## Underlying Infrastructure and Platform Changes ### Updates in GitHub's Infrastructure - **Description**: GitHub's infrastructure changes can temporarily affect performance. - **Example**: Upgrades or maintenance activities can slow down or disrupt services. - **Mitigation**: Stay informed about GitHub updates, and plan for alternative strategies during scheduled maintenance. ### Environmental Variables Configuration - **Description**: Management of environment variables affects workflow efficiency. - **Example**: Misconfigured environment variables, like the `CI`, `DEBUG`, and `LOGLEVEL` variables, can lead to errors or inefficiencies. - **Mitigation**: Regularly review and optimize the use of environment variables, ensuring they are used effectively and securely. By addressing these factors, one can effectively manage resources, adapt to platform constraints, and stay responsive to changes in GitHub's infrastructure, all contributing to improved GitHub Action performance. ## A Note on GitHub-hosted Runner Processors The processors used in GitHub Actions runners are not optimized for CI and build workloads. They are server-grade processors designed for high-performance computing and data center workloads. The following table compares the specifications of the processors used in GitHub Actions runners (on 2023-11-13). From an anecdotal sampling of ~100 runs, the AMD EPYC 7763 64-Core Processor appears to be the most common processor used in GitHub Actions runners. | Processor Model | Class | Socket | Clockspeed | Turbo Speed | Cores/Threads | Typical TDP | Cache Size | Avg CPU Mark | Single Thread Rating | | ------------------------------------ | ------ | ----------- | ---------- | ----------- | ------------- | ----------- | ------------------------------------ | ------------ | -------------------- | | AMD EPYC 7763 64-Core Processor | Server | SP3 | 2.5 GHz | 3.5 GHz | 64/128 | 280 W | L1: 8128 KB, L2: 63.5 MB, L3: 512 MB | 86194 | 2577 | | Intel Xeon Platinum 8171M @ 2.60GHz | Server | FCLGA3647 | 2.6 GHz | 3.7 GHz | 26/52 | 165 W | L1: 1664 KB, L2: 26.0 MB, L3: 36 MB | 30632 | 2222 | | Intel Xeon Platinum 8272CL @ 2.60GHz | Server | FCLGA3647 | 2.7 GHz | 4.0 GHz | 26/52 | 205 W | L1: 1664 KB, L2: 26.0 MB, L3: 36 MB | 52386 | 2382 | | Intel Xeon Platinum 8370C @ 2.80GHz | Server | FCLGA4189 | 2.9 GHz | 3.5 GHz | 32/64 | 300 W | - | 55705 | 2479 | | Intel Xeon CPU E5-2673 v4 @ 2.30GHz | Server | FCLGA2011-3 | 2.3 GHz | 3.3 GHz | 20/40 | 135 W | L1: 1280 KB, L2: 5.0 MB, L3: 50 MB | 21576 | 2079 | For comparison, here are the specifications of some popular desktop processors: | Processor Model | Class | Socket | Clockspeed | Turbo Speed | Cores/Threads | Typical TDP | Cache Size | Avg CPU Mark | Single Thread Rating | | --------------------- | --------------- | --------- | ---------- | ----------- | -------------------- | ----------- | ----------------------------------- | ------------ | -------------------- | | Apple M3 8 Core | Desktop, Laptop | | 4.0 GHz | NA | 8 Cores, 8 Threads | | | 19247 | 4822 | | Intel Core i9-14900KF | Desktop | FCLGA1700 | 3.2 GHz | 6.0 GHz | 24 Cores, 32 Threads | 125 W | L1: 2176 KB, L2: 32.0 MB, L3: 36 MB | 61135 | 4798 | | Intel Core i9-13900KS | Desktop | FCLGA1700 | 3.2 GHz | 6.0 GHz | 24 Cores, 32 Threads | 150 W | L1: 2176 KB, L2: 32.0 MB, L3: 36 MB | 61924 | 4764 | | Intel Core i7-14700KF | Desktop | FCLGA1700 | 3.4 GHz | 5.6 GHz | 20 Cores, 28 Threads | 125 W | L1: 1792 KB, L2: 28.0 MB, L3: 33 MB | 54073 | 4541 | | Apple A17 Pro | Mobile/Embedded | | 3.8 GHz | NA | 6 Cores, 6 Threads | | | 12208 | 4515 | Source: [PassMark](https://www.cpubenchmark.net/) Note: ChatGPT was used to scrape processor information and structuring the article. They are not equivalent comparisons however, since desktop and mobile processors can have heterogeneous cores, which are optimized for different workloads. WarpBuild provides runners with high performance processors, which are optimized for CI and build workloads with fast disk IO and [improved caching](/docs/ci/caching). You can learn more about our hosted runners [here](/docs/ci/cloud-runners). ## Conclusion Improving the speed of your GitHub Actions isn't just about tweaking a few settings; it's about a holistic approach to workflow management, and resource optimization. By understanding the underlying factors, you'll be well on your way to more efficient, faster GitHub Actions. **Share & Feedback:** Found this guide useful? Share it with fellow engineers. Feedback and additional factors are welcome. I'm surya@warpbuild.com. **Keywords:** - GitHub Actions Performance - CI/CD Pipeline Optimization - Workflow Efficiency in GitHub - GitHub Actions Speed Tips - Developer Guide to GitHub Actions # Github Actions runners in your AWS account URL: /blog/launch-byoc WarpBuild can now manage Github actions runners on your own AWS account to increase security, customizability, and cut costs by 90% --- title: "Github Actions runners in your AWS account" excerpt: "WarpBuild can now manage Github actions runners on your own AWS account to increase security, customizability, and cut costs by 90%" description: "WarpBuild can now manage Github actions runners on your own AWS account to increase security, customizability, and cut costs by 90%" date: "2024-07-28" author: surya_oruganti cover: "/images/blog/launch-byoc/cover.png" --- ## Maximize Customizability and Cut Costs by 90% We're excited to announce a game-changing feature for WarpBuild: the ability to connect your own AWS cloud account. This integration allows WarpBuild to manage Github actions runners directly within your AWS environment in a VPC that you specify. By leveraging your own cloud resources, you can reduce costs by up to 90% compared to GitHub-hosted runners, while gaining access to WarpBuild's advanced features that streamline your development process.~ This is the most flexible, powerful, and cost effective way to run Github actions on your AWS account. ## BYOC: Advantages of the AWS Integration ### Unmatched Cost Efficiency By using your own AWS account, you take advantage of AWS's flexible pricing and billing, leading to substantial savings. Our benchmarks show cost reductions of up to 90%, allowing you to allocate your budget more effectively. - Leverage spot instances, preferred pricing and reserved instances to further reduce costs. - Choose your region to eliminate data transfer costs (ECR, NAT etc). ### Enhanced Flexibility Every project has unique needs, and WarpBuild's AWS integration offers the flexibility to tailor your infrastructure accordingly. Whether you need specific instance types, custom storage performance, or tailored network configurations, managing your own infrastructure empowers you to optimize resources to meet your exact requirements - securely. ### Fast Colocated Caches One of the standout features of WarpBuild's AWS integration is the implementation of fast colocated caches. By placing caches close to your build runners, we significantly reduce build times, ensuring your team spends less time waiting and more time coding. This proximity speeds up data access, enhances performance, and improves overall efficiency. ### Advanced Debugging Tools Debugging in complex environments can be a headache. With WarpBuild managing your AWS infrastructure, you gain access to sophisticated debugging tools. These tools provide deep insights into your builds, enabling you to quickly identify and resolve issues. Real-time monitoring, detailed logs, and comprehensive analytics are at your fingertips, making troubleshooting faster and more effective. ## Additional Benefits of Hosting Runners on Your AWS Account ### Improved Security - **Network Isolation**: By hosting runners within your own VPC, you can isolate your CI/CD environment from other internet traffic, enhancing security. - **Custom Security Groups**: Define precise security group rules to control inbound and outbound traffic to your runners, minimizing exposure to potential threats. - **IAM Roles and Policies**: Apply custom IAM roles to ensure that runners have only the permissions they need. ### Performance Optimization - **Proximity to Resources**: Running within your own VPC allows for low-latency access to other AWS resources such as servers, databases, storage, and other services, speeding up your CI/CD pipelines. - **Custom Instance Types**: Choose instance types that best match your workload requirements, optimizing for CPU, memory, and I/O performance. ### Compliance and Governance - **Data Residency**: Ensure data compliance by keeping your build data within specific geographic regions, adhering to data residency regulations. - **Audit Logging**: Utilize AWS CloudTrail and other monitoring tools to maintain detailed audit logs of all actions performed within your infrastructure. ### Scalability and Customization - **Auto Scaling**: WarpBuild automatically manages your infrastructure, ensuring that your runners are always available and ready to run your builds with zero wasted resources. - **Custom AMIs**: Use custom Amazon Machine Images (AMIs) to pre-configure runners with specific tools and dependencies, reducing setup time and ensuring consistency across builds. ### Enhanced Monitoring and Alerts - **Resource Tagging**: Apply tags to AWS resources to track usage and costs, enabling more effective budget management and cost allocation. - **CloudWatch Integration** (coming soon): Monitor your runners with Amazon CloudWatch, setting up alerts for key metrics to proactively manage performance and health. - **Detailed Metrics** (coming soon): Collect and analyze detailed metrics on runner performance, helping to identify bottlenecks and optimize build processes. ## How It Works 1. **Connect Your AWS Account**: Sign in to WarpBuild and navigate to the AWS integration section. Follow the simple steps to securely link your AWS account. 2. **Connect Your Infrastructure**: Specify your infrastructure setup. Choose your network, security groups according to your project needs. 3. **Configure Runners**: Choose your runner instance types, disk, and network configurations. WarpBuild managed images are available for use on your account without any changes. 4. **Deploy and Manage**: WarpBuild takes care of deploying and managing the runners within your AWS environment. Enjoy seamless builds with our fast colocated caches and advanced debugging tools. 5. **Monitor and Optimize** (coming soon): Use WarpBuild's intuitive dashboard to monitor your builds, analyze performance, and make data-driven decisions to optimize your infrastructure further. ## Get Started Today Start maximizing your development efficiency with WarpBuild's AWS integration. [Learn more](http://docs.warpbuild.com/ci/byoc) and follow our step-by-step guide to connect your AWS account. WarpBuild is committed to providing you with the tools you need to build faster, smarter, and more cost-effectively. Join us in this new era of development and take your projects to the next level. --- Stay tuned for more updates and features coming soon. Happy building! --- For detailed technical documentation, visit [WarpBuild Docs](http://docs.warpbuild.com). For support and inquiries, contact us at support@warpbuild.com. --- # M4Pro powered MacOS Runners for GitHub Actions URL: /blog/m4pro-launch New M4Pro powered MacOS runners for GitHub Actions --- title: "M4Pro powered MacOS Runners for GitHub Actions" excerpt: "New M4Pro powered MacOS runners for GitHub Actions" description: "New M4Pro powered MacOS runners for GitHub Actions" date: "2025-03-20" author: surya_oruganti cover: "/images/blog/m4pro-launch/cover.png" --- WarpBuild's new M4Pro powered MacOS runners for GitHub Actions are now available. This is a huge upgrade from the previous M2Pro powered runners and is a step towards providing the best possible experience for MacOS runners on GitHub Actions. ## What's new? The new runners are powered by the latest M4Pro processors and are 30% faster than the previous M2Pro runners in our benchmarks. They are the same price as the previous runners, but also come with a higher 22GB RAM (up from 14GB RAM previously). The PassMark single core CPU benchmark score for the M4Pro processor is 4625, vs 4100 for the M2Pro processor. ## How do I get access? All users have been upgraded to the new runners. When a MacOS runner is requested, the new M4Pro runners will be used. - Add the WarpBuild bot to your organization and provide permissions to the repository. - Change the runner type to `warp-macos-15-arm64-6x` in your GitHub actions workflow. Try out WarpBuild's new MacOS runners powered by the latest M4Pro processors - Documentation. - Get started. --- # Fast MacOS runners for GitHub Actions URL: /blog/macos-runners WarpBuild provides fast, ephemeral MacOS runners for GitHub actions that are 25% faster and 50% cheaper. --- title: "Fast MacOS runners for GitHub Actions" excerpt: "M2 Pro powered MacOS instances for GitHub Actions" description: "WarpBuild provides fast, ephemeral MacOS runners for GitHub actions that are 25% faster and 50% cheaper." date: "2024-01-28" author: surya_oruganti cover: "/images/blog/macos-runners/cover.png" --- WarpBuild now supports MacOS GitHub Actions runners on M2 Pros with 6vCPU and 14GB memory. On paper, this is comparable to the M1 powered `macos-13-xlarge` runners by GitHub but are 25% faster and 50% cheaper, with the same tools pre-installed. This is interesting for the following reasons: 1. Because managing Mac infra is a hassle. There are no orchestration mechanisms available for managing fleets of Mac instances. 1. Ephemeral runners for clean and reproducible builds are challenging to maintain. 1. GitHub runners are slow and pricey. 1. Build concurrency on MacOS instances is limited. WarpBuild addresses these challenges with the ability to run ephemeral VMs that can process GitHub actions workflows in a fast and secure manner. More details on the MacOS instance configuration is in [the docs here](/docs/ci/cloud-runners#macos-m2-pro-on-arm64). You can get started with using MacOS instances on WarpBuild by changing the runner to `warp-macos-latest-arm64-6x` in the GitHub workflow file. Our customers use WarpBuild MacOS runners for iOS and Mac app builds. They are seeing benefits of 25-70% reduction in build time, resulting in up to 85% reduction in cost per job. We're very excited to make it broadly available for everyone to use. WarpBuild supports `macos-13` and `macos-14` runners. | Runner Tag | CPU | Memory | Storage | Price | Aliases | | -------------------------- | ------ | ------ | -------- | ------------ | ---------------------- | | warp-macos-latest-arm64-6x | 6 vCPU | 14GB | 64GB SSD | $0.08/minute | warp-macos-13-arm64-6x | | warp-macos-14-arm64-6x | 6 vCPU | 14GB | 64GB SSD | $0.08/minute | | We are on a mission to build the world's fastest CI cloud. I'd love your feedback on the product and thoughts on your biggest challenges with CI workflows that you'd like to see addressed. # WarpBuild's Observability Architecture URL: /blog/observability-architecture How WarpBuild built a zero-maintenance observability system using S3, presigned URLs, and OpenTelemetry - achieving infinite scalability with minimal infrastructure. --- title: "WarpBuild's Observability Architecture" excerpt: "How WarpBuild built a zero-maintenance observability system using S3, presigned URLs, and OpenTelemetry - achieving infinite scalability with minimal infrastructure." description: "How WarpBuild built a zero-maintenance observability system using S3, presigned URLs, and OpenTelemetry - achieving infinite scalability with minimal infrastructure." author: surya_oruganti date: "2025-10-21" cover: "/images/blog/observability-architecture/cover.png" --- Observability is critical for understanding CI/CD performance, but traditional observability stacks are complex, expensive, and require significant operational overhead. WarpBuild took a radically different approach: an S3-first architecture that eliminates maintenance burden while providing infinite scalability. - **Zero maintenance**: No databases, no clusters, no operational burden - **Infinite scalability**: Built on S3's proven durability and scale - **Minimal infrastructure**: Two simple components instead of complex observability stacks - **Cost-effective**: Pay only for S3 storage and data transfer This post walks through the architecture decisions that make this possible and why this approach works uniquely well for CI observability. ## The Textbook Observability Stack Most observability systems follow a familiar pattern: ```mermaid flowchart LR Agent["Agent/Collector
on Runner"] --> Gateway["Gateway/
Aggregator"] Gateway --> Storage["Time-series DB
(Prometheus, InfluxDB)"] Storage --> Query["Query Service
(API Layer)"] Query --> UI["Dashboard UI"] ``` This architecture requires: - Multiple service tiers (collector, gateway, storage, query layer) - Database clusters with replication and backups - Query optimization and indexing strategies - Monitoring for the monitoring system - Capacity planning and scaling - Multi-tenant support in each of the tiers For general-purpose observability, this complexity is necessary. Since WarpBuild is not general purpose observability, it has unique characteristics that allow for a much simpler approach. Not cargo-culting hardcore observability architecture, but building for the specific use case of CI/CD observability has huge advantages in our case. ## WarpBuild's S3-First Architecture ### High-Level Architecture Here's the complete observability flow in WarpBuild: ```mermaid flowchart TB subgraph Runner["Runner"] OTEL["OTEL Collector"] Proxy["Proxy"] end Backend["WarpBuild Backend
(Presigned URL Generator)"] S3["S3 Storage"] UI["WarpBuild UI
(Browser)"] OTEL -->|Metrics & Logs| Proxy Proxy <-->|Request/Get
Presigned URL| Backend Proxy -->|Upload Data| S3 UI <-->|Request/Get
Presigned URL| Backend UI -->|Fetch Data| S3 ``` WarpBuild's observability architecture has two simple paths: ingestion and retrieval. ### Data Ingestion Path When a job runs on a WarpBuild runner, metrics and logs flow directly to S3: ```mermaid sequenceDiagram participant R as Runner participant O as OTEL Collector participant P as Proxy participant B as WarpBuild Backend participant S3 as S3 Bucket Note over R,O: Runner boots R->>O: Start OTEL collector Note over O: Collect metrics & logs Note over R: Job allocated alt Observability Disabled R->>O: Kill collector Note over O: Stop collection else Observability Enabled O->>P: Send metrics/logs P->>B: Request presigned URL B->>P: Return S3 presigned URL P->>S3: Upload data directly end ``` **Key components:** 1. **OpenTelemetry Collector**: Starts automatically when the runner boots, collecting system metrics (CPU, memory, disk, network) and logs. 2. **Conditional Collection**: When a GitHub Actions job is allocated, the system checks if observability is enabled. If disabled, the collector is killed immediately to save resources. 3. **Proxy Service**: A lightweight proxy running on the runner that handles authentication and presigned URL generation. This exists because OTEL collectors don't natively support S3 presigned URLs—they require long-lived credentials. 4. **Direct S3 Upload**: Using presigned URLs, data flows directly from the runner to S3 without passing through intermediate services. OpenTelemetry collectors are designed to work with credential-based authentication (IAM roles, access keys). Since WarpBuild uses presigned URLs for security and simplicity, the proxy translates between OTEL's credential expectations and S3's presigned URL model. The proxy is minimal and handles only URL generation and secure communication with the backend. ### Data Retrieval Path When a user views observability data in the WarpBuild UI, the architecture is even simpler: ```mermaid sequenceDiagram participant Browser participant Backend as WarpBuild Backend participant S3 as S3 Bucket Browser->>Backend: Request job metrics/logs Backend->>Backend: Generate presigned URLs Backend->>Browser: Return presigned URLs Browser->>S3: Fetch data directly Note over Browser: Display metrics & logs ``` The retrieval path has no intermediate query layer, no caching tier, no aggregation service. The browser fetches data directly from S3 and renders it. That's it. ## Why This Architecture Works This simplified architecture is possible because of several unique characteristics of CI/CD observability:
Unlike application monitoring where you need real-time alerting and complex queries, CI observability is primarily for **human consumption**. Developers want to see: - What resources did my job use? - When did the CPU spike? - What errors appeared in the logs? These are simple, single-job queries that don't require sophisticated query engines or real-time aggregation. Each CI job is isolated. Developers rarely need to query across multiple jobs simultaneously based on log contents or metric patterns. When aggregate analysis is needed (e.g., "show me all jobs that failed with X error this week"), AWS Athena can query S3 data directly using SQL without building a dedicated query layer. CI jobs typically generate a few megabytes of metrics and logs. Even heavy jobs rarely exceed tens of megabytes. This means: - S3's latency is acceptable (100-200ms to fetch a job's data) - Transfers are throughput bound, not latency constrained By using S3 as the primary storage and query layer, WarpBuild eliminates: - Database clusters that need monitoring and scaling - Index management and optimization - Backup and disaster recovery processes - Software upgrades and security patches S3 provides 99.999999999% (11 nines) durability and built-in versioning, replication, and lifecycle management.
## Architecture Benefits The entire observability system is two components: 1. OTEL collector with S3 export (via proxy) 2. Presigned URL generation in the backend The most important benefit is that we do not need to manage any infrastructure, while providing extremely fast data retrieval for end users as we do not have throughput bottlenecks with databases. At 1 Million jobs/day, the data volume is ~1TB/day, which quickly gets out of control in databases. Besides, we do not need database features like aggregation, indexing, etc. Traditional observability stacks cost many thousands of dollars per month for database clusters, monitoring infrastructure, not including engineering time. With our architecture at 1 million jobs/day with 1MB data each, the total cost is ~$700/month instead of thousands - a 100x cost reduction with better reliability. This architecture isn't great for: - Aggregate analysis of metrics (p50, p90, p99, deviation from the median, etc.) - Metrics visualization (histogram, scatter plot, etc.) However, this is easily achievable by layering on AWS Athena and other tools. ## Conclusion Context matters. The "textbook" observability stack is necessary for real-time monitoring and complex querying, but our requirements allow for radical simplification. By leveraging S3's durability, scalability, and simplicity, WarpBuild delivers robust observability with: - Zero maintenance burden - Infinite scalability - Two simple components instead of a complex stack, with no scalability concerns This architectural decision embodies WarpBuild's philosophy: **build for scale with minimal people**. As the platform grows, the observability system requires no additional engineering effort - it just works. Our architecture is inspired by the rise of S3-first architectures in the industry including turbopuffer and others. --- ### Learn More - **View job metrics and logs**: [WarpBuild Observability Dashboard](https://app.warpbuild.com/ci/observability) - **Documentation**: [Observability Docs](/docs/ci/observability/) - **Get started**: [Quick Start Guide](/docs/ci/quick-start/) Experience zero-maintenance observability alongside the world's fastest CI runners. Get started in minutes: [app.warpbuild.com](https://app.warpbuild.com) ### Call for developers We are looking for developers who are interested in building the future of CI/CD. If you are interested in this, get in touch with us at [hello@warpbuild.com](mailto:hello@warpbuild.com)! # Optimizing Dockerfiles for Fast Builds URL: /blog/optimizing-docker-builds Optimize Dockerfile definition for speeding up container builds and reducing image sizes --- title: "Optimizing Dockerfiles for Fast Builds" excerpt: "Optimize Dockerfile definition for speeding up container builds and reducing image sizes" description: "Optimize Dockerfile definition for speeding up container builds and reducing image sizes" date: "2023-12-28" author: prajjwal_dimri cover: "/images/blog/optimizing-docker-builds/cover.png" --- In this post, we will create a Dockerfile starting with a naive definition and incrementally improve it with well-established best practices used across the industry for illustrating the optimization that can be achieved in build time, and container size. A container image is a read-only template with instructions for creating a container. To build our own image, we need to create a Dockerfile which is just a series of instructions that describes how the image needs to be built. The project that we are going to build today is a simple backend server written in Go. We need to create a binary of our server. Additionally, the binary requires zstd(z-standard compression algorithm) to be present on the system. The code for the project is available at [https://github.com/WarpBuilds/docker-build-optimization-example-project/blob/main/Dockerfile](https://github.com/WarpBuilds/docker-build-optimization-example-project/blob/main/Dockerfile) The processes and steps described here are not language or framework specific. They can be adapted to any project. ### A note on terminology While container images originated in Docker, the company, they have become a standard way of packaging applications and running them in a portable manner. The specification for generating container images is now maintained by the Open Container Initiative (OCI), which is a part of the Linux Foundation. Writing a Dockerfile is the most common way to build an OCI compliant image. Colloquially, docker images and OCI images are used interchangeably. ## Initial Dockerfile Our first target is to write a Dockerfile that can build our code and get the server up and running. ```dockerfile # The base image. FROM ubuntu:22.04 # Sets the working directory for any instructions that follow it. WORKDIR /build # Copies all the files from the current directory and # adds them to the filesystem of the container. COPY . . # Upgrades all the installed packages using Ubuntu's package manager. RUN apt update -y RUN apt upgrade -y # The default ubuntu image doesn't validate proxy.golang.org as CA, # so we need to manually add it. RUN apt install golang-go ca-certificates openssl -y ARG cert_location=/usr/local/share/ca-certificates RUN openssl s_client -showcerts -connect proxy.golang.org:443 /dev/null|openssl x509 -outform PEM > ${cert_location}/proxy.golang.crt RUN update-ca-certificates # Installs Z-standard RUN apt install zstd -y # Verify Z-standard installation RUN zstd --version # Downloads dependencies of our project RUN go mod download # Builds the go binary ENV CGO_ENABLED=0 GOOS=linux GOARCH=amd64 RUN go build -ldflags="-s -w" -o apiserver . # Exposes port used by our backend server EXPOSE 8080 # Runs the backend server RUN chmod +x apiserver ENTRYPOINT ["./apiserver"] ``` ![Initial Docker Build Benchmark](/images/blog/optimizing-docker-builds/initial-benchmark.png) Build Time: `352.9s` Final image size: `1.42GB` An image is built from a Dockerfile using `docker build .` command. Running the build takes our system around `6 minutes` and the final image size is `1.42GB`. All the benchmarks are done on an Macbook Air M1 with 16GB RAM. Waiting `6 minutes` for every build is a really bad experience. Additionally, if the image is not cached, downloading a `1.42GB` image will impact the startup time of our container. So, let's address these issues. First, we will focus on optimizing our build time. Once that is done, we will address the issue of image size. ## Base Image For our first improvement, we can start using the GoLang image as our base image, which already contains GoLang SDK and correct CA certificate configuration. ```dockerfile # Changes to golang image FROM golang:1.21 WORKDIR /build COPY . . RUN apt update -y RUN apt upgrade -y RUN apt install zstd -y RUN zstd --version RUN go mod download ENV CGO_ENABLED=0 GOOS=linux GOARCH=amd64 RUN go build -ldflags="-s -w" -o apiserver . ENTRYPOINT ["/apiserver"] ``` ![Base Image Benchmark](/images/blog/optimizing-docker-builds/base-image-benchmark.png) Build Time: `38.5s` Final image size: `1.38GB` As you would have noticed, this has already reduced our build time quite significantly. The image size is reduced by a little bit as well. ## Layer Caching Docker uses the concepts of layers while building images. Each layer contains the filesystem changes to the image for the state before and after the execution an instruction. In our tests above, you would have noticed that the benchmarks deliberately delete all the cache. This is done so that we can notice the difference it makes when we do start using layer caching. Layer caching is enabled by default. If we build the Dockerfile above again, this time with layer caching, we can see that it takes us around `24s` compared to the `38s` it took us before. ![Layer Caching](/images/blog/optimizing-docker-builds/layer-caching.png) ![Layer Caching Benchmark](/images/blog/optimizing-docker-builds/layer-caching-benchmark.png) Build Time: `24.7s` Final image size: `1.38GB` ## Layer Ordering Docker caches every layer it creates. Whenever it encounters a layer that is changed, all the layer caches of the downstream layers are invalidated and built again. In our current Dockerfile, you would notice that every time any file changes, the `COPY . .` layer is invalidated and this causes all of the downstream steps to be built again. Let's first introduce a `.dockerignore` file, so that any changes to env files, .git folder etc. don't invalidate our cache. ``` # Files .dockerignore .editorconfig .gitignore .env.* Dockerfile Makefile LICENSE **/*.md **/*_test.go *.out # Folders .git/ .github/ build/ ``` As we are already aware, that the changes in any layer result in caches being invalidated of all the downstream layers. So, we should always try to order our Dockerfiles from the least changing instructions at the top to the most changing instructions at the bottom. For our project, we can move our dependency installation steps above our copy step. ```dockerfile FROM golang:1.21 WORKDIR /build RUN apt update -y RUN apt upgrade -y RUN apt install zstd -y RUN zstd --version COPY . . RUN go mod download ENV CGO_ENABLED=0 GOOS=linux GOARCH=amd64 RUN go build -ldflags="-s -w" -o apiserver . ENTRYPOINT ["/apiserver"] ``` ![Layer Ordering Benchmark](/images/blog/optimizing-docker-builds/layer-ordering-benchmark.png) Build Time: `17.9s` Final image size: `1.38GB` This reduces our cached build time to `18s`. We can optimise this further by only copying files required by dependency installer first i.e. `go.mod` and `go.sum` for Go projects. This will make sure that even if our code changes, our dependency installs are cached. ```dockerfile FROM golang:1.21 WORKDIR /build RUN apt update -y RUN apt upgrade -y RUN apt install zstd -y RUN zstd --version COPY go.mod go.sum ./ RUN go mod download COPY . . ENV CGO_ENABLED=0 GOOS=linux GOARCH=amd64 RUN go build -ldflags="-s -w" -o apiserver . ENTRYPOINT ["/apiserver"] ``` ![Dependency Install Benchmark](/images/blog/optimizing-docker-builds/dependency-install-benchmark.png) Build Time: `10.7s` Final image size: `1.38GB` This reduces our build time to `10.7s`. We started with a build time of around 6 minutes and have reached 10 seconds but our image size is still quite huge. Every time our image changes, our container would have to download the new image which will affect its startup time. Let's optimize that as well. ## Image Size First step to reduce our image size is to reduce our layers. Every layer that we introduce to our image increases its size. To reduce layers, we can bunch up our run statements together. ```dockerfile FROM golang:1.21 WORKDIR /build RUN apt update -y && apt upgrade -y && apt install zstd -y && zstd --version COPY go.mod go.sum ./ RUN go mod download COPY . . ENV CGO_ENABLED=0 GOOS=linux GOARCH=amd64 RUN go build -ldflags="-s -w" -o apiserver . ENTRYPOINT ["/apiserver"] ``` Although, in our case it will not have a large effect as we didn't have many layers to begin with. ## Multi-Stage Builds We use `golang:1.21` as our base image which is built on top of `debian`. It contains various packages and dependencies which we do not need. GoLang's build process generates a binary which can run on various systems even without GoLang SDKs installed. Let's address this by introducing another important concept in Docker known as Multi Stage Build. ```docker # Build stage FROM golang:1.21 AS builder WORKDIR /build COPY go.mod go.sum ./ RUN go mod download COPY . . ENV CGO_ENABLED=0 GOOS=linux GOARCH=amd64 RUN go build -ldflags="-s -w" -o apiserver . # Runtime stage FROM alpine:3.19 COPY --from=builder ["/build/apiserver", "/"] # Uses alpine's package manager to install zstd RUN apk add zstd && zstd --version ENTRYPOINT ["/apiserver"] ``` ![Final Benchmark](/images/blog/optimizing-docker-builds/final-benchmark.png) Build Time: `10.3s` Final image size: `34MB` We have now changed our Dockerfile to contain two stages. In the first build stage, we use `golang:1.21` base image to build our go source code and get a binary as output. We copy the binary to our runtime stage which uses `alpine:3.19` as its base. Alpine is a very lightweight linux distro suitable for creating light weight runtime images. We have also added zstd in our alpine base image as the binary requires it in runtime. The final size of our image is `34MB`. Our build time is also reduced by some milliseconds as Docker tries to run these stages in parallel, until it hits a dependency to output of another stage. Here we are assuming that our binary requires zstd to be installed on the system. If that was not the case, then we could have used the `scratch` base image which would have reduced our final image size to `25.7MB`. ```docker FROM golang:1.21 AS builder WORKDIR /build COPY go.mod go.sum ./ RUN go mod download COPY . . ENV CGO_ENABLED=0 GOOS=linux GOARCH=amd64 RUN go build -ldflags="-s -w" -o apiserver . FROM scratch COPY --from=builder ["/build/apiserver", "/"] # RUN apk add zstd && zstd --version ENTRYPOINT ["/apiserver"] ``` ## CI Builds All the concepts that we have talked about here are also applicable to building a docker image on CI systems. Let's see how we can run the docker build on GitHub CI with layer caching enabled. ``` name: Build Docker Image on: push: branches: - "main" jobs: docker: runs-on: ubuntu-latest steps: - name: Set up QEMU uses: docker/setup-qemu-action@v3 - name: Set up Docker Buildx uses: docker/setup-buildx-action@v3 - name: Build and push uses: docker/build-push-action@v5 with: push: false # We are using GitHub's backend as the cache storage here. More info: # https://docs.docker.com/build/ci/github-actions/cache/#github-cache cache-from: type=gha cache-to: type=gha,mode=max ``` It takes us `2m6s` to build this for the first time without using cache. Using cache reduces the build time to `1m38s`. ## 🚀 Use WarpBuild Runners If you want to optimize your build times even further, you can use WarpBuild's runners. The same workflow, on a similar system to `ubuntu-latest`, finishes in `55s`. See the results for yourself here [https://github.com/WarpBuilds/docker-build-optimization-example-project/actions/runs/7326651267/job/19952582967](https://github.com/WarpBuilds/docker-build-optimization-example-project/actions/runs/7326651267/job/19952582967) Using WarpBuild Runners is as easy as replacing a line in your GitHub workflow file. ```diff - runs-on: ubuntu-latest + runs-on: warp-ubuntu-latest-x64-2x ``` ## Build Optimization Reference | Optimization Step | Build Time (seconds) | Image Size (GB) | | ------------------------ | -------------------- | --------------- | | Initial | 352.9 | 1.42 | | Specific Image selection | 38.5 | 1.38 | | Use caching | 24.4 | 1.38 | | Layer ordering | 17.9 | 1.38 | | Order file copying step | 10.7 | 1.38 | | Multi-stage builds | 10.3 | 0.034 (34MB) | ## Conclusion In this post, we have seen how we can optimize our Dockerfiles for faster builds and smaller image sizes. We have also seen how we can use WarpBuild Runners to further optimize our build times. This post focused on optimizing the definition of the Dockerfile and is foundational to optimizing the build times and image sizes. In the future, we will look at optimizing the build process itself through container layer caching in CI systems and alternate build systems like Bazel. # Optimizing Self-Hosted GitHub Actions Runner Costs URL: /blog/optimizing-self-hosted-runner-costs Checklist of strategies to cut self-hosted GitHub Actions costs with networking, caching, autoscaling, and compliance-friendly patterns. --- title: "Optimizing Self-Hosted GitHub Actions Runner Costs" excerpt: "Checklist of strategies to cut self-hosted GitHub Actions costs with networking, caching, autoscaling, and compliance-friendly patterns." description: "Checklist of strategies to cut self-hosted GitHub Actions costs with networking, caching, autoscaling, and compliance-friendly patterns." author: surya_oruganti cover: "/images/blog/optimizing-self-hosted-runner-costs/cover.png" date: "2025-10-21" --- import { Step, Steps } from 'fumadocs-ui/components/steps'; Running CI is essential, but your self-hosted runner bill doesn't have to be. This guide contains strategies to reduce costs without sacrificing reliability or compliance. We cite primary sources throughout so you can validate assumptions and adapt them to your environment to keep things vendor-neutral. Treat this post as a checklist - going through each item and applying the best practices to your environment will help you reduce costs. Before optimizing, baseline your usage and costs: - GitHub billing and usage: About billing, Viewing usage, Run usage API - Cloud cost explorers: AWS Cost Explorer, GCP Cloud Billing, Azure Cost Management ## Core cost optimization strategies ### Infrastructure optimization - Spot/Preemptible capacity: Typically 60-90% cheaper, but interruptible. Use job retries and checkpointing; isolate long-lived state from runners. - AWS EC2 Spot: capacity-optimized allocation; interruption notices. See EC2 Spot and best practices. - GCP Preemptible/Spot VMs: GCP Preemptible/Spot VMs - Azure Spot VMs: Azure Spot VMs - Autoscaling: Scale-to-zero when queues are empty; scale quickly when demand spikes. Combine queue depth, pending job counts, and target start SLOs. - Right-sizing: Measure CPU, memory, I/O. Choose the knee of the performance-cost curve, not the max spec. - Commitments: Reserved/Committed use discounts work for steady baselines; keep burst on spot. ```mermaid flowchart LR Q["Queued jobs?"] -->|No| Z["Scale to zero"] Q -->|Yes| T{"Target start SLO met?"} T -- No --> Up["Scale up"] T -- Yes --> K["Keep size"] Up --> R{"Budget guardrails?"} R -- Exceeded --> Fan["Reduce fan-out or size"] R -- OK --> Mon["Monitor"] ``` ### Ephemeral vs reusable runners Ephemeral runners (one job, then teardown) provide a guaranteed clean state and stronger isolation. Reusable runners keep state and caches across jobs to cut minutes but need hygiene. - Ephemeral: best for untrusted code, stricter compliance, and multi-tenant orgs. Trade-off: less cache reuse, potentially more minutes but lower security/ops risk. Most importantly, it leads to reproducible builds. This is highly recommended for CI. - Reusable: best for trusted repos and cache-heavy builds. Trade-off: requires cleanup to avoid state bleed; consider periodic reimage. Do this only if you have a good reason to keep the state. ```mermaid flowchart TD A["Repo trust: org-internal?"] -->|No| E["Use ephemeral"] A -->|Yes| C{"Cache hit rate high?"} C -- Yes --> R["Consider reusable"] C -- No --> E R --> H{"Compliance strict?
(PCI/HIPAA)"} H -- Yes --> E H -- No --> O{"High ops maturity
& ok with risk?"} O -- Yes --> RU["Use reusable"] O -- No --> E ``` - GitHub runner `--ephemeral` for one-job-per-runner: docs - actions-runner-controller RunnerScaleSet with ephemeral pods and scale-to-zero: ARC - Terraform AWS GitHub Runner module supports ephemeral, autoscaled runners on AWS: repo ### Caching and storage - Local dependency caches (npm, pip, gradle, cargo, etc.) via GitHub Actions cache. - Docker layer caching: Use Buildx and a registry/cache near compute via Docker Buildx cache. - Artifacts: Upload only what's needed, compress, and reduce retention. Example (Docker Buildx with GitHub cache backend): ```yaml - uses: docker/setup-buildx-action@v3 - uses: docker/build-push-action@v6 with: push: false cache-from: type=gha cache-to: type=gha,mode=max ``` | Strategy | Typical impact | Notes | | --- | --- | --- | | Dependency cache | 20-60% faster | Stable lockfiles help maximize hits | | Docker layer cache | 20-70% faster | Co-locate cache/registry with runners | | Artifact retention 7-14d | 80-90% storage reduction | From GitHub default 90d | | Reusable runners | Up to 40x faster | Depends on the size of the runner and the amount of state kept, requires periodic cleanup | ### Networking optimization Private subnets often require NAT for egress. NAT gateways typically charge hourly + per-GB processed. Heavy egress can dwarf compute savings. Prefer endpoints and keep traffic in-region. Public runners have direct internet egress (cheapest); private require NAT (higher cost but better control). Use hybrid: public for general CI, private for sensitive workloads. Use gateway endpoints (AWS S3, GCP Private Google Access, Azure service endpoints) to bypass NAT and reduce egress costs. Keep runners, registries, caches, and buckets in the same region/AZ to minimize cross-region and cross-AZ transfer charges. Use regional repos; avoid cross-region pulls. ```mermaid flowchart TB subgraph Region subgraph VPC["VPC / VNet"] Runners["Runners
(ASG/VMSS or K8s nodes)"] NAT["NAT
(only if needed)"] Endp["Endpoints:
S3/GCS/Storage,
ECR/AR/ACR"] end Cross-Account[("Cross-Account
Storage and Access")] end Runners --> Endp Runners --> Cross-Account Runners -.-> NAT ``` --- ## Open-source and free tools - actions-runner-controller (ARC): Kubernetes operator for autoscaling GitHub runners. - Terraform AWS GitHub Runner module: Serverless, autoscaling self-hosted runners on AWS. - Infracost: Cost impact in PRs. - AWS Cost Explorer, GCP Cloud Billing, Azure Cost Management. --- ## Cloud-specific optimization Use this accordion for provider details. Keep the rest of this guide cloud-agnostic.

Compute

  • EC2 Spot with capacity-optimized allocation; test multiple instance types. EC2 Spot
  • EKS + ARC for scale-to-zero runners; consider Karpenter for node right-sizing.

Networking

  • Prefer Gateway Endpoints for S3 and DynamoDB to avoid NAT traversal. VPC endpoints
  • Use Interface Endpoints for ECR API and ECR DKR (image pulls) to keep traffic private. ECR endpoints
  • NAT choices: gateway (hourly + per-GB) vs NAT instance for low throughput; place one NAT per AZ to avoid cross-AZ data charges. NAT pricing

Storage/Registry

  • S3 lifecycle policies and storage classes (IA/Glacier) for artifacts. S3 lifecycle
  • ECR repos in the same region; replicate only if needed. ECR

Terraform examples

```hcl resource "aws_vpc_endpoint" "s3" { vpc_id = var.vpc_id service_name = "com.amazonaws.${var.region}.s3" vpc_endpoint_type = "Gateway" route_table_ids = var.route_table_ids } ``` ```hcl resource "aws_vpc_endpoint" "ecr_api" { vpc_id = var.vpc_id service_name = "com.amazonaws.${var.region}.ecr.api" vpc_endpoint_type = "Interface" subnet_ids = var.private_subnet_ids security_group_ids = [aws_security_group.endpoints.id] private_dns_enabled = true } resource "aws_vpc_endpoint" "ecr_dkr" { vpc_id = var.vpc_id service_name = "com.amazonaws.${var.region}.ecr.dkr" vpc_endpoint_type = "Interface" subnet_ids = var.private_subnet_ids security_group_ids = [aws_security_group.endpoints.id] private_dns_enabled = true } ```

References: EC2 pricing, Spot, NAT pricing, VPC endpoints, S3 pricing

Compute

  • Spot/Preemptible VMs in Managed Instance Groups with multiple machine types. docs
  • GKE (Autopilot or Standard) with ARC; right-size node pools.

Networking

  • Enable Private Google Access so private VMs reach GCS/Artifact Registry without public egress. docs
  • Cloud NAT sized appropriately; avoid cross-region pulls. Cloud NAT pricing

Storage/Registry

  • Artifact Registry regional repos in the same region as runners. docs
  • GCS lifecycle rules for artifacts. docs

Examples

```hcl resource "google_compute_subnetwork" "subnet" { name = "ci-private" ip_cidr_range = var.cidr network = var.network region = var.region private_ip_google_access = true } ``` ```hcl resource "google_compute_router" "router" { name = "ci-router" network = var.network region = var.region } resource "google_compute_router_nat" "nat" { name = "ci-nat" router = google_compute_router.router.name region = var.region nat_ip_allocate_option = "AUTO_ONLY" source_subnetwork_ip_ranges_to_nat = "LIST_OF_SUBNETWORKS" subnetwork { name = google_compute_subnetwork.subnet.name source_ip_ranges_to_nat = ["ALL_IP_RANGES"] } } ```

References: Compute pricing, Spot/Preemptible, Private Google Access, Cloud NAT pricing, Artifact Registry

Compute

  • Spot VMs in VM Scale Sets; consider capacity reservations for stability. docs
  • AKS with ARC; autoscaling node pools and right-sized SKUs.

Networking

Storage/Registry

  • ACR in-region with runners; enable geo-replication only if required. docs
  • Blob Storage lifecycle rules for artifacts. docs

Examples

```hcl resource "azurerm_subnet_service_endpoint_storage_policy" "storage" { name = "allow-storage" resource_group_name = var.rg virtual_network_name = var.vnet subnet_name = var.subnet storage_accounts = [azurerm_storage_account.artifacts.id] } ``` ```hcl resource "azurerm_private_endpoint" "acr" { name = "acr-pe" location = var.location resource_group_name = var.rg subnet_id = azurerm_subnet.private.id private_service_connection { name = "acr" private_connection_resource_id = azurerm_container_registry.acr.id is_manual_connection = false subresource_names = ["registry"] } } ```

References: VM pricing, Spot, NAT pricing, Private Endpoints, ACR

--- ## Industry-specific considerations ### Financial services - SOC 2 and PCI-DSS drive stricter isolation and auditability. Prefer ephemeral runners for untrusted code; ensure logs are centralized (not as long-lived artifacts). - Use OIDC and short-lived credentials for cloud access; scope IAM roles tightly. - Keep sensitive builds in private subnets behind endpoints; avoid cross-region traffic. References: SOC 2, PCI-DSS ### Healthcare - HIPAA requires administrative, physical, and technical safeguards; do not store PHI in CI logs or artifacts. - Sign a BAA with your cloud provider; choose compliant regions; encrypt at rest and in transit. - Favor ephemeral runners and minimal artifact retention. References: HIPAA Security Rule, provider guidance for HIPAA on AWS, GCP, Azure --- ## Monitoring and cost tracking - Dashboards: minutes, spend, queue time, runner utilization, cache hit rates. - Alerts: budget thresholds, anomaly detection. - APIs: GitHub run usage API, cloud billing exports. ```bash # Org billing summary (requires org admin) gh api -H "Accept: application/vnd.github+json" /orgs/OWNER/settings/billing/actions | jq # Run timing (billable minutes) gh api -H "Accept: application/vnd.github+json" /repos/OWNER/REPO/actions/runs/123456/timing | jq ``` --- ## Advanced optimization strategies ### Ephemeral runners deep dive - CI reproducibility: ephemeral runners lead to reproducible builds. This is extremely important for CI. - Security-first: one job per VM/pod; automatic teardown eliminates drift. - Cost knobs: rely on remote/registry caches and artifact pruning to offset cache losses. - ARC RunnerScaleSet and Terraform AWS GitHub Runner module support ephemeral patterns out-of-the-box. ### Job batching and scheduling - Batching: batch nightly jobs and low-priority tasks in off-peak windows; restrict max-parallel to contain burst costs. - Spot-friendly pipelines: persist caches early, checkpoint long jobs to resume. ```mermaid flowchart TD S["Non-critical job?"] -->|Yes| Off["Schedule off-peak"] S -->|No| Now["Run now"] Off --> Spot["Prefer Spot/Preemptible"] Now --> Guard["Apply concurrency + timeouts"] ``` ### Workflow-level cost controls - Conditional execution (paths/paths-ignore), concurrency cancellation, timeouts, matrix throttling. - Keep storage cheap: compress artifacts, shorten retention, upload minimal logs. --- ## Cost comparison and ROI | Monthly workload | Hosted Linux x64 | Self-hosted on-demand | Self-hosted spot | Notes | | --- | --- | --- | --- | --- | | 10,000 min | $$ | $ | $ | Depends on instance type, cache hits, NAT/egress | | 100,000 min | $$$ | $$ | $-$$ | Maintenance overhead more salient | Exact numbers vary by region, instance type, cache effectiveness, and egress. Use cloud calculators and your actual run data. --- ## Implementation checklist Enable concurrency cancellation and job timeouts Reduce artifact retention to 7-14 days; compress logs Co-locate runners, registry, and artifacts in the same region Add storage/registry endpoints to avoid NAT traversal Introduce spot/preemptible runners with safe retry policies Migrate to ephemeral runners for untrusted code paths Adopt ARC or Terraform AWS GitHub Runner module for autoscaling Right-size instance SKUs based on utilization Implement per-team cost allocation and budgets Consolidate NAT and endpoint topology; reduce cross-AZ traffic Establish image baking with pre-baked caches --- ## References - GitHub Actions billing and usage: billing, usage, run usage API - AWS: EC2 pricing, Spot, NAT pricing, VPC endpoints, S3 pricing, ECR endpoints - GCP: Compute pricing, Spot/Preemptible, Private Google Access, Cloud NAT pricing, Artifact Registry - Azure: VM pricing, Spot, NAT pricing, Private Endpoints, ACR - Compliance: SOC 2, PCI-DSS, HIPAA Security Rule --- This guide is vendor-neutral; if you want managed building blocks that implement many of the above, see WarpBuild docs: [`/docs/ci/`](/docs/ci/). WarpBuild offers a comprehensive solution for self-hosted runners, including support for Linux, Windows, across all major cloud providers built for Enterprises. Get started today with WarpBuild: [`https://app.warpbuild.com/`](https://app.warpbuild.com/). WarpBuild also offers a cloud-hosted solution with high performance runners, that are 10x faster and 90% cheaper than GitHub hosted infrastructure, optimized for peak performance and seamless integration. # Rate Limit Cheatsheet for Self-Hosting Github Runners URL: /blog/rate-limits-self-hosted-runners Various rate limits to keep in mind when deciding to self-host GitHub Action Runners --- title: "Rate Limit Cheatsheet for Self-Hosting Github Runners" excerpt: "Various rate limits to keep in mind when deciding to self-host GitHub Action Runners" description: "Various rate limits to keep in mind when deciding to self-host GitHub Action Runners" date: "2024-06-12" author: prajjwal_dimri cover: "/images/blog/rate-limits-self-hosted-runners/cover.webp" --- When self-hosting GitHub Actions runners, understanding and managing rate limits across various services is crucial for maintaining efficient and uninterrupted CI/CD workflows. This guide provides an overview of rate limits for major services and some best practices to handle them effectively. This blog post is also accompanied with a handy _cheat-sheet_ which you can download [here](/images/blog/rate-limits-self-hosted-runners/cheatsheet.pdf). ## GitHub APIs GitHub imposes several rate limits to ensure fair usage. These can cause issues when you are trying to pull information from GitHub using their APIs. - **API requests from self-hosted runners**: 1,000 requests per hour across all actions within a repository. - **Primary rate limit for authenticated users**: 5,000 requests per hour. - **Primary rate limit for GitHub App installations**: - Minimum 5,000 requests per hour (can go up to 12,500 depending on the number of repos and users). - 15,000 requests per hour if the installation is on a GitHub Enterprise Cloud Organization. - **Primary rate limit for GITHUB_TOKEN in GitHub Actions**: 1,000 requests per hour per repository. For GitHub Enterprise Cloud Accounts, the limit is 15,000 requests per hour per repo. **Secondary Rate Limits**: Along with the primary limits above, GitHub also imposes secondary rate limits to limit concurrent calls from systems. - No more than 100 concurrent requests allowed. - No more than 900 read requests per minute or 180 write requests per minute to a single REST endpoint. ## Cloud Providers (Provisioning Runners) Keep these limits in mind when provisioning VMs as self-hosted runners on your preferred cloud provider. #### AWS EC2 - **Token Bucket algorithm**: Maximum of 1,000 tokens, with a refill rate of 2 tokens per second. #### Google Cloud Engine - **API Rate Limit**: 1,500 requests per minute per region. #### Microsoft Azure - **API Rate Limit**: 1,200 writes per hour per subscription. #### Hetzner Cloud - **API Rate Limit**: 3600 requests per hour per project. | These limits can be increased by contacting the respective cloud provider's support. ## Docker Hub Docker Hub enforces rate limits on container image pulls. As you can see, the limits are very less and are usually a major issue when using self-hosted runners. - **Anonymous users**: 100 pulls per 6 hours per IP address. - **Authenticated users**: 200 pulls per 6 hour period per account. - **Users with Paid Docker Subscription**: Up to 5,000 pulls per day. Read about how we dealt with this limitation in our runners in our blog post: [Docker registry mirror setup](https://www.warpbuild.com/blog/docker-mirror-setup) ## Image Registries ### Amazon ECR - **Authenticated**: 10 image pulls per second. - **Unauthenticated**: 1 image pull per second. ### Google Artifact Registry - **Rate Limits**: - 1,000 requests per second. - 300 write requests per second. ### Azure Container Registry - **Basic Tier**: - 1,000 read, 100 write operations per minute. - **Standard Tier** - 3000 read, 500 write operations per minute. - **Premium Tier**: - 10,000 read, 2000 write operations per minute. | These limits can be increased by contacting the respective cloud provider's support. ## Package Registries ### RubyGems - **Rate Limit**: 10 requests per second. ## Best Practices - **Monitor Usage**: Regularly check your usage against these quotas using tools provided by the respective cloud providers. - **Optimize API Calls**: Reduce the frequency of API calls where possible, using caching and batch operations. - **Request Increases**: If your usage patterns require higher limits, request quota increases for that service. By managing these rate limits and optimizing your interactions with these services, you can ensure smooth and efficient operations for your self-hosted GitHub runners. ## WarpBuild Runners With WarpBuild, most of the rate limiting cases above (VM provisiong, GitHub APIs, DockerHub APIs) are automatically handled for you. WarpBuild provides performant runners for GitHub Actions for a fraction of the cost. Supercharge your builds and [Go Warp!](https://www.warpbuild.com/) today. ## References: - [GitHub API Rate Limits](https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-limiting) - [Docker Hub Rate Limits](https://docs.docker.com/docker-hub/download-rate-limit/) - [EC2 Rate Limits](https://docs.aws.amazon.com/ec2/latest/devguide/ec2-api-throttling.html) - [Compute Engine Rate Quotas](https://cloud.google.com/compute/api-quota) - [Azure Throttling](https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling) - [Hetzner Cloud Rate Limits](https://docs.hetzner.cloud/#rate-limiting) - [Amazon ECR Service Quotas](https://docs.aws.amazon.com/AmazonECR/latest/userguide/service-quotas.html) - [Google Artifact Registry Quotas](https://cloud.google.com/compute/quotas) - [Azure Container Registry Limits](https://github.com/MicrosoftDocs/azure-docs/blob/main/includes/container-registry-limits.md) - [Ruby Gems Rate Limit](https://guides.rubygems.org/rubygems-org-rate-limits/) # RSS feed generator from Markdown files URL: /blog/rss-feed-generator Generate RSS feeds from Markdown pages. --- title: "RSS feed generator from Markdown files" excerpt: "Generate RSS feeds from Markdown pages." description: "Generate RSS feeds from Markdown pages." date: "2024-11-11" author: surya_oruganti cover: "/images/blog/rss-feed-generator/cover.png" --- The [WarpBuild documentation site](https://docs.warpbuilds.com) is built with Docusaurus and hosted on Vercel. The documentation is a collection of markdown files stored in a Github repository. Here's a simple script to generate RSS feeds for the documentation pages. I used this script to generate the RSS feed for the [changelog](https://docs.warpbuilds.com/changelog) page so users can subscribe to the changelog via RSS, especially to keep track of breaking changes. This was built heavily leveraging `claude sonnet 3.5 v2` and `cursor`. [Docusaurus](https://docusaurus.io/showcase) is a static site generator with content in markdown and extensive customization options. It is maintained by Meta Open Source and is used by many popular companies including Meta, The Linux Foundation, and Red Hat. While Docusaurus has a great [RSS feed generator](https://docusaurus.io/docs/blog#rss-feed) for blog posts, it does not support RSS feeds for the documentation content page type. Hope you find this useful! ## RSS Feed Generator Usage The `changelog-to-rss.sh` script generates the `changelog.xml` file, which is the RSS feed for the changelog. 1. Keep the `slug` in the frontmatter of the changelog file the same as the filename. 2. The `slug` is used to generate the permalink for the changelog entry. 3. The `updatedAt` field in the frontmatter is used to set the date of the changelog entry. 4. The permalink points to the different sections in the changelog. 5. Sections starting with `###` in the changelog file are used as the title of the RSS item. 6. All the markdown files are in the `docs/changelog` directory, one file per month. The naming convention is `YYYY-monthname.mdx`. Example: `2024-October.mdx`. ## The Script The code for the script is available in the [warpbuilds/docs-rss-feed](https://github.com/warpbuilds/docs-rss-feed) repository. ```bash #!/bin/bash # Configuration FEED_TITLE="WarpBuild Changelog" FEED_DESC="WarpBuild platform updates, improvements, and bug fixes" FEED_LINK="/docs/ci/changelog" DOCS_BASE_URL="https://docs.warpbuild.com" OUTPUT_FILE="static/changelog.xml" CHANGELOG_DIR="docs/changelog" # Create RSS header cat > "$OUTPUT_FILE" << EOF $FEED_TITLE $FEED_DESC $FEED_LINK $(date -R) EOF # Function to convert date format for macOS convert_date() { local input_date="$1" if [ -z "$input_date" ]; then return 1 fi # Convert "Month DD, YYYY" to RFC822 format and strip the time portion date -R -j -f "%B %d, %Y" "$input_date" 2>/dev/null | sed 's/ [0-9][0-9]:[0-9][0-9]:[0-9][0-9] .*//' } # Function to create anchor-friendly string create_anchor() { local input="$1" if [ -z "$input" ]; then return 1 fi echo "$input" | tr '[:upper:]' '[:lower:]' | tr ' ' '-' | tr -d ',' 2>/dev/null } # Function to extract frontmatter value get_frontmatter_value() { local file="$1" local key="$2" awk -v key="$key:" '$1 == key {print substr($0, length(key) + 3)}' "$file" | tr -d '"' } # Function to process markdown content process_markdown() { local content="$1" local processed="$content" # Convert markdown to HTML first processed=$(echo "$processed" | perl -pe 's|\[([^\]]*)\]\(([^\)]*)\)|\1|g') # Properly escape HTML content processed=$(echo "$processed" | sed 's/\\n//g') echo "$processed" } # Process each changelog file in reverse chronological order for file in $(ls -r "$CHANGELOG_DIR"/*.mdx); do # Skip changelog.mdx if [[ $file == *"changelog.mdx" ]]; then continue fi # Get update date from frontmatter updated_at=$(get_frontmatter_value "$file" "updatedAt") title=$(get_frontmatter_value "$file" "title") # Extract the slug from the filename (remove path and extension) SLUG=$(basename "$file" .mdx) CONTENT="" CURRENT_DATE="" while IFS= read -r line; do # Look for changelog entries starting with ### if [[ $line =~ ^###[[:space:]]+(.*,[[:space:]]+[0-9]{4})$ ]]; then # If we have accumulated content, create an item if [ ! -z "$CURRENT_DATE" ] && [ ! -z "$CONTENT" ]; then RFC_DATE=$(convert_date "$CURRENT_DATE") PROCESSED_CONTENT=$(process_markdown "$CONTENT") # Create anchor-friendly date string with error checking ANCHOR_DATE=$(create_anchor "$CURRENT_DATE") if [ ! -z "$ANCHOR_DATE" ]; then cat >> "$OUTPUT_FILE" << EOF WarpBuild Updates - $CURRENT_DATE $FEED_LINK/$SLUG#$ANCHOR_DATE $FEED_LINK/$SLUG#$ANCHOR_DATE $RFC_DATE EOF fi fi CURRENT_DATE="${BASH_REMATCH[1]}" CONTENT="" elif [[ -n $line && ! $line =~ ^--- && ! $line =~ ^$ ]]; then CONTENT+="$line\n" fi done < "$file" # Process the last entry in the file if [ ! -z "$CURRENT_DATE" ] && [ ! -z "$CONTENT" ]; then RFC_DATE=$(convert_date "$CURRENT_DATE") echo "RFC_DATE: $RFC_DATE" PROCESSED_CONTENT=$(process_markdown "$CONTENT") ANCHOR_DATE=$(create_anchor "$CURRENT_DATE") if [ ! -z "$ANCHOR_DATE" ]; then cat >> "$OUTPUT_FILE" << EOF WarpBuild Updates - $CURRENT_DATE $FEED_LINK/$SLUG#$ANCHOR_DATE $FEED_LINK/$SLUG#$ANCHOR_DATE $RFC_DATE EOF fi fi done # Close RSS feed cat >> "$OUTPUT_FILE" << EOF EOF echo "RSS feed generated at $OUTPUT_FILE" ``` ## Example Markdown File Here's a snippet of the markdown file for the changelog: ```mdx --- title: "October 2024" slug: "2024-October" description: "List of updates in 2024-October" sidebar_position: -9 createdAt: "2024-10-04" updatedAt: "2024-10-29" --- ### October 29, 2024 - `Feature`: Custom VM images are now supported for GCP BYOC runners. ### October 21, 2024 - `Feature`: Ubuntu 24.04 arm64 runners are now supported natively as cloud runners as well as with AWS and GCP custom runners. These runners are compatible with GitHub's Ubuntu 24.04 arm64. Refer to [cloud runner labels](/cloud-runners#linux-arm64) for the full list of available labels. Refer to [this link](https://github.com/actions/partner-runner-images/blob/main/images/arm-ubuntu-24-image.md) for the details on the packaged tools. ### October 17, 2024 - `Enhancement`: The image for `macos-14` (https://github.com/actions/runner-images/releases/tag/macos-14-arm64%2F20241007.259) has been updated. This fixes the issue with iOS 18 SDK and simulator not being available. ### October 15, 2024 - `Feature`: Docker Layer Caching is now available for GCP BYOC runners. - `Enhancement`: The images for `ubuntu-2204` for [x86-64](https://github.com/actions/runner-images/releases/tag/ubuntu22%2F20241006.1) for `arm64` architecture have been updated. - `Enhancement`: [ubuntu-2404 for x86-64](https://github.com/actions/runner-images/releases/tag/ubuntu24%2F20241006.1) image has been updated. ### October 14, 2024 - `Enhancement`: BYOC features do not require a payment method to be added, by default. Credits can be used for BYOC runners. ### October 11, 2024 - `Pricing`: Cost for cache operations has been **reduced** from $0.001 to $0.0001 per operation. ### October 09, 2024 - `Feature`: GCP BYOC is now generally available. Read more here: [BYOC on GCP](/byoc/gcp). ### October 08, 2024 - `Enhancement`: The runner start times are now much faster, with a 90%ile of the start times being under 20 seconds. This is a a significant improvement over the previous 90%ile of 45 seconds. --- ``` ## Next steps It would be fantastic to have this as a Docusaurus plugin so it can be reused for other markdown pages. If you are interested in this, please let me know! The full script is available as a [GitHub Gist](https://gist.github.com/suryaoruganti/8f520335f3c9c9a705687ac6d3c47b9f). Example markdown file is available [here](https://gist.github.com/suryaoruganti/792b444fc1f2e1d12831712daf68de69). --- Use WarpBuild for blazing fast GitHub actions runners with superior job start times, caching backed by object storage, unlimited concurrency, and easy to use dashboards. Save 50-90% on your GitHub Actions costs while getting 10x the performance. Book a call or get started today! # A Complete Guide to Self-hosting GitHub Actions Runners URL: /blog/self-hosting-github-actions Comprehensive guide to self-hosting GitHub Actions runners on AWS --- title: "A Complete Guide to Self-hosting GitHub Actions Runners" excerpt: "A Complete Guide to Self-hosting GitHub Actions Runners" description: "Comprehensive guide to self-hosting GitHub Actions runners on AWS" date: "2024-04-29" author: surya_oruganti cover: "/images/blog/self-hosting-github-actions/cover.webp" --- # Self-Hosting GitHub Actions Runners on AWS: A Comprehensive Guide GitHub Actions has rapidly become a favourite tool for CI/CD, thanks to its seamless integration with GitHub repositories and its extensive marketplace of pre-built actions. However, in cases where you need more control over the environment, security, or costs, self-hosting your runners can be a beneficial strategy. AWS provides robust and scalable infrastructure that can be tailored to host self-managed GitHub Actions runners. In this blog post, we will explore various methods to deploy these runners on AWS, detailing the steps involved and discussing the pros and cons of each approach. ## Method 1: Using EC2 Instances One straightforward way to host GitHub Actions runners is by using Amazon EC2 instances. This method gives you full control over the compute environment. ### Steps #### 1. **Set Up an EC2 Instance** Start by launching an EC2 instance from the AWS Management Console or AWS CLI. An instance with 2 vCPUs and 4GB RAM (e.g., `t3.medium`) is a good starting point. Ensure the security group allows outbound connections to access GitHub and any other needed resources. Attach an EBS volume for persistent storage if required. 100GB is a good root volume size for most use cases. In this guide, we will use Ubuntu 22.04 as the base OS for the EC2 instance. #### 2. **Install GitHub Actions Runner** - Go to your GitHub organization's settings and then go to `Actions` `Runners`. Click on `New runner` and choose `New self-hosted runner`. Choose Linux as the OS and x64 as the architecture. ![alt text](/images/blog/self-hosting-github-actions/image.png) > You can directly go to the following URL (after replacing `ORG` with your GitHub organisation name) to get to the runner setup page along with the OS and architecture pre-selected: > `https://github.com/organizations/ORG/settings/actions/runners/new?arch=x64&os=linux` The configuration should look like this: ![GitHub new runner configuration screenshot](/images/blog/self-hosting-github-actions/image-1.png) [!NOTE] If you want to create a runner only for a specific repository, you can do so by going to the repository's settings and following the same steps. The direct link looks like this: > `https://github.com/ORG/REPO/settings/actions/runners/new?arch=x64&os=linux` - Follow the instructions on that page to download, configure and start the runner on your EC2 instance. [!TIP] Instead of starting the runner with `./run.sh`, you can run it as a service to ensure it starts automatically on boot and restarts if the app or the host machine crashes. After successfully configuring with the `config.sh` script, you get a `svc.sh` script that can be used to install the runner as a service: > ```sh sudo ./svc.sh install && sudo ./svc.sh start ``` > Learn more about running it as a service [here](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/configuring-the-self-hosted-runner-application-as-a-service?platform=linux). ### Pros - **Full Control**: Customize the OS, installed software, and hardware specifications as needed. - **Cost-Effective**: Particularly with spot instances or reserved instances for long-term use. ### Cons - **Maintenance Overhead**: Requires regular updates for said software and monitoring. - **Scalability Issues**: Manually managing multiple runners can be cumbersome. ### References - **AWS EC2**: https://aws.amazon.com/ec2/ - **Adding self-hosted GitHub Actions Runner**: https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/adding-self-hosted-runners ## Method 2: Using ECS (Elastic Container Service) ECS allows you to run containers directly and can be an efficient way to manage GitHub Actions runners, especially if you prefer using Docker containers. ### Steps #### 1. **Create a Docker Image** - **Dockerfile**: Create a Dockerfile that installs the GitHub Actions runner. The following Dockerfile installs the runner and its dependencies on an Ubuntu 22.04 base image. On container start, it registers a new runner with GitHub and starts the runner. ```dockerfile FROM amd64/ubuntu:22.04 RUN apt-get update && apt-get install -y curl sudo jq ADD https://github.com/actions/runner/releases/download/v2.316.0/actions-runner-linux-x64-2.316.0.tar.gz runner.tar.gz RUN newuser=runner && \ adduser --disabled-password --gecos "" $newuser && \ usermod -aG sudo $newuser && \ echo "$newuser ALL=(ALL) NOPASSWD:ALL" >/etc/sudoers USER runner WORKDIR /home/runner RUN sudo mv /runner.tar.gz ./runner.tar.gz && \ sudo chown runner:runner ./runner.tar.gz && \ mkdir runner && \ tar xzf runner.tar.gz -C runner && \ rm runner.tar.gz WORKDIR /home/runner/runner RUN sudo ./bin/installdependencies.sh COPY start.sh start.sh ENTRYPOINT ["./start.sh"] ``` The above Dockerfile assumes that the following `start.sh` script is present in the same directory as the Dockerfile. ```bash #!/bin/bash set -euo pipefail check_env() { if [ -z "${GITHUB_PAT:-}" ]; then echo "Env variable GITHUB_PAT is required but not set" exit 1 fi if [ -z "${GITHUB_ORG:-}" ]; then echo "Env variable GITHUB_ORG is required but not set" exit 1 fi } register_runner() { local github_token=$(curl -sL \ -X POST \ -H "Accept: application/vnd.github+json" \ -H "Authorization: Bearer $GITHUB_PAT" \ -H "X-GitHub-Api-Version: 2022-11-28" \ https://api.github.com/orgs/$GITHUB_ORG/actions/runners/registration-token | jq -r .token) ./config.sh --unattended --url https://github.com/$GITHUB_ORG --token $github_token } check_env register_runner ./run.sh ``` You can configure the runner name and labels by passing additional arguments to the `config.sh` script. For example, to set the runner name, use `--name RUNNER_NAME`. Use `./config.sh --help` to see all available options. Options are attached below for your reference: **Configuration Options** ```sh $ ./config.sh --help Commands: ./config.sh Configures the runner ./config.sh remove Unconfigures the runner ./run.sh Runs the runner interactively. Does not require any options. Options: --help Prints the help for each command --version Prints the runner version --commit Prints the runner commit --check Check the runner's network connectivity with GitHub server Config Options: --unattended Disable interactive prompts for missing arguments. Defaults will be used for missing options --url string Repository to add the runner to. Required if unattended --token string Registration token. Required if unattended --name string Name of the runner to configure (default mac) --runnergroup string Name of the runner group to add this runner to (defaults to the default runner group) --labels string Custom labels that will be added to the runner. This option is mandatory if --no-default-labels is used. --no-default-labels Disables adding the default labels: 'self-hosted,OSX,Arm64' --local Removes the runner config files from your local machine. Used as an option to the remove command --work string Relative runner work directory (default \_work) --replace Replace any existing runner with the same name (default false) --pat GitHub personal access token with repo scope. Used for checking network connectivity when executing `./run.sh --check` --disableupdate Disable self-hosted runner automatic update to the latest released version` --ephemeral Configure the runner to only take one job and then let the service un-configure the runner after the job finishes (default false) Examples: Check GitHub server network connectivity: ./run.sh --check --url Configure a runner non-interactively: ./config.sh --unattended --url Configure a runner non-interactively, replacing any existing runner with the same name: ./config.sh --unattended --url ] Configure a runner non-interactively with three extra labels: ./config.sh --unattended --url - Build the Docker image with an appropriate tag. ```sh docker build -t github-runner . ``` #### 2. **Push to ECR (Elastic Container Registry)** - **Create Repository**: Create a new repository in ECR from AWS Console or AWS CLI. - **Authenticate Docker**: Authenticate your Docker client to your default registry. ```sh aws ecr get-login-password --region YOUR_REGION | docker login --username AWS --password-stdin YOUR_ECR_REPOSITORY_URL ``` - **Tag and Push**: Tag your Docker image and push it to ECR. ```sh docker tag github-runner:latest YOUR_ECR_REPOSITORY_URL:YOUR_TAG docker push YOUR_ECR_REPOSITORY_URL:YOUR_TAG ``` #### 3. **Deploy on ECS** - **Create Cluster**: Set up an ECS cluster from the AWS Management Console or AWS CLI which uses `t3.medium` instances. Your infrastructure should look like this: ![ECS EC2 cluster configuration](/images/blog/self-hosting-github-actions/image-2.png) - Create a secret using AWS Secret Manager to store the GitHub PAT: ```sh aws secretsmanager create-secret --region us-east-2 --name github_runner_ecs_secrets --secret-string '{ "github_pat": "" }' ``` You can also store the GitHub organization name in the same secret or use it as an environment variable in the ECS task definition. - Create an ECS Task Execution role. The `executionRoleArn` field is required for tasks to interact with other AWS services. You can create a new role with the necessary permissions or use an existing one. Learn about the role and how to create it here: [Amazon ECS task execution IAM role](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_execution_IAM_role.html#create-task-execution-role). You will also need to [create an inline policy](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/secrets-envvar-secrets-manager.html#secrets-envvar-secrets-manager-iam) to allow the container to access the secret. - **Task Definition**: Create a new task definition in ECS that uses the Docker image pushed to ECR and the secret in the previous steps. Make sure to replace the placeholders with your actual values. ```json { "family": "github-runner", "executionRoleArn": "", "containerDefinitions": [ { "name": "github-runner", "image": ":", "memory": 4096, "cpu": 2048, "secrets": [ { "name": "GITHUB_PAT", "valueFrom": ":github_pat::" } ], "environment": [ { "name": "GITHUB_ORG", "value": "" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-create-group": "true", "awslogs-group": "/ecs/github-runners", "awslogs-region": "", "awslogs-stream-prefix": "ecs" } } } ] } ``` - **Run Task**: Go to the task definition and select the first (or latest) revision. Click on `Deploy` and then `Create Service`. Choose the cluster you created earlier and select the cluster default capacity provider strategy. In the deployment configuration section give the service a name e.g., `github-runner-service` and choose an appropriate number of desired tasks (e.g., 3). Click on `Create Service` to deploy the tasks. ### Pros - **Scalability**: Easily scale out by adjusting the service's desired count. - **Isolation**: Runners operate in isolated environments, improving security. ### Cons - **Complexity**: Requires familiarity with Docker and AWS ECS. - **Costs**: Potentially higher costs depending on the ECS configuration and usage pattern. - **Runner Management**: Manually managing multiple runners can be cumbersome. ### References - **AWS ECS**: https://aws.amazon.com/ecs/ - **Docker Basics**: https://www.docker.com/101-tutorial - **AWS ECR**: https://aws.amazon.com/ecr/ - **Task Definitions in ECS**: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definitions.html - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/taskdef-envfiles.html - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/secrets-envvar-secrets-manager.html ## Method 3: Using AWS Fargate AWS Fargate is a serverless compute engine for containers that works with both Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS). It abstracts the server and cluster management and provides a straightforward way to run containers. ### Steps ### 1. **Create and Push Docker Image** Follow the same initial steps as for ECS to create and push a Docker image. #### 2. **Configure Fargate Task** - **Fargate Task Definition**: Similar to ECS but select Fargate as the launch type. ```json { "requiresCompatibilities": ["FARGATE"], "executionRoleArn": "", "networkMode": "awsvpc", "cpu": "2048", "family": "github-runners", "memory": "4096", "containerDefinitions": [ { "name": "github-runner", "image": ":", "essential": true, "portMappings": [ { "containerPort": 80, "hostPort": 80 } ], "secrets": [ { "name": "GITHUB_PAT", "valueFrom": ":github_pat::" } ], "environment": [ { "name": "GITHUB_ORG", "value": "" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-create-group": "true", "awslogs-group": "/ecs/github-runners", "awslogs-region": "", "awslogs-stream-prefix": "ecs" } } } ] } ``` #### 3. Deploy on Fargate - **Create Cluster**: Set up an ECS cluster with the Fargate launch type. - Create a service using the task definition created in the previous step. ### Pros - **Serverless**: No need to manage servers or clusters. - **Scalable and Isolated**: Automatically scales and provides high isolation. ### Cons - **Cost**: Can be expensive for high compute usage. - **Networking Limitations**: Requires good understanding of AWS VPC, subnets, and security groups. ### References - **AWS Fargate**: https://aws.amazon.com/fargate - **AWS ECS on Fargate**: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/AWS_Fargate.html # Advanced Methods for Self-Hosting GitHub Actions Runners on AWS Following up on our previous exploration of basic methods like using EC2, ECS, and AWS Fargate for hosting GitHub Actions runners, we now get into more sophisticated strategies. These involve Kubernetes solutions and Terraform modules, which can significantly streamline and enhance the management of GitHub runners at scale. ## Method 4: Using `actions-runner-controller` on EKS `actions-runner-controller` is a Kubernetes operator designed to automate the deployment, scaling, and management of GitHub Actions self-hosted runners within a Kubernetes cluster. It supports features like automatic scaling based on the number of queued jobs, which makes it highly efficient for dynamic CI/CD environments. ### Steps #### 1. **Set Up a Kubernetes Cluster** - Deploy a Kubernetes cluster using Amazon EKS. - Create a Node group with the desired instance type and capacity. As stated before, `t3.medium` instances are good enough for most use cases. ```sh eksctl create cluster \ --name You can adjust the `--nodes`, `--nodes-min`, and `--nodes-max` values based on your workload and scaling requirements. - Configure `kubectl` to communicate with your cluster: ```sh aws eks --region ``` #### 2. **Install `actions-runner-controller`** - Install and setup the controller using Helm: ```sh NAMESPACE="arc-systems" helm install arc \ --namespace "${NAMESPACE}" \ --create-namespace \ oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller ``` #### 3. Setup a runner scale set - Create a separate Kubernetes namespace for the runner pods: ```sh kubectl create namespace arc-runners ``` - Create a GitHub App that will be used to authenticate the runners. Install the app in your organization. - From the app's dashboard, generate a private key file (`*.pem`) and get the App ID. Get the installation ID from the app installation page's URL which is of the form: `https://github.com/organizations/ORGANIZATION/settings/installations/INSTALLATION_ID` For detailed instructions about the above two steps, follow the official documentation: [Authenticating ARC with a GitHub App](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/authenticating-to-the-github-api#authenticating-arc-with-a-github-app). - Store the app ID, installation ID and the private key in a Kubernetes secret: ```sh kubectl create secret generic github-secrets \ --namespace=arc-runners \ --from-literal=github_app_id=123456 \ --from-literal=github_app_installation_id=654321 \ --from-file=github_app_private_key=YOUR_APP_NAME.DATE.private-key.pem ``` - Configure a scale set for your organization or repo: ```sh INSTALLATION_NAME="arc-runner-set" NAMESPACE="arc-runners" GITHUB_ORG="YOUR_ORG" GITHUB_REPO="" # If you want to use a org-level runner, leave this empty helm upgrade --install "${INSTALLATION_NAME}" \ --namespace "${NAMESPACE}" \ --set githubConfigUrl="https://github.com/$GITHUB_ORG/$GITHUB_REPO" \ --set githubConfigSecret=github-secrets \ oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set ``` ### Pros - **Auto-Scaling**: The controller automatically adjusts the number of runners based on the workload. - **Efficiency**: Reduces costs by scaling down to zero when no jobs are queued. - **Security**: Uses GitHub App authentication for secure communication with least number of privileges. ### Cons - **Setup Complexity**: Requires a moderate understanding of Kubernetes and Helm. - **Overhead**: More Kubernetes resources to manage. - **GitHub App Configuration**: Setting up the GitHub App can be a bit involved. ### References - **actions-runner-controller GitHub**: https://github.com/actions/actions-runner-controller - **Helm Installation**: https://helm.sh/docs/intro/install - **GitHub Apps**: https://docs.github.com/en/apps ## Method 5: Philips Terraform Module The Philips software team has developed a Terraform module specifically for deploying self-hosted GitHub Actions runners on AWS. ### Steps #### 1. **Set Up Terraform** - Ensure Terraform is installed and configured to manage your AWS resources. #### 2. **Use the Philips Module** - **Write Configuration**: Define your Terraform configuration using the Philips module. ```hcl module "github-runner" { source = "philips-labs/github-runner/aws" version = "REPLACE_WITH_VERSION" aws_region = "eu-west-1" vpc_id = "vpc-123" subnet_ids = ["subnet-123", "subnet-456"] prefix = "gh-ci" github_app = { key_base64 = "base64string" id = "1" webhook_secret = "webhook_secret" } webhook_lambda_zip = "lambdas-download/webhook.zip" runner_binaries_syncer_lambda_zip = "lambdas-download/runner-binaries-syncer.zip" runners_lambda_zip = "lambdas-download/runners.zip" enable_organization_runners = true } ``` - **Initialize and Apply**: Initialize Terraform and apply the configuration to set up the runners. ```sh terraform init terraform apply ``` ### Pros - **Infrastructure as Code**: Easy versioning, auditing, and replication of infrastructure. - **Scalable and Flexible**: Easily adjust settings and scale resources through code. ### Cons - **Initial Learning Curve**: Requires understanding of Terraform and AWS. - **Terraform Management**: Need to manage Terraform state and possibly costs associated with state storage. ### References - **Philips Labs GitHub Runner Module**: https://github.com/philips-labs/terraform-aws-github-runner - **Terraform AWS Provider**: https://registry.terraform.io/providers/hashicorp/aws/latest/docs ## Method 6: Self-Hosting on Kubernetes Deploying directly on a Kubernetes cluster gives you full control over the environment and may reduce costs compared to using Fargate. ### Steps #### 1. **Prepare the Kubernetes Cluster** - Set up a Kubernetes cluster on AWS, either through EKS or manually with EC2 instances. #### 2. **Deploy Runner Manually** - **Create Docker Image**: Build and push the Docker image just as you did [when setting up ECS](#method-1-using-ec2-instances). - **Add Secrets**: Store the GitHub PAT in a Kubernetes secret: ``` kubectl create secret generic github-secrets --from-literal=github_pat= ``` - **Deploy Pods**: Write Kubernetes deployment manifests to specify the pods that will run the GitHub runners. ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: github-runner spec: replicas: 2 selector: matchLabels: app: github-runner template: metadata: labels: app: github-runner spec: containers: - name: runner image: env: - name: GITHUB_PAT valueFrom: secretKeyRef: name: github-secrets key: github_pat - name: GITHUB_ORG value: ``` ### Pros - **Complete Control**: Full control over the Kubernetes cluster and how it scales. - **Cost-Effective**: Potentially lower costs by managing the underlying resources yourself. ### Cons - **Complex Configuration**: Requires detailed knowledge of Kubernetes. - **Maintenance**: You are responsible for all updates, scaling, and health monitoring. # Conclusion Self-hosting GitHub Actions runners on AWS provides flexibility, control, and potential cost savings, especially for complex workflows that require specific configurations. By choosing the appropriate AWS service-be it EC2, ECS, or Fargate-you can optimize your CI/CD pipeline according to your project's needs. Each method has its trade-offs in terms of complexity, cost, and scalability. Therefore, it's crucial to evaluate your requirements and expertise in AWS services when deciding the best approach for self-hosting GitHub Actions runners. WarpBuild provides runners with high performance processors, which are optimized for CI and build workloads with fast disk IO and improved caching. [Get started](https://app.warpbuild.com) today. # Self-host GitHub Actions runners with Actions Runner Controller (ARC) on AWS URL: /blog/setup-actions-runner-controller Self-host GitHub Actions runners with Actions Runner Controller (ARC) on AWS. Includes terraform code, and production ready configurations for `arc` and `karpenter`. --- title: "Self-host GitHub Actions runners with Actions Runner Controller (ARC) on AWS" excerpt: "Self-host GitHub Actions runners with Actions Runner Controller (ARC) on AWS. Includes terraform code, and production ready configurations for `arc` and `karpenter`." description: "Self-host GitHub Actions runners with Actions Runner Controller (ARC) on AWS. Includes terraform code, and production ready configurations for `arc` and `karpenter`." date: "2024-11-06" author: surya_oruganti cover: "/images/blog/setup-actions-runner-controller/cover.png" --- This post details setting up GitHub Actions Runners using ARC (Actions Runner Controller) on AWS using EKS. This includes terraform code for provisioning the infrastructure and a custom runner image for runners. It also includes optimizations for cost and performance using Karpenter for autoscaling and other best practices. ## Setup We setup Karpenter v1.0.2 and EKS using Terraform to provision the infrastructure. Complete setup code is available here: [https://github.com/WarpBuilds/github-arc-setup](https://github.com/WarpBuilds/github-arc-setup) ### EKS Cluster Setup The EKS cluster was provisioned using Terraform and runs on Kubernetes v1.30. A key aspect of our setup was using a dedicated node group for essential add-ons, keeping them isolated from other workloads. The `default-ng` node group utilizes `t3.xlarge` instance types, with taints to ensure that only critical workloads, such as Networking, DNS management, Node management, ARC controllers etc. can be scheduled on these nodes. ```hcl module "eks" { source = "terraform-aws-modules/eks/aws" cluster_name = local.cluster_name cluster_version = "1.30" cluster_endpoint_public_access = true cluster_addons = { coredns = {} eks-pod-identity-agent = {} kube-proxy = {} vpc-cni = {} } subnet_ids = var.private_subnet_ids vpc_id = var.vpc_id eks_managed_node_groups = { default-ng = { desired_capacity = 2 max_capacity = 5 min_capacity = 1 instance_types = ["t3.xlarge"] subnet_ids = var.private_subnet_ids taints = { addons = { key = "CriticalAddonsOnly" value = "true" effect = "NO_SCHEDULE" } } } } node_security_group_tags = merge(local.tags, { "karpenter.sh/discovery" = local.cluster_name }) enable_cluster_creator_admin_permissions = true tags = local.tags } ``` #### Private Subnets and NAT Gateway The EKS nodes are in private subnets, allowing them to communicate with external resources through a NAT Gateway. This configuration ensures node connectivity without exposing them directly to external traffic. ### Karpenter for Autoscaling Karpenter provides fast and flexible autoscaling of the nodes to optimize cost and resource efficiency. We explore a few variations of configuration to reduce over-provisioning and unnecessary costs. - [**Karpenter v1.0.2**](https://karpenter.sh/): We chose the latest version of karpenter at the time of writing. - **Amazon Linux 2023 (AL2023)**: The default NodeClass provisions nodes with AL2023, and each node is configured with 300GiB of EBS storage. This additional storage is crucial for workloads that require high disk usage, such as CI/CD runners, preventing out-of-disk errors commonly encountered with default node storage (17GiB). This needs to be increased based on the number of jobs expected to run on a node in parallel. - **Private Subnet Selection**: The NodeClass is configured to use the private subnets created earlier. This ensures that nodes are spun up in a secure, isolated environment, consistent with the EKS cluster's network setup. - [**m7a Node Families**](https://aws.amazon.com/ec2/instance-types/m7a/): Using the NodePool resource, we restricted node provisioning to the m7a instance family. These instances were chosen for their performance-to-cost efficiency and are only provisioned in the us-east-1a and us-east-1b Availability Zones. - **On-demand Instances**: While Karpenter supports Spot Instances for cost savings, we opted for on-demand instances for an equivalent cost comparison. - **Consolidation Policy**: We configured a 5-minute consolidation delay, preventing premature node terminations that could disrupt workflows. Karpenter will only consolidate nodes once they are underutilized for at least 5 minutes, ensuring stable operations during peak workloads. ```hcl module "karpenter" { source = "terraform-aws-modules/eks/aws//modules/karpenter" cluster_name = module.eks.cluster_name enable_pod_identity = true create_pod_identity_association = true create_instance_profile = true node_iam_role_additional_policies = { AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore" } tags = local.tags } resource "helm_release" "karpenter-crd" { namespace = "karpenter" create_namespace = true name = "karpenter-crd" repository = "oci://public.ecr.aws/karpenter" chart = "karpenter-crd" version = "1.0.2" wait = true values = [] } resource "helm_release" "karpenter" { depends_on = [helm_release.karpenter-crd] namespace = "karpenter" create_namespace = true name = "karpenter" repository = "oci://public.ecr.aws/karpenter" chart = "karpenter" version = "1.0.2" wait = true skip_crds = true values = [ <<-EOT serviceAccount: name: ${module.karpenter.service_account} settings: clusterName: ${module.eks.cluster_name} clusterEndpoint: ${module.eks.cluster_endpoint} EOT ] } resource "kubectl_manifest" "karpenter_node_class" { yaml_body = <<-YAML apiVersion: karpenter.k8s.aws/v1beta1 kind: EC2NodeClass metadata: name: default spec: amiFamily: AL2023 detailedMonitoring: true blockDeviceMappings: - deviceName: /dev/xvda ebs: volumeSize: 300Gi volumeType: gp3 deleteOnTermination: true iops: 5000 throughput: 500 instanceProfile: ${module.karpenter.instance_profile_name} subnetSelectorTerms: - tags: karpenter.sh/discovery: ${module.eks.cluster_name} securityGroupSelectorTerms: - tags: karpenter.sh/discovery: ${module.eks.cluster_name} tags: karpenter.sh/discovery: ${module.eks.cluster_name} Project: arc-test-praj YAML depends_on = [ helm_release.karpenter, helm_release.karpenter-crd ] } resource "kubectl_manifest" "karpenter_node_pool" { yaml_body = <<-YAML apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: default spec: template: spec: tags: Project: arc-test-praj nodeClassRef: name: default requirements: - key: "karpenter.k8s.aws/instance-category" operator: In values: ["m"] - key: "karpenter.k8s.aws/instance-family" operator: In values: ["m7a"] - key: "karpenter.k8s.aws/instance-cpu" operator: In values: ["4", "8", "16", "32", "64"] - key: "karpenter.k8s.aws/instance-generation" operator: Gt values: ["2"] - key: "topology.kubernetes.io/zone" operator: In values: ["us-east-1a", "us-east-1b"] - key: "kubernetes.io/arch" operator: In values: ["amd64"] - key: "karpenter.sh/capacity-type" operator: In values: ["on-demand"] limits: cpu: 1000 disruption: consolidationPolicy: WhenEmpty consolidateAfter: 5m YAML depends_on = [ kubectl_manifest.karpenter_node_class ] } ``` **Variant #2:** We also ran another setup with a single job per node to compare the performance and cost implications of running multiple jobs on a single node. ```diff - key: "karpenter.k8s.aws/instance-cpu" - operator: In - values: ["4", "8", "16", "32", "64"] + key: "karpenter.k8s.aws/instance-cpu" + operator: In + values: ["8"] ``` ### Actions Runner Controller and Runner Scale Set Once Karpenter was configured, we proceeded to set up the GitHub Actions Runner Controller (ARC) and the Runner Scale Set using Helm. The ARC setup was deployed with Helm using the following command and values: ```bash helm upgrade arc \ --namespace "${NAMESPACE}" \ oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \ --values runner-set-values.yaml --install ``` ```yaml tolerations: - key: "CriticalAddonsOnly" operator: "Equal" value: "true" effect: "NoSchedule" ``` This configuration applies tolerations to the controller, enabling it to run on nodes with the `CriticalAddonsOnly` taint i.e. `default-ng` nodegroup, ensuring it doesn't interfere with other runner workloads. Next, we set up the Runner Scale Set using another Helm command: ```bash helm upgrade warp-praj-arc-test oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set --namespace ${NAMESPACE} --values values.yaml --install ``` The key points for our Runner Scale Set configuration: - **GitHub App Integration**: We connected our runners to GitHub via a GitHub App, enabling the runners to operate at the organization level. - **Listener Tolerations**: Like the controller, the listener template also included tolerations to allow it to run on the `default-ng` node group. - **Custom Image for Runners**: We used a custom Docker image for the runner pods (detailed in the next section). - **Resource Requirements**: To simulate high-performance runners, the runner pods were configured to require 8 CPU cores and 32 GiB of RAM, which aligns with the performance of an 8x runner used in the workflows. ```yaml githubConfigUrl: "https://github.com/Warpbuilds" githubConfigSecret: github_app_id: "" github_app_installation_id: "" github_app_private_key: | -----BEGIN RSA PRIVATE KEY----- [your-private-key-contents] -----END RSA PRIVATE KEY----- github_token: "" listenerTemplate: spec: containers: - name: listener securityContext: runAsUser: 1000 tolerations: - key: "CriticalAddonsOnly" operator: "Equal" value: "true" effect: "NoSchedule" template: spec: containers: - name: runner image: command: ["/home/runner/run.sh"] resources: requests: cpu: "4" memory: "16Gi" limits: cpu: "8" memory: "32Gi" controllerServiceAccount: namespace: arc-systems name: arc-gha-rs-controller ``` ### Custom Image for Runner Pods By default, the Runner Scale Sets use GitHub's official `actions-runner` image. However, this image doesn't include essential utilities such as wget, curl, and git, which are required by various workflows. To address this, we created a custom Docker image based on GitHub's runner image, adding the necessary tools. This image was hosted in a public ECR repository and was used by the runner pods during our tests. The custom image allowed us to run workflows without missing dependencies and ensured smooth execution. ```dockerfile FROM ghcr.io/actions/actions-runner:2.319.1 RUN sudo apt-get update && sudo apt-get install -y wget curl unzip git RUN sudo apt-get clean && sudo rm -rf /var/lib/apt/lists/* ``` This approach ensures that our runners were always equipped with the required utilities, preventing errors and reducing friction during the workflow runs. ### Tagging Infrastructure for Cost Tracking In order to track costs effectively during the ARC setup, the infra resources created with this process are tagged, along with collecting hourly data. AWS Cost Explorer allows us to monitor and attribute costs to specific resources based on these tags. This was essential for calculating the true cost of running ARC, with all costs like EC2, EBS, VPC, S3, NAT Gateway, data ingress/egress etc. included. ## Running workflows We use `PostHog` OSS as an example repo to demonstrate the cost comparison on real world use cases over 960 jobs. The duty cycle is a representative 2 hour period, where there is a continuous load of commits, each triggering a job every few minutes. ### PostHog's Frontend CI Workflow To simulate real-world use-case, we leveraged PostHog's Frontend CI workflow. This workflow is designed to run a series of frontend checks, followed by two sets of jobs: one for code quality checks and another for executing a matrix of Jest tests. You can view the workflow file here: [PostHog Frontend CI Workflow](https://github.com/WarpBuilds/posthog/blob/master/.github/workflows/ci-frontend.yml) ### Auto-Commit Simulation Script To ensure continuous triggering of the Frontend CI workflow, we developed an automated commit script in JavaScript. This script generates commits every minute on the forked PostHog repository, which in turn triggers the CI workflow. The script is designed to run for two hours, ensuring a consistent workload over an extended period for accurate cost measurement. The results were then analyzed to compare the costs of using ARC versus WarpBuild's BYOC runners. Commit simulation script: ```javascript const { exec } = require("child_process"); const fs = require("fs"); const path = require("path"); const repoPath = "arc-setup/posthog"; const frontendDir = path.join(repoPath, "frontend"); const intervalTime = 1 * 60 * 1000; // Every Minute const maxRunTime = 2 * 60 * 60 * 1000; // 2 hours const setupGitConfig = () => { exec('git config user.name "Auto Commit Script"', { cwd: repoPath }); exec('git config user.email "auto-commit@example.com"', { cwd: repoPath }); }; const makeCommit = () => { const logFilePath = path.join(frontendDir, "commit_log.txt"); // Create the frontend directory if it doesn't exist if (!fs.existsSync(frontendDir)) { fs.mkdirSync(frontendDir); } // Write to commit_log.txt in the frontend directory fs.appendFileSync( logFilePath, `Auto commit in frontend at ${new Date().toISOString()}\n`, ); // Add, commit, and push changes exec(`git add ${logFilePath}`, { cwd: repoPath }, (err) => { if (err) return console.error("Error adding file:", err); exec( `git commit -m "Auto commit at ${new Date().toISOString()}"`, { cwd: repoPath }, (err) => { if (err) return console.error("Error committing changes:", err); exec("git push origin master", { cwd: repoPath }, (err) => { if (err) return console.error("Error pushing changes:", err); console.log("Changes pushed successfully"); }); }, ); }); }; setupGitConfig(); const interval = setInterval(makeCommit, intervalTime); // Stop the script after 2 hours setTimeout(() => { clearInterval(interval); console.log("Script completed after 2 hours"); }, maxRunTime); ``` ## Results ### Performance and Scalability The following metrics showcase the average time taken by ARC Runners for jobs in the Frontend-CI workflow. All the jobs are run on the same underlying CPU family (m7a) and request the same amount of resources (vcpu and memory). | **Test** | **ARC (Varied Node Sizes)** | **ARC (1 Job Per Node)** | | ----------------------- | --------------------------- | ------------------------ | | **Code Quality Checks** | ~9 minutes 30 seconds | ~7 minutes | | **Jest Test (FOSS)** | ~2 minutes 10 seconds | ~1 minute 30 seconds | | **Jest Test (EE)** | ~1 minute 35 seconds | ~1 minute 25 seconds | ARC runners with varied node sizes exhibited slower performance primarily because multiple runners shared disk and network resources on the same node, causing bottlenecks despite larger node sizes. To address these bottlenecks, we tested a **1 Job Per Node** configuration with ARC, where each job ran on its own node. This approach significantly improved performance. However, it introduced higher job start delays due to the time required to provision new nodes. > Note: Job start delays are directly influenced by the time needed to provision a new node and pull the container image. Larger image sizes increase pull times, leading to longer delays. If the image size is reduced, additional tools would need to be installed during the action run, increasing the overall workflow run time. > > Node spin up and image pull takes ~45s to 1.5m for `arc` runners. This is a significant overhead for workflows that run multiple jobs. Using ### Cost Comparison | **Category** | **ARC (Varied Node Sizes)** | **ARC (1 Job Per Node)** | | ------------------ | --------------------------- | ------------------------ | | **Total Jobs Ran** | 960 | 960 | | Node Type | m7a (varied vCPUs) | m7a.2xlarge | | Max K8s Nodes | 8 | 27 | | Storage | 300GiB per node | 150GiB per node | | IOPS | 5000 per node | 5000 per node | | Throughput | 500Mbps per node | 500Mbps per node | | Compute | $27.20 | $22.98 | | EC2-Other | $18.45 | $19.39 | | VPC | $0.23 | $0.23 | | S3 | $0.001 | $0.001 | | **Total Cost** | **$45.88** | **$42.60** | The cost comparison shows that ARC with 1 job per node is more cost effective than ARC with varied node sizes. This is also the more performant setup. ## Conclusion ARC provides a flexible and scalable solution for running GitHub Actions workflows. It is important to configure it correctly to avoid performance bottlenecks and optimize costs. However, it comes with operational overhead (kubernetes cluster management, terraform, etc.) and continuous maintenance for maintenance at scale and keeping the runner binaries updated. Despite these challenges, ARC is a powerful tool for running GitHub Actions workflows at scale being 10x cheaper than the default Github Actions runners. --- WarpBuild provides the same flexibility as actions-runner-controller but with none of the operational complexity. WarpBuild runners are also more cost effective than ARC runners, with a ~41% cost saving. > Get started with WarpBuild in ~3 minutes for faster job start times, caching backed by object storage, and easy to use dashboards. [Book a call](https://cal.com/suryao/start) or [get started](https://app.warpbuilds.com) today! # Supercharge your CI with Snapshot Runners URL: /blog/snapshot-runners WarpBuild introduces new Snapshot Runners --- title: "Supercharge your CI with Snapshot Runners" excerpt: "Snapshot Runners for GitHub Actions" description: "WarpBuild introduces new Snapshot Runners" date: "2024-09-12" author: prajjwal_dimri cover: "/images/blog/snapshot-runners/cover.webp" --- ## Why Local Builds and Tests Are Faster than CI Developers often experience faster builds locally compared to their CI systems, largely because their local machine has cached dependencies, libraries, and other artifacts. Every CI runner starts fresh, losing this advantage. This disparity in build times can cause frustration and delay continuous integration and deployment workflows. ## Caching in CI Runners CI builds can be sped up by caching dependencies, Docker layers, and other artifacts. However, traditional CI caching mechanisms can't fully replicate the performance of local builds due to the limitations of what can be cached. ## Enter Snapshot Runners Snapshot Runners offer a more powerful caching approach. Instead of relying on external caches, Snapshot Runners capture a complete snapshot of the VM runner just before it exits. This snapshot includes all dependencies, build caches, and system-level optimizations, allowing subsequent runners to start with a fully primed environment. This drastically reduces initialization time and improves overall build performance. ## Seamlessly Integrate Snapshot Runners into Your CI Pipelines Integrating Snapshot Runners into your existing workflows is easy. Simply change the warp runner in `runs-on` to have a `snapshot.key` parameter. This key is used to identify the snapshot and load it into the runner. ```diff - runs-on: "warp-ubuntu-latest-x64-4x" + runs-on: "warp-ubuntu-latest-x64-4x;snapshot.key=pocketbase-snp-warp" ``` Also, add the [`WarpBuilds/snapshot-save@v1`](https://github.com/WarpBuilds/snapshot-save) action at the end of your workflow or at the point where you want to create the snapshot. > It is recommended to clean up all credentials and sensitive information before creating a snapshot. Here's an example workflow which we modified to use snapshot runners. This is the basebuild action for Pocketbase, a popular open-source project. ```yaml name: basebuild on: pull_request: push: jobs: goreleaser: runs-on: "warp-ubuntu-latest-x64-4x;snapshot.key=pocketbase-snp-warp" steps: - name: Checkout uses: actions/checkout@v4 with: fetch-depth: 0 - name: Log GitHub context uses: actions/github-script@v7 with: script: | console.log('GitHub context:', context); core.debug('Full GitHub context object:'); core.debug(JSON.stringify(context, null, 2)); - name: Set up Node.js uses: WarpBuilds/setup-node@v4 with: node-version: 20.11.0 - name: Ensure GCC is installed run: | if ! command -v gcc &> /dev/null then echo "GCC could not be found, installing..." sudo apt-get update sudo apt-get install -y gcc else echo "GCC is already installed" fi - name: Set up Go uses: WarpBuilds/setup-go@v5 with: go-version: ">=1.22.5" cache: false - name: Build Admin dashboard UI run: npm --prefix=./ui ci && npm --prefix=./ui run build - name: Run tests run: go test ./... - name: Run GoReleaser uses: goreleaser/goreleaser-action@v3 with: distribution: goreleaser version: latest args: release --clean env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} - name: Create snapshot if: github.event_name == 'push' uses: WarpBuilds/snapshot-save@v1 with: wait-timeout-minutes: 30 fail-on-error: "false" alias: "pocketbase-snp-warp" ``` ## Benchmarks We modified some popular CI workflows to use Snapshot Runners. Here are the results: | Workflow | GitHub Time | Snapshot Runner Time | | ------------------------- | ------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Pocketbase Build | 7 to 8 minutes | 3m19s (includes 1m1s snapshot creation) | | Supabase Playwright Tests | 5m25s | 4m56s (includes 1 minute of snapshot creation) | | Flux Build and Test | 20m23s | 14m28s | ## Common Pitfalls and Best Practices ### Network attached disks Currently, snapshot runners use a network attached disk which might perform worse than Warpbuild's [cache action](https://github.com/WarpBuilds/cache) in case of large number of files. For large directories like node_modules, it may be better to use cache action instead of relying on files in snapshot. Combine both of these approaches based on your use case. ### Refreshing Snapshots While Snapshot Runners can store everything from previous builds, it is a good idea to refresh the base snapshot periodically to prevent outdated or unnecessary artifacts from accumulating. The snapshot can be refreshed by using the `snapshot-save` action with the same key. ### Runner Compatibility Snapshot runners are a WarpBuild exclusive feature and works only with WarpBuild Runners. Get Started with [WarpBuild](https://warpbuild.com). # WarpBuild's SOC 2 Certification URL: /blog/soc2 Another step in our commitment to security and trust. This post details our learnings from the SOC 2 certification process. --- title: "WarpBuild's SOC 2 Certification" excerpt: "WarpBuild's SOC 2 Certification" description: "Another step in our commitment to security and trust. This post details our learnings from the SOC 2 certification process." date: "2024-09-26" author: surya_oruganti cover: "/images/blog/soc2/cover.png" --- SOC 2 Type II certification is a significant milestone for WarpBuild, demonstrating our commitment to security and trust. It is especially important for us as we continue to onboard new customers of all sizes, many requiring SOC 2 certification. ## What is SOC 2 Certification? SOC 2 certification is a widely recognized standard for evaluating the security and controls of a company's information systems. It is a voluntary process that requires a company to undergo an audit by an independent third-party auditor. The audit evaluates the company's controls over security, availability, processing integrity, confidentiality, and privacy of customer data. ### Who certifies this? The American Institute of Certified Public Accountants (AICPA) is the organization that oversees the SOC 2 certification process. The AICPA has developed the SOC 2 standard, which is used by auditors to evaluate the security and controls of a company's information systems. ### Types of SOC 2 certifications There are two types of SOC 2 certifications: - SOC 2 Type I: Focuses on the design of controls at a specific point in time. This is a one-time audit that evaluates the company's controls over a specific period of time. This is typically used for organizations that are just starting to implement security controls and want to demonstrate that they have a basic level of security in place. - SOC 2 Type II: Evaluates the operating effectiveness of controls over a period (typically 3-12 months). This is a more comprehensive audit that evaluates the company's controls over a longer period of time, typically 3 months to one year. This is typically used for organizations that have been in operation for a while and want to demonstrate that they have a more comprehensive level of security in place. ### Security questionnaire as a stop-gap solution Many companies use a security questionnaire to assess the security of a company's information systems. This is a more informal process than SOC 2 certification but could unblock the procurement process. Some companies can send a security questionnaire from their compliance tool. I've found that a standard questionnaire like this is a good option for minimizing effort: 1. [Whistic](https://www.whistic.com/) can help centralize security questionnaires and posture reports. IMO, it is overkill for a stop-gap solution. 1. [CAIQ Lite v4](https://cloudsecurityalliance.org/artifacts/ccm-lite-and-caiq-lite-v4/) is a fantastic option in a spreadsheet format. ## Who is this for? Most B2B companies require SOC 2 certification as a part of the procurement process. Financial institutions, healthcare providers, and other organizations that are subject to regulatory requirements usually have additional compliance requirements such as HIPAA and PCI DSS, apart from SOC 2 certification. ### Do you really, really need SOC 2 certification? The founder of a SOC 2 audit company advised me to not do SOC 2 certification until we had a few customers who refuse to sign up without the certification. If you are targeting SMB customers exclusively or your customers are not regulated, evaluate if you can skip SOC 2 certification. Evaluate if a standard security questionnaire will suffice instead. This can be a good option as a short-term solution while you are in the process of getting SOC 2 certification. ### General thoughts 1. Once you decide SOC 2 is mandatory, start immediately - Start the SOC 2 certification process as early as possible. It gets more complex as the company grows - both in terms of the number of users and the number of services. - The $ cost and the effort required, both increase as the company grows. 1. SOC 2 certification is not going to get you new customers. - It will only help resolve some blockers during the procurement process. 1. SOC 2 Type I certification is useful if you desperately need to onboard a customer. It's a waste of time and money in most cases. ## Evaluating compliance automation tools I spoke to a few compliance automation companies. Here are the dimensions that matter: 1. **Product**: The tool should be able to automate the evidence collection along with the flexibility to adapt to your internal processes. Most tools will cite 80-95% automation. - There is always a manual effort involved in the certification process, so good support is a must. IMO, support >> product automation. - Integration with AWS, Azure, and GCP is a must, along with Github, and other common tools. Existence of these common integrations is table stakes. However, the quality of the integrations is not obvious before sign-up. - I was not very impressed with the quality of integrations with Sprinto but they made up for that with a good support team. 1. **Cost**: Most companies are flexible on their pricing, especially if you get into multi--year contracts. A good hack is to start the process at the end of the quarter so that you can get a discount. - Generally, the cost order is: Drata ~ Oneleet > Secureframe > Vanta > Sprinto 1. **Documentation**: Guides and documentation on how to use the tool are useful in saving time. 1. **Auditor Network**: A company that has a network of auditors that they already work with makes the transfer of evidence from tool to auditor seamless. This generally is not an issue. - Sprinto gets an extra point because they have auditors who can support a wider budget range than the others. 1. **Support and Responsiveness**: There are a LOT of back and forths during the setup, evidence collection, and audit processes. This is a big deal. It could add weeks to the process if the support folks are not responsive. - Sprinto was upfront about having very hands-on support. This was a very good thing in hindsight. - Support was on Slack Connect and ready to hop on a call quickly for unblocking issues. - Evaluating this is only possible after signing up. I spoke to users of other products and generally found that most companies have mediocre support, with the exception of Oneleet. Here's my evaluation matrix: | Criterion // Company | Secureframe | Vanta | Drata | Sprinto | Oneleet | | -------------------------- | ----------- | ---------- | -------- | ---------- | -------- | | Cost | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | | Product and Integrations | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ | ❓ | | Documentation | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ❓ | | Auditor Network | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | | Support and Responsiveness | ❓ | ❓ | ❓ | ⭐⭐⭐⭐⭐ | ❓ | ❓ = Not evaluated ### Other considerations 1. The auditor reputation supposedly matters, for certain customers. Pick an auditor that has a good reputation. 1. Some companies charge extra for enabling a Trust Center. 1. Oneleet is rather unique: - They are much smaller than the other companies and the founders are very involved. This leads to generally better support. - They claim to actually help your team implement a good security program and improve the security posture of the company. This **should** be obvious table stakes for any compliance automation company. That it is a differentiator for Oneleet speaks to the general state of the industry. ## Our choice: Sprinto and Prescient I chose Sprinto because of their hands on support and they were the cheapest option. Prescient were our auditors. They have a good reputation from what I could find from other founders. ## Process This section is specific to our experience with Sprinto. For context, we are a small team of 5 people. Our infrastructure is multi-cloud, with AWS hosting most of our infra and some spread across GCP, Azure, and Mac Stadium. This could change in the future. ### Audit Readiness: Evidence Collection We are an engineering focused team and we have a lot of the best practices in place. That helped us quite a bit. Here are the interesting things about the certification process that stood out to me: 1. It requires at least 2 people to be involved in the process, as a part of some checks. 1. A pen-test is recommended but not mandatory for the SOC 2 Type II certification. This comes at an additional cost for most providers. 1. We did not have a tool in place for MDM (Mobile Device Management) before the audit and needed to get that sorted. 1. It is important for the production environments and customer data to be kept secure. We had most of these in place, so it was relatively easy. - Tag all infra with the environment details. - Having fewer environments makes this process much simpler. - We do not have a requirement to replicate data across environments. This made our data posture much simpler. - Ensure all prod DBs and data are in private VPCs and not publicly accessible. 1. We set up Cloudflare for WAF. The paid plan comes with default protection rules which are quite valuable. This was a good option given we are multi-cloud. The DDoS protection is very helpful, and free. 1. Setting up an Intrusion Detection System (IDS) was tricky - there is a non-trivial cost associated with this. We decided to use the cloud specific options because of ease of setup. We will revisit this later. 1. This was a forcing function for formalizing the incident response process and exercises for backup-restore. 1. Github Team plan is the minimum required. We were on the Enterprise plan with most of the checks already in place. ### Sprinto: Product Review ⭐⭐ The Sprinto product is quite basic, but it works well when coupled with responsive support. A lot of actions are much easier to do in the backend by support than in the UI. Here are some observations: 1. The UI is dated, and laggy. There are lots of modals, drawers, and popups. Contents do not update after an action and need page refreshes. This was very frustrating. 1. The integrations with AWS, GCP, Azure, and Cloudflare are basic. The refresh intervals are slow (~once a day) and can be slow to update. 1. Github integration was pretty poor. Repo rules set at the org level were not reflected. - Bulk updates are not possible. - Automated commits by a bot for GitOps were being flagged as commits without review. Our current process is that support bulk updates it in the backend once a week. 1. The Trust Center design could be better, but it comes at no additional cost. 1. The documentation is passable. The fact that we has a Slack Connect with support and could offload a bunch of tasks to them was a huge plus. ### Cost You can expect to spend ~$8-10k for the first year, split half and half between Sprinto and the auditor. Some of the product and auditor combinations go up to $15k. You're being ripped off if you spend more than that as a company of less than 50 people. Getting a SOC 2 Type I certification costs ~$2-4k more. ### Timeline | Date | Event | Notes | | ------------------ | --------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | | March 21-28 | Calls with various compliance automation companies. | Sales teams across the board are really responsive. | | March 29 | Signed up with Sprinto | - | | April 02 | Got access to Sprinto dashboard | - | | April 02 | Call with Onboarding Manager | My onboarding manager was fantastic - super knowledgeable and responsive. | | April 03 to May 13 | Getting the compliance checklists in-place | This could have been ~2 weeks faster, but we were occupied with product releases. | | May 13 to Aug 13 | Evidence collection period | Found multiple rough edges with the product integrations that were annoying, but support helped resolve things quickly. | | Aug 20 | Auditors got access to Sprinto evidence | This was entirely avoidable. | | Aug 20-28 | Auditor evidence review: Testing | Initial review and questions. The timeline given was 4-6 weeks after the testing period. | | Aug 28 - Sep 18 | Auditor evidence review: Finalization | Following up every 2-3 days and lots more back and forth with auditors about the evidence. | | Sep 19 - Sep 23 | Draft SOC 2 Type II report review | I reviewed it thoroughly and corrected errors in the draft report. | | Sep 24 | WarpBuild is SOC 2 Type II Certified | The final report was e-signed and issued. | | Sep 25 | WarpBuild Trust Center Setup | The trust center was set up and the SOC 2 Type II report was published. SOC 2 logo is updated on the website. | 1. Connect the Sprinto POCs to the audit team. It helps getting people aligned. 1. 1-2 weeks before the end of the evidence collection window, send a reminder email to Sprinto POCs and the audit team. Auditors could be busy with other engagements, so the heads up will help. ## Final thoughts The overall process took ~6months. In the best case, it could have been ~4.5 months, practically. We had to spend ~7 days of engineering time to get the compliance checklists in place. It was a conscious effort to ensure that the team is not burdened with the compliance process and I went through every policy document in detail to ensure that. We took this opportunity to ensure we had lightweight processes for best practices that we didn't previously have. WarpBuild is now SOC 2 Type II certified, with unqualified opinion attestation (that's a good thing). --- WarpBuild provides Github actions runner infrastructure for optimized CI with fast disk IO and improved caching. You can run this in our cloud or your own cloud account with our Bring Your Own Cloud (BYOC) option, and get 10x cheaper runners while improving performance. Get started today. --- # Spot Instances - Fast GitHub actions for the budget conscious URL: /blog/spot-instances WarpBuild provides fast runners for GitHub actions, now even cheaper with `spot` instances --- title: "Spot Instances - Fast GitHub actions for the budget conscious" excerpt: "Spot instances for GitHub Actions" description: "WarpBuild provides fast runners for GitHub actions, now even cheaper with `spot` instances" date: "2024-04-12" author: surya_oruganti cover: "/images/blog/spot-instances/cover.png" --- WarpBuild is excited to announce the launch of our new spot instance runners for GitHub Actions, offering a significant cost advantage to developers. These spot instances are 62.5% cheaper than regular GitHub actions runner instances and 25% less expensive than our standard WarpBuild runners. This makes them an ideal choice for budget-conscious teams and individuals who need flexible, powerful CI/CD solutions without a hefty price tag. ## Key Features of WarpBuild Spot Instances _Cost-Effective:_ Enjoy substantial savings with spot instances that cost significantly less compared to standard runners. _Usage:_ Runners are named using the format `warp----x-spot`, making them easy to identify and integrate into your workflows. _Seamless Configuration:_ Setting up a spot instance follows the same straightforward process as our standard instances, ensuring a smooth transition and no learning curve. These are 100% compatible with GitHub hosted runners. _Optimized for Non-Critical Tasks:_ These instances are perfect for short, interruptible tasks. They're not recommended for critical deployments that can't afford interruptions, such as prod application updates, `tofu apply` etc. ## Pricing and Configurations These runners provide the best value for your CI/CD needs while being extremely fast. Here's a quick look at our configurations: | Runner Tag | CPU | Memory | Storage | Price | Aliases | | --------------------------------- | ------- | ------ | --------- | --------------- | ------------------------------- | | warp-ubuntu-latest-x64-2x-spot | 2 vCPU | 7GB | 150GB SSD | $0.003/minute | warp-ubuntu-2204-x64-2x-spot | | warp-ubuntu-latest-x64-4x-spot | 4 vCPU | 16GB | 150GB SSD | $0.006/minute | warp-ubuntu-2204-x64-4x-spot | | warp-ubuntu-latest-x64-8x-spot | 8 vCPU | 32GB | 150GB SSD | $0.012/minute | warp-ubuntu-2204-x64-8x-spot | | warp-ubuntu-latest-x64-16x-spot | 16 vCPU | 64GB | 150GB SSD | $0.024/minute | warp-ubuntu-2204-x64-16x-spot | | warp-ubuntu-latest-x64-32x-spot | 30 vCPU | 120GB | 150GB SSD | $0.048/minute | warp-ubuntu-2204-x64-32x-spot | | warp-ubuntu-latest-arm64-2x-spot | 2 vCPU | 7GB | 150GB SSD | $0.00225/minute | warp-ubuntu-2204-arm64-2x-spot | | warp-ubuntu-latest-arm64-4x-spot | 4 vCPU | 16GB | 150GB SSD | $0.0045/minute | warp-ubuntu-2204-arm64-4x-spot | | warp-ubuntu-latest-arm64-8x-spot | 8 vCPU | 32GB | 150GB SSD | $0.09/minute | warp-ubuntu-2204-arm64-8x-spot | | warp-ubuntu-latest-arm64-16x-spot | 16 vCPU | 64GB | 150GB SSD | $0.018/minute | warp-ubuntu-2204-arm64-16x-spot | | warp-ubuntu-latest-arm64-32x-spot | 32 vCPU | 128GB | 150GB SSD | $0.036/minute | warp-ubuntu-2204-arm64-32x-spot | You can find out more about our runners [here](/docs/ci/cloud-runners#spot-instances) ## Conclusion WarpBuild's spot instances offer an unmatched blend of performance, flexibility, and cost-effectiveness for your GitHub Actions workflows. Whether you're managing build/test cycles, automation tasks, or any CI/CD operations, our runners provide a reliable and affordable solution. Embrace the power of spot instances and optimize your development cycle efficiently and economically. For more details and to start integrating WarpBuild spot instances into your CI/CD pipelines, visit our [website](https://www.warpbuild.com), [email us](mailto:support@warpbuild.com), or [schedule a call](https://cal.com/suryao/start). # Ubicloud vs WarpBuild Comparison URL: /blog/ubicloud-warpbuild-comparison Ubicloud features, performance, and why WarpBuild is a better Github Actions runner alternative. --- title: "Ubicloud vs WarpBuild Comparison" excerpt: "Ubicloud features, performance, and why WarpBuild is a better Github Actions runner alternative." description: "Ubicloud features, performance, and why WarpBuild is a better Github Actions runner alternative." date: "2024-08-22" author: surya_oruganti cover: "/images/blog/ubicloud-warpbuild-comparison/cover.png" --- Ubicloud is an OSS cloud provider that can be run on any physical infrastructure. They also have a Github actions runner product, likely as a test bed for features being built, and seeding usage. Ubicloud provides Github Actions runners that you can start using by just changing one line in the Github actions workflow file. This post provides a detailed description of the features of Ubicloud, and see how it compares to WarpBuild. ## Ubicloud Features ### `x86-64` and `arm64` Runners Ubicloud supports both x86-64 and arm64 architectures. This allows you to build your projects on both platforms, without emulation. This can speed up your arm64 builds 2-5x. WarpBuild Advantage: WarpBuild supports x86-64 and arm64 instances. WarpBuild arm64 instances are ~40% more powerful for faster raw performance but Ubicloud's x86-64 instances are ~10% faster. WarpBuild has faster job start times and better platform uptime. ### Concurrency Ubicloud supports up to 128 concurrent vCPUs for builds. This can be increased at an additional cost by reaching out to Ubicloud support. WarpBuild Advantage: WarpBuild supports truly unlimited concurrency at no additional cost out of the box for x86-64 and arm64 instances. ### OS and Images Ubicloud supports Ubuntu 22.04 only, compatible with Github actions runner images. WarpBuild Advantage: WarpBuild supports ubuntu 22.04 and the latest 24.04 ubuntu image as well, and is 100% compatible with Github actions runner images. Users can use the app to setup any number of custom base images directly. > WarpBuild also supports MacOS runners powered by M2 Pros for blazing fast MacOS builds on MacOS 13 and 14. ### Caching Ubicloud provides 10GB of cache per repo, free. Any additional usage evicts the oldest cache entries. WarpBuild Advantage: WarpBuild provides unlimited cache storage with a 7-day retention policy from last access. The cache performance is blazing fast even for workflows having 100+ concurrent jobs. This is a major advantage for larger customers, especially when there are large artifacts, monorepos, or container builds with large layers. > WarpBuild supports container large layer caching, which Ubicloud does not support. ![WarpBuild Cache](/images/blog/ubicloud-warpbuild-comparison/warpbuild-cache.png) ### Hosting Provider Ubicloud runners are hosted on Hetzner, with most of the compute in the EU region. WarpBuild Advantage: WarpBuild runs compute on AWS, GCP, and Azure and users can choose which region to run their builds in. This has enormous advantages for customers to minimize inter-region data transfer costs and improve performance. ### Dashboard and Analytics Ubicloud has a very basic page that lists the available runners. WarpBuild Advantage: WarpBuild supports a rich dashboard for runners, cache usage, and builds. There are also analytics and insights for the entire repository, including build times, build failure rates, runtime trends, activity heatmaps and more. ![WarpBuild Analytics](/images/blog/ubicloud-warpbuild-comparison/warpbuild-analytics.png) ![Ubicloud Dashboard](/images/blog/ubicloud-warpbuild-comparison/ubicloud-dashboard.png) ### Security Ubicloud runners are ephemeral VMs, running on bare-metal. This is potentially subject to noisy neighbors and performance degradation. WarpBuild Advantage: WarpBuild runners are ephemeral VMs as well, with the virtualization and isolation handled by the underlying cloud provider (AWS, GCP, Azure) and strong performance guarantees. This allows WarpBuild to be more secure and compliant with the most stringent security standards. ### Enterprise Compliance Ubicloud is not SOC2 compliant. Data residency regions are not handled. WarpBuild Advantage: WarpBuild is in the process of getting SOC2 Type2 compliance certification. The documentation will be available for free, on request. ### Static IPs Ubicloud supports public IPs for the runner instances at an additional cost. However, these are not guaranteed to be static. WarpBuild Advantage: WarpBuild offers static IPs for runners (on BYOC only) at no additional cost. ### Runner Pricing Ubicloud's pricing is 10% of the cost of a Github Actions runners for x86-64 and ~16% the cost of arm64 runners. WarpBuild Advantage: WarpBuild runners offer ~20% higher performance for x86-64 instances and ~20% higher performance for arm64 instances. Ubicloud runners are way cheaper than the WarpBuild cloud pricing but comparable to the overall WarpBuild BYOC pricing at a job level because of performance improvements. ## Missing Features Ubicloud is missing a lot of features essential for robust CI. These features are usually deal breakers for large and fast growing teams. WarpBuild supports all of these features. ![WarpBuild Dashboard](/images/blog/buildjet-warpbuild-comparison/warpbuild-dashboard.png) ### Snapshots WarpBuild supports saving and restoring state from a runner instance for persistence and incremental builds is essential for large codebases. WarpBuild users see a 10x improvement in build times, due to this feature. Snapshots are very useful because the time in a CI workflow spent installing dependencies is eliminated and snapshots enable incremental builds. ### Spot Instances WarpBuild has multiple runner instance configurations including spot instances, which are ideal for low-cost and short-duration workflows. This makes the WarpBuild instances ~20-40% cheaper than Ubicloud. ### Bring Your Own Cloud (BYOC) WarpBuild supports BYOC, with a cloud-hosted control plane and the runners spawned in the user's cloud account. This provides the best of both worlds with maximum flexibility and zero management overhead. This is a major advantage for larger customers and is 10x cheaper than Ubicloud. Users can leverage preferential pricing agreements with their cloud providers for even more value. ### Regions WarpBuild supports over 29 regions globally. This is huge for minimizing data transfer costs and improving performance. It is essential for some customers with sensitive workloads and data residency regulations. ### Disk Configurations Ubicloud only supports 64GB of disk storage. WarpBuild supports configurable disk sizes, iops, and throughput. This is useful for ML/AI workloads, large container builds, monorepos, game developers, and mobile app development. ### Roadmap Ubicloud's Github actions product is a "use-case" but not core to their overall vision. In contrast, WarpBuild is already the most capable product in this space and has been adding new features and capabilities to its platform rapidly since launch less than a year ago. ## Conclusion Ubicloud is a basic provider of Github Actions runners. WarpBuild is superior to Ubicloud in every way, but is way cheaper. It's good for teams that are on a very tight budget and don't particularly care about performance. Large or fast growing teams use WarpBuild for their CI/CD needs at scale for 10x faster builds with snapshots, better security, and even cheaper with BYOC. ## Get Started Today WarpBuild is committed to providing you with the tools you need to build faster, smarter, and more cost-effectively. Join us in this new era of development. --- Stay tuned for more updates and features coming soon. Happy building! --- For detailed technical documentation, visit [WarpBuild Docs](http://docs.warpbuild.com). Contact us at support@warpbuild.com. --- # Observability at WarpBuild: Designing for Uptime and Reliability URL: /blog/uptime-reliability Observability at WarpBuild: Designing for Uptime and Reliability --- title: "Observability at WarpBuild: Designing for Uptime and Reliability" excerpt: "Observability at WarpBuild: Designing for Uptime and Reliability" description: "Observability at WarpBuild: Designing for Uptime and Reliability" author: surya_oruganti date: "2025-11-05" cover: "/images/blog/uptime-reliability/cover.png" --- Uptime and reliability are critical for any platform, but especially for WarpBuild as CI is critical to our customer workflows. Poor uptime can block hotfixes and releases causing significant business impact. At WarpBuild, our goal is to have a system that our customers can set and forget, so that it just works. In the last two months, we overhauled our internal observability stack for better alerting and visibility. This post discusses how we architect for uptime and reliability at WarpBuild through three key pillars: intelligent automation, multi-cloud redundancy, and comprehensive observability. But first, let's understand the infrastructure landscape that makes this all possible. [In the previous post](/blog/observability-architecture), we discussed how we built a zero-maintenance observability system using S3, presigned URLs, and OpenTelemetry for our users to view their job metrics and logs. This post focuses on the internal infrastructure that we use to ensure that all our customers' workloads always run regardless of infrastructure issues, capacity constraints, or unexpected demand spikes. ## Infrastructure Overview ### Compute Layer At WarpBuild, we run GitHub Actions runners across multiple infrastructure stacks, optimizing for both performance and user experience. **Bare Metal Servers**: Most Linux and macOS runners run on bare metal servers. This gives us maximum performance and control, which is critical for compute intensive CI workloads. **Hyperscalers**: Windows runners, ARM64 Linux runners, and Docker builders run on hyperscalers due to licensing (for Windows) and performance related constraints (ARM64 instances on AWS and Azure are vastly superior to Ampere servers). When we need to rely on hyperscaler infrastructure, we primarily use AWS with passive backup stacks on GCP and Azure for redundancy and failover. ### Persistence and Backend Services Persistence layers - including S3, databases, Redis, and SQS - are primarily hosted on AWS. Our backend services also run on AWS, but in a separate region isolated from the GitHub Actions runners. This isolation ensures that customer workloads don't impact our control plane and vice versa. Critically, our backend services are infrastructure aware. They maintain real-time visibility into the state of our compute infrastructure and communicate bidirectionally with our orchestrators. ### Orchestration We use both Kubernetes and Nomad as orchestrators, depending on the workload characteristics. These orchestrators handle the placement of virtual machines and containers, but they don't make decisions in isolation. They're part of a closed-loop system with our backend services, continuously providing real-time information about the state of the underlying infrastructure. ## Three Pillars Building reliable infrastructure isn't about a single magic solution. At WarpBuild, we approach reliability through three interconnected pillars that work together to ensure uptime and robust performance. ### Pillar 1: Intelligent Automation The foundation of our reliability is automation that continuously monitors infrastructure state and makes intelligent scheduling decisions on the fly. Our backend services are the brain of this operation. They're infrastructure aware and make dynamic decisions about where compute needs to be scheduled, optimizing for both queue times and performance. This isn't static configuration - it's realtime decision making based on current conditions. The system works as a closed loop: 1. Our orchestrators (Kubernetes or Nomad) continuously provide realtime information about the state of the underlying infrastructure to our backend services 2. The backend services analyze this data alongside incoming workload requests 3. Based on capacity, performance characteristics, and current state, the backend commands the orchestrator for optimal VM placement 4. The orchestrator executes the placement and feeds updated state back to the backend This closed-loop system enables automatic failover management. When the backend detects that scheduling isn't feasible on the preferred infrastructure - whether due to capacity constraints or infrastructure issues - it automatically triggers failover to alternative compute resources. ### Pillar 2: Multi-cloud Redundancy with Intelligent Failover While automation handles the decision making, multi-cloud redundancy provides the infrastructure options that make seamless failover possible. **Primary Infrastructure**: We prioritize bare metal servers for runners whenever possible. Bare metal delivers the best performance for CPU-intensive CI workloads, and our customers benefit from faster build times. **Automatic Fallback**: When bare metal capacity is insufficient or when there are issues with our bare metal hosting provider, the system automatically falls back to hyperscalers. This isn't a manual process - our infrastructure-aware backend services detect capacity or availability issues and seamlessly redirect workloads to hyperscaler infrastructure. **Multi-cloud Backup**: Beyond the bare metal to hyperscaler failover, we maintain backup stacks on GCP and Azure in addition to our primary AWS infrastructure. This provides an additional layer of redundancy for hyperscaler workloads. The result is a seamless customer experience. Whether a job runs on bare metal or gets automatically failed over to a hyperscaler, customer workloads are always running. The complexity is hidden from users - they simply see their CI jobs complete successfully. ### Pillar 3: Observability and Alerting While automation and redundancy handle most reliability scenarios, observability provides visibility and enables quick response to everything else. **Tools and Stack**: We use a comprehensive observability stack including OpenTelemetry for instrumentation, Prometheus for metrics collection, Grafana for visualization, and Datadog and Signoz for unified monitoring and alerting. This gives us deep visibility into logs, metrics, alerts, and dashboards across all our systems. **Persistence Layer Strategy**: Unlike our compute layer which has automated failover, our persistence layers (databases, Redis, SQS) rely on observability and rapid response rather than automated failover. **Recent Improvements**: The observability stack overhaul we completed in the last two months significantly improved our alert quality and visibility. We now have better signal-to-noise ratio in alerts, more comprehensive dashboards for infrastructure health, and generally more surface area for observability. ## How It All Works Together These three pillars aren't independent - they work in concert to deliver the reliability our customers depend on. **Automation** handles dynamic workload placement and failover, continuously optimizing where jobs run based on realtime infrastructure state. **Multi-cloud redundancy** provides the infrastructure options - bare metal, multiple hyperscalers, multiple regions - that make intelligent failover possible. **Observability** ensures we have visibility into every layer of the stack and can quickly respond to any issues that automation doesn't handle. The result is the "set and forget" reliability that we promised. Our customers configure their CI once, and from that point forward, WarpBuild handles the complexity of ensuring their workloads always run regardless of infrastructure issues, capacity constraints, or unexpected demand spikes. This setup also helps protect against dreaded `us-east-1` failures and other outages. ## Conclusion Reliability at scale requires more than just redundant infrastructure - it requires intelligent systems that can make realtime decisions, seamless failover mechanisms, and comprehensive observability to catch what automation misses. At WarpBuild, these three pillars - intelligent automation, multi-cloud redundancy, and observability - work together to deliver the uptime and reliability that modern development teams require. As we continue to grow and evolve our infrastructure, these principles guide how we approach every architectural decision. ### Call for Developers We are looking for developers who are interested in building the future of CI/CD. If you are interested in working on these kinds of infrastructure challenges, get in touch with us at [hello@warpbuild.com](mailto:hello@warpbuild.com)!