Logo
Published on

Drift Detection and Management: Keeping Terraform Infrastructure in Sync

Authors
  • Name

Infrastructure as Code (IaC) tools, such as Terraform, have revolutionized how developers and DevOps teams manage cloud resources. By defining infrastructure in code, teams can version, review, and reuse configurations to ensure consistency across environments. However, a common challenge that arises over time is infrastructure drift — when the actual state of resources in the cloud diverges from the state defined in your Terraform code. In other words, changes happen outside your code, and your deployments gradually fall out of sync.

For developers or DevOps engineers, drift can be a source of confusion and risk. Imagine deploying a set of cloud instances via Terraform, only to have an urgent on-call fix applied manually in the cloud console at 2 AM. A week later, you run Terraform again and get unexpected changes in the plan output — a clear sign that something drifted behind the scenes.

If not addressed, these hidden differences can lead to deployment errors, security holes, and lots of head-scratching. In this post, we’ll explore what drift is, why it happens, and how to detect and manage drift effectively in a Terraform workflow. We’ll also look at tools and best practices — including how platforms like Spacelift help automate drift detection, so you can keep your infrastructure true to its code.

What is Drift in Terraform?

Infrastructure drift refers to any situation where the real-world state of your infrastructure does not match the state defined in your Terraform configuration and state files. Terraform stores the expected state of resources in a state file, and when a change is made to those resources outside of Terraform (for example, via a cloud provider’s web console or CLI), Terraform’s state is unaware of it. This discrepancy is what we call “drift.” Put simply, drift is the accumulation of untracked changes in your infrastructure.

For example, if your Terraform code defines a virtual machine instance with a certain type or size, but an engineer manually changes that instance type in the cloud console, your code and reality are now out of sync. The next time Terraform runs, it will notice the difference. In Terraform’s terms, drift shows up as a difference between the desired state (from your .tf code and state file) and the actual state (in the cloud). Even without running Terraform, drift is there — lurking until detected.

Not all drift is malicious or accidental; sometimes changes are made intentionally for quick fixes or testing. However, any drift means your code is no longer the single source of truth for your infrastructure. It undermines the benefits of IaC by introducing uncertainty about what’s really running.

Common Causes of Infrastructure Drift

Understanding why drift happens is the first step to preventing it. Here are some of the most common causes of drift in a Terraform-managed environment:

  • Manual changes in the cloud: The most frequent culprit is someone making a manual change to resources through a cloud provider’s console or CLI. This might happen during an emergency (to fix a production issue fast) or for testing a configuration tweak. For instance, an engineer might manually open an AWS S3 bucket’s access from private to public for debugging, inadvertently exposing data. If they forget to apply the same change in Terraform, that modification becomes drift.
  • Automated processes and external scripts: Modern cloud setups often have automation beyond Terraform. Auto-scaling events, cloud-managed updates, or custom scripts can all alter infrastructure without Terraform’s knowledge. For example, an AWS auto-scaling group might spin up extra instances or change parameters in response to load, or a CI/CD script might directly adjust a security group rule. These changes are outside Terraform’s control and thus introduce drift unless Terraform is informed of them.
  • Using multiple IaC or configuration tools: Organizations might use Terraform alongside other tools like configuration management (e.g. Ansible) or cloud-specific services. If these tools overlap responsibilities, one tool could change infrastructure that another tool (Terraform) also manages. Ansible, for example, can provision infrastructure as well as configure it, so running Ansible playbooks might inadvertently create or modify resources tracked by Terraform.
  • Resource deletion or policy-driven changes: Sometimes resources drift because they’re removed or altered due to policies or cost controls. A classic case is someone deleting a resource (like an unused VM or database) directly to save money, or an automated policy shutting something down for compliance. Terraform would still think that resource exists per its state file, causing a drift.
  • State file manipulation or corruption: While less common, editing Terraform state files manually or encountering a corrupted state can cause drift. For example, if a state file is manually adjusted or not properly updated after a partial apply, Terraform’s view of reality may be skewed, leading to inconsistencies until corrected.

Why Drift Matters: Risks of Ignoring Drift

Allowing your infrastructure to drift away from your codebase can introduce significant risks and operational challenges. Some key impacts of unmanaged drift include:

Security vulnerabilities

Untracked changes can open up security holes. For example, if someone manually broadens a firewall rule or opens a storage bucket to the public and doesn’t revert it in code, your system could be exposed to attackers. Drift means security policies defined in code might not actually be enforced in the cloud.

Compliance violations

Many organizations rely on IaC to enforce compliance standards (e.g., ensuring encryption is enabled, networks are private, etc.). Drift can silently break these rules. A drift that exposes user data publicly or changes an encryption setting can mean you’re no longer compliant with regulations. Because the change isn’t tracked, it might go unnoticed until an audit or incident occurs.

Operational and performance issues

When infrastructure doesn’t match the expected state, it can lead to instability. Configuration drift might disable auto-scaling, alter resource sizes, or create mismatched settings that degrade performance. Troubleshooting becomes difficult too — teams waste time chasing “mystery bugs” only to find some forgotten manual change was the cause.

Increased costs

Drift can hit the wallet as well. Orphaned or altered resources might be running when they shouldn’t, incurring cloud costs that aren’t accounted for. For instance, if a VM was manually scaled up to a larger instance type outside of Terraform, it could continue running at the higher cost unnoticed. Additionally, resolving drift after the fact (rewriting code, figuring out changes) has a labor cost.

Erosion of infrastructure as code benefits

Perhaps most subtly, unchecked drift undermines the whole point of using Terraform. If the code doesn’t reflect reality, infrastructure automation pipelines (CI/CD, testing, etc.) can’t be trusted. Over time, the Terraform code becomes obsolete, and engineers may lose confidence in it, leading to more ad-hoc changes — a vicious cycle.

Detecting Drift: How to Spot Infrastructure Changes

Detecting infrastructure drift in Terraform essentially means comparing the real world to your code on a regular basis. Terraform’s built-in commands provide the primary means of surfacing drift:

  • Run terraform plan regularly: The Terraform CLI plan command is your first line of defense. Every time you run terraform plan, Terraform will refresh the state and show any differences between the current state and your configuration. If something in the real infrastructure doesn’t match your code, the plan output will show it as a change.
  • Use the refresh-only plan option: Terraform 1.5+ offers a terraform plan -refresh-only flag which refreshes the state and reports changes without proposing new infrastructure actions. This is useful solely for drift checking.
  • Scheduled Terraform plans: Rather than waiting for the next terraform run during a deployment, many teams set up automated jobs to run terraform plan on a schedule (daily or hourly) against critical environments.

Beyond core Terraform, there are also specialized tools designed to detect drift:

  • Drift detection tools (e.g. driftctl): Third-party tools like driftctl can scan cloud resources and compare them with Terraform state to find anything unmanaged or out-of-sync.
  • CI/CD pipeline checks: You can script drift detection into CI pipelines. For instance, a nightly pipeline could run terraform plan across your environments and send an alert if any drift is detected.

The key is to make drift detection automatic and frequent. Many teams integrate a combination of the above and may use tools like Spacelift, which offers continuous drift detection and automatic remediation as described in their Terraform drift detection guide.

Drift Remediation Strategies: Reconcile or Align?

Detecting drift is only half the battle — once you find a drift, you need to decide how to fix it. Broadly, there are two approaches to remediate drift in Terraform-managed infrastructure:

  1. Reconcile (Revert to code): This involves using Terraform to reapply your intended configuration, overriding the out-of-band changes. Ideal when the drift changes are mistakes or unauthorized.
  2. Align (Adopt the changes into code): If the out-of-band change was valid and beneficial, update your Terraform code to match the new state. Then apply to record this state in Terraform.

In either case, the goal is to bring code and infrastructure back in sync. Letting drift linger increases the risk of issues and erodes trust in your IaC workflow.

Best Practices to Prevent and Manage Drift

  • Avoid out-of-band changes: Promote a culture of using Terraform workflows for all changes.
  • Implement policy-as-code: Use tools like OPA or Sentinel to enforce infrastructure rules.
  • Run frequent drift checks: Conducting daily or hourly checks helps catch and address drift early.
  • Use version control and logging: Ensure changes are traceable and auditable.
  • Consider IaC platforms: Tools like HCP or Spacelift can automate the detection and remediation process.
  • Document remediation protocols: Establish clear runbooks for how to handle drift.

Conclusion

Drift happens. But with proactive detection, clear decision-making, and automation, you can manage it effectively. Whether you choose to revert changes or adopt them into code, the key is ensuring your Terraform remains a trustworthy source of truth. Integrating tools, routines, and team awareness into your workflows will help you maintain control, reduce surprises, and keep your cloud environments stable and secure.