NEW: Heap for mobile. Track every interaction, on every platform.

Learn more
skip to content
Loading...
    • The Digital Insights Platform Transform your digital experience
    • How Heap Works A video guide
    • How Heap Compares Heap vs. competitors
    • The Future of Insights A comic book guide
  • Data Insights

    • Session Replay Complete context with a single click
    • Illuminate Data science that pinpoints unknown friction
    • Journeys Visual maps of all user flows

    Data Analysis

    • Segments User cohorts for actionable insights
    • Dashboards Share insights on critical metrics
    • Charts Analyze everything about your users
    • Playbooks Plug-and-play templates and analyses

    Data Foundation

    • Capture Automatic event tracking and apis
    • Mobile Track and analyze your users across devices
    • Enrichment Add context to your data
    • Integrations Connect bi-directionally to other tools

    Data Management

    • Governance Keep data clean and trusted
    • Security & Privacy Security and compliance made simple
    • Infrastructure How we build for scale
    • Heap Connect Send Heap data directly to your warehouse
  • Solutions

    • Funnel Optimization Improve conversion in user flows
    • Product Adoption Maximize adoption across your site
    • User Behavior Understand what your users do
    • Product Led Growth Manage PLG with data

    Industries

    • SaaS Easily improve acquisition, retention, and expansion
    • eCommerce Increase purchases and order value
    • Financial Services Raise share of wallet and LTV

    Heap For Teams

    • Product Teams Optimize product activation, conversion and retention
    • Marketing Teams Optimize acquisition performance and costs
    • Data Teams Optimize behavioral data without code
  • Pricing
  • Support

    • Heap University Video Tutorials
    • Help Center How to use Heap
    • Heap Plays Tactical how-to guides
    • Heap Updates
    • Professional Services

    Resources

    • Blog A community for digital builders
    • Content Library Ebooks, whitepapers, videos, guides
    • Press News from and about Heap
    • Webinars & Events Virtual and live events
    • Careers Join us

    Ecosystem

    • Customer Community Join the conversation
    • Partners Technology and Solutions Partners
    • Developers
    • Customers Over 8,000 successful companies
  • Free TrialRequest Demo
  • Log In
  • Free Trial
  • Request Demo
  • Log In

All Blogs

Engineering

Terraform Gotchas And How We Work Around Them

Kamal Marhubi
June 12, 20178 min read
  • Facebook
  • Twitter
  • LinkedIn

Heap’s infrastructure runs on AWS, and we manage it using Terraform. This post is a collection of tips and gotchas we’ve picked up along the way.

Terraform and infrastructure as code

Terraform is a tool from Hashicorp to help manage infrastructure declaratively. Instead of manually creating instances, networks, and so on in your cloud provider’s console or command line client, you write configuration that describes what you want your infrastructure to look like.

This configuration is in a human-readable text format. When you want to modify your infrastructure, you modify the configuration and run terraform apply. Terraform will make API calls to your cloud provider to bring the infrastructure in line with what’s defined in the configuration.

Moving our infrastructure management into text files allows us to take all our favorite tools and processes for source code and apply them to our infrastructure. Now infrastructure can live in source control, we can review it just like source code, and we can often roll back to an earlier state if something goes wrong.

As an example, here’s a Terraform definition of an EC2 instance with an EBS volume:

resource "aws_instance" "example" {
    ami = "ami-2757f631" 
    instance_type = "t2.micro" { ebs_block_device {  
    device_name = "/dev/xvdb"  
    volume_type = "gp2" 
    volume_size = 100 } 
}

If you haven’t tried Terraform yet, the getting started guide is quite good, and will quickly get you familiar with the workflow.

Terraform’s data model

At a high level Terraform has a simple data model: it manages resources, and resources have attributes. A few examples from the AWS world:

  • an EC2 instance is a resource with attributes like the machine type, boot image, availability zone, security groups

  • an EBS volume is a resource with attributes like volume size, volume type, IOPS

  • an Elastic Load Balancer is a resource with attributes for its backing instances, how it checks their health, and a few others

Terraform maintains a mapping between resources defined in configuration and the corresponding cloud provider resources. The mapping is called the state, and it’s a giant JSON file. When you run terraform apply, Terraform refreshes its state by querying the cloud provider. Then it compares the returned resources against what you have in your Terraform configuration.

If there are any differences it will create a plan, which is a set of changes to the resources in your cloud provider to match your configuration. Finally it applies those changes by making calls to your cloud provider.

Not every Terraform resource is an AWS resource

This resources-and-attributes data model is not too hard to understand, but it doesn’t necessarily match the cloud provider APIs perfectly. In fact, a single Terraform resource can correspond to one, more than one, or even zero underlying entities in your cloud provider. Here are some examples from AWS:

  • a Terraform aws_ebs_volume corresponds to one AWS EBS volume

  • a Terraform aws_instance with an embedded ebs_block_device block as in the example above corresponds to two EC2 resources: the instance and the volume

  • a Terraform aws_volume_attachment corresponds to zero entities in EC2!

The last one might be surprising. When you create an aws_volume_attachment, Terraform will make an AttachVolume request; when you destroy it, it will make a DetachVolume request. There’s no EC2 object involved: Terraform’s aws_volume_attachment is completely synthetic! Like all resources in Terraform, it has an ID. But where most have an ID that comes from the cloud provider, the aws_volume_attachment‘s ID is simply a hash of the volume ID, instance ID, and device name.

Synthetic resources show up in a few other places in Terraform, for example aws_route53_zone_association, aws_elb_attachment, and aws_security_group_rule. One way to spot them is to look out for association or attachment in the resource name, though not always.

There’s more than one way to do it, so choose carefully!

With Terraform, there can be more than one way to represent exactly the same infrastructure. Here’s another way to represent our example instance with and EBS volume in Terraform that results in the same EC2 resources:

resource "aws_instance" "example" {
  ami           = "ami-2757f631"
  instance_type = "t2.micro"
}

resource "aws_ebs_volume" "example-volume" {
  availability_zone = "${aws_instance.example.availability_zone}"
  type              = "gp2"
  size              = 100
}

resource "aws_volume_attachment" "example-volume-attachment" {
  device_name = "/dev/xvdb"
  instance_id = "[error]aws_instance.example.id[/error]"
  volume_id   = "[error]aws_ebs_volume.example-volume.id[/error]"
}

Now the EBS volume is a Terraform resource in its own right, distinct from the EC2 instance. There’s also the third synthetic resource that ties the two together. Representing our instance and volume this way allows us to add and remove volumes by adding and removing aws_ebs_volume and aws_volume_attachment resources.

In many cases, it doesn’t matter which EBS representation you choose. But sometimes making the wrong choice can make changing your infrastructure quite difficult!

We made the wrong choice

We got bitten by this at Heap. We operate a large PostgreSQL cluster in AWS, and each instance has 18 EBS volumes attached for storage. We represented the instances in Terraform as a single aws_instance resource with the EBS volumes defined in ebs_block_device blocks.

Our database instances store data on a ZFS filesystem. ZFS lets you dynamically add block devices to grow the filesystem with no downtime. This means we can gradually grow our storage as our customers send us more data while. As an analytics company that captures everything, this flexibility is a huge win. We’re continually improving the insert and query efficiency of our cluster. Instead of being stuck with the CPU-to-storage ratio we picked when we provisioned the cluster, we can adjust the balance on the fly to take advantage of the latest improvements. We’ll go into more detail on how this works in another post.

Using the ebs_block_device blocks got in the way of this process being as smooth as it could be. You might hope that Terraform would let us add a nineteenth ebs_block_device block to the aws_instance and everything would just work.

But unfortunately, Terraform sees this as an incompatible change: it doesn’t know how to modify an instance with 18 volumes to turn it into one with 19 volumes. Instead the Terraform plan is to tear down the whole instance and create a new one in its place. This definitely isn’t what we want for our database instances with tens of terabytes of storage!

Until recently, we worked around this, and hackishly got Terraform in sync in a few steps:

  1. we ran a script that used the AWS CLI to create and attach the volumes

  2. we ran terraform refresh to get Terraform to update its state, and

  3. finally we changed the configuration to match the new reality

Between steps 2 and 3, terraform plan would show that Terraform wanted to destroy and recreate all our database instances. This made it impossible to do anything with those instances in Terraform until someone updated the config. Needless to say, this is a scary state to end up in routinely!

Terraform state surgery

Once we found the aws_volume_attachment approach, we decided to switch our representation over. Each volume became two new Terraform resources: an aws_ebs_volume and an aws_volume_attachment. For 18 volumes per instance in our cluster, we were looking at well over a thousand new resources. Switching the representation isn’t just a matter of changing the Terraform configuration. We have to reach into Terraform’s state to change how it sees the resources.With over a thousand resources being added, we were definitely not going to do it manually. Terraform’s state is stored as JSON. While the format is stable, the docs state that ‘direct file editing of the state is discouraged’. We had to do it anyway, but we wanted to be sure we were doing it correctly. Rather than reverse-engineer the JSON format by inspection, we wrote a program that uses Terraform’s internals as a library to read, modify, and write it. This wasn’t exactly straightforward, especially since it was the first Go program for both of the people working on it! But we think it was worth it to be sure we weren’t subtly messing up the Terraform state of our database instances.

Terraforming safely

Running terraform apply is one of the few times you have the power to seriously damage your company’s infrastructure. There are a few things you can do to make this safer and less scary.

Always write your plan -out, and apply that plan

If you run terraform plan -out planfile, Terraform will write the plan to planfile. You can then get exactly that plan to run by running terraform apply planfile. That way, the changes made at apply time are exactly what Terraform showed you at plan time. You won’t find yourself unexpectedly changing infrastructure that a coworker modified in between your plan and apply.

Take care with the plan file though: it will include your Terraform variables, so if you put secrets in those they will be written to the filesystem in the clear. For example, if you pass in your cloud provider credentials as variables, those will end up stored on disk in plaintext.

Have a read-only IAM role for iterating on changes

When you run terraform plan, Terraform refreshes its view of your infrastructure. To do this it only needs read access to your cloud provider. By using a read-only role, you can iterate on your config changes and verify them with terraform plan without ever risking a stray apply ruining your day — or week!

With AWS, we can manage the IAM roles and their permissions in Terraform. Our role looks like this:

resource "aws_iam_role" "terraform-readonly" {
  name = "terraform-readonly"
  path = "/",
  assume_role_policy = "[error]data.aws_iam_policy_document.assume-terraform-readonly-role-policy.json[/error]"
}

Our assume_role_policy simply lists the users who are allowed to assume the role.The final piece of this is the policy that gives read only access on all AWS resources.Amazon helpfully provides a copy-pastable policy document, and that’s what we used. We define an aws_iam_policy that references the policy document:

resource "aws\_iam\_policy" "terraform-readonly" {
name = "terraform-readonly"
path = "/"
description = "Readonly policy for terraform planning"
policy = "[error]file("policies/terraform-readonly.json")[/error]"
}

Then we apply the policy to the terraform-readonly role with an aws_iam_policy_attachment:

resource "aws\_iam\_policy\_attachment" "terraform-readonly-attachment" {
name = "Terraform read-only attachment"
roles = ["[error]aws\_iam\_role.terraform-readonly.name[/error]"]
policy\_arn = "[error]aws\_iam\_policy.terraform-readonly.arn[/error]"
}

Now you can use the Secure Token Service API’s AssumeRole method to get temporary credentials that only have the power to query AWS, not change it. Running terraform plan will update your Terraform state to reflect the current infrastructure. If you’re using local state, this means it will write to the terraform.tfstate file. If you’re using remote state, eg in S3, you’ll need to grant your read-only role write access to the it.

Having this role in place made us much happier when rewriting Terraform’s state to use aws_volume_attachment for our database volumes. We knew there should be no change to the infrastructure in AWS, only in Terraform’s view of it. With the read-only role

After all, we weren’t actually modifying any infrastructure, so why have that power available?

Ideas for the future

As our team grows, more and more people are making changes to our infrastructure with Terraform. We want to make this easy and safe. Most outages are caused by human error and configuration changes, and applying Terraform changes is a terrifying mix of the two.

For example, with a tiny team, it’s easy to be sure only one person is running Terraform at any given time. With a larger team, that becomes less of a guarantee and more of a hope. If two terraform apply runs were happening at the same time, the result could be a horrible non-deterministic mess. While we’re not using it just yet, Terraform 0.9 introduced state locking, making it possible to guarantee only one terraform apply is happening at a time.

Another place where we’re thinking about ease and safety is in reviewing infrastructure changes. Right now our review process involves copy/pasting terraform plan output as a comment on the review, and applying it manually once it’s approved.

We’re already using our continuous integration tool to validate the Terraform configuration. For now this just runs terraform validate, which checks for syntax errors. The next step we want to work towards is having our continuous integration run terraform plan and post the infrastructure changes as a comment in code review. The CI system would automatically run terraform apply when the change is approved. This removes a manual step, while also providing a more consistent audit trail of changes in the review comments. Terraform Enterprise has a feature like this, and we’ll be taking a look at it.

Have any ideas on how to improve Terraform workflows? Let me know! And if you enjoyed this post, join our team!

Kamal Marhubi

Was this helpful?
PreviousNext

Related Stories

See All

  • Heap.io

    Data Stories

    Celebrating H&R Block as the inaugural winner of the Digital Innovator Award

    March 22, 2023

  • Heap.io

    Product Updates

    Introducing Heap for mobile: see Everything, Everywhere all at once

    March 14, 2023

  • Heap.io

    Data Stories

    How I shipped a mobile app without tracking and bad things™ happened

    March 15, 2023

Subscribe

Sign up to stay on top of the latest posts.

Better insights. Faster.

Request Demo
  • Platform
  • Capture
  • Enrichment
  • Integrations
  • Governance
  • Security & Privacy
  • Infrastructure
  • Illuminate
  • Segments
  • Charts
  • Dashboards
  • Playbooks
  • Use Cases
  • Funnel Optimization
  • Product Adoption
  • User Behavior
  • Product Led Growth
  • Customer 360
  • SaaS
  • eCommerce
  • Financial Services
  • Why Heap
  • The Digital Insights Platform
  • How Heap Works
  • How Heap Compares
  • The Future of Insights
  • Resources
  • Blog
  • Content Library
  • Events
  • Topics
  • Heap University
  • Community
  • Professional Services
  • Company
  • About
  • Partners
  • Press
  • Careers
  • Customers
  • Support
  • Request Demo
  • Help Center
  • Contact Us
  • Pricing
  • Social
  • Twitter
  • Facebook
  • LinkedIn
  • YouTube

© 2023 Heap Inc. All Rights Reserved.

  • Legal
  • Privacy Policy
  • Status
  • Trust