Terraform Design Considerations for Cisco ACI - Part 1¶

Estimated time to read: 26 minutes

Originally Written: August, 2023

This is the first of a four post series

Part 1: A Terraform recap and introduction to the design considerations
Part 2: Terraform Considerations Related to ACI Network Connectivity Options
Part 3: Example Designs
Part 4: Single Source of Truth and Production Ready Environment

Topics Covered¶

Overview
Terraform Recap
How Terraform Works
Terraform State
Nexus as Code Recap
Design Considerations Summary
Design Consideration: Single Folder vs Multiple Folders
What happens if the state file is damaged? (e.g. corrupted, deleted)
Reducing the run time and fault domain size
Applying role based access
Updating A Property Across Multiple Configurations Or Re-using Config Across Multiple Environments
Single file vs multiple files?
1 large folder vs 1000s of tiny folders + 1 large file vs 1000s of tiny files = ?
Additional References

Overview¶

Many customers are starting to use Infrastructure as Code (IaC) tools such as Terraform to deploy and manage their infrastructure and applications.

Infrastructure as Code is a methodology that allows developers and operations teams to manage and provision infrastructure resources through machine-readable files, rather than manually configuring them.

There are numerous benefits of using IaC including:

Scalability and Consistency: IaC allows for the automatic provisioning of resources, making it easier to scale infrastructure up or down as needed. Terraform's declarative approach ensures that infrastructure is consistently and reliably provisioned across environments, reducing the risk of errors or configuration drift.
Version Control and Collaboration: By treating infrastructure as code, teams can manage infrastructure changes using version control systems like Git. This enables multiple team members to work on the infrastructure simultaneously, track changes, and roll back if necessary. It also fosters collaboration, as code reviews can be conducted, ensuring best practices and reducing the chance of mistakes.
Automation and Efficiency: With IaC and Terraform, infrastructure provisioning and management tasks can be automated, saving significant time and effort. Infrastructure can be provisioned, updated, or destroyed with a single command, making it easier to replicate environments and maintain consistency. This automation reduces the chance of human error and allows teams to focus on higher-value tasks.
Testing: Code can be tested, just like application code. By using tools like Terraform's plan command, potential issues can be identified before infrastructure changes are applied. This helps catch misconfigurations, conflicts, or other problems early on, minimizing the impact on production environments and improving overall reliability. These advantages help organizations streamline infrastructure management, improve efficiency, and reduce the risk of errors, ultimately enabling them to deliver reliable and scalable applications and services.

While fairly straight forward to get started, it can be difficult to understand the various options to manage your real world ACI fabrics using Terraform and the tradeoffs/considerations associated with each design. This paper will cover some of the available options.

Notes

This paper is intended for network admins with some knowledge of Terraform. A short recap is also provided.

It is focused on the operational design considerations of Terraform for ACI such as state/folder structure rather than module design

For a great starting point on Terraform module design for ACI see the Nexus as Code (NaC) project NaC is used in the papers examples which has two main benefits:

It's simpler to understand the configuration (YAML) which therefore allows you to focus on the design considerations relating to how Terraform works
Cisco have developed many Terraform modules which NaC uses to configure ACI. If you are wanting to develop your own Terraform modules for ACI they form a good starting point to understand module design
The Nexus as Code project uses the Terraform ACI REST Managed resource. Resource calculations have been performed based on NaC unless otherwise stated

https://developer.cisco.com/docs/nexus-as-code/#!introduction

Finally, a consistent message is present throughout the paper. There is no right or wrong option, only trade-offs which must be considered as part of the overall design and implementation process. Use the questions raised to help guide the decision making process and determine a design which best fits the business requirements.

Terraform Recap¶

Terraform is a tool in the Infrastructure as Code (IaC) category. Rather than directly configuring devices through CLI or GUI, IaC is a way of describing the desired state with text files. The IaC tool (i.e. Terraform) reads these text files and implements the required configuration in the endpoint (i.e. ACI)

How Terraform Works¶

When you run commands Terraform will search through the current folder (where the terraform apply command is entered) for one or more configuration files. These files can either be written in JSON (using the .tf.json extension ), or in the Hashicorp Configuration Language (HCL) using the .tf extension.

The following link provides detailed information regarding Terraform configuration files.

https://www.terraform.io/docs/configuration/index.html

Terraform State¶

A key component of how Terraform works is the state file. In order for Terraform to know what changes need to be made to your infrastructure, it must keep track of the environment. This information is stored by default in a local file named “terraform.tfstate“ When you run the terraform apply command, your desired configuration (.tf files) will be compared against the current state (.tfstate file) and the difference calculated.

For example if a subnet exists within the main.tf configuration file but not within the terraform.tfstate will configure a new subnet in ACI and update terraform.tfstate.

The opposite is also true. If the subnet exists in terraform.tfstate but not within the main.tf configuration file, Terraform assumes this configuration is not required and will delete the subnet from ACI.

Note: This is a very important point and may result in undesired behaviour if the terraform.tfstate file was to change unexpectedly.

Nexus As Code Recap¶

As previously stated, the examples provided in this post will be using the Nexus as Code (NaC) project rather than pure Terraform config written in HCL.

The Nexus as Code (NaC) project makes it easy for customers to gain the benefits of programmability and IaC with minimal effort.

You can find more details about NaC here: https://developer.cisco.com/docs/nexus-as-code

Info

Although ACI configuration is defined as YAML, NaC still uses Terraform under the hood and therefore the same design considerations can be applied when working with Terraform HCL configuration. Any differences will be mentioned where applicable.

Design Considerations Summary¶

In general, when working with IaC there are a number of topics to think about.

Who will be managing the environment? One person, multiple people?
Do they manage all the environment or just a component?
What does the environment look like? What is the size? Are there multiple locations, multiple functions (e.g. dev, test, prod), multiple physical components?
What is the programming/automation experience of the person/team managing the environment?
Is there buy-in from all the relevant stakeholders? e.g. Do some people want to continue using the CLI/GUI - and same processes they have in place today?
How has the environment been configured?
What is the current process to deploy and maintain the environment?
How often does the environment change?
Are there any shared resources or configuration?

The answers to these questions can help guide your design. It's important to note that there is no right or wrong option, only trade-offs which must be considered as part of the overall design and implementation process.

As an example, later on you will see the length of time taken for Terraform to apply large configurations (1000s of resources). When first applied this can take 30 - 60 minutes and any subsequent changes can take 1 - 2 minutes to run.

For a very dynamic environment with many changes each day or week across multiple teams this may be a poor user experience. In that environment it may be decided to split the configuration into multiple folders to reduce the run time. The tradeoff is then having to manage those separate folders and individual state files.

For a very static environment the run time may not be an issue. e.g. an admin that only makes a change every few months may not have a problem waiting 1 - 2 minutes if it means they only have to manage a single folder and state file.

Additionally, when teams are working with the same files at the same time there is potential for configuration conflicts to occur. Tools such as version control systems and structured workflows or pipelines can help to reduce these issues but can increase complexity.

The choice to split the Terraform configuration may instead be based on a business process or workflow. For example, a customer might have a day 1 configuration to bring up the fabric and a separate day 2 configuration for additional tenant policies.

Design Consideration: Single Folder vs Multiple Folders¶

Whether you're working with Terraform on your local machine or have all your files in a VCS (e.g. Github), you'll need to decide on a structure. Will you have a single folder/repository with all Terraform configuration or will you split the config into multiple directories/repositories? As mentioned previously there are multiple ways to structure your environment with each having advantages and disadvantages.

Some key areas to consider are:

Fault domain size and deployment time
Readability and troubleshooting
Static vs dynamic environments
Roles and responsibilities i.e. RBAC
Dependencies
Business processes

The state file is a key component of how Terraform works and provides a number of benefits. If configuration for a resource exists in the .tf files but does not exist in the state file Terraform will assume it needs to be created.

Conversely this also provides an easy way to cleanup or destroy resources when they're no longer needed. If a resource exists in a state file but the configuration does not exist in the .tf files, Terraform will assume that the resource is no longer needed and will delete it from the endpoint. For example, if you previously created a Bridge Domain in ACI using Terraform and remove the configuration from your .tffile, when you run Terraform again it will delete the BD from your ACI fabric. This takes dependencies into consideration so that resources are created and destroyed in the correct order.

Terraform will also allow you to simulate the expected changes which will occur based on an updated configuration file. For example, it can tell you if a resource will be created or updated without actually making any changes to the environment.

By default, when Terraform runs a plan or applies a configuration, it first pulls the latest configuration information from the endpoint (e.g. ACI fabric) to refresh the state file.

Based on this information, the answers to the following two questions can help when deciding upon the structure of your environment.

What happens when the state file grows too large?
What happens if the state file is damaged? (e.g. corrupted, deleted)
What Happens When The State File Grows Too Large?

The time it takes Terraform to refresh and apply the configuration increases as the state file grows larger. This could be minutes or even an hour depending on the size of the environment and may lead to a poor user experience.

For example, consider you have a large ACI fabric in a single Terraform config/state file. The environment uses static bindings in a 1 EPG = 1 BD = 1 VLAN design. Almost 6000 resources would be required to configure 20 VLANs on 6 switches, each with 48 ports. This can take a very large amount of time to create all resources.

Factors Contributing to Performance Issues with Large Terraform Plans

There are a couple of factors contributing to the long run times mentioned above.

If you are making individual API calls to provision resources, the more calls the longer the run time. If you see this issue with large amounts of static port bindings have a look at the aci_bulk_epg_to_static_path Terraform resource which can help reduce the amount of API calls.

aci_bulk_epg_to_static_path

As outlined in the Github Issues below, certain use of for_each in Terraform can result in longer run times as "Terraform needs to lookup resource as complete objects containing all instances in order to be able to index them in expressions. "

Performance issues when referencing high cardinality resources

Terraform with large set of resources take very long time to run

What Happens If The State File Is Damaged? (e.g. Corrupted, Deleted)¶

The state file is used to determine what resources to create/update/delete from an endpoint (e.g. ACI fabric). If the state file is deleted, Terraform would try to create all resources again. This may cause issues depending on the behavior of the Terraform provider and the endpoint API.

In the example screenshot below, a port group on a Distributed Virtual Switch was created using the vCenter Terraform provider. The state file was then deleted and the plan was applied a second time. This caused an attempt by Terraform to re-create the port group, but since it already existed an error was returned. Typically in this case the resource or state would need to be re-imported/re-built. For a single resource that may not be a difficult task, however it may be a challenge in a real environment with 10s, 100s, or even 1000s of resources configured.

The behaviour of ACI is different from that of the previous example. As per the APIC API Guide, "Both create and update operations in the REST API are implemented using the POST method, so that if an object does not already exist, it will be created, and if it does already exist, it will be updated to reflect any changes between its existing state and desired state."

https://www.cisco.com/c/en/us/td/docs/dcn/aci/apic/all/apic-rest-api-configuration-guide/cisco-apic-rest-api-configuration-guide-42x-and-later/m_using_the_rest_api.html

This has both advantages and disadvantages.

As per the guide above, "if it does already exist, it will be updated". Imagine a tenant, development, already exists and is managed by Terraform however the state file is deleted by mistake. After re-applying the Terraform plan, a new state file will be created containing the resources in the configuration files. This is the opposite of what was seen in the example above when re-creating the vCenter port group (an error was returned and the resource has to be manually re-imported). Since the resource, development, already exists in ACI an error will not be returned and the resource will be updated. Terraform will then create the state file containing the development properties.

One disadvantage is the potential for things to go wrong if the behaviour is not well understood. Consider a scenario in which a network admin is learning Terraform using the ACI simulator. They configure a tenant-production resource however use the wrong URL. Instead of pointing to the ACI simulator they point it to their production ACI fabric which already contains the real tenant-production. Since the tenant already exists no error will be returned and Terraform will essentially "adopt" the real tenant-production. If they run a terraform Destroy command they could potentially delete the entire tenant and any resource it contains.

This is where ACI features such as RBAC and Security Domains can help. This will be covered in a later section.

Info

This behaviour is not specific to Terraform but relates to how the APIC API works

The following example shows the creation of a tenant with 35 resources. The state file is then removed and a new plan runs. It shows the creation of 35 new resources, however no errors about the resources already existing. This is because the resources still exist in APIC and therefore they will be updated.

Reducing The Run Time And Fault Domain Size¶

By default Terraform applies any configuration found in the current folder. This is also where the local state file is stored by default. Therefore, one way to reduce the time it takes to run a Terraform plan is to use separate folders for configuration.

As shown in the example below, you might have one folder for tenant-production, one folder for tenant-development, and one folder for access-policies. Each folder would contain only the configuration that applies to that particular area. After all, does it make sense to refresh all of the tenant-production state when a change only applies to tenant-development?

Any change to one folder, e.g. tenant-development, would not impact configuration/state within the other folders, i.e. tenant-production or access-policies. Not only does segmenting the configuration/state into separate folders reduce the time it takes to run a Terraform plan, it also reduces the fault domain size.

If an issue affects tenant-development, e.g. the configuration or state file is deleted, the configuration/state files within tenant-production and access-policies should not be affected.

Applying Role Based Access¶

ACI features such as RBAC and Security Domains can also be used to reduce risk e.g. tenant-development and tenant-production could be configured in separate security domains and different accounts used to access each domain. Credentials used to configure the APIC through tools such as Terraform have the same security policy/restrictions as logging in through the UI. So if an account only has access to a single tenant in the UI, that's all you'll be able to configure with Terraform.

This would also resolve the disadvantage previously discussed regarding the APIC API behaviour. Proper RBAC structure would ensure that the credentials used in the development environment or the simulator are different from those used to access the production tenant. If the incorrect URL was provided, an authorization error would be returned to alert the user.

RBAC and security policies can also be applied to Version Control System e.g. Git/Github, to further reduce risk.

Info

A folder in the context above refers to a folder on a local machine from where the Terraform commands are run. This folder can also be stored in a Version Control System such as Git/Github and this is discussed later in the paper.

Updating A Property Across Multiple Configurations Or Re-using Config Across Multiple Environments¶

A question that's often brought up regarding multiple folders is "What happens if I need to change a common setting?" Like any IT problem the answer depends on what is required. There are a couple of points to consider.

What needs to change?
How often do things change?

If a property or resource needs to change across 10s or 100s of places but does not change all that frequently, it may not be an issue to go through the configuration and make the update manually. Carefully using the find and replace function of a text editor or writing a small script can also be help in this situation.

If a resource needs to change across 10s or 100s of places and it is made every day then the design should be addressed. Is it a common resource that could be managed in a separate "shared resources" configuration or in the common tenant?

When planning your Terraform configuration layout think about what resources are shared and which resources or properties will realistically be changed. For example, although you might have 1000 bridge domains segmented across 10s of folders, is it realistic that one day you need to update all BDs to disable ARP flooding?

Forgetting Terraform for a moment, these types of large scale changes are typically associated with a structured process and rolled out in phases. This reduces the size of the fault domain to only a subset of resources. This same process can be applied when using Terraform.

In some cases you may want to use the same configuration between environments and not have it replicated across multiple folders. For example, testing a configuration on one ACI fabric and then applying the same configuration to a production fabric when all tests pass.

There are a couple of options available.

You could manually update the provider "aci" {} Terraform configuration which contains the URL, username, and password to access the fabric. The remote backend configuration would also need to be updated to ensure the correct state file is selected. You will also need to re-initialise the Terraform environment so the new state file is used. This method can be prone to error.

Another option is to use Terraform Workspaces which allow you to use the same configuration but maintains one individual statefile per workspace. This still means you need to update the configuration with the correct credentials and properties relevant to a particular workspace. To do this you could maintain different .tfvars files with the values of that environment. Conditionals (if/then/else) and variable substitution allows you create a generic Terraform configuration. When a Terraform plan runs, an environments .tfvars file is used to update the configuration with that environments values. This method can also be prone to errors as it is easy to forget to switch workspaces before applying configuration changes.

A third option is to take the common resources and build a Terraform module with the configuration. Terraform plans can then use this module to have consistent configuration but maintain separate state files and environment specific variables.

The Nexus as Code project is a great example of module design in Terraform, in particular building modules for Cisco ACI.

Nexus as Code translates ACI configuration in the form of YAML files into Terraform code. Just like in the example above, with the NaC modules the YAML (ACI configuration) files can be stored in a single folder while multiple Terraform plans and statefiles work with the same configuration

Info

The three options discussed fit into scenarios where the configuration is consistent e.g. dev, test, prod fabrics. Only the environment specific values such as fabric credentials differ. A change can be developed and tested in dev/test and the same configuration then pushed to prod.

To get the most benefits this could also be combined with the designs found later in the paper.

Bonus NaC Capabilities¶

While the focus of this paper is on general Terraform design considerations with ACI, the architecture of Nexus as Code provides an additional advantage. As previously mentioned, when using Terraform all configuration found in files ending in either .tf (pure Terraform) or .YAML (NaC) will be applied. If you want to manage only a subset of the configuration you need to separate the files into different folders.

As can be seen in the diagrams below, the NaC modules have a flag which allows you to enable only the areas you want that Terraform plan to configure. This allows you to have a single folder (data in this case) with all configuration, but have separate policies and statefiles to manage the various areas.

Although you may still want to use separate repositories or folders to separate domains and responsibilities across teams.

Single File vs Multiple Files?¶

So far we've looked at design considerations relating to folder structure and how to segment your configuration into more manageable components. There's also a second layer to consider which is the file structure within an individual folder. Whether you're working with Terraform .tf files or Nexus as Code YAML files, you have the possibility to use either a single configuration file or multiple configuration files. You can also use whatever name you want for each file (as long as it ends in .tf or .tf.json when using pure Terraform or .yaml or .yml when using Nexus as Code).

As stated previously, Terraform applies all configuration found in the current folder, so using multiple files will provide the same result as a single file with all configuration. The main advantage of using multiple files is for readability. A minimal delay will most likely be seen when the files are concatenated at the start of the plan.

When working with small configurations a single file may be suitable, however think about if you are trying to understand the configuration applied to a large ACI tenant. Would you find it easier to work with a single 3000 line file or would it be more helpful to have bridge_domains.tf, epgs.tf, and contracts.tf?

There are three examples shown in the screenshots above. The first one is a single file containing the configuration of all resources in a tenant (i.e. tenant, VRF, BD, EPGs, contracts)

The medium example has resources separated into different files (bridge_domains, epgs, vrf_and_contracts).

In the large example you can see that the BDs and EPGs have been broken out into smaller files. Based on the file name you could assume that the configuration for bridge domains 192.168.10.0 to 192.168.50.0 can be found in the bds_192.168.10.0-50.0.yaml file.

Again, the main advantage of structuring configuration in this way is for readability. There is no hard rule on when or how to split across multiple files but think about what will be most helpful to you and your team. A single file with 10000 lines of configuration could probably be split into smaller files with meaningful names, but on the opposite side you probably don't want 1000 files each only containing a couple of lines of configuration.

1 large folder vs 1000s of tiny folders + 1 large file vs 1000s of tiny files = ?¶

The previous two sections have discussed:

Separate folders containing configuration which reduces risk and decreases run/apply time - e.g. tenant-production and tenant-development
Within a folder splitting the configuration into multiple files for readability - e.g. within tenant-production having epgs, bridge_domains, and contracts all in separate files

This still leaves the question of how small or large these folders/files should be and how the configuration should be split. Should there be a single folder with all configuration, should a folder contain 10, 50, 100, 500 EPGs? This is going to depend on the environment, how it's configured, who's managing it, how often it changes. i.e. all the questions that were proposed at the beginning of the "Design Considerations Summary" .

Onto the next chapter¶

Part 2 - Terraform Considerations Related to ACI Network Connectivity Options

Terraform Design Considerations for Cisco ACI - Part 1¶

Topics Covered¶

Overview¶

Terraform Recap¶

How Terraform Works¶

Terraform State¶

Nexus As Code Recap¶

Design Considerations Summary¶

Design Consideration: Single Folder vs Multiple Folders¶

What Happens If The State File Is Damaged? (e.g. Corrupted, Deleted)¶

Reducing The Run Time And Fault Domain Size¶

Applying Role Based Access¶

Updating A Property Across Multiple Configurations Or Re-using Config Across Multiple Environments¶

Bonus NaC Capabilities¶

Single File vs Multiple Files?¶

1 large folder vs 1000s of tiny folders + 1 large file vs 1000s of tiny files = ?¶

Onto the next chapter¶

Additional References¶

Comments