Simplifying ACI - What features help me manage a data center fabric?¶
Estimated time to read: 32 minutes
- Originally Written: June, 2024
Info
This post uses the Nexus as Code (NaC) project which makes it very easy to configure ACI fabrics through a YAML file . More details and examples can be found at https://developer.cisco.com/docs/nexus-as-code/#!aci-introduction
I speak with many customers about ACI and I'm using this post to collect some of the features they find help simplify the roll-out and management of their data center fabrics.
The features¶
- Automated fabric standup
- Nexus as Code to configure everything
- ACI Built-in workflows
- VM Manager integration for visibility and configuration
- Endpoint database for visibility and troubleshooting
- Endpoint Security Groups (ESGs)
- Snapshots and config rollbacks
- Automated upgrades
- Multi-tenancy
Automated fabric standup¶
Benefits
- The guided process and automated cluster installation simplifies the initial setup
- Providing the APIC as a packaged appliance (physical or virtual) simplifies deployment by removing the need for you to bring your own server or VM and manually install the software
ACI is managed as a single system. Rather than configuring individual switches through the CLI, with ACI all config is pushed from the controller (known as the APIC). APICs are deployed as either a physical or virtual appliance (i.e. the server is shipped to you with the OS and APIC application already installed) and there are typically 3 or more APICs running as a cluster in a production environment.
There have been many improvements made to the APIC bringup process and these days it's very easy to standup both the physical and virtual appliances. The following shows the virtual deployment.
In the case of the physical APIC you need to install the server in the rack and connect via console. You enter a few pieces of information and then connect to the browser to perform the rest of the installation.
Each virtual APIC is deployed as an OVA into a VMware environment. It can run on a server either directly connected to the fabric or across an L3 network (e.g. on a management cluster). The following screenshots will show the process for deploying a directly connected vAPIC cluster.
There are a few inputs to provide as part of the OVA deployment including the OOB management IP, mask, gateway, and admin password. This provide access to the APIC through the browser to complete the cluster bringup.
There are two networks in the vAPIC, a management and a data/infrastructure network.
The management network provides connectivity to the vAPIC through browser. It should be accessible via a separate, previously provisioned, management network. Otherwise you would have a chicken and egg situation where you need the switches/network to be provisioned to access the vAPIC but vAPIC is responsible for configuring the network. In this example the VM Network
port group is on a standard vSwitch used for management connectivity.
The INFRA
port group is on a distributed vSwitch with physical uplinks directly connected to the Nexus 9000 switches in the ACI fabric. This port group requires specific VLANs to reach the ACI switches.
VLAN 0
: This is required for forwarding and receiving LLDP packets to and from the fabric for discovery. LLDP packets sent from the leaf switch will be untaggedVLAN 3910-3914
: We use3914
as the infrastructure VLAN but you may also need additional VLANs if you configure inband management
For more information on the installation process see the Deploying Cisco Virtual APIC Using VMware vCenter guide.
Once deployed, the rest of the installation can be performed through a browser. You can navigate to the management IP address you provided in the OVA deployment and you should see the following screen.
Note
The following steps of the process are the same for both the virtual and physical APICs
First select the method of deployment. There's also documentation built into the browser if you need more information about an option.
You need to configure the cluster properties such as the cluster size, TEP pool (used for VXLAN), and infrastructure VLAN (used for communication between the fabric and the APICs).
Then add each APIC into the cluster by entering the management IP of the APIC, the controller ID, pod ID, serial number, and out of band management details. When all APICs are added you can start the deployment.
Remember that the management IPs were entered in the wizard when you initially deploy each OVA.
Once the cluster bringup process has completed you can refresh the page and should now have access to a new ACI fabric.
There are a couple of options to configure your fabric. To simplify the initial setup you can go through the setup wizard and then configure the rest of the ACI fabric through the UI.
Alternatively you can use Nexus as Code to build the rest of the fabric for you.
Nexus as Code to configure everything¶
Benefits
- Get the benefit of programmability and automation without need to know Terraform and Ansible
- Human errors are minimized by integrating with testing/validation solutions which ensures more reliable network configurations
- Allows for consistent and repeatable network deployments across different environments
- The NaC YAML files contain only the configuration which has been applied to the ACI fabric. This makes it very easy to understand how an environment was configured and can help reduce troubleshooting time
While Cisco ACI offers very powerful APIs and integrations with tools such as Ansible and Terraform, not all network engineers are developers. The Nexus as Code (NaC) project makes it easy for users to gain the benefits of programmability and Infrastructure as Code with minimal effort. NaC uses Terraform to translate the data model (YAML files with the ACI configurations) into Terraform configuration which is then implemented on the ACI fabric.
https://developer.cisco.com/docs/nexus-as-code
Nexus as Code configuration to standup entire fabric¶
After deploying the APIC cluster as shown in the previous section, an entire fabric can be provisioned with Nexus as Code configuration stored in a YAML file. This includes the spine and leaf switch registration. This also means it's very easy to add or change configuration. For example, if you want to add a new tenant just copy an existing tenant config and update them with any details specific to the new tenant.
You could use a single YAML file however to make it more readable I've split the configuration into multiple files. See the following document for further details and additional design considerations when using Terraform with ACI.
Terraform Design Considerations for Cisco ACI - Single File vs Multiple Files
Note
The following configuration is to be used as an example only. You would need to update the values to meet the needs of your own fabric
- The node policies are used for registering the switches and configuring management addresses
node_policies.nac.yaml
apic:
node_policies:
inb_endpoint_group: inb_management_epg
nodes:
- id: 1
role: apic
inb_address: 172.20.132.1/24
inb_gateway: 172.20.132.254
- id: 201
pod: 1
role: spine
serial_number: FD012345678
name: spine-201
oob_address: 10.58.30.210/25
oob_gateway: 10.58.30.254
inb_address: 172.20.132.210/24
inb_gateway: 172.20.132.254
- id: 202
pod: 1
role: spine
serial_number: FD123456789
name: spine-202
oob_address: 10.58.30.211/25
oob_gateway: 10.58.30.254
inb_address: 172.20.132.211/24
inb_gateway: 172.20.132.254
- id: 101
pod: 1
role: leaf
serial_number: FD234567890
name: leaf-101
oob_address: 10.58.30.215/25
oob_gateway: 10.58.30.254
inb_address: 172.20.132.215/24
inb_gateway: 172.20.132.254
- id: 102
pod: 1
role: leaf
serial_number: FD345678901
name: leaf-102
oob_address: 10.58.30.216/25
oob_gateway: 10.58.30.254
inb_address: 172.20.132.216/24
inb_gateway: 172.20.132.254
- id: 103
pod: 1
role: leaf
serial_number: FD456789012
name: leaf-103
oob_address: 10.58.30.214/25
oob_gateway: 10.58.30.254
inb_address: 172.20.132.214/24
inb_gateway: 172.20.132.254
- In this case we only have a single pod in our lab however you could configure multiple pods (fabrics) using pod policies
pod_policies.nac.yaml
- These are fabric wide settings and configure things like global settings, NTP/DNS, and vCenter integration
fabric_policies.nac.yaml
apic:
fabric_policies:
external_connectivity_policy:
name: ixn
site_id: 0
fabric_id: 1
pod_policies:
date_time_policies:
- name: ntppol
ntp_servers:
- hostname_ip: 69.164.213.136
preferred: true
mgmt_epg: oob
config_passphrase: !env MY_CONFIG_PASSPHRASE
date_time_format:
display_format: utc
pod_profiles:
- name: "pod1_prof"
selectors:
- name: "pod_1_sel"
policy: "pod_1_podpol"
type: "range"
pod_blocks:
- name: "pod_1"
from: 1
to: 1
pod_policy_groups:
- name: "pod_1_podpol"
snmp_policy: default
date_time_policy: ntppol
apic_conn_pref: ooband
banners:
apic_gui_alias: lab - unauthorized access is prohibited
apic_gui_banner_url: lab - unauthorized access is prohibited
apic_cli_banner: lab - unauthorized access is prohibited
switch_cli_banner: lab - unauthorized access is prohibited
ep_loop_protection:
admin_state: true
detection_interval: 180
detection_multiplier: 10
action: bd-learn-disable
global_settings:
domain_validation: true
enforce_subnet_check: true
opflex_authentication: true
disable_remote_endpoint_learn: false
overlapping_vlan_validation: true
remote_leaf_direct: true
reallocate_gipo: false
ptp:
admin_state: true
global_domain: 1
fabric_isis_redistribute_metric: 46
dns_policies:
- name: default
mgmt_epg: oob
providers:
- ip: 8.8.8.8
preferred: true
domains:
- name: depexp.local
default: true
err_disabled_recovery:
interval: 30
ep_move: true
bpdu_guard: true
l2_mtu_policies:
- name: vmm_mtu_pol
port_mtu_size: 9000
l2_port_mtu: 9216
remote_locations:
- name: utilities01
description: ubuntu server
hostname_ip: 10.1.1.1
protocol: scp
path: '/home/files'
port: 22
auth_type: password
username: files
password: !env MY_REMOTE_LOCATION_PASSWORD
mgmt_epg: oob
fabric_bgp_as: 65003
fabric_bgp_rr:
- 201
- 202
switch_policies:
node_control_policies:
- name: default
dom: true
telemetry: telemetry
vmware_vmm_domains:
- name: mil_3_pod_1_vmm
access_mode: read-write
tag_collection: true
vlan_pool: vmm_vlp
vswitch:
mtu_policy: vmm_mtu_pol
cdp_policy: system-cdp-enabled
lldp_policy: system-lldp-disabled
credential_policies:
- name: vsphere_local_cred
username: Administrator@vsphere.local
password: !env VCENTER_PASSWORD
vcenters:
- name: mil_vcenter
hostname_ip: 10.2.2.2
datacenter: MILAN-SITE-1
statistics: true
credential_policy: vsphere_local_cred
dvs_version: unmanaged
uplinks:
- id: 1
name: uplink_1
- id: 2
name: uplink_2
- id: 3
name: uplink_3
- The access policies define how interfaces are configured. e.g. Which interfaces and which switches are configured, is CDP/LLDP configured, and which VLANs are available on each interface?
access_policies.nac.yaml
apic:
access_policies:
infra_vlan: 3914
spine_interface_policy_groups:
- name: ixn_ipg
description: "for ixn link"
aaep: ixn_aaep
leaf_interface_policy_groups:
- name: apic_ipg
description: "for inband management"
type: access
link_level_policy: system-link-level-10G-auto
cdp_policy: system-cdp-enabled
lldp_policy: system-lldp-enabled
aaep: apic_aaep
- name: core_ipg
description: "for l3out"
type: access
link_level_policy: system-link-level-10G-auto
cdp_policy: system-cdp-enabled
lldp_policy: system-lldp-enabled
aaep: core_aaep
- name: esxi_site_3_pod_1_ipg
description: "for vmm"
type: access
link_level_policy: system-link-level-10G-auto
cdp_policy: system-cdp-enabled
lldp_policy: system-lldp-enabled
aaep: esxi_site_2_pod_1_aaep
# spine_interface_profiles:
# - name: spine_201_intprof
# selectors:
# - name: ixn_link
# policy_group: ixn_ipg
# port_blocks:
# - name: 201_1_31
# description: to_ixn
# from_port: 31
# spine_switch_profiles:
# - name: spine_201_swprof
# selectors:
# - name: 201
# node_blocks:
# - name: 201
# from: 201
# interface_profiles:
# - spine_201_intprof
leaf_interface_profiles:
- name: leaf_101_intprof
selectors:
- name: apic
policy_group: apic_ipg
port_blocks:
- name: 101_1_1
description: apic
from_port: 1
- name: esxi
policy_group: esxi_site_3_pod_1_ipg
port_blocks:
- name: 101_1_2
description: esxi_146
from_port: 2
- name: leaf_102_intprof
selectors:
- name: apic
policy_group: apic_ipg
port_blocks:
- name: 102_1_1
description: apic
from_port: 1
- name: esxi
policy_group: esxi_site_3_pod_1_ipg
port_blocks:
- name: 102_1_2
description: esxi_145
from_port: 2
- name: leaf_103_intprof
selectors:
- name: esxi
policy_group: esxi_site_3_pod_1_ipg
port_blocks:
- name: 103_1_1
description: esxi_145
from_port: 1
- name: 103_1_2
description: esxi_146
from_port: 2
- name: 103_1_3
description: esxi_148
from_port: 3
- name: core
policy_group: core_ipg
port_blocks:
- name: 103_1_40
description: core-1
from_port: 40
- name: 103_1_41
description: core-2
from_port: 41
leaf_switch_profiles:
- name: leaf_101_swprof
selectors:
- name: 101
node_blocks:
- name: 101
from: 101
interface_profiles:
- leaf_101_intprof
- name: leaf_102_swprof
selectors:
- name: 102
node_blocks:
- name: 102
from: 102
interface_profiles:
- leaf_102_intprof
- name: leaf_103_swprof
selectors:
- name: 103
node_blocks:
- name: 103
from: 103
interface_profiles:
- leaf_103_intprof
aaeps:
- name: baremetal_aaep
physical_domains:
- baremetal_pdom
- name: apic_aaep
infra_vlan: true
physical_domains:
- inband_pdom
- name: esxi_site_2_pod_1_aaep
routed_domains:
- vmm_l3dom
vmware_vmm_domains:
- mil_3_pod_1_vmm
- name: core_aaep
routed_domains:
- baremetal_l3dom
- name: ixn_aaep
infra_vlan: true
routed_domains:
- ixn_l3dom
physical_domains:
- name: baremetal_pdom
vlan_pool: baremetal_vlp
- name: inband_pdom
vlan_pool: inband_vlp
routed_domains:
- name: baremetal_l3dom
vlan_pool: baremetal_vlp
- name: ixn_l3dom
vlan_pool: ixn_vlp
- name: vmm_l3dom
vlan_pool: vmm_vlp
- name: ixn_l3dom
vlan_pool: ixn_vlp
vlan_pools:
- name: ixn_vlp
description: "vlan 4"
allocation: static
ranges:
- from: 4
to: 4
role: external
allocation: inherit
- name: baremetal_vlp
description: "physical vlan pool"
allocation: static
ranges:
- from: 1100
to: 1299
role: external
allocation: inherit
- name: inband_vlp
description: "inband management vlan pool"
allocation: static
ranges:
- from: 3913
to: 3913
role: external
allocation: inherit
- name: vmm_vlp
description: "vmm vlan pool"
allocation: dynamic
ranges:
- from: 103
to: 1450
role: external
allocation: inherit
- from: 2303
to: 3500
role: external
allocation: inherit
- from: 1451
to: 1500
role: external
allocation: static
- The final configuration is for the tenants. This is where we configure the VRFs, BDs (subnets and SVIs), EPGs (VLANs), firewall integration, and L3Outs
tenant_policies.nac.yaml
---
apic:
tenants:
- name: production
managed: false
vrfs:
- name: vrf-01
bridge_domains:
- name: 192.168.10.0_24
vrf: vrf-01
subnets:
- ip: 192.168.10.254/24
public: true
l3outs:
- floating-l3out-to-csr
- name: 192.168.20.0_24
vrf: vrf-01
subnets:
- ip: 192.168.20.254/24
public: true
l3outs:
- floating-l3out-to-csr
- name: 192.168.30.0_24
vrf: vrf-01
subnets:
- ip: 192.168.30.254/24
public: true
l3outs:
- floating-l3out-to-csr
- name: 192.168.40.0_24
vrf: vrf-01
subnets:
- ip: 192.168.40.254/24
public: true
l3outs:
- floating-l3out-to-csr
- name: 6.6.6.0_24
alias: pbr_bd
vrf: vrf-01
subnets:
- ip: 6.6.6.1/24
application_profiles:
- name: network-segments
endpoint_groups:
- name: 192.168.10.0_24
bridge_domain: 192.168.10.0_24
vmware_vmm_domains:
- name: mil_3_pod_1_vmm
resolution_immediacy: immediate
- name: 192.168.20.0_24
bridge_domain: 192.168.20.0_24
vmware_vmm_domains:
- name: mil_3_pod_1_vmm
resolution_immediacy: immediate
- name: 192.168.30.0_24
bridge_domain: 192.168.30.0_24
vmware_vmm_domains:
- name: mil_3_pod_1_vmm
resolution_immediacy: immediate
- name: 6.6.6.0_24
alias: pbr_bd
bridge_domain: 6.6.6.0_24
vmware_vmm_domains:
- name: mil_3_pod_1_vmm
endpoint_security_groups:
- name: production
vrf: vrf-01
epg_selectors:
- endpoint_group: 192.168.10.0_24
- endpoint_group: 192.168.20.0_24
# We don't need intra-esg isolation as the intra-esg contract will send all traffic to the firewall
intra_esg_isolation: false
contracts:
intra_esgs:
- intra-esg-production
providers:
- permit-to-esg-production
- name: development
vrf: vrf-01
epg_selectors:
- endpoint_group: 192.168.30.0_24
# We don't need intra-esg isolation as the intra-esg contract will send all traffic to the firewall
intra_esg_isolation: false
contracts:
intra_esgs:
- intra-esg-development
providers:
- permit-to-esg-development
filters:
- name: src-any-to-dst
entries:
- name: src-any-to-dst
ethertype: unspecified
contracts:
- name: permit-to-esg-production
subjects:
- name: permit-any
filters:
- filter: src-any-to-dst
service_graph: conmurph-ftdv-routed-1
- name: permit-to-esg-development
subjects:
- name: permit-any
filters:
- filter: src-any-to-dst
service_graph: conmurph-ftdv-routed-1
- name: intra-esg-production
subjects:
- name: permit-any
filters:
- filter: src-any-to-dst
service_graph: conmurph-ftdv-routed-1
- name: intra-esg-development
subjects:
- name: permit-any
filters:
- filter: src-any-to-dst
service_graph: conmurph-ftdv-routed-1
services:
service_graph_templates:
- name: conmurph-ftdv-routed-1
template_type: FW_ROUTED
redirect: true
device:
tenant: production
name: conmurph-ftdv-routed-1
l4l7_devices:
- name: conmurph-ftdv-routed-1
context_aware: single-Context
type: VIRTUAL
vmware_vmm_domain: mil_3_pod_1_vmm
function: GoTo
managed: false
service_type: FW
concrete_devices:
- name: conmurph-ftdv-routed-1
vcenter_name: mil_vcenter
vm_name: conmurph-ftd-1
interfaces:
- name: client
vnic_name: Network adapter 3 # network adapter on the VM which is used for PBR
logical_interfaces:
- name: client
concrete_interfaces:
- device: conmurph-ftdv-routed-1
interface_name: client
redirect_policies:
- name: client
l3_destinations:
- ip: 6.6.6.2
mac: 00:50:56:b6:f3:02 # MAC address of the network adapter 3 from above
device_selection_policies:
- contract: any
service_graph_template: conmurph-ftdv-routed-1
consumer:
l3_destination: true
redirect_policy:
name: client
logical_interface: client
bridge_domain:
name: 6.6.6.0_24
provider:
l3_destination: true
redirect_policy:
name: client
logical_interface: client
bridge_domain:
name: 6.6.6.0_24
l3outs:
- name: floating-l3out-to-csr
vrf: vrf-01
domain: vmm_l3dom
ospf:
area: 0
area_type: regular
node_profiles:
- name: border-leafs
nodes:
- node_id: 1101
router_id: 101.2.1.1
interface_profiles:
- name: mil_3_pod_1_vmm
ospf:
policy: floating-l3out-to-csr
interfaces: # floating SVI
- node_id: 1101
vlan: 500
floating_svi: true
ip: 172.16.100.1/24
paths:
- vmware_vmm_domain: mil_3_pod_1_vmm
floating_ip: 172.16.100.3/24
external_endpoint_groups:
# - name: all-ext-subnets
# contracts:
# consumers:
# - to-firewall-pbr
# subnets:
# - prefix: 0.0.0.0/1
# - prefix: 128.0.0.0/1
- name: 10.1.3.0
contracts:
consumers:
- permit-to-esg-production
- permit-to-esg-development
subnets:
- prefix: 10.1.3.0/24
- name: 10.1.4.0
contracts:
consumers:
- permit-to-esg-production
- permit-to-esg-development
subnets:
- prefix: 10.1.4.0/24
policies:
ospf_interface_policies:
- name: floating-l3out-to-csr
network_type: p2p
ACI Built-in workflows¶
Benefits
- Simplify network configuration
- Reduce the time it takes to make a network configuration change
- The guided process helps minimize the risk of errors
If you don't want to configure everything with Nexus as Code, ACI provides built-in workflows to simplify configuration of some tasks.
Port configuration¶
Before you can configure VRFs, bridge domains, EPGs, and security groups, you first need to enable the switch interfaces and define the VLANs allowed on each interface. With the port configuration workflow you can apply a policy to one or more switches and one or more interfaces on those switches.
The policy includes settings such as the associated VLANs, port speed, and whether or not CDP/LLDP are enabled.
Multi-pod¶
The multi-pod wizard makes it very simple to deploy a new ACI pod. A pod in ACI is a separate fabric managed by the same APIC cluster. This was introduced to isolate any failures in the pods and inrease the overall design resiliency. Each pod runs separate instances of fabric control-plane protocols such as IS-IS, COOP, and MP-BGP. For more information see the ACI Multi-pod Whitepaper.
First configure the spine to Inter-Pod Network (IPN) device connectivity.
Then setup OSPF peering with the IPN device.
Next you need to setup the routable external Tunnel Endpoint (TEP) and dataplane TEP pools which is used to communicate with remote locations.
Finally confirm that the settings are all correct and let the wizard deploy the new ACI pod.
VMM integration for visibility and configuration¶
Benefits
- Removes the need for the server team to configure the vCenter networking
- You no longer need to think about which VLANs are associated to which portgroups (since it's an automated process)
- If something changes on the virtual switch which breaks communication (e.g. VLAN is manually changed), an event will be shown in ACI
Virtual Machine Manager (VMM) integration creates a connection between ACI and the VM Manager (e.g. VMware vCenter) to provide network admins visibility (e.g. hosts, DVS, VMs) and configuration of virtual networking resources. When an EPG is associated with a VMM domain (e.g. vCenter), ACI can create a new portgroup on the ESXi hosts using a VLAN from a pre-configured pool of VLANs.
vCenter Integration
ACI is integrated with vCenter through the public APIs i.e. the same ones you would use if you were to configure vCenter with Ansible, Terraform, or a Python script
There is no hard requirement to use VMM integration in ACI. Some customers have separate teams managing the servers and network and prefer to keep the configuration of these environments separate. In that case you can setup connectivity to virtualized hosts using static ports, the same way you would connect ACI to a bare metal server.
Reference: BRKACI-2645: ACI Troubleshooting - VMware vDS VMM Integration
In this example a new EPG is created in the 192.168.10.0/24 subnet and a VMM domain associated.
- You will see the portgroup automatically created in vCenter
- The VM is attached and can then reach the default gateway (192.168.10.254/24)
- Connectivity is lost when the VLAN is manually changed on the DVS
- You will see a fault straight away in ACI showing the port encapsulation was changed
- The connectivity is restored when ACI resyncs the config
- There are multiple ways to trigger the resync but I find it easy to just update the EPG description
Endpoint database for visibility and troubleshooting¶
Benefits
- Simpler troubleshooting as you can query the global endpoint database to see if/where an endpoint was learned
- Leaf switches don't need to learn all remote endpoints (they just send the packet to the spine switch)
Since ACI is a single system, any locally connected endpoints will be learned by the leaf switches and stored in a global database on the spine switches. You can find more details about ACI endpoint learning from the ACI Fabric Endpoint Learning White Paper and there's also a great Cisco Live session covering ACI Forwarding Behavior.
In this example I am showing the global endpoint database. I then show the endpoint table and IP drops for the conmurph-01
tenant. Again this is very helpful when troubleshooting to see if and how an endpoint was learned, as well as to which interface and VLAN the endpoint is connected.
Endpoint Security Groups (ESGs)¶
Benefits
- Better understand to which switch/host endpoints are connected across the fabric
- Apply flexible security policies to bare metal, virtualized, or containerized workloads regardless of which subnet/VLAN/switch an endpoint belongs
When migrating to ACI, many if not most customers I've spoken with use a "network-centric" configuration style when creating BDs and EPGs. This means one EPG (think of it like a VLAN) is associated to one Bridge Domain (like a subnet/SVI) and all of these EPGs and BDs are a part of the same VRF.
The problem is an application service or tier may be spread across multiple subnet which makes it a lot harder to classify an application tier into security domains using the traditional method (contract between EPGs).
Endpoint Security Groups (ESGs) are essentially collections of endpoints that are subjected to the same set of security policies and rules. Unlike EPGs which are associated at the bridge domain level (i.e. an EPG can only be tied to one BD), ESGs exist at the VRF level. Therefore I could create a a security group which has endpoints from multiple different BDs/EPGs (i.e. subnets and VLANs).
You could create multiple ESGs, one per application, similar to the example below. Alternatively you might want to create one ESG per environment (e.g. production, dev, test) or even a more granular design of one ESG per application tier (e.g. web, app, database).
There is no exact way you should implement ESGs however the Cisco ACI Endpoint Security Group (ESG) Design Guide and Cisco Live session provide some good insights.
Using ESGs also provides visibility into your environment. By classifying traffic into an application ESG you can start to see where the endpoints of that application connect. i.e. what is the VM name?, to which switch/port/VLAN is it connected?
There are two ESGs in the following example, production and development. The production ESG contains any endpoints in the 192.168.10.0/24
or 192.168.20.0/24
subnets (you could also classify based on individual MACs or IPs or even VM Tags). The development ESG contains the 192.168.30.0/24
subnet.
As you can see in the video, traffic flows between the 10.0
and 20.0
subnets since they are in the same security group, however when the 20.0
subnet is removed the traffic stops. Traffic is not permitted between the production endpoints and the development endpoints. This is allowed once a contract is consumed/provided by both security groups.
Snapshots and config rollbacks¶
Benefits
- Config snapshot and rollback out of the box
- Schedule an export available
- Applies to the entire DC fabric and not just individual devices
There are different ways to backup network device config. For example it could be done manually, you could write a script, or use a software like RANCID. ACI has config snapshot and rollbacks built into the controller and snapshots can be taken on a per tenant or a fabric wide basis. They can also be scheduled (e.g. daily) and exported to an external repo for backup.
APIC Snapshots and Infrastructure as Code
When working with Infrastructure as Code one of the best practices is to have a single source of truth. Having multiple configuration methods (e.g. manually configuring some resources) may cause your infrastructure to drift from the desired configuration in your IaC files. This also applies to ACI snapshots as they can be thought of as another source of truth and you may run into issues for example if using snapshots and Terraform together.
See the following section in the Terraform Design Considerations for Cisco ACI - Part 4 for more details and examples.
In this example I have a number of EPGs, BDs, a VRF, and an L3out peering to an external router.
- I first take a snapshot of my production tenant
- As you can see in the video, you could also schedule a recurring backup and export the snapshot to an external location
- I delete the VRF and L3Out
- The config is then reverted to the previous snapshot and the VRF and L3Out return
Automated upgrades¶
Benefits
- Reduced fabric upgrade time through pre-downloaded images and parallel upgrades
- Reduced risk of unplanned downtime by using pre-validation checklists and automatic APIC/switch upgrades on deployment
One of the benefits of a controller based fabric is the ability to perform fabric-wide upgrades rather than upgrading one switch at a time. If you think about the upgrade process there are a number of steps that need to take place besides just installing the new image.
- First obtain the image
- Upload it to the device
- Perform the upgrade
- Recover if something fails
- Move onto the next device and repeat the process
Additionally, there may be other steps you perform outside of these steps. For example:
- Checking there's enough storage on the device
- Checking all hosts dual connected so won't be disrupted when the switch is upgraded
- Pre-upgrading the devices before bringing them into the fabric
ACI has made many improvements to the controller/switch upgrade process, with many enhancements coming in the 4.2(5) and 5.2(1) releases.
Reference: Slide 48 from the Why You Shouldn’t Fear Upgrading Your ACI Fabric Cisco Live session
Here are a few of the enhancements I really like.
Parallel upgrades¶
With ACI you can put switches into upgrade groups and perform an upgrade one group at a time. For example you may want to put one leaf and one spine into the green group and the second leaf and spine into the blue group. You can then upgrade the switches in the green group in parallel and when this process finishes you can perform the parallel upgrade of switches in the blue group.
Reference: Slide 68 from the Why You Shouldn’t Fear Upgrading Your ACI Fabric Cisco Live session
Auto firmware upgrades¶
Whether it's ACI or any network it's typically a good idea to run consistent firmware versions across devices as it helps ensure feature, performance, and security consistency across the network. If you need to add or replace a switch in the fabric you would traditionally do this manually before you connect it to the other switches.
ACI has automated this process for both new switches and the controllers. That means you don't need to spend time performing that task manually and you also lower the risk of mismatched firmware because the upgrade process was forgotten or delayed.
When the Auto Firmware Update on Switch Discovery
feature is enabled, APIC automatically updates the switch firmware for the following scenarios:
- A new switch discovery with a new node ID
- A switch replacement with an existing node ID
- An initialization and rediscovering of an existing node
Pre-validation checklist¶
There may be a number of things you want check prior to an upgrade. For example, is there enough storage available on the device to store the new image? Are all hosts dual connected so the upgrade doesn't take down a host with only one connection?
ACI has a built in health checklist which runs prior to an upgrade. This applies to both the switches and the APIC controllers.
Check Configurations and Conditions That May Cause An Upgrade Failure
Multi-tenancy¶
Benefits
- Provides administrative separation of resources or environments (e.g. business units, or production/development environments)
In ACI a tenant is a container for grouping various policies such as VRFs, Bridge Domains, Endpoint Groups, L3Outs, and Service Graphs. Besides logical separation of objects, tenants can also have access control policies applied. For example, some users may have read/write access to a development tenant but only read access to a production tenant. Additionally, it's possible to assign different switches (nodes) to a security domain so that a user can only configure the switches in the fabric that are part of their domain.
Restricting Access Using Security Domains and Node Rules
There are no exact rules for how tenants should be designed however in many cases customers use tenants to represent different environments. For example, production, testing, development. In the example below I've created a new security domain which only allows configuration of node-1101
.