Simplifying ACI - What features help me manage a data center fabric?¶

Estimated time to read: 32 minutes

Originally Written: June, 2024

Info

This post uses the Nexus as Code (NaC) project which makes it very easy to configure ACI fabrics through a YAML file . More details and examples can be found at https://developer.cisco.com/docs/nexus-as-code/#!aci-introduction

I speak with many customers about ACI and I'm using this post to collect some of the features they find help simplify the roll-out and management of their data center fabrics.

The features¶

Automated fabric standup
Nexus as Code to configure everything
ACI Built-in workflows
- Port configuration
- Multi-pod
VM Manager integration for visibility and configuration
Endpoint database for visibility and troubleshooting
Endpoint Security Groups (ESGs)
Snapshots and config rollbacks
Automated upgrades
Multi-tenancy

Automated fabric standup¶

Benefits

The guided process and automated cluster installation simplifies the initial setup
Providing the APIC as a packaged appliance (physical or virtual) simplifies deployment by removing the need for you to bring your own server or VM and manually install the software

ACI is managed as a single system. Rather than configuring individual switches through the CLI, with ACI all config is pushed from the controller (known as the APIC). APICs are deployed as either a physical or virtual appliance (i.e. the server is shipped to you with the OS and APIC application already installed) and there are typically 3 or more APICs running as a cluster in a production environment.

There have been many improvements made to the APIC bringup process and these days it's very easy to standup both the physical and virtual appliances. The following shows the virtual deployment.

In the case of the physical APIC you need to install the server in the rack and connect via console. You enter a few pieces of information and then connect to the browser to perform the rest of the installation.

Each virtual APIC is deployed as an OVA into a VMware environment. It can run on a server either directly connected to the fabric or across an L3 network (e.g. on a management cluster). The following screenshots will show the process for deploying a directly connected vAPIC cluster.

There are a few inputs to provide as part of the OVA deployment including the OOB management IP, mask, gateway, and admin password. This provide access to the APIC through the browser to complete the cluster bringup.

There are two networks in the vAPIC, a management and a data/infrastructure network.

The management network provides connectivity to the vAPIC through browser. It should be accessible via a separate, previously provisioned, management network. Otherwise you would have a chicken and egg situation where you need the switches/network to be provisioned to access the vAPIC but vAPIC is responsible for configuring the network. In this example the VM Network port group is on a standard vSwitch used for management connectivity.

The INFRA port group is on a distributed vSwitch with physical uplinks directly connected to the Nexus 9000 switches in the ACI fabric. This port group requires specific VLANs to reach the ACI switches.

VLAN 0: This is required for forwarding and receiving LLDP packets to and from the fabric for discovery. LLDP packets sent from the leaf switch will be untagged
VLAN 3910-3914: We use 3914 as the infrastructure VLAN but you may also need additional VLANs if you configure inband management

For more information on the installation process see the Deploying Cisco Virtual APIC Using VMware vCenter guide.

Once deployed, the rest of the installation can be performed through a browser. You can navigate to the management IP address you provided in the OVA deployment and you should see the following screen.

Note

The following steps of the process are the same for both the virtual and physical APICs

First select the method of deployment. There's also documentation built into the browser if you need more information about an option.

You need to configure the cluster properties such as the cluster size, TEP pool (used for VXLAN), and infrastructure VLAN (used for communication between the fabric and the APICs).

Then add each APIC into the cluster by entering the management IP of the APIC, the controller ID, pod ID, serial number, and out of band management details. When all APICs are added you can start the deployment.

Remember that the management IPs were entered in the wizard when you initially deploy each OVA.

Once the cluster bringup process has completed you can refresh the page and should now have access to a new ACI fabric.

There are a couple of options to configure your fabric. To simplify the initial setup you can go through the setup wizard and then configure the rest of the ACI fabric through the UI.

Alternatively you can use Nexus as Code to build the rest of the fabric for you.

Nexus as Code to configure everything¶

Benefits

Get the benefit of programmability and automation without need to know Terraform and Ansible
Human errors are minimized by integrating with testing/validation solutions which ensures more reliable network configurations
Allows for consistent and repeatable network deployments across different environments
The NaC YAML files contain only the configuration which has been applied to the ACI fabric. This makes it very easy to understand how an environment was configured and can help reduce troubleshooting time

While Cisco ACI offers very powerful APIs and integrations with tools such as Ansible and Terraform, not all network engineers are developers. The Nexus as Code (NaC) project makes it easy for users to gain the benefits of programmability and Infrastructure as Code with minimal effort. NaC uses Terraform to translate the data model (YAML files with the ACI configurations) into Terraform configuration which is then implemented on the ACI fabric.

https://developer.cisco.com/docs/nexus-as-code

Nexus as Code configuration to standup entire fabric¶

After deploying the APIC cluster as shown in the previous section, an entire fabric can be provisioned with Nexus as Code configuration stored in a YAML file. This includes the spine and leaf switch registration. This also means it's very easy to add or change configuration. For example, if you want to add a new tenant just copy an existing tenant config and update them with any details specific to the new tenant.

You could use a single YAML file however to make it more readable I've split the configuration into multiple files. See the following document for further details and additional design considerations when using Terraform with ACI.

Terraform Design Considerations for Cisco ACI - Single File vs Multiple Files

Note

The following configuration is to be used as an example only. You would need to update the values to meet the needs of your own fabric

The node policies are used for registering the switches and configuring management addresses

node_policies.nac.yaml

apic:
  node_policies:
    inb_endpoint_group: inb_management_epg
    nodes:
      - id: 1
        role: apic
        inb_address: 172.20.132.1/24
        inb_gateway: 172.20.132.254
      - id: 201
        pod: 1
        role: spine
        serial_number: FD012345678
        name: spine-201
        oob_address: 10.58.30.210/25
        oob_gateway: 10.58.30.254
        inb_address: 172.20.132.210/24
        inb_gateway: 172.20.132.254
      - id: 202
        pod: 1
        role: spine
        serial_number: FD123456789
        name: spine-202
        oob_address: 10.58.30.211/25
        oob_gateway: 10.58.30.254
        inb_address: 172.20.132.211/24
        inb_gateway: 172.20.132.254
      - id: 101
        pod: 1
        role: leaf
        serial_number: FD234567890
        name: leaf-101
        oob_address: 10.58.30.215/25
        oob_gateway: 10.58.30.254
        inb_address: 172.20.132.215/24
        inb_gateway: 172.20.132.254
      - id: 102
        pod: 1
        role: leaf
        serial_number: FD345678901
        name: leaf-102
        oob_address: 10.58.30.216/25
        oob_gateway: 10.58.30.254
        inb_address: 172.20.132.216/24
        inb_gateway: 172.20.132.254
      - id: 103
        pod: 1
        role: leaf
        serial_number: FD456789012
        name: leaf-103
        oob_address: 10.58.30.214/25
        oob_gateway: 10.58.30.254
        inb_address: 172.20.132.214/24
        inb_gateway: 172.20.132.254

In this case we only have a single pod in our lab however you could configure multiple pods (fabrics) using pod policies

pod_policies.nac.yaml

apic:
  pod_policies:
    pods:
      - id: 1
        policy: default
        tep_pool: 172.21.30.0/23
        data_plane_tep: 172.21.3.111/32
        unicast_tep: 172.21.35.2/32
        external_tep_pools:
          - prefix: 172.21.35.0/24
            reserved_address_count: 2

These are fabric wide settings and configure things like global settings, NTP/DNS, and vCenter integration

fabric_policies.nac.yaml

apic:
  fabric_policies:
    external_connectivity_policy:
      name: ixn
      site_id: 0
      fabric_id: 1
    pod_policies:
      date_time_policies:
        - name: ntppol
          ntp_servers:
            - hostname_ip: 69.164.213.136
              preferred: true
              mgmt_epg: oob
    config_passphrase: !env MY_CONFIG_PASSPHRASE
    date_time_format:
       display_format: utc
    pod_profiles:
      - name: "pod1_prof"
        selectors:
          - name: "pod_1_sel"
            policy: "pod_1_podpol"
            type: "range"
            pod_blocks:
              - name: "pod_1"
                from: 1
                to: 1
    pod_policy_groups:
      - name: "pod_1_podpol"
        snmp_policy: default
        date_time_policy: ntppol
    apic_conn_pref: ooband
    banners:
      apic_gui_alias: lab - unauthorized access is prohibited
      apic_gui_banner_url: lab - unauthorized access is prohibited
      apic_cli_banner: lab - unauthorized access is prohibited
      switch_cli_banner: lab - unauthorized access is prohibited
    ep_loop_protection:
      admin_state: true
      detection_interval: 180
      detection_multiplier: 10
      action: bd-learn-disable
    global_settings:
      domain_validation: true
      enforce_subnet_check: true
      opflex_authentication: true
      disable_remote_endpoint_learn: false
      overlapping_vlan_validation: true
      remote_leaf_direct: true
      reallocate_gipo: false
    ptp:
      admin_state: true
      global_domain: 1
    fabric_isis_redistribute_metric: 46
    dns_policies:
      - name: default
        mgmt_epg: oob
        providers:
          - ip: 8.8.8.8
            preferred: true
        domains:
          - name: depexp.local
            default: true
    err_disabled_recovery:
      interval: 30
      ep_move: true
      bpdu_guard: true
    l2_mtu_policies:
      - name: vmm_mtu_pol
        port_mtu_size: 9000
    l2_port_mtu: 9216
    remote_locations:
      - name: utilities01
        description: ubuntu server
        hostname_ip: 10.1.1.1
        protocol: scp
        path: '/home/files'
        port: 22
        auth_type: password
        username: files
        password: !env MY_REMOTE_LOCATION_PASSWORD
        mgmt_epg: oob
    fabric_bgp_as: 65003
    fabric_bgp_rr:
      - 201
      - 202
    switch_policies:
    node_control_policies:
      - name: default
        dom: true
        telemetry: telemetry
    vmware_vmm_domains:
      - name: mil_3_pod_1_vmm
        access_mode: read-write
        tag_collection: true
        vlan_pool: vmm_vlp
        vswitch:
          mtu_policy: vmm_mtu_pol
          cdp_policy: system-cdp-enabled
          lldp_policy: system-lldp-disabled

        credential_policies:
          - name: vsphere_local_cred
            username: Administrator@vsphere.local
            password: !env VCENTER_PASSWORD
        vcenters:
          - name: mil_vcenter
            hostname_ip: 10.2.2.2
            datacenter: MILAN-SITE-1
            statistics: true
            credential_policy: vsphere_local_cred
            dvs_version: unmanaged
        uplinks:
          - id: 1
            name: uplink_1
          - id: 2
            name: uplink_2
          - id: 3
            name: uplink_3

The access policies define how interfaces are configured. e.g. Which interfaces and which switches are configured, is CDP/LLDP configured, and which VLANs are available on each interface?

access_policies.nac.yaml

apic:
  access_policies:
    infra_vlan: 3914
    spine_interface_policy_groups:
      - name: ixn_ipg
        description: "for ixn link"
        aaep: ixn_aaep
    leaf_interface_policy_groups:
      - name: apic_ipg
        description: "for inband management"
        type: access
        link_level_policy: system-link-level-10G-auto
        cdp_policy: system-cdp-enabled
        lldp_policy: system-lldp-enabled
        aaep: apic_aaep
      - name: core_ipg
        description: "for l3out"
        type: access
        link_level_policy: system-link-level-10G-auto
        cdp_policy: system-cdp-enabled
        lldp_policy: system-lldp-enabled
        aaep: core_aaep
      - name: esxi_site_3_pod_1_ipg
        description: "for vmm"
        type: access
        link_level_policy: system-link-level-10G-auto
        cdp_policy: system-cdp-enabled
        lldp_policy: system-lldp-enabled
        aaep: esxi_site_2_pod_1_aaep
#        spine_interface_profiles:
#          - name: spine_201_intprof
#            selectors:
#              - name: ixn_link
#                policy_group: ixn_ipg
#                port_blocks:
#                  - name: 201_1_31
#                    description: to_ixn
#                    from_port: 31
#        spine_switch_profiles:
#          - name: spine_201_swprof
#            selectors:
#              - name: 201
#                node_blocks:
#                  - name: 201
#                    from: 201
#            interface_profiles:
#              - spine_201_intprof
    leaf_interface_profiles:
      - name: leaf_101_intprof
        selectors:
          - name: apic
            policy_group: apic_ipg
            port_blocks:
              - name: 101_1_1
                description: apic
                from_port: 1
          - name: esxi
            policy_group: esxi_site_3_pod_1_ipg
            port_blocks:
              - name: 101_1_2
                description: esxi_146
                from_port: 2
      - name: leaf_102_intprof
        selectors:
          - name: apic
            policy_group: apic_ipg
            port_blocks:
              - name: 102_1_1
                description: apic
                from_port: 1
          - name: esxi
            policy_group: esxi_site_3_pod_1_ipg
            port_blocks:
              - name: 102_1_2
                description: esxi_145
                from_port: 2
      - name: leaf_103_intprof
        selectors:
          - name: esxi
            policy_group: esxi_site_3_pod_1_ipg
            port_blocks:
              - name: 103_1_1
                description: esxi_145
                from_port: 1
              - name: 103_1_2
                description: esxi_146
                from_port: 2
              - name: 103_1_3
                description: esxi_148
                from_port: 3
          - name: core
            policy_group: core_ipg
            port_blocks:
              - name: 103_1_40
                description: core-1
                from_port: 40
              - name: 103_1_41
                description: core-2
                from_port: 41
    leaf_switch_profiles:
      - name: leaf_101_swprof
        selectors:
          - name: 101
            node_blocks:
              - name: 101
                from: 101
        interface_profiles:
          - leaf_101_intprof
      - name: leaf_102_swprof
        selectors:
          - name: 102
            node_blocks:
              - name: 102
                from: 102
        interface_profiles:
          - leaf_102_intprof
      - name: leaf_103_swprof
        selectors:
          - name: 103
            node_blocks:
              - name: 103
                from: 103
        interface_profiles:
          - leaf_103_intprof
    aaeps:
      - name: baremetal_aaep
        physical_domains:
          - baremetal_pdom
      - name: apic_aaep
        infra_vlan: true
        physical_domains:
          - inband_pdom
      - name: esxi_site_2_pod_1_aaep
        routed_domains:
          - vmm_l3dom
        vmware_vmm_domains:
          - mil_3_pod_1_vmm
      - name: core_aaep
        routed_domains:
          - baremetal_l3dom
      - name: ixn_aaep
        infra_vlan: true
        routed_domains:
          - ixn_l3dom
    physical_domains:
      - name: baremetal_pdom
        vlan_pool: baremetal_vlp
      - name: inband_pdom
        vlan_pool: inband_vlp
    routed_domains:
      - name: baremetal_l3dom
        vlan_pool: baremetal_vlp
      - name: ixn_l3dom
        vlan_pool: ixn_vlp
      - name: vmm_l3dom
        vlan_pool: vmm_vlp
      - name: ixn_l3dom
        vlan_pool: ixn_vlp
    vlan_pools:
      - name: ixn_vlp
        description: "vlan 4"
        allocation: static
        ranges:
          - from: 4
            to: 4
            role: external
            allocation: inherit
      - name: baremetal_vlp
        description: "physical vlan pool"
        allocation: static
        ranges:
          - from: 1100
            to: 1299
            role: external
            allocation: inherit
      - name: inband_vlp
        description: "inband management vlan pool"
        allocation: static
        ranges:
          - from: 3913
            to: 3913
            role: external
            allocation: inherit
      - name: vmm_vlp
        description: "vmm vlan pool"
        allocation: dynamic
        ranges:
          - from: 103
            to: 1450
            role: external
            allocation: inherit
          - from: 2303
            to: 3500
            role: external
            allocation: inherit
          - from: 1451
            to: 1500
            role: external
            allocation: static

The final configuration is for the tenants. This is where we configure the VRFs, BDs (subnets and SVIs), EPGs (VLANs), firewall integration, and L3Outs

tenant_policies.nac.yaml

---
apic:
  tenants:
    - name: production
      managed: false

      vrfs:
      - name: vrf-01

      bridge_domains:
        - name: 192.168.10.0_24
          vrf: vrf-01
          subnets:
            - ip: 192.168.10.254/24
              public: true
          l3outs:
            - floating-l3out-to-csr


        - name: 192.168.20.0_24
          vrf: vrf-01
          subnets:
            - ip: 192.168.20.254/24
              public: true
          l3outs:
            - floating-l3out-to-csr

        - name: 192.168.30.0_24
          vrf: vrf-01
          subnets:
            - ip: 192.168.30.254/24
              public: true
          l3outs:
            - floating-l3out-to-csr

        - name: 192.168.40.0_24
          vrf: vrf-01
          subnets:
            - ip: 192.168.40.254/24
              public: true
          l3outs:
            - floating-l3out-to-csr

        - name: 6.6.6.0_24
          alias: pbr_bd
          vrf: vrf-01
          subnets:
            - ip: 6.6.6.1/24

      application_profiles:
        - name: network-segments
          endpoint_groups:
            - name: 192.168.10.0_24
              bridge_domain: 192.168.10.0_24
              vmware_vmm_domains:
                - name: mil_3_pod_1_vmm
                  resolution_immediacy: immediate

            - name: 192.168.20.0_24
              bridge_domain: 192.168.20.0_24
              vmware_vmm_domains:
                - name: mil_3_pod_1_vmm
                  resolution_immediacy: immediate

            - name: 192.168.30.0_24
              bridge_domain: 192.168.30.0_24
              vmware_vmm_domains:
                - name: mil_3_pod_1_vmm
                  resolution_immediacy: immediate

            - name: 6.6.6.0_24
              alias: pbr_bd
              bridge_domain: 6.6.6.0_24
              vmware_vmm_domains:
                - name: mil_3_pod_1_vmm

          endpoint_security_groups:
            - name: production
              vrf: vrf-01
              epg_selectors:
                - endpoint_group: 192.168.10.0_24
                - endpoint_group: 192.168.20.0_24
              # We don't need intra-esg isolation as the intra-esg contract will send all traffic to the firewall
              intra_esg_isolation: false
              contracts:
                intra_esgs:
                  - intra-esg-production
                providers:
                  - permit-to-esg-production

            - name: development
              vrf: vrf-01
              epg_selectors:
                - endpoint_group: 192.168.30.0_24
              # We don't need intra-esg isolation as the intra-esg contract will send all traffic to the firewall
              intra_esg_isolation: false
              contracts:
                intra_esgs:
                  - intra-esg-development
                providers:
                  - permit-to-esg-development

      filters:
        - name: src-any-to-dst
          entries:
            - name: src-any-to-dst
              ethertype: unspecified

      contracts:

        - name: permit-to-esg-production
          subjects:
            - name: permit-any
              filters:
                - filter: src-any-to-dst
              service_graph: conmurph-ftdv-routed-1

        - name: permit-to-esg-development
          subjects:
            - name: permit-any
              filters:
                - filter: src-any-to-dst
              service_graph: conmurph-ftdv-routed-1

        - name: intra-esg-production
          subjects:
            - name: permit-any
              filters:
                - filter: src-any-to-dst
              service_graph: conmurph-ftdv-routed-1

        - name: intra-esg-development
          subjects:
            - name: permit-any
              filters:
                - filter: src-any-to-dst
              service_graph: conmurph-ftdv-routed-1

      services:
        service_graph_templates:
          - name: conmurph-ftdv-routed-1
            template_type: FW_ROUTED
            redirect: true
            device:
              tenant: production
              name: conmurph-ftdv-routed-1

        l4l7_devices:
          - name: conmurph-ftdv-routed-1
            context_aware: single-Context
            type: VIRTUAL
            vmware_vmm_domain: mil_3_pod_1_vmm
            function: GoTo
            managed: false
            service_type: FW
            concrete_devices:
              - name: conmurph-ftdv-routed-1
                vcenter_name: mil_vcenter
                vm_name: conmurph-ftd-1
                interfaces:
                  - name: client
                    vnic_name: Network adapter 3 # network adapter on the VM which is used for PBR
            logical_interfaces:
              - name: client
                concrete_interfaces:
                  - device: conmurph-ftdv-routed-1
                    interface_name: client


        redirect_policies:
          - name: client
            l3_destinations:
              - ip: 6.6.6.2
                mac: 00:50:56:b6:f3:02 # MAC address of the network adapter 3 from above

        device_selection_policies:
          - contract: any
            service_graph_template: conmurph-ftdv-routed-1

            consumer:
              l3_destination: true
              redirect_policy:
                name: client
              logical_interface: client
              bridge_domain:
                name: 6.6.6.0_24

            provider:
              l3_destination: true
              redirect_policy:
                name: client
              logical_interface: client
              bridge_domain:
                name: 6.6.6.0_24
      l3outs:

        - name: floating-l3out-to-csr
          vrf: vrf-01
          domain: vmm_l3dom
          ospf:
            area: 0
            area_type: regular

          node_profiles:
            - name: border-leafs
              nodes:
                - node_id: 1101
                  router_id: 101.2.1.1

              interface_profiles:
                - name: mil_3_pod_1_vmm
                  ospf:
                    policy: floating-l3out-to-csr

                  interfaces: # floating SVI
                    - node_id: 1101
                      vlan: 500
                      floating_svi: true
                      ip: 172.16.100.1/24
                      paths:
                        - vmware_vmm_domain: mil_3_pod_1_vmm
                          floating_ip: 172.16.100.3/24

          external_endpoint_groups:
            # - name: all-ext-subnets
            #   contracts:
            #     consumers:
            #       - to-firewall-pbr
            #   subnets:
            #     - prefix: 0.0.0.0/1
            #     - prefix: 128.0.0.0/1

            - name: 10.1.3.0
              contracts:
                consumers:
                  - permit-to-esg-production
                  - permit-to-esg-development
              subnets:
                - prefix: 10.1.3.0/24

            - name: 10.1.4.0
              contracts:
                consumers:
                  - permit-to-esg-production
                  - permit-to-esg-development
              subnets:
                - prefix: 10.1.4.0/24

      policies:
        ospf_interface_policies:
            - name: floating-l3out-to-csr
              network_type: p2p

ACI Built-in workflows¶

Benefits

Simplify network configuration
Reduce the time it takes to make a network configuration change
The guided process helps minimize the risk of errors

If you don't want to configure everything with Nexus as Code, ACI provides built-in workflows to simplify configuration of some tasks.

Port configuration¶

Before you can configure VRFs, bridge domains, EPGs, and security groups, you first need to enable the switch interfaces and define the VLANs allowed on each interface. With the port configuration workflow you can apply a policy to one or more switches and one or more interfaces on those switches.

The policy includes settings such as the associated VLANs, port speed, and whether or not CDP/LLDP are enabled.

Multi-pod¶

The multi-pod wizard makes it very simple to deploy a new ACI pod. A pod in ACI is a separate fabric managed by the same APIC cluster. This was introduced to isolate any failures in the pods and inrease the overall design resiliency. Each pod runs separate instances of fabric control-plane protocols such as IS-IS, COOP, and MP-BGP. For more information see the ACI Multi-pod Whitepaper.

First configure the spine to Inter-Pod Network (IPN) device connectivity.

Then setup OSPF peering with the IPN device.

Next you need to setup the routable external Tunnel Endpoint (TEP) and dataplane TEP pools which is used to communicate with remote locations.

Finally confirm that the settings are all correct and let the wizard deploy the new ACI pod.

VMM integration for visibility and configuration¶

Benefits

Removes the need for the server team to configure the vCenter networking
You no longer need to think about which VLANs are associated to which portgroups (since it's an automated process)
If something changes on the virtual switch which breaks communication (e.g. VLAN is manually changed), an event will be shown in ACI

Virtual Machine Manager (VMM) integration creates a connection between ACI and the VM Manager (e.g. VMware vCenter) to provide network admins visibility (e.g. hosts, DVS, VMs) and configuration of virtual networking resources. When an EPG is associated with a VMM domain (e.g. vCenter), ACI can create a new portgroup on the ESXi hosts using a VLAN from a pre-configured pool of VLANs.

vCenter Integration

ACI is integrated with vCenter through the public APIs i.e. the same ones you would use if you were to configure vCenter with Ansible, Terraform, or a Python script

There is no hard requirement to use VMM integration in ACI. Some customers have separate teams managing the servers and network and prefer to keep the configuration of these environments separate. In that case you can setup connectivity to virtualized hosts using static ports, the same way you would connect ACI to a bare metal server.

Reference: BRKACI-2645: ACI Troubleshooting - VMware vDS VMM Integration

In this example a new EPG is created in the 192.168.10.0/24 subnet and a VMM domain associated.

You will see the portgroup automatically created in vCenter
The VM is attached and can then reach the default gateway (192.168.10.254/24)
Connectivity is lost when the VLAN is manually changed on the DVS
You will see a fault straight away in ACI showing the port encapsulation was changed
The connectivity is restored when ACI resyncs the config
There are multiple ways to trigger the resync but I find it easy to just update the EPG description

Endpoint database for visibility and troubleshooting¶

Benefits

Simpler troubleshooting as you can query the global endpoint database to see if/where an endpoint was learned
Leaf switches don't need to learn all remote endpoints (they just send the packet to the spine switch)

Since ACI is a single system, any locally connected endpoints will be learned by the leaf switches and stored in a global database on the spine switches. You can find more details about ACI endpoint learning from the ACI Fabric Endpoint Learning White Paper and there's also a great Cisco Live session covering ACI Forwarding Behavior.

In this example I am showing the global endpoint database. I then show the endpoint table and IP drops for the conmurph-01 tenant. Again this is very helpful when troubleshooting to see if and how an endpoint was learned, as well as to which interface and VLAN the endpoint is connected.

Endpoint Security Groups (ESGs)¶

Benefits

Better understand to which switch/host endpoints are connected across the fabric
Apply flexible security policies to bare metal, virtualized, or containerized workloads regardless of which subnet/VLAN/switch an endpoint belongs

When migrating to ACI, many if not most customers I've spoken with use a "network-centric" configuration style when creating BDs and EPGs. This means one EPG (think of it like a VLAN) is associated to one Bridge Domain (like a subnet/SVI) and all of these EPGs and BDs are a part of the same VRF.

The problem is an application service or tier may be spread across multiple subnet which makes it a lot harder to classify an application tier into security domains using the traditional method (contract between EPGs).

Endpoint Security Groups (ESGs) are essentially collections of endpoints that are subjected to the same set of security policies and rules. Unlike EPGs which are associated at the bridge domain level (i.e. an EPG can only be tied to one BD), ESGs exist at the VRF level. Therefore I could create a a security group which has endpoints from multiple different BDs/EPGs (i.e. subnets and VLANs).

You could create multiple ESGs, one per application, similar to the example below. Alternatively you might want to create one ESG per environment (e.g. production, dev, test) or even a more granular design of one ESG per application tier (e.g. web, app, database).

There is no exact way you should implement ESGs however the Cisco ACI Endpoint Security Group (ESG) Design Guide and Cisco Live session provide some good insights.

Using ESGs also provides visibility into your environment. By classifying traffic into an application ESG you can start to see where the endpoints of that application connect. i.e. what is the VM name?, to which switch/port/VLAN is it connected?

There are two ESGs in the following example, production and development. The production ESG contains any endpoints in the 192.168.10.0/24 or 192.168.20.0/24 subnets (you could also classify based on individual MACs or IPs or even VM Tags). The development ESG contains the 192.168.30.0/24 subnet.

As you can see in the video, traffic flows between the 10.0 and 20.0 subnets since they are in the same security group, however when the 20.0 subnet is removed the traffic stops. Traffic is not permitted between the production endpoints and the development endpoints. This is allowed once a contract is consumed/provided by both security groups.

Snapshots and config rollbacks¶

Benefits

Config snapshot and rollback out of the box
Schedule an export available
Applies to the entire DC fabric and not just individual devices

There are different ways to backup network device config. For example it could be done manually, you could write a script, or use a software like RANCID. ACI has config snapshot and rollbacks built into the controller and snapshots can be taken on a per tenant or a fabric wide basis. They can also be scheduled (e.g. daily) and exported to an external repo for backup.

APIC Snapshots and Infrastructure as Code

When working with Infrastructure as Code one of the best practices is to have a single source of truth. Having multiple configuration methods (e.g. manually configuring some resources) may cause your infrastructure to drift from the desired configuration in your IaC files. This also applies to ACI snapshots as they can be thought of as another source of truth and you may run into issues for example if using snapshots and Terraform together.

See the following section in the Terraform Design Considerations for Cisco ACI - Part 4 for more details and examples.

In this example I have a number of EPGs, BDs, a VRF, and an L3out peering to an external router.

I first take a snapshot of my production tenant
As you can see in the video, you could also schedule a recurring backup and export the snapshot to an external location
I delete the VRF and L3Out
The config is then reverted to the previous snapshot and the VRF and L3Out return

Automated upgrades¶

Benefits

Reduced fabric upgrade time through pre-downloaded images and parallel upgrades
Reduced risk of unplanned downtime by using pre-validation checklists and automatic APIC/switch upgrades on deployment

One of the benefits of a controller based fabric is the ability to perform fabric-wide upgrades rather than upgrading one switch at a time. If you think about the upgrade process there are a number of steps that need to take place besides just installing the new image.

First obtain the image
Upload it to the device
Perform the upgrade
Recover if something fails
Move onto the next device and repeat the process

Additionally, there may be other steps you perform outside of these steps. For example:

Checking there's enough storage on the device
Checking all hosts dual connected so won't be disrupted when the switch is upgraded
Pre-upgrading the devices before bringing them into the fabric

ACI has made many improvements to the controller/switch upgrade process, with many enhancements coming in the 4.2(5) and 5.2(1) releases.

Reference: Slide 48 from the Why You Shouldn’t Fear Upgrading Your ACI Fabric Cisco Live session

Here are a few of the enhancements I really like.

Parallel upgrades¶

With ACI you can put switches into upgrade groups and perform an upgrade one group at a time. For example you may want to put one leaf and one spine into the green group and the second leaf and spine into the blue group. You can then upgrade the switches in the green group in parallel and when this process finishes you can perform the parallel upgrade of switches in the blue group.

Reference: Slide 68 from the Why You Shouldn’t Fear Upgrading Your ACI Fabric Cisco Live session

Auto firmware upgrades¶

Whether it's ACI or any network it's typically a good idea to run consistent firmware versions across devices as it helps ensure feature, performance, and security consistency across the network. If you need to add or replace a switch in the fabric you would traditionally do this manually before you connect it to the other switches.

ACI has automated this process for both new switches and the controllers. That means you don't need to spend time performing that task manually and you also lower the risk of mismatched firmware because the upgrade process was forgotten or delayed.

When the Auto Firmware Update on Switch Discovery feature is enabled, APIC automatically updates the switch firmware for the following scenarios:

A new switch discovery with a new node ID
A switch replacement with an existing node ID
An initialization and rediscovering of an existing node

Pre-validation checklist¶

There may be a number of things you want check prior to an upgrade. For example, is there enough storage available on the device to store the new image? Are all hosts dual connected so the upgrade doesn't take down a host with only one connection?

ACI has a built in health checklist which runs prior to an upgrade. This applies to both the switches and the APIC controllers.

Check Configurations and Conditions That May Cause An Upgrade Failure

Multi-tenancy¶

Benefits

Provides administrative separation of resources or environments (e.g. business units, or production/development environments)

In ACI a tenant is a container for grouping various policies such as VRFs, Bridge Domains, Endpoint Groups, L3Outs, and Service Graphs. Besides logical separation of objects, tenants can also have access control policies applied. For example, some users may have read/write access to a development tenant but only read access to a production tenant. Additionally, it's possible to assign different switches (nodes) to a security domain so that a user can only configure the switches in the fabric that are part of their domain.

Restricting Access Using Security Domains and Node Rules

There are no exact rules for how tenants should be designed however in many cases customers use tenants to represent different environments. For example, production, testing, development. In the example below I've created a new security domain which only allows configuration of node-1101.

Simplifying ACI - What features help me manage a data center fabric?¶

The features¶

Automated fabric standup¶

Nexus as Code to configure everything¶

Nexus as Code configuration to standup entire fabric¶

ACI Built-in workflows¶

Port configuration¶

Multi-pod¶

VMM integration for visibility and configuration¶

Endpoint database for visibility and troubleshooting¶

Endpoint Security Groups (ESGs)¶

Snapshots and config rollbacks¶

Automated upgrades¶

Parallel upgrades¶

Auto firmware upgrades¶

Pre-validation checklist¶

Multi-tenancy¶

Comments