Kubernetes Networking - Using the defaults vs optimizing your setup¶

Estimated time to read: 33 minutes

Originally Written: June, 2025

Overview¶

This post is documenting a few details about Kubernetes networking, the Cilium CNI, default settings vs tuning for performance, and an example of how to connect Kubernetes to Cisco ACI.

Kubernetes Networking

For more general information on Kubernetes networking please have a look at the following link.

https://tl10k.dev/categories/kubernetes/kubernetes-networking-intro/part-1/

We first need access to the Kubernetes cluster before the network can be configured. This can be achieved by connecting via the kubectl CLI tool.

If you're completely new to Kubernetes you might also want to read through the following guides.

There will be a few example configurations provided as YAML. The Kubernetes configuration uses a standard format.

The ACI configuration uses the Network as Code project which makes it easy to deploy a network through a YAML file.

Kubernetes Networking and the CNI¶

This post will look at the two main components of Kubernetes Networking; the CNI and the Kubernetes Service. First we will look at the defaults and then see how to increase performance, visibility, and security.

Kubernetes is an orchestration platform used to automate the deployment, scaling, and management of containerized applications. However, it does not manage all of this by itself. Instead Kubernetes "outsources" some functions such as deploying the containers themselves, some networking, and storage configuration, to third party applications.

For example, Kubernetes makes a request to the container runtimes such as containerd or CRI-O which is the actual software that start/stop/delete the containers. This is referred to as a Container Runtime Interface or CRI.

Similarly, Kubernetes uses a third party plugin such as Calico or Cilium to implement some of the networking in a Kubernetes cluster. This is known as a Container Networking Interface or CNI.

The CNI (e.g. Cilium) is responsible for assigning an IP address to each pod or Kubernetes Service, creating interfaces on the Kubernetes nodes/pods, and setting up routing/tunnel to ensure all pods within a cluster can communicate. Some CNIs provide additional functionality. For example Cilium is also able to assign IP addresses to Services of type LoadBalancer rather than having to rely on another software such as MetalLB.

The CNI typically runs as a pod on each Kubernetes node. In the example of Cilium, the cilium-agent process runs in the cilium pod on each node to manage the eBPF programs used to provide network connectivity for the cluster.

Image reference: Cilium Component Overview

Each node receives a PodCIDR block and the CNI programs each new pod with an IP from this block. You can use the following command to find the pod CIDR assigned to each node.

kubectl get nodes -o custom-columns=NODE: .metadata.name,PODCIDR: .spec.podCIDR

The routes and interfaces are also configured by the CNI.

The default settings¶

All the CNIs I've worked with have set the default configurations to provide the most compatibility rather than the highest performance. In that case how could one pod talk to a pod on a different node?

The easiest option is to let the CNI setup the connectivity for you, rather than you having to configure routing to the upstream network and advertise your networks. Cilium, Flannel, and Weave use VXLAN tunnels to encapsulate traffic on the node and transport it across a network to the destination node. Calico can use either IPinIP or VXLAN encapsulation.

By encapsulating the traffic within the Kubernetes cluster, the upstream network only needs to understand how to reach the destination node, not how to reach the service or pod network. This provides a simple solution to connect the nodes and pods together. However there's a tradeoff in that you lose visibility in the network (as traffic is encapsulated) and aren't able to apply as granular security policies. There's also a slight performance overhead which can be an issue in some environments.

The second half of the post will look at how to tune the environment to address these tradeoffs.

Should I care about pod IPs?¶

In a Kubernetes cluster you may have more than a single replica for a workload e.g. I might have 10 pods serving my application and when I need to scale up for performance I just add another 10. In that case to which pod IP address should I connect?

Kubernetes Services abstract away this problem and help you to track and manage connectivity to your pods. They are a native concept to Kubernetes, meaning they do not rely on an external plugin as we saw with pod communication.

There are a few key benefits that Kubernetes Services provide.

Tracking pods
Providing access within the cluster
Providing access from outside the cluster

Tracking pods¶

Labels and selectors are very important concepts in Kubernetes.

Labels are key/value pairs that are attached to objects, such as pods [and] are intended to be used to specify identifying attributes of objects that are meaningful and relevant to users. Unlike names and UIDs, labels do not provide uniqueness. In general, we expect many objects to carry the same label(s)

Via a label selector, the client/user can identify a set of objects

https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/

In this case we can create a new service and track any pods with a specific label (app: hello-node in the following example). Kubernetes will then maintain a reference of the pods (endpoints) including the pod IP.

Internal and external connectivity¶

Each Kubernetes service will be assigned an internal ClusterIP and a DNS record will be configured (the general format is <service-name>.<namespace>.svc.cluster.local but this may change depending on how your K8s cluster is configured)

You can then access this ClusterIP or DNS record from any pod within the cluster, rather than having to know the IPs of the individual pods.

Additionally, Kubernetes Services can be used to provide access to pods from an external client (e.g. another application or a user). See Part 3 of the Kubernetes Networking post for more details on the available options. This post will show Cilium load balancing as an option in a later section.

But what are Kubernetes Services and what creates them?

You may have seen a kube-proxy pod running on each of your nodes when you deploy a cluster and it's this component which implements services. When a service is created or changed, the kube-proxy pod will configure IPTables rules on each of the Kubernetes nodes. These rules direct traffic hitting the ClusterIP to one of the pods with the specified label.

As you see in the screenshot below, there is an element of randomness when selecting to which pod the traffic should be sent. The iptables statistics module is used to allow traffic to be randomly selected based on a specified probability.

This random selection is performed on each node using the kernels pseudo random number generator and therefore there is no consistency of which pod will be selected. i.e. if traffic is sent from a client to two nodes they may be redirected to two different backend pods. In some environments it may be required that a client connection is always sent to the same backend pod, regardless of the ingress node. This is possible with consistent hashing which is covered in a later section.

There are four pods associated with the hello-node service in the following screenshot. You can also see four rules (last four lines).

Initial Rule (0.25 probability)
- When there are four pods, the traffic needs to be evenly split between them
- The statistic module matching is based on a uniform distribution, where the probability of a packet being selected is defined by the --probabilityparameter. It is a value between 0 and 1. For example, --probability 0.5 means each packet has a 50% chance of being matched
- The first rule is set with a --probability 0.25 to capture 25% of the traffic and send it to the first pod
- When traversing the IPTables rules, if this rule is matched based on the random selection it's sent to the associated endpoint (KUBE-SEP)
- If it's not selected it proceeds to the second rule
Second Rule (0.33 probability)
- If the previous rules matched the traffic would have been sent to the first pod
- Since we are now at the second rule it means it didn't match and we therefore need to select one of the remaining three pods`
- Therefore in this rule the probability is split among three pods or --probability 0.33
- Like the previous, if this rule is selected the packet is sent to the associated endpoint (KUBE-SEP)
- If not it proceeds to the third rule
Third Rule (0.5 probability)
- There are now two remaining pods or --probability 0.5
- If it's selected go to the KUBE-SEP endpoint rule
- Otherwise go to the final rule
Final Rule (No probability)
- If we get to this point then the first three rules were not selected
- This means there's only one more pod and we don't need a probability rule
- This is because any traffic that doesn't match earlier rules will fall through to the next rule or destination, making the fourth pod the default recipient of the remaining traffic

You may also have seen that masquerading or Network Address Translation (NAT) is used to send traffic to the kubernetes endpoint. In this case the destination IP of the packet is changed from the service IP to the Pod IP.

Tuning for performance, visibility, and security¶

The second half of the post will look at a few configuration changes we can make to improve performance, visibility, and security. There will be references to additional resources for further details and optimizations.

Removing kube-proxy and IPTables¶

There are many CNI options available but Cilium is one of the most popular ones and provides a lot of configuration flexibility.

One of the reasons Cilium became so popular is due to its use of eBPF which allows programs (typically written in the C programming language) to run safely within the Linux kernel rather than within userspace. This gives you access to advanced network visibility, security, and performance optimizations without requiring kernel modifications.

For network admins, think of it like being able to implement your own changes to a routing protocol such as OSPF without having to first wait for the IETF (kernel developers) to update the standard and then have vendors (Linux distributions) update their devices to support the change.

Under the covers

Here are a few more resources if you're interested in what these programs look like and how they're implemented

The eBPF programs used by Cilium are located in its GitHub repository under the bpf/ directory at https://github.com/cilium/cilium/tree/main/bpf/
- bpf_lxc.c: one of the main eBPF programs responsible for handling packet processing for containers. It implements various networking features, such as routing, NAT, and policy enforcement
- bpf_host.c: responsible for handling traffic on the host network, including interactions with non-containerized workloads
- bpf_overlay.c: implements functionality for handling overlay networks, such as VXLAN or Geneve
- maps/: contains definitions for the various eBPF maps used by Cilium programs to store and share data, such as connection tracking or policy enforcement data (More on maps below)
The eBPF programs in the bpf/ directory are compiled and loaded into the kernel by the Cilium agent which is running on each Kubernetes node
- https://docs.cilium.io/en/latest/network/ebpf/intro/
The programs are dynamically attached to various hooks in the kernel. For example:
- tc (Traffic Control): For managing ingress/egress network traffic
- XDP (eXpress Data Path): For high-performance packet processing at the driver level
- kprobes/tracepoints: For observing kernel function calls and events

We saw previously that by default kube-proxy installs IPTables rules for each new service that is created. When a packet reaches the Kubernetes node, it is inspected against the rules until a match is found or the packet reaches the end of the chain (remember the iptables statistics module and probability example).

Since IPTables rules are traversed sequentially, the more services you create, the longer the time to process the packet and therefore lower performance.

eBPF takes a different approach by using maps. This offers a couple of benefits:

They provide a way for eBPF programs to communicate with each other
They use efficient data structures like hash tables or arrays for lookups, which means the time to process a packet is independent of the number of "rules" or entries in the map

The following chart shows as the amount of services increase, the HTTP request latency increases when using IPTables, however it remains constant when using eBPF. This is due to the efficient data structures used.

Reference: https://isovalent.com/blog/post/why-replace-iptables-with-ebpf/

Cilium can can also move connection tracking from IPtables to eBPF maps. The IPTables Conntrack module tracks the state of network connections (e.g. new, established, related, or invalid), which allows IPTables to implement stateful firewall rules, such as allowing return traffic for established connections without explicitly matching each packet.

Improving the random loadbalancing mechanism with Maglev Consistent Hashing¶

It was shown earlier in the post that IPTables uses the Kubernetes nodes kernel pseudo random number generator to determine to which backend pod traffic should be sent. Each node operates independently and therefore there is no consistency of which pod will be selected. i.e. if traffic is sent from a client to two nodes they may be redirected to two different backend pods.

In some environments it may be required that a client connection is always sent to the same backend pod, regardless of the ingress node. This is possible through Cilium by configuring the Maglev Consistent Hashing feature. Maglev creates a pre-computed static lookup table on each node that maps hash values to backend endpoints. The hashes are derived from packet attributes such as source IP/destination IP/port, and if given the same set of inputs + configuration parameters it will always produces the same consistent lookup table.

The Cilium agent running on each Kubernetes node integrates with the K8s API server to receive updates about services (ClusterIP, NodePort, LoadBalancer) and endpoints/pods. Since each K8s node has the same list of backend endpoints for any given service they can all generate the same hashtable.

If the set of backend endpoints for a service changes due to a failure or scaling the amount of pods, the Cilium agents on all nodes are notified of the change simultaneously via the Kubernetes API and can update their respective table.

eBPF maps are used on each node to store the Maglev lookup table in the kernel for efficient packet processing.

Resources:

Direct Server Return¶

Imagine a client connecting to a web server running a Kubernetes cluster. In the default configuration, a client could potentially connect to any of the Kubernetes nodes for the web service and be redirected to another node in the cluster where the web server pod resides. If using masquerading (NAT) then Cilium would perform Destination NAT (DNAT) to replace the service IP with the IP of the selected backend web server pod.

On the return journey from the web server to the client, the packet will be sent back to the original Kubernetes node and Cilium will use Source NAT (SNAT) to rewrite the pods source IP with the services IP. This allows the client to see the response as coming from the original service IP.

In some situations this behaviour may be desired for a couple of reasons:

There is additional latency since the return traffic needs to exit via the original node
The backend pod can't see the original clients IP as it's masqueraded

To overcome these problems it's possible to implement Direct Server Return (DSR) with Cilum. With DSR, the backend pod sends the response directly to the client without routing the return traffic back through the the node which initially handled the request. This works because the backend pod retains the client's original source IP which allows it to send the response directly.

As shown in step 3 of the diagram below, the service IP/port is used as the source so to the upstream network or client it appears as though the packet is coming from the same node to which it was originally sent.

Resources:

https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/#direct-server-return-dsr

Additional Cilium tuning¶

There are many other settings you can configure to further enhance Ciliums performance. The link below has some recommended settings and further details. Here is an example

helm install cilium cilium/cilium --version 1.17.2 \
  --namespace kube-system \
  --set routingMode=native \
  --set ipv4NativeRoutingCIDR=10.42.0.0/16 \
  --set bpf.datapathMode=netkit \
  --set kubeProxyReplacement=true \
  --set bpf.masquerade=true \
  --set ipv4.enabled=true \
  --set enableIPv4BIGTCP=true \
  --set loadBalancer.algorithm=maglev \
  --set bpf.distributedLRU.enabled=true \
  --set bpf.mapDynamicSizeRatio=0.08 \
  --set bpfClockProbe=true \
  --set bandwidthManager.enabled=true

Breakdown of options¶

--set routingMode=native
- Uses native routing on the host rather than VXLAN/Geneve encapsulation - more on this later
--set ipv4NativeRoutingCIDR=10.42.0.0/16
- Set the CIDR in which native routing can be performed
  
  The default behavior is to exclude any destination within the IP allocation CIDR of the local node. If the pod IPs are routable across a wider network, that network can be specified with the option: ipv4-native-routing-cidr: 10.0.0.0/8 (or ipv6-native-routing-cidr: fd00::/100 for IPv6 addresses) in which case all destinations within that CIDR will not be masqueraded.
  - Reference: https://docs.cilium.io/en/stable/network/concepts/masquerading/
--set bpf.datapathMode=netkit
- Replacement for vethinterfaces
  
  Provides connectivity for Pods with the goal to improve throughput and latency for applications as if they would have resided directly in the host namespace, meaning, it reduces the datapath overhead for network namespaces down to zero.
  - https://docs.cilium.io/en/stable/operations/performance/tuning/#netkit-device-mode
    
    One of the goals was that in case of Pod egress traffic, this allows to move BPF programs from hostns tcx ingress into the device itself, providing earlier drop or forward mechanisms, for example, if the BPF program determines that the skb must be sent out of the node, then a redirect to the physical device can take place directly without going through per-CPU backlog queue. This helps to shift processing for such traffic from softirq to process context, leading to better scheduling decisions/performance
  - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=35dfaad7188c
  - https://isovalent.com/blog/post/cilium-netkit-a-new-container-networking-paradigm-for-the-ai-era/#shortcut-4-introducing-netkit
--set kubeProxyReplacement=true
- Replaces kube-proxy with Ciliums eBPF-based load balancer
--set bpf.masquerade=true
- Enables eBPF-based SNAT rather than using IPTables
--set ipv4.enabled=true
- Enables IPv4 for the cluster
--set enableIPv4BIGTCP=true
- This provides packet aggregation within the Kubernetes node for increased performance
--set loadBalancer.algorithm=maglev
- Enables Maglev consistent hashing
--set bandwidthManager.enabled=true
- Enables bandwidth management to monitor and enforce traffic limits using eBPF for fair resource usage

Reference: https://docs.cilium.io/en/stable/operations/performance/tuning/

There is also a great Kubecon talk, Turning up Performance to 11: Cilium, NetKit Devices,and Going Big with TCP, which goes into further details of these configuration options.

Native routing connectivity options¶

To recap, in the last section we looked at some configuration options to tune Cilium for maximum performance. One of those settings was to enable native routing rather than encapsulation. As we saw towards the start, most CNIs by default are aimed at optimal compatibility. This means encapsulation such as VXLAN is automatically configured to provide pod-to-pod/service-to-service connectivity between Kubernetes nodes.

This can have a performance impact due to the VXLAN/Geneve encapsulation/decapsulation (higher CPU utilization) which leads to increased latency. Instead of using VXLAN/Geneve, native routing (peering with the upstream network) can be configured in environments where the highest performance is required.

These are also a couple of additional benefits.

Visibility from the upstream network - We can advertise different resources (pods, service IPs)
Additional layer of security within the network - Since we have visibility we can now provide network security policies to permit or deny traffic

This next section will show an example of connectiving a Kubernetes cluster running Cilium with an ACI fabric. This is only one example and there are many design options available with any network fabric type (e.g. routed, VXLAN).

Here are some things to consider which may influence how you design your solution.

Node/Pod count influences the CIDR sizing?
VM or Baremetal, consider default gateways and vMotion if running VMware?
Physical connections - Port-Channels / VPCs?
Pod to pod routing modes, either direct routing or VXLAN encapsulation?
Single subnet for K8s nodes or split across multiple subnets? (see auto-direct-node-routes later on)
BGP per cluster, per rack, or per server
Dedicated ingress/egress nodes or peer all nodes with fabric?
If dedicated, which racks and ToRs should we use for ingress traffic
If dedicated, do we have redundant nodes and also across racks?
If with ACI, standard SVI or floating SVI (ease of configuration)
If using egress nodes, which racks and ToRs should we use for Egress Traffic and pinning Pod IPs to a consistent external IP?
When running multi-cluster setups, how will traffic be routed between clusters (e.g., leveraging Cilium Cluster Mesh or BGP)?
How will you handle cluster auto-scaling (if enabled)?

Info

This post uses the Network as Code (NaC) project which makes it very easy to configure ACI fabrics through a YAML file . More details and examples can be found at https://netascode.cisco.com

ACI configuration¶

In this example I have configured an ACI L3Out which is used for Kubernetes node and service connectivity. The default gateway on the nodes is the SVI/Floating SVI (192.168.2.3/24 in the config below) which you can see in the interface profiles.

The config below also has a subnet, 192.168.99.0_24, which I used to simulate an external client which could access the various services hosted on the Kubernetes cluster. The 99.0 subnet is advertised via BGP to the cluster.

Since this is a lab I just used a single interface and VLAN to connect to the nodes which are running in VMs.

There are multiple ways you could configure the BGP connections but to keep things simple I used a single AS Number for the entire cluster. This has an added benefit that we can use dynamic BGP peers (should work for both ACI or NX-OS) which means I can define a subnet ip: 192.168.2.0/24 instead of an individual neighbour IP. The L3Out then dynamically establishes BGP peering with any IPs in the subnet.

ACI Network as Code Configuration

---
apic:
  tenants:
    - name: conmurph-01
      managed: true

      vrfs:
        - name: vrf-01

      bridge_domains:

        - name: 192.168.99.0_24
          alias: external-client
          vrf: vrf-01
          subnets:
          - ip: 192.168.99.254/24
            public: true
          l3outs:
            - l3out-to-k8s-01-group-01

      application_profiles:
        - name: network-segments
          endpoint_groups:
            - name: 192.168.99.0_24
              bridge_domain: 192.168.99.0_24
              contracts:
                providers:
                  - permit-to-app-1
                  - deny-to-app-2
                  - permit-any
              physical_domains:
                - n9k-sbx-static
              static_ports:
                - node_id: 101
                  port: 10
                  vlan: 2102
                  deployment_immediacy: immediate
                - node_id: 102
                  port: 10
                  vlan: 2102
                  deployment_immediacy: immediate


      filters:
        - name: icmp
          entries:
            - name: icmp
              ethertype: ip
              protocol: icmp
        - name: web
          entries:
            - name: http
              ethertype: ip
              protocol: tcp
              destination_from_port: http
              destination_to_port: http
        - name: all-protocols
          entries:
            - name: any
              ethertype: unspecified

      contracts:
        - name: permit-any
          subjects:
            - name: permit-any
              filters:
                - filter: all-protocols
        - name: permit-to-app-1
          subjects:
            - name: permit-any
              filters:
                - filter: all-protocols
                  action: permit
                  log: true
        - name: deny-to-app-2
          subjects:
            - name: deny-all
              filters:
                - filter: all-protocols
                  action: deny
                  log: true

      l3outs:
        - name: l3out-to-k8s-01-group-01
          vrf: vrf-01
          domain: n9k-sbx-k8s-cluster-01-bgp
          bgp:
            name: bgp-l3out-to-k8s-01-group-01

          node_profiles:
            - name: border-leafs
              nodes:
                - node_id: 101
                  router_id: 101.2.1.1
                  router_id_as_loopback: false

                - node_id: 102
                  router_id: 102.2.1.1
                  router_id_as_loopback: false

              interface_profiles:
                - name: n9k-sbx-leaf-switches
                  interfaces:
                    - node_id: 101
                      #channel: vpc_to_n9k-sbx-servers
                      port: 10
                      vlan: 602
                      svi: true
                      ip: 192.168.2.1/24
                      ip_shared: 192.168.2.3/24
                      bgp_peers:
                      - ip: 192.168.2.0/24
                        local_as: 65151
                        remote_as: 65152
                        as_override: true
                        disable_peer_as_check: true
                    - node_id: 102
                      #channel: vpc_to_n9k-sbx-servers
                      port: 10
                      vlan: 602
                      svi: true
                      ip: 192.168.2.2/24
                      ip_shared: 192.168.2.3/24
                      bgp_peers:
                      - ip: 192.168.2.0/28
                        local_as: 65151
                        remote_as: 65152
                        as_override: true
                        disable_peer_as_check: true

          external_endpoint_groups:
            - name: k8s-ingress-nodes
              subnets:
                - prefix: 192.168.2.0/24
              contracts:
                consumers:
                  - permit-any
            - name: namespace-app-1-subnet
              subnets:
                - prefix: 30.0.10.0/24
              contracts:
                consumers:
                  - permit-to-app-1
            - name: namespace-app-2-subnet
              subnets:
                - prefix: 40.0.10.0/24
              contracts:
                consumers:
                  - deny-to-app-2

Kubernetes cluster BGP configuration¶

On the K8s side I configured a BGP peering with each of the ACI leafs (there are only two in my lab). In my case each node peers with the upstream network however you may also want to have dedicated ingress nodes (see the nodeSelectorbelow as an example). This would allow you to have deterministic traffic paths/patterns as well as different profiles (CPU/memory/storage/network) for the ingress nodes vs the worker nodes.

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPClusterConfig
metadata:
  name: group-01
spec:
  # nodeSelector:
  #   matchLabels:
  #     ingress-group: group-01
  bgpInstances:
  - name: group-01
    localASN: 65152
    peers:
    - name: n9k-sbx-leaf-101
      peerASN: 65151
      peerAddress: 192.168.2.1
      peerConfigRef:
        name: n9k-sbx-leaf-switches-group-01
    - name: n9k-sbx-leaf-102
      peerASN: 65151
      peerAddress: 192.168.2.2
      peerConfigRef:
        name:  n9k-sbx-leaf-switches-group-01
---
apiVersion: cilium.io/v2alpha1
kind: CiliumBGPPeerConfig
metadata:
  name: n9k-sbx-leaf-switches-group-01
spec:
  families:
    - afi: ipv4
      safi: unicast
      advertisements:
        matchLabels:
          advertise: advertise-group-01

We need a way for Kubernetes to know how to reach pods on other nodes. This could be done via BGP but Cilium has a nice feature which can be enabled with auto-direct-node-routes: true. When auto-direct-node-routes is configured, Cilium inserts routes for the pod CIDRs on each of the individual nodes as you can see in this screenshot.

You may still want to advertise pods to the upstream network for endpoint visibility. In the config below I'm just advertising the LoadBalancerIP service. The dummy key/value in the matchExpressions selector ensures that all services are advertised.

---
apiVersion: cilium.io/v2alpha1
kind: CiliumL2AnnouncementPolicy
metadata:
  name: cilium-l2-announcements
spec:
  externalIPs: false
  loadBalancerIPs: true
---
apiVersion: cilium.io/v2alpha1
kind: CiliumBGPAdvertisement
metadata:
  name: advertise-group-01
  labels:
    advertise: advertise-group-01
spec:
  advertisements:
    # - advertisementType: "PodCIDR"
    - advertisementType: "Service"
      service:
        addresses:
          - LoadBalancerIP
      selector:
        matchExpressions:
        - {key: somekey, operator: NotIn, values: ['never-used-value']}

Cilium will now advertise /32 host routes for each service from every node.

Cilium LoadBalancer IPAM¶

You may have noticed in the screenshots above that there are two /32 addresses advertised, 30.0.10.1 amd 40.0.10.1. Cilium can also assign IP addresses to services of type LoadBalancer. It's quite flexible and this can be powerful when combined with the EPG classification on ACI.

As you see in the configuration below, I've defined two IP Pools with a range of IP addresses in each. There are multiple ways you can define how IPs are assigned but in this example I have it matching on namespaces used to segments different applications.

---
apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
  name: "app-1"
spec:
  blocks:
  - start: "30.0.10.1"
    stop: "30.0.10.100"
  serviceSelector:
    matchLabels:
      "io.kubernetes.service.namespace": "app-1"

---
apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
  name: "app-2"
spec:
  blocks:
  - start: "40.0.10.1"
    stop: "40.0.10.100"
  serviceSelector:
    matchLabels:
      "io.kubernetes.service.namespace": "app-2"

Any services that are created in these name spaces will be assigned an IP from their respective pools. These addresses are then advertised to the rest of the next via BGP.

Enhancing network security¶

On the ACI side we can use the ExternalEPG concept to classify traffic. Since we know which ranges are used for each namespace we can essentially classify the namespace into an ExternalEPG. Contracts can then be added to these EPGs to permit or deny traffic. They could even be used to send particular sets of traffic through a firewall for inspection.

In this scenario I associated an application (app-1 and app-2) with a namespace however you may want a more micro or macro design, it all depends on your organization and requirements. For example you may just want to classify all traffic (0.0.0.0/0 or 0.0.0.0/1 + 128.0.0.0/1) into a single ExternalEPG and and then redirect this to a firewall which performs more granular inspection.

Egress gateways¶

The previous sections was primarily looking at traffic from an external to internal direction. In some cases you may also need to identify certain traffic flows that originate from the cluster. For example, imagine you wanted to create a security policy to inspect traffic coming from the Kubernetes cluster and going to a legacy application. It would be helpful to have all this traffic egress from the same node or group of nodes so that you can apply the policy in the one location. Cilium provides an egress gateway functionality which allows you to achieve this.

There are two key pieces of configuration; classify which traffic to send through the egress gateway and which node should be used as the egress gateway.

The configuration below is just an example to show the different options, with some commented out. If you think about the example scenario above, we could define the IP of the legacy application in the DestinationCIDR field. Any traffic from the pods in the cluster that is destined to this CIDR block will egress from the cluster via n9k-sbx-k8s-cluster-01-worker-03 in this example.

This means we can have a deterministic exit point to which we can apply our policy.

apiVersion: cilium.io/v2
kind: CiliumEgressGatewayPolicy
metadata:
  name: cilium-egress
spec:
  # selectors:
  #   - podSelector:
  #       matchLabels:
  #         io.kubernetes.pod.namespace: app-1
  destinationCIDRs:
    - 10.25.10.0/24 # legacy application
  egressGateway:
    nodeSelector:
      matchLabels:
        kubernetes.io/hostname: n9k-sbx-k8s-cluster-01-worker-03
    egressIP: 192.168.3.1
    #interface: ens256

Summary and other resources¶

Hopefully that's given you a clear overview of some of the default behaviour when it comes to Kubernetes networking. We also saw how to configure your environment to provide performance, visibility, and security benefits. There are many resources available with more details and here is a list to get you started.