Saturday, July 2, 2022

Cisco Nexus Switch Virtual PortChannel (vPC) Failure Scenarios

Here's a Cisco link regarding the different Nexus vPC terminologies, best practice and failure scenarios for the Peer-Link and Peer-Keepalive. I tried to simulate different failure scenarios in my Nexus switch lab.


Peer-Keepalive Failure (mgmt0 via Layer 3):

- Only the heartbeat between Primary and Secondary Nexus peer will be lost

- vPC adjacency will NOT break/fail

- There's no change in vPC role (Primary/Secondary)

- vPC will still run as normal/forward traffic

 - Ensure NMS monitoring for the Nexus mgmt0 interface



N5K-1# show run interface mgmt0

 

!Command: show running-config interface mgmt0

!Time: Mon Jul 19 02:56:36 2021

 

version 7.3(8)N1(1)

 

interface mgmt0

  vrf member management

  ip address 10.10.2.8/23

 

 

N5K-1# show run vpc

!Command: show running-config vpc
!Time: Thu Jul 22 08:30:50 2021

version 7.3(8)N1(1)
feature vpc

vpc domain 1

  role priority 10

  peer-keepalive destination 10.10.2.9 source 10.10.2.8


interface port-channel1
  vpc peer-link

interface port-channel100
  vpc 100


I shutdown the switchport connected to NK5-1 mgmt0.

 

SW01#configure terminal

Enter configuration commands, one per line.  End with CNTL/Z.

SW01(config)#interface Gi1/0/6

SW01(config-if)#shutdown

 

 

N5K-1# show interface mgmt0

mgmt0 is down (Link not connected)

 

  Hardware: GigabitEthernet, address: 00de.fb78.0123 (bia 00de.fb78.0112)

  Internet Address is 10.10.2.8/23

 

 

The Peer-Keepalive status changed to peer is not reachable but peer adjacency is still formed ok.

 

N5K-1# 2021 Jul 19 02:28:59 N5K-1 %$ VDC-1 %$ %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed

 

N5K-1# show vpc

Legend:

                (*) - local vPC is down, forwarding via vPC peer-link

 

vPC domain id                     : 1  

Peer status                       : peer adjacency formed ok     

vPC keep-alive status             : peer is not reachable through peer-keepalive

Configuration consistency status  : success

Per-vlan consistency status       : success                      

Type-2 consistency status         : success

vPC role                          : primary                      

Number of vPCs configured         : 294

Peer Gateway                      : Disabled

Dual-active excluded VLANs        : -

Graceful Consistency Check        : Enabled

Operational Layer3 Peer-router    : Disabled

Auto-recovery status              : Enabled (timeout = 240 seconds)

 

vPC Peer-link status

---------------------------------------------------------------------

id   Port   Status Active vlans   

--   ----   ------ --------------------------------------------------

1    Po1    up     1,99  

 

<OUTPUT TRUNCATED>

 

 

N5K-1# show vpc peer-keepalive

 

vPC keep-alive status             : peer is not reachable through peer-keepalive

--Send status                   : Success

--Last send at                  : 2021.07.19 02:29:59 804 ms

--Sent on interface             :

--Receive status                : Failed

--Last update from peer         : (65) seconds, (174) msec

 

vPC Keep-alive parameters

--Destination                   : 10.10.2.9

--Keepalive interval            : 1000 msec

--Keepalive timeout             : 5 seconds

--Keepalive hold timeout        : 3 seconds

--Keepalive vrf                 : management

--Keepalive udp port            : 3200

--Keepalive tos                 : 192

 

 

The FEX module state is still Online.

 

N5K-1# show fex

  FEX         FEX           FEX              FEX              Fex      

Number    Description      State            Model            Serial    

------------------------------------------------------------------------

100    FEX100                Online   N2K-C2348UPQ-10GE   FOC22401234


 

The same output is seen on the Nexus peer switch.

 

N5K-2# 2021 Jul 19 02:28:59 N5K-2 %$ VDC-1 %$ %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed

 

N5K-2# show vpc

Legend:

                (*) - local vPC is down, forwarding via vPC peer-link

 

vPC domain id                     : 1  

Peer status                       : peer adjacency formed ok     

vPC keep-alive status             : peer is not reachable through peer-keepalive

Configuration consistency status  : success

Per-vlan consistency status       : success                      

Type-2 consistency status         : success

vPC role                          : secondary                    

Number of vPCs configured         : 294

Peer Gateway                      : Disabled

Dual-active excluded VLANs        : -

Graceful Consistency Check        : Enabled

Operational Layer3 Peer-router    : Disabled

Auto-recovery status              : Enabled (timeout = 240 seconds)

 

vPC Peer-link status

---------------------------------------------------------------------

id   Port   Status Active vlans   

--   ----   ------ --------------------------------------------------

1    Po1    up     1,99      

 

<OUTPUT TRUNCATED>

 

 

N5K-2# show vpc peer-keepalive

 

vPC keep-alive status             : peer is not reachable through peer-keepalive

--Send status                   : Success

--Last send at                  : 2021.07.19 02:30:46 803 ms

--Sent on interface             : mgmt0

--Receive status                : Failed

--Last update from peer         : (112) seconds, (807) msec

 

vPC Keep-alive parameters

--Destination                   : 10.10.2.8

--Keepalive interval            : 1000 msec

--Keepalive timeout             : 5 seconds

--Keepalive hold timeout        : 3 seconds

--Keepalive vrf                 : management

--Keepalive udp port            : 3200

--Keepalive tos                 : 192

 

 

N5K-2# show fex

  FEX         FEX           FEX              FEX              Fex      

Number    Description      State            Model            Serial    

------------------------------------------------------------------------

100    FEX100                Online   N2K-C2348UPQ-10GE   FOC22401234

 

 

The vPC Peer-Keepalive status immediately changed to alive after I unshut the switch port on N5K-1 mgmt0,

 

SW01(config)#interface Gi1/0/6

SW01(config-if)#no shutdown

 

 

N5K-1# show vpc

Legend:

                (*) - local vPC is down, forwarding via vPC peer-link

 

vPC domain id                     : 1  

Peer status                       : peer adjacency formed ok     

vPC keep-alive status             : peer is alive                

Configuration consistency status  : success

Per-vlan consistency status       : success                      

Type-2 consistency status         : success

vPC role                          : primary                      

Number of vPCs configured         : 294

Peer Gateway                      : Disabled

Dual-active excluded VLANs        : -

Graceful Consistency Check        : Enabled

Operational Layer3 Peer-router    : Disabled

Auto-recovery status              : Enabled (timeout = 240 seconds)

 

vPC Peer-link status

---------------------------------------------------------------------

id   Port   Status Active vlans   

--   ----   ------ --------------------------------------------------

1    Po1    up     1,99        

 

<OUTPUT TRUNCATED>


 

N5K-1# show vpc peer-keepalive

 

vPC keep-alive status             : peer is alive                

--Peer is alive for             : (84) seconds, (386) msec

--Send status                   : Success

--Last send at                  : 2021.07.19 02:36:40 813 ms

--Sent on interface             : mgmt0

--Receive status                : Success

--Last receive at               : 2021.07.19 02:36:40 854 ms

--Received on interface         : mgmt0

--Last update from peer         : (0) seconds, (336) msec

 

vPC Keep-alive parameters

--Destination                   : 10.10.2.9

--Keepalive interval            : 1000 msec

--Keepalive timeout             : 5 seconds

--Keepalive hold timeout        : 3 seconds

--Keepalive vrf                 : management

--Keepalive udp port            : 3200

--Keepalive tos                 : 192

 

 

N5K-2# show vpc

Legend:

                (*) - local vPC is down, forwarding via vPC peer-link

 

vPC domain id                     : 1  

Peer status                       : peer adjacency formed ok     

vPC keep-alive status             : peer is alive                

Configuration consistency status  : success

Per-vlan consistency status       : success                      

Type-2 consistency status         : success

vPC role                          : secondary                    

Number of vPCs configured         : 294

Peer Gateway                      : Disabled

Dual-active excluded VLANs        : -

Graceful Consistency Check        : Enabled

Operational Layer3 Peer-router    : Disabled

Auto-recovery status              : Enabled (timeout = 240 seconds)

 

vPC Peer-link status

---------------------------------------------------------------------

id   Port   Status Active vlans   

--   ----   ------ --------------------------------------------------

1    Po1    up     1,99                                                     

 

<OUTPUT TRUNCATED>

 

 

N5K-2# show vpc peer-keepalive

 

vPC keep-alive status             : peer is alive                

--Peer is alive for             : (114) seconds, (258) msec

--Send status                   : Success

--Last send at                  : 2021.07.19 02:37:11 851 ms

--Sent on interface             : mgmt0

--Receive status                : Success

--Last receive at               : 2021.07.19 02:37:11 834 ms

--Received on interface         : mgmt0

--Last update from peer         : (0) seconds, (227) msec

 

vPC Keep-alive parameters

--Destination                   : 10.10.2.8

--Keepalive interval            : 1000 msec

--Keepalive timeout             : 5 seconds

--Keepalive hold timeout        : 3 seconds

--Keepalive vrf                 : management

--Keepalive udp port            : 3200

--Keepalive tos                 : 192

 

Peer-Link failure (Port-channel 1):

- All the vPC member ports/FEX on the Secondary Nexus switch will be suspended

- All traffic will flow via the Primary Nexus switch

- This will prevent a "split-brain" scenario

- Traffic on Orphan port/device (i.e. trunk to a standalone switch or router) connected to Secondary Nexus switch will fail or "blackhole"

- Create a Port-Channel with multiple interfaces for Peer-link

 

N5K-1# show run interface po1

 

!Command: show running-config interface port-channel1

!Time: Mon Jul 19 02:55:48 2021

 

version 7.3(8)N1(1)

 

interface port-channel1

  switchport mode trunk

  spanning-tree port type network

  vpc peer-link

 

N5K-1# show port-channel summary

Flags:  D - Down        P - Up in port-channel (members)

        I - Individual  H - Hot-standby (LACP only)

        s - Suspended   r - Module-removed

        S - Switched    R - Routed

        U - Up (port-channel)

        M - Not in use. Min-links not met

--------------------------------------------------------------------------------

Group Port-       Type     Protocol  Member Ports

      Channel

--------------------------------------------------------------------------------

1     Po1(SU)     Eth      LACP      Eth1/23(P)   Eth1/24(P)

<OUTPUT TRUNCATED>


I disabled Port-Channnel 1 Peer-Link on N5K-1 switch. N5K-2 vPC Port-Channel interface immediately became suspended and FEX went offline.

N5K-1# configure terminal

Enter configuration commands, one per line.  End with CNTL/Z.

N5K-1(config)# interface port-channel1

N5K-1(config-if)# shutdown

 

 

N5K-2# 2021 Jul 19 02:59:03 N5K-2 %$ VDC-1 %$ %VPC-2-VPC_SUSP_ALL_VPC: Peer-link going down, suspending all vPCs on secondary

2021 Jul 19 02:59:03 N5K-2 %$ VDC-1 %$ %NOHMS-2-NOHMS_ENV_FEX_OFFLINE: FEX-100 Off-line (Serial Number FOX25191234)

2021 Jul 19 02:59:03 N5K-2 %$ VDC-1 %$ %PFMA-2-FEX_STATUS: Fex 100 is offline


 

N5K-2# show vpc

Legend:

                (*) - local vPC is down, forwarding via vPC peer-link

 

vPC domain id                     : 1  

Peer status                       : peer link is down            

vPC keep-alive status             : peer is alive                

Configuration consistency status  : success

Per-vlan consistency status       : success                      

Type-2 consistency status         : success

vPC role                          : secondary                    

Number of vPCs configured         : 6  

Peer Gateway                      : Disabled

Dual-active excluded VLANs        : -

Graceful Consistency Check        : Enabled

Operational Layer3 Peer-router    : Disabled

Auto-recovery status              : Enabled (timeout = 240 seconds)

 

vPC Peer-link status

---------------------------------------------------------------------

id   Port   Status Active vlans   

--   ----   ------ --------------------------------------------------

1    Po1    down   -                                                        

 

vPC status

----------------------------------------------------------------------------

id     Port        Status Consistency Reason                     Active vlans

------ ----------- ------ ----------- -------------------------- -----------

100    Po100       down   failed      Peer-link is down          -          

 

 

N5K-2# show vpc peer-keepalive

 

vPC keep-alive status             : peer is alive                

--Peer is alive for             : (1343) seconds, (564) msec

--Send status                   : Success

--Last send at                  : 2021.07.19 03:03:14 84 ms

--Sent on interface             : mgmt0

--Receive status                : Success

--Last receive at               : 2021.07.19 03:03:14 84 ms

--Received on interface         : mgmt0

--Last update from peer         : (0) seconds, (334) msec

 

vPC Keep-alive parameters

--Destination                   : 10.10.2.8

--Keepalive interval            : 1000 msec

--Keepalive timeout             : 5 seconds

--Keepalive hold timeout        : 3 seconds

--Keepalive vrf                 : management

--Keepalive udp port            : 3200

--Keepalive tos                 : 192

 

 

N5K-2# show fex

  FEX         FEX           FEX              FEX              Fex      

Number    Description      State            Model            Serial    

------------------------------------------------------------------------

100    FEX100               Offline   N2K-C2348UPQ-10GE   FOC22401234

 

 

Only N5K-1 FEX is online. This is to prevent "split-brain" traffic on the peer switch N5K-2.

 

N5K-1# show vpc

Legend:

                (*) - local vPC is down, forwarding via vPC peer-link

 

vPC domain id                     : 1  

Peer status                       : peer link is down            

vPC keep-alive status             : peer is alive                

Configuration consistency status  : success

Per-vlan consistency status       : success                      

Type-2 consistency status         : success

vPC role                          : primary                      

Number of vPCs configured         : 294

Peer Gateway                      : Disabled

Dual-active excluded VLANs        : -

Graceful Consistency Check        : Enabled

Operational Layer3 Peer-router    : Disabled

Auto-recovery status              : Enabled (timeout = 240 seconds)

 

vPC Peer-link status

---------------------------------------------------------------------

id   Port   Status Active vlans   

--   ----   ------ --------------------------------------------------

1    Po1    down   -    

 

 

N5K-1# show vpc peer-keepalive

 

vPC keep-alive status             : peer is alive                

--Peer is alive for             : (1278) seconds, (446) msec

--Send status                   : Success

--Last send at                  : 2021.07.19 03:02:07 934 ms

--Sent on interface             : mgmt0

--Receive status                : Success

--Last receive at               : 2021.07.19 03:02:07 882 ms

--Received on interface         : mgmt0

--Last update from peer         : (0) seconds, (428) msec

 

vPC Keep-alive parameters

--Destination                   : 10.10.2.9

--Keepalive interval            : 1000 msec

--Keepalive timeout             : 5 seconds

--Keepalive hold timeout        : 3 seconds

--Keepalive vrf                 : management

--Keepalive udp port            : 3200

--Keepalive tos                 : 192

 

 

N5K-1# show fex

  FEX         FEX           FEX              FEX              Fex      

Number    Description      State            Model            Serial    

------------------------------------------------------------------------

100    FEX100                Online   N2K-C2348UPQ-10GE   FOC22401234


 

I re-enabled the Port-Channel 1 and it took around a couple of minutes for the FEX in N5K-2 to back back online.

 

N5K-1# configure terminal

Enter configuration commands, one per line.  End with CNTL/Z.

N5K-1(config)# interface port-channel1

N5K-1(config-if)# no shutdown

 

 

N5K-2# 2021 Jul 19 03:05:44 N5K-2 %$ VDC-1 %$ %SATCTRL-FEX105-2-SOHMS_ENV_ERROR: FEX-100 Module 1: Check environment alarms.

2021 Jul 19 03:05:48 N5K-2 %$ VDC-1 %$ %PFMA-2-FEX_STATUS: Fex 100 is online

2021 Jul 19 03:05:48 N5K-2 %$ VDC-1 %$ %NOHMS-2-NOHMS_ENV_FEX_ONLINE: FEX-100 On-line

2021 Jul 19 03:05:50 N5K-2 %$ VDC-1 %$ %PFMA-2-FEX_STATUS: Fex 100 is online


N5K-2# show fex

  FEX         FEX           FEX              FEX              Fex      

Number    Description      State            Model            Serial    

------------------------------------------------------------------------

100    FEX100                Online   N2K-C2348UPQ-10GE   FOC22401234


In summary, you can tolerate a separate Peer-Keepalive failure and a separate Peer-Link failure. This will give enough time to troubleshoot and fix the problem (usually at Layer 1). Avoid a Peer-Keepalive followed by a Peer-Link failure at all cost, otherwise traffic instability/split-brain will occur.