Here's a Cisco link regarding the different Nexus vPC terminologies, best practice and failure scenarios for the Peer-Link and Peer-Keepalive. I tried to simulate different failure scenarios in my Nexus switch lab.
Peer-Keepalive Failure (mgmt0 via Layer 3):
- Only the heartbeat between Primary and Secondary Nexus peer will be lost
- vPC adjacency will NOT break/fail
- There's no change in vPC
role (Primary/Secondary)
- vPC will still run as normal/forward traffic
- Ensure NMS monitoring for the Nexus mgmt0 interface
N5K-1# show run interface mgmt0
!Command: show running-config interface mgmt0
!Time: Mon Jul 19 02:56:36 2021
version 7.3(8)N1(1)
interface mgmt0
vrf member management
ip address 10.10.2.8/23
N5K-1# show run vpc
!Command: show running-config vpc
!Time: Thu Jul 22 08:30:50 2021
version 7.3(8)N1(1)
feature vpc
vpc domain 1
role priority 10
peer-keepalive destination 10.10.2.9 source 10.10.2.8
interface port-channel1
vpc peer-link
interface port-channel100
vpc 100
I shutdown the switchport connected to NK5-1 mgmt0.
SW01#configure terminal
Enter configuration commands, one per line. End with CNTL/Z.
SW01(config)#interface Gi1/0/6
SW01(config-if)#shutdown
N5K-1# show interface mgmt0
mgmt0 is down (Link not connected)
Hardware: GigabitEthernet, address: 00de.fb78.0123 (bia 00de.fb78.0112)
Internet Address is 10.10.2.8/23
The Peer-Keepalive status changed to peer is not reachable but peer adjacency is still formed ok.
N5K-1# 2021 Jul 19 02:28:59 N5K-1 %$ VDC-1 %$ %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed
N5K-1# show vpc
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 1
Peer status : peer adjacency formed ok
vPC keep-alive status : peer is not reachable through peer-keepalive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : primary
Number of vPCs configured : 294
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Operational Layer3 Peer-router : Disabled
Auto-recovery status : Enabled (timeout = 240 seconds)
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 up 1,99
<OUTPUT TRUNCATED>
N5K-1# show vpc peer-keepalive
vPC keep-alive status : peer is not reachable through peer-keepalive
--Send status : Success
--Last send at : 2021.07.19 02:29:59 804 ms
--Sent on interface :
--Receive status : Failed
--Last update from peer : (65) seconds, (174) msec
vPC Keep-alive parameters
--Destination : 10.10.2.9
--Keepalive interval : 1000 msec
--Keepalive timeout : 5 seconds
--Keepalive hold timeout : 3 seconds
--Keepalive vrf : management
--Keepalive udp port : 3200
--Keepalive tos : 192
The FEX module state is still Online.
N5K-1# show fex
FEX FEX FEX FEX Fex
Number Description State Model Serial
------------------------------------------------------------------------
100 FEX100 Online N2K-C2348UPQ-10GE FOC22401234
The same output is seen on the Nexus peer switch.
N5K-2# 2021 Jul 19 02:28:59 N5K-2 %$ VDC-1 %$ %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed
N5K-2# show vpc
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 1
Peer status : peer adjacency formed ok
vPC keep-alive status : peer is not reachable through peer-keepalive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : secondary
Number of vPCs configured : 294
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Operational Layer3 Peer-router : Disabled
Auto-recovery status : Enabled (timeout = 240 seconds)
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 up 1,99
<OUTPUT TRUNCATED>
N5K-2# show vpc peer-keepalive
vPC keep-alive status : peer is not reachable through peer-keepalive
--Send status : Success
--Last send at : 2021.07.19 02:30:46 803 ms
--Sent on interface : mgmt0
--Receive status : Failed
--Last update from peer : (112) seconds, (807) msec
vPC Keep-alive parameters
--Destination : 10.10.2.8
--Keepalive interval : 1000 msec
--Keepalive timeout : 5 seconds
--Keepalive hold timeout : 3 seconds
--Keepalive vrf : management
--Keepalive udp port : 3200
--Keepalive tos : 192
N5K-2# show fex
FEX FEX FEX FEX Fex
Number Description State Model Serial
------------------------------------------------------------------------
100 FEX100 Online N2K-C2348UPQ-10GE FOC22401234
The vPC Peer-Keepalive status immediately changed to alive after I unshut the switch port on N5K-1 mgmt0,
SW01(config)#interface Gi1/0/6
SW01(config-if)#no shutdown
N5K-1# show vpc
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 1
Peer status : peer adjacency formed ok
vPC keep-alive status : peer is alive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : primary
Number of vPCs configured : 294
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Operational Layer3 Peer-router : Disabled
Auto-recovery status : Enabled (timeout = 240 seconds)
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 up 1,99
<OUTPUT TRUNCATED>
N5K-1# show vpc peer-keepalive
vPC keep-alive status : peer is alive
--Peer is alive for : (84) seconds, (386) msec
--Send status : Success
--Last send at : 2021.07.19 02:36:40 813 ms
--Sent on interface : mgmt0
--Receive status : Success
--Last receive at : 2021.07.19 02:36:40 854 ms
--Received on interface : mgmt0
--Last update from peer : (0) seconds, (336) msec
vPC Keep-alive parameters
--Destination : 10.10.2.9
--Keepalive interval : 1000 msec
--Keepalive timeout : 5 seconds
--Keepalive hold timeout : 3 seconds
--Keepalive vrf : management
--Keepalive udp port : 3200
--Keepalive tos : 192
N5K-2# show vpc
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 1
Peer status : peer adjacency formed ok
vPC keep-alive status : peer is alive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : secondary
Number of vPCs configured : 294
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Operational Layer3 Peer-router : Disabled
Auto-recovery status : Enabled (timeout = 240 seconds)
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 up 1,99
<OUTPUT TRUNCATED>
N5K-2# show vpc peer-keepalive
vPC keep-alive status : peer is alive
--Peer is alive for : (114) seconds, (258) msec
--Send status : Success
--Last send at : 2021.07.19 02:37:11 851 ms
--Sent on interface : mgmt0
--Receive status : Success
--Last receive at : 2021.07.19 02:37:11 834 ms
--Received on interface : mgmt0
--Last update from peer : (0) seconds, (227) msec
vPC Keep-alive parameters
--Destination : 10.10.2.8
--Keepalive interval : 1000 msec
--Keepalive timeout : 5 seconds
--Keepalive hold timeout : 3 seconds
--Keepalive vrf : management
--Keepalive udp port : 3200
--Keepalive tos : 192
Peer-Link failure (Port-channel 1):
- All the vPC member ports/FEX on the Secondary Nexus switch will be suspended
- All traffic will flow via the Primary Nexus switch
- This will prevent a "split-brain" scenario
- Traffic on Orphan
port/device (i.e. trunk to a standalone switch or router) connected to Secondary Nexus switch will
fail or "blackhole"
- Create a Port-Channel with multiple interfaces for Peer-link
N5K-1# show run interface po1
!Command: show running-config interface port-channel1
!Time: Mon Jul 19 02:55:48 2021
version 7.3(8)N1(1)
interface port-channel1
switchport mode trunk
spanning-tree port type network
vpc peer-link
N5K-1# show port-channel summary
Flags: D - Down P - Up in port-channel (members)
I - Individual H - Hot-standby (LACP only)
s - Suspended r - Module-removed
S - Switched R - Routed
U - Up (port-channel)
M - Not in use. Min-links not met
--------------------------------------------------------------------------------
Group Port- Type Protocol Member Ports
Channel
--------------------------------------------------------------------------------
1 Po1(SU) Eth LACP Eth1/23(P) Eth1/24(P)
<OUTPUT TRUNCATED>
I disabled Port-Channnel 1 Peer-Link on N5K-1 switch. N5K-2 vPC Port-Channel interface immediately became suspended and FEX went offline.
N5K-1# configure terminal
Enter configuration commands, one per line. End with CNTL/Z.
N5K-1(config)# interface port-channel1
N5K-1(config-if)# shutdown
N5K-2# 2021 Jul 19 02:59:03 N5K-2 %$ VDC-1 %$ %VPC-2-VPC_SUSP_ALL_VPC: Peer-link going down, suspending all vPCs on secondary
2021 Jul 19 02:59:03 N5K-2 %$ VDC-1 %$ %NOHMS-2-NOHMS_ENV_FEX_OFFLINE: FEX-100 Off-line (Serial Number FOX25191234)
2021 Jul 19 02:59:03 N5K-2 %$ VDC-1 %$ %PFMA-2-FEX_STATUS: Fex 100 is offline
N5K-2# show vpc
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 1
Peer status : peer link is down
vPC keep-alive status : peer is alive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : secondary
Number of vPCs configured : 6
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Operational Layer3 Peer-router : Disabled
Auto-recovery status : Enabled (timeout = 240 seconds)
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 down -
vPC status
----------------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
------ ----------- ------ ----------- -------------------------- -----------
100 Po100 down failed Peer-link is down -
N5K-2# show vpc peer-keepalive
vPC keep-alive status : peer is alive
--Peer is alive for : (1343) seconds, (564) msec
--Send status : Success
--Last send at : 2021.07.19 03:03:14 84 ms
--Sent on interface : mgmt0
--Receive status : Success
--Last receive at : 2021.07.19 03:03:14 84 ms
--Received on interface : mgmt0
--Last update from peer : (0) seconds, (334) msec
vPC Keep-alive parameters
--Destination : 10.10.2.8
--Keepalive interval : 1000 msec
--Keepalive timeout : 5 seconds
--Keepalive hold timeout : 3 seconds
--Keepalive vrf : management
--Keepalive udp port : 3200
--Keepalive tos : 192
N5K-2# show fex
FEX FEX FEX FEX Fex
Number Description State Model Serial
------------------------------------------------------------------------
100 FEX100 Offline N2K-C2348UPQ-10GE FOC22401234
Only N5K-1 FEX is online. This is to prevent "split-brain" traffic on the peer switch N5K-2.
N5K-1# show vpc
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 1
Peer status : peer link is down
vPC keep-alive status : peer is alive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : primary
Number of vPCs configured : 294
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Operational Layer3 Peer-router : Disabled
Auto-recovery status : Enabled (timeout = 240 seconds)
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 down -
N5K-1# show vpc peer-keepalive
vPC keep-alive status : peer is alive
--Peer is alive for : (1278) seconds, (446) msec
--Send status : Success
--Last send at : 2021.07.19 03:02:07 934 ms
--Sent on interface : mgmt0
--Receive status : Success
--Last receive at : 2021.07.19 03:02:07 882 ms
--Received on interface : mgmt0
--Last update from peer : (0) seconds, (428) msec
vPC Keep-alive parameters
--Destination : 10.10.2.9
--Keepalive interval : 1000 msec
--Keepalive timeout : 5 seconds
--Keepalive hold timeout : 3 seconds
--Keepalive vrf : management
--Keepalive udp port : 3200
--Keepalive tos : 192
N5K-1# show fex
FEX FEX FEX FEX Fex
Number Description State Model Serial
------------------------------------------------------------------------
100 FEX100 Online N2K-C2348UPQ-10GE FOC22401234
I re-enabled the Port-Channel 1 and it took around a couple of minutes for the FEX in N5K-2 to back back online.
N5K-1# configure terminal
Enter configuration commands, one per line. End with CNTL/Z.
N5K-1(config)# interface port-channel1
N5K-1(config-if)# no shutdown
N5K-2# 2021 Jul 19 03:05:44 N5K-2 %$ VDC-1 %$ %SATCTRL-FEX105-2-SOHMS_ENV_ERROR: FEX-100 Module 1: Check environment alarms.
2021 Jul 19 03:05:48 N5K-2 %$ VDC-1 %$ %PFMA-2-FEX_STATUS: Fex 100 is online
2021 Jul 19 03:05:48 N5K-2 %$ VDC-1 %$ %NOHMS-2-NOHMS_ENV_FEX_ONLINE: FEX-100 On-line
2021 Jul 19 03:05:50 N5K-2 %$ VDC-1 %$ %PFMA-2-FEX_STATUS: Fex 100 is online
N5K-2# show fex
FEX FEX FEX FEX Fex
Number Description State Model Serial
------------------------------------------------------------------------
100 FEX100 Online N2K-C2348UPQ-10GE FOC22401234
In summary, you can tolerate a separate Peer-Keepalive failure and a separate Peer-Link failure. This will give enough time to troubleshoot and fix the problem (usually at Layer 1). Avoid a Peer-Keepalive followed by a Peer-Link failure at all cost, otherwise traffic instability/split-brain will occur.