rssLink RSS for all categories
 
icon_red
icon_green
icon_red
icon_red
icon_blue
icon_green
icon_green
icon_red
icon_red
icon_red
icon_orange
icon_green
icon_green
icon_green
icon_green
icon_blue
icon_green
icon_orange
icon_red
icon_green
icon_red
icon_red
icon_green
icon_red
icon_red
icon_red
icon_red
icon_orange
icon_green
 

FS#1683 — FS#5664 — Network update

Attached to Project— Dedicated Cloud
Maintenance
Backend / Core
CLOSED
100%
We are going to update the pCC switch. Normally, there will be no failure since the switches are profiting of the updates in ISSU (without interrupting the service).
But we had already crashes at this level. If this happens again, all is going to be switched to the 2nd network.


Date:  Saturday, 13 August 2011, 00:39AM
Reason for closing:  Done
Comment by OVH - Friday, 05 August 2011, 18:16PM

pcc-10a done
pcc-10b done

pcc-11a in progress


Comment by OVH - Friday, 05 August 2011, 18:16PM

storage-s28a-n5 in progress


Comment by OVH - Friday, 05 August 2011, 18:16PM

pcc-29-n5 in progress


Comment by OVH - Friday, 05 August 2011, 18:16PM

pcc-28-n5 in progress


Comment by OVH - Friday, 05 August 2011, 18:16PM

pcc-26-n5 in progress


Comment by OVH - Friday, 05 August 2011, 18:17PM

pcc-11a

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling
102 yes non-disruptive rolling
103 yes non-disruptive rolling
104 yes non-disruptive rolling
105 yes non-disruptive rolling
106 yes non-disruptive rolling
107 yes non-disruptive rolling
108 yes non-disruptive rolling
109 yes non-disruptive rolling
110 yes non-disruptive rolling
111 yes non-disruptive rolling


Comment by OVH - Friday, 05 August 2011, 18:17PM

storage-s28a-n5

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling


Comment by OVH - Friday, 05 August 2011, 18:17PM

pcc-11b-n5# 2011 Aug 5 17:13:45 pcc-11b-n5 %VPC-2-VPC_ISSU_START: Peer vPC switch ISSU start, locking configuration
storage-s28b-n5# 2011 Aug 5 17:14:33 storage-s28b-n5 %VPC-2-VPC_ISSU_START: Peer vPC switch ISSU start, locking configuration


Comment by OVH - Friday, 05 August 2011, 18:17PM

2011 Aug 5 17:18:23 pcc-11b-n5 %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 154, VPC peer keep-alive receive has failed


Comment by OVH - Friday, 05 August 2011, 18:20PM

pcc-11a-n5 had a failure while updating it. pcc-11b-n5 continue to manage FEX. pcc-11a is UP. We will cut the FEX.
We are updating pcc-11b. If it works, it will update the FEX and we can put back the FEX on the pcc-11a


Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling
102 yes non-disruptive rolling
103 yes non-disruptive rolling
104 yes non-disruptive rolling
105 yes non-disruptive rolling
106 yes non-disruptive rolling
107 yes non-disruptive rolling
108 yes non-disruptive rolling
109 yes non-disruptive rolling
110 yes non-disruptive rolling
111 yes non-disruptive rolling


Comment by OVH - Friday, 05 August 2011, 18:20PM

storage-s28a-n5 fini avec ses FEX.
storage-s28b-n5 en cours

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive none
101 yes non-disruptive none


Comment by OVH - Friday, 05 August 2011, 18:20PM

storage-s28 updated.
we are passing to storage-s27


Comment by OVH - Friday, 05 August 2011, 18:21PM

storage-s27a-n5

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling
102 yes non-disruptive rolling
103 yes non-disruptive rolling
104 yes non-disruptive rolling
105 yes non-disruptive rolling


Comment by OVH - Friday, 05 August 2011, 18:21PM

pcc-11a-n5# 2011 Aug 5 17:36:55 pcc-11a-n5 %VPC-2-VPC_ISSU_END: Peer vPC switch ISSU end, unlocking configuration
2011 Aug 5 17:37:00 pcc-11a-n5 %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 154, VPC peer keep-alive receive has failed

pcc-11b is also crashed. le pcc-22 has recovered the vlan switching.


Comment by OVH - Friday, 05 August 2011, 18:24PM

pcc-11b is UP. pcc-11a and b have updated the FEX and activated the ports of each host which has been set then once the port is UP, the host has sent the traffic to pcc-11.


Comment by OVH - Friday, 05 August 2011, 18:24PM

storage-s27a-n5 fini
storage-s27b-n5 en cours

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive none
101 yes non-disruptive none
102 yes non-disruptive none
103 yes non-disruptive none
104 yes non-disruptive none
105 yes non-disruptive none


Comment by OVH - Friday, 05 August 2011, 18:24PM

pcc-12a-n5 in progress

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling
102 yes non-disruptive rolling
103 yes non-disruptive rolling
104 yes non-disruptive rolling
105 yes non-disruptive rolling
106 yes non-disruptive rolling
107 yes non-disruptive rolling
108 yes non-disruptive rolling
109 yes non-disruptive rolling
110 yes non-disruptive rolling
111 yes non-disruptive rolling


Comment by OVH - Friday, 05 August 2011, 18:24PM

storage-s27b-n5 done


Comment by OVH - Friday, 05 August 2011, 18:25PM

pcc-25-n5 in progress

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling
102 yes non-disruptive rolling
103 yes non-disruptive rolling


Comment by OVH - Friday, 05 August 2011, 18:35PM

pcc-12a-n5 done
pcc-12b-n5 in progress

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive none
101 yes non-disruptive none
102 yes non-disruptive none
103 yes non-disruptive none
104 yes non-disruptive none
105 yes non-disruptive none
106 yes non-disruptive none
107 yes non-disruptive none
108 yes non-disruptive none
109 yes non-disruptive none
110 yes non-disruptive none
111 yes non-disruptive none


Comment by OVH - Friday, 05 August 2011, 18:51PM

pcc-25-n5 done

we find the same problem that on the pcc-22-n5 which seems linked to Nexus 5548P: netstack takes from CPU
we hve already a TAC at Cisco opened to this subject.


pcc-25-n5# sh processes cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4459 184 43 4294 49.5% netstack


Comment by OVH - Friday, 05 August 2011, 18:52PM

pcc-15-n5 we are going to cut all ports of FEX then restart in hard the N5.


Comment by OVH - Friday, 05 August 2011, 18:52PM

pcc-12b-n5 is crashed. pcc-12a continue to switch FEX


Comment by OVH - Friday, 05 August 2011, 19:19PM

2 pcc-12 are wallowed. but not leaving the host ports. We are rebooting in hard.


Comment by OVH - Friday, 05 August 2011, 19:20PM

pcc-12a and b are back to normal thereafter a hard reboot, FEX are running.


Comment by OVH - Friday, 05 August 2011, 19:23PM

Instant updates are not performing at any level on Nexus 5xxx with the FEX. We are going to change the strategy: cut the ports on one of the 2 sides, we will force the performance in the 2nd couple, then we will update it. It could crash. Once it come back to normal we will put it into production.


Comment by OVH - Friday, 05 August 2011, 19:24PM

2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102400
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102401
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102402
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102403
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102404
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102405
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102407
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102408
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102409
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102410
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102411
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102412


Comment by OVH - Friday, 05 August 2011, 19:26PM

we will cut all pcc-12 ports and we will reboot in hard.


Comment by OVH - Friday, 05 August 2011, 19:26PM

Ports of 2 pcc-2 are cut.


Comment by OVH - Friday, 05 August 2011, 19:48PM

we have put the port UP on the B and the CPU exploded on the pcc-22

pcc-12b-n5# sh proc cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4382 292 100 2923 84.0% netstack

one or many hosts must send packets which will go directly to the N5 in software and take all the CPU.
It is a bug soft on the N5. But we need to find everything that is causing this problem.


Comment by OVH - Friday, 05 August 2011, 21:41PM

We are downgrading pcc-12 in n5000-uk9.5.0.3.N1.1b.bin which doesn't seem to cause a netstack problem but which has other bugs.


Comment by OVH - Friday, 05 August 2011, 21:43PM

pcc-12b-n5(config-if)# sh proc cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4382 292 100 2923 95.2% netstack
pcc-12b-n5(config-if)# inter po 111
pcc-12b-n5(config-if)# shutdown
2011 Aug 5 20:01:05 pcc-12b-n5 %PFMA-2-FEX_STATUS: Fex 111 is offline
2011 Aug 5 20:01:05 pcc-12b-n5 %NOHMS-2-NOHMS_ENV_FEX_OFFLINE: FEX-111 Off-line (Serial Number )
pcc-12b-n5(config-if)# sh proc cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4382 292 100 2923 2.0% netstack

FEX had to be cut in order to recover the CPU to 2%


Comment by OVH - Friday, 05 August 2011, 21:43PM

pcc-12b-n5(config-if)# sh proc cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
1 1025 1462 701 0.0% init
pcc-12b-n5(config)# inter po 100
pcc-12b-n5(config-if)# no shutdown
2011 Aug 5 20:03:38 pcc-12b-n5 %PFMA-2-FEX_STATUS: Fex 100 is online
2011 Aug 5 20:03:38 pcc-12b-n5 %NOHMS-2-NOHMS_ENV_FEX_ONLINE: FEX-100 On-line
2011 Aug 5 20:03:38 pcc-12b-n5 %PFMA-2-FEX_STATUS: Fex 100 is online
pcc-12b-n5(config-if)# sh proc cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4382 292 100 2923 95.0% netstack

Nothing but to downgrade.


Comment by OVH - Friday, 05 August 2011, 21:43PM

kickstart: version 5.0(3)N2(1)
system: version 5.0(3)N2(1)


Comment by OVH - Friday, 05 August 2011, 23:29PM

Following the discussions with TAC and some dmp on the network,it is possible that some packets have a surprising effect on the N5 in version (3). Nx.x.
It's about spantree packet with a mac source 0100.0ccc.cccd who sets on the network,we don't know from where (probably customers are sending them).
This is a malformed packet that does not exist in the perfect world. the packets may have a destination 0100.0ccc.cccd but not a source.
So the packet arrives at the CPU.

The first idea was to put a mac access-list to filter these packets:
pcc-12b-n5# sh mac access-lists

MAC access list test
10 deny 0100.0ccc.cccd ffff.ffff.ffff any
20 permit any any

This didn't work,CPU is still 100% .

We were asked to enable the spantree in order to check if the spantree process couldn't handle these packets instead of the CPU.

We enabled the spantree but when we enable the ports there is a new limit of spantree instance number by port and by vlan.
We established the spantree mst that reduces the instance number,but nothing changed.

So we forced the test by enabling all ports and we was looking with stress at the log messages that appeared on our consoles.

2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (73600) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (73700) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (73800) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (73900) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74000) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74100) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74200) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74300) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74400) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74500) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74600) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74700) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74800) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74900) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (75000) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (75100) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (75200) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (75300) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (75400) exceeded [MST mode] recommended
limit of 14500

Finally, the configuration has been made and it seems that it is switching. The hosts work, the spantree probably not, but the CPU is correct.

pcc-12b-n5# sh processes cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4210 588 201530 2 2.0% gatosusd
1 1014 1305 777 0.0% init

CPU util : 0.0% user, 1.0% kernel, 99.0% idle

Appearently, these packets are the origin of the CPU problem.
We will remount this information to TAC from sisco and we'll see
if they can give us a patched version of NX-OS so we can expel the spantree.


Comment by OVH - Friday, 05 August 2011, 23:32PM

We stopped works for today,enough emotions for a short day :( .