OVHcloud Private Cloud Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance
FS#5664 — Network update
Scheduled Maintenance Report for Hosted Private Cloud
Completed
We are going to update the pCC switch. Normally, there will be no failure since the switches are profiting of the updates in ISSU (without interrupting the service).
But we had already crashes at this level. If this happens again, all is going to be switched to the 2nd network.



Update(s):

Date: 2011-08-05 21:32:11 UTC
We stopped works for today,enough emotions for a short day :( .

Date: 2011-08-05 21:29:49 UTC
Following the discussions with TAC and some dmp on the network,it is possible that some packets have a surprising effect on the N5 in version (3). Nx.x.
It's about spantree packet with a mac source 0100.0ccc.cccd who sets on the network,we don't know from where (probably customers are sending them).
This is a malformed packet that does not exist in the perfect world. the packets may have a destination 0100.0ccc.cccd but not a source.
So the packet arrives at the CPU.

The first idea was to put a mac access-list to filter these packets:
pcc-12b-n5# sh mac access-lists

MAC access list test
10 deny 0100.0ccc.cccd ffff.ffff.ffff any
20 permit any any

This didn't work,CPU is still 100% .

We were asked to enable the spantree in order to check if the spantree process couldn't handle these packets instead of the CPU.

We enabled the spantree but when we enable the ports there is a new limit of spantree instance number by port and by vlan.
We established the spantree mst that reduces the instance number,but nothing changed.

So we forced the test by enabling all ports and we was looking with stress at the log messages that appeared on our consoles.

2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (73600) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (73700) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (73800) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (73900) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74000) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:33 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74100) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74200) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74300) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74400) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74500) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74600) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74700) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:34 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74800) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (74900) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (75000) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (75100) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (75200) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (75300) exceeded [MST mode] recommended
limit of 14500
2011 Aug 5 21:41:36 pcc-12b-n5 %STP-2-VLAN_PORT_LIMIT_EXCEEDED: The number of vlan-port instances (75400) exceeded [MST mode] recommended
limit of 14500

Finally, the configuration has been made and it seems that it is switching. The hosts work, the spantree probably not, but the CPU is correct.

pcc-12b-n5# sh processes cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4210 588 201530 2 2.0% gatosusd
1 1014 1305 777 0.0% init

CPU util : 0.0% user, 1.0% kernel, 99.0% idle

Appearently, these packets are the origin of the CPU problem.
We will remount this information to TAC from sisco and we'll see
if they can give us a patched version of NX-OS so we can expel the spantree.

Date: 2011-08-05 19:43:55 UTC
kickstart: version 5.0(3)N2(1)
system: version 5.0(3)N2(1)


Date: 2011-08-05 19:43:37 UTC
pcc-12b-n5(config-if)# sh proc cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
1 1025 1462 701 0.0% init
pcc-12b-n5(config)# inter po 100
pcc-12b-n5(config-if)# no shutdown
2011 Aug 5 20:03:38 pcc-12b-n5 %PFMA-2-FEX_STATUS: Fex 100 is online
2011 Aug 5 20:03:38 pcc-12b-n5 %NOHMS-2-NOHMS_ENV_FEX_ONLINE: FEX-100 On-line
2011 Aug 5 20:03:38 pcc-12b-n5 %PFMA-2-FEX_STATUS: Fex 100 is online
pcc-12b-n5(config-if)# sh proc cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4382 292 100 2923 95.0% netstack

Nothing but to downgrade.

Date: 2011-08-05 19:43:11 UTC
pcc-12b-n5(config-if)# sh proc cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4382 292 100 2923 95.2% netstack
pcc-12b-n5(config-if)# inter po 111
pcc-12b-n5(config-if)# shutdown
2011 Aug 5 20:01:05 pcc-12b-n5 %PFMA-2-FEX_STATUS: Fex 111 is offline
2011 Aug 5 20:01:05 pcc-12b-n5 %NOHMS-2-NOHMS_ENV_FEX_OFFLINE: FEX-111 Off-line (Serial Number )
pcc-12b-n5(config-if)# sh proc cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4382 292 100 2923 2.0% netstack

FEX had to be cut in order to recover the CPU to 2%


Date: 2011-08-05 19:41:54 UTC
We are downgrading pcc-12 in n5000-uk9.5.0.3.N1.1b.bin which doesn't seem to cause a netstack problem but which has other bugs.


Date: 2011-08-05 17:48:03 UTC
we have put the port UP on the B and the CPU exploded on the pcc-22

pcc-12b-n5# sh proc cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4382 292 100 2923 84.0% netstack

one or many hosts must send packets which will go directly to the N5 in software and take all the CPU.
It is a bug soft on the N5. But we need to find everything that is causing this problem.


Date: 2011-08-05 17:26:38 UTC
Ports of 2 pcc-2 are cut.


Date: 2011-08-05 17:26:03 UTC
we will cut all pcc-12 ports and we will reboot in hard.



Date: 2011-08-05 17:24:08 UTC
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102400
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102401
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102402
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102403
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102404
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102405
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102407
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102408
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102409
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102410
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102411
2011 Aug 5 19:14:38 pcc-12a-n5 %VPC-2-PEER_VPC_RESP_TIMEDOUT: Failed to receive response from peer for vPC: 102412

Date: 2011-08-05 17:23:58 UTC
Instant updates are not performing at any level on Nexus 5xxx with the FEX. We are going to change the strategy: cut the ports on one of the 2 sides, we will force the performance in the 2nd couple, then we will update it. It could crash. Once it come back to normal we will put it into production.


Date: 2011-08-05 17:20:18 UTC
pcc-12a and b are back to normal thereafter a hard reboot, FEX are running.


Date: 2011-08-05 17:19:27 UTC
2 pcc-12 are wallowed. but not leaving the host ports. We are rebooting in hard.

Date: 2011-08-05 16:52:57 UTC
pcc-12b-n5 is crashed. pcc-12a continue to switch FEX

Date: 2011-08-05 16:52:09 UTC
pcc-15-n5 we are going to cut all ports of FEX then restart in hard the N5.

Date: 2011-08-05 16:51:22 UTC
pcc-25-n5 done

we find the same problem that on the pcc-22-n5 which seems linked to Nexus 5548P: netstack takes from CPU
we hve already a TAC at Cisco opened to this subject.


pcc-25-n5# sh processes cpu sort

PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4459 184 43 4294 49.5% netstack

Date: 2011-08-05 16:35:28 UTC
pcc-12a-n5 done
pcc-12b-n5 in progress

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive none
101 yes non-disruptive none
102 yes non-disruptive none
103 yes non-disruptive none
104 yes non-disruptive none
105 yes non-disruptive none
106 yes non-disruptive none
107 yes non-disruptive none
108 yes non-disruptive none
109 yes non-disruptive none
110 yes non-disruptive none
111 yes non-disruptive none

Date: 2011-08-05 16:25:37 UTC
pcc-25-n5 in progress

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling
102 yes non-disruptive rolling
103 yes non-disruptive rolling

Date: 2011-08-05 16:24:43 UTC
storage-s27b-n5 done

Date: 2011-08-05 16:24:31 UTC
pcc-12a-n5 in progress

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling
102 yes non-disruptive rolling
103 yes non-disruptive rolling
104 yes non-disruptive rolling
105 yes non-disruptive rolling
106 yes non-disruptive rolling
107 yes non-disruptive rolling
108 yes non-disruptive rolling
109 yes non-disruptive rolling
110 yes non-disruptive rolling
111 yes non-disruptive rolling

Date: 2011-08-05 16:24:15 UTC
storage-s27a-n5 fini
storage-s27b-n5 en cours

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive none
101 yes non-disruptive none
102 yes non-disruptive none
103 yes non-disruptive none
104 yes non-disruptive none
105 yes non-disruptive none

Date: 2011-08-05 16:24:04 UTC
pcc-11b is UP. pcc-11a and b have updated the FEX and activated the ports of each host which has been set then once the port is UP, the host has sent the traffic to pcc-11.


Date: 2011-08-05 16:21:53 UTC
pcc-11a-n5# 2011 Aug 5 17:36:55 pcc-11a-n5 %VPC-2-VPC_ISSU_END: Peer vPC switch ISSU end, unlocking configuration
2011 Aug 5 17:37:00 pcc-11a-n5 %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 154, VPC peer keep-alive receive has failed

pcc-11b is also crashed. le pcc-22 has recovered the vlan switching.

Date: 2011-08-05 16:21:04 UTC
storage-s27a-n5

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling
102 yes non-disruptive rolling
103 yes non-disruptive rolling
104 yes non-disruptive rolling
105 yes non-disruptive rolling


Date: 2011-08-05 16:20:55 UTC
storage-s28 updated.
we are passing to storage-s27


Date: 2011-08-05 16:20:33 UTC
storage-s28a-n5 fini avec ses FEX.
storage-s28b-n5 en cours

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive none
101 yes non-disruptive none


Date: 2011-08-05 16:20:22 UTC
pcc-11a-n5 had a failure while updating it. pcc-11b-n5 continue to manage FEX. pcc-11a is UP. We will cut the FEX.
We are updating pcc-11b. If it works, it will update the FEX and we can put back the FEX on the pcc-11a


Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling
102 yes non-disruptive rolling
103 yes non-disruptive rolling
104 yes non-disruptive rolling
105 yes non-disruptive rolling
106 yes non-disruptive rolling
107 yes non-disruptive rolling
108 yes non-disruptive rolling
109 yes non-disruptive rolling
110 yes non-disruptive rolling
111 yes non-disruptive rolling

Date: 2011-08-05 16:17:34 UTC
2011 Aug 5 17:18:23 pcc-11b-n5 %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 154, VPC peer keep-alive receive has failed

Date: 2011-08-05 16:17:27 UTC
pcc-11b-n5# 2011 Aug 5 17:13:45 pcc-11b-n5 %VPC-2-VPC_ISSU_START: Peer vPC switch ISSU start, locking configuration
storage-s28b-n5# 2011 Aug 5 17:14:33 storage-s28b-n5 %VPC-2-VPC_ISSU_START: Peer vPC switch ISSU start, locking configuration

Date: 2011-08-05 16:17:19 UTC
storage-s28a-n5

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling

Date: 2011-08-05 16:17:07 UTC
pcc-11a

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
100 yes non-disruptive rolling
101 yes non-disruptive rolling
102 yes non-disruptive rolling
103 yes non-disruptive rolling
104 yes non-disruptive rolling
105 yes non-disruptive rolling
106 yes non-disruptive rolling
107 yes non-disruptive rolling
108 yes non-disruptive rolling
109 yes non-disruptive rolling
110 yes non-disruptive rolling
111 yes non-disruptive rolling


Date: 2011-08-05 16:16:56 UTC
pcc-26-n5 in progress

Date: 2011-08-05 16:16:44 UTC
pcc-28-n5 in progress

Date: 2011-08-05 16:16:29 UTC
pcc-29-n5 in progress

Date: 2011-08-05 16:16:17 UTC
storage-s28a-n5 in progress

Date: 2011-08-05 16:16:05 UTC
pcc-10a done
pcc-10b done

pcc-11a in progress
Posted Aug 05, 2011 - 16:15 UTC