OVHcloud Network Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance
FS#12261 — bhs2-15b-n6
Incident Report for Network & Infrastructure
Resolved
The n6 rebooted due to a bug related to port-security.

Kernel uptime is 0 day(s), 0 hour(s), 27 minute(s), 33 second(s)

Last reset at 423002 usecs after Mon Dec 22 04:35:09 2014

Reason: Reset triggered due to HA policy of Reset
System version: 6.0(2)N2(4)
Service: eth_port_sec hap reset

During the reboot, forwarding was by by the 15a, no downtime..
All the FEX are UP and present.



Update(s):

Date: 2014-12-22 09:28:32 UTC
The pair of Nexus is stable again! We have not seen any problems for 10 minutes.

Date: 2014-12-22 09:26:27 UTC
We still have ports in err-disab on A. N6 is on B all is stable again.

bhs2-15a-n6# sh inter status | i err
Eth102/1/42 server-EG err-disab trunk full auto --
Eth108/1/36 server-EG err-disab trunk full auto --
Eth109/1/14 server-EG err-disab trunk auto auto --
Eth109/1/43 server-EG err-disab trunk auto auto --
Eth110/1/22 server-SP-HOST err-disab trunk auto auto --
Eth113/1/44 server-SP-HOST err-disab trunk auto auto --
Eth115/1/2 server-SP-HOST err-disab trunk auto auto --

We are going to make the last reload on the n6 A. All FEX are up on the B, the traffic will be forwarded by the latter during the reboot.


Date: 2014-12-22 09:25:22 UTC
The pair is UP, the FEX are all UP.
4 servers remain down.

Date: 2014-12-22 09:24:46 UTC
Currently the A has been updated.
The B after the reboot was in a weird state, it kept uplink port to bhs-3a / b-a9 in suspended state, yet the VPC is UP.

On the reload starting on a clean base

16 servers remain down on this pair.

Date: 2014-12-22 09:23:47 UTC
nothing goes as planned ...
The A has crashed too.

I go back on FEX B in the latest version, the FEX is updating.
FEX FEX FEX FEX Fex
Number Description State Model Serial
------------------------------------------------------------------------
100 FEX100|T02A40 Image Download N2K-C2248TP-E-1GE SSI16410495
101 FEX101|T02A41 Connected N2K-C2248TP-E-1GE FOX1724G9CL
102 FEX102|T02A42 Connected N2K-C2248TP-E-1GE SSI17160DEA
103 FEX103|T02A43 Connected N2K-C2248TP-E-1GE FOX1724GZ4S
104 FEX104|T02A44 Connected N2K-C2248TP-E-1GE FOX1724GZ5S
105 FEX105|T02A45 Online N2K-C2248TP-E-1GE SSI17160D7R
106 FEX106|T02A46 Online N2K-C2248TP-E-1GE FOX1720GEK6
107 FEX107|T02A47 Connected N2K-C2248TP-1GE SSI1601073V
108 FEX108|T02A48 Online N2K-C2248TP-E-1GE FOX1720GE3G
109 FEX109|T02A49 Connected N2K-C2248TP-E-1GE FOX1720GEMP
110 FEX110|T02D05 Connected N2K-C2248TP-E-1GE SSI173608P6
111 FEX111|T02A61 Connected N2K-C2248TP-E-1GE SSI1641048V
112 FEX112|T02D06 Connected N2K-C2248TP-E-1GE FOX1750GJ2J
113 FEX113|T02D07 Connected N2K-C2248TP-E-1GE SSI173608RT
114 FEX114|T02D08 Connected N2K-C2248TP-E-1GE SSI173606JB
115 FEX115|T02D09 Connected N2K-C2248TP-E-1GE FOX1749GBF5
116 FEX116|T02D10 Online N2K-C2248TP-E-1GE SSI1736062S
117 FEX117|T02D11 Online N2K-C2248TP-E-1GE FOX1748G4U1
118 FEX118|T02D12 Online N2K-C2248TP-E-1GE SSI173606JS
119 FEX119|T02D13 Connected N2K-C2248TP-E-1GE FOX1748G4T6
120 FEX120|T02D14 Connected N2K-C2248TP-E-1GE FOX1750GNV3

Date: 2014-12-22 09:22:10 UTC
The B has crashed during the manip.

bb [local7.err] === : 2014 Dec 22 07:58:33 CET: %SYSMGR-3-HEARTBEAT_FAILURE: Service \"afm\" sent SIGABRT for not setting heartbeat for last 4 periods. Last heartbeat 175.15 secs ago.
ba [local7.crit] === : 2014 Dec 22 07:58:33 CET: %SYSMGR-2-SERVICE_CRASHED: Service \"afm\" (PID 3986) hasn't caught signal 6 (core will be saved).
ba [local7.crit] === : 2014 Dec 22 07:58:33 CET: %SYSMGR-2-HAP_FAILURE_SUP_RESET: System reset due to service \"afm\" in vdc 1 has had a hap failure

Date: 2014-12-22 09:19:58 UTC
We can't do it without a cut.
I am stopping the ISSU on B.


Remaining action::
\"Module(s) 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120 still need to be upgraded\".

Install has been aborted.

Upgrade failed during the update of the FEX 100, the servers are down.
The B is blocked on Check UP Seq FEX 100.


Plan of Action.
- I cut the FEX on B
- I reload B
- I update the nxos on A and then transfer the FEX on B.

There will be down time during the update of the FEX.


Date: 2014-12-22 09:17:03 UTC
The ISSU does not function!

le fex 100 est bloqué
FEX FEX FEX FEX Fex
Number Description State Model Serial
------------------------------------------------------------------------
100 FEX100|T02A40 Check Upg Seq N2K-C2248TP-E-1GE SSI16410495

bhs2-15a-n6# sh fex
FEX FEX FEX FEX Fex
Number Description State Model Serial
------------------------------------------------------------------------
100 FEX100|T02A40 Image Download N2K-C2248TP-E-1GE SSI16410495


Date: 2014-12-22 09:15:43 UTC
Ready to go for the upgrade.
notifying services about system upgrade.
[####################] 100% -- SUCCESS



Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes non-disruptive reset
2 yes non-disruptive rolling
100 yes non-disruptive rolling
101 yes non-disruptive rolling
102 yes non-disruptive rolling
103 yes non-disruptive rolling
104 yes non-disruptive rolling
105 yes non-disruptive rolling
106 yes non-disruptive rolling
107 yes non-disruptive rolling
108 yes non-disruptive rolling
109 yes non-disruptive rolling
110 yes non-disruptive rolling
111 yes non-disruptive rolling
112 yes non-disruptive rolling
113 yes non-disruptive rolling
114 yes non-disruptive rolling
115 yes non-disruptive rolling
116 yes non-disruptive rolling
117 yes non-disruptive rolling
118 yes non-disruptive rolling
119 yes non-disruptive rolling
120 yes non-disruptive rolling

Date: 2014-12-22 09:14:37 UTC
After the reboot the FEX are not UP.

As soon as both side are up, I turn on port-secu and make the ISSU.



Date: 2014-12-22 09:01:51 UTC
The images are downloaded.

The B has been rebooted.




Date: 2014-12-22 09:00:02 UTC
Okay

The image is being downloaded on the n6.

The pair is stable, I will make the ISSU upgrade at 4/5 am.


Date: 2014-12-22 08:58:29 UTC
I wrote too fast, the 15a will be rebooted in an instant. (forwarding by 15b).

I wprepared the ISSU upgrade for the pair.
Posted Dec 22, 2014 - 08:57 UTC