Hosted Private Cloud Status - FS#5626

OVHcloud Private Cloud Status

Current status

Legend

Operational
Degraded performance
Partial Outage
Major Outage
Under maintenance

FS#5626 — pcc-19a pcc-19b

Incident Report for Hosted Private Cloud

Resolved

A double switch pcc-19a-n5 and pcc-19b-n5 which manages eth0
of certain hosts had a strange behaviour at 22h10.
The pcc-19a-n5 has lost and refound the 11 FEX of the racks
that he manages. No available reason because during this time
the pcc-19b-n5 had no problem.

After this DOWN/UP , the 2 switches pcc-19a-n5 et pcc-19b-n5 were
not switching all the macs, some yes other no.
So even though the port is UP on the host,
the trafic was not passing between the VM and Internet.
During this problem eth1 are still working on the pcc-20a
and pcc-20b ensuring the trafic between the hosts and the storage.

We did not found the origin of this problem which seem to be a
software bug on the NX-OS version running on these switches : 5.0(3)N1(1b).

We restarted simply the 2 switches at 23h40.
The trafic is back ob eth1 (so pcc-20a and pcc-20b)
and after that everything was up. The switches pcc-19a and b
are back again and it swiched on eth0 without problems.

A new version of NX-OS exists 5.0(3)N2(1)
and we are going to plan the updates of
the whole infra on this last version of NX-OS.

If we did not did it yet, it's because we had problems
an to do the updates without having a crash.
In fact the urgent updates (ISSU) do not work also
and we have sometimes a strange behavoiur.
We have lately the information that the update ISSU
could generate a software bug beacuse all the information
were not really updated. The was still some « stuff » in the memory.
A (good) hard reboot is necessary to be back again.

Example :
http://status.ovh.net/?do=details&id=1625

So knowing that, we know how to update correctly
the Nexus of Cisco (hard reboot).
Since we have 2 physical networks for each host ,
we are going to cut the port eth0 of each host and
after that do the maintenance on eth0 only.
Then the same thing on eth1. It's long not really sexy
but today it's the only procedure that seems to work 100% of the cases.

Posted Jul 22, 2011 - 13:33 UTC