Hosted Private Cloud Status - FS#6788

OVHcloud Private Cloud Status

Current status

Legend

Operational
Degraded performance
Partial Outage
Major Outage
Under maintenance

FS#6788 — Service vCenter

Incident Report for Hosted Private Cloud

Resolved

We had a crash on a switch supporting the vCenter service.
We are intervening.

Update(s):

Date: 2012-05-31 11:53:12 UTC
All the vShield Manager are operational.

Date: 2012-05-29 19:59:52 UTC
All the infrastructures are operational.

There are still some unavailabilities of vShield Manager which will be fixed during the night.

Date: 2012-05-29 17:46:18 UTC
The majority of the infrastructures are now operational.

We are continuing the maintenance.

Date: 2012-05-29 17:45:20 UTC
We isolated this switch with the FEX 105 which was having a problem.
The switch does not crash only with this FEX.
We try to remount the 4 other FEX which were intially plugged on.

Date: 2012-05-29 17:43:49 UTC
We have connected the FEX only from one side. This made the concerned switch crash.
The switching continues from the other side.

The core dumps were got back and escalated to the developers in Cisco.

------------------
2012 May 29 16:33:05 pcc-30a-n5 %SYSMGR-2-SERVICE_CRASHED: Service \"fwm\" (PID 3166) hasn't caught signal 6 (core will be saved).

Broadcast message from root (console) (Tue May 29 16:33:18 2012):

The system is going down for reboot NOW!
------------------

Date: 2012-05-29 17:42:25 UTC
We are connecting the new FEX.

Date: 2012-05-29 13:43:38 UTC
the 2 switches crashed again. We identified the damaged FEX. We are replacing it.

Date: 2012-05-29 13:02:20 UTC
The vCenter services are up at 95%. We restart the last services which pose a problem.

Date: 2012-05-29 13:01:26 UTC
We are searching for the origin of the crash with the manufacturer.

Date: 2012-05-29 12:29:43 UTC
Reason: Reset triggered due to HA policy of Reset

Date: 2012-05-29 12:29:34 UTC
We checked the connectivity of each of the hosts which are running the vCenter services.

Date: 2012-05-29 12:28:57 UTC
We had also a crash of one of the 2 switches. The other ensured the redondance.

--------------
2012 May 29 13:31:36 pcc-30b-n5 %SYSMGR-2-SERVICE_CRASHED: Service \"fwm\" (PID 3163) hasn't caught signal 6 (core will be saved).

Broadcast message from root (console) (Tue May 29 13:31:50 2012):

The system is going down for reboot NOW!
--------------

Date: 2012-05-29 12:24:40 UTC
The 2 switched in dual-home supporting the vCenter service crashed one after the other :

pcc-30a-n5:
-------------
2012 May 29 13:04:12 pcc-30b-n5 %SYSMGR-2-SERVICE_CRASHED: Service \"fwm\" (PID 3277) hasn't caught signal 6 (core will be saved).

Broadcast message from root (console) (Tue May 29 13:04:25 2012):

The system is going down for reboot NOW!
--------------

pcc-30a-n5:
-------------
2012 May 29 13:04:30 pcc-30a-n5 %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 30, VPC peer keep-alive receive has failed
2012 May 29 13:05:01 pcc-30a-n5 %SYSMGR-2-SERVICE_CRASHED: Service \"fwm\" (PID 3284) hasn't caught signal 6 (core will be saved).

Broadcast message from root (console) (Tue May 29 13:05:13 2012):

The system is going down for reboot NOW!
-------------

The switches are back.
We launch a check-up in the vCenter service infrastructure.

Posted May 29, 2012 - 12:23 UTC