rssLink RSS for all categories
 
icon_red
icon_green
icon_red
icon_red
icon_blue
icon_green
icon_green
icon_red
icon_red
icon_red
icon_orange
icon_green
icon_green
icon_green
icon_green
icon_blue
icon_green
icon_orange
icon_red
icon_green
icon_red
icon_red
icon_green
icon_red
icon_red
icon_red
icon_red
icon_orange
icon_green
 

FS#7099 — FS#10988 — hosts

Attached to Project— Dedicated Cloud
Incident
Rbx2a
In progress
100%
The monitoring system has detected a large quantity of faulty hosts.
We will investigate.
Comment by OVH - Wednesday, 18 June 2014, 13:29PM

The affected hosts all appear to be in version 5.0update1.
There are in purple screen state.
They are being rebooted.


Comment by OVH - Wednesday, 18 June 2014, 13:32PM

Over half the affected hosts have been checked and rebooted.
The intervention is in progress.


Comment by OVH - Wednesday, 18 June 2014, 13:32PM

All servers have been checked and rebooted.
We are checking that back to normal on the monitoring system.


Comment by OVH - Wednesday, 18 June 2014, 13:32PM

We have opened an SR with VMware for the root cause analysis.

A diagnostic is in progress.


Comment by OVH - Thursday, 26 June 2014, 17:56PM

The root of the problem has been found.

"Engineering have analyzed the dumps and found that the PSOD's were due to corruption which originated from the igb network driver."

We will escalate the SR in order to find the root of the corruption.


Comment by OVH - Friday, 27 June 2014, 10:28AM

VMware engineering found corrupted data in the headings of the frames networks.
The exact reason for the corruption is unknown but it originates for the Intel IGB driver.
The current versions of Firmware and Driver are not the latest and we will proceed with an update of the drivers.

Logs analysis: (Bug Id 1272069)
The PSOD is due to that the head pointer of (&(container->slabInfo[2].pktList))->csList is corrupted.

[esx-host3922.ovh.net-2014-06-18--09.04]

(gdb) f 4
#4 PktContainerGetPkt (slabType=PKT_SLAB_HIGH_MEM, container=0x410004c49f00, index=2) at bora/vmkernel/net/pkt.c:3733
3733 entry = PktList_PopHead(&(container->slabInfo[index].pktList));
(gdb) p container
$11 = (PktContainer *) 0x410004c49f00
(gdb) p &(container->slabInfo[index].pktList)
$12 = (PktList *) 0x410004c49fa8
(gdb) p ((PktList *) 0x410004c49fa8)->csList
$13 = {
slist = {
head = 0x61646e656974656c, <---- invalid value
tail = 0x4100085e4980
},
numElements = 11
}


Comment by OVH - Wednesday, 02 July 2014, 12:31PM

The same issue has just arisen.

We are currently checking all the hosts and controlling the host drivers.


Comment by OVH - Wednesday, 02 July 2014, 15:07PM

The ESXi version is not relevant, there are still a few host servers that have the bugged version of the driver.

~ # vmware -lv
VMware ESXi 5.0.0 build-721882
VMware ESXi 5.0.0 Update 1
~ # esxcli software vib list |grep igb
net-igb 3.2.10-1OEM.500.0.0.472560 Intel VMwareCertified 2013-05-14

We will force an update of the drivers.


Comment by OVH - Wednesday, 02 July 2014, 15:09PM

The first host servers are up-to-date.

A reboot is necessary to apply the update.

We will open a ticket for the relevant host servers.


Comment by OVH - Wednesday, 02 July 2014, 19:07PM

All host servers are up-to-date and the tickets concerning the impacted machines have been opened.

We now need to reboot the host servers to apply the driver update.


Comment by OVH - Wednesday, 02 July 2014, 19:08PM

We are checking the entire infrastructure to see if there are any other hosts affected by this update.


Comment by OVH - Friday, 18 July 2014, 10:19AM

The bug impacted some of the remaining servers in the infrastructure.

Tomorrow all the remaining servers with this version of driver will be rebooted in order to update the network driver and to ensure that they are no longer impacted.