Bare Metal Cloud Status - FS#4847

OVHcloud Bare Metal Cloud Status

Current status

Legend

Operational
Degraded performance
Partial Outage
Major Outage
Under maintenance

FS#4847 — HG, under windows

Incident Report for Bare Metal Cloud

Resolved

We have some HG, apparently under windows that does not ping
since 6h36. We continue to seek the origin of the problem.

Update(s):

Date: 2010-11-18 15:40:58 UTC
The origin of the problem was found. Tonight, the teams which
take care of the introduction of new servers has put in place
the new HG servers. They have taken by mistake the IP
of the DHCP servers. This has caused the crashing of all of the HG servers
which use DHCP.

The lack of communication between the internal teams in the same
data centre is at the origin of this problem. We will fix
this communication problem. We will introduce a DHCP
external to the network. Then, we will refund the customers impacted by
the crash.

Date: 2010-11-18 15:33:27 UTC
53 windows in the racks 27XXX on the network in question,
there are only 18 which do not function. They use dhcp
to boot.

We will change the network cards of one of the servers to see if it will
fix the problem.

Date: 2010-11-18 15:30:45 UTC
The servers push well the MAC on the network, but it does not function.

Date: 2010-11-18 15:30:03 UTC
The switch is up-to-date. It does not work.

Now, there is still the hardware problems. We will intervene to change
the hardware.

Date: 2010-11-18 15:28:40 UTC
sw-n5-14.242# install all kickstart bootflash:n5000-uk9-kickstart.4.2.1.N1.1.bin system bootflash:n5000-uk9.4.2.1.N1.1.bin

Verifying image bootflash:/n5000-uk9-kickstart.4.2.1.N1.1.bin for boot variable \"kickstart\".
[####################] 100% -- SUCCESS

Verifying image bootflash:/n5000-uk9.4.2.1.N1.1.bin for boot variable \"system\".
[####################] 100% -- SUCCESS

Verifying image type.
[####################] 100% -- SUCCESS

Extracting \"system\" version from image bootflash:/n5000-uk9.4.2.1.N1.1.bin.
[####################] 100% -- SUCCESS

Extracting \"kickstart\" version from image bootflash:/n5000-uk9-kickstart.4.2.1.N1.1.bin.
[####################] 100% -- SUCCESS

Extracting \"bios\" version from image bootflash:/n5000-uk9.4.2.1.N1.1.bin.
[####################] 100% -- SUCCESS

Notifying services about system upgrade.
[####################] 100% -- SUCCESS

Compatibility check is done:
Module bootable Impact Install-type Reason
------ -------- -------------- ------------ ------
1 yes disruptive reset Reset due to single supervisor

Images will be upgraded according to following table:
Module Image Running-Version New-Version Upg-Required
------ ---------- ---------------------- ---------------------- ------------
1 system 4.1(3)N2(1) 4.2(1)N1(1) yes
1 kickstart 4.1(3)N2(1) 4.2(1)N1(1) yes
1 bios v1.3.0(09/08/09) v1.3.0(09/08/09) no
1 power-seq v1.2 v1.2 no

Switch will be reloaded for disruptive upgrade.
Do you want to continue with the installation (y/n)? [n] y

Install is in progress, please wait.

Setting boot variables.
[####################] 100% -- SUCCESS

Performing configuration copy.
[####################] 100% -- SUCCESS

Module 1: Refreshing compact flash and upgrading bios/loader/bootrom/power-seq.
Warning: please do not remove or power off the module at this time.
Note: Power-seq upgrade needs a power-cycle to take into effect.
On success of power-seq upgrade, SWITCH OFF THE POWER to the system and then, power it up.
[####################] 100% -- SUCCESS

Finishing the upgrade, switch will reboot in 10 seconds.
sw-n5-14.242#
Broadcast message from root (Thu Nov 18 10:26:57 2010):

The system is going down for reboot NOW!
2010 Nov 18 10:26:57 sw-n5-14.242 %KERN-0-SYSTEM_MSG: writing reset reason 31, - kernel

Date: 2010-11-18 15:28:17 UTC
We will restart the switch.

Meanwhile, we have looked internally for similar problems
and apparently we had problems on the linux on 10G. we
had to introduce specific procedures in order to run
the linux with the choice of SFP+ cables and the network cards
due to incompatibilities. We did not have this problem
under windows.

Thus, we will see at the same time if this problem is not the same
under linux but this happens many times after the introduction of
windows and under a network. very weird.

The boot of the switch has started.

Date: 2010-11-18 15:19:33 UTC
It does not work.

We will update the switch in order to see if it will fix the problem.

Date: 2010-11-18 15:18:34 UTC
Same thing.

We will therefore change the ports for the 7 HG servers under Windows
which no longer function.

Date: 2010-11-18 15:16:53 UTC
We have tried a different re-configuration of the port. it does not
work. We have recovered a server by changing the switch port.
It seems that it is a bug in the switch system.
We will see if we can recover the servers by restarting the
switch.

Posted Nov 18, 2010 - 13:01 UTC