rssLink RSS for all categories
 
icon_red
icon_green
icon_red
icon_red
icon_blue
icon_green
icon_green
icon_red
icon_red
icon_red
icon_orange
icon_green
icon_green
icon_green
icon_green
icon_blue
icon_green
icon_orange
icon_red
icon_green
icon_red
icon_red
icon_green
icon_red
icon_red
icon_red
icon_red
icon_orange
icon_green
 

FS#1681 — FS#5662 — switch leclerc

Attached to Project— Dedicated Cloud
Incident
Roubaix 4
CLOSED
100%
We had a hardware failure on a switch in the leclerc section of the infrastructure.
It is operational now. We are checking all the infrastructure.
Date:  Friday, 05 August 2011, 18:11PM
Reason for closing:  Done
Comment by OVH - Friday, 05 August 2011, 01:13AM

Both fan modules gave out(failed) at the same time (for an unknown reason) on one of the switches on which are attached the impacted storages.
The switch starts protecting itself in this case and launches an auto-shutdown after 2minutes.


Comment by OVH - Friday, 05 August 2011, 18:08PM

Subject: pCC incident

Hi,
Tonight we had an incident on one of the storage pCC racks. The switch managing 19 NAS-HA masters is placed on default thereafter a simultaneous failure of 2 fans of the switch.
It doesn't matter, it happens. the problem is that 19 heads slaves could not be able to detect this failure and did not recover the service immediately.
This case which (strangely) has not been expected neither been coded. Mistake.

It took us 1h45 to put the service back to normal. Another mistake. This is despite the fact that 4 independent teams have forwarded the problem. Even though we haven't got any data loss, the failure is not acceptable.
Below, clearly, our analysis.

At 22h30. The switch has been electrically cut.

First of all, the astreinte team, who received alerts thereafter the service failure, did not panic later. Why? Because the network infrastructure update task has been expected and the astreinte thought that it was simply maintenance tasks which is normal. No verification was made by the network team. It could help, since the tasks were not been planned for that precised night "one" night. It is only about 1 hour of failure that the astreinte has started to look for the problem's origin and alerting other teams. It's a fatal error.

The pCC team has its own monitoring system. We monitor everything for each customer. But the system has been found bloqued (we still don't know why) and had not recorded any failure. Thus the monitoring had not triggered the pCC team.
Alerts have been triggered after 1h45.

The 3rd team is the support which is found with PABX failure at the same time of the incident(!?). The astreinte has repaired the server PABX without thinking
that there has been a link with the problem on the pCC. Once the PABX has been repaired, the support has received calls from customers and triggered the VIP team
as well as its supervisor. There we lost one more hour.

Then the 4th team in the datacenter who received alerts but did not intervene on the failure urgently. "everything is going to be ok so no panic" (c).
After 40 minutes of investigation, the datacenter has triggered network teams who started to analyse at the problem for about 20 minutes. Still not urgently. "everything is ok no panic"(c).


It is 23h30.

for about 1 hour, the 4 teams were at the same point and found the problem's origin. At this moment, the decision has been taken to repair the switch and not to switch manually on the slaves. A huge mistake. The network team has repaired the switch in 45 minutes and the service was back to normal.
But switching on the slave takes between 15 seconds to 30 seconds. And we could have 1h00 of failure instead of 1h45.

0h15 the service is UP.

That's it.

There is no excuses. We have been so bad at this level. we can only apologize and trigger the SLA hoping that the confidence will raise up quickly thanks to the modernizations that we will make the next hours/days.

And, this morning, we have determined all that we have to do to improve the internal processes linked to pCC. Furthermore, we found out that many problems preventing the teams to be in an urgent state when it is really the case. But basically, a better communication will accelerate fixing the problem. The monitoring system is not perfect. It needs to be improved.

below the list of improvements which we started to make:

- better name the network equipments by profession, not using common names.

- prepare and validate a standard setting for monitoring servers which is independent. Independence of email, SMS and PABX.

- add an IP fake of the service NAS-HA accessible from outside and monitor it

- make the pCC monitoring independent of the pCC in another datacenter to monitor the hosts and NAS of each customer.

- monitor the pCC monitoring system.

- add a monitoring VM by host in IPv6 to monitor the performance of each client host to the external performance.

- On NAS-HA,we will fix the switching code master/slave with taking in account "eth0 down"

- On the NAS-HA we are going to add an OCO on the test of route by default on the master and the slave via the case "the slave do not ping the master && OCO master GW down && OCO slave GW up".

- we are going officially to create a team level 2 in the datacenter available 24/24 7/7, instead of 10/24 5/7

- pCC, FEX, N5, N7 equipments alert are added to the network team

- work on faked alerts to decrease the team's reactivity.

- The support will trigger VIP teams by phone (not SMS) and we decrease the number of alerts of the same type to trigger the support supervisor.

- N5 spare stock has to be doubled in each datacenter having N5. the aim is to decrease the time of changing a spare.

- the monitoring "pCC tasks error" trigger the pCC team 24/24

Among all 64 failed services NAS-HA, 60 pCC customers were impacted. In term of SLA it is very simple: OVH ensure 100% of SLA on the storage. That means, in case of failure, penalties will start from the 1st second of failure detected by the vSphere. The ratio on storage is "-5% per 10 minutes of failure".
There is not data loss. Times are calculated by the vSphere. We are going to apply them automatically.

We are sincerely sorry for the incident and its management. We are going to do the necessary so this will not happen again.

All the best,
Octave