Network & Infrastructure Status - FS#6533

OVHcloud Network Status

Current status

Legend

Operational
Degraded performance
Partial Outage
Major Outage
Under maintenance

FS#6533 — general routing

Incident Report for Network & Infrastructure

Resolved

We have had a general routing problem.
We are looking for the origin of the problem.

Apparently,a card in one of two routers in Roubaix began to malfunction and did not crash completely. Suddenly it caused the isolation of the network and the split between the parts Paris Roubaix London.

We cut the card electrically and are checking the logs in order to understand how a card could trigger such a problem.

Update(s):

Date: 2012-04-16 09:17:15 UTC
To apply for the SLA, please visit https://www.ovh.co.uk/managerv3/sla-list.pl

Date: 2012-03-29 00:39:18 UTC
Both routers rbx-g1 and rbx-g2 are working properly. The patches were implemented.

Date: 2012-03-29 00:08:39 UTC
This time,there's no inconvenience with BGP. The router is up and in a stable status.We are reactivating traffic on it.

Date: 2012-03-29 00:06:42 UTC
TThe rbx-g2 is isolated from network. Routing is now provided by rbx-g1. The router will be reloaded during the process of applying patches.

Date: 2012-03-29 00:04:47 UTC
All BGP sessions were mounted. Worktask continues with g2-rbx.

Date: 2012-03-28 23:22:48 UTC
Patches are applied and the router is in a stable status.
However, we have a problem with the BGP. One of the sessions to RF-2 (BGP route reflector) does not go in v4 and another one to rf-1 in v6. We are checking this closer before proceeding further.

Date: 2012-03-28 23:19:19 UTC
Wed Mar 28 22:31:25.042 UTC
Install operation 6 '(admin) install activate
disk0:asr9k-px-4.2.0.CSCty46761-1.0.0 disk0:asr9k-px-4.2.0.CSCtx89601-1.0.0'
started by user 'gui' via CLI at 22:31:25 UTC Wed Mar 28 2012.
Info: This operation will reload the following nodes in parallel:
Info: 0/RSP0/CPU0 (RP) (SDR: Owner)
Info: 0/RSP1/CPU0 (RP) (SDR: Owner)
Info: 0/0/CPU0 (LC) (SDR: Owner)
Info: 0/1/CPU0 (LC) (SDR: Owner)
Info: 0/2/CPU0 (LC) (SDR: Owner)
Info: 0/3/CPU0 (LC) (SDR: Owner)
Info: 0/4/CPU0 (LC) (SDR: Owner)
Info: 0/5/CPU0 (LC) (SDR: Owner)
Info: 0/6/CPU0 (LC) (SDR: Owner)
Info: 0/7/CPU0 (LC) (SDR: Owner)

Date: 2012-03-28 23:19:03 UTC
Routing is provided by rbx-g2. We are applying the patches.
A full reload of the router rbx-g1 will be performed. There is no expected impact on traffic, routing is made by rbx-g2.

Date: 2012-03-28 22:30:58 UTC
We started deploying the patch.

We are isolating rbx-g1-a9 from the network.

Date: 2012-03-28 20:14:08 UTC
Both patches:
CSCty46761
CSCtx89601

asr9k-px-4.2.0.CSCtx89601-1.0.0
asr9k-px-4.2.0.CSCty46761-1.0.0

Date: 2012-03-28 18:39:05 UTC
Hello,
We had a routing problem tonight, due to a software bug which affected 2 principal routers in Roubaix. These Cisco ASR 9010 ensure collecting the bandwidth of datacenters in Roubaix (RBX1 RBX2 RBX3 RBX4 RBX5) and the connection to Paris, Brussels, Amsterdam, London and Frankfurt. Briefly, the routing heart in Roubaix.

This bug is known and is well related to new cards that we set end of January (24x10G per slot). For a random reason the card detects the RAM ECC errors and doesn't rout packets anymore. But certainly despite this the card is not declared as \"breakdown\" and remains in the router as if it was good.
Other routers will continue to send packets but there's none in front. That would cause a big issue and the network will not perform correctly.
The worse: a breakdown not net.

Tonight, 3 24x10G cards on 2 ASR 9010 routers had this bug almost in the same time. This broke the network in 3 pieces: USA/London/Amsterdam/Warsaw, Roubaix and Paris, Frankfurt, Madrid, Milan, aspiring the packets in Roubaix. Usually, the traffic would have been rerouted but then it was aspired and blocked in Roubaix.

Therefore, we didn't exploit the network to manage it and recover the logs of all the routers in order to reveal the problem's origin.
We have navigated to the old one, with rescue/external connexions to connect to each backbone router and check whether the router is the origin of the issue.
This operation took time, since there are 2 broken routers and it took us time to understand that this not only due to the router rbx-g2-a9 but also due to rbx-g1-a9.
Once we've restarted the 3 cards, all went back to normal in 5 minutes.

3 Weeks ago. We have already opened a ticket to Cisco regarding the RAM ECC issue. Cisco worked on this matter and has provided .. this morning, the patch software to apply on these routers in order to fix the problem. We are going to start the operation tonight. No breakdown expected.

We will focus also, on how to improve the management of our routers if all the backbone is down for a reason that will never happen.
We know how to deal with this case, but it's quite long. Very long.

At any case, the breakdown lasted only for 99.9% around 1h22 whereas we have \"the right\" to 43 minutes per month of downtime. There are penalties that trigger to go over the allowed time.
Example: for the SD OVH it's 5% per unavailability hour.
We are going to set an URL so that you could be able to trigger the SLA and send us the doc to credit the 5% time on your service. It will be posted in this task:
http://status.ovh.net/?do=details&id=2571

It never was pleasant to write this type of emails but if we aren't good, well, we take the responsibility and apologize.

We do apologize once again.

Regards,
Octave

Date: 2012-03-28 17:39:49 UTC
We worked with Cisco today, on the faced issues. We have to set urgent correctives on the routers. These correctives will be deployed tonight:
00:00 sur rbx-g1
01:00 sur rbx-g2

Date: 2012-03-28 17:34:56 UTC
We have already faced this issue and it has been forwarded to TAC. The TAC Cisco has worked on this issue and prepared an SMU to apply a software patch on the IOS XR version that we have on. This small patch will be integrated in the +1 version.

We are recovering it then tonight we are going to start the maintenance on these 2 routers in order to apply the patch software which will push us to reload the router afterwards. We are not going to do it during the day.

Date: 2012-03-28 07:00:33 UTC
Murphy law of problems that never happen.

Something caused the simultaneous failure of cards from the same type in two different routers.
A bug hard/soft on the new cards 24x10G of the Cisco ASR9010. other cards 8x10G remained up.

We opened the TAC in order to request replacing of the three cards that have crashed. but we must find the origin of the problem so we prevent it from happening again, because even with the same hard and the same soft, the same origin will cause the same problem.

Date: 2012-03-28 06:42:58 UTC
2 cards 24x10G on rbx-g1-a9 crashed
also 1 card 24x10G on rbx-g2-a9 crashed too.

Date: 2012-03-28 06:39:08 UTC
LC/0/0/CPU0:Mar 28 04:18:20 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab9f8, bit 4294967295, ext info 0x05cab9f8 0x000082d9 0x00000047 0xffffffff, action 0 (Fix)
LC/0/0/CPU0:Mar 28 04:18:20 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab9f8, bit 4294967295, ext info 0x05cab9f8 0x000082d9 0x00000047 0xffffffff, action 0 (Fix)
LC/0/1/CPU0:Mar 28 04:18:20 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 2, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab960, bit 4294967295, ext info 0x05cab960 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
LC/0/0/CPU0:Mar 28 04:18:21 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab9f8, bit 4294967295, ext info 0x05cab9f8 0x00

Date: 2012-03-28 06:38:55 UTC
One of two main routers in Roubaix is down rbx-g1-a9 and the second has a defected card.

Posted Mar 28, 2012 - 06:19 UTC