Network & Infrastructure Status - FS#4411 — routing on Saturday 24 July at night

OVHcloud Network Status

Current status

Legend

Operational
Degraded performance
Partial Outage
Major Outage
Under maintenance

FS#4411 — routing on Saturday 24 July at night

Incident Report for Network & Infrastructure

Resolved

context:
We are having 2 different tasks on the network: tonight (25 July) and Monday 27 July night.
http://status.ovh.co.uk/?do=details&id=246
http://status.ovh.co.uk/?do=details&id=285

It is a maintenance task on the optic fiber Roubaix/London and Roubaix/Bruxelles.

Therefore we boosted the end of the task on the network security through london/amsterdam and frankfurt/paris. It was mouted on Thursday.
http://status.ovh.co.uk/?do=details&id=318
http://status.ovh.co.uk/?do=details&id=319

Before these tasks, at 20h03 the router in london has crashed then furthermore it has not returned itself. boot problem, we has to resume in series cable in order to finish the booting.
http://status.ovh.co.uk/?do=details&id=326

In sprained frankfurt has crashed due to memory. then in amsterdam has also crashed. in 1 hour 3 routers.

it took time to re-stabilize the backbone and especially remount frankfurt. and who says frankfurt says Zurick said milano, prague and vienna. only France and Spain were not impacted:
http://p19.smokeping.ovh.net/ovh-server-statistics/show.cgi?target=PING

After the analysis we think that with the safety features that we establish and BGP data to synchronize in addition, the router london amsterdam and frankfurt were full RAM unfragmented. They have not been restarted for quite long time ago and RAM has been fragmented.
But it is also related to MPLS. it works without it. In deed we disconnected the MPLS and returned frankfurt link by link. The router is stable with 200Mo free of 1Go.

one of the solutions was ordered three weeks ago and will arrive in 5 weeks. In deed 2 ASR1000 for collector routes. instead to mount 1 BGP session between each router, we are going to mount only 2 BGP sessions by router and the 2 collectors will calculate the route then propagate the information simply to other routers.
it will also take less CPU and less RAM.
especially when network is secure, through the loops, the same information of the same router arrive by different paths at different times to each router and each router is obliged to calculate all at several times. it will be a lot of permanent calculation. In fact the current configuration has come to the end and it must improve. it will be done. Another solution would be the BGP confederations in order that changes be made only in the confederation. we prefer collector routers.

Second solution is to change the cisco 6509 by nexus 7016. we received one for the labo and is now being tested. we are waiting until September to order 5 nexus 7016 because ... cards that we need are not yet available. Available in September ...

Besides some FR and ES customers has also been impacted by these problems especially on vss-1 and vss-2: when the routing changes and it needs to recalculate BGP tables, vss routers are defected. they charge 100% of CPU during a very long time and ARP process is not responding to ARP servers' requests. OSPF disconnects, BGP also, customer servers expire the MAC of the router and are not receiving request's response. and that does not ping. with the route collector we are decreasing the CPU for the BGP. that is already going to fix many problems. A second solution is to move the router's network proxy-arp to a server specially designed for this. that is going to be coded and established.

Posted Jul 26, 2010 - 00:42 UTC