FS#346 — FS#3830 — Internal Routing Roubaix
Attached to Project— Network
Maintenance | |
Whole Network | |
CLOSED | |
![]() |
In order to manage the traffic between our backbone routers on Roubaix (rbx-1-6k<>rbx-2-6k<>vss-1-6k<>vss-2-6k<>rbx-99-6k), we are establishing a new routing architecture. Switching to this new architecture would take place tonight starting from midnight.
This maintenance concerns the links Roubaix <> Brussels (bru-1-6k).
We are switching the links one by one which would not cause any impact on the traffic.
Date: Saturday, 31 July 2010, 02:25AMThis maintenance concerns the links Roubaix <> Brussels (bru-1-6k).
We are switching the links one by one which would not cause any impact on the traffic.
Reason for closing: Done
Maintenances are not running well. We have the CRC between the routers. We returned to the initial setting. With more pains because of bugs:
rbx-99-6k#sh inter ten 9/1
[...]
30 second output rate 90000 bits/sec, 98 packets/sec
[...]
No way to pass the traffic.
rbx-99-6k#conf t
Enter configuration commands, one per line. End with CNTL/Z.
rbx-99-6k(config)#inter ten 9/1
rbx-99-6k(config-if)#shutdown
rbx-99-6k(config-if)#no shutdown
rbx-99-6k#sh inter ten 9/1
[...]
30 second output rate 2345596000 bits/sec, 384765 packets/sec
[...]
This is what we call a nice bug which wastes 2h at night.
We believe the CRC problems caused by non compatible optics (!?) between Cisco N5 and Cisco 6509 ...
We are retesting.
Lost.
We will return the links back as before and we will forward the bugs to Cisco ...
the problem is probably due to MTU which is XXXXX managed on N5
the XXXX to replace by "bad", "differently", etc
We modified the MTU configuration of the N5 switches and switched the link rbx-1<>rbx-2 above. The BGP session is actually stable. we are going to switch progressively the other links.
We are switching the links rbx-1<>vss-2 and rbx-2 <> vss-1
We located some problems on the link rbx-1<>vss-2 before even the start of the switching. We established a fiber temporarily and we expect a maintenance intervention so as to repair it once at all.
We are measuring an abnormal high attenuation on the links vss-2 <> rbx-99 that we would fix.
Defected links repairing will take place tonight from 23:00. Regarding the way we are improving in this part, we clutch on the switching the routing links on the new internal routing switches.
We are starting the maintenance.
Defected links are now repaired. We are beneficing to repair other defected links.
We reattempted switching the links 10G on the new infra but we are facing always difficulties. We are switching back to the old configuration unless rbx-1 <> rbx-2 which is the only link running correctly via this new infra.
Tonight, there will be tasks on the network Roubaix2. We are switching the traffic ss-1 <> vss-2 on a new infra nexus. in case of problem, we would return back immediately.
we are starting the switching operation.
The traffic is switched.
It is an MTU problem and a bug.
There is no problem between Nexus 5000 and 6509 standard and/or en SXF.
We are setting the MTU 9216 and that works properly.
Nexus 5000:
policy-map type network-qos jumbo
class type network-qos class-default
mtu 9216
system qos
service-policy type network-qos jumbo
BOOTLDR: s72033_rp Software (s72033_rp-IPSERVICESK9-M), Version 12.2(18)SXF16, RELEASE SOFTWARE (fc2)
interface Port-channelXXX
mtu 9216
The bug exists between Nexus 5000 and VSS in SXI.
Cisco IOS Software, s72033_rp Software (s72033_rp-ADVIPSERVICESK9-M), Version 12.2(33)SXI3, RELEASE SOFTWARE (fc2)
2 bits are missing.
with
interface Port-channelXXX
mtu 9216
there is CRC on the interfaces
with
interface Port-channelXXX
mtu 9214
No more problems.
We have noticed it on the weft's height in BGP sessions.
Datagrams (max data segment is 9214 bytes):
# ping ip XXXX size 9216 df-bit
Type escape sequence to abort.
Sending 5, 9216-byte ICMP Echos to XXXX, timeout is 2 seconds:
Packet sent with the DF bit set
.....
Success rate is 0 percent (0/5)
-> that's OK from 9214:
#ping ip XXXX size 9214 df-bit
Type escape sequence to abort.
Sending 5, 9214-byte ICMP Echos to XXXX, timeout is 2 seconds:
Packet sent with the DF bit set
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 8/52/204 ms
We are going to finalise the internal routing infrastructure with this "workaround" then report the bug to Cisco ...
We are pursuing the tasks tonight hoping that dealing with MTU allows to fix the problem once at all and to switch totally on the new infra.
We are starting the tasks.
We are switching the traffic on the new links sw.int-1 <> vss-1/2 and rbx-99
The switching is accomplished. Remains a defected link (rbx-1<>sw.int-1) passed tonight in interim. that would be fixed by tomorrow.
MTU problem is resolved with the passing of nexus 5000 to nexus 7000:
http://status.ovh.co.uk/?do=details&id=345