FS#1608 — FS#5592 — rbx-g1/g2
Attached to Project— Network
Incident | |
Whole Network | |
CLOSED | |
![]() |
We have a problem on the ASR9000
Jul 6 12:58:05 rbx-g1-a9.fr.eu 5919: LC/0/0/CPU0:Jul 6 10:57:46 UTC: fib_mgr[161]: %ROUTING-FIB-4-RSRC_LOW : CEF running low on DATA_TYPE_TABLE_SET resource memory. CEF will nowbegin resource constrained forwarding. Only route deletes will behandled in this state, which may result in mismatch between RIB/CEF.Traffic loss on certain prefixes can be expected. The CEF will automatically resume normal operation, once the resource utilizationreturns to normal level
Jul 6 12:57:42 rbx-g2-a9.fr.eu 15654: LC/0/3/CPU0:Jul 6 10:57:23 UTC: fib_mgr[161]: %PLATFORM-PLAT_FIB-6-INFO : PD FIB object LEAF OOR state changed to GREEN
Jul 6 12:57:42 rbx-g2-a9.fr.eu 15655: LC/0/3/CPU0:Jul 6 10:57:23 UTC: fib_mgr[161]: %ROUTING-FIB-6-RSRC_OK : CEF resource state has returned to normal. CEF hasexited resource constrained operation and normal forwarding has been restored
Date: Thursday, 07 July 2011, 14:51PMJul 6 12:58:05 rbx-g1-a9.fr.eu 5919: LC/0/0/CPU0:Jul 6 10:57:46 UTC: fib_mgr[161]: %ROUTING-FIB-4-RSRC_LOW : CEF running low on DATA_TYPE_TABLE_SET resource memory. CEF will nowbegin resource constrained forwarding. Only route deletes will behandled in this state, which may result in mismatch between RIB/CEF.Traffic loss on certain prefixes can be expected. The CEF will automatically resume normal operation, once the resource utilizationreturns to normal level
Jul 6 12:57:42 rbx-g2-a9.fr.eu 15654: LC/0/3/CPU0:Jul 6 10:57:23 UTC: fib_mgr[161]: %PLATFORM-PLAT_FIB-6-INFO : PD FIB object LEAF OOR state changed to GREEN
Jul 6 12:57:42 rbx-g2-a9.fr.eu 15655: LC/0/3/CPU0:Jul 6 10:57:23 UTC: fib_mgr[161]: %ROUTING-FIB-6-RSRC_OK : CEF resource state has returned to normal. CEF hasexited resource constrained operation and normal forwarding has been restored
Reason for closing: Done
The problem resembles to this one
http://status.ovh.net/?do=details&id=752
but not quite the same.
We have added the next-hop-self on IPv6.
The same thing.
We have just opened a TAC at Cisco
RP/0/RSP1/CPU0:rbx-g2-a9# show bgp nexthops statistics
Wed Jul 6 12:34:19.284 UTC
Total Nexthop Processing
Time Spent: 871.632 secs
Maximum Nexthop Processing
Received: 6w3d
Bestpaths Deleted: 0
Bestpaths Changed: 144079
Time Spent: 2.918 secs
Last Notification Processing
Received: 1d14h
Time Spent: 0.021 secs
Gateway Address Family: IPv4 Unicast
Table ID: 0xe0000000
Nexthop Count: 147
Critical Trigger Delay: 3000msec
Non-critical Trigger Delay: 10000msec
Nexthop Version: 1, RIB version: 1
Total Critical Notifications Received: 119
Total Non-critical Notifications Received: 11570
Bestpaths Deleted After Last Walk: 0
Bestpaths Changed After Last Walk: 1961
Nexthop register:
Sync calls: 426747, last sync call: 00:15:14
Async calls: 1697, last async call: 14w6d
Nexthop unregister:
Async calls: 426603, last async call: 00:14:38
Nexthop batch finish:
Calls: 947770, last finish call: 00:14:37
Nexthop flush timer:
Times started: 853358, last time flush timer started: 00:14:38
RIB update: 0 rib update runs, last update: 00:00:00
0 prefixes installed, 0 modified, 0 removed
RP/0/RSP1/CPU0:rbx-g2-a9#show controller np struct 6 summary location 0/0/cpu0
Wed Jul 6 12:34:29.161 UTC
Node: 0/0/CPU0:
----------------------------------------------------------------
NP: 0 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0
NP: 1 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0
NP: 2 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0
NP: 3 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0
NP: 4 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0
NP: 5 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0
NP: 6 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0
NP: 7 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0
RP/0/RSP1/CPU0:rbx-g2-a9#sh cef resource detail location 0/0/cpu0
Wed Jul 6 12:35:19.098 UTC
CEF resource availability summary state: YELLOW
CEF will drop route updates
No. of times HW caused oor: 26
CEF entered oor at : Jul 6 12:30:33.573
CEF came out of oor at : Jul 6 12:29:48.370
ipv4 shared memory resource:
CurrMode GREEN, CurrAvail 866398208 bytes, MaxAvail 984129536 bytes
ipv6 shared memory resource:
CurrMode GREEN, CurrAvail 866398208 bytes, MaxAvail 984129536 bytes
mpls shared memory resource:
CurrMode GREEN, CurrAvail 866398208 bytes, MaxAvail 984129536 bytes
common shared memory resource:
CurrMode GREEN, CurrAvail 866398208 bytes, MaxAvail 984129536 bytes
DATA_TYPE_TABLE_SET hardware resource: YELLOW
DATA_TYPE_TABLE hardware resource: YELLOW
DATA_TYPE_IDB hardware resource: YELLOW
DATA_TYPE_IDB_EXT hardware resource: YELLOW
DATA_TYPE_LEAF hardware resource: YELLOW
DATA_TYPE_LOADINFO hardware resource: YELLOW
DATA_TYPE_PATH_LIST hardware resource: YELLOW
DATA_TYPE_NHINFO hardware resource: YELLOW
DATA_TYPE_LABEL_INFO hardware resource: YELLOW
DATA_TYPE_FRR_NHINFO hardware resource: YELLOW
DATA_TYPE_ECD hardware resource: YELLOW
DATA_TYPE_RECURSIVE_NH hardware resource: YELLOW
DATA_TYPE_TUNNEL_ENDPOINT hardware resource: YELLOW
DATA_TYPE_LOCAL_TUNNEL_INTF hardware resource: YELLOW
DATA_TYPE_ECD_TRACKER hardware resource: YELLOW
DATA_TYPE_ECD_V2 hardware resource: YELLOW
DATA_TYPE_ATTRIBUTE hardware resource: YELLOW
DATA_TYPE_LSPA hardware resource: YELLOW
DATA_TYPE_LDI_LW hardware resource: YELLOW
DATA_TYPE_LDSH_ARRAY hardware resource: YELLOW
DATA_TYPE_TE_TUN_INFO hardware resource: YELLOW
DATA_TYPE_DUMMY hardware resource: YELLOW
DATA_TYPE_IDB_VRF_LCL_CEF hardware resource: YELLOW
DATA_TYPE_TABLE_UNRESOLVED hardware resource: YELLOW
DATA_TYPE_MOL hardware resource: YELLOW
DATA_TYPE_MPI hardware resource: YELLOW
DATA_TYPE_SUBS_INFO hardware resource: YELLOW
DATA_TYPE_GRE_TUNNEL_INFO hardware resource: YELLOW
RP/0/RSP1/CPU0:rbx-g2-a9#
The registration of the new IPs is not done.
We are in contact with TAC CISCO in order to fix the problem.
It is turning in a loop for the new IPs in the network.
We are waiting for CISCO.
6 th2-1-6k.fr.eu (213.186.32.181) 55.409 ms * 50.620 ms
7 th1-1-6k.fr.eu (213.186.32.165) 58.132 ms * 50.333 ms
8 rbx-g2-a9.fr.eu (91.121.131.141) 55.075 ms 53.812 ms 54.613 ms
9 gsw-2-6k.fr.eu (91.121.131.214) 77.756 ms * *
10 rbx-g1-a9.fr.eu (91.121.131.33) 57.627 ms 57.028 ms 57.390 ms
11 gsw-2-6k.fr.eu (91.121.131.38) 263.777 ms
gsw-2-6k.fr.eu (91.121.131.34) 205.179 ms
gsw-2-6k.fr.eu (213.251.128.106) 209.499 ms
12 rbx-g1-a9.fr.eu (91.121.131.33) 62.124 ms 59.690 ms 62.422 ms
13 gsw-2-6k.fr.eu (91.121.131.38) 62.392 ms *
gsw-2-6k.fr.eu (213.251.128.106) 61.387 ms
14 rbx-g1-a9.fr.eu (91.121.131.33) 65.804 ms 65.402 ms 65.773 ms
15 gsw-2-6k.fr.eu (91.121.131.38) 65.205 ms *
gsw-2-6k.fr.eu (213.251.128.106) 64.206 ms
16 rbx-g1-a9.fr.eu (91.121.131.33) 69.591 ms 67.366 ms 68.669 ms
17 * * gsw-2-6k.fr.eu (213.251.128.106) 220.553 ms
18 rbx-g1-a9.fr.eu (91.121.131.33) 71.096 ms 73.312 ms 71.266 ms
19 gsw-2-6k.fr.eu (91.121.131.38) 70.817 ms
gsw-2-6k.fr.eu (91.121.131.34) 70.360 ms
gsw-2-6k.fr.eu (213.251.128.106) 71.530 ms
RP/0/RSP1/CPU0:rbx-g2-a9(admin-config)#hw-module profile scale l3xl
Wed Jul 6 18:50:16.520 UTC
In order to activate this new memory resource profile, you must manually reboot the system.
We have to restart the router.
All routage is going through g1 currently.
We are ready for g2.
RP/0/RSP1/CPU0:rbx-g2-a9(admin)#reload location all
Wed Jul 6 18:58:42.597 UTC
Preparing system for backup. This may take a few minutes especially for large configurations.
Status report: node0_RSP1_CPU0: START TO BACKUP
Status report: node0_RSP1_CPU0: BACKUP HAS COMPLETED SUCCESSFULLY
[Done]
Proceed with reload? [confirm]RP/0/RSP1/CPU0::This node received reload command. Reloading in 5 secs
g2 is UP.
We are checking it.
g2 is OK.
We set it in the routage,is is on the loop.
We will set g1 off the routage.
g1 is off the loop, all is rooted on g2.
We are ready to restart.
RP/0/RSP0/CPU0:rbx-g1-a9(admin)#reload location all
Wed Jul 6 19:13:11.504 UTC
Preparing system for backup. This may take a few minutes especially for large configurations.
Status report: node0_RSP0_CPU0: START TO BACKUP
Status report: node0_RSP0_CPU0: BACKUP HAS COMPLETED SUCCESSFULLY
[Done]
Proceed with reload? [confirm]RP/0/RSP0/CPU0::This node received reload command. Reloading in 5 secs
Restarting in process.
g1 is up.
We will check it now.
The card 0/4 died.
We started replacing the card with Cisco via hardware support T+2H,this means that Cisco will give us the card which is down in less than 2 hours in case of hardware problem on one of the elements of the router .
We checked the ports down and we don't expect an impact on traffic even without the card. All ports are lined and it should not saturate.
We just set the router in routing.
Now we will check saturation of the links.
Cisco asked us to restart the card to see if it is definately dead.
RP/0/RSP0/CPU0:rbx-g1-a9(admin)#reload location 0/4/CPU0
Wed Jul 6 19:37:06.607 UTC
Preparing system for backup. This may take a few minutes especially for large configurations.
[Done]
Proceed with reload? [confirm]
Traffic was reloaded,everything is going right .
The inital problem is fixed.
Now we need to replace the card. The RMA is in progress.
Well ,this is all: Cisco bases are not updated with the contract recently signed,we will not have the card within 2hours.
Apparently the card is not in the bases.
It's probably because we've already had two broken cards and following the previews RMA it was not updated.
http://status.ovh.co.uk/?do=details&id=1154
[...]
We will replace the card #6 of g1 by the card #4 of g2 on which we have ports not used or little traffic.
[...]
That's why it does not stick with Cisco bases.
We have received the spare card of Cisco at 4H00 am.
http://yfrog.com/z/kejb0uj
The old card is still in the router.
First of all, we disconnect the optical fibres.
http://yfrog.com/z/kg4rknnj
It is done, the card is ready to get out.
http://yfrog.com/z/kl2d5jj
Ready to go ? Go ... The card is out
http://yfrog.com/z/kl1aslhj
We verify the logs and everything is OK
http://yfrog.com/z/kj1kfij
We put down the old card and unpack the new one
http://yfrog.com/z/kh47sqj
The card is ready to be inserted
http://yfrog.com/z/kiz82vtj
The card is inserted and it boots
http://yfrog.com/z/kjh2dvj
We verify the logs: the boot goes well
http://yfrog.com/z/kl42ttj
We re-connect the optical fibres.
http://yfrog.com/z/h7iialhxj
We verify the logs: everything is OK
http://yfrog.com/z/khd74nj
We verify the weathermap and the traffic movement
to Paris and Frankfurt: everything
is OK
http://weathermap.ovh.net/backbone
The old card is re-packed and will be sent to
Cisco.
We thank the Cisco team for the follow up of this
night. The internal bug was fixed at 1h am.