Network & Infrastructure Status - FS#17791

OVHcloud Network Status

Current status

Legend

Operational
Degraded performance
Partial Outage
Major Outage
Under maintenance

FS#17791 — rbx6-12b-n56

Incident Report for Network & Infrastructure

Resolved

We have detected a high level of use of bufares on this switch. (12b shows no abnormalities.)

This is caused by the AFM process (never a good sign) (ACL Feature Manager )

rbx6-12b-n56# sh system internal mts buffers summary
node sapno recv_q pers_q npers_q log_q
sup 175 0 9 0 0
sup 377 0 0 0 47
sup 608 0 159 0 0
sup 284 0 4 0 0

We are investigating the root cause but seems to be the reload.
sup 351 0 0 0 17
rbx6-12b-n56# sh system internal mts sup sap 608 description
Afm SAP

Update(s):

Date: 2016-05-03 16:54:50 UTC
fex116 up, buffer and SW OK, everything is back to normal.

Date: 2016-05-03 16:54:03 UTC
The FEXs are up. Redundancy has been restored on all of the FEXs except 116.
Indeed, the 116 has flapped on 12, we have an optic out of service=> being fixed by the datacentre.

Date: 2016-05-03 16:31:14 UTC
rbx6-12b-n56# sh fex
FEX FEX FEX FEX Fex
Number Description State Model Serial
------------------------------------------------------------------------
100 fex100 Online N2K-C2248TP-E-1GE SSI181709KY
101 fex101 Online N2K-C2248TP-E-1GE FOX1844G5AX
102 fex102 Online N2K-C2248TP-E-1GE FOX1901G31F
103 fex103 Online N2K-C2248TP-E-1GE FOX1901G2YS
104 fex104 Online N2K-C2248TP-E-1GE FOX1844G75X
105 fex105 Online N2K-C2248TP-E-1GE FOX1905GDWS
106 fex106 Online N2K-C2248TP-E-1GE FOX1844GJHP

Date: 2016-05-03 16:31:02 UTC
SW reload done.
We have to shutter the po to the FEX to avoid saturating again 1000eth at once.
We are setting up the fex 1 by 1 by monitoring the buffers.

rbx6-12b-n56# sh fex
FEX FEX FEX FEX Fex
Number Description State Model Serial
------------------------------------------------------------------------
100 fex100 Online N2K-C2248TP-E-1GE SSI181709KY
101 fex101 Online N2K-C2248TP-E-1GE FOX1844G5AX
102 fex102 Online N2K-C2248TP-E-1GE FOX1901G31F

Date: 2016-05-03 16:28:48 UTC
CPU before the reload, snmp is beating up the switch: that seems to be an effect.
wild guess to confirm with Cisco: ETHPM galley => causes AFM buffer=> SNMP galley.
This impacts the CPU, generating a circcle...

rbx6-12b-n56# sh system internal processes cpu
top - 12:10:39 up 315 days, 19:11, 3 users, load average: 1.28, 1.45, 1.15
Tasks: 240 total, 3 running, 236 sleeping, 0 stopped, 1 zombie
Cpu(s): 2.9%us, 1.7%sy, 0.0%ni, 95.0%id, 0.0%wa, 0.0%hi, 0.5%si, 0.0%st
Mem: 8243352k total, 3861200k used, 4382152k free, 288k buffers
Swap: 0k total, 0k used, 0k free, 1463832k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28326 root 20 0 348m 38m 25m R 52.8 0.5 30903:01 snmpd
4458 root 20 0 321m 70m 19m R 33.9 0.9 19053:50 ethpm
3994 root 20 0 310m 41m 15m S 17.0 0.5 15171:01 stats_client
8423 nicolas. 20 0 3620 1528 1140 R 7.5 0.0 0:00.07 top
4050 root 20 0 321m 32m 20m S 3.8 0.4 5352:08 pm
4174 root 20 0 442m 73m 26m S 3.8 0.9 6659:44 netstack
4170 root 20 0 297m 49m 20m S 1.9 0.6 1567:58 satmgr
1 root 20 0 2004 664 580 S 0.0 0.0 5:19.84 init
2 root 15 -5 0 0 0 S 0.0 0.0 0:00.01 kthreadd
3 root RT -5 0 0 0 S 0.0 0.0 0:11.29 migration/0
4 root 15 -5 0 0 0 S 0.0 0.0 94:25.26 ksoftirqd/0
5 root RT -5 0 0 0 S 0.0 0.0 5:09.96 watchdog/0
6 root RT -5 0 0 0 S 0.0 0.0 0:14.36 migration/1

Date: 2016-05-03 16:23:28 UTC
o spanning tree instance exists.
rbx6-12b-n56# sh platform afm info copp-tbls | diff
8,10c8,10
< 0 default 64000 6250 51700252190 4151275828
< 1 stp 2500000 4687 1214117872 0
< 2 lacp 128000 4687 574984688 0
---
> 0 default 64000 6250 51700312959 4151275828
> 1 stp 2500000 4687 1214119104 0
> 2 lacp 128000 4687 574985296 0
15c15
< 7 sat control 62500000 65535 2318965670683 0
---
> 7 sat control 62500000 65535 2318968001023 0
25c25
< 18 cdp 128000 4687 159709968 0
---
> 18 cdp 128000 4687 159710144 0
28,29c28,29
< 21 mgmt/ipv6-mgmt* 1500000 4687 139677728087 5781405
< 23 arp/ipv6-nd 8000 3515 16452102544 630004096
---
> 21 mgmt/ipv6-mgmt* 1500000 4687 139677925157 5781405
> 23 arp/ipv6-nd 8000 3515 16452118836 630004096
33c33
< 27 hsrp vrrp/ipv6-hsrp 128000 250 2987080360 85648746
---
> 27 hsrp vrrp/ipv6-hsrp 128000 250 2987083756 85648746
44c44
< 41 excp/ipv6-excp** 8000 4687 5679291770 384144830
---
> 41 excp/ipv6-excp** 8000 4687 5679301982 384144830

We are taking some logs ano reloading the box. No downtime. Traffic is fowarded through 12a.

Posted May 03, 2016 - 16:21 UTC