Get webhook notifications whenever Network & Infrastructure creates an incident, updates an incident, resolves an incident or changes a component status.
We have an incident on the ACE of rbx-s1. We are looking for the origin of the problem.
Update(s):
Date: 2011-09-25 16:13:28 UTC rbx-s2-ace/Admin# sh proc cpu
CPU utilization for five seconds: 3%; one minute: 4%; five minutes: 5%
rbx-s1-ace/Admin# sh proc cpu
CPU utilization for five seconds: 10%; one minute: 12%; five minutes:
13%
It's much better.
Date: 2011-09-25 02:23:02 UTC We have applied it on some contexts of some customers.
Date: 2011-09-25 02:22:00 UTC If the situation is not stable, we will add a limitation to 4 simultaneous connections for the administration of ACE. Some customers use 50 or 100 access!? and they are probably causing the problem.
Date: 2011-09-25 02:19:37 UTC And why do we have the problem only at night,do we have a nag customer ?
s2/ace est master :
rbx-s2-ace/Admin# sh proc cpu
CPU utilization for five seconds: 68%; one minute: 66%; five minutes: 63%
s1/ace est slave actuellement
rbx-s1-ace/Admin# sh proc cpu
CPU utilization for five seconds: 31%; one minute: 34%; five minutes: 33%
Date: 2011-09-25 02:17:35 UTC If we study the error message it means that because of a client (uspace) there is a big load (big loadavg) and therefore the watchdog (ft fail-tolerance) triggers the switch from master card to the slave card. in case i don't know I decided to switch to the slave card because I decided that the master is not fine.
no idea if this is true. we'll see the answer of the TAC.
we changed the ft values from
heartbeat interval 300
heartbeat count 20
to
heartbeat interval 1000
heartbeat count 50
We'll see if it's more stable this way.
Date: 2011-09-25 02:01:54 UTC Sep 25 02:03:20 GMT: %OIR-SP-3-PWRCYCLE: Card in module 2, is being power-cycled 'off (Reset - Module Reloaded During Download)'
Sep 25 02:03:20 GMT: %C6KPWR-SP-4-DISABLED: power to module in slot 2 set off (Reset - Module Reloaded During Download)
Sep 25 02:08:52 GMT: %DIAG-SP-6-RUN_MINIMUM: Module 2: Running Minimal Diagnostics...
Sep 25 02:09:05 GMT: %DIAG-SP-6-DIAG_OK: Module 2: Passed Online Diagnostics
Sep 25 02:09:08 GMT: %OIR-SP-6-INSCARD: Card inserted in slot 2, interfaces are now online
The card is up with the reboot message:
last boot reason: SB Wdog uspace big loadavg
Date: 2011-09-25 02:01:06 UTC The slave card s2 ace that took the load of s1 crashed.
Sep 25 01:38:28 GMT: %OIR-SP-3-PWRCYCLE: Card in module 2, is being power-cycled 'off (Reset - Module Reloaded During Download)'
Sep 25 01:38:29 GMT: %C6KPWR-SP-4-DISABLED: power to module in slot 2 set off (Reset - Module Reloaded During Download)
Sep 25 01:38:30 GMT: %DIAG-SP-3-TEST_FAIL: Module 2: TestAsicSync{ID=3} has failed. Error code = 0x76 (DIAG_QUERY_HYPERION_SYNC_ERROR)
The card is back with the original message of the crash:
last boot reason: SB Wdog uspace big loadavg
Date: 2011-09-24 23:41:30 UTC It's done, the card is up again.
Date: 2011-09-24 23:41:04 UTC Card is being restarted:
20w1d: SP: The PC in slot 2 is shutting down. Please wait ...
20w1d: SP: PC shutdown completed for module 2
Sep 25 00:07:45 GMT: %C6KPWR-SP-4-DISABLED: power to module in slot 2 set off (Reset)
20w1d: Processor 0 of module in slot 2 cannot service session requests.
20w2d: Processor 0 of module in slot 2 cannot service session requests.
20w2d: Processor 0 of module in slot 2 cannot service session requests.
20w2d: Processor 0 of module in slot 2 cannot service session requests.