OVHcloud Private Cloud Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance
FS#9745 — rbx-s50-6k
Incident Report for Hosted Private Cloud
Resolved
rbx-s50-6k,one of the main routers of Roubaix's pcc network crashed. We will restart it.

Nov 22 04:47:14 CET: %DIAG-SP-3-TEST_FAIL: Module 7: TestSPRPInbandPing{ID=2} has failed. Error code = 0xC3 (DIAG_CHECK_RP_PAK_ERROR)
Nov 22 04:47:14 CET: %DIAG-SP-3-MINOR: Module 7: Online Diagnostics detected a Minor Error. Please use 'show diagnostic result ' to see test results.
Nov 22 04:47:36 CET: %DIAG-SP-3-TEST_FAIL: Module 7: TestSPRPInbandPing{ID=2} has failed. Error code = 0xC3 (DIAG_CHECK_RP_PAK_ERROR)
Nov 22 04:48:00 CET: %DIAG-SP-3-TEST_FAIL: Module 7: TestSPRPInbandPing{ID=2} has failed. Error code = 0xC3 (DIAG_CHECK_RP_PAK_ERROR)
Nov 22 04:48:22 CET: %DIAG-SP-3-TEST_FAIL: Module 7: TestSPRPInbandPing{ID=2} has failed. Error code = 0xC3 (DIAG_CHECK_RP_PAK_ERROR)
Nov 22 04:48:45 CET: %DIAG-SP-3-TEST_FAIL: Module 7: TestSPRPInbandPing{ID=2} has failed. Error code = 0xC3 (DIAG_CHECK_RP_PAK_ERROR)
Nov 22 04:48:45 CET: %CONST_DIAG-SP-3-HM_TEST_FAIL: Module 7 TestSPRPInbandPing consecutive failure count:5
Nov 22 04:48:45 CET: %CONST_DIAG-SP-6-HM_TEST_INFO: CPU util(5sec): SP=56% RP=71% Traffic=0%
netint_thr_active[0], Tx_Rate[600], Rx_Rate[147], dev=3[IPv4, fail=5], 4[IPv4, fail=5]
Nov 22 04:49:12 CET: %DIAG-SP-3-TEST_FAIL: Module 7: TestSPRPInbandPing{ID=2} has failed. Error code = 0xC3 (DIAG_CHECK_RP_PAK_ERROR)
Nov 22 04:49:36 CET: %DIAG-SP-3-TEST_FAIL: Module 7: TestSPRPInbandPing{ID=2} has failed. Error code = 0xC3 (DIAG_CHECK_RP_PAK_ERROR)
Nov 22 04:50:13 CET: %DIAG-SP-3-TEST_FAIL: Module 7: TestSPRPInbandPing{ID=2} has failed. Error code = 0xC3 (DIAG_CHECK_RP_PAK_ERROR)
Nov 22 04:50:35 CET: %DIAG-SP-3-TEST_FAIL: Module 7: TestSPRPInbandPing{ID=2} has failed. Error code = 0xC3 (DIAG_CHECK_RP_PAK_ERROR)
Nov 22 04:50:56 CET: %DIAG-SP-3-TEST_FAIL: Module 7: TestSPRPInbandPing{ID=2} has failed. Error code = 0xC3 (DIAG_CHECK_RP_PAK_ERROR)
Nov 22 04:50:56 CET: %CONST_DIAG-SP-3-HM_TEST_FAIL: Module 7 TestSPRPInbandPing consecutive failure count:10
Nov 22 04:50:56 CET: %CONST_DIAG-SP-6-HM_TEST_INFO: CPU util(5sec): SP=41% RP=84% Traffic=0%
netint_thr_active[0], Tx_Rate[600], Rx_Rate[146], dev=3[IPv4, fail=10], 4[IPv4, fail=10]
Nov 22 04:51:18 CET: %DIAG-SP-3-TEST_FAIL: Module 7: TestSPRPInbandPing{ID=2} has failed. Error code = 0xC3 (DIAG_CHECK_RP_PAK_ERROR)
Nov 22 04:51:38 CET: %HA_EM-6-LOG: Mandatory.go_sprping.tcl: 1Process Forced Exit- MAXRUN timer expired.
Nov 22 04:51:38 CET: %HA_EM-6-LOG: Mandatory.go_sprping.tcl: while executing
Nov 22 04:51:38 CET: %HA_EM-6-LOG: Mandatory.go_sprping.tcl: \"if [catch {cli_exec $cli1(fd) \"diagnostic action mod $card test TestSPRPInbandPing default\"} result] {
Nov 22 04:51:38 CET: %HA_EM-6-LOG: Mandatory.go_sprping.tcl: error $result $errorInfo
Nov 22 04:51:38 CET: %HA_EM-6-LOG: Mandatory.go_sprping.tcl: } else {
Nov 22 04:51:38 CET: %HA_EM-6-LOG: Mandatory.go_sprping.tcl: set c...\"
Nov 22 04:51:38 CET: %HA_EM-6-LOG: Mandatory.go_sprping.tcl: (file \"tmpsys:/eem_policy/Mandatory.go_sprping.tcl\" line 78)
Nov 22 04:51:38 CET: %HA_EM-6-LOG: Mandatory.go_sprping.tcl: Tcl policy execute failed: 1Process Forced Exit- MAXRUN timer expired.
Queued messages:
Nov 22 04:51:59 CET: %SYS-3-LOGGER_FLUSHING: System pausing to ensure console debugging output.

Nov 22 04:51:59 CET: %C6K_PLATFORM-2-PEER_RESET: RP is being reset by the SP
Nov 22 04:52:06 CET: %SYS-SP-3-LOGGER_FLUSHING: System pausing to ensure console debugging output.

Nov 22 04:52:05 CET: %SYS-SP-3-CPUHOG: Task is running for (4428)msecs, more than (2000)msecs (90/90),process = Crash writer.
-Traceback= 2
Nov 22 04:52:05 CET: %SYS-SP-3-CPUHOG: Task is running for (4432)msecs, more than (2000)msecs (90/90),process = Crash writer.
-Traceback= 419B5A30 114C 7D0 5A 5A 41D2C1C0 443A9780
Nov 22 04:52:06 CET: %OIR-SP-6-CONSOLE: Changing console ownership to switch processor



No warm reboot Storage
*** System received an unknown failure ***
signal= 0x0, code= 0x0, context= 0x443ab2f4
PC = 0x417bf3f0, Cause = 0x1020, Status Reg = 0x34008102
Exit at the end of BOOT string

Update(s):

Date: 2013-11-22 06:36:31 UTC
We didn't detect any issues following the router reboot.

Date: 2013-11-22 06:34:01 UTC
The router is up again. We will perform some verifications in order to make sure that there was no impact on the production.

Date: 2013-11-22 06:32:47 UTC
We're having issues to boot the router with the correct configuration of the TCAM. We will restart again the chassis.

Date: 2013-11-22 05:08:38 UTC
The router is rebooting with the CF card on which we set the IOS image and the backup of the config.

Date: 2013-11-22 05:07:55 UTC
We have a problem with the CF card of the supervisor. This card stores the IOS image and the config. We are preparing a new CF card.


Date: 2013-11-22 05:07:17 UTC
The router is rebooting. We are preparing to replace the card 7 that caused the crash, if necessary.
Posted Nov 22, 2013 - 05:06 UTC