Hosted Private Cloud Status - FS#8696

OVHcloud Private Cloud Status

Current status

Legend

Operational
Degraded performance
Partial Outage
Major Outage
Under maintenance

FS#8696 — 1000v and host 10G

Incident Report for Hosted Private Cloud

Resolved

We have detected a failure in our infrastructures regarding the hosts with 10G connectivity
(host L2 and XL) in the datacenters where the nexus cisco 1000v has been activated.

This failure, after an event on the network and with a bad setting of the 1000v dvUplink,
has led the host not to be able to provide in network with the virtual servers that it hosts.

In fact, the dvUplink is composed with 3 network cards, 2 10G and a 100M.
The 1000v does not allow to apply a setting to the identical of vSS
or of the vDS which replaces it, and it considered the 100M as a 10G card.

In certain conditions, we have found the host using this 100M card and became therefore
unreachable. By removing the 100M card, we have found the connectivity of the host.

Once the 10G has came into the port-channel, the VMs will have to be divided, but the 1000v has refused
to switch on it:

In a normal case:

~ # vemcmd show port
LTL VSM Port Admin Link State PC-LTL SGID Vem Port Type
19 Eth5/3 UP UP F/B* 0 2 vmnic2
20 Eth5/4 UP UP F/B* 0 3 vmnic3
49 Veth33 UP UP FWD 0 2 vmk1
50 Veth41 UP UP FWD 0 2 vmk0
51 Veth385 UP UP FWD 0 2 vmk2 VXLAN
59 Veth342 UP UP FWD 0 3 vm1.localhost.eth0

We notice that the SGID is 3 => vm1.localhost.eth0 passes by vmnic3 to get out.
vmk0-2 => vmnic2

Here :
~ # vemcmd show port
LTL VSM Port Admin Link State PC-LTL SGID Vem Port Type
19 Eth5/3 UP UP F/B* 0 2 vmnic2
20 Eth5/4 UP UP F/B* 0 3 vmnic3
49 Veth33 UP UP FWD 0 vmk1
50 Veth41 UP UP FWD 0 vmk0
51 Veth385 UP UP FWD 0 vmk2 VXLAN
59 Veth342 UP UP FWD 0 vm1.localhost.eth0

The vem does not know how to attribute the pinning in the port-channel.
The host and the 1000v are accepting a VM on a vMotion, but the VM does not respond to ping.

So that the vem finds the way to ping properly the VMs and the vmk, we had to
to do a hotswap of the vem:

~ # hotswap.sh -u && sleep 3 && hotswap.sh -l && sleep 5 && vem restart
The following switch is of type cisco_nexus_1000v: DvsPortset-0
Starting time: Fri May 24 16:36:01 UTC 2013
stopDpa
VEM SwISCSI PID is
returning
watchdog-vemdpa: Terminating watchdog with PID 5393551
Unload N1k switch modules
stop stun
Module vem-v152-stun being unloaded..
Module vem-v152-stun unloaded..
Module vem-v152-vssnet being unloaded..
Module vem-v152-vssnet unloaded..
Module vem-v152-n1kv being unloaded..
Module vem-v152-n1kv unloaded..
Module vem-v152-l2device being unloaded..
Module vem-v152-l2device unloaded..
Unload of N1k modules done.
startDpa
Ending time: Fri May 24 16:36:12 UTC 2013
stopDpa
VEM SwISCSI PID is
returning
watchdog-vemdpa: Terminating watchdog with PID 5791406
startDpa

Which do completely reset the vem module.

We are going to pass through the whole infrastructure to fix the issue
and add a special monitoring for this use case.

1- cutting off impacted robots
1- removing the 100M uplink of the port-channel
2- check the status of the port-channel (operational pinning)
3- patch the robots and reset the robots
4- setting the monitoring if the pinning failure reoccurs on the prod

The ticket 626085527 opened with cisco is being resolved. (/n1k/dc1003a-n1/4.2.1.SV2.1.1a/1000v and 10G behavior)

Update(s):

Date: 2013-05-30 12:54:53 UTC
No further behaviour linked to this bug has been seen since.
We remain vigilent and the ticket remains open with Cisco.
We will power it further to the monitoring results.

Date: 2013-05-25 00:17:12 UTC
The monitoring is set, we are now monitoring all these behaviours.

The ticket with Cisco remains opened, we will be carrying on with cisco engineers on Monday during the day.

Date: 2013-05-25 00:15:13 UTC
3 - patching robots and resetting them

The robots are now performing, the 100M cards do not exist.

4 - the monitoring is being reset

Date: 2013-05-25 00:13:33 UTC
1 - remove the 100M uplink of the port-channel

All the impacted hosts are now having the port-channel composed with only 10G ports

2- check the status of the port-channel (operational pinning)

All pinning is recovered and the redundancy is assured

3- patch the robots and reset the robots

We are patching the robots

Posted May 25, 2013 - 00:10 UTC