Managing High Availability in the Cisco UCS System

By Anthony Sequeira on April 29th, 2012

This post is an excerpt from the CCIE Data Center Written Bootcamp that begins live, online July 31, 2012 here IPexpert.

Overview

In a previous post, we introduced students to an important component of the Cisco Unified Computing System (UCS) – the Cisco UCS 6100 Series Fabric Interconnect. That previous post can be found here:

Introducing the Cisco UCS 6120XP 20-Port Fabric Interconnect

In this post, we are going to talk about the importance of these devices for providing high availability. When we configure two of these Cisco UCS 6100 Series Fabric Interconnects in a cluster, both data planes are able to actively forward traffic. The management plane operates in an active-standby type relationship between the two devices.

Creating the Cluster

As you might expect, in order for this clustering to take place, both fabric interconnect models must be identical. So, for example, a Cisco UCS 6120 fabric interconnect cannot cluster with a Cisco UCS 6140. Also, keep in mind that the two devices must run the same version of the Cisco UCS Manager software.

Now, thankfully, there is one important exception to this identical model requirement. In order to facilitate an upgrade to the number of ports in your UCS, an unconfigured 6140 can connect to the active member of a 6120 cluster. Once the database is synchronized to the 6140, the 6120 can be removed from the cluster and replaced with a 6140.

A private cluster network is required for the management communications. This network is a 1 Gbps infrastructure. EIA/TIA Category 6 cabling is required. The L1 and L2 interfaces on the fabric interconnects form this network. They carry cluster heartbeat messages between the two fabric interconnects. In addition, they carry high-level management messages between the two devices. These links run in an LACP bond (port channel) with fixed IP addressing.

Process Management

Cisco’s NX-OS is responsible for starting all Cisco UCS Manager processes on the two devices. It will also monitor these processes.

The Cisco UCS Manager Controller is a distributed application that runs on top of the Cisco NX-OS. This modular approach helps to guarantee a high degree of fault isolation. This separation is also important to allow the controller to distinguish between  failure of a system, and a failure of the controller itself.

Local Chassis Storage

For local storage of information, NVRAM and flash are used for static data. The local UCS Manager reads and writes this information. When both nodes in the cluster are up and available, this information is synced between the two.

Cluster state data is stored in serial EEPROM. This information is both read and written by the chassis management controllers. This information is not replicated. This information is read by the UCS Manager in order to determine chassis state information.

Primary Cluster Node Election

A Cisco UCS Manager instance will declare a new leader under the following circumstances:

  • The instance received an acknowledgement that the election request has been processed
  • The election counter is checked to ensure it is the correct election process
  • All processed propose the same new leader node

To ensure stability, leadership only changes when there is an administrative configuration or the leader process fails. Elections are infrequent, since only the following events can trigger them:

  • Administrative changes
  • New processes joining the group
  • Processes exiting the group
  • A process failure

Monitoring the Cluster Status

In order to monitor the cluster status in the GUI, choose the fabric interconnect from the Equipment tab of the navigation tab. In the content pane, click the arrow next to High Availability Details. From the CLI, use the command show cluster extended-state. The cluster lead and the cluster force primary commands can be used at the CLI in order to force the election.

Issues in the High Availability Cluster

Split Brain

This can occur in the cluster when there is a failure in the private network that is responsible for the node to node communications between the fabric interconnects. The serial EEPROM in the chassis is used to resolve these issues. The chassis management controller on each half of the fabric handles this issue. Fabric A side of the chassis has read access to the EEPROM data, and the same applies to the Fabric B side. It will have read-only access to the A side.

Partition in Space

Another issue that can occur when the private network fails between the cluster components is the partition in space. Here the issue is the fact that each fabric interconnect might try and declare itself as active. In order to resolve this issue, both nodes are demoted to subordinate, and Cisco begins what is called a quorum race. The node that claims the most resources first wins, and becomes the active node. The loser aborts the cluster and can rejoin as a subordinate once communications are restored.

Partition in Time

This condition occurs when a node boots alone in the cluster. The node compares its database version against the serial EEPROM and discovers that its version number is lower than the current database version. Obviously, with this condition, there is a risk of applying an old configuration to the Cisco UCS components. As a result, the node will not become the active management node.

We hope you enjoyed this detailed look at high availability in the Cisco UCS system.

Anthony Sequeira CCIE, CCSI
Twitter: @compsolv
Facebook: http://www.facebook.com/compsolv

Managing High Availability in the Cisco UCS System , 5.0 out of 5 based on 1 rating
Be Sociable, Share!

    Tags: CCIE, center, cloud, data, fabric, training, ucs

    One Response to “Managing High Availability in the Cisco UCS System”

    1. Rohit Baweja says:

      Waiting for the more information!!!

      VA:F [1.9.22_1171]
      Rating: 1.0/5 (1 vote cast)

    Leave a Reply