Never mind the switch, feel the fabric
Vendors take up weaving
It is not clear who started it but it is all-out war in the systems business and the battleground is convergence.
Networking giant Cisco Systems has moved forcefully into the server racket, IBM is buying its way back into the networking business, Dell is buying into it for the first time and Hewlett-Packard is expanding the business it already has while also doing acquisitions.
Servers and storage arrays are being virtualised and sharing networks, which provides operational efficiencies and lower capital expenses (in theory anyway). The network is at the centre of it all – and is quickly becoming a management hairball.
In an ideal world, you would have a single tool that spans the switches, servers and storage, and virtualisation hypervisors on servers with their integrated virtual switches would be the command-and-control system for the whole enchilada.
But we don't live in an ideal world. A cross-platform, cross-vendor network fabric management tool that hooks into all of this physical and virtual iron does not yet exist.
Some networking vendors do, however, have fabric management tools for their own products, and at least one player is working on a set of tools that will span a wide variety of products, not just its own.
Let’s explore the approaches of three vendors: Mellanox, Cisco and IBM.
Mellanox Technologies, the maker of InfiniBand switch and network interface chips, hybrid InfiniBand/Ethernet host adapter cards and InfiniBand switches, shelled out $218m in November to keep an ASIC partner from slipping into enemy hands. It wanted to give itself a head start in moving into the 10 Gigabit Ethernet switch market, but also to get its hands on the Unified Fabric Manager (UFM) software created by Voltaire to span its InfiniBand and Ethernet switches.
UFM can provide some basic management functions for any switch supporting a standard management information base, which is part of the Simple Network Management Protocol (SNMP).
Because Hewlett-Packard and Blade Network Technologies were partners of Voltaire, UFM also had extensive integration with HP's ProCurve 6600 and BNT's RackSwitch 8124G switches.
But following the acquisition of Voltaire by Mellanox, of 3Com by HP and of BNT by IBM, UFM has been more tightly focused on Mellanox’s InfiniBand and Ethernet switches and adapters. This parallels what happened with blade servers, with every vendor offering its own in-chassis management tool to make the hardware run applications more efficiently.
Gilad Shainer, senior director of high-performance and technical computing at Mellanox, says the fabric manager is the last missing piece in the cloud stack, which includes virtual machine managers, service orchestrators, job schedulers and storage managers.
Through the keyhole
Being a networking vendor, Mellanox naturally puts the fabric manager at the centre of it all, providing a window into all aspects of the network and the systems that use it.
Its job involves monitoring the performance of devices on the network and aggregating statistics on applications and jobs running on systems; detecting problems and correlating them with particular jobs or applications; providing detailed alert information for each job and node; and changing network fabric policies on the fly in response to changing resource needs to optimise performance.
UFM monitors the physical, logical and application layers of the network, and can push fabric policies back down to the switches, hypervisors and servers in the network. This includes provisioning virtual networks for virtual machines and letting network settings travel with the virtual machines as they are migrated around the network.
"It's like Google Maps for your network. You can see where your traffic jams are"
As far as Mellanox is concerned, large systems with thousands of server nodes absolutely require something like UFM to make the networks support the needs of applications more efficiently.
"Even when you are talking about a few hundred servers, without a view into the fabric, your setup could be dead wrong and you wouldn't even know it," says Shainer.
"It's like Google Maps for your network. You can see where your traffic jams are."
UFM includes quality-of-service (QoS) features that allow network admins to set policies for the performance of switches and NICs, interfacing with job schedulers to make sure the right work gets done on a cluster (whether it is running commercial software or a high-performance workload) in the right order and at the right speed.
How UFM fits into a dynamic server cluster
In one benchmark test run by Mellanox, a single job running in a cluster has an average latency for transactions of 2.1 microseconds (the kind of thing Wall Street hedge funds need to make a living). When a second job is added to the cluster, the latency jumps to 4.5 microseconds. Kicking on UFM's QoS features and letting it optimise the networks reduces the overall latency on both jobs to 2.2 microseconds.
Mellanox charges for UFM on a per-node basis: the Standard Edition costs $139 per node and the Advanced Edition, which has optimisations specifically for high-performance computing libraries, costs about twice that. Enterprise licences for customers building machines with thousands of nodes are available at a discounted price.
In the end, UFM adds a few per cent to the overall cost – not a big deal, says Shainer. "But it can increase the productivity of the clusters in the tens of per cents, if not more," he adds.
Storage warp and server weft
Cisco Systems deals with two levels of fabric management. One is inside the "California" Unified Computing System blade servers and related rack servers, which feature an integrated network and server management tool called UCS Manager.
At a higher level, Cisco peddles Data Center Network Manager, a converging set of server and storage network management tools that span its director, top-of-rack and edge switches.
The company also needs to integrate the two products so Unified Computing System looks like any other kind of network device and can be managed from Data Center Network Manager.
"We have a lot of cool technology out there but you need fabric management," says Omar Sultan, Cisco’s solution manager for data centre switching.
This means not only putting in hooks for automation between Data Center Network Manager and Unified Computing System, but also integrating with existing enterprise management frameworks.
"There are no greenfield management accounts," says Sultan. "Everyone already has some sort of management framework in place and you can't disrupt that.
“Our fabric view is beyond the Layer 2 and Layer 3 fabric – it includes servers, security and Layers 4 through 7 on the network."
Cisco has been hard at work expanding the Fabric Manager software for managing the Fibre Channel storage switches that it inherited from its $700m acquisition of Andiamo in 2004.
Andiamo is the source for much of the technology in the Nexus family of converged switches, as well as the networking at the heart of the UCS blade servers. Fabric Manager came out in 2002 and was updated in 2005.
In 2008, Cisco rolled out Data Center Network Manager as an Ethernet companion to Fabric Manager. This software can be used to manage Catalyst and Nexus switches, including the Nexus 1000V virtual switch that tucks up inside VMware's ESXi hypervisor and replaces its own integrated distributed virtual switch. (The Nexus 1000V is a key component of Unified Computing System but can run on any x86 server sporting ESXi.)
Data Center Network Manager does configuration through templates and wizards, as well as command line interfaces, event monitoring and performance trending, and event forwarding and impact analysis for troubleshooting. Not all features are available for all Nexus, Catalyst, MDS 9XX, and UCS 6XXX switches, but Cisco is working on it.
To make things easier, in July Cisco also rebranded these storage and network fabric managers and gave them a common management console. the Data Center Network Manager Web Client. This has dashboards for looking into the fabrics written in Adobe Flex.
Fabric Manager is now Data Center Network Manager for SAN, and the original Data Center Network Manager is now called Data Center Network Manager for LAN. Now, Cisco has a single, integrated product that spans Fibre Channel, Ethernet and Fibre Channel over Ethernet switches – so long as the gear has the Cisco name on it, of course.
Sultan says that while Data Center Network Manager can discover and interface with any switch on the network, the company would have to expand its engineering and quality assurance teams by a factor of three or four to get it to work properly with other vendors' gear.
The Essentials Edition of Data Center Network Manager is free and can do simple SAN and LAN provisioning and monitoring. If you want more complex SAN and LAN provisioning, monitoring and management, then you need to pay for the Advanced Edition.
Sultan says that generally speaking, Data Center Network Manager has an incremental cost above the hardware price that works out to between five and 10 per cent of the cost of the switching gear.
Big Blue has open mind
In September 2010 IBM bought Blade Network Technologies, a maker of rack-based Ethernet switches (rebranded Voltaire gear), as well as blade switches for IBM, HP and NEC blade servers, for a rumored $400m.
Three months later, before its acquisition by Mellanox, Voltaire signed a global licensing deal for Unified Fabric Manager, and a joint development agreement between Voltaire and IBM with a combined value of $17m over two years.
IBM has not disclosed the nature of the agreement between itself and Mellanox but Tom Parker, the company’s senior product manager for systems networking management software, confirms that a new tool combining the functionality of various networking tools does not include any Unified Fabric Manager code.
But clearly something is in the works. The acquisitions of BNT and Intelliden back in February 2010, as well as IBM's Systems Director multi-platform systems management tools, could be the foundation of an interesting management framework that extends the fabric beyond the network and all the way up to applications.
"We think we are in a great position to bring together a lot of different assets," says Dan Tuchler, vice-president of product management for IBM's system networking division.
As far as IBM is concerned, you need to integrate the management of networks, servers, storage, middleware and applications because customers are tired of getting stuck down in the weeds of device management.
"What they want is to load their applications and make them work," says Tuchler.
He adds that it is not safe to assume that IBM will support this future network-application management software only on its own iron. "I think we will support many vendors," he says.
And, of course, thanks to the vast Global Services behemoth which manages lots of servers, storage and networks on behalf of IBM’s customers, the vendor has to take a broader view than, say, Mellanox or Cisco.
Follow my leader
In the meantime, at the end of November IBM announced the foundation of what could be a new switch-system-application fabric, System Networking Element Manager v6.1.
According to Parker it includes what would have been BNT's BladeHarmony v6.1 switch manager, including an upgrade to the important VMReady feature, which allows virtual machines to migrate from server to server without having to reconfigure the network settings This is the kind of thing IBM wants to do across many kinds of devices.
"People don't want to get bogged down in MAC addresses and QoS settings. We think we have a lead here and we are going to extend it," says Tuchler.
The new System Networking Element Manager v6.1 tool also wraps in the Tivoli Netcool Configuration Manager from the Intelliden acquisition, and a subset of Tivoli Network Manager for System Networking Element Manager, which does network discovery and topology-based root-cause analysis.
The System Networking Element Manager stack also includes the functionality from Intelliden's Netcool Omnibus, an event manager that can de-duplicate alerts as switches keep pinging for help. ®
Sponsored: Benefits from the lessons learned in HPC