Original URL: https://www.theregister.com/2011/11/01/hyperscale_network_management/

Hyperscale networks need holistic management

End-to-end services for virtual machines

By Chris Mellor

Posted in Networks, 1st November 2011 15:44 GMT

If large-scale storage networks were managed in the same way as city road traffic systems the result would be catastrophic, with traffic jams, delayed delivery and lost messages. Network Fabric management, unlike road traffic management systems, has both real-time traffic management and an end-to-end view to find faults fast and fix them.

Consider the magnitude of the task. A network could have 800 servers with network adapters, 300 switches, two directors and fifty storage arrays. That is more than 1,150 network devices and upwards of 5,000 ports - and this is not an especially big network by large-scale cloud standards - commercially available fabric managers can scale up to 9,000 switch ports and 20,000 end-user devices. There is no way you or I could observe a status screen with all these devices and ports, and watch for faults.

The job is complex enough, what with configuring paths through such a network, setting up zones, trying to apply policies for different traffic types to ensure that network resources are applied sensibly, avoiding low-priority traffic getting resources best used for high priority traffic.

Such networks need intelligent fabric management systems. Indeed, they are darn near un-manageable without them.

It's also not enough to monitor individual devices and network ports. The network is a single entity and it provides data transport services between server applications and storage arrays and external services. The management system should monitor at the service level and provide end-to-end management.

End to end

Let's return to the city road traffic analogy; that would be like having end-to-end monitoring of the road links between Grand Central Station and La Guardia Airport in New York. The idea is so unrealistic as to be laughable. End-to-end traffic management in a town or city would be wonderful but is impractical. But it is what data networks require because, without it, you have no idea what network-using services are affected by a device failure.

With products like Cisco's Fabric Manager, and Brocade's Data Centre Fabric Manager (which is also OEMed by IBM), you can define end-to-end links across the fabric and monitor them. They can have real-time performance data recorded, such as traffic amounts and network transit time, and these statistics can be stored for a historical record.

Brocade DCFM window

Brocade DCFM window

The fabric manager can monitor individual devices and issue alerts if utilisation surpasses settable thresholds. The management software thus knows about devices, such as switches and their ports, in the fabric and about end-to-end services and their requirements. If an end-to-end service starts failing then the information about the individual devices is to hand and can be used to identify any failing switch devices and find out the cause of the service lapse and help remedial activities.

Don't stop at the server port

In these days of cloud computing, fabric management cannot stop at the entrance to the server, the network or SAN port. A holistic view is needed. The raison d'être of the network is to deliver services, not to the server but to the applications running in the server. With server virtualisation happening at an increasing rate and VMware the dominant force in server virtualisation, it is mandatory that fabric managers integrate with VMware and are virtualisation-aware.

The virtual machine should be the end-point for end-to-end service management; it should be present on monitoring displays. The fabric manager should integrate with the hypervisor in each physical server hooked up to the network, and have access to virtual machine information (VM) so it can connect it to its own network information, which is invisible to these hypervisors.

Cisco Fabric Manager

Cisco Fabric Manager network display

The fabric manager should be able to set-up and deliver against network quality of service (QOS) targets for the individual VMs, specifying such things as performance, access security and encryption.

One network

For a customer there is a single network infrastructure made up of sub-networks; the server-SAN fabric, the server network, branch office-head office networks, and disaster recovery sites. Each has its own management facilities and the fabric management software should integrate with these so as to present a single overall fabric management function.

Storage networking and server networking and even data centre management should be integrated. With data centres becoming virtualised as well as servers the fabric manager has to manage virtual resources as well as physical ones. It has to interface to virtualised resource administrators and deliver fabric management resources to them so that, for example, networking can be provisioned to new VMs.

The VMware administrator could, for example, pick a network QOS profile from a template and tell the fabric manager to apply this to a new VM. The fabric manager then automatically provides a path to storage resources, bandwidth, security and starts monitoring the health of these functions for that VM. It should also automatically discover new VMs, in the same way it discovers new physical network elements such as servers and switches.

When VMs are moved from one server to another, in a vMotion exercise, the fabric manager should know about this and enable the networking part of it, reconfiguring paths through the fabric for example.

Hyper-scale networks

Road traffic management is tremendously complex An interstate highway or autoroute can have a hundred or more entry and exit points, some with traffic lights, hundreds of bridges, information display signs, and tens of thousands of signs, road markings, speed limits and other infrastructure elements.

It could be manages as a single entity delivering a transit service to the vehicles using it and that is the direction for its management. Managing a city's road traffic network is on a different level altogether.

Freeway at night

Freeway at night

There are millions of access points and junctions, tens of millions of infrastructure elements like signs, traffic lights and road markings, and more traffic types, such as buses, bicycles and pedestrians that are kept away from inter-state highways, motorways and the like. That is an example of a hyper-scale network, and that scale is where cloud computing is taking us.

Unlike a city, with its millions of individuals making their own decisions about when and where and how to travel, a hyper-scale computing network is inherently more disciplined and manageable. The only way it can be managed is by using an over-arching fabric manager which understands its responsibility is to deliver reliable networking services to the end-users of the network, end-to-end services that operate in both the virtual and physical domains.

Then, when there is a traffic hold-up in the city, we can sit in a coffee bar or hotel lounge, log on to the net, and work remotely in an effective and secure way because, the network 'just works.' ®