Feeds

IBM parks parallel file system on Big Data's lawn

Mirror, mirror on the wall, who is the fattest of them all?

3 Big data security analytics techniques

The IT universe is seeing a massive collision taking place as the worlds of high-performance computing, big data and warehousing intermingle. IBM is pushing its General Parallel File System (GPFS) further to broaden its footprint in this space, with the 3.5 release adding big data and async replication features as well as customer metadata and more performance.

GPFS is a large-scale file system running on Network Shared Disk (NSD) server nodes with the file data spread over a variety of storage devices and users enjoying parallel access. We got the GPFS 3.5 news from Crispin Keable, IBM's HPC architect based at Basingstoke.

The new release has Active File Management, an asynchronous version of the existing GPFS multi-cluster synchronous replication feature, which enables a central GPFS site to be mirrored with other remote sites, where users then get file access to the mirrored at local instead of wide area network speed. The link is duplex, so updates at either side of it are propagated across.

If the link goes down, the remote site can continue operating using the effectively cached GPFS data. Any updates are cached too, and as a way of preventing old data re-writing more recent data, the update of the central site from an offline remote site coming back online can be restricted to data newer than a pre-set date and time.

One thing to bear in mind is that there is no in-built deduplication in GPFS. If you wanted to reduce the data flowing across such a mirrored link you'd need something like a pair of Diligent dedupe boxes either side of it, or else use some other WAN optimisation/data reduction technique.

RAID and Big data

In petabyte-scale GPFS deployments there can be a thousand or more disks – and disks fail often enough for a RAID re-build to be going on somewhere in the deployment all the time. This limits GPFS performance to the performance of the device upon which the rebuild is taking place.

Keable says that, in de-clustered RAID, the NSD servers farm out GPFS to clients and have spare CPU capacity. They can use this to run software RAID. GPFS deployments can have data blocks randomly scattered across JBOD disks and this provides a stronger RAID scheme than RAID 6, says Keable. The big plus here is that it spreads the RAID re-build work across the entire disk farm, which helps the GPFS's performance to rise. Keable says this feature, which is a block-level algorithm and so capable of dealing with ever-larger disk capacities, was released on POWER 7.

He said IBM expected GPFS customers to use flash storage with de-clustered RAID "to hold its specific metadata – the V-disk as it's called."

GPFS is pretty much independent of what goes on below, the physical storage.

GPFS 3.5 can also be run in a shared-nothing, Hadoop-style cluster and is POSIX-compliant, unlike Hadoop's HFS. Keable says GPFS 3.5 is big-data capable and can deliver "big insights" from what he termed a "big insight cluster". This release of GPFS does not, however, have any HFS import facility.

Fileset features and metadata matters

Prior to GPFS 3.5, a sysadmin could take part of a GPFS file system tree, a fileset, and put it on a specific set of disks to provide a specific quality of service, such as faster responses from a set of fast Fibre Channel drives. The filesets can be dynamically moved without taking the filesystem down and the sysadmin can move data across disks' tiers on a per-day or some other time unit basis.

The fileset has an "i-node" associated with it – an i-node being a tag and a block of data – which points to the actual file data and contains metadata such as origination date, time of first access, etc. The GPFS stored all the fileset metadata on one system. With 3.5, the fileset metadata is no longer mixed but separated out and this has enabled fileset-based backup, snapshot, quotas and group quota policies to be applied. Previously backup policies were applied at the filesystem level, but now, Keable says, "We can apply separate backup policies at the fileset level. It makes the GPFS sysadmin's job easier and more flexible."

Because of this change GPFS has gained POSIX.0-compliance, which means the i-node can contain small files along with their metadata. So you don't have to do two accesses to get at such small files – for example one for the i-node pointer and then one more for the real data – as the i-node metadata and small file data are co-located.

It gets better. A customer's own metadata can be added to the i-node as well. Keable says you could put the latitude and longitude of the file in the i-node and enable location-based activities for such files, such as might be needed in a follow-the-sun scheme. You could do this before but the process was slow as the necessary metadata wasn't in the i-node.

GPFS object storage and supercomputing

A UK GPFS customer said that this opened the way for GPFS to be used for object storage, as the customer-inserted metadata could be a hash based on the file's contents. Such hashed files could thereby be located and addressed via the hashes, effectively layering an object storage scheme on to GPFS.

We also hear GPFS is involved with the Daresbury supercomputer initiatives. There are broadly three systems at Daresbury: a big SMP one, a conventional X86 cluster and Blue Gene – with some 7PB of disk drive data. GPFs underpins this and fronts a massive TS350 tape library with 15PB of capacity.

GPFS is a mature and highly capable parallel file system that is being extended and tuned to work more effectively with the increasing scale of big data systems as the worlds of scale-out file systems, massive unstructured data stores, high-performance computing data storage, data warehousing, business analytics and object storage collide and mingle, causing an intense and competitive development effort to take place.

IBM is pushing GPFS development hard so that the product more than holds its position in this collision – in fact it extends it. ®

SANS - Survey on application security programs

More from The Register

next story
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Kingston DataTraveler MicroDuo: Turn your phone into a 72GB beast
USB-usiness in the front, micro-USB party in the back
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
BOFH: Oh DO tell us what you think. *CLICK*
$%%&amp Oh dear, we've been cut *CLICK* Well hello *CLICK* You're breaking up...
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
IT bods: How long does it take YOU to train up on new tech?
I'll leave my arrays to do the hard work, if you don't mind
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
Cisco reps flog Whiptail's Invicta arrays against EMC and Pure
Storage reseller report reveals who's selling what
prev story

Whitepapers

Designing a defence for mobile apps
In this whitepaper learn the various considerations for defending mobile applications; from the mobile application architecture itself to the myriad testing technologies needed to properly assess mobile applications risk.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.