Feeds

Big data can be deduplicated

Isilon view disputed

Remote control for virtualized desktops

Opinion Ronaldo Yamashita, an ESG analyst in Brazil, took exception to the Isilon view that big data is not compressible and sent El Reg his reasoning why he thinks this is the case.

When we talk about "text" big data, like logs or information gathered from different sources (e.g. web, credit agencies, Facebook), it is highly compressible. In fact, most data warehousing products use column-based compression to achieve high de-duplication ratios and to improve performance. After all, the fastest I/O is the I/O you don't have to make.

De-duplicated data results in better cache utilisation and less disk I/O. Deduplication is useful at any scale; just because most products can't handle it at high volumes doesn't mean it should not be used.

When we look at things from an overall storage perspective, and not just from the specialised database point of view, Rob Peglar's worries with metadata make sense. [But there are ways around this.]

Microsoft proposed a way to reduce RAM needs for deduplication in its research technology called ChunkStash. It allocates only 2 bytes per record in RAM.

The problem of replicating metadata between nodes, was solved by Scality, a startup which uses DHT (Distributed Hash Tables) to address metadata distribution. This is the same technique that's used by P2P (peer to peer) systems which handle vast petabyte-scale amounts of data.

Scality is not as efficient as Isilon from a performance perspective, but it shows a possible way to solve the problem.

NetApp is as 'performant' as Isilon and solved the problem in a simpler way, with no dedupe metadata replication. Deduplication happens in a single node and the cluster has some intelligence to place similar files together. It is good for performance and also for deduplication.

Databases such as Vertica and Greenplum can benefit from data locality too. They don't use global deduplication and achieve spectacular compression ratios.

Ocarina, the compression/deduplication company bought by Dell, even demonstrated how to get better compression from unexpected files, like images and videos. It can be used in the oil and gas industry for example, where it had long been thought impossible to achieve good compression ratios.

The way companies handle data may make it more compressible too. Jesse Jonas, from IBM, has a great presentation showing how stacking data ... is great for data reduction algorithms, as you don't need a person's name two times.

Compression and deduplication will play a key role in big data; it's all about economics. As ESG's Steve Duplessie pointed out, the next storage battle will be economic. If your system demands an order of magnitude more storage than competing suppliers' systems, it will struggle to compete. ®

Beginner's guide to SSL certificates

More from The Register

next story
Azure TITSUP caused by INFINITE LOOP
Fat fingered geo-block kept Aussies in the dark
NASA launches new climate model at SC14
75 days of supercomputing later ...
Yahoo! blames! MONSTER! email! OUTAGE! on! CUT! CABLE! bungle!
Weekend woe for BT as telco struggles to restore service
Cloud unicorns are extinct so DiData cloud mess was YOUR fault
Applications need to be built to handle TITSUP incidents
NSA SOURCE CODE LEAK: Information slurp tools to appear online
Now you can run your own intelligence agency
BOFH: WHERE did this 'fax-enabled' printer UPGRADE come from?
Don't worry about that cable, it's part of the config
Stop the IoT revolution! We need to figure out packet sizes first
Researchers test 802.15.4 and find we know nuh-think! about large scale sensor network ops
DEATH by COMMENTS: WordPress XSS vuln is BIGGEST for YEARS
Trio of XSS turns attackers into admins
SanDisk vows: We'll have a 16TB SSD WHOPPER by 2016
Flash WORM has a serious use for archived photos and videos
prev story

Whitepapers

Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Designing and building an open ITOA architecture
Learn about a new IT data taxonomy defined by the four data sources of IT visibility: wire, machine, agent, and synthetic data sets.
10 threats to successful enterprise endpoint backup
10 threats to a successful backup including issues with BYOD, slow backups and ineffective security.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.