Microsoft boasts its cloudy Hadoop big data mill's faster than yours
Open-sourcier Redmond talks up speed, security gains for Azure HDInsight
Microsoft has overhauled its cloud-hosted Azure HDInsight Hadoop big data mill with extra security in the shape of enhanced authentication and identity management features plus a claimed 25 times performance boost in crunching big data queries.
Azure HDInsight is a service that lets users deploy and manage Apache Hadoop clusters on Microsoft’s Azure cloud, and has been developed in partnership with Hadoop specialist Hortonworks using the latter’s Hortonworks Data Platform.
It was also updated with support for Apache Spark just a few months back, adding support for in-memory processing to speed analytics jobs.
Much of the underlying framework of Azure HDInsight is thus open source software, which Redmond is very much in favour of these days.
In fact, the firm claims it has played an important part in making the Apache Hive data warehouse tool run faster, and this where significant performance gains have come, thanks to something called Long Lived and Process (LLAP) functionality.
LLAP keeps data compressed while running in-memory, and along with other enhancements, delivers a 25x performance improvement for big data queries, according to Microsoft. However, as is often the case with cloud services, this is currently offered only as a public preview.
Performance gains also come from updating the Spark platform support to Spark 2.0, which overhauls the core query engine with the ability to perform cache-efficient vectorised computations for up to 10x faster processing.
Security is set to get a boost with new features that will be turned on in October. These include integration of Azure HDInsight with Azure Active Directory, the cloud-based version of Microsoft’s directory and identity management service, and implementation of Apache Ranger, an open source project that provides centralised policy control for Hadoop clusters.
Meanwhile, the data processed by Azure HDInsight can now be secured while at rest through server-side encryption in the Azure Data Lake Store or Azure Storage. Users can also choose to manage their own encryption keys for this, storing them in the Azure Key Vault.
Redmond also welcomed a new crop of third-party vendors to the Azure HDInsight tent. Two outfits called Cask and StreamSets have joined the partner programme that enables application code to run directly on the HDInsight clusters instead of being hosted elsewhere. This enables end users to access Hadoop and Spark clusters pre-integrated and pre-tuned with their big data application of choice, Microsoft said. ®