Networks

This article is more than 1 year old

Apache Foundation rushes out Arrow as 'Top-Level Project'

... then it took an Arrow to the TLP

Wed 17 Feb 2016 // 12:29 UTC

The Apache Software Foundation has today announced Apache Arrow, its new project which aims to provide a cross-system data layer for columnar in-memory analytics.

While Apache projects normally go through incubation periods, Arrow has been immediately announced as a Top-Level Project, and its code – seeded from the Apache Drill project – is being released today.

Apache Arrow is intended to establish a "de-facto standard for columnar in-memory processing and interchange," although its first formal release is a few months away.

Jacques Nadeau, veep of both Arrow and Drill, modestly said: "We anticipate the majority of the world's data will be processed through Arrow within the next few years."

Talking to The Register, Nadeau said that "key guys" from other Apache Big Data projects – comprising Calcite, Cassandra, Drill, Hadoop, HBase, Impala, Kudu (incubating), Parquet, Phoenix, Spark and Storm – "as well as established and emerging Open Source projects such as Pandas and Ibis" are involved.

In addition to traditional relational data, Arrow supports complex data with dynamic schemas. For example, Arrow can handle JSON data, which is commonly used in IoT workloads, modern applications and log files. Implementations are also available (or under way) for a number of programming languages including Java, C++ and Python to allow greater interoperability.

Todd Lipcon, founder of the Apache Kudu project and member of Arrow's Project Management Committee, said: "Modern CPUs are designed to exploit data-level parallelism via vectorized operations and SIMD instructions. Arrow facilitates such processing."

Apache claimed that, for many workloads, 70-80 per cent of CPU cycles were spent serializing and deserializing data. Arrow is intended to solve this problem by "enabling data to be shared between systems and processes with no serialization, deserialization or memory copies."

Allowing multiple systems to work better without the overhead of moving data between them is what Apache will do, Nadeau told us.

"You have to move data around between different nodes, and potentially move it between Java and Python, for instance," Nadeau added, "so any time doing this between two different programming environments, or engines, all of those transfers benefit from the lack of serialization/deserialization."

"An industry-standard columnar in-memory data layer enables users to combine multiple systems, applications and programming languages in a single workload without the usual overhead," said Ted Dunning, Vice President of the Apache Incubator and member of the Arrow PMC. ®

Topics

Special Features

Vendor Voice

Resources

Networks

Apache Foundation rushes out Arrow as 'Top-Level Project'

... then it took an Arrow to the TLP

More about

More about

Narrower topics

More about

More about

More about

Narrower topics

TIP US OFF

Other stories you might like

Apache OFBiz zero-day pummeled by exploit attempts after disclosure

Four in five Apache Struts 2 downloads are for versions featuring critical flaw

Critical Apache ActiveMQ flaw under attack by 'clumsy' ransomware crims

Industrial systems integrating digitalisation

Microsoft extends life support for aging Apache Cassandra 3.11 database

Mirai botnet loves exploiting your unpatched TP-Link routers, CISA warns

China outlines plan for National Integrated Government Affairs Big Data System

UK.gov finds billions in cash for big data contracts

Apache Superset: A story of insecure default keys, thousands of vulnerable systems, few paying attention

Airbus pulls up hard, no longer buying 29.9% stake in Atos-owned Evidian

Ex-BigQuery exec and Motherduck CEO: For some users, the answer is to think small

Native Americans urge Apache Software Foundation to ditch name

About Us

Our Websites

Your Privacy