Original URL: https://www.theregister.com/2012/03/11/hugh_durrant_whyte_big_data/

The challenges of hot rocks and big data

Interview: NICTA CEO, Hugh Durrant-Whyte

By Richard Chirgwin

Posted in Science, 11th March 2012 23:00 GMT

Of all the alternative energy proposals suggested for Australia, geothermal is probably the least-understood in the public mind.

Australia’s problem is this: most places that adopt geothermal power do so because the heat is right there at the surface. Australia’s “hot rocks”, on the other hand, are a long way underground. To turn the heat into electricity needs a lot of expensive drilling, and sites have to be chosen carefully.

Finding a suitable site – one in which you can pump water down to the heat source without losing it, without damaging the environment in some way (for example, by polluting or compromising an aquifer), and without losing the water to porous rocks at the bottom of the hole – can demand millions of dollars in drilling.

Hence, a nation full of suitable underground heat if you can only get at it turns the hunt for geothermal power sites into a “big data” problem of staggering scale. Last week, The Register covered NICTA’s announcement that it’s working on a multi-million-dollar geothermal analytics project. In this interview with The Register, NICTA CEO Dr Hugh Durrant-Whyte, discusses the “big data” challenges involved in the project.

The Register: If we look around the world, there are places that have been using geothermal power for years. What is it that turns geothermal into a “big data” project in Australia?

Durrant-Whyte: “The type of geothermal resource we’re typically looking for in Australia is not near the surface – it’s 4 km below. That’s because in Australia, the heat is four kilometers underground – it’s driven by radioactivity, not by vulcanism.

“The aim is to try to work out what’s going to be down there before we drill the hole. To do that, we have to gather all the data – and there is a lot of it – from diverse sources.

“We have gravity data, which allows us to estimate the density of rocks. There’s magnetotelluric data, which measures the electromagnetic energy from the Sun as it passes through the Earth at different frequencies. There’s also magnetic data, and there’s a lot of seismic information, particular for areas where people have been exploring from minerals.

“What we want to do with all this data is to infer, in a machine learning sense, not just if there is something interesting down there, but to get some idea of its properties.”

Durrant-Whyte said it’s not just about whether we can find the “hot rocks”: we have to know whether the intervening rock between the surface and the heat source is suitable for drilling.

“Our basement rock [in Australia] at 4 Km is full of things like Uranium to heat the rocks up, so if we can get this right, it will be a very valuable resource. But it’s not just finding the hot rocks – they have to be the type of rock that you work with.

“So one of the goals of this project is to try to minimize the risk for geothermal companies drilling, to help them predict where the best places to drill will be.”

Masses of data

The Register: What does that imply in a computer science sense?

Durrant-Whyte: “The big issue is that there is a vast amount of different unstructured data – gravity images, seismic data, drilling reports. Each is measured differently: drilling gives you point data, gravity scans are a sum over areas, and so on

“And the amount data is massive.”

Australia, he explained, is unusual in having a form of “public deposit” for geodata. Exploration surveys by the private sector are, eventually, made public either in surveys collected by state governments, or via the national authority, Geosciences Australia.

The computing facility NICTA is assembling will have to undertake three steps to bring the datasets to a starting point, Durrant-Whyte explained:

“First, we have to assimilate the data – how do we get the different pieces of information together into a coherent picture. We’re trying to generate a 3D picture of the subsurface of Australia.

“Each data source has different levels of uncertainty, which you need to capture in the process of putting it together, and it’s at a continent-wide scale.

“The second process is discovery. Unlike most ‘big data’ problems, we don’t start with a unique characterization of what we’re looking for. Every geothermal target we find will be different.”

“The third is the computation challenge, particularly the algorithms. We’re developing non-parametric Bayesian algorithms to deal with this, and they require serious computation, able to deal with ten-to-twelve dimensions of data.

“There’s an issue of how you map this kind of algorithm onto something like a cloud – how do you put that into a distributed environment and do serious computation on it?”

The Register: Can we look at the discovery problem in a little more detail? Why is this an unusual ‘big data’ problem?

Durrant-Whyte: It’s because the domain of inference – what you’re trying to find – is spatially continuous across a huge area.

“When you take a measurement at a specific location, its impact is non-local. The fact that you know the stratigraphy (the characteristics of the sub-surface) in one location tells you a lot about how you interpret the data in other locations.

“That’s quite different from other ‘big data’ problems, which are often in discrete packets. In this case, instead, there’s a big contiuum that you’re trying to discover, but you’re only seeing little bits of it.”

Different type of cloud problem

The Register: Just loading that huge continuum from storage, and getting it into a computer ‘image’ you can work on, will be a challenge.

Durrant-Whyte: “Yes – not that I know how to do it yet! But we would love to load it as one huge dataset.

“Another problem is that as geosciences come in with new data, we want to see its effects on the whole model.

“There are a lot of problems that are similar, that could benefit from this – everything in environmental sciences for example.

“If I’m trying to infer crop yields from forecasts of rainfall and sanility, that’s a continuu problem. If I’m trying to look at what happens to marine habitats, that’s a continuum problem. A lot of things work in the same domain.”

The Register: So how do you build the thing? How do you assemble the data, the compute power, the storage and the software to solve the problem.

Durrant-Whyte: My own specialty is in data fusion – that underpins the assimilation part – and there’s a big bag of tools for the non-parametric Bayesian models.

“Those models allow you to build a continuum. A naïve approach might be to make up a grid, and calculate everything in that grid; rather than doing it that we, we want to build [the data] into a continuous function and work on that.

“We have used those techniques, only not on as big a scale as this before.

“The good news is that the properties in the rock are correlated. For example, if you know the fractional pourousity of a rock, you can infer its directional stress.

“So far, cloud computing has primarily been used for data storage and visualization more than anything else. Nobody’s really doing these kinds of complex machine learning operations on a cloud yet.

“The kinds of problems we’re dealing with aren’t about problems of I/O, for example – it’s how to put a lot of core machine learning algorithms into a highly distributed environment.

“And that also means communications – within the data centre/s and externally – will be important.”

The first step, he said, will be to start with known areas with known geology, so as to test the algorithms against drill data, and “get a sense of the ground truth”. ®