Brit boffins teach mere PCs to find galaxies in Hubble pics
You didn't think we could sift petabytes of astro-data without automation, did you?
As astronomical datasets head towards the petabyte scale, they'll be too big for even crowd-sourced labelling and analysis, so a pair of Brit boffins are working on teaching computers to see and label galaxies in Hubble images.
In doing so, they believe the algorithm they've created (a combination of “growing neural gas” or GNG, and hierarchical clustering) is general enough that it could also be useful in medical imaging and analysing satellite images of Earth.
While machine learning is already applied to astronomical images, the authors explain that it's a supervised learning environment: humans have to give the machines guidance by pre-labelling the images they're sending for classification.
The aim of the work (at Arxiv here) led by the University of Hertfordshire's Alex Hocking is to take out that supervision, letting the computer start with a raw image and identify (in this case) possible galaxies without help.
“Unsupervised machine learning techniques have not been fully applied to astronomical image (pixel) data directly with a goal of performing automatic classification and labelling”, they say.
Their aim is to create an algorithm which once trained “can be be presented with images it has never ‘seen’ before and provide consistent categorisation of features”.
To test their idea, they used Hubble images of large galactic clusters containing “a mixture of galaxy types” that humans can identify and label quite easily – in particular, early and late galaxy types.
The training image Hocking's group used
Even on an Intel desktop – Core i7-3770T 2.50GHz with 8GB RAM, along with a Tesla K40 GPU – training using the image above needed just 39.5 milliseconds per pixel, and testing against a new image (below) took 1.7 milliseconds per pixel.
Hocking's group says their algorithm correctly identified early and late galaxies
That's not good enough, the authors concede: but applying the algorithm to big datasets would happen on supercomputers (they're porting the software to the CUDA language), and “the compute time for classification scales linearly with the number of pixels for a given model”. ®