Wanna gobble Google's custom chips? Now you can – its Cloud TPUs at $6.50 an hour
TensorFlow math accelerators available to rent
Google’s Cloud Tensor Processing Unit (TPU) chips are now available to rent on its cloud platform as a beta-grade service.
The Cloud TPU – formerly known as the TPU2 – is the second generation of Google’s homegrown math accelerators geared towards AI and machine learning workloads that rely on TensorFlow. While the first generation was only applicable for training neural networks, the Cloud TPU can handle both training and inference.
John Barrus, product manager for Cloud TPUs and Zak Stone, product manager for TensorFlow and Cloud TPUs, said earlier today the chips are “available in limited quantities today and usage is billed by the second at the rate of $6.50 USD / Cloud TPU / hour.”
Six bucks an hour may seem steep, but consider Amazon's AWS charges up to roughly $24 an hour for a top-end GPU-powered virtual instance for machine-learning tasks.
It typically takes several hours or a few days to teach AI models new tricks, depending on how complex they are. Google claims it requires only a single Cloud TPU running for less than one day, and costing under $200, to train ResNet-50, an old but popular convolutional neural network developed by Microsoft, to an accuracy comparable to the benchmarks set in the ImageNet challenges.
Crucially, the TPU2 is optimized for code running on TensorFlow, so, in the ResNet example, make sure you follow this tutorial.
Last year, Jeff Dean, a Google senior fellow leading the Google Brain project – the advertising giant's internal deep-learning research team – revealed that each Cloud TPU is made up of four ASICs. Each chip contains:
- Two processing cores, each with a 128x128 mixed multiply unit (MXU) and 8GB of high-bandwidth memory, adding up to 64GB of HBM and 600GBps of memory bandwidth in one four-chip device.
- 32-bit floating-point precision math units for scalars and vectors, and 32-bit floating-point-precision matrix multiplication units with reduced precision for multipliers.
- Some 45 TFLOPS of max performance, adding up to 180 TFLOPS for one four-chip device.
A single TPU2 device with four processor chips ... The red box shows highlights one chip. Image credit: Google
Google also plans to rent out pods of Cloud TPUs later this year. A pod crams 64 devices together to create "machine learning supercomputers" that perform 11.3 PFLOPS using 4TB of high bandwidth memory.