Microsoft to world: We've got open source machine learning too
Help teach Cortana to say 'Sorry, Dave'
Microsoft's decided that it, too, wants to open source some of its machine learning space, publishing its Distributed Machine Learning Toolkit (DMTK) on Github.
Google released some of its code last week. Redmond's (co-incidental?) response is pretty basic: there's a framework, and two algorithms, but Microsoft Research promises it will get extended in the future.
The DMTK Framework is front-and-centre, since that's where both extensions will happen. It's a two-piece critter, consisting of a parameter server and a client SDK.
The parameter server has “separate data structures for high- and low-frequency parameters”, Microsoft says, so as to balance memory capacity and access speed. It aggregates updates from local workers, and synchs different model mechanisms, including binary space partitions (BSP), answer set programming (ASP), and statistical signal processing (SSP) “in a unified manner”.
The client SDK provides:
- A local model cache – designed to reduce communication workloads by synching with the parameter server only when needed;
- A pipeline between local training and model communication; and
- Round-robin scheduling of big model training, which the project's site explains “allows each worker machine to pull the sub-models as needed from the parameter server, resulting in a frugal use of limited memory capacity and network bandwidth to support very big models.”
The first algorithms in the DMTK are LightLDA, described as a “highly-efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-art Gibbs samplers”; and distributed word embedding algorithms that can “train 300-d word embedding vectors for a 10-million-word vocabulary, on a document collection with over 100-billion tokens, on a cluster of just 8 machines”.
Or, scaling up, Microsoft explains in this blog post:
“One can train a topic model with one million topics and a 20-million word vocabulary, or a word-embedding model with 1000 dimensions and a 20-million word vocabulary, on a web document collection with 200 billion tokens utilizing a cluster of just 24 machines. That workload would previously have required thousands of machines.”