Addressing challenges in statistical machine learning to make data useful

We are undoubtedly living in the Age of Data, where data is power and currency. The massive and complex datasets being collected via the Internet and scientific domains, including genomics, astronomy, physics, and finance, hold the promise for many transformational applications, and machine learning techniques are seemingly poised to deliver on this promise. However, concerns over scalability and ease-of-use present major roadblocks to the wider adoption of these statistical methods. For instance, modern genomic sequencing technologies can rapidly generate hundreds of gigabytes of raw data for a single individual, yet the cost to analyze this data is emerging as a major obstacle in eventually achieving the goal of personalized medicine. As a second example, measuring customer activity is no longer a practice limited to pioneering companies like Google and Netflix, yet many organizations lack the statistical expertise required to make actionable decisions from their data.

Dr. Ameet Talwalkar, Assistant Professor of Computer Science at the University of California, Los Angeles, addresses scalability and ease-of-use issues like these, and aims to ultimately connect advances in machine learning back to the real problems in science and technology. Having led the initial development of MLlib, a leading distributed machine learning library that is part of the Apache Spark project, and having co-authored a graduate level textbook about machine learning called Foundations of Machine Learning, Dr. Talwalkar is fully equipped with both the knowledge and the expertise to tackle these challenges. He continues to educate researchers and practitioners through his massive open online course (MOOC) on edX as well as other sources.

Current research projects include:

  • Scalability: Data analyses suitable for modest-sized datasets are often entirely infeasible for the terabyte and petabyte datasets that are quickly becoming the norm. Over the past decade, Dr. Talwalkar has devised and analyzed scalable learning algorithms that leverage distributed (e.g., cloud-based) computing architectures, focusing on a variety of problems including collaborative filtering for recommender systems, estimating the uncertainty or quality of learning models, and facilitating the processing of genomic sequencing data. He often employs a divide-and-conquer strategy, whereby he builds upon existing base algorithms that have proven their value at smaller scales but executes them on subsamples of the data, thus yielding efficient approximation algorithms. Currently, he is working on scalable approximation methods for learning algorithms such as deep learning and random forests, two state-of-the-art methods for classification and regression tasks with a wide range of applications including image classification, protein function prediction and speech recognition.

  • Ease-of-use: Successfully deploying data processing pipelines is currently a highly manual process, and is fraught with pitfalls for domain experts lacking strong statistical backgrounds. The process of preprocessing input data, selecting the appropriate learning model, and tuning the various knobs of these models can be an ad-hoc and a time-consuming task. To address these issues, Dr. Talwalkar is developing principled tools to autotune components of typical machine learning pipelines. The long-term vision is to build a system called MLbase that leverages these tools to automatically construct machine learning pipelines. Through abstractions, MLbase will enable end users to issue high-level queries and reason about predicted attributes, without the burden of understanding the low-level details.

Dr. Ameet Talwalkar has always been drawn to mathematics while also being pragmatic in nature. He was thus naturally drawn to the field of machine learning, as it is inspired by practical applications yet is grounded by deep mathematical principles. He remarks, “As we enter the ‘age of data,’ statistical methods can potentially have a large societal impact by harnessing the power of massive and complex datasets.” Through his research and teaching activities, Dr. Talwalkar strives to promote the widespread use of machine learning techniques to a broad group of people, and to enable machine learning techniques to “thrive in the wild.”

Beyond the theoretical ideas presented in his various publications, Dr. Talwalkar’s work can potentially impact various fields in science and technology generating massive amounts of data via software artefacts and educational outreach. In terms of software, he led the initial development of MLlib, a machine learning library that is part of Apache Spark, a leading cluster computing engine. MLlib is quickly becoming a leading open-source distributed machine learning library, as it is easy-to-use and provides state-of-the-art methods for common predictive analytics use-cases. Many of its core design decisions have been heavily influenced by his research, and he expects to continue his close involvement with the project moving forward.

Educationally, he has co-authored a graduate level textbook about machine learning called Foundations of Machine Learning. This book has been widely cited since its publication and has been adopted as the main textbook for several graduate level classes at various universities around the world. Additionally, his upcoming MOOC on edX about Scalable Machine Learning, along with training material he has developed for MLlib are leading resources for researchers and practitioners to learn about techniques for large-scale data analysis and machine learning.

In a past life, Dr. Talwalkar was a competitive ultimate frisbee player, and he still enjoys playing recreationally. He enjoys sports in general, especially biking, tennis and basketball. He finds fantasy basketball to be a particularly interesting, as it blends his enthusiasm for sports with his passion for statistics and data analytics.

For more information, visit http://cs.ucla.edu/~ameet/

NSF Office of Cyberinfrastructure (OCI) Postdoctoral Fellowship, 2011

Janet Fabri Prize for best doctoral dissertation in NYU’s Computer Science Department, 2011

Henning Biermann Award for exceptional service to NYU’s Computer Science Department, 2008

Yale Computer Science Undergraduate Prize, 2002

Westinghouse Science Talent Search Scholarship, 1998