Abstract
Motivated by models of natural language, we consider estimation and classification in the asymptotic regime where the source alphabet grows linearly with the number of observations. We show that, although the underlying distribution cannot be learned in this regime, it is possible to consistently estimate its entropy, the likelihood of the observed string, the divergence between the true and empirical distributions, and other similar quantities. We also show that consistent classification is possible, although standard tests such as the generalized likelihood ratio test, the chi-squared test, and the Hellinger test are all inconsistent in this regime. In fact, consistent classification is possible if the alphabet grows subquadratically with the number of observations, and it is impossible if the alphabet grows quadratically or faster.
This is joint work with Benjamin Kelly, Thitidej Tularak, Pramod Viswanath, and Sanjeev Kulkarni.
Bio:
Prof. Wagner is an Assistant Professor in the School of Electrical and Computer Engineering at Cornell University. During the 2005-2006 academic year, he was a Postdoctoral Research Associate in the Coordinated Science Laboratory at the University of Illinois at Urbana-Champaign and a Visiting Assistant Professor in the School of Electrical and Computer Engineering at Cornell. He did his graduate work at the University of California, Berkeley. He received his undergraduate degree from the University of Michigan, Ann Arbor.
Prof. Wagner studies problems at the intersection of information theory and other fields including networking, statistics, queuing theory, security, computational linguistics, and learning.He is particularly interested in network information theory, distributed compression and its application to peer-to-peer networks, secure communication over timing and photonic channels, and communication and classification in learning-limited environments.