四篇應該仔細讀的關於文本分析的tutorial類文章

來自:http://jacoxu.com/?p=415

這四篇文章經常被提及到,現原文出自:http://blog.sciencenet.cn/blog-611051-535693.html
對文本分析進行詳細深入介紹的肯定不只這四篇,這是本人目前讀過的,其他比較好的tutorial類文章歡迎大家推薦補充。

第一篇:詳細介紹了離散數據的參數估計方法,而不是像大多數教材中使用的Gaussian分佈作爲例子進行介紹。個人覺得最值得一讀的地方是它使用Gibbs採樣對LDA進行推斷,其中相關公式的推導非常詳細,是許多人瞭解LDA及其他相關topic model的必讀文獻。
@TECHREPORT{Hei09,
author = {Heinrich, Gregor},
title = {Parameter Estimation for Text Analysis},
institution = {vsonix GmbH and University of Leipzig},
year = {2009},
type = {Technical Report Version 2.9},
abstract = {Presents parameter estimation methods common with discrete probability
distributions, which is of particular interest in text modeling.
Starting with maximum likelihood, a posteriori and Bayesian estimation,
central concepts like conjugate distributions and Bayesian networks
are reviewed. As an application, the model of latent Dirichlet allocation
(LDA) is explained in detail with a full derivation of an aaproximate
inference algorithm based on Gibbs sampling, including a discussion
of Dirichlet hyperparameter estimation.},
}

第二篇:正像該文文摘中所陳述的那樣,特別適合於計算機科學家。其中涉及的數學知識比較少,適用於不太關心數學細節的同仁。uninitiated好像是門外漢的意思,不難看出Resnik和Hardisty寫該文的目的。
@TECHREPORT{RH10,
author = {Resnik, Philip and Hardisty, Eric},
title = {Gibbs Sampling for the Uninitiated},
institution = {University of Maryland},
year = {2010},
type = {Technical Report CS-TR-4956, UMIACS-TR-2010-04, LAMP-153},
abstract = {This document is intended for computer scientists who would like to
try out a Markov Chain Monte Carlo (MCMC) technique, particularly
in order to do inference with Bayesian models on problems related
to text processing. We try to keep theory to the absolute minimum
needed, though we work through the details much more explicitly than
you usually see even in “introductory” explanations. That means we’ve
attempted to be ridiculously explicit in our exposition and notation.

After providing the reasons and reasoning behind Gibbs sampling (and
at least nodding our heads in the direction of theory), we work through
an example application in detail—the derivation of a Gibbs sampler
for a Na\”{i}ve Bayes model. Along with the example, we discuss some
practical implementation issues, including the integrating out of
continuous parameters when possible. We conclude with some pointers
to literature that we’ve found to be somewhat more friendly to uninitiated
readers.

Note: as of June 3, 2010 we have corrected some small errors in the
original April 2010 report.},
keywords = {Gibbs Sampling; Markov Chain Monte Carlo; Na\”{i}ve Bayes; Bayesian
Inference; Tutorial},
url = {http://drum.lib.umd.edu/bitstream/1903/10058/3/gsfu.pdf}
}

第三篇:Knight是做NLP的同仁們非常熟悉的大牛,就不多介紹了。
@ELECTRONIC{Kni09,
author = {Knight, Kevin},
title = {Bayesian Inference with Tears: A Tutorial Workbook for Natural Language
Researchers},
url = {http://www.isi.edu/natural-language/people/bayes-with-tears.pdf},
}

第四篇,LDA之父Blei和他的學生Gershman共同撰寫的,對Bayesian非參數模型進行了詳細介紹,特別對Chinese Restaurant Process (CRP)和Indian Buffet Process以非常直觀的方式進行了討論。
@ARTICLE{GB11,
author = {Gershman, Samuel J. and Blei, David M.},
title = {A Tutorial on Bayesian Nonparametric Models},
journal = {Journal of Mathematical Psychology},
year = {2011},
abstract = {A key problem in statistical modeling is model selection, that is,
how to choose a model at an appropriate level of complexity. This
problem appears in many settings, most prominently in choosing the
number of clusters in mixture models or the number of factors in
factor analysis. In this tutorial, we describe Bayesian nonparametric
methods, a class of methods that side-steps this issue by allowing
the data to determine the complexity of the model. This tutorial
is a high-level introduction to Bayesian nonparametric methods and
contains several examples of their application.},
keywords = {Bayesian Methods; Chinese Restaurant Process; Indian Buffer Process},
}



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章