|
|
Line 1: |
Line 1: |
| A statistical '''language model''' assigns a [[probability]] to a sequence of ''m'' words <math>P(w_1,\ldots,w_m)</math> by means of a [[probability distribution]].
| | Another day I woke up and noticed - I have [http://okkyunglee.com luke bryan Concert tickets] been single for a while [http://lukebryantickets.citizenswebcasting.com luke bryan thats my kinda night tour] now and after much bullying from friends I now find myself opted for online dating. They assured me that there are plenty of standard, sweet and interesting visitors to fulfill, so luke bryan tickets indianapolis ([http://lukebryantickets.omarfoundation.org simply click the next website page]) the pitch is gone by here!<br>I endeavor to stay as toned as possible coming to the fitness center several-times weekly. I appreciate my sports and endeavor to perform or view as numerous a possible. Being winter I will frequently at Hawthorn suits. Notice: I've observed the carnage of wrestling matches at stocktake sales, If you will [http://www.google.co.uk/search?hl=en&gl=us&tbm=nws&q=contemplated+shopping&gs_l=news contemplated shopping] a hobby I do not mind.<br>My buddies and fam are amazing and hanging out with them at bar gigs or meals is constantly imperative. I have never been into cabarets as I find that one may do not get a good dialogue against the noise. In addition, I have 2 really cute and unquestionably cheeky puppies that are [http://Browse.Deviantart.com/?qh=§ion=&global=1&q=invariably+excited invariably excited] to meet up fresh individuals.<br><br> |
|
| |
|
| Language modeling is used in many [[natural language processing]] applications such as [[speech recognition]], [[machine translation]], [[part-of-speech tagging]], [[parsing]] and [[information retrieval]].
| | Here is my page - [http://www.ffpjp24.org luke bryan 2014 tour] |
| | |
| In [[speech recognition]] and in [[data compression]], such a model tries to capture the properties of a language, and to predict the next word in a speech sequence.
| |
| | |
| When used in information retrieval, a language model is associated with a [[document]] in a collection. With query ''Q'' as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query, ''P''(''Q|M<sub>d</sub>''). The method to use language models in information retrieval is the [[query likelihood model]].
| |
| | |
| In practice, [[unigram]] language models are most commonly used in information retrieval, as they are sufficient to determine the topic from a piece of text. Unigram models only calculate the probability of hitting an isolated word, without considering any influence from the words before or after the target. This leads to the [[Bag of words model]], and turns out to generate a [[multinomial distribution]] over words.
| |
| | |
| Estimating the probability of sequences can become difficult in [[Corpus linguistics|corpora]], in which [[phrase]]s or [[Sentence (linguistics)|sentence]]s can be arbitrarily long and hence some sequences are not observed during [[training]] of the language model ([[data sparseness problem]] of [[overfitting]]). For that reason these models are often approximated using smoothed [[N-gram]] models.
| |
| | |
| == Unigram models ==
| |
| A unigram model used in information retrieval can be treated as the combination of several one-state [[Finite-state machine|finite automata]].<ref>Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze: An Introduction to Information Retrieval, pages 237-240. Cambridge University Press, 2009</ref> It splits the probabilities of different terms in a context, e.g. from <math>P(t_1t_2t_3)=P(t_1)P(t_2|t_1)P(t_3|t_1t_2)</math> to <math>P_{uni}(t_1t_2t_3)=P(t_1)P(t_2)P(t_3)</math>.
| |
| | |
| In this model, the probability to hit each word all depends on its own, so we only have one-state finite automata as units. For each automaton, we only have one way to hit its only state, assigned with one probability. Viewing from the whole model, the sum of all the one-state-hitting probabilities should be 1. Followed is an illustration of an unigram model of a document.
| |
| {| class="wikitable"
| |
| |-
| |
| ! Terms !! Probability in doc
| |
| |-
| |
| | a || 0.1
| |
| |-
| |
| | world || 0.2
| |
| |-
| |
| | likes || 0.05
| |
| |-
| |
| | we || 0.05
| |
| |-
| |
| | share || 0.3
| |
| |-
| |
| | ... || ...
| |
| |}
| |
| | |
| <math>\sum_{term\ in\ doc} P(term) = 1</math>
| |
| | |
| The probability generated for a specific query is calculated as
| |
| | |
| <math>P(query) = \prod_{term\ in\ query} P(term)</math>
| |
| | |
| For different documents, we can build their own unigram models, with different hitting probabilities of words in it. And we use probabilities from different documents to generate different hitting probabilities for a query. Then we can rank documents for a query according to the generating probabilities. Next is an example of two unigram models of two documents.
| |
| {| class="wikitable"
| |
| |-
| |
| ! Terms !! Probability in Doc1 !! Probability in Doc2
| |
| |-
| |
| | a || 0.1 || 0.3
| |
| |-
| |
| | world || 0.2 || 0.1
| |
| |-
| |
| | likes || 0.05 || 0.03
| |
| |-
| |
| | we || 0.05 || 0.02
| |
| |-
| |
| | share || 0.3 || 0.2
| |
| |-
| |
| | ... || ... || ...
| |
| |}
| |
| | |
| In information retrieval contexts, unigram language models are often smoothed to avoid instances where <math>P(term) = 0</math>. A common approach is to generate a maximum-likelihood model for the entire collection and [[Linear interpolation|linearly interpolate]] the collection model with a maximum-likelihood model for each document to create a smoothed document model.<ref>Buttcher, Clarke, and Cormack. Information Retrieval: Implementing and Evaluating Search Engines. pg. 289-291. MIT Press.</ref>
| |
| | |
| == N-gram models ==
| |
| | |
| {{Main|N-gram}}
| |
| | |
| In an n-gram model, the probability <math>P(w_1,\ldots,w_m)</math> of observing the sentence w<sub>1</sub>,...,w<sub>m</sub> is approximated as
| |
| | |
| <math>
| |
| P(w_1,\ldots,w_m) = \prod^m_{i=1} P(w_i|w_1,\ldots,w_{i-1})
| |
| \approx \prod^m_{i=1} P(w_i|w_{i-(n-1)},\ldots,w_{i-1})
| |
| </math>
| |
| | |
| Here, it is assumed that the probability of observing the ''i<sup>th</sup>'' word ''w<sub>i</sub>'' in the context history of the preceding ''i-1'' words can be approximated by the probability of observing it in the shortened context history of the preceding ''n-1'' words (''n<sup>th</sup> order [[Markov property]]). | |
| | |
| The conditional probability can be calculated from n-gram frequency counts:
| |
| <math>
| |
| P(w_i|w_{i-(n-1)},\ldots,w_{i-1}) = \frac{count(w_{i-(n-1)},\ldots,w_{i-1},w_i)}{count(w_{i-(n-1)},\ldots,w_{i-1})}
| |
| </math>
| |
| | |
| The words '''bigram''' and '''trigram''' language model denote n-gram language models with ''n=2'' and ''n=3'', respectively.<ref>Craig Trim, [http://trimc-nlp.blogspot.com/2013/04/language-modeling.html ''What is Language Modeling?''], April 26th, 2013.</ref>
| |
| | |
| Typically, however, the n-gram probabilities are not derived directly from the frequency counts, because models derived this way have severe problems when confronted with any n-grams that have not explicitly been seen before. Instead, some form of ''smoothing'' is necessary, assigning some of the total probability mass to unseen words or N-grams. Various methods are used, from simple "add-one" smoothing (assign a count of 1 to unseen N-grams) to more sophisticated models, such as [[Good-Turing discounting]] or [[Katz's back-off model|back-off model]]s.
| |
| | |
| === Example ===
| |
| In a bigram (n=2) language model, the probability of the sentence ''I saw the red house'' is approximated as
| |
| | |
| <math>
| |
| P(I,saw,the,red,house) \approx P(I|<s>) P(saw|I) P(the|saw) P(red|the) P(house|red) P(</s>|house)
| |
| </math>
| |
| | |
| whereas in a trigram (n=3) language model, the approximation is
| |
| | |
| <math>
| |
| P(I,saw,the,red,house) \approx P(I|<s>,<s>) P(saw|<s>,I) P(the|I,saw) P(red|saw,the) P(house|the,red) P(</s>|red,house)
| |
| </math>
| |
| | |
| Note that the context of the first <math>n-1</math> N-grams is filled with start-of-sentence markers, typically denoted <nowiki><s></nowiki>.
| |
| | |
| Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence ''*I saw the'' would always be higher than that of the longer sentence ''I saw the red house.''
| |
| | |
| ==Other models==
| |
| | |
| A [[positional language model]]<ref>Yuanhua Lv and ChengXiang Zhai, [http://times.cs.uiuc.edu/czhai/pub/sigir09-PLM.pdf ''Positional Language Models for Information Retrieval''], in Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR), 2009.</ref> is one that describes the probability of given words occurring close to one another in a text, not necessarily immediately adjacent. Similarly, bag-of-concepts models<ref>E. Cambria and A. Hussain. Sentic Computing: Techniques, Tools, and Applications. Dordrecht, Netherlands: Springer, ISBN 978-94-007-5069-2 (2012)</ref> leverage on the semantics associated with multi-word expressions such as ''buy_christmas_present'', even when they are used in information-rich sentences like "today I bought a lot of very nice Christmas presents".
| |
| | |
| == See also ==
| |
| * [[Factored language model]]
| |
| * [[Cache language model]]
| |
| * [[Katz's back-off model]]
| |
| | |
| == References ==
| |
| {{reflist}}
| |
| | |
| ===Further reading===
| |
| {{refbegin}}
| |
| *{{cite conference | author=J M Ponte and W B Croft | id = {{citeseerx|10.1.1.117.4237}} | title=A Language Modeling Approach to Information Retrieval | booktitle=Research and Development in Information Retrieval | year=1998 | pages=275–281}}
| |
| *{{cite conference | author=F Song and W B Croft | id = {{citeseerx|10.1.1.21.6467}} | title=A General Language Model for Information Retrieval | booktitle=Research and Development in Information Retrieval | year=1999 | pages=279–280}}
| |
| *{{cite techreport |first=Stanley |last=Chen |author2=Joshua Goodman |title=An Empirical Study of Smoothing Techniques for Language Modeling |number=10-98 |institution=Harvard University |year=1998 |url = https://research.microsoft.com/~joshuago/tr-10-98.pdf | id= {{citeseerx|10.1.1.131.5458}} }}
| |
| {{refend}}
| |
| | |
| ==External links==
| |
| *[https://lmsharp.codeplex.com/ LMSharp] - Free language model toolkit for Kneser-Ney smoothed n-gram model and recurrent neural network model
| |
| *[https://github.com/jnory/DALM DALM] - Fast, Free software for language model queries
| |
| *[http://sourceforge.net/projects/irstlm IRSTLM] - Free software for language modeling
| |
| *[http://kheafield.com/code/kenlm KenLM] - Fast, Free software for language modeling
| |
| *[https://code.google.com/p/mitlm MITLM] - MIT Language Modeling toolkit. Free software
| |
| *[http://openfst.cs.nyu.edu/twiki/bin/view/GRM/NGramLibrary OpenGrm NGram] library - Free software for language modeling. Built on [[OpenFst]].
| |
| *[http://sifaka.cs.uiuc.edu/~ylv2/pub/plm/plm.htm Positional Language Model]
| |
| *[http://sourceforge.net/projects/randlm RandLM] - Free software for randomised language modeling
| |
| *[http://www-speech.sri.com/projects/srilm SRILM] - Proprietary software for language modeling
| |
| *[http://vsiivola.github.io/variKN VariKN] - Free software for creating, growing and pruning Kneser-Ney smoothed n-gram models.
| |
| | |
| *[http://www.keithv.com/software/csr/ Language models trained on newswire data]
| |
| | |
| | |
| [[Category:Statistical natural language processing]]
| |
| [[Category:Markov models]]
| |
Another day I woke up and noticed - I have luke bryan Concert tickets been single for a while luke bryan thats my kinda night tour now and after much bullying from friends I now find myself opted for online dating. They assured me that there are plenty of standard, sweet and interesting visitors to fulfill, so luke bryan tickets indianapolis (simply click the next website page) the pitch is gone by here!
I endeavor to stay as toned as possible coming to the fitness center several-times weekly. I appreciate my sports and endeavor to perform or view as numerous a possible. Being winter I will frequently at Hawthorn suits. Notice: I've observed the carnage of wrestling matches at stocktake sales, If you will contemplated shopping a hobby I do not mind.
My buddies and fam are amazing and hanging out with them at bar gigs or meals is constantly imperative. I have never been into cabarets as I find that one may do not get a good dialogue against the noise. In addition, I have 2 really cute and unquestionably cheeky puppies that are invariably excited to meet up fresh individuals.
Here is my page - luke bryan 2014 tour