|
|
(One intermediate revision by one other user not shown) |
Line 1: |
Line 1: |
| {{machine learning bar}}
| | I don't mind wearing lightly soiled clothes through the day (frankly, I don't care at all), but I need the cleanest clothes when I activate the snooze button at my slumber.<br><br>My web blog; [http://posco.com.br/mediawiki/index.php/Atlanta_Free_Trial_Free_Trial_Phone_Chat_Line_Numbers free trial phone chat] |
| {{About|the machine learning technique|other kinds of random tree|Random tree (disambiguation)}}
| |
| | |
| {{merge from|Random naive Bayes|discuss=Talk:Random forest#Proposed merge with Random naive Bayes|date=December 2013}}
| |
| '''Random forests''' are an [[ensemble learning]] method for [[statistical classification|classification]] (and [[regression analysis|regression]]) that operate by constructing a multitude of [[decision tree learning|decision trees]] at training time and outputting the class that is the [[mode (statistics)|mode]] of the classes output by individual trees. The algorithm for inducing a random forest was developed by [[Leo Breiman]]<ref name="breiman2001">{{cite journal | |
| |first=Leo |last=Breiman |authorlink=Leo Breiman
| |
| |title=Random Forests
| |
| |journal=[[Machine Learning (journal)|Machine Learning]]
| |
| |year=2001 |volume=45 |issue=1 |pages=5–32
| |
| |doi=10.1023/A:1010933404324
| |
| }}</ref> and Adele Cutler,<ref name="rpackage"/> and "Random Forests" is their [[trademark]]. The term came from '''random decision forests''' that was first proposed by Tin Kam Ho of [[Bell Labs]] in 1995. The method combines Breiman's "[[Bootstrap aggregating|bagging]]" idea and the random selection of features, introduced independently by Ho<ref>{{cite conference
| |
| |first=Tin Kam |last=Ho
| |
| |title=Random Decision Forest
| |
| |conference=Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995
| |
| |year=1995 |pages=278–282
| |
| |url=http://cm.bell-labs.com/cm/cs/who/tkh/papers/odt.pdf
| |
| }}</ref><ref name="ho1998">{{cite journal
| |
| |first=Tin Kam |last=Ho
| |
| |title=The Random Subspace Method for Constructing Decision Forests
| |
| |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence
| |
| |year=1998 |volume=20 |issue=8 |pages=832–844
| |
| |doi=10.1109/34.709601
| |
| |url=http://cm.bell-labs.com/cm/cs/who/tkh/papers/df.pdf
| |
| }}</ref> and Amit and [[Donald Geman|Geman]]<ref name="amitgeman1997">{{cite journal
| |
| |last=Amit |first=Yali
| |
| |last2=Geman |first2=Donald |authorlink2=Donald Geman
| |
| |title=Shape quantization and recognition with randomized trees
| |
| |journal=[[Neural Computation]]
| |
| |year=1997 |volume=9 |issue=7 |pages=1545–1588
| |
| |doi=10.1162/neco.1997.9.7.1545
| |
| |url=http://www.cis.jhu.edu/publications/papers_in_database/GEMAN/shape.pdf
| |
| }}</ref> in order to construct a collection of decision trees with controlled variation.
| |
| | |
| The selection of a random subset of features is an example of the [[random subspace method]], which, in Ho's formulation, is a way to implement [[stochastic discrimination]]<ref>{{cite journal
| |
| |first=Eugene |last=Kleinberg
| |
| |title=An Overtraining-Resistant Stochastic Modeling Method for Pattern Recognition
| |
| |journal=[[Annals of Statistics]]
| |
| |year=1996 |volume=24 |issue=6 |pages=2319–2349
| |
| |url=http://kappa.math.buffalo.edu/aos.pdf
| |
| |doi=10.1214/aos/1032181157 |mr=1425956
| |
| }}</ref> proposed by Eugene Kleinberg.
| |
| | |
| == History ==
| |
| The early development of random forests was influenced by the work of Amit and
| |
| Geman<ref name="amitgeman1997"/> which introduced the idea of searching over a random subset of the
| |
| available decisions when splitting a node, in the context of growing a single
| |
| [[Decision tree|tree]]. The idea of random subspace selection from Ho<ref name="ho1998"/> was also influential
| |
| in the design of random forests. In this method a forest of trees is grown,
| |
| and variation among the trees is introduced by projecting the training data
| |
| into a randomly chosen [[Linear subspace|subspace]] before fitting each tree. Finally, the idea of
| |
| randomized node optimization, where the decision at each node is selected by a
| |
| randomized procedure, rather than a deterministic optimization was first
| |
| introduced by Dietterich.<ref>{{cite journal
| |
| |first=Thomas |last=Dietterich
| |
| |title=An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization
| |
| |journal=[[Machine Learning (journal)|Machine Learning]]
| |
| |year=2000 |pages=139–157}}</ref>
| |
| | |
| The introduction of random forests proper was first made in a paper
| |
| by [[Leo Breiman]].<ref name="breiman2001"/> This paper describes a method of building a forest of
| |
| uncorrelated trees using a [[Classification and regression tree|CART]] like procedure, combined with randomized node
| |
| optimization and [[Bootstrap aggregating|bagging]]. In addition, this paper combines several
| |
| ingredients, some previously known and some novel, which form the basis of the
| |
| modern practice of random forests, in particular:
| |
| | |
| # Using out-of-bag error as an estimate of the [[generalization error]].
| |
| # Measuring variable importance through permutation.
| |
| | |
| The report also offers the first theoretical result for random forests in the
| |
| form of a bound on the [[generalization error]] which depends on the strength of the
| |
| trees in the forest and their [[correlation]].
| |
| | |
| More recently several major advances in this area have come from Microsoft Research,<ref name="criminisi2011"/> which incorporate and extend the earlier work from Breiman.
| |
| | |
| == Framework ==
| |
| It is better to think of random forests as a framework rather than as a
| |
| particular model. The framework consists of several interchangeable parts
| |
| which can be mixed and matched to create a large number of particular
| |
| models, all built around the same central theme. Constructing a model in
| |
| this framework requires making several choices:
| |
| | |
| # The shape of the decision to use in each node.
| |
| # The type of predictor to use in each leaf.
| |
| # The splitting objective to optimize in each node.
| |
| # The method for injecting randomness into the trees.
| |
| | |
| The simplest type of decision to make at each node is to apply a threshold to a
| |
| single dimension of the input. This is a very common choice and leads to trees
| |
| that partition the space into [[Hyperrectangle|hyper-rectangular]] regions.
| |
| However, other decision shapes, such as splitting a node using
| |
| [[Linear equation|linear]] or [[Quadratic equation|quadratic]] decisions are also
| |
| possible.
| |
| | |
| Leaf predictors determine how a prediction is made for a point, given that it
| |
| falls in a particular cell of the space partition defined by the tree. Simple
| |
| and common choices here include using a [[histogram]] for real valued outputs, or [[Constant function|constant predictors]] for [[categorical data]].
| |
| | |
| In principle there is no restriction on the type of predictor that can be used,
| |
| for example one could fit a [[Support Vector Machine]] or a
| |
| [[Spline (mathematics)|spline]] in each leaf; however, in practice this is uncommon. If the tree is large then each leaf may have
| |
| very few points making it difficult to fit complex models; also, the tree
| |
| growing procedure itself may be complicated if it is difficult to
| |
| compute the splitting objective based on a complex leaf model. However, many of
| |
| the more exotic generalizations of random forests, e.g. to
| |
| [[Density estimation|density]] or
| |
| [[Nonlinear_dimensionality_reduction#Manifold_learning_algorithms|manifold]]
| |
| estimation, rely on replacing the constant leaf model.
| |
| | |
| The splitting objective is a function which is used to rank candidate splits of
| |
| a leaf as the tree is being grown. This is commonly based on an impurity
| |
| measure, such as the [[Information gain in decision trees|information gain]] or the
| |
| [[Decision_tree_learning#Gini_impurity|Gini gain]].
| |
| | |
| The method for injecting randomness into each tree is the component of the
| |
| random forests framework which affords the most freedom to model designers.
| |
| Breiman's original algorithm achieves this in two ways:
| |
| | |
| # Each tree is trained on a [[Bootstrapping (statistics)|bootstrapped]] sample of the original data set.
| |
| # Each time a leaf is split, only a randomly chosen subset of the dimensions are considered for splitting.
| |
|
| |
| In Breiman's model, once the dimensions are chosen the splitting objective is
| |
| evaluated at every possible split point in each dimension and the best is
| |
| chosen. This can be contrasted with the method of Criminisi,<ref name="criminisi2011">{{cite journal
| |
| |first1=Antonio |last1=Criminisi
| |
| |first2=Jamie |last2=Shotton
| |
| |first3=Ender |last3=Konukoglu
| |
| |title=Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning
| |
| |journal=Foundations and Trends in Computer Vision
| |
| |year=2011 |volume=7 |pages=81–227
| |
| |doi=10.1561/0600000035
| |
| }}</ref> which performs no
| |
| [[Bootstrapping (statistics)|bootstrapping]] or [[Sampling (statistics)|subsampling]] of
| |
| the data between trees, but uses a different approach for choosing the decisions
| |
| in each node. Their model selects entire decisions at random (e.g. a dimension
| |
| threshold pair rather than a dimension). The optimization in the node is
| |
| performed over a fixed number of these randomly selected decisions, rather than
| |
| over every possible decision involving some fixed set of dimensions.
| |
| | |
| === Breiman's Algorithm ===
| |
| | |
| Each tree is constructed using the following [[algorithm]]:
| |
| # Let the number of training cases be ''N'', and the number of variables in the classifier be ''M''.
| |
| # We are told the number ''m'' of input variables to be used to determine the decision at a node of the tree; ''m'' should be much less than ''M''.
| |
| # Choose a training set for this tree by choosing ''n'' times with replacement from all ''N'' available training cases (i.e., take a [[bootstrap (statistics)|bootstrap]] sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes.
| |
| # For each node of the tree, randomly choose ''m'' (out of ''M'') variables on which to search for the best split. Calculate the best split based on these ''m'' variables in the training set. Base the decision at that node using the best split.
| |
| # Each tree is fully grown and not [[Pruning (algorithm)|pruned]] (as may be done in constructing a normal tree classifier).
| |
| | |
| For prediction a new sample is pushed down the tree. It is assigned the label of the training sample in the terminal node it ends up in. This procedure is iterated over all trees in the ensemble, and the mode vote of all trees is reported as the random forest prediction.
| |
| | |
| == Relationship to Nearest Neighbors ==
| |
| | |
| Given a set of training data
| |
| :<math>\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n </math>
| |
| a [[K-nearest neighbor algorithm|weighted neighborhood scheme]] makes a prediction for a query point <math>X</math>, by computing
| |
| :<math>\hat{Y} = \sum_{i=1}^n W_i(X)Y_i</math>
| |
| for some set of non-negative weights <math>\{W_i(X)\}_{i=1}^n</math> which sum to 1. The set of points <math>X_i</math> where <math>W_i(X) > 0</math> are called the neighbors of <math>X</math>. A common example of a weighted neighborhood scheme is the [[K-nearest neighbor algorithm|k-NN algorithm]] which sets <math>W_i(X) = 1/k</math> if <math>X_i</math> is among the <math>k</math> closest points to <math>X</math> in <math>\mathcal{D}_n</math> and <math>0</math> otherwise.
| |
| | |
| Random forests with constant leaf predictors can be interpreted as a [[K-nearest neighbor algorithm|weighted neighborhood scheme]] in the following way. Given a forest of <math>M</math> trees, the prediction that the <math>m</math>-th tree makes for <math>X</math> can be written as
| |
| :<math>T_m(X) = \sum_{i=1}^n W_{im}(X)Y_i</math>
| |
| where
| |
| <math>W_{im}(X)</math> is equal to <math>1/{k_m}</math>
| |
| if <math>X</math> and <math>X_i</math> are in the same leaf in the <math>m</math>-th tree and <math>0</math> otherwise, and <math>k_m</math> is the number of training data which fall in the same leaf as <math>X</math> in the <math>m</math>-th tree. The prediction of the whole forest is
| |
| :<math>F(X) = \frac{1}{M}\sum_{m=1}^M T_m(X) = \frac{1}{M}\sum_{m=1}^M\sum_{i=1}^n W_{im}(X)Y_i = \sum_{i=1}^n\left(\frac{1}{M}\sum_{m=1}^M W_{im}(X)\right)Y_i</math>
| |
| which shows that the random forest prediction is a weighted average of the <math>Y_i</math>'s, with weights
| |
| :<math>W_i(X) = \frac{1}{M}\sum_{m=1}^M W_{im}(X)</math>
| |
| The neighbors of <math>X</math> in this interpretation are the points <math>X_i</math> which fall in the same leaf as <math>X</math> in at least one tree of the forest. In this way, the neighborhood of <math>X</math> depends in a complex way on the structure of the trees, and thus on the structure of the training set.
| |
| | |
| This connection was first described by Lin and Jeon in a technical report from 2001<ref name="linjeon02">{{Citation
| |
| |first1=Yi |last1=Lin
| |
| |first2=Yongho |last2=Jeon
| |
| |contribution=Random forests and adaptive nearest neighbors
| |
| |series=Technical Report No. 1055
| |
| |year=2002
| |
| |place=University of Wisconsin
| |
| |contribution-url=http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.153.9168
| |
| }}</ref> where they show that the shape of the neighborhood used by a random forest adapts to the local importance of each feature.
| |
| | |
| == Variable Importance ==
| |
| | |
| Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way. The following technique was described Breiman's original paper<ref name=breiman2001/> and is implemented in the [[R (programming language)|R]] package randomForest.<ref name="rpackage">{{cite web
| |
| |url=http://cran.r-project.org/web/packages/randomForest/randomForest.pdf
| |
| |title=Documentation for R package randomForest
| |
| |first1=Andy |last1=Liaw
| |
| |date=16 October 2012
| |
| |accessdate=15 March 2013}}
| |
| </ref>
| |
| | |
| The first step in measuring the variable importance in a data set <math>\mathcal{D}_n =\{(X_i, Y_i)\}_{i=1}^n</math> is to fit a random forest to the data. During the fitting process the out-of-bag error for each data point is recorded and averaged over the forest (errors on an independent test set can be substituted if bagging is not used during training).
| |
| | |
| To measure the importance of the <math>j</math>-th feature after training, the values of the <math>j</math>-th feature are permuted among the training data and the out-of-bag error is again computed on this perturbed data set. The importance score for the <math>j</math>-th feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees. The score is normalized by the standard deviation of these differences.
| |
| | |
| Features which produce large values for this score are ranked as more important than features which produce small values.
| |
| | |
| This method of determining variable importance has some drawbacks. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Methods such as [[partial permutation]]s can be used to solve the problem.<ref>{{cite conference
| |
| |author=Deng,H.|coauthors=Runger, G.; Tuv, E.
| |
| |title=Bias of importance measures for multi-valued attributes and solutions
| |
| |conference=Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN)
| |
| |year=2011|pages=293–300
| |
| }}</ref><ref>{{cite journal |author=Altmann A, Tolosi L, Sander O, Lengauer T |title=Permutation importance:a corrected feature importance measure |journal=Bioinformatics |year=2010 |doi=10.1093/bioinformatics/btq134 |url=http://bioinformatics.oxfordjournals.org/content/early/2010/04/12/bioinformatics.btq134.abstract }}</ref> If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups.<ref>{{cite journal |author=Tolosi L, Lengauer T |title=Classification with correlated features: unreliability of feature ranking and solutions. |journal=Bioinformatics |year=2011 |doi=10.1093/bioinformatics/btr300 |url=http://bioinformatics.oxfordjournals.org/content/27/14/1986.abstract }}</ref>
| |
| | |
| == See also ==
| |
| *[[Random multinomial logit]]
| |
| *[[Random naive Bayes]]
| |
| *[[Decision tree learning]]
| |
| *[[Gradient boosting]]
| |
| *[[Randomized algorithm]]
| |
| *[[Bootstrap aggregating]] (bagging)
| |
| *[[Ensemble learning]]
| |
| *[[Boosting (machine learning)|Boosting]]
| |
| *[[Non-parametric statistics]]
| |
| | |
| ==References==
| |
| {{Reflist}}
| |
| | |
| == External links ==
| |
| * [http://stat-www.berkeley.edu/users/breiman/RandomForests/cc_home.htm Random Forests classifier description] (Site of Leo Breiman)
| |
| * [http://cran.r-project.org/doc/Rnews/Rnews_2002-3.pdf Liaw, Andy & Wiener, Matthew "Classification and Regression by randomForest" R News (2002) Vol. 2/3 p. 18] (Discussion of the use of the random forest package for [[R programming language|R]])
| |
| * [http://cm.bell-labs.com/cm/cs/who/tkh/papers/compare.pdf Ho, Tin Kam (2002). "A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors". Pattern Analysis and Applications 5, p. 102-112] (Comparison of bagging and random subspace method)
| |
| * {{cite journal |doi = 10.1007/978-3-540-74469-6_35 |chapter = Random Multiclass Classification: Generalizing Random Forests to Random MNL and Random NB |title = Database and Expert Systems Applications |series = Lecture Notes in Computer Science |year = 2007 |last1 = Prinzie |first1 = Anita |last2 = Poel |first2 = Dirk |isbn = 978-3-540-74467-2 |volume = 4653 |pages = 349}}
| |
| * {{cite journal |doi=10.1016/j.eswa.2007.01.029 |title=Random Forests for multiclass classification: Random MultiNomial Logit |year=2008 |last1=Prinzie |first1=Anita |last2=Van Den Poel |first2=Dirk |journal=Expert Systems with Applications |volume=34 |issue=3 |pages=1721}}
| |
| * [http://semanticsearchart.com/researchRF.html C# implementation] of random forest algorithm for categorization of text documents supporting reading of documents, making dictionaries, filtering stop words, stemming, counting words, making document-term matrix and its usage for building random forest and further categorization.
| |
| * A [http://scikit-learn.org/stable/modules/ensemble.html python implementation] of the random forest algorithm working in regression, classification with multi-output support.
| |
| * [https://github.com/david-matheson/rftk RFTK] is a flexible toolkit for building random forests with python bindings.
| |
| | |
| [[Category:Classification algorithms]]
| |
| [[Category:Ensemble learning]]
| |
| [[Category:Decision trees]]
| |