Von Neumann's trace inequality: Difference between revisions

Revision as of 19:47, 7 January 2013

The Davies–Bouldin index (DBI) (introduced by David L. Davies and Donald W. Bouldin in 1979) is a metric for evaluating clustering algorithms.^[1] This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset. This has a drawback that a good value reported by this method does not imply the best information retrieval.

Preliminaries

Let C_i be a cluster of vectors. Let X_j be an n dimensional feature vector assigned to cluster C_i.

S_{i}={\sqrt[{q}]{{\frac {1}{T_{i}}}\displaystyle \sum _{j=1}^{T_{i}}{|X_{j}-A_{i}|^{q}}}}

Here $A_{i}$ is the centroid of C_i and T_i is the size of the cluster i. S_i is a measure of scatter within the cluster. Usually the value of q is 2, which makes this a Euclidean distance function between the centroid of the cluster, and the individual feature vectors. Many other distance metrics can be used, in the case of manifolds and higher dimensional data, where the euclidean distance may not be the best measure for determining the clusters. It is important to note that this distance metric has to match with the metric used in the clustering scheme itself for meaningful results.

M_{i,j}=\left|\left|A_{i}-A_{j}\right|\right|_{p}={\sqrt[{p}]{\displaystyle \sum _{k=1}^{n}\left|a_{k,i}-a_{k,j}\right|^{p}}}

M_{i,j}

is a measure of separation between cluster

C_{i}

and cluster

C_{j}

.

a_{k,i}

is the kth element of

A_{i}

, and there are n such elements in A for it is an n dimensional centroid.

Here k indexes the features of the data, and this is essentially the Euclidean distance between the centers of clusters i and j when p equals 2.

Definition

Let R_i,j be a measure of how good the clustering scheme is. This measure, by definition has to account for M_i,j the separation between the i^th and the j^th cluster, which ideally has to be as large as possible, and S_i, the within cluster scatter for cluster i, which has to be as low as possible. Hence the Davies Bouldin Index is defined as the ratio of S_i and M_i,j such that these properties are conserved:

$R_{i,j}\geqslant 0$ .
$R_{i,j}=R_{j,i}$ .
if $S_{j}\geqslant S_{k}$ and $M_{i,j}=M_{i,k}$ then $R_{i,j}>R_{i,k}$ .
and if $S_{j}=S_{k}$ and $M_{i,j}\leqslant M_{i,k}$ then $R_{i,j}>R_{i,k}$ .

R_{i,j}={\frac {S_{i}+S_{j}}{M_{i,j}}}

This is the symmetry condition. Due to such a formulation, the lower the value, the better the separation of the clusters and the 'tightness' inside the clusters.

D_{i}\equiv \max _{j:i\neq j}R_{i,j}

If N is the number of clusters:

{DB}\equiv {\frac {1}{N}}\displaystyle \sum _{i=1}^{N}D_{i}

DB is called the Davies Bouldin Index. This is dependent both on the data as well as the algorithm. D_i chooses the worst case scenario, and this value is equal to R_i,j for the most similar cluster to cluster i. There could be many variations to this formulation, like choosing the average of the cluster similarity, weighted average and so on.

Explanation

These conditions constrain the index so defined to be symmetric and non-negative. Due to the way it is defined, as a function of the ratio of the within cluster scatter, to the between cluster separation, a lower value will mean that the clustering is better. It happens to be the average similarity between each cluster and its most similar one, averaged over all the clusters, where the similarity is defined as S_i above. This affirms the idea that no cluster has to be similar to another, and hence the best clustering scheme essentially minimizes the Davies Bouldin Index. This index thus defined is an average over all the i clusters, and hence a good measure of deciding how many clusters actually exists in the data is to plot it against the number of clusters it is calculated over. The number i for which this value is the lowest is a good measure of the number of clusters the data could be ideally classified into. This has applications in deciding the value of k in the kmeans algorithm, where the value of k is not known apriori. The SOM toolbox contains a MATLAB implementation.^[2]

External links

Notes and references

43 year old Petroleum Engineer Harry from Deep River, usually spends time with hobbies and interests like renting movies, property developers in singapore new condominium and vehicle racing. Constantly enjoys going to destinations like Camino Real de Tierra Adentro.

[1] Template:Cite doi

[2] Template:Cite web

[1]

[2]

@@ Line 1: / Line 1: @@
-The Cooperative Logistics Network is a professional worldwide company which provides a multitude of services in the field of freight forwarding. Out of all the international shipping companies, The Cooperative Logistics Network stands out as a result of the large success it has had since its formation and the impeccable quality of services it has to offer. This [http://www.articlesbase.com/shipping-articles/freight-activity-affecting-the-price-of-goods-7103729.html freight forwarding company] is dedicated and a trustworthy business to collaborate with, proving its reliability through a large number of positive reviews, feedback and testimonials provided by its former clients. The Cooperative Logistics Network operates on an international level and is a sought after global company. It is a cooperative alliance which includes hand-picked and carefully chosen freight forwarders from all over the world who are united to work together. The aim is to grow each of their businesses and serve all of the clients located worldwide. <br><br>The market in which The Cooperative Logistics Network operates is one under the strict domination of multinational enterprises, which is why the vision of the company is to create a bridge between all of the independent forwarders located around the globe and link them together under one single unity. The company also provides these link-minded forwarders with the tools they need in order to be successful on an international scale. The full details regarding how the Cooperative Logistic Network works and how independent companies can become part of it can be found by looking online for the company or visiting its official website, called www.thecooperativelogisticsnetwork.com . The website is an invaluable source of information and facts pertaining to the company, as well as many insightful details which all of the persons interested in this area of activity will find very useful. As an alliance of worldwide forwarding network companies, The Cooperative Logistics Network is a vast and complex enterprise which has brought together a number of separate freight forwarding firms and united entrepreneurs from all over the world, offering them guidance and assisting all of the interested businesses who desire to join it in the future.  <br><br>In order to conclude, The Cooperative Logistics Network can be seen as a [http://www.articlesolve.com/articledetail.php?artid=574588&catid=81 world cargo alliance] between many different independent companies who have come together under a single unity and with a single purpose: to enhance their businesses and have international reach. The members that have joined The Cooperative Logistics Network can enjoy the full benefits which this alliance provides, such as global coverage and partnerships with every major transportation hub around the world, either on air or water. <br><br>The power of joining forces together on an international level should never be underestimated, which is precisely why The Cooperative Logistics Network is so well known and appreciated by the companies in this field of activity. It provides a link between many of the independent freight forwarders and allows them to reach their goals or even exceed them. For additional information and details regarding the way this network operates or how it accepts new members in the alliance, any interested person or business representative is welcome to visit the official website of the company at www.thecooperativelogisticsnetwork.com
+{{multiple issues|
+{{Primary sources|date=May 2010}}
+{{Expert-subject|Statistics|date=May 2010}}
+}}
+The '''Davies&ndash;Bouldin index (DBI)''' (introduced by David L. Davies and Donald W. Bouldin in 1979) is a metric for evaluating [[clustering algorithm]]s.<ref>{{cite doi |10.1109/TPAMI.1979.4766909 }}</ref> This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset. This has a drawback that a good value reported by this method does not imply the best information retrieval.
+==Preliminaries==
+Let ''C''<sub>''i''</sub> be a cluster of vectors. Let ''X''<sub>''j''</sub> be an n dimensional feature vector assigned to cluster ''C''<sub>''i''</sub>.
+: <math> S_i =  \sqrt[q]{\frac{1}{T_i} \displaystyle\sum_{j=1}^{T_i} {|X_j-A_i|^q} } </math>
+Here <math>A_i</math> is the [[centroid]] of ''C''<sub>''i''</sub> and ''T''<sub>''i''</sub> is the size of the cluster ''i''.  ''S''<sub>''i''</sub> is a measure of scatter within the cluster. Usually the value of ''q'' is 2, which makes this a [[Euclidean distance]] function between the centroid of the cluster, and the individual feature vectors. Many other distance metrics can be used, in the case of [[manifolds]] and higher dimensional data, where the euclidean distance may not be the best measure for determining the clusters. It is important to note that this distance metric has to match with the metric used in the clustering scheme itself for meaningful results.
+: <math> M_{i,j} = \left|\left|A_i-A_j\right|\right|_p = \sqrt[p]{\displaystyle\sum_{k=1}^{n}\left|a_{k,i}-a_{k,j}\right|^p } </math>
+: <math> M_{i,j}</math> is a measure of separation between cluster <math>C_i</math> and cluster <math>C_j</math>.
+: <math> a_{k,i} </math> is the ''k''th element of <math>A_i</math>, and there are n such elements in ''A'' for it is an n dimensional centroid.
+Here k indexes the features of the data, and this is essentially the [[Euclidean distance]] between the centers of clusters ''i'' and ''j'' when ''p'' equals 2.
+==Definition==
+Let ''R''<sub>''i,j''</sub> be a measure of how good the clustering scheme is. This measure, by definition has to account for ''M''<sub>''i,j''</sub> the separation between the ''i''<sup>''th''</sup> and the ''j''<sup>''th''</sup> cluster, which ideally has to be as large as possible, and ''S''<sub>''i''</sub>, the within cluster scatter for cluster i, which has to be as low as possible. Hence the Davies Bouldin Index is defined as the ratio of ''S''<sub>''i''</sub> and ''M''<sub>''i,j''</sub> such that these properties are conserved:
+#  <math> R_{i,j} \geqslant 0 </math>.
+#  <math> R_{i,j} = R_{j,i} </math>.
+#  if <math> S_j \geqslant S_k </math> and <math> M_{i,j} = M_{i,k} </math> then <math> R_{i,j} > R_{i,k} </math>.
+#  and if <math> S_j = S_k </math> and <math> M_{i,j} \leqslant M_{i,k} </math> then <math> R_{i,j} > R_{i,k} </math>.
+: <math> R_{i,j} = \frac{S_i + S_j}{M_{i,j}} </math>
+This is the [[symmetry]] condition. Due to such a formulation, the lower the value, the better the separation of the clusters and the 'tightness' inside the clusters.
+: <math> D_i \equiv \max_{j : i \neq j} R_{i,j}</math>
+If N is the number of clusters:
+: <math> {DB} \equiv \frac{1}{N}\displaystyle\sum_{i=1}^N D_i</math>
+''DB'' is called the Davies Bouldin Index. This is dependent both on the data as well as the algorithm. ''D''<sub>''i''</sub> chooses the worst case scenario, and this value is equal to ''R''<sub>''i,j''</sub> for the most similar cluster to cluster ''i''. There could be many variations to this formulation, like choosing the average of the cluster similarity, weighted average and so on.
+==Explanation==
+These conditions constrain the index so defined to be symmetric and non-negative. Due to the way it is defined, as a function of the ratio of the within cluster scatter, to the between cluster separation, a lower value will mean that the clustering is better. It happens to be the average similarity between each cluster and its most similar one, averaged over all the clusters, where the similarity is defined as ''S''<sub>''i''</sub> above. This affirms the idea that no cluster has to be similar to another, and hence the best clustering scheme essentially minimizes the Davies Bouldin Index. This index thus defined is an average over all the ''i'' clusters, and hence a good measure of deciding how many clusters actually exists in the data is to plot it against the number of clusters it is calculated over. The number ''i'' for which this value is the lowest is a good measure of the number of clusters the data could be ideally classified into. This has applications in deciding the value of ''k'' in the [[kmeans]] algorithm, where the value of k is not known apriori. The SOM toolbox contains a [[MATLAB]] implementation.<ref>{{cite web |url=http://www.cis.hut.fi/somtoolbox/package/docs2/db_index.html |title=Matlab implementation |accessdate=12 November 2011 }}</ref>
+==External links==
+* http://machaon.karanagai.com/validation_algorithms.html
+* http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.2072
+* http://books.google.com/books?id=HY8gB2OIqSoC
+== Notes and references ==
+{{Reflist}}
+{{DEFAULTSORT:Davies-Bouldin index}}
+[[Category:Clustering criteria]]

Von Neumann's trace inequality: Difference between revisions

Revision as of 19:47, 7 January 2013

Contents

Preliminaries

Definition

Explanation

External links

Notes and references

Navigation menu

Von Neumann's trace inequality: Difference between revisions

Revision as of 19:47, 7 January 2013

Preliminaries

Definition

Explanation

External links

Notes and references

Navigation menu

Search