BATON Overlay: Difference between revisions

Revision as of 11:16, 11 July 2013

The constellation model is a probabilistic, generative model for category-level object recognition in computer vision. Like other part-based models, the constellation model attempts to represent an object class by a set of N parts under mutual geometric constraints. Because it considers the geometric relationship between different parts, the constellation model differs significantly from appearance-only, or "bag-of-words" representation models, which explicitly disregard the location of image features.

The problem of defining a generative model for object recognition is difficult. The task becomes significantly complicated by factors such as background clutter, occlusion, and variations in viewpoint, illumination, and scale. Ideally, we would like the particular representation we choose to be robust to as many of these factors as possible.

In category-level recognition, the problem is even more challenging because of the fundamental problem of intra-class variation. Even if two objects belong to the same visual category, their appearances may be significantly different. However, for structured objects such as cars, bicycles, and people, separate instances of objects from the same category are subject to similar geometric constraints. For this reason, particular parts of an object such as the headlights or tires of a car still have consistent appearances and relative positions. The Constellation Model takes advantage of this fact by explicitly modeling the relative location, relative scale, and appearance of these parts for a particular object category. Model parameters are estimated using an unsupervised learning algorithm, meaning that the visual concept of an object class can be extracted from an unlabeled set of training images, even if that set contains "junk" images or instances of objects from multiple categories. It can also account for the absence of model parts due to appearance variability, occlusion, clutter, or detector error.

History

The idea for a "parts and structure" model was originally introduced by Fischler and Elschlager in 1973.^[1] This model has since been built upon and extended in many directions. The Constellation Model, as introduced by Dr. Perona and his colleagues, was a probabilistic adaptation of this approach.

In the late '90s, Burl et al.^[2]^[3]^[4]^[5] revisited the Fischler and Elschlager model for the purpose of face recognition. In their work, Burl et al. used manual selection of constellation parts in training images to construct a statistical model for a set of detectors and the relative locations at which they should be applied. In 2000, Weber et al. ^[6]^[7]^[8]^[9] made the significant step of training the model using a more unsupervised learning process, which precluded the necessity for tedious hand-labeling of parts. Their algorithm was particularly remarkable because it performed well even on cluttered and occluded image data. Fergus et al.^[10]^[11] then improved upon this model by making the learning step fully unsupervised, having both shape and appearance learned simultaneously, and accounting explicitly for the relative scale of parts.

The method of Weber and Welling et al.^[9]

In the first step, a standard interest point detection method, such as Harris corner detection, is used to generate interest points. Image features generated from the vicinity of these points are then clustered using k-means or another appropriate algorithm. In this process of vector quantization, one can think of the centroids of these clusters as being representative of the appearance of distinctive object parts. Appropriate feature detectors are then trained using these clusters, which can be used to obtain a set of candidate parts from images.

As a result of this process, each image can now be represented as a set of parts. Each part has a type, corresponding to one of the aforementioned appearance clusters, as well as a location in the image space.

Basic generative model

Weber & Welling here introduce the concept of foreground and background. Foreground parts correspond to an instance of a target object class, whereas background parts correspond to background clutter or false detections.

Let T be the number of different types of parts. The positions of all parts extracted from an image can then be represented in the following "matrix,"

X^{o} = (\begin{matrix} x_{11}, x_{12}, \dots, x_{1 N_{1}} \\ x_{21}, x_{22}, \dots, x_{2 N_{2}} \\ ⋮ \\ x_{T 1}, x_{T 2}, \dots, x_{T N_{T}} \end{matrix})

where $N_{i}$ represents the number of parts of type $i \in {1, \dots, T}$ observed in the image. The superscript o indicates that these positions are observable, as opposed to missing. The positions of unobserved object parts can be represented by the vector $x^{m}$ . Suppose that the object will be composed of $F$ distinct foreground parts. For notational simplicity, we assume here that $F = T$ , though the model can be generalized to $F > T$ . A hypothesis $h$ is then defined as a set of indices, with $h_{i} = j$ , indicating that point $x_{i j}$ is a foreground point in $X^{o}$ . The generative probabilistic model is defined through the joint probability density $p (X^{o}, x^{m}, h)$ .

Model details

The rest of this section summarizes the details of Weber & Welling's model for a single component model. The formulas for multiple component models^[8] are extensions of those described here.

To parametrize the joint probability density, Weber & Welling introduce the auxiliary variables $b$ and $n$ , where $b$ is a binary vector encoding the presence/absence of parts in detection ( $b_{i} = 1$ if $h_{i} > 0$ , otherwise $b_{i} = 0$ ), and $n$ is a vector where $n_{i}$ denotes the number of background candidates included in the $i^{t h}$ row of $X^{o}$ . Since $b$ and $n$ are completely determined by $h$ and the size of $X^{o}$ , we have $p (X^{o}, x^{m}, h) = p (X^{o}, x^{m}, h, n, b)$ . By decomposition,

p (X^{o}, x^{m}, h, n, b) = p (X^{o}, x^{m} | h, n, b) p (h | n, b) p (n) p (b)

The probability density over the number of background detections can be modeled by a Poisson distribution,

p (n) = \prod_{i = 1}^{T} \frac{1}{n_{i}!} (M_{i})^{n_{i}} e^{- M_{i}}

where $M_{i}$ is the average number of background detections of type $i$ per image.

Depending on the number of parts $F$ , the probability $p (b)$ can be modeled either as an explicit table of length $2^{F}$ , or, if $F$ is large, as $F$ independent probabilities, each governing the presence of an individual part.

The density $p (h | n, b)$ is modeled by

p (h | n, b) = {\begin{cases} \frac{1}{\prod_{f = 1}^{F} N_{f}^{b_{f}}}, & if h \in H (b, n) \\ 0, & for other h \end{cases}

where $H (b, n)$ denotes the set of all hypotheses consistent with $b$ and $n$ , and $N_{f}$ denotes the total number of detections of parts of type $f$ . This expresses the fact that all consistent hypotheses, of which there are $\prod_{f = 1}^{F} N_{f}^{b_{f}}$ , are equally likely in the absence of information on part locations.

And finally,

p (X^{o}, x^{m} | h, n) = p_{f g} (z) p_{b g} (x_{b g})

where $z = (x^{o} x^{m})$ are the coordinates of all foreground detections, observed and missing, and $x_{b g}$ represents the coordinates of the background detections. Note that foreground detections are assumed to be independent of the background. $p_{f g} (z)$ is modeled as a joint Gaussian with mean $μ$ and covariance $Σ$ .

Classification

The ultimate objective of this model is to classify images into classes "object present" (class $C_{1}$ ) and "object absent" (class $C_{0}$ ) given the observation $X^{o}$ . To accomplish this, Weber & Welling run part detectors from the learning step exhaustively over the image, examining different combinations of detections. If occlusion is considered, then combinations with missing detections are also permitted. The goal is then to select the class with maximum a posteriori probability, by considering the ratio

\frac{p (C_{1} | X^{o})}{p (C_{0} | X^{o})} \propto \frac{\sum_{h} p (X^{o}, h | C_{1})}{p (X^{o}, h_{0} | C_{0})}

where $h_{0}$ denotes the null hypothesis, which explains all parts as background noise. In the numerator, the sum includes all hypotheses, including the null hypothesis, whereas in the denominator, the only hypothesis consistent with the absence of an object is the null hypothesis. In practice, some threshold can be defined such that, if the ratio exceeds that threshold, we then consider an instance of an object to be detected.

Model learning

After the preliminary step of interest point detection, feature generation and clustering, we have a large set of candidate parts over the training images. To learn the model, Weber & Welling first perform a greedy search over possible model configurations, or equivalently, over potential subsets of the candidate parts. This is done in an iterative fashion, starting with random selection. At subsequent iterations, parts in the model are randomly substituted, the model parameters are estimated, and the performance is assessed. The process is complete when further model performance improvements are no longer possible.

At each iteration, the model parameters

Θ = {μ, Σ, p (b), M}

are estimated using expectation maximization. $μ$ and $Σ$ , we recall, are the mean and covariance of the joint Gaussian $p_{f g} (z)$ , $p (b)$ is the probability distribution governing the binary presence/absence of parts, and $M$ is the mean number of background detections over part types.

M-step

EM proceeds by maximizing the likelihood of the observed data,

L (X^{o} | Θ) = \sum_{i = 1}^{I} \log \sum_{h_{i}} \int p (X_{i}^{o}, x_{i}^{m}, h_{i} | Θ) d x_{i}^{m}

with respect to the model parameters $Θ$ . Since this is difficult to achieve analytically, EM iteratively maximizes a sequence of cost functions,

Q (\tilde{Θ} | Θ) = \sum_{i = 1}^{I} E [\log p (X_{i}^{o}, x_{i}^{m}, h_{i} | \tilde{Θ})]

Taking the derivative of this with respect to the parameters and equating to zero produces the update rules:

\tilde{μ} = \frac{1}{I} \sum_{i = 1}^{I} E [z_{i}]

\tilde{Σ} = \frac{1}{I} \sum_{i = 1}^{I} E [z_{i} z_{i}^{T}] - \tilde{μ} {\tilde{μ}}^{T}

\tilde{p} (\bar{b}) = \frac{1}{I} \sum_{i = 1}^{I} E [δ_{b, \bar{b}}]

\tilde{M} = \frac{1}{I} \sum_{i = 1}^{I} E [n_{i}]

E-step

The update rules in the M-step are expressed in terms of sufficient statistics, $E [z]$ , $E [z z^{T}]$ , $E [δ_{b, \bar{b}}]$ and $E [n]$ , which are calculated in the E-step by considering the posterior density:

p (h_{i}, x_{i}^{m} | X_{i}^{o}, Θ) = \frac{p (h_{i}, x_{i}^{m}, X_{i}^{o} | Θ)}{\sum_{h_{i} \in H_{b}} \int p (h_{i}, x_{i}^{m}, X_{i}^{o} | Θ) d x_{i}^{m}}

The method of Fergus et al.^[10]

In Weber et al., shape and appearance models are constructed separately. Once the set of candidate parts had been selected, shape is learned independently of appearance. The innovation of Fergus et al. is to learn not only two, but three model parameters simultaneously: shape, appearance, and relative scale. Each of these parameters are represented by Gaussian densities.

Feature representation

Whereas the preliminary step in the Weber et al. method is to search for the locations of interest points, Fergus et al. use the detector of Kadir and Brady^[12] to find salient regions in the image over both location (center) and scale (radius). Thus, in addition to location information $X$ this method also extracts associated scale information $S$ . Fergus et al. then normalize the squares bounding these circular regions to 11 x 11 pixel patches, or equivalently, 121-dimensional vectors in the appearance space. These are then reduced to 10-15 dimensions by principal component analysis, giving the appearance information $A$ .

Model structure

Given a particular object class model with parameters $Θ$ , we must decide whether or not a new image contains an instance of that class. This is accomplished by making a Bayesian decision,

R = \frac{p (Object | X, S, A)}{p (No object | X, S, A)}

= \frac{p (X, S, A | Object) p (Object)}{p (X, S, A | No object) p (No object)}

\approx \frac{p (X, S, A | Θ) p (Object)}{p (X, S, A | Θ_{b g}) p (No object)}

where $Θ_{b g}$ is the background model. This ratio is compared to a threshold $T$ to determine object presence/absence.

The likelihoods are factored as follows:

p (X, S, A | Θ) = \sum_{h \in H} p (X, S, A, h | Θ) =

\sum_{h \in H} \underset{Appearance}{\underset{⏟}{p (A | X, S, h, Θ)}} \underset{Shape}{\underset{⏟}{p (X | S, h, Θ)}} \underset{Rel. Scale}{\underset{⏟}{p (S | h, Θ)}} \underset{Other}{\underset{⏟}{p (h | Θ)}}

Appearance

Each part $p$ has an appearance modeled by a Gaussian density in the appearance space, with mean and covariance parameters $Θ_{p}^{a p p} = {c_{p}, V_{p}}$ , independent of other parts' densities. The background model has parameters $Θ_{b g}^{a p p} = {c_{b g}, V_{b g}}$ . Fergus et al. assume that, given detected features, the position and appearance of those features are independent. Thus, $p (A | X, S, h, Θ) = p (A | h, Θ)$ . The ratio of the appearance terms reduces to

\frac{p (A | X, S, h, Θ)}{p (A | X, S, h, Θ_{b g})} = \frac{p (A | h, Θ)}{p (A | h, Θ_{b g})}

= \prod_{p = 1}^{P} {(\frac{G (A (h_{p}) | c_{p}, V_{p})}{G (A (h_{p}) | c_{b g}, V_{b g})})}^{b_{p}}

Recall from Weber et al. that $h$ is the hypothesis for the indices of foreground parts, and $b$ is the binary vector giving the occlusion state of each part in the hypothesis.

Shape

Shape is represented by a joint Gaussian density of part locations within a particular hypothesis, after those parts have been transformed into a scale-invariant space. This transformation precludes the need to perform an exhaustive search over scale. The Gaussian density has parameters $Θ^{shape} = {μ, Σ}$ . The background model $Θ_{b g}$ is assumed to be a uniform distribution over the image, which has area $α$ . Letting $f$ be the number of foreground parts,

\frac{p (X | S, h, Θ)}{p (X | S, h, Θ_{b g})} = G (X (h) | μ, Σ) α^{f}

Relative scale

The scale of each part $p$ relative to a reference frame is modeled by a Gaussian density with parameters $Θ^{scale} = {t_{p}, U_{p}}$ . Each part is assumed to be independent of other parts. The background model $Θ_{b g}$ assumes a uniform distribution over scale, within a range $r$ .

\frac{p (S | h, Θ)}{p (S | h, Θ_{b g})} = \prod_{p = 1}^{P} G (S (h_{p}) | t_{p}, U_{p})^{d_{p}} r^{f}

Occlusion and statistics of feature detection

\frac{p (h | Θ)}{p (h | Θ_{b g})} = \frac{p_{Poiss} (n | M)}{p_{Poiss} (N | M)} \frac{1}{^{n} C_{r} (N, f)} p (b | Θ)

The first term models the number of features detected using a Poisson distribution, which has mean M. The second term serves as a "book-keeping" term for the hypothesis variable. The last term is a probability table for all possible occlusion patterns.

Learning

The task of learning the model parameters $Θ = {μ, Σ, c, V, M, p (b | Θ), t, U}$ is accomplished by expectation maximization. This is carried out in a spirit similar to that of Weber et al. Details and formulas for the E-step and M-step can be seen in the literature.^[11]

Performance

The Constellation Model as conceived by Fergus et al. achieves successful categorization rates consistently above 90% on large datasets of motorbikes, faces, airplanes, and spotted cats.^[13] For each of these datasets, the Constellation Model is able to capture the "essence" of the object class in terms of appearance and/or shape. For example, face and motorbike datasets generate very tight shape models because objects in those categories have very well-defined structure, whereas spotted cats vary significantly in pose, but have a very distinctive spotted appearance. Thus, the model succeeds in both cases. It is important to note that the Constellation Model does not generally account for significant changes in orientation. Thus, if the model is trained on images of horizontal airplanes, it will not perform well on, for instance, images of vertically oriented planes unless the model is extended to account for this sort of rotation explicitly.

In terms of computational complexity, the Constellation Model is very expensive. If $N$ is the number of feature detections in the image, and $P$ the number of parts in the object model, then the hypothesis space $H$ is $O (N^{P})$ . Because the computation of sufficient statistics in the E-step of expectation maximization necessitates evaluating the likelihood for every hypothesis, learning becomes a major bottleneck operation. For this reason, only values of $P \leq 6$ have been used in practical applications, and the number of feature detections $N$ is usually kept within the range of about 20-30 per image.

Variations

One variation that attempts to reduce complexity is the star model proposed by Fergus et al.^[14] The reduced dependencies of this model allows for learning in $O (N^{2} P)$ time instead of $O (N^{P})$ . This allows for a greater number of model parts and image features to be used in training. Because the star model has fewer parameters, it is also better at avoiding the problem of over-fitting when trained on fewer images.

References

43 year old Petroleum Engineer Harry from Deep River, usually spends time with hobbies and interests like renting movies, property developers in singapore new condominium and vehicle racing. Constantly enjoys going to destinations like Camino Real de Tierra Adentro.

External links

L. Fei-fei. Object categorization: the constellation models. Lecture Slides. (2005) (link not working)

BATON Overlay: Difference between revisions

Revision as of 11:16, 11 July 2013

Contents

History

The method of Weber and Welling et al.^[9]

Basic generative model

Model details

Classification

Model learning

M-step

E-step

The method of Fergus et al.^[10]

Feature representation

Model structure

Appearance

Shape

Relative scale

Occlusion and statistics of feature detection

Learning

Performance

Variations

References

External links

See also

Navigation menu

@@ Line 1: / Line 1: @@
+The '''constellation model''' is a probabilistic, [[generative model]] for category-level object recognition in [[computer vision]]. Like other [[part-based models]], the constellation model attempts to represent an object class by a set of ''N'' parts under mutual geometric constraints. Because it considers the geometric relationship between different parts, the constellation model differs significantly from appearance-only, or "[[bag-of-words model|bag-of-words]]" representation models, which explicitly disregard the location of image features.
+The problem of defining a generative model for object recognition is difficult. The task becomes significantly complicated by factors such as background clutter, occlusion, and variations in viewpoint, illumination, and scale. Ideally, we would like the particular representation we choose to be robust to as many of these factors as possible.
-You need a great lawyer or attorney as well as a very good scenario for the strong success in the court. If you've been recently injured, you're likely to want to ensure that you give yourself a preventing opportunity by permitting began now. Keep reading wand understand everything you should know about accidental injuries rules.<br><br>It is essential to make the incident statement as comprehensive as you can. Talk about exactly what occurred to you personally, whether you shattered an arm or just acquired a minimize. You need to put in your list when you notice any extra traumas after a while. Also, think of your mental express following experiencing and enjoying the incident, like being afraid to operate a vehicle.<br><br>A good accidental injuries legal representative can be difficult to find if you do not do your research. Find a person who has a lot of experience and lots of victories less than his buckle. Their practical experience can help you succeed.<br><br>Inquiring your friends and relations for private damage lawyer tips can land a excellent lawyer. This can help you find the best attorney. You are worthy of the best legal professional easy for your needs.<br><br>If not one of your respective friends people have already been by way of a injury circumstance, search the Internet for beneficial information regarding personal injury attorneys. There are many agencies and message boards that level lawyers based on customer evaluations. These websites can also show you the attorney's track record for winning accidental injury lawsuits.<br><br>When visiting a lawyer's website, navigate on the About Us portion. This article gives you information such as the attorney's niche, the area of their place of work and yrs in practice. Numerous  http://youtu.be/XK0u_ALArVc ([http://www.youtube.com/watch?v=XK0u_ALArVc click here for more]) attorneys also have a testimonial webpage. On this page it will be easy to read about diverse instances from the client's perspective.<br><br>When you are included in the vehicle incident, you have to acquire several images as you can of the arena. If there is any type of accidental injury circumstance introduced up, these will allow you to existing your case. When you have a legal representative, it can help them see precisely what occurred.<br><br>Make sure the legal professional you select has very good knowledge of accidental injury instances. Do not evaluate the attorney by how often he resolved, but by what he surely could do for his clients. Your own injury legal representative could have a lot of "effective" times when he settled cheaper he then must have.<br><br>In no way engage a breakup lawyer to supervise your own personal injuries case. This might appear to be good sense, but many people feel that the legal professional they've previously dealt with is excellent, why not have them to help you? Alternatively, ask them for the affiliate for someone who is an expert in this particular law.<br><br>Make sure you question the proper concerns should you talk with a personal injuries legal representative. Make sure the legal professional is skilled in accidental injuries instances which is competent in your state. Request what kind of record the legal professional has. You need a successful legal professional, needless to say, and one who has many years of expertise.<br><br>It is important to hire a accidental injuries attorney for any personal injury case. When they've tried it before, know the particulars of legal requirements and know the techniques to profitable, they'll can get you the money you are entitled to for your misfortune, which is really the end result you are worthy of.<br><br>Look for a legal professional happy to work with a contingency foundation. Consequently your lawyer will not likely get money if you do not obtain a settlement. It will not only mean your attorney features a vested interest in getting a good resolution, furthermore, it implies you won't be remaining by using a sizeable authorized cost should your situation be ignored.<br><br>Carefully read authorized commercials of personalized-injury lawyers to ascertain the legitimacy from the training. At times, a lawyer assures that they can succeed to suit your needs when this is an impossible situation to calculate. By pass above these attorneys since they are just looking to reel you in for the investment you give the kitchen table.<br><br>Once you learn a legal professional inside a different type of law, you could possibly get yourself a referral with an seasoned accidental injury legal professional. It really is typical training for legal professionals to recommend situations to each one more, and lots of lawyers know an individual injuries legal professional. Ensure you do don't depend on just their expression and do your very own study.<br><br>Even when your insurance company provides a attorney for your injury scenario, look at employing yet another legal representative all by yourself. The legal professional from the insurer will not be the most effective person to look out for your personal demands and might, instead, be keen on shielding the likes and dislikes of the insurance company.<br><br>You have to very carefully evaluate regardless if you are able to submit a personal injury court action. Now you're far more knowledgeable about personal injury and the regulations that surround it, and you have to keep this information in mind to help you started out. Make sure to continue to learn all you can to enable you to get the resolution you are entitled to.
+In category-level recognition, the problem is even more challenging because of the fundamental problem of intra-class variation. Even if two objects belong to the same visual category, their appearances may be significantly different. However, for structured objects such as cars, bicycles, and people, separate instances of objects from the same category are subject to similar geometric constraints. For this reason, particular parts of an object such as the headlights or tires of a car still have consistent appearances and relative positions. The Constellation Model takes advantage of this fact by explicitly modeling the relative location, relative scale, and appearance of these parts for a particular object category. Model parameters are estimated using an [[unsupervised learning]] algorithm, meaning that the visual concept of an object class can be extracted from an unlabeled set of training images, even if that set contains "junk" images or instances of objects from multiple categories. It can also account for the absence of model parts due to appearance variability, occlusion, clutter, or detector error.
+== History ==
+The idea for a "parts and structure" model was originally introduced by Fischler and Elschlager in 1973.<ref>[http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1672195 M. Fischler and R. Elschlager. ''The Representation and Matching of Pictoral Structures.'' (1973)]</ref> This model has since been built upon and extended in many directions. The Constellation Model, as introduced by Dr. Perona and his colleagues, was a probabilistic adaptation of this approach.
+In the late '90s, Burl et al.<ref>
+[ftp://vision.caltech.edu/pub/tech-reports-vision/IWAFGR95.ps.Z M. Burl, T. Leung, and P. Perona. ''Face Localization via Shape Statistics.'' (1995)]</ref><ref>[ftp://vision.caltech.edu/pub/tech-reports-vision/ICCV95-faces.ps.Z T. Leung, M. Burl, and P. Perona. ''Finding Faces in Cluttered Scenes Using Random Labeled Graph Matching.'' (1995)]</ref><ref>[ftp://vision.caltech.edu/pub/tech-reports-vision/CVPR96-recog.ps.gz M. Burl and P. Perona. ''Recognition of Planar Object Classes'' (1996)]</ref><ref>[http://www.vision.caltech.edu/publications/ECCV98-recog.pdf M. Burl, M. Weber, and P. Perona. ''A Probabilistic Approach to Object Recognition Using Local Photometry and Global Geometry'' (1998)]</ref> revisited the Fischler and Elschlager model for the purpose of face recognition. In their work, Burl et al. used manual selection of constellation parts in training images to construct a statistical model for a set of detectors and the relative locations at which they should be applied. In 2000, Weber et al. <ref>[http://www.vision.caltech.edu/publications/MarkusWeber-thesis.pdf M. Weber. ''Unsupervised Learning of Models for Object Recognition.'' PhD Thesis. (2000)]</ref><ref>[ftp://vision.caltech.edu/pub/tech-reports/FG00-recog.pdf M. Weber, W. Einhaeuser, M. Welling and P. Perona. ''Viewpoint-Invariant Learning and Detection of Human Heads.'' (2000)]</ref><ref name="weber_towards">[ftp://vision.caltech.edu/pub/tech-reports/CVPR00-recog.pdf M. Weber, M. Welling, and P. Perona. ''Towards Automatic Discovery of Object Categories.'' (2000)]</ref><ref name="weber_unsupervised">[ftp://vision.caltech.edu/pub/tech-reports/ECCV00-recog.pdf M. Weber, M. Welling and P. Perona. ''Unsupervised Learning of Models for Recognition.'' (2000)]</ref> made the significant step of training the model using a more unsupervised learning process, which precluded the necessity for tedious hand-labeling of parts. Their algorithm was particularly remarkable because it performed well even on cluttered and occluded image data. Fergus et al.<ref name="object_class_recognition">[ftp://vision.caltech.edu/pub/tech-reports/Fergus_CVPR03.pdf R. Fergus, P. Perona, and A. Zisserman. ''Object Class Recognition by Unsupervised Scale-Invariant Learning.'' (2003)]</ref><ref name="fergus_thesis">[http://cs.nyu.edu/~fergus/papers/fergus_thesis.pdf R. Fergus. ''Visual Object Category Recognition.'' PhD Thesis. (2005)]</ref> then improved upon this model by making the learning step fully unsupervised, having both shape and appearance learned simultaneously, and accounting explicitly for the relative scale of parts.
+== The method of Weber and Welling et al.<ref name="weber_unsupervised" /> ==
+In the first step, a standard [[interest point detection]] method, such as Harris [[corner detection]], is used to generate interest points. [[Image feature]]s generated from the vicinity of these points are then clustered using [[k-means]] or another appropriate algorithm. In this process of [[vector quantization]], one can think of the centroids of these clusters as being representative of the appearance of distinctive object parts. Appropriate [[feature detector]]s are then trained using these clusters, which can be used to obtain a set of candidate parts from images.
+As a result of this process, each image can now be represented as a set of parts. Each part has a type, corresponding to one of the aforementioned appearance clusters, as well as a location in the image space.
+=== Basic generative model ===
+Weber & Welling here introduce the concept of ''foreground'' and ''background''. ''Foreground'' parts correspond to an instance of a target object class, whereas ''background'' parts correspond to background clutter or false detections.
+Let ''T'' be the number of different types of parts. The positions of all parts extracted from an image can then be represented in the following "matrix,"
+:<math>
+X^o =
+\begin{pmatrix}
+x_{11},x_{12},{\cdots} ,x_{1N_1} \\
+x_{21},x_{22},{\cdots} ,x_{2N_2} \\
+\vdots \\
+x_{T1},x_{T2},{\cdots} ,x_{TN_T}
+\end{pmatrix}
+</math>
+where <math>N_i\,</math> represents the number of parts of type <math>i \in \{1,\dots,T\}</math> observed in the image. The superscript ''o'' indicates that these positions are ''observable'', as opposed to ''missing''. The positions of unobserved object parts can be represented by the vector <math>x^m\,</math>. Suppose that the object will be composed of <math>F\,</math> distinct foreground parts. For notational simplicity, we assume here that <math>F = T\,</math>, though the model can be generalized to <math>F > T\,</math>. A ''hypothesis'' <math>h\,</math> is then defined as a set of indices, with <math>h_i = j\,</math>, indicating that point <math>x_{ij}\,</math> is a foreground point in <math>X^o\,</math>. The generative probabilistic model is defined through the joint probability density <math>p(X^o,x^m,h)\,</math>.
+=== Model details ===
+The rest of this section summarizes the details of Weber & Welling's model for a single component model. The formulas for multiple component models<ref name="weber_towards" /> are extensions of those described here.
+To parametrize the joint probability density, Weber & Welling introduce the auxiliary variables <math>b\,</math> and <math>n\,</math>, where <math>b\,</math> is a binary vector encoding the presence/absence of parts in detection (<math>b_i = 1\,</math> if <math>h_i > 0\,</math>, otherwise <math>b_i = 0\,</math>), and <math>n\,</math> is a vector where <math>n_i\,</math> denotes the number of ''background'' candidates included in the <math>i^{th}</math> row of <math>X^o\,</math>. Since <math>b\,</math> and <math>n\,</math> are completely determined by <math>h\,</math> and the size of <math>X^o\,</math>, we have <math>p(X^o,x^m,h) = p(X^o,x^m,h,n,b)\,</math>. By decomposition,
+:<math>
+p(X^o,x^m,h,n,b) = p(X^o,x^m|h,n,b)p(h|n,b)p(n)p(b)\,
+</math>
+The probability density over the number of background detections can be modeled by a [[Poisson distribution]],
+:<math>
+p(n) = \prod_{i=1}^T \frac{1}{n_i!}(M_i)^{n_i}e^{-M_i}
+</math>
+where <math>M_i\,</math> is the average number of background detections of type <math>i\,</math> per image.
+Depending on the number of parts <math>F\,</math>, the probability <math>p(b)\,</math> can be modeled either as an explicit table of length <math>2^F\,</math>, or, if <math>F\,</math> is large, as <math>F\,</math> independent probabilities, each governing the presence of an individual part.
+The density <math>p(h|n,b)\,</math> is modeled by
+:<math>
+p(h|n,b) =
+\begin{cases}
+\frac{1}{ \textstyle \prod_{f=1}^F N_f^{b_f}}, & \mbox{if } h \in H(b,n) \\
+, & \mbox{for other } h
+\end{cases}
+</math>
+where <math>H(b,n)\,</math> denotes the set of all hypotheses consistent with <math>b\,</math> and <math>n\,</math>, and <math>N_f\,</math> denotes the total number of detections of parts of type <math>f\,</math>. This expresses the fact that all consistent hypotheses, of which there are <math>\textstyle \prod_{f=1}^F N_f^{b_f}</math>, are equally likely in the absence of information on part locations.
+And finally,
+:<math>
+p(X^o,x^m|h,n) = p_{fg}(z)p_{bg}(x_{bg})\,
+</math>
+where <math>z = (x^ox^m)\,</math> are the coordinates of all foreground detections, observed and missing, and <math>x_{bg}\,</math> represents the coordinates of the background detections. Note that foreground detections are assumed to be independent of the background. <math>p_{fg}(z)\,</math> is modeled as a joint Gaussian with mean <math>\mu\,</math> and covariance <math>\Sigma\,</math>.
+=== Classification ===
+The ultimate objective of this model is to classify images into classes "object present" (class <math>C_1\,</math>) and "object absent" (class <math>C_0\,</math>) given the observation <math>X^o\,</math>. To accomplish this, Weber & Welling run part detectors from the learning step exhaustively over the image, examining different combinations of detections. If occlusion is considered, then combinations with missing detections are also permitted. The goal is then to select the class with maximum a posteriori probability, by considering the ratio
+:<math>
+\frac{p(C_1|X^o)}{p(C_0|X^o)} \propto \frac{\sum_h p(X^o,h|C_1)}{p(X^o,h_0|C_0)}
+</math>
+where <math>h_0\,</math> denotes the null hypothesis, which explains all parts as background noise. In the numerator, the sum includes all hypotheses, including the null hypothesis, whereas in the denominator, the only hypothesis consistent with the absence of an object is the null hypothesis. In practice, some threshold can be defined such that, if the ratio exceeds that threshold, we then consider an instance of an object to be detected.
+=== Model learning ===
+After the preliminary step of interest point detection, feature generation and clustering, we have a large set of candidate parts over the training images. To learn the model, Weber & Welling first perform a greedy search over possible model configurations, or equivalently, over potential subsets of the candidate parts. This is done in an iterative fashion, starting with random selection. At subsequent iterations, parts in the model are randomly substituted, the model parameters are estimated, and the performance is assessed. The process is complete when further model performance improvements are no longer possible.
+At each iteration, the model parameters
+:<math>
+\Theta = \{\mu, \Sigma, p(b), M\}\,
+</math>
+are estimated using [[expectation maximization]]. <math>\mu\,</math> and <math>\Sigma\,</math>, we recall, are the mean and covariance of the joint Gaussian <math>p_{fg}(z)\,</math>, <math>p(b)\,</math> is the probability distribution governing the binary presence/absence of parts, and <math>M\,</math> is the mean number of background detections over part types.
+=== M-step ===
+EM proceeds by maximizing the likelihood of the observed data,
+:<math>
+L(X^o|\Theta) = \sum_{i=1}^I \log \sum_{h_i} \int p(X_i^o,x_i^m,h_i|\Theta)dx_i^m
+</math>
+with respect to the model parameters <math>\Theta\,</math>. Since this is difficult to achieve analytically, EM iteratively maximizes a sequence of cost functions,
+:<math>
+Q(\tilde{\Theta}|\Theta) = \sum_{i=1}^I E[\log p(X_i^o,x_i^m,h_i|\tilde{\Theta})]
+</math>
+Taking the derivative of this with respect to the parameters and equating to zero produces the update rules:
+:<math>
+\tilde{\mu} = \frac{1}{I} \sum_{i=1}^I E[z_i]
+</math>
+:<math>
+\tilde{\Sigma} = \frac{1}{I} \sum_{i=1}^I E[z_iz_i^T] - \tilde{\mu}\tilde{\mu}^T
+</math>
+:<math>
+\tilde{p}(\bar{b}) = \frac{1}{I} \sum_{i=1}^I E[\delta_{b,\bar{b}}]
+</math>
+:<math>
+\tilde{M} = \frac{1}{I} \sum_{i=1}^I E[n_i]
+</math>
+=== E-step ===
+The update rules in the M-step are expressed in terms of [[sufficient statistics]], <math>E[z]\,</math>, <math>E[zz^T]\,</math>, <math>E[\delta_{b,\bar{b}}]\,</math> and <math>E[n]\,</math>, which are calculated in the E-step by considering the posterior density:
+:<math>
+p(h_i,x_i^m|X_i^o,\Theta) = \frac{p(h_i,x_i^m,X_i^o|\Theta)}{\textstyle \sum_{h_i \in H_b} \int p(h_i,x_i^m,X_i^o|\Theta) dx_i^m}
+</math>
+==The method of Fergus et al.<ref name="object_class_recognition" />==
+In Weber et al., shape and appearance models are constructed separately. Once the set of candidate parts had been selected, shape is learned independently of appearance. The innovation of Fergus et al. is to learn not only two, but three model parameters simultaneously: shape, appearance, and relative scale. Each of these parameters are represented by Gaussian densities.
+=== Feature representation ===
+Whereas the preliminary step in the Weber et al. method is to search for the locations of interest points, Fergus et al. use the detector of Kadir and Brady<ref>[http://www.springerlink.com/content/t45n2g8543574026/fulltext.pdf T. Kadir and M. Brady. ''Saliency, scale and image description.'' (2001)]</ref> to find salient regions in the image over both location (center) and scale (radius). Thus, in addition to location information <math>X\,</math> this method also extracts associated scale information <math>S\,</math>. Fergus et al. then normalize the squares bounding these circular regions to 11 x 11 pixel patches, or equivalently, 121-dimensional vectors in the appearance space. These are then reduced to 10-15 dimensions by [[principal component analysis]], giving the appearance information <math>A\,</math>.
+=== Model structure ===
+Given a particular object class model with parameters <math>\Theta\,</math>, we must decide whether or not a new image contains an instance of that class. This is accomplished by making a Bayesian decision,
+:<math>
+R = \frac{p(\mbox{Object}|X,S,A)}{p(\mbox{No object}|X,S,A)}
+</math>
+:<math>
+= \frac{p(X,S,A|\mbox{Object})p(\mbox{Object})}{p(X,S,A|\mbox{No object})p(\mbox{No object})}
+</math>
+:<math>
+\approx \frac{p(X,S,A|\Theta)p(\mbox{Object})}{p(X,S,A|\Theta_{bg})p(\mbox{No object})}
+</math>
+where <math>\Theta_{bg}</math> is the background model. This ratio is compared to a threshold <math>T\,</math> to determine object presence/absence.
+The likelihoods are factored as follows:
+:<math>
+p(X,S,A|\Theta) = \sum_{h \in H} p(X,S,A,h|\Theta) =
+</math>
+:<math>
+\sum_{h \in H} \underbrace{ p(A|X,S,h,\Theta) }_{\mbox{Appearance}} \underbrace{ p(X|S,h,\Theta) }_{\mbox{Shape}} \underbrace{ p(S|h,\Theta) }_{\mbox{Rel. Scale}} \underbrace{ p(h|\Theta) }_{\mbox{Other}}
+</math>
+=== Appearance ===
+Each part <math>p\,</math> has an appearance modeled by a Gaussian density in the appearance space, with mean and covariance parameters <math>\Theta_p^{app} = \{c_p,V_p\}</math>, independent of other parts' densities. The background model has parameters <math>\Theta_{bg}^{app} = \{c_{bg},V_{bg}\}</math>. Fergus et al. assume that, given detected features, the position and appearance of those features are independent. Thus, <math>p(A|X,S,h,\Theta) = p(A|h,\Theta)\,</math>. The ratio of the appearance terms reduces to
+:<math>
+\frac{p(A|X,S,h,\Theta)}{p(A|X,S,h,\Theta_{bg})} = \frac{p(A|h,\Theta)}{p(A|h,\Theta_{bg})}
+</math>
+:<math>
+= \prod_{p=1}^P \left ( \frac{G(A(h_p)|c_p,V_p)}{G(A(h_p)|c_{bg},V_{bg})} \right )^{b_p}
+</math>
+Recall from Weber et al. that <math>h\,</math> is the hypothesis for the indices of foreground parts, and <math>b\,</math> is the binary vector giving the occlusion state of each part in the hypothesis.
+=== Shape ===
+Shape is represented by a joint Gaussian density of part locations within a particular hypothesis, after those parts have been transformed into a scale-invariant space. This transformation precludes the need to perform an exhaustive search over scale. The Gaussian density has parameters <math>\Theta^{\mbox{shape}} = \{\mu,\Sigma\}\,</math>. The background model <math>\Theta_{bg}\,</math> is assumed to be a uniform distribution over the image, which has area <math>\alpha\,</math>. Letting <math>f\,</math> be the number of foreground parts,
+:<math>
+\frac{p(X|S,h,\Theta)}{p(X|S,h,\Theta_{bg})} = G(X(h)|\mu,\Sigma)\alpha^f
+</math>
+=== Relative scale ===
+The scale of each part <math>p\,</math> relative to a reference frame is modeled by a Gaussian density with parameters <math>\Theta^{\mbox{scale}} = \{t_p,U_p\}\,</math>. Each part is assumed to be independent of other parts. The background model <math>\Theta_{bg}\,</math> assumes a uniform distribution over scale, within a range <math>r\,</math>.
+:<math>
+\frac{p(S|h,\Theta)}{p(S|h,\Theta_{bg})} = \prod_{p=1}^P G(S(h_p)|t_p,U_p)^{d_p} r^f
+</math>
+=== Occlusion and statistics of feature detection ===
+:<math>
+\frac{p(h|\Theta)}{p(h|\Theta_{bg})} = \frac{p_{\mbox{Poiss}}(n|M)}{p_{\mbox{Poiss}}(N|M)} \frac{1}{^nC_r(N,f)} p(b|\Theta)
+</math>
+The first term models the number of features detected using a [[Poisson distribution]], which has mean M. The second term serves as a "book-keeping" term for the hypothesis variable. The last term is a probability table for all possible occlusion patterns.
+=== Learning ===
+The task of learning the model parameters <math>\Theta = \{\mu,\Sigma,c,V,M,p(b|\Theta),t,U\}\,</math> is accomplished by [[expectation maximization]]. This is carried out in a spirit similar to that of Weber et al. Details and formulas for the E-step and M-step can be seen in the literature.<ref name="fergus_thesis" />
+== Performance ==
+The Constellation Model as conceived by Fergus et al. achieves successful categorization rates consistently above 90% on large datasets of motorbikes, faces, airplanes, and spotted cats.<ref>[http://www.vision.caltech.edu/html-files/archive.html R. Fergus and P. Perona. Caltech Object Category datasets. http://www.vision.caltech.edu/html-files/archive.html (2003)]</ref> For each of these datasets, the Constellation Model is able to capture the "essence" of the object class in terms of appearance and/or shape. For example, face and motorbike datasets generate very tight shape models because objects in those categories have very well-defined structure, whereas spotted cats vary significantly in pose, but have a very distinctive spotted appearance. Thus, the model succeeds in both cases. It is important to note that the Constellation Model does not generally account for significant changes in orientation. Thus, if the model is trained on images of horizontal airplanes, it will not perform well on, for instance, images of vertically oriented planes unless the model is extended to account for this sort of rotation explicitly.
+In terms of computational complexity, the Constellation Model is very expensive. If <math>N\,</math> is the number of feature detections in the image, and <math>P\,</math> the number of parts in the object model, then the hypothesis space <math>H\,</math> is <math>O(N^P)\,</math>. Because the computation of sufficient statistics in the E-step of [[expectation maximization]] necessitates evaluating the likelihood for every hypothesis, learning becomes a major bottleneck operation. For this reason, only values of <math>P \le 6</math> have been used in practical applications, and the number of feature detections <math>N\,</math> is usually kept within the range of about 20-30 per image.
+== Variations ==
+One variation that attempts to reduce complexity is the star model proposed by Fergus et al.<ref>[http://www.robots.ox.ac.uk/%7Efergus/papers/fergus_cvpr05.pdf R. Fergus, P. Perona, and A. Zisserman. ''A Sparse Object Category Model for Efficient Learning and Exhaustive Recognition.'' (2005)]</ref> The reduced dependencies of this model allows for learning in <math>O(N^2P)\,</math> time instead of <math>O(N^P)\,</math>. This allows for a greater number of model parts and image features to be used in training. Because the star model has fewer parameters, it is also better at avoiding the problem of over-fitting when trained on fewer images.
+== References ==
+{{reflist}}
+== External links ==
+* [http://courses.ece.uiuc.edu/ece598/ffl/lecture7_ConstellationModel_shortversion.pdf L. Fei-fei. ''Object categorization: the constellation models''. Lecture Slides. (2005)] (link not working)
+== See also ==
+* [[Part-based models]]
+* [[One-shot learning]]
+[[Category:Learning in computer vision]]
+[[Category:Probabilistic models]]

BATON Overlay: Difference between revisions

Revision as of 11:16, 11 July 2013

History

The method of Weber and Welling et al.[9]

Basic generative model

Model details

Classification

Model learning

M-step

E-step

The method of Fergus et al.[10]

Feature representation

Model structure

Appearance

Shape

Relative scale

Occlusion and statistics of feature detection

Learning

Performance

Variations

References

External links

See also

Navigation menu

Search

The method of Weber and Welling et al.^[9]

The method of Fergus et al.^[10]