IntroductionThis paper describes a method to improve the performance of tracking an object by online adjusting of features used for finding the object of interest. Features are used for a compact representation of a high dimensional space (e.g. a color image). The features that best separate the foreground object from the scene can be considered as the best features for tracking. Most of the work before this paper assume a priori set of features for tracking. This is a proper assumption for object tracking as long as the object or background appearance does not change. As an example, a set of features that performs well for tracking a car in sunlight might be a bad option for tracking the same car in shadow. The idea of online selection of features arises from the fact that adapting features to the current situation of the scene would improve the performance of tracking.
The general idea of this paper is to select features that have the strongest discrimination power to distinguish between foreground (the tracked object) and the background. Therefore, the features in a feature set are ranked based on their power of separating the tracked object from the surrounding background. The high ranked features are used for tracking and the discrimination score of each feature is re-evaluated as tracking continues and the candidate feature set is updated accordingly.
1. Feature Selection for TrackingVarious features such as texture, color and motion features can be used for tracking. So the space of tracking features is enormous. An appropriate feature set should be selected so that re-evaluation and update of the candidate tracking features can be performed efficiently. Since color features are rather insensitive to changes in appearance due to viewpoint, partial occlusion and non-rigidity, a linear combination of components of a color space forms the feature set in this paper. As shown below the feature set is a linear combination of R, G and B of pixel values:
where, w*s are integer coefficients and vary between -2 and 2. Each feature is also normalized into a range of 0 to 255. The total number of these features would be 125. The features whose coefficients are multiples of other feature coefficients such as 2R+2G+2B and R+B+G and also the feature with all zero coefficients are removed and only 49 features remain in the set. It should be mentioned that other color spaces can be used. Some of these features have been seen in the literature for example, R+B+G which is intensity or R-B, approximate chrominance feature.
The overall process of tracking can be summarized as the following. First, the above feature values are computed for object and background pixels. It is assumed that there is a known distribution for object and background pixel values. Then the features are ranked based on the separability of the object and the background (the details will be shown in the next section). Whenever a new video frame arrives, a likelihood map is generated using the top candidate features in which object pixels have high probabilities and background pixels have low probabilities. Then a mean-shift process is initialized for each likelihood image resulted from top discriminative features to find a local peak of object pixels distribution. The results of mean-shift processes are merged to get a 2D estimate of the position of the object in the image and the procedure continues by repeating the above steps.
2. Evaluation of Discriminative Power of FeaturesAs mentioned above, the next step is to find the discriminative power of each feature and select the best features for separating object from background. The idea is to find features that best separate the tracked object from its immediate surroundings. The inner box in Figure 1 is the object bounding box and the outer box represents the boundary of the local background.
The value of each feature is evaluated for object and background pixels. and denote the histograms for object and background pixel values, respectively. For the experiments of this project, we assume the histograms have 32 bins so i ranges from 1 to 32. Therefore, we have 49 histograms in total for feature values. We can form a probability distribution by normalizing the histograms and dividing them by the number of pixels:
where, n_obj and n_bg are the number of object pixels and background pixels, respectively. The log likelihood of a feature value i is given by:
where delta is a small value that prevents division by zero or taking log of zero. The intuition behind the log likelihood is that it converts the multi-modal histograms to positive values for colors that correspond to the object and negative values for background colors and it will be close to zero for colors shared by both object and the background. Finally, we should rank likelihood distributions of features. The following equation shows the variance ratio for a given feature. A feature with the maximum value is the most discriminative feature for the current appearance of the object and background:
The numerator of the equation shows the variance of L over both object and background pixels. A higher value for this quantity means the object and background values are more spread, which is desirable for us. The other issue that should be taken into account is the variance of object and background values themselves. We prefer features that minimize these variances that is we choose features that tightly cluster background and foreground pixels. These within class variances are shown in the denominator of the equation. Therefore, for tracking, we select features which have the maximum variance ratios.
Figure 2 shows the likelihood images of Figure 1 for 49 features that are sorted based on the variance ratio values. The white pixels represent a high probability of being an object pixel and the black pixels are most probable to be a background pixel and the range in between has been shown by different intensity values (the log likelihood values are mapped to intensity values between 0 and 255).
In this example, B is the most discriminative feature and 2R+2G-B is the least discriminative feature. In the likelihood image of the best feature, the red car is well separated from its local background. The following figures show the feature value histograms for the best and the worst feature. As expected, in the best case, the feature value distributions of object and background pixels show two different clusters.
3. TrackingSo far, we have found the features that best separate the tracked object and the background. The assumption is that two consecutive video frames are not much different. Therefore, the best features for the current frame would be valid for clustering object and background in the next frame of a video. The top N most discriminative features are used for tracking (in the experiments of this project N=5). It has been proved that the top N features do not necessarily form the best feature set but they are good enough to satisfy the goals. As mentioned before, after choosing the top feature candidates, a mean-shift process is initialized for each feature to find the new location of the tracked target in the image. The task of mean-shift processes is to find a local mode in the likelihood images. For each frame, the process starts from the position of previous object window (a window that bounds the object pixels in the previous frame). The window moves in the current frame until it reaches a local maxima in the neighbourhood of that window. The vector of the movement is calculated according to the following equation :
where, x is the current location of the center of the window, a are the pixels in a window around the current location, x, and K is a suitable kernel function (in this project, a quadratic function has been used). After convergence of the mean-shift processes for all of the top feature candidates, the median of x and y coordinates of window centers are computed. The result of the median is the estimated location of the object in the current frame. The reason for choosing median is to prevent a big jump due to presence of an outlier. Figure 4 represents an overview of the tracking system:
ResultsThe result of tracking and online feature selection is shown in this section. The first example shows a simple case of a very popular computer vision application where our tracker tracks a soccer player before goal scoring. The shape of the tracked person changes over time but the appearance of the background and the object does not change significantly. So this case can be considered as a simple case for this tracking algorithm. In the first frame, we determine a bounding box around the object which is used as the learned distribution of the object pixels. Then the program iterates on a sequence of frames and tracks the object. The original movie and the result of tracking can be found below:
The following features are the mostly used features in tracking the soccer player: R, R-G, R-B, 2R-G and G. An example likelihood image for feature R is shown below. The soccer player pixels have the highest probability to be pixels of the tracked object in their local surrounding and have been depicted in white:
In the next example, we will present a harder problem of tracking where the tracker tracks a person that skis on a mountain. The major problem is different lighting conditions that is due to movement of the person and the camera. Figure 6 shows three zoomed view of the person. The left image shows the start of tracking and the pixels have brown color. The middle image shows the case that sunlight reflection has changed the color of the person to blue in the camera and finally, the right image shows a very dark image which is caused by the movement of camera.
The following movies show the tracking result on this example:
The five features which were chosen as the most discriminative features were G, B, R, G-B and G-2B, in order. Since the object has a color close to black and the background is almost white, using RGB components alone will give us a discriminative feature and this is one reason that R, G and B features are selected more frequently in this example. A likelihood image of feature G is depicted for one of the frames in the sequence that has had G as the most discriminative feature:
Summary and DiscussionAn efficient and powerful tracking algorithm was implemented to track objects with changing appearance and shape. The general idea was that the features that best discriminate foreground from the background are the best features for tracking. In addition, changing features on-the-fly would significantly improve the performance of tracking compared to methods which use a fixed set of features.
The feature space was a linear combination of RGB components which can be computed very efficiently for image pixels and each feature was ranked based on a variance ratio measure that showed the separability of the object and the background. Then a mean-shift process was initiated for the top ranked features to find the location of the tracked target in a new frame. The procedure is continued by fetching a new frame, re-evaluating features and restarting the mean-shift processes. Despite the promising results, this method has some deficiencies which are discussed below:
Nice result by the authors of the paper
References R. T. Collins and Y. Liu "On-line Selection of Discriminative Tracking Features," in International Conference on Computer Vision (ICCV) 2003, Nice, France.
 R. T. Collins "Mean-shift Blob Tracking through Scale Space," in IEEE Conference on Computer Vision and Pattern Recognition 2003, Madison, Wisconsin.
 R. T. Collins, Y. Liu and M. Leordeanu, "On-Line Selection of Discriminative Tracking Features," IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI), Vol 27(10), October 2005, pp.1631-1643.