2011 International Conference on Reconfigurable Computing and FPGAs
Improving KLT in Embedded Systems by Processing Oversampling Video Sequence
in Real-time
Zhilei Chai
School of Internet of Things Engineering
Jiangnan University
Wuxi, China
Email: zlchai@jiangnan.edu.cn
Jianbo Shi
Department of Computer and Information Science
University of Pennsylvania
Philadelphia, USA
Email: jshi@cis.upenn.edu
Abstract—We
propose an efficient hardware architecture
for Kanade-Lucas-Tomasi(KLT) algorithm to process over-
sampling video data in real-time. The philosophy of this
paper is to make sure the tracking performance of KLT by
decreasing the inter-frame displacement rather than handling
large displacement by complex algorithms. It would be a
preferable method for embedded systems because of their
higher special-purpose instead of general-purpose computa-
tional performance. According to this architecture, the time
to extract all features is just the time to capture one frame
of image. This time interval can be used to track as many
features as possible. This architecture was implemented in a
Xilinx Virtex-5 FPGA. As shown in the synthesized result, the
maximum frequency of the system can be up to 182.287MHz.
Thus, around 182 frames of image with 1024×768 resolution
can be processed and more than 1000 features can be tracked
successfully within each frame. This makes the inter-frame
displacement really small in contrast to that of the lower frame
rate provided before. Empirical analysis shows that when the
displacement small enough(e.g., no more than 5-pixel), features
can be tracked well based on just the basic KLT algorithm.
Keywords-FPGA;
Oversampling Video Sequence; KLT
Tracking; Embedded Systems;
I. I
NTRODUCTION
The Kanade-Lucas-Tomasi [1], [2] algorithm tries to an-
swer two basic questions: how to select the features, and how
to track them from frame to frame. The original algorithm
was proposed by Lucas and Kanade[1]. Its goal is to align
or track a selected image patch from a template image
to an input image. Later, Shi and Tomasi[2] supplemented
the L&K algorithm by proposing how to select the image
patches with good features to track and how to abandon bad
features during tracking, thus composed the Kanade-Lucas-
Tomasi algorithm. As mentioned in L&K, The success of
the tracking requires the inter-frame displacement small
enough to make the approximation adequate, otherwise
more complex algorithms would be necessary to keep the
tracking results acceptable. Hence, it is obvious there are
two directions to make the KLT track well:
1) Dealing with large displacement by employing more
complex algorithms.
978-0-7695-4551-6/11 $26.00 © 2011 IEEE
DOI 10.1109/ReConFig.2011.54
297
2) Making the basic algorithm work well by keeping the
displacement small enough.
For embedded systems, the method 1) complicates the
computing and leads to a low frame rate processed in real-
time, which will lead to a larger inter-frame displacement.
On the contrary, when the algorithm keeps simple, em-
bedded systems can provide much higher computational
performance through hardware acceleration.
Progress in imaging sensor technology makes the over-
sampling technique viable for video processing now. For
example, the DALSA Falcon VGA300HG camera [4] can
capture VGA video sequence with a frame rate of 300fps.
Using high frame rate/resolution to oversample the scene
can obtain more accurate information about its motion and
illumination. Thus, method 2) coordinated with hardware
acceleration is potentially more reasonable for deploying the
KLT in embedded systems.
Lim et al [5] did some investigation to use the temporal
oversampling video to improve the accuracy of the KLT
optical flow estimation. They found that a key benefit of
using temporal oversampling to estimate optical flow is the
reduction in motion aliasing, not only for areas with large
displacements but also for areas with small displacements
but high spatial frequencies. The primary goal of their work
was to improve the accuracy of KLT. It does not concentrate
on how to process the oversampling video data in real-
time. But for embedded systems, the real-time processing
is generally a necessity.
Recently there are work emerging to focus on how to
improve the real-time performance of the KLT. Besides the
software-related work by upgrading algorithm itself such as
in [3]. Most of the work depended on the special-purpose
hardware (e.g., GPU or FPGA) to accelerate the computa-
tion. GPU KLT [6] implemented translational model and
can track approximately 1000 features within 1024×768
resolution video at 30 Hz. Kanade et al [7] implemented
affine-photometric KLT on GPU in CUDA framework. 512
features from 640×480 resolution can be tracked at the
frame rate more than 50fps. Wonjun Jang et al [8] introduced
an implementation of pyramidal KLT in FPGA with the
working frequency of 209.249MHz. It can process around
100 features extraction and tracking at the frame rate of
60fps with the image size 720×480. Yean Choon Ham et al
[9] proposed a gesture tracking system based on KLT. It was
implemented in Virtex-II XC2VP30 FPGA and can process
60 fps with the image size 320×240.
Current work still cannot provide the performance to
process the video data with both high resolution (e.g., above
1024×768) and high frame rate (e.g., above 180fps) in
real-time. In this paper, we try to improve both of them
further by a totally pipelined architecture. So it can use the
oversampling technology to make the basic KLT work well.
II. KLT A
LGORITHM
The KLT algorithm is composed of two phases of opera-
tion namely feature extraction [2] and feature tracking [1].
Generally, image patches with good feature are extracted
from the initial frame by feature extraction then tracked from
frame to frame.
When the inter-frame displacement is small enough, the
assumption of intensity constancy can be assured. Let the
image intensity at the point
(x,
y)
taken at time
t
be denoted
by
I(x, y,t),
we have:
I(x, y,t
+
τ
) =
I(x
−
ξ
,
y
−
η
,t),
(1)
B. Feature Tracking
With the image patches selected by feature extraction, the
task of feature tracking is to track them from frame to frame.
Equation (1) can be redefined as:
J(x)
=
I(x, y,t
+
τ
),
I(x
−
d)
=
I(x
−
ξ
,
y
−
η
,t),
(4)
Here
J(x), I(x−d)
denotes one image patch
ω
instead of just
one pixel as in (1). Because of the noise or shape change,
there is possibly a difference between
J(x)
and
I(x
−
d).
So
the local image model should be described as:
J(x)
=
I(x
−
d)
+
n(x),
(5)
Where
n
represents the error introduced due to noise or
shape change. Then the problem of tracking becomes the
problem to find the suitable displacement
d
so as to mini-
mize the residue error
ε
defined by the following equation:
ε
=
i∈
ω
∑
[I(x
i
−
d)
−
J(x
i
)]
2
(6)
That means one pixel(x,
y)
in later image at time
t
+
τ
can
be obtained by moving another pixel in the current image
taken at time
t
, by a suitable amount. The amount of motion
d
= (
ξ
,
η
)
is called the displacement of the point at x =
(x,
y)
between time instants
t
and
t
+
τ
.
A. Feature Extraction
Single pixel is not sufficient to be tracked because the
intensity of it can be changed during tracking. The KLT
algorithm chooses image patches instead of single pixel for
tracking. Through a further constraint that an image patch as
a whole can be approximated by a translation of the old one,
the
d
above becomes the displacement for the whole image
patch. The KLT selects good features based on tracking. It
requires the coefficient matrix
Z
defined in (2) of the image
patch to track must be both above the image noise level and
well conditioned. In (2)
g
x
,
g
y
denotes the gradient of the
image patch in x and y direction respectively.
g
2
x
Z
=
g
x
g
y
g
x
g
y
g
2
y
(2)
Rather than search the possible
d
exhaustively to find the
best match, the KLT uses a type of Newton-Raphson iter-
ation to direct the procedure of matching. Because when
d
is small enough, the intensity function can be approximated
by its Taylor series and the minimization can be represented
by equation (7).
Zd
=
e,
(7)
Where
Z
is defined by (2) above, and
e
can be expressed as
follows.
g
(I
−
J)
(8)
e
=
x
g
y
(I
−
J)
(7) can be expressed into a more intuitive way as follows.
g
2
×
e
x
−
g
x
g
y
×
e
y
y
d
x
=
δ
g
2
×
e
y
−
g
x
g
y
×
e
x
d
y
=
x
δ
(9)
Where
δ
=
g
2
×
g
2
−
(g
x
g
y
)
2
. For each image patch with
x
y
good feature, the KLT computes
d
x
,
d
y
and updates the
new position iteratively until it converges or exceeds the
maximum iteration. Then the residue error is checked again
to decide whether the feature is tracked successfully.
III. S
UITABLE
KLT
VARIATIONS FOR DIFFERENT
INTER
-
FRAME DISPLACEMENTS
It is known that different techniques should be employed
to assure KLT’s tracking results with the inter-frame dis-
placement increasing. Fig. 1 to Fig. 2 show the tracking
trajectory and convergence result of a selected feature under
different inter-frame displacements tracked by the KLT
algorithm with different enhancement. Let the inter-frame
displacement increase from (1,1) to (10,10), the KLT without
According to the KLT, we compute the smaller eigenvalue
λ
min
by equation (3). The image patch with bigger
λ
min
is
the better feature to track.
λ
min
=
g
2
+
g
2
−
x
y
(g
2
−
g
2
)
2
+
4(g
x
g
y
)
2
x
y
2
(3)
298
Tracking Trajectory for Displacement=5 in finest level: [(18,23) to (23,28)]
Tracking Trajectory for Displacement=1 in finest level: [(18,23) to (19,24)]
22.8
22
23
23.2
Smooth=No, MultiPyramid=No
Smooth=Yes, MultiPyramid=No
Smooth=Yes, MultiPyramid=Yes
23
24
23.4
23.6
y−axis
25
y−axis
23.8
26
24
24.2
27
Smooth=No, MultiPyramid=No
Smooth=Yes, MultiPyramid=No
Smooth=Yes, MultiPyramid=Yes
24.4
28
24.6
24.8
18
18.2
18.4
18.6
x−axis
18.8
19
19.2
19.4
29
18
19
20
21
x−axis
22
23
24
Figure 1. Tracking results by different KLT variations with displacement
is 1 and 5 respectively. When displacement is 1, Algorithm
a
fails to
converge after 10 iterations.
b
converges after 1 iteration.
c
converges after
10 + 1 iterations from coarse to fine respectively. When displacement is 5,
algorithm
a
fails after 1 iteration.
b
converges after 5 iterations.
c
converges
after 10 + 3 iterations.
Tracking Trajectory for Displacement=6 in finest level: [(18,23) to (24,29)]
16
20
Tracking Trajectory for Displacement=9 in finest level: [(18,23) to (27,32)]
18
20.5
20
21
22
y−axis
y−axis
21.5
24
Smooth=No, MultiPyramid=No
Smooth=Yes, MultiPyramid=No
Smooth=Yes, MultiPyramid=Yes
22
26
28
22.5
Smooth=No, MultiPyramid=No
Smooth=Yes, MultiPyramid=No
Smooth=Yes, MultiPyramid=Yes
19
20
21
22
x−axis
23
24
25
26
27
30
18
19
20
21
x−axis
22
23
24
25
23
18
Figure 2. Tracking results by different KLT variations with displacement is
6 and 9 respectively. When displacement is 6, algorithm
a
fails to converge
after 8 iterations.
b
fails to converge after 10 iterations.
c
converges after 10
+ 4 iterations. When displacement is 9, Algorithm
a
and
b
fail to converge
after 10 iterations.
c
fails to converges after 1 + 10 iterations.
image smoothing, the KLT with image smoothing, and the
multiple pyramidal KLT are selected to track the same
feature point respectively. Below these three algorithms are
referred as algorithm
a, b
and
c
respectively for brevity.
As shown in Fig. 1 to 2, when the displacement is no more
than 5, algorithm
b
needs much less iterations than algorithm
c
to converge. Because algorithm
c
with multiple pyramids
is much more complex than algorithm
b
to be implemented
in hardware. Thus, in this paper, we prefer to implement
algorithm
b
in hardware to process high frame rate video
sequence in real time to keep the inter-frame displacement
small enough.
IV. H
ARDWARE
A
RCHITECTURE OF
KLT
According to the KLT algorithm, the process of KLT
tracker can be described as follows:
1) Capture the incoming image and smooth it if necessary;
2) Compute gradient of the image in x and y direction
respectively(Grad
x
,
Grad
y
);
3) For tracking, go to (8), else (4);
4) Compute
g
2
,
g
x
g
y
,
g
2
for the image patch in current
x
y
frame centered by each pixel;
5) Compute the
λ
min
for each image patch based on (4);
6) Select
N
image patches with good features to track
based on the corresponding
λ
min
;
7) go to (1);
8) Select one feature to track;
9) Compute
g
2
,
g
x
g
y
,
g
2
,
e
x
,
e
y
for this feature-related
x
y
image patch with updated coordinate in the previous
image and the current image;
10) Compute
d
x
,
d
y
;
11) Update its coordinate based on
d
x
,
d
y
within the current
image;
12) If
|d
x
|
>
th
or
|d
y
|
>
th
and not exceeds the iteration,
go to (9) else (13);
13) If there is still feature left, go to (8) else (14);
14) go to (1).
Fig. 3 is the architecture of the hardware KLT imple-
mented in this paper. As shown in Fig. 3, the modules of
image smoothing, gradient computing are multiplexed by the
feature extraction and feature tracking to reduce the resource
consumption.
There are 3 pipelining stages of this architecture namely
f eature extraction, image preprocessing
and
f eature
tracking.
It should be noted that the feature extraction
includes the image preprocessing while the feature tracking
does not. The reason is the time of image preprosessing
can be hidden in the time of feature extraction due to the
pipeline mechanism, but cannot be hidden in the time of
feature tracking due to its iteration mechanism.
During the first frame of the video capturing, its smoothed
version will be computed by the module of image smoothing
if necessary. Then, the output result from the image smooth-
ing is sent to the module of gradient computing. In order to
improve the computational performance and avoid storing
the intermediate result, gradient in x and y direction are
computed in parallel. So
Grad
x
and
Grad
y
will be ready
simultaneously to use by eigenvalue computing.
For eigenvalue computing, we adopt a Row-Column-
separate-accumulation mechanism. Hence, the eigenvalues
for each pixel related window can be accumulated gradually
during the process of each
Grad
x
and
Grad
y
inputing. The
similar stream mode is adopted to implement the module
of maximum eigenvalue selection. The system can select
the maximum eigenvalue for one area dynamically with the
eigenvalues being produced one by one. Thus, for feature
extraction in this architecture, it is totally a stream mode
for the whole computing from the image smoothing to the
maximum eigenvalue selection. There is no intermediate
results need to save before maximum eigenvalue selection.
Considering the borders are abandoned when computing the
eigenvalues, it makes the time to get all the selected features
just the time to capture a frame of the video data.
For feature tracking, because the possible position of the
image patch under tracking in the current frame is kept being
updated iteratively, the memory holding
S2, Grad
x
2,
Grad
y
2
needs to be accessed randomly to get the corresponding
patch. These results are generated by image smoothing and
gradient computing respectively.
299
this computation and the real 2D convolution is that the
filter kernel of the convolution needs a shift to align with
each window to be convolved while this computation keeps
fixed. So, it is possible to compute all products for each
pixel and its corresponding multiplicand based on the
Grad
x
and
Grad
y
in advance and accumulate them in an Row-
Column-separate-accumulation way. By using an internal
buffer array, the sum of rows for each window can be saved
and the final sum for each window can be computed on the
fly. Hence, the sum of the first window can be obtained with
a latency after the first product being received. After that,
the sum of a window can be obtained per cycle.
C. Non-maximum eigenvalue suppression
Selecting good features by sorting is time consuming as
well as memory intensive. The sorting algorithm itself is
difficult to be accelerated by hardware because of the data
hazards existing. In this paper, we propose a mechanism
of the non-maximum suppression to select the best feature
within each area that can be performed dynamically with
the eigenvalues produced one by one. Using this method,
each clock cycle, one input eigenvalue can be selected as
the maximum in its region or be abandoned. So, when all
eigenvalues are produced, the good features are selected as
well. To implement this mechanism, we split the image with
all eigenvalues into a
c
×
r
2D grid. Each grid includes
(
k/2
+
1)
×
(
k/2
+
1) eigenvalues, and a memory space
indexed by the 2D address having the same size with the 2D
grid is created to store the maximum eigenvalues of each
grid. Where
k
is the number of columns for an image patch.
In order to avoid the distance between two good features too
close, besides the policy only one feature can exist within
one grid, the features to be saved in the neighboring grids
are also kept out of that distance. Combing with a threshold,
this method can do the good features selection on the fly.
D. d
x
and d
y
computing
When one feature is under tracking, the KLT computes its
d
x
,
d
y
iteratively until it converges or exceeds the maximum
iteration. Because the computation of
d
x
and
d
y
in the current
iteration depends on the result from the last one, it limits
the parallelism to accelerate the tracking process of one
individual feature. However, the process to track different
features are independent of each other, it is possible to
improve the parallelism of computation by tracking different
features concurrently. In this paper, we prefer to track
all features one after another in each iteration instead of
tracking one feature multiple iterations until it converges or
fails.
E. Computational Performance analysis based on the archi-
tecture
As introduced above, the hardware KLT in this paper has
two separate data paths namely feature extraction and fea-
ture tracking. The feature extraction is composed of image
Figure 3. Overall architecture of hardware KLT implemented. When the
first frame of the video is captured and the good features are obtained
after the image smoothing, gradient computing, eigenvalue computing and
maximum eigenvalue selection. At the same time, the intermediate results
S1, Grad
x
1 and
Grad
y
1 are saved in memory section
M1.
After feature
extraction, the second frame is captured and
S2, Grad
x
2 and
Grad
y
2 are
computed and saved in memory section
M2.
During features being tracked
from
M1
to
M2,
the intermediate results
S2, Grad
x
2 and
Grad
y
2 of the
third frame can be computed and saved to
M3.
Then when tracking from
M2
to
M3, M1
will be used to save the intermediate results successively.
A. Convolution
We use 1D convolution in horizontal and vertical direction
respectively to replace 2D convolution to save resources.
Rather than try to improve the performance of one 1D con-
volution alone, we prefer to implement the 1D convolution
in a stream fashion and improve the performance of the
system as a whole. That is , the horizontal convolution can be
performed with a latency of several cycles after the first pixel
being received. The vertical convolution can be performed
with another latency after the horizontal one is performed.
The first convolved result can be obtained with another
latency of convolving per se. Then, one result per cycle can
be output after the first result being produced. Furthermore,
once the first result from the previous convolver is produced,
it can be sent to the latter one immediately instead of waiting
for all results produced. This method makes the time to
compute multiple passes of convolution nearly equal to the
time to compute just one pass of convolution.
B. Eigenvalue Computing
As described in section II, to compute eigenvalue
λ
min
, it
needs to compute
g
2
,
g
x
g
y
,
g
2
for each image patch based
x
y
on the
Grad
x
and
Grad
y
first. Generally, this is computed by
a convolution using the patches of
Grad
x
and
Grad
y
as filter
kernels. However, this kind of 2D convolution is really time
and resource consuming. Fortunately, the difference between
300
smoothing, gradient computing, eigenvalues computing and
non-maximum eigenvalue suppression. The feature tracking
is composed of the
d
x
,
d
y
computing. So, the computing time
for these two parts should be described separately. Given
K
is the size of the filter,
M
×
N
is the image size,
k
×
l
is the size of the image patch,
L
r
is the latency of row
accumulating. Given the number of features to track is
F
and the maximum iteration for each feature is
I,
the border
of pixels being abandoned is
b.
1) Feature extraction:
The feature extraction is imple-
mented totally in a stream fashion. Its execution time can
be calculated as follows:
1) the delay from the beginning to getting the first
smoothed pixel is
K
+
12
+
K
×
M
+
12.
2
2
2) the delay from the smoothed results to getting the first
Grad
x
and
Grad
y
, also is
K
+
12
+
K
×
M
+
12 because
2
2
Grad
x
and
Grad
y
are computed in parallel.
3) Thus, the time from the beginning till finishing the gra-
dient computing is 2
×
(
K
+
12
+
K
×
M
+
12)
+
M
×
N.
2
2
4) the delay from the gradient results to getting the first
eigenvalue is
L
r
+
k
×
M
+
4
+
21
+
k.
Allowing for the
delays from 1) and 2), the time from the beginning till
finishing the eigenvalues computing within the borders
is 2
×
(
K
+
12
+
K
×
M
+
12)
+ (L
r
+
k
×
M
+
4
+
21
+
2
2
k)
+
M
×
(N
−
b).
5) there is just one cycle delay from the end of eigenvalue
computing to eigenvalue selection, so it can be omitted
when analysing.
6) consequently, the time from the beginning to finishing
the features selection should be the time mentioned in
4). Generally, if
b
>
k,
this time is less than that of 3).
Obviously, if the image resolution is big enough in compar-
ison with the delays and the border is bigger than the size
of the image patch, the time to finish feature extraction is
nearly the time to fetch a frame of the video. For example,
if the image size is 1024
×
768, the time in 3) is 808960,
and the time in 4) is 810031. Given the working frequency
is 182Mhz, the frame rate can be completed with feature
extraction can be over 182fps.
2) Feature tracking:
It was also mentioned, to track a
feature from one frame to the next needs access the latter
frame at different positions during tracking iteratively. It is
hard to implement the feature tracking in a stream fashion
like the feature extraction. In order to leave more time
for the feature tracking, we implemented the
S2,Grad
x
2
and
Grad
y
2 computing and the feature tracking in separate
pipelining stages. The interval between consecutive frames
being captured and
S2, Grad
x
2 and
Grad
y
2 being computed
is the time left for tracking. So, the problem becomes during
this interval, how many features can be tracked in time.
The time to track a feature in each iteration is around
k
×
l
cycles. The time of tracking all features in total is around
k
×
l
×
F
×
I.
As the example above, the time to capture a
frame and compute the
S2, Grad
x
2 and
Grad
y
2 is 808960
Extracted Results without Image Smoothing
Software Extracted Features
10
20
30
40
y−axis
50
60
70
80
90
20
40
60
x−axis
80
100
120
Figure 4. Features extracted by the software KLT and its Corresponding
eigenvalues. Features are selected by sorting the eigenvalues first then
picking out some good features with maximum eigenvalue.
Corresponding Eigenvalue of Software Extracted Features
50
100
150
200
250
300
350
400
450
100
200
300
400
500
600
700
Figure 5. Corresponding eigenvalues for features extracted by the software
KLT. These eigenvalues refer to the
λ
min
in equation (3). The features with
bigger eigenvalue are good features to track.
cycles that is the same cycles can used to track. So the
features can be tracked in this example is over 1000 when
iterations are less than 4.
V. E
XPERIMENTAL
R
ESULTS
A. Synthesized results
The design above was programmed in VHDL and synthe-
sized by Xilinx ISE 10.1. Finally it was downloaded into the
ML506 evaluation platform with xc5vsx50t-3-ff1136. It was
shown in the synthesized result, the working frequency of the
design can be up to 182.287MHz, that means the computing
performance analyzed above is reasonable. To establish a
minimum testable environment without manufacturing the
high speed camera interface connected with the FPGA board
right now, we employed two images with shifted object
stored in the memory to provide the video data for tracking.
It shows that 95% of the resources in total, 75% multiplier
resources and 84% on chip memory are used.
B. Feature extraction and tracking Results
In this section, the feature extraction and feature tracking
of the hardware KLT tracker is verified by comparing its
results with the corresponding results by the software KLT
implemented in [10]. Thus, the filter used in this design is
the same with [10]. Its smoothing kernel is [0.04, 0.11, 0.21,
301
评论