Download PDF
ads:
Universidade Federal de Minas Gerais
Instituto de Ciências Exatas
Departamento de Ciência da Computação
Alinhamento Espaço-Temporal de Sequências
de Vídeo Capturadas a Partir de
Múltiplos Pontos de Vista
Flávio Luis Cardeal Pádua
Tese apresentada ao Curso de Pós-Graduação em
Ciência da Computação da Universidade Federal de
Minas Gerais, como requisito parcial para obtenção
do título de Doutor em Ciência da Computação.
Orientador: Prof. Rodrigo Lima Carceroni
Belo Horizonte, 20 de maio de 2005
ads:
Livros Grátis
http://www.livrosgratis.com.br
Milhares de livros grátis para download.
Universidade Federal de Minas Gerais
Instituto de Ciências Exatas
Departamento de Ciência da Computação
Spatio-Temporal Alignment of Video Sequences
Captured from Multiple Viewpoints
Flávio Luis Cardeal Pádua
Thesis presented to the Graduate Program in
Computer Science of the Federal University
of Minas Gerais in partial fulfillment of the
requirements for the degree of Doctor in
Computer Science.
Advisor: Prof. Rodrigo Lima Carceroni
Belo Horizonte, May 20, 2005
ads:
Resumo
Esta tese aborda o problema de se estimar o alinhamento espaço-temporal
entre N sequências de vídeo não-sincronizadas referentes à mesma cena dinâ-
mica 3D e capturadas a partir de pontos de vista distintos. Diferentemente
dos métodos existentes, os quais funcionam somente para N = 2, este traba-
lho apresenta uma abordagem inovadora que reduz o problema caracterizado
por um N qualquer ao problema de se estimar uma única reta em R
N
. Esta
reta captura todas as relações temporais entre os videos, podendo ser cal-
culada sem qualquer conhecimento a priori sobre as mesmas. Considerando
que o alinhamento espacial é capturado por parâmetros de tensores bilineares
(matrizes fundamentais), um algoritmo iterativo é usado para refinar simul-
taneamente os parâmetros temporais e espaciais que definem o alinhamento
entre as sequências, uma vez que o refinamento exclusivo dos parâmetros
temporais é subótimo. Resultados experimentais obtidos com sequências de
vídeo reais demonstram que a metodologia proposta é capaz de recuperar efi-
cazmente o alinhamento entre as sequências mesmo diante da existência de
grandes desalinhamentos, diante da presença de ambiguidades (por exemplo,
cenas com movimentos periódicos) e quando um alinhamento manual preciso
é inviável. Finalmente, experimentos com sequências sintéticas demonstram
a escalabilidade e acurácia de nossa abordagem, fornecendo medidas quanti-
tativas para a qualidade dos alinhamentos estimados.
i
Abstract
This thesis addresses the problem of estimating the spatio-temporal align-
ment between N unsynchronized video sequences of the same dynamic 3D
scene, captured from distinct viewpoints. Unlike existing methods, which
work for N = 2 and rely on a computationally-intensive search in the space
of temporal alignments, we present a novel approach that reduces the prob-
lem for general N to the robust estimation of a single line in R
N
. This
line captures all temporal relations between the sequences and can be com-
puted without any prior knowledge of these relations. Considering that the
parameters of fundamental matrices capture the spatial alignment, we use
an iterative algorithm to refine simultaneously the parameters representing
the temporal and spatial relations between the sequences, since that the ex-
clusive refinement of the temporal parameters is suboptimal. Experimental
results with real-world sequences show that our method can accurately align
videos even when they have large misalignments (e.g., hundreds of frames),
when the problem is seemingly ambiguous (e.g., scenes with roughly peri-
odic motion), and when accurate manual alignment is difficult (e.g., due to
slow-moving objects). Finally, experiments with synthetic sequences demon-
strate the scalability and accuracy of our approach, providing quantitative
measurements for the quality of the spatio-temporal alignments estimated.
ii
Resumo Estendido
O texto a seguir consiste em um resumo estendido sobre o trabalho
desenvolvido nesta tese. Primeiramente, este texto introduz o problema
abordado, a principal motivação para se resolvê-lo e alguns dos principais
trabalhos relacionados. Em seguida, é feita uma breve descrição da me-
todologia desenvolvida e dos experimentos realizados que comprovam sua
aplicabilidade, escalabilidade e exatidão. Finalmente, são apresentadas
conclusões e propostas de trabalhos futuros.
Introdução
Esta tese aborda o problema de se estimar o Alinhamento Espaço-
Temporal entre Múltiplas Sequências de Vídeo referentes a uma mesma cena
3D, as quais são capturadas a partir de pontos de vista distintos. A di-
nâmica da cena bem como características estáticas presentes na mesma são
utilizadas como poderosas pistas para se estimar a sincronização temporal
(alinhamento tempor al) e o alinhamento espacial entre as sequências. Ti-
picamente, o desalinhamento temporal entre sequências de vídeo origina-se
por duas razões principais. A primeira relaciona-se com o fato de que as
sequências de entrada podem possuir diferentes taxas de quadros, enquanto
a segunda relaciona-se com a existência de um deslocamento temporal en-
iii
iv
tre as sequências frequentemente criado quando as câmeras não são ativadas
simultaneamente. Por outro lado, o desalinhamento espacial resulta das dife-
rentes posições, orientações e parâmetros internos de calibração das câmeras.
Em muitas aplicações atuais que se beneficiam da disponibilidade de re-
gistros de vídeo simultâneos de um mesmo evento físico, como por exemplo,
tele-imersão (Vedula et al., 2002), segurança baseada em vídeo (Zelnik-Manor
and Irani, 2001), criação de mosaicos a partir de múltiplos vídeos (Caspi and
Irani, 2001) e análise de lances duvidosos em eventos esport ivos (Reid and
Zisserman, 1996), observa-se a necessidade da estimação numa fase anterior
do alinhamento espaço-temporal das múltiplas sequências de vídeo referentes
ao evento físico monitorado.
Neste contexto, nota-se uma demanda crescente por métodos automáti-
cos eficazes para a estimação do alinhamento espaço-temporal entre múlti-
plas sequências, especialmente sequências previamente gravadas onde o uso
de hardwares de sincronização é inviável. Sendo assim, esta tese propõe uma
nova abordagem cujo objetivo principal consiste em avançar no desenvolvi-
mento de novas metodologias para se estimar com grande exatidão o alinha-
mento espaço-temporal entre não somente duas sequências, como a maioria
dos métodos existentes, mas entre um conjunto genérico de N sequências.
Especificamente, este trabalho considera conjuntos de sequências que pos-
suam uma certa sobreposição entre seus campos de visão, isto é, dadas N
sequências, é possível identificar tuplas correspondentes de pixels (x
1
, y
1
, t
1
)
S
1
,...,(x
N
, y
N
, t
N
) S
N
, onde todas estas tuplas são formadas pelas pro-
jeções de um único ponto na cena. Além disso, a abordagem apresentada
neste trabalho considera que todas as regiões de sobreposição contenham
algum movimento não-rígido.
Nós acreditamos que qualquer solução suficientemente genérica para o
v
problema de alinhamento espaço-temporal deva tratar os seguintes casos:
Taxas de q uadros desconhecidas: as taxas de quadros das sequên-
cias são desconhecidas e podem apresentar qualquer valor.
Deslocamento temporal arbitrário: o deslocamento temporal entre
as sequências é desconhecido e pode ser arbitrariamente grande.
Movimento desconhecido: o movimento 3D dos objetos na cena é
desconhecido e suas características são arbitrárias.
Falhas no rastreamento: pontos de interesse na cena não podem ser
rastreados de forma confiável ao longo de toda a sequência.
Geometria epipolar desconhecida: a geometria epipolar das câme-
ras de vídeo é desconhecida.
Escalabilidade: a eficiência computacional da metodologia utilizada
deve cair proporcionalmente ao aumento no número de sequências.
Ausência de pontos estáticos: nenhum ponto visível na cena per-
manece estático para dois ou mais quadros.
Neste sentido, esta tese apresenta uma abordagem inovadora que trata to-
dos os casos acima mencionados, com exceção do último caso. Em particular,
nós assumimos que para cada par de sequências de vídeo é possível identi-
ficar pontos estáticos suficientes na cena que permitam a obtenção de uma
estimativa inicial da geometria epipolar das câmeras. Além disso, com o ob-
jetivo de se assegurar que os parâmetros desta estimativa inicial permaneçam
constantes durante a aplicação de nossa abordagem, nós consideramos um
cenário no qual as câmeras são estáticas e apresentam parâmetros intrínsicos
e extrínsicos constantes e desconhecidos.
vi
A idéia básica de nossa abordagem está fundamentada na definição de
uma reta N-dimensional que captura globalment e as relações temporais en-
tre todas as N sequências de vídeo. Uma propriedade fundamental desta
reta é que ainda embora seu conhecimento implique no conhecimento do
alinhamento temporal entre as sequências, a estimativa de pontos sobre a
mesma pode ser realizada sem o conhecimento prévio de tal alinhamento.
Utilizando-se esta propriedade como ponto de partida, a abordagem pro-
posta neste trabalho reduz o problema de se estimar o alinhamento temporal
entre N sequências para o problema de se estimar de forma robusta uma
única reta N-dimensional a partir de um conjunto de pontos gerados em
N
.
Grande parte das soluções existentes na literatura para o problema de se
adquirir o alinhamento temporal realizam uma pesquisa explícita em todo o
espaço de soluções de alinhamentos possíveis (Caspi et al., 2002; Rao et al.,
2003; Wolf and Zomet, 2002a,b; Lee et al., 2000; Stein, 1998). Infelizmente,
a natureza combinatória desta pesquisa requer o estabelecimento de várias
hipóteses adicionais para torná-la gerenciável, como por exemplo, deve-se
conhecer a priori as taxas de quadros das sequências, deve-se assumir sem-
pre N = 2, que o desalinhamento temporal é um inteiro e ainda que tal
desalinhamento resida dentro de uma pequena faixa limitada definida pelo
usuário (tipicamente menor do que cinquenta quadros). Consequentemente,
ainda que a maior parte destas soluções possam ser utilizadas diante da ine-
xistência de pontos estáticos na cena, suas eficiências computacionais podem
limitar bastante suas aplicabilidades. Diferentemente de tais técnicas, nossa
abordagem alinha N sequências num único passo, pode tratar desalinhamen-
tos temporais arbitrariamente grandes e não requer qualquer informação a
priori sobre as relações temporais entre as mesmas.
A metodologia proposta nesta tese está mais diretamente relacionada com
vii
a abordagem proposta em Caspi et al. (2002). Neste trabalho, a técnica pro-
posta pelos autores recupera a geometria epipolar e o desali nhamento tem-
poral entre as sequências a partir da trajetória no plano de im agem de um
único ponto na cena que é visível em ambas as sequências. Posteriormente,
a geometria epipolar e o desalinhamento temporal estimados são r efinados
usando-se mais pontos. Para fazerem isso, os autores assumem que as taxas
de quadros são conhecidas e formulam um problema de otimizaç ão não-linear
para estimar os parâmetros refinados que capturam a geometria epipolar e
o desalinhamento temporal. Infelizmente, a natureza altamente não-linear
deste processo de otimização necessita da aquisição de boas estimativas ini-
ciais para a geometria epipolar e o desalinhament o temporal. É importante
ainda mencionar que a abordagem proposta em Caspi et al. (2002) assume
que um único ponto na cena p ossa ser rastreado de forma confiável ao longo
de toda a sequência, o que pode ser difícil de se realizar no caso de vídeos
de cenas complexas, onde o rastreamento pode falhar devido a diversos pro-
blemas, como por exemplo, problemas de oclusão. Nossa solução, por outro
lado, requer a habilidade de se rastrear pontos na cena somente ao longo de
dois quadros consecutivos da mesma sequência. Além disso, ela não requer a
habilidade de se estabelecer correspondência de pontos entre as sequências.
A seguir será apresentada uma breve descrição da meto dologia desen-
volvida nesta tese, a qual pode ser dividida em duas etapas principais. Em
sua primeira etapa, chamada de Alinhamento Temporal, realiza-se uma
estimativa inicial dos parâmetros que recuperam o alinhamento temporal
entre as sequências, utilizando-se para isso as trajetórias de pontos de inte-
resse que se movem na cena bem como as estimativas iniciais das matrizes
fundamentais que recuperam o alinhamento espacial entre as sequências.
Na segunda e última etapa de nossa metodologia, chamada de Alinhamento
viii
Espaço-Temporal, um processo de otimização linear é formulado para se
refinar os parâmetros temporais e espaciais estimados inicialmente, os
quais podem possuir erros significativos relacionados com as limitações
impostas pelas técnicas utilizadas para se recuperar as est imat ivas iniciais
das geometrias epipolares de pares de câmeras bem como pelas técnicas
utilizadas para se adquirir as trajetórias dos objetos veis.
Alinhamento Temporal
Considere uma cena dinâmica monitorada simultaneamente por N câme-
ras localizadas em diferentes pontos de vista, onde tais câmeras obedecem
ao modelo de projeção em perspectiva. Assuma que cada câmera capture
quadros a uma taxa constante e desconhecida. Além disso, considere que
as câmeras não estejam sincronizadas, isto é, elas começam a capturar qua-
dros em diferentes instantes de tempo e possivelmente com taxas de quadros
distintas. Com o objetivo de se alinhar temporalmente as sequências de ví-
deo resultantes, deve-se determinar as correspondências entre os números dos
quadros de uma sequência de “referência” com os números dos quadros em
todas as outras sequências. Esta correspondência pode ser expressa como
um conjunto de equações lineares do tipo:
f
i
= α
i
f
r
+ β
i
, (1)
onde f
i
e f
r
são os números dos quadros da i-ésima sequência e da sequência
de referência, respectivamente, e α
i
, β
i
são constant es desconhecidas que
representam a dilatação e o deslocamento temporal, respectivamente, entre
as sequências. Em geral, estas constantes não são números inteiros.
As relações temporais entre pares de sequências capturadas pela Equação
ix
(3.1) induzem uma relação global entre os números dos quadros das sequên-
cias de entrada. Especificamente, nós representamos esta relação global pela
reta N-dimensional L a seguir:
L =
[α
1
. . . α
N
]
T
t + [β
1
. . . β
N
]
T
| t
. (2)
A propriedade chave desta reta N-dimensional, pela qual a estimativa
de pontos sobre a mesma pode ser realizada sem o conhecimento prévio do
alinhamento temporal entre as sequências, permite a elaboração de um algo-
ritmo simples para sua reconstrução a partir das trajetórias de características
de inter esse que se movem na cena e são visíveis por duas ou mais sequências.
Especificamente, seja q
1
(f
1
) a projeção instantânea na câmera 1 no qua-
dro f
1
de um ponto móvel na cena, expressa em coordenadas 2D homogêneas,
como ilustrado na Figura 3.1. Além disso, seja q
i
(·) a trajetória formada pela
projeção deste ponto móvel na câmera i e suponha que a matrix fundamen-
tal F
1i
entre as câmeras 1 e i sejam conhecidas para todo i, onde i = 2...N.
Neste caso, a câmera 1 está sendo considerada como sendo a câmera de refe-
rência. Se o ponto de interesse na cena é visível por todas as câmeras quando
o quadro f
1
é capturado pela câmera 1, tem-se a seguinte restrição:
O conjunto:
V
q
1
(f
1
)
=
[f
1
. . . f
N
]
T
| q
1
(f
1
)F
1i
q
i
(f
i
) = 0, i = 2 . . . N
contém pelo menos um ponto sobre a reta N-dimensional L.
Intuitivamente, a restrição acima pode ser imaginada como um procedi-
mento para se gerar um conjunto V
q
1
(f
1
)
de alinhamentos temporais “candi-
datos”, o qual deve conter pelo menos um ponto sobre a reta N-dimensional
procurada. Além disso, esta restrição nos diz que tal conjunto pode ser criado
de acordo com os seguintes passos principais: (1) intersectar a reta epipolar
x
de q
1
(f
1
) com a trajetória q
i
(·) na câmera i, (2) armazenar o(s) número(s)
dos quadro(s) correspondente(s) ao(s) ponto(s) de interseção na câmera i e
(3) gerar vetores de alinhamento temporal a partir dos números dos quadros
armazenados.
Para que se possa aplicar a restrição acima apresentada, deve-se conhecer
a priori as matrizes fundamentais F
ij
, as quais descrevem a geometria
epipolar de cada par de câmeras (i, j). Na prática, nossa abordagem obtém
uma estimativa inicial de F
ij
por meio da identificação de pontos estáticos
no plano de fundo da cena, os quais sejam visíveis por duas ou mais câmeras.
Uma vez que a reta N-dimensional L tenha sido reconstruída, isto é, uma
vez que a estimação do alinhamento temporal entre as sequências tenha sido
realizada, nossa abordagem realiza um processo de otimização linear dos
parâmetros de L juntamente com os parâmetros das matrizes fundamentais
que descrevem a geometria da cena. A seguir, será descrito o algoritmo
proposto nesta tese para se reconstruir L.
Recontrução da reta N-dimensional L
Pode-se notar que a restr ição descrita na seção anterior nos leva a um
algoritmo baseado em um processo de votação para se reconstruir a reta
N-dimensional que recuperará o alinhamento temporal entre N sequências.
Especificamente, este algoritmo opera em duas etapas principais. Na primeira
etapa, escolhe-se uma das sequências de vídeo como sendo a sequência de
“referência” e utiliza-se as posições instantâneas q
r
(f
r
) de cada trajetória
q
r
(·) desta sequência de referência juntamente com as trajetórias completas
q
i
(·) das outras sequências com o objetivo de se estimar V
q
r
(f
r
)
para cada
q
r
(f
r
). Na segunda etapa, estima-se a reta N-dimensional L a partir da
xi
nuvem de pontos formada pela união dos conjuntos V
q
r
(f
r
)
. Sendo assim,
para especificarmos completamente este algoritmo, três questões principais
devem ser respondidas:
1. Como são calculadas as trajetórias q
i
(·) dos pontos de interesse na cena,
para i = 1, ..., N?
2. Como é estimado o conjunto V
q
r
(f
r
)
para cada q
r
(f
r
)?
3. Como são calculados os parâmetros de L?
Para se calcular as trajetórias q
i
(·), nossa metodologia utiliza um rastrea-
dor de características, o qual é tratado p or nosso algoritmo como uma “caixa
preta” responsável por retornar uma lista de segmentos de reta de caracterís-
ticas correspondentes para todo par de quadros consecutivos. É importante
ressaltar que nosso algoritmo independe do rastreador utilizado. Assim, a
escolha de uma determinada metodologia de rastreamento depende somente
da complexidade da cena e das propriedades das características de interesse.
No que se refere ao cálculo do conjunto V
q
r
(f
r
)
para um dado q
r
(f
r
), nosso
algoritmo utiliza as estimativas iniciais das matrizes fundamentais F
ij
, entre
cada par ( i, j) de câmeras, assim como também os segmentos de reta forne-
cidos pelo rastreador. Quando um segmento de reta específico intersecta a
reta epipolar de q
r
(f
r
), define-se um número de quadro possivelmente não
inteiro f
i
, o qual representa a i-ésima coordenada de um elemento potencial
de V
q
r
(f
r
)
. Para se gerar V
q
r
(f
r
)
, agrupa-se todas as coordenadas f
i
calcula-
das para todas as sequências e concatena-se as mesmas de t al f orma que elas
construam vetores N-dimensionais válidos, os quais representam alinhamen-
tos temporais candidatos em um espaço de votos. Estes passos são ilustrados
na Figura 3.4(a)-(d) para o cojunto de duas sequências de vídeo reais exibidas
na Figura 5.1.
xii
Observe que se a reta epipolar de q
r
(f
r
) intersecta dois ou mais segmentos
de reta em cada uma das N 1 câmeras restantes, tem-se um total de 2
N1
possíveis maneiras de se concatenar em um vet or N-dimensional as coorde-
nadas f
i
estimadas. Para se evitar a inclusão de um número exponencial de
vetores em V
q
r
(f
r
)
, nosso algoritmo somente inclui aqueles que sejam consis-
tentes com a geometria epipolar das câmeras. Em particular, concatena-se
as coordenadas f
i
e f
j
para as câmeras i e j, respectivamente, se os pontos
de inter seção que as definiram estão próximos de suas retas epipolares corres-
pondentes. Observe que nosso procedimento de concatenação é conservador,
isto é, ele garante que o conjunto de vetores gerado desta maneira será um
superconjunto de V
q
r
(f
r
)
.
Em geral, o conjunto de todos os alinhamentos temporais candidatos con-
tém um grande número de dados espúrios (do inglês, outliers), como ilustrado
na Figura 3.4(d). Neste contexto, para se reconstruir a reta N-dimensional
L, nossa metodologia se baseia no uso de um algoritmo bastante robusto
a presença de tais dados, conhecido como RANSAC (Fischler and Bolles,
1981), o qual é descr ito detalhadamente no Apêndice A. Basicamente, o al-
goritmo RANSAC escolhe aleatoriamente um par de alinhamentos temporais
candidatos para se definir a reta L e, em seguida, calcula o número total de
candidatos que se encontram a uma distância m áxima desta reta. Estes
dois passos são repetidos durante um número determinado de iterações. Por-
tanto, os dois parâmetros críticos deste algoritmo são o número z de iterações
e a distância . Para se determinar z, utiliza-se a seguinte equação:
z =
log(1 p)
log(1 r
2
)
, (3)
onde p é um parâmetro especificado a priori cujo valor varia de 0 a 1 e r é
a probabilidade de um candidato aleatoriamente selecionado ser um ponto
xiii
sobre L (do inglês, inlier). A Equação (3.3) expressa o fato de que z deva
ser suficientemente grande para assegurar que, com probabilidade p, pelo
menos um dos pares de candidatos selecionados aleatoriamente esteja sobre
L. Em nossos experimentos, foram utilizados p = 0.99 e r = 0.05, os quais são
valores conservadores que levaram aos resultados mais exatos de alinhamento
espaço-temporal. A distância máxima é calculada pela distância média
entre características estáticas identificadas nas sequências de entrada e suas
respectivas retas epipolares.
Finalmente, após a estimativa realizada pelo RANSAC do potencial con-
junto de candidatos sobre os quais a reta L deva passar, nossa metodologia
utiliza o método dos mínimos quadrados sobre este conjunto para se estimar
os parâmetros de L. A seguir, será apresentada a segunda etapa de nossa
metodologia, a qual consiste na realização de um processo de otimização dos
parâmetros temporais e espaciais estimados na primeira etapa.
Alinhamento Espaço-Temporal
Embora as imagens de uma cena dinâmica possam conter pontos estáti-
cos em seus planos de fundo, observa-se que tais pontos frequentemente não
representam a maior parte das características detectáveis na cena. Qualquer
metodologia que busque estimar a geometria epipolar entre um par de câme-
ras exclusivamente a partir deste conjunto de características estáticas estará
provavelmente ignorando um conjunto significativamente grande de informa-
ções relevantes disponíveis nas imagens. Na prática, esta conduta poderá
causar erros nas matrizes fundamentais calculadas e, em última análise, na
reta N-dimensional L calculada por nosso método. A seguir, será brevemente
mostrado como nossa metodologia refina as matrizes fundamentais F
ij
e os
xiv
parâmetros de L por meio da consideração de to das as características detec-
tadas nas sequências. Sem perda de generalidade, a câmera 1 é assumida
como sendo a câmera de referência.
Seja q
1
(f
1
) a projeção de um ponto da cena no plano de imagem da
câmera 1 durante a aquisição do quadro f
1
. Além disso, suponha que a
projeção instantânea de tal ponto da cena em uma câmera i no quadro f
i
possa ser parametrizada como a seguir:
q
i
(f
i
) = (1 f
i
)q
i
(f
a
) + f
i
q
i
(f
b
), (4)
onde q
i
(f
a
) e q
i
(f
b
) são os extremos de um segmento linear, o qual contém
a posição q
i
(f
i
). Observe que esta equação permite estabelecer uma restri-
ção envolvendo os parâmetros de L e os parâmetros da matriz fundamental
F
1i
. Especificamente, pode-se utilizar os parâmetros de L para se calcular
a posição instantânea q
i
(f
i
) correspondente à posição instantânea q
1
(f
1
) na
câmera 1, como se segue:
q
i
(f
i
) = [1 (α
i
f
1
+ β
i
)q
i
(f
a
)] + (α
i
f
1
+ β
i
)q
i
(f
b
). (5)
Além disso, sendo q
i
(f
i
) e q
1
(f
1
) pontos correspondentes, tem-se que:
q
1
(f
1
)F
1i
q
i
(f
i
) = 0. (6)
Combinando-se as Equações (5) e (6), obtém-se uma equação ho mogênea
que leva em conta os parâmetros estimados de L, isto é, os parâmetros α
i
,
β
i
, bem como os parâmetros estimados da matriz fundamental F
1i
:
(1 α
i
f
1
β
i
)q
1
(f
1
)F
1i
q
i
(f
a
) + (α
i
f
1
+ β
i
)q
1
(f
1
)F
1i
q
i
(f
b
) = 0. (7)
xv
Na prática, erros nos parâmetros de L e F
1i
fazem com que a Equação
(7) não seja satisfeita exatamente, originando-se um resíduo que representa
a distância algébrica entre q
i
(f
i
) e sua reta epipolar correspondente. Para
se re finar a estimativa inicial dos parâmetros temporais e espaciais, nossa
técnica realiza a expansão da Equação (7) ao introduzir as incógnitas F
1i
,
α
i
e β
i
na mesma, de forma que:
F
1i
F
1i
+ F
1i
α
i
α
i
+ α
i
β
i
β
i
+ β
i
.
Dada a equação expandida, determina-se as incógnitas F
1i
, α
i
e β
i
que minimizam seu resíduo. Note que para cada ponto na cena e cada qua-
dro no qual ele é visível, nossa técnica obtém uma equação em função dos
parâmetros desconhecidos que especificam completamente F
1i
, α
i
e β
i
.
Esta equação é não linear pelo fato de conter os termos de segunda ordem
(α
i
F
1i
) e (β
i
F
1i
). Em nossa implementação, nós simplificamos os cál-
culos a serem realizados ao ignorarmos estes termos não lineares e resolvermos
o sistema resultante sobredeterminado de equações lineares.
Em geral, as trajetórias de um objeto vel na cena são altamente não
lineares. Para superarmos este problema, a técnica de refinamento acima
descrita é aplicada a um conjunto de segmentos de reta que aproximam a
trajetória original realizada pelo ponto na cena (Horst and Beichl, 1997),
como ilustrado na Figura 4.1. A seguir são apresentados os resultados
experimentais que comprovam a eficácia da metodologia proposta neste
trabalho.
xvi
Resultados experimentais
Os experimentos realizados neste trabalho são divididos em dois grupos
principais. No primeiro grupo, realizamos experimentos com sequências
de vídeo reais para avaliarmos e demonstrarmos a aplicabilidade de nossa
abordagem, sobretudo em cenários desafiadores para os demais métodos
encontrados na literatura, como por exemplo, cenários nos quais o sistema
de rastreamento não é confiável ao longo de toda a sequência e e stimativas
iniciais exatas para as matrizes fundamentais não são possíveis. Por
outro lado, para avaliarmos cuidadosamente a escalabilidade e exatidão de
nossa metodologia diante da variação de alguns parâmetros críticos para a
qualidade dos alinhamentos espaço-temporais calculados, um segundo grupo
de experimentos com se quências sintéticas de uma cena artificial foi reali-
zado. A seguir, os resultados experimentais obtidos são brevemente descritos.
Sequências reais
As sequências de vídeo reais utilizadas em nossos experimentos con-
tém dois e três pontos de vista distintos do evento físico monitorado,
sendo representantes de cenários bastante diversos onde, por exemplo,
as sequências possuem comprimentos que variam de 55 a 605 quadros,
os desalinhamentos temporais variam de 21 a 285 quadros, as imagens
podem apresentar desde uma alta qualidade (sequências capturadas em
laboratório) a uma baixa qualidade (video clipes de transmissões de TV),
as características de interesse podem se mover desde vários pixels a menos
do que um pixel por quadro e as razões entre as taxas de quadros de
pares de sequência varia entre 1 e 2. Em todos os casos, as dimensões
xvii
das imagens são 320 × 240 pixels. A exatidão dos alinhamentos temporais
calculados é avaliada com base no cálculo do erro de alinhamento temporal
médio. Este erro é dado pela média das diferenças absolutas entre as
coordenadas temporais calculadas utilizando-se a reta L estimada por nosso
método e as coordenadas temporais calculadas usando-se a reta L de refe-
rência, a qual é determinada a partir de um processo de alinhamento manual.
Sequência 1
Como um primeiro experimento, nossa técnica de alinhamento espaço-
temporal foi aplicada a duas sequências de vídeo (veja Figura 5.1) também
utilizadas em Caspi and Irani (2000). Basicamente, o objetivo deste expe-
rimento inicial era comparar a eficácia de nossa técnica com a eficácia da
metodologia desenvolvida por estes autores. As duas sequências adquirem
quadros a uma taxa de 25qps (α = 1) e possuem um desalinhamento tempo-
ral de referência dado por: β = 55 ± 0.5 quadros.
A Figura 3.4(d) exibe a reta L reconstruída utilizando-se a primeira
etapa de nossa metodologia. Esta estimativa inicial de L leva a um erro de
alinhamento temporal médio de 0.66 quadros. Entretanto, após a aplicação
de nossa técnica de refinamento, a nova reta estimada levou a u m erro
ainda menor de 0.35 quadros. Note que os parâmetros calculados por nosso
método são tão exatos quanto aqueles calculados em Caspi and Irani (2000),
embora nossa metodologia se baseie na solução de um problema muito me-
nos restritivo (por exemplo, α é desconhecido e a cena não precisa ser planar).
Sequência 2
O segundo experimento foi realizado com duas sequências de mesma taxa
de quadros (30qps) adquiridas em laboratório. Neste caso, os objetivos prin-
xviii
cipais eram mostrar que nossa metodologia independe do fato da cena ser
ou não planar e do tamanho do desalinhamento temporal entre as sequên-
cias. Especificamente, a cena monitorada é constituída por objetos (robôs)
que se movem em dois planos distint os, como ilustrado na Figura 5.3. Os
parâmetros temporais são α = 1 e β = 284.5 ± 2 quadros. Observe que o
desalinhamento temporal entre as sequências é significativamente grande.
A reta L inicialmente reconstruída é mostrada na Figura 5.4, a qual leva
a um erro de alinhamento temporal médio de 5.84 quadros. Entretanto, após
a aplicação da técnica de refinamento, a nova reta estimada diminuiu este
erro para 4.43 quadros. Dado que os robôs se movem muito lentamente, este
erro de alinhamento temporal aparentemente significativo não é perceptível
visualmente. A Figura 5.7 confirma a boa qualidade do alinham ento
temporal recuperado.
Sequência 3
Neste experimento, duas pessoas são monitoradas por um par de câmeras
com mesma taxa de quadros (30qps), enquanto realizam um malabarismo
com cinco bolas coloridas em suas mãos (veja F igura 5.5). Estas sequências
representam um cenário bastante difícil para todos os métodos existentes,
uma vez que (1) as trajetórias das diferentes bolas se sobrepõe no mundo
3D, (2) as trajetórias individuais são aproximadamente cíclicas e (3) as bolas
se movem muito rapidamente (mais de 9 pixels por quadro). Os parâmetros
temporais são α = 1 e β = 41 ± 0.5 quadros. Para tornar o problema de se
estimar o alinhamento temporal ainda mais desafiador, uma das sequências de
entrada foi inicialmente modificada por meio da inserção de novos quadros e,
posteriormente, por meio da exlcusão de quadros. Estas modificações foram
realizadas para se simular sequências que apresentam mais do que uma taxa
xix
de quadros e sequências que possuem vídeo clipes intermediários, como por
exemplo, comerciais de TV.
As Figuras 5.6(a)-(c) mostram as retas reconstruídas antes do refina-
mento, as quais capturam os alinhamentos temporais entre as sequências.
Para as sequências originais, obtivemos erros de alinhamento temporal de
0.75 e 0.26 quadros antes e após o refinamento, respectivamente.
Sequência 4
Neste último experimento com sequências reais, nossa técnica foi aplicada
a três sequências de vídeo referentes a uma partida de futebol (veja Figura
5.9). As sequências apresentam baixa qualidade e foram adquiridas a partir
de câmeras que se moviam. Neste contexto, para que pudéssemos utilizar
nossa metodologia, foi preciso compensarmos os movimentos das câmeras
numa fase anterior (Brown and Lowe, 2003).
Uma vez que agora três sequências de vídeo são consideradas, a reta L
calculada por nossa metodologia é uma reta 3D, como ilustrada nas Figuras
5.10(b) and 5.10(c). Uma per cepção visual da boa qualidade do alinhamento
temporal estimado por nossa técnica é fornecida pela Figura 5.11.
Sequências sintéticas
Na segunda etapa de nossos experimentos, um software foi desenvolvido
para se simular as cinemáticas de partículas 3D com movimentos indepen-
dentes, bem como a visualização das mesmas a partir de múltiplos pontos de
vista, gerando-se assim sequências de vídeo sintéticas. Por meio da utilização
de tais dados sintéticos, foi possível controlar e avaliar os impactos de valores
específicos atribuídos a parâmetros considerados críticos para a exatidão e
xx
escalabilidade de nossa metodologia.
Em particular, com estes experimentos foram analisados os efeitos da
variação de três parâmetros principais: ( 1) número de objetos rastreados,
(2) erro na geometria epipolar de cada par de câmeras e, finalmente, (3) o
nível de ruído do sistema de rastreamento.
A cena artificial gerada simula a cena ilustrada na Figura 5.12(a), a qual
contém duas câmeras calibradas, cujos parâmetros intrínsecos e extrínsicos
são aqueles das câmeras utilizadas nos experimentos com sequências reais
obtidas em laboratório. As trajetórias das partículas aleatórias eram geradas
no interior de uma esfera 3D, cujo raio foi definido empiricam e nte de forma a
maximizar suas projeções nos planos de imagem de ambas as câmeras (veja
Figura 5.12(b)). Cada partícula era iniciada aleatoriamente em uma posição
uniformemente distribuída no interior desta esfera. O modelo utilizado para
a cinemática de uma partícula específica é ilustrado na Figura 5.13.
Considerando-se conjuntos de k partículas (k {1, 2, 4, 8, 16, 32}), dois
cenários principais foram concebidos para a realização dos experimentos com
sequências sintéticas. Primeiramente, considerou-se um cenário no qual a
geometria epipolar do par de câmeras continha um erro fixo de 2 pixels e
variou-se o desvio padrão R ( em pixels) do ruído gaussiano de média zero
adicionado ao rastreador (R {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}). Por outro lado, no
segundo cenário, fixou-se o desvio padrão do ruído do rastreador em 2 pixels
e variou-se o erro ε
f
da matriz fundamental (ε
f
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}).
Fixando-se o erro da matriz fundamental e o desvio padrão do ruído do
rastreador em 2 pixels, simulou-se situações mais realísticas, uma vez que
tal valor foi o pior caso para as matrizes fundamentais calculadas para as
sequências reais e representa na média o desvio padrão do ruído observado
em alguns dos rastreadores usados na prática. Para cada tupla (ε
f
, R, k)
xxi
em cada um dos dois cenários acima descritos, foram simulados 100 eventos
dinâmicos distintos na cena artificial. Em seguida, dados os espaços de votos
dos eventos dinâmicos simulados, foram calculados os percentuais das retas
que capturam o alinhamento temporal entre as sequências que levam a erros
médios de alinhamento temporal menores ou iguais a 1, 2 e 5 quadros.
Analisando-se os resultados obtidos e supondo-se que o limite mínimo
desejável para o percentual de retas que levam a um determinado erro médio
de alinhamento temporal seja 95%, nota-se que para valores de erro na geo-
metria epipolar e no sistema de rastreamento próximos àqueles verificados na
prática, nossa metodologia não somente alcançou como também superou o li-
mite desejável de 95% para diversos valores de número de objetos rastreados,
como ilustrado na Figura 5.16(a). Observe nesta figura que o uso da técnica
de refinamento mostrou-se ser de grande importância, sendo que a melhoria
trazida aos resultados pela mesma caiu bruscamente quando s e aumentou o
erro no sistema de rastreamento e na geometria epipolar das c âmeras (por
exemplo, veja Figura 5.20(d), onde R = 10).
Considerando-se os valores de erro na geometria epip olar e no rastreador
que foram efetivamente observados na prática (ε
f
= R = 2 pixels), observa-se
a partir das Figuras 5.30(a) e 5.33(a) que para aplicações on de se necessita de
alinhamentos temporais bastante exatos (ε
f
1 quadro), nossa metodologia
alcançou um p ercentual de sucessos de 60%. Por outro lado, considerando-se
aplicações mais flexíveis quanto ao valor do erro de alinhame nto temporal
(por exemplo, ε
f
5 quadros), nota-se que até mesmo para rastreadores
bastante corrompidos por ruído (R = 10) e matrizes fundamentais com erros
severos (ε
f
= 4), nossa metodologia superou o limite desejável de 95% para
o percentual de sucessos, como ilustrado nas Figuras 5.30(c) e 5.33(c).
Finalmente, nota-se a partir de nossos resultados que para conjuntos
xxii
muito pequenos de objetos rastreados (por exemplo, k = 1) ou conjun-
tos muito grandes (por exemplo, k = 32), os percentuais de sucesso de
nossa metodologia diminuem significativamente, como exemplificado na Fi-
gura 5.20(d)-(f). As razões para tal fato estão relacionadas com os volumes
de pontos gerados nos correspondentes espaços de votos. Por exemplo, ob-
serve a Figura 5.14(a)-(f). Quando se tem um pequeno número de objetos
rastreados (veja Figura 5.14(a)), obtém-se consequentemente um espaço de
votos com poucos pontos e, em particular, poucos inliers, o que dificulta o
trabalho do RANSAC para encontrar a reta L desejada. Por outro lado,
quando um número muito grande de objetos rastreados é considerado (veja
Figura 5.14(f)), obtém-se um espaço de votos bastante denso com um volume
grande de dados espúrios, o que também leva a uma queda na qualidade do
conjunto de votos estimados pelo RANSAC e, consequentemente, nos parâ-
metros da reta estimada pelo método dos mínimos quadrados. Em geral,
os melhores resultados de nossa metodologia foram alcançados quando se
considerou conjuntos de aproximadamente 4 objetos rastreados.
Conclusõe s e p ropostas de trabalhos futuros
Esta tese propõe uma metodologia inovadora para se alinhar no tempo e
no espaço múltiplas sequências de vídeo adquiridas a partir de pontos de vista
distintos. Especificamente, a abordagem apresentada neste trabalho reduz
o problema de se estimar o alinhamento espaço-temporal entre sequências a
dois subproblemas mais simples: um problema de regressão linear e um pro-
blema de otimização linear, enquanto todas os demais trabalhos relacionados
ao nosso se baseiam na pesquisa da solução desejada em todo o espaço de
alinhamentos possíveis, o que faz com que os mesmos p ossuam uma natureza
xxiii
combinatória.
A qualidade dos alinhamento estimados p or nossa abordagem e seu cor-
respondente custo computacional são invariantes à magnitude dos desalinha-
mentos temporais entre as sequências. Além disso, diferentemente dos demais
métodos existente s na literatura, os quais são dedicados a conjuntos de ape-
nas duas sequências de vídeo, nossa abordagem é capaz de alin har num único
passo um número arbitrário de sequências.
Os experimentos realizados nesta tese demonstraram que nossa aborda-
gem é adequada para se resolver diversos problemas encontrados em aplica-
ções atuais que se beneficiam da disponibilidade de registros simultâneos de
um mesmo evento físico, fornecendo assim uma alternativa interessante às
tecnologias de sincronização de vídeos baseadas em hardware, as quais são
mais caras e mais difíceis de se utilizar em ambientes externos.
Do ponto de vista teórico, este trabalho demonstra sua relevância ao
fornecer evidências adicionais que por meio da consideração de pistas tem-
porais e espaciais em uma única metodologia, muitos eventos físicos que são
inerentemente ambíguos para métodos tradicionais de alinhamento imagem-
a-imagem são resolvidos eficazmente por técnicas de alinhamento sequência-
a-sequência.
Como trabalhos futuros, pesquisas teóricas adicionais precisam ser consi-
deradas. Primeiramente, deve-se estudar a criação de um modelo matemá-
tico alternativo para o desalinhamento temporal entre sequências de vídeo
quando as mesmas são adquiridas a partir de câmeras que não trabalham a
taxas de quadros constantes, um fato muitas vezes comum em aplicações de
robótica. Além disso, técnicas alternativas para a estimação inicial da geo-
metria epipolar de cada par de câmeras devem ser concebidas. Atualmente,
nossa metodologia considera a existência de um conjunto suficiente de pontos
xxiv
estáticos na cena que sejam visíveis pelas câmeras, os quais possam ser uti-
lizados pelo algoritmo dos oito pontos normalizado para se estimar a matriz
fundamental. Sabe-se, entretanto, que este cenário pode não acontecer em
alguns casos.
Finalmente, uma outra direção importante para pesquisa futura consiste
na combinação de nossa metodologia com técnicas de reconstrução de cenas
3D para melhorar a eficácia de tais métodos. Atualmente, estamos pesqui-
sando a extensão da abordagem proposta nesta tese para a realização do
alinhamento espaço-temporal entre não somente câmeras baseadas no mo-
delo de projeção em perspectiva como também câmeras catadióptricas.
To my parents, brothers and sister, who taught me to persevere
and prepared me to face challenges with faith and humility. They
are a constant source of inspiration to my life.
xxv
Acknowledgments
It’s been quite a ride getting to this point, and I owe many thanks to
several people for many things.
Special thanks to my advisor, Rodrigo Carceroni, for sharing with me his
great ideas and for giving me the opportunity to get this far. Without his
support, advice and close supervision, this thesis would not be possible.
Thanks a lot to Geraldo Massahud for his invaluable help in designing
and building the system proposed in this thesis for aligning both in time and
space multiple video sequences. It was really funny t o work with someone
who has an amazing ability to disagree to almost all conceiva ble points of
view and a complex nonlinear way of thinking!
Thanks to Professor Kyros Kutulakos, who also cont ributed significantly
for the development of this work. By working with him when writing our
paper to CVPR, I could understand what Professor Carceroni meant with:
“Professor Kyros has an unstoppable determination to reach perfection and
an extensive knowledge of the rules of the game...”.
I would like to thank the members of my thesis committee, professors
Guilherme Pereira, Hani Yehia, Paulo Carvalho and Siome Goldenstein, for
their suggestions to improve this work.
Many thanks to my colleagues and friends at VERLab, for the excellent
work atmosphere and the valuable discussions during the development of this
xxvi
xxvii
thesis. In special, I would like to thank the support of Pedro Shiroma, Vilar
Neto, Guilherme Pereira, Erikson Morais, Wagner Barros, José Pinheiro and
Bruno Pimentel.
I would like to express my gratitude to professors Mario Campos and
Antônio de Pádua Braga for showing me the beauty of a research career.
I would also like to thank my family for the support they provided me
through my entire life and in particular, I must acknowledge my brothers
Pedro and Paulo, who had a direct influence in my choice for assuming a
scientific career.
Special thanks to my girlfriend and best friend, Fabiana, without whose
love and encouragement, I would not have finished this thesis.
Thanks to the administrative staff of the Department of Computer Science
at UFMG, for making things run smoothly.
I thank CNPq for the financial support.
Finally, I thank all people with whom I’ve worked and who participated
directly or indirectly in this work.
Contents
List of Figures xxx
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Outline of this work . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Background 12
2.1 Spatio-Temporal Alignment . . . . . . . . . . . . . . . . . . . 12
2.2 Single View Geometry . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Extrinsic parameters . . . . . . . . . . . . . . . . . . . 19
2.2.2 Intrinsic parameters . . . . . . . . . . . . . . . . . . . 20
2.3 Multiple View Geometry . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Two-View Geometry . . . . . . . . . . . . . . . . . . . 22
3 Temporal Alignment 30
3.1 The Timeline Constraint . . . . . . . . . . . . . . . . . . . . . 30
3.2 The Temporal Alignment Algorithm . . . . . . . . . . . . . . 33
4 Spatio-Temporal Alignment 41
4.1 The Refinement Algorithm . . . . . . . . . . . . . . . . . . . . 42
4.1.1 Two-View Refinement . . . . . . . . . . . . . . . . . . 42
4.1.2 N-View Refinement . . . . . . . . . . . . . . . . . . . . 47
5 Experimental Results 49
5.1 Real-world Sequences . . . . . . . . . . . . . . . . . . . . . . . 50
5.1.1 Two-view Car Dataset . . . . . . . . . . . . . . . . . . 51
5.1.2 Two-view Robots Dataset . . . . . . . . . . . . . . . . . 53
5.1.3 Two-view Juggling Dataset . . . . . . . . . . . . . . . . 54
5.1.4 Three-view Soccer Dataset . . . . . . . . . . . . . . . . 60
xxviii
CONTENTS xxix
5.2 Synthetic Sequences . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.1 Scene dynamics simulation . . . . . . . . . . . . . . . . 64
5.2.2 Error measurements and experiments’ description . . . 68
5.2.3 Evaluation of the experimental results . . . . . . . . . 71
6 Conclusions 98
6.1 Summary of the Accomplished Work . . . . . . . . . . . . . . 98
6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Bibliography 104
A The RANSAC Algorithm 109
A.1 Hypotheses evaluation . . . . . . . . . . . . . . . . . . . . . . 110
A.2 RANSAC parameters . . . . . . . . . . . . . . . . . . . . . . . 111
A.3 RANSAC efficiency . . . . . . . . . . . . . . . . . . . . . . . . 111
B Tensorial Notation 113
B.1 Tensorial notation and the trifocal tensor . . . . . . . . . . . . 115
C Multiple View Geometry 116
C.1 Three-View Geometry . . . . . . . . . . . . . . . . . . . . . . 116
C.1.1 The Trifocal Tensor T . . . . . . . . . . . . . . . . . . 116
C.2 N-View Geometry . . . . . . . . . . . . . . . . . . . . . . . . 122
List of Figures
1.1 The famous Hand of God goal scored by Maradona in the
1986 FIFA World Cup match between England and Argentina
(FIFA, 2004b). . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Images from two available sequences from the incident regard-
ing the supposed goal scored by the English player Geoff Hurst
in the 1966 FIFA World Cup (Reid and Zisserman, 1996). . . 5
1.3 Example of application of spatio-temporal alignment of multi-
ple sequences for multi-sensor fusion. (a) and (b) are tempo-
rally corresponding frames from the visible-light and Infra-Red
sequences, respectively. (c) shows the results of fusing the two
sequences after spatio-temporal alignment. (Caspi and Irani,
2001). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Alignment and integration of information across multiple
video sequences to exceed the limited spatial-resolution and
limited temporal-resolution of video cameras. (a)-(c) Display
the event captured by three low-resolution sequences. (d)
The reconstructed event as captured by the generated high-
resolution sequence. (e) A close-up image of the distorted ball
in one of the low resolution frames. (f ) A close-up image
of the ball at the exact corresponding frame in time in the
high-resolution output sequence (Shechtman et al., 2002). . . . 7
2.1 The pinhole camera model (Hartley and Zisserman, 2003). . . 17
xxx
LIST OF FIGURES xxxi
2.2 The homography map induced by a world plane (Hartley and
Zisserman, 2003). The ray corresponding to a point q is ex-
tended to meet the world plane π in a point Q. This point
is projected to a point q
in the other image. The map from
q to q
is the homography induced by the plane π. If H
1
and
H
2
are the perspectivities from the world plane π to the first
and second image planes, respectively, we have q = H
1
Q and
q
= H
2
Q. It is the composition of these two perspectivities
that defines a homography H, q
= H
2
H
1
1
q = Hq, between
the image planes. . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 The epipolar geometry. . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Geometry of the Timeline Constraint. In this two-
camera example, the point’s trajectory in camera i inter-
sects the epipolar line, q
1
(f
1
)F
1i
, twice. Given the inter-
section points q
i
(f
i
) and q
i
(f
i
), we have the set V
q
1
(f
1
)
=
{ [f
1
f
i
]
T
,
f
1
f
i
T
}. . . . . . . . . . . . . . . . . . . . . . . 32
3.2 (a) Trajectory of the car’s pixel centroid in Sequence 1. (b)
Trajectory of the car’s pixel centroid in Sequence 2. The car
was tr acked by a simple blob tracker that relies on foreground-
background detection to label all foreground pixels in each frame. 34
3.3 Two intersection points q
i
(f
i
) and q
i+1
(f
i+1
) of cameras i and
i+1, respectively, are considered by our approach as potential
representations of the same scene point only if d
i
e and
d
i+1
e, where d
i
and d
i+1
are distance values that measure
how close q
i
(f
i
) and q
i+1
(f
i+1
) are to each others’ epipolar
lines and e is the fundamental matrices’ average error. . . . . . 36
LIST OF FIGURES xxxii
3.4 (a) Trajectory of a feature in Sequence 1 of the Car dataset,
which is presented in Chapter 5 and illustrated in Figure 3.2.
The feature was the centroid of all pixels labeled as “fore-
ground” by a color-based foreground-background detector. (b)
Trajectory of t he foreground pixel centroid in Sequence 2 of the
dataset. Also shown is the epipolar line corresponding to pixel
q
1
(363) in (a). (c) Magnified view of the trajectory/epipolar
line intersection in (b). The individual line segments con-
necting feature locations in adjacent frames are now visible.
Note that the epipolar line of q
1
(363) intersects multiple line
segments along the trajectory. (d) Exploiting the Timeline
Constraint for two-sequence alignment. Each point represents
a candidate temporal alignment, i.e., an element of V
q
1
(f
1
)
for
the feature location, q
1
(f
1
), in (a). The reconstructed time-
line, drawn as a solid line, describes the temporal alignment
of the two sequences in the Car dataset. . . . . . . . . . . . . 38
4.1 Geometry of our refinement method. . . . . . . . . . . . . . . 43
5.1 Four representative frames (100, 200, 300, 400) from the
cameras 1 and 2, of the two-view Car dataset (Caspi
and Irani, 2000). We can identify the spatial misalign-
ment by observing near image boundaries, where different
static objects are visible in each sequence. The tempo-
ral misalignment is easily identified by comparing the po-
sition of the gate in frames 400. This dataset, along
with more experimental results and videos, are available at
http://www.dcc.ufmg.br/˜cardeal/research/timeline/ . . . . . . 52
5.2 The r ecovered epipolar geometry for the two-view Car dataset.
Points and their epipolar lines are displayed in each image
for verification. Accuracy of the computed fundamental ma-
trix can be appreciated by the closeness of each point to the
epipolar line of its corresponding point. . . . . . . . . . . . . . 53
LIST OF FIGURES xxxiii
5.3 Three representative frames (000, 150, 300) from the cam-
eras 1 and 2, of the two-view Robots dataset. The spa-
tial misalignment can be easily identified by observing
the distinct orientations of the robots’ soccer field. On
the other hand, the t emporal misalignment can be no-
ticeable by comparing the position of the robot with the
dark green circle in frames 300. This dataset, along
with more experimental results and videos, are available at
http://www.dcc.ufmg.br/˜cardeal/research/timeline/ . . . . . . 54
5.4 Voting space, timeline, and timeline equation recovered prior
to refinement for the two-view Robots dataset. Each point is
an element of V
q
1
(f
1
)
for some feature q
1
(f
1
) in sequence 1. . . 55
5.5 Four representative frames (100, 115, 120, 130) from the cam-
eras 1 and 2, of the two-view Juggling dataset. This dataset,
along with more experimental results and videos, are available
at http://www.dcc.ufmg.br/˜cardeal/research/timeline/ . . . . 56
5.6 Voting spaces, timelines, and timeline equations recovered
prior to refinement for the two-view Juggling dataset. (a)
Juggling dataset without modification, (b) simulation of a se-
quence with more than one frame rate and (c) simulation of a
sequence with spurious clips. . . . . . . . . . . . . . . . . . . . 57
5.7 Before alignment images were created by superimposing
the green band of a frame f
2
with the red and blue bands
of frame f
1
= (f
2
β
)
using ground truth timeline coeffi-
cients α
and β
. After alignment images were created by
replacing the green band of the images above them with that
of frame f
1
= (f
2
β), with α, β computed by our algo-
rithm. For both types of images, deviations from the ground-
truth alignment cause “double exposure” artifacts (i.e., when
f
1
= f
2
or f
1
= f
1
, respectively). . . . . . . . . . . . . . . . . 58
5.8 Distribution of distances of inlier votes from the reconstructed time-
line. Left column: Distribution before the timeline refinement
stage. Right column: Distribution after the refinement stage.
Note that the updated epipolar geometry and updated timeline
parameters reduce the distance between inliers and the timeline
and cause more votes to be labeled as inliers. . . . . . . . . . . . 59
5.9 Two representative frames (8, 46) from the cameras 1, 2
and 3, of the three-view Soccer dataset. This dataset, along
with more experimental results and videos, are available at
http://www.dcc.ufmg.br/˜cardeal/research/timeline/ . . . . . . 61
LIST OF FIGURES xxxiv
5.10 (a),(b) Two views of the 3D voting space and 3D timeline
computed for the Soccer dataset. . . . . . . . . . . . . . . . . 62
5.11 Before alignment images were created by super imposing
the green band of a frame f
2
with the red and blue bands
of frame f
1
= (f
2
β
)
using ground truth timeline coeffi-
cients α
and β
. After alignment images were created by
replacing the green band of the images above them with that
of frame f
1
= (f
2
β), with α, β computed by our algo-
rithm. For both types of images, deviations from the ground-
truth alignment cause “double exposure” artifacts (i.e., when
f
1
= f
2
or f
1
= f
1
, respectively). . . . . . . . . . . . . . . . . 63
5.12 Simulation of scene dynamics. (a) Experimental setup. (b)
All features start in random positions uniformly distributed
within the illustrated sphere, which is viewed from both cameras. 65
5.13 Feature dynamics. . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.14 Examples of voting spaces for k = 1, 2, 4, 8, 16 and 32 features.
In all these cases, the fundamental matrix has an average error
of 2 pixels (ε
f
= 2) and the tracker is corrupted by a gaussian
noise with mean zero and standard deviation of ±2 pixels (R =
2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.15 Percentage of inliers for each voting space of Figure 5.14, where
k represents the number of tracked features (k = 1, 2, 4, 8, 16
and 32). A specific vote is considered an inlier if its coordinates
f
1
and f
2
satisfy f
2
f
1
+ 32 1 frame. . . . . . . . . . . . . 73
5.16 Percentages of timelines that lead to average temporal align-
ment errors smaller than or equal to 1, 2 and 5 frame(s), for
k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Standard
deviation R of the tracker’s gaussian noise is ±1 pixel. (d), (e)
and (f) Standard deviation R of the tracker’s gaussian noise is
±2 pixels. In all cases, a fundamental matrix with an average
error of 2 pixels is used (ε
f
= 2). . . . . . . . . . . . . . . . . 79
5.17 Percentages of timelines that lead to average temporal align-
ment errors smaller than or equal to 1, 2 and 5 frame(s), for
k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Stan-
dard deviation R of the tracker’s gaussian noise is ±3 pixels.
(d), (e) and ( f) Standard deviation R of the tracker’s gaussian
noise is ±4 pixels. In all cases, a fundamental matrix with an
average error of 2 pixels is used (ε
f
= 2). . . . . . . . . . . . . 80
LIST OF FIGURES xxxv
5.18 Percentages of timelines that lead to average temporal align-
ment errors smaller than or equal to 1, 2 and 5 frame(s), for
k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Stan-
dard deviation R of the tracker’s gaussian noise is ±5 pixels.
(d), (e) and ( f) Standard deviation R of the tracker’s gaussian
noise is ±6 pixels. In all cases, a fundamental matrix with an
average error of 2 pixels is used (ε
f
= 2). . . . . . . . . . . . . 81
5.19 Percentages of timelines that lead to average temporal align-
ment errors smaller than or equal to 1, 2 and 5 frame(s), for
k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Stan-
dard deviation R of the tracker’s gaussian noise is ±7 pixels.
(d), (e) and ( f) Standard deviation R of the tracker’s gaussian
noise is ±8 pixels. In all cases, a fundamental matrix with an
average error of 2 pixels is used (ε
f
= 2). . . . . . . . . . . . . 82
5.20 Percentages of timelines that lead to average temporal align-
ment errors smaller than or equal to 1, 2 and 5 frame(s), for
k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Stan-
dard deviation R of the tracker’s gaussian noise is ±9 pixels.
(d), (e) and ( f) Standard deviation R of the tracker’s gaussian
noise is ±10 pixels. In all cases, a fundamental matrix with
an average error of 2 pixels is used (ε
f
= 2). . . . . . . . . . . 83
5.21 Percentages of timelines that lead to average temporal align-
ment errors smaller than or equal to 1, 2 and 5 frame(s), for
k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Average
error of the fundamental matrix: ε
f
= 1 pixel. (d), (e) and
(f) Average error of the fundamental matrix: ε
f
= 2 pixels.
In all cases, the tracker is corrupted by a gaussian noise with
standard deviation of ±2 pixels (R = 2). . . . . . . . . . . . . 84
5.22 Percentages of timelines that lead to average temporal align-
ment errors smaller than or equal to 1, 2 and 5 frame(s), for
k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Average
error of the fundamental matrix: ε
f
= 3 pixels. (d), (e) and
(f) Average error of the fundamental matrix: ε
f
= 4 pixels.
In all cases, the tracker is corrupted by a gaussian noise with
standard deviation of ±2 pixels (R = 2). . . . . . . . . . . . . 85
LIST OF FIGURES xxxvi
5.23 Percentages of timelines that lead to average temporal align-
ment errors smaller than or equal to 1, 2 and 5 frame(s), for
k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Average
error of the fundamental matrix: ε
f
= 5 pixels. (d), (e) and
(f) Average error of the fundamental matrix: ε
f
= 6 pixels.
In all cases, the tracker is corrupted by a gaussian noise with
standard deviation of ±2 pixels (R = 2). . . . . . . . . . . . . 86
5.24 Percentages of timelines that lead to average temporal align-
ment errors smaller than or equal to 1, 2 and 5 frame(s), for
k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Average
error of the fundamental matrix: ε
f
= 7 pixels. (d), (e) and
(f) Average error of the fundamental matrix: ε
f
= 8 pixels.
In all cases, the tracker is corrupted by a gaussian noise with
standard deviation of ±2 pixels (R = 2). . . . . . . . . . . . . 87
5.25 Percentages of timelines that lead to average temporal align-
ment errors smaller than or equal to 1, 2 and 5 frame(s), for
k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Average
error of the fundamental matrix: ε
f
= 9 pixels. (d), (e) and
(f) Average error of the fundamental matrix: ε
f
= 10 pixels.
In all cases, the tracker is corrupted by a gaussian noise with
standard deviation of ±2 pixels (R = 2). . . . . . . . . . . . . 88
5.26 Space of estimated parameters for (a) k = 1 feature and (b)
k = 2 features. Each point shown corresponds to the parame-
ters of an estimated timeline with temporal alignment error
smaller than or equal to 1 frame. m
α
and m
β
represent the
average values and σ
2
α
and σ
2
β
are the variances of the temporal
dilation (α) and the temporal shift (β), respectively. In both
cases, the average error of the fundamental matrix and the
standard deviation of the gaussian noise added to the tracker
are 2 pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.27 Space of estimated parameters for (a) k = 4 features and
(b) k = 8 features. Each point shown corresponds to the
parameters of an estimated timeline with temporal alignment
error smaller than or equal to 1 frame. m
α
and m
β
represent
the average values and σ
2
α
and σ
2
β
are the variances of the
temporal dilation (α) and the temporal shift (β), respectively.
In both cases, the average error of the fundamental matrix
and the standard deviation of the gaussian noise added to the
tracker are 2 pixels. . . . . . . . . . . . . . . . . . . . . . . . . 90
LIST OF FIGURES xxxvii
5.28 Space of estimated parameters for (a) k = 16 features and
(b) k = 32 features. Each point shown corresponds to the
parameters of an estimated timeline with temporal alignment
error smaller than or equal to 1 frame. m
α
and m
β
represent
the average values and σ
2
α
and σ
2
β
are the variances of the
temporal dilation (α) and the temporal shift (β), respectively.
In both cases, the average error of the fundamental matrix
and the standard deviation of the gaussian noise added to the
tracker are 2 pixels. . . . . . . . . . . . . . . . . . . . . . . . . 91
5.29 Percentages of timelines that lead to average temporal align-
ment error s smaller than or equal to 1, 2 and 5 frame(s), as a
function of the tracker’s noise level. (a), (b) and (c) Results
for k = 1 feature. (d), (e) and (f) Results for k = 2 features.
In all cases, a fundamental matrix with an average error of 2
pixels is used (ε
f
= 2). . . . . . . . . . . . . . . . . . . . . . . 92
5.30 Percentages of timelines that lead to average temporal align-
ment error s smaller than or equal to 1, 2 and 5 frame(s), as a
function of the tracker’s noise level. (a), (b) and (c) Results
for k = 4 features. (d), (e) and (f) Results for k = 8 features.
In all cases, a fundamental matrix with an average error of 2
pixels is used (ε
f
= 2). . . . . . . . . . . . . . . . . . . . . . . 93
5.31 Percentages of timelines that lead to average temporal align-
ment error s smaller than or equal to 1, 2 and 5 frame(s), as a
function of the tracker’s noise level. (a), (b) and (c) Results for
k = 16 features. (d), (e) and (f) Results for k = 32 features.
In all cases, a fundamental matrix with an average error of 2
pixels is used (ε
f
= 2). . . . . . . . . . . . . . . . . . . . . . . 94
5.32 Percentages of timelines that lead to average temporal align-
ment error s smaller than or equal to 1, 2 and 5 frame(s), as a
function of the average error of the fundamental matrix. (a),
(b) and (c) Results for k = 1 feature. (d), (e) and (f) Results
for k = 2 features. In all cases, the tracker is corrupted by a
gaussian noise with standard deviation of ±2 pixels (R = 2). . 95
5.33 Percentages of timelines that lead to average temporal align-
ment error s smaller than or equal to 1, 2 and 5 frame(s), as a
function of the average error of the fundamental matrix. (a),
(b) and (c) Results for k = 4 features. (d), (e) and (f) Results
for k = 8 features. In all cases, the tracker is corrupted by a
gaussian noise with standard deviation of ±2 pixels (R = 2). . 96
LIST OF FIGURES xxxviii
5.34 Percentages of timelines that lead to average temporal align-
ment error s smaller than or equal to 1, 2 and 5 frame(s), as a
function of the average error of the fundamental matrix. (a),
(b) and (c) Results for k = 16 features. (d), (e) and (f) Results
for k = 32 features. In all cases, the tracker is corrupted by a
gaussian noise with standard deviation of ±2 pixels (R = 2). . 97
B.1 3D representation of the trifocal tensor. The picture represents
λ
i
= λ
j
λ

k
T
jk
i
, which is the contraction of the tensor with the
lines λ
and λ

to produce a line λ. In pseudo-matrix notation
this can be written as λ
i
= λ
T
i
λ

, where (T
i
)
jk
= T
jk
i
(Hartley and Zisserman, 2003). . . . . . . . . . . . . . . . . . 115
C.1 Three images of a line define it as the intersection of three
planes in the same pencil. . . . . . . . . . . . . . . . . . . . . 118
Chapter 1
Introduction
I don’t want to achieve immortality through my work.
I want to achieve it through not dying.
Woody Allen
In this work we consider the problem of Spatio-Temporal Alignment of
Multiple Video Sequences of the same 3D scene, captured from distinct view-
points. The scene dynamics as well as static scene features are used as pow-
erful cues for estimating the t emporal synchronization (temporal alignment)
and the spat ial alignment between the sequences. Typically, the temporal
misalignment between video sequences arises from two main reasons. The
first one relates to the fact that the input sequences may have different frame
rates (e.g., NTSC and PAL), while the second one relates to the existence of
a time shift or offset between the sequences, which is frequently created when
the cameras are not activated simultaneously. On the other hand, the spatial
misalignment results from the different positions, orientations and internal
calibration parameters of all the cameras.
Real-world scenes where objects move and deform in 3D space are often
so complex that, in order to understand them completely, it is necessary
to observe them simultaneously from multiple viewpoints (Carceroni, 2001;
1
2
(a) First viewpoint. (b) Second viewpoint.
Figure 1.1: The famous Hand of God goal scored by Maradona in the 1986
FIFA World Cup match between England and Argentina (FIFA, 2004b).
Kutulakos, 2000; Kutulakos and Seitz, 2000; Szeliski, 1999). Consider, for
instance, a sports event. Even a well-trained referee who is closely observing
a scene of this type sometimes fails to capture pieces of information that
are essent ial for accurate judgment. Figure 1.1 illustrates a famous example
of such a failure, where two views of The Hand of God goal are presented
(FIFA, 2004b). That goal was scored by Maradona, who knocked the ball
into the net with the back of his left hand rather than with his head, a famous
incident in the 1986 FIFA World Cup quarter-final match between England
and Argentina in the Aztec Stadium, Mexico City.
In many of these situations, the missing visual clues are later revealed
clearly in video sequences captured by strategically-positioned cameras, mak-
ing the single-observer error evident. In particular, one of the major obstacles
to be overcomed in the analysis of events such as the one in Figure 1.1 is
represented by the need of a previous spatio-temporal alignment between the
sequences, that is, the need of establishing correspondences in time and in
space between the different image sequences.
The demand for effective automatic methods for temporally and spa-
3
tially aligning multiple videos, mainly pre-recorded videos (e.g., regular and
slow-motion clips of the same p enalty kick), where video synchronization
hardware and calibration techniques cannot be applied, has directed us in
our quest to develop a novel solution that takes into account the constraints
related to the monitored scene, as well as the constraints related to the set
of cameras. Within this context, our main goal in this work consists in
advancing towards the development of new practial methods for accuretely
aligning both in time and space not only two sequences, as most of the
existing methods, but a general set of N video sequences captured from
distinct viewpoints. Therefore, the problem that we are addressing in this
thesis can be stated as follows:
Given a dynamic scene, viewed simultaneously by N perspective cam-
eras located at distinct viewpoints, which work with constant (albeit not
necessarily identical) temporal sampling rates and constant (but unknown)
intrinsic and extrinsic parameters, recover the N-dimensional function that
captures the temporal relations between all sequences as well as the spatial
transformations that align them in space.
Specifically, our work focuses on sets of video sequences that have overlap
between their fields of view i.e., given N sequences S
1
, ..., S
N
, we can
identify corresponding tuples of pixels (x
1
, y
1
, t
1
) S
1
,...,(x
N
, y
N
, t
N
) S
N
,
where each such tuple is formed by projections of a single point Q in the
scene space-time in such a way that these overlapping parts contain some
non-rigid motion.
1.1. MOTIVATION 4
1.1 Motivation
Many applications today benefit from t he availability of simultaneous
video recordings of the same physical event. Examples include tele-immersion
(Vedula et al., 2002), video-based surveillance (Zelnik-Manor and Irani,
2001), video mosaicing (Caspi and Irani, 2001), video metrology from televi-
sion broadcasts of athletic events (Reid and Zisserman, 1996 ) and recovering
of the affine structure of a non-rigid motion (Tresadern and Reid, 2003).
A critical task in all of these applications is the spatio-temporal alignment
between the videos involved. Frequently, the use of special synchronization
hardware has been the most common solution adopted to acquire temporally
aligned sequences. However, alignment based on the content of the image
sequences themselves has proved to be a more interesting alternative, since
that it is less expensive, easier to use outside labs, and it can be applied
to various multi-view sequences that already exist in video databases, such
as those of sport events (FIFA, 2004a; Reid and Zisserman, 1996), artistic
performances, or crimes captured in survelillance tapes.
In Reid and Zisserman (1996), for instance, the authors combined infor-
mation from two independent sequences, illustrated in Figure 1.2, to resolve
the controversy regarding the supposed goal scored by the English player Ge-
off Hurst in the 1966 FIFA World Cup. In particular, they wanted to know
if the ball had actually crossed the goal line, and if not, how close it came to
crossing it. In that work, the authors manually synchronized the sequences
and then computed spatial alignment between selected corresponding images.
This is an interesting example where automatic spatio-temporal alignment
may provide better results.
In Caspi and Irani (2001), the authors illustrate an interesting applica-
tion by performing the alignment of multi-sensor sequences for sensor fusion.
1.1. MOTIVATION 5
ball
ball
ball
ball
ball
ball
Figure 1.2: Images from two available sequences from the incident regarding
the supposed goal scored by the English player Geoff Hurst in the 1966 FIFA
World Cup (Reid and Zisserman, 1996).
Figure 1.3 shows the example presented by the authors. Two sequences of
an outdoor scene were aligned, one captured by an Infra-Red camera, while
the other by a regular video (visible-light) camera. The result of the spatio-
temporal alignment is illustrated by fusing temporally corresponding frames.
The Infra-Red camera provides only intensity information, and was therefore
fused with the intensity component of the visible-light camera.
A final example of the importance of effective video alignment techniques
1.1. MOTIVATION 6
(a) Visible Light. (b) Infra-Red. (c) Output.
Figure 1.3: Example of application of spatio-temporal alignment of multiple
sequences for multi-sensor fusion. (a) and (b) are temporally correspond-
ing frames from the visible-light and Infra-Red sequences, respectively. (c)
shows the results of fusing t he two sequences after spatio-temporal alignment.
(Caspi and Irani, 2001).
is given in Shechtman et al. (2002), where the authors propose a novel method
for constructing a video sequence of high space-time resolution by combin-
ing information from multiple low-resolution video sequences of the same
dynamic scene. Some results are illustrated in Figure 1.4. In this work,
the spatio-temporal alignment method proposed in Caspi and Irani (2000) is
applied in a preliminar step.
Importantly, content-based alignment of sequences acquired with station-
ary cameras is possible only if (a) these sequences depict parts of the scene
space-time that have some overlap and (b) the regions visible in both se-
quences move in a way that is not completely rigid. If both conditions above
hold, then the existence of a rigid transformation between the overlapping
parts of any two frames (one from each sequence) indicates that these frames
were probably acquired simultaneously. Every solution to the alignment
problem that we address here (ours included) exploits this constraint.
1.2. APPROACH 7
(a) (b) (c)
(d) (e) (f)
Figure 1.4: Alignment and integration of information across multiple video
sequences to exceed the limited spatial-resolution and limited temporal-
resolution of video cameras. (a)-(c) Display the event captured by three
low-resolution sequences. (d) The reconstructed event as captured by the
generated high-resolution sequence. (e) A close-up image of the distorted
ball in one of the low resolution frames. (f ) A close-up image of the ball at
the exact corresponding frame in time in the high-resolution output sequence
(Shechtman et al., 2002).
1.2 Approach
We believe t hat any general solution to the spatio-temporal alignment
problem should handle the following cases:
Unknown frame rate: The relative frame rate of the video sequences
is unknown and unconstrained.
Arbitrary time shift: The time shift between the sequences is un-
known and can be arbitrarily large.
1.2. APPROACH 8
Unknown motion: The 3D motion of objects in the scene is unknown
and unconstrained.
Tracking failures: Individual scene points cannot be tracked reliably
over many frames.
Unknown epipolar geometry: The relative camera geometry of the
video sequences is unknown.
Scalability: Computational efficiency should degrade gracefully with
increasing number of sequences.
No static points: No visible point in the scene remains stationary for
two or more frames.
As a step toward this goal, we present a novel approach that operates
under all of the above conditions except the last one. In particular, we assume
that for every pair of video sequences we can identify enough static scene
points to get an initial est imate of the cameras’ epipolar geometry. Moreover,
in order to ensure that the parameters of that initial estimate remain constant
during the application of our approach, we consider a scenario where the
video cameras are stationary, with fixed (but unknown) intrinsic and extrinsic
parameters.
Because of that last assumption, every corresponding set of N pixels is
related by the same spatio-temporal transformation, whose spatial compo-
nents are temporally invariant. Moreover, as long as each individual image
in each sequence is acquired instantaneously, the temporal component of this
transformation is spatially invariant, which allows us to decouple its recovery
from the recovery of the spatial alignment. Importantly, our methodology
still can be used in the case of moving cameras, as long as preliminary video
stabilization process is applied.
1.2. APPROACH 9
We further assume that both cameras follow the projective pinhole model,
and that they acquire frames at constant (albeit not necessarily identical)
temporal sampling rates. The constant sampling rate assumption implies
that the temporal coordinates (frame numbers) in one reference sequence
and the temporal coordinates in all other sequences are related by a one-
dimensional affine transformation:
f
i
= α
i
f
r
+ β
i
, (1.1)
where f
i
and f
r
are the frame numbers of the i-th sequence and the refer-
ence sequence, respectively, and α
i
, β
i
are unkown constants describing the
temporal dilation and temporal shift, respectively, between the sequences. In
general, these constants will not be integers.
The pairwise temporal relations captured by Equation (1.1) induce a
global relationship between the frame numbers of the input sequences. In
fact, at the heart of our approach lies the concept of a timeline. Given N
sequences, the timeline is a N-dimensional line that completely describes all
temporal relations between the sequences. A key property of the timeline
is that even though knowledge of the timeline implies knowledge of the se-
quences’ temporal alignment, we can compute points on the timeline without
knowing this alignment. Using this property as a starting point, we reduce
the temporal alignment problem for N sequences to the problem of robustly
estimating a single N-dimensional line from a set of appropriately-generated
points in R
N
.
Finally, we assume in this work that the overlapping sequence parts con-
tain features such as uniformly-colored blobs or corners that move along
(piecewise) smooth trajectories, and these trajectories can be captured by
trackers that output trajectory segments detected in all the sequences as
1.3. CONTRIBUTIONS 10
parametric curves. We do not assume that correspondences between trajec-
tories are known a priori: recovering such correspondences is part of the job
done by the method that we will introduce in the next chapters. Therefore, in
practice the data that we need can be obtained with one of the many single-
view, real-time, multi-feature trackers available in the literature (Lowe, 2004;
Jepson et al., 2003; Isard and MacCormick, 2001; Shi and Tomasi, 1994).
1.3 Contributions
From a practical standpoint, this work has ushered in two major contri-
butions to the area of video analysis, specially, to the field of applications
where multiple video sequences must be temporally and spatially aligned:
A generalized framework for solving the problem of spatio-temporal
alignment between N videos sequences captured from distinct view
points. The framework (1) can handle arbitrary large misalignments
between the sequences, (2) does not require any a priori information
about their temporal relations, (3) does not assume that a single scene
point can be tracked reliably over the entir e sequence, (4) does not
require the ability to establish feature correspondences between the se-
quences, (5) can handle sequences with feature trajectories that nearly
overlap in 3D, that are approximately cyclical and that contain fea-
tures with quite large image velocities (up to 9 pixels per frame), (6)
can handle sequences with multiple frame rates and that contain spu-
rious clips, and (7) can handle sequences where the tracked features
move in completely distinct planes in the scene.
A new iterative algorithm to refine simultaneously the parameters rep-
resenting the temporal and spatial relations between the sequences,
1.4. OUTLINE OF THIS WORK 11
since that the exclusive refinement of the temporal parameters is sub-
optimal.
From a theoretical point-of-view, this work is important because it pro-
vides additional theoretical and empirical evidence that by considering tem-
poral and spatial cues into a single alignment framework, many events that
are inherently ambiguous for traditional image-to-image aligment techniques
can be uniquely resolved by sequence-to-sequence alignment methods.
1.4 Outline of this work
This document is organized in six chapters (including this one) and three
appendices. Chapter 2 discusses the state-of-the-art in Spatio-Temporal
Alignment of Multiple Video Sequences and introduces some fundamental
concepts in single and multiple view geometries that we use as a theoreti-
cal basis to our work. In Chapter 3, we present the timeline constraint and
our temporal alignment methodology associated to that concept. Chapter
4 presents an iterative algorithm for refining simultaneously the parameters
representing the temporal and spatial relations between the sequences. In
Chapter 5, we present and discuss experimental results with real-world and
synthetic video sequences. Chapter 6 presents our conclusions and perspec-
tives of future work. Appendix A describes in detail the RANSAC algorithm.
Appendix B gives a brief introduction on tensorial notation. Finally, Ap-
pendix C describes an important tool applied in the analysis of three-view
geometries: the trifocal tensor, and some techniques for deriving multi-linear
constraints on correspondences in the case of N-views.
Chapter 2
Background
Everything has been said before, but since nobody
listens we have to keep going back and beginning all
over again.
Andre Gide
This chapter presents a survey of the current state-of-the-art in Spatio-
Temporal Alignment of M ultiple Video Sequences and introduces some fun-
damental concepts in single and multiple view geometries, specially two-view
geometries, that we use as a theoretical basis to our work. Related work that
support specific topics of this thesis will be surveyed wherever necessary.
2.1 Spatio-Temporal Alignment
In spite of the fact that spatial alignment is one of the problems that have
been most heavily researched by the Computer Vision community, so far only
a handful of works have dealt with the problem of temporal alignment.
Based on the analysis of the most relevant articles found in the litera-
ture, we classify the existing spatio-temporal alignment methods in two main
groups: the feature–based methods and the direct methods. The feature–
12
2.1. SPATIO-TEMPORAL ALIGNMENT 13
based methods (Caspi et al., 2002; Rao et al., 2003; Wolf and Zomet, 2002a,b;
Lee et al., 2000; Stein, 1998) extract all information needed to perform
spatio-temporal alignment from the trajectories of tracked features, such
as uniformly–colored blobs or corners. On the other hand, direct meth-
ods (Caspi and Ir ani, 2000, 2001) extract that information from the int ensi-
ties and intensity gradients of all pixels that belong to overlapping regions.
Therefore, direct methods tend to align sequences more accurately if their
appearances are similar, while feature–based methods are widely prescribed
(Caspi et al., 2002; Rao et al., 2003; Torr and Zisserman, 1999) for sequences
with dissimilar appearance such as those acquired with wide baselines, dif-
ferent magnifications, or by cameras with distinct spectral sensitivities.
In this work, we propose a novel feature–based methodology for sequence–
to–sequence alignment. More specifically, a major novelty of our method-
ology is that it reduces the computation of temporal and spatio-tempor al
alignments between sequences to linear regression and linear optimization
problems, while existing feature–based techniques (Caspi et al., 2002; Rao
et al., 2003; Wolf and Zomet, 2002a,b; Lee et al., 2000; Stein, 1998) search
the entire space of possible temp oral alignments. Unfortunately, the combi-
natorial nature of this search requires several additional assumptions to make
it manageable, such as (1) known cameras’ frame rates, (2) the number of
video sequences N is restricted to be two, (3) temporal misalignment as an
integer, and (4) temporal misalignment within a small user-specified range
(typically less than fifty frames).
Unlike previous feature–based techniques, our approach aligns N se-
quences in a single step. It can handle arbitrarily large misalignments be-
tween them and does not require any a priori information about their tem-
poral relations.
2.1. SPATIO-TEMPORAL ALIGNMENT 14
The quality of the alignments that we obtain and the computational cost
of our alignment process are invariant to the magnitude of the initial temporal
offsets between sequences. To achieve this breakthrough, we derive alignment
constraints by matching instantaneous positions of features in one sequence
against entire feature trajectories in the other sequences, while most feature–
based techniques rely on matches between pairs of instantaneous positions.
Our key observation is that, because feature t rajectories are (piecewise–)
smooth curves parameterized by time, once the geometric relations of a set
of cameras can be estimated, each match between an instantaneous feature
in one reference sequence and a trajectory in one of the other sequences
constrains the feature’s temporal coordinate to be aligned with one among a
finite set of instants in the other sequence: those instants where the feature’s
epipolar line (see the concept of epipolar line in Section 2.3.1) intersects the
matching trajectory. Importantly, these instants where intersections occur
not necessarily correspond to the second sequence’s frames, which means that
our methodology yields temporal alignment at sub–frame accuracy, contrary
to techniques based on position–to–position matches.
In this thesis, we exploit the observation above to develop a sequence–to–
sequence alignment approach based on two techniques: (a) one that builds
large sets of those temporal constraints from a rough spatial alignment be-
tween sequences and then performs a robust linear regression in the temporal
domain to recover the globally correct temporal alignment, and (b) one that
linearizes feature trajectories around the points of intersection with epipolar
lines to reduce the problem of finding the complete spatio–temporal align-
ment between two sequences to a problem of solving a linear system.
Our work is most closely related to the approach of Caspi, Simakov and
Irani (Caspi et al., 2002). In their approach, the epipolar geometry and
2.1. SPATIO-TEMPORAL ALIGNMENT 15
temporal misalignment between two sequences are recovered from the im-
age trajectory of a single scene point that is visible in both sequences, and
are subsequently refined using more features. To achieve this, they assume
known frame rates and formulate a non-linear optimization problem to jointly
estimate epipolar geometry and temporal misalignment. Unfortunately, the
highly non-linear nature of this optimization necessitates good initial esti-
mates for the temporal misalignment and the epipolar geometry.
Importantly, that approach still assumes that a single scene point can be
tracked reliably over the entire sequence. This may be difficult to achieve
when aligning videos of complex scenes, where feature tracking can fail often
because of occlusions or large inter-frame motions. Our solution, on the other
hand, requires the ability to track scene points only across two consecutive
frames of the same sequence. Moreover, it does not require the ability to
establish feature correspondences between the sequences.
All other feature–based methods for spatio–temporal alignment that we
are aware of (Rao et al., 2003; Wolf and Zomet, 2002a,b; Lee et al., 2000;
Stein, 1998) use p osition–to–position constraints. It is known (Hartley and
Zisserman, 2003; Ma et al., 2003) that in the case of two-view geometries
except in degenerate cases a match between static points in images
acquired by two stationary cameras generates a linear constraint on the pa-
rameters of the cameras’ fundamental matrix. It is also known (Hartley and
Zisserman, 2003; Ma et al., 2003) that in static scenes (or in cases where
two sequences are correctly synchronized) the matrix formed by all such con-
straints is singular. The methods cited above use this fact to check if any
possible temporal transformation between two sequences is consistent with
the instantaneous feature positions in the sequences. They do this by assem-
bling large constraint matrices from multiple position–to–position matches
2.1. SPATIO-TEMPORAL ALIGNMENT 16
and then testing these matrices for rank–deficiency.
Thus, all those methods need to search the space of possible temporal
transformations to find the optimum alignment. Rao et al. (Rao et al.,
2003) have recently proposed a way to perform this search in an incremental,
non-exaustive way, but their solution assumes that the initial frames of the
two sequences are aligned a priori. Moreover, because of this search–based
nature, none of the techniques based on position–to–position constraints can
synchronize sequences at sub–frame accuracy and only one of them (Wolf
and Zomet, 2002b) can tolerate outliers in the matched position pairs, albeit
by assuming that both cameras are orthographic.
Finally, there is the method proposed by Caspi and Irani (Caspi and
Irani, 2000) t hat aligns sequences directly from pixel intensities and their
spatio–temporal gradients. Because it uses only linear terms of intensities’
Taylor–series expansions to approximate spatial and temporal intensity vari-
ations that are in general non–linear, it only works if the initial sequence–
to–sequence misalignments in space and time are small enoug h to fall within
the range of validity of intensity linearizations. In addition, it models spa-
tial transformations between sequences as homographies, which contrary
to fundamental matrices are not appropriate to align sequences with sig-
nifficant depth discontinuities. Homographies and fundamental matrices are
geometric relations between two views, which are introduced in Section 2.3.1.
Caspi and Irani also developed a direct method that can align sequences
without any overlap (Caspi and Irani, 2001), but this later method does not
work with stationary cameras: it only works with sequences acquired by pairs
of cameras that remain rigidly attached to each other while moving relative
to a mostly rigid scene.
2.2. SINGLE VIEW GEOMETRY 17
O
O
Y
Y
X
q
o
o
y
x
Q
Q
Z
Z
π
f
f
fQ
2
/Q
3
Q
3
Q
2
Figure 2.1: The pinhole camera model (Hartley and Zisserman, 2003).
2.2 Single View Geometry
This section concentrates on the introduction of some fundamental con-
cepts in the geometry of a single perspective camera. Basically, a camera
is a mapping between the 3D world (object space) and a 2D image (Trucco
and Verri, 1998). There are several camera models in the literature, which
are matrices with particular propert ies that represent the camera mapping.
Specifically, we consider in this work that all cameras follow the projective
pinhole model, which is illustrated in Figure 2.1. The basic pinhole model
consists of a plane π, the image plane, and a 3D p oint O, the center of
projection. The distance f between π and O is the focal length. The line
through O and perpendicular to π is the optical axis, and o, the intersection
between π and the optical axis, is named image center. As illustrated in
Figure 2.1, q, the image of Q, is the point at which the straight line through
Q and O intersects the image plane π. Consider the 3D reference frame
in which O is the origin and the plane π is orthogonal to the Z axis, and
let Q = (Q
1
, Q
2
, Q
3
)
. By similar triangles, one quickly computes that the
point Q is mapped to the point q = (fQ
1
/Q
3
, fQ
2
/Q
3
, f)
on the image
2.2. SINGLE VIEW GEOMETRY 18
plane. Ignoring the final image coordinate, we see that
(Q
1
, Q
2
, Q
3
)
→ (fQ
1
/Q
3
, fQ
2
/Q
3
)
(2.1)
describes the central projection mapping from world to image coordinates
(Hartley and Zisserman, 2003). This is a mapping from Euclidean 3D-space
R
3
to Euclidean 2D-space R
2
.
If world and image points are represented by homogeneous vectors, then
the central projection is expressed as a linear mapping between their homo-
geneous coordinates (Hartley and Zisserman, 2003). In particular, Equation
(2.1) may be written in terms of matrix multiplication as
Q
1
Q
2
Q
3
1
→
fQ
1
fQ
2
Q
3
=
f 0 0 0
0 f 0 0
0 0 1 0
Q
1
Q
2
Q
3
1
. (2.2)
Computer vision algorithms for reconstructing the 3D structure of a scene
or computing the position of objects in space need equations as Equation (2.2)
for linking the coordinates of points in 3D space with the coordinates of their
corresponding image points (Trucco and Verri, 1998). In these applications
it is often assumed that the coordinates of the image points in the camera
reference frame can be obtained from pixel coordinates and t hat the camera
reference frame can be located with respect to some other, known as the
world reference frame (Trucco and Verri, 1998). This is equivalent to assume
knowledge of some camera’s characteristics, known in vision as the camera’s
extrinsic and intrinsic parameters. In the following we briefly introduce
the definitions of extrinsic and intrinsic parameters in practical terms. The
2.2. SINGLE VIEW GEOMETRY 19
problem of estimating their values is called camera calibration (Hartley and
Zisserman, 2003).
2.2.1 Extrinsic parameters
The extrinsic parameters are defined as any set of geometric parameters
that identify uniquely the transformation between the unknown camera ref-
erence frame and a known reference frame, named the world reference frame
(Trucco and Verri, 1998).
A typical choice for describing the transformation between camera and
world frame is to use (Trucco and Verri, 1998; Hartley and Zisserman, 2003):
a 3 × 1 translation vector t, describing the relative positions of the
origins of the two reference frames, and
a 3 × 3 rotation matrix R, an orthogonal matrix that brings the corre-
sponding axes of the two frames onto each other.
The relation between the coordinates of a point Q in world and camera frame,
Q
w
and Q
c
respectively, is
Q
c
= RQ
w
+ t. (2.3)
Now, if we add a “1” as a fourth coordinate of Q
w
(that is, express Q
w
in
homogeneous coordinates), we may group the extrinsic param eters in a single
3 × 4 matrix M
e
, called matrix of extrinsic parameters
M
e
=
R t
, (2.4)
and obtain the following linear matrix equation for linking the coordinates
of points in the world reference frame with their corresponding coordinates
2.2. SINGLE VIEW GEOMETRY 20
in the camera reference frame
Q
c
= M
e
Q
w
1
. (2.5)
2.2.2 Intrinsic parameters
The intrinsic parameters can be defined as the set of parameters needed to
characterize the optical, geometric, and digital characteristics of the viewing
camera (Trucco and Verri, 1998). They are the focal length f, the location
of the image center in pixel coordinates (o
x
, o
y
), the effective pixel size in the
horizontal and vertical direction (s
x
, s
y
), and, if required, a radial distortion
coefficient k.
Neglecting radial distortion, we can group all the intrinsic parameters in
a single 3 × 3 matrix M
i
, called matrix of intrinsic parameters
M
i
=
f/s
x
0 o
x
0 f/s
y
o
y
0 0 1
. (2.6)
Again, if we express Q
w
in homogeneous coordinates, we obtain the following
linear matrix equation describing perspective projections
q = M
i
M
e
Q
w
1
. (2.7)
What is interesting about vector q = [q
1
, q
2
, q
3
]
is that the ratios (q
1
/q
3
)
and (q
2
/q
3
) are nothing but the coordinates in pixel of the image point.
Therefore, M
i
performs the transformation between the camera reference
frame and the image reference frame.
2.3. MULTIPLE VIEW GEOMETRY 21
2.3 Multiple View Geometry
A basic problem in com puter vision is to understand the structure of a real
world scene given several images of it (Hartley and Zisserman, 2003). Over
the past decade there has been a rapid development in the understanding
and modelling of the geometry of multiples views, especially due to the new
achievements in our theoretical understanding and improvements in the esti-
mation of mat hematical objects from images (Hartley and Zisser man, 2003;
Ma et al., 2003; Forsyth and Ponce, 2002).
While several problems of scene reconstruction have already reached rea-
sonable solutions, such as the problem of estimating a multifocal tensor from
image point correspondences, particularly the fundamental matrix and the
trifocal tensor (Hartley and Zisserman, 2003), other relevant problems still
claim for more carefull study. Examples include: (1) application of bundle
adjustment to solve more general reconstruction problems, and (2) automatic
detection of correspondences in image sequences with elimination of outliers
and false matches using the multifocal tensor relationships.
Next, we will briefly describe some important concepts and definitions in
two-view geometry, which constitute the basis of our spatio-temporal align-
ment methodology, such as the epipolar geometry of two cameras. In Ap-
pendix C, we present more general frameworks that are natural extensions
for three-, four- and N-views. Particularly, we introduce in this appendix the
trifocal tensor, which plays an analogous role in three views to that played by
the fundamental matrix in two. Finally, we still describe some techniques for
deriving multi-linear constraints on correspondences in the case of N-views.
2.3. MULTIPLE VIEW GEOMETRY 22
2.3.1 Two-View Geometry
The geometry of two perspective views may be acquired simultaneously
as in a stereo rig, or acquired sequentially, for example by a camera moving
relative to the scene. We say that these two situations are geometrically
equivalent and they will not be differentiated here. Basically, there are two
relations between two views of a scene (Hartley and Zisserman, 2003; Ma
et al., 2003):
1. The homography of a point in one view determines a point in the other
which is the image of the intersection of the ray with a plane, as illus-
trated in Figure 2.2.
2. The epipolar geometry a point in one view determines a line in the
other which is the image of the ray through that point, as illustrated
in Figure 2.3.
In the following we briefly introduce these both relations.
The homography map
Images of points on a plane are related to corresponding image points
in a second view by a (planar) homography as shown in Figure 2.2. This
is a projective relation since it depends only on the intersections of planes
with lines (Forsyth and Ponce, 2002). We say that the world plane induces a
homography between the views and that the homography map is responsible
for transferring points from one view to the other.
In Hartley and Zisserman (2003), it is shown that for world planes in
general position the homography is determined uniquely by the plane and
vice versa. In this case, general position means that the world plane does
not contain either of the camera centers. If the world plane does contain one
2.3. MULTIPLE VIEW GEOMETRY 23
π
Q
q
O
H
q
O
Figure 2.2: The homography map induced by a world plane (Hartley and
Zisserman, 2003). The ray corresponding to a point q is extended to meet
the world plane π in a point Q. This point is projected to a point q
in the
other image. The map from q to q
is the homography induced by the plane
π. If H
1
and H
2
are the perspectivities from the world plane π to the first
and second image planes, respectively, we have q = H
1
Q and q
= H
2
Q. It
is the composition of these two perspectivities that defines a homography H,
q
= H
2
H
1
1
q = Hq, between the image planes.
of the camera centers then the induced homography is degenerate (Hartley
and Zisserman, 2003; Ma et al., 2003; Forsyth and Ponce, 2002).
In the following we derive an explicit expression for the induced homogra-
phy between the two views. Suppose a world plane π as the one illustrated in
Figure 2.2, which is specified by its coordinates in the world frame. Consider
the following projection matrices for the two views
M =
I | 0
M
=
A | a
where I is a 3 × 3 identity matrix, 0 = [0, 0, 0]
is a null 3-vector and A
and a are the parameters of the projection matrix M
. Moreover, let π be a
world plane with π = [v
, 1]
. Then the homography induced by the plane
2.3. MULTIPLE VIEW GEOMETRY 24
is q
= Hq with
H = A av
. (2.8)
We may assume the fourth coordinate of π equal to 1 since the plane does
not pass through the center of the first camera at [0, 0, 0, 1]
(Hartley and
Zisserman, 2003).
Observe that there is a three-parameter family of planes in the 3D world,
and correspondingly a three-parameter family of homographies between two
views induced by planes in the 3D world. These three parameters are spec-
ified by the elements of the vector v, which is not a homogeneous 3-vector
(Hartley and Zisserman, 2003).
Epipolar Geometry
The epipolar geometry between two views is essentially the geometry of
the intersection of the image planes with the pencil of planes having the
baseline as axis, where the baseline is the line joining the camera centers
(Hartley and Zisserman, 2003; Ma et al., 2003; Trucco and Verri, 1998). The
epipolar geometry is independent of scene structure, and only depends on
the cameras’ internal parameters and relative pose (Trucco and Verri, 1998).
The geometric entities involved in epipolar geometry are illustrated in
Figure 2.3. This figure shows two pinhole cameras, their projection centers,
O
and O

, and image planes, π
and π

. As usual, each camera identifies a
3D reference frame, the origin of which coincides with the projection center,
and the Z-axis with the optical axis. The vectors Q
and Q

refer to the
scene point, while the vectors q
and q

refer to the projections of the scene
point onto the image planes π
and π

, respectively, and are expressed in
the corresponding reference frame. We call epipole the point of intersection
of the baseline with the image plane. In Figure 2.3, we denote the epipoles
2.3. MULTIPLE VIEW GEOMETRY 25
baseline
epipolar line
epipolar line
scene point
q
q

e
e

λ
λ

Q
Q

O
O

π
π

Figure 2.3: The epip olar geometry.
by e
and e

. The plane identified by the scene point and the camera centers
O
and O

is called epipolar plane, while its intersections λ
and λ

with
the image planes π
and π

, respectively, are called epipolar lines. With
the exception of the epipoles, only one epipolar line goes through any image
point.
The reference frames of both cameras in Figure 2.3 are related via the
extrinsic parameters. Therefore, given a point in space, the relation between
its representations Q
and Q

in the camera reference frames is
Q

= R
Q
t
, (2.9)
where t = (O

O
) is the translation vector and R is the rotation matrix.
The relation between the scene point and its projections is described by
the usual equations of perspective projection, in vector form:
q
=
f
Z
Q
, (2.10)
2.3. MULTIPLE VIEW GEOMETRY 26
q

=
f

Z

Q

, (2.11)
where f
, f

are the cameras’ focal lengths and Z
, Z

are the coordinates
in the Z-axis of Q
and Q

, respectively.
Finally, another important concept is the s o-called epipolar constraint.
Consider the triplet formed by the scene point and their projections q
and
q

. Given q
, the scene point can lie anywhere on the ray from O
through
q
. However, since the image of this ray in π

is the epipolar line through
the corresponding point, q

, the correct match must lie on the epipolar line.
This fact is known as epipolar constraint and establishes a mapping between
points in π
with lines in π

and vice versa.
The Fundamental Matrix F
The fundamental matrix is an algebraic representation of the epipolar
geometry (Hartley and Zisserman, 2003; Ma et al., 2003). It is a basic tool
in the development of our technique for spatio-temporal alignment between
video sequences. In the following, in order to derive the fundamental ma-
trix, we firstly derive another important matrix known as essential matrix
(Longuet-Higgins, 1981).
The equation of the epipolar plane through the scene point can be written
as the coplanarity condition of the vectors Q
, t, and Q
t, or
Q
t
t × Q
= 0. (2.12)
By using Equation (2.9), we obtain
R
Q

t × Q
= 0. (2.13)
2.3. MULTIPLE VIEW GEOMETRY 27
Given that a vector product can be rewritt en as a multiplication by a rank-
deficient matrix, we can write
t × Q
= J Q
, (2.14)
where
J =
0 t
z
t
y
t
z
0 t
x
t
y
t
x
0
. (2.15)
By using this fact, Equation (2.13) becomes
Q

EQ
= 0, (2.16)
with
E = RJ . (2.17)
The matrix E is known as the essential matrix (Longuet-Higgins, 1981) and
establishes a link between the epipolar constraint and the extrinsic parame-
ters of the stereo system. Observe that, by using Equations (2.10) and (2.11),
and dividing by the product Z
Z

, Equation (2.16) can be rewritten as
q

Eq
= 0. (2.18)
Consider now the matrices M
i
and M

i
of the intrinsic parameters of
the cameras in Figure 2.3. Let
q
and q

be the points in pixel coordinates
corresponding to q
and q

in camera coodinates, that is
q
= M
i
q
. (2.19)
2.3. MULTIPLE VIEW GEOMETRY 28
q

= M

i
q

. (2.20)
By substituting Equations (2.19) and (2.20) into Equation (2.18), we have
q

F
q
= 0, (2.21)
where
F = M

i
1
EM
i
1
. (2.22)
F is the so-called fundamental matrix. Observe that Fq
in Equation (2.21)
can be thought of as the equation of the projective epipolar line, λ

, that
corresponds to point q
, or
λ

= F q
. (2.23)
Observe that if it is possible to estimate the fundamental matrix from a
number of point matches in pixel coordinates, we can reconstruct the epipolar
geometry with no information at all on the intrinsic or extrinsic parameters.
We present below some of the most important properties of the funda-
mental matrix (Hartley and Zisserman, 2003):
if F is the fundamental matrix of the pair of cameras (π
, π

), then F
is the fundamental matrix of the pair in the opposite order: (π

, π
).
for any point q
in the first image, the corresponding epipolar line is
λ

= Fq
. Similarly, λ
= F
q

represents the epipolar line corre-
sponding to q

in the second image.
for any point q
(other than e
) the epipolar line λ

= Fq
contains
the epipole e

. Therefore, e

satisfies e

Fq
= (e

F)q
= 0 for
all q
. It follows that e

F = 0 and, similarly, Fe
= 0.
F is a rank 2 homogeneous matrix with seven degrees of freedom.
2.3. MULTIPLE VIEW GEOMETRY 29
F is a correlation, a projective map taking a point to a line. However,
F is not a proper correlatiom, that is, F is not invertible.
Several methods of computing the fundamental matrix were presented
in the literature (Zhang, 1998; Hartley and Zisserman, 2003 ) and many re-
searchers still work on the development of new techniques. Of the various
current methods, the eight-point algorithm (Hartley, 1997a) is by far the
simplest. The idea behind this technique is based on the solution of a homo-
geneous linear system. Given a set of n point correspondences between two
images, the fundamental matrix F satisfies the condition q

i
Fq
i
= 0, for
i = 1, ..., n. With the q
i
and q

i
known, this equation is linear in the entries
of the matrix F. Thus, given at least 8 point correspondences it is possible
to solve linearly for the entries of F up to scale. With more than 8 equations
a least-squares solution is found. More detailed studies and caracterizations
of the best methods for computing the fundamental matrix can be found in
Hartley and Zisserman (2003) and Zhang (1998).
Chapter 3
Temporal Alignment
O tempo propõe outras dificuldades. Uma, talvez
a maior, a de sincronizar o tempo individual de
cada p es soa com o tempo geral das matemáticas, foi
fartamente apregoada pelo recente alarme relativista,
e todos a recordam - ou lembram tê-la recordado até
bem pouco tempo.
Jorge Luis Borges
This chapter presents our framework for temporally aligning multiple se-
quences acquired from distinct viewpoints. We begin this chapter describing
a key concept in our method: the timeline, an N-dimensional line responsible
for capturing all the temporal relations between the video sequences.
3.1 The Timeline Constraint
Suppose that a dynamic scene is viewed simultaneously by N perspective
cameras located at distinct viewpoints. We assume that each camera captures
frames with a constant, unknown frame rate. We also assume that the cam-
eras are unsynchronized, i.e., they began capturing frames at different times
with possibly-distinct frame rates. In order to temporally align the resulting
30
3.1. THE TIMELINE CONSTRAINT 31
sequences, we must determine the correspondence between frame numbers
in one “reference” sequence and frame numbers in all other sequences. This
correspondence can be expressed as a set of linear equations,
f
i
= α
i
f
r
+ β
i
, (3.1)
where f
i
and f
r
denote the frame numbers of the i-th sequence and the
reference sequence, respectively, and α
i
, β
i
are unkown constants describing
the temporal dilation and temporal shift, respectively, between the sequences.
In general, these constants will not be integers.
The pairwise temporal relations captured by Equation (3.1) induce a
global relationship between the frame numbers of the sequences. We rep-
resent this relationship by an N-dimensional line L that we call the timeline:
L =
[α
1
. . . α
N
]
T
t + [β
1
. . . β
N
]
T
| t
. (3.2)
A key property of the timeline is that even though knowledge of L implies
knowledge of the temporal alignment of the sequences, we can compute points
on the timeline without knowing the sequences’ alignment. This observation
leads to a simple algorithm for reconstructing the timeline from dynamic
features in the scene that are visible in two or more of the sequences.
Specifically, let q
1
(f
1
) be the instantaneous projection of a moving scene
point in camera 1 at frame f
1
, expressed in homogeneous 2D coordinates, as
illustrated in Figure 3.1. Furthermore, let q
i
(·) be the trajectory traced by
the point’s projection in camera i and suppose that the fundamental matrix,
F
1i
, between c ameras 1 and i is known for all i, where i = 2...N. Observe
that in this case we are considering the camera 1 as our reference camera. If
the scene point is visible to all cameras when frame f
1
is captured by camera
1, we have the following constraint:
3.1. THE TIMELINE CONSTRAINT 32
camera 1
camera i
scene point
q
1
(f
1
)F
1i
q
i
(f
i
)
q
i
(f
i
)
q
1
(f
1
)
q
1
(·) q
i
(·)
Figure 3.1: Geometry of the Timeline Constraint. In this two-camera
example, the point’s trajectory in camera i intersects the epipolar line,
q
1
(f
1
)F
1i
, twice. Given the intersection points q
i
(f
i
) and q
i
(f
i
), we have
the set V
q
1
(f
1
)
= { [f
1
f
i
]
T
,
f
1
f
i
T
}.
Timeline Constraint: The set
V
q
1
(f
1
)
=
[f
1
. . . f
N
]
T
| q
1
(f
1
)F
1i
q
i
(f
i
) = 0, i = 2 . . . N
contains at least one point on the timeline L.
Intuitively, the Timeline Constraint can be thought of as a procedure for
generating a set V
q
1
(f
1
)
of “candidate” temporal alignments that is guaranteed
to contain at least one point on the timeline. The constraint tells us that we
can create such a set by (1) intersecting the epipolar line of q
1
(f
1
) in camera i
with the trajectory q
i
(·), (2) recording the frame number(s) corresponding to
each intersection point for camera i, and (3) generating temporal alignment
vectors from the recorded frame numbers. To see why the Timeline Con-
straint holds, observe that if [f
1
. . . f
N
]
T
is on the timeline it must represent
the “true” temporal alignment between the frame f
1
of pixel q
1
(f
1
) and the
3.2. THE TEMPORAL ALIGNMENT ALGORITHM 33
remaining cameras. Hence, pixels q
1
(f
1
) and q
i
(f
i
) must satisfy the epipolar
constraint equation, q
1
(f
1
)F
1i
q
i
(f
i
) = 0. Since, by definition, the set V
q
1
(f
1
)
contains all temporal alignments that satisfy the epipolar constraint equa-
tion across the N cameras, it must also contain the true alignment, which
is a point on the timeline L. In this respect, the Timeline Constraint can
be thought of as the converse of the epipolar constraint for the case of N
unaligned sequences.
In order to apply the Timeline Constraint, we must know the fundamen-
tal matrices, F
ij
, describing the cameras’ epipolar geometry between each
pair (i, j) of cameras. In practice, we obtain an initial estimate of F
ij
by
finding “background features,” i.e., points in the scene that remain station-
ary and are jointly visible by two or more cameras. Once the timeline L
is reconstructed, that is, once the estimation of the temporal alignment is
performed, we jointly optimize L and the parameters of the fundamental
matrices that describe the scene geometry, by using a linear, iterative re-
finement procedure. We describe our temporal alignment algorithm in the
next section, and consider in Chapter 4 the joint optimization of L and the
pre-computed fundamental matrices F
ij
, which gives us the spatio-temporal
alignment between the sequences.
3.2 The Tempora l Alignment Algorithm
The Timeline Constraint leads directly to a voting-based algorithm for
reconstructing the timeline of N sequences, which provides the temporal
alignment. The algorithm operates in two phases. In the first phase, we
choose one of the image sequences to be the reference sequence and use
the instantaneous positions q
r
(f
r
) from each feature trajectory q
r
(·) of that
3.2. THE TEMPORAL ALIGNMENT ALGORITHM 34
(a) (b)
Figure 3.2: (a) Trajectory of the car’s pixel centroid in Sequence 1. (b)
Trajectory of the car’s pixel centroid in Sequence 2. The car was tracked by
a simple blob tracker that relies on foreground-background detection to label
all foreground pixels in each frame.
sequence together with the entire trajectories q
i
(·) of the other sequences to
estimate V
q
r
(f
r
)
for each q
r
(f
r
). In the second phase, we fit an N-dimensional
line L to the union of the estimated sets V
q
r
(f
r
)
. Therefore, to fully specify
this algorithm we must ask three questions:
1. How do we compute the feature trajectories q
i
(·), for i = 1, ..., N?
2. How do we estimate the set V
q
r
(f
r
)
for each q
r
(f
r
)?
3. How do we compute the timeline L?
To compute the feature trajectories q
i
(·), we use a two-frame feature
tracker that is treated by our algorithm as a “black-box” res ponsible for
returning a list of line segments of corresponding features for every pair of
consecutive frames. Each line segment connects the locat io n of a feature
that was detected in some frame of the i-th sequence and was successfully
tracked to the next frame, as illustrated in a r eal-world sequence in Figure
3.2. Importantly, our algorithm does not depend on a specific tracker. Thus,
3.2. THE TEMPORAL ALIGNMENT ALGORITHM 35
the choice of a particular tracking methodology depends exclusively on the
scene’s complexity and on the properties of the features of interest.
Next, to compute the set V
q
r
(f
r
)
for a given q
r
(f
r
), our algorithm uses
the initial estimates of the fundamental matrices, F
ij
, between each pair
(i, j) of cameras, as well as the line segments provided by the feature tracker.
When a specific line segment intersects the epipolar line of q
r
(f
r
), it de-
fines a possibly-fractional frame number, f
i
, corresponding to the instant
that q
r
(f
r
)’s epipolar line intersect s the image trajectory of a point in the
scene. Hence, f
i
is the i-th co ordinate of a potential element of V
q
r
(f
r
)
. To
generate V
q
r
(f
r
)
, we collect all the f
i
coordinates computed for all sequences
and concatenate them so that they form valid N-dimensional vectors, which
represent candidate temporal alignments in a voting space. These steps are
illustrated in Figure 3.4a-d for the two-camera example of Figure 3.2.
Note that according to the algorithm described above, our approach may
result in a large number of intersections of the epipolar line of q
r
(f
r
) with the
line segments of the trajectories in each one of the N 1 cameras, since we
may have several possible ways of “concatenating” the computed f
i
coordi-
nates into an N-dimensional vector. However, to avoid including an exponen-
tial number of vectors in V
q
r
(f
r
)
, we only include vectors that are consistent
with the cameras’ epipolar geometry. In particular, let [f
1
. . . f
N
]
T
be a can-
didate vector for a set of N cameras, where f
1
represents the temporal coor-
dinate of a feature position in the reference camera, that is , q
r
(f
r
) = q
1
(f
1
).
Given fundamental matrices with an average error of e pixels, where the error
of a fundamental matrix is measured as the average of the distances between
background feature projections in the image plane of the reference camera
and their corresponding epipolar lines, we assume that the afore-mentioned
candidate vector is consistent with the cameras’ epipolar geometry only if
3.2. THE TEMPORAL ALIGNMENT ALGORITHM 36
q
i
(f
i
)
q
i+1
(f
i+1
)
F
i,i+1
q
i+1
(f
i+1
)
q
i
(f
i
)F
i,i+1
q
i+1
(·)
q
i
(·)
camera i + 1
camera i
d
i+1
d
i
Figure 3.3: Two intersection points q
i
(f
i
) and q
i+1
(f
i+1
) of cameras i and
i+1, respectively, are considered by our approach as potential representations
of the same scene point only if d
i
e and d
i+1
e, where d
i
and d
i+1
are
distance values that measure how close q
i
(f
i
) and q
i+1
(f
i+1
) are to each
others’ epipolar lines and e is the fundamental matrices’ average error.
the intersection points that defined each pair of its consecutive temporal co-
ordinates (f
i
, f
i+1
), for 2 i N 1, satisfy: d
i
e and d
i+1
e, where d
i
and d
i+1
are illustrated in Figure 3.3 and are distance values that measure
how close the intersection points that defined f
i
and f
i+1
are to each others’
epipolar lines.
Therefore, given a set of N cameras, our approach evaluates the con-
straints d
i
e and d
i+1
e for N 2 times. If all the N 2 evaluations
performed obtain a positive answer, the candidate vector is considered as
a potential inlier and is added to the voting space, otherwise it is rejected.
Note that our procedure does not test those constraints for each possible pair
of temporal coordinates, but rather it considers consecutive temporal coordi-
nates in the candidate vector. That is, we make use of the transitive property
that allows us to go from step to step in our reasoning process for inferring
that a candidate point is consistent with the cameras’ epipolar geometry.
3.2. THE TEMPORAL ALIGNMENT ALGORITHM 37
For instance, consider that the intersection points that defined f
i
and f
i+1
are, according to our approach (see Figure 3.3), potential representations of
a specific scene point Q. If the intersection points that defined f
i1
and f
i
potentially represent a scene point P, we may infer that P and Q are probably
the same point, since the intersection p oint of f
i
is considered in both cases.
Moreover, we may also conclude that the intersection points of f
i1
and f
i+1
are probable representations of the same scene point. By using this reasoning
process along consecutive temporal coordinates of the candidate point, we
ensure that it is a potential representation of the temporal misalignment
between t he cameras. Note that our concatenation procedure is conservative,
i.e., it guarantees that the set of vectors generated will be a superset of V
q
r
(f
r
)
.
The set of candidate temporal alignments is the union of the sets V
q
r
(f
r
)
for all q
r
(f
r
). In general, this union will contain a large number of outliers,
as illustrated in Figure 3.4d. To reconstruct the timeline in the presence of
outliers, we use the RANSAC algorithm (Fischler and Bolles, 1981), which
is described in detail in Appendix A.
RANSAC can be regarded as an algorithm for robust fitting of models
in the presence of many data outliers (Fischler and Bolles, 1981). Since it
gives us the opportunity to evaluate any estimate of a set of parameters no
matter how accurate the method that generated this solution might be, the
RANSAC method represents an interesting approach to the solution of many
computer vision problems (Cantzler, 2004). The algorithm randomly chooses
a pair of candidate temporal alignments to define the timeline L, and then
computes the total number of candidates that fall within an -distance of this
line. These two steps are repeated for a number of iterations. Provided suf-
ficient repetitions are per formed, RANSAC is expected to identify solutions
computed from outlier-free data. Therefore, the two critical parameters of
3.2. THE TEMPORAL ALIGNMENT ALGORITHM 38
q
1
(363)
q
1
(·)
X axis
Y axis
0
0
50
50
100
100
150
150
200
200
250 300
q
2
(·)
q
1
(363)F
12
X axis
Y axis
0
0
50
50
100
100
150
150
200
200
250
300
(a) (b)
q
2
(418)
q
2
(423)
q
2
(428)
q
2
(444)
X axis
Y axis
q
2
(·)
122
124
126
128
130
132
134
210 215 225220
f
2
= 1.0030f
1
+ 54.7708
Frame number (S
2
)
Frame number (S
1
)
0
0
100
100
200
200
300
300
400
400
500
500
(c) (d)
Figure 3.4: (a) Trajectory of a feature in Sequence 1 of the C ar dataset, which
is presented in Chapter 5 and illustrated in Figure 3.2. The feature was the
centroid of all pixels labeled as “foreground” by a color-based foreground-
background detector. (b) Trajectory of the foreground pixel centroid in Se-
quence 2 of the dataset. Also shown is the epipolar line corre sponding to
pixel q
1
(363) in (a). (c) Magnified view of the trajectory/epipolar line in-
tersection in (b). The individual line segments connecting feature locations
in adjacent frames are now visible. Note that the epipolar line of q
1
(363)
intersects multiple line segments along the trajectory. (d) Exploiting the
Timeline Constraint for two-sequence alignment. Each point represents a
candidate temporal alignment, i.e., an element of V
q
1
(f
1
)
for the feature lo-
cation, q
1
(f
1
), in (a). The reconstructed timeline, drawn as a solid line,
describes the temporal alignment of the two sequences in the Car dataset.
3.2. THE TEMPORAL ALIGNMENT ALGORITHM 39
the algorithm are the number z of RANSAC iterations and the distance .
To determine z, we use the formula
z =
log(1 p)
log(1 r
2
)
, (3.3)
where p is the probability t hat at least one of our random selections is an
error-free set of candidates and r is the probability that a randomly-selected
candidate is an inlier.
Equation (3.3) expresses the fact that z should be large enough to ensure
that, with probability p, at least one randomly-selected pair of candidates
is an inlier. We used p = 0.99 and r = 0.05 (z = 1840 iterations) for all
experiments, which are conservative values that lead to accurate results in
our experiments, as demonstrated in Chapter 5. To compute the distance
, we observe that can be thought of as a bound on the distance between
detected feature locations in the input cameras and their associated epipolar
lines. This allows us to approximate by the average distance b etween static
features in the scene and their associated epipolar lines. Next, we summarize
our algorithm for recovering the temp oral alignment between multiple video
sequences.
3.2. THE TEMPORAL ALIGNMENT ALGORITHM 40
Algorithm 1 The Temporal Alignment Algorithm
Input
1: N; {Number of cameras.}
2: F
ij
; {Fundamental matrices between cameras i and j.}
Output
3: [α
1
...α
N
]
; {Temporal dilation parameters of the timeline L.}
4: [β
1
...β
N
]
; {Temporal shift parameters of the timeline L.}
BEGIN
5: {Step 1 - Generat e the voting space:
V
q
r
(f
r
)
, q
r
(f
r
).}
6: for i 1 to N do
7: Compute feature trajectories q
i
(·).
8: end for
9: {Consider the camera 1 as the refere nce camera.}
10: for (each instantaneous position q
1
(f
1
)) do
11: for i 2 to N do
12: for (each line segment in sequence i) do
13: if (the epipolar line q
1
(f
1
)F
1i
intersects the line segment) then
14: Obtain the frame number f
i
of the intersection’s point.
15: if (the corresponding pixels of (f
1
,f
i
) are consistent
with the cameras’ epipolar geometry) then
16: Store f
i
in the intersections’ vector of sequence i.
17: end if
18: end if
19: end for
20: end for
21: Construct the set V
q
1
(f
1
)
by collecting all the f
i
coordinates
computed and concatenating them so that they form
valid N-dimensional vectors [f
1
...f
N
].
22: end for
23: Generate the voting space by performing the union of the sets V
q
1
(f
1
)
.
24: {Step 2 - Fit the timeline L to the union of the estimated sets V
q
1
(f
1
)
.}
25: Apply the RANSAC algorithm to the voting space in order to
determine the data set that best fits the searched timeline L.
26: Apply the least-squares method over the data set estimated by
RANSAC to compute the timeline parameters: [α
1
...α
N
]
and [β
1
...β
N
]
.
27: return([α
1
...α
N
]
,[β
1
...β
N
]
).
END
Chapter 4
Spatio-Temporal Alignment
Time and space are modes by which we think and not
conditions in which we live.
Albert Einstein
While images of a dynamic scene may contain stationary points in the
background, these points cannot be expected to represent the majority of
detected features. Any procedure that attempts to estimate the geometric
relations between the views from those features alone is likely to ignore a
significant portion of the available image information. In practice, this will
cause errors in the computed fundamental matrices that encapsulate the
geometric relations and, ultimately, in the reconstructed timeline. In this
chapter we show how to refine the pre-computed fundamental matrices F
ij
and the timeline L by incorporating all features detected in the sequences,
i.e., both the tracked dynamic features and t he static features detected on
the background. Without loss of generality, we assume that the camera
represented by the number 1 is our reference camera.
41
4.1. THE REFINEMENT ALGORITHM 42
4.1 The Refinement Algorithm
Even though the solution presented in Chapter 3 for temporally aligning
multiple video sequences is robust, it is not very accurate: it does not use
any information from the dynamic features to refine the spatial alignment
between sequences and even the temporal alignment is affected by errors in
the initial spatial alignment, because the RANSAC threshold for regarding
a point in the voting space as an inlier is set proportional to the magnitude
of those errors.
In the following, we present a method that can refine both the temporal
and the spatial transformations between sequences, using information from
all features available. Firstly, we present its derivation for the case of scenes
monitored by two distinct viewp oints (N = 2) and, next, we present the
general ideas behind its application for N > 2.
4.1.1 Two-View Refinement
The geometric basis of the refinement method in the case of two dif-
ferent viewpoints is presented in Figure 4.1. The multilinear tensor that
encapsulates the geometric relations between the two views is represented by
a fundamental matrix. To make the problem linear, we approximate each
tracked trajectory q
2
(·) with a polygonal spline s
2
(·) (represented by the red
line segments in Figure 4.1) using a method proposed by Horst and Beichl
that guarantees an upper bound on the difference between the length of the
original curve and the length of its polygonal approximation (Horst and Be-
ichl, 1997). Importantly, we parameterize these polygonal approximations in
a way that is consistent with the ( temporal) parameterization of the original
curves, i.e., each point that lies both on the original curve and on the ap-
4.1. THE REFINEMENT ALGORITHM 43
camera 1
camera 2
scene point
q
1
(f
1
)
ˆ
F
s
2
(f
2
)
q
1
(f
1
)
q
1
(·)
q
2
(·)
q
2
(f
2a
)
q
2
(f
2b
)
s
2
(·)
Figure 4.1: Geometry of our refinement method.
proximation keeps its original coordinate, so that each linear spline segment
has endpoints with well–defined temporal coordinates f
2a
and f
2b
.
The key steps of our method to refine the current estimates for the spatial
parameters represented by the entries of the fundamental matrix
ˆ
F, and the
temporal parameters ˆα and
ˆ
β of the timeline, are:
1. Create the epipolar lines q
1
(f
1
)
ˆ
F in camera 2 from their corresponding
instantaneous feature positions q
1
(f
1
) in camera 1, by using
ˆ
F.
2. Intersect the epipolar lines q
1
(f
1
)
ˆ
F with the p olygonal trajectory ap-
proximations s
2
(t
2
) in camera 2.
3. Screen the resulting intersections s
2
(f
2
) for consistency with the current
estimates for ˆα and
ˆ
β.
4. Use only the consistent inter sections to generate algebraic equations
that jointly constrain the (unknown) transformation parameters F, α
and β.
4.1. THE REFINEMENT ALGORITHM 44
More specifically, according to Equation (3.1), an estimated temporal
transformation with parameters ˆα and
ˆ
β implies that any instantaneous fea-
ture q
1
(f
1
) in camera 1 should correspond to a feature in camera 2 whose
temporal coordinate is
f
2
= ˆαf
1
+
ˆ
β. (4.1)
Thus, an intersection s
2
(f
2
) in camera 2 (see Figure 4.1) between an
epipolar line q
1
(f
1
)
ˆ
F and an approximated tr ajectory segment whose end-
points have temporal coordinates f
2a
and f
2b
is consistent with the current
estimated temporal alignment only if
f
2a
< ˆαf
1
+
ˆ
β < f
2b
. (4.2)
Every intersection that satisfies the consistency condition in Eq. (4.2)
yields a linear constraint on α, β, and the entries of F. To obtain this
constraint, we note that an arbitrary point s
2
(f
2
) on an intersected spline
segment
q
2
(f
2a
)q
2
(f
2b
) (Figure 4.1) is, according to the segment’s parame-
terization, given by
s
2
(f
2
) = q
2
(f
2a
) + (f
2
f
2a
)
q
2
(f
2b
) q
2
(f
2a
)
(f
2b
f
2a
)
. (4.3)
Observe that if f
2
= f
2a
then s
2
(f
2
) = q
2
(f
2a
), which means that the
epipolar line q
1
(f
1
)
ˆ
F intersects the endpoint q
2
(f
2a
). Similarly, if f
2
= f
2b
,
then s
2
(f
2
) = q
2
(f
2b
), meaning that the epip olar line q
1
(f
1
)
ˆ
F intersects the
other endpoint q
2
(f
2b
).
Now, consider that s
2
(f
2
) in camera 2 represents the corresponding point
of q
1
(f
1
) in camera 1. Given that assumption, those points together must
4.1. THE REFINEMENT ALGORITHM 45
satisfy the following equation
q
1
(f
1
)Fs
2
(f
2
) = 0. ( 4.4)
Observe that we have considered F instead of
ˆ
F in Equation (4.4), where
F =
ˆ
F + F, i.e., the entries of F represent the spatial parameters obtained
after the refinement process,
ˆ
F is the current estimate of the fundamental
matrix and F is the refinement term computed by our method. Substitut-
ing Equation (4.3) into Equation (4.4), we obtain:
q
1
(f
1
)F
q
2
(f
2a
) + (f
2
f
2a
)
q
2
(f
2b
) q
2
(f
2a
)
(f
2b
f
2a
)
= 0. (4.5)
Equation (4.5) may be rewritten in a more concise manner as follows:
q
1
(f
1
) {Fkf
2
+ Fm} = 0, (4.6)
where
k =
q
2
(f
2b
) q
2
(f
2a
)
(f
2b
f
2a
)
. (4.7)
m = q
2
(f
2a
) f
2a
k. (4.8)
Now, by considering f
2
= αf
1
+β, where α = ˆα+α and β =
ˆ
β+β, i.e.,
α and β are the temporal parameters obtained after the refinement process,
ˆα and
ˆ
β are the current estimates and α, β are the refinement terms, and
similarly by writing F =
ˆ
F + F, the factor enclosed by curly brackets in
4.1. THE REFINEMENT ALGORITHM 46
the Equation (4.6) becomes
ˆ
F(f
1
ˆαk +
ˆ
βk + m) + f
1
ˆ
Fkα +
ˆ
Fkβ +
+ F(f
1
ˆαk +
ˆ
βk + m) +
+ f
1
αFk + βFk.
Disregarding the second-order terms f
1
αFk and βFk, we obtain the
following linear constraint on F, α and β:
q
1
(f
1
)
f
1
ˆ
Fkα +
ˆ
Fkβ + Fh
= q
1
(f
1
)
ˆ
Fh, (4.9)
where h = (f
1
ˆα +
ˆ
β)k + m.
After straightforward algebraic manipulation, Equation (4.9) may be
rewritten as the (inner) product of two vectors: a 11-element row vector
that contains only known coefficients and a 11-element column vector that
contains the 9 unknown c oefficients of F followed by the scalar unknowns
α and β. Linear constraint s of this form yielded by all the temporally
consistent intersections between epipolar lines and approximated trajectories
are finally assembled into an over-constrained linear system. Importantly,
traditional linear constraints of the type used in the eight-point algorithm
can also be added to this system, just by concatenating two zeros at the end
of their coefficient vectors, as the coefficients of the temporal parameters α
and β. Therefore, our solution allows us to use all avaliable constraints
both from static features and from dynamic features in order to refine the
estimated spatio-temporal alignment.
The set of equations that we construct are of the form A
n×11
x
11×1
= b
n×1
,
where the number of linear constraints n is frequently much larger than the
4.1. THE REFINEMENT ALGORITHM 47
11 unknowns. Our task now consists in finding the best solution x for that
linear system. There are many different techniques for solving such a system
(Press et al., 1988; Atkinson, 1989; Tomasi, 2000; Hefferon, 2001) and a good
way to find its solution is by using the Singular Value Decompostion (SVD)
(Press et al., 1988), although the linear least-squares methods represent also
very interesting alternatives (Atkinson, 1989). By using one of the above-
mentioned numerical methods, our spatio-temporal approach computes the
system’s solution in an iterative way, until the convergence to zero of the
unknowns F, α and β.
4.1.2 N-View Refinement
Consider now a dynamic scene viewed by N distinct cameras, where N >
2, and suppose that our reference camera is the camera labeled by the number
1, which is denoted by c
1
. In this case, we may simply use the two-view
refinement technique previously described for each pair of cameras (c
1
,c
i
),
where c
i
is the i-th camera, for i = 2, ..., N. Thus, by combining the computed
equations t
i
= α
i
t
1
+ β
i
with refined parameters α
i
and β
i
, we may obtain
new equations that capture the temporal relation between any two arbitrary
sequences i and j.
As far as our metho dology for refining the statio-temporal alignment be-
tween N different video sequences is concerned, one may argue that the use
of other tools such as the trifocal and quadrifocal tensors could be also con-
sidered, since they represent natural extensions of the fundamental matrix
in the case of three and four views, respectively. Because of the added sta-
bility of a third or even a fourth view, and the fact that they constrain the
position of reconstructed points in space more tightly, use of the trifocal
and quadrifo cal tensors should lead to greater accuracy than two-view tech-
4.1. THE REFINEMENT ALGORITHM 48
niques (Hartley, 1998). This hypothesis is supp ort ed by the results of Heyden
(Heyden, 1995a,b). In order to evaluate the use of those alternative tools,
we have derived and tested a three-view refinement methodology similarly
to the previous one derived for the two-view case. The main difference of
this alternative approach relates to the multilinear tensor (a trifocal tensor
instead of a fundamental matrix) that encapsulates the geometric relations
between the views. Our experiments showed that this strategy is not effec-
tive, since that the optimization of the temporal and spatial parameters is
quite unstable. One of the main impediments to use the trifocal tensor (and
the quadrifocal tensor) is its overparametrization (Hartley, 1998), using 29
components of the tensor to describe a geometric configuration that depends
only on 18 parameters. This is probably the main reason for the inaccuracies
of our results and the observed instability in the optimization process.
Chapter 5
Experimental Results
No amount of experimentation can ever prove me
right; a single experiment can prove me wrong.
Albert Einstein
In this chapter, we present and discuss several experimental results with
real-world and synthetic sequences. Firstly, in Section 5.1, we illustrate the
applicability of our approach by testing it on several challenging two- and
three-view datasets of real-world dynamic scenes. Scenarios that may be
critical for m ost of the current spatio-temporal alignment methodologies are
successfully handled by our approach, such as, situations where a reliable
feature tracking cannot be performed over the entire sequence, the initial
estimates of the cameras’ epipolar geometry is inaccurate, the video sequences
have large temporal misalignments and the scene points move along three-
dimensional, overlapping and cyclical trajectories.
However, although these experiments with real-world sequences demon-
strate how effective our method can be in some common and challenging
scenarios, it is not possible to perform from their results a careful analysis
of the scalability, efficiency and accuracy of our approach, since that many
49
5.1. REAL-WORLD SEQUENCES 50
critical variables, such as the error in the initial estimate of the cameras’
epipolar geometry, had their values tested in a limited range.
Therefore, to obtain additional experimental results for a better evalua-
tion of our method, we have performed experiments with synthetic sequences
of an artificial scene, presented in Section 5.2, where we could simulate and
control some of the main parameters that affect the results of our approach.
In particular, based on that simulation, we obtained quantitative measure-
ments of the quality of the estimated spatio-temporal alignments as functions
of three key factors: (1) the accuracy of the initial estimates of the funda-
mental matrices that capture the geometric relations between the views (2)
the amount of noise in the tracking system and (3) the number of tracked
features that are considered by the method.
5.1 Real-world Sequences
To demonstrate the applicability of our timeline reconstruction algorithm,
we tested it on various challenging two- and three-view real-world datasets.
Image dimensions in all datasets were about 320 × 240 pixels. The sequences
represented a wide variety of conditions, including sequences that ranged
from 55 to 605 frames; temporal misalignments of 21 to 285 frames; relative
frame rates between 1 and 2; image quality that ranged from quite high (i.e.,
sequences captured by laboratory-based color cameras) to rather low (i.e.,
clips from a low-quality, MPEG-compressed video of a broadcast TV signal);
and object motions ranging from several pixels per frame to less than a pixel.
Since no single tracker was able to handle all of our datasets, and since
our algorithm does not depend on a specific tracker, we experimented with
several—a simple color-based blob tracker, a blob tracker based on back-
ground subtraction, and the WSL tracker (Jepson et al., 2003). In each case,
5.1. REAL-WORLD SEQUENCES 51
we treated the tracker as a “black box” that returned a list of corresponding
features for every pair of consecutive frames.
Alignment accuracy was evaluated by measuring the average temporal
misalignment. This is the average difference between the computed time of
each frame and the frame’s “ground-truth” time, i.e., when it was actually
captured. Since our sequences were acquired with unsynchronized cameras,
the ground-truth time of each frame could only be known to within ±0.5
frames. This is because even if we could perfectly align the sequences at
frame resolution, corresponding frames could have been captured up to 0.5
frame intervals apart. This lower b ound on ground-truth accuracy is critical
in evaluating the presented results.
5.1.1 Two-view Car Dataset
As a first test, we applied our technique to a two-view sequence used by
Caspi and Irani (Caspi and Irani, 2000) for evaluating their method (Figure
5.1). The data was acquired by two cameras with identical frame rate of
25fps, implying a unit ground-truth temporal dilation (α = 1). The ground-
truth temporal shift was β = 55 ± 0.5 frames.
Most frames in the resulting sequences contain a single rigid object (a
car) moving over a static background (a parking lot), along a fairly smooth
trajectory. We therefore used a simple blob tracker that relied on foreground-
background detection to label all foreground pixels in each frame. The cen-
troid of the foreground pixels was the only “feature” detected and tracked
(Figures 3.2(a) and 3.2(b)). To compute the cameras’ fundamental matrix
we used the normalized eight-point algorithm (Hartley, 1997a), which was
provided with twenty six correspondences between background pixels in the
two views illustrated in Figure 5.2.
5.1. REAL-WORLD SEQUENCES 52
Frame 100 Frame 200 Frame 300 Frame 400
Camera 1
Camera 2
Figure 5.1: Four representative frames (100, 200, 300, 400) from the camer as
1 and 2, of the two-view Car dataset (Caspi and Irani, 2000). We can identify
the spatial misalignment by observing near image boundaries, where different
static objects are visible in each sequence. The temporal misalignment is
easily identified by comparing the position of the gate in frames 400. This
dataset, along with more experimental results and videos, are available at
http://www.dcc.ufmg.br/˜cardeal/research/timeline/ .
Figure 3.4(d) shows the timeline reconstructed using the RANSAC-based
algorithm of Chapter 3, with the RA NSAC parameter set to 2.0. The re-
constructed timeline gives an average temporal misalignment of 0.66 frames,
almost within the 0.5-frame uncertainty of the ground-truth measurements.
By applying the refinement procedure of Chapter 4 we obtained updated
values of α = 1.0027 and β = 54.16 for the timeline coefficients. These co-
efficients correspond to an improved average temporal misalignment of 0.35
frames, i.e., below the accuracy of the ground-truth alignment . Note that
these results are at least as accurate as those of Caspi and Irani, even though
we are solving a less constrained problem (i.e., α is unknown and scene pla-
narity is not required). Moreover, the results were obtained from raw results
of a tracker that was not particularly robust (e.g., the centroid of the fore-
ground pixels drifts off the moving car for approximately 30 frames in each
sequence).
5.1. REAL-WORLD SEQUENCES 53
Figure 5.2: The recovered epipolar geometry for the two-view Car dataset.
Points and their epipolar lines are displayed in each image for verification.
Accuracy of the computed fundamental matrix can be appreciated by the
closeness of each point to the epipolar line of its corresponding point.
5.1.2 Two-view Robots Dataset
In a second experiment, we used two cameras operating at 30fps to acquire
images of four small robots, as they executed sm all random movements on
two planes (Figure 5.3). The ground-truth timeline coefficients were α = 1
and β = 284.5 ± 2. We used a uniform-color blob tracker to track these
robots between consecutive frames. The resulting data was challenging for
four reasons. First, the robots’ inter-frame motion was imperceptibly small
(roughly 0.25 pixels per frame), making precise manual alignment by a human
observer virtually impossible. Second, the t emporal shift of the sequences was
large, making it inefficient to find this shift via exhaustive search. Third, the
uniformly-colored regions on each robot were small, causing our tracker to
generate fragmented and noisy trajectories. Fourth, the robot’s motion was
designed to produce trajectories that self-intersect and that are non-smooth,
complicating the shape of each blob’s trajectory.
The timeline reconstructed with = 2.0 prior to refinement is shown
in Figure 5.4. This line gives an average temporal misalignment error of
5.1. REAL-WORLD SEQUENCES 54
Frame 000 Frame 150 Frame 300
Camera 1Camera 2
Figure 5.3: Three representative frames (000, 150, 300) from the cameras
1 and 2, of the two-view Robots dataset. The spatial misalignment can b e
easily identified by observing the distinct orientations of the robots’ soccer
field. On the other hand, the temporal misalignment can be noticeable by
comparing the position of the robot with the dark green circle in frames 300.
This dataset, along with more experimental results and videos, are available
at http://www.dcc.ufmg.br/˜cardeal/research/timeline/ .
5.84 frames. Our refinement stage reduced this error to 4.43 frames, with
α = 1.015 and β = 286.89. Given the robots’ image velocity, this translates
to a m isalignment of about one pixel. Figure 5.7 confirms that the computed
alignment is quite good, despite the robots’ slow motion and the tracker’s
poor performance.
5.1.3 Two-view Juggling Dataset
In this dataset, two people are observed by a wide-baseline camera pair
while juggling five uniformly-colored balls (see Figure 5.5). Both sequences
were acquired at a rate of 30fps. This dataset represents a difficult case for
existing direct- or feature-based methods because (1) the trajectories of dif-
5.1. REAL-WORLD SEQUENCES 55
Frame number (S
1
)
Frame number (S
2
)
f
2
= 1.0329f
1
305.7445
0
0
100
100
200
200
300
300
400
400
500
500
600
600
700
Figure 5.4: Voting space, timeline, and timeline equation recovered prior
to refinement for the two-view Robots dataset. Each point is an element of
V
q
1
(f
1
)
for some feature q
1
(f
1
) in sequence 1.
ferent balls nearly overlap in 3D, (2) individual trajectories are approximately
cyclical, (3) image velocities are quite large, up to 9 pixels per frame, mak-
ing long-range feature tracking difficult, and (4) the ground-truth temporal
shift between the sequences is β = 41 ± 0.5 frames, or about 1.5 periods
of a ball’s motion. This shift is likely to cause difficulties for techniques
based on non-exhaustive search (Rao et al., 2003) or non-linear optimization
(Caspi et al., 2002) because of the possibility of getting trapped in deep local
minima.
To make the alignment problem even more challenging, we modified this
dataset by deleting or adding frames to one of the sequences. These modifi-
cations were intended to simulate sequences with more than one frame rate
(e.g., containing a slow-motion segment) and sequences that contain spuri-
5.1. REAL-WORLD SEQUENCES 56
Frame 100 Frame 115 Frame 120 Frame 130
Camera 1Camera 2
Figure 5.5: Four representative frames (100, 115, 120, 130) from the
cameras 1 and 2, of the two-view Jug gling dataset. This dataset,
along with more experimental results and videos, are available at
http://www.dcc.ufmg.br/˜cardeal/research/timeline/ .
ous clips (e.g., a TV commercial). We used a uniform-color blob tracker to
track four of the balls in each sequence, providing us with the location of
four features per frame. No information about feature correspondences be-
tween cameras was given to the algorithm (i.e., color information was not
used). Figures 5.6(a)-(c) show the reconstructed timelines before the refine-
ment stage, with = 0.5. The average temporal misalignment error was 0.75
frames for the original dataset. The refinement stage brought this error down
to 0.26 frames, with α = 1.0004 and β = 40.80.
Figure 5.7 confirms that the computed alignment was effectively retrieved
and Figure 5.8 illustrates the distribution of distances of inlier votes from the
reconstructed timeline for the Car, Robots and Juggling datasets, before and
after the timeline refinement stage.
5.1. REAL-WORLD SEQUENCES 57
Frame number (S
1
)
Frame number (S
2
)
f
2
= 1.0035f
1
40.8382
0
0
50
50
100
100
150
150
200
200
250
250
300
350
(a)
Frame number (S
1
)
Frame number (S
2
)
f
2
= 1.0067f
1
42.2728
f
2
= 0.49677f
1
+ 44.874
0
0
50
50
100
100
150
150
200
200
250
300
350
(b)
Frame number (S
1
)
Frame number (S
2
)
f
2
= 1.0086f
1
+ 57.5611
f
2
= 0.98712f
1
39.4849
0
0
50
50
100
100
150
150
200
200
250
250
300
300
350
350
(c)
Figure 5.6: Voting spaces, timelines, and timeline equations recovered prior
to refinement for the two-view Juggling dataset. (a) Juggling dataset without
modification, (b) simulation of a sequence with more than one frame rate and
(c) simulation of a sequence with spurious clips.
5.1. REAL-WORLD SEQUENCES 58
Frame f
2
= 300 Magnified view
Before alignmentAfter alignment
Frame f
2
= 115 Magnified view
Before alignmentAfter alignment
Figure 5.7: Before alignment images were created by s uperimposing
the green band of a frame f
2
with the red and blue bands of frame
f
1
= (f
2
β
)
using ground truth timeline coefficients α
and β
. After
alignment images were created by replacing the green band of the im-
ages above them with that of frame f
1
= (f
2
β), with α, β computed by
our algorithm. For both types of images, deviations from the ground-truth
alignment cause “double exposure” artifacts (i.e., when f
1
= f
2
or f
1
= f
1
,
respectively).
5.1. REAL-WORLD SEQUENCES 59
Before refinement After refinement
Car dataset
−3 −2 −1 0 1 2 3
0
5
10
15
20
25
30
35
Residual error (frames)
Number of temporal votes
−3 −2 −1 0 1 2 3
0
5
10
15
20
25
30
35
Residual error (frames)
Number of temporal votes
Robots dataset
−15 −10 −5 0 5 10 15
0
20
40
60
80
100
120
140
160
180
Residual error (frames)
Number of temporal votes
−15 −10 −5 0 5 10 15
0
20
40
60
80
100
120
140
160
180
Residual error (frames)
Number of temporal votes
Juggling dataset
−4 −2 0 2 4
0
10
20
30
40
50
60
70
Residual error (frames)
Number of temporal votes
−4 −2 0 2 4
0
10
20
30
40
50
60
70
Residual error (frames)
Number of temporal votes
Figure 5.8: Distribution of distances of inlier votes from the reconstructed time-
line. Left column: Distribution before the timeline refinement stage. Right
column: Distribution after the refinement stage. Note that the updated epipolar
geometry and updated timelin e parameters reduce the distance between inliers and
the timeline and cause more votes to be labeled as inliers.
5.1. REAL-WORLD SEQUENCES 60
5.1.4 Three-view Soccer Dataset
As a final experiment, we applied our technique t o three video clips ex-
tracted from a single MPEG-compressed TV broadcast of a soccer match
(FIFA, 2002). The clips were replays of the same goal filmed from three dis-
tinct viewpoints (Figure 5.9). Each sequence contained a significant panning
motion to maintain the moving players within the field of view. To ensure
that the pairwise fundamental matrices remained constant for all frames, we
stabilized each sequence by computing the frame-to-frame homography us-
ing Brown and Lowe’s system (Brown and Lowe, 2003). We used the WSL
tracker to track the same player in each sequence, thereby obtaining one fea-
ture trajectory per camera. WSL was initialized manually in the first frame
of each sequence. Even though it was able to track the chosen player for
most frames, the player’s small size and jitter artifacts caused by the video’s
poor quality resulted in noisy measurements of his location. These measure-
ments were given as input to the basic timeline reconstruction algorithm with
= 1.5 and no timeline refinement.
Since this dataset contained N = 3 views, the timeline is a 3D line with 3-
vectors as its coefficients (see Eq. (3.2) and Figures 5.10(b) and 5.10(c)). To
evaluate the timeline’s accuracy in the absence of ground-truth information,
we attempted to estimate the ground-truth alignment by visual inspection:
we identified three easily-distinguishable events (e.g., a player stepping on a
field line, as in Figure 5.9) and recorded the frame where each event occurred
in each sequence. These frames were used as “ground-truth” event times for
each camera. To evaluate the timeline’s accuracy, we used it to predict the
event times in cameras 1 and 2 from the ground-truth time in camera 3.
5.1. REAL-WORLD SEQUENCES 61
Frame 8 (Event #1) Frame 46 (Event #2)
Camera 1Camera 2
Camera 3
Camera 3 (Magnified view)
Figure 5.9: Two representative frames (8, 46) from the cameras
1, 2 and 3, of the three-view Soccer dataset. This dataset,
along with more experimental results and videos, are available at
http://www.dcc.ufmg.br/˜cardeal/research/timeline/ .
5.1. REAL-WORLD SEQUENCES 62
Camera 1
Camera 2
Camera 3
α = [1.0686 1.2252 1]
β = [11.4561 10.7577 0]
0
0
0
10
20
20
20
30
40
40
40
60
60
(a)
Camera 1
Camera 2
Camera 3
α = [1.0686 1.2252 1]
β = [11.4561 10.7577 0]
0
0
0
10
20
20
20
30
40
40
40
60
60
(b)
Figure 5.10: (a),(b) Two views of the 3D voting space and 3D ti meline com-
puted for the Soccer dataset.
The minimum difference between the predictions and the ground-truth
times across all three events was 0.22 frames in camer a 1 and 0.86 frames
in camera 2; the maximum difference was 1.66 and 1.33 frames, respectively.
This confirms that the sequences were aligned quite well (see Figure 5.11),
despite the low quality of the videos and their unequal frame rates.
5.2. SYNTHETIC SEQUENCES 63
Frame f
2
= 46 Magnified view
Before alignmentAfter alignment
Figure 5.11: Before alignment images were created by superimpos-
ing the green band of a frame f
2
with the red and blue bands of frame
f
1
= (f
2
β
)
using ground truth timeline coefficients α
and β
. After
alignment images were created by replacing the green band of the im-
ages above them with that of frame f
1
= (f
2
β), with α, β computed by
our algorithm. For both types of images, deviations from the ground-truth
alignment cause “double exposure” artifacts (i.e., when f
1
= f
2
or f
1
= f
1
,
respectively).
5.2 Synthetic Sequences
In the second phase of our experiments, we developed a software for
simulating the dynamics of 3D particles with independent motion and their
visualizations from multiple viewpoints. Through the use of synthetic data,
the values of some key parameters, such as the number of tracked features,
could be controlled and their impacts in the accuracy, stability and scalability
5.2. SYNTHETIC SEQUENCES 64
Cameras Intrinsic parameters Extrinsic parameters
C
1
675.00 0 438.69
0 674.61 260.08
0 0 1
0.71 0.01 0.70 1237.20
0.04 1.00 0.06 2.57
0.70 0.07 0.71 2636.10
C
2
1835.30 0 352.39
0 1834.20 220.85
0 0 1
0.78 0.28 0.55 222.31
0.12 0.81 0.58 115.09
0.61 0.52 0.60 10130.00
Table 5.1: Matrices of intrinsic and extrinsic parameters for the cameras used
during the simulation of the artificial scene.
of our approach could be carefully analyzed. In particular, by using the
developed simulator we answer in this section three fundamental questions:
1. What is the scalability of our method against an increasing number of
tracked features?
2. How is the accuracy of our method affected by errors in the initial esti-
mates of the fundamental matrices that capture the geometric relations
between the views?
3. How does the noise level in the tracking system affect the accuracy of
our technique?
5.2.1 Scene dynamics simulation
We consider in this analysis an artificial scene monitored by two calibrated
cameras, whose intrinsic and extrinsic parameters are listed in Table 5.1 and
are those of the real cameras used in our indoor experiments presented in
Section 5.1, namely, an Hitachi KP-D50 Color CCD Camera (C1) and a
Sony DCR-TRV320 Digital Camcorder (C
2
). Those cameras were positioned
according to the scheme illustrated in Figure 5.12(a).
5.2. SYNTHETIC SEQUENCES 65
C
1
C
2
3m
2.5m
2m
1.5m
Calibration target
45
30
C
1
C
2
Calibration target
Sphere’s central point
(a) (b)
Figure 5.12: Simulation of scene dynamics. (a) Exp erimental setup. (b) All
features start in random positions uniformly distributed within the illustrated
sphere, which is viewed from both cameras.
Random object trajectories were generated in the world coordinate system
and projected in the image planes by using the afore-mentioned intrinsic and
extrinsic parameters. All features were started in random positions uniformly
distributed within a sphere that was visible from both cameras, as illustrated
in Figure 5.12(b). Note that the sphere’s central point was defined by the
center of the black circle located in the middle of a planar calibration target
that was positioned in the 3D world, so that the projections of the central
point of that circle were located near the centers of the cameras’ CCDs.
The sphere’s radius was determined empirically in order to maximize its
projections in both cameras, subject to the constraint that such projections
should be entirely visible.
In our simulation, each feature moves for l frames, where l is drawn from
a uniform distribution in the interval [1, 2L]. Specifically, we consider that
5.2. SYNTHETIC SEQUENCES 66
.
v cos φρ
v sin φτ
p
t
τ
ρ
p
t+1
φ
θ
v
Figure 5.13: Feature dynamics.
both cameras acquire image sequences of the same lenght (256 frames) and
that the average feature lifetime L is 128 frames. By defining a potentially
distinct lifetime for each feature, we simulate situations where the tracker
may lose some features, either because they actually became occluded or go
out of bounds, or because the computation fails for one reason or another.
As it is frequently desired to maintain a certain number of features, every
time a feature dies, a new one is started in a random position w ithin the
volume visible from both cameras.
The features move with the following dynamics:
p
t+1
= p
t
+ v cos φρ + v sin φτ , (5.1)
which is illustrated in Figure 5.13 and whose notation is described below:
p
t
and p
t+1
are the current frame and next frame positions, respectively.
v is the magnitude of the feature displacement in the 3D world (in cm),
simulated as additive gaussian noise with zero m ean and standard de-
viation V. In the current simulation, we defined V = ±2.5cm, resulting
in an average feature displacement of about 2 pixels in the image planes
of both cameras.
5.2. SYNTHETIC SEQUENCES 67
φ is an angular variation in the direction of movement of the feature
in the 3D world (in radians), simulated as additive gaussian noise with
mean zero and standard deviation A. Specifically, we defined A = 0.09
radians, that is, about 5 degrees.
ρ is a unit direction vector along the current direction of movement.
τ is a unit direction vector perpendicular to ρ, given by the cross
product between ρ and a unit direction vector uniformly sampled on
the sphere.
Regarding the video sequences obtained by the visualization process per-
formed by the cameras, we defined the following “ground-truth” affine trans-
formation for modelling their temporal misalignment:
f
2
= f
1
32, (5.2)
that is, we have cameras with the same frame rate (α = 1) and whose
corresponding video sequences have a temporal misalignment of 32 frames
(β = 32). Importantly, the quality of the temporal alignments estimated
by our method as well as its computational cost are invariant to the magni-
tude of the temporal shift between the s equences, since our approach derives
alignment constraints by matching instantaneous positions of features in one
sequence against entire feature trajectories in the other sequences. Therefore,
we have arbitrarily defined a temporal misalignment of 32 frames, which rep-
resents approximately the average value of most of the temporal shifts of the
real-world sequences presented in Section 5.1. Given that both came ras used
in this experiment work at a frame rate of 30 frames per second, we note
that this temporal shift simulates cameras that were activated at distinct
time instants with an interval of about 1 second.
5.2. SYNTHETIC SEQUENCES 68
5.2.2 Error measurements and experiments’ description
In the current analysis, we use the average temporal alignment error as
our basic measurement for evaluating the accuracy, scalability and stability
of our approach. Specifically, its value is given by the average of the ab-
solute values of the differences between the temporal coordinates computed
by the estimated timeline and the temporal coordinates computed by the
“ground-truth” affine transformation in Equation (5.2). That is, if f
1
repre-
sents the temporal coordinate of the reference sequence, t
g
(f
1
) represents its
corresponding temporal coordinate computed by the affine transformation
in Equation (5.2) and t
e
(f
1
) is its corresponding temporal coordinate com-
puted by using the timeline estimated by our method, the average temporal
alignment error ε
t
is given by:
ε
t
=
1
256
255
f
1
=0
|t
e
(f
1
) t
g
(f
1
)| =
1
256
255
f
1
=0
|(α 1)f
1
+ β + 32| . (5.3)
To measure the error of the fundamental matrix, we used the same back-
ground features that were used to compute its initial estimate in the real
scene. Specifically, this error was measured as the average of the distances
between each background feature projection in the image plane of the refer-
ence camera (C
1
) and its corresponding epipolar line. That is, if e
i
represents
the distance between a static feature i and its corresponding epipolar line,
the error ε
f
(in pixels) of the fundamental matrix is given by:
ε
f
=
1
m
m
i=1
e
i
, (5.4)
where m is the total number of background features.
In order to simulate the error in the cameras’ epipolar geometry, new
5.2. SYNTHETIC SEQUENCES 69
fundamental matrices with pre-determined errors (in pixels) were generated
by adding a small multiple
F
of a 3 × 3 unit matrix J to the original
fundamental matrix F estimated with the normalized eight-point algorithm.
That is, we added the term
F
= δ
F
J to F, where δ
F
simulates a gaussian
noise with mean zero and standard deviation 1e 5. This operation was
repeated until the new fundamental matrix had the desired error.
On the other hand, to simulate distinct noise levels in the tracker, we
considered the following model:
x
n
= x
o
+ d cos ϕ, (5.5)
y
n
= y
o
+ d sin ϕ, (5.6)
where:
(x
o
, y
o
) represents the original coordinates of a feature.
(x
n
, y
n
) represents the the new corrupted coordinates of a feature.
d represents the magnitude of the feature displacement, simulated as
additive gaussian noise with zero mean and standard deviation R.
ϕ defines the direction of the feature displacement in the image plane
and is drawn from a uniform distribution in the interval [0, 2π].
Particularly, we controlled the variation of the t racker’s noise level in our
experiments by defining specific values for the standard deviation R of the
random variable d described above.
Using the above-described mechanisms for defining and measuring the
errors of the estimated temporal alignments, of the cameras’ epipolar geom-
etry and of the feature positions estimated by the tracker, we performed the
following two groups of experiments:
5.2. SYNTHETIC SEQUENCES 70
1. Varying number of features and tracker’s error with a fixed
fundamental matrix’s error of 2 pixels (ε
f
= 2). In this case,
for each tuple (ε
f
, R, k), where R {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} and
k {1, 2, 4, 8, 16, 32}, we simulated 100 distinct 256-frame 3D se-
quences, computed their corresponding voting spaces and measured
the percentages of timelines that lead to average temporal alignment
errors smaller than or equal to 1, 2 and 5 frame(s), before and after the
timeline refinement stage.
2. Varying number of features and fundamental matrix’s error
for a tracker with a gaussian noise that has a fixed stan-
dard deviation of ±2 pixels (R = 2). Similarly to the pre-
vious group of experiments, for each tuple (ε
f
, R, k), where ε
f
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10} and k {1, 2, 4, 8, 16, 32}, we simulated 100
distinct 256-frame 3D sequences, computed their voting spaces and
measured the percentages of timelines that lead to average temporal
alignment errors smaller than or equal to 1, 2 and 5 frame(s).
Since a fundamental matrix with an average error of 2 pixels represented
the worst case in the real-world sequences presented in Section 5.1, we decided
to use that value in our first group of experiments above, in order to guarantee
a realistic perception about the scalability of our approach in cases where the
computation of very accurate fundamental matrices is impossible. Similarly,
a tracker with a standard deviation of ±2 pixels was used in the second group
of experiments, since that value represents roughly the noise level observed
in some trackers used in our real-world sequences.
By measuring the percentages of timelines that lead to average temporal
alignment errors smaller than or equal to 1, 2 and 5 frame(s), we illustrate
the applicability of our approach in scenarios where we need timelines with
5.2. SYNTHETIC SEQUENCES 71
highly accurate parameters (ε
t
1 frame) or in less challenging situations
where lower accuracies are allowed (ε
t
2 frames or even ε
t
5 frames).
The values used in all experiments for the RANSAC parameters were:
p = 0.99, r = 0.05 and = 0.5. According to Equation (3.3), those values
lead to a RANSAC execution of 1840 iterations. Importantly, they represent
conservative values that were defined in order to maximize the accuracy of
the timeline’s parameters, even though they also result in a lower efficiency
for our method. Moreover, in order to ensure the acquisition of accurate
results, we restricted the search of the temporal alignment algorithm for a
timeline whose angular coefficient α satisfies: 0.2 α 5, that is, we used
an a-priori information that the video sequences have frame rates that differ
by at most a factor of 5.
5.2.3 Evaluation of the experimental results
As a first step, we evaluated the scalability of our temporal alignment and
refinement methods against an increasing of the number of moving features
per frame. Specifically, we considered scenes with k = 1, 2, 4, 8, 16 and 32
feature(s). In Figure 5.14 we illustrate some examples of voting spaces for
those values of features, where the fundamental matrix has an average error
of 2 pixels and the tracker is corrupted by a gaussian noise with zero mean
and standard deviation of ±2 pixels.
Note that the larger is the number of features, the denser are the voting
spaces. In fact, by increasing the number of features, the ratio between inliers
and outliers decreases visibly. Figure 5.15 quantifies this visual perception
by presenting the percentage of inliers for each voting space of Figure 5.14,
where a specific vote is regarded as an inlier if its temporal coordinates f
1
and f
2
satisfy f
2
f
1
+ 32 1 frame.
5.2. SYNTHETIC SEQUENCES 72
Frame number (S
2
)
Frame number (S
1
)
0
0
50
50
100
100
150
150
200
200
250
250
Frame number (S
2
)
Frame number (S
1
)
0
0
50
50
100
100
150
150
200
200
250
250
(a) k = 1 feature. (b) k = 2 features.
Frame number (S
2
)
Frame number (S
1
)
0
0
50
50
100
100
150
150
200
200
250
250
Frame number (S
2
)
Frame number (S
1
)
0
0
50
50
100
100
150
150
200
200
250
250
(c) k = 4 features. (d) k = 8 features.
Frame number (S
2
)
Frame number (S
1
)
0
0
50
50
100
100
150
150
200
200
250
250
Frame number (S
2
)
Frame number (S
1
)
0
0
50
50
100
100
150
150
200
200
250
250
(e) k = 16 features. (f) k = 32 features.
Figure 5.14: Examples of voting spaces for k = 1, 2, 4, 8, 16 and 32 features.
In all these cases, the fundamental matrix has an average error of 2 pixels
(ε
f
= 2) and the tracker is corrupted by a gaussian noise with mean zero and
standard deviation of ±2 pixels (R = 2).
5.2. SYNTHETIC SEQUENCES 73
0 1 2 3 4 5
0
1
2
3
4
5
6
7
8
9
10
11
12
13
log2(k)
Percentage of inliers
Figure 5.15: Percentage of inliers for each voting space of Figure 5.14, where
k represents the number of tracked features (k = 1, 2, 4, 8, 16 and 32).
A specific vote is considered an inlier if its coordinates f
1
and f
2
satisfy
f
2
f
1
+ 32 1 frame.
The voting spaces for 16 and 32 features represent especially challenging
cases, where it is impossible to identify visually the actual timeline. Accord-
ing to Figure 5.15, we have less than 2% of inliers in both cases.
Consider now Figures 5.16 to 5.25. Those figures illustrate the percent-
ages of timelines that lead to average temporal alignment errors smaller than
or equal to 1, 2 and 5 frame(s), as a function of the number of tracked fea-
tures. In particular, Figures 5.16 to 5.20 illustrate scenarios with distinct
tracker’s errors, but that have in common a fundamental matrix with an
average error of 2 pixels. On the other hand, Figures 5.21 to 5.25 illustrates
situations where different fundamental matrix’s errors are considered and a
tracker that is corrupted by a gaussian noise with st andard deviation of ±2
pixels is used.
By observing the curves that illustrate the percentages of timelines that
lead to average temporal alignment errors smaller than or equal to 1, 2 and
5.2. SYNTHETIC SEQUENCES 74
5 frame(s) in all those figures, both before and after the refinement stage,
we note that although sets containing from 2 to 16 features bring similar
results when the tracker’s noise level and the fundamental matrix’s error
have small values, as illustrated in Figures 5.16(a)-(f) and 5.21(a)-(f), our
approach tends to be more accurate when sets of 4 features are used, es-
pecially in scenarios where the tracking system and the initial estimate of
the cameras’s epipolar geometry have errors with more significative magni-
tudes, as illustrated in Figure 5.20(d)-(f) where the tracker is corrupted by a
noise with standard deviation of ±10 pixels, and in Figures 5.22(a)-(f) and
5.23(a)-(f) for fundamental matrix’s errors varying from 3 to 6 pixels.
Specifically, if we consider sets of about 4 features, a fundamental matrix’s
error of about 2 pixels and applications that need timelines whose temporal
alignment errors are smaller than or equal to 1 frame, we note from Figures
5.16 to 5.20 that the percentage of solutions computed by our method that
satisfy this errors constraint vary from 100%, as illustrated in Figure 5.16(a)
for a tracker with a low noise level (R = 1), to about 30%, as illustrated in
Figure 5.20(d) for a case where the tracker was severely corrupted by noise
(R = 10). On the other hand, by fixing now the tracker’s noise level (R = 2)
and varying the fundamental matrix’s error, we observe from Figures 5.21
to 5.25 that when sets with 4 features are used, the percentage of timelines
whose temporal alignment errors are smaller than or equal to 1 frame vary
from 90%, as shown in Figure 5.21(a) for a fundamental matrix’s error of 1
pixel (ε
f
= 1), to about 5%, as illustrated in Figure 5.25(d) for a fundamental
matrix’s error of 10 pixels (ε
f
= 10). Therefore, note that the accuracy of
our approach, in special the accuracy of the temporal alignment algorithm, is
much more negatively affected by an increase in the error of the fundamental
matrix than by an equivalent increase of the tracker’s noise level.
5.2. SYNTHETIC SEQUENCES 75
Importantly, by assuming less challenging upper bounds of 2 and 5 frames
for the temporal alignment error, we observe that the percentage of timelines
computed by our method that must satisfy these error constraints increases
significantly. In fact, by using a set with 4 features and considering a tempo-
ral alignment error of at most 2 frames, our metho d succeeds in 65% of the
total number of trials (after the refinement stage) even when a tracker with
a high noise level (R = 10) is used, as shown in Figure 5.20(e), and in 55% of
the trials (after the refinement stage) when a highly inaccurate fundamental
matrix (ε
f
= 6) is computed, as illustrated in Figure 5.23(e). The results
are even better in cases where the temporal alignment error may be up to 5
frames as, for example, in Figures 5.20(f) and 5.25(f), where our approach
succeeded in about 100% and 50% of the total number of trials, respectively,
even though the tracker’s noise level (R = 10) and the fundamental matrix’s
error (ε
f
= 10) had large values.
Additionally, note from Figures 5.16 to 5.25 that the accuracy of the
timeline’s parameters may decrease significantly when feature sets too small
(e.g., 1 feature) or too large (e.g., 32 features) are used. This behavior of our
methodology is explained by the direct relation between the complexity of
the voting space and the number of tracked features considered. When more
features are added, the number of outliers increases faster than the number
of inliers, as illustrated in Figures 5.14 and 5.15. Therefore, with a higher
amount of spurious information in the voting space due to the higher number
of features, our approach may compute timelines whose parameters are more
inaccurate. On the other hand, the smaller is the set of features the s maller is
the number of votes in the voting space. In this case, an insufficient number
of inliers may also result in the estimation of inaccurate parameters for the
timeline that models the temporal alignment between the sequences.
5.2. SYNTHETIC SEQUENCES 76
However, Figure 5.16(a)-(c) suggests that in cases where trackers with
low noise levels are used, an increase in the number of tracked features does
not cause much impact in the accuracy of the timeline’s parameters esti-
mated, especially before the application of the refinement technique. Note in
Figure 5.16(a), for example, the small variations in the percentage of correct
timelines before the refinement stage, when the number of features increases.
To illustrate the variation of the values of the timeline’s parameters when
the number of features increases, we present Figures 5.26 to 5.28. For all the
cases illustrated in those figures, we consider that the average err or of the
fundamental matrix and the standard deviation of the gaussian noise added
to the tracker are 2 pixels. Moreover, only the parameters of timelines that
lead to an average temporal alignment error smaller than or equal to 1 frame
were considered. We note that for all values of features illustrated in those
figures, the refinement stage played an important role. After its application,
the average values of the timeline’s parameters became closer to the ground-
truth parameters and their corresponding variances decreased significantly.
This fact is specially noticeable in the cases of sets with 4 and 16 features,
which are illustrated in Figures 5.27(a) and 5.28(a), respectively.
Now, consider Figures 5.29 to 5.34. Essentially, those figures provide
the same information already presented in Figures 5.16 to 5.25. However,
differently from those previous figures, Figures 5.29 to 5.34 illustrate the
percentage of timelines that lead to average temporal align m ent errors smaller
than or equal to 1, 2 and 5 frame(s), as a function of the standard deviation
of the gaussian noise added to the tracker (Figures 5.29 to 5.31) and as
a function of the error in the initial estimate of the fundamental matrix
(Figures 5.32 to 5.34). Therefore, by analyzing Figures 5.29 to 5.34 together
with Figures 5.16 to 5.25, better considerations may be performed about the
5.2. SYNTHETIC SEQUENCES 77
impacts in the accuracy and stability of our results caused by errors in the
cameras’ epipolar geometry and in the tracking system.
By observing the curves in Figures 5.29 to 5.34 that illustrate the per-
centages of timelines that lead to average temporal alignment errors smaller
than or equal to 1, 2 and 5 frame(s), both before and after the application of
the refinement technique, we note that the higher is the tracker’s noise level
and/or the fundamental matrix’s error t he higher is the tendency of the tem-
poral alignment and refinement methodologies to present lower accuracies.
This already expec ted behavior of our approach is more critical when more
features are considered, as illustrated in Figures 5.31(a)-(f) and 5.34(a)-(f)
for sets with 16 and 32 features.
Two main reasons explain this degradation in the accuracy of our results
due to increases in the tracker’s noise level and in the error of the cameras’
epipolar geometry. The first reason relates to the fact that the higher are
those er rors the lower is the information quality regarding the actual corre-
spondences between temporal co or dinates of feature positions in both image
planes. That is, by considering the binary representation of a candidate
point in the voting space, those errors produce an effect that is equivalent to
induce a lost of some of its less significative bits. In this context, potential
inliers are shifted from their actual positions in the voting space, where the
magnitudes of those shifts are proportional to the errors’ magnitudes. This
behavior may affect the estimation process performed by RANSAC resulting
in timelines with inaccurate parameters.
The second reason is similar to the one appointed previously regarding
increases in the number of features. That is, when trackers and fundamental
matrices with higher errors are used, we observe a significative increase of
spurious information in the voting spaces. Consequently, this higher number
5.2. SYNTHETIC SEQUENCES 78
of outliers may affect negatively the effectiveness of RANSAC a nd lead our
approach to compute timelines with inaccurate parameters.
Importantly, the more inaccurate are the timeline’s parameters the harder
is for the optimization task formulated in the refinement stage to converge
to their actual values. This last fact explains the observation from Figures
5.29 to 5.34 that the higher is the t racker’s noise level and the fundamental
matrix’s error the lower is the improvement brought by the application of
the refinement technique. Consider, for instance, Figure 5.29(a) where we
have only 1 feature in the scene. When the standard deviation of the tracker’s
noise increases and, consequently, timelines with less accurate parameters are
estimated due to the reasons afore-mentioned, we note that the refinement
technique has its effectiveness hardly affected. Specifically, Figure 5.29(a)
suggests that for trackers with standard deviations larger than or equal to 6
pixels, our refinement method should not be used.
By observing Figures 5.29 to 5.31 and establishing that our approach
should ideally estimate timelines with average temporal alignment errors
smaller than or equal to 1 frame in 95% of the total number of trials, we note
that for a fundamental matrix’s error of about 2 pixels, that goal is achieved
only when trackers with low noise levels (standard deviation of about 1 pixel)
are used, as illustrated in Figures 5.30(a),(d) and 5.31(a) for sets containing
4, 8 and 16 tracked features.
Finally, we note from Figures 5.29(a),(d), 5.30(a),(d) and 5.31(a),(d) that
the exclusive use of the temporal alignment technique is not appropriate when
we need timelines whose temporal alignment errors must b e smaller than or
equal to 1 frame, since for any number of features and noise level considered,
the percentage of timelines that satisfied that errors constraint before the
refinement stage was usually smaller than 30%.
5.2. SYNTHETIC SEQUENCES 79
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(a) R = 1 ε
t
1 frame. (d) R = 2 ε
t
1 frame.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(b) R = 1 ε
t
2 frames. (e) R = 2 ε
t
2 frames.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(c) R = 1 ε
t
5 frames. (f) R = 2 ε
t
5 frames.
Figure 5.16: Percentages of timelines that lead to average temporal alignment
errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16
and 32 feature(s). (a), (b) and (c) Standard deviation R of the tracker’s
gaussian noise is ±1 pixel. (d), (e) and (f) Standard deviation R of the
tracker’s gaussian noise is ±2 pixels. In all cases, a fundamental matrix with
an average error of 2 pixels is used (ε
f
= 2).
5.2. SYNTHETIC SEQUENCES 80
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(a) R = 3 ε
t
1 frame. (d) R = 4 ε
t
1 frame.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(b) R = 3 ε
t
2 frames. (e) R = 4 ε
t
2 frames.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(c) R = 3 ε
t
5 frames. (f) R = 4 ε
t
5 frames.
Figure 5.17: Percentages of timelines that lead to average temporal alignment
errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16
and 32 feature(s). (a), (b) and (c) Standard deviation R of the tracker’s
gaussian noise is ±3 pixels. (d), (e) and (f) Standard deviation R of the
tracker’s gaussian noise is ±4 pixels. In all cases, a fundamental matrix with
an average error of 2 pixels is used (ε
f
= 2).
5.2. SYNTHETIC SEQUENCES 81
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(a) R = 5 ε
t
1 frame. (d) R = 6 ε
t
1 frame.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(b) R = 5 ε
t
2 frames. (e) R = 6 ε
t
2 frames.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(c) R = 5 ε
t
5 frames. (f) R = 6 ε
t
5 frames.
Figure 5.18: Percentages of timelines that lead to average temporal alignment
errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16
and 32 feature(s). (a), (b) and (c) Standard deviation R of the tracker’s
gaussian noise is ±5 pixels. (d), (e) and (f) Standard deviation R of the
tracker’s gaussian noise is ±6 pixels. In all cases, a fundamental matrix with
an average error of 2 pixels is used (ε
f
= 2).
5.2. SYNTHETIC SEQUENCES 82
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(a) R = 7 ε
t
1 frame. (d) R = 8 ε
t
1 frame.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(b) R = 7 ε
t
2 frames. (e) R = 8 ε
t
2 frames.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(c) R = 7 ε
t
5 frames. (f) R = 8 ε
t
5 frames.
Figure 5.19: Percentages of timelines that lead to average temporal alignment
errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16
and 32 feature(s). (a), (b) and (c) Standard deviation R of the tracker’s
gaussian noise is ±7 pixels. (d), (e) and (f) Standard deviation R of the
tracker’s gaussian noise is ±8 pixels. In all cases, a fundamental matrix with
an average error of 2 pixels is used (ε
f
= 2).
5.2. SYNTHETIC SEQUENCES 83
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(a) R = 9 ε
t
1 frame. (d) R = 10 ε
t
1 frame.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(b) R = 9 ε
t
2 frames. (e) R = 10 ε
t
2 frames.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(c) R = 9 ε
t
5 frames. (f) R = 10 ε
t
5 frames.
Figure 5.20: Percentages of timelines that lead to average temporal alignment
errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16 and
32 feature(s). (a), (b) and (c) Standard deviation R of the tracker’s gaussian
noise is ±9 pixels. (d), (e) and (f) Standard deviation R of the tracker’s
gaussian noise is ±10 pixels. In all cases, a fundamental matrix with an
average error of 2 pixels is used (ε
f
= 2).
5.2. SYNTHETIC SEQUENCES 84
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(a) ε
f
= 1 ε
t
1 frame. (d) ε
f
= 2 ε
t
1 frame.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(b) ε
f
= 1 ε
t
2 frames. (e) ε
f
= 2 ε
t
2 frames.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(c) ε
f
= 1 ε
t
5 frames. (f) ε
f
= 2 ε
t
5 frames.
Figure 5.21: Percentages of timelines that lead to average temporal alignment
errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16
and 32 feature(s). (a), (b) and (c) Average error of the fundamental matrix:
ε
f
= 1 pixel. (d), (e) and (f) Average error of the fundamental matrix: ε
f
= 2
pixels. In all cases, the tracker is corrupted by a gaussian noise with standard
deviation of ±2 pixels (R = 2).
5.2. SYNTHETIC SEQUENCES 85
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(a) ε
f
= 3 ε
t
1 frame. (d) ε
f
= 4 ε
t
1 frame.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(b) ε
f
= 3 ε
t
2 frames. (e) ε
f
= 4 ε
t
2 frames.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(c) ε
f
= 3 ε
t
5 frames. (f) ε
f
= 4 ε
t
5 frames.
Figure 5.22: Percentages of timelines that lead to average temporal alignment
errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16
and 32 feature(s). (a), (b) and (c) Average error of the fundamental matrix:
ε
f
= 3 pixels. (d), (e) and (f) Average err or of the fundamental m atrix:
ε
f
= 4 pixels. In all cases, the tracker is corrupted by a gaussian noise with
standard deviation of ±2 pixels (R = 2).
5.2. SYNTHETIC SEQUENCES 86
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(a) ε
f
= 5 ε
t
1 frame. (d) ε
f
= 6 ε
t
1 frame.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(b) ε
f
= 5 ε
t
2 frames. (e) ε
f
= 6 ε
t
2 frames.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(c) ε
f
= 5 ε
t
5 frames. (f) ε
f
= 6 ε
t
5 frames.
Figure 5.23: Percentages of timelines that lead to average temporal alignment
errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16
and 32 feature(s). (a), (b) and (c) Average error of the fundamental matrix:
ε
f
= 5 pixels. (d), (e) and (f) Average err or of the fundamental m atrix:
ε
f
= 6 pixels. In all cases, the tracker is corrupted by a gaussian noise with
standard deviation of ±2 pixels (R = 2).
5.2. SYNTHETIC SEQUENCES 87
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(a) ε
f
= 7 ε
t
1 frame. (d) ε
f
= 8 ε
t
1 frame.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(b) ε
f
= 7 ε
t
2 frames. (e) ε
f
= 8 ε
t
2 frames.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(c) ε
f
= 7 ε
t
5 frames. (f) ε
f
= 8 ε
t
5 frames.
Figure 5.24: Percentages of timelines that lead to average temporal alignment
errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16
and 32 feature(s). (a), (b) and (c) Average error of the fundamental matrix:
ε
f
= 7 pixels. (d), (e) and (f) Average err or of the fundamental m atrix:
ε
f
= 8 pixels. In all cases, the tracker is corrupted by a gaussian noise with
standard deviation of ±2 pixels (R = 2).
5.2. SYNTHETIC SEQUENCES 88
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(a) ε
f
= 9 ε
t
1 frame. (d) ε
f
= 10 ε
t
1 frame.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(b) ε
f
= 9 ε
t
2 frames. (e) ε
f
= 10 ε
t
2 frames.
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
log2(k)
% of timelines
desired level: 95%
before refinement
after refinement
(c) ε
f
= 9 ε
t
5 frames. (f) ε
f
= 10 ε
t
5 frames.
Figure 5.25: Percentages of timelines that lead to average temporal alignment
errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16
and 32 feature(s). (a), (b) and (c) Average error of the fundamental matrix:
ε
f
= 9 pixels. (d), (e) and (f) Average err or of the fundamental m atrix:
ε
f
= 10 pixels. In all cases, the tracker is corrupted by a gaussian noise with
standard deviation of ±2 pixels (R = 2).
5.2. SYNTHETIC SEQUENCES 89
0.98 1 1.02
−34
−33
−32
−31
−30
−29
Before refinement
Temporal dilation (α)
Temporal shift (β)
0.98 1 1.02
−34
−33
−32
−31
−30
−29
After refinement
Temporal dilation (α)
Temporal shift (β)
(a) k = 1. Before ref.: m
α
= 1.0004, σ
2
α
= 1.1300e
4
, m
β
= 31.1767, σ
2
β
= 2.1003.
After ref.: m
α
= 1.0004, σ
2
α
= 3.9847e
5
, m
β
= 31.7825, σ
2
β
= 0.6514.
0.98 1 1.02
−34
−33
−32
−31
−30
−29
−28
Before refinement
Temporal dilation (α)
Temporal shift (β)
0.98 1 1.02
−34
−33
−32
−31
−30
−29
−28
After refinement
Temporal dilation (α)
Temporal shift (β)
(b) k = 2. Before ref.: m
α
= 1.0008, σ
2
α
= 8.9396e
5
, m
β
= 31.0969, σ
2
β
= 2.0429.
After ref.: m
α
= 1.0009, σ
2
α
= 1.2767e
5
, m
β
= 31.7608, σ
2
β
= 0.2588.
Figure 5.26: Space of estimated parameters for (a) k = 1 feature and (b)
k = 2 features. Each point shown corresponds to the parameters of an
estimated timeline with t emporal alignment error smaller than or equal to
1 frame. m
α
and m
β
represent the average values and σ
2
α
and σ
2
β
are the
variances of the temporal dilation (α) and the temporal shift (β), respectively.
In both cases, the average error of the fundamental matrix and the standard
deviation of the gaussian noise added to the tracker are 2 pixels.
5.2. SYNTHETIC SEQUENCES 90
(a) k = 4. Before ref.: m
α
= 0.9995, σ
2
α
= 5.9613e
5
, m
β
= 30.7143, σ
2
β
= 0.9581.
After ref.: m
α
= 1.0003, σ
2
α
= 1.3732e
5
, m
β
= 31.6266, σ
2
β
= 0.1964.
0.99 1 1.01
−33
−32.5
−32
−31.5
−31
−30.5
−30
−29.5
−29
−28.5
−28
Before refinement
Temporal dilation (α)
Temporal shift (β)
0.99 1 1.01
−33
−32.5
−32
−31.5
−31
−30.5
−30
−29.5
−29
−28.5
−28
After refinement
Temporal dilation (α)
Temporal shift (β)
(b) k = 8. Before ref.: m
α
= 0.9991, σ
2
α
= 7.7272e
5
, m
β
= 30.8503, σ
2
β
= 1.6661.
After ref.: m
α
= 1.0001, σ
2
α
= 1.4973e
5
, m
β
= 31.6023, σ
2
β
= 0.3413.
Figure 5.27: Space of estimated parameters for (a) k = 4 features and (b)
k = 8 features. Each point shown corresponds to the parameters of an
estimated timeline with t emporal alignment error smaller than or equal to
1 frame. m
α
and m
β
represent the average values and σ
2
α
and σ
2
β
are the
variances of the temporal dilation (α) and the temporal shift (β), respectively.
In both cases, the average error of the fundamental matrix and the standard
deviation of the gaussian noise added to the tracker are 2 pixels.
5.2. SYNTHETIC SEQUENCES 91
0.98 1 1.02
−34
−33
−32
−31
−30
−29
−28
−27
Before refinement
Temporal dilation (α)
Temporal shift (β)
0.98 1 1.02
−34
−33
−32
−31
−30
−29
−28
−27
After refinement
Temporal dilation (α)
Temporal shift (β)
(a) k = 16. Before ref.: m
α
= 0.9997, σ
2
α
= 1.7003e
4
, m
β
= 30.7789, σ
2
β
= 3.4365.
After ref.: m
α
= 1.0003, σ
2
α
= 1.7867e
5
, m
β
= 31.7365, σ
2
β
= 0.2663.
0.98 1 1.02
−33
−32.5
−32
−31.5
−31
−30.5
−30
−29.5
−29
−28.5
Before refinement
Temporal dilation (α)
Temporal shift (β)
0.98 1 1.02
−33
−32.5
−32
−31.5
−31
−30.5
−30
−29.5
−29
−28.5
After refinement
Temporal dilation (α)
Temporal shift (β)
(b) k = 32. Before ref.: m
α
= 0.9982, σ
2
α
= 1.2828e
4
, m
β
= 30.8495, σ
2
β
= 1.7812.
After ref.: m
α
= 0.9994, σ
2
α
= 5.2500e
5
, m
β
= 31.4488, σ
2
β
= 0.6915.
Figure 5.28: Space of estimated parameters for (a) k = 16 features and
(b) k = 32 features. Each point shown corresponds to the parameters of
an estimated timeline with temporal alignment error smaller than or equal
to 1 frame. m
α
and m
β
represent the average values and σ
2
α
and σ
2
β
are the
variances of the temporal dilation (α) and the temporal shift (β), respectively.
In both cases, the average error of the fundamental matrix and the standard
deviation of the gaussian noise added to the tracker are 2 pixels.
5.2. SYNTHETIC SEQUENCES 92
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
(a) 1 feature ε
t
1 frame. (d) 2 features ε
t
1 frame.
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
(b) 1 feature ε
t
2 frames. (e) 2 features ε
t
2 frames.
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
(c) 1 feature ε
t
5 frames. (f) 2 features ε
t
5 frames.
Figure 5.29: Percentages of timelines that lead to average temporal alignment
errors smaller than or equal to 1, 2 and 5 frame(s), as a function of the
tracker’s noise level. (a), (b) and (c) Results for k = 1 feature. (d), (e) and
(f) Results for k = 2 features. In all cases, a fundamental matrix with an
average error of 2 pixels is used (ε
f
= 2).
5.2. SYNTHETIC SEQUENCES 93
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
(a) 4 features ε
t
1 frame. (d) 8 features ε
t
1 frame.
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
(b) 4 features ε
t
2 frames. (e) 8 features ε
t
2 frames.
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
(c) 4 features ε
t
5 frames. (f) 8 features ε
t
5 frames.
Figure 5.30: Percentages of timelines that lead to average temporal alignment
errors smaller than or equal to 1, 2 and 5 frame(s), as a function of the
tracker’s noise level. (a), (b) and (c) Results for k = 4 features. (d), (e) and
(f) Results for k = 8 features. In all cases, a fundamental matrix with an
average error of 2 pixels is used (ε
f
= 2).
5.2. SYNTHETIC SEQUENCES 94
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
(a) 16 features ε
t
1 frame. (d) 32 features ε
t
1 frame.
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
(b) 16 features ε
t
2 frames. (e) 32 features ε
t
2 frames.
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Std of the tracker’s noise
% of timelines
desired level: 95%
before refinement
after refinement
(c) 16 features ε
t
5 frames. (f) 32 features ε
t
5 frames.
Figure 5.31: Percentages of timelines that lead to average temporal alignment
errors smaller than or equal to 1, 2 and 5 frame(s), as a function of the
tracker’s noise level. ( a), (b) and (c) Results for k = 16 features. (d), (e)
and (f) Results for k = 32 features. In all cases, a fundamental matrix with
an average error of 2 pixels is used (ε
f
= 2).
5.2. SYNTHETIC SEQUENCES 95
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
(a) 1 feature ε
t
1 frame. (d) 2 features ε
t
1 frame.
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
(b) 1 feature ε
t
2 frames. (e) 2 features ε
t
2 frames.
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
(c) 1 feature ε
t
5 frames. (f) 2 features ε
t
5 frames.
Figure 5.32: Percentages of timelines that lead to average temporal alignment
errors smaller than or equal to 1, 2 and 5 frame(s), as a function of the
average error of the fundamental matrix. (a), (b) and (c) Results for k = 1
feature. (d), (e) and (f) Results for k = 2 features. In all cases, the tracker is
corrupted by a gaussian noise with standard deviation of ±2 pixels (R = 2).
5.2. SYNTHETIC SEQUENCES 96
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
(a) 4 features ε
t
1 frame. (d) 8 features ε
t
1 frame.
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
(b) 4 features ε
t
2 frames. (e) 8 features ε
t
2 frames.
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
(c) 4 features ε
t
5 frames. (f) 8 features ε
t
5 frames.
Figure 5.33: Percentages of timelines that lead to average temporal alignment
errors smaller than or equal to 1, 2 and 5 frame(s), as a function of the
average error of the fundamental matrix. (a), (b) and (c) Results for k = 4
features. (d), ( e) and (f) Results for k = 8 features. In all cases, the tracker is
corrupted by a gaussian noise with standard deviation of ±2 pixels (R = 2).
5.2. SYNTHETIC SEQUENCES 97
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
(a) 16 features ε
t
1 frame. (d) 32 features ε
t
1 frame.
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
(b) 16 features ε
t
2 frames. (e) 32 features ε
t
2 frames.
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Fundamental matrix’s error
% of timelines
desired level: 95%
before refinement
after refinement
(c) 16 features ε
t
5 frames. (f) 32 features ε
t
5 frames.
Figure 5.34: Percentages of timelines that lead to average temporal alignment
errors smaller than or equal to 1, 2 and 5 frame(s), as a function of the average
error of the fundamental matrix. (a), (b) and (c) Results for k = 16 features.
(d), (e) and (f) Results for k = 32 features. In all cases, the tracker is
corrupted by a gaussian noise with standard deviation of ±2 pixels (R = 2).
Chapter 6
Conclusions
Vivendo, se aprende, mas o que se aprende, mais, é
a fazer outras maiores perguntas.
João Guimarães Rosa
6.1 Summary of the Accomplished Work
We have proposed a novel feature–based methodology for aligning both in
time and space multiple video sequences acquired from distinct viewpoints.
More specifically, a major contribution of our methodology is that it reduces
the computation of temporal and spatio-temporal alignments between se-
quences to linear regression and linear optimization problems, while previous
feature–based techniques need to search the entire space of possible temporal
alignments. The quality of the computed alignments and the computational
cost of our techniques are invariant to the magnitude of the initial temporal
offsets between sequences. Moreover, unlike existing methods, which work
for only two video sequences, our approach can handle an arbitrary number
of sequences in a single step. Basically, our sequence–to–sequence alignment
approach is constituted by two techniques: (a) one that builds large sets of
98
6.1. SUMMARY OF THE ACCOMPLISHED WORK 99
temporal constraints from a rough spatial alignment between sequences and
then performs a robust linear regression in the temporal domain to recover
the globally correct temporal alignment, and (b) one that linearizes feature
trajectories around the points of intersection with epipolar lines to reduce
the problem of finding the complete spatio–temporal alignment between two
sequences to a problem of solving a linear system.
We have shown that our work is suitable for solving several types of
problems presented in current applications that benefit from the availability
of simultaneous video recordings of the same physical event, proving that it
is an interesting alternative, since that it is less expensive and easier to use
outside labs than video synchronization hardware and it can be applied to
various multi-view sequences that already exist in video databases, such as
those of sport events.
From a theoretical standpoint, this work is relevant since it provided ad-
ditional theoretical and empirical evidence that by considering temporal and
spatial cues into a single alignment framework, many physical event s which
are inherently ambiguous for traditional image-to-image alignment methods,
are uniquely resolved by sequence-to-sequence alignment techniques.
Our experimental results suggest that our timeline reconstruction algo-
rithm provides a simple and effective method for temporally aligning multiple
video sequences. Unlike previous approaches, it is able to handle te mpo-
ral dilations and large time shifts, with no degradation in accuracy, even
when scene points move along three-dimensional, overlapping and almost–
cyclical trajectories. Importantly, by reducing the alignment problem to a
RANSAC-based procedure, our algorithms are able to tolerate large propor-
tions of outliers in the data, high levels of noise, discontinuities in feature
trajectories, complete absence of stereo corresp ondences for moving features,
6.2. DISCUSSION 100
and sequences that contain multiple frame rates. Finally, our approach re-
quires the ability to track scene points only across two consecutive frames
of the same sequence, what makes it robust to common tracking problems,
such as occlusion, and it does not require the ability to establish feature
correspondences between the sequences.
The above-mentioned contributions of the present work and the exper-
iments performed with real-world sequences present ed in Section 5.1 were
published in one of the most important international peer reviewed confer-
ences in computer vision:
R. Carceroni, F. Pádua, G. Massahud, K. Kutulakos, “Linear Sequence-
to-Sequence Alignment", in Proc. of IEEE Computer Vision and Pat-
tern Recognition Conference, Washington, USA, pp. 746-753, 2004.
Moreover, we are currently preparing a complete journal article, which
will be submitted to the IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence (TPAMI). This article includes a careful analysis about
the scalability and accuracy of our approach, and discusses the experimental
results obtained with synthetic sequences, which are described in Section 5.2.
6.2 Discussion
Although our experimental results demonstrate that we may apply our
approach successfully in several challenging scenarios, it has some failure
modes, which must be appointed. Firstly, our approach may not b e applied
in case of dynamic scenes where we cannot detect m oving features (e.g., the
monitored scene is an empty room whose light has been turned off), or in
cases where the scene features move in such a way that their corresponding
trajectories in the image plane are represented by a single point (e.g., a
6.2. DISCUSSION 101
feature moves in a direction that is perpendicular to the image plane). In
those scenarios, it would not be possible to generate the voting space for
estimating the searched timeline. Another critical situation o ccurs when the
epipolar geometry of a pair of cameras defines epipolar lines that do not
intersect the feature trajectories (e.g., the epipolar lines are parallel to the
trajectories). Again, in that scenario, we may not compute the space with the
candidate temporal alignments. Fortunately, the above mentioned situations
do not represent a significative portion of the scenarios that we may find in
the practice.
Regarding the accuracy of our methodology, we believe that some simple
additional changes in our technique could lead to even better results. Firstly,
we believe that by using specific object characteristics, such as its color and
texture, we could avoid the insertion of spurious information in the voting
spaces created by our temporal alignment algorithm. In particular, a can-
didate point should concatenate coordinates f
i
and f
j
for cameras i and j,
respectively, only if the intersection points that defined them belong to blobs
that present the same characteristic of interest (e.g., color).
Moreover, we believe that in cases where it is possible to provide the
optimization process formulated by our methodology with the a priori infor-
mation about the possible range in which the true temporal misalignment
is, we could improve the accuracy of the refined temporal parameters esti-
mated. In this case, by using that a priori information, the optimization
algorithm would know when it would be diverging from the actual solution
and correction actions could be appropriately taken.
6.3. FUTURE WORK 102
6.3 Future Work
Additional theoretical investigations need to be considered for future
work. Firstly, the methodology proposed in Chapter 3 assumes that all cam-
eras acquire frames at constant (albeit not necessarily identical) temporal
sampling rates. Based on that assumption, our approach model the tempo-
ral misalignment between a pair of video sequences as an one-dimensional
affine transformation. The pairwise temporal relations modelled by that
transformation induce a global relationship between the frame numbers of
the input sequences, which is captured by the N-dimensional line that we
call timeline. However, such a kind of mathematical modelling is not appro-
priate when some sequences work with variable frame rates. Therefore, the
development of an alternative mathematical model, which can couple with
this problem represents an important topic for future research.
Also, it is necessary to conceive alternative techniques for obtaining initial
estimates of the cameras’ epipolar geometry. Currently, our approach is based
on the use of background features whose image coordinates are processed by
the normalized eight-point algorithm in order to estimate the fundamental
matrices that capture the geometric relations between the views. However,
there are some cases where we can not identify enough static scene points
for every pair of video sequences, making impossible the computation of an
initial estimate of the cameras’ epipolar geometry and, consequently, the use
of our methodology.
Finally, another important direction for future work is 3D scene recon-
struction. By combining our temporal alignment approach with multi-view
stereo techniques, important advances could be achieved in the development
of robust systems for reconstructing 3D dynamic scenes, specially the ones
presented in old video footage, where multiple replays of the same event
6.3. FUTURE WORK 103
are shown from different viewpoints. Moreover, as the capabilities of video
standards and receiver hardware are increasing towards integrated 3D anima-
tions, generating realistic content is now becoming a limiting factor. In this
context, an increasing demand has been verified on techniques for generating
3D content from reality, i.e., from video sequences acquired with TV cam-
eras, providing the TV viewer with animated 3D reconstructions of physical
events and allowing for an immersive experience via free interaction on the
receiver side.
Bibliography
Atkinson, K. (1989). An Introduction to Numerical Analysis. John Wiley
and Sons, second edition.
Brown, M. and Lowe, D. (2003). Recognising Panoramas. In Proc. IEEE
International Confere nce on Computer Vision, pages 1218–1225.
Canterakis, N. (2000). A Minimal Set of Constraints for the Trifocal Tensor.
In Proc. European Conference on Computer Vision, pages 84–99. Springer
LNCS 1842.
Cantzler, H. (2004). Random Sample Consensus (RANSAC). Technical
report, Institute for Perception, Action and Behaviour, Division of Infor-
matics, University of Edinburgh.
Carceroni, R. L. (2001). Recovering Non-Rigid 3D Motion, Shape and Re-
flectance from Multi-View Image Sequences: A Differential-Geometric Ap-
proach. PhD thesis, Department of Computer Science - The College Arts
and Sciences, University of Rochester, Rochester, New York.
Caspi, Y. and Irani, M. (2000). A Step Towards Sequence-to-Sequence Align-
ment. In Proc. IEEE Confere nce on Computer Vision and Pattern Recog-
nition, volume 2, pages 682–689, Hilton Head Island, South Carolina.
Caspi, Y. and Irani, M. (2001). Alignment of Non-Overlapping Sequences.
In Proc. IEEE International Conference on Computer Vision, volume 2,
pages 76–83, Vancouver, Canada.
Caspi, Y., Simakov, D., and Irani, M. (2002). Feature-Based Sequence-
to-Sequence Matching. In VAMODS (Vision and Modelling of Dynamic
Scenes) workshop with ECCV, Copenhagen, Denmark.
Clarke, J., Carlsson, S., and Zisserman, A. (1996). Detecting and Tracking
Linear Features Efficiently. In Proc. British Machine Vision Conference,
pages 415–424.
104
BIBLIOGRAPHY 105
Faugeras, O. and Mourrain, B. (1995). On the Geometry and Algebra of
the Point and Line Correspondences Between N Images. In Proc. IEEE
International Confere nce on Computer Vision, pages 951–956.
Faugeras, O. and Papadopoulo, T. (1997). Grassmann-Cayley Algebra for
Modeling Systems of Cameras and the Algebraic Equations of the Mani-
fold of Trifocal Tensors. Technical Report 3225, INRIA, Sophia-Antipolis,
France.
FIFA (2002). FIFA World Cup Archives: Goal of the Century. http://
fifaworldcup.yahoo.com/02/en/pf/h/gotc/launch.html.
FIFA (2004a). FIFA World Cup Archives: Classic Games. http://
fifaworldcup.yahoo.com/06/en/p/cg/index.html.
FIFA (2004b). FIFA World Cup Archives: Photo Gallery. http://
fifaworldcup.yahoo.com/02/en/011221/4/44c.html.
Fischler, M. and Bolles, R. (1981). Random Sample Consensus: A Paradigm
for Model Fitting with Applications to Image Analysis and Automated
Cartography. Communications of the ACM, 24(6):381–395.
Forsyth, D. and Ponce, J. (2002). Computer Vision: A Modern Approach.
Prentice Hall, ISBN: 0130851981, first edition.
Hartley, R. (1997a). In Defense of the Eight-point Algorithm. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 19(6):580–593.
Hartley, R. (1997b). Lines and Points in Three Views and the Trifocal Tensor.
International Journal of Computer Vision, 22(2):125–140.
Hartley, R. (1998). Computation of the Quadrifocal Tensor. In Proc. Euro-
pean Conference on Computer Vision, pages 20–35. Springer-Verlag.
Hartley, R. and Zisserman, A. (2003). Multiple View Geometry in Computer
Vision. Cambridge University Press, ISBN: 0521540518, second edition.
Hefferon, J. (2001). Linear Algebra. Mathematics Department of Saint
Michael’s College, first edition.
Heyden, A. (1995a). Geometry and Algebra of Multiple Projective Trans-
formations. PhD thesis, Department of Mathematics, Lund U niversity,
weden.
Heyden, A. (1995b). Reconstruction from Multiple Images Using Kinetic
Depths. International Journal of Computer Vision.
BIBLIOGRAPHY 106
Heyden, A. (1998). Algebraic Varieties in Multiple View Geometry. In
Proc. 5th European Conference on Computer Vision, pages 3–19, Freiburg,
Germany.
Heyden, A. and Åström, K. (1997). Algebraic Prop erties of Multilinear Con-
straints. Mathematical Methods in the Applied Sciences, 20:1135–1162.
Horst, J. and Beichl, I. (1997). A Simple Algorithm for Efficient Piecewise
Linear Approximation of Space Curves. In Proc. IEEE International Con-
ference on Image Processing.
Isard, M. and MacCormick, J. (2001). BraMBLe: A Bayesian Multiple-Blob
Tracker. In Proc. IEEE International Conference on Computer Vision,
volume 2, pages 34–41, Vancouver, Canada.
Jepson, A., Fleet, D., and El-Maraghi, T. (2003). Robust On-li ne Appearance
Models for Visual Tracking. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 25(10):1296–1311.
Kutulakos, K. and Seitz, S. (2000). A Theory of Shape by Space Carving.
International Journal of Computer Vision, 38(3):199–218.
Kutulakos, K. N. (2000). Approximate N-View Stereo. In Proc. European
Conference on Computer Vision, pages 67–83.
Lee, L., Romano, R., and Stein, G. (2000). Monitoring Act ivities from
Multiple Video Streams: Establishing a Common Coordinate Frame.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI),
22:758–767.
Longuet-Higgins, H. (1981). A Computer Algorithm for Reconstructing a
Scene from Two Projections. Nature, 293:133–135.
Lowe, D. G. (2004). Distinctive Image Features from Scale-Invariant Key-
points. International Journal of Computer Vision, 60(2):91–110.
Luong, Q.-T. and Vieville, T. (1994). Canonic Representations for the
Geometries of Multiple Projective Views. In Proc. 4th European Con-
ference on Computer Vision, volume 800, pages 589–599, Cambridge, UK.
Ma, Y., Soatto, S., Kosecka, J., and Sastry, S. S. (2003). An Invitation to 3-D
Vision - From Images to Geometric Models. Springer-Verlag, first edition.
Matas, J. and Chum, O. (2004). Randomized RANSAC with t
d,d
Test. Image
and Vision Computing, 22(10):837–842.
BIBLIOGRAPHY 107
McLauchlan, P. and Jaenicke, A. (2002). Image Mosaicing Using Sequential
Bundle Adjustment. Image and Vision Computing, 20(9-10):751–759.
Myatt, D., Torr, P., Nasuto, S., Bishop, J., and Craddock, R. (2002). Napsac:
High Noise, High Dimensional Robust Estimation - It’s in the Bag. In Proc.
British Machine Vision Conference, pages 458–467.
Press, W., Flannery, B., Teukolsky, S., and Vetter ling, W. (1988). Numerical
Recipes in C: The Art of Scientific Computing. Cambridge University
Press, first edition.
Pritchett, P. and Zisserman, A. (1998). Wide Baseline Stereo Matching. In
Proc. IEEE International Conference on Computer Vision, pages 754–760.
Rao, C., Gritai, A., Shah, M., and Syeda-Mahmood, T. (2003). View-
Invariant Alignment and Matching of Video Sequences. In Proc. of IEEE
International Conference on Computer Vision, volume 2, pages 939–945,
Nice,France.
Reid, I. and Zisserman, A. (1996). Goal Directed Video Metrology. In Proc.
European Conference on Computer Vision, pages 647–658.
Schaffalitzky, F. and Zisserman, A. (2001). Viewpoint Invariant Texture
Matching and Wide Baseline Stereo. In Proc. 8th IEEE International
Conference on Computer Vision, Vancouver, Canada.
Sharipov, R. (2004). Quick Introduction to Tensor Analysis. Freely distrib-
uted on-line - http://samizdat.mines.edu/tensors/, first edition.
Shashua, A. and Werman, M. (1995). Fundamental Tensor: On the Geome-
try of Three Perspective Views. In Proc. IEEE International Conference
on Computer Vision, pages 920–925.
Shechtman, E., Caspi, Y., and Irani, M. (2002). Increasing Video Resolution
in Time and Space. In Proc. European Conference in Computer Vision,
Copenhagen, Denmark.
Shi, J. and Tomasi, C. (1994). Good Features to Track. In Proc. IEEE
Conference on Computer Vision and Pattern Recognition, Seattle.
Stein, G. (1998). Tracking from Multiple View Points: Self-Calibration of
Space and Time. In DARPA Image Understanding Workshop, pages 521–
527, Montery, Canada.
Szeliski, R. (1999). A Multi-View Approach to Motion and Stereo. In Proc.
IEEE Conference on Computer Vision and Pattern Recognition, volume 1,
pages 157–163.
BIBLIOGRAPHY 108
Tomasi, C. (2000). Mathematical Methods for Robotics and Vision. Technical
Report CS 205, Stanford University.
Torr, P. (1995). Outlier Detection and Motion Segmentation. PhD thesis,
Department of Engineering Science, University of Oxford.
Torr, P. and Zisserman, A. (1997). Robust Parametrization and Computation
of the Trifocal Tensor. Image and Vision Computing, 15:591–605.
Torr, P. and Zisserman, A. (1999). Feature-Based Methods for Structure
and Motion Estimation. In International Workshop on Vision Algorithms,
pages 278–295.
Tresadern, P. and Reid, I. (2003). Synchronizing Image Sequences of Non-
rigid Objects. In Proc. British Machine Vision Conference, volume 2,
pages 629–638, Norwich.
Triggs, B. (1995). Matching Constraints and the Joint Image. In Proc. IEEE
International Confere nce on Computer Vision, pages 338–343.
Trucco, E. and Verri, A. (1998). Introductory Techniques for 3-D Computer
Vision. Prentice Hall, ISBN: 0132611082, first edition.
Vedula, S., Baker, S., and Kanade, T. (2002). Spatio-Temporal View Inter-
polation. In Proc. Eurographics Workshop on Rendering, pages 65–76.
Viéville, T. and Luong, Q. (1993). M otion of Points and Lines in the Uncal-
ibrated Case. Technical Report RR-2054, INRIA.
Wolf, L. and Zomet, A. (2002a). Correspondence-Free Synchronization and
Reconstruction in a Non-Rigid Scene. In Workshop on Vision and Mod-
elling of Dynamic Scenes, Copenhagen, Denmark.
Wolf, L. and Zomet, A. (2002b). Sequence-to-Sequence Self Calibration. In
Proc. European Conference on Computer Vision, volume 2, pages 370–382.
Zelnik-Manor, L. and Irani, M. (2001). Event-Based Analysis of Video.
In Proc. IEEE Conference on Computer Vision and Pattern Recognition,
Kauai,Hawaii USA.
Zhang, Z. (1998). Determining the Epipolar Geometry and its Uncertainty -
A Review. International Journal of Computer Vision, 27(2):161–195.
Appendix A
The RANSAC Algorithm
The Random Sample Consensus (RANSAC) algorithm is a paradigm for
robust fitting of models that was introduced by Fischler and Bolles in 1981
(Fischler and Bolles, 1981). It is robust in the sense of good tolerance to
outliers in the experimental data. It is capable of interpreting and smoothing
data containing a significant percentage of gr oss errors (Fischler and Bolles,
1981). The estimate is only correct with a certain probability, since RANSAC
is a randomised estimator. The algorithm has been applied to a wide range of
model parameters estimation problems in computer vision, such as detection
of geometric primitives (Clarke et al., 1996), mosaicing (McLauchlan and
Jaenicke, 2002), wide baseline stereo matching (Schaffalitzky and Zisserman,
2001; Pr itchett and Zisserman, 1998) and motion segmentation (Torr, 1995).
Although RANSAC is a quite simple algorithm, it is a very powerful
tool. In an iterative way, it randomly selects subsets from the input data and
compute the model parameters that best fit to the sample (Matas and Chum,
2004). Samples are drawn uniformly from the input data set. Each point has
the same probability of selection (uniform point sampling). For each sample
a model hypothesis is constructed by computing t he model parameters using
109
A.1. HYPOTHESES EVALUATION 110
the sample data. The size of the sample depends on the model one wants
to find. Typically, it is the smallest size sufficient for determining model
parameters (Matas and Chum, 2004). For example, to find circles in the
data set, one has to draw three points, since three points are required to
determine the parameters of a circle.
A.1 Hypotheses evalua tion
In a second phase, the quality of the model parameters is evaluated on
the full data set. Different cost functions may be used for the evaluation,
the standard being the number of data points consistent with the model
(Matas and Chum, 2004). The process is terminated when the likelihood of
finding a better model becomes low (Fischler and Bolles, 1981). Usually, the
model parameters estimated by RANSAC are not very precise. Therefore, the
estimated model parameters are recomputed by, for example, a least-squares
fit to the data subset which supports the best estimate. The input data may
support several distinct models. In this case, the model parameters for the
first model are estimated, the data points supporting the model are removed
from the input data and the algorithm is simply repeated with the remainder
of the data set to find the next best model. The strength of the algorithm is
that it is likely to draw at least one set of points which consists only of inliers
(Cantzler, 2004). Depending on the size of random samples, RANSAC can
handle contamination levels well above 50%, which is commonly assumed to
be a practical limit in robust statistics (Matas and Chum, 2004).
A.2. RANSAC PARAMETERS 111
A.2 RANSAC parameters
The RANSAC technique uses three parameters to control the model esti-
mation process (Cantzler, 2004). The first parameter () is an error tolerance
that determines a volume within which all compatible points must fall in. The
second parameter (p) is the probability that at least one of our random selec-
tions is an error-free set of candidates. Finally, the third parameter (r) is the
probability that a randomly-selected candidate is an inlier. The parameters
p and r define the number of iterations z of RANSAC, as follows:
z =
log(1 p)
log(1 r
2
)
, (A.1)
Equation (A.1) expresses the fact that z should be large enough to ensure
that, with probability p, at least one randomly-selected set of candidates is
an inlier.
A.3 RANSAC efficiency
The speed of RANSAC depends on two factors (Matas and Chum, 2004).
Firstly, the number of samples which have to be drawn to guarantee a certain
confidence to obtain a good estimate, and secondly, the time spent evaluating
the quality of each hypothetical model. The latter is proportional to the size
of the data set.
Typically, a very large number of erroneous m odel parameters obtained
from contaminated samples are evaluated. Such models are consistent with
only a small fraction of the data (Cantzler, 2004). The evaluation of the mod-
els can be computationally optimised by r andomising the evaluation (Matas
and Chum, 2004). Every hypothetical model is first tested only with a small
A.3. RANSAC EFFICIENCY 112
number of random data points from the data set. If a model does not get
enough support from this random point set, then one can assume with a
high confidence that the model is not a good estimate. Models passing the
randomised evaluation are then evaluated on the full data set.
The performance of RANSAC degrades with increasing sample size or
in case multiple models are supported by the data due to the decreasing
probability of sampling a set that is composed entirely of inliers. A very
common observation is t hat outliers possess a difuse distribution. On the
other hand, inliers will tend to be located closely together. Therefore, the
uniform sampling of points is replaced by selection of sample sets based
on proximity taking spatial relationships into account (Myatt et al., 2002).
The first initial sample point is selected randomly. The rest of the points
are random points lying within a hypersphere centred on the first point.
The selection of sample sets of adjacent points can significa ntly improve the
probability of selecting a set of inliers and thus reduce the number of samples
required to find a good model estimate (Matas and Chum, 2004).
Appendix B
Tensorial Notation
This brief introduction on tensor notation is based on the explanations
presented in Hartley (1997b) and Hartley and Zisserman (2003). For more
details about this topic, the reader is r eferred to Sharipov (2004) and Triggs
(1995). For the sake of simplicity, we will present the concepts envolved
in tensorial notation in the context of low-dimensional projective spaces,
rather than in a general context, exactly as performed in Hartley (1997a)
and Hartley and Zisserman (2003).
Consider a set of basis vectors e
i
, i=1,...,3 for a 2-dimensional projective
space P
2
. Let their indices be written as subscripts. With respect to this
basis, a point in P
2
is represented by a set of coordinates q
i
, which represents
the point Σ
3
i=1
q
i
e
i
. The coordinates are written with an upper index. Let q
represent the triple of coordinates, q = (q
1
, q
2
, q
3
)
.
Now, consider a change of coordinate axes in which the basis vectors
e
i
are replaced by a new basis set
ˆ
e
j
, where
ˆ
e
j
= Σ
i
H
i
j
e
i
, and H is the
basis transformation matrix with entries H
i
j
. If
ˆ
q = (ˆq
1
, ˆq
2
, ˆq
3
)
are the
coordinates of the vector with respect to the new basis, then we may verify
that
ˆ
q = H
1
q. Thus, if the basis vectors transform according to H the
113
114
coordinates of points transform according to the inverse transform H
1
.
Next, consider a line in P
2
represented by coordinates λ with respect to
the original basis. With respect to new basis, it may be verified that the line
is respresented by a new set of co or dinates
ˆ
λ = H
λ. Thus coordinates of
the line transform according to H
.
Finally, let P be a matrix r epresenting a mapping between vector spaces.
If G and H represent basis transformations in the domain and range spaces,
then with respect to the new bases, the mapping is represented by a new
matrix
ˆ
P = H
1
P G. Not e in these examples, that sometimes the matrix H
or H
is used in the transformation, and sometimes H
1
.
These three examples of coordinate transformations may be written as
follows:
ˆq
i
= (H
1
)
i
j
q
j
(B.1)
ˆ
λ
i
= H
j
i
λ
j
(B.2)
ˆ
P
i
j
= (H
1
)
i
k
G
l
j
P
k
l
(B.3)
where we use the tensor summation convention that an index repeated in up-
per and lower positions in a product represents summation over the range of
the index. Note that those indices that are written as superscripts transform
according to H
1
, whereas those that are written as subscripts transform as
H (or G). Note that there is no distinction in t ensor notation between in-
dices that are transformed by H, and those that are transformed by H
. In
general, tensor indices will transform by either H or H
1
. Those indices that
transform according to H are known as covariant indices and are written as
subscripts (Hartley, 1997a). Those indices that transform according to H
1
are known as contravariant indices, and are written as superscripts (Hartley,
B.1. TENSORIAL NOTATION AND THE TRIFOCAL TENSOR 115
λ
1
λ
2
λ
3
T
1
T
2
T
3
λ

1
λ

1
λ

1
λ

2
λ

2
λ

2
λ

3
λ

3
λ

3
λ
1
λ
1
λ
1
λ
2
λ
2
λ
2
λ
3
λ
3
λ
3
Figure B.1: 3D representation of the trifocal tensor. The picture represents
λ
i
= λ
j
λ

k
T
jk
i
, which is the contraction of the tensor with the lines λ
and
λ

to produce a line λ. In pseudo-matrix notation this can be written as
λ
i
= λ
T
i
λ

, where (T
i
)
jk
= T
jk
i
(Hartley and Zisserman, 2003).
1997a). The number of indices is the valency of the tensor. The sum over an
index, e.g. H
j
i
λ
j
, is referred to as a contraction, in this case the tensor H
j
i
is
contracted with the line λ
j
.
B.1 Tensorial notation and the trifocal tensor
The trifocal tensor T
jk
i
has one covariant and two contravariant indices.
For vectors and matrices, such as q
i
, λ
i
and P
i
j
, it is possible to write the
transformation rules using standard linear algebra notation, e.g. q
= Hq.
However, for tensors with three or more indices, this cannot conveniently be
done. A vector q may be thought of as a set of numbers arranged in a column
or row, and a matrix H as a 2D array of numbers. Similarly, a tensor with
three indices may be thought of as a 3D array of numbers. In particular the
trifocal tensor is a 3 × 3 × 3 cube of cells as illustrated in Figure B.1.
Appendix C
Multiple View Geometry
C.1 Three-View Geometry
The geometry of three perspective views may be acquired simultaneously
as in a trinocular rig, or acquired sequentially, for example by a camera
moving relative to the scene. Exactly as in the case of two-view geometry,
we say that these two situations are geometrically equivalent and they will
not be differentiated here.
A new multiple view object the trifocal tensor plays an analogous
role in three views to that played by the fundamental matrix in two views
(Hartley, 1997b). It encapsulates all the (projective) geometric relations
between three views that are independent of scene structure. In the following
we present its derivation as well as some of its main properties.
C.1.1 The Trifocal Tensor T
The trifocal tensor may be approached in several different manners
(Shashua and Werman, 1995; Hartley, 1997b; Hartley and Zisserman, 2003;
Canterakis, 2000), but in this section we will present its derivation based on
116
C.1. THREE-VIEW GEOMETRY 117
the description performed by Hartley (1997b), where the starting point is
taken to be the incidence relationship of three corresponding lines. Finally,
we will present the application of the trifocal tensor as tool for transferring
points.
Essentially, the trifocal tensor is a triply indexed 3×3×3 array of values.
Therefore, it is fully natural to treat it as a tensor (Viéville and Luong,
1993). In this section we will make use of tensorial notation, in particular
the standard summation convention for repeated indices. Some basics on
tensorial notation are presented in Appendix B.
Line Transfer
Consider the three cameras with image planes π
, π

and π

in Figure
C.1. Their corresponding projection matrices will be denoted by M
= [I|0],
M

= [a
j
i
] and M

= [b
j
i
]. Observe that we are picking the coordinate
system attached to the camera π
as the world reference frame. Let λ
, λ

and λ

be the image lines of the scene line illustrated in Figure C.1. Their
corresponding planes in space are given by ϕ
= M
λ
, ϕ

= M

λ

and ϕ

= M

λ

. Our goal here is to find a relationship between the
coordinates of these three lines.
Given that the three image lines are derived from a single line in space,
it follows that ϕ
, ϕ

and ϕ

must meet at this line in space. This fact
leads to a linear dependency between the coordinates of these three planes.
In particular, there exist constants ρ
1
and ρ
2
such that we have
ϕ
= ρ
1
ϕ

+ ρ
2
ϕ

. (C.1)
C.1. THREE-VIEW GEOMETRY 118
O
λ
π
ϕ
scene line
ϕ

ϕ

π

λ

O

λ

π

O

Figure C.1: Three images of a line define it as the intersection of three planes
in the same pencil.
Writing Equation (C.1) in terms of the planes’ coordinates we obtain
λ
i
= ρ
1
a
j
i
λ

j
+ ρ
2
b
k
i
λ

k
, (i = 1, ..., 3) (C.2)
0 = ρ
1
a
j
4
λ

j
+ ρ
2
b
k
4
λ

k
. (C.3)
From Equation (C.3) we deduce that ρ
1
(b
k
4
λ

k
) and ρ
2
(a
j
4
λ

j
). The
notation denotes equality up to an unknown scale factor. Thus, we may
C.1. THREE-VIEW GEOMETRY 119
rewrite Equation (C.2) as
λ
i
(b
k
4
λ

k
)(a
j
i
λ

j
) (a
j
4
λ

j
)b
k
i
λ

k
), (C.4)
= λ

j
λ

k
(a
j
i
b
k
4
a
j
4
b
k
i
). (C.5)
If we define a 3×3×3 tensor T
jk
i
by the expression
T
jk
i
= a
j
i
b
k
4
a
j
4
b
k
i
, (C.6)
we obtain the following equation
λ
i
λ

j
λ

k
T
jk
i
. (C.7)
The tensor T
jk
i
is the so-called trifocal tensor (Hartley and Zisserman, 2003)
and may be thought of as a set of three 3×3 matrices of rank 2, which is
evident from Equation (C.6), where it is expressed as the sum of two outer
products. Given T
jk
i
and the coordinates λ

j
, λ

k
of corresponding lines,
Equation (C.7) may be used to compute the line in the other image. This
process is called line transfer.
Importantly, if at least 13 line matches are known, it is possible to solve
for the entries of the tensor T
jk
i
, since each line match provides two linear
equations in the 27 unknown tensor entries. In particular, if the line λ
is
specified by two points on the line, then each such point q
= (q
1
, q
2
, q
3
)
generates an equation such as
q
i
λ

j
λ

k
T
jk
i
= 0. (C.8)
C.1. THREE-VIEW GEOMETRY 120
Point Transfer
Suppose that a scene point Q is seen at positions q
, q

and q

in three
image planes π
, π

and π

, respectively, where q
(and similarly q

and
q

) are represented in homogeneous coordinates. Now, we wish to find a
relationship between the coordinates of those three image points. It is im-
portant to say that at any point in the following derivation we may set the
coordinates q
3
, q

3
and q

3
to 1 to obtain equations relating to measured
image coordinates.
Consider that the cameras’ projection matrices are denoted by M
=
[I|0], M

= [a
j
i
] and M

= [b
j
i
]. Since q
[I|0]Q, we have
Q =
q
w
, (C.9)
for some w yet to be deter mined.
Therefore, project ing the scene point Q in the image plane π

by the
usual formula q

i
a
i
j
Q
j
, we have
q

i
a
i
k
q
k
+ a
i
4
w. (C.10)
As performed by Hartley (1997b), we may eliminate this scale factor to obtain
equations
q

i
a
j
k
q
k
+ a
j
4
w
= q

j
a
i
k
q
k
+ a
i
4
w
. (C.11)
Each choice of the free indices i and j gives a separate equation. Of the
three resulting equations, only two are independent. Thus, we obtain three
C.1. THREE-VIEW GEOMETRY 121
separate estimates for w as follows
w =
q
k
q

i
a
j
k
q

j
a
i
k
q

j
a
i
4
q

i
a
j
4
. (C.12)
Substituting w in Equation (C.9) by its value in Equation (C.12) we can
write the scene point Q as
Q =
q
q
k
q

i
a
j
k
q

j
a
i
k
q

j
a
i
4
q

i
a
j
4
. (C.13)
Finally, projecting the scene point Q in the image plane π

(q

l
b
l
k
Q
k
),
we obtain
q

l
b
l
k
q
k
q

j
a
i
4
q

i
a
j
4
+ b
l
4
q
k
q

i
a
j
k
q

j
a
i
k
(C.14)
q
k
q

i
a
j
k
b
l
4
a
j
4
b
l
k
q
k
q

j
a
i
k
b
l
4
a
i
4
b
l
k
. (C.15)
Observe that the tensor coefficients T
jk
i
can be easily identified in Equation
(C.15):
q

l
q
k
q

i
T
jl
k
q

j
T
il
k
. (C.16)
As before we may eliminate the unknown scale factor to obtain the equations
q
k
q

i
q

l
T
jm
k
q

j
q

l
T
im
k
q

i
q

m
T
jl
k
+ q

j
q

m
T
il
k
= 0
ijlm
. (C.17)
As mentioned by Hartley (1997b), these are the trilinearity relationships of
Shashua and Werman (1995). The indices i, j, l and m are free variables,
and there is one equation for each choice of indices with i = j and l = m .
We may assume that i < j and l < m, since we have the same relation by
C.2. N-VIEW GEOMETRY 122
interchanging i and j, or l and m. Therefore, from the Expression (C.17) we
obtain nine possible equations. As only two of the three choices of pair (i,j)
give indep endent equations, and the same is true for pairs (l,m), we get only
four linearly independent equations.
One natural choice of the four independent equations from Expression
(C.17) is obtained by setting j = m = 3, and letting i and l range freely,
given the conditions stated before (i = j, l = m, i < j and l < m). Thus,
if we set q
3
, q

3
and q

3
to 1, we obtain the relationship between image
coordinates in Equation (C.18)
q
k
q

i
q

l
T
33
k
q

l
T
i3
k
q

i
T
3l
k
+ T
il
k
= 0
i3l3
(i, l = 1, 2). (C.18)
Equations (C.8) and (C.18) show the presence of the entries of the trifocal
tensor T involved in the processes of transferring lines and points. In partic-
ular, each line correspondence provides two linear constraints on the entries
T
jk
i
, whereas each point correspondence provides four linear equations. Since
T has 27 entries, 26 equations are needed to solve for the T
jk
i
up to scale, that
is, provided that 2 #lines + 4 #points 26 (Hartley and Zisserman, 2003)
we have enough matches to completely determine the entries of the trifocal
tensor. The main methods for computing trifocal tensors (Hartley, 1997b;
Torr and Zisserman, 1997; Faugeras and Papadopoulo, 1997) are based on
that restriction and most of them are described in detail by Hartley and
Zisserman (2003).
C.2 N-View Geometry
In the study of the geometry of multiple views, the fundamental matrix
and the trifocal tensor have proven to be essential tools. In the case of four
C.2. N-VIEW GEOMETRY 123
views, the quadrifocal tensor was proposed as a natural extension of those
techniques (Heyden, 1995a; Hartley, 1998). Because of the added stability of
a fourth view, use of the quadrifocal tensor should lead to greater accuracy
than two and three-view techniques (He yden, 1995a). However, the four-
view tensor has not been given much attention in the literature (Hartley and
Zisserman, 2003). One of the main reasons to its rare use is related to its over-
parametrization, where 81 components of the tensor are used to describe a
geometric configuration that depends only on 29 parameters (Hartley, 1998).
This fact can lead to significant inaccuracies if additional constraints are not
applied.
With regard to the general case of multiple images taken by N differ-
ent cameras, several different approaches have been proposed in the litera-
ture (Luong and Vieville, 1994; Faugeras and Mourrain, 1995 ; Heyden and
Åström, 1997; Heyden, 1998). In Heyden (1998), a common framework was
proposed for the definition and operations on the different multiple view ten-
sors. The author showed that there are essentially three different ways to
encode an N-view geometry, namely by using bifocal tensors or fundamen-
tal matrices, trifocal tensors and quadrifocal tensors (combinations of these
three tools are also possible).
According to his work, if only bifocal tensors are used for a sequence
of temporally corresponding images, it is sufficient to use the tensors
i,i+1
F
and
i,i+2
F for every triplet (
i,i+1
F,
i,i+2
F,
i+1,i+2
F), where i denotes the i-
th view and
i,i+1
F denotes the fundamental matrix b etween views i and
i + 1. Using this representation and remembering that each bifocal tensor
has 7 independent parameters (Hartley and Zisserman, 2003), we have N
1 + N 2 = 2N 3 bifocal tensors and N 2 such triplets obeying three
constraints each, giving in total 7(2N 3)3(N 2) = 11N 15 parameters
C.2. N-VIEW GEOMETRY 124
describing the N-view geometry, i.e. the minimal number (Heyden, 1998).
On the other hand, if only trifocal tensors are used it is sufficient to
use the tensors
i+1,i+2
i
T . Using this representation we have N 2 trifocal
tensors with 18 indep endent parameters each and for every consecutive pair
(
i+1,i+2
i
T ,
i+2,i+3
i+1
T ) of trifocal tensors we get 7 constraints from the compati-
bility of
i+1,i+2
F, t hat can be calculated from both. Since there are N 3 such
constraints, we get 18(N 2) 7(N 3) = 11N 15 parameters describing
the N-view geometry, i.e. the minimal number (Heyden, 1998).
Finally, when only quadrifocal tensors (represented by letter Q) are used
it is sufficient to use the tensors
i,i+1,i+2,i+3
Q. Now, by using this represen-
tation we have N 3 quadrifocal tensors with 29 independent parameters
each and for every consecutive pair (
i,i+1,i+2,i+3
Q,
i+1,i+2,i+3,i+4
Q) of quadri-
focal tensors we get 18 constraints from the compatibility of
i+2,i+3
i+1
T , that
can be calculated from both. Since there are N 4 such constraints, we
get 29(N 3) 18(N 4) = 11N 15 parameters describing the N-view
geometry, i.e the minimal number (Heyden, 1998).
An enumeration of the complete set of multilinear relations, formulae
for the multiview tensors, and the analysis of the number of independent
equations derived from point correspondences can be found in Hartley and
Zisserman (2003) and Heyden (1998).
Livros Grátis
( http://www.livrosgratis.com.br )
Milhares de Livros para Download:
Baixar livros de Administração
Baixar livros de Agronomia
Baixar livros de Arquitetura
Baixar livros de Artes
Baixar livros de Astronomia
Baixar livros de Biologia Geral
Baixar livros de Ciência da Computação
Baixar livros de Ciência da Informação
Baixar livros de Ciência Política
Baixar livros de Ciências da Saúde
Baixar livros de Comunicação
Baixar livros do Conselho Nacional de Educação - CNE
Baixar livros de Defesa civil
Baixar livros de Direito
Baixar livros de Direitos humanos
Baixar livros de Economia
Baixar livros de Economia Doméstica
Baixar livros de Educação
Baixar livros de Educação - Trânsito
Baixar livros de Educação Física
Baixar livros de Engenharia Aeroespacial
Baixar livros de Farmácia
Baixar livros de Filosofia
Baixar livros de Física
Baixar livros de Geociências
Baixar livros de Geografia
Baixar livros de História
Baixar livros de Línguas
Baixar livros de Literatura
Baixar livros de Literatura de Cordel
Baixar livros de Literatura Infantil
Baixar livros de Matemática
Baixar livros de Medicina
Baixar livros de Medicina Veterinária
Baixar livros de Meio Ambiente
Baixar livros de Meteorologia
Baixar Monografias e TCC
Baixar livros Multidisciplinar
Baixar livros de Música
Baixar livros de Psicologia
Baixar livros de Química
Baixar livros de Saúde Coletiva
Baixar livros de Serviço Social
Baixar livros de Sociologia
Baixar livros de Teologia
Baixar livros de Trabalho
Baixar livros de Turismo