( PDF ) Alinhamento espaço-temporal de sequências de vídeo capturadas a partir de múltiplos pontos de vista

Download PDF

ads:

Universidade Federal de Minas Gerais

Instituto de Ciências Exatas

Departamento de Ciência da Computação

Alinhamento Espaço-Temporal de Sequências

de Vídeo Capturadas a Partir de

Múltiplos Pontos de Vista

Flávio Luis Cardeal Pádua

Tese apresentada ao Curso de Pós-Graduação em

Ciência da Computação da Universidade Federal de

Minas Gerais, como requisito parcial para obtenção

do título de Doutor em Ciência da Computação.

Orientador: Prof. Rodrigo Lima Carceroni

Belo Horizonte, 20 de maio de 2005

ads:

Livros Grátis

http://www.livrosgratis.com.br

Milhares de livros grátis para download.

Universidade Federal de Minas Gerais

Instituto de Ciências Exatas

Departamento de Ciência da Computação

Spatio-Temporal Alignment of Video Sequences

Captured from Multiple Viewpoints

Flávio Luis Cardeal Pádua

Thesis presented to the Graduate Program in

Computer Science of the Federal University

of Minas Gerais in partial fulﬁllment of the

requirements for the degree of Doctor in

Computer Science.

Advisor: Prof. Rodrigo Lima Carceroni

Belo Horizonte, May 20, 2005

ads:

Resumo

Esta tese aborda o problema de se estimar o alinhamento espaço-temporal

entre N sequências de vídeo não-sincronizadas referentes à mesma cena dinâ-

mica 3D e capturadas a partir de pontos de vista distintos. Diferentemente

dos métodos existentes, os quais funcionam somente para N = 2, este traba-

lho apresenta uma abordagem inovadora que reduz o problema caracterizado

por um N qualquer ao problema de se estimar uma única reta em R

. Esta

reta captura todas as relações temporais entre os videos, podendo ser cal-

culada sem qualquer conhecimento a priori sobre as mesmas. Considerando

que o alinhamento espacial é capturado por parâmetros de tensores bilineares

(matrizes fundamentais), um algoritmo iterativo é usado para reﬁnar simul-

taneamente os parâmetros temporais e espaciais que deﬁnem o alinhamento

entre as sequências, uma vez que o reﬁnamento exclusivo dos parâmetros

temporais é subótimo. Resultados experimentais obtidos com sequências de

vídeo reais demonstram que a metodologia proposta é capaz de recuperar eﬁ-

cazmente o alinhamento entre as sequências mesmo diante da existência de

grandes desalinhamentos, diante da presença de ambiguidades (por exemplo,

cenas com movimentos periódicos) e quando um alinhamento manual preciso

é inviável. Finalmente, experimentos com sequências sintéticas demonstram

a escalabilidade e acurácia de nossa abordagem, fornecendo medidas quanti-

tativas para a qualidade dos alinhamentos estimados.

Abstract

This thesis addresses the problem of estimating the spatio-temporal align-

ment between N unsynchronized video sequences of the same dynamic 3D

scene, captured from distinct viewpoints. Unlike existing methods, which

work for N = 2 and rely on a computationally-intensive search in the space

of temporal alignments, we present a novel approach that reduces the prob-

lem for general N to the robust estimation of a single line in R

. This

line captures all temporal relations between the sequences and can be com-

puted without any prior knowledge of these relations. Considering that the

parameters of fundamental matrices capture the spatial alignment, we use

an iterative algorithm to reﬁne simultaneously the parameters representing

the temporal and spatial relations between the sequences, since that the ex-

clusive reﬁnement of the temporal parameters is suboptimal. Experimental

results with real-world sequences show that our method can accurately align

videos even when they have large misalignments (e.g., hundreds of frames),

when the problem is seemingly ambiguous (e.g., scenes with roughly peri-

odic motion), and when accurate manual alignment is diﬃcult (e.g., due to

slow-moving objects). Finally, experiments with synthetic sequences demon-

strate the scalability and accuracy of our approach, providing quantitative

measurements for the quality of the spatio-temporal alignments estimated.

Resumo Estendido

O texto a seguir consiste em um resumo estendido sobre o trabalho

desenvolvido nesta tese. Primeiramente, este texto introduz o problema

abordado, a principal motivação para se resolvê-lo e alguns dos principais

trabalhos relacionados. Em seguida, é feita uma breve descrição da me-

todologia desenvolvida e dos experimentos realizados que comprovam sua

aplicabilidade, escalabilidade e exatidão. Finalmente, são apresentadas

conclusões e propostas de trabalhos futuros.

Introdução

Esta tese aborda o problema de se estimar o Alinhamento Espaço-

Temporal entre Múltiplas Sequências de Vídeo referentes a uma mesma cena

3D, as quais são capturadas a partir de pontos de vista distintos. A di-

nâmica da cena bem como características estáticas presentes na mesma são

utilizadas como poderosas pistas para se estimar a sincronização temporal

(alinhamento tempor al) e o alinhamento espacial entre as sequências. Ti-

picamente, o desalinhamento temporal entre sequências de vídeo origina-se

por duas razões principais. A primeira relaciona-se com o fato de que as

sequências de entrada podem possuir diferentes taxas de quadros, enquanto

a segunda relaciona-se com a existência de um deslocamento temporal en-

iii

tre as sequências frequentemente criado quando as câmeras não são ativadas

simultaneamente. Por outro lado, o desalinhamento espacial resulta das dife-

rentes posições, orientações e parâmetros internos de calibração das câmeras.

Em muitas aplicações atuais que se beneﬁciam da disponibilidade de re-

gistros de vídeo simultâneos de um mesmo evento físico, como por exemplo,

tele-imersão (Vedula et al., 2002), segurança baseada em vídeo (Zelnik-Manor

and Irani, 2001), criação de mosaicos a partir de múltiplos vídeos (Caspi and

Irani, 2001) e análise de lances duvidosos em eventos esport ivos (Reid and

Zisserman, 1996), observa-se a necessidade da estimação numa fase anterior

do alinhamento espaço-temporal das múltiplas sequências de vídeo referentes

ao evento físico monitorado.

Neste contexto, nota-se uma demanda crescente por métodos automáti-

cos eﬁcazes para a estimação do alinhamento espaço-temporal entre múlti-

plas sequências, especialmente sequências previamente gravadas onde o uso

de hardwares de sincronização é inviável. Sendo assim, esta tese propõe uma

nova abordagem cujo objetivo principal consiste em avançar no desenvolvi-

mento de novas metodologias para se estimar com grande exatidão o alinha-

mento espaço-temporal entre não somente duas sequências, como a maioria

dos métodos existentes, mas entre um conjunto genérico de N sequências.

Especiﬁcamente, este trabalho considera conjuntos de sequências que pos-

suam uma certa sobreposição entre seus campos de visão, isto é, dadas N

sequências, é possível identiﬁcar tuplas correspondentes de pixels (x

, y

, t

)

∈ S

,...,(x

, y

, t

) ∈ S

, onde todas estas tuplas são formadas pelas pro-

jeções de um único ponto na cena. Além disso, a abordagem apresentada

neste trabalho considera que todas as regiões de sobreposição contenham

algum movimento não-rígido.

Nós acreditamos que qualquer solução suﬁcientemente genérica para o

problema de alinhamento espaço-temporal deva tratar os seguintes casos:

• Taxas de q uadros desconhecidas: as taxas de quadros das sequên-

cias são desconhecidas e podem apresentar qualquer valor.

• Deslocamento temporal arbitrário: o deslocamento temporal entre

as sequências é desconhecido e pode ser arbitrariamente grande.

• Movimento desconhecido: o movimento 3D dos objetos na cena é

desconhecido e suas características são arbitrárias.

• Falhas no rastreamento: pontos de interesse na cena não podem ser

rastreados de forma conﬁável ao longo de toda a sequência.

• Geometria epipolar desconhecida: a geometria epipolar das câme-

ras de vídeo é desconhecida.

• Escalabilidade: a eﬁciência computacional da metodologia utilizada

deve cair proporcionalmente ao aumento no número de sequências.

• Ausência de pontos estáticos: nenhum ponto visível na cena per-

manece estático para dois ou mais quadros.

Neste sentido, esta tese apresenta uma abordagem inovadora que trata to-

dos os casos acima mencionados, com exceção do último caso. Em particular,

nós assumimos que para cada par de sequências de vídeo é possível identi-

ﬁcar pontos estáticos suﬁcientes na cena que permitam a obtenção de uma

estimativa inicial da geometria epipolar das câmeras. Além disso, com o ob-

jetivo de se assegurar que os parâmetros desta estimativa inicial permaneçam

constantes durante a aplicação de nossa abordagem, nós consideramos um

cenário no qual as câmeras são estáticas e apresentam parâmetros intrínsicos

e extrínsicos constantes e desconhecidos.

A idéia básica de nossa abordagem está fundamentada na deﬁnição de

uma reta N-dimensional que captura globalment e as relações temporais en-

tre todas as N sequências de vídeo. Uma propriedade fundamental desta

reta é que ainda embora seu conhecimento implique no conhecimento do

alinhamento temporal entre as sequências, a estimativa de pontos sobre a

mesma pode ser realizada sem o conhecimento prévio de tal alinhamento.

Utilizando-se esta propriedade como ponto de partida, a abordagem pro-

posta neste trabalho reduz o problema de se estimar o alinhamento temporal

entre N sequências para o problema de se estimar de forma robusta uma

única reta N-dimensional a partir de um conjunto de pontos gerados em 

Grande parte das soluções existentes na literatura para o problema de se

adquirir o alinhamento temporal realizam uma pesquisa explícita em todo o

espaço de soluções de alinhamentos possíveis (Caspi et al., 2002; Rao et al.,

2003; Wolf and Zomet, 2002a,b; Lee et al., 2000; Stein, 1998). Infelizmente,

a natureza combinatória desta pesquisa requer o estabelecimento de várias

hipóteses adicionais para torná-la gerenciável, como por exemplo, deve-se

conhecer a priori as taxas de quadros das sequências, deve-se assumir sem-

pre N = 2, que o desalinhamento temporal é um inteiro e ainda que tal

desalinhamento resida dentro de uma pequena faixa limitada deﬁnida pelo

usuário (tipicamente menor do que cinquenta quadros). Consequentemente,

ainda que a maior parte destas soluções possam ser utilizadas diante da ine-

xistência de pontos estáticos na cena, suas eﬁciências computacionais podem

limitar bastante suas aplicabilidades. Diferentemente de tais técnicas, nossa

abordagem alinha N sequências num único passo, pode tratar desalinhamen-

tos temporais arbitrariamente grandes e não requer qualquer informação a

priori sobre as relações temporais entre as mesmas.

A metodologia proposta nesta tese está mais diretamente relacionada com

vii

a abordagem proposta em Caspi et al. (2002). Neste trabalho, a técnica pro-

posta pelos autores recupera a geometria epipolar e o desali nhamento tem-

poral entre as sequências a partir da trajetória no plano de im agem de um

único ponto na cena que é visível em ambas as sequências. Posteriormente,

a geometria epipolar e o desalinhamento temporal estimados são r eﬁnados

usando-se mais pontos. Para fazerem isso, os autores assumem que as taxas

de quadros são conhecidas e formulam um problema de otimizaç ão não-linear

para estimar os parâmetros reﬁnados que capturam a geometria epipolar e

o desalinhamento temporal. Infelizmente, a natureza altamente não-linear

deste processo de otimização necessita da aquisição de boas estimativas ini-

ciais para a geometria epipolar e o desalinhament o temporal. É importante

ainda mencionar que a abordagem proposta em Caspi et al. (2002) assume

que um único ponto na cena p ossa ser rastreado de forma conﬁável ao longo

de toda a sequência, o que pode ser difícil de se realizar no caso de vídeos

de cenas complexas, onde o rastreamento pode falhar devido a diversos pro-

blemas, como por exemplo, problemas de oclusão. Nossa solução, por outro

lado, requer a habilidade de se rastrear pontos na cena somente ao longo de

dois quadros consecutivos da mesma sequência. Além disso, ela não requer a

habilidade de se estabelecer correspondência de pontos entre as sequências.

A seguir será apresentada uma breve descrição da meto dologia desen-

volvida nesta tese, a qual pode ser dividida em duas etapas principais. Em

sua primeira etapa, chamada de Alinhamento Temporal, realiza-se uma

estimativa inicial dos parâmetros que recuperam o alinhamento temporal

entre as sequências, utilizando-se para isso as trajetórias de pontos de inte-

resse que se movem na cena bem como as estimativas iniciais das matrizes

fundamentais que recuperam o alinhamento espacial entre as sequências.

Na segunda e última etapa de nossa metodologia, chamada de Alinhamento

viii

Espaço-Temporal, um processo de otimização linear é formulado para se

reﬁnar os parâmetros temporais e espaciais estimados inicialmente, os

quais podem possuir erros signiﬁcativos relacionados com as limitações

impostas pelas técnicas utilizadas para se recuperar as est imat ivas iniciais

das geometrias epipolares de pares de câmeras bem como pelas técnicas

utilizadas para se adquirir as trajetórias dos objetos móveis.

Alinhamento Temporal

Considere uma cena dinâmica monitorada simultaneamente por N câme-

ras localizadas em diferentes pontos de vista, onde tais câmeras obedecem

ao modelo de projeção em perspectiva. Assuma que cada câmera capture

quadros a uma taxa constante e desconhecida. Além disso, considere que

as câmeras não estejam sincronizadas, isto é, elas começam a capturar qua-

dros em diferentes instantes de tempo e possivelmente com taxas de quadros

distintas. Com o objetivo de se alinhar temporalmente as sequências de ví-

deo resultantes, deve-se determinar as correspondências entre os números dos

quadros de uma sequência de “referência” com os números dos quadros em

todas as outras sequências. Esta correspondência pode ser expressa como

um conjunto de equações lineares do tipo:

= α

+ β

, (1)

onde f

e f

são os números dos quadros da i-ésima sequência e da sequência

de referência, respectivamente, e α

, β

são constant es desconhecidas que

representam a dilatação e o deslocamento temporal, respectivamente, entre

as sequências. Em geral, estas constantes não são números inteiros.

As relações temporais entre pares de sequências capturadas pela Equação

(3.1) induzem uma relação global entre os números dos quadros das sequên-

cias de entrada. Especiﬁcamente, nós representamos esta relação global pela

reta N-dimensional L a seguir:

L =



[α

. . . α

]

t + [β

. . . β

]

| t ∈ 



. (2)

A propriedade chave desta reta N-dimensional, pela qual a estimativa

de pontos sobre a mesma pode ser realizada sem o conhecimento prévio do

alinhamento temporal entre as sequências, permite a elaboração de um algo-

ritmo simples para sua reconstrução a partir das trajetórias de características

de inter esse que se movem na cena e são visíveis por duas ou mais sequências.

Especiﬁcamente, seja q

) a projeção instantânea na câmera 1 no qua-

dro f

de um ponto móvel na cena, expressa em coordenadas 2D homogêneas,

como ilustrado na Figura 3.1. Além disso, seja q

(·) a trajetória formada pela

projeção deste ponto móvel na câmera i e suponha que a matrix fundamen-

tal F

entre as câmeras 1 e i sejam conhecidas para todo i, onde i = 2...N.

Neste caso, a câmera 1 está sendo considerada como sendo a câmera de refe-

rência. Se o ponto de interesse na cena é visível por todas as câmeras quando

o quadro f

é capturado pela câmera 1, tem-se a seguinte restrição:

O conjunto:

)



. . . f

]

| q



) = 0, i = 2 . . . N



contém pelo menos um ponto sobre a reta N-dimensional L.

Intuitivamente, a restrição acima pode ser imaginada como um procedi-

mento para se gerar um conjunto V

)

de alinhamentos temporais “candi-

datos”, o qual deve conter pelo menos um ponto sobre a reta N-dimensional

procurada. Além disso, esta restrição nos diz que tal conjunto pode ser criado

de acordo com os seguintes passos principais: (1) intersectar a reta epipolar

de q

) com a trajetória q

(·) na câmera i, (2) armazenar o(s) número(s)

dos quadro(s) correspondente(s) ao(s) ponto(s) de interseção na câmera i e

(3) gerar vetores de alinhamento temporal a partir dos números dos quadros

armazenados.

Para que se possa aplicar a restrição acima apresentada, deve-se conhecer

a priori as matrizes fundamentais F

, as quais descrevem a geometria

epipolar de cada par de câmeras (i, j). Na prática, nossa abordagem obtém

uma estimativa inicial de F

por meio da identiﬁcação de pontos estáticos

no plano de fundo da cena, os quais sejam visíveis por duas ou mais câmeras.

Uma vez que a reta N-dimensional L tenha sido reconstruída, isto é, uma

vez que a estimação do alinhamento temporal entre as sequências tenha sido

realizada, nossa abordagem realiza um processo de otimização linear dos

parâmetros de L juntamente com os parâmetros das matrizes fundamentais

que descrevem a geometria da cena. A seguir, será descrito o algoritmo

proposto nesta tese para se reconstruir L.

Recontrução da reta N-dimensional L

Pode-se notar que a restr ição descrita na seção anterior nos leva a um

algoritmo baseado em um processo de votação para se reconstruir a reta

N-dimensional que recuperará o alinhamento temporal entre N sequências.

Especiﬁcamente, este algoritmo opera em duas etapas principais. Na primeira

etapa, escolhe-se uma das sequências de vídeo como sendo a sequência de

“referência” e utiliza-se as posições instantâneas q

) de cada trajetória

(·) desta sequência de referência juntamente com as trajetórias completas

(·) das outras sequências com o objetivo de se estimar V

)

para cada

). Na segunda etapa, estima-se a reta N-dimensional L a partir da

nuvem de pontos formada pela união dos conjuntos V

)

. Sendo assim,

para especiﬁcarmos completamente este algoritmo, três questões principais

devem ser respondidas:

1. Como são calculadas as trajetórias q

(·) dos pontos de interesse na cena,

para i = 1, ..., N?

2. Como é estimado o conjunto V

)

para cada q

3. Como são calculados os parâmetros de L?

Para se calcular as trajetórias q

(·), nossa metodologia utiliza um rastrea-

dor de características, o qual é tratado p or nosso algoritmo como uma “caixa

preta” responsável por retornar uma lista de segmentos de reta de caracterís-

ticas correspondentes para todo par de quadros consecutivos. É importante

ressaltar que nosso algoritmo independe do rastreador utilizado. Assim, a

escolha de uma determinada metodologia de rastreamento depende somente

da complexidade da cena e das propriedades das características de interesse.

No que se refere ao cálculo do conjunto V

)

para um dado q

), nosso

algoritmo utiliza as estimativas iniciais das matrizes fundamentais F

, entre

cada par ( i, j) de câmeras, assim como também os segmentos de reta forne-

cidos pelo rastreador. Quando um segmento de reta especíﬁco intersecta a

reta epipolar de q

), deﬁne-se um número de quadro possivelmente não

inteiro f

, o qual representa a i-ésima coordenada de um elemento potencial

de V

)

. Para se gerar V

)

, agrupa-se todas as coordenadas f

calcula-

das para todas as sequências e concatena-se as mesmas de t al f orma que elas

construam vetores N-dimensionais válidos, os quais representam alinhamen-

tos temporais candidatos em um espaço de votos. Estes passos são ilustrados

na Figura 3.4(a)-(d) para o cojunto de duas sequências de vídeo reais exibidas

na Figura 5.1.

xii

Observe que se a reta epipolar de q

) intersecta dois ou mais segmentos

de reta em cada uma das N − 1 câmeras restantes, tem-se um total de 2

N−1

possíveis maneiras de se concatenar em um vet or N-dimensional as coorde-

nadas f

estimadas. Para se evitar a inclusão de um número exponencial de

vetores em V

)

, nosso algoritmo somente inclui aqueles que sejam consis-

tentes com a geometria epipolar das câmeras. Em particular, concatena-se

as coordenadas f

e f

para as câmeras i e j, respectivamente, se os pontos

de inter seção que as deﬁniram estão próximos de suas retas epipolares corres-

pondentes. Observe que nosso procedimento de concatenação é conservador,

isto é, ele garante que o conjunto de vetores gerado desta maneira será um

superconjunto de V

)

Em geral, o conjunto de todos os alinhamentos temporais candidatos con-

tém um grande número de dados espúrios (do inglês, outliers), como ilustrado

na Figura 3.4(d). Neste contexto, para se reconstruir a reta N-dimensional

L, nossa metodologia se baseia no uso de um algoritmo bastante robusto

a presença de tais dados, conhecido como RANSAC (Fischler and Bolles,

1981), o qual é descr ito detalhadamente no Apêndice A. Basicamente, o al-

goritmo RANSAC escolhe aleatoriamente um par de alinhamentos temporais

candidatos para se deﬁnir a reta L e, em seguida, calcula o número total de

candidatos que se encontram a uma distância m áxima  desta reta. Estes

dois passos são repetidos durante um número determinado de iterações. Por-

tanto, os dois parâmetros críticos deste algoritmo são o número z de iterações

e a distância . Para se determinar z, utiliza-se a seguinte equação:

z =



log(1 − p)

log(1 − r

)



, (3)

onde p é um parâmetro especiﬁcado a priori cujo valor varia de 0 a 1 e r é

a probabilidade de um candidato aleatoriamente selecionado ser um ponto

xiii

sobre L (do inglês, inlier). A Equação (3.3) expressa o fato de que z deva

ser suﬁcientemente grande para assegurar que, com probabilidade p, pelo

menos um dos pares de candidatos selecionados aleatoriamente esteja sobre

L. Em nossos experimentos, foram utilizados p = 0.99 e r = 0.05, os quais são

valores conservadores que levaram aos resultados mais exatos de alinhamento

espaço-temporal. A distância máxima  é calculada pela distância média

entre características estáticas identiﬁcadas nas sequências de entrada e suas

respectivas retas epipolares.

Finalmente, após a estimativa realizada pelo RANSAC do potencial con-

junto de candidatos sobre os quais a reta L deva passar, nossa metodologia

utiliza o método dos mínimos quadrados sobre este conjunto para se estimar

os parâmetros de L. A seguir, será apresentada a segunda etapa de nossa

metodologia, a qual consiste na realização de um processo de otimização dos

parâmetros temporais e espaciais estimados na primeira etapa.

Alinhamento Espaço-Temporal

Embora as imagens de uma cena dinâmica possam conter pontos estáti-

cos em seus planos de fundo, observa-se que tais pontos frequentemente não

representam a maior parte das características detectáveis na cena. Qualquer

metodologia que busque estimar a geometria epipolar entre um par de câme-

ras exclusivamente a partir deste conjunto de características estáticas estará

provavelmente ignorando um conjunto signiﬁcativamente grande de informa-

ções relevantes disponíveis nas imagens. Na prática, esta conduta poderá

causar erros nas matrizes fundamentais calculadas e, em última análise, na

reta N-dimensional L calculada por nosso método. A seguir, será brevemente

mostrado como nossa metodologia reﬁna as matrizes fundamentais F

e os

xiv

parâmetros de L por meio da consideração de to das as características detec-

tadas nas sequências. Sem perda de generalidade, a câmera 1 é assumida

como sendo a câmera de referência.

Seja q

) a projeção de um ponto da cena no plano de imagem da

câmera 1 durante a aquisição do quadro f

. Além disso, suponha que a

projeção instantânea de tal ponto da cena em uma câmera i no quadro f

possa ser parametrizada como a seguir:

) = (1 − f

) + f

), (4)

onde q

) e q

) são os extremos de um segmento linear, o qual contém

a posição q

). Observe que esta equação permite estabelecer uma restri-

ção envolvendo os parâmetros de L e os parâmetros da matriz fundamental

. Especiﬁcamente, pode-se utilizar os parâmetros de L para se calcular

a posição instantânea q

) correspondente à posição instantânea q

) na

câmera 1, como se segue:

) = [1 − (α

+ β

)] + (α

+ β

). (5)

Além disso, sendo q

) e q

) pontos correspondentes, tem-se que:



) = 0. (6)

Combinando-se as Equações (5) e (6), obtém-se uma equação ho mogênea

que leva em conta os parâmetros estimados de L, isto é, os parâmetros α

, bem como os parâmetros estimados da matriz fundamental F

(1 − α

− β



) + (α

+ β



) = 0. (7)

Na prática, erros nos parâmetros de L e F

fazem com que a Equação

(7) não seja satisfeita exatamente, originando-se um resíduo que representa

a distância algébrica entre q

) e sua reta epipolar correspondente. Para

se re ﬁnar a estimativa inicial dos parâmetros temporais e espaciais, nossa

técnica realiza a expansão da Equação (7) ao introduzir as incógnitas ∆F

∆α

e ∆β

na mesma, de forma que:

←− F

+ ∆F

←− α

+ ∆α

←− β

+ ∆β

Dada a equação expandida, determina-se as incógnitas ∆F

, ∆α

e ∆β

que minimizam seu resíduo. Note que para cada ponto na cena e cada qua-

dro no qual ele é visível, nossa técnica obtém uma equação em função dos

parâmetros desconhecidos que especiﬁcam completamente ∆F

, ∆α

e ∆β

Esta equação é não linear pelo fato de conter os termos de segunda ordem

(∆α

∆F

) e (∆β

∆F

). Em nossa implementação, nós simpliﬁcamos os cál-

culos a serem realizados ao ignorarmos estes termos não lineares e resolvermos

o sistema resultante sobredeterminado de equações lineares.

Em geral, as trajetórias de um objeto móvel na cena são altamente não

lineares. Para superarmos este problema, a técnica de reﬁnamento acima

descrita é aplicada a um conjunto de segmentos de reta que aproximam a

trajetória original realizada pelo ponto na cena (Horst and Beichl, 1997),

como ilustrado na Figura 4.1. A seguir são apresentados os resultados

experimentais que comprovam a eﬁcácia da metodologia proposta neste

trabalho.

xvi

Resultados experimentais

Os experimentos realizados neste trabalho são divididos em dois grupos

principais. No primeiro grupo, realizamos experimentos com sequências

de vídeo reais para avaliarmos e demonstrarmos a aplicabilidade de nossa

abordagem, sobretudo em cenários desaﬁadores para os demais métodos

encontrados na literatura, como por exemplo, cenários nos quais o sistema

de rastreamento não é conﬁável ao longo de toda a sequência e e stimativas

iniciais exatas para as matrizes fundamentais não são possíveis. Por

outro lado, para avaliarmos cuidadosamente a escalabilidade e exatidão de

nossa metodologia diante da variação de alguns parâmetros críticos para a

qualidade dos alinhamentos espaço-temporais calculados, um segundo grupo

de experimentos com se quências sintéticas de uma cena artiﬁcial foi reali-

zado. A seguir, os resultados experimentais obtidos são brevemente descritos.

Sequências reais

As sequências de vídeo reais utilizadas em nossos experimentos con-

tém dois e três pontos de vista distintos do evento físico monitorado,

sendo representantes de cenários bastante diversos onde, por exemplo,

as sequências possuem comprimentos que variam de 55 a 605 quadros,

os desalinhamentos temporais variam de 21 a 285 quadros, as imagens

podem apresentar desde uma alta qualidade (sequências capturadas em

laboratório) a uma baixa qualidade (video clipes de transmissões de TV),

as características de interesse podem se mover desde vários pixels a menos

do que um pixel por quadro e as razões entre as taxas de quadros de

pares de sequência varia entre 1 e 2. Em todos os casos, as dimensões

xvii

das imagens são 320 × 240 pixels. A exatidão dos alinhamentos temporais

calculados é avaliada com base no cálculo do erro de alinhamento temporal

médio. Este erro é dado pela média das diferenças absolutas entre as

coordenadas temporais calculadas utilizando-se a reta L estimada por nosso

método e as coordenadas temporais calculadas usando-se a reta L de refe-

rência, a qual é determinada a partir de um processo de alinhamento manual.

Sequência 1

Como um primeiro experimento, nossa técnica de alinhamento espaço-

temporal foi aplicada a duas sequências de vídeo (veja Figura 5.1) também

utilizadas em Caspi and Irani (2000). Basicamente, o objetivo deste expe-

rimento inicial era comparar a eﬁcácia de nossa técnica com a eﬁcácia da

metodologia desenvolvida por estes autores. As duas sequências adquirem

quadros a uma taxa de 25qps (α = 1) e possuem um desalinhamento tempo-

ral de referência dado por: β = 55 ± 0.5 quadros.

A Figura 3.4(d) exibe a reta L reconstruída utilizando-se a primeira

etapa de nossa metodologia. Esta estimativa inicial de L leva a um erro de

alinhamento temporal médio de 0.66 quadros. Entretanto, após a aplicação

de nossa técnica de reﬁnamento, a nova reta estimada levou a u m erro

ainda menor de 0.35 quadros. Note que os parâmetros calculados por nosso

método são tão exatos quanto aqueles calculados em Caspi and Irani (2000),

embora nossa metodologia se baseie na solução de um problema muito me-

nos restritivo (por exemplo, α é desconhecido e a cena não precisa ser planar).

Sequência 2

O segundo experimento foi realizado com duas sequências de mesma taxa

de quadros (30qps) adquiridas em laboratório. Neste caso, os objetivos prin-

xviii

cipais eram mostrar que nossa metodologia independe do fato da cena ser

ou não planar e do tamanho do desalinhamento temporal entre as sequên-

cias. Especiﬁcamente, a cena monitorada é constituída por objetos (robôs)

que se movem em dois planos distint os, como ilustrado na Figura 5.3. Os

parâmetros temporais são α = 1 e β = −284.5 ± 2 quadros. Observe que o

desalinhamento temporal entre as sequências é signiﬁcativamente grande.

A reta L inicialmente reconstruída é mostrada na Figura 5.4, a qual leva

a um erro de alinhamento temporal médio de 5.84 quadros. Entretanto, após

a aplicação da técnica de reﬁnamento, a nova reta estimada diminuiu este

erro para 4.43 quadros. Dado que os robôs se movem muito lentamente, este

erro de alinhamento temporal aparentemente signiﬁcativo não é perceptível

visualmente. A Figura 5.7 conﬁrma a boa qualidade do alinham ento

temporal recuperado.

Sequência 3

Neste experimento, duas pessoas são monitoradas por um par de câmeras

com mesma taxa de quadros (30qps), enquanto realizam um malabarismo

com cinco bolas coloridas em suas mãos (veja F igura 5.5). Estas sequências

representam um cenário bastante difícil para todos os métodos existentes,

uma vez que (1) as trajetórias das diferentes bolas se sobrepõe no mundo

3D, (2) as trajetórias individuais são aproximadamente cíclicas e (3) as bolas

se movem muito rapidamente (mais de 9 pixels por quadro). Os parâmetros

temporais são α = 1 e β = −41 ± 0.5 quadros. Para tornar o problema de se

estimar o alinhamento temporal ainda mais desaﬁador, uma das sequências de

entrada foi inicialmente modiﬁcada por meio da inserção de novos quadros e,

posteriormente, por meio da exlcusão de quadros. Estas modiﬁcações foram

realizadas para se simular sequências que apresentam mais do que uma taxa

xix

de quadros e sequências que possuem vídeo clipes intermediários, como por

exemplo, comerciais de TV.

As Figuras 5.6(a)-(c) mostram as retas reconstruídas antes do reﬁna-

mento, as quais capturam os alinhamentos temporais entre as sequências.

Para as sequências originais, obtivemos erros de alinhamento temporal de

0.75 e 0.26 quadros antes e após o reﬁnamento, respectivamente.

Sequência 4

Neste último experimento com sequências reais, nossa técnica foi aplicada

a três sequências de vídeo referentes a uma partida de futebol (veja Figura

5.9). As sequências apresentam baixa qualidade e foram adquiridas a partir

de câmeras que se moviam. Neste contexto, para que pudéssemos utilizar

nossa metodologia, foi preciso compensarmos os movimentos das câmeras

numa fase anterior (Brown and Lowe, 2003).

Uma vez que agora três sequências de vídeo são consideradas, a reta L

calculada por nossa metodologia é uma reta 3D, como ilustrada nas Figuras

5.10(b) and 5.10(c). Uma per cepção visual da boa qualidade do alinhamento

temporal estimado por nossa técnica é fornecida pela Figura 5.11.

Sequências sintéticas

Na segunda etapa de nossos experimentos, um software foi desenvolvido

para se simular as cinemáticas de partículas 3D com movimentos indepen-

dentes, bem como a visualização das mesmas a partir de múltiplos pontos de

vista, gerando-se assim sequências de vídeo sintéticas. Por meio da utilização

de tais dados sintéticos, foi possível controlar e avaliar os impactos de valores

especíﬁcos atribuídos a parâmetros considerados críticos para a exatidão e

escalabilidade de nossa metodologia.

Em particular, com estes experimentos foram analisados os efeitos da

variação de três parâmetros principais: ( 1) número de objetos rastreados,

(2) erro na geometria epipolar de cada par de câmeras e, ﬁnalmente, (3) o

nível de ruído do sistema de rastreamento.

A cena artiﬁcial gerada simula a cena ilustrada na Figura 5.12(a), a qual

contém duas câmeras calibradas, cujos parâmetros intrínsecos e extrínsicos

são aqueles das câmeras utilizadas nos experimentos com sequências reais

obtidas em laboratório. As trajetórias das partículas aleatórias eram geradas

no interior de uma esfera 3D, cujo raio foi deﬁnido empiricam e nte de forma a

maximizar suas projeções nos planos de imagem de ambas as câmeras (veja

Figura 5.12(b)). Cada partícula era iniciada aleatoriamente em uma posição

uniformemente distribuída no interior desta esfera. O modelo utilizado para

a cinemática de uma partícula especíﬁca é ilustrado na Figura 5.13.

Considerando-se conjuntos de k partículas (k ∈ {1, 2, 4, 8, 16, 32}), dois

cenários principais foram concebidos para a realização dos experimentos com

sequências sintéticas. Primeiramente, considerou-se um cenário no qual a

geometria epipolar do par de câmeras continha um erro ﬁxo de 2 pixels e

variou-se o desvio padrão R ( em pixels) do ruído gaussiano de média zero

adicionado ao rastreador (R ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}). Por outro lado, no

segundo cenário, ﬁxou-se o desvio padrão do ruído do rastreador em 2 pixels

e variou-se o erro ε

da matriz fundamental (ε

∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}).

Fixando-se o erro da matriz fundamental e o desvio padrão do ruído do

rastreador em 2 pixels, simulou-se situações mais realísticas, uma vez que

tal valor foi o pior caso para as matrizes fundamentais calculadas para as

sequências reais e representa na média o desvio padrão do ruído observado

em alguns dos rastreadores usados na prática. Para cada tupla (ε

, R, k)

xxi

em cada um dos dois cenários acima descritos, foram simulados 100 eventos

dinâmicos distintos na cena artiﬁcial. Em seguida, dados os espaços de votos

dos eventos dinâmicos simulados, foram calculados os percentuais das retas

que capturam o alinhamento temporal entre as sequências que levam a erros

médios de alinhamento temporal menores ou iguais a 1, 2 e 5 quadros.

Analisando-se os resultados obtidos e supondo-se que o limite mínimo

desejável para o percentual de retas que levam a um determinado erro médio

de alinhamento temporal seja 95%, nota-se que para valores de erro na geo-

metria epipolar e no sistema de rastreamento próximos àqueles veriﬁcados na

prática, nossa metodologia não somente alcançou como também superou o li-

mite desejável de 95% para diversos valores de número de objetos rastreados,

como ilustrado na Figura 5.16(a). Observe nesta ﬁgura que o uso da técnica

de reﬁnamento mostrou-se ser de grande importância, sendo que a melhoria

trazida aos resultados pela mesma caiu bruscamente quando s e aumentou o

erro no sistema de rastreamento e na geometria epipolar das c âmeras (por

exemplo, veja Figura 5.20(d), onde R = 10).

Considerando-se os valores de erro na geometria epip olar e no rastreador

que foram efetivamente observados na prática (ε

= R = 2 pixels), observa-se

a partir das Figuras 5.30(a) e 5.33(a) que para aplicações on de se necessita de

alinhamentos temporais bastante exatos (ε

≤ 1 quadro), nossa metodologia

alcançou um p ercentual de sucessos de 60%. Por outro lado, considerando-se

aplicações mais ﬂexíveis quanto ao valor do erro de alinhame nto temporal

(por exemplo, ε

≤ 5 quadros), nota-se que até mesmo para rastreadores

bastante corrompidos por ruído (R = 10) e matrizes fundamentais com erros

severos (ε

= 4), nossa metodologia superou o limite desejável de 95% para

o percentual de sucessos, como ilustrado nas Figuras 5.30(c) e 5.33(c).

Finalmente, nota-se a partir de nossos resultados que para conjuntos

xxii

muito pequenos de objetos rastreados (por exemplo, k = 1) ou conjun-

tos muito grandes (por exemplo, k = 32), os percentuais de sucesso de

nossa metodologia diminuem signiﬁcativamente, como exempliﬁcado na Fi-

gura 5.20(d)-(f). As razões para tal fato estão relacionadas com os volumes

de pontos gerados nos correspondentes espaços de votos. Por exemplo, ob-

serve a Figura 5.14(a)-(f). Quando se tem um pequeno número de objetos

rastreados (veja Figura 5.14(a)), obtém-se consequentemente um espaço de

votos com poucos pontos e, em particular, poucos inliers, o que diﬁculta o

trabalho do RANSAC para encontrar a reta L desejada. Por outro lado,

quando um número muito grande de objetos rastreados é considerado (veja

Figura 5.14(f)), obtém-se um espaço de votos bastante denso com um volume

grande de dados espúrios, o que também leva a uma queda na qualidade do

conjunto de votos estimados pelo RANSAC e, consequentemente, nos parâ-

metros da reta estimada pelo método dos mínimos quadrados. Em geral,

os melhores resultados de nossa metodologia foram alcançados quando se

considerou conjuntos de aproximadamente 4 objetos rastreados.

Conclusõe s e p ropostas de trabalhos futuros

Esta tese propõe uma metodologia inovadora para se alinhar no tempo e

no espaço múltiplas sequências de vídeo adquiridas a partir de pontos de vista

distintos. Especiﬁcamente, a abordagem apresentada neste trabalho reduz

o problema de se estimar o alinhamento espaço-temporal entre sequências a

dois subproblemas mais simples: um problema de regressão linear e um pro-

blema de otimização linear, enquanto todas os demais trabalhos relacionados

ao nosso se baseiam na pesquisa da solução desejada em todo o espaço de

alinhamentos possíveis, o que faz com que os mesmos p ossuam uma natureza

xxiii

combinatória.

A qualidade dos alinhamento estimados p or nossa abordagem e seu cor-

respondente custo computacional são invariantes à magnitude dos desalinha-

mentos temporais entre as sequências. Além disso, diferentemente dos demais

métodos existente s na literatura, os quais são dedicados a conjuntos de ape-

nas duas sequências de vídeo, nossa abordagem é capaz de alin har num único

passo um número arbitrário de sequências.

Os experimentos realizados nesta tese demonstraram que nossa aborda-

gem é adequada para se resolver diversos problemas encontrados em aplica-

ções atuais que se beneﬁciam da disponibilidade de registros simultâneos de

um mesmo evento físico, fornecendo assim uma alternativa interessante às

tecnologias de sincronização de vídeos baseadas em hardware, as quais são

mais caras e mais difíceis de se utilizar em ambientes externos.

Do ponto de vista teórico, este trabalho demonstra sua relevância ao

fornecer evidências adicionais que por meio da consideração de pistas tem-

porais e espaciais em uma única metodologia, muitos eventos físicos que são

inerentemente ambíguos para métodos tradicionais de alinhamento imagem-

a-imagem são resolvidos eﬁcazmente por técnicas de alinhamento sequência-

a-sequência.

Como trabalhos futuros, pesquisas teóricas adicionais precisam ser consi-

deradas. Primeiramente, deve-se estudar a criação de um modelo matemá-

tico alternativo para o desalinhamento temporal entre sequências de vídeo

quando as mesmas são adquiridas a partir de câmeras que não trabalham a

taxas de quadros constantes, um fato muitas vezes comum em aplicações de

robótica. Além disso, técnicas alternativas para a estimação inicial da geo-

metria epipolar de cada par de câmeras devem ser concebidas. Atualmente,

nossa metodologia considera a existência de um conjunto suﬁciente de pontos

xxiv

estáticos na cena que sejam visíveis pelas câmeras, os quais possam ser uti-

lizados pelo algoritmo dos oito pontos normalizado para se estimar a matriz

fundamental. Sabe-se, entretanto, que este cenário pode não acontecer em

alguns casos.

Finalmente, uma outra direção importante para pesquisa futura consiste

na combinação de nossa metodologia com técnicas de reconstrução de cenas

3D para melhorar a eﬁcácia de tais métodos. Atualmente, estamos pesqui-

sando a extensão da abordagem proposta nesta tese para a realização do

alinhamento espaço-temporal entre não somente câmeras baseadas no mo-

delo de projeção em perspectiva como também câmeras catadióptricas.

To my parents, brothers and sister, who taught me to persevere

and prepared me to face challenges with faith and humility. They

are a constant source of inspiration to my life.

xxv

Acknowledgments

It’s been quite a ride getting to this point, and I owe many thanks to

several people for many things.

Special thanks to my advisor, Rodrigo Carceroni, for sharing with me his

great ideas and for giving me the opportunity to get this far. Without his

support, advice and close supervision, this thesis would not be possible.

Thanks a lot to Geraldo Massahud for his invaluable help in designing

and building the system proposed in this thesis for aligning both in time and

space multiple video sequences. It was really funny t o work with someone

who has an amazing ability to disagree to almost all conceiva ble points of

view and a complex nonlinear way of thinking!

Thanks to Professor Kyros Kutulakos, who also cont ributed signiﬁcantly

for the development of this work. By working with him when writing our

paper to CVPR, I could understand what Professor Carceroni meant with:

“Professor Kyros has an unstoppable determination to reach perfection and

an extensive knowledge of the rules of the game...”.

I would like to thank the members of my thesis committee, professors

Guilherme Pereira, Hani Yehia, Paulo Carvalho and Siome Goldenstein, for

their suggestions to improve this work.

Many thanks to my colleagues and friends at VERLab, for the excellent

work atmosphere and the valuable discussions during the development of this

xxvi

xxvii

thesis. In special, I would like to thank the support of Pedro Shiroma, Vilar

Neto, Guilherme Pereira, Erikson Morais, Wagner Barros, José Pinheiro and

Bruno Pimentel.

I would like to express my gratitude to professors Mario Campos and

Antônio de Pádua Braga for showing me the beauty of a research career.

I would also like to thank my family for the support they provided me

through my entire life and in particular, I must acknowledge my brothers

Pedro and Paulo, who had a direct inﬂuence in my choice for assuming a

scientiﬁc career.

Special thanks to my girlfriend and best friend, Fabiana, without whose

love and encouragement, I would not have ﬁnished this thesis.

Thanks to the administrative staﬀ of the Department of Computer Science

at UFMG, for making things run smoothly.

I thank CNPq for the ﬁnancial support.

Finally, I thank all people with whom I’ve worked and who participated

directly or indirectly in this work.

Contents

List of Figures xxx

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Outline of this work . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Background 12

2.1 Spatio-Temporal Alignment . . . . . . . . . . . . . . . . . . . 12

2.2 Single View Geometry . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Extrinsic parameters . . . . . . . . . . . . . . . . . . . 19

2.2.2 Intrinsic parameters . . . . . . . . . . . . . . . . . . . 20

2.3 Multiple View Geometry . . . . . . . . . . . . . . . . . . . . . 21

2.3.1 Two-View Geometry . . . . . . . . . . . . . . . . . . . 22

3 Temporal Alignment 30

3.1 The Timeline Constraint . . . . . . . . . . . . . . . . . . . . . 30

3.2 The Temporal Alignment Algorithm . . . . . . . . . . . . . . 33

4 Spatio-Temporal Alignment 41

4.1 The Reﬁnement Algorithm . . . . . . . . . . . . . . . . . . . . 42

4.1.1 Two-View Reﬁnement . . . . . . . . . . . . . . . . . . 42

4.1.2 N-View Reﬁnement . . . . . . . . . . . . . . . . . . . . 47

5 Experimental Results 49

5.1 Real-world Sequences . . . . . . . . . . . . . . . . . . . . . . . 50

5.1.1 Two-view Car Dataset . . . . . . . . . . . . . . . . . . 51

5.1.2 Two-view Robots Dataset . . . . . . . . . . . . . . . . . 53

5.1.3 Two-view Juggling Dataset . . . . . . . . . . . . . . . . 54

5.1.4 Three-view Soccer Dataset . . . . . . . . . . . . . . . . 60

xxviii

CONTENTS xxix

5.2 Synthetic Sequences . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2.1 Scene dynamics simulation . . . . . . . . . . . . . . . . 64

5.2.2 Error measurements and experiments’ description . . . 68

5.2.3 Evaluation of the experimental results . . . . . . . . . 71

6 Conclusions 98

6.1 Summary of the Accomplished Work . . . . . . . . . . . . . . 98

6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Bibliography 104

A The RANSAC Algorithm 109

A.1 Hypotheses evaluation . . . . . . . . . . . . . . . . . . . . . . 110

A.2 RANSAC parameters . . . . . . . . . . . . . . . . . . . . . . . 111

A.3 RANSAC eﬃciency . . . . . . . . . . . . . . . . . . . . . . . . 111

B Tensorial Notation 113

B.1 Tensorial notation and the trifocal tensor . . . . . . . . . . . . 115

C Multiple View Geometry 116

C.1 Three-View Geometry . . . . . . . . . . . . . . . . . . . . . . 116

C.1.1 The Trifocal Tensor T . . . . . . . . . . . . . . . . . . 116

C.2 N-View Geometry . . . . . . . . . . . . . . . . . . . . . . . . 122

List of Figures

1.1 The famous Hand of God goal scored by Maradona in the

1986 FIFA World Cup match between England and Argentina

(FIFA, 2004b). . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Images from two available sequences from the incident regard-

ing the supposed goal scored by the English player Geoﬀ Hurst

in the 1966 FIFA World Cup (Reid and Zisserman, 1996). . . 5

1.3 Example of application of spatio-temporal alignment of multi-

ple sequences for multi-sensor fusion. (a) and (b) are tempo-

rally corresponding frames from the visible-light and Infra-Red

sequences, respectively. (c) shows the results of fusing the two

sequences after spatio-temporal alignment. (Caspi and Irani,

2001). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Alignment and integration of information across multiple

video sequences to exceed the limited spatial-resolution and

limited temporal-resolution of video cameras. (a)-(c) Display

the event captured by three low-resolution sequences. (d)

The reconstructed event as captured by the generated high-

resolution sequence. (e) A close-up image of the distorted ball

in one of the low resolution frames. (f ) A close-up image

of the ball at the exact corresponding frame in time in the

high-resolution output sequence (Shechtman et al., 2002). . . . 7

2.1 The pinhole camera model (Hartley and Zisserman, 2003). . . 17

xxx

LIST OF FIGURES xxxi

2.2 The homography map induced by a world plane (Hartley and

Zisserman, 2003). The ray corresponding to a point q is ex-

tended to meet the world plane π in a point Q. This point

is projected to a point q



in the other image. The map from

q to q



is the homography induced by the plane π. If H

and

are the perspectivities from the world plane π to the ﬁrst

and second image planes, respectively, we have q = H

Q and



= H

Q. It is the composition of these two perspectivities

that deﬁnes a homography H, q



= H

−1

q = Hq, between

the image planes. . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 The epipolar geometry. . . . . . . . . . . . . . . . . . . . . . . 25

3.1 Geometry of the Timeline Constraint. In this two-

camera example, the point’s trajectory in camera i inter-

sects the epipolar line, q



, twice. Given the inter-

section points q

) and q



), we have the set V

)

{ [f

]







}. . . . . . . . . . . . . . . . . . . . . . . 32

3.2 (a) Trajectory of the car’s pixel centroid in Sequence 1. (b)

Trajectory of the car’s pixel centroid in Sequence 2. The car

was tr acked by a simple blob tracker that relies on foreground-

background detection to label all foreground pixels in each frame. 34

3.3 Two intersection points q

) and q

i+1

) of cameras i and

i+1, respectively, are considered by our approach as potential

representations of the same scene point only if d

≤ e and

i+1

≤ e, where d

and d

i+1

are distance values that measure

how close q

) and q

i+1

) are to each others’ epipolar

lines and e is the fundamental matrices’ average error. . . . . . 36

LIST OF FIGURES xxxii

3.4 (a) Trajectory of a feature in Sequence 1 of the Car dataset,

which is presented in Chapter 5 and illustrated in Figure 3.2.

The feature was the centroid of all pixels labeled as “fore-

ground” by a color-based foreground-background detector. (b)

Trajectory of t he foreground pixel centroid in Sequence 2 of the

dataset. Also shown is the epipolar line corresponding to pixel

(363) in (a). (c) Magniﬁed view of the trajectory/epipolar

line intersection in (b). The individual line segments con-

necting feature locations in adjacent frames are now visible.

Note that the epipolar line of q

(363) intersects multiple line

segments along the trajectory. (d) Exploiting the Timeline

Constraint for two-sequence alignment. Each point represents

a candidate temporal alignment, i.e., an element of V

)

for

the feature location, q

), in (a). The reconstructed time-

line, drawn as a solid line, describes the temporal alignment

of the two sequences in the Car dataset. . . . . . . . . . . . . 38

4.1 Geometry of our reﬁnement method. . . . . . . . . . . . . . . 43

5.1 Four representative frames (100, 200, 300, 400) from the

cameras 1 and 2, of the two-view Car dataset (Caspi

and Irani, 2000). We can identify the spatial misalign-

ment by observing near image boundaries, where diﬀerent

static objects are visible in each sequence. The tempo-

ral misalignment is easily identiﬁed by comparing the po-

sition of the gate in frames 400. This dataset, along

with more experimental results and videos, are available at

http://www.dcc.ufmg.br/˜cardeal/research/timeline/ . . . . . . 52

5.2 The r ecovered epipolar geometry for the two-view Car dataset.

Points and their epipolar lines are displayed in each image

for veriﬁcation. Accuracy of the computed fundamental ma-

trix can be appreciated by the closeness of each point to the

epipolar line of its corresponding point. . . . . . . . . . . . . . 53

LIST OF FIGURES xxxiii

5.3 Three representative frames (000, 150, 300) from the cam-

eras 1 and 2, of the two-view Robots dataset. The spa-

tial misalignment can be easily identiﬁed by observing

the distinct orientations of the robots’ soccer ﬁeld. On

the other hand, the t emporal misalignment can be no-

ticeable by comparing the position of the robot with the

dark green circle in frames 300. This dataset, along

with more experimental results and videos, are available at

http://www.dcc.ufmg.br/˜cardeal/research/timeline/ . . . . . . 54

5.4 Voting space, timeline, and timeline equation recovered prior

to reﬁnement for the two-view Robots dataset. Each point is

an element of V

)

for some feature q

) in sequence 1. . . 55

5.5 Four representative frames (100, 115, 120, 130) from the cam-

eras 1 and 2, of the two-view Juggling dataset. This dataset,

along with more experimental results and videos, are available

at http://www.dcc.ufmg.br/˜cardeal/research/timeline/ . . . . 56

5.6 Voting spaces, timelines, and timeline equations recovered

prior to reﬁnement for the two-view Juggling dataset. (a)

Juggling dataset without modiﬁcation, (b) simulation of a se-

quence with more than one frame rate and (c) simulation of a

sequence with spurious clips. . . . . . . . . . . . . . . . . . . . 57

5.7 Before alignment images were created by superimposing

the green band of a frame f

with the red and blue bands

of frame f

∗

= (f

− β

∗

)/α

∗

using ground truth timeline coeﬃ-

cients α

∗

and β

∗

. After alignment images were created by

replacing the green band of the images above them with that

of frame f

= (f

− β)/α, with α, β computed by our algo-

rithm. For both types of images, deviations from the ground-

truth alignment cause “double exposure” artifacts (i.e., when

∗

= f

or f

∗

= f

, respectively). . . . . . . . . . . . . . . . . 58

5.8 Distribution of distances of inlier votes from the reconstructed time-

line. Left column: Distribution before the timeline reﬁnement

stage. Right column: Distribution after the reﬁnement stage.

Note that the updated epipolar geometry and updated timeline

parameters reduce the distance between inliers and the timeline

and cause more votes to be labeled as inliers. . . . . . . . . . . . 59

5.9 Two representative frames (8, 46) from the cameras 1, 2

and 3, of the three-view Soccer dataset. This dataset, along

with more experimental results and videos, are available at

http://www.dcc.ufmg.br/˜cardeal/research/timeline/ . . . . . . 61

LIST OF FIGURES xxxiv

5.10 (a),(b) Two views of the 3D voting space and 3D timeline

computed for the Soccer dataset. . . . . . . . . . . . . . . . . 62

5.11 Before alignment images were created by super imposing

the green band of a frame f

with the red and blue bands

of frame f

∗

= (f

− β

∗

)/α

∗

using ground truth timeline coeﬃ-

cients α

∗

and β

∗

. After alignment images were created by

replacing the green band of the images above them with that

of frame f

= (f

− β)/α, with α, β computed by our algo-

rithm. For both types of images, deviations from the ground-

truth alignment cause “double exposure” artifacts (i.e., when

∗

= f

or f

∗

= f

, respectively). . . . . . . . . . . . . . . . . 63

5.12 Simulation of scene dynamics. (a) Experimental setup. (b)

All features start in random positions uniformly distributed

within the illustrated sphere, which is viewed from both cameras. 65

5.13 Feature dynamics. . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.14 Examples of voting spaces for k = 1, 2, 4, 8, 16 and 32 features.

In all these cases, the fundamental matrix has an average error

of 2 pixels (ε

= 2) and the tracker is corrupted by a gaussian

noise with mean zero and standard deviation of ±2 pixels (R =

2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.15 Percentage of inliers for each voting space of Figure 5.14, where

k represents the number of tracked features (k = 1, 2, 4, 8, 16

and 32). A speciﬁc vote is considered an inlier if its coordinates

and f

satisfy f

− f

+ 32 ≤ 1 frame. . . . . . . . . . . . . 73

5.16 Percentages of timelines that lead to average temporal align-

ment errors smaller than or equal to 1, 2 and 5 frame(s), for

k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Standard

deviation R of the tracker’s gaussian noise is ±1 pixel. (d), (e)

and (f) Standard deviation R of the tracker’s gaussian noise is

±2 pixels. In all cases, a fundamental matrix with an average

error of 2 pixels is used (ε

= 2). . . . . . . . . . . . . . . . . 79

5.17 Percentages of timelines that lead to average temporal align-

ment errors smaller than or equal to 1, 2 and 5 frame(s), for

k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Stan-

dard deviation R of the tracker’s gaussian noise is ±3 pixels.

(d), (e) and ( f) Standard deviation R of the tracker’s gaussian

noise is ±4 pixels. In all cases, a fundamental matrix with an

average error of 2 pixels is used (ε

= 2). . . . . . . . . . . . . 80

LIST OF FIGURES xxxv

5.18 Percentages of timelines that lead to average temporal align-

ment errors smaller than or equal to 1, 2 and 5 frame(s), for

k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Stan-

dard deviation R of the tracker’s gaussian noise is ±5 pixels.

(d), (e) and ( f) Standard deviation R of the tracker’s gaussian

noise is ±6 pixels. In all cases, a fundamental matrix with an

average error of 2 pixels is used (ε

= 2). . . . . . . . . . . . . 81

5.19 Percentages of timelines that lead to average temporal align-

ment errors smaller than or equal to 1, 2 and 5 frame(s), for

k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Stan-

dard deviation R of the tracker’s gaussian noise is ±7 pixels.

(d), (e) and ( f) Standard deviation R of the tracker’s gaussian

noise is ±8 pixels. In all cases, a fundamental matrix with an

average error of 2 pixels is used (ε

= 2). . . . . . . . . . . . . 82

5.20 Percentages of timelines that lead to average temporal align-

ment errors smaller than or equal to 1, 2 and 5 frame(s), for

k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Stan-

dard deviation R of the tracker’s gaussian noise is ±9 pixels.

(d), (e) and ( f) Standard deviation R of the tracker’s gaussian

noise is ±10 pixels. In all cases, a fundamental matrix with

an average error of 2 pixels is used (ε

= 2). . . . . . . . . . . 83

5.21 Percentages of timelines that lead to average temporal align-

ment errors smaller than or equal to 1, 2 and 5 frame(s), for

k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Average

error of the fundamental matrix: ε

= 1 pixel. (d), (e) and

(f) Average error of the fundamental matrix: ε

= 2 pixels.

In all cases, the tracker is corrupted by a gaussian noise with

standard deviation of ±2 pixels (R = 2). . . . . . . . . . . . . 84

5.22 Percentages of timelines that lead to average temporal align-

ment errors smaller than or equal to 1, 2 and 5 frame(s), for

k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Average

error of the fundamental matrix: ε

= 3 pixels. (d), (e) and

(f) Average error of the fundamental matrix: ε

= 4 pixels.

In all cases, the tracker is corrupted by a gaussian noise with

standard deviation of ±2 pixels (R = 2). . . . . . . . . . . . . 85

LIST OF FIGURES xxxvi

5.23 Percentages of timelines that lead to average temporal align-

ment errors smaller than or equal to 1, 2 and 5 frame(s), for

k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Average

error of the fundamental matrix: ε

= 5 pixels. (d), (e) and

(f) Average error of the fundamental matrix: ε

= 6 pixels.

In all cases, the tracker is corrupted by a gaussian noise with

standard deviation of ±2 pixels (R = 2). . . . . . . . . . . . . 86

5.24 Percentages of timelines that lead to average temporal align-

ment errors smaller than or equal to 1, 2 and 5 frame(s), for

k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Average

error of the fundamental matrix: ε

= 7 pixels. (d), (e) and

(f) Average error of the fundamental matrix: ε

= 8 pixels.

In all cases, the tracker is corrupted by a gaussian noise with

standard deviation of ±2 pixels (R = 2). . . . . . . . . . . . . 87

5.25 Percentages of timelines that lead to average temporal align-

ment errors smaller than or equal to 1, 2 and 5 frame(s), for

k = 1, 2, 4, 8, 16 and 32 feature(s). (a), (b) and (c) Average

error of the fundamental matrix: ε

= 9 pixels. (d), (e) and

(f) Average error of the fundamental matrix: ε

= 10 pixels.

In all cases, the tracker is corrupted by a gaussian noise with

standard deviation of ±2 pixels (R = 2). . . . . . . . . . . . . 88

5.26 Space of estimated parameters for (a) k = 1 feature and (b)

k = 2 features. Each point shown corresponds to the parame-

ters of an estimated timeline with temporal alignment error

smaller than or equal to 1 frame. m

and m

represent the

average values and σ

and σ

are the variances of the temporal

dilation (α) and the temporal shift (β), respectively. In both

cases, the average error of the fundamental matrix and the

standard deviation of the gaussian noise added to the tracker

are 2 pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.27 Space of estimated parameters for (a) k = 4 features and

(b) k = 8 features. Each point shown corresponds to the

parameters of an estimated timeline with temporal alignment

error smaller than or equal to 1 frame. m

and m

represent

the average values and σ

and σ

are the variances of the

temporal dilation (α) and the temporal shift (β), respectively.

In both cases, the average error of the fundamental matrix

and the standard deviation of the gaussian noise added to the

tracker are 2 pixels. . . . . . . . . . . . . . . . . . . . . . . . . 90

LIST OF FIGURES xxxvii

5.28 Space of estimated parameters for (a) k = 16 features and

(b) k = 32 features. Each point shown corresponds to the

parameters of an estimated timeline with temporal alignment

error smaller than or equal to 1 frame. m

and m

represent

the average values and σ

and σ

are the variances of the

temporal dilation (α) and the temporal shift (β), respectively.

In both cases, the average error of the fundamental matrix

and the standard deviation of the gaussian noise added to the

tracker are 2 pixels. . . . . . . . . . . . . . . . . . . . . . . . . 91

5.29 Percentages of timelines that lead to average temporal align-

ment error s smaller than or equal to 1, 2 and 5 frame(s), as a

function of the tracker’s noise level. (a), (b) and (c) Results

for k = 1 feature. (d), (e) and (f) Results for k = 2 features.

In all cases, a fundamental matrix with an average error of 2

pixels is used (ε

= 2). . . . . . . . . . . . . . . . . . . . . . . 92

5.30 Percentages of timelines that lead to average temporal align-

ment error s smaller than or equal to 1, 2 and 5 frame(s), as a

function of the tracker’s noise level. (a), (b) and (c) Results

for k = 4 features. (d), (e) and (f) Results for k = 8 features.

In all cases, a fundamental matrix with an average error of 2

pixels is used (ε

= 2). . . . . . . . . . . . . . . . . . . . . . . 93

5.31 Percentages of timelines that lead to average temporal align-

ment error s smaller than or equal to 1, 2 and 5 frame(s), as a

function of the tracker’s noise level. (a), (b) and (c) Results for

k = 16 features. (d), (e) and (f) Results for k = 32 features.

In all cases, a fundamental matrix with an average error of 2

pixels is used (ε

= 2). . . . . . . . . . . . . . . . . . . . . . . 94

5.32 Percentages of timelines that lead to average temporal align-

ment error s smaller than or equal to 1, 2 and 5 frame(s), as a

function of the average error of the fundamental matrix. (a),

(b) and (c) Results for k = 1 feature. (d), (e) and (f) Results

for k = 2 features. In all cases, the tracker is corrupted by a

gaussian noise with standard deviation of ±2 pixels (R = 2). . 95

5.33 Percentages of timelines that lead to average temporal align-

ment error s smaller than or equal to 1, 2 and 5 frame(s), as a

function of the average error of the fundamental matrix. (a),

(b) and (c) Results for k = 4 features. (d), (e) and (f) Results

for k = 8 features. In all cases, the tracker is corrupted by a

gaussian noise with standard deviation of ±2 pixels (R = 2). . 96

LIST OF FIGURES xxxviii

5.34 Percentages of timelines that lead to average temporal align-

ment error s smaller than or equal to 1, 2 and 5 frame(s), as a

function of the average error of the fundamental matrix. (a),

(b) and (c) Results for k = 16 features. (d), (e) and (f) Results

for k = 32 features. In all cases, the tracker is corrupted by a

gaussian noise with standard deviation of ±2 pixels (R = 2). . 97

B.1 3D representation of the trifocal tensor. The picture represents

= λ





, which is the contraction of the tensor with the

lines λ



and λ



to produce a line λ. In pseudo-matrix notation

this can be written as λ

= λ









, where (T

)

= T

(Hartley and Zisserman, 2003). . . . . . . . . . . . . . . . . . 115

C.1 Three images of a line deﬁne it as the intersection of three

planes in the same pencil. . . . . . . . . . . . . . . . . . . . . 118

Chapter 1

Introduction

I don’t want to achieve immortality through my work.

I want to achieve it through not dying.

Woody Allen

In this work we consider the problem of Spatio-Temporal Alignment of

Multiple Video Sequences of the same 3D scene, captured from distinct view-

points. The scene dynamics as well as static scene features are used as pow-

erful cues for estimating the t emporal synchronization (temporal alignment)

and the spat ial alignment between the sequences. Typically, the temporal

misalignment between video sequences arises from two main reasons. The

ﬁrst one relates to the fact that the input sequences may have diﬀerent frame

rates (e.g., NTSC and PAL), while the second one relates to the existence of

a time shift or oﬀset between the sequences, which is frequently created when

the cameras are not activated simultaneously. On the other hand, the spatial

misalignment results from the diﬀerent positions, orientations and internal

calibration parameters of all the cameras.

Real-world scenes where objects move and deform in 3D space are often

so complex that, in order to understand them completely, it is necessary

to observe them simultaneously from multiple viewpoints (Carceroni, 2001;

(a) First viewpoint. (b) Second viewpoint.

Figure 1.1: The famous Hand of God goal scored by Maradona in the 1986

FIFA World Cup match between England and Argentina (FIFA, 2004b).

Kutulakos, 2000; Kutulakos and Seitz, 2000; Szeliski, 1999). Consider, for

instance, a sports event. Even a well-trained referee who is closely observing

a scene of this type sometimes fails to capture pieces of information that

are essent ial for accurate judgment. Figure 1.1 illustrates a famous example

of such a failure, where two views of The Hand of God goal are presented

(FIFA, 2004b). That goal was scored by Maradona, who knocked the ball

into the net with the back of his left hand rather than with his head, a famous

incident in the 1986 FIFA World Cup quarter-ﬁnal match between England

and Argentina in the Aztec Stadium, Mexico City.

In many of these situations, the missing visual clues are later revealed

clearly in video sequences captured by strategically-positioned cameras, mak-

ing the single-observer error evident. In particular, one of the major obstacles

to be overcomed in the analysis of events such as the one in Figure 1.1 is

represented by the need of a previous spatio-temporal alignment between the

sequences, that is, the need of establishing correspondences in time and in

space between the diﬀerent image sequences.

The demand for eﬀective automatic methods for temporally and spa-

tially aligning multiple videos, mainly pre-recorded videos (e.g., regular and

slow-motion clips of the same p enalty kick), where video synchronization

hardware and calibration techniques cannot be applied, has directed us in

our quest to develop a novel solution that takes into account the constraints

related to the monitored scene, as well as the constraints related to the set

of cameras. Within this context, our main goal in this work consists in

advancing towards the development of new practial methods for accuretely

aligning both in time and space not only two sequences, as most of the

existing methods, but a general set of N video sequences captured from

distinct viewpoints. Therefore, the problem that we are addressing in this

thesis can be stated as follows:

Given a dynamic scene, viewed simultaneously by N perspective cam-

eras located at distinct viewpoints, which work with constant (albeit not

necessarily identical) temporal sampling rates and constant (but unknown)

intrinsic and extrinsic parameters, recover the N-dimensional function that

captures the temporal relations between all sequences as well as the spatial

transformations that align them in space.

Speciﬁcally, our work focuses on sets of video sequences that have overlap

between their ﬁelds of view — i.e., given N sequences S

, ..., S

, we can

identify corresponding tuples of pixels (x

, y

, t

) ∈ S

,...,(x

, y

, t

) ∈ S

where each such tuple is formed by projections of a single point Q in the

scene space-time — in such a way that these overlapping parts contain some

non-rigid motion.

1.1. MOTIVATION 4

1.1 Motivation

Many applications today beneﬁt from t he availability of simultaneous

video recordings of the same physical event. Examples include tele-immersion

(Vedula et al., 2002), video-based surveillance (Zelnik-Manor and Irani,

2001), video mosaicing (Caspi and Irani, 2001), video metrology from televi-

sion broadcasts of athletic events (Reid and Zisserman, 1996 ) and recovering

of the aﬃne structure of a non-rigid motion (Tresadern and Reid, 2003).

A critical task in all of these applications is the spatio-temporal alignment

between the videos involved. Frequently, the use of special synchronization

hardware has been the most common solution adopted to acquire temporally

aligned sequences. However, alignment based on the content of the image

sequences themselves has proved to be a more interesting alternative, since

that it is less expensive, easier to use outside labs, and it can be applied

to various multi-view sequences that already exist in video databases, such

as those of sport events (FIFA, 2004a; Reid and Zisserman, 1996), artistic

performances, or crimes captured in survelillance tapes.

In Reid and Zisserman (1996), for instance, the authors combined infor-

mation from two independent sequences, illustrated in Figure 1.2, to resolve

the controversy regarding the supposed goal scored by the English player Ge-

oﬀ Hurst in the 1966 FIFA World Cup. In particular, they wanted to know

if the ball had actually crossed the goal line, and if not, how close it came to

crossing it. In that work, the authors manually synchronized the sequences

and then computed spatial alignment between selected corresponding images.

This is an interesting example where automatic spatio-temporal alignment

may provide better results.

In Caspi and Irani (2001), the authors illustrate an interesting applica-

tion by performing the alignment of multi-sensor sequences for sensor fusion.

1.1. MOTIVATION 5

ball

Figure 1.2: Images from two available sequences from the incident regarding

the supposed goal scored by the English player Geoﬀ Hurst in the 1966 FIFA

World Cup (Reid and Zisserman, 1996).

Figure 1.3 shows the example presented by the authors. Two sequences of

an outdoor scene were aligned, one captured by an Infra-Red camera, while

the other by a regular video (visible-light) camera. The result of the spatio-

temporal alignment is illustrated by fusing temporally corresponding frames.

The Infra-Red camera provides only intensity information, and was therefore

fused with the intensity component of the visible-light camera.

A ﬁnal example of the importance of eﬀective video alignment techniques

1.1. MOTIVATION 6

(a) Visible Light. (b) Infra-Red. (c) Output.

Figure 1.3: Example of application of spatio-temporal alignment of multiple

sequences for multi-sensor fusion. (a) and (b) are temporally correspond-

ing frames from the visible-light and Infra-Red sequences, respectively. (c)

shows the results of fusing t he two sequences after spatio-temporal alignment.

(Caspi and Irani, 2001).

is given in Shechtman et al. (2002), where the authors propose a novel method

for constructing a video sequence of high space-time resolution by combin-

ing information from multiple low-resolution video sequences of the same

dynamic scene. Some results are illustrated in Figure 1.4. In this work,

the spatio-temporal alignment method proposed in Caspi and Irani (2000) is

applied in a preliminar step.

Importantly, content-based alignment of sequences acquired with station-

ary cameras is possible only if (a) these sequences depict parts of the scene

space-time that have some overlap and (b) the regions visible in both se-

quences move in a way that is not completely rigid. If both conditions above

hold, then the existence of a rigid transformation between the overlapping

parts of any two frames (one from each sequence) indicates that these frames

were probably acquired simultaneously. Every solution to the alignment

problem that we address here (ours included) exploits this constraint.

1.2. APPROACH 7

(a) (b) (c)

(d) (e) (f)

Figure 1.4: Alignment and integration of information across multiple video

sequences to exceed the limited spatial-resolution and limited temporal-

resolution of video cameras. (a)-(c) Display the event captured by three

low-resolution sequences. (d) The reconstructed event as captured by the

generated high-resolution sequence. (e) A close-up image of the distorted

ball in one of the low resolution frames. (f ) A close-up image of the ball at

the exact corresponding frame in time in the high-resolution output sequence

(Shechtman et al., 2002).

1.2 Approach

We believe t hat any general solution to the spatio-temporal alignment

problem should handle the following cases:

• Unknown frame rate: The relative frame rate of the video sequences

is unknown and unconstrained.

• Arbitrary time shift: The time shift between the sequences is un-

known and can be arbitrarily large.

1.2. APPROACH 8

• Unknown motion: The 3D motion of objects in the scene is unknown

and unconstrained.

• Tracking failures: Individual scene points cannot be tracked reliably

over many frames.

• Unknown epipolar geometry: The relative camera geometry of the

video sequences is unknown.

• Scalability: Computational eﬃciency should degrade gracefully with

increasing number of sequences.

• No static points: No visible point in the scene remains stationary for

two or more frames.

As a step toward this goal, we present a novel approach that operates

under all of the above conditions except the last one. In particular, we assume

that for every pair of video sequences we can identify enough static scene

points to get an initial est imate of the cameras’ epipolar geometry. Moreover,

in order to ensure that the parameters of that initial estimate remain constant

during the application of our approach, we consider a scenario where the

video cameras are stationary, with ﬁxed (but unknown) intrinsic and extrinsic

parameters.

Because of that last assumption, every corresponding set of N pixels is

related by the same spatio-temporal transformation, whose spatial compo-

nents are temporally invariant. Moreover, as long as each individual image

in each sequence is acquired instantaneously, the temporal component of this

transformation is spatially invariant, which allows us to decouple its recovery

from the recovery of the spatial alignment. Importantly, our methodology

still can be used in the case of moving cameras, as long as preliminary video

stabilization process is applied.

1.2. APPROACH 9

We further assume that both cameras follow the projective pinhole model,

and that they acquire frames at constant (albeit not necessarily identical)

temporal sampling rates. The constant sampling rate assumption implies

that the temporal coordinates (frame numbers) in one reference sequence

and the temporal coordinates in all other sequences are related by a one-

dimensional aﬃne transformation:

= α

+ β

, (1.1)

where f

and f

are the frame numbers of the i-th sequence and the refer-

ence sequence, respectively, and α

, β

are unkown constants describing the

temporal dilation and temporal shift, respectively, between the sequences. In

general, these constants will not be integers.

The pairwise temporal relations captured by Equation (1.1) induce a

global relationship between the frame numbers of the input sequences. In

fact, at the heart of our approach lies the concept of a timeline. Given N

sequences, the timeline is a N-dimensional line that completely describes all

temporal relations between the sequences. A key property of the timeline

is that even though knowledge of the timeline implies knowledge of the se-

quences’ temporal alignment, we can compute points on the timeline without

knowing this alignment. Using this property as a starting point, we reduce

the temporal alignment problem for N sequences to the problem of robustly

estimating a single N-dimensional line from a set of appropriately-generated

points in R

Finally, we assume in this work that the overlapping sequence parts con-

tain features such as uniformly-colored blobs or corners that move along

(piecewise) smooth trajectories, and these trajectories can be captured by

trackers that output trajectory segments detected in all the sequences as

1.3. CONTRIBUTIONS 10

parametric curves. We do not assume that correspondences between trajec-

tories are known a priori: recovering such correspondences is part of the job

done by the method that we will introduce in the next chapters. Therefore, in

practice the data that we need can be obtained with one of the many single-

view, real-time, multi-feature trackers available in the literature (Lowe, 2004;

Jepson et al., 2003; Isard and MacCormick, 2001; Shi and Tomasi, 1994).

1.3 Contributions

From a practical standpoint, this work has ushered in two major contri-

butions to the area of video analysis, specially, to the ﬁeld of applications

where multiple video sequences must be temporally and spatially aligned:

• A generalized framework for solving the problem of spatio-temporal

alignment between N videos sequences captured from distinct view

points. The framework (1) can handle arbitrary large misalignments

between the sequences, (2) does not require any a priori information

about their temporal relations, (3) does not assume that a single scene

point can be tracked reliably over the entir e sequence, (4) does not

require the ability to establish feature correspondences between the se-

quences, (5) can handle sequences with feature trajectories that nearly

overlap in 3D, that are approximately cyclical and that contain fea-

tures with quite large image velocities (up to 9 pixels per frame), (6)

can handle sequences with multiple frame rates and that contain spu-

rious clips, and (7) can handle sequences where the tracked features

move in completely distinct planes in the scene.

• A new iterative algorithm to reﬁne simultaneously the parameters rep-

resenting the temporal and spatial relations between the sequences,

1.4. OUTLINE OF THIS WORK 11

since that the exclusive reﬁnement of the temporal parameters is sub-

optimal.

From a theoretical point-of-view, this work is important because it pro-

vides additional theoretical and empirical evidence that by considering tem-

poral and spatial cues into a single alignment framework, many events that

are inherently ambiguous for traditional image-to-image aligment techniques

can be uniquely resolved by sequence-to-sequence alignment methods.

1.4 Outline of this work

This document is organized in six chapters (including this one) and three

appendices. Chapter 2 discusses the state-of-the-art in Spatio-Temporal

Alignment of Multiple Video Sequences and introduces some fundamental

concepts in single and multiple view geometries that we use as a theoreti-

cal basis to our work. In Chapter 3, we present the timeline constraint and

our temporal alignment methodology associated to that concept. Chapter

4 presents an iterative algorithm for reﬁning simultaneously the parameters

representing the temporal and spatial relations between the sequences. In

Chapter 5, we present and discuss experimental results with real-world and

synthetic video sequences. Chapter 6 presents our conclusions and perspec-

tives of future work. Appendix A describes in detail the RANSAC algorithm.

Appendix B gives a brief introduction on tensorial notation. Finally, Ap-

pendix C describes an important tool applied in the analysis of three-view

geometries: the trifocal tensor, and some techniques for deriving multi-linear

constraints on correspondences in the case of N-views.

Chapter 2

Background

Everything has been said before, but since nobody

listens we have to keep going back and beginning all

over again.

Andre Gide

This chapter presents a survey of the current state-of-the-art in Spatio-

Temporal Alignment of M ultiple Video Sequences and introduces some fun-

damental concepts in single and multiple view geometries, specially two-view

geometries, that we use as a theoretical basis to our work. Related work that

support speciﬁc topics of this thesis will be surveyed wherever necessary.

2.1 Spatio-Temporal Alignment

In spite of the fact that spatial alignment is one of the problems that have

been most heavily researched by the Computer Vision community, so far only

a handful of works have dealt with the problem of temporal alignment.

Based on the analysis of the most relevant articles found in the litera-

ture, we classify the existing spatio-temporal alignment methods in two main

groups: the feature–based methods and the direct methods. The feature–

2.1. SPATIO-TEMPORAL ALIGNMENT 13

based methods (Caspi et al., 2002; Rao et al., 2003; Wolf and Zomet, 2002a,b;

Lee et al., 2000; Stein, 1998) extract all information needed to perform

spatio-temporal alignment from the trajectories of tracked features, such

as uniformly–colored blobs or corners. On the other hand, direct meth-

ods (Caspi and Ir ani, 2000, 2001) extract that information from the int ensi-

ties and intensity gradients of all pixels that belong to overlapping regions.

Therefore, direct methods tend to align sequences more accurately if their

appearances are similar, while feature–based methods are widely prescribed

(Caspi et al., 2002; Rao et al., 2003; Torr and Zisserman, 1999) for sequences

with dissimilar appearance such as those acquired with wide baselines, dif-

ferent magniﬁcations, or by cameras with distinct spectral sensitivities.

In this work, we propose a novel feature–based methodology for sequence–

to–sequence alignment. More speciﬁcally, a major novelty of our method-

ology is that it reduces the computation of temporal and spatio-tempor al

alignments between sequences to linear regression and linear optimization

problems, while existing feature–based techniques (Caspi et al., 2002; Rao

et al., 2003; Wolf and Zomet, 2002a,b; Lee et al., 2000; Stein, 1998) search

the entire space of possible temp oral alignments. Unfortunately, the combi-

natorial nature of this search requires several additional assumptions to make

it manageable, such as (1) known cameras’ frame rates, (2) the number of

video sequences N is restricted to be two, (3) temporal misalignment as an

integer, and (4) temporal misalignment within a small user-speciﬁed range

(typically less than ﬁfty frames).

Unlike previous feature–based techniques, our approach aligns N se-

quences in a single step. It can handle arbitrarily large misalignments be-

tween them and does not require any a priori information about their tem-

poral relations.

2.1. SPATIO-TEMPORAL ALIGNMENT 14

The quality of the alignments that we obtain and the computational cost

of our alignment process are invariant to the magnitude of the initial temporal

oﬀsets between sequences. To achieve this breakthrough, we derive alignment

constraints by matching instantaneous positions of features in one sequence

against entire feature trajectories in the other sequences, while most feature–

based techniques rely on matches between pairs of instantaneous positions.

Our key observation is that, because feature t rajectories are (piecewise–)

smooth curves parameterized by time, once the geometric relations of a set

of cameras can be estimated, each match between an instantaneous feature

in one reference sequence and a trajectory in one of the other sequences

constrains the feature’s temporal coordinate to be aligned with one among a

ﬁnite set of instants in the other sequence: those instants where the feature’s

epipolar line (see the concept of epipolar line in Section 2.3.1) intersects the

matching trajectory. Importantly, these instants where intersections occur

not necessarily correspond to the second sequence’s frames, which means that

our methodology yields temporal alignment at sub–frame accuracy, contrary

to techniques based on position–to–position matches.

In this thesis, we exploit the observation above to develop a sequence–to–

sequence alignment approach based on two techniques: (a) one that builds

large sets of those temporal constraints from a rough spatial alignment be-

tween sequences and then performs a robust linear regression in the temporal

domain to recover the globally correct temporal alignment, and (b) one that

linearizes feature trajectories around the points of intersection with epipolar

lines to reduce the problem of ﬁnding the complete spatio–temporal align-

ment between two sequences to a problem of solving a linear system.

Our work is most closely related to the approach of Caspi, Simakov and

Irani (Caspi et al., 2002). In their approach, the epipolar geometry and

2.1. SPATIO-TEMPORAL ALIGNMENT 15

temporal misalignment between two sequences are recovered from the im-

age trajectory of a single scene point that is visible in both sequences, and

are subsequently reﬁned using more features. To achieve this, they assume

known frame rates and formulate a non-linear optimization problem to jointly

estimate epipolar geometry and temporal misalignment. Unfortunately, the

highly non-linear nature of this optimization necessitates good initial esti-

mates for the temporal misalignment and the epipolar geometry.

Importantly, that approach still assumes that a single scene point can be

tracked reliably over the entire sequence. This may be diﬃcult to achieve

when aligning videos of complex scenes, where feature tracking can fail often

because of occlusions or large inter-frame motions. Our solution, on the other

hand, requires the ability to track scene points only across two consecutive

frames of the same sequence. Moreover, it does not require the ability to

establish feature correspondences between the sequences.

All other feature–based methods for spatio–temporal alignment that we

are aware of (Rao et al., 2003; Wolf and Zomet, 2002a,b; Lee et al., 2000;

Stein, 1998) use p osition–to–position constraints. It is known (Hartley and

Zisserman, 2003; Ma et al., 2003) that in the case of two-view geometries

— except in degenerate cases — a match between static points in images

acquired by two stationary cameras generates a linear constraint on the pa-

rameters of the cameras’ fundamental matrix. It is also known (Hartley and

Zisserman, 2003; Ma et al., 2003) that in static scenes (or in cases where

two sequences are correctly synchronized) the matrix formed by all such con-

straints is singular. The methods cited above use this fact to check if any

possible temporal transformation between two sequences is consistent with

the instantaneous feature positions in the sequences. They do this by assem-

bling large constraint matrices from multiple position–to–position matches

2.1. SPATIO-TEMPORAL ALIGNMENT 16

and then testing these matrices for rank–deﬁciency.

Thus, all those methods need to search the space of possible temporal

transformations to ﬁnd the optimum alignment. Rao et al. (Rao et al.,

2003) have recently proposed a way to perform this search in an incremental,

non-exaustive way, but their solution assumes that the initial frames of the

two sequences are aligned a priori. Moreover, because of this search–based

nature, none of the techniques based on position–to–position constraints can

synchronize sequences at sub–frame accuracy and only one of them (Wolf

and Zomet, 2002b) can tolerate outliers in the matched position pairs, albeit

by assuming that both cameras are orthographic.

Finally, there is the method proposed by Caspi and Irani (Caspi and

Irani, 2000) t hat aligns sequences directly from pixel intensities and their

spatio–temporal gradients. Because it uses only linear terms of intensities’

Taylor–series expansions to approximate spatial and temporal intensity vari-

ations that are in general non–linear, it only works if the initial sequence–

to–sequence misalignments in space and time are small enoug h to fall within

the range of validity of intensity linearizations. In addition, it models spa-

tial transformations between sequences as homographies, which — contrary

to fundamental matrices — are not appropriate to align sequences with sig-

niﬃcant depth discontinuities. Homographies and fundamental matrices are

geometric relations between two views, which are introduced in Section 2.3.1.

Caspi and Irani also developed a direct method that can align sequences

without any overlap (Caspi and Irani, 2001), but this later method does not

work with stationary cameras: it only works with sequences acquired by pairs

of cameras that remain rigidly attached to each other while moving relative

to a mostly rigid scene.

2.2. SINGLE VIEW GEOMETRY 17

Figure 2.1: The pinhole camera model (Hartley and Zisserman, 2003).

2.2 Single View Geometry

This section concentrates on the introduction of some fundamental con-

cepts in the geometry of a single perspective camera. Basically, a camera

is a mapping between the 3D world (object space) and a 2D image (Trucco

and Verri, 1998). There are several camera models in the literature, which

are matrices with particular propert ies that represent the camera mapping.

Speciﬁcally, we consider in this work that all cameras follow the projective

pinhole model, which is illustrated in Figure 2.1. The basic pinhole model

consists of a plane π, the image plane, and a 3D p oint O, the center of

projection. The distance f between π and O is the focal length. The line

through O and perpendicular to π is the optical axis, and o, the intersection

between π and the optical axis, is named image center. As illustrated in

Figure 2.1, q, the image of Q, is the point at which the straight line through

Q and O intersects the image plane π. Consider the 3D reference frame

in which O is the origin and the plane π is orthogonal to the Z axis, and

let Q = (Q

, Q

)



. By similar triangles, one quickly computes that the

point Q is mapped to the point q = (fQ

, fQ

, f)



on the image

2.2. SINGLE VIEW GEOMETRY 18

plane. Ignoring the ﬁnal image coordinate, we see that

, Q

)



→ (fQ

, fQ

)



(2.1)

describes the central projection mapping from world to image coordinates

(Hartley and Zisserman, 2003). This is a mapping from Euclidean 3D-space

to Euclidean 2D-space R

If world and image points are represented by homogeneous vectors, then

the central projection is expressed as a linear mapping between their homo-

geneous coordinates (Hartley and Zisserman, 2003). In particular, Equation

(2.1) may be written in terms of matrix multiplication as













→



















f 0 0 0

0 f 0 0

0 0 1 0



















. (2.2)

Computer vision algorithms for reconstructing the 3D structure of a scene

or computing the position of objects in space need equations as Equation (2.2)

for linking the coordinates of points in 3D space with the coordinates of their

corresponding image points (Trucco and Verri, 1998). In these applications

it is often assumed that the coordinates of the image points in the camera

reference frame can be obtained from pixel coordinates and t hat the camera

reference frame can be located with respect to some other, known as the

world reference frame (Trucco and Verri, 1998). This is equivalent to assume

knowledge of some camera’s characteristics, known in vision as the camera’s

extrinsic and intrinsic parameters. In the following we brieﬂy introduce

the deﬁnitions of extrinsic and intrinsic parameters in practical terms. The

2.2. SINGLE VIEW GEOMETRY 19

problem of estimating their values is called camera calibration (Hartley and

Zisserman, 2003).

2.2.1 Extrinsic parameters

The extrinsic parameters are deﬁned as any set of geometric parameters

that identify uniquely the transformation between the unknown camera ref-

erence frame and a known reference frame, named the world reference frame

(Trucco and Verri, 1998).

A typical choice for describing the transformation between camera and

world frame is to use (Trucco and Verri, 1998; Hartley and Zisserman, 2003):

• a 3 × 1 translation vector t, describing the relative positions of the

origins of the two reference frames, and

• a 3 × 3 rotation matrix R, an orthogonal matrix that brings the corre-

sponding axes of the two frames onto each other.

The relation between the coordinates of a point Q in world and camera frame,

and Q

respectively, is

= RQ

+ t. (2.3)

Now, if we add a “1” as a fourth coordinate of Q

(that is, express Q

homogeneous coordinates), we may group the extrinsic param eters in a single

3 × 4 matrix M

, called matrix of extrinsic parameters



R t



, (2.4)

and obtain the following linear matrix equation for linking the coordinates

of points in the world reference frame with their corresponding coordinates

2.2. SINGLE VIEW GEOMETRY 20

in the camera reference frame

= M









. (2.5)

2.2.2 Intrinsic parameters

The intrinsic parameters can be deﬁned as the set of parameters needed to

characterize the optical, geometric, and digital characteristics of the viewing

camera (Trucco and Verri, 1998). They are the focal length f, the location

of the image center in pixel coordinates (o

, o

), the eﬀective pixel size in the

horizontal and vertical direction (s

, s

), and, if required, a radial distortion

coeﬃcient k.

Neglecting radial distortion, we can group all the intrinsic parameters in

a single 3 × 3 matrix M

, called matrix of intrinsic parameters







−f/s

0 o

0 −f/s

0 0 1







. (2.6)

Again, if we express Q

in homogeneous coordinates, we obtain the following

linear matrix equation describing perspective projections

q = M









. (2.7)

What is interesting about vector q = [q

, q

]



is that the ratios (q

)

and (q

) are nothing but the coordinates in pixel of the image point.

Therefore, M

performs the transformation between the camera reference

frame and the image reference frame.

2.3. MULTIPLE VIEW GEOMETRY 21

2.3 Multiple View Geometry

A basic problem in com puter vision is to understand the structure of a real

world scene given several images of it (Hartley and Zisserman, 2003). Over

the past decade there has been a rapid development in the understanding

and modelling of the geometry of multiples views, especially due to the new

achievements in our theoretical understanding and improvements in the esti-

mation of mat hematical objects from images (Hartley and Zisser man, 2003;

Ma et al., 2003; Forsyth and Ponce, 2002).

While several problems of scene reconstruction have already reached rea-

sonable solutions, such as the problem of estimating a multifocal tensor from

image point correspondences, particularly the fundamental matrix and the

trifocal tensor (Hartley and Zisserman, 2003), other relevant problems still

claim for more carefull study. Examples include: (1) application of bundle

adjustment to solve more general reconstruction problems, and (2) automatic

detection of correspondences in image sequences with elimination of outliers

and false matches using the multifocal tensor relationships.

Next, we will brieﬂy describe some important concepts and deﬁnitions in

two-view geometry, which constitute the basis of our spatio-temporal align-

ment methodology, such as the epipolar geometry of two cameras. In Ap-

pendix C, we present more general frameworks that are natural extensions

for three-, four- and N-views. Particularly, we introduce in this appendix the

trifocal tensor, which plays an analogous role in three views to that played by

the fundamental matrix in two. Finally, we still describe some techniques for

deriving multi-linear constraints on correspondences in the case of N-views.

2.3. MULTIPLE VIEW GEOMETRY 22

2.3.1 Two-View Geometry

The geometry of two perspective views may be acquired simultaneously

as in a stereo rig, or acquired sequentially, for example by a camera moving

relative to the scene. We say that these two situations are geometrically

equivalent and they will not be diﬀerentiated here. Basically, there are two

relations between two views of a scene (Hartley and Zisserman, 2003; Ma

et al., 2003):

1. The homography of a point in one view determines a point in the other

which is the image of the intersection of the ray with a plane, as illus-

trated in Figure 2.2.

2. The epipolar geometry a point in one view determines a line in the

other which is the image of the ray through that point, as illustrated

in Figure 2.3.

In the following we brieﬂy introduce these both relations.

The homography map

Images of points on a plane are related to corresponding image points

in a second view by a (planar) homography as shown in Figure 2.2. This

is a projective relation since it depends only on the intersections of planes

with lines (Forsyth and Ponce, 2002). We say that the world plane induces a

homography between the views and that the homography map is responsible

for transferring points from one view to the other.

In Hartley and Zisserman (2003), it is shown that for world planes in

general position the homography is determined uniquely by the plane and

vice versa. In this case, general position means that the world plane does

not contain either of the camera centers. If the world plane does contain one

2.3. MULTIPLE VIEW GEOMETRY 23



Figure 2.2: The homography map induced by a world plane (Hartley and

Zisserman, 2003). The ray corresponding to a point q is extended to meet

the world plane π in a point Q. This point is projected to a point q



in the

other image. The map from q to q



is the homography induced by the plane

π. If H

and H

are the perspectivities from the world plane π to the ﬁrst

and second image planes, respectively, we have q = H

Q and q



= H

Q. It

is the composition of these two perspectivities that deﬁnes a homography H,



= H

−1

q = Hq, between the image planes.

of the camera centers then the induced homography is degenerate (Hartley

and Zisserman, 2003; Ma et al., 2003; Forsyth and Ponce, 2002).

In the following we derive an explicit expression for the induced homogra-

phy between the two views. Suppose a world plane π as the one illustrated in

Figure 2.2, which is speciﬁed by its coordinates in the world frame. Consider

the following projection matrices for the two views

M =



I | 0







A | a



where I is a 3 × 3 identity matrix, 0 = [0, 0, 0]



is a null 3-vector and A

and a are the parameters of the projection matrix M



. Moreover, let π be a

world plane with π = [v



, 1]



. Then the homography induced by the plane

2.3. MULTIPLE VIEW GEOMETRY 24

is q



= Hq with

H = A − av



. (2.8)

We may assume the fourth coordinate of π equal to 1 since the plane does

not pass through the center of the ﬁrst camera at [0, 0, 0, 1]



(Hartley and

Zisserman, 2003).

Observe that there is a three-parameter family of planes in the 3D world,

and correspondingly a three-parameter family of homographies between two

views induced by planes in the 3D world. These three parameters are spec-

iﬁed by the elements of the vector v, which is not a homogeneous 3-vector

(Hartley and Zisserman, 2003).

Epipolar Geometry

The epipolar geometry between two views is essentially the geometry of

the intersection of the image planes with the pencil of planes having the

baseline as axis, where the baseline is the line joining the camera centers

(Hartley and Zisserman, 2003; Ma et al., 2003; Trucco and Verri, 1998). The

epipolar geometry is independent of scene structure, and only depends on

the cameras’ internal parameters and relative pose (Trucco and Verri, 1998).

The geometric entities involved in epipolar geometry are illustrated in

Figure 2.3. This ﬁgure shows two pinhole cameras, their projection centers,



and O



, and image planes, π



and π



. As usual, each camera identiﬁes a

3D reference frame, the origin of which coincides with the projection center,

and the Z-axis with the optical axis. The vectors Q



and Q



refer to the

scene point, while the vectors q



and q



refer to the projections of the scene

point onto the image planes π



and π



, respectively, and are expressed in

the corresponding reference frame. We call epipole the point of intersection

of the baseline with the image plane. In Figure 2.3, we denote the epipoles

2.3. MULTIPLE VIEW GEOMETRY 25

baseline

epipolar line

scene point

























Figure 2.3: The epip olar geometry.

by e



and e



. The plane identiﬁed by the scene point and the camera centers



and O



is called epipolar plane, while its intersections λ



and λ



with

the image planes π



and π



, respectively, are called epipolar lines. With

the exception of the epipoles, only one epipolar line goes through any image

point.

The reference frames of both cameras in Figure 2.3 are related via the

extrinsic parameters. Therefore, given a point in space, the relation between

its representations Q



and Q



in the camera reference frames is



= R





− t



, (2.9)

where t = (O



− O



) is the translation vector and R is the rotation matrix.

The relation between the scene point and its projections is described by

the usual equations of perspective projection, in vector form:



, (2.10)

2.3. MULTIPLE VIEW GEOMETRY 26



, (2.11)

where f



, f



are the cameras’ focal lengths and Z



, Z



are the coordinates

in the Z-axis of Q



and Q



, respectively.

Finally, another important concept is the s o-called epipolar constraint.

Consider the triplet formed by the scene point and their projections q



and



. Given q



, the scene point can lie anywhere on the ray from O



through



. However, since the image of this ray in π



is the epipolar line through

the corresponding point, q



, the correct match must lie on the epipolar line.

This fact is known as epipolar constraint and establishes a mapping between

points in π



with lines in π



and vice versa.

The Fundamental Matrix F

The fundamental matrix is an algebraic representation of the epipolar

geometry (Hartley and Zisserman, 2003; Ma et al., 2003). It is a basic tool

in the development of our technique for spatio-temporal alignment between

video sequences. In the following, in order to derive the fundamental ma-

trix, we ﬁrstly derive another important matrix known as essential matrix

(Longuet-Higgins, 1981).

The equation of the epipolar plane through the scene point can be written

as the coplanarity condition of the vectors Q



, t, and Q



− t, or





− t





t × Q



= 0. (2.12)

By using Equation (2.9), we obtain











t × Q



= 0. (2.13)

2.3. MULTIPLE VIEW GEOMETRY 27

Given that a vector product can be rewritt en as a multiplication by a rank-

deﬁcient matrix, we can write



t × Q





= J Q



, (2.14)

where

J =







0 −t

−t







. (2.15)

By using this fact, Equation (2.13) becomes







= 0, (2.16)

with

E = RJ . (2.17)

The matrix E is known as the essential matrix (Longuet-Higgins, 1981) and

establishes a link between the epipolar constraint and the extrinsic parame-

ters of the stereo system. Observe that, by using Equations (2.10) and (2.11),

and dividing by the product Z





, Equation (2.16) can be rewritten as







= 0. (2.18)

Consider now the matrices M



and M



of the intrinsic parameters of

the cameras in Figure 2.3. Let



and q



be the points in pixel coordinates

corresponding to q



and q



in camera coodinates, that is



= M



. (2.19)

2.3. MULTIPLE VIEW GEOMETRY 28



= M



. (2.20)

By substituting Equations (2.19) and (2.20) into Equation (2.18), we have







= 0, (2.21)

where

F = M



−1



−1

. (2.22)

F is the so-called fundamental matrix. Observe that Fq



in Equation (2.21)

can be thought of as the equation of the projective epipolar line, λ



, that

corresponds to point q



, or



= F q



. (2.23)

Observe that if it is possible to estimate the fundamental matrix from a

number of point matches in pixel coordinates, we can reconstruct the epipolar

geometry with no information at all on the intrinsic or extrinsic parameters.

We present below some of the most important properties of the funda-

mental matrix (Hartley and Zisserman, 2003):

• if F is the fundamental matrix of the pair of cameras (π



, π



), then F



is the fundamental matrix of the pair in the opposite order: (π



, π



• for any point q



in the ﬁrst image, the corresponding epipolar line is



= Fq



. Similarly, λ



= F





represents the epipolar line corre-

sponding to q



in the second image.

• for any point q



(other than e



) the epipolar line λ



= Fq



contains

the epipole e



. Therefore, e



satisﬁes e











= (e





F)q



= 0 for

all q



. It follows that e





F = 0 and, similarly, Fe



= 0.

• F is a rank 2 homogeneous matrix with seven degrees of freedom.

2.3. MULTIPLE VIEW GEOMETRY 29

• F is a correlation, a projective map taking a point to a line. However,

F is not a proper correlatiom, that is, F is not invertible.

Several methods of computing the fundamental matrix were presented

in the literature (Zhang, 1998; Hartley and Zisserman, 2003 ) and many re-

searchers still work on the development of new techniques. Of the various

current methods, the eight-point algorithm (Hartley, 1997a) is by far the

simplest. The idea behind this technique is based on the solution of a homo-

geneous linear system. Given a set of n point correspondences between two

images, the fundamental matrix F satisﬁes the condition q







= 0, for

i = 1, ..., n. With the q



and q



known, this equation is linear in the entries

of the matrix F. Thus, given at least 8 point correspondences it is possible

to solve linearly for the entries of F up to scale. With more than 8 equations

a least-squares solution is found. More detailed studies and caracterizations

of the best methods for computing the fundamental matrix can be found in

Hartley and Zisserman (2003) and Zhang (1998).

Chapter 3

Temporal Alignment

O tempo propõe outras diﬁculdades. Uma, talvez

a maior, a de sincronizar o tempo individual de

cada p es soa com o tempo geral das matemáticas, foi

fartamente apregoada pelo recente alarme relativista,

e todos a recordam - ou lembram tê-la recordado até

bem pouco tempo.

Jorge Luis Borges

This chapter presents our framework for temporally aligning multiple se-

quences acquired from distinct viewpoints. We begin this chapter describing

a key concept in our method: the timeline, an N-dimensional line responsible

for capturing all the temporal relations between the video sequences.

3.1 The Timeline Constraint

Suppose that a dynamic scene is viewed simultaneously by N perspective

cameras located at distinct viewpoints. We assume that each camera captures

frames with a constant, unknown frame rate. We also assume that the cam-

eras are unsynchronized, i.e., they began capturing frames at diﬀerent times

with possibly-distinct frame rates. In order to temporally align the resulting

3.1. THE TIMELINE CONSTRAINT 31

sequences, we must determine the correspondence between frame numbers

in one “reference” sequence and frame numbers in all other sequences. This

correspondence can be expressed as a set of linear equations,

= α

+ β

, (3.1)

where f

and f

denote the frame numbers of the i-th sequence and the

reference sequence, respectively, and α

, β

are unkown constants describing

the temporal dilation and temporal shift, respectively, between the sequences.

In general, these constants will not be integers.

The pairwise temporal relations captured by Equation (3.1) induce a

global relationship between the frame numbers of the sequences. We rep-

resent this relationship by an N-dimensional line L that we call the timeline:

L =



[α

. . . α

]

t + [β

. . . β

]

| t ∈ 



. (3.2)

A key property of the timeline is that even though knowledge of L implies

knowledge of the temporal alignment of the sequences, we can compute points

on the timeline without knowing the sequences’ alignment. This observation

leads to a simple algorithm for reconstructing the timeline from dynamic

features in the scene that are visible in two or more of the sequences.

Speciﬁcally, let q

) be the instantaneous projection of a moving scene

point in camera 1 at frame f

, expressed in homogeneous 2D coordinates, as

illustrated in Figure 3.1. Furthermore, let q

(·) be the trajectory traced by

the point’s projection in camera i and suppose that the fundamental matrix,

, between c ameras 1 and i is known for all i, where i = 2...N. Observe

that in this case we are considering the camera 1 as our reference camera. If

the scene point is visible to all cameras when frame f

is captured by camera

1, we have the following constraint:

3.1. THE TIMELINE CONSTRAINT 32

camera 1

camera i

scene point





)

(·) q

(·)

Figure 3.1: Geometry of the Timeline Constraint. In this two-camera

example, the point’s trajectory in camera i intersects the epipolar line,



, twice. Given the intersection points q

) and q



), we have

the set V

)

= { [f

]







Timeline Constraint: The set

)



. . . f

]

| q



) = 0, i = 2 . . . N



contains at least one point on the timeline L.

Intuitively, the Timeline Constraint can be thought of as a procedure for

generating a set V

)

of “candidate” temporal alignments that is guaranteed

to contain at least one point on the timeline. The constraint tells us that we

can create such a set by (1) intersecting the epipolar line of q

) in camera i

with the trajectory q

(·), (2) recording the frame number(s) corresponding to

each intersection point for camera i, and (3) generating temporal alignment

vectors from the recorded frame numbers. To see why the Timeline Con-

straint holds, observe that if [f

. . . f

]

is on the timeline it must represent

the “true” temporal alignment between the frame f

of pixel q

) and the

3.2. THE TEMPORAL ALIGNMENT ALGORITHM 33

remaining cameras. Hence, pixels q

) and q

) must satisfy the epipolar

constraint equation, q



) = 0. Since, by deﬁnition, the set V

)

contains all temporal alignments that satisfy the epipolar constraint equa-

tion across the N cameras, it must also contain the true alignment, which

is a point on the timeline L. In this respect, the Timeline Constraint can

be thought of as the converse of the epipolar constraint for the case of N

unaligned sequences.

In order to apply the Timeline Constraint, we must know the fundamen-

tal matrices, F

, describing the cameras’ epipolar geometry between each

pair (i, j) of cameras. In practice, we obtain an initial estimate of F

ﬁnding “background features,” i.e., points in the scene that remain station-

ary and are jointly visible by two or more cameras. Once the timeline L

is reconstructed, that is, once the estimation of the temporal alignment is

performed, we jointly optimize L and the parameters of the fundamental

matrices that describe the scene geometry, by using a linear, iterative re-

ﬁnement procedure. We describe our temporal alignment algorithm in the

next section, and consider in Chapter 4 the joint optimization of L and the

pre-computed fundamental matrices F

, which gives us the spatio-temporal

alignment between the sequences.

3.2 The Tempora l Alignment Algorithm

The Timeline Constraint leads directly to a voting-based algorithm for

reconstructing the timeline of N sequences, which provides the temporal

alignment. The algorithm operates in two phases. In the ﬁrst phase, we

choose one of the image sequences to be the reference sequence and use

the instantaneous positions q

) from each feature trajectory q

(·) of that

3.2. THE TEMPORAL ALIGNMENT ALGORITHM 34

(a) (b)

Figure 3.2: (a) Trajectory of the car’s pixel centroid in Sequence 1. (b)

Trajectory of the car’s pixel centroid in Sequence 2. The car was tracked by

a simple blob tracker that relies on foreground-background detection to label

all foreground pixels in each frame.

sequence together with the entire trajectories q

(·) of the other sequences to

estimate V

)

for each q

). In the second phase, we ﬁt an N-dimensional

line L to the union of the estimated sets V

)

. Therefore, to fully specify

this algorithm we must ask three questions:

1. How do we compute the feature trajectories q

(·), for i = 1, ..., N?

2. How do we estimate the set V

)

for each q

3. How do we compute the timeline L?

To compute the feature trajectories q

(·), we use a two-frame feature

tracker that is treated by our algorithm as a “black-box” res ponsible for

returning a list of line segments of corresponding features for every pair of

consecutive frames. Each line segment connects the locat io n of a feature

that was detected in some frame of the i-th sequence and was successfully

tracked to the next frame, as illustrated in a r eal-world sequence in Figure

3.2. Importantly, our algorithm does not depend on a speciﬁc tracker. Thus,

3.2. THE TEMPORAL ALIGNMENT ALGORITHM 35

the choice of a particular tracking methodology depends exclusively on the

scene’s complexity and on the properties of the features of interest.

Next, to compute the set V

)

for a given q

), our algorithm uses

the initial estimates of the fundamental matrices, F

, between each pair

(i, j) of cameras, as well as the line segments provided by the feature tracker.

When a speciﬁc line segment intersects the epipolar line of q

), it de-

ﬁnes a possibly-fractional frame number, f

, corresponding to the instant

that q

)’s epipolar line intersect s the image trajectory of a point in the

scene. Hence, f

is the i-th co ordinate of a potential element of V

)

. To

generate V

)

, we collect all the f

coordinates computed for all sequences

and concatenate them so that they form valid N-dimensional vectors, which

represent candidate temporal alignments in a voting space. These steps are

illustrated in Figure 3.4a-d for the two-camera example of Figure 3.2.

Note that according to the algorithm described above, our approach may

result in a large number of intersections of the epipolar line of q

) with the

line segments of the trajectories in each one of the N − 1 cameras, since we

may have several possible ways of “concatenating” the computed f

coordi-

nates into an N-dimensional vector. However, to avoid including an exponen-

tial number of vectors in V

)

, we only include vectors that are consistent

with the cameras’ epipolar geometry. In particular, let [f

. . . f

]

be a can-

didate vector for a set of N cameras, where f

represents the temporal coor-

dinate of a feature position in the reference camera, that is , q

) = q

Given fundamental matrices with an average error of e pixels, where the error

of a fundamental matrix is measured as the average of the distances between

background feature projections in the image plane of the reference camera

and their corresponding epipolar lines, we assume that the afore-mentioned

candidate vector is consistent with the cameras’ epipolar geometry only if

3.2. THE TEMPORAL ALIGNMENT ALGORITHM 36

)

i+1

)

i,i+1

i+1

)



i,i+1

i+1

(·)

camera i + 1

camera i

i+1

Figure 3.3: Two intersection points q

) and q

i+1

) of cameras i and

i+1, respectively, are considered by our approach as potential representations

of the same scene point only if d

≤ e and d

i+1

≤ e, where d

and d

i+1

are

distance values that measure how close q

) and q

i+1

) are to each

others’ epipolar lines and e is the fundamental matrices’ average error.

the intersection points that deﬁned each pair of its consecutive temporal co-

ordinates (f

, f

i+1

), for 2 ≤ i ≤ N − 1, satisfy: d

≤ e and d

i+1

≤ e, where d

and d

i+1

are illustrated in Figure 3.3 and are distance values that measure

how close the intersection points that deﬁned f

and f

i+1

are to each others’

epipolar lines.

Therefore, given a set of N cameras, our approach evaluates the con-

straints d

≤ e and d

i+1

≤ e for N − 2 times. If all the N − 2 evaluations

performed obtain a positive answer, the candidate vector is considered as

a potential inlier and is added to the voting space, otherwise it is rejected.

Note that our procedure does not test those constraints for each possible pair

of temporal coordinates, but rather it considers consecutive temporal coordi-

nates in the candidate vector. That is, we make use of the transitive property

that allows us to go from step to step in our reasoning process for inferring

that a candidate point is consistent with the cameras’ epipolar geometry.

3.2. THE TEMPORAL ALIGNMENT ALGORITHM 37

For instance, consider that the intersection points that deﬁned f

and f

i+1

are, according to our approach (see Figure 3.3), potential representations of

a speciﬁc scene point Q. If the intersection points that deﬁned f

i−1

and f

potentially represent a scene point P, we may infer that P and Q are probably

the same point, since the intersection p oint of f

is considered in both cases.

Moreover, we may also conclude that the intersection points of f

i−1

and f

i+1

are probable representations of the same scene point. By using this reasoning

process along consecutive temporal coordinates of the candidate point, we

ensure that it is a potential representation of the temporal misalignment

between t he cameras. Note that our concatenation procedure is conservative,

i.e., it guarantees that the set of vectors generated will be a superset of V

)

The set of candidate temporal alignments is the union of the sets V

)

for all q

). In general, this union will contain a large number of outliers,

as illustrated in Figure 3.4d. To reconstruct the timeline in the presence of

outliers, we use the RANSAC algorithm (Fischler and Bolles, 1981), which

is described in detail in Appendix A.

RANSAC can be regarded as an algorithm for robust ﬁtting of models

in the presence of many data outliers (Fischler and Bolles, 1981). Since it

gives us the opportunity to evaluate any estimate of a set of parameters no

matter how accurate the method that generated this solution might be, the

RANSAC method represents an interesting approach to the solution of many

computer vision problems (Cantzler, 2004). The algorithm randomly chooses

a pair of candidate temporal alignments to deﬁne the timeline L, and then

computes the total number of candidates that fall within an -distance of this

line. These two steps are repeated for a number of iterations. Provided suf-

ﬁcient repetitions are per formed, RANSAC is expected to identify solutions

computed from outlier-free data. Therefore, the two critical parameters of

3.2. THE TEMPORAL ALIGNMENT ALGORITHM 38

(363)

(·)

X axis

Y axis

100

150

200

250 300

(·)



(363)F

X axis

Y axis

100

150

200

250

300

(a) (b)

(418)

(423)

(428)

(444)

X axis

Y axis

(·)

122

124

126

128

130

132

134

210 215 225220

= 1.0030f

+ 54.7708

Frame number (S

)

Frame number (S

)

100

200

300

400

500

Figure 3.4: (a) Trajectory of a feature in Sequence 1 of the C ar dataset, which

is presented in Chapter 5 and illustrated in Figure 3.2. The feature was the

centroid of all pixels labeled as “foreground” by a color-based foreground-

background detector. (b) Trajectory of the foreground pixel centroid in Se-

quence 2 of the dataset. Also shown is the epipolar line corre sponding to

pixel q

(363) in (a). (c) Magniﬁed view of the trajectory/epipolar line in-

tersection in (b). The individual line segments connecting feature locations

in adjacent frames are now visible. Note that the epipolar line of q

(363)

intersects multiple line segments along the trajectory. (d) Exploiting the

Timeline Constraint for two-sequence alignment. Each point represents a

candidate temporal alignment, i.e., an element of V

)

for the feature lo-

cation, q

), in (a). The reconstructed timeline, drawn as a solid line,

describes the temporal alignment of the two sequences in the Car dataset.

3.2. THE TEMPORAL ALIGNMENT ALGORITHM 39

the algorithm are the number z of RANSAC iterations and the distance .

To determine z, we use the formula

z =



log(1 − p)

log(1 − r

)



, (3.3)

where p is the probability t hat at least one of our random selections is an

error-free set of candidates and r is the probability that a randomly-selected

candidate is an inlier.

Equation (3.3) expresses the fact that z should be large enough to ensure

that, with probability p, at least one randomly-selected pair of candidates

is an inlier. We used p = 0.99 and r = 0.05 (z = 1840 iterations) for all

experiments, which are conservative values that lead to accurate results in

our experiments, as demonstrated in Chapter 5. To compute the distance

, we observe that  can be thought of as a bound on the distance between

detected feature locations in the input cameras and their associated epipolar

lines. This allows us to approximate  by the average distance b etween static

features in the scene and their associated epipolar lines. Next, we summarize

our algorithm for recovering the temp oral alignment between multiple video

sequences.

3.2. THE TEMPORAL ALIGNMENT ALGORITHM 40

Algorithm 1 The Temporal Alignment Algorithm

 Input

1: N; {Number of cameras.}

2: F

; {Fundamental matrices between cameras i and j.}

 Output

3: [α

...α

]



; {Temporal dilation parameters of the timeline L.}

4: [β

...β

]



; {Temporal shift parameters of the timeline L.}

BEGIN

5: {Step 1 - Generat e the voting space:



)

, ∀q

).}

6: for i ← 1 to N do

7: Compute feature trajectories q

(·).

8: end for

9: {Consider the camera 1 as the refere nce camera.}

10: for (each instantaneous position q

)) do

11: for i ← 2 to N do

12: for (each line segment in sequence i) do

13: if (the epipolar line q



intersects the line segment) then

14: Obtain the frame number f

of the intersection’s point.

15: if (the corresponding pixels of (f

) are consistent

with the cameras’ epipolar geometry) then

16: Store f

in the intersections’ vector of sequence i.

17: end if

18: end if

19: end for

20: end for

21: Construct the set V

)

by collecting all the f

coordinates

computed and concatenating them so that they form

valid N-dimensional vectors [f

...f

22: end for

23: Generate the voting space by performing the union of the sets V

)

24: {Step 2 - Fit the timeline L to the union of the estimated sets V

)

25: Apply the RANSAC algorithm to the voting space in order to

determine the data set that best ﬁts the searched timeline L.

26: Apply the least-squares method over the data set estimated by

RANSAC to compute the timeline parameters: [α

...α

]



and [β

...β

]



27: return([α

...α

]



,[β

...β

]



END

Chapter 4

Spatio-Temporal Alignment

Time and space are modes by which we think and not

conditions in which we live.

Albert Einstein

While images of a dynamic scene may contain stationary points in the

background, these points cannot be expected to represent the majority of

detected features. Any procedure that attempts to estimate the geometric

relations between the views from those features alone is likely to ignore a

signiﬁcant portion of the available image information. In practice, this will

cause errors in the computed fundamental matrices that encapsulate the

geometric relations and, ultimately, in the reconstructed timeline. In this

chapter we show how to reﬁne the pre-computed fundamental matrices F

and the timeline L by incorporating all features detected in the sequences,

i.e., both the tracked dynamic features and t he static features detected on

the background. Without loss of generality, we assume that the camera

represented by the number 1 is our reference camera.

4.1. THE REFINEMENT ALGORITHM 42

4.1 The Reﬁnement Algorithm

Even though the solution presented in Chapter 3 for temporally aligning

multiple video sequences is robust, it is not very accurate: it does not use

any information from the dynamic features to reﬁne the spatial alignment

between sequences and even the temporal alignment is aﬀected by errors in

the initial spatial alignment, because the RANSAC threshold for regarding

a point in the voting space as an inlier is set proportional to the magnitude

of those errors.

In the following, we present a method that can reﬁne both the temporal

and the spatial transformations between sequences, using information from

all features available. Firstly, we present its derivation for the case of scenes

monitored by two distinct viewp oints (N = 2) and, next, we present the

general ideas behind its application for N > 2.

4.1.1 Two-View Reﬁnement

The geometric basis of the reﬁnement method in the case of two dif-

ferent viewpoints is presented in Figure 4.1. The multilinear tensor that

encapsulates the geometric relations between the two views is represented by

a fundamental matrix. To make the problem linear, we approximate each

tracked trajectory q

(·) with a polygonal spline s

(·) (represented by the red

line segments in Figure 4.1) using a method proposed by Horst and Beichl

that guarantees an upper bound on the diﬀerence between the length of the

original curve and the length of its polygonal approximation (Horst and Be-

ichl, 1997). Importantly, we parameterize these polygonal approximations in

a way that is consistent with the ( temporal) parameterization of the original

curves, i.e., each point that lies both on the original curve and on the ap-

4.1. THE REFINEMENT ALGORITHM 43

camera 1

camera 2

scene point



)

(·)

)

(·)

Figure 4.1: Geometry of our reﬁnement method.

proximation keeps its original coordinate, so that each linear spline segment

has endpoints with well–deﬁned temporal coordinates f

and f

The key steps of our method to reﬁne the current estimates for the spatial

parameters represented by the entries of the fundamental matrix

F, and the

temporal parameters ˆα and

β of the timeline, are:

1. Create the epipolar lines q



)

F in camera 2 from their corresponding

instantaneous feature positions q

) in camera 1, by using

2. Intersect the epipolar lines q



)

F with the p olygonal trajectory ap-

proximations s

) in camera 2.

3. Screen the resulting intersections s

) for consistency with the current

estimates for ˆα and

β.

4. Use only the consistent inter sections to generate algebraic equations

that jointly constrain the (unknown) transformation parameters F, α

and β.

4.1. THE REFINEMENT ALGORITHM 44

More speciﬁcally, according to Equation (3.1), an estimated temporal

transformation with parameters ˆα and

β implies that any instantaneous fea-

ture q

) in camera 1 should correspond to a feature in camera 2 whose

temporal coordinate is

= ˆαf

β. (4.1)

Thus, an intersection s

) in camera 2 (see Figure 4.1) between an

epipolar line q



)

F and an approximated tr ajectory segment whose end-

points have temporal coordinates f

and f

is consistent with the current

estimated temporal alignment only if

< ˆαf

β < f

. (4.2)

Every intersection that satisﬁes the consistency condition in Eq. (4.2)

yields a linear constraint on α, β, and the entries of F. To obtain this

constraint, we note that an arbitrary point s

) on an intersected spline

segment

) (Figure 4.1) is, according to the segment’s parame-

terization, given by

) = q

) + (f

− f

)

) − q

)

− f

)

. (4.3)

Observe that if f

= f

then s

) = q

), which means that the

epipolar line q



)

F intersects the endpoint q

). Similarly, if f

= f

then s

) = q

), meaning that the epip olar line q



)

F intersects the

other endpoint q

Now, consider that s

) in camera 2 represents the corresponding point

of q

) in camera 1. Given that assumption, those points together must

4.1. THE REFINEMENT ALGORITHM 45

satisfy the following equation



)Fs

) = 0. ( 4.4)

Observe that we have considered F instead of

F in Equation (4.4), where

F =

F + ∆F, i.e., the entries of F represent the spatial parameters obtained

after the reﬁnement process,

F is the current estimate of the fundamental

matrix and ∆F is the reﬁnement term computed by our method. Substitut-

ing Equation (4.3) into Equation (4.4), we obtain:





) + (f

− f

)

) − q

)

− f

)



= 0. (4.5)

Equation (4.5) may be rewritten in a more concise manner as follows:



) {Fkf

+ Fm} = 0, (4.6)

where

k =

) − q

)

− f

)

. (4.7)

m = q

) − f

k. (4.8)

Now, by considering f

= αf

+β, where α = ˆα+∆α and β =

β+∆β, i.e.,

α and β are the temporal parameters obtained after the reﬁnement process,

ˆα and

β are the current estimates and ∆α, ∆β are the reﬁnement terms, and

similarly by writing F =

F + ∆F, the factor enclosed by curly brackets in

4.1. THE REFINEMENT ALGORITHM 46

the Equation (4.6) becomes

F(f

ˆαk +

βk + m) + f

Fk∆α +

Fk∆β +

+ ∆F(f

ˆαk +

βk + m) +

+ f

∆α∆Fk + ∆β∆Fk.

Disregarding the second-order terms f

∆α∆Fk and ∆β∆Fk, we obtain the

following linear constraint on ∆F, ∆α and ∆β:



)



Fk∆α +

Fk∆β + ∆Fh



= −q



)

Fh, (4.9)

where h = (f

ˆα +

β)k + m.

After straightforward algebraic manipulation, Equation (4.9) may be

rewritten as the (inner) product of two vectors: a 11-element row vector

that contains only known coeﬃcients and a 11-element column vector that

contains the 9 unknown c oeﬃcients of ∆F followed by the scalar unknowns

∆α and ∆β. Linear constraint s of this form yielded by all the temporally

consistent intersections between epipolar lines and approximated trajectories

are ﬁnally assembled into an over-constrained linear system. Importantly,

traditional linear constraints of the type used in the eight-point algorithm

can also be added to this system, just by concatenating two zeros at the end

of their coeﬃcient vectors, as the coeﬃcients of the temporal parameters ∆α

and ∆β. Therefore, our solution allows us to use all avaliable constraints

both from static features and from dynamic features in order to reﬁne the

estimated spatio-temporal alignment.

The set of equations that we construct are of the form A

n×11

11×1

= b

n×1

where the number of linear constraints n is frequently much larger than the

4.1. THE REFINEMENT ALGORITHM 47

11 unknowns. Our task now consists in ﬁnding the best solution x for that

linear system. There are many diﬀerent techniques for solving such a system

(Press et al., 1988; Atkinson, 1989; Tomasi, 2000; Heﬀeron, 2001) and a good

way to ﬁnd its solution is by using the Singular Value Decompostion (SVD)

(Press et al., 1988), although the linear least-squares methods represent also

very interesting alternatives (Atkinson, 1989). By using one of the above-

mentioned numerical methods, our spatio-temporal approach computes the

system’s solution in an iterative way, until the convergence to zero of the

unknowns ∆F, ∆α and ∆β.

4.1.2 N-View Reﬁnement

Consider now a dynamic scene viewed by N distinct cameras, where N >

2, and suppose that our reference camera is the camera labeled by the number

1, which is denoted by c

. In this case, we may simply use the two-view

reﬁnement technique previously described for each pair of cameras (c

where c

is the i-th camera, for i = 2, ..., N. Thus, by combining the computed

equations t

= α

+ β

with reﬁned parameters α

and β

, we may obtain

new equations that capture the temporal relation between any two arbitrary

sequences i and j.

As far as our metho dology for reﬁning the statio-temporal alignment be-

tween N diﬀerent video sequences is concerned, one may argue that the use

of other tools such as the trifocal and quadrifocal tensors could be also con-

sidered, since they represent natural extensions of the fundamental matrix

in the case of three and four views, respectively. Because of the added sta-

bility of a third or even a fourth view, and the fact that they constrain the

position of reconstructed points in space more tightly, use of the trifocal

and quadrifo cal tensors should lead to greater accuracy than two-view tech-

4.1. THE REFINEMENT ALGORITHM 48

niques (Hartley, 1998). This hypothesis is supp ort ed by the results of Heyden

(Heyden, 1995a,b). In order to evaluate the use of those alternative tools,

we have derived and tested a three-view reﬁnement methodology similarly

to the previous one derived for the two-view case. The main diﬀerence of

this alternative approach relates to the multilinear tensor (a trifocal tensor

instead of a fundamental matrix) that encapsulates the geometric relations

between the views. Our experiments showed that this strategy is not eﬀec-

tive, since that the optimization of the temporal and spatial parameters is

quite unstable. One of the main impediments to use the trifocal tensor (and

the quadrifocal tensor) is its overparametrization (Hartley, 1998), using 29

components of the tensor to describe a geometric conﬁguration that depends

only on 18 parameters. This is probably the main reason for the inaccuracies

of our results and the observed instability in the optimization process.

Chapter 5

Experimental Results

No amount of experimentation can ever prove me

right; a single experiment can prove me wrong.

Albert Einstein

In this chapter, we present and discuss several experimental results with

real-world and synthetic sequences. Firstly, in Section 5.1, we illustrate the

applicability of our approach by testing it on several challenging two- and

three-view datasets of real-world dynamic scenes. Scenarios that may be

critical for m ost of the current spatio-temporal alignment methodologies are

successfully handled by our approach, such as, situations where a reliable

feature tracking cannot be performed over the entire sequence, the initial

estimates of the cameras’ epipolar geometry is inaccurate, the video sequences

have large temporal misalignments and the scene points move along three-

dimensional, overlapping and cyclical trajectories.

However, although these experiments with real-world sequences demon-

strate how eﬀective our method can be in some common and challenging

scenarios, it is not possible to perform from their results a careful analysis

of the scalability, eﬃciency and accuracy of our approach, since that many

5.1. REAL-WORLD SEQUENCES 50

critical variables, such as the error in the initial estimate of the cameras’

epipolar geometry, had their values tested in a limited range.

Therefore, to obtain additional experimental results for a better evalua-

tion of our method, we have performed experiments with synthetic sequences

of an artiﬁcial scene, presented in Section 5.2, where we could simulate and

control some of the main parameters that aﬀect the results of our approach.

In particular, based on that simulation, we obtained quantitative measure-

ments of the quality of the estimated spatio-temporal alignments as functions

of three key factors: (1) the accuracy of the initial estimates of the funda-

mental matrices that capture the geometric relations between the views (2)

the amount of noise in the tracking system and (3) the number of tracked

features that are considered by the method.

5.1 Real-world Sequences

To demonstrate the applicability of our timeline reconstruction algorithm,

we tested it on various challenging two- and three-view real-world datasets.

Image dimensions in all datasets were about 320 × 240 pixels. The sequences

represented a wide variety of conditions, including sequences that ranged

from 55 to 605 frames; temporal misalignments of 21 to 285 frames; relative

frame rates between 1 and 2; image quality that ranged from quite high (i.e.,

sequences captured by laboratory-based color cameras) to rather low (i.e.,

clips from a low-quality, MPEG-compressed video of a broadcast TV signal);

and object motions ranging from several pixels per frame to less than a pixel.

Since no single tracker was able to handle all of our datasets, and since

our algorithm does not depend on a speciﬁc tracker, we experimented with

several—a simple color-based blob tracker, a blob tracker based on back-

ground subtraction, and the WSL tracker (Jepson et al., 2003). In each case,

5.1. REAL-WORLD SEQUENCES 51

we treated the tracker as a “black box” that returned a list of corresponding

features for every pair of consecutive frames.

Alignment accuracy was evaluated by measuring the average temporal

misalignment. This is the average diﬀerence between the computed time of

each frame and the frame’s “ground-truth” time, i.e., when it was actually

captured. Since our sequences were acquired with unsynchronized cameras,

the ground-truth time of each frame could only be known to within ±0.5

frames. This is because even if we could perfectly align the sequences at

frame resolution, corresponding frames could have been captured up to 0.5

frame intervals apart. This lower b ound on ground-truth accuracy is critical

in evaluating the presented results.

5.1.1 Two-view Car Dataset

As a ﬁrst test, we applied our technique to a two-view sequence used by

Caspi and Irani (Caspi and Irani, 2000) for evaluating their method (Figure

5.1). The data was acquired by two cameras with identical frame rate of

25fps, implying a unit ground-truth temporal dilation (α = 1). The ground-

truth temporal shift was β = 55 ± 0.5 frames.

Most frames in the resulting sequences contain a single rigid object (a

car) moving over a static background (a parking lot), along a fairly smooth

trajectory. We therefore used a simple blob tracker that relied on foreground-

background detection to label all foreground pixels in each frame. The cen-

troid of the foreground pixels was the only “feature” detected and tracked

(Figures 3.2(a) and 3.2(b)). To compute the cameras’ fundamental matrix

we used the normalized eight-point algorithm (Hartley, 1997a), which was

provided with twenty six correspondences between background pixels in the

two views illustrated in Figure 5.2.

5.1. REAL-WORLD SEQUENCES 52

Frame 100 Frame 200 Frame 300 Frame 400

Camera 1

Camera 2

Figure 5.1: Four representative frames (100, 200, 300, 400) from the camer as

1 and 2, of the two-view Car dataset (Caspi and Irani, 2000). We can identify

the spatial misalignment by observing near image boundaries, where diﬀerent

static objects are visible in each sequence. The temporal misalignment is

easily identiﬁed by comparing the position of the gate in frames 400. This

dataset, along with more experimental results and videos, are available at

http://www.dcc.ufmg.br/˜cardeal/research/timeline/ .

Figure 3.4(d) shows the timeline reconstructed using the RANSAC-based

algorithm of Chapter 3, with the RA NSAC parameter  set to 2.0. The re-

constructed timeline gives an average temporal misalignment of 0.66 frames,

almost within the 0.5-frame uncertainty of the ground-truth measurements.

By applying the reﬁnement procedure of Chapter 4 we obtained updated

values of α = 1.0027 and β = 54.16 for the timeline coeﬃcients. These co-

eﬃcients correspond to an improved average temporal misalignment of 0.35

frames, i.e., below the accuracy of the ground-truth alignment . Note that

these results are at least as accurate as those of Caspi and Irani, even though

we are solving a less constrained problem (i.e., α is unknown and scene pla-

narity is not required). Moreover, the results were obtained from raw results

of a tracker that was not particularly robust (e.g., the centroid of the fore-

ground pixels drifts oﬀ the moving car for approximately 30 frames in each

sequence).

5.1. REAL-WORLD SEQUENCES 53

Figure 5.2: The recovered epipolar geometry for the two-view Car dataset.

Points and their epipolar lines are displayed in each image for veriﬁcation.

Accuracy of the computed fundamental matrix can be appreciated by the

closeness of each point to the epipolar line of its corresponding point.

5.1.2 Two-view Robots Dataset

In a second experiment, we used two cameras operating at 30fps to acquire

images of four small robots, as they executed sm all random movements on

two planes (Figure 5.3). The ground-truth timeline coeﬃcients were α = 1

and β = −284.5 ± 2. We used a uniform-color blob tracker to track these

robots between consecutive frames. The resulting data was challenging for

four reasons. First, the robots’ inter-frame motion was imperceptibly small

(roughly 0.25 pixels per frame), making precise manual alignment by a human

observer virtually impossible. Second, the t emporal shift of the sequences was

large, making it ineﬃcient to ﬁnd this shift via exhaustive search. Third, the

uniformly-colored regions on each robot were small, causing our tracker to

generate fragmented and noisy trajectories. Fourth, the robot’s motion was

designed to produce trajectories that self-intersect and that are non-smooth,

complicating the shape of each blob’s trajectory.

The timeline reconstructed with  = 2.0 prior to reﬁnement is shown

in Figure 5.4. This line gives an average temporal misalignment error of

5.1. REAL-WORLD SEQUENCES 54

Frame 000 Frame 150 Frame 300

Camera 1Camera 2

Figure 5.3: Three representative frames (000, 150, 300) from the cameras

1 and 2, of the two-view Robots dataset. The spatial misalignment can b e

easily identiﬁed by observing the distinct orientations of the robots’ soccer

ﬁeld. On the other hand, the temporal misalignment can be noticeable by

comparing the position of the robot with the dark green circle in frames 300.

This dataset, along with more experimental results and videos, are available

at http://www.dcc.ufmg.br/˜cardeal/research/timeline/ .

5.84 frames. Our reﬁnement stage reduced this error to 4.43 frames, with

α = 1.015 and β = −286.89. Given the robots’ image velocity, this translates

to a m isalignment of about one pixel. Figure 5.7 conﬁrms that the computed

alignment is quite good, despite the robots’ slow motion and the tracker’s

poor performance.

5.1.3 Two-view Juggling Dataset

In this dataset, two people are observed by a wide-baseline camera pair

while juggling ﬁve uniformly-colored balls (see Figure 5.5). Both sequences

were acquired at a rate of 30fps. This dataset represents a diﬃcult case for

existing direct- or feature-based methods because (1) the trajectories of dif-

5.1. REAL-WORLD SEQUENCES 55

Frame number (S

)

Frame number (S

)

= 1.0329f

− 305.7445

100

200

300

400

500

600

700

Figure 5.4: Voting space, timeline, and timeline equation recovered prior

to reﬁnement for the two-view Robots dataset. Each point is an element of

)

for some feature q

) in sequence 1.

ferent balls nearly overlap in 3D, (2) individual trajectories are approximately

cyclical, (3) image velocities are quite large, up to 9 pixels per frame, mak-

ing long-range feature tracking diﬃcult, and (4) the ground-truth temporal

shift between the sequences is β = −41 ± 0.5 frames, or about 1.5 periods

of a ball’s motion. This shift is likely to cause diﬃculties for techniques

based on non-exhaustive search (Rao et al., 2003) or non-linear optimization

(Caspi et al., 2002) because of the possibility of getting trapped in deep local

minima.

To make the alignment problem even more challenging, we modiﬁed this

dataset by deleting or adding frames to one of the sequences. These modiﬁ-

cations were intended to simulate sequences with more than one frame rate

(e.g., containing a slow-motion segment) and sequences that contain spuri-

5.1. REAL-WORLD SEQUENCES 56

Frame 100 Frame 115 Frame 120 Frame 130

Camera 1Camera 2

Figure 5.5: Four representative frames (100, 115, 120, 130) from the

cameras 1 and 2, of the two-view Jug gling dataset. This dataset,

along with more experimental results and videos, are available at

http://www.dcc.ufmg.br/˜cardeal/research/timeline/ .

ous clips (e.g., a TV commercial). We used a uniform-color blob tracker to

track four of the balls in each sequence, providing us with the location of

four features per frame. No information about feature correspondences be-

tween cameras was given to the algorithm (i.e., color information was not

used). Figures 5.6(a)-(c) show the reconstructed timelines before the reﬁne-

ment stage, with  = 0.5. The average temporal misalignment error was 0.75

frames for the original dataset. The reﬁnement stage brought this error down

to 0.26 frames, with α = 1.0004 and β = −40.80.

Figure 5.7 conﬁrms that the computed alignment was eﬀectively retrieved

and Figure 5.8 illustrates the distribution of distances of inlier votes from the

reconstructed timeline for the Car, Robots and Juggling datasets, before and

after the timeline reﬁnement stage.

5.1. REAL-WORLD SEQUENCES 57

Frame number (S

)

Frame number (S

)

= 1.0035f

− 40.8382

100

150

200

250

300

350

(a)

Frame number (S

)

Frame number (S

)

= 1.0067f

− 42.2728

= 0.49677f

+ 44.874

100

150

200

250

300

350

(b)

Frame number (S

)

Frame number (S

)

= 1.0086f

+ 57.5611

= 0.98712f

− 39.4849

100

150

200

250

300

350

(c)

Figure 5.6: Voting spaces, timelines, and timeline equations recovered prior

to reﬁnement for the two-view Juggling dataset. (a) Juggling dataset without

modiﬁcation, (b) simulation of a sequence with more than one frame rate and

5.1. REAL-WORLD SEQUENCES 58

Frame f

= 300 Magniﬁed view

Before alignmentAfter alignment

Frame f

= 115 Magniﬁed view

Before alignmentAfter alignment

Figure 5.7: Before alignment images were created by s uperimposing

the green band of a frame f

with the red and blue bands of frame

∗

= (f

− β

∗

)/α

∗

using ground truth timeline coeﬃcients α

∗

and β

∗

. After

alignment images were created by replacing the green band of the im-

ages above them with that of frame f

= (f

− β)/α, with α, β computed by

our algorithm. For both types of images, deviations from the ground-truth

alignment cause “double exposure” artifacts (i.e., when f

∗

= f

or f

∗

= f

respectively).

5.1. REAL-WORLD SEQUENCES 59

Before reﬁnement After reﬁnement

Car dataset

−3 −2 −1 0 1 2 3

Residual error (frames)

Number of temporal votes

−3 −2 −1 0 1 2 3

Residual error (frames)

Number of temporal votes

Robots dataset

−15 −10 −5 0 5 10 15

100

120

140

160

180

Residual error (frames)

Number of temporal votes

−15 −10 −5 0 5 10 15

100

120

140

160

180

Residual error (frames)

Number of temporal votes

Juggling dataset

−4 −2 0 2 4

Residual error (frames)

Number of temporal votes

−4 −2 0 2 4

Residual error (frames)

Number of temporal votes

Figure 5.8: Distribution of distances of inlier votes from the reconstructed time-

line. Left column: Distribution before the timeline reﬁnement stage. Right

column: Distribution after the reﬁnement stage. Note that the updated epipolar

geometry and updated timelin e parameters reduce the distance between inliers and

the timeline and cause more votes to be labeled as inliers.

5.1. REAL-WORLD SEQUENCES 60

5.1.4 Three-view Soccer Dataset

As a ﬁnal experiment, we applied our technique t o three video clips ex-

tracted from a single MPEG-compressed TV broadcast of a soccer match

(FIFA, 2002). The clips were replays of the same goal ﬁlmed from three dis-

tinct viewpoints (Figure 5.9). Each sequence contained a signiﬁcant panning

motion to maintain the moving players within the ﬁeld of view. To ensure

that the pairwise fundamental matrices remained constant for all frames, we

stabilized each sequence by computing the frame-to-frame homography us-

ing Brown and Lowe’s system (Brown and Lowe, 2003). We used the WSL

tracker to track the same player in each sequence, thereby obtaining one fea-

ture trajectory per camera. WSL was initialized manually in the ﬁrst frame

of each sequence. Even though it was able to track the chosen player for

most frames, the player’s small size and jitter artifacts caused by the video’s

poor quality resulted in noisy measurements of his location. These measure-

ments were given as input to the basic timeline reconstruction algorithm with

 = 1.5 and no timeline reﬁnement.

Since this dataset contained N = 3 views, the timeline is a 3D line with 3-

vectors as its coeﬃcients (see Eq. (3.2) and Figures 5.10(b) and 5.10(c)). To

evaluate the timeline’s accuracy in the absence of ground-truth information,

we attempted to estimate the ground-truth alignment by visual inspection:

we identiﬁed three easily-distinguishable events (e.g., a player stepping on a

ﬁeld line, as in Figure 5.9) and recorded the frame where each event occurred

in each sequence. These frames were used as “ground-truth” event times for

each camera. To evaluate the timeline’s accuracy, we used it to predict the

event times in cameras 1 and 2 from the ground-truth time in camera 3.

5.1. REAL-WORLD SEQUENCES 61

Frame 8 (Event #1) Frame 46 (Event #2)

Camera 1Camera 2

Camera 3

Camera 3 (Magniﬁed view)

Figure 5.9: Two representative frames (8, 46) from the cameras

1, 2 and 3, of the three-view Soccer dataset. This dataset,

along with more experimental results and videos, are available at

http://www.dcc.ufmg.br/˜cardeal/research/timeline/ .

5.1. REAL-WORLD SEQUENCES 62

Camera 1

Camera 2

Camera 3

α = [1.0686 1.2252 1]

β = [11.4561 10.7577 0]

(a)

Camera 1

Camera 2

Camera 3

α = [1.0686 1.2252 1]

β = [11.4561 10.7577 0]

(b)

Figure 5.10: (a),(b) Two views of the 3D voting space and 3D ti meline com-

puted for the Soccer dataset.

The minimum diﬀerence between the predictions and the ground-truth

times across all three events was 0.22 frames in camer a 1 and 0.86 frames

in camera 2; the maximum diﬀerence was 1.66 and 1.33 frames, respectively.

This conﬁrms that the sequences were aligned quite well (see Figure 5.11),

despite the low quality of the videos and their unequal frame rates.

5.2. SYNTHETIC SEQUENCES 63

Frame f

= 46 Magniﬁed view

Before alignmentAfter alignment

Figure 5.11: Before alignment images were created by superimpos-

ing the green band of a frame f

with the red and blue bands of frame

∗

= (f

− β

∗

)/α

∗

using ground truth timeline coeﬃcients α

∗

and β

∗

. After

alignment images were created by replacing the green band of the im-

ages above them with that of frame f

= (f

− β)/α, with α, β computed by

our algorithm. For both types of images, deviations from the ground-truth

alignment cause “double exposure” artifacts (i.e., when f

∗

= f

or f

∗

= f

respectively).

5.2 Synthetic Sequences

In the second phase of our experiments, we developed a software for

simulating the dynamics of 3D particles with independent motion and their

visualizations from multiple viewpoints. Through the use of synthetic data,

the values of some key parameters, such as the number of tracked features,

could be controlled and their impacts in the accuracy, stability and scalability

5.2. SYNTHETIC SEQUENCES 64

Cameras Intrinsic parameters Extrinsic parameters



675.00 0 438.69

0 674.61 260.08

0 0 1







0.71 0.01 0.70 −1237.20

−0.04 −1.00 0.06 2.57

0.70 −0.07 −0.71 2636.10







1835.30 0 352.39

0 1834.20 220.85

0 0 1







0.78 −0.28 0.55 −222.31

0.12 −0.81 −0.58 115.09

0.61 0.52 −0.60 10130.00





Table 5.1: Matrices of intrinsic and extrinsic parameters for the cameras used

during the simulation of the artiﬁcial scene.

of our approach could be carefully analyzed. In particular, by using the

developed simulator we answer in this section three fundamental questions:

1. What is the scalability of our method against an increasing number of

tracked features?

2. How is the accuracy of our method aﬀected by errors in the initial esti-

mates of the fundamental matrices that capture the geometric relations

between the views?

3. How does the noise level in the tracking system aﬀect the accuracy of

our technique?

5.2.1 Scene dynamics simulation

We consider in this analysis an artiﬁcial scene monitored by two calibrated

cameras, whose intrinsic and extrinsic parameters are listed in Table 5.1 and

are those of the real cameras used in our indoor experiments presented in

Section 5.1, namely, an Hitachi KP-D50 Color CCD Camera (C1) and a

Sony DCR-TRV320 Digital Camcorder (C

). Those cameras were positioned

according to the scheme illustrated in Figure 5.12(a).

5.2. SYNTHETIC SEQUENCES 65

2.5m

1.5m

Calibration target

◦

Calibration target

Sphere’s central point

(a) (b)

Figure 5.12: Simulation of scene dynamics. (a) Exp erimental setup. (b) All

features start in random positions uniformly distributed within the illustrated

sphere, which is viewed from both cameras.

Random object trajectories were generated in the world coordinate system

and projected in the image planes by using the afore-mentioned intrinsic and

extrinsic parameters. All features were started in random positions uniformly

distributed within a sphere that was visible from both cameras, as illustrated

in Figure 5.12(b). Note that the sphere’s central point was deﬁned by the

center of the black circle located in the middle of a planar calibration target

that was positioned in the 3D world, so that the projections of the central

point of that circle were located near the centers of the cameras’ CCDs.

The sphere’s radius was determined empirically in order to maximize its

projections in both cameras, subject to the constraint that such projections

should be entirely visible.

In our simulation, each feature moves for l frames, where l is drawn from

a uniform distribution in the interval [1, 2L]. Speciﬁcally, we consider that

5.2. SYNTHETIC SEQUENCES 66

v cos φρ

v sin φτ

t+1

Figure 5.13: Feature dynamics.

both cameras acquire image sequences of the same lenght (256 frames) and

that the average feature lifetime L is 128 frames. By deﬁning a potentially

distinct lifetime for each feature, we simulate situations where the tracker

may lose some features, either because they actually became occluded or go

out of bounds, or because the computation fails for one reason or another.

As it is frequently desired to maintain a certain number of features, every

time a feature dies, a new one is started in a random position w ithin the

volume visible from both cameras.

The features move with the following dynamics:

t+1

= p

+ v cos φρ + v sin φτ , (5.1)

which is illustrated in Figure 5.13 and whose notation is described below:

• p

and p

t+1

are the current frame and next frame positions, respectively.

• v is the magnitude of the feature displacement in the 3D world (in cm),

simulated as additive gaussian noise with zero m ean and standard de-

viation V. In the current simulation, we deﬁned V = ±2.5cm, resulting

in an average feature displacement of about 2 pixels in the image planes

of both cameras.

5.2. SYNTHETIC SEQUENCES 67

• φ is an angular variation in the direction of movement of the feature

in the 3D world (in radians), simulated as additive gaussian noise with

mean zero and standard deviation A. Speciﬁcally, we deﬁned A = 0.09

radians, that is, about 5 degrees.

• ρ is a unit direction vector along the current direction of movement.

• τ is a unit direction vector perpendicular to ρ, given by the cross

product between ρ and a unit direction vector uniformly sampled on

the sphere.

Regarding the video sequences obtained by the visualization process per-

formed by the cameras, we deﬁned the following “ground-truth” aﬃne trans-

formation for modelling their temporal misalignment:

= f

− 32, (5.2)

that is, we have cameras with the same frame rate (α = 1) and whose

corresponding video sequences have a temporal misalignment of 32 frames

(β = −32). Importantly, the quality of the temporal alignments estimated

by our method as well as its computational cost are invariant to the magni-

tude of the temporal shift between the s equences, since our approach derives

alignment constraints by matching instantaneous positions of features in one

sequence against entire feature trajectories in the other sequences. Therefore,

we have arbitrarily deﬁned a temporal misalignment of 32 frames, which rep-

resents approximately the average value of most of the temporal shifts of the

real-world sequences presented in Section 5.1. Given that both came ras used

in this experiment work at a frame rate of 30 frames per second, we note

that this temporal shift simulates cameras that were activated at distinct

time instants with an interval of about 1 second.

5.2. SYNTHETIC SEQUENCES 68

5.2.2 Error measurements and experiments’ description

In the current analysis, we use the average temporal alignment error as

our basic measurement for evaluating the accuracy, scalability and stability

of our approach. Speciﬁcally, its value is given by the average of the ab-

solute values of the diﬀerences between the temporal coordinates computed

by the estimated timeline and the temporal coordinates computed by the

“ground-truth” aﬃne transformation in Equation (5.2). That is, if f

repre-

sents the temporal coordinate of the reference sequence, t

) represents its

corresponding temporal coordinate computed by the aﬃne transformation

in Equation (5.2) and t

) is its corresponding temporal coordinate com-

puted by using the timeline estimated by our method, the average temporal

alignment error ε

is given by:

256

255



) − t

)| =

256

255



|(α − 1)f

+ β + 32| . (5.3)

To measure the error of the fundamental matrix, we used the same back-

ground features that were used to compute its initial estimate in the real

scene. Speciﬁcally, this error was measured as the average of the distances

between each background feature projection in the image plane of the refer-

ence camera (C

) and its corresponding epipolar line. That is, if e

represents

the distance between a static feature i and its corresponding epipolar line,

the error ε

(in pixels) of the fundamental matrix is given by:



i=1

, (5.4)

where m is the total number of background features.

In order to simulate the error in the cameras’ epipolar geometry, new

5.2. SYNTHETIC SEQUENCES 69

fundamental matrices with pre-determined errors (in pixels) were generated

by adding a small multiple ∆

of a 3 × 3 unit matrix J to the original

fundamental matrix F estimated with the normalized eight-point algorithm.

That is, we added the term ∆

= δ

J to F, where δ

simulates a gaussian

noise with mean zero and standard deviation 1e − 5. This operation was

repeated until the new fundamental matrix had the desired error.

On the other hand, to simulate distinct noise levels in the tracker, we

considered the following model:

= x

+ d cos ϕ, (5.5)

= y

+ d sin ϕ, (5.6)

where:

• (x

, y

) represents the original coordinates of a feature.

• (x

, y

) represents the the new corrupted coordinates of a feature.

• d represents the magnitude of the feature displacement, simulated as

additive gaussian noise with zero mean and standard deviation R.

• ϕ deﬁnes the direction of the feature displacement in the image plane

and is drawn from a uniform distribution in the interval [0, 2π].

Particularly, we controlled the variation of the t racker’s noise level in our

experiments by deﬁning speciﬁc values for the standard deviation R of the

random variable d described above.

Using the above-described mechanisms for deﬁning and measuring the

errors of the estimated temporal alignments, of the cameras’ epipolar geom-

etry and of the feature positions estimated by the tracker, we performed the

following two groups of experiments:

5.2. SYNTHETIC SEQUENCES 70

1. Varying number of features and tracker’s error with a ﬁxed

fundamental matrix’s error of 2 pixels (ε

= 2). In this case,

for each tuple (ε

, R, k), where R ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} and

k ∈ {1, 2, 4, 8, 16, 32}, we simulated 100 distinct 256-frame 3D se-

quences, computed their corresponding voting spaces and measured

the percentages of timelines that lead to average temporal alignment

errors smaller than or equal to 1, 2 and 5 frame(s), before and after the

timeline reﬁnement stage.

2. Varying number of features and fundamental matrix’s error

for a tracker with a gaussian noise that has a ﬁxed stan-

dard deviation of ±2 pixels (R = 2). Similarly to the pre-

vious group of experiments, for each tuple (ε

, R, k), where ε

∈

{1, 2, 3, 4, 5, 6, 7, 8, 9, 10} and k ∈ {1, 2, 4, 8, 16, 32}, we simulated 100

distinct 256-frame 3D sequences, computed their voting spaces and

measured the percentages of timelines that lead to average temporal

alignment errors smaller than or equal to 1, 2 and 5 frame(s).

Since a fundamental matrix with an average error of 2 pixels represented

the worst case in the real-world sequences presented in Section 5.1, we decided

to use that value in our ﬁrst group of experiments above, in order to guarantee

a realistic perception about the scalability of our approach in cases where the

computation of very accurate fundamental matrices is impossible. Similarly,

a tracker with a standard deviation of ±2 pixels was used in the second group

of experiments, since that value represents roughly the noise level observed

in some trackers used in our real-world sequences.

By measuring the percentages of timelines that lead to average temporal

alignment errors smaller than or equal to 1, 2 and 5 frame(s), we illustrate

the applicability of our approach in scenarios where we need timelines with

5.2. SYNTHETIC SEQUENCES 71

highly accurate parameters (ε

≤ 1 frame) or in less challenging situations

where lower accuracies are allowed (ε

≤ 2 frames or even ε

≤ 5 frames).

The values used in all experiments for the RANSAC parameters were:

p = 0.99, r = 0.05 and  = 0.5. According to Equation (3.3), those values

lead to a RANSAC execution of 1840 iterations. Importantly, they represent

conservative values that were deﬁned in order to maximize the accuracy of

the timeline’s parameters, even though they also result in a lower eﬃciency

for our method. Moreover, in order to ensure the acquisition of accurate

results, we restricted the search of the temporal alignment algorithm for a

timeline whose angular coeﬃcient α satisﬁes: 0.2 ≤ α ≤ 5, that is, we used

an a-priori information that the video sequences have frame rates that diﬀer

by at most a factor of 5.

5.2.3 Evaluation of the experimental results

As a ﬁrst step, we evaluated the scalability of our temporal alignment and

reﬁnement methods against an increasing of the number of moving features

per frame. Speciﬁcally, we considered scenes with k = 1, 2, 4, 8, 16 and 32

feature(s). In Figure 5.14 we illustrate some examples of voting spaces for

those values of features, where the fundamental matrix has an average error

of 2 pixels and the tracker is corrupted by a gaussian noise with zero mean

and standard deviation of ±2 pixels.

Note that the larger is the number of features, the denser are the voting

spaces. In fact, by increasing the number of features, the ratio between inliers

and outliers decreases visibly. Figure 5.15 quantiﬁes this visual perception

by presenting the percentage of inliers for each voting space of Figure 5.14,

where a speciﬁc vote is regarded as an inlier if its temporal coordinates f

and f

satisfy f

− f

+ 32 ≤ 1 frame.

5.2. SYNTHETIC SEQUENCES 72

Frame number (S

)

Frame number (S

)

100

150

200

250

Frame number (S

)

Frame number (S

)

100

150

200

250

(a) k = 1 feature. (b) k = 2 features.

Frame number (S

)

Frame number (S

)

100

150

200

250

Frame number (S

)

Frame number (S

)

100

150

200

250

Frame number (S

)

Frame number (S

)

100

150

200

250

Frame number (S

)

Frame number (S

)

100

150

200

250

(e) k = 16 features. (f) k = 32 features.

Figure 5.14: Examples of voting spaces for k = 1, 2, 4, 8, 16 and 32 features.

In all these cases, the fundamental matrix has an average error of 2 pixels

(ε

= 2) and the tracker is corrupted by a gaussian noise with mean zero and

standard deviation of ±2 pixels (R = 2).

5.2. SYNTHETIC SEQUENCES 73

0 1 2 3 4 5

log2(k)

Percentage of inliers

Figure 5.15: Percentage of inliers for each voting space of Figure 5.14, where

k represents the number of tracked features (k = 1, 2, 4, 8, 16 and 32).

A speciﬁc vote is considered an inlier if its coordinates f

and f

satisfy

− f

+ 32 ≤ 1 frame.

The voting spaces for 16 and 32 features represent especially challenging

cases, where it is impossible to identify visually the actual timeline. Accord-

ing to Figure 5.15, we have less than 2% of inliers in both cases.

Consider now Figures 5.16 to 5.25. Those ﬁgures illustrate the percent-

ages of timelines that lead to average temporal alignment errors smaller than

or equal to 1, 2 and 5 frame(s), as a function of the number of tracked fea-

tures. In particular, Figures 5.16 to 5.20 illustrate scenarios with distinct

tracker’s errors, but that have in common a fundamental matrix with an

average error of 2 pixels. On the other hand, Figures 5.21 to 5.25 illustrates

situations where diﬀerent fundamental matrix’s errors are considered and a

tracker that is corrupted by a gaussian noise with st andard deviation of ±2

pixels is used.

By observing the curves that illustrate the percentages of timelines that

lead to average temporal alignment errors smaller than or equal to 1, 2 and

5.2. SYNTHETIC SEQUENCES 74

5 frame(s) in all those ﬁgures, both before and after the reﬁnement stage,

we note that although sets containing from 2 to 16 features bring similar

results when the tracker’s noise level and the fundamental matrix’s error

have small values, as illustrated in Figures 5.16(a)-(f) and 5.21(a)-(f), our

approach tends to be more accurate when sets of 4 features are used, es-

pecially in scenarios where the tracking system and the initial estimate of

the cameras’s epipolar geometry have errors with more signiﬁcative magni-

tudes, as illustrated in Figure 5.20(d)-(f) where the tracker is corrupted by a

noise with standard deviation of ±10 pixels, and in Figures 5.22(a)-(f) and

5.23(a)-(f) for fundamental matrix’s errors varying from 3 to 6 pixels.

Speciﬁcally, if we consider sets of about 4 features, a fundamental matrix’s

error of about 2 pixels and applications that need timelines whose temporal

alignment errors are smaller than or equal to 1 frame, we note from Figures

5.16 to 5.20 that the percentage of solutions computed by our method that

satisfy this errors constraint vary from 100%, as illustrated in Figure 5.16(a)

for a tracker with a low noise level (R = 1), to about 30%, as illustrated in

Figure 5.20(d) for a case where the tracker was severely corrupted by noise

(R = 10). On the other hand, by ﬁxing now the tracker’s noise level (R = 2)

and varying the fundamental matrix’s error, we observe from Figures 5.21

to 5.25 that when sets with 4 features are used, the percentage of timelines

whose temporal alignment errors are smaller than or equal to 1 frame vary

from 90%, as shown in Figure 5.21(a) for a fundamental matrix’s error of 1

pixel (ε

= 1), to about 5%, as illustrated in Figure 5.25(d) for a fundamental

matrix’s error of 10 pixels (ε

= 10). Therefore, note that the accuracy of

our approach, in special the accuracy of the temporal alignment algorithm, is

much more negatively aﬀected by an increase in the error of the fundamental

matrix than by an equivalent increase of the tracker’s noise level.

5.2. SYNTHETIC SEQUENCES 75

Importantly, by assuming less challenging upper bounds of 2 and 5 frames

for the temporal alignment error, we observe that the percentage of timelines

computed by our method that must satisfy these error constraints increases

signiﬁcantly. In fact, by using a set with 4 features and considering a tempo-

ral alignment error of at most 2 frames, our metho d succeeds in 65% of the

total number of trials (after the reﬁnement stage) even when a tracker with

a high noise level (R = 10) is used, as shown in Figure 5.20(e), and in 55% of

the trials (after the reﬁnement stage) when a highly inaccurate fundamental

matrix (ε

= 6) is computed, as illustrated in Figure 5.23(e). The results

are even better in cases where the temporal alignment error may be up to 5

frames as, for example, in Figures 5.20(f) and 5.25(f), where our approach

succeeded in about 100% and 50% of the total number of trials, respectively,

even though the tracker’s noise level (R = 10) and the fundamental matrix’s

error (ε

= 10) had large values.

Additionally, note from Figures 5.16 to 5.25 that the accuracy of the

timeline’s parameters may decrease signiﬁcantly when feature sets too small

(e.g., 1 feature) or too large (e.g., 32 features) are used. This behavior of our

methodology is explained by the direct relation between the complexity of

the voting space and the number of tracked features considered. When more

features are added, the number of outliers increases faster than the number

of inliers, as illustrated in Figures 5.14 and 5.15. Therefore, with a higher

amount of spurious information in the voting space due to the higher number

of features, our approach may compute timelines whose parameters are more

inaccurate. On the other hand, the smaller is the set of features the s maller is

the number of votes in the voting space. In this case, an insuﬃcient number

of inliers may also result in the estimation of inaccurate parameters for the

timeline that models the temporal alignment between the sequences.

5.2. SYNTHETIC SEQUENCES 76

However, Figure 5.16(a)-(c) suggests that in cases where trackers with

low noise levels are used, an increase in the number of tracked features does

not cause much impact in the accuracy of the timeline’s parameters esti-

mated, especially before the application of the reﬁnement technique. Note in

Figure 5.16(a), for example, the small variations in the percentage of correct

timelines before the reﬁnement stage, when the number of features increases.

To illustrate the variation of the values of the timeline’s parameters when

the number of features increases, we present Figures 5.26 to 5.28. For all the

cases illustrated in those ﬁgures, we consider that the average err or of the

fundamental matrix and the standard deviation of the gaussian noise added

to the tracker are 2 pixels. Moreover, only the parameters of timelines that

lead to an average temporal alignment error smaller than or equal to 1 frame

were considered. We note that for all values of features illustrated in those

ﬁgures, the reﬁnement stage played an important role. After its application,

the average values of the timeline’s parameters became closer to the ground-

truth parameters and their corresponding variances decreased signiﬁcantly.

This fact is specially noticeable in the cases of sets with 4 and 16 features,

which are illustrated in Figures 5.27(a) and 5.28(a), respectively.

Now, consider Figures 5.29 to 5.34. Essentially, those ﬁgures provide

the same information already presented in Figures 5.16 to 5.25. However,

diﬀerently from those previous ﬁgures, Figures 5.29 to 5.34 illustrate the

percentage of timelines that lead to average temporal align m ent errors smaller

than or equal to 1, 2 and 5 frame(s), as a function of the standard deviation

of the gaussian noise added to the tracker (Figures 5.29 to 5.31) and as

a function of the error in the initial estimate of the fundamental matrix

(Figures 5.32 to 5.34). Therefore, by analyzing Figures 5.29 to 5.34 together

with Figures 5.16 to 5.25, better considerations may be performed about the

5.2. SYNTHETIC SEQUENCES 77

impacts in the accuracy and stability of our results caused by errors in the

cameras’ epipolar geometry and in the tracking system.

By observing the curves in Figures 5.29 to 5.34 that illustrate the per-

centages of timelines that lead to average temporal alignment errors smaller

than or equal to 1, 2 and 5 frame(s), both before and after the application of

the reﬁnement technique, we note that the higher is the tracker’s noise level

and/or the fundamental matrix’s error t he higher is the tendency of the tem-

poral alignment and reﬁnement methodologies to present lower accuracies.

This already expec ted behavior of our approach is more critical when more

features are considered, as illustrated in Figures 5.31(a)-(f) and 5.34(a)-(f)

for sets with 16 and 32 features.

Two main reasons explain this degradation in the accuracy of our results

due to increases in the tracker’s noise level and in the error of the cameras’

epipolar geometry. The ﬁrst reason relates to the fact that the higher are

those er rors the lower is the information quality regarding the actual corre-

spondences between temporal co or dinates of feature positions in both image

planes. That is, by considering the binary representation of a candidate

point in the voting space, those errors produce an eﬀect that is equivalent to

induce a lost of some of its less signiﬁcative bits. In this context, potential

inliers are shifted from their actual positions in the voting space, where the

magnitudes of those shifts are proportional to the errors’ magnitudes. This

behavior may aﬀect the estimation process performed by RANSAC resulting

in timelines with inaccurate parameters.

The second reason is similar to the one appointed previously regarding

increases in the number of features. That is, when trackers and fundamental

matrices with higher errors are used, we observe a signiﬁcative increase of

spurious information in the voting spaces. Consequently, this higher number

5.2. SYNTHETIC SEQUENCES 78

of outliers may aﬀect negatively the eﬀectiveness of RANSAC a nd lead our

approach to compute timelines with inaccurate parameters.

Importantly, the more inaccurate are the timeline’s parameters the harder

is for the optimization task formulated in the reﬁnement stage to converge

to their actual values. This last fact explains the observation from Figures

5.29 to 5.34 that the higher is the t racker’s noise level and the fundamental

matrix’s error the lower is the improvement brought by the application of

the reﬁnement technique. Consider, for instance, Figure 5.29(a) where we

have only 1 feature in the scene. When the standard deviation of the tracker’s

noise increases and, consequently, timelines with less accurate parameters are

estimated due to the reasons afore-mentioned, we note that the reﬁnement

technique has its eﬀectiveness hardly aﬀected. Speciﬁcally, Figure 5.29(a)

suggests that for trackers with standard deviations larger than or equal to 6

pixels, our reﬁnement method should not be used.

By observing Figures 5.29 to 5.31 and establishing that our approach

should ideally estimate timelines with average temporal alignment errors

smaller than or equal to 1 frame in 95% of the total number of trials, we note

that for a fundamental matrix’s error of about 2 pixels, that goal is achieved

only when trackers with low noise levels (standard deviation of about 1 pixel)

are used, as illustrated in Figures 5.30(a),(d) and 5.31(a) for sets containing

4, 8 and 16 tracked features.

Finally, we note from Figures 5.29(a),(d), 5.30(a),(d) and 5.31(a),(d) that

the exclusive use of the temporal alignment technique is not appropriate when

we need timelines whose temporal alignment errors must b e smaller than or

equal to 1 frame, since for any number of features and noise level considered,

the percentage of timelines that satisﬁed that errors constraint before the

reﬁnement stage was usually smaller than 30%.

5.2. SYNTHETIC SEQUENCES 79

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(a) R = 1 — ε

≤ 1 frame. (d) R = 2 — ε

≤ 1 frame.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(b) R = 1 — ε

≤ 2 frames. (e) R = 2 — ε

≤ 2 frames.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

≤ 5 frames. (f) R = 2 — ε

≤ 5 frames.

Figure 5.16: Percentages of timelines that lead to average temporal alignment

errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16

and 32 feature(s). (a), (b) and (c) Standard deviation R of the tracker’s

gaussian noise is ±1 pixel. (d), (e) and (f) Standard deviation R of the

tracker’s gaussian noise is ±2 pixels. In all cases, a fundamental matrix with

an average error of 2 pixels is used (ε

= 2).

5.2. SYNTHETIC SEQUENCES 80

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(a) R = 3 — ε

≤ 1 frame. (d) R = 4 — ε

≤ 1 frame.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(b) R = 3 — ε

≤ 2 frames. (e) R = 4 — ε

≤ 2 frames.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

≤ 5 frames. (f) R = 4 — ε

≤ 5 frames.

Figure 5.17: Percentages of timelines that lead to average temporal alignment

errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16

and 32 feature(s). (a), (b) and (c) Standard deviation R of the tracker’s

gaussian noise is ±3 pixels. (d), (e) and (f) Standard deviation R of the

tracker’s gaussian noise is ±4 pixels. In all cases, a fundamental matrix with

an average error of 2 pixels is used (ε

= 2).

5.2. SYNTHETIC SEQUENCES 81

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(a) R = 5 — ε

≤ 1 frame. (d) R = 6 — ε

≤ 1 frame.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(b) R = 5 — ε

≤ 2 frames. (e) R = 6 — ε

≤ 2 frames.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

≤ 5 frames. (f) R = 6 — ε

≤ 5 frames.

Figure 5.18: Percentages of timelines that lead to average temporal alignment

errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16

and 32 feature(s). (a), (b) and (c) Standard deviation R of the tracker’s

gaussian noise is ±5 pixels. (d), (e) and (f) Standard deviation R of the

tracker’s gaussian noise is ±6 pixels. In all cases, a fundamental matrix with

an average error of 2 pixels is used (ε

= 2).

5.2. SYNTHETIC SEQUENCES 82

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(a) R = 7 — ε

≤ 1 frame. (d) R = 8 — ε

≤ 1 frame.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(b) R = 7 — ε

≤ 2 frames. (e) R = 8 — ε

≤ 2 frames.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

≤ 5 frames. (f) R = 8 — ε

≤ 5 frames.

Figure 5.19: Percentages of timelines that lead to average temporal alignment

errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16

and 32 feature(s). (a), (b) and (c) Standard deviation R of the tracker’s

gaussian noise is ±7 pixels. (d), (e) and (f) Standard deviation R of the

tracker’s gaussian noise is ±8 pixels. In all cases, a fundamental matrix with

an average error of 2 pixels is used (ε

= 2).

5.2. SYNTHETIC SEQUENCES 83

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(a) R = 9 — ε

≤ 1 frame. (d) R = 10 — ε

≤ 1 frame.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(b) R = 9 — ε

≤ 2 frames. (e) R = 10 — ε

≤ 2 frames.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

≤ 5 frames. (f) R = 10 — ε

≤ 5 frames.

Figure 5.20: Percentages of timelines that lead to average temporal alignment

errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16 and

32 feature(s). (a), (b) and (c) Standard deviation R of the tracker’s gaussian

noise is ±9 pixels. (d), (e) and (f) Standard deviation R of the tracker’s

gaussian noise is ±10 pixels. In all cases, a fundamental matrix with an

average error of 2 pixels is used (ε

= 2).

5.2. SYNTHETIC SEQUENCES 84

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(a) ε

= 1 — ε

≤ 1 frame. (d) ε

= 2 — ε

≤ 1 frame.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(b) ε

= 1 — ε

≤ 2 frames. (e) ε

= 2 — ε

≤ 2 frames.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

= 1 — ε

≤ 5 frames. (f) ε

= 2 — ε

≤ 5 frames.

Figure 5.21: Percentages of timelines that lead to average temporal alignment

errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16

and 32 feature(s). (a), (b) and (c) Average error of the fundamental matrix:

= 1 pixel. (d), (e) and (f) Average error of the fundamental matrix: ε

= 2

pixels. In all cases, the tracker is corrupted by a gaussian noise with standard

deviation of ±2 pixels (R = 2).

5.2. SYNTHETIC SEQUENCES 85

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(a) ε

= 3 — ε

≤ 1 frame. (d) ε

= 4 — ε

≤ 1 frame.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(b) ε

= 3 — ε

≤ 2 frames. (e) ε

= 4 — ε

≤ 2 frames.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

= 3 — ε

≤ 5 frames. (f) ε

= 4 — ε

≤ 5 frames.

Figure 5.22: Percentages of timelines that lead to average temporal alignment

errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16

and 32 feature(s). (a), (b) and (c) Average error of the fundamental matrix:

= 3 pixels. (d), (e) and (f) Average err or of the fundamental m atrix:

= 4 pixels. In all cases, the tracker is corrupted by a gaussian noise with

standard deviation of ±2 pixels (R = 2).

5.2. SYNTHETIC SEQUENCES 86

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(a) ε

= 5 — ε

≤ 1 frame. (d) ε

= 6 — ε

≤ 1 frame.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(b) ε

= 5 — ε

≤ 2 frames. (e) ε

= 6 — ε

≤ 2 frames.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

= 5 — ε

≤ 5 frames. (f) ε

= 6 — ε

≤ 5 frames.

Figure 5.23: Percentages of timelines that lead to average temporal alignment

errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16

and 32 feature(s). (a), (b) and (c) Average error of the fundamental matrix:

= 5 pixels. (d), (e) and (f) Average err or of the fundamental m atrix:

= 6 pixels. In all cases, the tracker is corrupted by a gaussian noise with

standard deviation of ±2 pixels (R = 2).

5.2. SYNTHETIC SEQUENCES 87

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(a) ε

= 7 — ε

≤ 1 frame. (d) ε

= 8 — ε

≤ 1 frame.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(b) ε

= 7 — ε

≤ 2 frames. (e) ε

= 8 — ε

≤ 2 frames.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

= 7 — ε

≤ 5 frames. (f) ε

= 8 — ε

≤ 5 frames.

Figure 5.24: Percentages of timelines that lead to average temporal alignment

errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16

and 32 feature(s). (a), (b) and (c) Average error of the fundamental matrix:

= 7 pixels. (d), (e) and (f) Average err or of the fundamental m atrix:

= 8 pixels. In all cases, the tracker is corrupted by a gaussian noise with

standard deviation of ±2 pixels (R = 2).

5.2. SYNTHETIC SEQUENCES 88

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(a) ε

= 9 — ε

≤ 1 frame. (d) ε

= 10 — ε

≤ 1 frame.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

(b) ε

= 9 — ε

≤ 2 frames. (e) ε

= 10 — ε

≤ 2 frames.

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

0 1 2 3 4 5

100

log2(k)

% of timelines

desired level: 95%

before refinement

after refinement

= 9 — ε

≤ 5 frames. (f) ε

= 10 — ε

≤ 5 frames.

Figure 5.25: Percentages of timelines that lead to average temporal alignment

errors smaller than or equal to 1, 2 and 5 frame(s), for k = 1, 2, 4, 8, 16

and 32 feature(s). (a), (b) and (c) Average error of the fundamental matrix:

= 9 pixels. (d), (e) and (f) Average err or of the fundamental m atrix:

= 10 pixels. In all cases, the tracker is corrupted by a gaussian noise with

standard deviation of ±2 pixels (R = 2).

5.2. SYNTHETIC SEQUENCES 89

0.98 1 1.02

−34

−33

−32

−31

−30

−29

Before refinement

Temporal dilation (α)

Temporal shift (β)

0.98 1 1.02

−34

−33

−32

−31

−30

−29

After refinement

Temporal dilation (α)

Temporal shift (β)

(a) k = 1. Before ref.: m

= 1.0004, σ

= 1.1300e

−4

, m

= −31.1767, σ

= 2.1003.

After ref.: m

= 1.0004, σ

= 3.9847e

−5

, m

= −31.7825, σ

= 0.6514.

0.98 1 1.02

−34

−33

−32

−31

−30

−29

−28

Before refinement

Temporal dilation (α)

Temporal shift (β)

0.98 1 1.02

−34

−33

−32

−31

−30

−29

−28

After refinement

Temporal dilation (α)

Temporal shift (β)

(b) k = 2. Before ref.: m

= 1.0008, σ

= 8.9396e

−5

, m

= −31.0969, σ

= 2.0429.

After ref.: m

= 1.0009, σ

= 1.2767e

−5

, m

= −31.7608, σ

= 0.2588.

Figure 5.26: Space of estimated parameters for (a) k = 1 feature and (b)

k = 2 features. Each point shown corresponds to the parameters of an

estimated timeline with t emporal alignment error smaller than or equal to

1 frame. m

and m

represent the average values and σ

and σ

are the

variances of the temporal dilation (α) and the temporal shift (β), respectively.

In both cases, the average error of the fundamental matrix and the standard

deviation of the gaussian noise added to the tracker are 2 pixels.

5.2. SYNTHETIC SEQUENCES 90

0.98 0.99 1 1.01

−32.5

−32

−31.5

−31

−30.5

−30

−29.5

−29

Before refinement

Temporal dilation (α)

Temporal shift (β)

0.98 0.99 1 1.01

−32.5

−32

−31.5

−31

−30.5

−30

−29.5

−29

After refinement

Temporal dilation (α)

Temporal shift (β)

(a) k = 4. Before ref.: m

= 0.9995, σ

= 5.9613e

−5

, m

= −30.7143, σ

= 0.9581.

After ref.: m

= 1.0003, σ

= 1.3732e

−5

, m

= −31.6266, σ

= 0.1964.

0.99 1 1.01

−33

−32.5

−32

−31.5

−31

−30.5

−30

−29.5

−29

−28.5

−28

Before refinement

Temporal dilation (α)

Temporal shift (β)

0.99 1 1.01

−33

−32.5

−32

−31.5

−31

−30.5

−30

−29.5

−29

−28.5

−28

After refinement

Temporal dilation (α)

Temporal shift (β)

(b) k = 8. Before ref.: m

= 0.9991, σ

= 7.7272e

−5

, m

= −30.8503, σ

= 1.6661.

After ref.: m

= 1.0001, σ

= 1.4973e

−5

, m

= −31.6023, σ

= 0.3413.

Figure 5.27: Space of estimated parameters for (a) k = 4 features and (b)

k = 8 features. Each point shown corresponds to the parameters of an

estimated timeline with t emporal alignment error smaller than or equal to

1 frame. m

and m

represent the average values and σ

and σ

are the

variances of the temporal dilation (α) and the temporal shift (β), respectively.

In both cases, the average error of the fundamental matrix and the standard

deviation of the gaussian noise added to the tracker are 2 pixels.

5.2. SYNTHETIC SEQUENCES 91

0.98 1 1.02

−34

−33

−32

−31

−30

−29

−28

−27

Before refinement

Temporal dilation (α)

Temporal shift (β)

0.98 1 1.02

−34

−33

−32

−31

−30

−29

−28

−27

After refinement

Temporal dilation (α)

Temporal shift (β)

(a) k = 16. Before ref.: m

= 0.9997, σ

= 1.7003e

−4

, m

= −30.7789, σ

= 3.4365.

After ref.: m

= 1.0003, σ

= 1.7867e

−5

, m

= −31.7365, σ

= 0.2663.

0.98 1 1.02

−33

−32.5

−32

−31.5

−31

−30.5

−30

−29.5

−29

−28.5

Before refinement

Temporal dilation (α)

Temporal shift (β)

0.98 1 1.02

−33

−32.5

−32

−31.5

−31

−30.5

−30

−29.5

−29

−28.5

After refinement

Temporal dilation (α)

Temporal shift (β)

(b) k = 32. Before ref.: m

= 0.9982, σ

= 1.2828e

−4

, m

= −30.8495, σ

= 1.7812.

After ref.: m

= 0.9994, σ

= 5.2500e

−5

, m

= −31.4488, σ

= 0.6915.

Figure 5.28: Space of estimated parameters for (a) k = 16 features and

(b) k = 32 features. Each point shown corresponds to the parameters of

an estimated timeline with temporal alignment error smaller than or equal

to 1 frame. m

and m

represent the average values and σ

and σ

are the

variances of the temporal dilation (α) and the temporal shift (β), respectively.

In both cases, the average error of the fundamental matrix and the standard

deviation of the gaussian noise added to the tracker are 2 pixels.

5.2. SYNTHETIC SEQUENCES 92

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

(a) 1 feature — ε

≤ 1 frame. (d) 2 features — ε

≤ 1 frame.

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

(b) 1 feature — ε

≤ 2 frames. (e) 2 features — ε

≤ 2 frames.

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

≤ 5 frames. (f) 2 features — ε

≤ 5 frames.

Figure 5.29: Percentages of timelines that lead to average temporal alignment

errors smaller than or equal to 1, 2 and 5 frame(s), as a function of the

tracker’s noise level. (a), (b) and (c) Results for k = 1 feature. (d), (e) and

(f) Results for k = 2 features. In all cases, a fundamental matrix with an

average error of 2 pixels is used (ε

= 2).

5.2. SYNTHETIC SEQUENCES 93

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

(a) 4 features — ε

≤ 1 frame. (d) 8 features — ε

≤ 1 frame.

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

(b) 4 features — ε

≤ 2 frames. (e) 8 features — ε

≤ 2 frames.

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

≤ 5 frames. (f) 8 features — ε

≤ 5 frames.

Figure 5.30: Percentages of timelines that lead to average temporal alignment

errors smaller than or equal to 1, 2 and 5 frame(s), as a function of the

tracker’s noise level. (a), (b) and (c) Results for k = 4 features. (d), (e) and

(f) Results for k = 8 features. In all cases, a fundamental matrix with an

average error of 2 pixels is used (ε

= 2).

5.2. SYNTHETIC SEQUENCES 94

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

(a) 16 features — ε

≤ 1 frame. (d) 32 features — ε

≤ 1 frame.

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

(b) 16 features — ε

≤ 2 frames. (e) 32 features — ε

≤ 2 frames.

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Std of the tracker’s noise

% of timelines

desired level: 95%

before refinement

after refinement

≤ 5 frames. (f) 32 features — ε

≤ 5 frames.

Figure 5.31: Percentages of timelines that lead to average temporal alignment

errors smaller than or equal to 1, 2 and 5 frame(s), as a function of the

tracker’s noise level. ( a), (b) and (c) Results for k = 16 features. (d), (e)

and (f) Results for k = 32 features. In all cases, a fundamental matrix with

an average error of 2 pixels is used (ε

= 2).

5.2. SYNTHETIC SEQUENCES 95

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

(a) 1 feature — ε

≤ 1 frame. (d) 2 features — ε

≤ 1 frame.

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

(b) 1 feature — ε

≤ 2 frames. (e) 2 features — ε

≤ 2 frames.

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

≤ 5 frames. (f) 2 features — ε

≤ 5 frames.

Figure 5.32: Percentages of timelines that lead to average temporal alignment

errors smaller than or equal to 1, 2 and 5 frame(s), as a function of the

average error of the fundamental matrix. (a), (b) and (c) Results for k = 1

feature. (d), (e) and (f) Results for k = 2 features. In all cases, the tracker is

corrupted by a gaussian noise with standard deviation of ±2 pixels (R = 2).

5.2. SYNTHETIC SEQUENCES 96

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

(a) 4 features — ε

≤ 1 frame. (d) 8 features — ε

≤ 1 frame.

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

(b) 4 features — ε

≤ 2 frames. (e) 8 features — ε

≤ 2 frames.

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

≤ 5 frames. (f) 8 features — ε

≤ 5 frames.

Figure 5.33: Percentages of timelines that lead to average temporal alignment

errors smaller than or equal to 1, 2 and 5 frame(s), as a function of the

average error of the fundamental matrix. (a), (b) and (c) Results for k = 4

features. (d), ( e) and (f) Results for k = 8 features. In all cases, the tracker is

corrupted by a gaussian noise with standard deviation of ±2 pixels (R = 2).

5.2. SYNTHETIC SEQUENCES 97

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

(a) 16 features — ε

≤ 1 frame. (d) 32 features — ε

≤ 1 frame.

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

(b) 16 features — ε

≤ 2 frames. (e) 32 features — ε

≤ 2 frames.

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

1 2 3 4 5 6 7 8 9 10

100

Fundamental matrix’s error

% of timelines

desired level: 95%

before refinement

after refinement

≤ 5 frames. (f) 32 features — ε

≤ 5 frames.

Figure 5.34: Percentages of timelines that lead to average temporal alignment

errors smaller than or equal to 1, 2 and 5 frame(s), as a function of the average

error of the fundamental matrix. (a), (b) and (c) Results for k = 16 features.

(d), (e) and (f) Results for k = 32 features. In all cases, the tracker is

corrupted by a gaussian noise with standard deviation of ±2 pixels (R = 2).

Chapter 6

Conclusions

Vivendo, se aprende, mas o que se aprende, mais, é

só a fazer outras maiores perguntas.

João Guimarães Rosa

6.1 Summary of the Accomplished Work

We have proposed a novel feature–based methodology for aligning both in

time and space multiple video sequences acquired from distinct viewpoints.

More speciﬁcally, a major contribution of our methodology is that it reduces

the computation of temporal and spatio-temporal alignments between se-

quences to linear regression and linear optimization problems, while previous

feature–based techniques need to search the entire space of possible temporal

alignments. The quality of the computed alignments and the computational

cost of our techniques are invariant to the magnitude of the initial temporal

oﬀsets between sequences. Moreover, unlike existing methods, which work

for only two video sequences, our approach can handle an arbitrary number

of sequences in a single step. Basically, our sequence–to–sequence alignment

approach is constituted by two techniques: (a) one that builds large sets of

6.1. SUMMARY OF THE ACCOMPLISHED WORK 99

temporal constraints from a rough spatial alignment between sequences and

then performs a robust linear regression in the temporal domain to recover

the globally correct temporal alignment, and (b) one that linearizes feature

trajectories around the points of intersection with epipolar lines to reduce

the problem of ﬁnding the complete spatio–temporal alignment between two

sequences to a problem of solving a linear system.

We have shown that our work is suitable for solving several types of

problems presented in current applications that beneﬁt from the availability

of simultaneous video recordings of the same physical event, proving that it

is an interesting alternative, since that it is less expensive and easier to use

outside labs than video synchronization hardware and it can be applied to

various multi-view sequences that already exist in video databases, such as

those of sport events.

From a theoretical standpoint, this work is relevant since it provided ad-

ditional theoretical and empirical evidence that by considering temporal and

spatial cues into a single alignment framework, many physical event s which

are inherently ambiguous for traditional image-to-image alignment methods,

are uniquely resolved by sequence-to-sequence alignment techniques.

Our experimental results suggest that our timeline reconstruction algo-

rithm provides a simple and eﬀective method for temporally aligning multiple

video sequences. Unlike previous approaches, it is able to handle te mpo-

ral dilations and large time shifts, with no degradation in accuracy, even

when scene points move along three-dimensional, overlapping and almost–

cyclical trajectories. Importantly, by reducing the alignment problem to a

RANSAC-based procedure, our algorithms are able to tolerate large propor-

tions of outliers in the data, high levels of noise, discontinuities in feature

trajectories, complete absence of stereo corresp ondences for moving features,

6.2. DISCUSSION 100

and sequences that contain multiple frame rates. Finally, our approach re-

quires the ability to track scene points only across two consecutive frames

of the same sequence, what makes it robust to common tracking problems,

such as occlusion, and it does not require the ability to establish feature

correspondences between the sequences.

The above-mentioned contributions of the present work and the exper-

iments performed with real-world sequences present ed in Section 5.1 were

published in one of the most important international peer reviewed confer-

ences in computer vision:

• R. Carceroni, F. Pádua, G. Massahud, K. Kutulakos, “Linear Sequence-

to-Sequence Alignment", in Proc. of IEEE Computer Vision and Pat-

tern Recognition Conference, Washington, USA, pp. 746-753, 2004.

Moreover, we are currently preparing a complete journal article, which

will be submitted to the IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence (TPAMI). This article includes a careful analysis about

the scalability and accuracy of our approach, and discusses the experimental

results obtained with synthetic sequences, which are described in Section 5.2.

6.2 Discussion

Although our experimental results demonstrate that we may apply our

approach successfully in several challenging scenarios, it has some failure

modes, which must be appointed. Firstly, our approach may not b e applied

in case of dynamic scenes where we cannot detect m oving features (e.g., the

monitored scene is an empty room whose light has been turned oﬀ), or in

cases where the scene features move in such a way that their corresponding

trajectories in the image plane are represented by a single point (e.g., a

6.2. DISCUSSION 101

feature moves in a direction that is perpendicular to the image plane). In

those scenarios, it would not be possible to generate the voting space for

estimating the searched timeline. Another critical situation o ccurs when the

epipolar geometry of a pair of cameras deﬁnes epipolar lines that do not

intersect the feature trajectories (e.g., the epipolar lines are parallel to the

trajectories). Again, in that scenario, we may not compute the space with the

candidate temporal alignments. Fortunately, the above mentioned situations

do not represent a signiﬁcative portion of the scenarios that we may ﬁnd in

the practice.

Regarding the accuracy of our methodology, we believe that some simple

additional changes in our technique could lead to even better results. Firstly,

we believe that by using speciﬁc object characteristics, such as its color and

texture, we could avoid the insertion of spurious information in the voting

spaces created by our temporal alignment algorithm. In particular, a can-

didate point should concatenate coordinates f

and f

for cameras i and j,

respectively, only if the intersection points that deﬁned them belong to blobs

that present the same characteristic of interest (e.g., color).

Moreover, we believe that in cases where it is possible to provide the

optimization process formulated by our methodology with the a priori infor-

mation about the possible range in which the true temporal misalignment

is, we could improve the accuracy of the reﬁned temporal parameters esti-

mated. In this case, by using that a priori information, the optimization

algorithm would know when it would be diverging from the actual solution

and correction actions could be appropriately taken.

6.3. FUTURE WORK 102

6.3 Future Work

Additional theoretical investigations need to be considered for future

work. Firstly, the methodology proposed in Chapter 3 assumes that all cam-

eras acquire frames at constant (albeit not necessarily identical) temporal

sampling rates. Based on that assumption, our approach model the tempo-

ral misalignment between a pair of video sequences as an one-dimensional

aﬃne transformation. The pairwise temporal relations modelled by that

transformation induce a global relationship between the frame numbers of

the input sequences, which is captured by the N-dimensional line that we

call timeline. However, such a kind of mathematical modelling is not appro-

priate when some sequences work with variable frame rates. Therefore, the

development of an alternative mathematical model, which can couple with

this problem represents an important topic for future research.

Also, it is necessary to conceive alternative techniques for obtaining initial

estimates of the cameras’ epipolar geometry. Currently, our approach is based

on the use of background features whose image coordinates are processed by

the normalized eight-point algorithm in order to estimate the fundamental

matrices that capture the geometric relations between the views. However,

there are some cases where we can not identify enough static scene points

for every pair of video sequences, making impossible the computation of an

initial estimate of the cameras’ epipolar geometry and, consequently, the use

of our methodology.

Finally, another important direction for future work is 3D scene recon-

struction. By combining our temporal alignment approach with multi-view

stereo techniques, important advances could be achieved in the development

of robust systems for reconstructing 3D dynamic scenes, specially the ones

presented in old video footage, where multiple replays of the same event

6.3. FUTURE WORK 103

are shown from diﬀerent viewpoints. Moreover, as the capabilities of video

standards and receiver hardware are increasing towards integrated 3D anima-

tions, generating realistic content is now becoming a limiting factor. In this

context, an increasing demand has been veriﬁed on techniques for generating

3D content from reality, i.e., from video sequences acquired with TV cam-

eras, providing the TV viewer with animated 3D reconstructions of physical

events and allowing for an immersive experience via free interaction on the

receiver side.

Bibliography

Atkinson, K. (1989). An Introduction to Numerical Analysis. John Wiley

and Sons, second edition.

Brown, M. and Lowe, D. (2003). Recognising Panoramas. In Proc. IEEE

International Confere nce on Computer Vision, pages 1218–1225.

Canterakis, N. (2000). A Minimal Set of Constraints for the Trifocal Tensor.

In Proc. European Conference on Computer Vision, pages 84–99. Springer

LNCS 1842.

Cantzler, H. (2004). Random Sample Consensus (RANSAC). Technical

report, Institute for Perception, Action and Behaviour, Division of Infor-

matics, University of Edinburgh.

Carceroni, R. L. (2001). Recovering Non-Rigid 3D Motion, Shape and Re-

ﬂectance from Multi-View Image Sequences: A Diﬀerential-Geometric Ap-

proach. PhD thesis, Department of Computer Science - The College Arts

and Sciences, University of Rochester, Rochester, New York.

Caspi, Y. and Irani, M. (2000). A Step Towards Sequence-to-Sequence Align-

ment. In Proc. IEEE Confere nce on Computer Vision and Pattern Recog-

nition, volume 2, pages 682–689, Hilton Head Island, South Carolina.

Caspi, Y. and Irani, M. (2001). Alignment of Non-Overlapping Sequences.

In Proc. IEEE International Conference on Computer Vision, volume 2,

pages 76–83, Vancouver, Canada.

Caspi, Y., Simakov, D., and Irani, M. (2002). Feature-Based Sequence-

to-Sequence Matching. In VAMODS (Vision and Modelling of Dynamic

Scenes) workshop with ECCV, Copenhagen, Denmark.

Clarke, J., Carlsson, S., and Zisserman, A. (1996). Detecting and Tracking

Linear Features Eﬃciently. In Proc. British Machine Vision Conference,

pages 415–424.

104

BIBLIOGRAPHY 105

Faugeras, O. and Mourrain, B. (1995). On the Geometry and Algebra of

the Point and Line Correspondences Between N Images. In Proc. IEEE

International Confere nce on Computer Vision, pages 951–956.

Faugeras, O. and Papadopoulo, T. (1997). Grassmann-Cayley Algebra for

Modeling Systems of Cameras and the Algebraic Equations of the Mani-

fold of Trifocal Tensors. Technical Report 3225, INRIA, Sophia-Antipolis,

France.

FIFA (2002). FIFA World Cup Archives: Goal of the Century. http://

fifaworldcup.yahoo.com/02/en/pf/h/gotc/launch.html.

FIFA (2004a). FIFA World Cup Archives: Classic Games. http://

fifaworldcup.yahoo.com/06/en/p/cg/index.html.

FIFA (2004b). FIFA World Cup Archives: Photo Gallery. http://

fifaworldcup.yahoo.com/02/en/011221/4/44c.html.

Fischler, M. and Bolles, R. (1981). Random Sample Consensus: A Paradigm

for Model Fitting with Applications to Image Analysis and Automated

Cartography. Communications of the ACM, 24(6):381–395.

Forsyth, D. and Ponce, J. (2002). Computer Vision: A Modern Approach.

Prentice Hall, ISBN: 0130851981, ﬁrst edition.

Hartley, R. (1997a). In Defense of the Eight-point Algorithm. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, 19(6):580–593.

Hartley, R. (1997b). Lines and Points in Three Views and the Trifocal Tensor.

International Journal of Computer Vision, 22(2):125–140.

Hartley, R. (1998). Computation of the Quadrifocal Tensor. In Proc. Euro-

pean Conference on Computer Vision, pages 20–35. Springer-Verlag.

Hartley, R. and Zisserman, A. (2003). Multiple View Geometry in Computer

Vision. Cambridge University Press, ISBN: 0521540518, second edition.

Heﬀeron, J. (2001). Linear Algebra. Mathematics Department of Saint

Michael’s College, ﬁrst edition.

Heyden, A. (1995a). Geometry and Algebra of Multiple Projective Trans-

formations. PhD thesis, Department of Mathematics, Lund U niversity,

weden.

Heyden, A. (1995b). Reconstruction from Multiple Images Using Kinetic

Depths. International Journal of Computer Vision.

BIBLIOGRAPHY 106

Heyden, A. (1998). Algebraic Varieties in Multiple View Geometry. In

Proc. 5th European Conference on Computer Vision, pages 3–19, Freiburg,

Germany.

Heyden, A. and Åström, K. (1997). Algebraic Prop erties of Multilinear Con-

straints. Mathematical Methods in the Applied Sciences, 20:1135–1162.

Horst, J. and Beichl, I. (1997). A Simple Algorithm for Eﬃcient Piecewise

Linear Approximation of Space Curves. In Proc. IEEE International Con-

ference on Image Processing.

Isard, M. and MacCormick, J. (2001). BraMBLe: A Bayesian Multiple-Blob

Tracker. In Proc. IEEE International Conference on Computer Vision,

volume 2, pages 34–41, Vancouver, Canada.

Jepson, A., Fleet, D., and El-Maraghi, T. (2003). Robust On-li ne Appearance

Models for Visual Tracking. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 25(10):1296–1311.

Kutulakos, K. and Seitz, S. (2000). A Theory of Shape by Space Carving.

International Journal of Computer Vision, 38(3):199–218.

Kutulakos, K. N. (2000). Approximate N-View Stereo. In Proc. European

Conference on Computer Vision, pages 67–83.

Lee, L., Romano, R., and Stein, G. (2000). Monitoring Act ivities from

Multiple Video Streams: Establishing a Common Coordinate Frame.

IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI),

22:758–767.

Longuet-Higgins, H. (1981). A Computer Algorithm for Reconstructing a

Scene from Two Projections. Nature, 293:133–135.

Lowe, D. G. (2004). Distinctive Image Features from Scale-Invariant Key-

points. International Journal of Computer Vision, 60(2):91–110.

Luong, Q.-T. and Vieville, T. (1994). Canonic Representations for the

Geometries of Multiple Projective Views. In Proc. 4th European Con-

ference on Computer Vision, volume 800, pages 589–599, Cambridge, UK.

Ma, Y., Soatto, S., Kosecka, J., and Sastry, S. S. (2003). An Invitation to 3-D

Vision - From Images to Geometric Models. Springer-Verlag, ﬁrst edition.

Matas, J. and Chum, O. (2004). Randomized RANSAC with t

d,d

Test. Image

and Vision Computing, 22(10):837–842.

BIBLIOGRAPHY 107

McLauchlan, P. and Jaenicke, A. (2002). Image Mosaicing Using Sequential

Bundle Adjustment. Image and Vision Computing, 20(9-10):751–759.

Myatt, D., Torr, P., Nasuto, S., Bishop, J., and Craddock, R. (2002). Napsac:

High Noise, High Dimensional Robust Estimation - It’s in the Bag. In Proc.

British Machine Vision Conference, pages 458–467.

Press, W., Flannery, B., Teukolsky, S., and Vetter ling, W. (1988). Numerical

Recipes in C: The Art of Scientiﬁc Computing. Cambridge University

Press, ﬁrst edition.

Pritchett, P. and Zisserman, A. (1998). Wide Baseline Stereo Matching. In

Proc. IEEE International Conference on Computer Vision, pages 754–760.

Rao, C., Gritai, A., Shah, M., and Syeda-Mahmood, T. (2003). View-

Invariant Alignment and Matching of Video Sequences. In Proc. of IEEE

International Conference on Computer Vision, volume 2, pages 939–945,

Nice,France.

Reid, I. and Zisserman, A. (1996). Goal Directed Video Metrology. In Proc.

European Conference on Computer Vision, pages 647–658.

Schaﬀalitzky, F. and Zisserman, A. (2001). Viewpoint Invariant Texture

Matching and Wide Baseline Stereo. In Proc. 8th IEEE International

Conference on Computer Vision, Vancouver, Canada.

Sharipov, R. (2004). Quick Introduction to Tensor Analysis. Freely distrib-

uted on-line - http://samizdat.mines.edu/tensors/, ﬁrst edition.

Shashua, A. and Werman, M. (1995). Fundamental Tensor: On the Geome-

try of Three Perspective Views. In Proc. IEEE International Conference

on Computer Vision, pages 920–925.

Shechtman, E., Caspi, Y., and Irani, M. (2002). Increasing Video Resolution

in Time and Space. In Proc. European Conference in Computer Vision,

Copenhagen, Denmark.

Shi, J. and Tomasi, C. (1994). Good Features to Track. In Proc. IEEE

Conference on Computer Vision and Pattern Recognition, Seattle.

Stein, G. (1998). Tracking from Multiple View Points: Self-Calibration of

Space and Time. In DARPA Image Understanding Workshop, pages 521–

527, Montery, Canada.

Szeliski, R. (1999). A Multi-View Approach to Motion and Stereo. In Proc.

IEEE Conference on Computer Vision and Pattern Recognition, volume 1,

pages 157–163.

BIBLIOGRAPHY 108

Tomasi, C. (2000). Mathematical Methods for Robotics and Vision. Technical

Report CS 205, Stanford University.

Torr, P. (1995). Outlier Detection and Motion Segmentation. PhD thesis,

Department of Engineering Science, University of Oxford.

Torr, P. and Zisserman, A. (1997). Robust Parametrization and Computation

of the Trifocal Tensor. Image and Vision Computing, 15:591–605.

Torr, P. and Zisserman, A. (1999). Feature-Based Methods for Structure

and Motion Estimation. In International Workshop on Vision Algorithms,

pages 278–295.

Tresadern, P. and Reid, I. (2003). Synchronizing Image Sequences of Non-

rigid Objects. In Proc. British Machine Vision Conference, volume 2,

pages 629–638, Norwich.

Triggs, B. (1995). Matching Constraints and the Joint Image. In Proc. IEEE

International Confere nce on Computer Vision, pages 338–343.

Trucco, E. and Verri, A. (1998). Introductory Techniques for 3-D Computer

Vision. Prentice Hall, ISBN: 0132611082, ﬁrst edition.

Vedula, S., Baker, S., and Kanade, T. (2002). Spatio-Temporal View Inter-

polation. In Proc. Eurographics Workshop on Rendering, pages 65–76.

Viéville, T. and Luong, Q. (1993). M otion of Points and Lines in the Uncal-

ibrated Case. Technical Report RR-2054, INRIA.

Wolf, L. and Zomet, A. (2002a). Correspondence-Free Synchronization and

Reconstruction in a Non-Rigid Scene. In Workshop on Vision and Mod-

elling of Dynamic Scenes, Copenhagen, Denmark.

Wolf, L. and Zomet, A. (2002b). Sequence-to-Sequence Self Calibration. In

Proc. European Conference on Computer Vision, volume 2, pages 370–382.

Zelnik-Manor, L. and Irani, M. (2001). Event-Based Analysis of Video.

In Proc. IEEE Conference on Computer Vision and Pattern Recognition,

Kauai,Hawaii USA.

Zhang, Z. (1998). Determining the Epipolar Geometry and its Uncertainty -

A Review. International Journal of Computer Vision, 27(2):161–195.

Appendix A

The RANSAC Algorithm

The Random Sample Consensus (RANSAC) algorithm is a paradigm for

robust ﬁtting of models that was introduced by Fischler and Bolles in 1981

(Fischler and Bolles, 1981). It is robust in the sense of good tolerance to

outliers in the experimental data. It is capable of interpreting and smoothing

data containing a signiﬁcant percentage of gr oss errors (Fischler and Bolles,

1981). The estimate is only correct with a certain probability, since RANSAC

is a randomised estimator. The algorithm has been applied to a wide range of

model parameters estimation problems in computer vision, such as detection

of geometric primitives (Clarke et al., 1996), mosaicing (McLauchlan and

Jaenicke, 2002), wide baseline stereo matching (Schaﬀalitzky and Zisserman,

2001; Pr itchett and Zisserman, 1998) and motion segmentation (Torr, 1995).

Although RANSAC is a quite simple algorithm, it is a very powerful

tool. In an iterative way, it randomly selects subsets from the input data and

compute the model parameters that best ﬁt to the sample (Matas and Chum,

2004). Samples are drawn uniformly from the input data set. Each point has

the same probability of selection (uniform point sampling). For each sample

a model hypothesis is constructed by computing t he model parameters using

109

A.1. HYPOTHESES EVALUATION 110

the sample data. The size of the sample depends on the model one wants

to ﬁnd. Typically, it is the smallest size suﬃcient for determining model

parameters (Matas and Chum, 2004). For example, to ﬁnd circles in the

data set, one has to draw three points, since three points are required to

determine the parameters of a circle.

A.1 Hypotheses evalua tion

In a second phase, the quality of the model parameters is evaluated on

the full data set. Diﬀerent cost functions may be used for the evaluation,

the standard being the number of data points consistent with the model

(Matas and Chum, 2004). The process is terminated when the likelihood of

ﬁnding a better model becomes low (Fischler and Bolles, 1981). Usually, the

model parameters estimated by RANSAC are not very precise. Therefore, the

estimated model parameters are recomputed by, for example, a least-squares

ﬁt to the data subset which supports the best estimate. The input data may

support several distinct models. In this case, the model parameters for the

ﬁrst model are estimated, the data points supporting the model are removed

from the input data and the algorithm is simply repeated with the remainder

of the data set to ﬁnd the next best model. The strength of the algorithm is

that it is likely to draw at least one set of points which consists only of inliers

(Cantzler, 2004). Depending on the size of random samples, RANSAC can

handle contamination levels well above 50%, which is commonly assumed to

be a practical limit in robust statistics (Matas and Chum, 2004).

A.2. RANSAC PARAMETERS 111

A.2 RANSAC parameters

The RANSAC technique uses three parameters to control the model esti-

mation process (Cantzler, 2004). The ﬁrst parameter () is an error tolerance

that determines a volume within which all compatible points must fall in. The

second parameter (p) is the probability that at least one of our random selec-

tions is an error-free set of candidates. Finally, the third parameter (r) is the

probability that a randomly-selected candidate is an inlier. The parameters

p and r deﬁne the number of iterations z of RANSAC, as follows:

z =



log(1 − p)

log(1 − r

)



, (A.1)

Equation (A.1) expresses the fact that z should be large enough to ensure

that, with probability p, at least one randomly-selected set of candidates is

an inlier.

A.3 RANSAC eﬃciency

The speed of RANSAC depends on two factors (Matas and Chum, 2004).

Firstly, the number of samples which have to be drawn to guarantee a certain

conﬁdence to obtain a good estimate, and secondly, the time spent evaluating

the quality of each hypothetical model. The latter is proportional to the size

of the data set.

Typically, a very large number of erroneous m odel parameters obtained

from contaminated samples are evaluated. Such models are consistent with

only a small fraction of the data (Cantzler, 2004). The evaluation of the mod-

els can be computationally optimised by r andomising the evaluation (Matas

and Chum, 2004). Every hypothetical model is ﬁrst tested only with a small

A.3. RANSAC EFFICIENCY 112

number of random data points from the data set. If a model does not get

enough support from this random point set, then one can assume with a

high conﬁdence that the model is not a good estimate. Models passing the

randomised evaluation are then evaluated on the full data set.

The performance of RANSAC degrades with increasing sample size or

in case multiple models are supported by the data due to the decreasing

probability of sampling a set that is composed entirely of inliers. A very

common observation is t hat outliers possess a difuse distribution. On the

other hand, inliers will tend to be located closely together. Therefore, the

uniform sampling of points is replaced by selection of sample sets based

on proximity taking spatial relationships into account (Myatt et al., 2002).

The ﬁrst initial sample point is selected randomly. The rest of the points

are random points lying within a hypersphere centred on the ﬁrst point.

The selection of sample sets of adjacent points can signiﬁca ntly improve the

probability of selecting a set of inliers and thus reduce the number of samples

required to ﬁnd a good model estimate (Matas and Chum, 2004).

Appendix B

Tensorial Notation

This brief introduction on tensor notation is based on the explanations

presented in Hartley (1997b) and Hartley and Zisserman (2003). For more

details about this topic, the reader is r eferred to Sharipov (2004) and Triggs

(1995). For the sake of simplicity, we will present the concepts envolved

in tensorial notation in the context of low-dimensional projective spaces,

rather than in a general context, exactly as performed in Hartley (1997a)

and Hartley and Zisserman (2003).

Consider a set of basis vectors e

, i=1,...,3 for a 2-dimensional projective

space P

. Let their indices be written as subscripts. With respect to this

basis, a point in P

is represented by a set of coordinates q

, which represents

the point Σ

i=1

. The coordinates are written with an upper index. Let q

represent the triple of coordinates, q = (q

, q

)



Now, consider a change of coordinate axes in which the basis vectors

are replaced by a new basis set

, where

= Σ

, and H is the

basis transformation matrix with entries H

. If

q = (ˆq

, ˆq

)



are the

coordinates of the vector with respect to the new basis, then we may verify

that

q = H

−1

q. Thus, if the basis vectors transform according to H the

113

114

coordinates of points transform according to the inverse transform H

−1

Next, consider a line in P

represented by coordinates λ with respect to

the original basis. With respect to new basis, it may be veriﬁed that the line

is respresented by a new set of co or dinates

λ = H



λ. Thus coordinates of

the line transform according to H



Finally, let P be a matrix r epresenting a mapping between vector spaces.

If G and H represent basis transformations in the domain and range spaces,

then with respect to the new bases, the mapping is represented by a new

matrix

P = H

−1

P G. Not e in these examples, that sometimes the matrix H

or H



is used in the transformation, and sometimes H

−1

These three examples of coordinate transformations may be written as

follows:

ˆq

= (H

−1

)

(B.1)

= H

(B.2)

= (H

−1

)

(B.3)

where we use the tensor summation convention that an index repeated in up-

per and lower positions in a product represents summation over the range of

the index. Note that those indices that are written as superscripts transform

according to H

−1

, whereas those that are written as subscripts transform as

H (or G). Note that there is no distinction in t ensor notation between in-

dices that are transformed by H, and those that are transformed by H



. In

general, tensor indices will transform by either H or H

−1

. Those indices that

transform according to H are known as covariant indices and are written as

subscripts (Hartley, 1997a). Those indices that transform according to H

−1

are known as contravariant indices, and are written as superscripts (Hartley,

B.1. TENSORIAL NOTATION AND THE TRIFOCAL TENSOR 115





Figure B.1: 3D representation of the trifocal tensor. The picture represents

= λ





, which is the contraction of the tensor with the lines λ



and



to produce a line λ. In pseudo-matrix notation this can be written as

= λ









, where (T

)

= T

(Hartley and Zisserman, 2003).

1997a). The number of indices is the valency of the tensor. The sum over an

index, e.g. H

, is referred to as a contraction, in this case the tensor H

contracted with the line λ

B.1 Tensorial notation and the trifocal tensor

The trifocal tensor T

has one covariant and two contravariant indices.

For vectors and matrices, such as q

, λ

and P

, it is possible to write the

transformation rules using standard linear algebra notation, e.g. q



= Hq.

However, for tensors with three or more indices, this cannot conveniently be

done. A vector q may be thought of as a set of numbers arranged in a column

or row, and a matrix H as a 2D array of numbers. Similarly, a tensor with

three indices may be thought of as a 3D array of numbers. In particular the

trifocal tensor is a 3 × 3 × 3 cube of cells as illustrated in Figure B.1.

Appendix C

Multiple View Geometry

C.1 Three-View Geometry

The geometry of three perspective views may be acquired simultaneously

as in a trinocular rig, or acquired sequentially, for example by a camera

moving relative to the scene. Exactly as in the case of two-view geometry,

we say that these two situations are geometrically equivalent and they will

not be diﬀerentiated here.

A new multiple view object – the trifocal tensor – plays an analogous

role in three views to that played by the fundamental matrix in two views

(Hartley, 1997b). It encapsulates all the (projective) geometric relations

between three views that are independent of scene structure. In the following

we present its derivation as well as some of its main properties.

C.1.1 The Trifocal Tensor T

The trifocal tensor may be approached in several diﬀerent manners

(Shashua and Werman, 1995; Hartley, 1997b; Hartley and Zisserman, 2003;

Canterakis, 2000), but in this section we will present its derivation based on

116

C.1. THREE-VIEW GEOMETRY 117

the description performed by Hartley (1997b), where the starting point is

taken to be the incidence relationship of three corresponding lines. Finally,

we will present the application of the trifocal tensor as tool for transferring

points.

Essentially, the trifocal tensor is a triply indexed 3×3×3 array of values.

Therefore, it is fully natural to treat it as a tensor (Viéville and Luong,

1993). In this section we will make use of tensorial notation, in particular

the standard summation convention for repeated indices. Some basics on

tensorial notation are presented in Appendix B.

Line Transfer

Consider the three cameras with image planes π



, π



and π



in Figure

C.1. Their corresponding projection matrices will be denoted by M



= [I|0],



= [a

] and M



= [b

]. Observe that we are picking the coordinate

system attached to the camera π



as the world reference frame. Let λ



, λ



and λ



be the image lines of the scene line illustrated in Figure C.1. Their

corresponding planes in space are given by ϕ



= M







, ϕ



= M







and ϕ



= M







. Our goal here is to ﬁnd a relationship between the

coordinates of these three lines.

Given that the three image lines are derived from a single line in space,

it follows that ϕ



, ϕ



and ϕ



must meet at this line in space. This fact

leads to a linear dependency between the coordinates of these three planes.

In particular, there exist constants ρ

and ρ

such that we have



= ρ



+ ρ



. (C.1)

C.1. THREE-VIEW GEOMETRY 118



scene line









Figure C.1: Three images of a line deﬁne it as the intersection of three planes

in the same pencil.

Writing Equation (C.1) in terms of the planes’ coordinates we obtain



= ρ







+ ρ







, (i = 1, ..., 3) (C.2)

0 = ρ







+ ρ







. (C.3)

From Equation (C.3) we deduce that ρ

≈ (b



) and ρ

≈ −(a



). The

notation ≈ denotes equality up to an unknown scale factor. Thus, we may

C.1. THREE-VIEW GEOMETRY 119

rewrite Equation (C.2) as



≈ (b



)(a



) − (a





), (C.4)

= λ





− a

). (C.5)

If we deﬁne a 3×3×3 tensor T

by the expression

= a

− a

, (C.6)

we obtain the following equation



≈ λ





. (C.7)

The tensor T

is the so-called trifocal tensor (Hartley and Zisserman, 2003)

and may be thought of as a set of three 3×3 matrices of rank 2, which is

evident from Equation (C.6), where it is expressed as the sum of two outer

products. Given T

and the coordinates λ



, λ



of corresponding lines,

Equation (C.7) may be used to compute the line in the other image. This

process is called line transfer.

Importantly, if at least 13 line matches are known, it is possible to solve

for the entries of the tensor T

, since each line match provides two linear

equations in the 27 unknown tensor entries. In particular, if the line λ



speciﬁed by two points on the line, then each such point q



= (q



, q



, q



)

generates an equation such as







= 0. (C.8)

C.1. THREE-VIEW GEOMETRY 120

Point Transfer

Suppose that a scene point Q is seen at positions q



, q



and q



in three

image planes π



, π



and π



, respectively, where q



(and similarly q



and



) are represented in homogeneous coordinates. Now, we wish to ﬁnd a

relationship between the coordinates of those three image points. It is im-

portant to say that at any point in the following derivation we may set the

coordinates q



, q



and q



to 1 to obtain equations relating to measured

image coordinates.

Consider that the cameras’ projection matrices are denoted by M



[I|0], M



= [a

] and M



= [b

]. Since q



≈ [I|0]Q, we have

Q =











, (C.9)

for some w yet to be deter mined.

Therefore, project ing the scene point Q in the image plane π



by the

usual formula q



≈ a

, we have



≈ a



+ a

w. (C.10)

As performed by Hartley (1997b), we may eliminate this scale factor to obtain

equations







+ a



= q







+ a



. (C.11)

Each choice of the free indices i and j gives a separate equation. Of the

three resulting equations, only two are independent. Thus, we obtain three

C.1. THREE-VIEW GEOMETRY 121

separate estimates for w as follows

w =







− q







− q



. (C.12)

Substituting w in Equation (C.9) by its value in Equation (C.12) we can

write the scene point Q as

Q =













− q







− q









. (C.13)

Finally, projecting the scene point Q in the image plane π



≈ b

we obtain



≈ b







− q





+ b







− q





(C.14)

≈ q







− a



− q







− a



. (C.15)

Observe that the tensor coeﬃcients T

can be easily identiﬁed in Equation

(C.15):



≈ q







− q





. (C.16)

As before we may eliminate the unknown scale factor to obtain the equations









− q





− q





+ q







= 0

ijlm

. (C.17)

As mentioned by Hartley (1997b), these are the trilinearity relationships of

Shashua and Werman (1995). The indices i, j, l and m are free variables,

and there is one equation for each choice of indices with i = j and l = m .

We may assume that i < j and l < m, since we have the same relation by

C.2. N-VIEW GEOMETRY 122

interchanging i and j, or l and m. Therefore, from the Expression (C.17) we

obtain nine possible equations. As only two of the three choices of pair (i,j)

give indep endent equations, and the same is true for pairs (l,m), we get only

four linearly independent equations.

One natural choice of the four independent equations from Expression

(C.17) is obtained by setting j = m = 3, and letting i and l range freely,

given the conditions stated before (i = j, l = m, i < j and l < m). Thus,

if we set q



, q



and q



to 1, we obtain the relationship between image

coordinates in Equation (C.18)









− q



− q



+ T



= 0

i3l3

(i, l = 1, 2). (C.18)

Equations (C.8) and (C.18) show the presence of the entries of the trifocal

tensor T involved in the processes of transferring lines and points. In partic-

ular, each line correspondence provides two linear constraints on the entries

, whereas each point correspondence provides four linear equations. Since

T has 27 entries, 26 equations are needed to solve for the T

up to scale, that

is, provided that 2 #lines + 4 #points ≥ 26 (Hartley and Zisserman, 2003)

we have enough matches to completely determine the entries of the trifocal

tensor. The main methods for computing trifocal tensors (Hartley, 1997b;

Torr and Zisserman, 1997; Faugeras and Papadopoulo, 1997) are based on

that restriction and most of them are described in detail by Hartley and

Zisserman (2003).

C.2 N-View Geometry

In the study of the geometry of multiple views, the fundamental matrix

and the trifocal tensor have proven to be essential tools. In the case of four

C.2. N-VIEW GEOMETRY 123

views, the quadrifocal tensor was proposed as a natural extension of those

techniques (Heyden, 1995a; Hartley, 1998). Because of the added stability of

a fourth view, use of the quadrifocal tensor should lead to greater accuracy

than two and three-view techniques (He yden, 1995a). However, the four-

view tensor has not been given much attention in the literature (Hartley and

Zisserman, 2003). One of the main reasons to its rare use is related to its over-

parametrization, where 81 components of the tensor are used to describe a

geometric conﬁguration that depends only on 29 parameters (Hartley, 1998).

This fact can lead to signiﬁcant inaccuracies if additional constraints are not

applied.

With regard to the general case of multiple images taken by N diﬀer-

ent cameras, several diﬀerent approaches have been proposed in the litera-

ture (Luong and Vieville, 1994; Faugeras and Mourrain, 1995 ; Heyden and

Åström, 1997; Heyden, 1998). In Heyden (1998), a common framework was

proposed for the deﬁnition and operations on the diﬀerent multiple view ten-

sors. The author showed that there are essentially three diﬀerent ways to

encode an N-view geometry, namely by using bifocal tensors or fundamen-

tal matrices, trifocal tensors and quadrifocal tensors (combinations of these

three tools are also possible).

According to his work, if only bifocal tensors are used for a sequence

of temporally corresponding images, it is suﬃcient to use the tensors

i,i+1

and

i,i+2

F for every triplet (

i,i+1

i,i+2

i+1,i+2

F), where i denotes the i-

th view and

i,i+1

F denotes the fundamental matrix b etween views i and

i + 1. Using this representation and remembering that each bifocal tensor

has 7 independent parameters (Hartley and Zisserman, 2003), we have N −

1 + N − 2 = 2N − 3 bifocal tensors and N − 2 such triplets obeying three

constraints each, giving in total 7(2N − 3)−3(N − 2) = 11N − 15 parameters

C.2. N-VIEW GEOMETRY 124

describing the N-view geometry, i.e. the minimal number (Heyden, 1998).

On the other hand, if only trifocal tensors are used it is suﬃcient to

use the tensors

i+1,i+2

T . Using this representation we have N − 2 trifocal

tensors with 18 indep endent parameters each and for every consecutive pair

(

i+1,i+2

T ,

i+2,i+3

i+1

T ) of trifocal tensors we get 7 constraints from the compati-

bility of

i+1,i+2

F, t hat can be calculated from both. Since there are N −3 such

constraints, we get 18(N − 2) − 7(N − 3) = 11N − 15 parameters describing

the N-view geometry, i.e. the minimal number (Heyden, 1998).

Finally, when only quadrifocal tensors (represented by letter Q) are used

it is suﬃcient to use the tensors

i,i+1,i+2,i+3

Q. Now, by using this represen-

tation we have N − 3 quadrifocal tensors with 29 independent parameters

each and for every consecutive pair (

i,i+1,i+2,i+3

i+1,i+2,i+3,i+4

Q) of quadri-

focal tensors we get 18 constraints from the compatibility of

i+2,i+3

i+1

T , that

can be calculated from both. Since there are N − 4 such constraints, we

get 29(N − 3) − 18(N − 4) = 11N − 15 parameters describing the N-view

geometry, i.e the minimal number (Heyden, 1998).

An enumeration of the complete set of multilinear relations, formulae

for the multiview tensors, and the analysis of the number of independent

equations derived from point correspondences can be found in Hartley and

Zisserman (2003) and Heyden (1998).

Livros Grátis
( http://www.livrosgratis.com.br )
 
Milhares de Livros para Download:
 
Baixar livros de Administração
Baixar livros de Agronomia
Baixar livros de Arquitetura
Baixar livros de Artes
Baixar livros de Astronomia
Baixar livros de Biologia Geral
Baixar livros de Ciência da Computação
Baixar livros de Ciência da Informação
Baixar livros de Ciência Política
Baixar livros de Ciências da Saúde
Baixar livros de Comunicação
Baixar livros do Conselho Nacional de Educação - CNE
Baixar livros de Defesa civil
Baixar livros de Direito
Baixar livros de Direitos humanos
Baixar livros de Economia
Baixar livros de Economia Doméstica
Baixar livros de Educação
Baixar livros de Educação - Trânsito
Baixar livros de Educação Física
Baixar livros de Engenharia Aeroespacial
Baixar livros de Farmácia
Baixar livros de Filosofia
Baixar livros de Física
Baixar livros de Geociências
Baixar livros de Geografia
Baixar livros de História
Baixar livros de Línguas

Baixar livros de Literatura
Baixar livros de Literatura de Cordel
Baixar livros de Literatura Infantil
Baixar livros de Matemática
Baixar livros de Medicina
Baixar livros de Medicina Veterinária
Baixar livros de Meio Ambiente
Baixar livros de Meteorologia
Baixar Monografias e TCC
Baixar livros Multidisciplinar
Baixar livros de Música
Baixar livros de Psicologia
Baixar livros de Química
Baixar livros de Saúde Coletiva
Baixar livros de Serviço Social
Baixar livros de Sociologia
Baixar livros de Teologia
Baixar livros de Trabalho
Baixar livros de Turismo