During my final year at Oxford I spent many nights ruminating on the topic of computer vision. Haggard, bearded, and jaded by the prospect of partying and gallivanting that once dominated my life, I was left with only my thesis. Over the course of a year, I transformed from a youthful Frodo into a schizophrenic Gollum, hunched over a laptop and typing furiously as freshers frolicked beyond the library window.

All jokes and prose aside, my time at Oxford was awesome and irreplaceable. But given that my master’s thesis did, in fact, consume a sizeable chunk of my lifespan, it would be unfortunate if it got lost in the colossal annals of academia. So I shall post it here for the world (i.e. 3 people tops) to read, where it will instead become lost in the deep web.


Paper

Poster
Abstract

This report develops a large-scale, offline localisation and mapping scheme for textured surfaces using the bag-of-visual-words model. The proposed system builds heavily on the work of Philbin et al. and Vedaldi and Fulkerson, taking as an input a corpus of images that are rapidly indexed and correlated to construct a coherent map of the imaged scene. Thereafter, new images can be reliably queried against the map to obtain their corresponding 3D camera pose. A simple bundle adjustment algorithm is formulated to enforce global map consistency, exhibiting good performance on real datasets and large loop closures. Furthermore, a proposed submapping scheme provides dramatic computational improvements over the baseline without any deterioration in localisation performance. A comprehensive implementation written in MATLAB as proof of concept is pressure tested against a variety of textures to examine the conditions for system failure. The application unambiguously outperforms the naked eye, demonstrating an ability to discriminate between very fine details in diverse settings. Tests validate its robustness against broad changes in lighting and perspective, as well as its notable resilience to high levels of noise. Areas for further development to augment the original work and achieve real-time performance are also suggested.

I conducted my research in the Visual Geometry Group under the supervision of Prof Andrea Vedaldi. The project focused on building a system to coherently reconstruct large-scale “textured” scenes using a single camera. The idea was to see whether a computer could reconstruct an environment which, to the human eye, contains very little physical or visual information. A complete architecture for offline localisation and mapping using the Bag-of-Words (BoW) model was proposed and subsequently implemented in MATLAB.

The poster above provides a good overview of the system, but to give you a better sense of what the algorithm does at a high level, consider the image below. First, a corpus of training images is used to extract “interesting” features (using the Scale Invariant Feature Transform) – typically unique areas of high contrast, corners, edges and so on. These are then quantised into visual words using k-means clustering. Once a “vocabulary” has been established a large set of images can have their features extracted and geometrically matched, such that the scene can be stitched and reconstructed. Thereafter, a new input image can be rapidly localised within the scene.

The MATLAB implementation can be downloaded from the link below.

Download Code

Alternatively if you use Git then you can clone the repository from GitHub using the command:

git clone https://github.com/jaijuneja/texture-localisation-matlab.git