Indexing Very Large Text Data
With emergence of digital libraries with non-textual content, there is a clear need for improved techniques to organize large quantities of information. It appears, that the textual descriptions associated with non-textual content are an important source of information when judging the topical relevance. The CoPhIR (Content-based Photo Image Retrieval) data set is an example of multimedia collection that serves as the basis of the experiments, including over 100 million images with associated metadata. The main objective of this thesis is to study Lucene technology for indexing text data, and, using this technology, implement indexing and content-based image searching of the CoPhIR collection. The procedure of creating index from initial data set is described in detail, including possible pitfalls. This work also surveys efforts that focus on information retrieval and presents some challenges for retrieval in both image indexing and searching.