SIGIR 2012 Workshop on

Open Source Information Retrieval

16 August 2012
Workshop Homepage Proceedings Call for Participation Organizers


Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval


OpenSearchLab and the Lucene Ecosystem

Grant Ingersoll (Lucid Imagination)


Many of the stated goals of OpenSearchLab can be pieced together using open source tools from the Lucene ecosystem. While not an official packaging of software, the Lucene ecosystem collectively includes tools for large-scale crawling (Nutch), content extraction (Tika), indexing, search, faceting, highlighting, spellchecking (all covered by Lucene/Solr), machine learning and NLP (Mahout/OpenNLP) that have been proven to work at scale while saving time and development effort. Additionally, Lucene 4, in particular, has a number of improvements that make Lucene more amenable to IR research involving scoring models, compression, faceting, storage and more. In this talk, Lucene committer Grant Ingersoll will take a quick tour of the ecosystem, look at what's new in Lucene 4 and then look at what a potential architecture might look like for OpenSearchLab.

The Lemur Project and its ClueWeb12 Dataset

Jamie Callan (CMU)


For a dozen years the Lemur Project has provided open-source software, datasets, and information services that support research and education related to information retrieval and other language technologies. The Lemur Project's most successful products include the Lemur Toolkit, the Indri and Galago search engines, the ClueWeb09 dataset, and ClueWeb09 search services. This talk begins with a brief overview of the Lemur Project, the range of products and services that it provides, and how the project operates.

The Lemur Project's newest creation is ClueWeb12, a large collection of English web pages that complements the project's previous ClueWeb09 dataset. This talk provides an in-depth description of the goals for the dataset, how the crawl was conducted, and how raw crawl data was post-processed into a useful research dataset.

Full Papers

• L. Zhao, X. Liu, J. Callan
WikiQuery - An Interactive Collaboration Interface for Creating, Storing and Sharing Effective CNF Queries

• T. Beckers, S. Dungs, N. Fuhr, M. Jordan, S. Kriewel
ezDL: An Interactive Search and Evaluation System

• A. Bialecki, R. Muir, G. Ingersoll
Apache Lucene 4

• M.-A. Cartright, S. Huston, H. Field
Galago: A Modular Distributed Processing and Retrieval System

• M. Khabsa, Stephen Carman, S. R. Choudhury, C. L. Giles
A Framework for Bridging the Gap Between Open Source Search Tools

• A. Trotman, X. Jia, M. Crane
Towards an Efficient and Effective Search Engine


• M-D. Albakour, C. Macdonald, I. Ounis, A. Pnevmatikakis, J. Soldatos
SMART: An Open Source Framework for Searching the Physical World

• T. Gollub, S. Burrows, B. Stein
First Experiences with TIRA for Reproducible Evaluation in Information Retrieval

• P. Jourlin, R. Deveaud, E. Sanjuan-Ibekwe, J.-M. Francony, F. Papa
Design, Implementation and Experiment of a YeSQL Web Crawler

• C. Macdonald, R. McCreadie, R. L.T. Santos, I. Ounis
From Puppy to Maturity: Experiences in Developing Terrier

• H. Turtle, Y. Hegde, S. Rowe
Yet Another Comparison of Lucene and Indri Performance

• M. Yasukawa, J. S. Culpepper, F. Scholerb
Phonetic Matching in Japanese


• T. Beckers, S. Dungs, N. Fuhr, M. Jordan, S. Kriewel, V. Tran
Demo of ezDL

We will demonstrate our ezDL search system for digital libraries, the Khresmoi ( variant with search for similar images, and a special Khresmoi interface for radiologists. We will also show a web interface to the ezDL backend and the use of ezDL for evaluations. The ezDL search system allows for searching in heterogenous digital libraries and working with results by sorting, filtering, grouping, analyzing, and filing. Within the Khresmoi system users can drag and drop images into the search query and search for similar images possibly using relevance feedback. For radiologists images can be 3D volumes; ezDL can be used to inspect those images by zooming, panning and scrolling in all 5 dimensions.

• S. Gog, M. Petri, J. S. Culpepper, A. Moffat
SDSL: Succinct Data Structure Library

A recent trend in IR research is to build IR-systems based on succinct data structures like a wavelet tree. In theory the properties of succinct data structures are amazing: they take space close to the compressed space of the underlaying object and still provide all functionality in almost the same time complexity. We have developed a C++ template library which offers a wide range of useful succinct data structures to foster practical IR research in this area. The software is open-source, highly-efficient, 64-bit, tested, and was already used in IR-projects. For instance, it was used in the SIGIR 2012 paper of Culpepper and Petri. The source code is availble at

• T. Gollub
Showcasing the Open Source Experimentation Platform TIRA

We will demonstrate TIRA, our open source experimentation platform for information retrieval research. Based on an example experiment program running on the local demonstration computer, we will first explain how to start TIRA and how to deploy experiment programs to it. Once the example experiment program is deployed, we will showcase the web service TIRA provides for it. The web service allows the execution of experiment runs with individual parameter settings and provides basic result retrieval facilities. Every experiment run is made accessible under a unique URL, providing a convenient way for disseminating experiment results. Further on, we will demonstrate how to integrate experiment programs from remote TIRA instances as well as how to deploy experiment programs onto remote TIRA instances. In the second part of the demonstration, we will show how TIRA can be utilized as an evaluation platform for competitions. Here, the PAN 2012 competition on plagiarism detection we conducted using TIRA serves as a vivid example.

• A. Trotman, X.-F. Jia, M. Crane
The ATIRE Open Source Search Engine

The ATIRE open source search engine was designed and implemented at the University of Otago with code additionally contributed by postgraduate students at the Queensland Institute of Technology. The search engine is written in C++ and is object oriented. It supports a number of modern ranking functions including: BM25; language models; no-parametric divergence from randomness models; PageRang; and combined PageRank with BM25. It uses impact-ordered indexes and supports many index compression algorithms including: variable byte; word-based (such as simple-9); and bit-based (such as Elias delta). Standard features such as: stemming; relevance feedback; and snippet generation are also supported. Our search engine has proven to scale: We recently indexed clueWeb09 Category A (with spam removal) on a single server in under 1 day; and can directly generate TREC runs for it. We will demonstrate the search engine and discuss its design and implementation with participants.