![]() |
SIGIR 2012 Workshop onOpen Source Information Retrieval16 August 2012 |
Workshop Homepage | Proceedings | Call for Participation | Organizers |
Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval
OpenSearchLab and the Lucene Ecosystem
Grant Ingersoll (Lucid Imagination)
SlidesMany of the stated goals of OpenSearchLab can be pieced together using open source tools from the Lucene ecosystem. While not an official packaging of software, the Lucene ecosystem collectively includes tools for large-scale crawling (Nutch), content extraction (Tika), indexing, search, faceting, highlighting, spellchecking (all covered by Lucene/Solr), machine learning and NLP (Mahout/OpenNLP) that have been proven to work at scale while saving time and development effort. Additionally, Lucene 4, in particular, has a number of improvements that make Lucene more amenable to IR research involving scoring models, compression, faceting, storage and more. In this talk, Lucene committer Grant Ingersoll will take a quick tour of the ecosystem, look at what's new in Lucene 4 and then look at what a potential architecture might look like for OpenSearchLab.
The Lemur Project and its ClueWeb12 Dataset
Jamie Callan (CMU)
SlidesFor a dozen years the Lemur Project has provided open-source software, datasets, and information services that support research and education related to information retrieval and other language technologies. The Lemur Project's most successful products include the Lemur Toolkit, the Indri and Galago search engines, the ClueWeb09 dataset, and ClueWeb09 search services. This talk begins with a brief overview of the Lemur Project, the range of products and services that it provides, and how the project operates.
The Lemur Project's newest creation is ClueWeb12, a large collection of English web pages that complements the project's previous ClueWeb09 dataset. This talk provides an in-depth description of the goals for the dataset, how the crawl was conducted, and how raw crawl data was post-processed into a useful research dataset.
• L. Zhao, X. Liu, J. Callan
WikiQuery - An Interactive Collaboration Interface for Creating, Storing and Sharing Effective CNF Queries
[PDF]
• T. Beckers, S. Dungs, N. Fuhr, M. Jordan, S. Kriewel
ezDL: An Interactive Search and Evaluation System
[PDF]
• A. Bialecki, R. Muir, G. Ingersoll
Apache Lucene 4
[PDF]
• M.-A. Cartright, S. Huston, H. Field
Galago: A Modular Distributed Processing and Retrieval System
[PDF]
• M. Khabsa, Stephen Carman, S. R. Choudhury, C. L. Giles
A Framework for Bridging the Gap Between Open Source Search Tools
[PDF]
• A. Trotman, X. Jia, M. Crane
Towards an Efficient and Effective Search Engine
[PDF]
• M-D. Albakour, C. Macdonald, I. Ounis, A. Pnevmatikakis, J. Soldatos
SMART: An Open Source Framework for Searching the Physical World
[PDF]
• T. Gollub, S. Burrows, B. Stein
First Experiences with TIRA for Reproducible Evaluation in Information Retrieval
[PDF]
• P. Jourlin, R. Deveaud, E. Sanjuan-Ibekwe, J.-M. Francony, F. Papa
Design, Implementation and Experiment of a YeSQL Web Crawler
[PDF]
• C. Macdonald, R. McCreadie, R. L.T. Santos, I. Ounis
From Puppy to Maturity: Experiences in Developing Terrier
[PDF]
• H. Turtle, Y. Hegde, S. Rowe
Yet Another Comparison of Lucene and Indri Performance
[PDF]
• M. Yasukawa, J. S. Culpepper, F. Scholerb
Phonetic Matching in Japanese
[PDF]
• T. Beckers, S. Dungs, N. Fuhr, M. Jordan, S. Kriewel, V. Tran
Demo of ezDL
• S. Gog, M. Petri, J. S. Culpepper, A. Moffat
SDSL: Succinct Data Structure Library
• T. Gollub
Showcasing the Open Source Experimentation Platform TIRA
•
A. Trotman, X.-F. Jia, M. Crane
The ATIRE Open Source Search Engine