SIGIR 2012 Workshop on

Open Source Information Retrieval

16 August 2012

Introduction

The open source IR community has been strong for many years. Early search engines (such as MG) continue to be used in larger open source projects (such as Greenstone). More recent open source search engines (such as Apache Lucene) are used to power the search facilities of some of the largest technology companies (such as IBM, AOL, and Apple). In the academic community, such search engines are routinely used to test ranking functions, compression algorithms, user interfaces, and so on. Open Source IR is now an essential component of research and commerce. This workshop is providing a venue for users and authors of open source IR tools to get together and discuss their joint future.

Of particular interest is how to work together to build OpenSearchLab, an open source, live and functioning, online web search engine for research purposes. We believe that the tools to build it mostly exist and by working together it can be built and that it will transform the future of research in IR.

Topics of Interest

Position papers and posters on open source IR as well as demos of exciting packages are sought. Topics include, but are not limited to: Software Engineering; Hardware Engineering; Evaluation; Needs, Desires, and Demos; Protocols.

The selection process will give extra consideration to papers on search engines, particularly building a web scale live search engine, however all aspects of Information Retrieval will be considered.

Software Engineering

Software engineering is not normally discussed at the SIGIR conference, but this workshop will provide such an opportunity. Writing a scalable search engine is not a trivial task. The author must consider HTML parsing, stemming, index compression, relevance feedback, and so on. Designing maintainable software of this magnitude is outside the skill set of most graduate students, but there is likely to be consensus on the design. This workshop will provide a venue to discuss proven designs.

Software maintenance, including the use of: version control; regression tests; release schedules; bug tracking, and so on are an important aspect of any software project. IR has some special issues. For example, parallel indexing may lead to non-deterministic indexes, which lead to regression test problems.

Selection and implementation of algorithms and data structures can lead to significant differences in the performance of large-scale IR systems, yet retrieval efficiency issues are not well represented at SIGIR. For example, dynamic pruning techniques such as MaxScore and WAND (which allow efficient document scoring without decreasing effectiveness at rank K) have implementation intricacies rarely properly described in the literature. While "implementation details" are not generally interesting in a research paper, they are critical when designing and evaluating new experiments. Discussion on these topics will be sought.

Hardware Engineering

The open source community has several approaches to hardware. The Hadoop community believes that hardware is inexpensive and easily obtainable. The smart phone community believes that hardware is expensive and a user will have only one machine. The cloud community believes that hardware is infinite an only need be paid for if used. Each philosophy brings a different approach to the architecture and design of a search engine. Open debate on these philosophies could result in increased efficiency for cloud services as well as increased scalability of smart phone services.

Evaluation

There is variation in the performance of the same algorithms implemented in different systems. It is well understood that two different BM25 implementations are unlikely to produce the same MAP scores, but the variance is unknown. It is academically important to understand this variance, and bringing together a number of open source systems that (purportedly) implement the same algorithms is one way to understand this - it will also allow us to explore the best way to implement each search engine component.

Needs, Desires, and Demos

This workshop will provide a venue for members of the community to demonstrate and discuss their tools and their direction for their tools. It will allow users to discuss requirements, and for developers and users to work together on a future for those tools.

Protocols

This is the first time the open source IR community has come together with the intention of working together and interoperating. To do so requires standard protocols, and these are solicited. Such protocols include network-based protocols as well as object interfaces. As an example, the standard model of a distributed search engine has three components: the client; brokers; and search engines. The client issues a search request to a broker, which distributes to several search engines, which search in parallel. The broker then merges results so that the set of search engines appear as one. If the communications protocols in this distributed environment were standardized, then it would be possible to mix-and-match search engines, brokers, and clients. This would permit parallel development of systems by those who are expert in each part and it would allow those with search engines to easily provide a distributed search engine. Moreover, it would allow for the in-place comparison of different functional components.

Agreeing on object level interfaces will help increase the interoperability of source code for standard IR tasks such as stemming, relevance feedback, ranking, compression, and so on.

OpenSearchLab

Many of the open source IR research tools have been developed in isolation with separate goals. At the same time many software authors have raised the issue of open global web search (a common goal). The next step in open source IR is a fully-functional online web search engine. Such an engine would provide research opportunities in several areas including: web crawling, document parsing (including decoration removal), searching, user interfaces, click-log mining, and so on. This will advance academic research in search engines beyond the silos of isolated individuals and into a global community of researchers working together.

Planned Activities

Submitted Papers, Posters, and Demos will be fully peer-reviewed by an international Program Committee. The program committee will select submissions for presentation in the form most appropriate for that submission. Time will be allocated for presentation by full paper (with discussion), by demonstration, or by poster discussion.

Discussion on OpenSearchLab will form the bases of one session.

Goals

We will discuss the issues of Open Source IR in an open forum. This face-to-face discussion is invaluable when considering the future direction of the movement. It will provide an opportunity to agree on standards, on unaddressed issues (gaps), and on ways to share engineering. It will provide an opportunity for those who use open source IR to work with software authors. Importantly, it will allow us to work as a community to discuss the viability of OpenSearchLab and to plan a future.

Schedule

July 2, 2012	Deadline for Paper, and Poster Submissions
	Prepare your PDF using the ACM format 8 page full papers, 4 page posters Submit online using EasyChair OSIR 2012 Site

July 23, 2012	Notification of Acceptance

July 30, 2012	Deadline for Camera Ready Copies

August 6, 2012	Deadline for 2-Paragraph Demo Abstracts
	Submit a one-paragraph description and a one-paragraph resource requirements using on the EasyChair OSIR 2012 Site

August 8, 2012	Notification of Demo Acceptance

August 16, 2012	SIGIR 2012 Workshop on Open Source Information Retrieval

Submission Details

Submissions must be original work, not previously published elsewhere, and not currently submitted to any other conference, workshop, or journal. Submission of a paper should be regarded as an undertaking that, should the paper be accepted, at least one of the authors will attend the workshop to present the work.

PDFs should be submitted by the deadline. Full papers are expected to be 8 pages in length in the ACM format. Posters and demos are expected to be 4 pages in length in the ACM format.

Submitted Papers, Posters, and Demos will be fully peer-reviewed by an international Program Committee.The selection process will give extra consideration to papers on search engines, particularly building a web scale live search engine, however all aspects of Information Retrieval will be considered.