Patent Searching and Landscaping presents many challenges to the research community. Patents are complex technical documents, whose content appears in many languages and contains images, chemical and genomic structures and other forms of data, intermixed and cross-referring with the text material. Further much patent work involves the integration of other forms technical information: scientific papers, open linked data and so on; with the patent data. Finally the realistic presentation of search and analysis results to often non-technical and time-poor audiences for purpose of strategic decision making presents particular challenges.
The patent related business is worth billions, so all these problems are solved on a daily basis by the patent community, often in somewhat inadequate labor intensive ways. The challenge for the research community is to provide better solutions without increasing the already heavy burden on relevant technical, legal and language experts.
The course will review the state of the art and point out where the key challenges are, especially for early stage researchers in MUMIA related disciplines
1) Understand Patent Searching and Landscaping and the State of the Art
2) Understand the key limitations and challenges for the research community in the development of patent retrieval
3) Understand how recent developments in multilingual and Multi-media Information Access may be applied to patent searching and landscaping research.
The course will comprise the following sessions:
1. Introduction to Distributed Search and relevant issues: Introduction to Distributed Information Research. Disambiguation of terminology. Discussion of important issues pertinent to Distributed Search.
2. A review of the state-of-the-art in Distributed Search: A review of the state-of-the-art approaches in Distributed Search.
3. Presentation of existing frameworks for Distributed Search: Presentation and exploration of freely available Information Retrieval Toolkits. Advantages and limitations of each.
4. Conclusions and Wrap-up: Final thoughts, closing remarks and a summary of the tutorial.
The course will introduce its audience to the basic principles of Distributed Search. It will particularly explore issues related to the issue of Patent Search within a distributed framework.
Relevance to cost project
The course will directly discuss issues pertinent to the “Multilingual and Multifaceted Interactive Information Access (MUMIA)” project, as patents are often distributed in geographically distant repositories, each one with different underlying searching frameworks. In this setting, a Distributed system is often the only way to bring those resources together.
The intended audience of the course includes Information Retrieval researchers (including MSc and PhD students). A prior knowledge of basic Information Retrieval principles is useful but not essential.
To overcome the “one size fits all” behavior of most search engines, in recent years a great deal of research has addressed the problem of defining techniques aimed at tailoring the search outcome to the user context to the aim of improving the quality of search. The main idea is to produce context dependent and user-tailored search results. Search tasks are subjective, and often complex; the user system interaction based on keyword-based querying and on the presentation of search results as a list of web pages ordered according to their estimated relevance is often unsatisfactory. The aim of this course is to present a short overview of the main issues related to contextual search.
Genre analysis experiments in the field of information retrieval can be traced back to 1990s. Recent years have shown a growing interest to automatic genre analysis of Web documents, especially in the context of Web search. Document genre and other non-topical features can be used for modeling relevance, diversifying and personalizing search results, as well as improved search results representation. The course gives a brief overview of a 20-year history of genre experiments in IR and cognate areas and surveys modern applications (that are often far from those envisioned in the pioneering works).
To make student familiar with genre analysis – an interdisciplinary field in the intersection of IR, NLP, and information extraction. The course consists of two parts.
Part 1: Definition of document genre, main concepts and techniques; related areas. Genre: an elusive concept. Document genres and IR tasks: a brief history of the research area. Genre classification: corpora and genre palettes, classification features and methods. Cognate areas: authorship attribution and plagiarism detection, readability measures, gender classification, mood classification and sentiment analysis.
Part 2: Incorporation of genres and genre-related document features into IR systems. Vertical search. Specification of information need. Search query intent and document genres. Search results representation. Transparent use of genre-related features in ranking. Diversity in document retrieval and genres. Genre-related aspects in TREC evaluation experiments. Readability and personalization of web search results. Genres beyond search: summarization, question answering.
The course will assume that students are familiar with basic IR and machine learning concepts.
The course includes an overview of the techniques employed in multimedia search including feature extraction from audiovisual content, machine learning techniques, user implicit and explicit feedback gathering, indexing and retrieval. The course will focus on image and video search systems and discuss how these techniques can be employed in both cases. Specifically the course will discuss video shot segmentation and representation of video with image-keyframes, as well as visual low level feature extraction from images to support visual similarity and high level concept extraction (e.g. beach, mountain) based on supervised classification methods. Since the focus is on interactive systems, we will also discuss how the explicit and implicit user relevance feedback can be gathered and exploited by employing graph representation and machine learning techniques. The presentation will also include live demonstration of interactive multimedia systems with extensive discussion and lower level explanation on how the aforementioned techniques are integrated and applied.
The objective of the course is to give an overview of existing techniques for multimedia search and to explain how these techniques could be integrated in an interactive multimedia search system. The course can also serve as a background for building interactive video and image search engines.
Part I: Introduction and overview of IR Evaluation
The first session will cover the motivation, basics and details of evaluation of information retrieval tools. The aim of this session is to provide the students with an understanding of the basic concepts, as well as the general best practices in this domain, including the most used measures. In this sense, it will also summarily go through the standardized evaluation campaigns (TREC, CLEF, NTCIR, INEX).
Part II: Specific evaluation tasks
The second session will take the students to the different kinds of standardized evaluation currently and formerly organised. The presentation is structured on two axes: first, domain specific (patent, healthcare, genomic) and, second, task type specific (interactive, session based), pointing out the overlaps. We will see here how measures described in the first session are adapted to cater to different needs and user behaviours. The presentation will also talk about how laboratory style evaluations may and should be complemented by user studies.
Provide the students with
1. An understanding of the necessity and best practices of laboratory-like IR evaluation, and
2. An understanding of the different threads of research in this context, as well as the different alternatives available / needed in different domains.
In a series of four lectures I plan to cover a number of applications of text processing other than core Information Retrieval. The topics will be selected from the following:
1. Corpus Linguistics. A corpus is a large body of text stored on the computer, sampled for a specific purpose, such as linguistic or socio-demographic analysis. Corpora are much rich richer when they are annotated with information such as parts of speech or semantic codes, which include emotion words. Corpora in one language can be aligned automatically with their translations in other language, a precursor of multilingual terminology extraction.
2. Clustering and Classification. Clustering comprises a family of algorithms which can automatically assign entities (such as documents) to categories, which may be newly discovered in the process. Classification is the process of assigning entities to their correct categories. We will look both at the theory of different clustering algorithms and get some hands-on experience with the statistical programming language. Texts can be clustered by topic, genre or writing style.
3. Text Data Mining is the extraction of information from large volumes of the data which may never have been explicitly recorded by any author. This puts it in contrast with standard information retrieval, where every text retrieved has been authored by someone. Some methods in data mining come from standard statistics, while others use market basket analysis and still others use clustering algorithms. We will use a case study of audiology patient records.
4. Natural Language Processing – while conventional information retrieval is restricted to the “bag of words” model, natural language is much richer than that. We will introduce the concept of linguistic levels above lexis (individual words) such as parts of speech, syntax and parsers, semantics, and pragmatics. In the light of this, we will discuss why the bag of words model still works as well as it does, and the ways in which techniques from information retrieval can be used in linguistic processing.
5. Forensic Text Analysis, covering disputed authorship (who was the most probable author of a disputed text such as a novel or will), the detection of plagiarism (the unauthorised and unacknowledged use of text written by someone else) and spam filtering (stemming of bulk, unsolicited email).
Explosive growth of cheap but highly powerful audio-visual devices enable us to create huge multimedial data. Storing, transferring and processing such amount of data has become extraordinary difficult task. The main problem now becomes how to manage digital data rather than how to create it.
Retrieving of large image datasets is very difficult task. Early image/video retrieval systems were based on text retrieval: keywords are manually assigned to images or key frames and, for any particular query, images labeled by matching keywords are retrieved. Such an approach is traditionally used in classic movie and entertainment industry: key frames and shots are manually annotated by appropriate captions. Although text-based image retrieval techniques may be very efficient and can be easily automated, they suffer from at least two major drawbacks, especially when working with large databases. One is both a need for precise (and time-consuming) manual annotating of database and finding right combination of keywords when retrieving: the problem already recognized in web based search engines. The second drawback originates in linguistic barriers and annotation impreciseness, when textual-based approach may cause hard mismatches between user’s needs and retrieving results. To overcome drawbacks recognized in text-based approach a content-based image retrieval (CBIR) techniques are proposed. These techniques exploit low-level image features such as color, texture, shape, etc., from individual images from database. Low level features are described numerically and obtained numbers are arranged in the form of vectors - callef feature vectors (FVs). Then the objective similarity between different images are estimated from the distance between their FVs. Such an approach may be efficient in some cases but serious problem known as the "semantic gap" often degrades the retrieval. Low-level objective features and a high level user’s needs are in collision in many cases. So, low-level features definitely are not sufficient to describe exactly what the user wants to find in particular image.
A semantic gap in CBIR systems may be minimized by introducing the user as a part of the retrieval loop and by applying weighted distances, as suggested in systems Photobook, QBIC, irage, NETRA. From initial set of images, which are objectively more close to a query, the user selects subjectively best-matched samples and annotates them in appropriate way.
From these samples weights of pre-extracted features are updated, according to subjective perception of visual content. Such a procedure, usually called relevance feedback (RF), combined by neural networks and/or fuzzy systems, effectively brings the gap between the low- level image features and the high-level semantics in human perception. The problem in CBIR with RF is that a query image has to be compared to all images in datasets. This initial step is highly time consuming. Several methods can be applied to reduce this initial search and this will be the topic of this lecture.
1. Method based on reduction of feature vector.
2. Method based on clustering images from database.
In the first approach instead of the full FV only its dominant components are included in reduced FV. In our research we found that the reduction of even 90% has not serious influence on retrieving accuracy.
In the second approach subjectively similar images for a given query are clustered and only the representative element of the whole cluster is used for the first step of retrieving. Also, instead of the full FV, describing color, texture, lines, shape, etc., we can make retrieval based only to
one or some combination of features. For instance, for radiology images color is non-relevant feature but texture and shape is of great importance. By appropriate choice of features we can obtain highly efficient retrieving tool.
The course is addressed to PhD students but also to professionals who are interested in retrieval of large databases. Although the course considers mainly image retrieval, basic ideas may be applied to some other data - audio, for instance.
First part of the course will be addressed to the choice of features, necessary for bescribing as best as possible the image content while in the second part practical CBIR solutions will be described.
Searching for information in large scale datasets has become one of the most pervasive and powerful applications of computing and continuously reveals unprecedented challenges for a large number of computer science disciplines related to information access. Searching for information has become even more difficult today, because useful information is usually available from many distributed information sources, dynamic Web 2.0+ data streams and perhaps in different languages. Another important issue is the diagnosis that information finding at the client-side is hindered from current search systems. In fact, search engines have proved extremely efficient using the “query box” paradigm and the simple ranked list of search results, however the underlying “isolated” model they suggest (both in terms of data and interaction), does not address important issues such as for example the integration and possibly synchronization of multiple and concurrent search views produced from different datasets, search tools and UIs. Very recent developments in Web 2.0 such as mashup applications that combine information from various web sources are just a simple indicative step of trying to address this deficiency.
The aim of this short tutorial is to present models of information seeking behaviour, techniques and methods for better modelling, analysing and understanding of search behaviours and how current search systems can support different types of search. The tutorial will present relevant research findings together with examples derived from the practical experience of designing and/or using patent search systems. The main issues and challenges in designing integrated, new generation search systems which will be used for searching multiple (patent) data sources using a variety of patent search tools and UIs, will be analytically presented. These issues and challenges span across a large number of computer science disciplines and relate to the efficient integration of heterogeneous datasets, the development of new algorithms (e.g. for source selection, results merging, personalized results presentation) and the development of new interaction paradigms for richer information seeking environments.