Improving search engines by query clustering

November 21, 2017 | Autor: Marcelo Mendoza | Categoría: Information Systems, Information Retrieval, Library and Information Studies, Search Engine, Cluster, The American
Share Embed


Descripción

Introduction Query Clustering Framework Applications Conclusions

Improving Search Engines by Query Clustering Ricardo Baeza-Yates and Carlos Hurtado Journal of the American Society for Information Science and Technology

João Manuel Maia Duarte [email protected]

1 / 29

Introduction Query Clustering Framework Applications Conclusions

Outline 1

Introduction

2

Query Clustering Framework Query Representation Clustering Process Comparison of Query Distance Functions

3

Applications Answer Ranking Query Recommendation

4

Conclusions

2 / 29

Introduction Query Clustering Framework Applications Conclusions

Motivation

Generally, queries (as a list of keywords) are not good descriptors of the information needs of the user: Polysemic words Very short queries (around 2 terms per query)

To formulate effective queries the user: (may) need to be familiar with the terminology of the specific knowledge domain (sometimes) is not certain about what to search

It would be useful to capture sets of common preferences and information needs in groups of queries.

3 / 29

Introduction Query Clustering Framework Applications Conclusions

Related work Searching for common preferences and information needs using the query log: DirectHit MASEL Association rules Cluster similar queries to recommend URLs based on 1 2 3 4

keywords or phrases of the query string matching of keywords common clicked URLs distance of common clicked URLs in some predefined hierarchy

4 / 29

Introduction Query Clustering Framework Applications Conclusions

Query Representation Clustering Process Comparison of Query Distance Functions

Outline 1

Introduction

2

Query Clustering Framework Query Representation Clustering Process Comparison of Query Distance Functions

3

Applications Answer Ranking Query Recommendation

4

Conclusions

5 / 29

Introduction Query Clustering Framework Applications Conclusions

Query Representation Clustering Process Comparison of Query Distance Functions

Query Session Definition A query session consists of: one query, a list of URLs, the URLs the user clicked on.

Figure: Relationships of entities in query logs. 6 / 29

Introduction Query Clustering Framework Applications Conclusions

Query Representation Clustering Process Comparison of Query Distance Functions

Query Representation

X Pop(u, q) × Tf (ti , u) → − q [i] = maxt Tf (t, u)

(1)

u∈URLs

Pop(u, q) - number of clicks to URL u in the sessions related to query q Tf (t, u) - number of occurrences of term t in the excerpt of the URL u Queries may be viewed simple as documents.

7 / 29

Introduction Query Clustering Framework Applications Conclusions

Query Representation Clustering Process Comparison of Query Distance Functions

Query Representation Advantages: Simple to compute Captures semantic relationships of queries Redirects the search processes to related information of interest to previous users Avoids sparse similarity matrices But, URLs are returned by search engines in some ordering Top ranked URLs are more likely to be clicked So, the term Pop(u, q) is biased against the ordering of the URLs 8 / 29

Introduction Query Clustering Framework Applications Conclusions

Query Representation Clustering Process Comparison of Query Distance Functions

Unbiased Query Representation

Prank (u|q) = Pvisit (X ≥ Rank (u, q))Plikes (u|q)

(2)

Prank (u|q) - Prob. a user selects u given query q Pvisit (X ≥ Rank (u, q)) - Prob. a user had stop is search at the Rank (u, q)th URL Plikes (u|q) - Prob. u is relevant to the user given q To estimate Plikes (u|q) one only needs to estimate Pvisit (X ≥ Rank (u, q))

9 / 29

Introduction Query Clustering Framework Applications Conclusions

Query Representation Clustering Process Comparison of Query Distance Functions

Estimating Plikes (u|q) Assumptions: Each successive position in the rank is less likely to be reached by the users The distribution follows a power-law decay That distribution can be well fit using the Pareto distribution: Pvisit (X |x) = x1b , where b is a parameter. Thus, Plikes (u|q) = Prank (u|q) × Rank (u, q)b

(3)

where b is estimated by fitting the power-law distribution. It is assumed that the user stops searching at Rank (ulast , q) position where ulast is the last clicked URL. 10 / 29

Introduction Query Clustering Framework Applications Conclusions

Query Representation Clustering Process Comparison of Query Distance Functions

Last Click vs. Number of Users

Figure: Logarithm of the rank for the last clicked URL vs. logarithm for the number of users

11 / 29

Introduction Query Clustering Framework Applications Conclusions

Query Representation Clustering Process Comparison of Query Distance Functions

Unbiased similarity function

Using the previous results, the adjusted popularity is defined as AdjPop(u, q) = Pop(u, q) × Rank (u, q)b

(4)

and the unbiased vectorial representation of a query as X AdjPop(u, q) × Tf (ti , u) → − q [i] = maxt Tf (t, u)

(5)

u∈URLs

12 / 29

Introduction Query Clustering Framework Applications Conclusions

Query Representation Clustering Process Comparison of Query Distance Functions

Outline 1

Introduction

2

Query Clustering Framework Query Representation Clustering Process Comparison of Query Distance Functions

3

Applications Answer Ranking Query Recommendation

4

Conclusions

13 / 29

Introduction Query Clustering Framework Applications Conclusions

Query Representation Clustering Process Comparison of Query Distance Functions

Clustering Process The query clustering is performed using the (Spherical?) k -means provided by CLUTO1 software package. The k -means algorithm starts by selecting k random objects (queries) and then iterates two steps until a stop condition is reached: 1

2

Assign each query to the cluster whose centroid is closest to the query. Update each cluster centroid (by averaging the queries belonging to each cluster)

k is the desired number of clusters and is defined by the user.

1

Data clustering software package developed at University of Minesota 14 / 29

Introduction Query Clustering Framework Applications Conclusions

Query Representation Clustering Process Comparison of Query Distance Functions

Clustering Process (2) Which k should be selected? Usually, the goal of clustering is to find a data partition with 1 2

high intra-cluster similarity and high inter-cluster dissimilarity

In this case, one is only interested in cluster homogeneity. k is selected such that the clustering quality I is greater than a threshold. k X X I= sim(qi , Cr ) (6) r =1 qi ∈Cr

15 / 29

Introduction Query Clustering Framework Applications Conclusions

Query Representation Clustering Process Comparison of Query Distance Functions

Clustering Quality vs Number of Clusters

Figure: Quality of clusters (I) versus number of clusters (k )

16 / 29

Introduction Query Clustering Framework Applications Conclusions

Query Representation Clustering Process Comparison of Query Distance Functions

Queries per Cluster

(a) Excerpt terms

(b) Unbiased excerpt terms

Figure: Histogram of the number of queries per cluster

All clusters have more than 10 queries. 17 / 29

Introduction Query Clustering Framework Applications Conclusions

Query Representation Clustering Process Comparison of Query Distance Functions

Outline 1

Introduction

2

Query Clustering Framework Query Representation Clustering Process Comparison of Query Distance Functions

3

Applications Answer Ranking Query Recommendation

4

Conclusions

18 / 29

Introduction Query Clustering Framework Applications Conclusions

Query Representation Clustering Process Comparison of Query Distance Functions

Comparison of Query Distance Functions

Four distance functions were compared: 1

Excerpt terms distances (cosine of biased vectorial representation)

2

Unbiased excerpt terms distances (cosine of unbiased vectorial representation)

3

Query terms (standard cosine distance using a TF-IDF representation)

4

Cocitation (degree of cocitation over the collection of documents selected by users for each queries)

19 / 29

Introduction Query Clustering Framework Applications Conclusions

Query Representation Clustering Process Comparison of Query Distance Functions

Histogram of Distance for the Query Collection

(a) Query terms

(b) Cocitation

(c) Excerpt terms

(d) Unbiased excerpt 20 / 29

Introduction Query Clustering Framework Applications Conclusions

Query Representation Clustering Process Comparison of Query Distance Functions

Evaluation using quasi-synonym queries Table: Distance distributions (percentages) for a set of 600 pairs of quasi-synonym queries Query terms Cocitation Excerpt terms Unbiased excerpt terms

d1 20.7 53 68.5 69

d2 0 0 9.5 9.7

d3 0 0 4.7 8.8

d4 0 0 0.2 4.5

d5 0 0 9.5 3.5

d6 0 0 2.8 4.2

d7 0 0 4.7 0.3

d8 0 0 0.1 0

d9 0 0 0 0

d10 79.3 47 0 0

Table: Medians for each decile of the distance distributions Query terms Cocitation Excerpt terms Unbiased excerpt terms

m1 0.95 0.95 0.575 0.575

m2 1 1 0.85 0.8

m3 1 1 0.875 0.825

m4 1 1 0.9 0.85

m5 1 1 0.911 0.875

m6 1 1 0.925 0.9

m7 1 1 0.95 0.925

m8 1 1 0.975 0.95

m9 1 1 1 1

m10 1 1 1 1

21 / 29

Introduction Query Clustering Framework Applications Conclusions

Answer Ranking Query Recommendation

Outline 1

Introduction

2

Query Clustering Framework Query Representation Clustering Process Comparison of Query Distance Functions

3

Applications Answer Ranking Query Recommendation

4

Conclusions

22 / 29

Introduction Query Clustering Framework Applications Conclusions

Answer Ranking Query Recommendation

Ranking Algorithm The Ranking algorithm works in two phases: 1

Preprocessing phase at periodical intervals: Extract query sessions from the query log and cluster queries using the text of all the clicked URLs. For all clusters Ci , compute a list Qi of its queries and a list Ui containing the popularity of its k -most popular URLs.

2

Online searching phase: Given a query q, find its cluster Ci . Return the URLs that have been selected for Ci , ordered by a rank score considering both similarity and support: X AdjPop(u, qj ) (7) Rank (u) = Sim(q, u) × qj ∈Ci

NewRank (u) = βOrigRank (u) + (1 − β)Rank (u)

(8) 23 / 29

Introduction Query Clustering Framework Applications Conclusions

Answer Ranking Query Recommendation

Ranking Algorithm Results

Figure: Average retrieval precision of the proposed ranking algorithm.

24 / 29

Introduction Query Clustering Framework Applications Conclusions

Answer Ranking Query Recommendation

Outline 1

Introduction

2

Query Clustering Framework Query Representation Clustering Process Comparison of Query Distance Functions

3

Applications Answer Ranking Query Recommendation

4

Conclusions

25 / 29

Introduction Query Clustering Framework Applications Conclusions

Answer Ranking Query Recommendation

Query Recommendation The query recommendation also works in two phases: 1 Preprocessing phase at periodical intervals: Similar to the answer ranking algorithm. 2

Online searching phase: Given a query q, find its cluster Ci . Compute a rank for each query in Ci based on notions of similarity and support: X Rank (qi ) = Sim(q, u) × AdjPop(u, qi ) (9) u∈U

where U is the set of URLs belonging to Ci

26 / 29

Introduction Query Clustering Framework Applications Conclusions

Answer Ranking Query Recommendation

Query Recommendation Results

Figure: Average retrieval precision of the query recommendation.

27 / 29

Introduction Query Clustering Framework Applications Conclusions

Conclusions

The proposed representation: Allows a better similarity assessment between queries Allows to find groups of semantically related queries Improves clustering quality which results in better retrieval precision and better query recommendation

Traditional techniques for document retrieval can handle the presented framework

28 / 29

Introduction Query Clustering Framework Applications Conclusions

Improving Search Engines by Query Clustering Ricardo Baeza-Yates and Carlos Hurtado Journal of the American Society for Information Science and Technology

João Manuel Maia Duarte [email protected]

29 / 29

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.