首页资源分类嵌入式系统其他 > Search Engines

Search Engines

已有 445125个资源

下载专区

文档信息举报收藏

标    签:搜索引擎资料

分    享:

文档简介

Search Engines, Link Analysis, and User’s Web Behavior.

文档预览

George Meghabghab and Abraham Kandel Search Engines, Link Analysis, and User’s Web Behavior Studies in Computational Intelligence, Volume 99 Editor-in-chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail: kacprzyk@ibspan.waw.pl Further volumes of this series can be found on our homepage: springer.com Vol. 75. Crina Grosan, Ajith Abraham and Hisao Ishibuchi (Eds.) Hybrid Evolutionary Algorithms, 2007 ISBN 978-3-540-73296-9 Vol. 76. Subhas Chandra Mukhopadhyay and Gourab Sen Gupta (Eds.) Autonomous Robots and Agents, 2007 ISBN 978-3-540-73423-9 Vol. 77. Barbara Hammer and Pascal Hitzler (Eds.) Perspectives of Neural-Symbolic Integration, 2007 ISBN 978-3-540-73953-1 Vol. 78. Costin Badica and Marcin Paprzycki (Eds.) Intelligent and Distributed Computing, 2008 ISBN 978-3-540-74929-5 Vol. 79. Xing Cai and T.-C. Jim Yeh (Eds.) Quantitative Information Fusion for Hydrological Sciences, 2008 ISBN 978-3-540-75383-4 Vol. 80. Joachim Diederich Rule Extraction from Support Vector Machines, 2008 ISBN 978-3-540-75389-6 Vol. 81. K. Sridharan Robotic Exploration and Landmark Determination, 2008 ISBN 978-3-540-75393-3 Vol. 82. Ajith Abraham, Crina Grosan and Witold Pedrycz (Eds.) Engineering Evolutionary Intelligent Systems, 2008 ISBN 978-3-540-75395-7 Vol. 83. Bhanu Prasad and S.R.M. Prasanna (Eds.) Speech, Audio, Image and Biomedical Signal Processing using Neural Networks, 2008 ISBN 978-3-540-75397-1 Vol. 84. Marek R. Ogiela and Ryszard Tadeusiewicz Modern Computational Intelligence Methods for the Interpretation of Medical Images, 2008 ISBN 978-3-540-75399-5 Vol. 85. Arpad Kelemen, Ajith Abraham and Yulan Liang (Eds.) Computational Intelligence in Medical Informatics, 2008 ISBN 978-3-540-75766-5 Vol. 86. Zbigniew Les and Mogdalena Les Shape Understanding Systems, 2008 ISBN 978-3-540-75768-9 Vol. 87. Yuri Avramenko and Andrzej Kraslawski Case Based Design, 2008 ISBN 978-3-540-75705-4 Vol. 88. Tina Yu, David Davis, Cem Baydar and Rajkumar Roy (Eds.) Evolutionary Computation in Practice, 2008 ISBN 978-3-540-75770-2 Vol. 89. Ito Takayuki, Hattori Hiromitsu, Zhang Minjie and Matsuo Tokuro (Eds.) Rational, Robust, Secure, 2008 ISBN 978-3-540-76281-2 Vol. 90. Simone Marinai and Hiromichi Fujisawa (Eds.) Machine Learning in Document Analysis and Recognition, 2008 ISBN 978-3-540-76279-9 Vol. 91. Horst Bunke, Kandel Abraham and Last Mark (Eds.) Applied Pattern Recognition, 2008 ISBN 978-3-540-76830-2 Vol. 92. Ang Yang, Yin Shan and Lam Thu Bui (Eds.) Success in Evolutionary Computation, 2008 ISBN 978-3-540-76285-0 Vol. 93. Manolis Wallace, Marios Angelides and Phivos Mylonas (Eds.) Advances in Semantic Media Adaptation and Personalization, 2008 ISBN 978-3-540-76359-8 Vol. 94. Arpad Kelemen, Ajith Abraham and Yuehui Chen (Eds.) Computational Intelligence in Bioinformatics, 2008 ISBN 978-3-540-76802-9 Vol. 95. Radu Dogaru Systematic Design for Emergence in Cellular Nonlinear Networks, 2008 ISBN 978-3-540-76800-5 Vol. 96. Aboul-Ella Hassanien, Ajith Abraham and Janusz Kacprzyk (Eds.) Computational Intelligence in Multimedia Processing: Recent Advances, 2008 ISBN 978-3-540-76826-5 Vol. 97. Gloria Phillips-Wren, Nikhil Ichalkaranje and Lakhmi C. Jain (Eds.) Intelligent Decision Making: An AI-Based Approach, 2008 ISBN 978-3-540-76829-9 Vol. 98. Ashish Ghosh, Satchidananda Dehuri and Susmita Ghosh (Eds.) Multi-Objective Evolutionary Algorithms for Knowledge Discovery from Databases, 2008 ISBN 978-3-540-77466-2 Vol. 99. George Meghabghab and Abraham Kandel Search Engines, Link Analysis, and User’s Web Behavior, 2008 ISBN 978-3-540-77468-6 George Meghabghab Abraham Kandel Search Engines, Link Analysis, and User’s Web Behavior With 114 Figures and 74 Tables 123 Dr. George Meghabghab Professor of Computer Science Roane State University Oak Ridge, TN 37830 USA MeghabghaGV@roanestate.edu Dr. Abraham Kandel Endowed Eminent Scholar Distinguished University Research Professor Computer Science and Engineering University of South Florida Tampa, FL 33620 USA kandel@cse.usf.edu ISBN 978-3-540-77468-6 e-ISBN 978-3-540-77469-3 Studies in Computational Intelligence ISSN 1860-949X Library of Congress Control Number: 2008921398 c 2008 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: Deblik, Berlin, Germany Printed on acid-free paper 987654321 springer.com Preface WEB MINING WWW Link Analysis ccc Search Engines zzz User’s Web Behavior ccc HITS Algorithm ccc Radial Basis Function ccc Fuzzy Cognitive Map PageRank Algorithm Interior Point Method Fuzzy Bags Rgeression Models Rough Sets Information Theory This book presents a specific and unified approach framework to three major components: Search Engines Performance, Link Analysis, and User’s Web Behavior. The explosive growth and the widespread accessibility of the WWW has led to a surge of research activity in the area of information retrieval on VI Preface the WWW. The three aspects of web mining follow the taxonomy of the above diagram: Link Analysis, Search engines, and User’s web behavior are considered in the unifying approach. The book is organized in three sections as follows: 1. In section I of the book (chapters 2–4) we study Link Analysis within the hubs and authorities framework. Link Analysis is the science of hyperlink structures ranking, which are used to determine the relative authority of a Web page and produce improved algorithms for the ranking of Web search results. We use the HITS Algorithm developed by Kleinberg and we propose to study HITS in a 2-D new space: In-degree and Out Degree variables. After we categorize each web page into a specific toplogy we study the impact of each web topology on HITS in the new 2-D space. We describe why HITS does not fare well in almost all the different topologies of web graphs. We also describe the PageRank Algorithm in this new 2-D space. We detail why PageRank does well but in few of the web page topologies We propose to improve the ranking of the different web pages in relation to their topologies. 2. In section II (chapter 5) we present search engines and compare the performance of four search engines: Northern Light, Alta Vista, Yahoo and Google. The evaluation of the quality of search engines should involve not only the resulting set of web pages but also an estimate of the rejected set of web pages. A goodness measure of a search engine for a given set of queries is proposed as a function of the coverage of the search engine and the normalized age of a new document in result set for the query. We use radial basis functions to simulate search engines based on their characteristics. We add at the end of the simulation another algorithm that uses “Interior Point methods” which help reduce noise and add to the performance of the simulation a great deal. We compare those performances to second order regression analysis. We conclude that unless search engine designers address the issue of rejected web pages, indexing, and crawling, the usage of the Web as a research tool for academic and educational purposes will stay hindered. 3. Section III (chapter 6) focuses on the user and his/her behavior while looking for information on the WWW. This is a major and innovative chapter of the book. The WWW is full of information and mining the web for different kinds of information is still in its infancy. The difficulty in evaluating information on the web has to do with our cognitive abilities combined with personal likings, which is hard to quantify always. Users surf on the unbounded space of hyperlinks or the World Wide Web (WWW). We discuss strategies do people use when surfing through. Users form the cognitive neurological networks that result in a mental pathway, or cognitive map, that makes more navigable the route to further information as well as the information they set out to find. We propose three new innovative approaches to quantify user’s behavior: Preface VII (a) Fuzzy cognitive maps (FCM) can model the causal graph of “expert’s perception” of how users behave on the web. We establish a stable dynamical system that behaves well in a limit cycle. When the original FCM causal system is left to behave in time, the system learn new rules or reinforce old ones. If Differential Hebbian Learning is used (DHL), the system uncovers new rules in the case of failure and in the case of success. That shows that the original causal matrix can uncover some dynamical rules on learning when left to behave in time, which was not apparent in the beginning. The dynamical system can move from failure to success with certain conditions imposed on the user. The same rules that switch users from failure to success can keep successful users successful. (b) In all studies of user’s behavior on searching the web, researchers found that their search strategy or traversal process showed a pattern of moving back and forth between searching and browsing. These activities of searching and browsing are inclusive activities to complete a goal-oriented task to find the answer for the query. The moving back and forth, including backtracking to a former web page, constitutes the measurable or physical states or actions that a user takes while searching the web. The structure of a bag as a framework can help study the behavior of users and uncover some intrinsic properties about users. We consider queries that depend on one variable or “univariable”. An example of “uni-variable” queries is: “Find the number of web pages of users whose searching is less than the average”. Fuzzy operators were used successfully in comparing the “fuzzy bag” of the successful users to the “fuzzy bag” of unsuccessful users to answer such a query. The research considered queries that depend on more than one variable or multi-variable. An example of a “multi-variable” query is: “Find the number of web pages of users whose number of searches is less than average and hyperlink navigation is less than the average”. One-dimensional fuzzy bags were extended to two dimensional fuzzy bags and two dimensional fuzzy bag operators were used to answer such a query. (c) We turn the comparison of successful students to unsuccessful students on a specific factual query into a decision table of twenty rows. Rough Sets offer a unique framework to extract rules from the decision table. We consider a decision table with three set of attribute variables: i. The case of the 3 condition variables or attribute variables: Searches(SE), Hyperlinks(H), Web Pages(W). We reduce the number of attributes and generate the rules that explain the behavior of the users using these 3 variables. We discover that these two variables H and W are just sufficient to make the decision as whether a user succeeded or failed. The attribute SE is become superfluous. Six deterministic rules with an average of 1.67 conditions per rule were extracted and two non deterministic rules. VIII Preface ii. The case of the 3 condition variables or attribute variables: Web Pages(W), Time(T), Searches(SE). We discover that these three variables were needed to make a decision either a user succeeded or failed. Seven deterministic rules with an average of 1.55 conditions per rule were extracted and two non deterministic rules. iii. The case of the 4 condition variables or attribute variables: Web Pages(W), Hyperlinks(H), Time(T), Searches(SE). We discover that these four attributes could not be reduced without comprising making a decision whether a user is successful or failed. Nine deterministic rules with an average of 1.77 conditions per rule were extracted and one non deterministic rule. (d) We turn the comparison of successful students to unsuccessful students into a decision table of twenty rows but we used this time Information Theoretic (ID3) framework. We compare the results of Rough Sets to ID3 in the above three cases 3(c)i through 3(c)iii. Rough sets performed better than ID3 in all three cases 3(c)i through 3(c)iii. Four mathematical complements expand on needed theoretical background and present descriptions of very specific examples including detailed results. The book provides both theoretical and practical coverage of all three major components introduced above: 1. Includes extensive number of integrated examples and figures. 2. Include an end of chapter summary, and plenty of examples to cover the ideas discussed. 3. Assumes a modest knowledge in linear algebra, fuzzy sets, fuzzy bags and statistical analysis. 4. Fits nicely in an upper level undergraduate course on the Web Mining. This book can be used by researchers in the fields of information sciences, engineering (especially software), computer science, statistics and management, who are looking for a unified theoretical approach to finding relevant information on the WWW and a way of interpreting it from a data perspective to a user perspective. It specifically stresses the importance of the involvement of the user looking for information to the relevance of information sought to the performance of the medium used to find information on the WWW. In addition, social sciences, psychology, and other fields that are interested in understanding the underlying phenomena from data can benefit from the proposed approach. The book can also serve as a reference book for advanced undergraduate/graduate level course in data mining on the WWW. Practitioners may be particularly interested in the examples used throughout the book. USA April 2007 George Meghabghab and Abraham Kandel Contents 1 Basic WWW Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 General Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 The Internet is Big . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Playing Smart in a Global Trade . . . . . . . . . . . . . . . . . . . . 3 1.1.3 EDI for the Rest of Us . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.4 Merger-Proofing Your Business Intelligence . . . . . . . . . . . 4 1.1.5 Technical Challenges Abound . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Web Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 Information Overload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Hyperlinks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.3 Heterogeneous Nature of the Web . . . . . . . . . . . . . . . . . . . 8 1.3 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 Listed Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.2 Search Engine Statistics ([1]) . . . . . . . . . . . . . . . . . . . . . . . 15 1.4 Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2 Web Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Basic Graph Theory Applied to Web Pages . . . . . . . . . . . . . . . . . 24 2.2.1 Adjacency Matrix Representation of a Web Graph . . . . . 25 2.2.2 Incidence Matrix Representation of a Web Graph . . . . . . 29 2.2.3 Bipartite Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2.4 Web Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2.5 Web Graphs Topological Difference . . . . . . . . . . . . . . . . . . 34 2.3 Bow Tie Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3.2 Parametric Characteristics of Bow Tie Graphs . . . . . . . . 36 2.3.3 In/Out Degrees Characteristics of Bow Tie Graphs . . . . 40 2.4 Final Observations on Web Graphs . . . . . . . . . . . . . . . . . . . . . . . . 45 X Contents 3 Link Analysis of Web Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Linear Algebra Applied to Link Analysis . . . . . . . . . . . . . . . . . . . 48 3.3 Kleinberg’s Algorithm of Link Analysis . . . . . . . . . . . . . . . . . . . . . 49 3.4 Applying HITS to the Graphs of 2.2 . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.1 HITS Applied to General Graphs . . . . . . . . . . . . . . . . . . . . 51 3.4.2 HITS applied to In-degree Graphs . . . . . . . . . . . . . . . . . . . 53 3.4.3 HITS applied to Out-degree Graphs . . . . . . . . . . . . . . . . . 55 3.4.4 HITS Applied to “Complete Bipartite” Graphs . . . . . . . . 58 3.4.5 HITS Applied to Bipartite Graphs . . . . . . . . . . . . . . . . . . . 59 3.4.6 Summary of the Application of the HITS Algorithm to the Graphs of 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5 Hubs and Authorities in Bow Tie Graphs . . . . . . . . . . . . . . . . . . . 61 3.6 Similarity of Bow Tie Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.7 Limitations of Link Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4 PageRank Algorithm Applied to Web Graphs . . . . . . . . . . . . . 69 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 Page Rank Algorithm Applied to the Graphs of 2.2 . . . . . . . . . . 71 4.2.1 PageRank Applied to Complete Bipartite Graphs . . . . . . 72 4.2.2 PageRank Applied to Out-degree Graphs . . . . . . . . . . . . . 74 4.2.3 PageRank Applied to In-degree Graphs . . . . . . . . . . . . . . 75 4.2.4 PageRank Applied to Bipartite Graphs . . . . . . . . . . . . . . . 76 4.2.5 PageRank Applied to General Graphs . . . . . . . . . . . . . . . . 76 4.2.6 Summary of the Application of the PageRank Algorithm to the Graphs of 2.2 . . . . . . . . . . . . . . . . . . . . . . 77 4.3 PageRank Algorithm Applied to Bow Tie Graphs . . . . . . . . . . . . 77 4.4 Limitations of PageRank Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 81 5 Stochastic Simulations Of Search Engines . . . . . . . . . . . . . . . . . . 83 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.1.1 Artificial Neural Networks (ANN) on the WWW . . . . . . 84 5.1.2 Rejected World Wide Web Pages . . . . . . . . . . . . . . . . . . . . 85 5.2 ANN Meta-model Approach for the WWW . . . . . . . . . . . . . . . . . 86 5.2.1 Training RBF without Correction of Centers, Spreads and Weights (RBF without CCSW) . . . . . . . . . . . . . . . . . 87 5.2.2 Training RBF with Correction of Centers, Spreads, and Weights using Gradient Descent (RBF with CCSW using GD) . . . . . . . . . . . . . . . . . . . . . . . 88 5.2.3 Training RBF with Correction of Centers, Spreads and Weights using Interior Point Methods (RBF with CCSW using IPM) . . . . . . . . . . . . . . . . . . . . . . 90 5.2.4 New RBF Interior Point Method Learning Rule . . . . . . . 91 5.2.5 RBF Training Algorithm for Web Search Engines . . . . . . 93 5.2.6 Regression Models Approach for the WWW ([87]) . . . . . 94 Contents XI 5.3 Comparing the Results of Regression Models and RBF . . . . . . . 96 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6 Modeling Human Behavior on the Web . . . . . . . . . . . . . . . . . . . . 105 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2 An Experimentation of User’s Web Behavior . . . . . . . . . . . . . . . . 107 6.2.1 User Cognitive Model of a Browsing Session . . . . . . . . . . 107 6.2.2 Results on the Fact Based Query . . . . . . . . . . . . . . . . . . . . 110 6.3 Fuzzy Sets Modeling of User’s Web Behavior: A New Model . . 118 6.3.1 Review of Literature on FCM . . . . . . . . . . . . . . . . . . . . . . . 119 6.3.2 Applying FCM to User’s Web Behavior . . . . . . . . . . . . . . 120 6.3.3 The Case of Positive Influence of C7 on the Other Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.3.4 Final Observation on FCM as a Model for User’s Web Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.4 Multi-dimensional Fuzzy Bags Modeling User’s Web Behavior . 136 6.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.4.2 “Crisp Bag” Modeling User’s Actions([86]) . . . . . . . . . . . 139 6.4.3 One Dimensional “Fuzzy Bag” Modeling User’s Actions 144 6.4.4 Multidimensional Fuzzy Bags modeling of User’s Actions ([93]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 6.4.5 Discussion and Concluding Remarks . . . . . . . . . . . . . . . . . 159 6.5 A Rough Set Theory to Interpret User’s Web Behavior . . . . . . . 162 6.5.1 General Introduction to Rough Set Theory and Decision Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.5.2 Rough Set Modeling of User Web Behavior . . . . . . . . . . . 163 6.6 Information Theory to Interpret User’s Web Behavior . . . . . . . . 203 6.6.1 ID3 Applied to Case 1: Searches, Hyperlinks, and Web Pages([98]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 6.6.2 ID3 Applied to Case 2: Searches, Time and Web Pages 206 6.6.3 ID3 Applied to Case 3: Web Pages, Hyperlink, Searches and Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 6.6.4 Final Observation on Rough Set Theory and Information Theory Final to Interpret User’s Web Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 6.6.5 General Summary of Rough Set Theory . . . . . . . . . . . . . . 212 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 A Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 B Interior Point Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 B.2 Applying IPM to the Backpropagation Learning Rule . . . . . . . . 229 B.3 Numerical Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 XII Contents C Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 C.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 C.2 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 C.3 Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 C.4 Factorial Regression Analysis of Time . . . . . . . . . . . . . . . . . . . . . . 245 C.5 Quadratic Response Surface Regression . . . . . . . . . . . . . . . . . . . . 247 C.6 Final Observations on Regression Models . . . . . . . . . . . . . . . . . . . 251 D Fuzzy Set Theory and Fuzzy Similarity Measures . . . . . . . . . . 253 D.1 The Emergence of Fuzzy Set Theory . . . . . . . . . . . . . . . . . . . . . . . 253 D.2 Extending Blondels’ Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 D.3 The New Fuzzy Centrality Score . . . . . . . . . . . . . . . . . . . . . . . . . . 257 E Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 E.1 Application of Information Theory to a Production Line Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 E.2 Application of Rough Set Theory to a Production Line Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 List of Figures 1.1 High Level google Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2 Total Hits from 25 searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3 Freshness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4 Overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.5 Size Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.6 Unique Hits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.7 Analytical Model of information seeking and retrieving . . . . . . . . 20 1.8 Augmenting Task Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1 A General Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2 Out/In Degrees for Figure 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3 A Bipartite Graph G1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4 An In-Degree Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5 Out/In Degrees for Figure 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6 An Out-Degree Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.7 Out/In Degrees for Figure 2.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.8 A Complete Bipartite Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.9 Out/In Degrees for Figure 2.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.10 A Bipartite Graph G1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.11 Out/In Degrees for Figure 2.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.12 Bow Tie Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.13 A 10*1*10 Bow Tie Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.14 A circle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.15 A single bow tie graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.16 A 2/3 bow tie graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.17 A double bow tie graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.18 A 5 bow tie graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.19 An 8 bow tie graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.20 Out/In Degrees of the BTG in figure 2.13 . . . . . . . . . . . . . . . . . . . 40 2.21 A 10*1 Complete Bipartite Graph(L) . . . . . . . . . . . . . . . . . . . . . . . 41 2.22 In/Out for CBG(10×1) of figure 2.21 . . . . . . . . . . . . . . . . . . . . . . . 42 XIV List of Figures 2.23 A 1*10 Complete Bipartite Graph(R) . . . . . . . . . . . . . . . . . . . . . . . 43 2.24 In/Out for CBG(1×10) of figure 2.23 . . . . . . . . . . . . . . . . . . . . . . . 43 3.1 A bipartite graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 A bipartite graph G1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 An In Degree Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4 An Out Degree Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5 A complete bipartite graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.6 A bipartite graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.7 The BTG of “likely” and “probable” . . . . . . . . . . . . . . . . . . . . . . . . 65 3.8 A mesh of the centrality Matrix of Likely and probable . . . . . . . . 65 3.9 A mesh of the centrality Matrix of a BTG(50 × 2 × 100) . . . . . . 67 4.1 A General Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 A Complete Bipartite Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 An Out Degree Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.4 An In Degree Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.5 A Bipartite Graph G1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.6 A 10*1*10 bow tie graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.1 A generalized RBF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2 Results of the 3 Algoritms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.3 Alta Vista’s Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4 Northern Light’s Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.5 Yahoo’s Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.6 Google’s Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.1 Markovian Model of the User cognitive map in a browsing session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.2 number of words vs. frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.3 Query sizes in the case of Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.4 Query sizes in the case of Success . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.5 Statistics of All Summary data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.6 Detailed Statistics in the case of S and F . . . . . . . . . . . . . . . . . . . . 116 6.7 A User Cognitive Map showing all the moves in case of a failure 116 6.8 A User Cognitive Map showing all the moves in case of success . 117 6.9 FCM of User’s experience in the case of positive influence of C7 121 6.10 FCM of User’s experience in the case of negative influence of C7 127 6.11 Nested FCMs can divide the concept Search into 3 categories . . 134 6.12 Nested FCMs can divide the concept Browsing into 3 categories 135 6.13 Fuzzy Bag for Successful Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.14 Fuzzy Bag for Unsuccessful Users . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.15 Fuzzy Bag of the Intersection of S and U Users(Yager’s) . . . . . . . 146 6.16 Fuzzy Bag of the Intersection of S and U Users(Connan’s) . . . . . 147 6.17 Fuzzy Bag of the Difference of S and U Users(Yager’s) . . . . . . . . 147 List of Figures XV 6.18 Fuzzy Bag of the Difference of S and U Users(Connan’s) . . . . . . 148 6.19 Fuzzy Bag of the Difference of S and U Users(Cardinality) . . . . . 151 6.20 Multi Dimesnional fuzzy Bag of S Users . . . . . . . . . . . . . . . . . . . . . 152 6.21 Multi Dimesnional fuzzy Bag of F Users . . . . . . . . . . . . . . . . . . . . . 153 6.22 Multi Dimensional fuzzy Bag of the intersection of S and U users(α cuts) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.23 Multi Dimesnional fuzzy Bag of the difference between S and F Users(Yager’s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.24 Multi Dimesnional fuzzy Bag of the difference between S and F Users(Connan’s): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.25 Multi Dimesnional fuzzy Bag of the difference between S and F Users (α cuts:) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.26 Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.27 Approximating SU and F using W, H, and SE . . . . . . . . . . . . . . . 171 6.28 Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 6.29 Approximating SU and F using W, T, and SE . . . . . . . . . . . . . . . 185 6.30 Case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 6.31 Approximating SU and F using W, T, H, and SE . . . . . . . . . . . . . 196 6.32 Case 1: Rules extracted by ID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 6.33 Case 2: Rules extracted by ID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 6.34 Case 3: Rules extracted by ID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 A.1 A General graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 B.1 Comparing of BP, LILR, and QILR on the parity problem . . . . . 233 C.1 Time in Seconds as a function of the number of Moves for all 20 users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 C.2 Linear Regression of Time with residuals. Users 5,8,13, and 14 have the largest residuals values . . . . . . . . . . . . . . . . . . . . . 238 C.3 Seventh Polynomial regression of Time with residuals as a function of Moves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 C.4 Actual Time in seconds after removing outliers . . . . . . . . . . . . . . . 242 C.5 linear regression analysis of Time after removing outliers with residuals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 C.6 A seventh polynomial regression of Time after removing outliers with residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 C.7 A 3-D plot of Actual Time in seconds as a function of number of Moves and the number of Web Searches . . . . . . . . . . . . . . . . . . . 245 C.8 A multiple regression analysis of Time: Large square=regressed time, Small square=actual time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 C.9 A factorial time regression analysis of time: Large square=regressed time, smaller square=actual time. . . . . . . . . . . 247 XVI List of Figures C.10 A Response Surface time regression model of time: Large square=Surface regressed time, smaller square=actual time.of time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 C.11 a 3-D plot of Actual Time in seconds as a function of number of Moves and the number of Web Searches after removing outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 C.12 A multiple regression analysis of Time after removing outliers: Large square=regressed time, small square=actual time. . . . . . . . 250 C.13 A factorial time regression analysis of time after removing the outliers. Large square=factorial regressed time, smaller square=actual time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 C.14 A Response Surface time regression model of time after removing the outliers. Large square=Surface regressed time, smaller square=actual time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 D.1 A precise sharp set for the synonyms of “Probable” . . . . . . . . . . . 254 D.2 The BTG of “likely” and “probable” . . . . . . . . . . . . . . . . . . . . . . . . 255 D.3 A mesh of the centrality Matrix of Likely and probable . . . . . . . . 256 D.4 The fuzzification of the BTG of “likely” and “probable” . . . . . . . 256 D.5 A mesh of the centrality matrix corresponding to Fig. D.4 . . . . . 257 D.6 A mesh of the fuzzy centrality matrix corresponding to Fig. D.4 259 D.7 A mesh of the transitive FCS of Fig. D.6 . . . . . . . . . . . . . . . . . . . . 260 E.1 A decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 E.2 A modified decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 E.3 Decision Tree using Information Theory . . . . . . . . . . . . . . . . . . . . . 265 E.4 Decision Tree using using Rough Set theory . . . . . . . . . . . . . . . . . . 266 List of Tables 2.1 Size vs. Region in a Bow Tie graph . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1 Hubs and authorities in the case of the size of the core m << N and m << M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.2 Hubs and authorities in the case of the size of the core m < N and m < M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.3 Similarity between the central score of the core of the BTG made of 2 nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1 the values of r1, r2, r3, r4, r5, and r6 . . . . . . . . . . . . . . . . . . . . . . . 72 4.2 Google’s ranking applied to Figure 4.5 . . . . . . . . . . . . . . . . . . . . . . . 76 4.3 Modified google’s ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.4 Ranking of the nodes of Figure 4.6 . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.5 Google’s ranking on a general bow tie graph . . . . . . . . . . . . . . . . . 79 4.6 Google’s ranking does work in the case of N > M . . . . . . . . . . . . 79 4.7 Google’s ranking for a bow tie graph where N < M . . . . . . . . . . . 80 4.8 Google’s ranking of a bow tie graph where m = M and m = N . . 81 5.1 Data Presentation Methods Used . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.2 Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3 Comparison of Test Results (M.A.E. stands for mean absolute error) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.1 Edges between the different states . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.2 Occurrence of queries for the fact based query . . . . . . . . . . . . . . . . 110 6.3 Misspelled queries used and their frequency . . . . . . . . . . . . . . . . . . 111 6.4 Occurrence of Queries as a function of the length of the fact-based query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.5 Change in word length vs.the number of users that failed . . . . . . 113 6.6 Change in word length vs. the number of users that succeeded . 113 XVIII List of Tables 6.7 Modified Queries(MQ) vs. the length of terms on the fact-based query (F=Failure, S=Success) . . . . . . . . . . . . . . 114 6.8 Repeated Queries(RQ) as a function of the length(L) of terms on the fact-based query (S=Success, F=Failure) . . . . . . . . . . . . . . 114 6.9 Rules on success uncovered by the original model, CE, and DHL model(C.=Constrained, R.I.=Relevant Information) . . . . . 132 6.10 Rules on failure uncovered by the original model, CE, and DHL model.(C.= Constrained, R.I.= Relevant Information) . . . 133 6.11 The greatest fuzzy bag such that Max of w(S1)α (1W)+w(AS∩AF )α(1W) ≤ w(AS)α(1W) . . . . . . . . . . . . . . . . 137 6.12 The greatest fuzzy bag such that Max of w(S1)α (1W)+w(AS∩AF )α(1W) ≤ w(AS)α(1W) . . . . . . . . . . . . . . . . 137 6.13 S1(1W) = { 1, 1 >, < .7, .4 //1W } is the largest fuzzy bag where w(S1)α(1W) ≤ w(S1)α(1W) . . . . . . . . . . . . . . . . . . . . . . 138 6.14 Detailed Actions of Successful Users . . . . . . . . . . . . . . . . . . . . . . . . 139 6.15 Detailed Actions of UnSuccessful Users . . . . . . . . . . . . . . . . . . . . . . 140 6.16 Summary of Successful Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.17 Summary of Unsuccessful Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.18 Summary of All Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.19 Result of applying steps a-b of SSP . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.20 Eliminating all contradictory users . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.21 9 users are left after applying steps a-d of SSP . . . . . . . . . . . . . . . 174 6.22 Reordered decision table of table 6.21 . . . . . . . . . . . . . . . . . . . . . . . 175 6.23 Table 6.22 with only 2 attributes W and H . . . . . . . . . . . . . . . . . . . 176 6.24 Applying steps a-b to table 6.23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6.25 Core of the set of final data (Step e- of SSP) . . . . . . . . . . . . . . . . . 176 6.26 Set of reductset (Step f- of SSP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6.27 Table 6.20 with 2 attributes W and H . . . . . . . . . . . . . . . . . . . . . . . 179 6.28 Applying steps a-b to table 6.27 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.29 Applying steps e-f to table 6.28 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.30 Summary of All Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 6.31 Result of applying steps a-b of SSP . . . . . . . . . . . . . . . . . . . . . . . . . 186 6.32 Eliminating all contradictory users . . . . . . . . . . . . . . . . . . . . . . . . . 186 6.33 9 users are left after applying steps a-d of SSP to table 6.32 . . . 187 6.34 Core of the set of final data (Step e- of SSP) . . . . . . . . . . . . . . . . . 188 6.35 Set of reduct set (Step f- of SSP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 6.36 Core Set of table 6.32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 6.37 Reduct Set of table 6.32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 6.38 Table with all 4 attributes: W, H, SE, and T . . . . . . . . . . . . . . . . . 192 6.39 Result of applying steps a-b of SSP . . . . . . . . . . . . . . . . . . . . . . . . . 196 6.40 Eliminating contradictory rules in Table 6.39 . . . . . . . . . . . . . . . . . 197 6.41 Core of the set of final data (Step e- of SSP) . . . . . . . . . . . . . . . . . 199 6.42 Set of reductset (Step f- of SSP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 6.43 Applying steps a-e of SSP to table 6.40 . . . . . . . . . . . . . . . . . . . . . 202 6.44 Set of reductset (Step f- of SSP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 List of Tables XIX 6.45 Comparison of generated rules between RS Theory and ID3 . . . 205 6.46 Comparison of generated rules between RS Theory and ID3 . . . . 208 6.47 Comparison of generated rules between RS Theory and ID3 . . . 210 6.48 Summary of Comparison between RS theory and ID3 . . . . . . . . . 211 B.1 Results of the parity problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 C.1 Summary of the 20 users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 C.2 Summary of time and Moves after removing users: 5, 8, 13, and 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 C.3 Summary of the 20 users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 C.4 Summary of the 16 users after removing the outliers. . . . . . . . . . . 249 E.1 Summary of the outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 E.2 Data Supplied by the expert’s company . . . . . . . . . . . . . . . . . . . . . 263 E.3 Set of examples with indispensable attributes . . . . . . . . . . . . . . . . 263 E.4 Modified Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 E.5 Modified Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 E.6 Modified Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 1 Basic WWW Technologies 1.1 General Introduction Internet search service google.com announced recently that its indexes span an unprecedented 8 billion Web documents. The total includes a few billion Web pages, few hundreds of millions of images and about a decade’s worth - some 1 billion - of newsgroup messages. Promoting technology that scours this vast collection in less than a second, Google says that obtaining a similar result by hand would take a human searcher 15180 years - assuming someone could be found to scrutinize one document every minute, for twenty-four hours per day. If we were to put 15180 humans to perform this task we could have done it in one year. To do it in a minute we have to use 5.6 million people and to do it in a second like google does we have to use 333 million people. That is just 1 in every 20 people on the face of the earth will be just scrutinizing the 8 billion Web pages that google can index for now. As stark as this illustration may be, it points to the burgeoning scale of Web-based information resources. In a few short years, the Web has become our most compelling technological accomplishment. It’s the world’s largest library and telephone network, the world’s largest jukebox and photo album. It is global, and at the same time perfectly local. It’s the fastest growing enterprise on Earth, and yet, no stockbroker or media pundit or taxman can tell us who’s actually in charge. Social scientist Ithiel de Sola Pool [27], in a posthumous 1990 collection “Technologies Without Boundaries,” called the Internet “part of the largest machine that man has ever constructed.” A dozen years later, de Sola Pool appears eloquent in his understatement. As we learn to use this machine for more than merely searching an impending googolplex of indexed terms, our ability to repurpose, aggregate, and cross-reference individual datum inexpensively offers new impetus to economic and societal velocity. Where today the Web presents a loosely-coupled periodical file or a waste-bin of one-way screeds, we are discovering ways to unearth semantic clues that yield new knowledge, fresh applications and, perhaps, astounding insights. For enterprises doing business on the Web, the network is also becoming the world’s largest cash G. Meghabghab and A. Kandel: Search Engines, Link Analysis, and User’s Web Behavior, Studies in Computational Intelligence (SCI) 99, 1–22 (2008) www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008 2 1 Basic WWW Technologies register, a development that will only accelerate as we find new ways to mine value from the content institutions and individuals place there. Information architects use the term Web mining to describe the process of locating and extracting usable information from Web-based repositories and doing so on a repeatable, continuous basis. In this book, we describe this new discipline and explain “hubs” and “authorities” on the Web, google’s Web ranking to arrive at better mining the Web for information as well as users that understand better their needs. 1.1.1 The Internet is Big In a decade following release of the NCSA Mosaic Web browser, the Worldwide Web has become the world’s de facto information repository. For 2002, internet marketing statistician eMarketer estimates that there are already 168 million internet users in North America, lagging slightly the 175 million users in Europe and easily doubling Asia’s estimated 85 million users. With Asia pacing the 38% growth in these numbers, we can expect some 590 million Websurfing individuals in the three regions by 2004. Despite public perceptions that the internet juggernaut is fueled by hype, the scope and scale of this deployment is unprecedented in human experience. Today, virtually no one disputes that the worldwide Web is becoming an integral component in the way advanced societies give and get information. Here are just a few measures of the Web’s utility to consumers: 1. 85% of North American internet users report having bought goods or services online within the past year (up from 77% a year ago) 2. Online retail sales in the US totaled $51.3 billion in 2001 and are expected to jump 41% to $72.1 billion this year. 3. 49% of North American internet users have conducted a bank transaction online (up from 29% a year ago) 4. 19% of all US people aged 12 and older report having downloaded digital music files from an online file-sharing service (that’s 41.5 million people) 5. 13% have invested online (up from 10% a year ago). These numbers are dwarfed, however, by internet-enabled activity by and between businesses. Whether explicitly global or durably local, growth in business-to-business ecommerce is exploding. eMarketer projects that worldwide business-to-business (B2B) ecommerce revenues will grow by almost 74 percent in 2002. From a base of $278 billion in 2000, B2B revenues will reach $823.4 billion by the end of this year and eclipse the trillion-dollar mark in 2003. All this activity, of course, generates new data. A recent University of California-Berkeley study that found that the world’s annual production of information exceeds 250 megabytes per living person, the largest proportion of which now first appears in digital form, making it a candidate for inclusion in Web-based applications. Whether we’re talking about transaction details along an industrial supply chain, bid/asked quotations in an electronic market, 1.1 General Introduction 3 or customer complaints through a CRM portal, much of this content either originates in or is destined for a structured database management system and its dependent applications. There - safely defined in rows and columns and validated against application and data-type logic - it can be queried, extracted, extrapolated, or conjoined in new ways that the original program designers scarcely contemplated. Such is the power of structured data management tools. What is less easily accomplished is sharing this data beyond an enterprise’s firewalls again in anything other than static form. Every commercial database reporting tool now generates Web browser output, but it stands as one of the great ironies of the relational database era how so much of any system’s output becomes unstructured again at the first point where it actually becomes useful to someone. Unlike the orderly universe of database management, Web mining is concerned with the detection, capture and subsequent re-use of information content that we first encounter “in the wild,” that is to say, in a Web-page presentation accessible via browser. Thanks to their ubiquity and low cost, browsers have become our indispensable keyholes into enterprise- and customer-facing applications, with the result that a growing proportion of business data are destined to pass through one at some point during the digital half-life. A primary premise of Web miners is that this content is “low-hanging fruit” that can be harvested and re-used. Where do we find the low-hanging fruit? We use Web mining when: 1. We need access to external counter-party or partner’s data. 2. When we need to aggregate many values from sources that are spread across independent systems. 3. When our applications require loosely-structured or unstructured data, such as text in headlines or resumes. 1.1.2 Playing Smart in a Global Trade Global energy markets offer significant pricing and strategic leverage to participants who come to trade armed with well-head production figures or tanker routing or other operational data that support accurate commodity value discovery. Specialized (and profitable) information services have long existed to satisfy this demand, which they accomplish by gleaning breaking information from market participants, then filter it through analytical models and broadcast it over subscriber newsfeeds. Today, energy producers routinely record notices of vessel sailing dates or production data on their own Websites, making the collection and assembly of up-to-date information a matter of careful diligence more than hardcore sleuthing. This process is complicated, however, by the sheer number of players involved and the frequent changes in critical information. Web mining provides these specialized publishers a way to scale their market coverage and improve the timeliness and accuracy of services without adding new researchers. 4 1 Basic WWW Technologies 1.1.3 EDI for the Rest of Us “Just-in-time” manufacturers in the semiconductor and electronics industries publish catalog or inventory data on their Websites as a pragmatic way to accomplish non-invasive EDI inexpensively. They’ve learned that monitoring “loosely-coupled” supply chain relationships through a partner’s finished-goods or parts inventories facilitates their own ordering, production planning or billing processes. Unfortunately, most users of this data find themselves manually extracting information (and more often that not, re-keying it) from their partners’ disparate Websites. One businessman described the dilemma this way: “We have access to our most-critical suppliers’ inventories and use this information constantly in our production planning and purchasing decisions. Problem is, our divisions, product-line managers and departmental functions all have different requirements and there isn’t any single business application that consolidates a global view of our marketplace in a way that suits everybody. If we had just one supplier, we’d “hardwire” our systems to his and get everything we need in real-time. But we’ve identified 40 key external participants in our production processes and putting together a “global” view of our supply-side marketplace from, say, a Financial or Purchasing or Production Control or TQM perspective is still largely a matter of cut-andpaste. It’s a lot of handwork.” Automated Web mining tools allow much closer, round-the-clock scrutiny of vertical supply chains and eliminate the handwork. Partners can assemble a complete view of the components used in manufacturing processes and make better decisions about re-supply agreements, accounts payable issues or potential sources of production delays. Where “short-run” production relationships or nimble marketing strategies make hard-wiring EDI protocols difficult, Web mining can mean the difference between operating efficiently and operating in the dark. 1.1.4 Merger-Proofing Your Business Intelligence Global enterprises doing business in distant markets have long understood the imperative to know their customers thoroughly. That means anticipating customer demand, correlating services across a history of customer interactions, and staying atop developments in a customer’s downstream markets. CRM software provides a powerful means to track all this data, but it requires interaction with separate accounting and order-management systems, customer call centers, and business intelligence applications both within and outside an organization’s firewalls. Where all the pieces exist in a single enterprise, this is a significant integration chore. But add the urgency of mergers or other strategic business reorganizations and this hard-won transparency is often the first element to fail. Companies that know everything about their markets are the first to discover how little they know about customers acquired in a business combination. Web mining is a pragmatic way to fill in those blanks, incorporating data that would otherwise require complicated software integration 1.1 General Introduction 5 and expensive marketplace assessments by an army of consultants. Whether viewed as a pragmatic solution to information architecture problems or as a new approach to automating business intelligence, Web mining unlocks valuable nuggets of information from the looming info-glut and delivers essential content to the heart of critical business processes. 1.1.5 Technical Challenges Abound General-purpose Web mining applications are, of course, only just now becoming feasible. Where Web applications have made great strides in improving the timeliness or accessibility of data across the extended enterprise, browsers and visual representation remain their primary output state. Getting specific content out of the browser and into an application’s database, or a spreadsheet model, or an email alert, or a mobile employee’s wireless device is the new challenge. Where data must be exchanged, re-purposed for new uses, that visual representation we experience in our browser sessions becomes a virtual speed-bump. The manual cut-and-paste experience shared by nearly everyone who has ever circulated an internet joke or captured a URL in a browser bookmark offers a conceptual foretaste of Web mining. Simple as it is to understand though, Web mining is very difficult to replicate in automated processes. That’s because the Web is not constructed like a database, where standard access- and data-type rules, sharing conventions or navigational procedures enforce predictability. As implemented, the Web is the Wild, Wild West, ruled not by gunslingers, but by graphical considerations and constant ad hoc construction that long ago defied machine parsing. In “Digerati: Encounters With the Cyber Elite” by John Brockman [14], former Thinking Machines Corporation founder Danny Hillis writes: “When computers pass around Web pages, they don’t know what they’re talking about. They’re just passing around bits, and they can be meaningless bits as far as they’re concerned. They’re just acting like a big telephone system. It’s much more interesting when computers actually understand what they’re talking about.” A fierce observer of modern media and the societies they serve, Marshall McLuhan [79] would have loved the worldwide Web. As a result of the Web’s early dominance, the internet today exists largely as a simple global conduit connecting human audiences to rich graphical or multimedia content organized around human sensory expectations. Web designers have seen their craft evolve from composing static pages that must be created, published and stored ahead of any authentic userdemand, to dynamic applications that interoperate with backend databases, instrumentation packages and contextual information supplied by viewers. State-of-the-art Websites composite many applications or data sources into aggregated “portal” pages and populate them freshly for us, on the fly. These applications are richer by far, but they are fundamentally tied to our human capacity to inquire, to comprehend, and to absorb. In part, that’s because the internet as we know it is designed to be keyboard-driven, transmuting mouse clicks, button and scrollbar functions and visible hypertext clues into the logic 6 1 Basic WWW Technologies that defines our interaction with a distant server. Even where content is audible (as with streaming media) or report-structured (as results from the tabular output of database inquiries), we still navigate by pages, borrowing a visual publishing metaphor older than moveable type. All information on the Web thus takes its shape in conformance with the tyranny of browsers. Since this is exclusively a presentation-layer service, the horsepower trapped in millions of desktop computers does little more than host session-level page presentations, squandering one of the largest peacetime capital investments in the history of civilization. Unfortunately, until applications routinely comply with industrywide peer-to-peer computing standards, machines will do a poor job parsing Web-based applications for intelligence. That’s because today’s graphicallyrich Web applications make use of HTML extensively, commingling actual data content with display and formatting instructions that are meaningful only when a page is rendered for human consumption. A generation of graphics designers has taken to heart the lessons of Edward Tufte’s [128] “The Visual Display of Quantitative Information,” turning tables of dull numbers into rich visual metaphors, all to better stimulate our cognitive appreciation of the relationships they reveal. We know what thy mean when we see them and, driven by concern for our cognitive bandwidth, Web designers take generous liberties composing pages that “look right.” They routinely pad tables with non-displaying characters or deviate from previous style sheet declarations to make objects “fit” visually. If this were as bad as it gets, the machine-based parsing tools that make Web mining practical could be assumed to eventually conquer the problem with brute computational force. But it’s much worse than that. Brute force presumes the existence of standard text labels from standard glossaries, or at worst precise positional repetition throughout a file. What is a machine to do, for instance, when a five-digit numeric sequence that one Web-designer labels “mail:” is pre-pended on another site by a graphical object or? Are they both zip codes? If simplicity and broad application across hundreds or thousands of Websites is our goal, the need to record complex parsing rules for every such example defeats our purpose. Positional data, too, is meaningless when Web pages routinely adjust themselves to accommodate browser versions or even a user’s screen settings. It’s fair to say that business transaction records - arguably among some of the most valuable content on the Web - change too frequently for such simple screen-scraping techniques to work reliably. The problem becomes even more difficult when information is dispersed in bits and pieces all over a Website. Web mining tools need to be able to follow links and perform multi-page aggregations. Secure access provisions also pose issues for software that must scour a distant site at regular intervals. Getting through password authentication or firewalls is intentionally difficult. As if these challenges weren’t enough, much of the information destined for Web mining applications is stored and displayed in other formats, such as PDF files. A robust Web mining software solution is one that recognizes content within such “deep Web” structures and, similar to parsing HTML, converts markup characteristics into semantic structure. 1.2 Web Documents 1.2 Web Documents 7 “The Web has grown into a vital channel of communication and an important vehicle for information dissemination and retrieval.”(Martzoukou [77]). Whilst more than one would be persuaded that information retrieval on the Web is easy and at hand with a simple click of the mouse, still, the situation is not as bright. Information retrieval on the Web is far more difficult and different from other forms of information retrieval than one would believe since Web documents (pages) are voluminous, heterogeneous and dynamic with a lot of hyperlinks including useless information such as spams, have more than a dozen search engines, make use of free text searching, are mostly without metadata and standards, and contain multimedia elements. The principles of searching a catalogue are different from that of a WWW and the WWW is devoid of cataloguing and classification as opposed to a library system. Even the evaluation of information retrieval on the Web is different. All these points contribute to make information retrieval on the Web different from say, a library catalogue or even an electronic database. “Information retrieval deals with the representation, storage, organization of and access to information items. The representation and organization of the information should provide the user with easy access to the information in which he/she is interested ([4]). Information retrieval on the Web differs greatly from other forms of information retrieval and this because of the very nature of the Web itself and its documents. Indeed, “the WWW has created a revolution in the accessibility of information” (National Information Standard Organization, 2004). It goes without saying that the Internet is growing very rapidly and it will be difficult to search for the required information in this gigantic digital library. 1.2.1 Information Overload Information overload can be defined as “the inability to extract needed knowledge from an immense quantity of information for one of many reasons” (Nelson [105]). One of the main points that would differentiate information retrieval on the Web from other forms of information retrieval is information overload since the Web would represent items throughout the universe. “The volume of information on the Internet creates more problems than just trying to search an immense collection of data for a small and specific set of knowledge” ([105]), An OPAC will only represent the items of a specific library or a union catalogue will only represent items of participating libraries within a region or country. It is known that large volumes of data, especially uncontrolled data, are full of errors and inconsistencies. Contrast the situation on the Web by Nelson [[105]], “when we try to retrieve or search for information, we often get conflicting information or information which we do not want”; and by Meghabghab [90] “finding authoritative information on the Web is a challenging problem” to the situation of a library where we would get both authoritative and relevant information. Information overload on the Web makes 8 1 Basic WWW Technologies retrieval become a more laborious task for the user. The latter will either have to do several searches or refine searches before arriving at the relevant and accurate document. Therefore, as opposed to a search in a catalogue, with the Web, there is no instantaneous response as such. There is no instantaneous response in the sense that there are no instant relevant results and this is not to be mixed up with instantaneous response in the sense that it is true that whilst we tap in keywords, we do get a thousand of hits. With the advent of the Web, users will need easier access to the thousands of resources that are available but yet hard to find. 1.2.2 Hyperlinks “With the advent of the Web, new sources of information became available, one of them being hyperlinks between documents” (Henzinger et al. [53]). In an OPAC for example, one would not find hyperlinks. The Web would be the only place to find hyperlinks and this brings a new dimension to information retrieval. With hyperlinks, one information leads to another. If the user were viewing a library catalogue, he/she would be directed to let’s say a book and that’s the end to it. But if that same user does a search on the Web, he/she would be presented with a thousand of hits, he/she would then need to choose what to view and then the Web pages viewed would refer the user to yet other pages and so on. As pointed out in [53], “hyperlinks provide a valuable source of information for Web information retrieval” hyperlinks are thus providing us with more or even newer information retrieval capabilities. 1.2.3 Heterogeneous Nature of the Web “A Web page typically contains various types of materials that are not related to the topic of the Web page” (Yu et al. [133]). As such, the heterogeneous nature of the Web affects information retrieval. Most of the Web pages would consist of multiple topics and parts such as pictures, animations, logos, advertisements and other such links. “Although traditional documents also often have multiple topics, they are less diverse so that the impact on retrieval performance is smaller” (Yu et al. [133]). For instance, whilst searching in an OPAC, one won’t find any animations or pictures interfering with the search. Documents on the Web are presented in a variety of formats as opposed to catalogues and databases. We have HTML, pdf, MP3, text formats, etc. The latter can be barriers to information retrieval on the Web. For one to retrieve a pdf document on the Web, one must have Acrobat Reader software installed and enough space on the computer’s hard disk to install it. 1.3 Search Engines A search engine is a very complex machine. To engineer one is a very challenging task. From the early days of yahoo, and google, creating a search engine which scales to then’s Web presented many challenges. Fast crawling 1.3 Search Engines 9 technology is needed to gather the Web documents and keep them up to date. Storage space must be used efficiently to store indices and, optionally, the documents themselves. The indexing system must process hundreds of gigabytes of data efficiently. Queries must be handled quickly, at a rate of hundreds to thousands per second. As opposed to other forms of retrieval, the Web makes use of sophisticated information retrieval tools known as search engines. A search engine is simply “a Web site used to easily locate Internet resources”. Search engines have facilitated the information retrieval process by adopting techniques such as Artificial Intelligence, Bayesian Statistics and probability theory, weighting and also, query by example (see Mathematical Complements) . We can add that without search engines, information retrieval would be impossible. The Web relies upon search engines just like libraries rely upon catalogues. These tasks are becoming increasingly difficult as the Web grows. However, hardware performance and cost have improved dramatically to partially offset the difficulty. There are, however, several notable exceptions to this progress such as disk seek time and operating system robustness. A search engine would consist of three parts, namely, an interface, an index and the Web crawler or spider. The interface is the Web page where one would normally formulate his/her queries whereas the index would be the database operating behind the Web page. The crawlers or spiders are programs that would crawl throughout the Web, visiting each site and gathering information. The role of the search engine is to provide more control for the user in performing a search. Those search engines make use of the index to fetch terms of the query. The higher the data in the index, the higher would be the number of hits. However, the size of the index would vary from search engine to search engine but the bigger the index the better and the more often it is updated the better. Search engines are different in nature to electronic databases or library catalogues. Search engines would include a number of free Web pages from around the world since no search engine would include every Web page whereas electronic databases would include citations to some of the articles published in a particular subject or journal. The latter may be a fee-based service. Library catalogues of whatever format would record the items of libraries. The advantage with electronic databases and library catalogues is that they are regularly updated whereas search engines do not have a definite timeframe as to when they are updated and contain links that no longer exist. Statements as those of Hersh [55], “unlike MEDLINE and CD-ROM textbooks, the Web is not a single database”. Therefore, given the size of the Web, no single search system will be able to search the entire Web. As far as the searching process is concerned, search engines would make use of keywords or subject headings that they have themselves established whereas library catalogues would make use of standard subject headings such as the Library of Congress Subject Headings. Carrying out a search using a search engine would be different from that of a catalogue in the sense that we would hardly find any fields for titles, names and subjects in a search engine. We can 10 1 Basic WWW Technologies do a title search using a search engine but it will operate on the title as is the title and not all Web sources have proper titles. Whilst searching in a library catalogue, knowing a title is enough and we can get to the source directly. But nevertheless, with a search engine, the depth of searching is deeper than with a library catalogue. For instance, search engines would search parts of a book if the book is available on the Web but a library catalogue would search the book as a whole and it is up to the user to find the relevant chapters. Search engines also present the hits in order of relevance but it is to be noted that only the person doing the search can judge the relevance of a document. While moat OPACs would present result sets sorted by either author, title or in chronological order. As opposed to other forms of information retrieval tools, search engines provide us with more up to date information at the click of a mouse despite the fact that not all the information are useful ones. Once someone writes a piece of a material and puts it online, anyone in the world will be able to reach it. In designing google for example, Brin and Page [12] have considered both the rate of growth of the Web and technological changes. From the early days, google was designed with the intentions of scaling well to extremely large data sets. It makes efficient use of storage space to store the index. Its data structures are optimized for fast and efficient access (see figure 1.1). Further, they expected that the cost to index and store text or HTML will eventually decline relative to the amount that will be available. This will result in favorable scaling properties for centralized systems like google. In google, the Web crawling (downloading of Web pages) is done by several distributed crawlers. There is an URLserver that sends lists of URLs to be fetched to the crawlers. The Web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the Web pages into a repository. Every Web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a Web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of “barrels”, creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every Web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link. The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents. The sorter takes the barrels, which are sorted by docID resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into 1.3 Search Engines 11 Fig. 1.1. High Level google Architecture the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a Web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries. The principles of searching a library catalogue, is also very different from carrying out a search on the Web. We are familiar with the library catalogue which would consist of a call number, a title or author or subject entry and all these in a standardized format. If the user knows one of this information, then the user will be able to retrieve the exact information. However, searching the Web would depend upon keywords and the Boolean operators and, “OR” and “NOT” in order to either broaden or narrow a 12 1 Basic WWW Technologies search. The Boolean method is a fast and fairly easy method of retrieval used in search engines provided the user knows how to do the search. However, the problem is that the user should have some knowledge of the search topic in order for the search to be efficient and effective. If the user enters the wrong term in the search, then the relevant documents might not be retrieved. As opposed to a search in a library catalogue, with the Web, the information is only at a click of the mouse. With a library catalogue, one must be physically present in the library, carry out the search, memorize the call number and go to the shelves to retrieve the book or document. The next problem is that the document might not be on the shelves. With the Web, one can have access to every other document on the globe provided the document is online. The main problems are the quantity and quality of information. That is to say, a lot of ‘rubbish’ is being published on the Web whereas in a library, we would not get that sort of problem. Given the amount of information on the Web, searching for the appropriate and relevant document is no easy task. With a single search on the Web, one might get millions of hits and opening every page would be time-consuming. Library catalogues, somehow, whether online or manual, provide the user with more bibliographic information, thus facilitating information retrieval. In a catalogue for example, one would get some very brief bibliographic information for the material searched. On the Web, one would have recognized the fact that the hits give us only one line of information that is at times even unintelligible until you click on the hit in order to see what it is. Through the years search engines have died and others have stayed the course. Yet some of the statistics of the early years of 2000 is helpful to gage the characteristics of search engines. Search Engines showdown [1] kept a list of 10 features to compare search engines. This section looks closely at all these features especially that some of them will be used in more details in Chapter 5. Since google has dominated the search engine showdown in the mid of 2004, the way google defines these terms is used as a framework. 1.3.1 Listed Features 1. Defaults: What happens when multiple terms are entered for a search using no Boolean operators, + or − symbols, phrase marks, or other specialized features. Example: two terms could be processed as two AND terms two OR terms or “two terms” as a phrase. In google for example, multiple search terms are processed as an AND operation by default. Phrase matches are ranked higher 2. Boolean: refers to how multiple terms are combined in a search. “AND” requires that both terms be found. “OR” lets either term be found. “NOT” means any record containing the second term will be excluded. ( ) means the Boolean operators can be nested using parentheses. + is equivalent to AND, requiring the term; the + should be placed directly in front of the search term. - is equivalent to NOT and means to exclude the term; 1.3 Search Engines 13 the - should be placed directly in front of the search term. Operators can be entered in the case shown by the example. Example: “(Meghabghab and (authorities or hubs) and not Kleinberg)” returned 34 hits in google (March 28, 2007). while “(Meghabghab and (authorities or hubs))” returned 247 hits (March 28, 2007). Google always searches for pages containing all the words in your query, so you do not need to use + in front of words. 3. Proximity: refers to the ability to specify how close within a record multiple terms should be to each other. The most commonly used proximity search option in Internet finding aids is a “phrase search” that requires terms to be in the exact order specified within the phrase markings. The default standard for identifying phrases is to use double quotes (“ ”) to surround the phrase. Phrase searching example: “soul searching is good” in google returned 35 hits. Beyond phrase searching, other proximity operators can specify how close terms should be to each other. Some will also specify the order of the search terms. Each search engine can define them differently and use various operator names such as NEAR, ADJ, W, or AFTER. Example: Meghabghab NEAR Kleinberg returned 23 hits in google (March 28, 2007). 4. Truncation and Stemming: Truncation refers to the ability to search just a portion of a word. Typically, a symbol such as the asterisk is used to represent the rest of the term. End truncation is where several letters at the beginning of a word are specified but the ending can vary. With internal truncation, a symbol can represent one or more characters within a word. While stemming related to truncation, usually refers to the ability of a search engine to find word variants such as plurals, singular forms, past tense, present tense, etc. Some stemming only covers plural and singular forms. End Truncation Examples: Meghabgha* finds 3 hits in yahoo for Meghabghab (nothing else) but none in google since the * feature did not work in google (April 6, 2005). Since then google added that feature and as of June 6, 2006 Meghabgha* returned 3 hits in google. Truncation Examples: Megha*ghab finds none in yahoo. In google, as of June 6, 2006, one hit is returned. Yet, if you were to type “soul * is good” yahoo returned 9600 hits among with the following phrase ranking second “Like cold water to a weary soul, so is good news from a far country”, while google returned 61,500 hits with “As cold waters to a faint soul, so is good news from a far countrysoul searching is good” ranked fourth. 5. Case: In general, most search engines will match upper case, lower case, and mixed case as all the same term. Some search engines have the capability to match exact case. Entering a search term in lower case will usually find all cases. In a case sensitive search engine, entering any upper case letter in a search term will invoke the exact case match. Example: meghabghab, Meghabghab returns equal hits while MEGHABGHAB returns more hits in yahoo. Google has no case sensitive searching. 14 1 Basic WWW Technologies 6. Fields: Fields searching allows the searcher to designate where a specific search term will appear. Rather than searching for words anywhere on a Web page, fields define specific structural units of a document. The title, the URL, an image tag, and a hypertext link are common fields on a Web page. Example: title:Meghabghab will look for the word ‘Meghabghab’ in the title of a Web page. yahoo returns 30 hits. While google uses intitle to get similar results to that of yahoo. Example: intitle:Meghabghab returns 60 Hits in google (June 6, 2006). 7. Limits: The ability to narrow search results by adding a specific restriction to the search. Commonly available limits are the date limit and the language limit. The latter would restrict the search results to only those Web pages identified as being in the specified language. google has language, domain, date, filetype, and adult content limits. The date limit, added in July 2001, is only available on the Advanced Search page. Only three options are available: Past 3 Months, Past 6 Months, or Past Year. Meghabghab past 3 months returns 11,400 hits (as of March 28, 2007), returns 11,300 hits for past 6 months (as of March 28, 2007), and returns 11,400 hits for the past year (as of March 28, 2007). As for the file type you can specify a file type: Meghabghab filetype:pdf returns 164 hits in google. 8. Stop Words: Frequently occurring words that are not searchable. Some search engines include common words such as ‘the’ while others ignore such words in a query. Stop words may include numbers and frequent HTML strings. Some search engines only search stop words if they are part of a phrase search or the only search terms in the query. Examples: the, a, is, of, be, l, html, com. 9. Sorting: The ability to organize the results of a search. Typically, Internet search engines sort the results by “relevance” determined by their proprietary relevance ranking algorithms. Other options are to arrange the results by date, alphabetically by title, or by root URL or host name. In google, results are sorted by relevance which is determined by google’s PageRank analysis, determined by links from other pages with a greater weight given to authoritative sites. Pages are also clustered by site. Only two pages per site will be displayed, with the second indented. Others are available via the [More results from · · ·] link. If the search finds less than 1,000 results when clustered with two pages per site and if you page forward to the last page, after the last record the following message will appear: In order to show you the most relevant results, we have omitted some entries very similar to the 63 already displayed. If you like, you can repeat the search with the omitted results included. Clicking the “repeat the search” option will bring up more pages, some of which are near or exact duplicates of pages already found while others are pages that were clustered under a site listing. However, clicking on that link will not necessarily retrieve all results that have been clustered under a site. To see all results available on google, you need to check under each site cluster as well as using the “repeat this search” option. 1.3 Search Engines 15 10. Family Filters: The ability to limit or filter search results to exclude adult or other material that may be objectionable. In particular, these are made available to exclude results deemed potentially inappropriate for viewing by children. 1.3.2 Search Engine Statistics ([1]) 1. Size: The size feature compared nine search engines, with MSN Search and HotBot representing the Inktomi database. This analysis used 25 small single word queries. google found more total hits than any other search engine. In addition, it placed first on 25 of the 25 searches, more than any of the others and the first time that any search engine placed first on every single search. AlltheWeb moved back into second place with significant growth since March. AltaVista also had significant growth and moved up to third. WiseNut dropped to fourth and HotBot is up to fifth. Despite sharing an Inktomi source, HotBot found more than MSN and included PDF files not available from MSN. Note that the number given in the chart reflects only the number of Web pages found, not the total number of results. The chart in figure 1.2 gives the total verified number of search results (including Web pages, PDFs, other file types, and even google’s unindexed URLs) from all 25 searches. Since the exact same queries were used in March 2002 and August 2001, the other columns give previous totals. Fig. 1.2. Total Hits from 25 searches 16 1 Basic WWW Technologies 2. Freshness: All search engines are pictures of the past, but which search engine has taken its picture most recently? This comparison tries to begin to answer that question and tracks how the search engines change over time. Key points from results: (a) Most have some results indexed in the last few days (b) But the bulk of most of the databases is about 1 month old (c) And some pages may not have been re-indexed for much longer This freshness showdown evaluated the search engines based on 6 searches (with 10-46 total useful matches per search engine). The useful matches were hits for specific pages that: (a) Are updated daily (b) And report that date And the reported date is also visible from the search engine results. The chart in figure 1.3 shows how old the most recently refreshed pages were Fig. 1.3. Freshness 1.3 Search Engines 17 and the age of the most ancient page retrieved. All URLs analyzed showed the current date at the time of the test, but the search engines had all indexed older versions. Since the freshest search engines are so crowded at the top, here is another graph with a close-up of the five most current databases looking at the recent hits from about a month and a half. 3. Overlap: This analysis compares the results of four small searches run on ten different search engines. The four searches found a total of 334 hits, 141 of which represented specific Web pages. Of those 141 hits, 71 were found by only one of the ten search engines while another 30 were found by only two. Several of the largest search engines have shown significant growth since Feb. 2000, when the overlap comparison was last run. Even so, almost half of all pages found were only found by one of the search engines, and not always the same one. Over 78% were found by three search engines at most. Each pie slice in the chart (see figure 1.4) represents the number of hits found by the given number of search engines. For example, the by 1 (71) slice represents the number of unique hits found by one (and only one) search engine. Even with three Inktomi-based databases (iWon, MSN Search, and HotBot), there was not identical overlap between the three. However, the Inktomi databases are relatively similar. Overlap of 4 Searches 141 Pages on Mar. 6, 2002 by 1 (71) by 10 (2) Ten Engines: AllTheWeb, AltaVista, Direct Hit, Google, HotBot, iWon, MSN, NLResearch, Teoma, WiseNut by 9 (3) by 8 (2) by 7 (3) by 6 (3) by 5 (3) by 4 (14) © 2002 G. Notess by 2 (30) by 3 (10) Fig. 1.4. Overlap 18 1 Basic WWW Technologies 10,000 8,000 6,000 4,000 Aug. 01 Mar. 02 Dec. 02 2,000 0 Google AlltheWeb AltaVista WiseNut HotBot MSN search Fig. 1.5. Size Change Teoma 4. Size Change over Time: The chart in figure 1.5 compares how the sizes of the databases have changed over time, showing the results for the same eight search statements. Note how some shrank, some stayed the same, and other grew. 5. Unique Hits Report: Based on an analysis of the same four small searches used for the overlap analysis, the chart in figure 1.6 shows how the results found by only one search engine (unique hits) were divided up between the search engines. google, WiseNut, AllTheWeb, Teoma, NLResearch, and MSN were the only ones of the ten search engines that found unique results that none of the other search engines found. For more detailed analysis, the table below shows the number and percent of unique hits for each search engine. Unique hits were those URLs that were found by only one of the ten search engines. The percentage figure refers to the percentage of unique hits as compared to the total hits which that particular search engine found on that search. Times refers to how many times in the five searches a unique hit was found. Only google found unique hits on each search. Given these results from this very small sample, google found the most unique hits (41) but AllTheWeb, WiseNut, NLResearch, Teoma, and MSN all found hits that neither google nor any of the others found. 1.4 Users 19 Fig. 1.6. Unique Hits This example is from a total of 334 hits from four searches found by ten search engines. Of the 334 total hits, only 141 represented non-duplicative Web pages. And of those, 71 were unique hits in that they were only found by one of the ten search engines. 1.4 Users As well expounded in Johnson [61], it is an almost universal finding in studies investigating human information behavior that people choose other people as their preferred source of information. Studies of academic researchers in both the sciences and the humanities have revealed the importance of consulting with colleagues at different stages of their research (Ellis, [33]). Professionals, such as engineers, nurses, physicians and dentists rely on co-workers and knowledgeable colleagues in their search for work-related information (Leckie et al. [74]). The explanation for the use of people as information sources has often been that they are ‘typically easier and more readily accessible than the most authoritative printed sources’ (Case [17], p. 142). The use of people is a least effort option in the search for information and, therefore, may not be the best sources available. The theory of social capital, however, suggests that the use of people as information sources is not necessarily an easy option, but may also require a considerable effort. The findings do suggest, however, that the choice of a person as a source of information is complex and needs to be investigated at a more finely-grained level to determine what factors affect who is chosen and under what circumstances (see chapter 6). The 20 1 Basic WWW Technologies Information Objects 6 5 Organisat Social 4 3 Interfaces 2 Cognitive Actor(s) 1 & Information Systems 7 Cultural Context 8 = Cognitive transformation and influence = Interactive communication of cognitive structures Fig. 1.7. Analytical Model of information seeking and retrieving theory of social capital provides a useful framework within which to examine information behavior and to understand how social structure affects choice. Figure 1.7 presents a general analytical model of information seeking and retrieval proposed in Ingwersen & Jarvelin (forthcoming, [59]). It shows various players and their interaction in context in the field of information seeking and retrieval. We may observe cognitive actors such as information seekers in the middle surrounded by several kinds of contexts. These are formed by the actors’ social, organizational and cultural affiliations, information objects, information systems and the interfaces for using them. The context for any node in the diagram consists of all the other nodes. For example, algorithmic and interactive information retrieval processes do not stand alone, but are special cases of information seeking behavior. Algorithmic information retrieval, that is, the interaction between information objects and algorithms, arrow (4), has no real meaning without human interaction with information retrieval systems, arrows (2-3). Interactive information retrieval itself functions as a special case of information seeking. Today, all information seeking becomes increasingly technology-driven because progressively more and more informal communication channels, such as mail, become formalized in systems because of the digitalization of communication. Information seeking behavior means the acquisition of information from knowledge sources. For instance, one may ask a colleague, through (in)formal channels such as social interaction in context (arrow 1), or through an information system (arrows 2-4). Secondly, actors operate in a dual context: that of information systems and information spaces surrounding them, and the socio-cultural-organizational context to the right. Over time, the latter context influences and, to a large degree, creates the information object space on the one hand (arrow 6) and the information 1.4 Users 21 technology infrastructure (arrow 8) on the other. In different roles, the actors themselves are not just information seekers but also authors of information objects (arrow 5) and creators of systems (arrow 7). Often, from the actor’s point-of-view a proper system for analyzing information access may be a system of activities: work task, leisure interests and the social organization of these activities. We may take a look at augmenting task performance. After all, augmenting task performance is an important goal in working life and also in leisure activities. In the former it may be related to effectiveness and efficiency, and in the latter, if not efficiency, then at least quality. Engelbart [34] suggested a conceptual framework for augmenting human intellect. Figures 1.8 shows a new way of looking at it and present an exemplary means-end hierarchy focusing on information seeking. At the top of figure8 we have an actor’s work task in context. In order to augment the actor’s performance, one may improve her tools, her knowledge, Fig. 1.8. Augmenting Task Performance 22 1 Basic WWW Technologies her methods and/or her training in the application of the former. Improving one’s knowledge and methods means acquiring new knowledge, which can be done by creating it, remembering it or through information seeking. Information seeking, again, may reach towards education, documents or other sources. If one chooses documents, these may be documents immediately to hand, documents from colleagues or nearby collections, or documents retrieved from various databases through searching. If one chooses other sources than documents, these may be factual databases, colleagues, etc. In this figure, information seeking and its manifestations may seem somewhat remote from the work task - with document retrieval even more remote and behind many decisions. Nevertheless, our view is that information seeking and retrieval belongs to a context in real life, such as the work task context. The distance does not make information seeking and retrieval independent of work tasks or other activities; it needs to contribute to them. This sets many requirements on augmentation through information seeking and retrieval if it is going to be useful and used in the long run. 2 Web Graphs 2.1 Introduction A Web page is dynamic and bi-directional. It points to other Web links as well as other Web links point to it. A Web page gets updated and so new Web links are discovered as new links are added. It is a living community of links. The contextual amount of variability between two Web pages could be very high. Not only they deal with different subjects, but they could be in different languages, have nothing in common, no standards are imposed on the content, and reflect subjective ideas than commonly established scientific explorations. Human judgment on Web content is more subjective, and noisier than in a citation. It is hard to keep a control on the quality of Web pages that are created. Web pages serve many purposes not just to link 2 or more Web pages. Hubs and authorities (Kleinberg [64]) have been developed to better assess the influence that a Web page has on other Web pages that link to it. Web pages also reflect the “signatures of intelligence” in our era and contain rich information on our collective society as a whole. Citation analysis is a very important field for information scientists for mining citations in journals. The comparison above sheds some light on the parallel between a Web page and a citation in a journal ([64]). This just to emphasize the interest that Web mining experts ought to have to analyze Web links even though it is a task in its infancy, and even though that the principles that govern Web pages and Web links are different from those of scientific citations. Web pages are not scrutinized like scientific journals are. New links in a Web page can have an impact on the whole search done and the always changing size of the Web can reflect patterns of indexing, crawling for search engine designers that feed on these Web links (Meghabghab [90]). More importantly, users need to be aware not only of the set of results returned, but also of the set of results not returned or the percentage of “rejected Web pages” for each query. To fully comprehend and assess the discovery of hubs and authorities and the influence that hyperlinks between different Web pages have over other Web pages, Graph Theory can be used to better understand G. Meghabghab and A. Kandel: Search Engines, Link Analysis, and User’s Web Behavior, Studies in Computational Intelligence (SCI) 99, 23–45 (2008) www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008 24 2 Web Graphs the measures developed in Kleinberg ([64]). Nodes represent static html pages and hyperlinks represent directed edges. But never an attempt have been made to study Web graphs in the (Out-degree, in-degree) coordinate space, neither has citation analysis on the Web has been applied in this new coordinate space and has revealed the complexity of citation analysis in a variety of Web graphs. Graph Theory can be used also to discover new patterns that appear in a citation graph. The same idea can be used to measure the distance between 2 Web pages. Measuring the topological structure richness of a collection of Web pages is an important aspect of Web pages never explored before and never applied before on hubs and authorities. The next section is a reminder on concepts borrowed from Graph Theory to help analyze hyperlinks, and the richness of the WWW as an information network of ideas and decisions. 2.2 Basic Graph Theory Applied to Web Pages A graph is a directed link. A link on a Web page connects one document to another. A link is not only of a navigational nature but also can represent an endorsement to the target page. When we consider more than just one link, we could explore characteristics of the Web space. Spatial relations between Web pages can help understand the topology of a Web page and in general of the Web space. In the space of all Web pages W, let A ∈ W to mean a page A belongs to the space of all Web pages. The Web page A represents a graph. In that graph, if there is a link to another Web page B, we can say that: A is related to B by the link. In symbolic terms we can write (A, B) ∈ R , where R is the relation “point to”. We can add the following observations on the relation R : 1. If every Web page is related to itself, we say that R is reflexive. 2. For all Web pages X and Y that belong to W, if (X, Y) ∈ R → (Y, X) ∈ R. Web pages X and Y in that case represent mutual endorsement. Then the relation is then said to be symmetric. 3. For all Web pages X and Y that belong to W, if (X, Y) ∈ R → (Y, X) ∈ R. Web pages X and Y are linked in a unidirectional way. Then the relation is then said to be anti-symmetric. 4. For all Web pages that belong to W, when a Web page X cites another Web page Y and that last page cites another Web page Z, we can say that R is transitive: (X, Y) ∈ R and (Y, Z) ∈ R → (X, Z) ∈ R 5. When a Web page X cites another Web page Y and Y does not cite X, X endorses Y and Y does not endorse X, we can say that R is not symmetric: (X, Y) ∈ R but (Y, X) ∈ R. 2.2 Basic Graph Theory Applied to Web Pages 25 6. When 2 Web pages X and Y point to a distinct 3rd Web page Z, then we could say that the 2 Web pages are related through a very special relationship similar to a filtering relationship or bibliographic coupling (Kessler [63]). This kind of relationship does not have a name in the algebraic properties of R. (X, Z) ∈ R and (Y, Z) ∈ R 7. Conversely when one Web page X points to 2 distinct Web pages Y and Z, then we say that X co-cites Y and Z. Co-citation is a term borrowed from the field of biblio-metric studies (Small [122]). Co-citation has been used as a measure of similarity between Web pages by Larson [71] and Pitkow & Pirolli [113]. Small & Griffith [124] used breadth-first search to compute the connected components of the uni-directed graphs in which 2 nodes are joined by an edge if and only if they have a positive co-citation value. This kind of relationship does not have a name in the algebraic properties of R. (X, Y) ∈ R and (X, Z) ∈ R. These 7 properties are the simplest common patterns that can be perceived on the Web. These 7 properties can blend together to form more complex patterns that are indicative of emerging links or communities on the Web. These complex patterns can model properties of Web pages that can be qualified as “authoritative Web pages”. An authoritative Web page is a Web page that is pointed at by many other Web pages. Other emerging complex patterns can model Web pages that can be qualified as survey Web pages or “hub Web pages” because they cite “authoritative Web pages”. 2.2.1 Adjacency Matrix Representation of a Web Graph To further apply all these properties, consider a directed graph G that represents hyperlinks between Web pages. Consider also its adjacency matrix A. An entry apq is defined by the following: apq = 1 if there is a link between Web pages p and q. = 0 Otherwise Here some of the properties that could be discovered from an adjacency matrix perspective: 1. A graph is said to be reflexive if every node in a graph is connected back to itself, i.e., app = 1. The situation will happen if a page points back to itself. 2. A graph is said to be symmetric if for all edges p and q in X: iff apq = 1 then aqp = 1. We say in this case that there is mutual endorsement. 26 2 Web Graphs 3. A graph is said to be not symmetric if there exists 2 edges p and q in G such that iff apq = 1 then aqp = 0. We say in this case that there is endorsement in one direction. 4. A graph is said to be transitive if for all edges p,q, and r: Iff apq = 1 and aqr = 1 then apr = 1 We say in this case that all links p endorse links r even though not directly. 5. A graph is said to be anti-symmetric iff for all edges p and q: Iff apq = 1 then aqp = 0 6. If 2 different Web pages p and q point to another Web page r then we say that there is social filtering. This means that these Web pages are related through a meaningful relationship. apr = 1 and aqr = 1 7. If a single page p points to 2 different Web pages q and r then we say that there is co-citation. apq = 1 and apr = 1 Now consider 2 linear transformations defined on unit vectors a and h as follows: a = AT h (2.1) and: h = Ah (2.2) thus: a = AAT a (2.3) and: h = AT Ah (2.4) By examining closely the entries of these product matrices AAT and AT A, we find that 2 matrices are symmetric. with the following properties observed: 1. An entry (p, p) in AAT means the number of Web pages that come out of p. We call that number the out-degree or od. 2. An entry (p, p) in AT A means the number of Web pages that point towards p. We call that number in-degree or id. 3. An entry (p, q) in AT A represents the number of Web pages that are in common between p and q that point towards p and q. 4. An entry (p, q) in AAT represents the number of Web pages that came out of p and q that are in common. To illustrate the above points let us look at the following graph G. Here are the algebraic properties of R in G: 1. R is not reflexive. 2. R is not symmetric. 3. R is not transitive. 2.2 Basic Graph Theory Applied to Web Pages 27 6 e8 '1 e6 T e1 e5 e7 ©2 e2 4‚c e3  e4 ‚3 Fig. 2.1. A General Graph 4. R is anti-symmetric. 5. (1,4) ∈ R, (3,4) ∈ R, and (5,4) ∈ R: we say that the vertex with the highest number of Web pages pointing to it. 6. (5,1) ∈ R and (5,6) ∈ R: 5 co-cites 1 and 6. Here is the corresponding adjacency matrix for figure 2.1. 0d ⎡ ⎤ 3 010101 A= 1 2 0 2 ⎢⎢⎢⎢⎢⎢⎣ 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 ⎥⎥⎥⎥⎥⎥⎦ 0 000000 Building AT yields: AT = ⎡ ⎤ 001010 ⎢⎢⎢⎢⎢⎢⎣ 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 ⎥⎥⎥⎥⎥⎥⎦ 100000 Building C = AAT yields: C = AAT = ⎡ ⎤ 301000 ⎢⎢⎢⎢⎢⎢⎣ 0 1 0 0 1 0 0 0 0 2 0 2 0 0 0 0 0 2 0 2 0 0 0 0 ⎥⎥⎥⎥⎥⎥⎦ 100000 28 2 Web Graphs 4 2 0 1 2 3 Ver tix 4 5 6 Fig. 2.2. Out/In Degrees for Figure 2.1 Out-Degree In-Degree Building D = AT A yields: D = ATA = ⎡ ⎤ 200200 ⎢⎢⎢⎢⎢⎢⎣ 0 0 2 0 1 0 1 0 0 1 0 0 1 0 3 0 0 0 0 0 1 0 1 0 ⎥⎥⎥⎥⎥⎥⎦ 010101 The next figure illustrates the in-degrees and out degrees for the graph G (see figure 2.2): How far away are 2 Web pages in a Web graph? The adjacency matrix A can be used to calculate the length of the path than can separate 2 distinct Web pages. To further explore such an idea, consider the power matrices of A, i.e., A2, A3, A4, An for a graph of n vertices. If we calculate the value of A2 for Graph1 we have: A2 = ⎡ ⎤ 001000 ⎢⎢⎢⎢⎢⎢⎣ 1 0 0 0 0 1 0 1 0 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 ⎥⎥⎥⎥⎥⎥⎦ 000000 Every non-zero element in A2 means that to travel from vertex i to vertex j we will need 2 Web links to get there. Thus considering that A2(2, 4) = 1 means that the distance from vertex 2 to vertex 4 is 2. This can be verified on Figure 2.1 where (2,3) ∈ R and (3,4) ∈ R If we calculate the rest of powers of A, i.e., A3, A4, ...An, and we reach a value m < n such that Am = A then we say that any 2 Web pages in that graph are m pages or clicks away. Applying this to Graph1, one can see that A4 = A. This means that for any 2 Web pages, the furthest away they are, is 4 Web pages. An example in Graph1 is Web page 1 and 6, where to reach 6 from 1 we can travel directly with a path of length 1 or through vertices 2, 3, 1, 6. Expanding the idea of the distances of Web pages over the WWW, Albert et al. [3] were able to show that the distribution of Web pages over the Web constitutes a power law and that the distance between far away connected Web pages is 19. In other words, to move along the whole WWW, it will take approximately 19 Web pages or clicks at most. Thus the diameter of the WWW is 19 clicks. 2.2 Basic Graph Theory Applied to Web Pages 29 2.2.2 Incidence Matrix Representation of a Web Graph Another interesting representation of a graph is an incidence matrix. Let us consider the same graph G in Figure 2.1 and its corresponding incidence matrix I. To build the incidence matrix of a graph we label the rows with the vertices and the columns with the edges (in any order). The entry ive for row r (vertex v) and column c (edge e) is such: ive = 1 =0 if e is incident to v Otherwise (2.5) Notice that id, which is the number of edges incident on a vertex v, can be deduced from the incidence matrix. We also added to I the row s which the sum of all the values in a given column. ⎡ e1 e2 e3 e4 e5 e6 e7 e8 ⎤id 1001000102 I= 2 3 4 5 6 ⎢⎢⎢⎢⎢⎢⎢⎢⎣ 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 ⎥⎥⎥⎥⎥⎥⎥⎥⎦ 1 1 3 0 1 s111111117 We could deduce from I that Web page 4 or vertex 4 is the one with the highest incidence of links to it. Web page 4 is an authoritative Web page since it is the Web page with most links pointing to it. We can deduce through the value of s that there are no bi-directional links on this graph. That is why this graph is anti-symmetric. One way to look at how matrix A and vector id relate is by considering the following matrix-vector multiplication AT U where AT is the transpose of A already computed in 2.2.1 and U is the Unit vector 1. Applying AT U to Graph1 results in the following: AT × U = id (2.6) The matrix I has been ignored in the literature on Web Graph Theory ([64]). Looking closely at IIT yields the following matrix: ⎡ ⎤ 200000 IIT = ⎢⎢⎢⎢⎢⎢⎣ 0 0 0 0 1 0 0 0 0 1 0 0 0 0 3 0 0 0 0 0 0 0 0 0 ⎥⎥⎥⎥⎥⎥⎦ 000001 30 2 Web Graphs IIT is a diagonal matrix. Its eigenvalues vector is equal to the in-degree vector id. Its eigenvectors constitute the columns of the unit matrix of size 6 × 6: Eigenvalue(IIT ) = id (2.7) By looking at equations 2.6 and 2.7 we conclude that: Eigenvalue(IIT ) = id = AT U (2.8) 2.2.3 Bipartite Graphs A Bipartite graph G is a graph where the set of vertices can be divided into sets V1 and V2 such that each edge is incident on one vertex in V1 and one vertex in V2. Graph G in Figure 2.1 is not an actual Bipartite graph. To make G an actual bipartite graph, a possible bipartite graph G1 can be designed. If we let V1 = {1, 3, 5} and V2 = {2, 4, 6}, then we can take out the 2 edges e3 and e7 that were in and then the new graph G1 will become a bipartite graph as in Figure 2.3. In other related works, tree structures have been used to design better hyperlink structures (Botafogo et al. [13]). The reverse process of discovering tree structures from hyperlink Web pages and discover hierarchical structures has also been studied (Mukherjea et al. [103]; Pirolli et al. [112]). In case of topic search, we do not need to extract a Web structure from the Web. Often the user is interested in finding a small number of authoritative pages on the search topic. These pages will play an important role in a tree had we extracted the tree structure itself. An alternative to extracting trees from a Web graph is to use a ranking method to the nodes of the Web graph. In this section we review such methods proposed in the literature. Some basic concepts have to be laid down before doing that. We conclude that the topology of the links in Web pages affect search performance and strategies of the WWW. 6 1 5 ©2 4‚c  ‚3 Fig. 2.3. A Bipartite Graph G1 2.2 Basic Graph Theory Applied to Web Pages 31 2.2.4 Web Topology In this section, we will explore how different can Web graphs be? Can we classify these different Web pages? How complex these Web pages can appear to be? Different categories of Web graphs can emerge according to their od value and id value or what we defined in section 2.1 as Out-degree and indegree correspondingly. Even though we do not pretend that this classification is exhaustive, but different kinds of graphs were gathered to represent the possible different variety of Web pages. Emerging Web graphs can be complex and rich in structure and links more than Web page designers do realize. In-degree Web Graphs Complex pages can emerge with large in-degree that looks like Figure 2.4: Figure 2.5 illustrates such in-degree Web pages by looking at their In/Out Degree chart: T 1 2  4 3 s 5 6 Fig. 2.4. An In-Degree Graph 2 1 0 1 2 3 Ver tix 4 5 6 Fig. 2.5. Out/In Degrees for Figure 2.4 Out-Degree In-Degree 32 2 Web Graphs 1 2 © 3c 4 © ©5 ‚6 Fig. 2.6. An Out-Degree Graph 2 1 0 1 2 3 4 5 Out-Degree In-Degree Fig. 2.7. Out/In Degrees for Figure 2.6 Out-degree Web Graphs Complex Web pages with large Out-degree can emerge that look like Figure 2.6. Such graph becomes a tree where the starting node is the root of the tree: Figure 2.7 illustrates such Out-degree Web pages by looking at their In/Out Degree chart, which is the complement of Figure 2.5: Complete Bipartite Web Graphs Other complex Web pages can emerge as complete bipartite graphs that look like Figure 2.8 with 3 nodes in the first set V1 and 3 nodes in the second set V2: Remember that the topology of complete bipartite graphs like the one in Figure 2.8 is unique. Figure 2.9 illustrates such Complete Bipartite Web pages by looking at their In/Out Degree chart. Bipartite Web Graphs Other complex Web pages can emerge as bipartite graphs that look like Figure 2.10 with 4 nodes in the first set V1 and 2 nodes in the second set V2: The difference between complete bipartite Web graphs and bipartite graphs is the fact that not all nodes between set V1 and V2 are connected as it is seen in Figure 2.10: pages with large in-degree or Out-degree play an important role in Web algorithms in general. Section 3.4 will apply Kleinberg’s algorithm [64] on these different graphs. Figure 2.11 illustrates such Bipartite Web pages by looking at their IN/Out Degree chart. 2.2 Basic Graph Theory Applied to Web Pages 33 1 E4 ! 2 E ‚5  3 E ‚…6 Fig. 2.8. A Complete Bipartite Graph 3 2 1 0 1 2 3 4 5 6 Ver tix Fig. 2.9. Out/In Degrees for Figure 2.8 Out-Degree In-Degree 6 1 5 ©2 4‚c  ‚3 Fig. 2.10. A Bipartite Graph G1 3 2 1 0 1 2 3 4 5 6 Vertix Out-Degree In-Degree Fig. 2.11. Out/In Degrees for Figure 2.10 34 2 Web Graphs 2.2.5 Web Graphs Topological Difference The signature of a Web graph lies in their in/out degree as it can be seen from all these different charts. The in/out degree of a graph can be used to measure the differences or similarities between different topological Web structures. The signature of a Web graph lies in that. The Euclidean distance will help classify different graphs. Not only will be interested in the size of a graph but also the structure of the graph. A graph will be assumed to be symbolized by 2 vectors the in-degree id vector and the Out-degree od vector. Next the average of the in-degree id vector will be calculated, the average value of the Out-degree od vector will be calculated, and say that for a vertex v: od(v) = 1 =0 id(v) = 1 =0 if its value is above the average value Otherwise if its value is above the average value Otherwise (2.9) (2.10) By applying 2.9 and 2.10 to figure 2.1, which is a general graph, we could deduce: od = (1, 0, 1, 0, 1, 0), id = (1, 0, 0, 1, 0, 0). By applying 2.9 and 2.10 to figure 2.10, which is a possible Bipartite graph of figure 2.1, we deduce: od = (1, 0, 1, 0, 1, 0), id = (0, 1, 0, 1, 0, 1). By applying 2.9 and 2.10 to figure 2.8, which is the only complete possible Bipartite graph on figure 2.10, we deduce: od = (1, 1, 1, 0, 0, 0), id = (0, 0, 0, 1, 1, 1). By applying 2.9 and 2.10 to figure 2.4, which is an in-degree graph, we deduce: od = (0, 1, 1, 1, 1, 1), id = (1, 1, 1, 0, 0, 0). By applying 2.9 and 2.10 to figure 2.6, which is an Out-degree graph, we deduce: od = (1, 1, 1, 0, 0, 0), id = (0, 1, 1, 1, 1, 1). The following matrix M summarizes the difference between these different Web graphs with the same number of nodes in their (Out-degree, in-degree) coordinate space: GG ⎡ GG (0, 0) I= C BG BG ID ⎢⎢⎢⎢⎣ (0, 3) (2, 3) 3, 3) OD (2, 5) BG (0, 2) (0, 0) (2, 2) (3, 4) (2, 2) CBG (2, 3) (2, 2) (0, 0) (4, 6) (0, 2) ID (3, 3) (3, 4) (4, 6) (0, 0) (4, 4) OD ⎤ (2, 5) (2, (0, (4, 2) 2) 4) ⎥⎥⎥⎥⎦ (0, 0) The smallest value in this matrix is the value (0,2), which says that Graphs 8 and 6 are the closest because a complete bipartite graph is a form of an 2.3 Bow Tie Graphs 35 Out-degree graph with many roots. The next best smallest value in the matrix M is (0,3) which says that general graphs and bipartite graphs are the closest among all other graphs. The largest value in M is (4,6) which says that complete bipartite graphs and in-degree are the farthest apart. The next biggest difference is (4,4) which says that the next largest difference is between in-degrees trees and Out-degree trees, which is evident form the structure of the trees. It also shows that bipartite graphs are as close to Out-degree trees and complete bipartite graphs than in-degree trees which can be concluded from the statement before. We conclude that in the coordinate space of (Outdegree, In-Degree) the following metric of graphs topology stands: |(CBG)| < |(OD)| < |(BG)| < |(GG)| < |(ID)| (2.11) Where CBG = complete bipartite graph, OD = Out-degree Trees, BG = bipartite graph, GG = General Graphs, and ID = In-Degree Trees. 2.3 Bow Tie Graphs 2.3.1 Introduction Broder et al.’s [15] study analyzed the connectivity of more than 1.5 billion links on over 200 million Web pages. By creating a map showing all links pointing to and from this large sample of pages, the researchers found a comprehensive topography of the Web. We now give a more detailed description of the structure in Figure 2.12 which represents connectivity of the Web. One can pass from any node of IN through SCC to any node of OUT. Hanging off IN and OUT are TENDRILS containing nodes that are reachable from portions of IN, or that can reach portions of OUT, without passage through SCC. It is possible for a TENDRIL hanging off from IN to be hooked into a TENDRIL leading into OUT, forming a TUBE – a passage from a portion of IN to a portion of OUT without touching SCC. The sizes of the various components are as seen in table 2.1 Unlike previous models portraying the Web as clusters of sites forming a well-connected sphere, the results showed that the Web’s structure more closely resembles a bow tie consisting of three major regions (a knot and two bows), and a fourth, smaller region of pages that are disconnected from the basic bow-tie structure. 1. At the center of the bow tie is the knot, which the study calls the strongly connected core. This core is the heart of the Web. pages within the core are strongly connected by extensive cross-linking among and between themselves. Links on core pages enable Web surfers to travel relatively easily from page to page within the core. 2. left bow consists of origination pages that eventually allow users to reach the core, but that cannot themselves be reached from the core. Origination 36 2 Web Graphs Fig. 2.12. Bow Tie Graphs Region Size Table 2.1. Size vs. Region in a Bow Tie graph SCC 56 ∗ 106 IN 43 ∗ 106 OUT 43 ∗ 106 TENDRILS 43 ∗ 106 DISC. 16 ∗ 106 Total 203 ∗ 106 pages are typically new or obscure Web pages that haven’t yet attracted interest from the Web community (they have no links pointing to them from pages within the core), or that are only linked to from other origination pages. Relatively closed community sites such as Geocities and Tripod are rife with origination pages that often link to one another but are seldom linked to from pages within the core. 3. The right bow consists of “termination” pages that can be accessed via links from the core but that do not link back into the core. Many commercial sites don’t point anywhere except to themselves. Instead, corporate sites exist to provide information, sell goods or services, and otherwise serve as destinations in themselves, and there is little motivation to have them link back to the core of the Web. 2.3.2 Parametric Characteristics of Bow Tie Graphs Parametric equations of the form x = cos(at) and y = sin(bt), where a and b are constants, occur in electrical theory. The variables x and y could represent voltages or currents at time t. The resulting curve is often difficult to sketch, 2.3 Bow Tie Graphs 37 Fig. 2.13. A 10*1*10 Bow Tie Graph cos(t), sin(t) 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 Fig. 2.14. A circle but the graph can be represented on the screen of an oscilloscope when voltages or currents are imposed on the input terminals. These figures are called Lissajous curves. The times will be restricted to t = 0 to t = 2*π. Here are different cases: 1. For x = cos(at) and y = sin(bt) with a = b, x = a cos(t) and y = a sin(t), the graph will be always a circle. (Figure 2.14). 38 2 Web Graphs 2. x = cos(at) and y = sin(at) with a < b, we consider different values of a and b and obtain bow tie graphs with the number of bows in the graph equal to the ratio of b/a as follows: (a) For a = 1 and b = 2, we get a bow tie graph (Figure 2.15). (b) For a = 2 and b = 3, we get a 2/3 of a bow tie graph (Figure 2.16). cos(t), sin(2*t) 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 Fig. 2.15. A single bow tie graph cos(2*t), sin(3*t) 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 Fig. 2.16. A 2/3 bow tie graph 2.3 Bow Tie Graphs 39 (c) For a = 1 when b = 4, we get a double bow tie or 4 bows (Figure 2.17). (d) For a = 1 when b = 5 we get 5 bows (Figure 2.18). (e) For a = 1 and b = 8, we get 8 bows (Figure 2.19). 1 0.5 0 -0.5 -1 -1 cos(t), sin(4*t) -0.5 0 0.5 1 Fig. 2.17. A double bow tie graph 1 0.5 0 -0.5 -1 -1 cos(t), sin(5*t) -0.5 0 0.5 1 Fig. 2.18. A 5 bow tie graph 40 2 Web Graphs 1 cos(t), sin(8*t) 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 Fig. 2.19. An 8 bow tie graph 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Vertix od(10110) id(10110) Fig. 2.20. Out/In Degrees of the BTG in figure 2.13 2.3.3 In/Out Degrees Characteristics of Bow Tie Graphs As seen in Figure 2.13, a BTG is made out of 2 graphs, a left graph which is a complete bipartite graph or CBG(L) which we number as nodes 1,2,3,· · ·,10 which are the origination nodes, and node 11 which is the core; and a CBG(R) which is made out of node 11, and the nodes 12 through 21. Figure 2.20 is the In/Out degree characteristic of Figure 2.13. (These numbers do not show up on the graph of 2.13 but will be used in this section in the matrices below). 2.3 Bow Tie Graphs 41 0d ⎡ ⎤ 000000000010000000000 1 ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ 1 1 1 1 1 1 1 1 1 10 1 1 1 1 1 1 1 1 1 000000000000000000000 1 The above matrix is the adjacency matrix p10∗1∗10 of figure 2.13. p10x1 is the adjacency matrix of figure 2.21. Fig. 2.21. A 10*1 Complete Bipartite Graph(L) 42 2 Web Graphs 0d ⎡ ⎤ 000000000011 p10x1 = ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ 1 1 1 1 1 1 1 1 1 000000000000 The CBG(L) looks like (See Figure 2.21) and its In/Out degree characteristic in Figure 2.22. p1x10 is the adjacency matrix of figure 2.23. 0d ⎡ ⎤ 0 1 1 1 1 1 1 1 1 1 1 10 p1x10 = ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ 0 0 0 0 0 0 0 0 0 00000000000 0 In/Out Degrees for the Complete Bipartite Graph(10’1) 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 Vertix Fig. 2.22. In/Out for CBG(10×1) of figure 2.21 od(101) id(101) 2.3 Bow Tie Graphs 43 Fig. 2.23. A 1*10 Complete Bipartite Graph(R) In / Out Degrees for the Complete Bipartite Graph(1’10) 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 Ver tix Fig. 2.24. In/Out for CBG(1×10) of figure 2.23 od(110) id(110) The CBG(R) looks like (See Figure 2.23) and its In/Out degree characteristic in Figure 2.24. We can conclude in general that: ⎡ ⎤ A(CBG(L)) 0 A[BT G] = ⎣ 0 A(CBG(R)) ⎦ The od of a BTG is related to the od of the CBG(L) and the od of CBG(R): od(BTG) = [od(CBG(L) od(CBG(R)]T od(BTG)=A(BTG).unityT21=[1,1,1,1,1,1,1,1,1,1,10,0,0,0,0,0,0,0,0,0,0]T where unity21 is the vector of 21 ones. Applying it to the BTG(10*1*10), we conclude that: 44 2 Web Graphs p10∗1∗10 = p10∗1 0 0 p1∗10 and: od(p10∗1∗10) = [od(p10∗1) od(p1∗10)]T od(p10∗1∗10) = p10∗1∗10.unityT21=[1,1,1,1,1,1,1,1,1,1,10,0,0,0,0,0,0,0,0,0,0]T id(BTG)T = AT .unityT21 from 2.3. id(BTG)T = [0,0,0,0,0,0,0,0,0,0,10,1,1,1,1,1,1,1,1,1,1]T id(CBG(L))T = A(CBG(L))T .unityT11 id(CBG(L))T = [0,0,0,0,0,0,0,0,0,0,10]T id(CBG(R))T = A(CBG(R))T .unityT11 id(CBG(R))T = [0,1,1,1,1,1,1,1,1,1,1]T where p10∗1∗10 is the (21,21) matrix below, and id(p10∗1∗10)= pT10∗1∗10*unityT21 id(p10∗1∗10) = [0,0,0,0,0,0,0,0,0,0,10,1,1,1,1,1,1,1,1,1,1]T id(p10∗1) = pT10∗1.unityT =[0,0,0,0,0,0,0,0,0,0,10]T id(p1∗10) = pT1∗10.unityT id(p1∗10)T = [0,1,1,1,1,1,1,1,1,1,1]T id[BT G] = [id(CBG(L)) id (C B G(R))]T 0d ⎡ ⎤ 000000000000000000000 0 ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ 0 0 0 0 0 0 0 0 0 10 1 1 1 1 1 1 1 1 1 000000000010000000000 1 2.4 Final Observations on Web Graphs 45 2.4 Final Observations on Web Graphs Would the classification of these different Web graphs developed in equation 2.11 help shed a light on the behavior of Kleinberg’s algorithm ([64]) even though Kleinberg’s algorithm is not related to the (Out-degree, In-degree) space in which all these different graphs have been studied. According to equation 2.11 complete bipartite graphs(CBG) have a behavior close to Outdegree graphs(OD), and that OD graphs have a behavior close to general graphs(GG), and that GG have a behavior close to in-degree graphs (ID). Chapter 3 will reveal whether the general classification scheme proposed was able to help predict the behavior of these graphs. This chapter focuses on discovering important topological structures in Web pages and to help predict the behavior of Web algorithms (see chapter 3) in such environment especially that the WWW is rich not only in information but in topological structures. Last of all, what if a user’s Web page did not fit in any of the different Web graphs already described? Can we evaluate such an activity in a given Web page? May be different types of graphs are needed for that? Further research needs to answers the following questions: 1. How do we categorize existing simple graphs such the one already in use in many areas of research? 2. How do we uncover Web algorithms that are efficient on such graphs? 3. How do we devise new graphs, to better characterize user creativity in a given Web page on the WWW? 4. How do we devise new algorithms for these graphs? The WWW is full of information and mining the Web for different kinds of information is still in its infancy. The difficulty in evaluating information on the Web has to do with our cognitive abilities combined with personal likings, which is hard to quantify always. We are on the verge of unveiling creativity as a graph and what best describes the birth and death of an idea which happens in the birth and death of Web pages and the changes of their content. 3 Link Analysis of Web Graphs 3.1 Introduction Scientific citations have been studied for a long time. Citation analysis in the field of bibliometrics (Egghe & Rousseau [32]) is the science of studying citations, their structure, and the evolution of a specific domain of knowledge from its citations. Many information sciences journals, e.g., JASIS (Small [122, 123]) have devoted issues and complete volumes to the exploration of such a field of knowledge. A citation is static and unidirectional. Once an article is published, no new references can be added to it. A citation can be used to evaluate its impact and influence over the whole body of knowledge in a given field. A citation fulfilling such a role in a given field becomes an authority in that field. Citations in a given journal follow the same principles, standards, and forms. Thus, the standard deviation of the content of two articles in a given journal is usually small. Human judgment of the technicality of an article keeps the quality of publication at high level although the noise is present, and thus a citation is more relevant and more objective to a given topic. Citations link articles that are relevant to a given research. Garfield’s impact factor ([38]) is the most important factor ever developed to assess a journal j influence over the publication in that field. The impact factor is the average number of citations a journal j in a given year receives from other journal articles after it has been published for the last 2 years. It becomes the in-degrees of nodes in a given network of publications in that field. Pinski & Narin ([111]) argued that a journal is influential if it is heavily cited by other influential journals. Citations of a given publication are “signatures of intelligence” of the space and time of a collective group of people. Hubs and authorities (Meghabghab & Kandel [87]) have been developed to better assess the influence that a Web page has on other Web pages that link to it. Web pages also reflect the “signatures of intelligence” in our era and contain rich information on our collective society as a whole. Citation analysis is a very important field for information scientists for mining citations in journals. The comparison above sheds some light on the parallel between a G. Meghabghab and A. Kandel: Search Engines, Link Analysis, and User’s Web Behavior, Studies in Computational Intelligence (SCI) 99, 47–68 (2008) www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008 48 3 Link Analysis of Web Graphs Web page and a citation in a journal ([87]). This just to emphasize the interest that Web mining experts ought to have to analyze Web links even though it is a task in its infancy, and even though that the principles that govern Web pages and Web links are different from those of scientific citations. Web pages are not scrutinized like scientific journals are. New links in a Web page can have an impact on the whole search done and the always changing size of the Web can reflect patterns of indexing, crawling for search engine designers that feed on these Web links. To fully comprehend and assess the discovery of hubs and authorities and the influence that hyperlinks between different Web pages have over other Web pages, graph theory can be used to better understand the measures developed by [64]. The same idea can be used to measure the distance between 2 Web pages. Measuring the topological structure richness of a collection of Web pages is an important aspect of Web pages never explored before and never applied before on hubs and authorities. 3.2 Linear Algebra Applied to Link Analysis From linear algebra (Golub & Van Loan [41]): if A is a symmetric n x n matrix, then an eigenvalue of A is a number g with the property for some vector v we have Av = gv. The set of all such v is a subspace of Rn, which we refer to as the eigenspace associated with g; the dimension of this space is the multiplicity of g. A has n distinct eigenvalues, each of them is a real number, and the sum of their multiplicity is n. These eigenvalues are denoted by g1(A), g2(A), .., gn(A) indexed in order of decreasing absolute value and with each value listed in a number of times equal to its multiplicity. For each distinct eigenvalue, we choose an orthonormal basis of its eigenspace. Considering the vectors in all these bases, we obtain a set of eigenvectors v1(A), v2(A), v3(A), ...vn(A) where vi(A) belongs to the eigenspace of gi(A). The following assumption can be made without any violation of any substantial principle in what follows that: g1(A) > g2(A) (3.1) When the last assumption holds 3.1, v1(A) is referred to as the principal eigenvector and all other vi(A) are non-principal eigenvectors. By applying this analysis to IIT we can conclude: 1. There are 4 distinct eigenvalues mainly: 0, 1, 2, 3 with the multiplicity of g = 1 being 3. 2. The eigenspace can be defined by: 3(IIT ) > 2(IIT ) > 1(IIT ) > = 1(IIT ) > = 1(IIT ) > 0(IIT ) 3. There is one principal eigenvector which corresponds to the eigenvalue of g=3. This vector has a value v3 = [0, 0, 0, 1, 0, 0]T . All other eigenvectors are not principal eigenvectors. 3.3 Kleinberg’s Algorithm of Link Analysis 49 3.3 Kleinberg’s Algorithm of Link Analysis In [64], Kleinberg built a search technique procedure which he called HITS, which stands for “Hyperlink Induced Topic Search”. It was meant to rank the result set of links by any commercial search engine on broad search topics. The first step in HITS is a sampling step. The result of such a sampling is called the root set. Each page tends to be considered a hub or an authority or neither. HITS try to find “hubs” Web pages and “authorities” Web pages. An authority page is a page that has the highest number of links pointing to it in a given topology. A hub is a Web page that points to a high number of authority Web pages. Authorities and hubs reinforce each other mutually: since Web pages tend to be part of what makes an authority or so close to an authority that they are hubs. There are pages that are neither authorities nor hubs. Good authority Web pages are found close to good hubs, which in turn are pointing towards good sources of information. A typical representation of hubs and authorities can be seen as a bipartite graph. Figure 3.1 represents an ideal situation of hubs and authorities. The root set mentioned above tends to contain few hundred Web pages. To the root set is added links that point to it. This new set can grow to form up to few thousands of Web pages and is called the base set. The second step in HITS is to compute the weight for each Web page that can help rank the links as their relevance to the original query. Even though that HITS is applied to a large number of Web pages not 1 E6 ! 2 ‚ E5 ! 3 4 Fig. 3.1. A bipartite graph 50 3 Link Analysis of Web Graphs 6 1 5 ©2 4‚c  ‚3 Fig. 3.2. A bipartite graph G1 individual Web pages, for illustration only, we can look at the Web pages in Figure 3.2. From the actual definition of authority Web pages and hub Web pages we can conclude: 1. Web page 4 is an “authority” Web page because it has the highest number of pages that point to it. Web page 1 is also an authority Web page. 2. Web pages 5 and 3 are good “hub” pages. Web pages 5 and 3 cannot be considered authorities. 3. Web pages 2 and 6 are neither hubs nor authorities. 4. Thus, a Web page in HITS tends to have a hub weight and authority weight. HITS assigns an authority weight ap and a hub weight hp for every Web page that reinforce each other: ap = (hq ) (q,p)∈R (3.2) hp = (aq ) (q,p)∈R (3.3) These values of ap and hp are maintained normalized on all the Web pages for that particular graph. Thus: (a2p) = 1 and (p) (h2p) = 1 (p) (3.4) The larger the value of ap the better representative of an authority is p. The higher the value of hp the better representative of a hub is p. These values get initialized first to 1 and then get adjusted and the first few hubs and the first few authorities are then filtered out on a given graph. The procedure Filter looks like the following: Filter F: 3.4 Applying HITS to the Graphs of 2.2 51 1. Let z is a vector z = [1, 1, 1, ..1]T of Rn where n is the number of Web pages considered. 2. Initialize ap0 and hp0 to z 3. For i = 1, 2, ..k (a) Use 3.2 to (api−1, hpi−1) obtaining new a weights ai (b) Use 3.3 to (api, hpi−1) obtaining new h weights hi (c) Normalize api, and get a pi (d) Normalize hi, and get hi (e) End (f) Return (apk, hpk) We are proposing the following three conjectures: 1. conjecture 1: It turns out that the returned set (apk, hpk) is made out of 2 values where apk is the principal eigenvector over AT A, and hpk is the eigenvector over AAT . 2. conjecture 2: It turns out that that a Web page cannot be an authority Web page and a hub Web page after the first iteration according to Kleinberg’s Filter Procedure. 3. conjecture 3: It turns out that that a Web page cannot be an authority Web page and a hub Web page at the end of procedure Filter according to Kleinberg’s Filter Procedure. 3.4 Applying HITS to the Graphs of 2.2 3.4.1 HITS Applied to General Graphs Let us apply the procedure Filter F to Figure 3.2. The number of Web pages in Figure 3.2 is n = 6. Parameters a and h after one single iteration of step 3 in F and before any normalization have the following values: a = [2,1,1,3,0,1]T and h = [5,1,5,0,5,0]T . After normalization: a = [.5,.25,.25,.75,0,.25]T and h = [.5735,.1147,.5735,0,.5735,0]T . Here are some of the observations after the first iteration: 1. Web page 4 is the best authority Web page. 2. Web pages 1, 3 and 5 are the best hub pages. 3. Web page 2 is the second best hub Web page. 4. Web pages 2, 3, and 6 are the 3rd best authority Web pages. 5. Web pages 4 and 6 are not hub Web pages. 6. Web Pages 1, 2, and 3 are authority and hub Web pages at the same time with different ranks. General Graphs make appear the idea of a Web page that is both an authority Web page and a hub page in the first iteration. Web pages 1,2, and 3 in Figure 3.2 prove that to a certain extent. Would that idea persist after a number of iterations? conjecture 2 is not verified in the case of General 52 3 Link Analysis of Web Graphs Graphs. What happens after a number of iterations? The same idea we used in the subsection on the “distance between Web pages” can be applied here. Iterate until the values of vectors a and h do not change any more. What happened to conjecture 3 in the case of General Graphs? Here are the final values of a and h after k = 6 iterations with a change of almost 0: a = [.5164,.2582,0,.7746,0,.2582]T , h = [.5774,0,.5774,0,.5774,0]T . Here are some final observations after k=6 iterations in the procedure Filter: 1. Web Page 4 is still the best authority Web page. 2. Web pages 1, 3 and 5 are still the best and only hub Web pages. 3. Web page 1 is now the 2nd best authority Web page. 4. Web pages 2 and 6 are now the 3rd best authority Web pages. 5. Web page 3 is not an authority Web page any more. 6. Web page 1 is now the only hub Web page and authority Web page. Web page 1 is still the key issue to tackle in the whole subject of hubs and authorities. It was not well expounded in the literature whether that is a major obstacle to the full expansion of hubs and authorities in general. conjecture 3 does not hold for General Graphs. Being satisfied with the early set of iterations for a and h would have been deceiving. Some more values which had a non zero value now have a zero value. The answer to that question lies in the fact that we want to verify that the final values of vectors a and h correspond to the eigenvectors of both AT A and AAT correspondingly. From D = AT A in section I. we can calculate the eigenvalues and eigenvectors of D. The columns of the following matrix M1 constitute the eigenvectors: ⎡ 0 M1 = ⎢⎢⎢⎢⎢⎢⎣−−−0001 −0 −.1647 −.2578 0 .1647 .9331 .0931 −.607 −.182 −0 .607 −.2221 −.425 −.0665 −.7074 −0 .0665 −.2829 .6409 .5774 −.5774 −0 −0 0 −.5774 ⎤ .5164 .2582 −0 .7746 0 ⎥⎥⎥⎥⎥⎥⎦ .2582 And the corresponding eigenvalue vector: eigenvector(D)=(1.0,0.0,0.0,0.0,2.0,5.0)T According to our notations in section 2 we have: 5(AT A) > 2(AT A) > 1(AT A) > 0(AT A) ≥ 0(AT A) ≥ 0(AT A) So the eigenvalue is 5.0 which is the last value of the eigenvector(D). The cor- responding eigenvector is a vector denoted by w6(M1) which the last column of M1: w6(M1) = [0.5164, 0.2582, −0.0, 0.7746, 0.0, 0.2582]T By comparing the last vector and the vector a already calculated above we conclude: w6(M1) = a (3.5) 3.4 Applying HITS to the Graphs of 2.2 53 From C = AAT in section 2, we can calculate the eigenvalues and eigenvectors of C. The columns of the following matrix M2 constitute the eigenvectors: M2 = ⎡ ⎤ 00 0 .8165 .5774 0 ⎢⎢⎢⎢⎢⎢⎣ 1 0 0 0 0 .2571 .9316 −.2571 0 −.6587 .3635 .6587 0 −.4082 −0 −.4082 0 .5774 0 5774 0 0 0 0 ⎥⎥⎥⎥⎥⎥⎦ 00 0 0 01 Here is the corresponding eigenvalue vector:[1.0, 0.0, 0.0, 2.0, 5.0, 0.0]T So according to our notations in section 2, we have: 5(AAT ) > 2(AAT ) > 1(AAT ) > 0(AAT ) ≥ 0(AAT ) ≥ 0(AAT ) So the eigenvalue is 5.0 which is the 5th value in the vector of eigenvalue. So the principal eigenvector denoted by w5(M2) which the fifth column of D would be:[0.5774,0,0.5774,0,0.5774,0]T By comparing the last vector and the vector h already calculated above we conclude: w5(M2) = h (3.6) 3.4.2 HITS applied to In-degree Graphs Let us apply the Filter procedure to an In-degree Graph like the one in Figure 3.3. The number of Web pages in Figure 3.3 is n = 8. Parameters a and h after one single iteration of step 3 in F and before any normalization have the following values: a = [0,0,0,0,0,2,2,3]T and h = [2,2,3,2,2,3,3,0]T . After normalization a = [0,0,0,0,0.4851,0.4851,0.7276]T and h = [0.305,0.305,0.4575, 0.305,0.305, 0.4575,0.4575,0]T . Here some of the observations that can be made from a and h after one single iteration (k = 1): 1. Web Page 8 is the best authority Web page. 2. Web pages 6, and 7 are the 2nd best authority Web pages. sT 8 6 7 T Ts 1 2 3 4 5 Fig. 3.3. An In Degree Graph 54 3 Link Analysis of Web Graphs 3. Web pages 3, 6 and 7 are the best hub Web pages. 4. Web pages 1, 2, 4, and 5 are the 2nd best hub Web pages. 5. Web pages 6 and 7 are authority Web pages and hub Web pages at the same time. The first iteration of the procedure Filter shows a Web page that is both an authority Web page and a hub page. Web pages 6 and 7 in that example show clearly that conjecture 2 does not hold for In-degree Graphs. Would that idea persist after a number of iterations? What happened to conjecture 3 in the case of In-degree Graphs. Here are the final values of a and h after k = 8 iterations with a change of almost 0: a = [0, 0, 0, 0, 0, 0.039, 0.039, 0.9985]T h = [0.02, 0.02, 0.5768, 0.02, 0.02, 0.5768, 0.5768, 0]T . Here are the final observations after k = 8 iterations on the procedure Filter: 1. Web Page 8 is the only authority Web page. 2. Web pages 3,6, and 7 are the only hub pages. The process of identifying hubs and authorities is an iterative process and the first iteration is just the beginning of the process of filtering out weak hubs and authorities. In-Degree trees seem to be ideal for the procedure Filter since no one Web page can be considered both a hub and an authority at the same time. conjecture 3 holds for In-degree Graphs. Being satisfied with the early set of iterations for a and h would have been deceiving. Some more values which had a non zero value now have a zero value. The answer to that question lies in the fact that we want to verify that the final values of vectors a and h correspond to the eigenvectors of both AT A and AAT correspondingly. We want still to verify conjecture 1. It turns out that D = AT A is a diagonal matrix. The eigenvalues and eigenvectors can be calculated simply with no other transformations like what we did in the case of General Graphs. ⎡ ⎤ 00000000 M1 = ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ 00000003 The corresponding matrix M1 contains the eigenvectors is the unit matrix of 8*8. Here is the corresponding eigenvalue vector: [0, 0, 0, 0, 0, 2, 2, 3]T So ac- cording to our notations in 3.2 we have: 3(AT A) > 2(AT A) > 2(AT A) > 0(AT A > 0(AT A) ≥ 0(AT A) ≥ 0(AT A ≥ 0(AT A). 3.4 Applying HITS to the Graphs of 2.2 55 The eigenvalue is 5.0 which is the 8th value in the vector of eigenvalue. So the principal eigenvector denoted by w8(M1) which the last column of M1 would be:[0,0,0,0,0,0,0,1]T . By comparing the last vector and the vector a already calculated above we conclude: w8(M1) = a (3.7) From C = AAT , we can calculate the eigenvalues and eigenvectors of C. The columns of the following matrix M2 constitute the eigenvectors: ⎡ ⎤ .707 .707 0 0 0 0 00 M2 = ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ −.707 0 0 0 0 0 .707 0 0 0 0 0 0 0 −.707 .707 0 0 0 0 .707 .707 0 0 0 .785 0 0 −.587 −.198 0 .225 0 0 .568 −.792 0 .578 0 0 .578 .578 0 0 0 0 0 0 ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ 0 0 0 0 0 0 01 The corresponding eigenvalue vector = [0, 2, 0, 2, 0, 0, 3, 0]T According to our notations in 2, we have: 3(AAT ) > 2(AAT ) > 2(AAT ) > 0(AAT ) ≥ 0(AAT ) ≥ 0(AAT ) ≥ 0(AAT ) ≥ 0(AAT ) The eigenvalue is 3 which is the 7th value in the vector of eigenvalue. The principal eigenvector denoted by w7(M2) = [0, 0, 0.5774, 0, 0, 0.5774, 0.5774, 0]T By comparing the last vector and the vector h already calculated above we conclude: w7(M2) = h (3.8) 3.4.3 HITS applied to Out-degree Graphs Let us apply the Filter procedure to an out-degree Graph like the one in Figure 3.4. The number of Web pages in Figure 3.4 is n = 8. Parameters a and h after one single iteration of step 3 in F and before any normalization have the following values: a = [0, 1, 1, 1, 1, 1, 1, 1]T and h = [2, 3, 2, 0, 0, 0, 0, 0]T . The values of a and h after normalization are: a = [0, .378, .378, .378, .378, .378, 0, .378, .378]T and h = [.4581, .7276, .4581, 0, 0, 0, 0, 0]T . Here are some observations that can be made from the first iterative values of a and h: 1. Web Pages 2,3,4,5,6,7, and 8 are authority pages. 2. Web page 2 is the best hub Web page. 3. Web pages 1and 3 are the 2nd best hub Web pages. 56 3 Link Analysis of Web Graphs 1 2© ‚3 4© 5c ‚6 c7 ‚8 Fig. 3.4. An Out Degree Graph 4. Web pages 2 and 3 are hub Web pages and authority Web pages at the same time. The first iteration of the procedure Filter shows a Web page that is both an authority Web page and a hub page. Web pages 2 and 3 in that example show clearly that. conjecture 2 does not hold for Out-degree graphs. Would that idea persist after a number of iterations. What happened to conjecture 3 in the case of out-degree graphs. Here are the final values of a and h after k = 8 iterations with a change of almost 0: a = [.03, .03, .03, .576, .576, .576, .03, .03]T , h = [.04, .999, .04, 0, 0, 0, 0, 0]T . Here are some final observations after k = 8 iterations on the procedure Filter: 1. Web Pages 4,5, and 6 are the only authority Web pages. 2. Web page 2 is the only hub Web page. This final result requires some interpretations. Even though Web page 3 could have been considered a hub Web page but because in the same graph more nodes came out of Web page 2 than Web page 3, the measure of a hub for a Web page is a global measure and not a local one. Second, that because 2 is the nub page, Web pages 4, 5, and 6 become the only authority Web pages and not any more 7 and 8 like it was in the first iteration of the procedure Filter. Again the idea of an authority Web page is a global measure and not a local one. This idea could be applied to Citation analysis in general, which says that in a given literature on a given topic, being an authority is not a second best but really the best. The more you are cited, the more links come to your Web page and the more attention you will receive, and that make you an authority in that field regardless of the second best. Further studies have reflected on the fact that there may be more than just authority in a given graph, which should be considered. conjecture 3 does hold for Out-degree graphs. Being satisfied with the early set of iterations for a and h would have been deceiving. Some more values which had a non zero value now have a zero value. The answer to that question lies in the fact that we want to verify 3.4 Applying HITS to the Graphs of 2.2 57 that the final values of vectors a and h correspond to the eigenvectors of both AT A and AAT correspondingly. To verify conjecture 1, we calculate D = AT A. we can calculate the eigenvalues and eigenvectors of D. The columns of the following matrix M1 constitute the eigenvectors: ⎡ 10 0 0 0 00 M1 = ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ 0 0 0 0 0 0 .707 .707 0 0 0 0 .707 −.707 0 0 0 0 0 0 .785 −.587 −.198 0 0 0 .225 .568 −.792 0 0 0 .577 .577 .577 0 0 0 0 0 0 .707 ⎤ 0 0 0 0 0 0 .707 ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ 00 0 0 0 0 .707 −.707 Here is the corresponding eigenvalue vector: [0,2,0,0,0,3,2,0]T . Acording to our notations in 3.2 we conclude that: 3(AAT ) > 2(AAT ) > 2(AAT ) > 0(AAT ) > 0(AAT ) ≥ 0(AAT ) ≥ 0(AAT ) ≥ 0(AAT ) The eigenvalue is 3 which is the 6th value in the vector of eigenvalue. The principal eigenvector denoted by w6(M1) would be the principal eigenvector. The principal eigenvector denoted by w6(M1)=[0,0,0,.5774,.5774,.5774,0,0]T By comparing the last vector and the vector a already calculated above we conclude: w6(M1) = a (3.9) It turns out that C = AAT is a diagonal matrix. The eigenvalues and eigen- vectors can be calculated simply with no other transformations like what we did in the case of General Graphs. ⎡ ⎤ 20000000 M1 = ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ 0 0 0 0 0 0 3 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ 00000000 The corresponding matrix M2 contains the eigenvectors is the unit matrix of 8*8. Here is the corresponding eigenvalue vector:[2,3,2,0,0,0,0,0]T Thus according to our notations in section 3.2, we conclude that: 3(AT A) > 2(AT A) > 2(AT A) > 0(AT A) > 0(AT A) = 0(AT A) = 0(AT A) = 0(AT A). 58 3 Link Analysis of Web Graphs The eigenvalue is 3 which is the 2th value in the vector of eigenvalue. The principal eigenvector denoted by w2(M2) would be the principal eigenvector. The principal eigenvector denoted by w2(M2)=[0,1,0,0,0,0,0,0]T By comparing the last vector and the vector h already calculated above we conclude: w2(M2) = h (3.10) 3.4.4 HITS Applied to “Complete Bipartite” Graphs Let us apply the Filter procedure to a “complete bipartite” Graph like the one in Figure 3.5. The number of Web pages in Figure 3.5 is n=7. Parameters a and h after one single iteration of step 3 in F and before any normalization have the following values: a=[0,0,0,0,4,4,4]T and h=[12,12,12,12,0,0,0]T . After normalization: a=[0,0,0,0,0.5774,0.5774,0.5774]T and h=[0.5,0.5,0.5,0.5,0,0,0]T . The following observations can be made from a and h after the first iteration: 1. Web Pages 5, 6, and 7 are the only authority pages. 2. Web pages 1,2,3 and 4 are the only hub pages. In this perfect case, no pages are both authority and hub pages at the same time. In this case also, there are no Web pages that are neither authority Web pages nor hub Web pages. The observations made in the first iteration are evident from the complete bipartite graph itself. conjecture 2 holds for complete bipartite graphs. What happened to conjecture 3 in the case of complete bipartite graphs? Complete bipartite graphs are ideal cases for finding hubs and authorities. We need not iterate any further the procedure Filter since 1 E5 ! ‚ E6 2 ! ‚ E…7 3  4 Fig. 3.5. A complete bipartite graph 3.4 Applying HITS to the Graphs of 2.2 59 no change will occur any way. But still one can iterate to verify that the values of a and h will stay the same. Here are the final values of a and h after k=8 iterations with no change: a=[0,0,0,0,0.5774,0.5774,0.5774]T and h=[0.5,0.5,0.5,0.5,0,0,0]T . The early observations made on a and h will stand after a number of iterations. Conjecture 3 holds for complete bipartite graphs. 3.4.5 HITS Applied to Bipartite Graphs Let us apply the Filter procedure to a bipartite graph like the one in Figure 3.6. The number of Web pages in Figure 3.6 is n=6. Parameters a and h after one single iteration of step 3 in F and before any normalization have the following values: a=[0,2,0,2,0,1]T and h=[3,0,4,0,2,0]T . After normalization the values of a and h are: a=[0,0.6667,0,0.6667,0,0.3333]T and h=[0.5571,0,0.7428,0, 0.3714,0]T . The following observations can be made from the first values of a and h: 1. Web Page 2 and 4 are the best authority pages. 2. Web page 3 is the best hub page. 3. Web page 6 is the 2nd best authority Web page. 4. Web page 1 is the 2nd best hub Web page. 5. Web page 5 is the 3rd best hub Web page. 6. No Web pages are authority Web pages and hub pages at the same time. 7. All Web pages are either authority Web pages or hub Web pages. conjecture 2 holds for bipartite graphs. What happened to conjecture 3 for bipartite graphs? Here are the final values of a and h after k=12 iterations with almost no change: a=[0,.737,0,.591,0,.328]T , h=[.591,0,.7374,0,.328,0]T . The final observations can be made from the final values of a and h: 1. Web Page 2 is the only best authority Web page. 2. Web page 3 is still the best hub Web page. 6 1 5 ©2 4‚c  ‚3 Fig. 3.6. A bipartite graph 60 3 Link Analysis of Web Graphs 3. Web page 4 becomes the 2nd best authority Web page 4. Web page 6 is the 3rd best authority Web page. 5. Web page 1 is still the 2nd best hub Web page. 6. Web page 5 is the still 3rd best hub Web page. 7. No Web pages are authority Web pages and hub pages at the same time. 8. All Web pages are either authority Web pages or hub Web pages. conjecture 3 holds for bipartite graphs. Being satisfied with the early set of iterations for a and h would have been deceiving. Some more values which had a non zero value now have a zero value. The answer to that question lies in the fact that we want to verify that the final values of vectors a and h correspond to the eigenvectors of both AT A and AAT correspondingly. To verify conjecture 1, we calculate D = AT A. Also we need to calculate the eigenvalues and eigenvectors of D. The columns of the following matrix M1 constitute the eigenvectors: ⎡ ⎤ 1.0 0 0 0 0 0 M1 = ⎢⎢⎢⎢⎢⎢⎣ 0 0 0 0 0 0 0 1.0 0 1.0 0 0 −.737 0 −.591 0 −.328 0 .737 0 .591 0 −.328 0 ⎥⎥⎥⎥⎥⎥⎦ 0 0 0 −.328 −.591 −.737 Here is the corresponding eigenvalue vector: [0,0,0,3.247,1.555,0.1981]T . Thus according to our notations in section 2.3, we have: 3.247(AAT ) > 1.555(AAT ) > 0.1981(AAT ) > 0(AAT ) ≥ 0(AAT ) ≥ 0(AAT ) Thus 3.247 is the eigenvalue which is the 4th value in the vector of eigenvalue. The principal eigenvector would be w4(M1). The value of the principal eigenvector denoted by w4(M1) would be:[0,-0.737,0,-0.5910,0,-0.3280]T . The reflections made before on using just the value vi which corresponds to gi is contested here. We should use the absolute value of vi(w) as we did with gi(A). By comparing the last vector and the vector a already calculated above we conclude: w4(M1) = a (3.11) From C = AAT , we can calculate the eigenvalues and eigenvectors of C. The columns of the following matrix M2 constitute the eigenvectors: ⎡ ⎤ 0 0 −.737 .328 .591 0 M2 = ⎢⎢⎢⎢⎢⎢⎣ 0 0 1 0 1 0 0 0 0 .328 0 .591 0 0 −.591 .737 0 0 .737 .328 0 0 0 0 ⎥⎥⎥⎥⎥⎥⎦ 00 0 0 0 1.0 3.5 Hubs and Authorities in Bow Tie Graphs 61 Here is the corresponding eigenvalue vector: [0,0,1.555,0.1981,3.247,0]T . Thus according to our notations in 2.3 we have: 3.247(AAT ) > 1.555(AAT ) > 0.1981(AAT ) > 0(AAT ) ≥ 0(AAT ) ≥ 0(AAT ) The eigenvalue is 3.247 which is the 5th value in the vector of eigenvalue. The principal eigenvector would be w5(M2). The value of the principal eigenvector denoted by w5(M2) would be:[0.591,0,0.737,0,0.328,0]T . By comparing the last vector and the vector h already calculated above we conclude: w5(M2) = h (3.12) 3.4.6 Summary of the Application of the HITS Algorithm to the Graphs of 2.2 Did the classification of the web graphs of 2.2 developed in equation 2.11 in section 2.5 of chapter 2 help shed a light on the behavior of Kleinberg’s algorithm ([64]) even though Kleinberg’s algorithm is not related to the (OutDegree, In-degree) space in which all these different graphs have been studied. According to equation 2.11 complete bipartite graphs(CBG) have a behavior close to out-degree graphs(OD), and that OD graphs have a behavior close to general graphs(GG), and that GG have a behavior close to in-degree graphs (ID). Conjecture 2 held for CBG, BG but not OD, GG, and ID. Conjecture 2 violated the classification scheme established in that space in the case of OD but held it in the case of GG and ID. Conjecture 3 held for CBG, OD, BG, ID but not GG. According to 2.11 conjecture 3 should not have held ground for ID given the fact that their behavior is close to GG than the other graphs. New classification schemes are needed to help predict the behavior of HITS web algorithms in the new space. The application of HITS algorithm on Bow tie graphs will be considered next. 3.5 Hubs and Authorities in Bow Tie Graphs In Chapter 2 section 3, we established the relationship between a BTG and the left complete bipartite graph and the right complete bipartite graph. We established that the adjacency matrix of the BTG is made out of the adjacency of the left CBG and the right CBG. The number of inlinks of the BTG is made out of the inlinks of the CBG(L) and the CBG(R). The number of outlinks of the BTG is made out of the outlinks of the CBG(L) and the number of outlinks of the CBG(R). The latter was a special case to begin the study of BTG’s. In this section, we start out with a BTG (N*m*N) as a set of 2 complete bipartite graphs CBG with m << N , a left one CBG(L) and a right one CBG(R). (They share a core in the middle as seen in figure 2.13). In this ideal case, the BTG behaves very well according to (Meghabghab [95]). And then we consider the challenging cases: 62 3 Link Analysis of Web Graphs 1. A general bow tie graph: BTG(N × m × M ) with m << N and m << M , with a CBG(L) and a CBG(R) but where m << N and m << M and we consider the cases of N > M and N < M . In the case of N < M , we discover that all hubs in the left side of the BTG are lost while the authority of the core is lost. In the case of N > M , we discover that all authorities in the right side of the BTG are lost while the hub of the core is lost. 2. A general bow tie graph: BTG (N × m × M ) with m << N and m << M , as made out of 2 graphs: a complete BG on the left CBG(L) and bipartite graph on the right BG(R) or a BG(L) and CBG(R), which are reasonable cases to study and with N > M or N < M . If N < M or N > M , and a CBG(L) and a BG(R), we discover that the hub of the core is lost while all the authorities on the right side are lost. If N < M or N > M and a BG(L) and a CBG(R), we discover that the authority of the core is lost while all the hubs on the left are lost. 3. A general bow tie graph: BTG(N × m × M ) with m < N and m < M , as made out of 2 graphs: a CBG(L) and BG(R) or a BG(L) and CBG(R), which are reasonable cases to study and with N > M or N < M . In the case of N < M and a BG(L) and a CBG(R), we discover that the hubs of BG(L) are lost. In the case of a CBG(L) and BG(R) and N = M , all hubs of CBG(L) are lost. In the case of a CBG(L) and a BG(R) and N > M , then all authorities are lost including the ones of the core. There is less loss of hubs and authorities in this case than in the cases of (1) and (2). 4. Blondel’s central score (Blondel and VanDooren [9]) was applied to the bow tie graphs of (1), (2), and (3) to help verify whether these cases are different and whether these BTG’s are similar. We show that where m < N and m < M , the central score(s) identifies that the bow tie graphs of the form N*m*M are similar whether N > M or N < M as long as both are CBG(L) and CBG(R). Also the case of: N > M , CBG(L) and BG(R) and the case of: N < M , BG(L) and CBG(R), then both BTG are similar. It turns out that bow tie graphs fall into the category of General Graphs. As seen from table 3.1, in many of the different categories of BTG, the hubs in the left side and the core of the BTG were lost, the authorities of the core and the right side of the BTG were lost. This was never before researched in the study of hubs and authorities in the literature. In the case of table 3.2, there were 2 cases of hubs being lost on the left side of the BTG, and 1 case where the authorities of the core and the right side of the BTG were lost. A study of similarity between these different cases is needed as seen in the next section. Here are the results of hubs and authorities applied to different sizes of bow tie graphs: 3.6 Similarity of Bow Tie Graphs 63 Table 3.1. Hubs and authorities in the case of the size of the core m << N and m << M bow tie graphs m << N, m << M N ∗m∗N N ∗ m ∗ M (N > M ) N ∗ m ∗ M (N < M ) N ∗m∗N N ∗ m is a CBG m ∗ N is not CBG N ∗m∗N N ∗ m is not CBG m ∗ N is a CBG N ∗m∗M N ∗ m is a CBG m ∗ M is not CBG and N > M N ∗m∗M N ∗ misnotaCBG m ∗ M is a CBG, N > M N ∗m∗M N ∗ m is CBG m ∗ M is not CBG, N < M N ∗m∗M N ∗ m is not CBG m ∗ M is a CBG, N < M hubs Does work Lost the hubs of the core Lost the hubs in BTG (L) Lost the hubs of the core. Lost the hubs in BTG(L) Lost the hubs of the core Lost the hubs in BTG(L) Lost the hubs of the core Lost all the hubs in BTG(L) authorities Does Work Lost the authorities in the BTG(R) Lost authorities of the core Lost authorities in BTG(R) Lost the authorities of the core Lost the authorities in BTG(R) Lost the authorities of the core Lost the authorities in BTG(R) Lost the authorities of the core 3.6 Similarity of Bow Tie Graphs Blondel & VanDooren [9] extended the idea of hubs and authority to the general case of a line graph 1 → 2 → 3. The mutually reinforcing updating iteration used above 3.2 and 3.3 can be generalized to the line graph. With each vertex j of G we now associate three scores xi1, xi2, andxi3; one for each vertex of the structure graph. We initialize these scores with some positive value and then update them according to the following mutually reinforcing relation. Assume that we have two directed graphs GA and GB with nA and nB vertices and edge sets EA and EB. We think of GA as a structure graph that plays the role of the graphs hub → authority and 1 → 2 → 3 in the above examples. We consider real scores xij for i = 1...nB and j = 1..nA and simultaneously update all scores according to the following updating equations: xij ← (xrs) + (xrs) r:(r,i)∈EB ,s:(s,j)∈EA r:(i,r)∈EB ,s:(j,s)∈EA (3.13) 64 3 Link Analysis of Web Graphs Table 3.2. Hubs and authorities in the case of the size of the core m < N and m M ) N ∗ m ∗ M (N < M ) N ∗ m ∗ N (N < M ) N ∗m∗N N ∗ m is a CBG m ∗ N is not CBG N ∗m∗N N ∗ m is not CBG m ∗ N is a CBG N ∗m∗M N ∗ m is a CBG m ∗ M is not CBG and N > M N ∗m∗M N ∗ misnotaCBG m ∗ M is a CBG, N > M N ∗m∗M N ∗ m is CBG m ∗ M is not CBG, N < M N ∗m∗M N ∗ m is not CBG m ∗ M is a CBG, N < M hubs None were Lost None were Lost Lost the hubs in BTG (L) None were Lost None were Lost None were Lost None were Lost in BTG(L) None were Lost Lost all the hubs in BTG(L) authorities None were Lost None were Lost None were Lost None were lost Lost the authorities of the core and BTG(R) None were lost None were lost None were lost None were Lost The updating equation 3.13 can also be written in more compact matrix form. Let Xk be the nB ∗ nA matrix of entries xij at iteration k. Then the updating equations take the simple form: Xk+1 = BXkAT + BT XkA; k = 0, 1 (3.14) Where A and B are the adjacency matrices of GA and GB. Blondel & VanDooren [9] proved that the normalized even and odd iterates of this updating equation converge, and that the limit of all even iterates is among all possible limits the only one with largest 1-norm. We take this limit as definition of the similarity matrix. A direct algorithmic transcription of 3.14 leads to an approximation algorithm for computing similarity matrices of graphs: Procedure Similarity ([9]) 1. Set Z0 = 1. 2. Iterate an even number of times: Zk+1=BZkAT + BT ZkAk/|BZkAT + BT ZkAk| and stop upon convergence of Zk. 3. Output S. invidious 3.6 Similarity of Bow Tie Graphs 65 adapted truthy likely giving belief Probable vervsimilar probably Fig. 3.7. The BTG of “likely” and “probable” 0.8 0.6 0.4 0.2 0 10 10 5 8 6 4 2 00 Fig. 3.8. A mesh of the centrality Matrix of Likely and probable Blondel & VanDooren [9] applied such a similarity to the extraction of synonyms between terms. Figure 3.7 illustrates the graph of “likely” as compared to “probable”. Figure 3.8 illustrates a 3-D map of the centrality scores of the nodes 1,2,3,4,5,6,7,8,9 calculated according to the procedure Similarity as they correspond to {invidious, truthy, verisimilar, likely, probable, adapted, 66 3 Link Analysis of Web Graphs Table 3.3. Similarity between the central score of the core of the BTG made of 2 nodes bow tie graphs m << N , m << M m=2 N ∗m∗N N ∗ m ∗ M (N > M ) N ∗ m ∗ M (N < M ) N ∗m∗N N ∗ m is a CBG m ∗ N is not CBG N ∗m∗N N ∗ m is not CBG m ∗ N is a CBG N ∗m∗M N ∗ m is a CBG m ∗ M is not CBG and N > M N ∗m∗M N ∗ m is not a CBG m ∗ M is a CBG, N > M N ∗m∗M N ∗ m is CBG m ∗ M is not CBG, N < M N ∗m∗M N ∗ m is not CBG m ∗ M is a CBG,N < M Blondel’s central score for the core (m = 2) .51538 .4123 .51783 .40276 .51783 .40276 .53431 .33395 .4092 .35074 .53061 .3032 .54749 .31285 .61738 .26459 .53061 .3032 .4123 .51538 .40276 .51783 .40276 .51783 .33395 .6011 .35074 .58457 .3032 .60641 .31285 .62571 .26459 .61738 .3032 .60641 giving, belief, probably}. The peak correspond to the fact that likely is first very similar to itself followed by probable, and that probable is first very similar to itself and then to likely. This central score has been used for the purpose of examining the similarity of the BTGs mentioned earlier. Blondel’s central score does seem to help identify the similarity between BTGs. Table 3.3 gives a summary of the central score at the core of the BTG. In the case where m < N or m < M , the central score(s) identifies that the BTG of the form (N × m × M ) are similar whether N > M or N < M as long as both are CBG(L) and CBG(R). In the case of N > M , CBG(L) and BG(R) and the case of: N < M , BG(L) and CBG(R), then both BTG are similar. Figure 3.9 illustrates a 3-D map of the central scores of the BT G(50 × 2 × 100) as calculated in the procedure Similarity. The peak in fig. 3.9 corresponds to the nodes 51 and 52, the 2 nodes at the core of the BTG, where 51 is very similar to itself first and then to 52, and 52 is very similar to itself first and then to 51. 3.7 Limitations of Link Analysis 67 0.5 0.4 0.3 0.2 0.1 0 200 150 100 50 00 200 150 100 50 Fig. 3.9. A mesh of the centrality Matrix of a BTG(50 × 2 × 100) 3.7 Limitations of Link Analysis conjecture 2 was true for some graphs and not true for other graphs. Web pages that started out to be both hub Web pages and authority Web pages in the early stages of the procedure Filter were soon changed into either hub Web pages or authority Web pages but not both any more at the end of iterations. conjecture 3 held ground for all graphs with the exception of general Web graphs. The idea that a Web page can be both a hub page and an authority Web page is intriguing and worth further consideration. According to Jon Kleinberg ([65] sept 14, 2000) “nothing in the algorithm prevents this from happening but it is not a very common phenomenon”. Furthermore, we only considered in this study the influence of principal eigenvectors on the selection of hubs and authorities. If we expand the role of these secondary eigenvectors, we will have not only just a set of hubs and authorities, but a multiple sets of hubs and authorities. Further research needs to focus on the structure of the different kind of graphs studied in this chapter influence the idea of a multiple sets of hubs and authorities. We showed that hubs and authorities can be lost in some BTG more than others especially when m << N or m << M . In order to decrease the instabilities in such cases of BTG, i.e., “topic drifting”, more Web pages needs to be added to the core so the ratio of the size of the core to the size of the left or right number of Web pages will increase. Also through the concept of centrality scores between 2 graphs, we showed how 68 3 Link Analysis of Web Graphs similar are the different cases of BTGs that we studied. Thus, some of these cases are more similar than others and thus the original idea to consider the different BTGs. More research on graph similarity can shed more light on the different structures of the BTG’s and could be helpful in improving ranking. A more in depth study of the stability (Chung [23]) of BTG is needed. This chapter focused on just using links as a mean to evaluate Web pages and uncover hubs and authorities. No heuristics or any other contextual information was used to further enhance the idea of hubs and authorities. In an early study, McBryan [78] used searching hyperlinks based on an anchor text, in which one treats the text surrounding a hyperlink as a descriptor of the page being pointed to when assessing the relevance of that Web page. Frisse [37] considered the problem of document retrieval in single-authored, stand-alone works of hypertext. He proposed heuristics by which hyperlinks can enhance notions of relevance (Van Rijsbergen [129]), and hence the performance of retrieval heuristics. In recent studies (Bharat & Henzinger [6]; Chakrabarti, et al. [18, 19]), three distinct user studies were performed to help evaluate the HITS system to evaluate information found on the WWW. Each one of the studies employed additional heuristics to further enhance relevance judgments. As such, these 3 studies cannot enhance the direct evaluation of the pure link-based method described here; rather they assess its performance as the core component of a WWW search tool. For example in [18], the CLEVER system was used to create an automatic resource compilation or the construction of lists of high-quality WWW pages related to a broad search topic and the goal was to see whether the output of CLEVER compared to that of a manually generated compilation such as the WWW search service of Yahoo for a set of 26 topics. A collection of 37 users was assembled; the users were required to be familiar with the use of a Web browser, but were not experts in the topics picked. The users were asked to judge each Web page as “bad”, “fair”, “good” or “fantastic” in terms of their utility of learning about the topic. For approximately 31% of the topics, the evaluations of Yahoo and CLEVER were equivalent to within a threshold of statistical significance; for approximately 50% of the topics CLEVER was evaluated higher; and for the remaining 19% yahoo was evaluated higher. Many of the users of these studies reported that they used the lists as starting points from which to explore, but that they visited many pages not on the original topic lists generated by the various techniques. Of course it is hard to draw definitive conclusions from these 3 studies. We believe more insight in the structure of the graphs under consideration will help improve the success of Web algorithms, i.e., HITS and CLEVER, to refine the original concepts and make place for better understanding of hubs and authorities in a variety of topological structures. This study does not pretend to have improved finding information on the WWW. This study focuses on discovering important topological structures in Web pages and to predict the behavior of Web algorithms in such environment especially that the WWW is rich not only in information but in topological structures. 4 PageRank Algorithm Applied to Web Graphs 4.1 Introduction Commercial search engines are a key access point to the Web and have the difficult task of trying to find the most useful of the billions of Web pages for each user query entered. Google ’s PageRank ([12]) was an attempt to resolve this dilemma based upon the assumptions that: 1. more useful pages will have more links to them and 2. links from well linked to pages are better indicators of quality. The continued rise of google to its current dominant position and the proliferation of other link based algorithms seem to make an unassailable argument for the PageRank algorithm, despite the paucity of clear cut results. Modern Web IR algorithms are probably a highly complex mixture of different approaches, perhaps optimized using probabilistic techniques to identify the best combination. It is not possible to be definitive about commercial search engine algorithms, however, since they are kept secret apart from the broadest details. In fact academic research into Web IR is in a strange situation since research budgets and data sets could be expected to be dwarfed by those of the commercial giants, whose existence depends upon high quality results in an incredibly competitive market place. One research that compared the two found that the academic systems were slightly better but the authors admitted that the tasks were untypical for Web users ([50]). Nevertheless, google is one case amongst many of search algorithms gaining from approaches and developments in information science in general and bibliometrics in particular. Traditional information retrieval techniques can give poor results on the Web, with its vast scale and highly variable content quality. Recently, however, it was found that Web search results can be much improved by using the information contained in the link structure between pages. The two best-known algorithms which do this are HITS ([64]) and PageRank ([12]). The latter is used in the highly successful google search engine. The heuristic underlying both of these approaches is that pages with many inlinks are more likely to G. Meghabghab and A. Kandel: Search Engines, Link Analysis, and User’s Web Behavior, Studies in Computational Intelligence (SCI) 99, 69–81 (2008) www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008 70 4 PageRank Algorithm Applied to Web Graphs be of high quality than pages with few inlinks, given that the author of a page will presumably include in it links to pages that s/he believes are of high quality. Given a query (set of words or other query terms), HITS invokes a traditional search engine to obtain a set of pages relevant to it, expands this set with its inlinks and outlinks, and then attempts to find two types of pages, hubs (pages that point to many pages of high quality) and authorities (pages of high quality). Because this computation is carried out at query time, it is not feasible for today’s search engines, which need to handle tens of millions of queries per day. In contrast, PageRank computes a single measure of quality for a page at crawl time. This measure is then combined with a traditional information retrieval score at query time. Compared with HITS, this has the advantage of much greater efficiency, but the disadvantage that the PageRank score of a page ignores whether or not the page is relevant to the query at hand. Traditional information retrieval measures like TFIDF ([4]) rate a document highly if the query terms occur frequently in it. PageRank rates a page highly if it is at the center of a large sub-web (i.e., if many pages point to it, many other pages point to those, etc.). Intuitively, however, the best pages should be those that are at the center of a large sub-web relevant to the query. If one issues a query containing the word “Clinton”, then pages containing the word “Clinton” that are also pointed to by many other pages containing “Clinton” are more likely to be good choices than pages that contain “Clinton” but have no inlinks from pages containing it. The PageRank score of a page can be viewed as the rate at which a surfer would visit that page, if it surfed the Web indefinitely, blindly jumping from page to page. A problem common to both PageRank and HITS is topic drift. Because they give the same weight to all edges, the pages with the most inlinks in the network being considered tend to dominate, whether or not they are the most relevant to the query. Haveliwala ([49]) proposes applying an optimized version of PageRank to the subset of pages containing the query terms, and suggests that users do this on their own machines. We first describe PageRank. We then introduce our topology-dependent, version of PageRank, and if modified properly, it can work efficiently. Imagine a Web surfer who jumps from Web page to Web page, choosing with uniform probability which link to follow at each step. In order to reduce the effect of dead-ends or endless cycles the surfer will occasionally jump to a random page with some small probability b, or when on a page with no out-links. To reformulate this in graph terms, consider the Web as a directed graph, where nodes represent Web pages, and edges between nodes represent links between Web pages. Let W be the set of nodes, N = |W |, Fi be the set of pages page i links to, and Bi be the set pages which link to page i. For pages which have no outlinks we add a link to all pages in the graph. In this way, rank which is lost due to pages with no outlinks is redistributed uniformly to all pages. If averaged over a sufficient number of steps, the probability the surfer is on page j at some point in time is given by the formula: 4.2 Page Rank Algorithm Applied to the Graphs of 2.2 71 P R(pj) = (1 − N b) + d pi ∈Bj P R(Pi) Fi (4.1) PageRank is a simple and intuitive algorithm, yet admits a mathematical implementation that scales to the billions of pages currently on the Web. 4.2 Page Rank Algorithm Applied to the Graphs of 2.2 Let rp is the rank of a page p and odi is the out-degree of node i, i.e., the number of outgoing links from a Web page i. The rank of a page p is computed as specified by google’s search engine: rp = (1 − c) + c r1 + r2 + r3 + ..... + rn od1 od2 od3 odn (4.2) for all nodes 1,2,3,· · ·,n where (1,p) ∈ R, (2,p) ∈ R,(3,p) ∈ R,· · ·, (n,p) ∈ R and where c is a constant: 0 < c < 1 (ideally selected at 0.85). Note that the rps form a probability distribution over all Web pages, so the probability over all Web pages will be one. Applying equation 4.2 to the graph in Figure 4.1 yields the following equations for all these vertices: r1 = (1 − c) + c r3 + r5 22 (4.3) r3 = (1 − c) + c(r2) r5 = (1 − c) r2 = (1 − c) + c r1 3 (4.4) (4.5) (4.6) 6 e8 '1 e6 T e1 e5 e7 ©2 e2 4‚c e3  e4 ‚3 Fig. 4.1. A General Graph 72 4 PageRank Algorithm Applied to Web Graphs Table 4.1. the values of r1, r2, r3, r4, r5, and r6 Rank of Vertex r1 = .367 r2 = .254 r3 = .366 r4 = .47325 r5 = .15 r6 = .254 Value of c .85 .85 .85 .85 .85 .85 Value of id 2 1 1 3 0 1 Replacing equations 4.3 to 4.6, in 4.2 yields: 3(1 − c)(2c + 2 + c2) r1 = (6 − c3) (4.7) Table 4.1 shows the values of r1, r2, r3, r4, r5, and r6 as a function of c: The following observations can be made: • Vertex 4 being the vertex with the highest links pointing to it (id) yielded the highest ranking among all the pages. • Vertex 1 follows vertex 4 in its ranking because of id = 2. • Vertices 3,2, and 6 rank behind vertex 1. Although these 3 vertices have their id = 1 they should have equal value rank, yet they differ greatly. Value r3 is closer to r1 than it is to the rank of vertices 2 and 6, which are equal: r2 = r6. • Vertex 5 has the least rank because it did not have any nodes pointing to it. • Vertices with the same id according to equation 4.2 should yield equal ranking. The last observation above is violated in the graph of Figure 4.1. Vertices 2, 3, and 6 should have very similar ranking. To observe whether google’s ranking behaves better on other topological structures, we simulated google’s algorithm on other graph patterns, i.e., complete bipartite graphs like Figure 4.2. out-degree Web graphs like Figure 4.3, in-degree Web graphs like Figure 4.4, bipartite graphs like Figure 4.5. 4.2.1 PageRank Applied to Complete Bipartite Graphs Theorem 1. Nodes at the same tree level in complete bipartite graphs will have equal PageRank’s value. r0 = 0.15 r1 = 0.15 + .85r0m k (4.8) 4.2 Page Rank Algorithm Applied to the Graphs of 2.2 73 1 E5 ! ‚ E6 2 ! ‚E…7 3  4 Fig. 4.2. A Complete Bipartite Graph 1 2© ‚3 4© c5 ‚6 c7 ‚8 Fig. 4.3. An Out Degree Graph sT8 6 7 T Ts 1 2 3 4 5 Fig. 4.4. An In Degree Graph Where m is the number of nodes at level = 0, and k is the number of children to each node in Level = 0 Proof. Given the fact complete bipartite graphs are made of 2 levels, one level with id = 0 and another level = 1 with id higher than 0, then all vertices of 74 4 PageRank Algorithm Applied to Web Graphs 6 1 5 ©2 4‚c  ‚3 Fig. 4.5. A Bipartite Graph G1 degree higher than 0 will have equal ranking regardless of the number of nodes or configurations. Example: In Figure 4.2, m = 4 and k = 3. The following observations can be made: 1. All vertices in V1 = {1, 2, 3, 4} have an id = 0. They are also at Level = 0. Their rank according to Theorem 1 will be all the same and will equal to: r0 = 0.15 (for c=0.85) 2. All vertices in V2 = {5, 6, 7} have an id = 4. They are all at Level = 1. Their rank according to Theorem 1 will be the same and will be equal to: r1 = (1 − c) + c(r0 ∗ 4/3) = 0.32 4.2.2 PageRank Applied to Out-degree Graphs Theorem 2. Nodes at the same sub-tree level in out-degree Trees will have equal PageRank’s value. rn = 0.15 for n = 0 0.85(m − 1) rn = 0.15 + m for n = 1 One Sub-Tree : rn = 0.15 + .85rn−1 m for n > 1 and m = k One Sub-Tree : rn = 0.15 + .85rn−1 k for n > 1 and m = k (4.9) Where m is the number of children of node n-1 in one sub-tree, k is the number of children of node n-1 in the other sub-tree, and L is the level of the node Proof. out-degree graphs (or trees) having more than 2 levels will never provide equal ranking for vertices with equal degree. Instead of the degree of a node, the level of a node in a given tree will be the determining factor. 4.2 Page Rank Algorithm Applied to the Graphs of 2.2 75 Example. Consider the following out-degree tree in figure 4.3. 1. r0 = 0.15 2. r1 = 0.15+0.85(r0/2) = 0.214 3. for n = 2, we have 2 ranks because k = m (m = 3, k = 2): r2 = 0.15 + 0.85(r1/3) = 0.221 r2 = 0.15 + 0.85(r1/2) = 0.241 Thus the rank of node 1 is r0, the rank of nodes 2 and 3 is the same r1, the rank of nodes 4, 5, 6 is the same r2 = 0.221, and the rank of nodes 7 and 8 is the same r2 = 0.241. Interpretation: Even though all vertices are of the same degree id = 1 except the root of the tree which has a degree id = 0, vertices 2, and 3 at Level = 1 will have equal rank which is 0.214. Vertices at level = 2 will be divided into 2 ranks because the left sub-tree has more children coming out of node 2 than those coming out of node 3 on the right sub-tree. Vertices 4, 5, 6, which are at level 2 of the let sub-tree will have lower rank than those on the right part of the tree, which are still higher than that of those vertices at level = 1. This is expected since, what feeds into vertices 4, 5, 6 are nodes at a degree higher than those feeding into vertices 2 and 3, which is node 1. If 1 cites 2 and 3, and 2 now cites three others like 4, 5, and 6, then 4, 5, and 6 will have higher ranking than 2 because they are more important than 2. 4.2.3 PageRank Applied to In-degree Graphs Theorem 3. Nodes at the same tree level in in-degree Trees will have equal PageRank’s value. r0 = 0.15 rn = 0.15 + .85(mrn−1 + krn−2) (4.10) Where m is the number of parents to node n from level n-1 and k is the number of parents to node n from level n-2. Proof. in-degree graphs (or trees) having more than 2 levels will never provide equal ranking for vertices with equal in-degree. Instead of the in-degree of a node, the level of a node in a given tree will be the determining factor. Example. Consider the following general in-degree tree in figure 4.4. 1. r0 = 0.15 2. r1 = 0.15 + 0.85(2r0) = 0.405 3. r2 = 0.15 + 0.85(2r1 + r0) = 0.555 Thus the rank of node 8 is r2, the rank of nodes 6 and 7 is r1, and the rank of nodes 1, 2, 3, 4, and 5 is r0. Interpretation: Even though vertices 6, 7, and 8 have the same degree id = 2, only vertices 6, and 7 at Level = 1 will have 76 4 PageRank Algorithm Applied to Web Graphs Table 4.2. Google’s ranking applied to Figure 4.5 Rank of Vertex r1 = .15 r2 = .21 r3 = .33 r4 = .62 r5 = .15 r6 = .21 Value of c .85 .85 .85 .85 .85 .85 Value of id 0 1 1 3 0 1 equal rank which is 0.405. Vertex 8, which is at level 2, will have the highest rank that is 0.555. Vertices 1, 2, 3, 4, and 5, which are at level 0, will have equal rank, which is much lower than all of other vertices. This is expected since what feeds into vertex 8 are not only a low level node such as 3 but also vertices 6 and 7, which are nodes at an in-degree higher than those feeding into 6 and 7 which are vertices 1, 2, 4, and 5 which occupy level = 0. If 1 and 2 point to 6, 4 and 5 point to 7, 6 and 7 point to 8 besides 3, that makes 8 a more important link than all other links in the tree. 4.2.4 PageRank Applied to Bipartite Graphs Theorem 4. PageRank’s Algorithm will not work on bipartite graphs. Proof. Even though we have seen that bipartite graphs allow the separation of the set of vertices into 2 sets, yet nothing has changed in the interpretation of the fact that nodes with equal in-degree id can have different rankings, which is contrary to the interpretation of equation 4.1. Example. Consider the example of a bipartite graph like the one in figure 4.5. Table 4.2 summarizes google’s ranking applied to Figure 4.5. The same remark that was applied to the graph in figure 4.1 still applies to the graph in Figure 4.3. Vertex 3 has a higher ranking than those of the 3 vertices 2, 3, and 6, which have the same in-degree. Making a graph a bipartite graph did not change anything in the ranking of links. A more involved ranking scheme will be needed to further explore the complexity of ranking in bipartite and General graphs. 4.2.5 PageRank Applied to General Graphs Theorem 5. PageRank’s Algorithm will not work on General graphs. Table 4.3 summarizes the results of the modified google’s ranking procedure on these different topological structures. Thus such an algorithm needs to be applied carefully depending upon the topological structure of the Web pages that are studied. No other studies have reported the influence of the topology of the Web links on google’s Page Ranking technique. 4.3 PageRank Algorithm Applied to Bow Tie Graphs 77 Table 4.3. Modified google’s ranking Topological Structure in-degree Trees out-degree Trees Bipartite Graphs General graphs complete bipartite graphs google’s Web ranking Works well per Tree level(Theorem 3) Works well per Sub-Tree level(Theorem 2) Does not work (Theorem 4) Does not work (Theorem 5) Works well per graph level (Theorem 1) 4.2.6 Summary of the Application of the PageRank Algorithm to the Graphs of 2.2 Did the classification of the web graphs of 2.2 developed in equation 2.11 in section 2.5 of chapter 2 help shed a light on the behavior of PageRank’s algorithm ([12]) even though PageRank’s algorithm is not related to the (OutDegree, In-degree) space in which all these different graphs have been studied. According to equation 2.11 complete bipartite graphs(CBG) have a behavior close to out-degree graphs(OD), and that OD graphs have a behavior close to general graphs(GG), and that GG have a behavior close to in-degree graphs (ID). PagerRank algorithms works well per graph level for CBG, works well for OD per Sub-Tree level, Works well per Tree level for ID, but does not work not for GG and BG. According to 2.11 PageRank algorithms should not have held ground for ID given the fact that their behavior is close to GG than the other graphs. New classification schemes are needed to help predict the behavior of PageRank algorithms in the new space. The application of PageRank algorithm on Bow tie graphs will be considered next 4.3 PageRank Algorithm Applied to Bow Tie Graphs As seen in Figure 4.6, a BTG is made out of 2 graphs, a left graph which is a bipartite graph or BG(L) which includes all the nodes from 1,2,3,· · ·,10 which are the origination nodes, and node 11 which is the core; and a BG(R) which is made out of node 11, and nodes 12,13,· · ·,21. In section 2.3, we established the relationship between a BTG and the left complete bipartite graph and the right complete bipartite graph. We established that the adjacency matrix of the BTG is made out of the adjacency of the left LBG and the right CBG. The number of inlinks of the BTG is made out of the inlinks of the CBG(L) and the CBG(R). The number of outlinks of the BTG is made out of the outlinks of the CBG(L) and the number of outlinks of the CBG(R). The latter was a special case to begin the study of BTG’s. In this section, we start out with a BTG (N*m*N) as a set of 2 complete bipartite graphs CBG with m << N , a left one CBG(L) and a right one CBG(R). They share a core in the middle as seen in figure 4.6. Example: Applying PageRank to 78 4 PageRank Algorithm Applied to Web Graphs Fig. 4.6. A 10*1*10 bow tie graph Table 4.4. Ranking of the nodes of Figure 4.6 Rank of Vertex r1,2,3,4,5,6,7,8,9,10 = .15 r11 = 1.425 r12,13,14,15,16,17,18,19,20,21 = .2775 Value of c .85 .85 .85 Value of id 0 10 1 a BTG(10*1*10): The BTG of Figure 4.6 is made out of 2 CBG(L) and a CBG(R). We use equation 4.8 and applied for 3 levels: 1. At level L = 0, the ranking of nodes 1, through 10 is : r0 = 0.15 2. At level L = 1, using 4.8: r1 = .15 + .85(mr0/k)= 1.425 3. At level L = 2, using 4.8: r2 = .15 + .85(mr1/k)= .2775. As seen in table 4.4, nodes at level 0, 1 through 10, have the lowest rank. Nodes at the core which is 11 has the highest rank, while the nodes at level 2, 12 though 20, have a rank higher than nodes of level 0 but lower than level 1. It seems like google’s ranking is working. Nodes with the highest id have the highest rank, while the nodes with lowest id have the lowest rank. We consider the following cases: 1. A general bow tie graph: BTG(N*m*M) with m << N and m << M , with a CBG(L) and a CBG(R) but where m << N and m << M and we consider the cases of N > M and N < M . In the case of N < M , we find that: for Level L = 0; r0 = 0.15 for all vertices from 1 through N For level L = 1, using 4.8: 4.3 PageRank Algorithm Applied to Bow Tie Graphs 79 Table 4.5. Google’s ranking on a general bow tie graph Rank of Vertex r1..N = .15 RN+1,.N+m = N/10 ∗ m > .15 RN+m+1..N+M = .15 Value of c .85 .85 .85 Value of id 0 N 0 << m << N Table 4.6. Google’s ranking does work in the case of N > M Rank of Vertex r1..N = .15 RN+1,.N+m = N/10 ∗ m > .15 .15 < RN+m+1..N+M < N/10 ∗ m Value of c .85 .85 .85 Value of id 0 N 0 << m << N r1 = 0.15 + 0.85(r0∗m/k). r1 = .15 + .85∗(.15∗m/k) = .15 + .85∗(.15∗N/m) r1 = .15 + N/m∗10 ≈ N/10*m for all vertices from N+1 through N+m. For Level L = 2, using 4.8: r2 = .15+0.85(r1*m/k)=.15+.85*N*m/10*m*M r2 = .15+.85*N/10*M r2 = ≈ .15 for all vertices from N+m+1 through N+M According to table 4.5, then although the id of all nodes from N+m+1 through N+M is m > 0, yet the rank of all these nodes is really very close to .15. Thus google’s ranking does not work in that case. In the case of N > M , we find that (see table 4.6): for Level L = 0; r0 = .15 for all vertices from 1 though N For Level =1, using equation 4.8: r1 = .15 + .85∗(.15∗m/k) = 15 + .85∗(.15∗N/m) = .15 + N/m∗10 r1 ≈ N/10*m For Level =2, using 4.8: r2 = .15 + 0.85(r1*m/k)=.15+.85*N*m/10*m*M r2 = .15 + .85∗N/10∗M > .15 It appears that google’s ranking works in this case. 2. A general bow tie graph: BTG(N*m*M) with m << N and m << M , as made out of 2 graphs: a complete BG on the left CBG(L) and bipartite graph on the right BG(R) or a BG(L) and CBG(R), which are reasonable cases to study and with N > M or N < M . If N < M or N > M , and a CBG(L) and a BG(R), we find that: We still use equation 4.8 for the CBG(L): R0 = .15 for all nodes 1..N R1 = .15 + .85∗.15∗N/m = .15 + N/10m For BG(R), we have a bipartite graph and we know that in that case 80 4 PageRank Algorithm Applied to Web Graphs according to Theorem 4 that google’s ranking does not work. If we use a BG(L) and a CBG(R), with N > M or N < M , we know that google’s ranking will not work on the left but will work on the right. 3. A general bow tie graph: BTG(N*m*M) with m < N and m < M , as made out of 2 graphs: a CBG(L) and BG(R) or a BG(L) and CBG(R), which are reasonable cases to study and with N > M or N < M . Case 3 follows Case 2 since we have a BG(R) one time and a BG(L) another time. In case 3. in general we conclude that google’s ranking will not work. 4. A general bow tie graph where m >> M and m >> N as made out of 2 graphs: a CBG(L) and a CBG(R). In the case of N < M or N > M , we find that: for level L = 0 r0 = .15 for all vertices from 1 through N For level = 1, using 4.8 : r1 = .15 + 0.85(r0*m/k) r1 = .15 + .85∗(.15∗m/k) = .15 + .85∗(.15∗N/m) = .15 + N/m∗10 r1 ≈ .15 for all vertices from N + 1 through N + m For level = 2, using 4.8: r2 = 0.15 + 0.85(r1*m/k)=.15+.85*.15*m/M r2 = .15 + .1269∗m/M r2 > .15 for all vertices from N + m + 1 through N + M According to table 4.7, then although the id of all nodes from N+1 through N+m is N > 0, yet the rank of all these nodes is really very close to .15 which is equal to the ranking of the nodes with id = 0. Thus google’s ranking does not work in that case. 5. A general bow tie graph where m = M and m = N as made out of 2 graphs: a CBG(L) and a CBG(R) In that case, we find that: for Level=0 r0 = 0.15 for all vertices from 1 through N For level = 1, using 4.8: r1 = .15+0.85(r0*m/k) r1 = .15+.85*(.15* m/k) = .15+.85*(.15*N/N) r1 = .15+N/N*10 = .25 for all vertices from N+1 through N+m Table 4.7. Google’s ranking for a bow tie graph where N < M Rank of Vertex r1..N = .15 RN+1,.N+m = N/10 ∗ m > .15 RN+m+1..N+M > .15 Value of c .85 .85 .85 Value of id 0 N 0 < m << N 4.4 Limitations of PageRank Algorithm 81 Table 4.8. Google’s ranking of a bow tie graph where m = M and m = N Rank of Vertex r1..N = .15 RN+1,.N+m = .25 RN+m+1..N+M > .3625 Value of c .85 .85 .85 Value of id 0 N N For level = 2, using 4.8: r2 = .15+0.85(r1*m/k) = .15+.85*.25*m/M r2 = .15+.2125 = .3625 for all vertices from N+m+1 through N+M. According to table 4.8, then although the id of all nodes from 2N+1 through 3N is N, yet the rank of all these nodes is .3625 which is greater than .25 for all the nodes of N+1 thought 2N which have the same id. Google’s ranking does not work in that case. 4.4 Limitations of PageRank Algorithm Google’s ranking algorithm need to be adjusted to different topological Web structures to be able to successfully rank Web pages without any contextual information added. Given the fact google’s search engine is the forefront indexing search engine with more than “eight” billion Web pages and being adopted by major search engines like Yahoo it seems that a study of their ranking Web algorithm is timely to further explore its applicability in a variety of Web graphs. This study focused first on categorizing different Web graphs and classifying them according to their in-degree out-degree properties. Then we applied google’s Web ranking algorithm to the complete bipartite graphs, followed by bipartite graphs, then out-degree trees, in-degree trees, and last General graphs. Google’s ranking Web algorithm worked best on complete bipartite graphs by ranking equally vertices at the same level. The algorithm did not fare well in other Web graph structures on the lower ranking of the remaining vertices. More specifically, vertices with equal degrees, e.g., equal amount of outgoing nodes did not rank equally. Different theorems were adopted for these different topological structures, and readjusted google’s ranking to fit these different topological structures. Other ranking algorithms could have been studied and applied to the different topological Web graphs, but given the popularity of google, its wide indexing power, more than a billion Web pages, makes it a very powerful search engine that has been adopted by other search engines like Yahoo. All of these factors contributed to the consideration of google’s ranking algorithm in this study. 5 Stochastic Simulations Of Search Engines 5.1 Introduction Computer simulations are widely used in a variety of applications including the military, service industries, manufacturing companies, nuclear power plants, transportation organizations, and manufacturing domain. For example, in nuclear power plants, often computer simulations are used to train personnel on failures and normal operation, study operation plans, support testing of the heart of the nuclear power plant, as well as to evaluate and compare future design plan changes. Other techniques that are used to examine systems, in general, do not have the advantages that computer simulations bring mainly because computer simulations provide cheaper and more realistic results than other approaches. In some cases, computer simulation is the only means to examine a system, like in nuclear power plants, since it is either too dangerous to bring a system under such failure conditions or too costly especially for combat situations to experiment with the system. Also, computer simulations permit us to study systems over a long period of time, to learn from real world past experiences, and to have control over experimental conditions. Actual Internet simulation models are expensive to develop and use in terms of personnel, time and resources. Large memory requirements and slow Response time can prevent companies from considering it as a useful tool. Developing Internet simulations models during training requires models that are very close to reality while their speed is secondary. During testing, speed and reproducibility become the primary issues, which incite us to make the different internal simulation modules as efficient and accurate as possible. Computer simulations have provided companies with the description of the input settings that are needed to produce the optimal best output value for a given set of inputs in a specific domain of study. Response surface methodologies using regression models approximations of the computer simulation were the means to achieve computer simulation optimization. As Myers et al. ([104]) stated it in Technometrics, G. Meghabghab and A. Kandel: Search Engines, Link Analysis, and User’s Web Behavior, Studies in Computational Intelligence (SCI) 99, 83–104 (2008) www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008 84 5 Stochastic Simulations Of Search Engines “There is a need to develop non-parametric techniques in response surface methods (RSM). The use of model-free techniques would avoid the assumption of model accuracy or low-level polynomial approximations and in particular, the imposed symmetry, associated with a second degree polynomial (page 139).” One possible non-parametric approach is to use artificial neural networks (ANN). 5.1.1 Artificial Neural Networks (ANN) on the WWW An ANN learns to imitate the behavior of a complex function by being presented with examples of the inputs and outputs of the function. This operation is called training. A successfully trained ANN can predict the correct output of a function when presented with the input (or inputs for a multivariate function). ANN has been applied successfully in a variety of fields including industrial and mining and manufacturing, business and finance and marketing, medicine and health, an up to date Web page of the application of neural networks to medicine (http://www.phil.gu.se/ANN/ANNworld.html). Searching for information, in general, is a complex, fuzzy, and an uncertain process. Information retrieval models have been applied to online library catalogs in the past and these models can be still used for any online information service. The WWW is a system of storage and retrieval of information characterized by its enormous size, hypermedia structure, and distributed architecture. Computer and information scientists are trying to model the behavior of users and examine their successes and failures in using search engines on the WWW. While understanding user information seeking behavior on the Web is important, evaluating the quality of search engines is an important issue to tackle. WWW users will be more successful if they were made aware of how to interpret their search results when they query the WWW, and how these results are linked directly to the search engine they use. ANN has been used to help users negotiate their queries in an information retrieval system to better assess their subject needs. Meghabghab & Meghabghab ([82]) designed an ANN to act as an electronic information specialist capable of learning to negotiate a patron’s query and translate it into a true, well formulated statement prior to accessing an online public access catalog (OPAC). Meghabghab & Meghabghab ([99]) proposed a new neural network paradigm called “Selection Paradigm” (SP) to solve the problem of algorithm selection in multi-version information retrieval. SP learns to choose a preferred pattern for a given situation from a set of n alternatives based on examples of a human expert’s preferences. The testing results after training are compared to Dempster-Shafer’s (DS) model. DS computes the agreement between a document belief and two or more reformulated query beliefs by two or more different algorithms. Final testing results indicate that SP can produce consistent rank orderings and perform better 5.1 Introduction 85 in certain information retrieval situations than DS’s model. More recently, connectionist methods such as fuzzy clustering and Kohonen Self-Organizing Maps (SOM) were also used to evaluate log data analysis in a geo-referenced information service of a digital library. There are not many examples of ANN being utilized in the evaluation of the quality of search engines. This study is a premiere not only in using neural networks in the evaluation of search engines, but also in comparing its results to other simulation techniques that have been used in other applications. In the next section, we will apply ANN in approximating computer simulations of the quality of search engines. 5.1.2 Rejected World Wide Web Pages Real-world queries that is, queries that have been used in real world situations, are the only way for testing the WWW as to estimate the relative Coverage of search engines with respect to existing Web pages. This can be a lot different from the relative number of pages indexed by each search engine. Indexing and retrieval vary from one engine to another. For example, there may be different limits on the size of pages indexed by an engine, the words that are indexed, and processing time per query. google has a lexicon of 14 million words and a limit of 1 to 10 seconds per query (dominated by disk Input/Output over UNIX File System). Once 40,000 matching documents are found for a given query (to put a limit on Response time), sub-optimal results are returned not based on the user needs or query formulation but how well the pages are indexed. If all pages were to be equally indexed, information access across all Web pages would be equal. But designers tend to choose Web pages to be indexed. The Web page of President Bill Clinton, for example, will be indexed before the Web page of Joe Doe in Arkansas. The bias of indexing is complicated with the decision to crawl. The percentage of crawling versus indexing is a challenge because of costs of hardware and limitations of software (operating systems). Crawling is important in the beginning for a new search engine to cover as many Web pages as possible to attract users to come back, but then it tapers off as the cost to keep updating databases keeps increasing. Web pages are “living communities” of hyperlinks (Kumar et al. [70]). They are updated, modified, deleted, and changed constantly. The nature of Web pages complicates the evaluation of the returned set of results of search requests. To date, studies of Web search engines have mainly focused on retrieval performance and evaluation of search features. No studies have evaluated the quality of search engines based on the proportion of Web pages rejected and, therefore, excluded from search results. We investigate the proportion of rejected Web pages for a set of search requests that were submitted to 4 search engines: Alta Vista, google, Northern Light, and yahoo. We applied two techniques to the resulting set of data: RBF neural networks, and second order multiple linear regression models. 86 5 Stochastic Simulations Of Search Engines 5.2 ANN Meta-model Approach for the WWW A meta-model is a model of a model. Typical simulation meta-modeling approaches involve the use of regression models in response surface methods. A few attempts have been made to employ neural networks as the metamodeling technique. Pierreval & Huntsinger ([110]) were successful in using ANN as meta-models of stochastic computer simulations. The common feature of these models was an ANN baseline, which involves using a backpropagation, trained, multi-layer ANN to learn the relationship between the simulation inputs and outputs. The baseline ANN meta-model approach was developed on an (s, S) inventory computer simulation and was applied to a larger application in the domain under consideration. RBF neural networks are emerging as a viable architecture to implement neural network solutions to many problems. RBF neural networks are deterministic global non-linear minimization methods. These methods detect sub-regions not containing the global minimum and exclude them from further consideration. In general, this approach is useful for problems requiring a solution with guaranteed accuracy. These are computationally very expensive. The mathematical basis for RBF network is provided by Cover’s Theorem (Cover [25]), which states: “A complex pattern-classification problem cast in a high dimensional space nonlinearly is more likely to be linearly separable in a lowdimensional space”. RBF uses a curve fitting scheme for learning, that is, learning is equivalent to finding a surface in a multi-dimensional space that represents a best fit for the training data. The approach considered here is a generalized RBF neural network having a structure similar to that of Figure 5.1. A generalized RBF has the following characteristics: Fig. 5.1. A generalized RBF 5.2 ANN Meta-model Approach for the WWW 87 1. Every radial basis function φi at the hidden layer has a center position ti and a width or a spread around the center. It connects to the output layer through a weight value wi. 2. The number of radial basis functions at the hidden layer is of size M, where M is smaller than the number of input training patterns N. 3. The linear weights wi which link an RBF node to the output layer, the position of the centers of the radial basis functions, and the width of the radial basis functions are all unknown parameters that have to be learned. A supervised learning process using a Gradient Descent (GD) procedure is implemented to adapt the position of the centers and their spreads (or widths) and the weights of the output layer. To initialize GD, the authors begin the search from a structured initial condition that limits the region of parameter space to an already known useful area by using a standard pattern-classification method . The likelihood of converging to an undesirable local minimum in weight space is already reduced. Also, a supervised learning process using Interior Point Method (IPM) is implemented to adapt the position of the centers and their spreads (or widths) and the weights of the output layer, but which reduces the amount of computation compared to the one developed with GD. A standard Gaussian classifier is used; it assumes that each pattern in each class is drawn from a full Gaussian distribution. 5.2.1 Training RBF without Correction of Centers, Spreads and Weights (RBF without CCSW) The initial step in training using RBF is to pursue a maximum likelihood method, which views parameters as fixed but unknown. To be more specific, suppose that we have a two-class problem (it could be generalized to a multiclass case), with the underlying distribution of the input observation vector is Gaussian. The vector of observation or training pattern x is then characterized as follows: x ∈ X1: Mean vector = µ1 Covariance matrix = C x ∈ X2: Mean vector = µ2 Covariance matrix = C where µ = E[X], C = E[(x − µ)(x − µ)T ] , and E denotes the statistical expectation operator or mean. The input-output relation of the network is defined by: M y(x) = (wiφi(x) + b) i=1 (5.1) where φi(x) is a radial basis function, and b is the bias. A radial basis function has a center ti and is defined as follows: φi(x) = G(||x − ti||C ), i = 1, 2..M (5.2) 88 5 Stochastic Simulations Of Search Engines where G is a Gaussian function centered at ti with a general weighted norm defined as follows: ||x||C = (Cx)T (Cx) = xT CT Cx (5.3) where C is norm weighting square matrix having the same dimension as the input vector x. Replacing 5.2 in 5.1 yields: M y(x) = (wiG(|x − ti|C ) + b i=1 (5.4) Where M is the number of centers that belong to class X1. To fit the training data with the desired output, we require: y(xj) = djwhere j=1,.....N (5.5) Where xj is an input vector and dj is the desired output. Let: gji = G(||xj − ti||C ) (5.6) Where i = 1,...M and j = 1,.....N. G(||xj − ti||C ) can be expressed as follows: G(||xj − ti||C ) = exp[−(x − ti)T CT C(x − ti)] (5.7) Define −1 as (1/2) −1 = (CiT Ci), equation 5.7 yields: −1 G(||xj − ti||C ) = exp[−(1/2)(x − ti)T (x − ti) (5.8) Equation 5.8 represents a multivariate Gaussian distribution with mean vec- tor ti and covariance matrix . In a matrix form, G can be written as G = [gji], where gji = [gj1, gj2, ........., gjM , 1]T and where j = 1,........,N and i = 1,.......M. Combining equations 5.6 and 5.5, and defining w = [w1, w2, w3, w4, . . . , wM ]T , equation 5.4 can be written in a matrix format as Gw = d (5.9) To inverse 5.9 we have to use a pseudo-inverse where: w = G+d = (GT G)−1GT d where GT G is a square matrix with a unique inverse of its own. (5.10) 5.2.2 Training RBF with Correction of Centers, Spreads, and Weights using Gradient Descent (RBF with CCSW using GD) For the training of the proposed network, we used the modified generalized delta rule to update the output weights as well RBF centers and spreads. 5.2 ANN Meta-model Approach for the WWW 89 We train the network by minimizing the sum of the squares of the errors for all output neurons. Let E(n) be the sum of the squares of the errors for all output neurons at time n: N E(n) = (1/2) (ej(n))2 j=1 (5.11) Where ej(n) is the error signal of output unit defined at time n and is defined by: M ej(n) = dj(n) − y(xj) = dj(n) − wi(n)G(||xj − ti(n)||C ) j=1 (5.12) The problem is to find the free parameters wi, ti, and ( −1) to minimize E. The results of the minimization can be summarized as follows for the linear weights of the output layer: ∂E(n) ∂wi(n) = N ej (n)G(||xj j=1 − ti||C ) (5.13) And: ∂E(n) wi+1(n) = wi(n) − ζ1 ∂wi(n) i = 1, 2.....M and for the positions of centers at the hidden layer: (5.14) ∂E(ni ∂ti(n) = N 2wi(n) ej(n)G j=1 (||xj − ti(n)||C ) −1 (xj i − ti(n)) (5.15) where G’ is the derivative of G with respect to its argument. ∂E(n) ti+1(n) = ti(n) − ζ2 ∂ti(n) i = 1, 2.....M For the spreads of centers at the hidden layer: (5.16) ∂ ∂E(ni −1 i (n) = N wi(n) ej(n)G j=1 (||xj − ti(n)||C )Qji(n) (5.17) where Qji(n) = [xj − ti(n)][xj − ti(n)]T −1 (n + 1) = i −1 i (n) − ζ3 ∂ ∂E(n) −1 i (n) i = 1, 2.....M (5.18) Where ζ1, ζ2 ,and ζ3 are different learning rate parameters that could be made adaptative themselves. 90 5 Stochastic Simulations Of Search Engines 5.2.3 Training RBF with Correction of Centers, Spreads and Weights using Interior Point Methods (RBF with CCSW using IPM) To evaluate the promise of the IPM for RBF, we trained the centers, spreads and the weights as follows. The learning error function E can be approximated by a polynomial P(x) of degree 1, in a neighborhood of xi : E ≈ P (x) ≡ E(xi) + gT (x − xi) (5.19) N g = xE = x (xEj ) j=1 So the linear programming problem for a given search direction has as a solution: minimize gT (x − xi) subject to −αo ≤ x − xi ≤ αo (5.20) Where o is the vector of all ones with the same dimension as the state vector x and α > 0 is a constant. In this study we rely on the method applied in Meghabghab & Ji ([100]) and apply it three times as follows: 1. Search for the direction that will minimize the error corresponding to the weights. 2. Search for the direction that will minimize the error corresponding to the centers of the neurons. 3. Search for the direction that will minimize the error corresponding to the spreads of the neurons. The linear interior point method and its dual can be expressed as follows: minimize gT x maximize bT y subject to x + z = b and subject to sx = c − y ≥ 0 x, z ≥ 0 sz = −y ≥ 0. (5.21) Where x − xi + αo = u and b = 2αo. The vector x could represent the weights of the neurons, the centers of the neurons, or the spreads of the neurons. Thus (x, z, y, sx, sz) is an optimal feasible solution if and only if: z =b−x sx = c − y sz = −y Xsx = 0 Zsz = 0 (5.22) 5.2 ANN Meta-model Approach for the WWW 91 Where X is a diagonal matrix with diagonal elements made out of the values of vector x and Z is a diagonal matrix with diagonal elements made out of the values of vector z. Given (xk, zk, yk, skx, skz ) which is strictly feasible we apply Newton’s method to the perturbed non linear method of 5.22 resulting in the following system of equations: δx = (X−1Sx + Z−1Sz)−1[Z−1(Zsz − µe) − X−1(Xsx − µe)] δz = −δx δsx = −X−1(Sxδx + Xsx − µe) δsz = Z−1(Szδx − Zsz − µe) (5.23) δy = −δsx Now we set: (xk+1, zk+1, yk+1, skx+1, skz+1) = (xk, zk, yk, skx, skz ) + θk(δx, δz, δy, δsx, δsz), (5.24) The choices of µ and θk determine different interior point algorithms. We define with the judicial choice of µ and θk an algorithm with polynomial complexity. 5.2.4 New RBF Interior Point Method Learning Rule In our problem an initial point is easily obtained by exploring its special structure. We choose: x0 = αe, z0 = αe, sz0 = ρe, y0 = −ρe, sx0 = c − ρe (5.25) where ρ=max1 0. A2. For k = 0,1,2,... A2.1 Compute Eavg(wk), gradient of Eavg(w) at wk A2.2 Call Procedure 1, let (x∗, z∗, y∗, sx∗, sz∗) be the solution to 5.21 A2.3 Update the weights by wk+1 = wk + x∗ − αe, The same interior point learning rule applied to weights can be applied to the positions of the centers of neurons and to the spreads of the neurons. Meghabghab & Nasr ([101]) applied the 3 RBF training algorithms to the following benchmark problems: Sonar Problem: This problem discriminates sonar signals bounced off a metallic cylinder and those bounced between two sets of training points that lie on two distinct spirals in the x-y plane. Each spiral has 94 points. Two-Spiral Problem: This problem discriminates between two sets of training points that lie on two distinct spirals in the x-y plane. Each spiral has 94 input-output pairs in both the training and test pairs. Vowel Recognition Problem: This problem trains a network for speaker independent recognition of the 11 steady-state vowels of English. Vowels are classified correctly when the distance of the correct output to the actual output is the smallest among the distances from the actual output to all possible target outputs. 10-parity problem: This problem trains a network that computes the modulo-two sum of the 10-binary digits. There are 1024 training patterns and no testing set. Figure 5.2 summarizes the results of the 3 algorithms on the four benchmark problems. Meghabghab & Nasr ([102]) developed an iterative RBF meta-model approach to approximating discrete event computer simulations in order to provide accurate meta-models of computer simulations. An RBF neural network was used to learn the relationship between the simulation inputs and outputs. Results of the 3 RBFAlgorithms on the 4 benchmark Problems 100.00% 95.00% 90.00% 85.00% 80.00% 75.00% Vow el TR Vow el TESpirals TR Spirals TE SONAR SONAR TE Partly TR TR benchmark problem : Fig. 5.2. Results of the 3 Algoritms RBF RBF with CCSW RBF with PM 5.2 ANN Meta-model Approach for the WWW 93 The iterative RBF meta-model approach starts with small training and testing sets, and builds RBF meta-models to use in performing factor screening to eliminate those input factors that do not appear to make much difference on the simulation output. After eliminating the irrelevant factors, the baseline approach could be used on the remaining factors with substantial savings in total computer simulation runs. The iterative RBF neural network was developed and used to evaluate the quality of 4 different search engines on a variety of queries. 5.2.5 RBF Training Algorithm for Web Search Engines This study employed RBF to estimate the proportion of rejected Web pages or URL’s after they have been tested for lower than the maximum allowable threshold on four search engines. In selecting Web search engines for this study, particular attention was paid for evaluating their retrieval performance. Only 4 search engines were evaluated: Alta Vista, google, Northern Light, and yahoo. Nine variables were chosen as inputs for the simulation based on studies on other search engines. These 9 variables and their denotations are: 1. Precision (X1). 2. Overlap (X2): common portion of the Web that is indexed among all the engines that we have selected for the study. 3. Response time (X3). 4. Coverage (X4): Coverage is in terms of Web pages considered. 5. Coverage (X5) 6. Boolean logic (X6). 7. Truncation (X7). 8. Truncation (X8). 9. Portion of the Web pages indexed (X9) (e.g., titles plus the first several lines, or the entire Web page) The output variable is the proportion of rejected Web pages or URL on a given query or search question. Such a query or search question is called an exemplar. An exemplar is a format of a query that can represent queries of the same kind regardless of the content involved in the query. An example of such an exemplar would be: “violence among teenagers”. That same query could be “violence among athletes” or “drugs among teenagers” or “drugs among athletes” and would be considered of the same degree of fuzziness or vagueness. An exemplar is said to be unsuccessful or redeemed with no answers from the search engine if its proportion of rejected Web pages exceeds a certain threshold. The set of queries used to search the Web is important in studying the performance of the search engines on the WWW. Query selection to test the WWW must be different from any other set of wellcontrolled collection because pages on the WWW are not homogeneous Web pages. The reason that the authors did not use the TREC (Text REtrieval Conference) (see the URL: http://trec.nist.gov) queries for example, to test 94 5 Stochastic Simulations Of Search Engines the Web is the fact that the Web is a collection of heterogeneous documents. The standard deviation between 2 randomly selected Web pages is so high that it is extremely difficult to apply the results of the TREC experiment to the Web. A wide number of users access the WWW, with different cultural backgrounds, different language skills, different languages, which makes it hard to express the same thing with the same vocabulary in a Web page. This does not mean that the meta-information that is extracted cANNot be homogeneous but this is generated information on the actual existing information. ZD Labs recently conducted a study among 5 search engines including google, yahoo, Direct Hit, Alta Vista and Northern Light to determine which search engine produced the most relevant results (the queries used in the study were not disclosed). AltaVista fared better than google, Direct Hit, Northern Light and yahoo. It had the most relevant results for multiword queries. google had the best results for single word queries (See the URL: http://cbs.marketwatch.com/archive/20000503/news/current/altavista). Given the intense competition among commercial search services, finding the list of all existing URLs is highly prized proprietary information. The threshold of Web pages rejection that is adopted by different Web engine designers is even harder to determine. What is even worst is the fact that such a threshold varies among the different Web engines. The authors of this study will assume that this number does not change and is set around 80%. This number might seem high, but based on all the parameters considered during the testing by different search engines, it represents an accurate estimate of what constitutes the definition of when a exemplar is accepted or when an exemplar is rejected by a given search engine. The data used was limited to 70 different exemplars because all inputs and output were complete for these different exemplars. The numbering of these different exemplars is no secret but an easy way to reference the queries in question. 5.2.6 Regression Models Approach for the WWW ([87]) In this section, we followed the experimental design and optimization chapter on an inventory system developed by Law & Kelton ([72]). The mechanistic model of the simulation used Simulink (a toolbox provided by Matlab). The data used to develop the empirical approximation models consists of 2 different sets of the 9 different input variables, which will be denoted by X1, X2, X3, X4, X5, X6, X7, X8, and X9. The first data set will consist of 29 combinations of the variables X1,X2,X3,X4,X5,X6,X7,X8, and X9 with 2 different set values for each variable. The variable Precision X1 can vary from .1 at the lowest to 1.0 at the highest, e.g., X1 = {0.1, 1.0}. As reported by Bharat & Broder ([6]), the variable Overlap or X2 varies from 25% which is the least with up to 35% being the maximum among all search engines, e.g., X2 = {0.25, 0.35}. The variable Response time X3 varies from 1 to 5 seconds, e.g., X3 = {1, 5}. The variable Coverage at the time of simulation of the queries, varied from 250 millions Web pages which the least up to 512 millions even though the 5.2 ANN Meta-model Approach for the WWW 95 latest estimate shows google now reaching up to 1 billion Web pages in its Coverage, e.g., X4 = {250000000, 512000000}. The variable Coverage can vary from weekly with up to 1 month, e.g., X5 = {7, 30}. The variable Boolean logic X6 varies from 3 different logical operators NOT, OR, and AND, and with up to 3 different combinations of all these operators supported in a given query, e.g., X6 = {1, 3}. The variable Truncation X7 varies from 1 Truncation per word to up to 5 Truncations per word, e.g., X7 = {1, 5}. The variable Truncation varies from 1 word searched by one of the search engines studied to up to 5 different multi word per search, e.g., X8 = {1, 5}. The variable portion of the Web indexed X9 varies from the title of the Web page to other sections of the Web page including the whole page, e.g., X9 = {.20, 1.0}. The second set consists of 69 combinations of the variables X1, X2, X3, X4, X5, X6, X7, X8, and X9 with 6 different set values for each variable. The test set of 209 points was constructed with all possible combinations of 20 equally spaced values of each input variable from 0 to 100. Ten replications of the computer simulation were made for each combination of X1, X2, X3, X4, X5, X6, X7, X8, and X9 for the test set, and both training sets; the mean tc of each set of ten provided the correct, or target, answer, as was done in ([72]). Models A and B test the ability of approximation techniques to extrapolate beyond the 29 data points used to develop the approximation tool. Models C and D examine the ability of the approximation techniques to interpolate between 69 data points. Two different methods of presenting the data to the approximation tool were used; one used the mean tc value over the ten replications (Models A and C), and the other used the individual tc values from each replication (models B and D). These are summarized in table 5.1. Models A and B used full, first-order multiple linear regression models. Models C and D used full, second-order multiple linear regression models. These regression models were selected because they are used typically in response surface methods ([72]). The linear regression models obtained in each case are shown in Table 5.1. The F test for regression was significant for α = 0.01 for regression models B, C, and D. Note that regression models A and B are the same model, and that C and D the same model. This is just a manifestation of the regression attempting to estimate the expected value of the output for any particular input value. The benefit of including all of the data in regression models B and D is that lack of fit tests can be performed, although this was not done here. Table 5.1. Data Presentation Methods Used Model Label A B C D Data Set 29 points 29 points 69 points 69 points Target Data Presentation Method Mean Individual replications Mean Individual replications 96 5 Stochastic Simulations Of Search Engines Model Label A B C D Table 5.2. Regression Models Regression Equation PValue of R2 PR = Percentage Rejected Statistic PR = 9i=1(AiXi) .45 .78 PR = 9i=1(AiXi) 4.3E−12 .74 PR = 9 i=1 (AiXi) + i=9 i=1,(i

Top_arrow
回到顶部
EEWORLD下载中心所有资源均来自网友分享,如有侵权,请发送举报邮件到客服邮箱bbs_service@eeworld.com.cn 或通过站内短信息或QQ:273568022联系管理员 高进,我们会尽快处理。