Chapter 1: Introduction to Web Scraping 7
When is web scraping useful? 7 Is web scraping legal? 8 Python 3 9 Background research 10
Checking robots.txt 10 Examining the Sitemap 11 Estimating the size of a website 11 Identifying the technology used by a website 13 Finding the owner of a website 16
Crawling your first website 17 Scraping versus crawling 17 Downloading a web page 18
Retrying downloads 19
Setting a user agent 20 Sitemap crawler 21 ID iteration crawler 22 Link crawlers 25
Advanced features 28 Parsing robots.txt 28 Supporting proxies 29 Throttling downloads 30 Avoiding spider traps 31 Final version 32
Using the requests library 33 Summary 34
Chapter 2: Scraping the Data 35
Analyzing a web page 36 Three approaches to scrape a web page 39 Regular expressions 39 Beautiful Soup 41 Lxml 44 CSS selectors and your Browser Console 45 XPath Selectors 48
LXML and Family Trees 51 Comparing performance 52 Scraping results 53
Overview of Scraping 55
Adding a scrape callback to the link crawler 56 Summary 59
Chapter 3: Caching Downloads 60
When to use caching? 60 Adding cache support to the link crawler 61 Disk Cache 63
Implementing DiskCache 65 Testing the cache 67 Saving disk space 68 Expiring stale data 69 Drawbacks of DiskCache 70
Key-value storage cache 71 What is key-value storage? 72 Installing Redis 72 Overview of Redis 73 Redis cache implementation 75 Compression 76 Testing the cache 77 Exploring requests-cache 78
Summary 80 Chapter 4: Concurrent Downloading 81
One million web pages 81 Parsing the Alexa list 82 Sequential crawler 83 Threaded crawler 85 How threads and processes work 85 Implementing a multithreaded crawler 86 Multiprocessing crawler 88 Performance 92 Python multiprocessing and the GIL 93 Summary 94
Chapter 5: Dynamic Content 95 An example dynamic web page 95
Reverse engineering a dynamic web page 98 [ ii ]
Edge cases 102 Rendering a dynamic web page 104
PyQt or PySide 105 Debugging with Qt 105 Executing JavaScript 106 Website interaction with WebKit 107 Waiting for results 110 The Render class 111 Selenium 113 Selenium and Headless Browsers 115 Summary 117
Chapter 6: Interacting with Forms 119
The Login form 120 Loading cookies from the web browser 124 Extending the login script to update content 128 Automating forms with Selenium 132 Summary 135
Chapter 7: Solving CAPTCHA 136
Registering an account 137 Loading the CAPTCHA image 138 Optical character recognition 140 Further improvements 143 Solving complex CAPTCHAs 144 Using a CAPTCHA solving service 144 Getting started with 9kw 144 The 9kw CAPTCHA API 145 Reporting errors 150 Integrating with registration 151 CAPTCHAs and machine learning 152 Summary 153
Chapter 8: Scrapy 154
Installing Scrapy 154 Starting a project 155 Defining a model 156 Creating a spider 157
Tuning settings 158
Testing the spider 159 Different Spider Types 161 Scraping with the shell command 162
Checking results 164 Interrupting and resuming a crawl 166 Scrapy Performance Tuning 168 Visual scraping with Portia 168 Installation 169 Annotation 171 Running the Spider 176 Checking results 177 Automated scraping with Scrapely 178
Summary
Chapter 9: Putting It All Together
Google search engine Facebook
The website
Facebook API
Gap BMW Summary
179
180
180
185
186
188
190
194
198
199
Index
猜您喜欢
推荐内容
开源项目推荐 更多
热门活动
热门器件
用户搜过
随便看看
热门下载
热门标签
评论