超过460,000+ 应用技术资源下载
pdf

Python.Web.Scraping.2nd.Edition.2017.5.pdf

  • 1星
  • 日期: 2018-04-01
  • 大小: 14.78MB
  • 所需积分:0分
  • 下载次数:39
  • favicon收藏
  • rep举报
  • 分享
  • free评论
标签: pythonscraping

Chapter 1: Introduction to Web Scraping 7

When is web scraping useful? 7 Is web scraping legal? 8 Python 3 9 Background research 10

Checking robots.txt 10 Examining the Sitemap 11 Estimating the size of a website 11 Identifying the technology used by a website 13 Finding the owner of a website 16

Crawling your first website 17 Scraping versus crawling 17 Downloading a web page 18

Retrying downloads 19

Setting a user agent 20 Sitemap crawler 21 ID iteration crawler 22 Link crawlers 25

Advanced features 28 Parsing robots.txt 28 Supporting proxies 29 Throttling downloads 30 Avoiding spider traps 31 Final version 32

Using the requests library 33 Summary 34

Chapter 2: Scraping the Data 35

Analyzing a web page 36 Three approaches to scrape a web page 39 Regular expressions 39 Beautiful Soup 41 Lxml 44 CSS selectors and your Browser Console 45 XPath Selectors 48

  

LXML and Family Trees 51 Comparing performance 52 Scraping results 53

Overview of Scraping 55

Adding a scrape callback to the link crawler 56 Summary 59

Chapter 3: Caching Downloads 60

When to use caching? 60 Adding cache support to the link crawler 61 Disk Cache 63

Implementing DiskCache 65 Testing the cache 67 Saving disk space 68 Expiring stale data 69 Drawbacks of DiskCache 70

Key-value storage cache 71 What is key-value storage? 72 Installing Redis 72 Overview of Redis 73 Redis cache implementation 75 Compression 76 Testing the cache 77 Exploring requests-cache 78

Summary 80 Chapter 4: Concurrent Downloading 81

One million web pages 81 Parsing the Alexa list 82 Sequential crawler 83 Threaded crawler 85 How threads and processes work 85 Implementing a multithreaded crawler 86 Multiprocessing crawler 88 Performance 92 Python multiprocessing and the GIL 93 Summary 94

Chapter 5: Dynamic Content 95 An example dynamic web page 95

Reverse engineering a dynamic web page 98 [ ii ]

    

Edge cases 102 Rendering a dynamic web page 104

PyQt or PySide 105 Debugging with Qt 105 Executing JavaScript 106 Website interaction with WebKit 107 Waiting for results 110 The Render class 111 Selenium 113 Selenium and Headless Browsers 115 Summary 117

Chapter 6: Interacting with Forms 119

The Login form 120 Loading cookies from the web browser 124 Extending the login script to update content 128 Automating forms with Selenium 132 Summary 135

Chapter 7: Solving CAPTCHA 136

Registering an account 137 Loading the CAPTCHA image 138 Optical character recognition 140 Further improvements 143 Solving complex CAPTCHAs 144 Using a CAPTCHA solving service 144 Getting started with 9kw 144 The 9kw CAPTCHA API 145 Reporting errors 150 Integrating with registration 151 CAPTCHAs and machine learning 152 Summary 153

Chapter 8: Scrapy 154

Installing Scrapy 154 Starting a project 155 Defining a model 156 Creating a spider 157

Tuning settings 158

Testing the spider 159 Different Spider Types 161 Scraping with the shell command 162

  

Checking results 164 Interrupting and resuming a crawl 166 Scrapy Performance Tuning 168 Visual scraping with Portia 168 Installation 169 Annotation 171 Running the Spider 176 Checking results 177 Automated scraping with Scrapely 178

Summary

Chapter 9: Putting It All Together

Google search engine Facebook

The website

Facebook API

Gap BMW Summary

179

180

180

185

186

188

190

194

198

199

 Index

 

文档内容节选

Python Web Scraping Second Edition Fetching data from the web Katharine Jarmul Richard Lawson BIRMINGHAM MUMBAI Python Web Scraping Second Edition Copyright 2017 Packt Publishing All rights reserved No part of this book may be reproduced stored in a retrieval system or transmitted in any form or by any means without the prior written permission of the publisher except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of t......

Python Web Scraping Second Edition Fetching data from the web Katharine Jarmul Richard Lawson BIRMINGHAM - MUMBAI Python Web Scraping Second Edition Copyright © 2017 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: October 2015 Second edition: May 2017 Production reference: 1240517 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78646-258-9 www.packtpub.com Credits Authors Katharine Jarmul Richard Lawson Copy Editor Manisha Sinha Reviewers Dimitrios Kouzis-Loukas Lazar Telebak Project Coordinator Nidhi Joshi Commissioning Editor Veena Pagare Proofreader Safis Editing Acquisition Editor Varsha Shetty Indexer Francy Puthiry Content Development Editor Cheryl Dsa Production Coordinator Shantanu Zagade Technical Editor Danish Shaikh About the Authors Katharine Jarmul is a data scientist and Pythonista based in Berlin, Germany. She runs a data science consulting company, Kjamistan, that provides services such as data extraction, acquisition, and modelling for small and large companies. She has been writing Python since 2008 and scraping the web with Python since 2010, and has worked at both small and large start-ups who use web scraping for data analysis and machine learning. When she's not scraping the web, you can follow her thoughts and activities via Twitter (@kjam) or on her blog: h t t p s ://b l o g . k j a m i s t a n . c o m . Richard Lawson is from Australia and studied Computer Science at the University of Melbourne. Since graduating, he built a business specializing in web scraping while travelling the world, working remotely from over 50 countries. He is a fluent Esperanto speaker, conversational in Mandarin and Korean, and active in contributing to and translating open source software. He is currently undertaking postgraduate studies at Oxford University and in his spare time enjoys developing autonomous drones. You can find him on LinkedIn at h t t p s ://w w w . l i n k e d i n . c o m /i n /r i c h a r d p e n m a n .
更多简介内容

推荐帖子

TI-83 Premium CE Python Edition 计算器
来自:https://blog.adafruit.com/2019/07/06/ti-83-premium-ce-python-edition-samd-microchipmakes-chip-on-board-to-run-python-search-this-topic-first-unread-post-7-posts-page-1-of-1onlinetest-ti-83-premium-
dcexpert 【MicroPython开源版块】
【micropython教程】玩转 Web 服务器
本帖最后由 zy459994202 于 2018-5-4 10:42 编辑 玩 MicroPython 有一段时间了,前两天发现 RT-Thread 也推出了 MicroPython 的软件包,示例上说可以轻松创建一个 Web server,正好手上也有那个 1050 的开发板,很感兴趣就玩了玩,发现还不错,写下来分享分享。据文章介绍,RT-Thread MicroPython 可以跑在任何搭
zy459994202 【MicroPython开源版块】
Python实现批量将ppt转换为pdf
这是一个Python脚本,能够批量地将微软Powerpoint文件(.ppt或者.pptx)转换为pdf格式。使用说明1、将这个脚本跟PPT文件放置在同一个文件夹下。2、运行这个脚本。全部代码import comtypes.client import os def init_powerpoint(): powerpoint = comtypes.client.CreateObject("Powe
clinken 【MicroPython开源版块】
MicroPython 升级到 v1.12
MicroPython今天升级到 v1.12版本。改进了mpy格式,支持本机代码和新的JavaScript移植。 在此版本中,mpy文件格式已更新到v4,并进行了一些重大改进:mpy文件大小平均减少了约35%,mpy文件的加载时间减少了约40%,并且它们现在支持保存native, viper 和 inline assembler代码(或从任何其他来源生成的机器代码)。mpy文件的大小减少是通过在
dcexpert 【MicroPython开源版块】
深入浅出Tensorflow原理,及Python代码实例
文章结合实际的案例深入浅出的介绍了 Tensorflow 的原理,以及如何用最简单的Python代码进行功能实现。 1. 神经网络原理 神经网络模型,是上一章节提到的典型的监督学习问题,即我们有一组输入以及对应的目标输出,求最优模型。通过最优模型,当我们有新的输入时,可以得到一个近似真实的预测输出。我们先看一下如何实现这样一个简单的神经网络:输入 x = [1,2,3],目标输出 y = [-
他们逼我做卧底 RF/无线
ESP8266 Python 的socket 中断应该如何实现?
本帖最后由 p0we7 于 2016-11-26 22:24 编辑 serverSocket = socket(AF_INET, SOCK_DGRAM) serverSocket.bind(('', 9000)) 复制代码 我通过以上代码 绑定了一个 UDP Server . 之后使用 message, address = serverSocket.recvfrom(1024)复制代码
p0we7 【MicroPython开源版块】

评论


个人中心

意见反馈

求资源

回顶部

About Us 关于我们 客户服务 联系方式 器件索引 网站地图 最新更新 手机版

EEWorld电子技术资料下载——分享有价值的资料

北京市海淀区知春路23号集成电路设计园量子银座1305 电话:(010)82350740 邮编:100191

电子工程世界版权所有 京ICP证060456号 京ICP备10001474号 电信业务审批[2006]字第258号函 京公海网安备110108001534 Copyright © 2005-2020 EEWORLD.com.cn, Inc. All rights reserved
$(function(){ var appid = $(".select li a").data("channel"); $(".select li a").click(function(){ var appid = $(this).data("channel"); $('.select dt').html($(this).html()); $('#channel').val(appid); }) })