Web · Ebook
Scrapy: Web Scraping at Scale with Python
by Shriira Press
A comprehensive, self-contained guide to Scrapy, the Python framework for web scraping and crawling — extracting structured data from websites, at scale, reliably. Where a quick script with requests + BeautifulSoup scrapes one page, Scrapy is a full framework for crawling thousands or millions of pages: an asynchronous engine, spiders that define what to crawl and extract, selectors (CSS and XPath) for pulling data out of HTML, item pipelines for cleaning and storing it, and the middleware, politeness, and deployment machinery real scraping needs. This book teaches it end to end — spiders, selectors, following links, items and pipelines, the architecture, middlewares, robustness, ethics and legality, and deployment — blending intuition, the concepts behind the framework, and runnable code.
Contents
- 1Preface
- 2Chapter 1 — What Is Scrapy?
- 3Chapter 2 — How the Web Works for Scraping: HTTP, HTML, and the DOM
- 4Chapter 3 — Your First Spider
- 5Chapter 4 — Selectors: CSS and XPath
- 6Chapter 5 — Spiders in Depth: Requests, Responses, and Callbacks
- 7Chapter 6 — Following Links and Crawling
- 8Chapter 7 — Items, Item Loaders, and Structured Data
- 9Chapter 8 — Item Pipelines: Processing and Storing Data
- 10Chapter 9 — The Scrapy Architecture: Engine, Scheduler, Downloader
- 11Chapter 10 — Middlewares: Customizing Requests and Responses
- 12Chapter 11 — Robustness: Politeness, Anti-Bot, and Dynamic Content
- 13Chapter 12 — The Ethics and Legality of Web Scraping
- 14Chapter 13 — Deployment, Scaling, and the Profession
- 15Appendix A — Glossary and Quick Reference
- 16Appendix B — Further Reading and Resources
