Scrapy: Web Scraping at Scale with Python

Shriira Press

Preface

A comprehensive, self-contained guide to Scrapy, the Python framework for web scraping and crawling — extracting structured data from websites, at…

Welcome to Scrapy: Web Scraping at Scale with Python.

A comprehensive, self-contained guide to Scrapy, the Python framework for web scraping and crawling — extracting structured data from websites, at scale, reliably. Where a quick script with requests + BeautifulSoup scrapes one page, Scrapy is a full framework for crawling thousands or millions of pages: an asynchronous engine, spiders that define what to crawl and extract, selectors (CSS and XPath) for pulling data out of HTML, item pipelines for cleaning and storing it, and the middleware, politeness, and deployment machinery real scraping needs. This book teaches it end to end — spiders, selectors, following links, items and pipelines, the architecture, middlewares, robustness, ethics and legality, and deployment — blending intuition, the concepts behind the framework, and runnable code.

This title is part of the ShriIra library and is free to read in full, right here — our small contribution to making world-class knowledge easy to reach.

A note on reading it: open the Contents menu at the top of the reader to jump between chapters, use the Aa menu to set a comfortable text size, theme (light, sepia, or night), and single- or two-page layout. Your place is saved automatically, so you can always pick up where you left off.

We hope it serves you well.

— Shriira Press

Contents

  1. Chapter 1 — What Is Scrapy?
  2. Chapter 2 — How the Web Works for Scraping: HTTP, HTML, and the DOM
  3. Chapter 3 — Your First Spider
  4. Chapter 4 — Selectors: CSS and XPath
  5. Chapter 5 — Spiders in Depth: Requests, Responses, and Callbacks
  6. Chapter 6 — Following Links and Crawling
  7. Chapter 7 — Items, Item Loaders, and Structured Data
  8. Chapter 8 — Item Pipelines: Processing and Storing Data
  9. Chapter 9 — The Scrapy Architecture: Engine, Scheduler, Downloader
  10. Chapter 10 — Middlewares: Customizing Requests and Responses
  11. Chapter 11 — Robustness: Politeness, Anti-Bot, and Dynamic Content
  12. Chapter 12 — The Ethics and Legality of Web Scraping
  13. Chapter 13 — Deployment, Scaling, and the Profession
  14. Appendix A — Glossary and Quick Reference
  15. Appendix B — Further Reading and Resources
0%
1/1