Web · Ebook
Beautiful Soup: Parsing HTML and XML with Python
by Shriira Press
A comprehensive, self-contained guide to Beautiful Soup (the beautifulsoup4 / bs4 package) — Python's friendliest library for parsing HTML and XML and pulling data out of it. Where Scrapy is a whole framework for crawling thousands of pages, Beautiful Soup is a focused library for the other 90% of the time: you have some HTML — a page you fetched, a file, an API response — and you need to navigate it, search it, and extract the data, even when the markup is messy or broken. This book teaches it end to end — the parse tree, parsers, the four object types, navigating and searching, CSS selectors, extracting and modifying, real-world patterns, robustness, and where it fits the ecosystem — blending intuition, the concepts behind the library, and runnable code.
Contents
- 1Preface
- 2Chapter 1 — What Is Beautiful Soup?
- 3Chapter 2 — HTML, the Parse Tree, and Parsers
- 4Chapter 3 — Making Soup: Your First Parse
- 5Chapter 4 — The Four Objects: Tag, NavigableString, BeautifulSoup, Comment
- 6Chapter 5 — Navigating the Tree
- 7Chapter 6 — Searching the Tree: find and find_all
- 8Chapter 7 — CSS Selectors with select()
- 9Chapter 8 — Extracting Text and Attributes
- 10Chapter 9 — Modifying the Tree
- 11Chapter 10 — Real-World Parsing Patterns
- 12Chapter 11 — Robustness and Common Pitfalls
- 13Chapter 12 — Beautiful Soup vs. the Ecosystem
- 14Chapter 13 — In Practice and the Profession
- 15Appendix A — Glossary and Quick Reference
- 16Appendix B — Further Reading and Resources
