Topology-Based Content Clustering for Web Scraping with Python (Requests + BeautifulSoup)
Date Published

Stop writing custom web scrapers for every single site. 🛑
One of the biggest headaches in web scraping is maintaining selectors. The moment a site updates its CSS, your script breaks.
I’ve been experimenting with a "Repeated Topology" approach. Instead of looking for specific IDs or Classes, this script looks for structural patterns. ---
How it works:
- Noise Reduction: Strips out headers, footers, and scripts.
- Topology Mapping: It scans for containers where children share the same
- HTML signature (e.g., a list of news cards or product tiles).
- Automated Extraction: It pulls text, links, and images from those clusters automatically.
🤗 It’s not just a scraper; it’s a way to find the "heart" of a webpage without being told where it is. ðŸ§
Check out the code below! 👇
1import requests2from bs4 import BeautifulSoup3from urllib.parse import urljoin4def detailed_cluster_scrape(url):5 headers = {'User-Agent': 'Mozilla/5.0'}6 response = requests.get(url, headers=headers)7 soup = BeautifulSoup(response.text, 'html.parser')8 base_url = url9 # 1. Cleanup Noise10 for noise in soup(['script', 'style', 'nav', 'footer', 'header', 'svg', 'noscript']):11 noise.decompose()12 clusters = []13 # 2. Find "Repeated Topology" (The core of the image technique)14 # We look for containers where multiple children share the same class or structure15 for container in soup.find_all(['div', 'ul', 'section','ol','p']):16 children = container.find_all(recursive=False)17 if len(children) < 3: continue # Ignore small sections18 # Analyze child similarity (Topology)19 child_signatures = [f"{c.name}.{'.'.join(c.get('class', []))}" for c in children]2021 # If the majority of children share the same tag/class signature, it's a cluster22 if len(set(child_signatures)) <= 2: # Heuristic for high similarity23 item_blocks = []2425 for child in children:26 # Extract Detailed Data from each block27 block_data = {28 "text": " ".join(child.get_text(separator=" ", strip=True).split()),29 "links": [urljoin(base_url, a['href']) for a in child.find_all('a', href=True)],30 "images": [urljoin(base_url, img['src']) for img in child.find_all('img', src=True)],31 "headings": [h.get_text(strip=True) for h in child.find_all(['h1', 'h2', 'h3', 'h4'])]32 }3334 # Only add if there is actual content35 if block_data["text"]:36 item_blocks.append(block_data)37 if item_blocks:38 clusters.append({39 "container": f"{container.name}.{'.'.join(container.get('class', []))}",40 "items": item_blocks41 })42 # 3. Token Saving: Filter out the largest cluster (usually the main content)43 # Most pages have 1 main cluster. We sort by item count.44 clusters.sort(key=lambda x: len(x['items']), reverse=True)45 return clusters