Topology-Based Content Clustering for Web Scraping with Python (Requests + BeautifulSoup)

Date Published

topology-cluster-content

Stop writing custom web scrapers for every single site. 🛑

One of the biggest headaches in web scraping is maintaining selectors. The moment a site updates its CSS, your script breaks.

I’ve been experimenting with a "Repeated Topology" approach. Instead of looking for specific IDs or Classes, this script looks for structural patterns. ---

How it works:

  1. Noise Reduction: Strips out headers, footers, and scripts.
  2. Topology Mapping: It scans for containers where children share the same
  3. HTML signature (e.g., a list of news cards or product tiles).
  4. Automated Extraction: It pulls text, links, and images from those clusters automatically.

🤗 It’s not just a scraper; it’s a way to find the "heart" of a webpage without being told where it is. 🧠

Check out the code below! 👇

1import requests
2from bs4 import BeautifulSoup
3from urllib.parse import urljoin
4def detailed_cluster_scrape(url):
5 headers = {'User-Agent': 'Mozilla/5.0'}
6 response = requests.get(url, headers=headers)
7 soup = BeautifulSoup(response.text, 'html.parser')
8 base_url = url
9 # 1. Cleanup Noise
10 for noise in soup(['script', 'style', 'nav', 'footer', 'header', 'svg', 'noscript']):
11 noise.decompose()
12 clusters = []
13 # 2. Find "Repeated Topology" (The core of the image technique)
14 # We look for containers where multiple children share the same class or structure
15 for container in soup.find_all(['div', 'ul', 'section','ol','p']):
16 children = container.find_all(recursive=False)
17 if len(children) < 3: continue # Ignore small sections
18 # Analyze child similarity (Topology)
19 child_signatures = [f"{c.name}.{'.'.join(c.get('class', []))}" for c in children]
20
21 # If the majority of children share the same tag/class signature, it's a cluster
22 if len(set(child_signatures)) <= 2: # Heuristic for high similarity
23 item_blocks = []
24
25 for child in children:
26 # Extract Detailed Data from each block
27 block_data = {
28 "text": " ".join(child.get_text(separator=" ", strip=True).split()),
29 "links": [urljoin(base_url, a['href']) for a in child.find_all('a', href=True)],
30 "images": [urljoin(base_url, img['src']) for img in child.find_all('img', src=True)],
31 "headings": [h.get_text(strip=True) for h in child.find_all(['h1', 'h2', 'h3', 'h4'])]
32 }
33
34 # Only add if there is actual content
35 if block_data["text"]:
36 item_blocks.append(block_data)
37 if item_blocks:
38 clusters.append({
39 "container": f"{container.name}.{'.'.join(container.get('class', []))}",
40 "items": item_blocks
41 })
42 # 3. Token Saving: Filter out the largest cluster (usually the main content)
43 # Most pages have 1 main cluster. We sort by item count.
44 clusters.sort(key=lambda x: len(x['items']), reverse=True)
45 return clusters