
Adaptive Web Scraping with Topology Clustering in Python
Stop fixing broken CSS selectors. Use Python & BeautifulSoup to scrape sites by identifying structural patterns and "repeated topology" automatically.
Stop Babysitting Your Web Scrapers: Let Structure Do the Work
If you've spent any real time writing web scrapers, you've lived through what I call the "Selector Cycle of Death." You spend an hour carefully tuning CSS selectors until everything extracts perfectly. Then the site ships a minor frontend update — a renamed class, a restructured wrapper div — and your script quietly stops working. No error, no warning, just empty results.
The root problem is how most scrapers are designed: we tell them exactly where to look instead of what to look for. Hard-coded selectors are brittle by nature, because they assume a page's structure will stay frozen in time. It rarely does.
A Different Approach: Repeated Topology
Lately I've been experimenting with what I'm calling a "Repeated Topology" approach. Instead of targeting specific IDs or class names, the scraper looks for structural patterns that tend to show up regardless of how a site is styled.
Here's the insight: whether you're looking at a news homepage, a real estate listings page, or an e-commerce category, the "heart" of the page is almost always a list of similar-looking items — articles, listings, or products repeated in a row. The visual design changes from site to site, but the underlying structure — a container with several children that all look alike — tends to stay consistent.
The Strategy: Topology Mapping
The approach breaks down into three steps.
First, kill the noise. Headers, footers, navigation bars, scripts, and other boilerplate get stripped out early. None of that is the "main dish," so it gets removed before any analysis happens.
Second, find the "echo." The script scans through the remaining HTML looking for containers whose children share a structural signature — for example, five div elements in a row that all use the same tag and class combination. That repetition is a strong signal: it's very likely a product grid, article list, or news feed, even without knowing anything about the site's naming conventions.
Third, auto-harvest. Once a repeating cluster is found, the script pulls out text, links, headings, and images from each item automatically, without needing to be told in advance which tags hold which type of content.
The result isn't a scraper tied to one site's markup — it's a more general way of locating the "pulse" of a page based on its shape rather than its labels.
A Working Implementation
Here's a basic Python implementation using requests and BeautifulSoup that demonstrates the idea:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
def detailed_cluster_scrape(url):
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
base_url = url
# 1. Cleanup Noise: Strip the fluff
for noise in soup(['script', 'style', 'nav', 'footer', 'header', 'svg', 'noscript']):
noise.decompose()
clusters = []
# 2. Topology Mapping: Find repeated patterns
for container in soup.find_all(['div', 'ul', 'section', 'ol']):
children = container.find_all(recursive=False)
if len(children) < 3:
continue
# Generate a "signature" for each child based on tag and class
child_signatures = [f"{c.name}.{'.'.join(c.get('class', []))}" for c in children]
# Heuristic: If most children look the same, we found a content cluster
if len(set(child_signatures)) <= 2:
item_blocks = []
for child in children:
block_data = {
"text": " ".join(child.get_text(separator=" ", strip=True).split()),
"links": [urljoin(base_url, a['href']) for a in child.find_all('a', href=True)],
"images": [urljoin(base_url, img['src']) for img in child.find_all('img', src=True)],
"headings": [h.get_text(strip=True) for h in child.find_all(['h1', 'h2', 'h3', 'h4'])]
}
if block_data["text"]:
item_blocks.append(block_data)
if item_blocks:
clusters.append({
"container": f"{container.name}.{'.'.join(container.get('class', []))}",
"items": item_blocks
})
# 3. Sort by the largest cluster (usually the main feed)
clusters.sort(key=lambda x: len(x['items']), reverse=True)
return clusters
Why This Holds Up Better Over Time
The biggest advantage of this approach is resilience. A traditional scraper breaks the moment a site renames a class or restructures a wrapper element, because it's looking for that exact name. A topology-based scraper doesn't care about names at all — it cares about repetition and shape. As long as a site keeps presenting its main content as a list of similarly-structured items (which is true for the overwhelming majority of content-driven sites), the scraper keeps working even through cosmetic redesigns.
It's also a useful starting point for exploring an unfamiliar site. Instead of opening dev tools and manually hunting for the right selectors, you can run this kind of cluster detection first to quickly surface where the "real" content lives on the page, then refine from there if you need more specific fields.
Where It Falls Short
This isn't a silver bullet. A few caveats worth keeping in mind:
The heuristic can produce false positives on pages with other repeated structures, like sidebars full of similar-looking widgets or ad blocks. Sorting by cluster size helps, but it's not foolproof, and you may need additional filtering depending on the site.
It also won't handle JavaScript-rendered content out of the box. Sites that build their feeds client-side will need a headless browser (like Playwright or Selenium) to render the page before this kind of analysis can run on the resulting HTML.
And as always, scraping should be done responsibly — check a site's robots.txt and terms of service, avoid hammering servers with rapid requests, and respect any rate limits or access restrictions that are in place.
Final Thoughts
The "Selector Cycle of Death" is one of the most frustrating parts of maintaining scrapers, and most of that frustration comes from designing scrapers that depend on details a site owner never promised to keep stable. Shifting the question from "where is this data?" to "what does this data look like structurally?" produces something noticeably more durable — not perfect, but a lot less likely to need a fix every time a site gets a facelift.