DatafetchPro
    amazon-data-scraper-by-selenium
    Jun 18, 20265 min read5 views

    Building an Aggressive Amazon Search Results Scraper

    A deep dive into a production-ready Amazon scraper that extracts every field from search result pages — ASINs, prices, ratings, delivery info, images, and more.

    Building an Aggressive Amazon Search Results Scraper with Python, Selenium & BeautifulSoup

    Amazon is one of the richest sources of product data on the internet — prices, ratings, review counts, brand names, delivery timelines, and purchase trends are all surfaced directly on search result pages. Whether you're building a price comparison tool, doing market research, or training a product recommendation model, being able to extract this data programmatically is a genuine competitive advantage.

    In this post, I'll walk through a production-ready Amazon scraper that takes an "aggressive" approach — pulling every extractable field from a search result page, handling pagination automatically, rotating past basic bot detection, and outputting clean JSON and CSV files ready for analysis.





    What This Scraper Extracts

    Before diving into the architecture, let's talk about what actually gets captured. Most Amazon scrapers stop at title and price. This one doesn't.

    For each product card on a search results page, the scraper extracts:

    • ASIN — Amazon's unique product identifier, the key to every downstream API call
    • Title — both the full title and a short cleaned version
    • Brand / byline — parsed from the card or inferred from the title
    • Current price — assembled from separate symbol, whole, and fraction HTML spans
    • List price — the struck-through original price, when present
    • Discount percentage — read from the page or computed from the two prices above
    • Star rating — parsed from ARIA labels for reliability
    • Review count — both raw text and a numeric value for easy sorting
    • Bought last month — the "X+ bought in past month" social proof badge
    • Badges — Best Seller, Amazon's Choice, #1 Best Seller, and custom badges
    • Sponsored flag — a boolean indicating whether the listing is a paid placement
    • Prime eligibility — detected via the Prime icon's CSS class
    • Delivery date and cost — extracted from the delivery block on each card
    • Ships to — location info when available
    • Product URL — a clean /dp/ASIN link, stripped of tracking parameters
    • Primary image URL — upgraded from thumbnail to full resolution
    • All image URLs — including dynamic high-res variants from data-a-dynamic-image
    • Variants — color, size, pattern option counts listed on the card
    • Seller — the "by BrandName" or "sold by" attribution

    That's north of 20 fields per product. On a single page of 48–60 results, you're pulling a dense, structured dataset in seconds.


    Architecture: Three Classes, Clean Separation

    The scraper is organised into three distinct classes, each with a single responsibility.

    1. AmazonParser — Pure HTML Parsing

    The AmazonParser class is entirely stateless and has no dependency on Selenium. It takes raw HTML as input and outputs structured dictionaries. This is intentional: it means you can run the parser against locally saved HTML files without spinning up a browser at all.

    bash

    python main.py --local page.html

    The parser identifies product cards by the data-component-type="s-search-result" attribute — a stable anchor point on Amazon's DOM that has persisted across multiple layout changes. From there, each extraction method targets specific CSS classes, ARIA attributes, or regex patterns.

    Price extraction is notably tricky on Amazon because the price is split across three separate <span> elements — one for the currency symbol, one for the whole number, and one for the decimal fraction. The parser assembles them in order rather than relying on the full formatted price string, which is sometimes hidden or absent in the raw DOM.

    For ratings, the code reads the aria-label attribute on the rating anchor element rather than the visible star graphic. This approach is more reliable because the ARIA label always contains the numeric value in the format "4.3 out of 5 stars", which is easy to parse with a simple regex.

    2. AmazonDriver — Selenium Browser Automation

    The AmazonDriver class wraps Selenium's Chrome WebDriver and handles the mechanics of loading pages and evading basic bot detection.

    Two techniques are used to reduce the scraper's fingerprint:

    Disabling the webdriver navigator flag. When Chrome is launched by Selenium, it sets navigator.webdriver = true — a flag that many bot detection systems check. The driver injects a small JavaScript snippet on every new document that overrides this property and sets it to undefined.

    python

    driver.execute_cdp_cmd(
    "Page.addScriptToEvaluateOnNewDocument",
    {"source": "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"}
    )

    Excluding automation extension switches. Chrome adds enable-automation to its switches when launched by WebDriver. Setting excludeSwitches removes this telltale sign.

    After loading each page, the driver also scrolls to the middle and bottom of the page before harvesting HTML. This matters because Amazon uses lazy loading for product images — the image URLs won't be in the DOM until the element scrolls into the viewport.

    The configurable --delay parameter (default 3 seconds) adds a polite pause between page loads, which both reduces the chance of rate limiting and gives JavaScript-rendered content time to fully settle.

    3. AmazonScraper — Orchestration

    The AmazonScraper class ties the parser and driver together. It walks through paginated results by finding the "Next" button's href after each page load and following it, accumulating products across pages into a single flat list. Each product is tagged with its page number and global_position in the full result set, so you know exactly where it appeared in the rankings.



    Watch on YouTube



    Three Scraping Modes

    One of the most useful design decisions in this project is supporting three distinct input modes via CLI flags.

    Local mode is great for development and testing. Save an Amazon search page locally (Ctrl+S in your browser), then point the scraper at it. No browser startup, no rate limiting, instant results — perfect for iterating on your parsing logic.

    Keyword mode launches Chrome, constructs the Amazon search URL from your keyword, and walks all pages automatically until results are exhausted or you hit the --pages limit.

    URL mode does the same thing but lets you start from a specific Amazon search URL — useful when you want to target a particular category, apply specific filters, or use an international Amazon domain.


    Output: JSON, CSV, and a Summary Report

    Every scrape produces three output files:

    • JSON — the full-fidelity payload, including both the per-page metadata array and the flat products array, preserved with all nested data structures intact
    • CSV — a flattened row-per-product format, ready to open in Excel or load into pandas for analysis
    • Summary TXT — a human-readable table showing ASIN, rating, price, and truncated title for every product, useful for a quick sanity check without opening a spreadsheet

    Files are timestamped with the UTC scrape time, so repeated runs don't overwrite each other.


    Installation and Quick Start

    bash

    pip install -r requirements.txt

    Or install the key dependencies directly:


    pip install selenium webdriver-manager beautifulsoup4 pandas

    Chrome must be installed on your machine for live scraping — webdriver-manager handles downloading the matching ChromeDriver automatically.


    # Parse a locally saved HTML file
    python main.py --local temp/page.html

    # Live scrape by keyword, first 3 pages, browser visible
    python main.py --keyword "hair dryer" --pages 3 --headless false

    # Live scrape from a direct URL
    python main.py --url "https://www.amazon.com/s?k=hair+dryer"

    A Note on Responsible Scraping

    Amazon's Terms of Service prohibit automated scraping. This project is intended for educational and research purposes — learning how browser automation and HTML parsing work together, understanding DOM structures, and exploring data extraction techniques. If you're using scraped data in a production system, review Amazon's terms carefully, consider their official Product Advertising API, and always scrape at a rate that doesn't impact service availability. The built-in delay parameter exists for a reason: use it generously.


    What You Can Build With This

    Once you have structured product data flowing out of Amazon search results, the applications are broad:

    • Price tracking dashboards — monitor how prices shift over time for a product category
    • Competitor analysis tools — track ranking changes for specific brands or ASINs
    • Review count trend monitoring — identify products gaining rapid social proof
    • Product research pipelines — surface high-rating, high-review-count products with significant discounts for resale or affiliate opportunities
    • Training datasets — structured e-commerce product data for NLP or recommendation models

    The combination of Selenium for rendering and BeautifulSoup for parsing is a reliable workhorse for exactly this kind of structured data extraction — especially on sites that rely on JavaScript rendering and dynamic content.


    Wrapping Up

    What makes this scraper stand out from simpler examples is the attention to extraction completeness. Price assembly from split spans, ARIA-based rating parsing, lazy-loaded image URL resolution, automatic discount computation as a fallback, and sponsor flag detection are all details that matter when you're building a dataset you actually trust.

    The three-class architecture also keeps the code maintainable: when Amazon changes a CSS class on the price block (and they will), you update one method in AmazonParser without touching the browser or orchestration logic.

    If you're building anything in the web automation or data extraction space, this is a solid foundation to study and extend. The full source code with all files is available for download — grab it, run it against a local HTML file first, and see what comes out.

    0