Product Intelligence Engine
Transform product feeds into an AI-powered growth engine with 4-tier cascading extraction, deterministic staleness detection, event-driven feed sync, and a native MCP server for AI agents.
The Product Intelligence Engine transforms Serpwise from a basic product feed extractor into a complete AI-powered e-commerce optimization platform.
Instead of manually maintaining complex PIM (Product Information Management) tools or relying on limited schema data, Serpwise automatically enriches your products by analyzing the page content and product images using advanced multimodal AI. Furthermore, it natively exposes your entire catalog to AI agents via a Model Context Protocol (MCP) server backed by semantic vector search.
Architecture & Data Flow
1. Crawler discovers pages → page_index
2. Product detection (3-tier: custom regex → JSON-LD @type → OG meta) flags product pages
3. 4-tier cascading extraction (JSON-LD → Open Graph → CSS selectors → image fallback)
4. Deterministic fingerprint computed from 14 extracted fields
5. Staleness detection: fingerprint comparison + field-level diff
6. If stale → event-driven feed regeneration (all active feeds + single cache invalidation)
7. If stale + schedule enabled → auto-queued AI re-enrichment job
8. AI analysis (text + vision) enriches product data & generates vector embeddings
9. Edge proxy serves feeds, injects HTML content, and runs the MCP serverCascading Data Extraction
Every product page passes through a 4-tier cascading extraction pipeline. Each tier only fills fields that the previous tier missed, guaranteeing maximum data coverage.
Tier 1: JSON-LD Product Schema (Highest Priority)
Serpwise parses all application/ld+json script blocks on the page. It walks @graph arrays (common in Yoast, WooCommerce, and Shopify themes) to locate the Product node. From it, the system extracts:
- Product identity:
name,description,sku,mpn,category - Brand: handles both string values (
"brand": "Nike") and typed objects ("brand": { "@type": "Brand", "name": "Nike" }) - GTINs: tries
gtin, then falls through togtin13,gtin12,gtin14, andisbn - Offers: price,
priceCurrency, availability (normalized from schema.org URLs likehttps://schema.org/InStock→in stock), and item condition - Aggregate rating:
ratingValueandreviewCount - Images: single string or array of image URLs, resolved against the page URL
Tier 2: Open Graph Meta Tags
Any fields left empty after JSON-LD are filled from Open Graph meta tags:
og:title,og:description,og:imageproduct:price:amount,product:price:currencyproduct:brand,product:availability,product:condition
Tier 3: AI-Mapped CSS Selectors
During the 3-step E-Commerce Onboarding wizard, you provide a homepage, a category page, and a product page. Serpwise deeply scrapes these reference URLs and learns your site's CSS selectors for title, price, image, and description. These selectors are stored per-domain and applied as a third extraction fallback layer for any product page where structured data is missing or incomplete.
Tier 4: Content Image Discovery
If no product images were found by the previous tiers, Serpwise scans all <img> tags on the page, excluding those inside <nav>, <footer>, <aside>, and <header> regions. Up to 10 resolved image URLs are collected.
Extraction Output
The pipeline extracts 14 core fields per product:
| Field | Source Priority |
|---|---|
| Title | JSON-LD → OG → CSS selector |
| Description | JSON-LD → OG → CSS selector |
| Price | JSON-LD offers → OG product:price:amount → CSS selector |
| Currency | JSON-LD priceCurrency → OG product:price:currency |
| Availability | JSON-LD offers (normalized) → OG |
| Condition | JSON-LD offers (normalized) → OG |
| Brand | JSON-LD (string or object) → OG |
| Images | JSON-LD → OG → CSS selector → content image discovery |
| GTIN | JSON-LD (gtin → gtin13 → gtin12 → gtin14 → isbn) |
| MPN | JSON-LD |
| SKU | JSON-LD |
| Category | JSON-LD |
| Rating | JSON-LD aggregateRating |
| Review Count | JSON-LD aggregateRating |
Product Detection
Before extraction runs, Serpwise determines whether a crawled URL is a product page using a 3-tier detection heuristic:
- Custom Regex (highest priority): If you've configured a
productUrlRegexfor your domain (e.g.,/products/.+or/p/\d+), the URL is tested against it. If the regex matches, the result is authoritative — no fallback runs. Invalid regexes fall through gracefully. - JSON-LD Schema Type: If the page's detected schema types include
Product, it's flagged as a product page. - Open Graph Type: If the page contains
<meta property="og:type" content="product">, it's flagged as a product page.
Deterministic Staleness Detection
Most monitoring tools diff raw HTML, triggering false positives on every layout change or A/B test variation. Serpwise takes a fundamentally different approach.
How It Works
After extraction, the system computes a deterministic fingerprint of the extracted product fields (not the raw HTML). The raw JSON-LD schema is intentionally excluded from the fingerprint to avoid false positives from non-semantic schema changes.
The fingerprint comparison uses a three-step algorithm:
- Fast-path match: If the new fingerprint equals the stored fingerprint → not stale. Processing stops immediately with zero wasted compute.
- Null stored fingerprint: This is the first crawl for this product → flagged as stale (needs initial analysis).
- Fingerprint differs: A field-by-field comparison of all 14 field pairs runs, producing a
changedFieldsarray (e.g.,["price", "availability"]). Both the new fingerprint and the changed field names are persisted.
What This Means
A CSS redesign, an A/B test, or a footer change will not trigger a false stale flag. But a $5 price drop, an out-of-stock change, or a new product image will be caught instantly — and the system knows exactly which fields changed.
Event-Driven Feed Regeneration
When a product is flagged stale, feed regeneration fires immediately — not on a cron job, not in a batch overnight.
Behavior
- All active feeds for that domain (XML, CSV, JSON) are regenerated in a single pass, tagged with
triggeredBy: "auto". - Fault isolation: If one feed fails to regenerate, the error is logged but doesn't block the other feeds. Partial success is still progress.
- Single cache invalidation: After all feeds are regenerated, one gateway cache invalidation request fires (not one per feed), ensuring the edge proxy serves updated feeds within seconds.
Your Google Merchant feed reflects a price change within seconds of the crawler detecting it.
Automated AI Re-Enrichment
Staleness detection does more than update feeds. If you've configured a domain analysis schedule, the system automatically queues AI re-enrichment jobs.
Immediate Re-Analysis (On Staleness)
When a product is flagged stale and the domain has an enabled analysis schedule (with an interval other than "never"), the system immediately queues an AI job:
- Text analysis: Regenerates the 30+ attribute profile (SEO titles, Google Taxonomy classification, highlights, specs, FAQ, keywords, custom labels)
- Vision analysis: Re-examines the product image for color, material, pattern, style, and brand
- The job type is determined by your schedule's
analysisTypesetting
Scheduled Batch Re-Analysis
A cron job processes all domains whose scheduled nextRunAt timestamp has passed:
- Queries products where the current fingerprint differs from the last-analyzed fingerprint (stale-first priority)
- Limits the batch to the configured
maxProductsPerRun(default 25) - Checks the organization's credit balance against the total cost (text: 5 credits, vision: 3 credits, full: 7 credits per product)
- Pre-deducts credits in a single transaction, then bulk-inserts AI jobs into the queue
- Advances the schedule timestamps (
lastRunAt→ now,nextRunAt→ now + interval)
The process is idempotent — if a run finds no stale products, it simply advances the timestamps.
Multimodal AI Analysis
Text Analysis
The AI text analysis acts like an e-commerce data specialist. It reads the extracted source data and full page content, then generates:
- SEO-optimized title (max 70 characters)
- Detailed product description (500-1000 characters)
- Short description for feed previews (under 200 characters)
- Google Product Taxonomy classification (numeric ID + full category path)
- Physical attributes: color, material, pattern, size, weight, dimensions, brand, gender, age group
- 3-10 bullet-point selling highlights
- Key-value specification pairs
- 3-8 customer FAQ question-answer pairs
- 5-20 SEO keywords and phrases
- 0-5 custom labels for feed segmentation (e.g., seasonal, bestseller)
If your site has a category tree, it's included in the analysis context to improve taxonomy classification accuracy.
Vision Analysis
The vision analysis examines the primary product image and extracts:
- Primary/dominant color and all visible colors
- Material (leather, cotton, metal, wood, plastic, etc.)
- Pattern (striped, floral, solid, geometric, etc.)
- Style (casual, formal, athletic, vintage, modern, etc.)
- Brand markings (logo, label, printed text)
- All identifiable visual features
- Image quality rating (high/medium/low)
The system reconciles text and vision attributes logically — vision is prioritized for color and pattern, text for exact material and dimensions.
The Agentic Suite (MCP & UCP)
Native MCP Server
The edge proxy exposes a Model Context Protocol (MCP) server with three tools:
| Tool | Description |
|---|---|
semantic_product_search | Natural language semantic search against your product embeddings (cosine similarity) |
get_product_details | Full enriched product data by ID |
list_categories | All AI-classified categories across your catalog |
Authentication: Agents connect via scoped API keys with products:read permission. Rate limiting is enforced per-minute and per-day.
Semantic search flow: The agent sends a natural language query (e.g., "lightweight red summer dress under $50"). Serpwise generates a query embedding, runs cosine similarity against your product embeddings, and returns ranked results with similarity scores.
To connect an agent, generate an API key from your Organization settings and point your agent to https://your-proxy.serpwise.com/serpwise/mcp.
UCP Semantic Search Widget
The Universal Content Provider (UCP) widget drops an AI-powered smart search bar onto your storefront via a simple <script> tag. Queries run against the same semantic search infrastructure that powers the MCP server.
Automated Cross-Sells
Using the vector embeddings, the edge proxy dynamically queries for semantic similarity and injects a "You Might Also Like" section directly into the page response, bypassing your CMS entirely.
Edge HTML Content Injection
Because Serpwise operates as a reverse proxy, it modifies the HTTP response in-transit before it reaches the visitor or search engine crawler:
- Specs Table: A clean, structured table of all extracted specifications
- FAQ Section: AI-generated Q&As with
FAQPageschema microdata for rich search results - Enhanced JSON-LD: A merged and upgraded version of your product's structured data
- Related Products: Semantically similar product blocks injected on-page
Zero CMS plugins, zero theme edits, zero developer time.
Feed Automation
Feeds are available at: https://your-proxy.serpwise.com/serpwise/feeds/[feed-name].xml
Feeds are regenerated automatically when:
- A product is flagged stale (event-driven, immediate)
- You manually trigger regeneration from the dashboard
- An AI re-analysis job completes and updates the enriched product data