Product Intelligence Engine

Transform product feeds into an AI-powered growth engine with 4-tier cascading extraction, deterministic staleness detection, event-driven feed sync, and a native MCP server for AI agents.

The Product Intelligence Engine transforms Serpwise from a basic product feed extractor into a complete AI-powered e-commerce optimization platform.

Instead of manually maintaining complex PIM (Product Information Management) tools or relying on limited schema data, Serpwise automatically enriches your products by analyzing the page content and product images using advanced multimodal AI. Furthermore, it natively exposes your entire catalog to AI agents via a Model Context Protocol (MCP) server backed by semantic vector search.

Architecture & Data Flow

1. Crawler discovers pages → page_index
2. Product detection (3-tier: custom regex → JSON-LD @type → OG meta) flags product pages
3. 4-tier cascading extraction (JSON-LD → Open Graph → CSS selectors → image fallback)
4. Deterministic fingerprint computed from 14 extracted fields
5. Staleness detection: fingerprint comparison + field-level diff
6. If stale → event-driven feed regeneration (all active feeds + single cache invalidation)
7. If stale + schedule enabled → auto-queued AI re-enrichment job
8. AI analysis (text + vision) enriches product data & generates vector embeddings
9. Edge proxy serves feeds, injects HTML content, and runs the MCP server

Cascading Data Extraction

Every product page passes through a 4-tier cascading extraction pipeline. Each tier only fills fields that the previous tier missed, guaranteeing maximum data coverage.

Tier 1: JSON-LD Product Schema (Highest Priority)

Serpwise parses all application/ld+json script blocks on the page. It walks @graph arrays (common in Yoast, WooCommerce, and Shopify themes) to locate the Product node. From it, the system extracts:

Product identity: name, description, sku, mpn, category
Brand: handles both string values ("brand": "Nike") and typed objects ("brand": { "@type": "Brand", "name": "Nike" })
GTINs: tries gtin, then falls through to gtin13, gtin12, gtin14, and isbn
Offers: price, priceCurrency, availability (normalized from schema.org URLs like https://schema.org/InStock → in stock), and item condition
Aggregate rating: ratingValue and reviewCount
Images: single string or array of image URLs, resolved against the page URL

Tier 2: Open Graph Meta Tags

Any fields left empty after JSON-LD are filled from Open Graph meta tags:

og:title, og:description, og:image
product:price:amount, product:price:currency
product:brand, product:availability, product:condition

Tier 3: AI-Mapped CSS Selectors

During the 3-step E-Commerce Onboarding wizard, you provide a homepage, a category page, and a product page. Serpwise deeply scrapes these reference URLs and learns your site's CSS selectors for title, price, image, and description. These selectors are stored per-domain and applied as a third extraction fallback layer for any product page where structured data is missing or incomplete.

Tier 4: Content Image Discovery

If no product images were found by the previous tiers, Serpwise scans all <img> tags on the page, excluding those inside <nav>, <footer>, <aside>, and <header> regions. Up to 10 resolved image URLs are collected.

Extraction Output

The pipeline extracts 14 core fields per product:

Field	Source Priority
Title	JSON-LD → OG → CSS selector
Description	JSON-LD → OG → CSS selector
Price	JSON-LD offers → OG `product:price:amount` → CSS selector
Currency	JSON-LD `priceCurrency` → OG `product:price:currency`
Availability	JSON-LD offers (normalized) → OG
Condition	JSON-LD offers (normalized) → OG
Brand	JSON-LD (string or object) → OG
Images	JSON-LD → OG → CSS selector → content image discovery
GTIN	JSON-LD (`gtin` → `gtin13` → `gtin12` → `gtin14` → `isbn`)
MPN	JSON-LD
SKU	JSON-LD
Category	JSON-LD
Rating	JSON-LD `aggregateRating`
Review Count	JSON-LD `aggregateRating`

Product Detection

Before extraction runs, Serpwise determines whether a crawled URL is a product page using a 3-tier detection heuristic:

Custom Regex (highest priority): If you've configured a productUrlRegex for your domain (e.g., /products/.+ or /p/\d+), the URL is tested against it. If the regex matches, the result is authoritative — no fallback runs. Invalid regexes fall through gracefully.
JSON-LD Schema Type: If the page's detected schema types include Product, it's flagged as a product page.
Open Graph Type: If the page contains <meta property="og:type" content="product">, it's flagged as a product page.

Deterministic Staleness Detection

Most monitoring tools diff raw HTML, triggering false positives on every layout change or A/B test variation. Serpwise takes a fundamentally different approach.

How It Works

After extraction, the system computes a deterministic fingerprint of the extracted product fields (not the raw HTML). The raw JSON-LD schema is intentionally excluded from the fingerprint to avoid false positives from non-semantic schema changes.

The fingerprint comparison uses a three-step algorithm:

Fast-path match: If the new fingerprint equals the stored fingerprint → not stale. Processing stops immediately with zero wasted compute.
Null stored fingerprint: This is the first crawl for this product → flagged as stale (needs initial analysis).
Fingerprint differs: A field-by-field comparison of all 14 field pairs runs, producing a changedFields array (e.g., ["price", "availability"]). Both the new fingerprint and the changed field names are persisted.

What This Means

A CSS redesign, an A/B test, or a footer change will not trigger a false stale flag. But a $5 price drop, an out-of-stock change, or a new product image will be caught instantly — and the system knows exactly which fields changed.

Event-Driven Feed Regeneration

When a product is flagged stale, feed regeneration fires immediately — not on a cron job, not in a batch overnight.

Behavior

All active feeds for that domain (XML, CSV, JSON) are regenerated in a single pass, tagged with triggeredBy: "auto".
Fault isolation: If one feed fails to regenerate, the error is logged but doesn't block the other feeds. Partial success is still progress.
Single cache invalidation: After all feeds are regenerated, one gateway cache invalidation request fires (not one per feed), ensuring the edge proxy serves updated feeds within seconds.

Your Google Merchant feed reflects a price change within seconds of the crawler detecting it.

Automated AI Re-Enrichment

Staleness detection does more than update feeds. If you've configured a domain analysis schedule, the system automatically queues AI re-enrichment jobs.

Immediate Re-Analysis (On Staleness)

When a product is flagged stale and the domain has an enabled analysis schedule (with an interval other than "never"), the system immediately queues an AI job:

Text analysis: Regenerates the 30+ attribute profile (SEO titles, Google Taxonomy classification, highlights, specs, FAQ, keywords, custom labels)
Vision analysis: Re-examines the product image for color, material, pattern, style, and brand
The job type is determined by your schedule's analysisType setting

Scheduled Batch Re-Analysis

A cron job processes all domains whose scheduled nextRunAt timestamp has passed:

Queries products where the current fingerprint differs from the last-analyzed fingerprint (stale-first priority)
Limits the batch to the configured maxProductsPerRun (default 25)
Checks the organization's credit balance against the total cost (text: 5 credits, vision: 3 credits, full: 7 credits per product)
Pre-deducts credits in a single transaction, then bulk-inserts AI jobs into the queue
Advances the schedule timestamps (lastRunAt → now, nextRunAt → now + interval)

The process is idempotent — if a run finds no stale products, it simply advances the timestamps.

Multimodal AI Analysis

Text Analysis

The AI text analysis acts like an e-commerce data specialist. It reads the extracted source data and full page content, then generates:

SEO-optimized title (max 70 characters)
Detailed product description (500-1000 characters)
Short description for feed previews (under 200 characters)
Google Product Taxonomy classification (numeric ID + full category path)
Physical attributes: color, material, pattern, size, weight, dimensions, brand, gender, age group
3-10 bullet-point selling highlights
Key-value specification pairs
3-8 customer FAQ question-answer pairs
5-20 SEO keywords and phrases
0-5 custom labels for feed segmentation (e.g., seasonal, bestseller)

If your site has a category tree, it's included in the analysis context to improve taxonomy classification accuracy.

Vision Analysis

The vision analysis examines the primary product image and extracts:

Primary/dominant color and all visible colors
Material (leather, cotton, metal, wood, plastic, etc.)
Pattern (striped, floral, solid, geometric, etc.)
Style (casual, formal, athletic, vintage, modern, etc.)
Brand markings (logo, label, printed text)
All identifiable visual features
Image quality rating (high/medium/low)

The system reconciles text and vision attributes logically — vision is prioritized for color and pattern, text for exact material and dimensions.

The Agentic Suite (MCP & UCP)

Native MCP Server

The edge proxy exposes a Model Context Protocol (MCP) server with three tools:

Tool	Description
`semantic_product_search`	Natural language semantic search against your product embeddings (cosine similarity)
`get_product_details`	Full enriched product data by ID
`list_categories`	All AI-classified categories across your catalog

Authentication: Agents connect via scoped API keys with products:read permission. Rate limiting is enforced per-minute and per-day.

Semantic search flow: The agent sends a natural language query (e.g., "lightweight red summer dress under $50"). Serpwise generates a query embedding, runs cosine similarity against your product embeddings, and returns ranked results with similarity scores.

To connect an agent, generate an API key from your Organization settings and point your agent to https://your-proxy.serpwise.ai/serpwise/mcp.

The Universal Content Provider (UCP) widget drops an AI-powered smart search bar onto your storefront via a simple <script> tag. Queries run against the same semantic search infrastructure that powers the MCP server.

Automated Cross-Sells

Using the vector embeddings, the edge proxy dynamically queries for semantic similarity and injects a "You Might Also Like" section directly into the page response, bypassing your CMS entirely.

Edge HTML Content Injection

Because Serpwise operates as a reverse proxy, it modifies the HTTP response in-transit before it reaches the visitor or search engine crawler:

Specs Table: A clean, structured table of all extracted specifications
FAQ Section: AI-generated Q&As with FAQPage schema microdata for rich search results
Enhanced JSON-LD: A merged and upgraded version of your product's structured data
Related Products: Semantically similar product blocks injected on-page

Zero CMS plugins, zero theme edits, zero developer time.

Feed Automation

Feeds are available at: https://your-proxy.serpwise.ai/serpwise/feeds/[feed-name].xml

Feeds are regenerated automatically when:

A product is flagged stale (event-driven, immediate)
You manually trigger regeneration from the dashboard
An AI re-analysis job completes and updates the enriched product data