AI Website Parser — Architecture

Phase 2 of the scraping pipeline: AI-powered data extraction from fetched page content

Extraction Flow

Fetched Job

Read Content

AI Extraction

Validation

Store Results

All providers use the same prompt template and return identical JSON schema.

fetched → extract_processing → completed

↓

extract_failed (retry up to 3x)

↓

failed (permanent)

No VPN needed: AI extraction only calls AI APIs (OpenRouter, Gemini, Solar) — no web scraping.
Content from DB: Reads combined_text stored by Phase 1 fetcher — no re-fetching.
500-char minimum: Pages with less than 500 chars of content are skipped to prevent hallucinations.
NULLs combined_text after success: Saves DB space — raw content only needed during extraction.
Rate limit handling: On 429 errors, resets job to 'fetched' without incrementing retry count.
FOR UPDATE SKIP LOCKED: Multiple extractor instances safely compete for jobs.