Sales Intelligence

AI Website Parser — Architecture

Phase 2 of the scraping pipeline: AI-powered data extraction from fetched page content

Extraction Flow

Fetched Job
Read Content
AI Extraction
Validation
Store Results

AI Providers

  • OpenRouter — Qwen 2.5 72B (free tier)
  • Google Gemini — gemini-2.0-flash-lite / flash / pro
  • Upstage Solar — solar-pro (free tier)

All providers use the same prompt template and return identical JSON schema.

Job Status State Machine

fetched → extract_processing → completed

extract_failed (retry up to 3x)

failed (permanent)

Post-Extraction Validation

  • Filter placeholder emails (noreply@, webmaster@, etc.)
  • Move generic emails (info@, kontakt@) to company_contact
  • Detect phone numbers misclassified as emails
  • Validate social profile URLs
  • Anti-testimonial filter ("von/bei [company]")

Data Extracted

  • Company Info: name, description, industry, revenue, headcount
  • Employees: name, title, email, phone, social profiles
  • Job Listings: title, department, location, seniority
  • Social Profiles: LinkedIn, Xing, Instagram, Facebook, etc.
  • External Links: partner websites, press mentions

Key Architecture Decisions

  • No VPN needed: AI extraction only calls AI APIs (OpenRouter, Gemini, Solar) — no web scraping.
  • Content from DB: Reads combined_text stored by Phase 1 fetcher — no re-fetching.
  • 500-char minimum: Pages with less than 500 chars of content are skipped to prevent hallucinations.
  • NULLs combined_text after success: Saves DB space — raw content only needed during extraction.
  • Rate limit handling: On 429 errors, resets job to 'fetched' without incrementing retry count.
  • FOR UPDATE SKIP LOCKED: Multiple extractor instances safely compete for jobs.