AI Website Parser — Architecture
Phase 2 of the scraping pipeline: AI-powered data extraction from fetched page content
Extraction Flow
Fetched Job
Read Content
AI Extraction
Validation
Store Results
AI Providers
- OpenRouter — Qwen 2.5 72B (free tier)
- Google Gemini — gemini-2.0-flash-lite / flash / pro
- Upstage Solar — solar-pro (free tier)
All providers use the same prompt template and return identical JSON schema.
Job Status State Machine
fetched → extract_processing → completed
↓
extract_failed (retry up to 3x)
↓
failed (permanent)
Post-Extraction Validation
- Filter placeholder emails (noreply@, webmaster@, etc.)
- Move generic emails (info@, kontakt@) to company_contact
- Detect phone numbers misclassified as emails
- Validate social profile URLs
- Anti-testimonial filter ("von/bei [company]")
Data Extracted
- Company Info: name, description, industry, revenue, headcount
- Employees: name, title, email, phone, social profiles
- Job Listings: title, department, location, seniority
- Social Profiles: LinkedIn, Xing, Instagram, Facebook, etc.
- External Links: partner websites, press mentions
Key Architecture Decisions
- No VPN needed: AI extraction only calls AI APIs (OpenRouter, Gemini, Solar) — no web scraping.
- Content from DB: Reads combined_text stored by Phase 1 fetcher — no re-fetching.
- 500-char minimum: Pages with less than 500 chars of content are skipped to prevent hallucinations.
- NULLs combined_text after success: Saves DB space — raw content only needed during extraction.
- Rate limit handling: On 429 errors, resets job to 'fetched' without incrementing retry count.
- FOR UPDATE SKIP LOCKED: Multiple extractor instances safely compete for jobs.