Website Scraper Architecture
AI-powered extraction of employees, social profiles, and contact info from company websites
Internal Components
Service Configuration
Replicas5
VPNGluetun (Frankfurt)
Network Mode
service:vpnAIOpenRouter / Gemini / Solar
Database Outputs
websitesCompany metadataemployeesContact personssocial_profilesSocial media linkstracked_pagesPages for change detectionPipeline Position
Upstream
Result Processor (polls every 10s)
Downstream
Social Collectors, Change Detector
Component Breakdown
Page Fetcher
Fetches main page + all internal links. User agent rotation with 10 variants.
AI Extractor
Multi-provider: OpenRouter (Qwen3-235B) → Gemini → Solar. 500-char minimum content threshold.
Employee Extractor
Extracts contacts with roles, emails, phone. Anti-testimonial filter for "von/bei [company]" patterns.
Social Link Extractor
Detects and cleans social media URLs via regex. Supports 10+ platforms.
Email Classifier
Generic emails (info@, kontakt@) → company_contact. Personal emails → employee records.
Database Writer
Writes to websites, employees, social_profiles. Handles deduplication and tracked page registration.