LLM-Driven Smart Web Crawler Development

Заказчик: AI | Опубликовано: 05.12.2025
Бюджет: 250 $

We need an LLM-driven web crawler that behaves like a smart agent: it should explore a target site, prioritize logging in when it finds auth flows, fill forms with traceable values, avoid leaving the allowed domain, and record both network traffic and the actions it takes. It should keep track of what it has already visited so it doesn’t repeat itself and should stop when it’s no longer making progress. Key behaviors and expectations: - Planning + execution loop: use an LLM to propose small batches of high-priority actions (auth/login first, then forms with submits, then navigation links, then other clicks). Execute them step by step in the browser. - Coverage-aware: maintain a memory of interactive elements seen on each page and skip repeats; detect when a page looks the same as before and move on. - No hardcoding: navigation comes from links discovered on the current page; no fixed URL lists or regex scraping. Stay within the allowed domain/path patterns and avoid logout/off-domain links. - Forms and payloads: when forms appear, fill them with marker-tagged values (and credentials if provided), so submissions are traceable. Keep a log of what was submitted where. - Logging and outputs: capture action logs and network logs in a usable format; record coverage progress per page so we know what was explored. - Controls and caps: limit how many actions are proposed at once, how many actions per page, and add small random waits to avoid hammering a site. - Configuration: driven by environment variables (start URL, allowed domain, max pages/actions, logging paths, headless mode, marker prefix). Validation target: - Run headless on a test site (e.g., testphp.vulnweb.com) and show it can discover multiple pages, attempt login if available, submit forms with marker-tagged payloads, and produce action/network logs while respecting domain boundaries. Deliverables: - A runnable crawler script that ties the LLM planner to browser actions with the above behaviors. - Logging outputs (actions, network, payloads) and a way to see coverage per page. - Clear configuration via env vars to point it at any target site and allowed domain. Basically we need to modify https://github.com/browser-use/browser-use/tree/main/browser_use to crawl a website instead of us asking it to perform tasks. A planner plans execution and what to prioritise (mainly authenticated sections) and then we leverage browser-use for execution. Think of a task queue with tasks for browser-use to perform but focused on crawling the entire website and intercepting all HTTP network requests made on each interaction like clicks, form submission etc