Spanish PDF Data & Web Extraction -- 3

Бюджет: 100 $

I need a fully reproducible, two-phase pipeline that starts by mining Spanish-language PDF catalogs and ends with a consolidated pricing intelligence file. Phase 1 is all about precision PDF parsing. From each file you will pull the product name and its key features. The cleaned, well-structured data should arrive in a neatly formatted Excel workbook. Phase 2 takes every extracted product name and fires it off as a Google query, restricted to a single domain that I will supply. Analyse only the first five organic results per query. From each URL you must capture: one accurate price (so five per product in total), the brand, full product description, also similarity must be demonstrated with Levenshtein distance, cosine similarity or another transparent metric. Large-language models are encouraged for parsing, enrichment and similarity scoring, but the final workflow must be repeatable end-to-end on my side. Think Python, Pandas, PyPDF, BeautifulSoup or Playwright, plus whichever LLM API you prefer. Deliverables • Excel (.xlsx) and JSON (.json) files containing all cleaned PDF data, web-extracted pricing and similarity notes • A README.md that walks through the environment setup, key libraries, model calls, and step-by-step commands to rerun the job Acceptance will hinge on: • No missing or misaligned columns • Five verifiably correct prices per product • Clear audit trail in the README so I can replicate results without guesswork If this sounds straightforward, let’s get moving.

Python

Регистрация