Spanish Procurement PDF Data Extraction

Бюджет: 100 $

I am seeking a freelancer to build a full automatic data extraction and enrichment pipeline for Spanish procurement PDF documents. Scope of Work: Phase 1 – PDF Extraction Extract product names and key features from Spanish PDF files. Clean and structure the data. Output results in Excel (.xlsx) format. Phase 2 – Web Search & Price Extraction Use each product name as a Google search query, restricted to a specific domain. Analyze the top 5 search results per product. From each URL, extract with high precision: Price (must extract 5 accurate prices per product) Brand Product features and description Identify similar products, not only exact matches. Measure similarity using methods such as Levenshtein distance or cosine similarity. Deliverables Final datasets in Excel (.xlsx) and JSON (.json) formats. (Do not forget price extraction must be very precise and sufficient. We need 5 successful URL scraping) A detailed README.md explaining the full workflow, tools, and how to reproduce the process. Technical Notes Use of LLMs for extraction, similarity analysis, and enrichment is highly recommended. The solution must be accurate, efficient, and fully reproducible.

Python

Реєстрація