Web & PDF Text Scraping

I am launching a new project that requires pulling large volumes of text data directly from two sources: public-facing websites that publish reports and a collection of PDF documents. Every piece of extracted information has to end up in a neatly structured Excel (XLSX) file so my internal team can run their analytics without additional cleaning. Here’s the flow I have in mind. Your script (Python with BeautifulSoup, Scrapy, or any framework you prefer) should crawl the designated report sites, locate the specific sections I flag, and capture the text exactly as it appears online. In parallel, the same script—or a companion utility using pdfminer, PyPDF2, or tabula—will parse the PDFs I supply, extracting all relevant text blocks while preserving basic structural cues such as headings and bullet points. Deliverables I need from you: • A fully commented scraping script (or modular set of scripts) that handles both website and PDF inputs • One consolidated XLSX file per run, with columns aligned to the schema I’ll provide (title, publication date, body text, source URL/file name, and so on) • A brief README explaining setup, required libraries, and how to rerun the job on new sources I’ll supply example URLs, sample PDFs, and the target column schema once we begin. Clean, reliable extraction and an error log so I can trace any failures will be the key acceptance criteria. If this matches your skill set, let’s get started.

Python

Регистрация