PDF Invoice Data Extraction Tool

I need a small application that reads invoice-style PDFs and pulls every line-item—description, quantity, unit price, extended price—into an Excel workbook. The invoices come from multiple vendors, so layouts and fonts vary; accuracy across these differing templates is essential. Here’s how I picture the flow: I drop one or many PDFs into an input folder, run the script (Python would be ideal, but I’m open to other stacks), and receive an .xlsx file populated with the captured line items. A simple CLI is fine, though a lightweight GUI would be a welcome bonus. Behind the scenes you’re free to combine tools like pdfplumber, Tesseract, Amazon Textract, Azure Form Recognizer, or any other layout-aware OCR/LLM approach, as long as the final solution can be installed and executed locally without heavy monthly fees. Deliverables • Well-commented source code and any required models • Read-me with setup steps and usage examples • A sample run on my initial batch of invoices proving the extraction works • Optional: config file so I can tweak column names or add new vendors later Acceptance criteria • Captures item details and prices from at least 95 % of test invoices, including multi-page files • Produces a clean Excel output—one structured table per invoice, ready for further analysis • Handles unseen vendor templates with minimal additional training or rule tweaks If this aligns with your skill set, tell me which libraries you’d leverage and any prior experience parsing complex PDFs.

Реєстрація