Long-Term OCR Developer for Data Extraction from Documents

I’m building a production-grade pipeline that turns messy invoices and packing lists into clean, structured data. The job goes far beyond running standard OCR: irregular layouts, low-resolution scans, and deeply nested tables are the norm here, so every stage—pre-processing, recognition, post-OCR validation, and normalization—needs to be engineered for resilience. The core information I must capture on every document is clear: item details, pricing information, and shipping details. Whatever approach you prefer—Tesseract, AWS Textract, Google Vision, computer-vision preprocessing with OpenCV, or a custom deep-learning model—what matters is that the final output is consistently accurate and delivered through a reproducible workflow (CLI script, API, or microservice). This is a long-term build. I’ll need you available 3–5 full days each week, communicating promptly and sticking to deadlines we agree on. In return, there’s room for ongoing, well-compensated collaboration as the system scales to new document types and higher volumes. Deliverables I expect in each milestone: • A working extraction module that handles real samples with inconsistent layouts, low-quality scans, and complex tables • Post-processing logic that validates fields and normalizes them to our schema • Clear documentation and sample output so I can plug the module straight into the broader platform If this kind of deep, iterative problem-solving excites you—and you have verifiable experience doing it—let’s talk.

Python

Регистрация