Styled PDF OCR Extractor

I need a small, dependable utility that can read scanned (image-based) PDF files and output their text while keeping the original look as closely as possible. The tool must recognise each character through OCR, then return the results in a formatted form—HTML, RTF, or any equivalent you propose—as long as it preserves font family and size information and flags bold or italic segments accurately. Key expectations • Works on typical multi-page PDFs without manual intervention • Maintains correct reading order and paragraph breaks • Shows the same fonts (or close web-safe substitutes) and reflects bold / italics exactly where they appear in the source Deliverables 1. Source code and brief instructions to run it on Windows or cross-platform (Python, Java, or a language you’re comfortable with) 2. A sample run using one of my PDFs that demonstrates the preserved styling 3. Short read-me explaining any open-source libraries used (e.g. Tesseract, pdfminer, PyMuPDF) and how to train or tweak them for better accuracy I will supply representative PDFs once we start. If you have prior OCR or PDF-processing experience, let me know so we can move quickly.

Реєстрація