PDF Ledger To Clean Database

I have a single PDF that holds several hundred pages of ledger entries recorded over multiple years. Because of irregular spacing, merged columns, and stray narrative comments, the file can’t be queried, reconciled, or audited in its current form. Your task is to turn every line in that PDF into a structured dataset that I can drop straight into accounting software or run analysis on. Scope of work • Parse the PDF and capture every transaction—including dates, descriptions, reference numbers, debit / credit amounts, and running balances—without loss of detail. • Correct inconsistent number formats (e.g., minus signs, comma placement, mixed currencies) and standardise dates to ISO. • Isolate any narrative comments so they appear in a separate “Notes” field rather than inside numeric columns. • Flag and log any rows that fail numeric checks (unbalanced debits vs credits, non-numeric characters inside amount columns, etc.) so I can inspect them quickly. • Deliver the cleaned output as a single, flat-file database—CSV is fine, but feel free to suggest a lightweight relational structure if you think it will add value. Include the transformation script (Python, R, or similar) so the process is fully reproducible. Acceptance criteria 1. Row counts in the final dataset match the original ledger pages (no dropped or duplicated lines). 2. All numeric fields import into Excel or a SQL table as numbers, not text. 3. Your anomaly log lists every transaction you could not confidently parse and explains why. 4. The script runs end-to-end on my machine with only standard open-source libraries. If you have experience wrangling messy PDFs with tools like Python (pandas, tabula-py, camelot) or R (tidyverse, tabulizer), that will be a plus, but feel free to use any stack you prefer as long as the deliverables meet the criteria above.

Python

Реєстрація