ETL Cleansing Small Task

Замовник: AI | Опубліковано: 23.02.2026

I have several CSV and JSON files arriving on a regular schedule, and I need a reliable ETL pipeline that ingests these flat files, cleans the raw data, and validates every record before loading it into my environment. The core of the job is to: • Read multiple flat-file formats (mainly CSV, with the occasional JSON). • Apply thorough data-cleansing rules—removing duplicates, enforcing data types, flagging out-of-range values, and normalising text fields. • Run validation checks so that only clean, schema-compliant rows proceed to the load step. I’m happy for you to choose the stack you are most efficient with—Python (pandas, PySpark), Talend, or another ETL tool—as long as the final solution is reproducible and can be triggered automatically (CLI, scheduled job, or cloud function). If you think aggregation or more advanced joins would improve the dataset, flag that as a future enhancement; for now, cleansing and validation are the must-haves. Deliverables 1. Well-documented ETL script or job configuration. 2. A sample run demonstrating before-and-after records. 3. Setup instructions so I can deploy the pipeline in my own environment (local or cloud). I’ll review the deliverables by running the pipeline on a fresh batch of files; acceptance is based on error-free execution and a clean output dataset. Let me know what libraries or tooling you prefer, any assumptions you need clarified, and an estimated timeline to get the first working version ready.