Hadoop Data Pipeline Specialist Needed

I’m upgrading our analytics stack and need an expert who can own the Hadoop side and turn raw, high-volume feeds into analysis-ready datasets. The core objective is to design and build end-to-end data pipelines on a Hadoop cluster—this is where I believe Hadoop will be most valuable for the project. Here’s what I need from you: • An architecture that takes terabyte-scale log files, lands them in HDFS, applies basic cleansing, and outputs partitioned Parquet tables queryable from Hive or Spark • All scripts, configs, and scheduling (Oozie, Airflow, or your preferred orchestrator) committed to Git with clear documentation • A deployment guide plus a brief hand-over session so I can reproduce the setup on another cluster Acceptance criteria: the pipeline ingests a 100 GB test set, completes in under 60 minutes on my provided environment, and the resulting tables are fully accessible for downstream analytics. If you’ve previously worked with MapReduce, Hive, Spark, or similar tools, include code samples or screenshots so I can quickly verify fit. I’m ready to start as soon as the right approach and timeline are laid out.

Python

Реєстрація