Debug ML Crop Yield Pipeline

The core of my remote-sensing crop-yield project is in place, but the code will not run from start to finish. I need a fresh set of eyes to hunt down and eliminate the blockers so that the pipeline executes smoothly on Databricks and locally. Current state • Repository already contains: – Spark-based preprocessing notebooks (PySpark) – Trained ML model scripts and saved artefacts – A handful of Databricks experiment notebooks for exploration What I need most Debugging is the priority. I am not after a full rewrite—I want the existing pieces to work together. You are free to suggest refactors where they remove obvious bottlenecks, but the first milestone is simply getting the code to run cleanly. Focus areas • Spark preprocessing notebooks: fix schema/drift issues, broken joins, I/O errors, or any serialization problems that stop the job. • Trained model scripts: ensure they load the processed data correctly, reproduce the training pipeline, and deliver consistent predictions. You’re welcome to inspect the Databricks notebooks once those two areas are solid, yet the main review targets are the preprocessing and model sections. Deliverables 1. A fully functioning end-to-end run: raw satellite data → Spark preprocessing → model inference → evaluation metrics. 2. Updated notebooks/scripts pushed to the repo with clear, concise comments on the fixes applied. 3. A brief changelog outlining what was corrected and any refactor suggestions for future work. Tools & environment Repo is in Python, leveraging PySpark 3.x on Databricks Runtime, plus common ML libraries (scikit-learn, TensorFlow saved models). If you are comfortable dropping into Databricks, stepping through jobs, and using Git for version control, you will feel at home. I will provide access to the code immediately after kickoff and stay available for quick clarifications. Let’s get this pipeline producing yield predictions without a hitch.

Python

Реєстрація