My current stack needs a purpose-built Large Language Model that plugs straight into our infrastructure and immediately sharpens decision-making. The core assignment is two-fold: first, develop or customise an LLM that consumes our data streams (APIs, SQL warehousing, and a few flat-file drops) and returns context-aware recommendations we can feed into our dashboards; second, design and execute comprehensive black-box tests to validate output accuracy, latency, and edge-case stability end to end. You will be working against production-like replicas, so isolation and repeatability matter. I have internal SRE support for environment spin-ups, but the modelling, inference pipeline, and the full suite of opaque-input/observable-output tests will be yours to architect and code. Kubernetes, Docker, Python, and whichever LLM framework you prefer (Hugging Face Transformers, LangChain, or comparable) are all in play—just keep dependencies sane and containerised. Demonstrated experience shipping decision-support LLMs and hardening them through black-box testing is essential; sample repos or live references will carry weight in selection. I’ll review your approach to model alignment, monitoring hooks, and test-result reporting before green-lighting milestones, so bring a concise outline of how you normally tackle similar builds. Final hand-off is a reproducible pipeline, a documented test harness, and clear metrics that show we are truly enhancing decision-making, not just generating text.