My current web-scraping pipeline is functional but slow, and the data-processing stage in particular is exhausting the GPU. I want to cut total runtime, shrink GPU memory consumption, and have every run automatically log its own performance so I can measure gains over time. You’ll start by profiling the existing Python code (pandas, NumPy, CUDA-accelerated routines), pinpointing the true hotspots. From there, I need refactored or parallelised logic, smarter batching, and any lightweight caching that prevents redundant computation. I’m open to revisiting earlier extraction steps if a quick tweak there will compound the speed-up, but the main brief is processing-level optimisation. Deliverables: • Optimised processing script(s) with clear inline comments • A small logging module that records start/end time, GPU utilisation, and memory footprint to a flat file or lightweight DB • Short report highlighting changes, benchmarks before vs. after, and advice for future scaling Please keep the solution OS-agnostic and rely only on widely available Python libraries so I can port it between servers without hassle.