I want to see a working proof-of-concept that can watch a live webcam feed in an indoor setting and reliably decide whether someone is merely holding a phone or actively using it. The prototype must process the video stream in real time, recognise the presence of a smartphone, then look for behavioural cues—hand placement, posture and, ideally, gaze direction—to confirm active usage. Whenever the model judges that the phone is being used, it should trigger an audible or visible alarm on the host machine instantly; no other logging or alert channels are required for this first iteration. I am happy for you to choose your preferred computer-vision stack (e.g. OpenCV, MediaPipe, PyTorch, TensorFlow, ONNX) as long as the end result runs on a typical workstation without specialised hardware. Pre-trained networks are welcome, but please include any fine-tuning scripts so I can reproduce the results. If additional datasets are needed, point me to openly licensed sources or provide clear collection guidelines. Deliverables • Source code with clear setup instructions • A short demo video or live call showing the system detecting phone usage and firing the alarm in real time • Brief technical notes explaining the model architecture, input preprocessing and the logic you use to distinguish “holding” from “using” I will test by pointing a webcam at volunteers in an office, so accuracy in ordinary indoor lighting is critical. Let me know how quickly you can turn around an initial build and what dependencies I should have in place.