AI Science PDF Question Generator

My goal is to build an AI-driven pipeline that can read PDFs of medical subjects packed with both text and images, understand the material, and turn every chapter into student-friendly assessments. For each file the system should automatically deliver short-answer questions that capture key facts, long-answer prompts encouraging deeper explanations, and well-structured multiple-choice questions—complete with one correct option and plausible distractors. Because many of the PDFs include labelled diagrams, charts, or photographs, the engine must combine standard text extraction with reliable OCR so information buried in images feeds directly into at least one generated question. The tone and vocabulary need to suit secondary-school students, keeping accuracy on scientific terminology while staying readable. Behind the scenes, I expect a clean workflow: PDF parsing, OCR for images, semantic parsing, and question generation through a large-language-model layer (GPT-4, Llama 2, or an equivalent locally hosted model), followed by quality checks that filter hallucinations and enforce a grade-appropriate readability score. Output should come back in a structured format such as JSON or CSV so I can drop it straight into my LMS. Acceptance criteria • Handles a 20-page, mixed-media medicine PDF with at least 90 % extraction accuracy • Generates a minimum of fifteen questions per section: five short-answer, five long-answer, five MCQs • Flags the correct answer for every MCQ • Processes at least ten pages in under two minutes on a mid-range laptop Please break the work into prototype, refinement, and final hand-over milestones, provide well-commented Python code, and list any third-party libraries with their licences. If you already have demos of similar NLP or OCR projects, linking to them will help me gauge fit quickly.

Python

Регистрация