# Build an AI Comic Translator (GUI app + optional Discord bot integration) ## Goal Create a desktop application (Windows) that **detects comic text**, performs **high-accuracy OCR**, and does **glossary-aware translation** for **Korean manhwa, Chinese manhua, Japanese manga, and Mangatoon** images. The app should feel like a blend of **Ballon-translator-portable** and **PanelCleaner** (UI + workflow), focused on translation only (no cleaning). Optionally expose a minimal **Discord bot** interface that triggers the same core pipeline on a shared Drive folder. --- ## What I need (Scope) ### 1) Input & Project Handling * I paste a **Google Drive link** (or select a local folder) containing page images (JPG/PNG/WebP). * The app loads pages in a **left side panel** (thumbnail list) with page name/number. * Supported reading directions: * **Manga (JP):** Right-to-Left, Top-to-Bottom. * **Manhwa/Manhua (KR/CN):** Left-to-Right, Top-to-Bottom. * The app should **infer a correct reading order** for detected text boxes/bubbles based on the work’s direction and visual layout. Use an AI/heuristic ordering model (e.g., graph ordering, attention over positions). ### 2) Text Detection (No cleaning) * High-recall **text region detection** on comic pages (speech bubbles, narration boxes, side text). * Acceptable approaches: * **YOLO-family** detector (v8/v9/RT-DETR) trained/fine-tuned for comic text & bubble masks, or * **CTD/CRAFT/EAST** style detector with solid performance on thin fonts/vertical text. * Respect **reading direction** to assign an **ordered list of text segments** per page. ### 3) Bubble Type Classification * For each detected region, classify into one of: * **Speech**, **Shout**, **Narration**, **Thoughts**, **SideText**. * Output must **prepend a marker** to each raw & translated line: * Shout → `S:` * Speech → `"":` * Thoughts → `():` * Narration → `[]:` * SideText → `ST:` ### 4) OCR (Offline-first) * **Offline OCR is required** for privacy and speed. * Use **MangaOCR** (JP) and best-in-class offline OCR for **KR/CN** (e.g., **PaddleOCR** multilingual, custom KR/CN models, or comparable). * Handle vertical Japanese text and mixed punctuation. * Provide a **pluggable OCR layer** so models can be swapped/upgraded later. ### 5) Translation (Glossary-aware) * Core problem: APIs often ignore series glossaries and produce inconsistent character/term names. * Requirements: * A **Series Glossary** (CSV/JSON) loaded per project with **term → preferred translation** pairs, compound names, and forbidden variants. * **Glossary enforcement** during translation: * At minimum: post-processing **terminology mapping** with **token-boundary-aware** replacements and case handling. * Preferably: **constrained decoding** or **prompt-level forcing** if using an LLM API. * Allow multiple backends: * **Offline NMT** (if practical) or * **API** (OpenAI/Google/etc.) with **strict glossary injection** and deterministic settings (temperature, style guide). * The UI must let me **preview raw + translated** text pairs for each segment, with quick edit boxes. ### 6) UI/UX (Blend of Ballon-translator + PanelCleaner) * **Left sidebar**: page thumbnails + status (Not processed / OCR done / Translated done). * **Main viewer**: page image with overlay boxes; clicking a box jumps to its text pair. * **Right panel**: * Ordered list of segments for the current page. * For each segment: bubble type label, **RAW** text, **TRANSLATED** text (editable). * **Run** buttons: * Per page and **Run All** (batch). * Progress bar + GPU/CPU indicator. * **Glossary Manager**: import/export CSV/JSON, live apply. * **Settings**: reading direction, OCR backend per language, translation backend, glossary rules (strict/lenient), output format options. ### 7) Output * Final export as a **clean .docx** or **.txt** with this exact structure (nothing else): * **Page line**: `PAGE: 001.jpg` (or page number) * Then **for each ordered segment**: * **RAW line** with marker, e.g. `[]: ばっはっは!元気があってよろしいこったあ!` * **TR line** with the **same marker**, e.g. `[]: Bwahaha! Full of energy, that’s the way I like it!` * No headers/footers/logos/metadata—**only** page names and marked lines. ### 8) Optional: Discord Bot Wrapper * Minimal bot that: * Takes a **Drive link** command, queues the job, and returns the exported **docx/txt** when done. * Same core library as the GUI (no duplicated logic). * Admin-only commands & simple status messages. --- ## Deliverables 1. **Windows desktop app** (prefer **Python + PySide6/Qt** or **Electron + Python backend**) with installer. 2. **Core inference library** (separate module) for detection → OCR → translation → export. 3. **Model configs & weights** loading code; easy to swap models. 4. **Glossary system** with import/export and deterministic enforcement. 5. **Export module** that guarantees the exact output format. 6. **Optional Discord bot** using the same library. 7. **Documentation**: setup, model downloads, how to add new OCR/translation backends, glossary usage. 8. **Test data & test plan** covering KR/JP/CN pages, vertical JP, dense side text, multi-bubble pages. 9. **Source code** in a clean repo (readme, comments, type hints), plus a short handover video. --- ## Tech Preferences & Environment * **OS:** Windows 10/11. * **GPU:** NVIDIA RTX 4060 Ti (CUDA acceleration expected for detection/OCR where supported). * **Language:** Python 3.11+ preferred for fast iteration (PyTorch/ONNXRuntime). * **Models:** YOLO-family (or CRAFT/EAST) for detection; MangaOCR + PaddleOCR (KR/CN) or equivalent. * **Translation:** pluggable—offline if feasible, or API with strong glossary forcing. * **No image cleaning** features required. --- ## Quality & Performance Targets * **Detection recall** on speech/narration bubbles: ≥ 95% on provided samples. * **OCR accuracy** (character-level) on clean scans: * JP ≥ 95%, KR/CN ≥ 92% (on test set). * **Ordering correctness**: ≥ 95% segments exported in human-expected reading sequence. * **Latency**: ≤ 3s per 1500×2200 image on my GPU for full pipeline (avg across chapter). * **Glossary enforcement**: 100% for exact glossary keys; fuzzy tolerance for minor punctuation/spacing. --- ## Nice-to-Haves (not required, quote separately) * Auto-language detection per page/segment. * Confidence scores and “review first” filters for low-confidence OCR or translation. * Batch merge: single export for entire folder with per-page headers. * Hotkeys for speedy manual fixes. * Autosave and crash-safe resume. --- ## What I will provide * Sample chapters (KR/JP/CN) with ground-truth expectations. * Initial glossary files for a few series. * Feedback and rapid testing during development. --- ## Acceptance Criteria (must pass to mark complete) * I can load a Drive link/folder, click **Run All**, and get a **.docx**/**.txt** that contains only: * `PAGE: <name>` lines * For each segment: **RAW** + **TRANSLATED** lines prefixed with the correct bubble **marker**. * Segments are in correct **reading order** for the chosen language family. * Glossary terms appear **consistently** across the output. * App runs offline for detection+OCR; translation can be offline or API (with glossary forcing). * Installer + documentation provided; I can run on my RTX 4060 Ti machine. --- ## Milestones (suggested) 1. **Design + Prototype**: model choices, UI skeleton, sample page through full pipeline. 2. **Detection + Ordering** solid on test pages; UI overlays + sidebar complete. 3. **OCR + Translation** with glossary enforcement; inline editing UX. 4. **Export module** (exact format), batch processing, project save/load. 5. **Polish & Handover**: installer, docs, optional Discord wrapper, test pass. --- ## Please include in your bid * Your proposed **model stack** for detection/OCR and how you’ll ensure ordering accuracy. * How you’ll implement **glossary enforcement** (exact method). * Any prior work on OCR/comics/NLP. * A quick plan for reaching the performance targets.