Open-Source Text-to-Video Solution

I’m building BlessVideo.com and need to move away from closed, pay-walled models like Veo3. My goal is to assemble a fully open-source pipeline that turns plain text into short video clips and optionally layers in synced audio. I want to know which current projects (e.g. Stable Video, ModelScope, Pika, or any other promising repo you discover) can be chained together, where the gaps are, and how to host everything on my own hardware or a modest cloud instance. Here’s what I’d like delivered: • A concise comparative report outlining at least three viable open-source text-to-video models, their licenses, strengths, and hardware requirements. • A working proof-of-concept: type a prompt, receive a 5–10 sec video with an auto-generated audio track (voice-over, music, or both—whatever the chosen stack supports best). A simple Streamlit, Gradio, or similar front-end is fine. • Step-by-step setup notes so I can reproduce the environment from scratch on Ubuntu, including any model weights, Docker files, or inference scripts you customize. Python, PyTorch, ffmpeg, and common TTS libraries such as Coqui-TTS or Bark are all fair game; just keep the dependencies open and redistributable. I’ll test the demo locally—once it runs smoothly and the documentation is clear, the project is complete. The following is recommended by chatgpt as your reference:

Python

Registration