Multimodal Dialogue Generation for The Office TV Show
Deep Learning Pipeline ArchitectureFor the final project of CS2470: Deep Learning, we built a multimodal pipeline to generate TV show dialogue.
The system works in three stages:
- Visual Context: Uses CLIP and VGG16 to analyze a video frame and detect both objects in the frame and detect which characters are present (through transfer learning on character dataset).
- Text Generation: Feeds the visual context into a fine-tuned GPT-2 model (trained on entire The Office script).
- Output: Generates dialogue lines that match the personality and speaking style of the characters (e.g., Michael Scott vs. Jim Halpert).
- Tools: PyTorch, Hugging Face Transformers, OpenCV
For more details, view the poster, final report, or GitHub code by clicking the links at the top of the page.