Multimodal Dialogue Generation for The Office TV Show

Dec 1, 2022 · 1 min read
Deep Learning Pipeline Architecture

For the final project of CS2470: Deep Learning, we built a multimodal pipeline to generate TV show dialogue.

The system works in three stages:

  1. Visual Context: Uses CLIP and VGG16 to analyze a video frame and detect both objects in the frame and detect which characters are present (through transfer learning on character dataset).
  2. Text Generation: Feeds the visual context into a fine-tuned GPT-2 model (trained on entire The Office script).
  3. Output: Generates dialogue lines that match the personality and speaking style of the characters (e.g., Michael Scott vs. Jim Halpert).
  • Tools: PyTorch, Hugging Face Transformers, OpenCV

For more details, view the poster, final report, or GitHub code by clicking the links at the top of the page.