Multimodal Dialogue Generation for The Office TV Show

Dec 1, 2022 · 1 min read

Deep Learning Pipeline Architecture

For the final project of CS2470: Deep Learning, we built a multimodal pipeline to generate TV show dialogue.

The system works in three stages:

Visual Context: Uses CLIP and VGG16 to analyze a video frame and detect both objects in the frame and detect which characters are present (through transfer learning on character dataset).
Text Generation: Feeds the visual context into a fine-tuned GPT-2 model (trained on entire The Office script).
Output: Generates dialogue lines that match the personality and speaking style of the characters (e.g., Michael Scott vs. Jim Halpert).

For more details, view the poster, final report, or GitHub code by clicking the links at the top of the page.

Last updated on Dec 8, 2025