We introduce MiniSeg, our smart chaptering model, a task-defining text segmentation model focussed on speech and video. The model helps you to structure both short and long videos and talks into chapters, giving a comprehensive overview of the content. We see applications both in end users navigating contents as well in content creators, podcasters, and educators preparing content for their audience.

This demo showcases our recent text segmentation models based on our research published at EACL 2024. Our state-of-the-art text segmentation model, MiniSeg, is practical and efficient, being able to segment long documents and talks into functionally or topically coherent chapters quasi-instantly. It is trained on a large collection of YouTube videos collected as part of our research. The dataset YTSeg is made available here and is an important addition to the text segmentation landscape that currently lacks robust benchmarks.

As part of this demo, you can select from three different video data sources:

  • NIPS 2021 talks: AI research presentations from the NIPS 2021 conference, transcribed with Whisper.
  • TED talks with their humanely annotated transcripts
  • KIT lecture series on machine translation, transcribed using KIT’s Lecture Translator

After pressing „Process Transcript“, you are presented with a structured version of the transcript, being enriched with generated summaries from LLaMA 3, as part of this demo application. We note that the model(s) have not been trained on any of these data sources and only receive the plain transcript without any structural information.