XMem++: Production-level Video Segmentation From Few Annotated Frames


1MBZUAI   2Pinscreen   3Adobe Research
ICCV 2023

*These authors equally contributed to the work.

Inspired by cases from movie industry, we present XMem++ - an Interactive Video Object Segmentation tool that performs high-fidelity segmentation in complex and challenging scenes with only a few labeled examples .

Someone wore a wrong shirt in the scene? A tattoo that shouldn't be there? Need to add some CGI to a very specific part of an object?
Or maybe you want to quickly label some video segmentation datasets with unusual/unique targets?

This tool is for you.
We also have a dataset for benchmarking purposes: PUMaVOS.

Abstract

Despite advancements in user-guided video segmentation, extracting complex objects consistently for highly complex scenes is still a labor-intensive task, especially for production. It is not uncommon that a majority of frames need to be annotated. We introduce a novel semi-supervised video object segmentation (SSVOS) model, XMem++, that improves existing memory-based models, with a permanent memory module. Most existing methods focus on single frame annotations, while our approach can effectively handle multiple user-selected frames with varying appearances of the same object or region. Our method can extract highly consistent results while keeping the required number of frame annotations low. We further introduce an iterative and attention-based frame suggestion mechanism, which computes the next best frame for annotation. Our method is real-time and does not require retraining after each user input. We also introduce a new dataset, PUMaVOS (Partial and Unusual Masks for Video Object Segmentation), which covers new challenging use cases not found in previous benchmarks. We demonstrate SOTA performance on challenging (partial and multi-class) segmentation scenarios as well as long videos, while ensuring significantly fewer frame annotations than any existing method.

Overview

XMem++ is a memory-based interactive segmentation model - this means it uses a set of reference frames/feature maps and their corresponding masks, either predicted or given as ground truth if available, to predict masks for new frames based on how similar they are to already processed frames with known segmentation.

Just like XMem, we use the two types of memory inspired by the Atkinson-Shiffrin human memory model - working memory and long-term memory. The first one stores recent convolutional feature maps with rich details, and the other - heavily compressed features for long-term dependencies across frames that are far apart in the video.

However, existing segmentation methods (XMem, TBD, AoT, DeAOT, STCN, etc.) that are using memory mechanisms to predict the segmentation mask for the current frame, typically process frames one by one, and thus suffer from a common issue - "jumps" in visual quality, when the new ground truth annotation is encountered in the video

XMem++ architecture.

Frame annotation candidates selector

XMem++ is equipped with a single yet powerful algorithm to select which frames the user should annotate next to maximize performance and save time. It is based on an idea of diversity - to select the frames that capture the most variety of the target object's appearance - to maximize the information the network will get with them annotated.

It takes into account which object we are trying to segment and recommends frames that have the most variety in that specific object's appearance.

Frame selector at work - notice how given a different person in rows 1) and 2) it selects different frames where that specific person moves. In the third row the frames capture the target object in a variety of backgrounds, expressions and lighting conditions.

Performance

  • XMem++ achieves great visual results for a variety of use-cases, using only a few annotations (typically < 10 for a 30-60s video)
  • It treats ground truth annotations as references and infers masks for new frames based on their similarity to known references.
  • No fine-tuning, plug&play, 30FPS on 480p video on a single GPU - check out the [Code] on GitHub for more details.

Video Presentation

PUMaVOS dataset

We used XMem++ to collect and annotate PUMaVOS - a dataset of challenging and practical use cases inspired by the movie production industry.

Billie Shoes
Short Chair
Dog Tail
Workout Pants
SKZ
Tattoo
Ice Cream
Vlog

Partial and Unusual Masks for Video Object Segmentation (PUMaVOS) dataset has the following properties:

  • 24 videos, 21187 densely-annotated frames;
  • Covers complex practical use cases such as object parts, frequent occlusions, fast motion, deformable objects and more;
  • Average length of the video is 883 frames or 29s, with the longer ones spanning 1min;
  • Fully densely annotated at 30FPS;
  • Benchmark-oriented: no separation into training/test, designed to be as diverse as possible to test your models;
  • 100% open and free to download.

Download

Separate sequences and masks are available here: [Google Drive] [Mirror Google Drive]


PUMaVOS .zip download link: [Google Drive] [Mirror Google Drive]


PUMaVOS license

PUMaVOS is released under CC BY 4.0 license, - you can use it for any purpose (including commercial), you only need to credit the authors (us) whenever you do and indicate if you've made any modifications. See the full license text in LICENSE_PUMaVOS


PUMaVOS contains 5 videos from YouTube. We do not claim ownership of them, and here are the links to the original videos and their creators:

BibTeX

If you are using XMem++ PUMaVOS dataset in your work, please cite us:

@misc{bekuzarov2023xmem,
    title={XMem++: Production-level Video Segmentation From Few Annotated Frames}, 
    author={Maksym Bekuzarov and Ariana Bermudez and Joon-Young Lee and Hao Li},
    year={2023},
    eprint={2307.15958},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}