Text-based Editing of Talking-head Video

✧ Stanford University, ✻ Max Planck Institute for Informatics, ⚘ Princeton University, § Adobe Research

Abstract: Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis.

Ethical Considerations:

Our text-based editing approach lays the foundation for better editing tools for movie post production. Filmed dialogue scenes often require re-timing or editing based on small script changes, which currently requires tedious manual work. Our editing technique also enables easy adaptation of audio-visual video content to specific target audiences: e.g., instruction videos can be fine-tuned to audiences of different backgrounds, or a storyteller video can be adapted to children of different age groups purely based on textual script edits. In short, our work was developed for storytelling purposes.

However, the availability of such technology — at a quality that some might find indistinguishable from source material — also raises important and valid concerns about the potential for misuse. Although methods for image and video manipulation are as old as the media themselves, the risks of abuse are heightened when applied to a mode of communication that is sometimes considered to be authoritative evidence of thoughts and intents. We acknowledge that bad actors might use such technologies to falsify personal statements and slander prominent individuals. We are concerned about such deception and misuse.

Therefore, we believe it is critical that video synthesized using our tool clearly presents itself as synthetic. The fact that the video is synthesized may be obvious by context (e.g. if the audience understands they are watching a fictional movie), directly stated in the video or signaled via watermarking. We also believe that it is essential to obtain permission from the performers for any alteration before sharing a resulting video with a broad audience. Finally, it is important that we as a community continue to develop forensics, fingerprinting and verification techniques (digital and non-digital) to identify manipulated video. Such safeguarding measures would reduce the potential for misuse while allowing creative uses of video editing technologies like ours.

We hope that publication of the technical details of such systems can spread awareness and knowledge regarding their inner workings, sparking and enabling associated research into the aforementioned forgery detection, watermarking and verification systems. Finally, we believe that a robust public conversation is necessary to create a set of appropriate regulations and laws that would balance the risks of misuse of these tools against the importance of creative, consensual use cases.


Fig. 1: We propose a novel text-based editing approach for talking-head video. Given an edited transcript, our approach produces a realistic output video in which the dialogue of the speaker has been modified and the resulting video maintains a seamless audio-visual flow (i.e. no jump cuts).

If you find this work useful, please consider citing it:

@article{Fried:2019:TET:3306346.3323028,
  author = {Fried, Ohad and Tewari, Ayush and Zollh\"{o}fer, Michael and Finkelstein, Adam and Shechtman, Eli and Goldman, Dan B and Genova, Kyle and Jin, Zeyu and Theobalt, Christian and Agrawala, Maneesh},
  title = {Text-based Editing of Talking-head Video},
  journal = {ACM Trans. Graph.},
  issue_date = {July 2019},
  volume = {38},
  number = {4},
  month = jul,
  year = {2019},
  issn = {0730-0301},
  pages = {68:1--68:14},
  articleno = {68},
  numpages = {14},
  url = {http://doi.acm.org/10.1145/3306346.3323028},
  doi = {10.1145/3306346.3323028},
  acmid = {3323028},
  publisher = {ACM},
  address = {New York, NY, USA},
  keywords = {dubbing, face parameterization, face tracking, neural rendering, talking heads, text-based video editing, visemes},
}