daVinciNet: Joint Prediction of Motion and Surgical State in Robot-Assisted Surgery

In robot-assisted surgeries, it is important to predict the trajectories of robotic surgical instruments. It helps to prevent collisions between instruments or with near obstacles and to implement multi-agent surgical systems. The prediction of the next surgical states provides a seamless operational workflow and allows synchronized collaborations between the surgeons and operating room staff.

Image credit: U.S. Air Force, Andrew D. Sarver / Public Domain via nellis.af.mil

A recent paper proposes daVinciNet: a model that predicts both instrument paths in the endoscopic reference frame and future surgical states. It uses multiple data sources, including robot kinematics, endoscopic vision, and system events. Temporal information is included in data sequences using learning-based methods. The model can make multi-step predictions of up to 2 seconds in advance. Its surgical state estimation accuracy compares well with human annotator accuracy while the distance error was as low as 1.64mm.

This paper presents a technique to concurrently and jointly predict the future trajectories of surgical instruments and the future state(s) of surgical subtasks in robot-assisted surgeries (RAS) using multiple input sources. Such predictions are a necessary first step towards shared control and supervised autonomy of surgical subtasks. Minute-long surgical subtasks, such as suturing or ultrasound scanning, often have distinguishable tool kinematics and visual features, and can be described as a series of fine-grained states with transition schematics. We propose daVinciNet – an end-to-end dual-task model for robot motion and surgical state predictions. daVinciNet performs concurrent end-effector trajectory and surgical state predictions using features extracted from multiple data streams, including robot kinematics, endoscopic vision, and system events. We evaluate our proposed model on an extended Robotic Intra-Operative Ultrasound (RIOUS+) imaging dataset collected on a da Vinci Xi surgical system and the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS). Our model achieves up to 93.85% short-term (0.5s) and 82.11% long-term (2s) state prediction accuracy, as well as 1.07mm short-term and 5.62mm long-term trajectory prediction error.

Link: https://arxiv.org/abs/2009.11937