Driving Computer Vision with Deep Learning
When we started working on scene understanding in 2015, we proposed one of the first semantic segmentation architectures using deep learning — SegNet. Yet, it is remarkable to look at the progress of state-of-the-art scene understanding over just a few years. In 2018, we can now understand semantics, geometry and motion from a single deep learning model. The accuracy, robustness and performance of models is much richer than before. In this blog, we discuss the progression of deep learning for scene understanding, and release a new 2018 web demo (see below).
The video on the left shows SegNet which was state-of-the-art when we developed it in 2015. On the right is our latest model, which can predict depth, semantic segmentation and optical flow motion all from a single monocular video. This drastic improvement in computer vision technology is the reason why it is possible to drive on the road with camera-only perception.
Our new model encodes the input video feed into a single, high-dimensional tensor representing semantics, geometry and motion. We can use this tensor to make driving decisions. We can also decode it to various outputs, to visualise what it understands. Our visualisation is showing (clockwise from top left):
input video feed to the algorithm from a monocular front-facing camera,
semantic segmentation, which predicts the semantic class of each pixel and the spatial layout of the scene,
optical flow prediction, showing how the world moves, where the colour shows the direction of motion and the colour’s intensity shows the magnitude of motion,
monocular depth prediction, showing the geometry of the scene, predicted from a single monocular image (with no stereo cameras or LIDAR).
We can also demonstrate these models performing robustly, in real-time on an autonomous vehicle. This video shows a single deep learning algorithm operating in real-time at 25 Hz on our car using a small mobile GPU (NVIDIA Drive PX2).
Real-time scene understanding with computer vision
To further illustrate what computer vision can do today, we have created a web demo which runs a small deep learning algorithm live in your web browser - even on your mobile! It will turn an image into a 3D semantic point cloud, without any need for video information or LiDAR scans.
To produce the web demo, we took our perception model built in PyTorch, converted it to Keras, then converted the Keras model to TensorFlow.js so it could run directly in the browser. We then produce a point cloud from the model's depth output, projecting the depth values into 3D world space, then we render with Three.js.
For vision to become truly ubiquitous with intelligent robotic decision making, we would like to highlight three key research challenges.
Robust representation that generalises to the world
We can never rely on having a supervised label for every possible example we encounter in the world. One of the greatest challenges for a real-world computer vision system is to be able to efficiently learn a representation from limited data which generalises to novel situations. This is especially true for autonomous driving - the variety of road scenes that exist is substantial. There are a number of promising approaches to learning robust representations such as self-supervision, multi-task learning and simulation (more on this soon!).
Combining perception and action
We believe that the most effective system will come from jointly optimising perception and decision making algorithms to achieve a task. Today, these systems are often considered separately. Engineers will try to build a perception system to extract a predefined representation of the world, such as 3D object detection and lane segmentation.
But this is probably not sufficient. We need to be able to understand not just the position and orientation of other vehicles, but their behaviour, intent, indicators and many other subtle cues that make representing this state prohibitively hard to hand-engineer. A perception system needs to learn a state which is complex, probably high dimensional, and needs to represent semantics, motion and geometry.
Application to self-driving and mobile robotics
Deploying computer vision systems on robotics platforms presents a number of safety challenges. We cannot develop and validate these algorithms by brute force testing them. Instead, we need to quantify uncertainty, understand saliency for decisions, interpret intermediate representations and reason with multiple sensors in real-time.
We also care a lot more about robustness than precision. For example, for humans to drive, we do not need to know the position of other cars to the nearest millimeter. We simply care if we can detect the car or not, and its rough spatial layout within the scene. The same is true for the algorithms we build, we care about robustness rather than mm accurate geometry from a LiDAR laser scanner.
To conclude, over the last few years computer vision technology has matured to the point where it works robustly in the wild. We’re betting that camera-only perception will drive the intelligent robots of the future.