From Mixed Reality To Autonomous Vehicles
It is three months since I joined Wayve to lead its deep learning and AI organisation. Personally, it has been a return to working with road scenes after an immensely fruitful collaborative journey which began many years ago at Cambridge, UK. In the time at Wayve, I have been meaning to pen down some of my current thoughts regarding building AI for autonomous vehicles (AVs) and alongside share my views on what motivated me to make this transition from Magic Leap, a company at the frontiers of mixed reality (MR) development. So here it is:
AV and MR are currently the two major problem domains for deep learning researchers and engineers. The last 4-5 years in MR have been invested in building the best SLAM systems for Head Mounted Devices, efficient 3D reconstruction algorithms, hand gesture and eye gaze estimation algorithms. In these years, I have personally contributed to several deep learning models in production and also have been extremely lucky to be immersed in a great team which produced some state-of-the-art results in deep SLAM and 3D reconstruction. It has been amazing to see deep learning grow in might within production settings, pushing the boundaries of performance under highly constrained compute (~10-100X fewer FLOPs than a ResNet-50), memory and battery power (~2W for all inference tasks). In particular, the deep learning community has made significant progress on low bit-size model training and inference in the last few years. What seemed highly challenging to productise four years ago is now commonplace on many mobile devices. More interestingly, the current generation of core perception and SLAM/geometry production systems have adopted end-to-end deep learning and multi-task deep learning in its core.
In hindsight, I observe a number of similarities between the MR and AV domain. From a deep learning perspective, new architectural improvements and training methods which originate in one domain are transferable to the other domain. The close coupling between sensors, their calibration (online calibration is required in both domains) and ensuing neural network training are similar in both domains. There is an effort in both domains to rely less on active sensors by using learnt neural networks due to either power and/or cost constraints. Mixed precision training in the MR domain is something which will eventually find its place in AV model deployment. Path planning which is a central problem in AV will eventually find applications as virtual agents emerge in MR. In general, the challenges of efficient, embodied and continual learning need to be addressed in both domains. What then motivated me to make the transition to the AV space?
It is perhaps fair to say that the pace of deep learning has been much faster than that of efficient compute development in the MR domain. One drawback for many deep learning engineers and scientists in the MR domain is that any trained deep model has to be compute optimised (usually 100-1000X) before embodied testing can be carried out on head mounted displays (HMDs). This means that getting a feel for the performance of the model is indeed a slow process. We are also yet to see, at any reasonable scale, the emergence of a MR cloud which gathers data from a fleet of devices while also providing shared services, experiences using high quality maps.
In contrast to MR, the domain of AVs is one where the time to deploy trained models on a car is relatively short. Compute and battery power are more scalable in a car and often models trained on the cloud can be directly deployed and tested in the real world. This is very attractive and can provide quick insights for researchers interested in real world machine learning and in particular embodied intelligence. In fact, at Wayve it is common for us to test out new models on the road several times a week to gain real insights.
Research in vision, in the last decade, was largely dominated by training and testing models on large static datasets like ImageNet and CityScapes for classification and road scene segmentation respectively. The top-k accuracy on the ImageNet challenge using neural architecture search is 97.5% and similarly high for image segmentation as well. Although these datasets are still valuable for researchers, the general consensus within the research community is that these challenges are no longer the prime drivers for research. We have now entered a decade where the key challenge is to build embodied intelligence, i.e., agents which learn to work in real, everyday environments where there is also human activity. This brings new and exciting challenges including the gathering, storing and cloud processing of extremely large scales of datasets (e.g. data from a large and diverse fleet of vehicles). The ability to train models on such large and continual growing datasets involves a closely coordinated effort between data engineering, machine learning practitioners and machine learning researchers. At Wayve, we term this fleet learning and it is our core product.
With the scale of data we collect at Wayve, both with our own autonomous vehicle fleet and in partnership with other large external fleets, our focus as researchers is towards achieving end-to-end learning for robust self driving. Here, I am personally spurred on by successes we achieved in the MR domain, for example, with end-to-end 3D reconstruction and in general how multi-stage systems are being simplified by more end-to-end learning. Similarly, in our road tests at Wayve, we are seeing how end-to-end trained models can handle complex structures such as roundabouts, traffic islands and traffic light intersections. In addition, we have seen some exciting results with end-to-end trained models working in entirely new environments.
In the near future, there are several important axes of exploration towards continually improving and robust on-road driving performance of AVs. This includes, use of emerging deep architectures, learning geometry, novel parameterisation of input and action spaces, self and unsupervised training of representations, learning from on-policy data, continual learning. This said, the unwritten rule of bigger models, bigger data and compute equals better performance has lasted the test of time in the last decade. I am fairly confident this will be the case for robotics as well in this decade. To all researchers and engineers, come join us now in this generational journey!