14 September 2023 | Research
LINGO-1: Exploring Natural Language for Autonomous Driving
At Wayve, we use natural language to enhance the learning and explainability of our foundation driving models. In this blog, we introduce LINGO-1, an open-loop driving commentator that combines vision, language and action to enhance how we interpret, explain and train our foundation driving models.
Introduction to Vision-Language-Action Models (VLAMs)
The influence of large language models (LLMs) in this era of artificial intelligence is all around us. In recent years, we’ve seen the rise of LLMs that use self-supervised learning on vast internet-scale datasets to produce human-like responses to natural language queries. These models have transformed deep learning and generative AI, which are now being used to automate numerous tasks. They have also led to growing interest in vision-language models (VLMs), which blend the reasoning capabilities of LLMs with images and videos to perform various tasks such as image classification, text-to-image retrieval, and visual question answering (i.e. answering questions about an image using natural language) with impressive accuracy. At Wayve, we are taking this one step further by exploring vision-language-action models (VLAMs) that incorporate three kinds of information: images, driving data, and now language.
The use of natural language in training robots is still in its infancy, particularly in autonomous driving. Incorporating language along with vision and action may have an enormous impact as a new modality to enhance how we interpret, explain and train our foundation driving models. By foundation driving models, we mean models that can perform several driving tasks, including perception (perceiving the world around them), causal and counterfactual reasoning (making sense of what they see), and planning (determining the appropriate sequence of actions). We can use language to explain the causal factors in the driving scene, which may enable faster training and generalisation to new environments.
We can also use language to probe models with questions about the driving scene to more intuitively understand what it comprehends. This capability can provide insights that could help us improve our driving models’ reasoning and decision-making capabilities. Equally exciting, VLAMs open up the possibility of interacting with driving models through dialogue, where users can ask autonomous vehicles what they are doing and why. This could significantly impact the public’s perception of this technology, building confidence and trust in its capabilities.
In addition to having a foundation driving model with broad capabilities, it is also eminently desirable for it to efficiently learn new tasks and quickly adapt to new domains and scenarios where we have small training samples. Here is where natural language could add value in supporting faster learning. For instance, we can imagine a scenario where a corrective driving action is accompanied by a natural language description of incorrect and correct behaviour in this situation. This extra supervision can enhance few-shot adaptations of the foundation model. With these ideas in mind, our Science team is exploring using natural language to build foundation models for end-to-end autonomous driving.
As we push the boundaries of what’s possible with embodied AI, vision-language-action models may have an enormous impact, as language provides a new modality to enhance how we interpret, explain and train our foundation driving models.
Introducing LINGO-1, an open-loop driving commentator
In this first blog on VLAMs, we showcase how we’re building natural language driving datasets using our fleet and the results of using this dataset to train LINGO-1, an open-loop driving commentator.
A key feature in the development of LINGO-1 is our creation of a scalable and diverse dataset that incorporates image, language and action data gathered from our expert drivers commentating as they drive around the UK. The commentary technique is reminiscent of roadcraft used by professional driving instructors in their lessons: instructors say interesting aspects of the scene aloud and justify their driving actions using short phrases, helping their students learn by example. Here are some examples of what our expert drivers might describe, which affects their actions or draws attention to safe driving:
- slowing down for a lead vehicle or a change in traffic lights,
- changing lanes to follow a route,
- accelerating to the speed limit,
- noticing other cars coming onto the road or stopped at an intersection
- approaching hazards such as roundabouts and Give Way signs,
- parked cars, traffic lights or schools,
- actions other road users are taking, such as changing lanes or overtaking parked vehicles,
- cyclists and pedestrians waiting at zebra crossings or coming up from behind the car in a cycle lane.
When these phrases are temporally synchronised with sensory images and low-level driving actions, we obtain a rich vision-language-action dataset to train models for diverse tasks.
Our driving commentary data enhances our standard expert driving dataset collection without compromising the rate at which we collect expert driving data—enabling a cost-effective approach to gather another layer of supervision through natural language. We train each expert driver to follow a commentary protocol to maintain dataset quality. This protocol includes paying attention to the relevance and density of words spoken, the temporal synchronisation between commentary and driving actions, and the terminology used to describe events.
As depicted in the diagram below, we train LINGO-1, our open-loop driving model, on various vision and language data sources to perform visual question answering (VQA) on tasks such as perception, counterfactuals, planning, reasoning and attention. LINGO-1 can perform many tasks through simple prompt changes. This allows us to prompt LINGO-1 with questions regarding scene understanding and reasoning about the primary causal factors in the scene affecting the driving decision. In other words, LINGO-1 can provide a description of the driving actions and reasoning.
LINGO-1 can generate a continuous commentary that explains the reasoning behind driving actions. This can help us understand in natural language what the model is paying attention to and what it is doing. Below are a few examples:
In this first video, LINGO-1 describes the actions it takes when it overtakes a parked car.
LINGO-1: I’m edging in due to the slow-moving traffic.
LINGO-1: I’m overtaking a vehicle that’s parked on the side.
LINGO-1: I’m accelerating now since the road ahead is clear.
Below, you can see LINGO-1’s explanation as the car approaches a zebra crossing.
LINGO-1: I’m maintaining my speed; the road continues to be clear.
LINGO-1: I’m now decelerating, braking, and coming to a stop.
LINGO-1: Remaining stopped at the zebra crossing.
LINGO-1: I’m now accelerating from a stopped position.
LINGO-1: I’m accelerating as the road is clear.
Finally, this is LINGO-1’s explanation as it turns left at an intersection.
LINGO-1: I’m remaining stationary as the lead vehicle is also stopped
LINGO-1: I’m accelerating now, as we’re clear.
LINGO-1: I’m applying the brakes to stop at the junction.
LINGO-1: I’m moving now because the lane is clear.
LINGO-1: Completing a left turn to follow the route.
Visual Question and Answer (VQA)
In addition to commentary, we can also ask LINGO-1 questions about various driving scenes to evaluate the model’s scene comprehension and understand its reasoning. In the following video clips, you can see how it describes what to do in various situations.
In this first example, we ask LINGO-1 about what it’s paying attention to at this intersection.
We can also ask LINGO-1 about the weather and how it affects its driving.
Finally, here’s an example of how LINGO-1 understands safe driving behaviour around cyclists.
LINGO-1 is around 60% accurate compared to human-level performance. We track LINGO-1’s performance against a set of comprehensive benchmarks which measure its question-answering on various perception, reasoning and driving knowledge tasks.
In this graph, you can see how LINGO-1’s performance has increased over the last few weeks. Performance has doubled to almost 60% average validation accuracy as we have improved the model architecture and the training datasets. We continue rapidly iterating on LINGO-1 and expect further substantial improvements in the coming weeks.
Including natural language as a new modality can revolutionise autonomous driving in several ways.
Advancing AI explainability of end-to-end models
The lack of explainability in machine learning models is a common concern, as the decision-making process often seems like a black box. However, by leveraging language, we can shed light on how AI systems make decisions.
Creating natural language interfaces could allow users to engage in meaningful conversations with AI models, enabling them to question choices and gain insight into scene understanding and decision-making. This unique dialogue between passengers and autonomous vehicles could increase transparency, making it easier for people to understand and trust these systems. Furthermore, integrating language may enhance the model’s ability to adapt and learn from human feedback. Like a driving instructor guiding a student driver, corrective instructions and user feedback could refine the model’s understanding and decision-making processes over time.
Improving driving performance through better planning and reasoning
In the near future, we aim to harness LINGO’s natural language, reasoning and planning capabilities to enhance our closed-loop driving models. We are examining many different integration architectures, but we show a high-level architecture below.
A critical aspect of integrating the language and driving models is grounding between them. The two main factors affecting driving performance are the ability of the language model to accurately interpret scenes using various input modalities and the proficiency of the driving model in translating mid-level reasoning into effective low-level planning.
Efficient learning for handling new or long-tail scenarios
While a picture may be worth a thousand words, a paragraph is worth a thousand images when it comes to training. Let’s look at why. With natural language, we can explain causal factors in the driving scene. For example, instead of needing thousands of training examples of a car slowing down for a pedestrian in the road, we can use a few examples accompanied by a short text description of how to act in a particular situation and other factors to consider. In other words, we can accelerate learning by incorporating a description of the driving actions and causal reasoning into the model’s training.
Causal reasoning is vital in autonomous driving, enabling the system to understand the relationships between elements and actions within a scene. A well-grounded VLAM may offer significant improvements in having foundation models identify key causal components and understand the connections between entities in a driving scene. For instance, the system could link pedestrians waiting at a zebra crossing to the traffic signal indicating “do not cross.” This advancement has the potential to significantly improve planning, especially in challenging scenarios with limited data.
In addition, we can incorporate language-derived general-purpose knowledge from LLMs into driving models to enable enhanced generalisation to new, never before seen situations. LLMs already possess vast knowledge of human behaviour from internet-scale datasets, making them capable of understanding concepts like identifying objects, traffic regulations, and driving manoeuvres. For example, language models know the difference between a tree, a shop, a house, a dog chasing a ball, and a bus that’s stopped in front of a school. We can integrate this knowledge into our foundation models to improve the raw intelligence of our system, which could help it tackle challenging, long-tail scenarios where we have limited training examples.
While applying this knowledge to real driving situations remains a challenge, VLAMs offer the potential for better, safer autonomous driving by encoding image data with a broader range of information. This could accelerate the learning process, enhance the model’s accuracy and increase its capability to handle diverse driving tasks.
LINGO-1 represents a beachhead opportunity, opening up many new possibilities for the safety and interpretability of Embodied AI. While we are excited by the rapid progress of LINGO-1, we want to provide a few words of caution and describe the current limitations of the model.
Limited generalisation capabilities
Currently, LINGO-1 is trained on Central London driving experiences and internet-scale text. It is best at commenting on UK road rules, although it has learned aspects of global driving cultures from the broad base of knowledge it is trained on. For example, it can state which side of the road you should drive on in various countries worldwide. However, grounding this knowledge in actual driving experience from different countries is one of our next steps.
Hallucinations are a well-known problem in large language models; LINGO-1 is no exception. We continue to drive down the frequency and severity of hallucinations through RLHF and other common techniques. However, LINGO-1 has an advantage over standard LLMs, which is worth noting. Since LINGO-1 is grounded in vision, language and action, there are more sources of supervision that allow it to understand the world better. It can learn to reason and align its understanding between text descriptions, what it sees in the video and how it interacts with the world, increasing the sources of information which can allow it to understand casual relationships.
Limited temporal context
Video deep learning is challenging because video data is typically orders of magnitude larger than image or text datasets. In particular, Video-based Multimodal language models require long context lengths to be able to embed many video frames to reason about complex and dynamic driving scenes. We found working with very-long-context transformer models is incredibly important. Alongside our partner, Microsoft Azure, we are working hard to continue to push the boundaries of scale to improve the performance of LINGO-1.
Currently, we are working on the explainability of our models, but ultimately, we want the reasoning capabilities of large language models to affect our driving. We are working on a closed-loop architecture that will allow us to run LINGO-1 on-board our fleet of autonomous vehicles in the future.
LINGO-1, our open-loop driving commentator, offers an essential first step for enhancing the learning and explainability of our foundation driving models using natural language. While the integration of natural language in training robots is still in its early stages, we are excited to advance this research for end-to-end autonomous driving.
The potential future applications of this research are thrilling. Imagine the ability to update an autonomous driving system to changes in the Highway Code through a simple text prompt or the possibility of asking a robot car about the upcoming road conditions. Natural language holds great promise in developing safer and more reliable autonomous vehicles.
If you share our passion for shaping the future of autonomous driving, we invite you to explore our open roles in London or the Bay Area and join us on this exciting journey!