17 April 2024  |  Research

LINGO-2: Driving with Natural Language

This blog introduces LINGO-2, a driving model that links vision, language, and action to explain and determine driving behavior, opening up a new dimension of control and customization for an autonomous driving experience. LINGO-2 is the first closed-loop vision-language-action driving model (VLAM) tested on public roads.

In September 2023, we introduced natural language for autonomous driving in our blog on LINGO-1, an open-loop driving commentator that was a first step towards trustworthy autonomous driving technology. In November 2023, we further improved the accuracy and trustworthiness of LINGO-1’s responses by adding a “show and tell” capability through referential segmentation. Today, we are excited to present the next step in Wayve’s pioneering work incorporating natural language to enhance our driving models: introducing LINGO-2, a closed-loop vision-language-action driving model (VLAM) that is the first driving model trained on language tested on public roads. In this blog post, we share the technical details of our approach and examples of LINGO-2’s capability to combine language and action to accelerate the safe development of Wayve’s AI driving models.

Introducing LINGO-2, a closed-loop Vision-Language-Action-Model (VLAM)

Our previous model, LINGO-1, was an open-loop driving commentator that leveraged vision-language inputs to perform visual question answering (VQA) and driving commentary on tasks such as describing scene understanding, reasoning, and attention—providing only language as an output. This research model was an important first step in using language to understand what the model comprehends about the driving scene. LINGO-2 takes that one step further, providing visibility into the decision-making process of a driving model. LINGO-2 combines vision and language as inputs and outputs, both driving action and language, to provide a continuous driving commentary of its motion planning decisions. LINGO-2 adapts its actions and explanations in accordance with various scene elements and is a strong first indication of the alignment between explanations and decision-making. By linking language and action directly, LINGO-2 sheds light on how AI systems make decisions and opens up a new level of control and customization for driving.

Please change your cookie settings to view embedded content, or view this video on YouTube.

The above video is taken from a LINGO-2 drive through Central London. The same deep learning model generates the driving behavior and textual predictions in real-time.

While LINGO-1 could retrospectively generate commentary on driving scenarios, its commentary was not integrated with the driving model. Therefore, its observations were not informed by actual driving decisions. However, LINGO-2 can both generate real-time driving commentary and control a car. The linking of these fundamental modalities underscores the model’s profound understanding of the contextual semantics of the situation, for example, explaining that it’s slowing down for pedestrians on the road or executing an overtaking maneuver. It’s a crucial step towards enhancing trust in our assisted and autonomous driving systems. It opens up new possibilities for accelerating learning with natural language by incorporating a description of driving actions and causal reasoning into the model’s training. Natural language interfaces could, even in the future, allow users to engage in conversations with the driving model, making it easier for people to understand these systems and build trust.

LINGO-2 Architecture: Multi-modal Transformer for Driving

LINGO-2 architecture

LINGO-2 consists of two modules: the Wayve vision model and the auto-regressive language model. The vision model processes camera images of consecutive timestamps into a sequence of tokens. These tokens and additional conditioning variables – such as route, current speed, and speed limit – are fed into the language model. Equipped with these inputs, the language model is trained to predict a driving trajectory and commentary text. Then, the car’s controller executes the driving trajectory.

LINGO-2’s New Capabilities

The integration of language and driving opens up new capabilities for autonomous driving and human-vehicle interaction, including:

  1. Adapting driving behavior through language prompts: We can prompt LINGO-2 with constrained navigation commands (e.g., “pull over,” “turn right,” etc.) and adapt the vehicle’s behavior. This has the potential to aid model training or, in some cases, enhance human-vehicle interaction. 
  2. Interrogating the AI model in real-time: LINGO-2 can predict and respond to questions about the scene and its decisions while driving.
  3. Capturing real-time driving commentary: By linking vision, language, and action, LINGO-2 can leverage language to explain what it’s doing and why, shedding light on the AI’s decision-making process.

We’ll explore these use cases in the sections below, showing examples of how we’ve tested LINGO-2 in our neural simulator Ghost Gym. Ghost Gym creates photorealistic 4D worlds for training, testing, and debugging our end-to-end AI driving models. Given the speed and complexity of real-world driving, we leverage offline simulation tools like Ghost Gym to evaluate the robustness of LINGO-2’s features first. In this setup, LINGO-2 can freely navigate through an ever-changing synthetic environment, where we can run our model against the same scenarios with different language instructions and observe how it adapts its behavior. We can gain deep insights and rigorously test how the model behaves in complex driving scenarios, communicates its actions, and responds to linguistic instructions.

Adapting Driving Behavior through Linguistic Instructions

LINGO-2 uniquely allows driving instruction through natural language. To do this, we swap the order of text tokens and driving action, which means language becomes a prompt for the driving behavior. This section demonstrates the model’s ability to change its behavior in our neural simulator in response to language prompts for training purposes. This new capability opens up a new dimension of control and customization. The user can give commands or suggest alternative actions to the model. This is of particular value for training our AI and offers promise to enhance human-vehicle interaction for applications related to advanced driver assistance systems. In the examples below, we observe the same scenes repeated, with LINGO-2 adapting its behavior to follow linguistic instructions.

Example 1: Navigating a junction

In the three videos below, LINGO-2 navigates the same junction but is given different instructions: “turning left, clear road,” “turning right, clear road,” and “stopping at the give way line.” We observe that LINGO-2 can follow the instructions, reflected by different driving behaviors at the intersection.

Example of LINGO-2 driving in Ghost Gym and being prompted to turn left on a clear road.
Example of LINGO-2 driving in Ghost Gym and being prompted to turn right on a clear road.
Example of LINGO-2 driving in Ghost Gym and being prompted to stop at the give-way line.

Example 2: Navigating a bus

In the two videos below, LINGO-2 navigates around a bus. We can observe that LINGO-2 can follow the instructions to either hold back and “stop behind the bus” or “accelerate and overtake the bus.”

Example of LINGO-2 in Wayve’s Ghost Gym stopping behind the bus when instructed.
Example of LINGO-2 in Wayve’s Ghost Gym overtaking a bus when instructed by text.

Example 3: Driving in a residential area

In the two videos below, LINGO-2 responds to linguistic instruction when driving in a residential area. It can correctly respond to the prompts “continue straight to follow the route” or “slow down for an upcoming turn.”

Example of LINGO-2 in Wayve’s Ghost Gym driving straight when instructed by text.
Example of LINGO-2 in Wayve’s Ghost Gym turning right when instructed by text.

Interrogating an AI model in real-time: Video Question Answering (VQA)

Another possibility for language is to develop a layer of interaction between the robot car and the user that can give confidence in the decision-making capability of the driving model. Unlike our previous LINGO-1 research model, which could only answer questions retrospectively and was not directly connected to decision-making, LINGO-2 allows us to interrogate and prompt the actual model that is driving.

Example 4: Traffic Lights

In this example, we show LINGO-2 driving through an intersection. When we ask the model, “What is the color of the traffic lights?” it correctly responds, “The traffic lights are green.”

Example of LINGO-2 VQA in Ghost Gym

Example 5: Hazard Identification

In this example, LINGO-2 is prompted by the question, “Are there any hazards ahead of you?” It correctly identifies that “Yes, there is a cyclist ahead of me, which is why I am decelerating.”

Example of LINGO-2 VQA in Ghost Gym

Example 6: Weather

In the following three examples, we ask LINGO-2 to describe “What is the weather like?” It can correctly identify that the weather ranges from “very cloudy, there is no sign of the sun” to “sunny” to “the weather is clear with a blue sky and scattered clouds.”

Example of LINGO-2 VQA in Ghost Gym

Limitations

LINGO-2 marks a step-change in our progress to leverage natural language to enhance our AI driving models. While we are excited about the progress we are making, we also want to describe the current limitations of the model.

Language explanations from the driving model give us a strong idea of what the model might be thinking. However, more work is needed to quantify the alignment between explanations and decision-making. Future work will quantify and strengthen the connection between language, vision, and driving to reliably debug and explain model decisions. We expect to show in the real world that adding intermediate language reasoning in “chain-of-thought” driving helps solve edge cases and counterfactuals.

Additionally, we plan to investigate whether controlling the car’s behavior with language in real-world settings can be done reliably and safely. Ghost Gym provides a safe off-road environment for testing, but more work needs to be done to ensure the model is robust to noise and misinterpretation of the commands. It should understand the context of human instructions while never violating appropriate limits of safe and responsible driving behavior. This functionality will be more suited to aid model testing and training for fully automated driving systems.

Conclusion

In this post, we have introduced LINGO-2, the first driving model trained on language that has driven on public roads. We are excited to showcase how LINGO-2 can respond to language instruction and explain its driving actions in real-time. This is a first step towards building embodied AI that can perform multiple tasks, starting with language and driving. Join us at CVPR 2024 to explore this and other topics at our “End-to-End Autonomy: A New Era of Self-Driving” Tutorial.

Back to top