15 November 2023 | Research
Unveiling LINGO-1's Show and Tell Capability with Referential Segmentation
We have developed a new “show and tell” feature in LINGO-1, which uses referential segmentation to visually highlight the focus of LINGO-1’s attention and enhances the connection between language and vision tasks, ultimately allowing for more accurate and trustworthy responses.

In September, we introduced the use of natural language for autonomous driving in our blog on LINGO-1, an open-loop driving commentator that combines vision, language and action to enhance how we interpret, explain and train our foundation driving models. Through LINGO-1, we’ve shown how we can use language to probe models with questions about the driving scene to better understand what it comprehends.

But how do we know that LINGO-1 is paying attention to the right parts of the driving scene?
In the example above, how do we know which “pedestrians crossing the road” the AI model is focusing on? Another related question could be about understanding the road rules that are relevant to the scene. For instance, we might ask, “What does the box with the cycle image indicate?” It’s important for us to ascertain that LINGO-1 is paying attention to such elements alongside its textual response.
Therefore, to go beyond text description of objects in the scene, we’ve developed a new “show and tell” capability that allows LINGO-1 to visually indicate its focus of attention using referential segmentation, which means marking the region in the image that is relevant to the textual response of LINGO-1. This new development establishes a stronger “grounding” (or connection) between language and vision tasks, increasing confidence in LINGO-1’s responses.
Through referential segmentation, LINGO-1 can not only identify and describe objects in a scene but also provide visual proof of its attention and understanding.
In the video below, when we ask about the colour of the traffic light, LINGO-1 can accurately describe that “the traffic light is currently red” and simultaneously highlight the traffic light in the image frame, denoted by [A]. Furthermore, LINGO-1 can elaborate on the correct driving response to a red traffic light by stating, “We are waiting for it to turn green before proceeding.” As you can see, this new feature strengthens the link between language and vision tasks, improving our confidence in and ability to assess the accuracy and relevancy of LINGO-1’s responses.

Referential segmentation paves the way for multimodal explainability through cross-verification. By dynamically highlighting objects of interest, LINGO-1 demonstrates a holistic understanding of the scene. In the following example, LINGO-1 accurately identifies and associates elements within the scene to specific road rules outlined in the Highway Code. When we ask, “What do those squiggly lines indicate?” LINGO-1 correctly replies, “The zig-zag lines [A] you see are typically found near pedestrian crossings. They indicate that drivers should not park or overtake in this area.”

Here, we can ask LINGO-1 what it knows about safe stopping distances based on the weather. LINGO-1 accurately states, “The weather is a bit overcast, with some rain. It’s not too heavy, but it’s enough to make the road a bit slippery.” Given the weather, when we ask, “How many seconds of distance is recommended if the road is slippery?” LINGO-1 replies, “The recommended distance is at least 4 seconds. This allows for enough reaction time in case the motorcyclist [A] makes any sudden moves.” It also knows that “During normal conditions, I would maintain a 2-second drivable distance area [A] behind the motorcyclist [B]. This is the standard safe distance recommended by the UK Highway Code.”

In this example, we ask if driving over a bus lane is advisable to get around the vehicle turning right. LINGO-1 responds, “Yes, that’s a good observation. However, it’s important to note that the bus lane [A] is marked by a solid white line. According to the UK Highway Code, we’re not allowed to enter a lane marked by a solid white line unless it’s necessary and safe to do so. So, we will continue to follow our current lane.”

Here, we can see that LINGO-1 is “focusing on the driveable area [A] to ensure we stay within the lane. I’m also monitoring the traffic [B] in front of our lead car [C] to anticipate my next action.” It can also correctly identify that the broken white lines are lane dividers, which “indicate that drivers can change lanes when it’s safe to do so.”

This final example demonstrates how LINGO-1 reasons about potential hazards while driving at night. You can see the model first verifies that the scenario involves driving in the dark and acknowledges the need to be cautious of the following potential hazards: [A] a narrow drivable area, [B] a cyclist on the road, and [C] the presence of a car in front, which is challenging to spot in this image. It also correctly recognises that “the cyclist could make sudden moves or turns, so it’s important to keep a safe distance and be prepared to react accordingly.”

LINGO-1 Show & Tell Architecture
To develop this new capability, we extended the architecture of LINGO-1 and trained it using datasets specific to the AV domain. These datasets imbue the model with multi-tasking capabilities: it can respond with text and perform referential segmentation simultaneously. Importantly, we have developed efficient protocols for gathering and curating such multimodal datasets for training and validation.

Driving Forward with this New Capability
The inclusion of “show and tell” opens up new possibilities for autonomous driving and human-vehicle interaction. Here are a few potential use cases:
- Enhanced training and adaptability: By linking language and vision, we can speed up the learning process by providing more efficient supervision to these models and more information about the scene. This learning ability is critical to address long-tail scenarios in driving where there are more limited examples.
- Addressing hallucinations with LLMs: Hallucinations are a well-known problem in large language models; LINGO-1 is no exception. By developing the capability to ground language with vision, we introduce additional sources of supervision that would compel our model to respond only to objects that it can accurately identify in the scene, which can help mitigate the risk of hallucinations.
- Improved safety communication: By visually highlighting potential hazards in the scene, LINGO-1 can effectively communicate critical information such as the position of pedestrians, cyclists or obstacles. This can play a vital role in building trust and confidence in the capabilities of our AV technology.
Pioneering the Future of Autonomous Driving with Vision-Language-X Models
The incorporation of LINGO-1’s “show and tell” capability marks a new advancement in using natural language to enhance autonomous vehicle performance. This innovative feature represents a crucial step towards the development of Vision-Language-X Models (VLXMs), where ‘X’ encompasses diverse multimodal outputs such as language, referential segmentation, driving plans, and raw control signals.
By integrating these elements, Wayve is not only enhancing how AVs perceive and interpret their surroundings but also how they communicate this understanding—resembling a more human-like approach to driving. As a result, Wayve is setting the stage for a future where self-driving vehicles can clearly and effectively convey intentions and actions to passengers, pedestrians, and other road users, ultimately contributing to the development of safer and more trustworthy autonomous driving systems.