14 March 2024  |  Leadership

Solving the long-tail with e2e AI: “The revolution will not be supervised”

Erez Dagan, President of Wayve, shares his thoughts on how end-to-end (e2e) Embodied AI is, by design, uniquely equipped to solve the long-tail problem of driving automation.

Erez Dagan President

The road to achieving fully autonomous driving has been a challenging journey as the industry strives to unlock the tremendous personal, economic, and societal benefits that driving automation promises. Throughout my career over two decades in the Automated Driving (AD) and Advanced Driver-Assistance System (ADAS) industry, I have witnessed firsthand the difficulties of addressing the “long-tail” scenarios—those rare ‘edge case’ events with a low probability of occurrence yet still require careful consideration to ensure the utility and safety of driving assistance and automation systems. Although infrequent, the inadequate handling of the long-tail can pose challenges to the marketability of these systems by diminishing their usefulness to customers. Moreover, in the worst-case scenario, failure to correctly respond in a safety-critical long-tail situation can have disastrous consequences for consumers and automakers and even lead to widespread societal rejection, setting the entire industry back.

This persistence of long-tail challenges has long been the Achilles’ heel of driving automation efforts.

For almost 20 years, the different approaches to the problem of driving automation have struggled with the long-tail because they all rely on a broad set of predefined human-made concepts, accompanied by logical/mathematical models that attempt to translate these concepts into optimal driving decisions. While the concepts and models are meticulously designed to cover a set of encountered and reasonably predictable scenarios, they also suffer, by that same design, from the inability to generalize to unexpected situations—situations that may not be effectively covered by these constructs.

Overcoming this challenge requires a fresh approach.

Enter the revolutionary approach of Wayve’s end-to-end (e2e) Embodied AI.

Unlike traditional methods, e2e Embodied AI is built on the premise of learning how to drive from rich and extensive exposure to recordings of driving in various environments end-to-end—from sensing, as an input on one end, to driving actions, as an output on the other. Importantly—it does so in a self-supervised manner.

This innovative approach uniquely harnesses the power of state-of-the-art AI to address the long-tail problem head-on, giving rise to emergent, rich, and nuanced latent representations of real-world complexities—representations that maximize the potential to generalize to the unseen. By that—it marks a distinct solution optimally fit to meet the complexity of driving automation.

This blog post will delve into Wayve’s pioneering approach of e2e Embodied AI and highlight its transformative potential. We will explore five foundational pillars of Wayve’s approach, which make it fully equipped to master the long-tail. These include:

  1. Harnessing the tremendous expressive power and generalization capacities of state-of-the-art AI models.
  2. Domain-optimized architecture that places automotive safety at its core.
  3. Efficient, large-scale learning unleashed through self-supervision.
  4. Active-learning assets, securing convergent and predictably rewarding training cycles.
  5. Frequent and seamless model deployment cycles, propelling true fleet learning.


Understanding the Long-tail in Autonomous Driving

As drivers, we have all encountered unexpected situations on the road. These rare but realistic events form the “long-tail” of scenarios that have a low likelihood of occurrence but demand careful consideration to ensure the utility and safety of driving systems. Unlike humans, who possess a deep common-sense understanding of the world that enables them to adapt to unexpected situations with relative ease, traditional AV technology faces difficulties in handling these outliers.

Challenges for Current Methods

Traditional AV1.0 approaches rely heavily on predefined human-made concepts and logical models designed to cover a set of encountered and predictable scenarios. However, this reliance becomes a double-edged sword, as the very design that enables AVs to navigate known challenges also restricts their ability to generalize to unforeseen situations not captured by these predefined concepts and logical models.

This is a problem of under-modeling.

When we refer to ‘concepts,’ there’s an important distinction between:

  1. Human-defined concepts: linguistic, human-communicable representations of the world. 
  2. Emergent AI representations: abstract representations of the environment generated by AI through various mathematical transformations of the raw input stream. These, by design, are optimally informative for maximizing the learning objective.

E2e AI doesn’t discard concepts; instead, it exposes the learning algorithm to vast amounts of data aligned with a carefully crafted learning objective. Through this process, it automatically generates optimal actionable representations of the driving environment. These representations are the ‘secret sauce’ of e2e AI’s generalization capabilities, and play a crucial role in successfully predicting the next ‘data token’in this case, the next driving command.  

In addition to the core problem of under-modeling, where edge-case complexity simply exceeds the power of human-made models, there are two additional factors that significantly limit the generalization capability of AV1.0 approaches.

Complexity at the “sub-concept” level in an attempt to simplify: Managing these concepts introduces a wide range of learning sub-objectives, such as detecting pedestrians, lane markings, traffic lights, and other elements—adding another critical layer of complexity. The accuracy in defining and detecting these concepts varies greatly, making these sub-objectives far from trivial. For instance, the real-world manifestation of “pedestrians,” “road construction,” or “police cars” vary greatly (known as ‘high class-complexity’). This variability introduces second-order long-tail challenges at the concept level. These inaccuracies translate to information bottlenecks that obstruct the ultimate objective of driving.

Economic burden of supervised learning: Addressing the second-order long-tail issue by more and more granularly defining and detecting a plethora of sub-concepts necessitates extensive data collection, curation, and labeling. The operational and economic burden associated with each hour of driving data effectively limits the number of driving hours available for training AI models and restricts the models’ exposure to diverse driving scenarios. This underexposure perpetuates and amplifies the risks associated with the long-tail in driving automation.

Beyond sustaining the long-tail problem, this superficially logical, sound, and communicable structuring of the driving scene isby design inefficient—information and computation-wise. For example, in AV1.0, the system is tasked with explicitly detecting all pedestrians and vehicles within the visible field of view. However, this could be an avoidable burden, if the detection objective was more precisely defined as “identify road users that affect my driving decisions.” Evidently—if a human driver was asked to count the number of pedestrians they see at a given moment, they would only identify those near or within their driving path. This purpose-fit conceptualization approach is, thus, sufficient and optimal for human drivers to safely and fluently navigate the driving environment.

The Industry has Already Shifted Toward AI

Fortunately, the industry is evolving and has recognized the risks of under-modeling and domain under-sampling brought about by uncarefully relying on human-made concepts. Over the past decade, the field of ADAS perception steadily discarded a wide array of human-defined sub-concepts, for instance:

  • Various deep-convolutional network architectures have replaced human-defined approaches, such as edge detection techniques used for lane detection, methods for hand-crafting features for pedestrian detection, and tedious ‘optical flow’ operations.
  • AI models have proven effective in metric modeling tasks to translate 2D images into 3D representations. This is particularly noteworthy because, despite sophisticated projective-geometry and statistical signal processing tools built on closed-form, mathematical models which were assumed to provide information-efficient estimators for these problems—AI consistently outperforms even these rigorously and mathematically-crafted models.

In fact, automated driving solutions already trust AI to handle perception. Many AV1.0 systems have replaced human-designed sub-concepts from the perception pipeline with a single end-to-end perception network. This network is trained to model the environment in great human-defined detail. This output is sent to another AI network, which interprets this particular model of the environment to inform the vehicle’s actions.

The natural next step on this unstoppable trend is to remove the remaining bottlenecks and allow the representation of the environment to be learned to directly optimize for the driving task, or in other words, an end-to-end learning approach.

The e2e AI Alternative: Embracing Self-Supervised Learning

Ironically, it was the field of large language models (LLMs) that has provided undeniable empirical evidence that human-defined concepts and constructs (linguistics theories, in this case) are inefficient relative to the internal representations of an e2e AI model, given exposure to large-scale data. This realization fully resonates with the revolutionary e2e approach to autonomous driving, which Wayve has been championing since its inception in 2017, well ahead of the rise in LLMs.

Wayve has pioneered the development of an e2e learned driving modelsetting a trend that the industry is now converging towards, coined “AV2.0.” This approach uses a single, highly expressive AI model to convert raw vehicle sensing inputs directly into driving actions, eliminating the need for human-defined concepts to interpret sensor data. 

Central to this approach is self-supervised learning. By framing the learning problem as predicting the next data token in the sequence, whether it’s the next word in LLMs or the next driving command in Wayve’s model, the system thrives on unsupervised learning from raw, unlabelled data. The more data fed into the model, the richer and more expressive the AI model becomes for its specific application. The ability to train on a vast array of driving recordings without human input constitutes the power and magic of self-supervised learning.

“The revolution will not be supervised.”

Alexei (Alyosha) Efros, Professor, EECS Department, UC Berkeley

AV2.0 unleashes a rapid, continuous, and seamless fleet-learning loop: recording data, feeding it into model training, deploying an updated model, and repeating the process. By that, it also serves as a crowdsourced “data-to-value” engine, efficiently gathering real-world driving data from a diverse range of vehicles within an OEM’s fleet, uploading and processing this data in a cloud-based training infrastructure, and converting it into refined driving capabilitieswhich are downloaded back into the vehicles, upgrading their automated driving functions. 

Wayve’s fleet learning loop is optimally and economically designed to support the graduation from ‘eyes-on’ driving functions (L2+) to ‘eyes-off’ as driving data exposure builds verifiably robust driving abilities. We can use it to demonstrate that our AI model is capable of driving without needing to hand over to the human driver,  before the system graduates to ‘eyes off’ within the designated operational design domain (ODD).

As a crucial means of learning amplification, Wayve has invested in building the foundation model for driving. This model leverages multimodal data, including text and non-driving video sources, to optimize its internal representation of driving environments. This greatly enhances the AI model’s driving capabilities, allowing for cross-learning of driving-relevant concepts from different sources and improving alignment with driving task objectives. In simpler terms, by learning from multiple sources of data, we can improve a vehicle’s understanding of the most meaningful and actionable aspects of the sensor stream: the representations that are bound to lead to a fluent and safe driving function.

Taming the AI “Tiger” for Automotive

Cognitive AI tools like LLMs and multimodal chatbots, which similarly leverage e2e learning, have demonstrated impressive abilities to master various fields like coding, law, healthcare, and others. They excel at tasks such as answering questions, generating text and images, summarizing content, and brainstorming ideas. But would I trust a ChatGPT-like (Cognitive AI) model to drive my car? The answer is absolutely not.

Cognitive AI is trained towards objective functions that are fundamentally unsuitable for real-time, safety-critical engagement with the real physical world, such as driving. These models are honed for coherence, correctness, and human-like relevance in text generation, not for making split-second decisions on the road.

This raises a pivotal question: How would we optimally harness and tame this powerful e2e AI “tiger” for embodied AI applications, specifically in the domain of automated driving? 

To effectively apply AI’s tremendous capabilities to the problem of autonomous driving, we need to carefully define the correct learning processes, model architectures, and most importantly, set up learning objectives that fit the domain of driving. When developing embodied AI applications, the machine-learning objective function must be built upon three essential pillars that distinguish it from Cognitive AI:

  1. Utility: Execute the intended mission of accurately navigating from point A to point B with precise routing at the lane level.  
  2. Flow: Beyond reaching the destination, the AI driving model must make reasonable human-like driving decisions to ensure a driving experience that feels natural and comfortable. Whether to overtake, when to linger, and how to execute a lane shift. 
  3. Safety: At the core of the driving objective function is safety, which supersedes the two other pillars. This encompasses:
    • Continuous risk management and emergency avoidance. The AI model must continuously assess hazards and take measures to proactively avoid getting into emergency situations, prioritizing the safety and well-being of all road users.
    • Optimal emergency-event response to minimize its consequences. In an unavoidable emergency, the AI model must be trained to respond optimally to fully mitigate/minimize the consequences, safeguarding passengers and other road users.

By anchoring Embodied AI’s objectives in utility, flow, and safetywe harness e2e AI’s generalization power to meet the unique applicative requirements of automated driving functions, far exceeding the potential of traditional approaches to serving these requirements.  

Ensuring Predictability and Trust in the Performance of Automotive AI

Actively Nurturing a Safety Reflex

By leveraging a natural distribution of human-driving videos and optimizing the model architecture, training processes, and learning objectives, it is theoretically expected to ultimately achieve a driving model that perfectly fulfills the 3 pillars of utility, flow, and safety. However, safetywhich includes responding to emergency scenariospresents a unique challenge from the other pillars. In a natural distribution of training data, examples of emergency response driving are rarely sampled, but achieving flawless mastery of this driving skill is a zero-tolerance factor, from the very first real-world deployments on the road. 

Mastering capabilities that rarely occur in real-world scenarios lie at the core of what AV2.0 can do exceptionally well because we can leverage the unparalleled expressive and generalization power of state-of-the-art AI. To that aim, we use various tools to proactively ‘design-in’ emergency response capabilities into our model architecture and learning objectives and boost the model’s exposure to critical scenarios to actively nurture the driving model’s emergency reflex. In other words, we can accelerate the learning curve around the sub-problem of emergency response and master it as a subdomain, to build an innate safety reflex that can be applied to assure the safety of the driving task.

This aspect also signifies one of the core domain-specificities of Wayve’s AV2.0 e2e architecture and design, which rests on extensive proprietary research, development, and testing assets designed to tackle the unique characteristics of the driving automation domain: 

  • We have developed highly effective dataset sampling strategies, smart active-learning loops that make sense of the available data’s relevance, and Generative AI video simulation tools like GAIA-1 that allow us to optimally enrich our Emergency Reflex subsystem’s latent representations towards the objective functions of emergency avoidance and optimal emergency response. 
  • Moreover, these techniques allow us to measurably affirm that our model exhibits the proper emergency skills and demonstrates its emergent generalization to handle previously unseen emergency situations correctly.
  • We also benefit from our foundation model work, which leverages multimodal data sources to enhance the model’s internal representations. For example, to supercharge our Emergency Reflex, we can incorporate additional sources of information, such as other sensor modalities. We can also draw great value from third-party driving footage, such as accident data captured on dashcams. 

Differences in Embodied AI vs. Cognitive AI Problem-setting

Embodied AI has intrinsic traits, particularly in the context of automated driving, that clearly distinguish it from the problem domain of Cognitive AI. Consequently, certain concerns, such as hallucinations and misaligned learning, do not apply in the same way.

  • Clarity in the objective function: Unsafe driving is much easier to identify than unsafe speech. This makes the objective function of “driving safely,” against which we evaluate our models, more measurable compared to navigating the ambiguity of misinformation, toxic responses, and inappropriate content.
  • Data safety and curation: Similarly, the varied nature of language data on the internet, where freedom of speech allows for a wide range of expression, presents challenges in filtering out offensive content. However, with driving, there is no equivalent “freedom of driving.” Real-world driving data is inherently more consistent due to legal and societal frameworks. Moreover, we have developed ways to safely curate and synthesize data to minimize our driving model’s exposure to reckless driving behaviors. 
  • Closed-loop feedback eliminates “hallucinations”: Unlike LLMs, which can generate text without needing additional input from the user, driving models maintain a consistent grounding in the real world. For example, a language model could generate an extensive essay from a short prompt, but a driving model is not expected to blindly generate driving commands for, let’s say, the next 2 kilometers based on a past video stream. Instead, real-time updates from the vehicle’s sensors continuously inform the driving model of the current situation, ensuring a high-frequency reality check. This core difference makes concerns about “hallucinations” non-applicable to e2e driving models.

Ensuring a Predictably Convergent and Cost-efficient Training Process

So far, we’ve covered our model’s driving qualities and how we enhance performance through two core elements: (1) a self-supervised learning problem that enables unparalleled generalization from unsupervised exposure to a huge data-corpus, (2) leveraging AI’s expressive power to proactively nurture our driving model’s emergency reflex sub-expertise beyond the natural learning curve. 

Performance aside, the training and development process must meet specific criteria for the beneficial deployment of Embodied AI in automotive applications. It should be economically feasible, measurable, and able to produce ‘introspectable’ models (whose behavior can be systematically examined and reviewed). 

To enable safe and controlled over-the-air (OTA) updates for consumer vehicles, the training process needs to evolve into predictable enhancement cycles, driving a convergent and regression-resistant development processboth in duration and performance levels. Simply put, each major training cycle should progressively improve toward the driving objective without regression, and we should have well-estimated completion times and performance levels for each cycle. These qualities will ensure seamless and regular OTA updates as part of the fleet learning loop.

Scenario Intelligence Tools: As mentioned earlier, our e2e AI models replace human-defined concepts with auto-generated representations of the driving environment. This advancement requires us to adopt new methods for understanding the ‘emergent concepts’ derived from the model’s unsupervised exposure to large amounts of data. Wayve has developed new ‘scenario intelligence’ tools for dataset introspection and control. These tools harness the emergent concepts generated by our model and transform them into a framework for deeper performance analysis.

Examining how our AI model clusters information pinpoints what it deems as the most relevant aspects of a scene for driving tasks. In the images below, associated with two of the clusters, we can discern that the model prioritizes behaviorally relevant elements, such as following the vehicle at mid-range (left image) or following a vehicle at a distant range (right image), despite variations of elements like lane markings, lighting, weather, and vehicle size/shape/color.

By learning meaningful representations, we can navigate the dataset purposefully, aligned with our driving objective. This approach allows us to systematically evaluate data coverage and performance across different scenario clusters and identify rare ‘edge cases’.  Scenario intelligence improves our understanding of the model’s learning, enables us to fill data gaps through data collection or synthetic generation, and much more. It serves as one of the core means of supervising an unsupervised learning process, which we plan to share more about in a future blog. 

Training Data Quality and Control: In addition to systematically sampling informative video segments that contribute to learning innovation, we also filter out low-quality or risky driving scenarios based on a set of indications that we derive from our Emergency Reflex skills and analysis of driver action sequences, etc.   

Model Introspection: Wayve is pioneering methods for model introspection using natural language and other side information, such as segmentation for testing and performance characterization, to gain insights into our model’s internal decision-making process. One interesting way to achieve model introspection is to analyze ‘saliency’. In the image below, a heat map reveals the pixels the model focuses on to deduce driving actions. This helps us identify root causes and potential undersampling of specific road elements.  In many senses, it allows breaking our e2e model down, at evaluation steps, to its ‘perception’ and ‘planning’ components.  

Model introspection can also be achieved through natural language. LINGO-1, Wayve’s driving commentator, combines vision, language, and action to enhance our interpretation, explanation, and training of our driving models. By using language to question the model about the driving scene, LINGO-1 provides an understanding of the model’s comprehension.

LINGO-1’s “show and tell” capability goes beyond providing text descriptions. It visually indicates its focus of attention using referential segmentation, establishing a stronger “grounding” (or connection) between language and vision tasks. This increases confidence in LINGO’s responses. 

Riding the Tiger: How Wayve is Poised to Navigate the Future of Driving with Embodied AI

At Wayve, we continue to pioneer e2e Embodied AI for driving, positioning ourselves at the forefront of AV2.0 and creating a strong foundation for our approach. Through extensive research and innovation, we are developing the capabilities to produce effective, fluent, and safe driving behaviors from unlabeled driving videos—demonstrating measurable and powerful generalization capabilities to support seamless expansion across different geographies, vehicle types, sensor configurations, and in the near future, different embedded computing platforms.

To accomplish this, we incorporate meticulous domain-optimized model architectures, intelligent active-learning processes, realistic and industry-leading data generation capabilities, and closed-loop environments to re-drive in. We place great emphasis on safety, utility, and flow when crafting our driving models while developing R&D processes that yield predictable, controllable, convergent, and trusted training loops—maintaining high-quality standards that are required for automotive deployment and mass production.

Yet, while we are taming our AI “tiger,” we have in no way constrained its potential!

Our system includes state-of-the-art intelligence that brings extraordinary potential for consumer value, to be realized through our partnership with OEMs. This includes:

  • Developing language-responsive assisted and automated driving solutions. 
  • Advancing research into in-context learning of certain driving styles, allowing for unique OEM driving signatures, personalized driving preferences, and adaptation to culturally specific driving behaviors (such as assertiveness). By having the sensitivity to adjust driving decisions to conform with local norms and social contracts regarding ‘duty of care,’ our system promotes safety and cultural integration/acceptance. It can also identify abnormal driving patterns in certain drivers that may indicate impairment, prompting the system to offer co-piloting assistance for enhanced safety measures.
  • As we progress to higher levels of automation, graduating from ‘eyes on’ to ‘eyes off,’ our aim is to develop the model’s “self-awareness” to provide the human driver with a stronger understanding of the model’s capabilities. For instance, in situations where there may be limitations, the model can flag, in real-time, scenarios that may require closer monitoring by the driver or proactively hand over the driving task.

Driving a Data-Driven Evolution with OEM Partners

Wayve is developing an AV2.0 platform that brings the unique value of Embodied AI to the automotive industry. Our platform allows for the robust deployment of hardware-agnostic, vehicle-tailored driving models, supporting brand-specific features and driving styles that enables each OEM to differentiate itself.

For Embodied AI, data is vital. OEM partners play a crucial role in building competitive and efficient data assets from their vehicle fleet. They can provide a rich, geographically distributed data sample of the driving domain, which is essential for the iterative refinement and OTA updates of the vehicle’s automated driving capabilities.

Wayve’s AI assets enhance the intelligent collection of real-world driving data by focusing on informative scenarios and identifying long-tail edge cases. Our approach includes efficient data management, active-learning tools, coverage control, and data filters. This strategy empowers OEMs to create effective data assets for active ‘e2e’ learning while minimizing costs.    

To ensure maximum performance visibility, model introspection, and control during training, we are introducing our Model Deployment Dashboard (MDD). This interactive interface gives our OEM partners unparalleled access and control over their data assets and Wayve’s learning process. Our platform and tools maximize data-to-value conversion, ensuring optimal driving functions emerge from well-crafted data assets. Smart data management tools and performance-data metrics facilitate asset creation and enhance their quality and impact.

In summary, Wayve provides more than just assisted and automated driving technology; we’re offering a transformative Embodied AI platform that bridges from raw data to unparalleled driving capabilities. We are committed to collaboration, innovation, and OEM empowerment to drive the industry toward achieving safe, useful, and fluent autonomous driving solutions that are fully equipped to master the long-tail. 

Back to top