Visualization of Tesla’s fleet. Image courtesy of Tesla.

What Tesla can do that Waymo can’t

Training data is one of the fundamental factors that determine how well deep neural networks perform. (The other two are the network architecture and optimization algorithm.) As a general principle, more training data leads to better performance. This is why I believe Tesla, not Waymo, has the most promising autonomous vehicles program in the world.

With a fleet of approximately 500,000 vehicles on the road equipped with what Tesla claims is full self-driving hardware, Tesla’s fleet is driving about as many miles each day – around 15 million – as Waymo’s fleet has driven in its entire existence. 15 million miles a day extrapolates to 5.4 billion miles a year, or about 200x more than Waymo’s current total. Tesla’s fleet is also growing by approximately 5,000 cars per week.

There are three key areas where data makes a difference:

  • Computer vision
  • Prediction
  • Path planning/driving policy

Computer vision

One important computer vision task is object detection. Some objects, such as horses, only appear on the road rarely. Whenever a Tesla encounters what the neural network thinks might be a horse (or perhaps just an unrecognized object obstructing a patch of road), the cameras will take a snapshot, which will be uploaded later over wifi. It helps to have vehicles driving billions of miles per year because you can source many examples of rare objects. It stands to reason that, over time, Teslas will become better at recognizing rare objects than Waymo vehicles.

For common objects, the bottleneck for Waymo and Tesla is most likely paying people to manually label the images. It’s easy to capture more images than you can pay people to label. But for rare objects, the bottleneck for Waymo is likely collecting images in the first place, whereas for Tesla the bottlenecks are likely just labelling and developing the software to trigger snapshots at the right time. This is a much better position to be in.

Tesla’s Director of AI, Andrej Karpathy, explains in this clip (taken from his Autonomy Day presentation) how Tesla sources images to train object detection:


Prediction is the ability to anticipate the movements and actions of cars, pedestrians, and cyclists a few seconds ahead of time. Anthony Levandowski, who for years was one of the top engineers at Waymo, recently wrote that “the reason why nobody has achieved” full autonomy “is because today’s software is not good enough to predict the future.” Levandowski claims the main category of failures for autonomous vehicles is incorrectly predicting the behaviour of nearby cars and pedestrians.

Tesla’s fleet of approximately 500,000 vehicles is a fantastic resource here. Any time a Tesla makes an incorrect prediction about a car or pedestrian, the Tesla can save a data snapshot to later upload and add to Tesla’s training set. Tesla may be able to upload an abstract representation of the scene (wherein objects are visualized as colour-coded cuboid shapes and pixel-level information is thrown away) produced by its computer vision neural networks, rather than upload video. This would radically reduce the bandwidth and storage requirements of uploading this data.

Whereas images used to train object detection require human labelling, a prediction neural network can learn correlations between past and future just from temporal sequences of events. What behaviour precedes what behaviour is inherent in any recording (video or abstracted). Andrej Karpathy explains the process in the clip below:

Since there is no need for humans to label the data, Tesla can train its neural networks on as much useful data as it can collect. This means the size of its training dataset will correlate with its overall mileage. As with object detection, the advantage over Waymo isn’t just more data for predicting common behaviours, but the ability to collect data on rare behaviours seen in rare situations in order to predict those as well.

Path planning/driving policy

Path planning and driving policy refer to the actions that a car takes: staying centred in its lane at the speed limit, changing lanes, passing a slow car, making a left turn on a green light, nudging around a parked car, stopping for a jaywalker, and so on. It seems fiendishly difficult to specify a set of rules that encompass every action a car might ever need to take under any circumstance. One way around this fiendish difficulty is to get a neural network to copy what humans do. This is known as imitation learning (also sometimes called apprenticeship learning, or learning from demonstration).

The training process is similar to how a neural network learns to predict the behaviour of other road users by drawing correlations between past and present. In imitation learning, a neural network learns to predict what a human driver would do by drawing correlations between what it sees (via the computer vision neural networks) and the actions taken by human drivers.

Imitation learning recently met with arguably its greatest success yet: AlphaStar. DeepMind used examples from a database of millions of human-played games of StarCraft to train a neural network to play like a human. The network learned the correlations between the game state and human players’ actions, and thereby learned to predict what a human would do when presented with a game state. Using only this training, AlphaStar reached a level of ability that DeepMind estimates would put it roughly in the middle of StarCraft’s competitive rankings. (Afterward, AlphaStar was augmented using reinforcement learning, which is what allowed it to ascend to pro-level ability. A similar augmentation may or may not be possible with self-driving cars – that’s another topic.)

Tesla is applying imitation learning to driving tasks: such as how to handle the steep curves of a highway cloverleaf, or how to make a left turn at an intersection. It sounds like Tesla plans to extend imitation learning to more tasks over time, like how and when to change lanes on the highway. Karpathy describes how Tesla uses imitation learning in this clip:

As with prediction, it may be sufficient to upload an abstract representation of the scene surrounding the car, rather than upload video. This would imply much lower bandwidth and storage requirements.

Also as with prediction, no human labelling is needed once the data is uploaded. Since the neural network is predicting what a human driver would do given a world state, all it needs are the world state and the driver’s actions. Imitation learning is, in essence, predicting Tesla drivers’ behaviour, rather than predicting the behaviour of other road users that Teslas sees around them. As with AlphaStar, all the information needed is contained within the “replay” of what happened.

Based on Karpathy’s comments about predicting cut-ins, Tesla can trigger a car to save a “replay” when it fails to correctly predict whether a vehicle ahead will cut into the Tesla’s lane. Similarly, Tesla may capture “replay” data when a neural network involved in path planning or driving policy fails to correctly predict the Tesla driver’s actions. Elon Musk has alluded to this capability (or something similar) in the past, although it’s not clear if it’s currently running in Tesla cars.

The inverse would be when a Tesla is on Autopilot or in the upcoming coming urban semi-autonomous mode and the human driver takes over. This could be a rich source of examples where the system does something incorrectly, and then the human driver promptly demonstrates how to do it correctly.

Other ways to capture interesting “replays” include: sudden braking or swerving, automatic emergency braking, crashes or collision warnings, and more sophisticated techniques in machine learning known as anomaly detection and novelty detection. (These same conditions could be also used to trigger “replay” captures for prediction or camera snapshots for object detection.) If Tesla already knows what it wants to capture, such as left turns at intersections, it can set up a trigger to capture a “replay” whenever the vision neural networks see a traffic light and the left turn signal is activated, or the steering wheel turns left.


Tesla has an advantage over Waymo (and other competitors) in three key areas thanks to its fleet of roughly 500,000 vehicles:

  • Computer vision
  • Prediction
  • Path planning/driving policy

Concerns about collecting the right data, paying people to label it, or paying for bandwidth and storage don’t obviate these advantages. These concerns are addressed by designing good triggers, using data that doesn’t need human labelling, and using abstracted representations (“replays”) instead of raw video.

The majority view among business analysts, journalists, and the general public appears to be that Waymo is far in the lead with autonomous driving, and Tesla isn’t close. This view doesn’t make sense when you look at the first principles of neural networks.

What’s more, AlphaStar is a proof of concept of large-scale imitation learning for complex tasks. If you are skeptical that Tesla’s approach is the right one, or that path planning/driving policy is a tractable problem, you have to explain why imitation learning worked for StarCraft but won’t work for driving.

I predict that – barring a radical move by Waymo to increase the size of its fleet – in the next 1–3 years, the view that Waymo is far in the lead and Tesla is far behind will be widely abandoned. People have been focusing too much on demos that don’t inform us about system robustness, deeply limited disengagement metrics, and Google/Waymo’s access to top machine learning engineers and researchers. They have been focusing too little on training data, particularly for rare objects and behaviours where Waymo doesn’t have enough data to do machine learning at all.

Simulation isn’t an advantage for Waymo because Tesla (like all autonomous vehicle companies) also uses simulation, and, more importantly, because a simulation can’t generate the rare objects and rare behaviours that the simulation’s creators can’t anticipate or don’t know how to model accurately. Pure reinforcement learning didn’t work for AlphaStar because the action space is too large for random exploration to hit upon good strategies. So, DeepMind had to bootstrap with imitation learning. This shows a weakness in the supposition that, like AlphaGo Zero, pure simulated experience will solve any problem. Especially with a problem like driving where anticipating the behaviour of humans is a key component.

Observers of the autonomous vehicles space may be underestimating Tesla’s ability to attract top machine learning talent. A survey of tech workers found that Tesla is the 2nd most sought-after company in the Bay Area, one rank behind Google. It also found Tesla is the 4th most sought-after company globally, two ranks behind Google at 2nd place. (Shopify is in 3rd place globally, and SpaceX is in 1st.) It also bears noting that fundamental advances in machine learning are often shared openly by academia, OpenAI, and corporate labs at Google, Facebook, and DeepMind. The difference between what Tesla can do and what Waymo can do may not be that big.

The big difference between the two companies is data. As Tesla’s fleet grows to 1 million vehicles, its monthly mileage will be about 1 billion miles, 1000x more than Waymo’s rate of about 1 million miles. What that 1000x difference implies for Tesla is superior detection for rare objects, superior prediction for rare behaviours, and superior path planning/driving policy for rare situations. The self-driving challenge is more about handling the 0.001% of miles that contain edge cases than the 99.99% of miles that are unremarkable. So, it stands to reason that the company that can collect a large number of training examples from this 0.001% of miles will do better than the companies that can’t.

Source: Artificial Intelligence on Medium