Udacity had the unique opportunity to have two of our thought leaders on a panel discussion on training data for machine learning entitled AI-AI-Oh! during SXSW 2019. The discussion triggered an exchange of viewpoints among the expert panelists which ranged from how the data is being used in various industries, how much training data you need to apply machine learning, and practical tips for the audience to consider.

You can listen to the entirety of our panel discussion here.

The discussion started with the framing of machine learning. Machine learning (ML) is about teaching computers how to learn from data to make decisions or predictions. For true machine learning, the computer must be able to learn to identify patterns without being explicitly programmed to.

An easy example of a machine learning algorithm is an on-demand music streaming service. For the service to make a decision about which new songs or artists to recommend to a listener, machine learning algorithms associate the listener’s preferences with other listeners who have similar musical taste.

Machine learning fuels all sorts of automated tasks and spans across multiple industries, from data security firms hunting down malware to finance professionals looking out for favorable trades. They’re designed to work like virtual personal assistants, and they work quite well.

Machine learning serves a mechanical function the same way a flashlight, a car, or a television does. When something is capable of “machine learning”, it means it’s performing a function with the data given to it, and gets progressively better at that function. It’s like if you had a flashlight that turned on whenever you said “it’s dark”, so it would recognize different phrases containing the word “dark”.

In machine learning projects, we need a training data set. It is the actual data set used to train the model for performing various actions.

ML relies heavily on data; without data, it is impossible for an “AI” to learn. It is the most crucial aspect that makes algorithm training possible. The panelists discuss three different types of training data including:

Client services data – data generated from customers. “At HubSpot, we gather user-generated training data for ML that informs everything from email send time optimization to audience targeting,” stated Hector Urdiales.

User generated data – data created by users on their own without being prompted.  “We train data based on patterns,” said Rob McGrorty.

Simulated data – sensor data that self-driving cars, for example, collect in the real world. “A test vehicle’s cameras might record video of pedestrians crossing the street at night. Software developers can use that video every time they update their self-driving software, to verify that the software still detects the pedestrians correctly,” explains David Silver.

Essentially, training data is the textbook that will teach your AI to do its assigned task, and will be used over and over again to fine-tune its predictions and improve its success rate. Your AI will use training data in several different ways, all with the aim of improving the accuracy of its predictions.

Quite simply, without training data there is no AI. The cleanliness, relevance and quality of your data has a direct impact on whether your AI will achieve its goals.

Be sure to listen to this informative panel discussion and learn more about training data and practical use cases.