When getting started with machine learning (ML) you need data — and lots of it. Data really forms the foundation of your machine learning strategy and we’ll look at some of the considerations around data in machine learning. In a previous post we built a dog identification microservice in Python. We’ll consider that same use case here when looking at how we need to work with our data.
Volume of Data
In that example, we were retraining an Inception model from generalized object recognition to identify specific dog breeds. However, one of the first considerations we need to remember is that you need a large volume of data points in order to be able to successfully train an ML model. You need enough data to be able to draw statistically significant conclusions. So in this case, having a single image of a particular dog breed wouldn’t be enough to successfully train the model — it takes hundreds of images to achieve a level of accuracy that’s acceptable.
Preparing your Data
ML is not magic — data preparation is a big part of working with ML effectively. One of the first steps is cleaning and preparing your data. Data might be coming from disparate sources, it might have different levels of precision, different data formats, and all sorts of other inconsistencies. These need to be unified in order for your model to be able to ingest this data coherently.
And for supervised learning models, you also need to label your training data. Simply feeding our model pictures of 100 poodles won’t allow it to magically know that this is a poodle — we need to label each image with the ‘answer’ to what it is in order to train the model.
Bias
We need to be mindful of the choices that we make when gathering and presenting data and labels to our model as well, otherwise we can introduce bias into the model that we didn’t intend. If I only show my model photographs of white poodles, it might not be able to correctly identify brown or black coated poodles as the correct breed. There are many forms of subtle bias that can creep into a model, and since the model only knows what you feed into it — we need to be sure that we are providing an accurate representation of the problem space if we want to get meaningful results.
Without a reproducible way to replicate training sets, you will have difficulty debugging, training and scaling your models. If you want to compare results between two different iterations of the model, do you have a reliable way of replicating your previous model?
Reproducibility
Source code has source control systems that allow you to reproduce a particular build, but if you have an issue with a model, can you reproduce a specific model from a specific day? At the moment, many training data sets are built in a way that evolves over time and can be slightly ad-hoc in manner.
Solutions to this problem are in their infancy, but tools like pachyderm.io which acts like a form of “source control” for your training data sets, allow you to create a reproducible data pipeline for your training data, as well as offering tools to automate deployment of this data to your models in production.
Once in production, keeping your models updated with evolving training data is important. You need to have an automated, stable, reproducible pipeline that allows you to continuously update and iterate on your models with new data in order to be effective.
Laying the Foundation
Data is the foundation of your ML strategy and you’d be surprised by what data you already have on hand to begin working with. We’ve seen getting that data massaged into shape and having it clean, balanced, consistent and integrated into a reproducible pipeline can lay the groundwork for gaining invaluable insights from ML.
Watch our on-demand webinar, Making Machine Learning Accessible to learn the business “why” and technical “how” for implementing machine learning (ML) in your organization.
Watch Now