Discovering Machine Learning with Iris flower data set
Today I want you to show how you can use the Amazon Machine Learning service to train (supervised learning) a model that can categorize data (multiclass classification).
Introduction to Machine Learning
Given you have a spreadsheet with data, one column is the outcome of your model (also called class or label) while all the other columns (also called features or attributes) are used by the model as input for prediction.
class | weight | height |
---|---|---|
human | 75 | 180 |
cat | 2.8 | 23 |
dog | 5 | 50 |
The model is a function the takes a weight
and a height
and outputs a class
: (weight, height) => class
. Supervised Machine Learning is about learning this function by training with a data set that you provide.
Iris flower data set example
In our case we want to predict the species of a flower called Iris by looking at four features. We will use the Iris flower data set which you can download to train our model.
The data set contains 50 records of 3 species of Iris:
Iris setosa | Iris versicolor | Iris virginica |
---|---|---|
![Iris setosa](/images/2016/01/Kosaciec_szczecinkowaty_Iris_setosa.jpg) | ![Iris versicolor](/images/2016/01/220px-Iris_versicolor_3.jpg) | ![Iris virginica](/images/2016/01/220px-Iris_virginica.jpg) |
Each records contains 4 features:
- Sepal length
- Sepal width
- Petal length
- Petal width
and each record has a species (class) assigned.
The data set is provided in CSV format and looks like this:
5.1,3.5,1.4,0.2,Iris-setosa |
The first 4 columns are the features while the 5th column is the class. In the end we want a model that predict a class out of the 4 features. So if you discover an Iris in nature you can predict the species by putting the 4 features into the model.
So let’s get started.
WARNING This example is not covered by the free tier. See the pricing page for more details. I spent 0.64 USD in this experiment.
Upload data set to S3
I assume that you have aws-cli installed and configured. You need to create a S3 bucket and upload the CSV file. Make sure to replace $YourName
with your name or something that makes your bucket name unique.
$ aws --region us-east-1 s3 mb s3://$YourName-iris- |
Machine Learning
We are going to create three things with the Machine Learning service:
- Datasource: This links to the S3 bucket and defines the schema of our data
- Model: That’s the actual model that is generated
- Evaluation: We also test how accurate our model ist
Datasource
Open the Machine Learning Console. Make sure that you are in the us-east-1
(N. Virgina) region.
Click the Get started button.
Click Launch button.
Define the location of your data set and Verify the data set.
Click the Continue button.
The service is smart enough the create the schema of our data automatically. Just Continue.
Now you need to define the column that is the prediction target (class). Select the last column and click Continue.
Click the Review button.
Click the Continue button.
Model
The service is recognizes that it deals with a multi-class prediction problem. We only need to make one adjustment, so please select Custom Training and evaluation settings an click the Continue button.
Click the Continue button.
Click the Continue button.
Evaluation
Now we tell the service that it should randomly split our data set into a training dat set (70% of the data) and a validation data set (30% of the data). The idea is that the training data set is used to train the model while the validation data set is used to determine the accuracy of the model. So the accuracy is calculated with data that the model has never seen before.
Click the Finish button to start the model training process.
Now you need to wait a few minutes until your model is ready.
Now it’s time to check the accuracy of the model. Click Evaluation: ML model: Iris flow data set and then Explorer performance on the left and you will get a matrix that shows you how well your model works.
My model has an overall accuracy of 86%. If you like you can explore in depth what mistakes your model made by looking at the result matrix.
WARNING Depending on the randomization of the data it is possible that you get different results than me!
Why is the model not 100% accurate? Simplified explanation: We are either training with to less data (the model has not seen enough real-world data) or not all relevant features are in our data set to really distinguish the species of Iris.
Now we need to predict something. Open the Try real-time predictions link on the left, enter the four values and click the Create prediction button. After that you should see the prediction result an the right. In my case the mode is over 99% confident that the right class is Iris-virginica
.
Cleanup
Make sure to delete your Evaluation, Model and the three Datasources. Don’t forget to delete your S3 bucket.
Further reading
- Article Introducing the Object Store: S3
- Article A look at DynamoDB
- Article 5 AWS mistakes you should avoid
- Article Understanding Infrastructure as Code
- Tag s3