Discovering Machine Learning with Iris flower data set

Michael Wittig – 29 Jan 2016

Today I want you to show how you can use the Amazon Machine Learning service to train (supervised learning) a model that can categorize data (multiclass classification).

Introduction to Machine Learning

Given you have a spreadsheet with data, one column is the outcome of your model (also called class or label) while all the other columns (also called features or attributes) are used by the model as input for prediction.

class weight height
human 75 180
cat 2.8 23
dog 5 50

The model is a function the takes a weightand a heightand outputs a class: (weight, height) => class. Supervised Machine Learning is about learning this function by training with a data set that you provide.

Iris flower data set example

In our case we want to predict the species of a flower called Iris by looking at four features. We will use the Iris flower data set which you can download to train our model.

The data set contains 50 records of 3 species of Iris:

Iris setosa Iris versicolor Iris virginica
![Iris setosa](/images/2016/01/Kosaciec_szczecinkowaty_Iris_setosa.jpg) ![Iris versicolor](/images/2016/01/220px-Iris_versicolor_3.jpg) ![Iris virginica](/images/2016/01/220px-Iris_virginica.jpg)

Each records contains 4 features:

  • Sepal length
  • Sepal width
  • Petal length
  • Petal width

and each record has a species (class) assigned.

The data set is provided in CSV format and looks like this:

5.1,3.5,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica

The first 4 columns are the features while the 5th column is the class. In the end we want a model that predict a class out of the 4 features. So if you discover an Iris in nature you can predict the species by putting the 4 features into the model.

So let’s get started.

WARNING This example is not covered by the free tier. See the pricing page for more details. I spent 0.64 USD in this experiment.

Upload data set to S3

I assume that you have aws-cli installed and configured. You need to create a S3 bucket and upload the CSV file. Make sure to replace $YourName with your name or something that makes your bucket name unique.

$ aws --region us-east-1 s3 mb s3://$YourName-iris-data
$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
$ aws --region us-east-1 s3 cp iris.data s3://$YourName-iris-data/iris.data

Machine Learning

We are going to create three things with the Machine Learning service:

  • Datasource: This links to the S3 bucket and defines the schema of our data
  • Model: That’s the actual model that is generated
  • Evaluation: We also test how accurate our model ist

Datasource

Open the Machine Learning Console. Make sure that you are in the us-east-1 (N. Virgina) region.

Click the Get started button.

Step 1

Click Launch button.

Step 2

Define the location of your data set and Verify the data set.

Step 3

Click the Continue button.

Step 4

The service is smart enough the create the schema of our data automatically. Just Continue.

Step 5

Now you need to define the column that is the prediction target (class). Select the last column and click Continue.

Step 6

Click the Review button.

Step 7

Click the Continue button.

Step 8

Model

The service is recognizes that it deals with a multi-class prediction problem. We only need to make one adjustment, so please select Custom Training and evaluation settings an click the Continue button.

Step 9

Click the Continue button.

Step 10

Click the Continue button.

Step 11

Evaluation

Now we tell the service that it should randomly split our data set into a training dat set (70% of the data) and a validation data set (30% of the data). The idea is that the training data set is used to train the model while the validation data set is used to determine the accuracy of the model. So the accuracy is calculated with data that the model has never seen before.

Step 12

Click the Finish button to start the model training process.

Step 13

Now you need to wait a few minutes until your model is ready.

Step 14

Now it’s time to check the accuracy of the model. Click Evaluation: ML model: Iris flow data set and then Explorer performance on the left and you will get a matrix that shows you how well your model works.

Step 15

My model has an overall accuracy of 86%. If you like you can explore in depth what mistakes your model made by looking at the result matrix.

WARNING Depending on the randomization of the data it is possible that you get different results than me!

Why is the model not 100% accurate? Simplified explanation: We are either training with to less data (the model has not seen enough real-world data) or not all relevant features are in our data set to really distinguish the species of Iris.

Now we need to predict something. Open the Try real-time predictions link on the left, enter the four values and click the Create prediction button. After that you should see the prediction result an the right. In my case the mode is over 99% confident that the right class is Iris-virginica.

Step 16

Cleanup

Make sure to delete your Evaluation, Model and the three Datasources. Don’t forget to delete your S3 bucket.

Michael Wittig

Michael Wittig

I’ve been building on AWS since 2012 together with my brother Andreas. We are sharing our insights into all things AWS on cloudonaut and have written the book AWS in Action. Besides that, we’re currently working on bucketAV, HyperEnv for GitHub Actions, and marbot.

Here are the contact options for feedback and questions.