11.0: Pre-requisites for ML lab

This lab will walk you through the pre-requisites steps needed for the ML lab.

Setup a sample schema and data for ML lab

Using the Cloud9 IDE environment run the psql client and connect to the Aurora cluster.

psql

Run the following SQL statements from the psql prompt. This will create the mltest databases where we will run the rest of the sql queries.

DROP DATABASE IF EXISTS mltest;
CREATE DATABASE mltest;
\c mltest;

Next we will create the comments table and insert sample data, which we will use later in the Comprehend lab.

DROP TABLE IF EXISTS comments;
CREATE TABLE comments (
 comment_id SERIAL PRIMARY KEY,                                                                                                                                                                                                                                                                                                                      
 comment_text VARCHAR(255) NOT NULL
);

INSERT INTO comments (comment_text) VALUES 
(E'This is very useful\, thank you for writing it!'), 
(E'Awesome\, I was waiting for this feature.'), 
(E'An interesting write up\, please add more details.'), 
(E'I don''t like how this was implemented.'),
(E'Horrible writing\, I have read better books.');

Execute the following SQL commands to create churn table for the Sagemaker portion of the ML lab.


DROP TABLE IF EXISTS churn;
CREATE TABLE churn (
 state varchar(2048) DEFAULT NULL,
 acc_length bigint DEFAULT NULL,
 area_code bigint DEFAULT NULL,
 phone varchar(20) DEFAULT NULL,
 int_plan varchar(2048) DEFAULT NULL,
 vmail_plan varchar(2048) DEFAULT NULL,
 vmail_msg bigint DEFAULT NULL,
 day_mins  double precision DEFAULT NULL,
 day_calls bigint DEFAULT NULL,
 day_charge varchar(200) DEFAULT NULL,
 eve_mins double precision DEFAULT NULL,
 eve_calls bigint DEFAULT NULL,
 eve_charge varchar(200) DEFAULT NULL,
 night_mins double precision DEFAULT NULL,
 night_calls bigint DEFAULT NULL,
 night_charge varchar(200) DEFAULT NULL,
 int_mins double precision DEFAULT NULL,
 int_calls bigint DEFAULT NULL,
 int_charge varchar(200) DEFAULT NULL,
 cust_service_calls bigint DEFAULT NULL,
 Churn varchar(2048) DEFAULT NULL
);

Exit psql by executing following command or pressing Ctrl+d.

\q

Next, we will provision the data we need to pass to the SageMaker model and see the inference using SQL queries. To do that, we will import roughly 3000 rows of telecom customer churn data from S3 to Aurora.

The telecom customer churn dataset is publicly available and mentioned in the book Discovering Knowledge in Data by Daniel T. Larose. It is attributed by the author to the University of California Irvine Repository of Machine Learning Datasets. Each record uses 21 attributes to describe the profile of a customer of an unknown US mobile operator. To learn more, visit this AWS Machine Learning blog .

Following are the attributes of the Telecom Customer churn dataset:

  • State: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ
  • Account Length: the number of days that this account has been active
  • Area Code: the three-digit area code of the corresponding customer’s phone number
  • Phone: the remaining seven-digit phone number
  • Int’l Plan: whether the customer has an international calling plan: yes/no
  • VMail Plan: whether the customer has a voice mail feature: yes/no
  • VMail Message: presumably the average number of voice mail messages per month
  • Day Mins: the total number of calling minutes used during the day
  • Day Calls: the total number of calls placed during the day
  • Day Charge: the billed cost of daytime calls
  • Eve Mins, Eve Calls, Eve Charge: the billed cost for calls placed during the evening
  • Night Mins, Night Calls, Night Charge: the billed cost for calls placed during nighttime
  • Intl Mins, Intl Calls, Intl Charge: the billed cost for international calls
  • CustServ Calls: the number of calls placed to Customer Service
  • Churn?: whether the customer left the service: true/false

To import the dataset into Aurora, we will go through the below steps:

  1. Create an IAM role
  2. Create and attach a policy to the role
  3. Add this role to the Aurora cluster, so it can be used to communicate with S3.
  4. Import data from the S3 bucket

First we will create an IAM role by executing the following command from the prompt.

aws iam create-role \
   --role-name Cloud9-rds-s3-import-role-forlab \
   --assume-role-policy-document '{
     "Version": "2012-10-17",
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
            "Service": "rds.amazonaws.com"
          },
         "Action": "sts:AssumeRole"
       }
     ] 
   }'  

Next, we will create the IAM policy and add it to the role we created in the last step

aws iam create-policy \
   --policy-name rds-s3-import-policy \
   --policy-document '{
     "Version": "2012-10-17",
     "Statement": [
       {
         "Sid": "s3import",
         "Action": [
           "s3:GetObject",
           "s3:ListBucket"
         ],
         "Effect": "Allow",
         "Resource": [
           "arn:aws:s3:::*",
           "arn:aws:s3:::*/*"
         ] 
       }
     ] 
   }'                     
aws iam attach-role-policy --role-name Cloud9-rds-s3-import-role-forlab --policy-arn $(aws iam list-policies --query 'Policies[?PolicyName==`rds-s3-import-policy`].Arn' --output text)

Next, we will add the role to the Aurora cluster, so Aurora can be use it to communicate with S3.

aws rds add-role-to-db-cluster \
   --db-cluster-identifier $(echo $DBENDP | cut -d'.' -f1) \
   --feature-name s3Import \
   --role-arn $(aws iam list-roles --query 'Roles[?RoleName==`Cloud9-rds-s3-import-role-forlab`].Arn' --output text)   \
   --region $AWSREGION

Execute the following command and wait until the output shows as available, before moving on to the next step.

aws rds describe-db-clusters --db-cluster-identifier $(echo $DBENDP | cut -d'.' -f1) \
--query 'DBClusters[*].[Status]' --output text

Validate the IAM role is active by running the following command:

aws rds describe-db-clusters --db-cluster-identifier $(echo $DBENDP | cut -d'.' -f1) \
--query 'DBClusters[*].[AssociatedRoles]' --output table

You should see an output similar to the following:

Next we will execute the following commands to:

  1. Connect to the Aurora cluster
  2. Enable S3 extention
  3. Import data from the S3 bucket
psql -d mltest
CREATE EXTENSION aws_s3 CASCADE;


SELECT aws_s3.table_import_from_s3(
   'churn',
   'state,acc_length,area_code,phone,int_plan,vmail_plan,vmail_msg,day_mins,day_calls,day_charge,eve_mins,eve_calls,eve_charge,night_mins,night_calls,night_charge,int_mins,int_calls,int_charge,cust_service_calls,Churn', 
   '(format csv, HEADER true)',
   aws_commons.create_s3_uri('awsauroralabsmy-us-east-1', 'samples/mldata/churn.txt', 'us-east-1')
);

You Should see a success message similar to below

Execute the following command to make sure the comments and churn tables exist.

\dt

You should see the table listed as shown below.

Count the number of rows, we stored in the table by executing:

select count(*) from comments;
select count(*) from churn;

You should see an output as shown below.

Exit from psql.

\q