Lab2: Fast Cloning

In this lab, we will walk through the process of creating an Aurora fast clone. We will observe the divergence of data and compare the performance between the original and cloned Aurora clusters.

Prerequisites

This lab requires the following lab modules to be completed first:

1. Setting up the Fast Clone divergence test

Lets review the tables we have created in the Configure Cloud9 and Initialize Database step, which will be used in this lab.

Resource name Value
cloneeventtest Table to store the counter and the timestamp
statusflag Table to store the status flag which controls the start/stop counter
eventerrormsg Table to store error messages
cloneeventproc Function to add data to the cloneeventtest table based on the start counter flag

2. Creating and verifying performance impact of the Fast Clone

To verify the performance impact of the clone cluster on the primary source cluster we will be performing the following steps. We will explore these steps in more detail in corresponding sections below.

a. Start the pgbench workload (a PostgreSQL benchmarking tool) to generate a synthetic workload and run benchmarking on the primary cluster for 30 minutes.
b. Execute the function cloneeventproc to start adding sample data on the source cluster.
c. After 5 minutes or so, stop cloneeventproc function and kick off fast clone cluster creation. The pgbench workload will continue to execute.
d. The clone cluster should be ready after about 15 minutes or so. Execute the function cloneeventproc on the primary cluster again to start data divergence. e. Verify the output from the sample table cloneeventtest on both the primary and the clone cluster to see the data divergence.
f. Run the same pgbench workload on the clone cluster similar to the one running on the primary cluster (step #a)
g. Verify the pgbench Transaction per seconds (TPS) metrics on the primary and the clone cluster.

2.1 Running the pgbench workload on the primary cluster

Before creating a Fast Clone of the primary cluster, we are going to start pgbench test to measure the Transaction per seconds (TPS) metrics on the primary cluster. Open a Cloud9 terminal window (Session #1) and run the following command. This will run for 30 minutes.

pgbench --progress-timestamp -M prepared -n -T 1800 -P 60 -c 8 -j 8 -b tpcb-like@1 -b select-only@20 > Primary_results.log

2.2 Verify the environment and run the sample divergence test

In order to verify the data divergence on the primary and the clone cluster, we will be adding sample data using sample tables and the function mentioned in step 1.

We need to open one more Cloud9 terminal window (Session #2) to connect to Aurora and run the function. To open one more terminal window on your Cloud9 environment, click on Window menu and select new Terminal.

Run the following commands to verify the delflag column is set to N in the statusflag table and there is no data in the table cloneeventtest. Execute the function cloneeventproc() to start adding sample data. This function will add a row to the table cloneeventtest every 60 seconds.

psql
select * from statusflag;
select * from cloneeventtest;
select cloneeventproc();

At this time (we call as time “T1”) the pgbench workload is running on the source DB cluster and also, we are adding sample data to the table on the primary cluster every 60 seconds.

2.3 Stop the sample data generation

After 5 minutes or so we are going to kick off fast clone cluster creation.

First, at T1+5 minutes we will stop the function execution by manually resetting the delflag column on the table statusflag to Y. Open one more Cloud9 terminal window to connect to Aurora (session #3). The pgbench workload will continue to execute on the primary source cluster in session #1.

From session #3:

psql
select pg_sleep(300);update statusflag set delflag='Y';

Go back to session #2 where we ran cloneeventproc function. Wait for ~60sec until you see the function complete its execution:

mylab=> select cloneeventproc();
 cloneeventproc 
----------------
 
(1 row)

Let’s check the number of rows in the table cloneeventtest:

select count(*) from cloneeventtest;

We should see 5 or more rows in the table:

 count 
-------
     5
(1 row)

Let’s set proper timezone and check rows in cloneeventtest table:

SET timezone = 'America/Los_Angeles';
select * from cloneeventtest;
counter | seconds_elapsed |             data              
---------+-----------------+-------------------------------
       1 |              60 | 2020-01-20 13:57:13.473963-08
       2 |             120 | 2020-01-20 13:58:13.473963-08
       3 |             180 | 2020-01-20 13:59:13.473963-08
       4 |             240 | 2020-01-20 14:00:13.473963-08
       5 |             300 | 2020-01-20 14:01:13.473963-08
(5 rows)

2.4 Create Fast Clone Cluster

Once the function execution is stopped (after time T1+5 minutes) we will start creating the Fast Clone of the primary cluster. The pgbench workload on the primary will continue on the primary cluster in session#1.

Now, we will walk you through the process of cloning a DB cluster. Cloning creates a separate, independent DB cluster, with a consistent copy of your data set as of the time you cloned it. Database cloning uses a copy-on-write protocol, in which data is copied at the time that data changes, either on the source databases or the clone databases. The two clusters are isolated, and there is no performance impact on the source DB cluster from database operations on the clone, or the other way around.

Following are the steps to configure the Database Fast clone on your Aurora PostgreSQL cluster

a. Sign in to the AWS Management Console and open the Amazon RDS console .
b. In the navigation pane, choose Databases and select DB identifier with the cluster name you created as a part of the CloudFormation stack.Click the Actions menu at the top and select Create clone:

1-fastclone-3

c. Enter Labstack-cloneas DB Instance Identifier, pick the DB cluster parameter group and DB parameter groups created by cloudformation template in DB cluster parameter group and DB parameter group drop down menus. Leave rest of the input fields at their default value and click Create clone.

You can find out the DB cluster parameter group (Key: apgcustomclusterparamgroup) and DB parameter group (Key: apgcustomdbparamgroup) names by clicking the Cloudformation stack with description “Amazon Aurora PostgreSQL Labs Stackset” and going to the Outputs tab.

1-fastclone-4 1-fastclone-5 1-fastclone-6

d. Once you click on the “Create Clone” the status column will show status as “Creating”.

1-fastclone-8

e. The clone cluster should be ready after about 10-15 minutes or so. The status column will show as “Available” once the cloned cluster is ready.

2.5 Start the sample data divergence process on the primary cluster

Once the Clone cluster creation process is kicked off, we will start the sample data generation process on the primary cluster. Any sample data added from this point onwards should only be available on the primary cluster and not on the clone cluster.

psql 
truncate cloneeventtest;
update statusflag set delflag='N';
select count(*) from cloneeventtest;
select cloneeventproc();

We should see the following result:

 count 
-------
     0
(1 row)

 cloneeventproc 
----------------

The function is running and we leave it for some time while cloned cluster is creating.

2.6 Verify the data divergence on the Clone Cluster

Clone cluster should be ready after about 15 minutes or so (time T1+~10-15 minutes).

The table “cloneeventtest” on the cloned cluster should have the snapshot of data as it existed on the primary cluster at ~T1+5, because that is when we started creating the clone.

Copy the Writer Endpoint for your cloned aurora cluster by clicking the cluster name and going to the Connectivity & security tab.

1-fastclone-13

Connect to Aurora cloned cluster from session#3 window. Replace <Cloned Cluster Writer Endpoint> below with the Writer endpoint for your Cloned Aurora cluster you copied above.

psql -h <Cloned Cluster Writer Endpoint>

And run the sql commands to check the content of the data:

select count(*) from cloneeventtest;
SET timezone = 'America/Los_Angeles';
select * from cloneeventtest;
count 
-------
     5
(1 row)

 counter | seconds_elapsed |             data              
---------+-----------------+-------------------------------
       1 |              60 | 2020-01-20 13:57:13.473963-08
       2 |             120 | 2020-01-20 13:58:13.473963-08
       3 |             180 | 2020-01-20 13:59:13.473963-08
       4 |             240 | 2020-01-20 14:00:13.473963-08
       5 |             300 | 2020-01-20 14:01:13.473963-08
(5 rows)

Stop the function run on Primary aurora cluster that is currently running (follow step 2.3) and select data from cloneeventtest table. We will see more rows, as expected.

2.7 Run the pgbench workload on the Clone Cluster

We are going to start a similar pgbench workload on the newly created clone cluster as we did on the primary cluster earlier in step# 2.1. Replace <Cloned Cluster Endpoint> below with the Writer endpoint for your Cloned Aurora cluster.

pgbench --progress-timestamp -M prepared -n -T 1800 -P 60 -c 8 -j 8 --host=<Cloned Cluster Writer Endpoint> -b tpcb-like@1 -b select-only@20 > Clone_results.log

2.8 Verify the pgbench metrics on the primary and the clone cluster.

Once the pgbench workload completes on both the primary and the clone cluster, we can verify the TPS metrics from both the clusters by looking at the output file.

[ec2-user@ip-xxxxxxxx ~]$ more Primary_results.log 
transaction type: multiple scripts
scaling factor: 100
query mode: prepared
number of clients: 8
number of threads: 8
duration: 1800 s
number of transactions actually processed: 19494742
latency average = 0.739 ms
latency stddev = 1.361 ms
tps = 10830.345301 (including connections establishing)
tps = 10830.428982 (excluding connections establishing)
SQL script 1: <builtin: TPC-B (sort of)>
 - weight: 1 (targets 4.8% of total)
 - 928533 transactions (4.8% of total, tps = 515.848479)
 - latency average = 6.228 ms
 - latency stddev = 2.127 ms
SQL script 2: <builtin: select only>
 - weight: 20 (targets 95.2% of total)
 - 18561932 transactions (95.2% of total, tps = 10312.120725)
 - latency average = 0.464 ms
 - latency stddev = 0.371 ms
[ec2-user@ip-10-0-0-204 ~]$ more Clone_results.log
transaction type: multiple scripts
scaling factor: 100
query mode: prepared
number of clients: 8
number of threads: 8
duration: 1800 s
number of transactions actually processed: 19494742
latency average = 0.739 ms
latency stddev = 1.361 ms
tps = 10830.345301 (including connections establishing)
tps = 10830.428982 (excluding connections establishing)
SQL script 1: <builtin: TPC-B (sort of)>
 - weight: 1 (targets 4.8% of total)
 - 928533 transactions (4.8% of total, tps = 515.848479)
 - latency average = 6.228 ms
 - latency stddev = 2.127 ms
SQL script 2: <builtin: select only>
 - weight: 20 (targets 95.2% of total)
 - 18561932 transactions (95.2% of total, tps = 10312.120725)
 - latency average = 0.464 ms
 - latency stddev = 0.371 ms

2.9 Sample graph on the fast clone performance impact

Below is the sample comparison slide to show the performance impact of the clone.

1-fastclone-13

3.0 Remove the Cloned Aurora cluster

In a production environment, you might want to delete the Aurora cloned cluster when its no longer required to save cost.

a. To remove the cloned cluster, select the Writer node and then click Delete from the Actions menu. The Aurora cluster will be automatically removed later.

remove-clone

b. In a production environment, you might want a final snapshot before deleting the cloned cluster. For this lab, we will un-select the final snapshot option and confirm the deletion.

remove-clone

c. The Status will change to Deleting and the cloned cluster will be removed after some time.

remove-clone