9.2: Monitor a Global Database

Amazon Aurora exposes a variety of Amazon CloudWatch Metrics, that you can use to monitor and determine the health and performance of your Aurora Global Database. In this lab you will create a Amazon CloudWatch Dashboard to monitor for the latency, replicated write IO, and the cross region replication data transfer bytes of your Aurora Global Database.

This lab contains the following tasks:

  1. Generate load on the primary DB cluster
  2. Monitor cluster load and replication lag

Generate Load on the primary db Cluster

You will use pgbench, a popular benchmarking tool, to generate load.

Open a terminal window in your Cloud9 Console in the primary AWS region by referring Configure the Cloud9 workstation section.

Please ensure you are still working in the Primary AWS region.

Using pgbench we will put some load on our Primary region Aurora cluster. We will setup pgbench to run for 30 minutes, update the progress every 10 seconds, and store the results in a file called results.log. We are using the database called mylab created by Cloudformation stack.

pgbench -M prepared -n -T 1800 -P 10 -c 10 -j 2 -b tpcb-like > results.log mylab

In the RDS service console in the primary region, click the aupg-labs-cluster (Primary) Aurora cluster and toggle to the Monitoring tab. You will see a combined view of both the writer and reader DB instances in that cluster. We are not using the reader at this time, the load is directed only to the writer. Navigate through the metrics, and specifically review the CPU Utilization, DB Connections, Replica Lag, and Commit Throughput and notice they are fairly stable, beyond the initial spike caused by the pgbench tool populating an initial data set. Take another minute and review some other metrics to get a good feel for CloudWatch.

Next we will focus on our secondary DB cluster. You will create a CloudWatch Dashboard to monitor three key metrics relevant to global clusters and secondary DB clusters more specifically as shown below:

CloudWatch Metric Name Description
AuroraGlobalDBReplicatedWriteIO The number of Write IO replicated to the secondary region
AuroraGlobalDBDataTransferBytes The amount of redo logs transferred to the secondary region, in bytes
AuroraGlobalDBReplicationLag How far behind, measured in milliseconds, the secondary region lags behind the writer in the primary region

Open the Amazon CloudWatch service console in the secondary AWS region.

Verify that you are using the intended secondary AWS region. Since you are going to be working in two different AWS regions in the subsequent steps, make sure you are always working in the correct region.

Click Dashboards on the left navigation bar and then click Create dashboard.

Let’s name our new dashboard auroralab-postgres-global and click on the Create dashboard button again.

Select Number widget type and click Next.

In the Add metric graph screen, under the All Metrics tab, and select RDS, and then select the metrics group named DBClusterIdentifier, SourceRegion.

You should now see a filtered Metric named AuroraGlobalDBReplicationLag with the SourceRegion column as the name of your primary region of the global cluster. Select this metric using the checkbox.

The widget should now be added at the top with a sample of the lag time in milliseconds. Let’s further update the widget. Give it a friendly name by clicking on the edit icon (pencil icon) and rename the widget from Untitled Graph to Global DB Replication Lag (avg, 1min), press the tick/check icon to submit your changes.

Click on the Graphed metrics tab to further customize our view. Under the Statistic column, we want to change this to Average and Period to 1 Minute.

Confirm your settings are similar to the example below, and then click Create widget.

Now you have created your first widget. You can set this to Auto refresh on a set interval on the top right refresh menu.

Click Save dashboard to save your changes.

You can add widgets individually to the dashboard, to build a more complete monitoring dashboard. However, to save some time we will simply update the source of the dashboard with the below JSON specification.

First, click the Actions dropdown on the dashboard, and choose View/edit source.

In the textbox that appears on the screen, paste the following JSON code by replacing it’s existing contents. If you used different DB cluster identifier for the secondary DB cluster than the one indicated in this lab guide, you will have to update that in the below JSON.

{
    "widgets": [
        {
            "type": "metric",
            "x": 0,
            "y": 3,
            "width": 24,
            "height": 6,
            "properties": {
                "metrics": [
                    [ "AWS/RDS", "AuroraGlobalDBReplicationLag", "DBClusterIdentifier", "auroralab-postgres-secondary" ],
                    [ "...", { "stat": "Maximum" } ]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "us-east-1",
                "title": "Global DB Replication Lag (max vs. avg, 1min)",
                "stat": "Average",
                "period": 60
            }
        },
        {
            "type": "metric",
            "x": 0,
            "y": 0,
            "width": 9,
            "height": 3,
            "properties": {
                "metrics": [
                    [ "AWS/RDS", "AuroraGlobalDBReplicationLag", "DBClusterIdentifier", "auroralab-postgres-secondary" ]
                ],
                "view": "singleValue",
                "region": "us-east-1",
                "title": "Global DB Replication Lag (avg, 1min)",
                "stat": "Average",
                "period": 60
            }
        },
        {
            "type": "metric",
            "x": 9,
            "y": 0,
            "width": 15,
            "height": 3,
            "properties": {
                "metrics": [
                    [ "AWS/RDS", "AuroraGlobalDBReplicatedWriteIO", "DBClusterIdentifier", "auroralab-postgres-secondary", { "label": "Global DB Replicated Write IOs" } ],
                    [ ".", "AuroraGlobalDBDataTransferBytes", ".", ".", { "label": "Global DB DataTransfer Bytes" } ]
                ],
                "view": "singleValue",
                "region": "us-east-1",
                "stat": "Sum",
                "period": 86400,
                "title": "Billable Replication Metrics (aggregate, last 24 hr)"
            }
        }
    ]
}

Click Update to change the dashboard.

Click Save dashboard to make sure the new changes are saved.

Now, you have dashboard to monitor the Global DB Replication Lag along with some Global DB billable metrics all at one place.

Monitoring Aurora PostgreSQL-based Aurora Global Databases

The Aurora PostgreSQL-based Aurora Global Database also provides aurora_global_db_status and aurora_global_db_instance_status functions to monitor the replication between the primary and secondary Regions.

Only Aurora PostgreSQL supports the aurora_global_db_status and aurora_global_db_instance_status functions.

To monitor an Aurora PostgreSQL-based global database

The following commands assume the primary region is us-west-2. Make sure to use the appropriate DB cluster endpoint.

  1. Connect to the global database primary cluster endpoint using psql installed on Cloud9.
psql
  1. Use the aurora_global_db_status function to list the primary and secondary volumes. This shows the lag times of the global database secondary DB clusters.
mylab=> select * from aurora_global_db_status();
 aws_region | highest_lsn_written | durability_lag_in_msec | rpo_lag_in_msec | last_lag_calculation_time  | feedback_epoch | feedback_xmin 
------------+---------------------+------------------------+-----------------+----------------------------+----------------+---------------
 us-west-2  |           230707749 |                     -1 |              -1 | 1970-01-01 00:00:00+00     |              0 |             0
 us-east-1  |           230707746 |                    243 |               0 | 2021-04-28 03:19:41.794+00 |              0 |         65730

Here, the most important business metric from an availability standpoint is rpo_lag_in_msec. This lag is the time difference between the most recent user transaction commit stored on a secondary DB cluster and the most recent user transaction commit stored on the primary DB cluster.

  1. Use the aurora_global_db_instance_status function to list all secondary DB instances for both the primary DB cluster and secondary DB clusters.
mylab=> select * from aurora_global_db_instance_status();
          server_id           |              session_id              | aws_region | durable_lsn | highest_lsn_rcvd | feedback_epoch | feedback_xmin | oldest_read_view_lsn | visibility_lag_in_msec 
------------------------------+--------------------------------------+------------+-------------+------------------+----------------+---------------+----------------------+------------------------
 aupg-labs-node-02            | MASTER_SESSION_ID                    | us-west-2  |   230707969 |                  |                |               |                      |                       
 aupg-labs-node-01            | 99855f36-af51-4a15-b624-a343a5b37baf | us-west-2  |   230707965 |        230707969 |              0 |         65789 |            230707951 |                     17
 auroralab-postgres-instance1 | 05eb93ef-4231-4164-971f-e10383087fe8 | us-east-1  |   230707947 |        230707951 |              0 |         65785 |            230707940 |                     16
 

For more details on what each column means, refer Monitoring Aurora PostgreSQL-based Aurora global databases.

Let’s move on to the next section.