Amazon RDS is one of the best MySQL-based DBaaS services from Amazon AWS. It provides high availability, resizable capacity, and consistent performance to your applications. To take advantage of the RDS features, we need to design, operate and apply the best practices to RDS to utilize the capability of it to the max extent.
In the last 2 months, we had so many security issues identified in Linux, hypervisors, and MySQL applications that impacted on the Amazon Infrastructure too. To mitigate the security issues, Amazon needs to perform some maintenance activity on the underlying EC2 Instances of RDS MySQL and to patch to the MySQL supported versions. These activities will impact the availability of RDS Instance during the maintenance window.
Amazon is introducing many new features to its existing services to provide top-notch solutions to its customers. Recently, they introduced General Purpose SSD and Provisioned IOPS (SSD) Storage Volumes for RDS instances to deliver fast, predictable, and consistent performance for I/O intensive transactional database workloads. They also introduced other new features like memory-optimized DB Instances, pre-warming InnoDB buffer pool on reboot and even more. Yet again, you need to reboot your RDS instances to take advantage of these new features.
To mitigate the risk of RDS instances unavailability during the maintenance window, some good practices come in handy. Let’s see how to deal with them:
1. Turn on Multi-AZ mode. This is the first and foremost thing to do to improve the availability and enable the built-in automated fail-over from your primary database to a synchronously replicated secondary database in case of a failure or reboot or any maintenance activity.
2. Enable Event Subscriptions to get the notifications of all the Events happening on the RDS instance.
When you subscribe for the Events, it will deliver the Events details to the given Notification Email IDs.
3. Enable CloudWatch metrics on RDS Instances to monitor the replication status between the Master and Read-Replicas. Replication may fail because of changes to the master RDS instance or DB Instance shutdown, so it’s good to have this feature on.
4. Verify the RDS DB Instance reachability, memory and number of DB Connections to understand whether it receives connections or not.
5. In Multi-AZ mode, RDS might take some 30 – 300 seconds to switch to the Fail-Over node, so notify the respective stakeholders on the maintenance activity and approximate downtime.
This is the minimum set of things that you should enable to deal with the AWS RDS maintenance and minimize your downtime. Nevertheless, additional hints and best practices should be deployed
to further increase both the availability and performance of your infrastructure:
- Read-Replica in Cross-Region: create a read-replica in cross-region to maximize the availability. Whenever primary region outage happens, we can promote the read-replica as the master instance and get the DB instance available all the time.
- General Purpose (SSD) storage for Consistent Performance: for small and medium database workloads, modify the Storage Type to the General Purpose (SSD) Storage for consistent IOPS delivery for Database operations.
- Change the Instance type to Current Generation: change the RDS Instance type to Current Generation Instance Types: T2, M3, R3 as per your workload requirements. The newer generation instances will give us the best RAM, CPU, and Networking capabilities compared to the previous generation instance types T1, M1, C1, M2, etc.
- Tune the MySQL Parameters of RDS as you Scale: When you Scale up and Scale down your RDS instance, there will be many parameters depending on your RDS IOPS, Memory, CPU and networking. Tune them accordingly, otherwise, it will lead to bad performance of the RDS instances.