TL;DR The MultiAZ RDS automatic failover mechanism is measured. RDS downtime, scheduled or not, is confirmed to be around 2 minutes, as per AWS documentation, while singleAZ downtime results to be much longer (~ 10 mins). Uncommon stuck-read-replica problem is reported and debugged. Finally, AWS management console reflects changes in DB state with a noticeable latency.
So the time has come when the core of your infrastructure, your database instances, have to be resized. Unlike other AWS products, it is not possible to perform such operation without downtime, as the database is by definition the single, stateful source of truth and therefore zero-downtime procedures such as Elastic Beanstalk rolling deployments and rolling configuration updates cannot be implemented.
As a consequence, in order to be able to supervise without panic the failover procedure and accurately predict production downtime, it is important to simulate and time the whole process.
What AWS says
In the event of a planned or unplanned outage of your DB instance, Amazon RDS automatically switches to a standby replica in another Availability Zone if you have enabled Multi-AZ. The time it takes for the failover to complete depends on the database activity and other conditions at the time the primary DB instance became unavailable. Failover times are typically 60-120 seconds. However, large transactions or a lengthy recovery process can increase failover time. When the failover is complete, it can take additional time for the RDS console UI to reflect the new Availability Zone. The failover mechanism automatically changes the DNS record of the DB instance to point to the standby DB instance.
Sweet. We then want to induce a failover in order to be prepared to what it will happen in production. We start by restoring a MultiAZ deployment from a snapshot. Little or no traffic at all would stream in and out of the testing DB, but no important difference was detected between test and production downtime during failover (data below refer to the actual production resizing procedure).
Cool in action
We setup a naive TCP endpoint check to monitor the health of the RDS endpoint
We then modify the instance using the AWS console (selecting Apply Now) and start the process. The database instance state in the AWS console changes in a little time to Modifying…
We seat back and watch
We can see that the RDS endpoint still replies. Another simple script, not included here, also monitors DB health at the application level (SELECT NOW()), resulting in the same behaviour described here.
… After more than 6 minutes the TCP check fails… …
Finally, 2:13 minutes after the first downtime poll, we regain connectivity (at application level too).
The six minutes before endpoint failure suggest that, in a multi-AZ deployment, instance modifications are not applied to the primary instance, but rather to the standby instance, most probably shutting down the RDS virtual machine and restarting it allocating the modified resources.
Once the standby instance is up and running, and more importantly replicating from the primary, the primary instance can be marked as unhealthy, triggering the failover process. As the documentation says, failover duration can depend on the database state, but in any case it would leave the clients without access to the database. At this point, the system is ready to promote the newly typed/sized standby instance as the new primary. Assuming that replication of the final state of failing/ex-primary instance to the ex-standby instance has been successful, RDS then changes the DNS endpoint to the promoted instance, which starts to speak to the reconnecting clients.
…
We repeated the whole process a few times, resizing instances of different type/size, and we can confirm that the downtime lasts on average a little bit more than 2 minutes.
The not so cool things
-
During the tests, it was manifest that the AWS management console lagged behind the truth state of the instance. Most notably, after up to ~ 10 minutes after the failover completed successfully, i.e. after the clients could query the endpoint of new instance, the console would still indicate the instance as ‘Modifying’
-
At times, replica instances attached to the master node, if any, can loose the master after failover. It only happened once in around ten test/live resizing, actually during the production resize. Contacting AWS support solved the issue. However, this is a known issue and a workaround is suggested to minimize the likelihood of a lost replica, at the cost of probable performance penalty.
-
Quite obviously, modifying a SingleAZ instance would take much longer, summing up the time to modify the secundary master replica and then reboot the system. A rough estimate is of about 10 minutes.
Conclusions
AWS RDS offers a great and seamless service when it comes to high availability: the possibility of a one-click, automatic and relatively short downtime to recover from failure or scaling up/down of database instances is extremely appealing to those that have had to put up with rescaling database servers, replication failures, etc. No more long cold hours at night in data centers.