In this post we will see a case study of a Galera Cluster migration to AWS Aurora and quick solution to the replication issue.
A friend received an error in a Master-Master replication as follows:
Could not execute Write_rows event on table _database._table; Duplicate entry '65eJ8RmzASppBuQD2Iz73AAy8gPKIEmP-2018-08-03 08:30:03' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin-changelog.000010, end_log_pos 1107814
We talked about it and I immediately started blabbering cons of writing on both masters, how to handle, roles of apps and so on. I was pointing to him regarding the error duplicate key about primary key.
Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY
He intervened and revealed, it is Galera Cluster replicated to Aurora and he is not writing on Aurora.
He was actually migrating his Galera Cluster to Aurora and master-master was to support his rollback plans. Consider following diagram for simplicity of understanding:
Okay, so we had replication errors on glr-3 node. The next check we made sure that Aurora instance is not being written. We further looked into binary log to confirm source or server-id that’s being replicated.
Download Aurora binary log and review:
$] mysqlbinlog --read-from-remote-server --host=aurora.cluster-xxxx.us-east-1.rds.amazonaws.com --port=3306 --user mysql_user --password mysql_pass --result-file=/tmp/mysql-bin-changelog.000010 $] less mysql-bin-changelog.000010 ... #180803 8:03:41 server id 1 end_log_pos 2279 CRC32 0x3e912bce Xid = 42555 COMMIT/*!*/; # at 2279 #180803 8:03:41 server id 1 end_log_pos 2347 CRC32 0xd193b55a Query thread_id=7214612 exec_time=722 error_code=0 ... #180803 8:03:54 server id 2 end_log_pos 10455 CRC32 0xa4dca5a2 Query thread_id=986300 exec_time=709 error_code=0 SET TIMESTAMP=1533283434/*!*/;
That revealed the second server-id which gave another pointer: all the nodes of Galera Cluster are being written. Of-Course then, this is bound to happen when mysql finds different server-id than master, it is going to write the changes eventually resulting into duplicate key errors.
Asynchronous replication in MySQL depends on server-id to determine the source of the query. This is particularly important to avoid endless loops in master-master or circular replication.
Now in this scenario, it was found that all the nodes Galera Cluster had unique server-ids and all were being written. Thus binary log had entries from all the nodes while only one node was participating in master-master replication.
Since the changes for server-ids 2 and 3 were already applied by Galera-replication, we are getting duplicate key errors.
So, here I thought of two solutions:
1. Write on only one node (and replicate from there).
2. Use same server-ids on all the Galera nodes.
3rd option could have been disabling log-slave-updates but that’s not possible in Aurora cluster (or actually any server which has chained replication further). Also, consider that this is a problem because we’re setting master-master replication; had it been master-slave, we’d have not noticed this.
1. Write on only one galera node and replicate from there.
This can be configured on maxscale or any load balancer in order to configure & route the write traffic, though that option was denied.
2. Use same server-ids on all Galera nodes.
Galera is certification based replication. On a cluster, before transaction commit, a certification test is performed on each node for determining whether or not the write-set (changes) can be applied to each of the node. If certification fails, write-set is dropped and cluster rollbacks the transaction.
This all happens based on a GTID comprises of UUID (dataset-id, cluster-id) and sequence number.
Sample: kdar8582-r271-00d2-8582-1jdgnv05e21d:8518
where kdar8582-r271-00d2-8582-1jdgnv05e21d is dataset-id while 8518 is sequence number.
You can read more about Galera Replication on documentation page. My idea to cover some of the basics here was to show that server-id has no role to play in Galera Replication. We can surely configure same server-id on each Galera Cluster node. That will nullify the multiple server-ids and hence tackle the duplicate key errors in multi-master replication with Aurora Cluster.
My friend went that way to complete the migration successfully. Consider sharing in comments if you have more information on “using same server-id in Galera and its effects”.
Hope this helps.