Galera cluster to AWS Aurora migration & HA_ERR_FOUND_DUPP_KEY

In this post we will see a case study of a Galera Cluster migration to AWS Aurora and quick solution to the replication issue.

A friend received an error in a Master-Master replication as follows:

Could not execute Write_rows event on table _database._table; Duplicate entry '65eJ8RmzASppBuQD2Iz73AAy8gPKIEmP-2018-08-03 08:30:03' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin-changelog.000010, end_log_pos 1107814

We talked about it and I immediately started blabbering cons of writing on both masters, how to handle, roles of apps and so on. I was pointing to him regarding the error duplicate key about primary key.

Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY

He intervened and revealed, it is Galera Cluster replicated to Aurora and he is not writing on Aurora.

He was actually migrating his Galera Cluster to Aurora and master-master was to support his rollback plans. Consider following diagram for simplicity of understanding:

galera to aws migration - sample architecture
Galera to AWS migration – sample architecture

Okay, so we had replication errors on glr-3 node. The next check we made sure that Aurora instance is not being written. We further looked into binary log to confirm source or server-id that’s being replicated.

Download Aurora binary log and review:

$] mysqlbinlog --read-from-remote-server --host=aurora.cluster-xxxx.us-east-1.rds.amazonaws.com --port=3306 --user mysql_user --password mysql_pass --result-file=/tmp/mysql-bin-changelog.000010

$] less mysql-bin-changelog.000010
...
#180803 8:03:41 server id 1 end_log_pos 2279 CRC32 0x3e912bce Xid = 42555
COMMIT/*!*/;
# at 2279
#180803 8:03:41 server id 1 end_log_pos 2347 CRC32 0xd193b55a Query thread_id=7214612 exec_time=722 error_code=0
...
#180803 8:03:54 server id 2 end_log_pos 10455 CRC32 0xa4dca5a2 Query thread_id=986300 exec_time=709 error_code=0
SET TIMESTAMP=1533283434/*!*/;
Galera to Aurora - replication error
Galera to Aurora – replication error

That revealed the second server-id which gave another pointer: all the nodes of Galera Cluster are being written. Of-Course then, this is bound to happen when mysql finds different server-id than master, it is going to write the changes eventually resulting into duplicate key errors.

Asynchronous replication in MySQL depends on server-id to determine the source of the query. This is particularly important to avoid endless loops in master-master or circular replication.

Now in this scenario, it was found that all the nodes Galera Cluster had unique server-ids and all were being written. Thus binary log had entries from all the nodes while only one node was participating in master-master replication.

Since the changes for server-ids 2 and 3 were already applied by Galera-replication, we are getting duplicate key errors.

So, here I thought of two solutions:
1. Write on only one node (and replicate from there).
2. Use same server-ids on all the Galera nodes.

3rd option could have been disabling log-slave-updates but that’s not possible in Aurora cluster (or actually any server which has chained replication further). Also, consider that this is a problem because we’re setting master-master replication; had it been master-slave, we’d have not noticed this.

log-slave-updates
log-slave-updates

1. Write on only one galera node and replicate from there.
This can be configured on maxscale or any load balancer in order to configure & route the write traffic, though that option was denied.

2. Use same server-ids on all Galera nodes.
Galera is certification based replication. On a cluster, before transaction commit, a certification test is performed on each node for determining whether or not the write-set (changes) can be applied to each of the node. If certification fails, write-set is dropped and cluster rollbacks the transaction.
This all happens based on a GTID comprises of UUID (dataset-id, cluster-id) and sequence number.

Sample: kdar8582-r271-00d2-8582-1jdgnv05e21d:8518

where kdar8582-r271-00d2-8582-1jdgnv05e21d is dataset-id while 8518 is sequence number.

You can read more about Galera Replication on documentation page. My idea to cover some of the basics here was to show that server-id has no role to play in Galera Replication. We can surely configure same server-id on each Galera Cluster node. That will nullify the multiple server-ids and hence tackle the duplicate key errors in multi-master replication with Aurora Cluster.

My friend went that way to complete the migration successfully. Consider sharing in comments if you have more information on “using same server-id in Galera and its effects”.

Hope this helps.

1 comment
Leave a Reply

Your email address will not be published. Required fields are marked *