13 Aug

Galera cluster to AWS Aurora migration & HA_ERR_FOUND_DUPP_KEY

In this post we will see a case study of a Galera Cluster migration to AWS Aurora and quick solution to the replication issue.

A friend received an error in a Master-Master replication as follows:

Could not execute Write_rows event on table _database._table; Duplicate entry '65eJ8RmzASppBuQD2Iz73AAy8gPKIEmP-2018-08-03 08:30:03' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin-changelog.000010, end_log_pos 1107814

We talked about it and I immediately started blabbering cons of writing on both masters, how to handle, roles of apps and so on. He intervened and revealed, it is Galera Cluster replicated to Aurora and he is not writing on Aurora.

He was actually migrating his Galera Cluster to Aurora and master-master was to support his rollback plans. Consider following diagram for simplicity of understanding:

galera to aws migration - sample architecture

galera to aws migration – sample architecture

Okay, so we had replication errors on glr-3 node. The next check we made sure that Aurora instance is not being written. We further looked into binary log to confirm source or server-id that’s being replicated.

Download Aurora binary log and review:

$] mysqlbinlog --read-from-remote-server --host=aurora.cluster-xxxx.us-east-1.rds.amazonaws.com --port=3306 --user mysql_user --password mysql_pass --result-file=/tmp/mysql-bin-changelog.000010

$] less mysql-bin-changelog.000010
...
#180803 8:03:41 server id 1 end_log_pos 2279 CRC32 0x3e912bce Xid = 42555
COMMIT/*!*/;
# at 2279
#180803 8:03:41 server id 1 end_log_pos 2347 CRC32 0xd193b55a Query thread_id=7214612 exec_time=722 error_code=0
...
#180803 8:03:54 server id 2 end_log_pos 10455 CRC32 0xa4dca5a2 Query thread_id=986300 exec_time=709 error_code=0
SET TIMESTAMP=1533283434/*!*/;
Galera to Aurora - replication error

Galera to Aurora – replication error

That revealed the second server-id which gave another pointer: all the nodes of Galera Cluster are being written. Of-Course then, this is bound to happen when mysql finds different server-id than master, it is going to write the changes eventually resulting into duplicate key errors.

Asynchronous replication in MySQL depends on server-id to determine the source of the query. This is particularly important to avoid endless loops in master-master or circular replication.

Now in this scenario, it was found that all the nodes Galera Cluster had unique server-ids and all were being written. Thus binary log had entries from all the nodes while only one node was participating in master-master replication.

Since the changes for server-ids 2 and 3 were already applied by Galera-replication, we are getting duplicate key errors.

So, here I thought of two solutions:
1. Write on only one node (and replicate from there).
2. Use same server-ids on all the Galera nodes.

3rd option could have been disabling log-slave-updates but that’s not possible in Aurora cluster (or actually any server which has chained replication further). Also, consider that this is a problem because we’re setting master-master replication; had it been master-slave, we’d have not noticed this.

log-slave-updates

log-slave-updates

1. Write on only one galera node and replicate from there.
This can be configured on maxscale or any load balancer in order to configure & route the write traffic, though that option was denied.

2. Use same server-ids on all Galera nodes.
Galera is certification based replication. On a cluster, before transaction commit, a certification test is performed on each node for determining whether or not the write-set (changes) can be applied to each of the node. If certification fails, write-set is dropped and cluster rollbacks the transaction.
This all happens based on a GTID comprises of UUID (dataset-id, cluster-id) and sequence number.

Sample: kdar8582-r271-00d2-8582-1jdgnv05e21d:8518

where kdar8582-r271-00d2-8582-1jdgnv05e21d is dataset-id while 8518 is sequence number.

You can read more about Galera Replication on documentation page. My idea to cover some of the basics here was to show that server-id has no role to play in Galera Replication. We can surely configure same server-id on each Galera Cluster node. That will nullify the multiple server-ids and hence tackle the duplicate key errors in multi-master replication with Aurora Cluster.

My friend went that way to complete the migration successfully. Consider sharing in comments if you have more information on “using same server-id in Galera and its effects”.

Hope this helps.

Leave a Reply

Your email address will not be published. Required fields are marked *


*

-- Kedar Vaijanapurkar --