21 Jan

Debugging Percona Xtradb (Galera) Cluster node startup / SST errors

We were working on setting up Percona XtraDB (Galera) Cluster following the Documentation for version 5.6+. This blog post is a use-case of a debugging / troubleshooting & fixing different SST issues we faced while bringing-up the cluster nodes.

We already have a successfully bootstrapped node-1 and now we’re adding second node (node-2) which was causing following error.

"2015-01-15 17:33:19 24618 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup --role 'joiner' --address '10.10.10.1' --auth 'sstuser:xxx' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '24618'  '' : 32 (Broken pipe)
2015-01-15 17:33:19 24618 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2015-01-15 17:33:19 24618 [ERROR] WSREP: SST failed: 32 (Broken pipe)
2015-01-15 17:33:19 24618 [ERROR] Aborting"


Verified that the SST user has proper privileges and concluded to check on the donor side.
Checked “innobackup.backup.log” file on master which logs the xtrabackup output during SST:

[root@node-1 mysql]# tail -5 innobackup.backup.log

150115 17:33:19  innobackupex: Connecting to MySQL server with DSN 'dbi:mysql:;mysql_read_default_file=/etc/my.cnf;mysql_read_default_group=xtrabackup;mysql_socket=/var/lib/mysql/mysql.sock' as 'sstuser'  (using password: YES).
innobackupex: got a fatal error with the following stacktrace: at /usr//bin/innobackupex line 2990
        main::mysql_connect('abort_on_error', 1) called at /usr//bin/innobackupex line 1530
innobackupex: Error: Failed to connect to MySQL server as DBD::mysql module is not installed at /usr//bin/innobackupex line 2990.

As the above log, innobackup.backup.log, conveys it’s missing DBD module causing the error. Checked to install perl’s DBD module for MySQL through yum as follows:

[root@node-1 mysql]# yum install perl-DBD-MySQL
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: mirror.netaddicted.ca
 * epel: mirror.csclub.uwaterloo.ca
 * extras: mirror.netaddicted.ca
 * updates: mirror.netaddicted.ca
Setting up Install Process
Package perl-DBD-MySQL-4.013-3.el6.x86_64 already installed and latest version
Nothing to do

Well this resulted that we already have the DBD installed! Tried running the backup command manually and received the same error described above:

[root@node-1 mysql]# innobackupex --user=sstuser --password=xxxx --defaults-file=/etc/my.cnf --slave-info  --tmpdir=/tmp --stream=tar ./   2> /tmp/xtrabackup.log  |gzip > backup.tar.gz
[root@node-1 mysql]# vi /tmp/xtrabackup.log

Attempted checking mysql version using following command, pointed missing dependencies for DBD:

[root@node-1 mysql]#  perl -MDBD::mysql -e 'print $DBD::mysql::VERSION'
Can't load '/usr/lib64/perl5/auto/DBD/mysql/mysql.so' for module DBD::mysql: libmysqlclient.so.16: cannot open shared object file: No such file or directory at /usr/lib64/perl5/DynaLoader.pm line 200.
 at -e line 0
Compilation failed in require.
BEGIN failed--compilation aborted.

Now as it’s clear about the missing dependencies, tried looking, identifying & fixing the dependencies:

[root@node-1 mysql]# locate libmysqlclient.so.16
[root@node-1 mysql]# yum list installed| grep perl-DBD-MySQL
perl-DBD-MySQL.x86_64  4.013-3.el6      @base
[root@node-1 mysql]# yum deplist perl-DBD-MySQL.x86_64|grep mysql
  dependency: libmysqlclient.so.16()(64bit)
   provider: mysql-libs.x86_64 5.1.73-3.el6_5
  dependency: libmysqlclient.so.16(libmysqlclient_16)(64bit)
   provider: mysql-libs.x86_64 5.1.73-3.el6_5

Installed missing dependencies:

[root@node-1 mysql]# yum install mysql-libs.x86_64
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: mirror.netaddicted.ca
 * epel: mirror.csclub.uwaterloo.ca
 * extras: mirror.netaddicted.ca
 * updates: mirror.netaddicted.ca
Setting up Install Process
Package mysql-libs is obsoleted by Percona-Server-shared-51, trying to install Percona-Server-shared-51-5.1.73-rel14.12.624.rhel6.x86_64 instead
Resolving Dependencies
--> Running transaction check
---> Package Percona-Server-shared-51.x86_64 0:5.1.73-rel14.12.624.rhel6 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

========================================================================================================================================================================
 Package                                      Arch                       Version                                       Repository                                  Size
========================================================================================================================================================================
Installing:
 Percona-Server-shared-51                     x86_64                     5.1.73-rel14.12.624.rhel6                     percona-release-x86_64                     2.1 M

Transaction Summary
========================================================================================================================================================================
Install       1 Package(s)

Total download size: 2.1 M
Installed size: 5.9 M
Is this ok [y/N]: y
Downloading Packages:
Percona-Server-shared-51-5.1.73-rel14.12.624.rhel6.x86_64.rpm                                                                                    | 2.1 MB     00:00
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
  Installing : Percona-Server-shared-51-5.1.73-rel14.12.624.rhel6.x86_64                                                                                            1/1
  Verifying  : Percona-Server-shared-51-5.1.73-rel14.12.624.rhel6.x86_64                                                                                            1/1

Installed:
  Percona-Server-shared-51.x86_64 0:5.1.73-rel14.12.624.rhel6

Complete!

Now as the perl-DBD-MySQL dependency is fixed, ran the backup manually on node-1 and it completed successfully.

[root@node-1 mysql]# innobackupex --user=sstuser --password=xxx --defaults-file=/etc/my.cnf --slave-info  --tmpdir=/tmp --stream=tar ./   2> /tmp/xtrabackup.log  |gzip > backup.tar.gz

I hoped this should resolve the issue but…
On node-1, cleared datadir and attempted restarting the node again failed complaining as follows:

2016-15 18:54:14 25385 [Note] WSREP: Requesting state transfer: success, donor: 0
2015-01-15 18:54:16 25385 [Note] WSREP: (ce9728c5, 'tcp://0.0.0.0:4567') turning message relay requesting off
2015-01-15 18:54:19 25385 [Note] WSREP: 0.0 (node-1.com): State transfer to 1.0 (node-2.com) complete.
2015-01-15 18:54:19 25385 [Note] WSREP: Member 0.0 (node-1.com) synced with group.
WSREP_SST: [ERROR] xtrabackup process ended without creating '/var/lib/mysql//xtrabackup_galera_info' (20150115 18:54:19.507)
...
...
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20150115 18:54:19.543)
WSREP_SST: [INFO] Removing the sst_in_progress file (20150115 18:54:19.548)
2015-01-15 18:54:19 25385 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup --role 'joiner' --address '10.10.10.1' --auth 'sstuser:s3cret' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '25385'  '' : 32 (Broken pipe)
2015-01-15 18:54:19 25385 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2015-01-15 18:54:19 25385 [ERROR] WSREP: SST failed: 32 (Broken pipe)

Looked further in details of configuration and found that the xtrabackup_sst_method was different on node-1 vs on node-2:

Node-1:

mysql> show global variables like '%sst%';
+---------------------------------+---------------+
| Variable_name                   | Value         |
+---------------------------------+---------------+
| wsrep_sst_auth                  | xxxxxxxx      |
| wsrep_sst_donor                 |               |
| wsrep_sst_donor_rejects_queries | OFF           |
| wsrep_sst_method                | xtrabackup-v2 |
| wsrep_sst_receive_address       | AUTO          |
+---------------------------------+---------------+

Node-2:

$] grep wsrep_sst_method /etc/my.cnf
wsrep_sst_method=xtrabackup

The wsrep_sst_method variable sets up the method for taking the State Snapshot Transfer (SST). Correced wsrep_sst_method=xtrabackup-v2 and the node came up clean.

From the docs:

The xtrabackup-v2 wsrep_sst_method is same as xtrabackup SST except that it uses newer protocol, hence is not compatible. This is the recommended option for PXC 5.5.34 and above.

There are other SST methods like rsync & mysqldump depending on the method you prefer. (Check references)

We progressed further but while adding a third node (node-3) we faced another issue and it did not join the cluster. We again checked the donor’s innobackup.backup.log file to trace the issue:

[root@node-1 mysql]# vi innobackup.backup.log
...
...
innobackupex:  Using server version 5.6.21-70.1-56-log

innobackupex: Created backup directory /tmp/tmp.fFyzHDRH1f
^Gxbstream: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
innobackupex: 'xbstream -c' returned with exit code 1.
innobackupex: got a fatal error with the following stacktrace: at /usr//bin/innobackupex line 4865
        main::backup_file_via_stream('/tmp/tmp.HE5T8DeArs', 'backup-my.cnf') called at /usr//bin/innobackupex line 4914
        main::backup_file('/tmp/tmp.HE5T8DeArs', 'backup-my.cnf', '/tmp/tmp.HE5T8DeArs/backup-my.cnf') called at /usr//bin/innobackupex line 4938
        main::write_to_backup_file('/tmp/tmp.HE5T8DeArs/backup-my.cnf', '# This MySQL options file was generated by innobackupex.\x{a}\x{a}# T...') called at /usr//bin/innobackupex line 3746
        main::write_backup_config_file('/tmp/tmp.HE5T8DeArs/backup-my.cnf') called at /usr//bin/innobackupex line 3673
        main::init() called at /usr//bin/innobackupex line 1557
innobackupex: Error: Failed to stream '/tmp/tmp.HE5T8DeArs/backup-my.cnf': 1 at /usr//bin/innobackupex line 4865.

This time we have confirmed all the issues we faced earlier are answered but still the SST was not progressing.
Running the xtrabackup command manually doesn’t cause any error and backup finished correctly:

[root@node-1 mysql]# innobackupex --user=root --password=$pw --defaults-file=/etc/my.cnf --slave-info  --tmpdir=/tmp --stream=tar ./   2> /tmp/xtrabackup.log  |gzip > backup.tar.gz
[root@node-1 mysql]# vi /tmp/xtrabackup.log

As the manual backup goes well it hints that the issue was somewhere on new node though innobackup.backup.log expressed the errors.

Rechecked the configuration file here and noted an important difference. On node-1 we had following innodb settings while on node-3 we did not:

innodb_log_file_size=48M
innodb_data_file_path=ibdata1:12M:autoextend
innodb_buffer_pool_size=400M

Here, difference in values of innodb_log_file_size & innodb_data_file_path caused the xtrabackup restore fail.
innodb_log_file_size is pretty much the default value but innodb_data_file_path has default as “ibdata1:10M:autoextend” which was a differentiating factor causing the (restore) SST fail. Adding these parameters in my.cnf on node-3 fixed the issue and node joined the cluster.

So, here we faced three issues causing the State Snapshot Transfer to fail and hence stopping the nodes to join the cluster:
1. The perl-DBD-MySQL was missing dependencies and that we could identify following the logs on node-1.
2. The configuration for xtrabackup_sst_method on node-2 was incorrectly set than what was recommended (i.e. xtrabackup-v2).
3. Differentiating InnoDB related configuration variables can cause SST to fail.

Correcting them made the Galera happy…
Hope above troubleshooting steps are helpful. Recently we saw Percona’s blog conveying that we can have multiple reasons for an SST to fail; this post practically agree to that :).

References:

  • http://www.percona.com/doc/percona-xtradb-cluster/5.5/wsrep-system-index.html#wsrep_sst_method
  • http://www.percona.com/doc/percona-xtradb-cluster/5.6/manual/state_snapshot_transfer.html

Also see:

  • http://www.percona.com/blog/2014/12/30/diagnosing-sst-errors-with-percona-xtradb-cluster-for-mysql/

2 thoughts on “Debugging Percona Xtradb (Galera) Cluster node startup / SST errors

  1. Another important one – in case the server has multiple IPs and wsrep_sst_receive_address is set to AUTO it may use the wrong one. For instance in a test configuration with Vagrant / OVM you may have two IP address, one internal on the 10.x.x.x range and other on 192.x.x.x and PXC may use the 10.x.x.x rather than the desired 192.x.x.x address.

    In this case SST will fail with a message like:


    150921 14:09:40 [ERROR] WSREP: gcs/src/gcs_group.cpp:long int gcs_group_handle_join_msg(gcs_group_t*, const gcs_recv_msg_t*)():717: Will never receive state. Need to abort.

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

-- Kedar Vaijanapurkar --