pt-table-checksum & Waiting to check replicas for differences: 0% 00:00 remain

This post discuss the pt-table-checksum & error “Waiting to check replicas for differences: 0% 00:00 remain” with a use-case and solution.

I recently came across an issue where the task was to checksum databases using Percona’s pt-table-checksum. The command continued well for all the databases but when it came to specific DB, it kept on showing following error:

Waiting to check replicas for differences: 0% 00:00 remain

As this did not change for hours (was in screen session), I cross-checked that nothing blocks the process and other options!

I killed pt-table-checksum & re-ran the command with PTDEBUG to get detailed debug info:

[kedar@nitty-witty ~] PTDEBUG=1 /usr/bin/pt-table-checksum --config /home/kedar/pt-table-checksum-config   --recursion-method dsn=h=localhost,D=DB_A,t=dsns --replicate DB_A.checksums --no-check-replication-filters --no-check-binlog-format --chunk-time=3 --max-load Threads_running=18 --check-interval=1 --max-lag=1000 --no-check-plan --ignore-databases mysql,performance_schema,information_schema --databases DB_A,DB_B,DB_C,DB_D --resume > checksum.log 2>&1

The environment variable PTDEBUG enables verbose debugging output to STDERR.

From the logs I found that the checksum command was stuck because following query was returning NULL:

mysql> SELECT MAX(chunk) FROM `DB_A`.`checksums` WHERE db='DB_D' AND tbl='a_c' AND master_crc IS NOT NULL;
| MAX(chunk) |
|       NULL |
1 row in set (0.00 sec)

I ran the query with only db=’DB_D’ condition and found that all the tables of this database returned NULL! The database DB_D was very well present on master and all the slaves.

Debugging further get me to following status output of the slave:

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_User: slave2
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000908
          Read_Master_Log_Pos: 576546786
               Relay_Log_File: mysqld-relay-bin.002738
                Relay_Log_Pos: 576546932
        Relay_Master_Log_File: mysql-bin.000908
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: DB_A,DB_B,DB_C
                   Last_Errno: 0
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 576546786
              Relay_Log_Space: 576547132
              Until_Condition: None
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
               Last_SQL_Errno: 0
             Master_Server_Id: 1
1 row in set (0.00 sec)

Look at “Replicate_Do_DB” closely -> “DB_A,DB_B,DB_C”. That’s right, DB_D is not replicated !!
So, that explains us the NULL results of max-chunk finding query and our error of waiting-to-check-replicas-for-differences.

Well, so the solution here is, as DB_D does not replicate, we shall remove it from the list of databases to checksum. Add DB_D in to the list of –ignore-databases and rerun the command to get things started!

[kedar@nitty-witty ~] PTDEBUG=1 /usr/bin/pt-table-checksum --config /home/kedar/pt-table-checksum-config   --recursion-method dsn=h=localhost,D=DB_A,t=dsns --replicate DB_A.checksums --no-check-replication-filters --no-check-binlog-format --chunk-time=3 --max-load Threads_running=18 --check-interval=1 --max-lag=1000 --no-check-plan --ignore-databases DB_D,mysql,performance_schema,information_schema --databases DB_A,DB_B,DB_C --resume > checksum.log 2>&1

This worked well and killed “Waiting to check replicas for differences: 0% 00:00 remain”.

Note that we’re using –no-check-replication-filters so that our checksum will continue over the replication filters. Also review –replicate-database option which can also be used to specify databases to be USEd.

I think fix in pt-table-checksum here could be that: In a condition where max-chunk returns NULL and slave is in sync (seconds_behind_master: 0), pt-table-checksum should exit with appropriate warning.
I may be overlooking many other options but that’s something I could quickly think. Comment for corrections / if-and-buts.

Hope this helps.

Leave a Reply

Your email address will not be published. Required fields are marked *