04 Nov

Replication slave lag monitoring using heartbeat and windows batch scripts

“Show Slave Status” command has a last column “Seconds_Behind_Master”, which gives us idea about how much time slave is lagging behind master. It is an important to be considered parameter in monitoring and maintaining replication.

This article explains us a way to monitor replication slave lag time. It also includes a sample batch scripts to automate the monitoring process, makes it easy to understand.

Whats wrong with “Seconds_Behind_Master”:

Show Slave Status command; does shows us Seconds_Behind_Master.

Now Documentation says: The field measures the time difference in seconds between the slave SQL thread and the slave I/O thread. If the network is slow, this is not a good approximation; the slave SQL thread may quite often be caught up with the slow-reading slave I/O thread, so Seconds_Behind_Master often shows a value of 0, even if the I/O thread is late compared to the master. In other words, this column is useful only for fast networks.

Now this has been discussed at a lot of places for a lot of times, the solution is to periodically insert a row into a “heartbeat” table on the master server. Let it get replicated on slave and check the “heartbeat”s on the slave. This will surely explain you replication time and the slave behind master.

So whats the idea?

Consider descriptions of following two mysql functions:

current_timestamp() : The function when used in a query replicates its output same across the slaves. Thus even if the query is executed at a later time on slave, the value stored would be same as what is there in the master.

sysdate() : The behavior of this function is different from the above. It stores the current output of the function at the slave when executed. Thus the output produced could be different from what was produced at the master.

Considering behavior, if we have both values inserted on master server; we will get it replicated on slave. But for sysdate() function, the time will be of slave’s and that will help us calculating the slave lag comparing with current_timestamp() value.

Implementing Slave Behind Master Monitoring:

I’ve tried to implement replication slave lag monitoring through “heartbeat” method.

Here for making an example I’ve considered MySQL Server with port 3307 as Master and Server with port 3306 (Default) as Slave for monitoring Slave Lag.

Create table on Master MySQL Server:

1:
2:
3:
4:
5:
create table heartbeat
(
  master_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  slave_time TIMESTAMP NOT NULL
) ENGINE=MyISAM;

Periodical value insertion – Generate Heartbeat:

1:
insert into heartbeat(slave_time) values(SYSDATE());

One more case is, when slave is intentionally kept behind master. We can surely add up that time and make according insertion.

1:
2:
3:
Insert Into heartbeat(slave_time)
Values( from_unixtime((unix_timestamp(sysdate()) + time_to_sec('HH:MM:SS'))) ;

Here, HH:MM:SS is slave’s known time of lagging behind master.

We have following queries to monitor slave-lag:

First query will tell us the time difference of slave and master query execution time. Which is actually caused by the functions explained above.

1:
2:
3:
4:
5:
Select master_time, timediff(slave_time, master_time)
From heartbeat
Where DATE(master_time) = DATE(NOW())
Order By master_time;

This second query will tell us the time difference of last query execution time from replication to now.

1:
select timediff(NOW(), max(master_time)) from heartbeat;

Batch Scripts For Windows:

To understand above setup and making it little automated, I wrote couple of batch scripts.

Following batch script will auto insert one row into heartbeat table on per minute basis on Master.

1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
##insert-master.bat
:begin
@echo off
set StartTime=%TIME%
set StartMin=%StartTime:~3,2%
:start
set EndTime=%TIME%
set EndMin=%EndTime:~3,2%
set /a Hour_Diff=StartMin - EndMin >nul
if %Hour_Diff%==0 (
goto start )
echo Beating...
mysql -uroot -pXXXX -P3307 -e "use master;insert into heartbeat(slave_time) values(SYSDATE());"
goto begin
pause

Similar batch created to monitor the lag time on Slave MySQL Server.

1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:




19:
20:
21:
22:
23:
##monitor-slave.bat
@echo off
echo Observing slave delay - per minute basis
echo SBM -  Slave Behind Master / Last Query Executed.
echo ETD - Execution Time Difference
echo ----------------------------------
echo SBM  -  ETD - Slave Status
echo ----------------------------------
:begin
set StartTime=%TIME%
set StartMin=%StartTime:~3,2%
:start
set EndTime=%TIME%
set EndMin=%EndTime:~3,2%
set /a Hour_Diff=StartMin - EndMin >nul
if %Hour_Diff%==0 (
         goto start )
FOR /F "skip=1 tokens=*" %%I IN ('mysql -uroot -pXXXX -e
"use master;select TIME_TO_SEC(timediff(NOW(), max(master_time)) )
last_query_time, second(timediff(max(slave_time), max(master_time)))
master_slave_difference from heartbeat where DATE(master_time) = DATE(NOW())
order by master_time desc;"') DO set var=%%I%
@echo %var%
FOR /F "tokens=11 delims='|        '" %%G IN ('mysql -uroot -pXXXX -e "show slave status;"') do @echo                  %%G
FOR /F "tokens=12 delims='|        '" %%G IN ('mysql -uroot -pXXXX -e "show slave status;"') do @echo                  %%G
goto begin
pause

PS: Set passwords accordingly.

Using sample batch scripts we will be able to understand the replication lag.
ETD, the Execution Time Difference from master to slave shows us how long slave takes to replicate once master has executed the statement. By query we can understand it is difference between current_timestamp() and sysdate().
SBM, Slave Behind Master, points us the time in seconds last query executed on slave from master.
The monitoring script will also tell you the IO and SQL thread status after each minute.


Leave a Reply

-- Kedar Vaijanapurkar --