Chain Copying to Multiple hosts
This week I was given the task of repopulating our entire primary database cluster. This was due to an alter that had to be performed on our largest table. It was easiest to run it on one host and populate the dataset from that host everywhere.
I recalled a while back reading a blog post from Tumblr about how to chain a copy to multiple hosts using a combination of nc, tar, and pigz. I used this, with a few other things to greatly speed up our repopulation process. As I was repopulating production servers, I did a combination of raw data copy and xtrabackup streams across our servers, depending on the position in our replication setup.
For a normal straight copy, here’s what I did:
On the last host, configure netcat to listen and then pipe the output through pigz and tar to uncompress and untar. This needs to be run in the destination directory:
nc -l 1337 | pigz -d | tar xvf -
On any hosts in the middle of the chain, you do the same thing with one extra step. Using a fifo to redirect the stream to the next host:
nc next_host_in_chain 1337 <copy_redirect &
nc -l 1337 | tee copy_redirect | pigz -d | tar -xvf -
And on the source host you actually make the stream. This is where I differed the most from what Tumblr had written. I added a progress bar using pv.
tar -c /data/mysql/ | pv --size $( du -sh /data/mysql/ | cut -f1 ) | pigz | nc first_host_in_chain 1337
To do this with an xtrabackup stream, the commands are similar. On each host, tar needs to add the “i” flag (to become “tar xvfi -”). The progress bar here became slightly less accurate, but was still a good rough estimate of the progress. On the source host, the command became:
innobackupex --stream=tar /tmp/ --slave-info | pv --size $( du -sh /data/mysql/ | cut -f1 ) | pigz | nc first_host_in_chain 1337
I found that using this method, for a raw copy, I was able to achieve between 300 and 350 MB/sec copying large tables. Smaller tables averaged slower speeds. I didn’t do enough testing here to see where the bottleneck was. I can say that it was not network, cpu, or io. Our servers involved have 10 GBit network and FusionIO drives. Increasing the compression level may have helped add some throughput here as well. Copying a 1.4 TB Dataset to 3 destination servers took under 2 hours.
This is definitely a tool that I will be adding to my arsenal to use on a regular basis.
Error: Twitter did not respond. Please wait a few minutes and refresh this page.