Well, 3TB in 13 hrs is about 80 hours to sync 20TB, ie 3-4 days, and it could be a lot longer with a large number of small files (a good chunk, but not all, of our data is composed of hundreds of thousands of small .jpg image files 100 Kbytes or so). Overall, there are millions of files that need to be transferred. A good thing about rsync directly to the single server, is that if we do have to stop the rsync for any reason, then it will be very fast to restart later. Restarting an rsync part-way through a large transfer to a gluster can be incredibly slow, as it has to stat all the files that have made it onto the gluster, in order to work out where to restart. Just working out where to restart could take hours on glusterfs, whereas, rsync direct to xfs filesystem will tear through millions of stat operations and work out where to restart in a matter of minutes. So for these reasons, it seems like we should be able to save an enormous amount of time rsyncing directly to the xfs bricks and adding the bricks to gluster later… Basically, our setup has 2 (soon to be 3) reasonably powerful nodes setup like: 1) Each node is a supermicro chassis with 12 x 4TB Hitachi disks, using LSI 9280-4i4e RAID controller, with large RAID6 array formatted with 4 XFS bricks of 9TB each, for a total of 36.5TB per node. 2) 10gbe connecting the nodes. 3) Xeon E3-1245 quad-core (8 HT) CPU @ 3.4 GHz, 16GB RAM These nodes definitely do not have the most powerful CPU ever, nor do they have huge quantities of RAM either, but the disk arrays should be capable of some good speed, and we hope they should be adequate for a gluster that is just a huge archive. We just want to move data onto it, and then access it when needed, or to backup data from it (to tape). From: Ryan Nix [mailto:ryan.nix@xxxxxxxxx] Interesting. Still, I think its better to let the Gluster client handle the syncing. What happens if, for some strange reason, the rsync process dies in the middle of the night? Gluster, on the other, will keep working to get the data on the other bricks without human intervention. I recently used Gluster to sync 3 TBs of data to the another brick over a 1Gbps link in about 13 hours on decent hardware. On Wed, Oct 15, 2014 at 9:04 PM, SINCOCK John <J.Sincock@xxxxxxxxx> wrote: We have 20 Terabytes to rsync onto a new server (which will have 32 TB capacity), And we then want to add that server to an existing 2-node gluster of 73TB (53 TB used, 20 TB free), to give a 3-node gluster with 105TB capacity, 73TB used. The reason I want to do it this way, if possible, is that Gluster is slow on writes, especially for small files, and we have a LOT of small files, so I’m pretty sure it will be LOT faster to rsync directly to the new server (which is the one that has free space anyway), and then add that server to the gluster – if it is possible to have gluster recognise those files. From: Ryan Nix [mailto:ryan.nix@xxxxxxxxx]
So Gluster, at its core, uses rsync to copy the data to the other bricks. Why not let Gluster do the heavy lifting? On Wed, Oct 15, 2014 at 7:35 PM, SINCOCK John <J.Sincock@xxxxxxxxx> wrote:
I've never added a brick with existing files but I did start a new Gluster volume on disks that already contained data and I was able to access the files without problem. Of course the files will be out of place but the first time you access them, Gluster will add links to speed up future lookups. |
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users