On 11/15/16 22:13, Thomas Danan wrote:
I think there are some versions where it rebalances data a bunch to even things out... I don't know why I think that...where I read it or anything. Maybe it was only argonaut vs newer. But having to rebalance 75% of the data makes me feel more confident. (and keep in mind it significantly changes client version compatibility requirements, esp. kernel drivers which possibly don't even exist in any version that are compatible) And looking at iostat, etc., at times of blocks, it seems like 1-2 disks are at 100% util%, and the rest are nearly idle, and the SSD journals rarely go above 10% or so (I bought 2 expensive micron DC ones per node). So I think balance is the most important thing I need, and just plain efficiency is the next thing (which might come from bluestore when it's ready, especially related to rbd snapshot CoW). Having 2 disks at 100% is like 300-560 iops, where the total server ought to do about 1700 iops (3 disks that go about 280 and 6 more that do about 150 direct sync rand write 4k iops per node). That's about 21% utilization before it blocks. You could try getting the data out of my ganglia here (sda,sdb are the SSDs, and ceph2 sdg is broken and missing with bogus data on the graphs): http://www.brockmann-consult.de/ganglia/graph_all_periods.php?title=&vl=&x=&n=&hreg%5B%5D=ceph.*&mreg%5B%5D=sd%5Bc-z%5D_util>ype=line&glegend=show&aggregate=1 But it's not that easy to get this info out of ganglia... highly customized graphing isn't the best. I can't guarantee a bug free experience, but you can change it, look at the rebalancing objects %, and if you don't like it, change it back (maybe it will be much less going from firefly to hammer than to jewel like me). But if you wait an hour before changing it back, you can bet it takes an hour to settle again. (or set nobackfill maybe). I don't like this, but I don't know what to do other than rate limit it and accept the enormous wait. Well... it was at 77.65% or so (tunables made it 75% + more pgs), and now after almost 3 hours, it's at 75.141% ... so I imagine it'll take somewhere between 75 hours and forever minus a day or two. But with the sleep settings, it seems not to cause any issues. So if there's any chance of it balancing out the load on the osds, i'll try it. (and these numbers are with me fiddling with it and watching it every now and then... I'll set max backfills back to 1 and sleep back to about 0.6 when I go to bed... maybe then it'll be half speed) Also FYI I only have 31% space used (most of the disks I added were to make it not horribly slow rather than add space, since it was so slow with only 3 disks per OSD). The cluster is just 3 nodes, with 2x Micron S630DC-400, 3x HUS724040ALS640, and 6 x Hitachi HUA722020ALA330 (minus one dead one) (last one is SATA... just some old stuff I added to speed things up, which helped even though they're slower). # ceph dfAnd as for impact... I could tell you more tomorrow. But with the sleep settings, the 4k randwrite iops in fio benchmarks seems maybe half or same as before, and other behavior doesn't seem so bad...maybe even better than before on average with a few more hicups than before, but less blocking killing qemu VMs (which I can't explain...do tunables do that right away? or did the snap trim sleep do something? I doubt the recovery one did since there was no recovery until I decided to change things. Or just luck so far, and tomorrow morning some VMs will be dead like every morning since a week, needing SIGKILL).
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com