Hi Peter, Thanks for those graphs and thorough explanation. I guess a lot of your performance increase is due to the fact that a fair amount of your workload is cachable? I'm got my new node online late last week with 12x8TB drives and 200GB bcache partition, coupled with my very uncachable workload, I was expecting the hit ratio to be very low. However, surprisingly its hovering around 50%!! Further investigation of the stats however possibly indicate sequential bypasses might be counting towards that figure. cat /sys/block/bcache2/bcache/stats_day/cache_hit_ratio 48 cat /sys/block/bcache2/bcache/stats_day/cache_bypass_hits 3000830 cat /sys/block/bcache2/bcache/stats_day/cache_hits 444625 So, I guess that a lot of the OSD metadata is being cached. Which is actually what I wanted. I've also done some initial looking at the disk stats and the disks look like they are doing about half as much work as the existing nodes, which is encouraging. I will try and get some good graphs to share over the course of this week. Nick > -----Original Message----- > From: Peter Maloney [mailto:peter.maloney@xxxxxxxxxxxxxxxxxxxx] > Sent: 06 April 2017 16:04 > To: nick@xxxxxxxxxx; 'ceph-users' <ceph-users@xxxxxxxxxxxxxx> > Subject: Re: Preconditioning an RBD image > > On 03/25/17 23:01, Nick Fisk wrote: > > > >> I think I owe you another graph later when I put all my VMs on there > >> (probably finally fixed my rbd snapshot hanging VM issue ...worked > >> around it by disabling exclusive-lock,object-map,fast-diff). The > >> bandwidth hungry ones (which hung the most often) were moved shortly > >> after the bcache change, and it's hard to explain how it affects the > >> graphs... easier to see with iostat while changing it and having a mix of > cache and not than ganglia afterwards. > > Please do, I can't resist a nice graph. What I would be really interested in is > answers to these questions, if you can: > > > > 1. Has your per disk bandwidth gone up, due to removing random writes. > > Ie. I struggle to get more than about 50MB/s writes per disk due to > > extra random IO per request 2. Any feeling on how it helps with > dentry/inode lookups. As mentioned above, I'm using 8TB disks and cold > data has extra penalty for reads/writes as it has to lookup the FS data first 3. > I assume with 4.9 kernel you don't have the bcache fix which allows > partitions. What method are you using to create OSDs? > > 4. As mentioned above any stats around percentage of MB/s that is > > hitting your cache device vs journal (assuming journal is 100% of IO). > > This is to calculate extra wear > > > > Thanks, > > Nick > > So it's graph time... > > Here's basically what you saw before, but I made it stacked (so 900 on the > %util means like 18/27 of the disks in the whole cluster are at avg 50% in the > sample period for that one pixel width of the graph) (remove gtype=stack > and it won't be stacked, or http://www.brockmann- > consult.de/ganglia/?c=ceph and find the aggregate report form and fill it out > yourself ... I manually added date (cs and > ce) copied from another url since that form doesn't have it, and only makes > last x time periods. You can also find more metrics in the drop downs on that > page. sda,sdb have always been the SSDs, disk metrics are 30s averages from > iostat) > > With no bcache until a bit at the end, plus some load from migrating to > bcache possibly in there (didn't record dates on that). > > %util - > http://www.brockmann- > consult.de/ganglia/graph.php?hreg[]=ceph.*&mreg[]=sd[c- > z]_util&glegend=show&aggregate=1&_=1491205396888&cs=11%2F1%2F2016 > +21%3A18&ce=12%2F15%2F2016+4%3A21&z=xlarge>ype=stack&x=1000 > await - > http://www.brockmann- > consult.de/ganglia/graph.php?hreg[]=ceph.*&mreg[]=sd[c- > z]_await&glegend=show&aggregate=1&_=1491205396888&cs=11%2F1%2F20 > 16+21%3A18&ce=12%2F15%2F2016+4%3A21&z=xlarge>ype=stack&x=1000 > wMBps - > http://www.brockmann- > consult.de/ganglia/graph.php?hreg[]=ceph.*&mreg[]=sd[c- > z]_wMBps&glegend=show&aggregate=1&_=1491205396888&cs=11%2F1%2F > 2016+21%3A18&ce=12%2F15%2F2016+4%3A21&z=xlarge>ype=stack&x=30 > 0 > > And here is since most VMs were on ceph (more than the before graphs), > with some osd-reweight-by-utilization started since a few days ago (but scrub > disabled during that) making the last part look higher. And the last VMs were > moved today, also seen on the graph, plus some extra backup load some > time later. > > %util - > http://www.brockmann- > consult.de/ganglia/graph.php?hreg[]=ceph.*&mreg[]=sd[c- > z]_util&glegend=show&aggregate=1&_=1491205396888&cs=3%2F24%2F2017 > +23%3A3&z=xlarge>ype=stack&x=1000 > await - > http://www.brockmann- > consult.de/ganglia/graph.php?hreg[]=ceph.*&mreg[]=sd[c- > z]_await&glegend=show&aggregate=1&_=1491205396888&cs=3%2F24%2F20 > 17+23%3A3&z=xlarge>ype=stack&x=1000 > wMBps - > http://www.brockmann- > consult.de/ganglia/graph.php?hreg[]=ceph.*&mreg[]=sd[c- > z]_wMBps&glegend=show&aggregate=1&_=1491205396888&cs=3%2F24%2F > 2017+23%3A3&z=xlarge>ype=stack&x=300 > > > Looking at the wMBps graph, you can see the cluster doesn't really have that > high of a load on average, only in bursts, but seeing that the before and after > are similar load means the other graphs can be at least somewhat > comparable. > > I think the %util graph speaks for itself, but I don't know how to show you > what it does in VMs. I figure it will smooth out the performance at times > when lots of requests happen that hdds are bad at but ssds are good at > (snap trimming, directory splitting, etc.). Lots of issues I find are clearly seen > in %util. > > Or both time ranges together in the main reports page: > > http://www.brockmann- > consult.de/ganglia/?r=year&cs=10%2F21%2F2016+20%3A33&ce=4%2F7%2F2 > 017+7%3A6&c=ceph&h=&tab=m&vn=&hide- > hf=false&m=load_one&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s > =by+name > > And be sure to share some of your own results. :) _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com