Re: Preconditioning an RBD image

Nick Fisk <nick@xxxxxxxxxx> · Mon, 10 Apr 2017 13:42:01 +0100

Hi Peter,

Thanks for those graphs and thorough explanation. I guess a lot of your performance increase is due to the fact that a fair amount of your workload is cachable?

I'm got my new node online late last week with 12x8TB drives and 200GB bcache partition, coupled with my very uncachable workload, I was expecting the hit ratio to be very low. However, surprisingly its hovering around 50%!!

Further investigation of the stats however possibly indicate sequential bypasses might be counting towards that figure.

cat /sys/block/bcache2/bcache/stats_day/cache_hit_ratio
48
cat /sys/block/bcache2/bcache/stats_day/cache_bypass_hits
3000830
cat /sys/block/bcache2/bcache/stats_day/cache_hits
444625

So, I guess that a lot of the OSD metadata is being cached. Which is actually what I wanted.

I've also done some initial looking at the disk stats and the disks look like they are doing about half as much work as the existing nodes, which is encouraging.

I will try and get some good graphs to share over the course of this week.

Nick

> -----Original Message-----
> From: Peter Maloney [mailto:peter.maloney@xxxxxxxxxxxxxxxxxxxx]
> Sent: 06 April 2017 16:04
> To: nick@xxxxxxxxxx; 'ceph-users' <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Preconditioning an RBD image
> 
> On 03/25/17 23:01, Nick Fisk wrote:
> >
> >> I think I owe you another graph later when I put all my VMs on there
> >> (probably finally fixed my rbd snapshot hanging VM issue ...worked
> >> around it by disabling exclusive-lock,object-map,fast-diff). The
> >> bandwidth hungry ones (which hung the most often) were moved shortly
> >> after the bcache change, and it's hard to explain how it affects the
> >> graphs... easier to see with iostat while changing it and having a mix of
> cache and not than ganglia afterwards.
> > Please do, I can't resist a nice graph. What I would be really interested in is
> answers to these questions, if you can:
> >
> > 1. Has your per disk bandwidth gone up, due to removing random writes.
> > Ie. I struggle to get more than about 50MB/s writes per disk due to
> > extra random IO per request 2. Any feeling on how it helps with
> dentry/inode lookups. As mentioned above, I'm using 8TB disks and cold
> data has extra penalty for reads/writes as it has to lookup the FS data first 3.
> I assume with 4.9 kernel you don't have the bcache fix which allows
> partitions. What method are you using to create OSDs?
> > 4. As mentioned above any stats around percentage of MB/s that is
> > hitting your cache device vs journal (assuming journal is 100% of IO).
> > This is to calculate extra wear
> >
> > Thanks,
> > Nick
> 
> So it's graph time...
> 
> Here's basically what you saw before, but I made it stacked (so 900 on the
> %util means like 18/27 of the disks in the whole cluster are at avg 50% in the
> sample period for that one pixel width of the graph) (remove gtype=stack
> and it won't be stacked, or http://www.brockmann-
> consult.de/ganglia/?c=ceph and find the aggregate report form and fill it out
> yourself ... I manually added date (cs and
> ce) copied from another url since that form doesn't have it, and only makes
> last x time periods. You can also find more metrics in the drop downs on that
> page. sda,sdb have always been the SSDs, disk metrics are 30s averages from
> iostat)
> 
> With no bcache until a bit at the end, plus some load from migrating to
> bcache possibly in there (didn't record dates on that).
> 
> %util -
> http://www.brockmann-
> consult.de/ganglia/graph.php?hreg[]=ceph.*&mreg[]=sd[c-
> z]_util&glegend=show&aggregate=1&_=1491205396888&cs=11%2F1%2F2016
> +21%3A18&ce=12%2F15%2F2016+4%3A21&z=xlarge&gtype=stack&x=1000
> await -
> http://www.brockmann-
> consult.de/ganglia/graph.php?hreg[]=ceph.*&mreg[]=sd[c-
> z]_await&glegend=show&aggregate=1&_=1491205396888&cs=11%2F1%2F20
> 16+21%3A18&ce=12%2F15%2F2016+4%3A21&z=xlarge&gtype=stack&x=1000
> wMBps -
> http://www.brockmann-
> consult.de/ganglia/graph.php?hreg[]=ceph.*&mreg[]=sd[c-
> z]_wMBps&glegend=show&aggregate=1&_=1491205396888&cs=11%2F1%2F
> 2016+21%3A18&ce=12%2F15%2F2016+4%3A21&z=xlarge&gtype=stack&x=30
> 0
> 
> And here is since most VMs were on ceph (more than the before graphs),
> with some osd-reweight-by-utilization started since a few days ago (but scrub
> disabled during that) making the last part look higher. And the last VMs were
> moved today, also seen on the graph, plus some extra backup load some
> time later.
> 
> %util -
> http://www.brockmann-
> consult.de/ganglia/graph.php?hreg[]=ceph.*&mreg[]=sd[c-
> z]_util&glegend=show&aggregate=1&_=1491205396888&cs=3%2F24%2F2017
> +23%3A3&z=xlarge&gtype=stack&x=1000
> await -
> http://www.brockmann-
> consult.de/ganglia/graph.php?hreg[]=ceph.*&mreg[]=sd[c-
> z]_await&glegend=show&aggregate=1&_=1491205396888&cs=3%2F24%2F20
> 17+23%3A3&z=xlarge&gtype=stack&x=1000
> wMBps -
> http://www.brockmann-
> consult.de/ganglia/graph.php?hreg[]=ceph.*&mreg[]=sd[c-
> z]_wMBps&glegend=show&aggregate=1&_=1491205396888&cs=3%2F24%2F
> 2017+23%3A3&z=xlarge&gtype=stack&x=300
> 
> 
> Looking at the wMBps graph, you can see the cluster doesn't really have that
> high of a load on average, only in bursts, but seeing that the before and after
> are similar load means the other graphs can be at least somewhat
> comparable.
> 
> I think the %util graph speaks for itself, but I don't know how to show you
> what it does in VMs. I figure it will smooth out the performance at times
> when lots of requests happen that hdds are bad at but ssds are good at
> (snap trimming, directory splitting, etc.). Lots of issues I find are clearly seen
> in %util.
> 
> Or both time ranges together in the main reports page:
> 
> http://www.brockmann-
> consult.de/ganglia/?r=year&cs=10%2F21%2F2016+20%3A33&ce=4%2F7%2F2
> 017+7%3A6&c=ceph&h=&tab=m&vn=&hide-
> hf=false&m=load_one&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s
> =by+name
> 
> And be sure to share some of your own results. :)

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com