Re: Preconditioning an RBD image

Nick Fisk <nick@xxxxxxxxxx> · Sat, 25 Mar 2017 22:01:29 -0000

Thanks for your response Peter, comments in line
> -----Original Message-----
> From: Peter Maloney [mailto:peter.maloney@xxxxxxxxxxxxxxxxxxxx]
> Sent: 23 March 2017 22:45
> To: nick@xxxxxxxxxx; 'ceph-users' <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Preconditioning an RBD image
> 
> Hi Nick,
> 
> I didn't test with a colocated journal. I figure ceph knows what it's doing with
> the journal device, and it has no filesystem, so there's no xfs journal, file
> metadata, etc. to cache due to small random sync writes.

Sure, I guess I was more interested in the simplicity of not having lots of partitions everywhere. Although with col-located, I guess you run the risk of having large sequential journal IO's not being cached and effectively halving the speed of the disk.

> 
> I tested the bcache and journals on some SAS SSDs (rados bench was ok but
> real clients were really low bandwidth), and journals on NVMe (P3700) and
> bcache on some SAS SSDs, and also tested both on the NVMe. I think the
> performance is slightly better with it all on the NVMe (hdds being the
> bottleneck... tests in VMs show the same, but rados bench looks a tiny bit
> better). The bcache partition is shared by the osds, and the journals are
> separate partitions.
> 
> I'm not sure it's really triple overhead. bcache doesn't write all your data to
> the writeback cache... just as much small sync writes as long as the cache
> doesn't fill up, or get too busy (based on await). And the bcache device
> flushes very slowly to the hdd, not overloading it (unless cache is full). And
> when I make it do it faster, it seems to do it more quickly than without
> bcache (like it does it more sequentially, or without sync; but I didn't really
> measure... just looked at, eg. 400MB dirty data, and then it flushes in 20
> seconds). And if you overwrite the same data a few times (like a filesystem
> journal, or some fs metadata), you'd think it wouldn't have to write it more
> than once to the hdd in the end. Maybe that means something small like
> leveldb isn't written often to the hdd.

Yes, that makes sense, thanks for the explanation. I guess it depends on your IO profile. I've recently been hit with slow downs when trying to copy large amounts of data into the cluster, mainly relating to random overheads on the disks. So this bcache thing looks interesting. I also suffer from dentry/inode lookups, despite vfs pressure set to 1, so again, being able to cache this makes sense. 

I'm guessing lots of small writes + leveldb updates are going to get written Journal->Bcache->Disk. I'm guess I'm just a bit nervous about putting too much wear on my NVME's by trying this. Do you have any stats showing Journal partition vs HDD/Bcache to see if there is much amplification?

> 
> And it's not just a write cache. The default is 10% writeback, which means the
> rest is read cache. And it keeps read stats so it knows which data is the most
> popular. My nodes right now show 33-44% cache hits (cache is too small I
> think). And bcache reorders writes on the cache device so they are
> sequential, and can write to both at the same time so it can actually go faster
> than a pure ssd in specific situations (mixed sequential and random, only
> until the cache fills).
> 
> I think I owe you another graph later when I put all my VMs on there
> (probably finally fixed my rbd snapshot hanging VM issue ...worked around it
> by disabling exclusive-lock,object-map,fast-diff). The bandwidth hungry ones
> (which hung the most often) were moved shortly after the bcache change,
> and it's hard to explain how it affects the graphs... easier to see with iostat
> while changing it and having a mix of cache and not than ganglia afterwards.

Please do, I can't resist a nice graph. What I would be really interested in is answers to these questions, if you can:

1. Has your per disk bandwidth gone up, due to removing random writes. Ie. I struggle to get more than about 50MB/s writes per disk due to extra random IO per request
2. Any feeling on how it helps with dentry/inode lookups. As mentioned above, I'm using 8TB disks and cold data has extra penalty for reads/writes as it has to lookup the FS data first
3. I assume with 4.9 kernel you don't have the bcache fix which allows partitions. What method are you using to create OSDs?
4. As mentioned above any stats around percentage of MB/s that is hitting your cache device vs journal (assuming journal is 100% of IO). This is to calculate extra wear

Thanks,
Nick

> 
> Peter
> 
> On 03/23/17 21:18, Nick Fisk wrote:
> Hi Peter,
> 
> Interesting graph. Out of interest, when you use bcache, do you then just
> leave the journal collocated on the combined bcache device and rely on the
> writeback to provide journal performance, or do you still create a separate
> partition on whatever SSD/NVME you use, effectively giving triple write
> overhead?
> 
> Nick
> 
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Peter Maloney
> Sent: 22 March 2017 10:06
> To: Alex Gorbachev mailto:ag@xxxxxxxxxxxxxxxxxxx; ceph-users mailto:ceph-
> users@xxxxxxxxxxxxxx
> Subject: Re:  Preconditioning an RBD image
> 
> Does iostat (eg.  iostat -xmy 1 /dev/sd[a-z]) show high util% or await during
> these problems?
> 
> Ceph filestore requires lots of metadata writing (directory splitting for
> example), xattrs, leveldb, etc. which are small sync writes that HDDs are bad
> at (100-300 iops), and SSDs are good at (cheapo would be 6k iops, and not so
> crazy DC/NVMe would be 20-200k iops and more). So in theory, these things
> are mitigated by using an SSD, like bcache on your osd device. You could also
> try something like that, at least to test.
> 
> I have tested with bcache in writeback mode and found hugely obvious
> differences seen by iostat, for example here's my before and after (heavier
> load due to converting week 49-50 or so, and the highest spikes being the
> scrub infinite loop bug in 10.2.3):
> 
> http://xo4t.mj.am/lnk/AEQAIfcunacAAAAAAAAAAF3gdw0AADNJBWwAAAAA
> AACRXwBY1DvAh8uPYe5LRJaO473StLfNWAAAlBI/1/_KQLt2QHZGOUvRQTr45
> 7rQ/aHR0cDovL3d3dy5icm9ja21hbm4tY29uc3VsdC5kZS9nYW5nbGlhL2dyYXB
> oLnBocD9jcz0xMCUyRjI1JTJGMjAxNisxMCUzQTI3JmNlPTAzJTJGMDklMkYyMD
> E3KzE3JTNBMjYmej14bGFyZ2UmaHJlZw[]=ceph.*&mreg[]=sd[c-
> z]_await&glegend=show&aggregate=1&x=100
> 
> But when you share a cache device, you get a single point of failure (and
> bcache, like all software, can be assumed to have bugs too). And I
> recommend vanilla kernel 4.9 or later which has many bcache fixes, or
> Ubuntu's 4.4 kernel which has the specific fixes I checked for.
> 
> On 03/21/17 23:22, Alex Gorbachev wrote:
> I wanted to share the recent experience, in which a few RBD volumes,
> formatted as XFS and exported via Ubuntu NFS-kernel-server performed
> poorly, even generated an "out of space" warnings on a nearly empty
> filesystem.  I tried a variety of hacks and fixes to no effect, until things started
> magically working just after some dd write testing.
> 
> The only explanation I can come up with is that preconditioning, or
> thickening, the images with this benchmarking is what caused the
> improvement.
> 
> Ceph is Hammer 0.94.7 running on Ubuntu 14.04, kernel 4.10 on OSD nodes
> and 4.4 on NFS nodes.
> 
> Regards,
> Alex
> Storcium
> --
> --
> Alex Gorbachev
> Storcium
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> mailto:ceph-users@xxxxxxxxxxxxxx
> http://xo4t.mj.am/lnk/AEQAIfcunacAAAAAAAAAAF3gdw0AADNJBWwAAAAA
> AACRXwBY1DvAh8uPYe5LRJaO473StLfNWAAAlBI/2/Zu9hF2FfS7TM3GerHHD6
> gQ/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzL
> WNlcGguY29t
> 
> 
> --
> 
> --------------------------------------------
> Peter Maloney
> Brockmann Consult
> Max-Planck-Str. 2
> 21502 Geesthacht
> Germany
> Tel: +49 4152 889 300
> Fax: +49 4152 889 333
> E-mail: mailto:peter.maloney@xxxxxxxxxxxxxxxxxxxx
> Internet:
> http://xo4t.mj.am/lnk/AEQAIfcunacAAAAAAAAAAF3gdw0AADNJBWwAAAAA
> AACRXwBY1DvAh8uPYe5LRJaO473StLfNWAAAlBI/3/nNYiN8Wg-
> QCZi0bq10AfKQ/aHR0cDovL3d3dy5icm9ja21hbm4tY29uc3VsdC5kZQ
> --------------------------------------------
> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com