Re: Shared WAL/DB device partition for multiple OSDs?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



For if you should do WAL only on the NVMe vs use a filestore journal, that depends on your write patterns, use case, etc.  In my clusters with 10TB disks I use 2GB partitions for the WAL and leave the DB on the HDD with the data.  Those are in archival RGW use cases and that works fine for the throughput.  The pains of filestore subfolder splitting are too severe for us to think about using our 10TB disks with filestore and journals, but we have 100s of millions of tiny objects.

The WAL is pretty static and wouldn't be a problem with DB and WAL on the same device even if the DB fills up the device.  I'm fairly certain that ceph will prioritize things such that the WAL won't spill over at all and just have the DB going over to the HDD.  I didn't want to deal with speed differentials between OSDs.  The troubleshooting slow requests of that just sounds awful.

With ceph-volume the volume type id doesn't matter at all.  I honestly don't know what the id's of my wal partitions are.  That was one of the goals of ceph-volume, to remove all of the magic id's everywhere required for things to start up on system boot.  It's a lot more deterministic with less things like partition type id's needing to all be perfect.

On Fri, May 11, 2018 at 2:14 PM Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx> wrote:
Dear David,

thanks a lot for the detailed answer(s) and clarifications!
Can I ask just a few more questions?

On 11.05.2018 18:46, David Turner wrote:
> partitions is 10GB per 1TB of OSD.  If your OSD is a 4TB disk you should
> be looking closer to a 40GB block.db partition.  If your block.db
> partition is too small, then once it fills up it will spill over onto
> the data volume and slow things down.

Ooops ... I have 15 x 10 TB disks in the servers, and one Optane
SSD for all of them - so I don't have 10GB SSD per TB of HDD. :-(
Will I still get a speed-up if only part of the block.db fits? Or
should I use the SSD for WAL only? Or even use good old filestore
with 10GB journals, instead of bluestore?


>> And just to make sure - if I specify "--osd-db", I don't need
>> to set "--osd-wal" as well, since the WAL will end up on the
>> DB partition automatically, correct?
> This is correct.  The wal will automatically be placed on the db if not
> otherwise specified.

Would there still be any benefit to having separate WAL
and DB partitions (so they that DB doesn't compete
with WAL for space, or something like that)?


> I don't use ceph-deploy, but the process for creating the OSDs should be
> something like this.  After the OSDs are created it is a good idea to
> make sure that the OSD is not looking for the db partition with the
> /dev/nvme0n1p2 distinction as that can change on reboots if you have

Yes, I just put that in as an example. I had thought about creating
the partitions with

     sgdisk -n 0:0:+10G -t 0:8300 -c 0:"osd-XYZ-db" -- /dev/nvme0n1

and then use "/dev/disk/by-partlabel/osd-XYZ-db" (or the partition
UUIDs) for "ceph volume ...". Thanks for the tip about checking
the symlinks! Btw, is "-t 0:8300" Ok? I guess the type number won't
really matter, though?


Cheers,

Oliver
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux