Re: Bluestore "separate" WAL and DB (and WAL/DB size?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Richard,

Regarding recovery speed, have you looked through any of Neha's results on recovery sleep testing earlier this summer?

https://www.spinics.net/lists/ceph-devel/msg37665.html

She tested bluestore and filestore under a couple of different scenarios. The gist of it is that time to recover changes pretty dramatically depending on the sleep setting.

I don't recall if you said earlier, but are you comparing filestore and bluestore recovery performance on the same version of ceph with the same sleep settings?

Mark

On 09/12/2017 05:24 AM, Richard Hesketh wrote:
Thanks for the links. That does seem to largely confirm that what I haven't horribly misunderstood anything and I've not been doing anything obviously wrong while converting my disks; there's no point specifying separate WAL/DB partitions if they're going to go on the same device, throw as much space as you have available at the DB partitions and they'll use all the space they can, and significantly reduced I/O on the DB/WAL device compared to Filestore is expected since bluestore's nixed the write amplification as much as possible.

I'm still seeing much reduced recovery speed on my newly Bluestored cluster, but I guess that's a tuning issue rather than evidence of catastrophe.

Rich

On 12/09/17 00:13, Brad Hubbard wrote:
Take a look at these which should answer at least some of your questions.

http://ceph.com/community/new-luminous-bluestore/

http://ceph.com/planet/understanding-bluestore-cephs-new-storage-backend/

On Mon, Sep 11, 2017 at 8:45 PM, Richard Hesketh
<richard.hesketh@xxxxxxxxxxxx> wrote:
On 08/09/17 11:44, Richard Hesketh wrote:
Hi,

Reading the ceph-users list I'm obviously seeing a lot of people talking about using bluestore now that Luminous has been released. I note that many users seem to be under the impression that they need separate block devices for the bluestore data block, the DB, and the WAL... even when they are going to put the DB and the WAL on the same device!

As per the docs at http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/ this is nonsense:

If there is only a small amount of fast storage available (e.g., less than a gigabyte), we recommend using it as a WAL device. If there is more, provisioning a DB
device makes more sense. The BlueStore journal will always be placed on the fastest device available, so using a DB device will provide the same benefit that the WAL
device would while also allowing additional metadata to be stored there (if it will fix). [sic, I assume that should be "fit"]

I understand that if you've got three speeds of storage available, there may be some sense to dividing these. For instance, if you've got lots of HDD, a bit of SSD, and a tiny NVMe available in the same host, data on HDD, DB on SSD and WAL on NVMe may be a sensible division of data. That's not the case for most of the examples I'm reading; they're talking about putting DB and WAL on the same block device, but in different partitions. There's even one example of someone suggesting to try partitioning a single SSD to put data/DB/WAL all in separate partitions!

Are the docs wrong and/or I am missing something about optimal bluestore setup, or do people simply have the wrong end of the stick? I ask because I'm just going through switching all my OSDs over to Bluestore now and I've just been reusing the partitions I set up for journals on my SSDs as DB devices for Bluestore HDDs without specifying anything to do with the WAL, and I'd like to know sooner rather than later if I'm making some sort of horrible mistake.

Rich

Having had no explanatory reply so far I'll ask further...

I have been continuing to update my OSDs and so far the performance offered by bluestore has been somewhat underwhelming. Recovery operations after replacing the Filestore OSDs with Bluestore equivalents have been much slower than expected, not even half the speed of recovery ops when I was upgrading Filestore OSDs with larger disks a few months ago. This contributes to my sense that I am doing something wrong.

I've found that if I allow ceph-disk to partition my DB SSDs rather than reusing the rather large journal partitions I originally created for Filestore, it is only creating very small 1GB partitions. Attempting to search for bluestore configuration parameters has pointed me towards bluestore_block_db_size and bluestore_block_wal_size config settings. Unfortunately these settings are completely undocumented so I'm not sure what their functional purpose is. In any event in my running config I seem to have the following default values:

# ceph-conf --show-config | grep bluestore
...
bluestore_block_create = true
bluestore_block_db_create = false
bluestore_block_db_path =
bluestore_block_db_size = 0
bluestore_block_path =
bluestore_block_preallocate_file = false
bluestore_block_size = 10737418240
bluestore_block_wal_create = false
bluestore_block_wal_path =
bluestore_block_wal_size = 100663296
...

I have been creating bluestore osds by:

ceph-disk prepare --bluestore /dev/sdX --block.db /dev/sdY1 --osd-id Z # re-using existing partitions for DB
or
ceph-disk prepare --bluestore /dev/sdX --block.db /dev/sdY --osd-id Z # letting ceph-disk partition DB, after zapping original partitions

Are these sane values? What does it mean that block_db_size is 0 - is it just using the entire block device specified or not actually using it at all? Is the WAL actually being placed on the DB block device? And is that 1GB default really a sensible size for the DB partition?

Rich



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux