On Fri, 20 May 2016, Mark Nelson wrote: > On 05/20/2016 05:08 AM, Sage Weil wrote: > > On Fri, 20 May 2016, Ramesh Chander wrote: > > > Thanks Sage and Allen, > > > > > > > > Unless we want to make bluestore smart enough to push object data on a > > > > > fast device (i.e., do ssd/hdd tiering internally), I'm not sure we > > > > > need per- device min_alloc_size. > > > > > > If we make it per block device even it becomes more simpler to set it on > > > device open time and read it whenever required. > > > > > > Most of the places we are already reading block device from > > > bdev->get_block_size(), this new one also goes with it. > > > > > > I think the gc_conf->* are read only parameters , to circumvent it I need > > > to set this info at BlueStore structure or globally somewhere. > > > > > > Whatever you suggest is fine. > > > > FWIW the github.com/liewebas/wip-bluestore-write branch already moves > > block_size and min_alloc_size ot BlueStore class members so that it's not > > always pulling them out of g_conf and bdev. > > > > > > > Don't worry about legacy at all since bluestore has no users. :) > > > > > > That simplifies it and I can simply remove it. Or do we still need to > > > keep the old parameter around and make that take precedence over two new > > > ones? I mean old option is applicable to both so we need to break tie > > > between new specific and old general parameter. > > > > I think having bluestore_min_alloc_size, bluestore_min_alloc_size_hdd, and > > bluestore_min_alloc_size_ssd still makes it easier to change for users. > > It'll only go in one bit of code that updates the BlueStore min_alloc_size > > member. > > If we really want to go down this road, would it make sense to create storage > class templates rather than global configuration parameters? Presumably you > might want different compression, read ahead, or writeback caching depending > on the device class as well. That sounds appealing. How would it work? sage > > Mark > > > > > Perhaps you can base this PR on the wip-bluestore-write branch. It's > > getting rebased still frequently but I think it's less than a > > week away from being mergeable. > > > > > > > Currently it is transient everywhere, and so far I've been trying to > > > > > keep it that way. However, we might want to change this: if we make > > > > > min_alloc_size fixed at mkfs time, we could possibly collapse down the > > > > > size of the allocation > > > > > bitmap(s) by a factor of 16 on HDD (1 bit per min_alloc_size instead > > > > > of per block). I'm not sure that it's worth it, though... thoughts? > > > > Collapsing the bitmap provides little DRAM savings and probably not much > > > > CPU time savings (though some additional (low risk) coding might be > > > > required to make this statement true), so I don't see much point in it. > > > Seems like extra complexity with little value. > > > > > > I think as long as our min alloc size does reduce from previous value or > > > our bitmap vector has bit per minimum possible value of min_alloc_size, > > > we are good. But as Allen said , we will not have significant saving > > > from this as fasr as cost (cpu and dram) is concerned. > > > > Yeah, let's not worry about it then. > > > > sage > > > > > > > > > > > > -Ramesh Chander > > > > > > > > > -----Original Message----- > > > From: Allen Samuels > > > Sent: Friday, May 20, 2016 2:00 AM > > > To: Sage Weil; Ramesh Chander > > > Cc: ceph-devel@xxxxxxxxxxxxxxx > > > Subject: RE: Min alloc size according to media type > > > > > > > -----Original Message----- > > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > > > > owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil > > > > Sent: Thursday, May 19, 2016 12:20 PM > > > > To: Ramesh Chander <Ramesh.Chander@xxxxxxxxxxx> > > > > Cc: ceph-devel@xxxxxxxxxxxxxxx > > > > Subject: Re: Min alloc size according to media type > > > > > > > > On Thu, 19 May 2016, Ramesh Chander wrote: > > > > > Hi Sage, > > > > > > > > > > I am doing changes in Bluestore related to minimum allocation size > > > > > according to ssd and hdd. This change involves: > > > > > > > > > > 1. There are three min alloc sizes now: > > > > > a. min_alloc_size: old one, default changed to 0 > > > > > b. min_alloc_size_hdd: for rotational media, default > > > > > 64k > > > > > c. min_alloc_size_ssd: for ssd, default 4k. > > > > > > > > > > 2. Making changes in BlockDevice to maintain its own min_alloc_size. > > > > > It allows to maintain different min_alloc_size for different devices. > > > > > > > > > > 3. Making changes in allocator(stupid, bitmap) interfaces to take > > > > > min_alloc_size from the corresponding devices. > > > > > > > > This makes sense if some devices are hdd and some are ssd (e.g., main > > > > vs db/wal), but in practice the only separation currently possible is > > > > to have a separate device for WAL and for rocksdb, both of which hare > > > > managed by bluefs and not bluestore directly. Ane bluefs currently > > > > has a min_alloc_size of 1MB since all files are generally big (usually > > > > 4MB each), there are no random writes, etc. > > > > > > > > Unless we want to make bluestore smart enough to push object data on a > > > > fast device (i.e., do ssd/hdd tiering internally), I'm not sure we > > > > need per- device min_alloc_size. > > > > > > I think this is / will be valuable -- in the future. I don't see that this > > > item significantly simplifies the future problem. > > > > > > > > > > > > > > > I have following questions regarding this parameter and use of it in > > > > > bluestore: > > > > > > > > > > 1. I assume this parameter is transient and does not have effect on > > > > > different values (say changed from 4k to 64k or vice versa) across > > > > > reboots or different ceph versions? > > > > > Is it ondisk anywhere in metadata or in freelist > > > > > manager > > > > > in direct or indirect manner? Because having on disk > > > > > presence could cause confusions by having new options > > > > > when existing users move to build with this change. > > > > > > > > Currently it is transient everywhere, and so far I've been trying to > > > > keep it that way. However, we might want to change this: if we make > > > > min_alloc_size fixed at mkfs time, we could possibly collapse down the > > > > size of the allocation > > > > bitmap(s) by a factor of 16 on HDD (1 bit per min_alloc_size instead > > > > of per block). I'm not sure that it's worth it, though... thoughts? > > > > > > Collapsing the bitmap provides little DRAM savings and probably not much > > > CPU time savings (though some additional (low risk) coding might be > > > required to make this statement true), so I don't see much point in it. > > > Seems like extra complexity with little value. > > > > > > > > > > > > 2. While figuring out the min_alloc_size for devices, I give > > > > > preferences to old config parameter so that existing configs > > > > > are not changed by this code change. Is this right > > > > > or this is not required? > > > > > > > > Don't worry about legacy at all since bluestore has no users. :) > > > > > > > > sage > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo > > > > info at http://vger.kernel.org/majordomo-info.html > > > PLEASE NOTE: The information contained in this electronic mail message is > > > intended only for the use of the designated recipient(s) named above. If > > > the reader of this message is not the intended recipient, you are hereby > > > notified that you have received this message in error and that any review, > > > dissemination, distribution, or copying of this message is strictly > > > prohibited. If you have received this communication in error, please > > > notify the sender by telephone or e-mail (as shown above) immediately and > > > destroy any and all copies of this message in your possession (whether > > > hard copies or electronically stored copies). > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html