Instead of guessing I took a look at one of my OSDs. TL;DR: I’m going to bump the inode size to 512 which should fit majority of xattrs, no need to touch filestore parameters. Short news first - I can’t find a file with more than 2 xattrs. (and that’s good) Then I extracted all the xattrs on all the ~100K files, counted their size and counted the occurences. The largest xattrs I have are 705 chars in base64 (so let’s say it’s half), and that particular file has about 512B total in xattr (that’s more than was expected with RBD-only workload, right?) # file: var/lib/ceph/osd/ceph-55//current/4.1ad7_head/rbd134udata.1a785181f15746a.000000000005a578__head_E5C51AD7__4 117 user.ceph._=0sCwjyAAAABANKAAAAAAAAACkAAAByYmRfZGF0YS4xYTc4NTE4MWYxNTc0NmEuMDAwMDAwMDAwMDA1YTU3OP7/////////1xrF5QAAAAAABAAAAAAAAAAFAxQAAAAEAAAAAAAAAP////8AAAAAAAAAAAAAAADrEKMAAAAAADB2DQAiDaMA AAAAAG11DQACAhUAAAAI1xSoAQAAAAD9CwAMAAAAAAAAAAAAAEAAAAAAABAgpFWoa6QVAgIVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA6xCjAAAAAAAwdg0AAAAAAAA= 347 user.ceph.snapset=0sAgL5AQAAgt8HAAAAAAABBgAAAILfBwAAAAAAb94HAAAAAAC23AcAAAAAAEnPBwAAAAAA470HAAAAAAB4ugcAAAAAAAQAAAC1ugcAAAAAAOO9BwAAAAAAStAHAAAAAACC3wcAAAAAAAQAAAC1ugcAAAAAAAQAAAAAAAAAAAAAAABQFAAAAAAAAGAUAAAAAAAAwAoAAAAAAAAwHwAAAAAAAJAZAAAAAAAA4DgAAAAAAAAgBwAAAAAA470HAAAAAAAFAAAAAAAAAAAAAAAAEA8AAAAAAAAgDwAAAAAAACAFAAAAAAAASBQAAAAAAABADgAAAAAAAJAiAAAAAAAAoAIAAAAAAAA4JQAAAAAAAMgaAAAAAABK0AcAAAAAAAQAAAAAAAAAAAAAAADgAQAAAAAAAOgBAAAAAAAAeCYAAAAAAACAKAAAAAAAAHAAAAAAAAAAACkAAAAAAAAAFwAAAAAAgt8HAAAAAAAFAAAAAAAAAAAAAAAAoAEAAAAAAADAAQAAAAAAAIAMAAAAAAAAUA4AAAAAAAAQBgAAAAAAAIAUAAAAAAAA4AAAAAAAAACAFQAAAAAAAIAqAAAAAAAEAAAAtboHAAAAAAAAAEAAAAAAAOO9BwAAAAAAAABAAAAAAABK0AcAAAAAAAAAQAAAAAAAgt8HAAAAAAAAAEAAAAAAAA== 705 (If anyone wants to enlighten me on the contents that would be great - is this expected to grow much?) BUT most of the files have much smaller xattrs, and if I researched it correctly it seems ext4 uses free space in inode (which should be something like inode_size-128-28=free) and if that’s not enough it will allocate one more block. In other words, if I format ext4 with 2048 inode size and 4096 block size, there will be 2048-(128+28)=1892 bytes available in the inode, and 4096 bytes can be allocated from another block. With default format, there will be just 256-(128+28)=100 bytes in the inode + 4096 bytes in another block. In my case, majority of the files have xattr size <200B, which is larger than fits inside one inode, but not really that large, so it should be beneficial to bump the inode size to 512B (that leaves plenty of 356 bytes for xattrs). Jan > On 14 Jul 2015, at 12:18, Gregory Farnum <greg@xxxxxxxxxxx> wrote: > > On Tue, Jul 14, 2015 at 10:53 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote: >> Thank you for your reply. >> Comments inline. >> >> I’m still hoping to get some more input, but there are many people running ceph on ext4, and it sounds like it works pretty good out of the box. Maybe I’m overthinking this, then? > > I think so — somebody did a lot of work making sure we were well-tuned > on the standard filesystems; I believe it was David. > -Greg > >> >> Jan >> >>> On 13 Jul 2015, at 21:04, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote: >>> >>> <<inline >>> >>> -----Original Message----- >>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Jan Schermer >>> Sent: Monday, July 13, 2015 2:32 AM >>> To: ceph-users@xxxxxxxxxxxxxx >>> Subject: Re: xattrs vs omap >>> >>> Sorry for reviving an old thread, but could I get some input on this, pretty please? >>> >>> ext4 has 256-byte inodes by default (at least according to docs) but the fragment below says: >>> OPTION(filestore_max_inline_xattr_size_other, OPT_U32, 512) >>> >>> The default 512b is too much if the inode is just 256b, so shouldn’t that be 256b in case people use the default ext4 inode size? >>> >>> Anyway, is it better to format ext4 with larger inodes (say 2048b) and set filestore_max_inline_xattr_size_other=1536, or leave it at defaults? >>> [Somnath] Why 1536 ? why not 1024 or any power of 2 ? I am not seeing any harm though, but, curious. >> >> AFAIK there is other information in the inode other than xattrs, also you need to count the xattra labels into this - so if I want to store 1536B of “values” it would cost more, and there still needs to be some space left. >> >>> (As I understand it, on ext4 xattrs ale limited to one block, inode size + something can spill to one different inode - maybe someone knows better). >>> >>> >>> [Somnath] The xttr size ("_") is now more than 256 bytes and it will spill over, so, bigger inode size will be good. But, I would suggest do your benchmark before putting it into production. >>> >> >> Good poin and I am going to do that, but I’d like to avoid the guesswork. Also, not all patterns are always replicable…. >> >>> Is filestore_max_inline_xattr_size and absolute limit, or is it filestore_max_inline_xattr_size*filestore_max_inline_xattrs in reality? >>> >>> [Somnath] The *_size is tracking the xttr size per attribute and *inline_xattrs keep track of max number of inline attributes allowed. So, if a xattr size is > *_size , it will go to omap and also if the total number of xattra > *inline_xattrs , it will go to omap. >>> If you are only using rbd, the number of inline xattrs will be always 2 and it will not cross that default max limit. >> >> If I’m reading this correctly then with my setting of filestore_max_inline_xattr_size_other=1536, it could actually consume 3072B (2 xattrs), so I should in reality use 4K inodes…? >> >> >>> >>> Does OSD do the sane thing if for some reason the xattrs do not fit? What are the performance implications of storing the xattrs in leveldb? >>> >>> [Somnath] Even though I don't have the exact numbers, but, it has a significant overhead if the xattrs go to leveldb. >>> >>> And lastly - what size of xattrs should I really expect if all I use is RBD for OpenStack instances? (No radosgw, no cephfs, but heavy on rbd image and pool snapshots). This overhead is quite large >>> >>> [Somnath] It will be 2 xattrs, default "_" will be little bigger than 256 bytes and "_snapset" is small depends on number of snaps/clones, but unlikely will cross 256 bytes range. >> >> I have few pool snapshots and lots (hundreds) of (nested) snapshots for rbd volumes. Does this come into play somehow? >> >>> >>> My plan so far is to format the drives like this: >>> mkfs.ext4 -I 2048 -b 4096 -i 524288 -E stride=32,stripe-width=256 (2048b inode, 4096b block size, one inode for 512k of space and set filestore_max_inline_xattr_size_other=1536 >>> [Somnath] Not much idea on ext4, sorry.. >>> >>> Does that make sense? >>> >>> Thanks! >>> >>> Jan >>> >>> >>> >>>> On 02 Jul 2015, at 12:18, Jan Schermer <jan@xxxxxxxxxxx> wrote: >>>> >>>> Does anyone have a known-good set of parameters for ext4? I want to try it as well but I’m a bit worried what happnes if I get it wrong. >>>> >>>> Thanks >>>> >>>> Jan >>>> >>>> >>>> >>>>> On 02 Jul 2015, at 09:40, Nick Fisk <nick@xxxxxxxxxx> wrote: >>>>> >>>>>> -----Original Message----- >>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On >>>>>> Behalf Of Christian Balzer >>>>>> Sent: 02 July 2015 02:23 >>>>>> To: Ceph Users >>>>>> Subject: Re: xattrs vs omap >>>>>> >>>>>> On Thu, 2 Jul 2015 00:36:18 +0000 Somnath Roy wrote: >>>>>> >>>>>>> It is replaced with the following config option.. >>>>>>> >>>>>>> // Use omap for xattrs for attrs over // >>>>>>> filestore_max_inline_xattr_size or >>>>>>> OPTION(filestore_max_inline_xattr_size, OPT_U32, 0) //Override >>>>>>> OPTION(filestore_max_inline_xattr_size_xfs, OPT_U32, 65536) >>>>>>> OPTION(filestore_max_inline_xattr_size_btrfs, OPT_U32, 2048) >>>>>>> OPTION(filestore_max_inline_xattr_size_other, OPT_U32, 512) >>>>>>> >>>>>>> // for more than filestore_max_inline_xattrs attrs >>>>>>> OPTION(filestore_max_inline_xattrs, OPT_U32, 0) //Override >>>>>>> OPTION(filestore_max_inline_xattrs_xfs, OPT_U32, 10) >>>>>>> OPTION(filestore_max_inline_xattrs_btrfs, OPT_U32, 10) >>>>>>> OPTION(filestore_max_inline_xattrs_other, OPT_U32, 2) >>>>>>> >>>>>>> >>>>>>> If these limits crossed, xattrs will be stored in omap.. >>>>>>> >>>>>> Sounds fair. >>>>>> >>>>>> Since I only use RBD I don't think it will ever exceed this. >>>>> >>>>> Possibly, see my thread about performance difference between new and >>>>> old pools. Still not quite sure what's going on, but for some reasons >>>>> some of the objects behind RBD's have larger xattrs which is causing >>>>> really poor performance. >>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Chibi >>>>>>> For ext4, you can use either filestore_max*_other or >>>>>>> filestore_max_inline_xattrs/ filestore_max_inline_xattr_size. I any >>>>>>> case, later two will override everything. >>>>>>> >>>>>>> Thanks & Regards >>>>>>> Somnath >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Christian Balzer [mailto:chibi@xxxxxxx] >>>>>>> Sent: Wednesday, July 01, 2015 5:26 PM >>>>>>> To: Ceph Users >>>>>>> Cc: Somnath Roy >>>>>>> Subject: Re: xattrs vs omap >>>>>>> >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> On Wed, 1 Jul 2015 15:24:13 +0000 Somnath Roy wrote: >>>>>>> >>>>>>>> It doesn't matter, I think filestore_xattr_use_omap is a 'noop' >>>>>>>> and not used in the Hammer. >>>>>>>> >>>>>>> Then what was this functionality replaced with, esp. considering >>>>>>> EXT4 based OSDs? >>>>>>> >>>>>>> Chibi >>>>>>>> Thanks & Regards >>>>>>>> Somnath >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On >>>>>>>> Behalf Of Adam Tygart Sent: Wednesday, July 01, 2015 8:20 AM >>>>>>>> To: Ceph Users >>>>>>>> Subject: xattrs vs omap >>>>>>>> >>>>>>>> Hello all, >>>>>>>> >>>>>>>> I've got a coworker who put "filestore_xattr_use_omap = true" in >>>>>>>> the ceph.conf when we first started building the cluster. Now he >>>>>>>> can't remember why. He thinks it may be a holdover from our first >>>>>>>> Ceph cluster (running dumpling on ext4, iirc). >>>>>>>> >>>>>>>> In the newly built cluster, we are using XFS with 2048 byte >>>>>>>> inodes, running Ceph 0.94.2. It currently has production data in it. >>>>>>>> >>>>>>>> From my reading of other threads, it looks like this is probably >>>>>>>> not something you want set to true (at least on XFS), due to >>>>>>>> performance implications. Is this something you can change on a running cluster? >>>>>>>> Is it worth the hassle? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Adam >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list >>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>> >>>>>>>> ________________________________ >>>>>>>> >>>>>>>> PLEASE NOTE: The information contained in this electronic mail >>>>>>>> message is intended only for the use of the designated >>>>>>>> recipient(s) named above. If the reader of this message is not the >>>>>>>> intended recipient, you are hereby notified that you have received >>>>>>>> this message in error and that any review, dissemination, >>>>>>>> distribution, or copying of this message is strictly prohibited. >>>>>>>> If you have received this communication in error, please notify >>>>>>>> the sender by telephone or e-mail (as shown above) immediately and >>>>>>>> destroy any and all copies of this message in your possession >>>>>>>> (whether hard copies or electronically stored copies). >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list >>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Christian Balzer Network/Systems Engineer >>>>>>> chibi@xxxxxxx Global OnLine Japan/Fusion Communications >>>>>>> http://www.gol.com/ >>>>>>> >>>>>>> ________________________________ >>>>>>> >>>>>>> PLEASE NOTE: The information contained in this electronic mail >>>>>>> message is intended only for the use of the designated recipient(s) named above. >>>>>>> If the reader of this message is not the intended recipient, you >>>>>>> are hereby notified that you have received this message in error >>>>>>> and that any review, dissemination, distribution, or copying of >>>>>>> this message is strictly prohibited. If you have received this >>>>>>> communication in error, please notify the sender by telephone or >>>>>>> e-mail (as shown above) immediately and destroy any and all copies >>>>>>> of this message in your possession (whether hard copies or electronically stored copies). >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Christian Balzer Network/Systems Engineer >>>>>> chibi@xxxxxxx Global OnLine Japan/Fusion Communications >>>>>> http://www.gol.com/ >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> ________________________________ >>> >>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). >>> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com