Hello, [reducing MLs to ceph-user] On Wed, 13 Apr 2016 14:51:58 +0200 Michael Metz-Martini | SpeedPartner GmbH wrote: > Hi, > > Am 13.04.2016 um 04:29 schrieb Christian Balzer: > > On Tue, 12 Apr 2016 09:00:19 +0200 Michael Metz-Martini | SpeedPartner > > GmbH wrote: > >> Am 11.04.2016 um 23:39 schrieb Sage Weil: > >>> ext4 has never been recommended, but we did test it. After Jewel is > >>> out, we would like explicitly recommend *against* ext4 and stop > >>> testing it. > >> Hmmm. We're currently migrating away from xfs as we had some strange > >> performance-issues which were resolved / got better by switching to > >> ext4. We think this is related to our high number of objects (4358 > >> Mobjects according to ceph -s). > > It would be interesting to see on how this maps out to the OSDs/PGs. > > I'd guess loads and loads of subdirectories per PG, which is probably > > where Ext4 performs better than XFS. > A simple ls -l takes "ages" on XFS while ext4 lists a directory > immediately. According to our findings regarding XFS this seems to be > "normal" behavior. > Just for the record, this is also influenced (for Ext4 at least) on how much memory you have and the "vm/vfs_cache_pressure" settings. Once Ext4 runs out of space in SLAB for dentry and ext4_inode_cache (amongst others), it will become slower as well, since it has to go to the disk. Another thing to remember is that "ls" by itself is also a LOT faster than "ls -l" since it accesses less data. > pool name category KB objects > data - 3240 2265521646 > document_root - 577364 10150 > images - 96197462245 2256616709 > metadata - 1150105 35903724 > queue - 542967346 173865 > raw - 36875247450 13095410 > > total of 4736 pgs, 6 pools, 124 TB data, 4359 Mobjects > > What would you like to see? > tree? du per Directory? > Just an example tree and typical size of the first "data layer". For example on my very lightly loaded/filled test cluster (45000 objects) the actual objects are in the "top" directory of the PG in question, like: ls -lah /var/lib/ceph/osd/ceph-3/current/2.fa_head/ total 289M drwxr-xr-x 2 root root 8.0K Mar 8 11:06 . drwxr-xr-x 106 root root 8.0K Mar 30 12:08 .. -rw-r--r-- 1 root root 4.0M Mar 8 11:05 benchmark\udata\uirt03\u16185\uobject586__head_60D672FA__2 -rw-r--r-- 1 root root 0 Mar 8 10:50 __head_000000FA__2 -rw-r--r-- 1 root root 4.0M Mar 8 11:06 rb.0.1034.74b0dc51.0000000000c6__head_C147A6FA__2 [79 further objects] --- Whereas on my main production cluster I got 2000 Kobjects and it's nested a lot more like this: --- ls -lah /var/lib/ceph/osd/ceph-2/current/2.35e_head/DIR_E/DIR_5/DIR_3/DIR_0 total 128M drwxr-xr-x 2 root root 4.0K Mar 10 09:20 . drwxr-xr-x 18 root root 32K Dec 14 15:15 .. -rw-r--r-- 1 root root 0 Feb 21 01:15 __head_0000035E__2 -rw-r--r-- 1 root root 4.0M Jun 3 2015 rb.0.11eb.238e1f29.000000010b1b__head_AD6E035E__2 [36 further 4MB objects) --- > As you can see we have one data-object in pool "data" per file saved > somewhere else. I'm not sure what's this related to, but maybe this is a > must by cephfs. > That's rather confusing (even more so since I don't use CephFS), but it feels wrong. >From what little I know about CephFS is that you can have only one FS per cluster and the pools can be arbitrarily named (default data and metadata). Looking at your output above I'm assuming that "metadata" is actually what the name implies and that you have quite a few files (as in CephFS files) at 35 million objects in there. Furthermore the actual DATA for these files seems to reside in "images", not in "data" (which nearly empty at 3.2MB). My guess is that you somehow managed to create things in a way that puts references (not the actual data) of everything in "images" to "data". Hell, it might even be a bug where a "data" pool will always be used by Ceph in that fashion even if the actual data holding pool is named differently. Don't think that's normal at all and I wonder if you could just remove "data", after checking with more knowledgeable people than me of course. Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com