Re: Deprecating ext4 support

Christian Balzer <chibi@xxxxxxx> · Thu, 14 Apr 2016 10:32:27 +0900

Hello,

[reducing MLs to ceph-user]

On Wed, 13 Apr 2016 14:51:58 +0200 Michael Metz-Martini | SpeedPartner
GmbH wrote:

> Hi,
> 
> Am 13.04.2016 um 04:29 schrieb Christian Balzer:
> > On Tue, 12 Apr 2016 09:00:19 +0200 Michael Metz-Martini | SpeedPartner
> > GmbH wrote:
> >> Am 11.04.2016 um 23:39 schrieb Sage Weil:
> >>> ext4 has never been recommended, but we did test it.  After Jewel is
> >>> out, we would like explicitly recommend *against* ext4 and stop
> >>> testing it.
> >> Hmmm. We're currently migrating away from xfs as we had some strange
> >> performance-issues which were resolved / got better by switching to
> >> ext4. We think this is related to our high number of objects (4358
> >> Mobjects according to ceph -s).
> > It would be interesting to see on how this maps out to the OSDs/PGs.
> > I'd guess loads and loads of subdirectories per PG, which is probably
> > where Ext4 performs better than XFS.
> A simple ls -l takes "ages" on XFS while ext4 lists a directory
> immediately. According to our findings regarding XFS this seems to be
> "normal" behavior.
> 
Just for the record, this is also influenced (for Ext4 at least) on how
much memory you have and the "vm/vfs_cache_pressure" settings. 
Once Ext4 runs out of space in SLAB for dentry and ext4_inode_cache
(amongst others), it will become slower as well, since it has to go to the
disk.
Another thing to remember is that "ls" by itself is also a LOT faster than
"ls -l" since it accesses less data.

> pool name       category                 KB      objects
> data            -                       3240   2265521646
> document_root   -                     577364        10150
> images          -                96197462245   2256616709
> metadata        -                    1150105     35903724
> queue           -                  542967346       173865
> raw             -                36875247450     13095410
> 
> total of 4736 pgs, 6 pools, 124 TB data, 4359 Mobjects
> 
> What would you like to see?
> tree? du per Directory?
> 
Just an example tree and typical size of the first "data layer".
For example on my very lightly loaded/filled test cluster (45000 objects)
the actual objects are in the "top" directory of the PG in question, like:
ls -lah /var/lib/ceph/osd/ceph-3/current/2.fa_head/ 
total 289M
drwxr-xr-x   2 root root 8.0K Mar  8 11:06 .
drwxr-xr-x 106 root root 8.0K Mar 30 12:08 ..
-rw-r--r--   1 root root 4.0M Mar  8 11:05 benchmark\udata\uirt03\u16185\uobject586__head_60D672FA__2
-rw-r--r--   1 root root    0 Mar  8 10:50 __head_000000FA__2
-rw-r--r--   1 root root 4.0M Mar  8 11:06 rb.0.1034.74b0dc51.0000000000c6__head_C147A6FA__2
[79 further objects]
---

Whereas on my main production cluster I got 2000 Kobjects and it's nested a
lot more like this:
---
ls -lah /var/lib/ceph/osd/ceph-2/current/2.35e_head/DIR_E/DIR_5/DIR_3/DIR_0
total 128M
drwxr-xr-x  2 root root 4.0K Mar 10 09:20 .
drwxr-xr-x 18 root root  32K Dec 14 15:15 ..
-rw-r--r--  1 root root    0 Feb 21 01:15 __head_0000035E__2
-rw-r--r--  1 root root 4.0M Jun  3  2015 rb.0.11eb.238e1f29.000000010b1b__head_AD6E035E__2
[36 further 4MB objects)
---

> As you can see we have one data-object in pool "data" per file saved
> somewhere else. I'm not sure what's this related to, but maybe this is a
> must by cephfs.
> 
That's rather confusing (even more so since I don't use CephFS), but it
feels wrong.
>From what little I know about CephFS is that you can have only one FS per
cluster and the pools can be arbitrarily named (default data and metadata).

Looking at your output above I'm assuming that "metadata" is actually what
the name implies and that you have quite a few files (as in CephFS
files) at 35 million objects in there.
Furthermore the actual DATA for these files seems to reside in "images",
not in "data" (which nearly empty at 3.2MB).
My guess is that you somehow managed to create things in a way that
puts references (not the actual data) of everything in "images" to "data".

Hell, it might even be a bug where a "data" pool will always be used by
Ceph in that fashion even if the actual data holding pool is named
differently. 

Don't think that's normal at all and I wonder if you could just remove
"data", after checking with more knowledgeable people than me of course.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com