Re: Deprecating ext4 support

Michael Metz-Martini | SpeedPartner GmbH <metz@xxxxxxxxxxxxxxx> · Thu, 14 Apr 2016 19:39:01 +0200

Hi,

Am 14.04.2016 um 03:32 schrieb Christian Balzer:
> On Wed, 13 Apr 2016 14:51:58 +0200 Michael Metz-Martini | SpeedPartner GmbH wrote:
>> Am 13.04.2016 um 04:29 schrieb Christian Balzer:
>>> On Tue, 12 Apr 2016 09:00:19 +0200 Michael Metz-Martini | SpeedPartner GmbH wrote:
>>>> Am 11.04.2016 um 23:39 schrieb Sage Weil:
>>>>> ext4 has never been recommended, but we did test it.  After Jewel is
>>>>> out, we would like explicitly recommend *against* ext4 and stop
>>>>> testing it.
>>>> Hmmm. We're currently migrating away from xfs as we had some strange
>>>> performance-issues which were resolved / got better by switching to
>>>> ext4. We think this is related to our high number of objects (4358
>>>> Mobjects according to ceph -s).
>>> It would be interesting to see on how this maps out to the OSDs/PGs.
>>> I'd guess loads and loads of subdirectories per PG, which is probably
>>> where Ext4 performs better than XFS.
>> A simple ls -l takes "ages" on XFS while ext4 lists a directory
>> immediately. According to our findings regarding XFS this seems to be
>> "normal" behavior.
> Just for the record, this is also influenced (for Ext4 at least) on how
> much memory you have and the "vm/vfs_cache_pressure" settings. 
> Once Ext4 runs out of space in SLAB for dentry and ext4_inode_cache
> (amongst others), it will become slower as well, since it has to go to the
> disk.
> Another thing to remember is that "ls" by itself is also a LOT faster than
> "ls -l" since it accesses less data.
128 GB RAM for 21 OSD (each 4 TB in size). Kernel so far "untuned"
regarding cache-pressure / inode-cache.

>> pool name       category                 KB      objects
>> data            -                       3240   2265521646
>> document_root   -                     577364        10150
>> images          -                96197462245   2256616709
>> metadata        -                    1150105     35903724
>> queue           -                  542967346       173865
>> raw             -                36875247450     13095410
>>
>> total of 4736 pgs, 6 pools, 124 TB data, 4359 Mobjects
>>
>> What would you like to see?
>> tree? du per Directory?
> Just an example tree and typical size of the first "data layer".
> [...]

First levels seem to be empty, so:
./DIR_3
./DIR_3/DIR_9
./DIR_3/DIR_9/DIR_0
./DIR_3/DIR_9/DIR_0/DIR_0
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_0
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_D
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_E
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_A
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_C
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_1
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_4
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_2
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_B
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_5
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_3
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_9
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_6
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_F
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_7
./DIR_3/DIR_9/DIR_0/DIR_0/DIR_8
./DIR_3/DIR_9/DIR_0/DIR_D
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_0
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_D
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_E
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_A
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_C
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_1
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_4
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_2
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_B
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_5
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_3
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_9
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_6
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_F
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_7
./DIR_3/DIR_9/DIR_0/DIR_D/DIR_8
...

/var/lib/ceph/osd/ceph-58/current/6.93_head/DIR_3/DIR_9/DIR_C/DIR_0$ du
-ms *
99      DIR_0
102     DIR_1
105     DIR_2
102     DIR_3
101     DIR_4
105     DIR_5
106     DIR_6
102     DIR_7
105     DIR_8
98      DIR_9
99      DIR_A
105     DIR_B
103     DIR_C
100     DIR_D
103     DIR_E
104     DIR_F

>> As you can see we have one data-object in pool "data" per file saved
>> somewhere else. I'm not sure what's this related to, but maybe this is a
>> must by cephfs.
> That's rather confusing (even more so since I don't use CephFS), but it
> feels wrong.
> From what little I know about CephFS is that you can have only one FS per
> cluster and the pools can be arbitrarily named (default data and metadata).
[...]
> My guess is that you somehow managed to create things in a way that
> puts references (not the actual data) of everything in "images" to
> "data".
You can tune the pool by e.g.
cephfs /mnt/storage/docroot set_layout -p 4

We thought this was a good idea so that we can change the replication
size different for doc_root and raw-data if we like. Seems this was a
bad idea for all objects.

-- 
Kind regards
 Michael Metz-Martini
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com