Re: cephfs: num_stray growing without bounds (octopus)

Frank Schilder <frans@xxxxxx> · Tue, 9 Aug 2022 14:15:07 +0000

Hi Dhairya,

I think we are reaching a point where we can move parts of this discussion to the tracker. Some of the points we discussed should surely be copied there. I have a bit more information and I will think about copying a summary to the tracker. Sorry, this e-mail is a bit longer. I added replies and new observations.

> I was thinking about a way more aggressive number ...

The reason why a 24h window is likely to work is that data deletion and even more so snapshot deletion is a rather rare event, which doesn't accummulate a lot of stray entries over a day. My benchmark is very special, it basically treats the file system as a large FIFO and, due to its low capacity and an enourmous usage amplification (https://tracker.ceph.com/issues/56949), it rotates through the complete storage in less than 12 hours. This results in an unusually high stray count, which you will never see on a production system. We have users cleaning up migrated data at the moment and the stray count is 165693 on a file system with 2578Ti usage and 431M inodes. The special case of orphaned stray entries (https://tracker.ceph.com/issues/53724) seems not to hurt that much.

In fact, it looks almost like no user apart from me ever noticed.

> I was also thinking about a solution where we evaluate strays as soon as we delete a snap.
> What do you think about this on production clusters?

This can work, however, see below. I don't know the details of how snapshot removal works internally. You would want to start this process a bit after all snapshot removal induced operations on stray have completed (all orphaned stray dirs are created). This would also require that snapshot removal is the only source of such stray dirs. And this is where I would start arguing against such an approach.

The reason is, that on a storage system you want redundancy not only in form of raiding up storage, but also in form of raiding up orthogonal algorithms. The software equivalent of the four-eyes-principle is the two-independent-algorithm principle. I see this stray evaluation more like a crawler that goes through the data and checks if certain things are as they should be. A bug introduced in one algorithm will then be caught by the second, independent algorithm. This, however, requires that both algirithms share not more then type declarations - well, in an ideal world. An example for this is the IO path (algorithm 1) and scrubbing (algorithm 2).

What you are asked to implement is a starting point of something like a scrubbing algorithm (or part of it). Therefore, having a detection and removal of orphaned stray entries as part of an fs scrub or a lightweight fs scrub that runs periodically or on demand makes a lot of sense following the two-independent-algorithm principle. That this check finds leftover data indicates a bug with the IO path. So, instead of considering a periodic stray evaluation as a workaround to "fix" an IO path problem, consider it as a crawler that finds something wrong and reports to the devs that they should actually fix the IO path (specifically, the snapshot removal algorithm to clean out all trash). The crawler should still run in case a regression re-introduces the bug.

New observations:

* Possible performance impact.

I made two contradicting observations. These are preliminary, I need to confirm these with modified tests. I added an mds fail to the benchmark. Unfortunately, it seems that performance collapses significantly after that and never recoveres again. I need to run a clean test with mds-fail and without mds-fail to get a proper comparison. A more lightweight way to trigger stray evaluation is likely to avoid that.

I needed to clean up the entire file system again after stopping the benchmark. Removal was extremely slow following the mds-fails during the test. However, after snap_trim finished, I did another mds fail to clear out all stray entries and it seems that this time, it actually imrpoved performance. Unfortunately, I did not take time stamps and need to repeat the procedure.

There seems something weird with stray entries and MDS performance and I will try to figure out what is happening. It looks like if certain operations are executed in a certain situation, a performance-degraded state of the MDS is entered and difficult to leave again.

* This time the cleanup didn't work entirely.

After the above cleanup, this time I have stray entries left over:

[ansible@rit-tceph bench]$ ./mds-stray-num 
2781

All directories and .snap directories are empty. I did an MDS fail and an ls -R in all (.snap) dirs that exist to trigger re-integration if required, to no avail. The entries in the stray buckets look like this (after mds dump cache):

[root@tceph-02 ceph]# grep "~mds0" mds-cache-01 | grep -e "10000a97dac"
[inode 0x10000a97dac [...5a,head] ~mds0/stray6/10000a97dac/ auth v7815075 snaprealm=0x55b633a83900 dirtyparent f(v0 m2022-08-09T06:35:43.959997+0200) n(v0 rc2022-08-09T06:35:43.959997+0200 1=0+1) (iversion lock) | dirfrag=1 openingsnapparents=0 dirtyparent=1 dirty=0 0x55b6334fb000]
 [dir 0x10000a97dac ~mds0/stray6/10000a97dac/ [2,head] auth v=8 cv=0/0 state=1073741824 f(v0 m2022-08-09T06:35:43.959997+0200) n(v0 rc2022-08-09T06:35:43.959997+0200) hs=0+1,ss=0+0 dirty=1 | child=1 0x55b63551cb00]

Its directories with their name equal to their inode number. I enabled distributed ephemeral pinning on a directory and my best bet is, that these stray entries have to do with that. This would look like a bug in the code. When you talk to Greg, could you ask him if this is expected or a bug? Its quite a lot of entries. I have only 1 active MDS and had 67 directories under the ephemeral pin for the life time of the benchmark.

I disabled ephemeral pinning (well, I hope I did, the docs say nothing about how to disable it), but the stray count stays the same:

[ansible@rit-tceph bench]$ setfattr -n ceph.dir.pin.distributed -v 0 /mnt/adm/cephfs/data/blobs
[ansible@rit-tceph bench]$ ./mds-stray-num 
2781
[ansible@rit-tceph bench]$ ./mds-stray-num 
2781
[ansible@rit-tceph bench]$ ./mds-fail
mds fail rank 0: failing mds.tceph-02 on Tue Aug  9 16:02:31 CEST 2022
ceph mds fail: failed mds gid 449819
mds new rank 0: mds.tceph-03 is up:active on Tue Aug  9 16:02:38 CEST 2022
[ansible@rit-tceph bench]$ ./mds-stray-num 
2781
[ansible@rit-tceph bench]$ ./mds-stray-num 
2781

Not so good this time :(

I will re-run the benchmarks now and hope to have some timings for you in a couple of days.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx