Re: cephfs: num_stray growing without bounds (octopus)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Dhairya,

I have some new results (below) and also some wishes as an operator that might even help with the decision you mentioned in your e-mails:

- Please implement both ways, a possibility to trigger an evaluation manually via a "ceph tell|daemon" command and a periodic evaluation.
- For the periodic evaluation, please introduce a tuning parameter, for example, mds_gc_interval (in seconds). If set to 0, disable periodic evaluation.

Reasons:

- On most production systems, doing this once per 24 hours seems enough (my benchmark is very special, it needs to delete aggressively). The default for mds_gc_interval could therefore be 86400 (24h).
- On my production system I would probably disable periodic evaluation and rather do a single shot manual evaluation some time after snapshot removal but before users start working to synchronise with snapshot removal (where the "lost" entries are created).

This follows a general software design principle: Whenever there is a choice like this to take, it is best to try to implement an API that can support all use cases and to leave the choice of what fits best for their workloads to the operators. Try not to restrict operators by hard-coding decisions. Rather pick reasonable defaults but also empower operators to tune things to special needs. One-size-fits-all never works.

Now to the results: Indeed, a restart triggers complete removal of all orphaned stray entries:

[root@rit-tceph bench]# ./mds-stray-num
962562
[root@rit-tceph bench]# ceph mds fail 0
failed mds gid 371425
[root@rit-tceph bench]# ./mds-stray-num
767329
[root@rit-tceph bench]# ./mds-stray-num
766777
[root@rit-tceph bench]# ./mds-stray-num
572430
[root@rit-tceph bench]# ./mds-stray-num
199172
[root@rit-tceph bench]# ./mds-stray-num
0
# ceph df
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
hdd    2.4 TiB  2.4 TiB  896 MiB    25 GiB       0.99
TOTAL  2.4 TiB  2.4 TiB  896 MiB    25 GiB       0.99

--- POOLS ---
POOL                   ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
device_health_metrics   1    1  205 KiB        9  616 KiB      0    785 GiB
fs-meta1                2   64  684 MiB       44  2.0 GiB   0.09    785 GiB
fs-meta2                3  128      0 B        0      0 B      0    785 GiB
fs-data                 4  128      0 B        0      0 B      0    1.5 TiB

Good to see that the bookkeeping didn't loose track of anything. I will add a periodic mds fail to my benchmark and report back how all of this works under heavy load.

Best regards and thanks for our help!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dhairya Parmar <dparmar@xxxxxxxxxx>
Sent: 05 August 2022 22:53:09
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  cephfs: num_stray growing without bounds (octopus)

On Fri, Aug 5, 2022 at 9:12 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
Hi Dhairya,

thanks to pointing me to this tracker. I can try an MDS fail to see if it clears the stray buckets or if there are still left-overs. Before doing so:

> Thanks for the logs though. It will help me while writing the patch.

I couldn't see if you were asking for logs. Do you want me to collect something or do you mean the session logs included in my e-mail. Also, is it on purpose to leave out the ceph-user list in CC (e-mail address)?

Nah, the session logs included are good enough. I missed CCing ceph-users. Done now.

For my urgent needs, failing the MDS periodically during the benchmark might be an interesting addition any ways - if this helps with the stray count.

Yeah it might be helpful for now. Do let me know if that works for you.

Thanks for your fast reply and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dhairya Parmar <dparmar@xxxxxxxxxx<mailto:dparmar@xxxxxxxxxx>>
Sent: 05 August 2022 16:10
To: Frank Schilder
Subject: Re:  cephfs: num_stray growing without bounds (octopus)

Hi Frank,

This seems to be related to a tracker<https://tracker.ceph.com/issues/53724> that I'm working on. I've got some rough ideas in my mind, a simple solution would be to run a single thread that would regularly evaluate strays (maybe every 1 or 2 minutes?) or a much better approach would be to evaluate strays whenever snapshot removal takes place but it's not that easy as it looks, therefore I'm currently going through the code to understand it's whole process(snapshot removal), I'll try my best to come up with something as soon as possible. Thanks for the logs though. It will help me while writing the patch.

Regards,
Dhairya

On Fri, Aug 5, 2022 at 6:55 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>> wrote:
Dear Gregory, Dan and Patrick,

this is a reply to an older thread about num_stray growing without limits (thread https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/2NT55RUMD33KLGQCDZ74WINPPQ6WN6CW, message https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/FYEN2W4HGMC6CGOCS2BS4PQDRPGUSNOO/). I'm opening a new thread for a better matching subject line.

I now started testing octopus and am afraid I came across a very serious issue with unlimited growth of stray buckets. I'm running a test that puts constant load on a file system by adding a blob of data, creating a snapshot, deleting a blob of data and deleting a snapshot in a cyclic process. A blob of data contains about 330K hard links to make it more interesting.

The benchmark crashed after half a day in rm with "no space left on device", which was due to the stray buckets being too full (old thread). OK, so I increased mds_bal_fragment_size_max and cleaned out all data to start fresh. However, this happened:

[root@rit-tceph ~]# df -h /mnt/adm/cephfs
Filesystem                             Size  Used Avail Use% Mounted on
10.41.24.13,10.41.24.14,10.41.24.15:/  2.5T   35G  2.5T   2% /mnt/adm/cephfs

[root@rit-tceph ~]# find /mnt/adm/cephfs/
/mnt/adm/cephfs/
/mnt/adm/cephfs/data
/mnt/adm/cephfs/data/blobs

[root@rit-tceph ~]# find /mnt/adm/cephfs/.snap
/mnt/adm/cephfs/.snap

[root@rit-tceph ~]# find /mnt/adm/cephfs/data/.snap
/mnt/adm/cephfs/data/.snap

[root@rit-tceph ~]# find /mnt/adm/cephfs/data/blobs/.snap
/mnt/adm/cephfs/data/blobs/.snap

All snapshots were taken in /mnt/adm/cephfs/.snap. Snaptrimming finished a long time ago. Now look at this:

[root@rit-tceph ~]# ssh "tceph-03" "ceph daemon mds.tceph-03 perf dump | jq .mds_cache.num_strays"
962562

Whaaaaat?

There is data left over in the fs pools and the stray buckets are cloaked up.

[root@rit-tceph ~]# ceph df
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
hdd    2.4 TiB  2.4 TiB  1.4 GiB    35 GiB       1.38
TOTAL  2.4 TiB  2.4 TiB  1.4 GiB    35 GiB       1.38

--- POOLS ---
POOL                   ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
device_health_metrics   1    1  170 KiB        9  509 KiB      0    781 GiB
fs-meta1                2   64  2.2 GiB  160.25k  6.5 GiB   0.28    781 GiB
fs-meta2                3  128      0 B  802.40k      0 B      0    781 GiB
fs-data                 4  128      0 B  802.40k      0 B      0    1.5 TiB

There is either a very serious bug with cleaning up stray entries when their last snapshot is deleted, or I'm missing something important here when deleting data. Just for completeness:

[root@rit-tceph ~]# ceph status
  cluster:
    id:     bf1f51f5-b381-4cf7-b3db-88d044c1960c
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum tceph-01,tceph-03,tceph-02 (age 10d)
    mgr: tceph-01(active, since 10d), standbys: tceph-02, tceph-03
    mds: fs:1 {0=tceph-03=up:active} 2 up:standby
    osd: 9 osds: 9 up (since 4d), 9 in (since 4d)

  data:
    pools:   4 pools, 321 pgs
    objects: 1.77M objects, 256 MiB
    usage:   35 GiB used, 2.4 TiB / 2.4 TiB avail
    pgs:     321 active+clean

I would be most grateful for both, an explanation what happened here and a way to get out of this. To me it looks very much like unlimited growth of garbage that is never cleaned out.

Many thanks and best regads,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Gregory Farnum <gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx><mailto:gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx>>>
Sent: 08 February 2022 18:22
To: Dan van der Ster
Cc: Frank Schilder; Patrick Donnelly; ceph-users
Subject: Re:  Re: cephfs: [ERR] loaded dup inode

On Tue, Feb 8, 2022 at 7:30 AM Dan van der Ster <dvanders@xxxxxxxxx<mailto:dvanders@xxxxxxxxx><mailto:dvanders@xxxxxxxxx<mailto:dvanders@xxxxxxxxx>>> wrote:
>
> On Tue, Feb 8, 2022 at 1:04 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>> wrote:
> > The reason for this seemingly strange behaviour was an old static snapshot taken in an entirely different directory. Apparently, ceph fs snapshots are not local to an FS directory sub-tree but always global on the entire FS despite the fact that you can only access the sub-tree in the snapshot, which easily leads to the wrong conclusion that only data below the directory is in the snapshot. As a consequence, the static snapshot was accumulating the garbage from the rotating snapshots even though these sub-trees were completely disjoint.
>
> So are you saying that if I do this I'll have 1M files in stray?

No, happily.

The thing that's happening here post-dates my main previous stretch on
CephFS and I had forgotten it, but there's a note in the developer
docs: https://docs.ceph.com/en/latest/dev/cephfs-snapshots/#hard-links
(I fortuitously stumbled across this from an entirely different
direction/discussion just after seeing this thread and put the pieces
together!)

Basically, hard links are *the worst*. For everything in filesystems.
I spent a lot of time trying to figure out how to handle hard links
being renamed across snapshots[1] and never managed it, and the
eventual "solution" was to give up and do the degenerate thing:
If there's a file with multiple hard links, that file is a member of
*every* snapshot.

Doing anything about this will take a lot of time. There's probably an
opportunity to improve it for users of the subvolumes library, as
those subvolumes do get tagged a bit, so I'll see if we can look into
that. But for generic CephFS, I'm not sure what the solution will look
like at all.

Sorry folks. :/
-Greg

[1]: The issue is that, if you have a hard linked file in two places,
you would expect it to be snapshotted whenever a snapshot covering
either location occurs. But in CephFS the file can only live in one
location, and the other location has to just hold a reference to it
instead. So say you have inode Y at path A, and then hard link it in
at path B. Given how snapshots work, when you open up Y from A, you
would need to check all the snapshots that apply from both A and B's
trees. But 1) opening up other paths is a challenge all on its own,
and 2) without an inode and its backtrace to provide a lookup resolve
point, it's impossible to maintain a lookup that scales and is
possible to keep consistent.
(Oh, I did just have one idea, but I'm not sure if it would fix every
issue or just that scalable backtrace lookup:
https://tracker.ceph.com/issues/54205)

>
> mkdir /a
> cd /a
> for i in {1..1000000}; do touch $i; done  # create 1M files in /a
> cd ..
> mkdir /b
> mkdir /b/.snap/testsnap  # create a snap in the empty dir /b
> rm -rf /a/
>
>
> Cheers, Dan
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>>
>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>>



--
Dhairya Parmar

He/Him/His

Associate Software Engineer, CephFS

Red Hat Inc.<https://www.redhat.com/>

dparmar@xxxxxxxxxx<mailto:dparmar@xxxxxxxxxx>

[https://static.redhat.com/libs/redhat/brand-assets/2/corp/logo--200.png]<https://www.redhat.com/>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux