cephfs: num_stray growing without bounds (octopus)

Frank Schilder <frans@xxxxxx> · Fri, 5 Aug 2022 13:23:32 +0000

Dear Gregory, Dan and Patrick,

this is a reply to an older thread about num_stray growing without limits (thread https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/2NT55RUMD33KLGQCDZ74WINPPQ6WN6CW, message https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/FYEN2W4HGMC6CGOCS2BS4PQDRPGUSNOO/). I'm opening a new thread for a better matching subject line.

I now started testing octopus and am afraid I came across a very serious issue with unlimited growth of stray buckets. I'm running a test that puts constant load on a file system by adding a blob of data, creating a snapshot, deleting a blob of data and deleting a snapshot in a cyclic process. A blob of data contains about 330K hard links to make it more interesting.

The benchmark crashed after half a day in rm with "no space left on device", which was due to the stray buckets being too full (old thread). OK, so I increased mds_bal_fragment_size_max and cleaned out all data to start fresh. However, this happened:

[root@rit-tceph ~]# df -h /mnt/adm/cephfs
Filesystem                             Size  Used Avail Use% Mounted on
10.41.24.13,10.41.24.14,10.41.24.15:/  2.5T   35G  2.5T   2% /mnt/adm/cephfs

[root@rit-tceph ~]# find /mnt/adm/cephfs/
/mnt/adm/cephfs/
/mnt/adm/cephfs/data
/mnt/adm/cephfs/data/blobs

[root@rit-tceph ~]# find /mnt/adm/cephfs/.snap
/mnt/adm/cephfs/.snap

[root@rit-tceph ~]# find /mnt/adm/cephfs/data/.snap
/mnt/adm/cephfs/data/.snap

[root@rit-tceph ~]# find /mnt/adm/cephfs/data/blobs/.snap
/mnt/adm/cephfs/data/blobs/.snap

All snapshots were taken in /mnt/adm/cephfs/.snap. Snaptrimming finished a long time ago. Now look at this:

[root@rit-tceph ~]# ssh "tceph-03" "ceph daemon mds.tceph-03 perf dump | jq .mds_cache.num_strays"
962562

Whaaaaat?

There is data left over in the fs pools and the stray buckets are cloaked up.

[root@rit-tceph ~]# ceph df
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
hdd    2.4 TiB  2.4 TiB  1.4 GiB    35 GiB       1.38
TOTAL  2.4 TiB  2.4 TiB  1.4 GiB    35 GiB       1.38

--- POOLS ---
POOL                   ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
device_health_metrics   1    1  170 KiB        9  509 KiB      0    781 GiB
fs-meta1                2   64  2.2 GiB  160.25k  6.5 GiB   0.28    781 GiB
fs-meta2                3  128      0 B  802.40k      0 B      0    781 GiB
fs-data                 4  128      0 B  802.40k      0 B      0    1.5 TiB

There is either a very serious bug with cleaning up stray entries when their last snapshot is deleted, or I'm missing something important here when deleting data. Just for completeness:

[root@rit-tceph ~]# ceph status
  cluster:
    id:     bf1f51f5-b381-4cf7-b3db-88d044c1960c
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum tceph-01,tceph-03,tceph-02 (age 10d)
    mgr: tceph-01(active, since 10d), standbys: tceph-02, tceph-03
    mds: fs:1 {0=tceph-03=up:active} 2 up:standby
    osd: 9 osds: 9 up (since 4d), 9 in (since 4d)

  data:
    pools:   4 pools, 321 pgs
    objects: 1.77M objects, 256 MiB
    usage:   35 GiB used, 2.4 TiB / 2.4 TiB avail
    pgs:     321 active+clean

I would be most grateful for both, an explanation what happened here and a way to get out of this. To me it looks very much like unlimited growth of garbage that is never cleaned out.

Many thanks and best regads,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Gregory Farnum <gfarnum@xxxxxxxxxx>
Sent: 08 February 2022 18:22
To: Dan van der Ster
Cc: Frank Schilder; Patrick Donnelly; ceph-users
Subject: Re:  Re: cephfs: [ERR] loaded dup inode

On Tue, Feb 8, 2022 at 7:30 AM Dan van der Ster <dvanders@xxxxxxxxx> wrote:
>
> On Tue, Feb 8, 2022 at 1:04 PM Frank Schilder <frans@xxxxxx> wrote:
> > The reason for this seemingly strange behaviour was an old static snapshot taken in an entirely different directory. Apparently, ceph fs snapshots are not local to an FS directory sub-tree but always global on the entire FS despite the fact that you can only access the sub-tree in the snapshot, which easily leads to the wrong conclusion that only data below the directory is in the snapshot. As a consequence, the static snapshot was accumulating the garbage from the rotating snapshots even though these sub-trees were completely disjoint.
>
> So are you saying that if I do this I'll have 1M files in stray?

No, happily.

The thing that's happening here post-dates my main previous stretch on
CephFS and I had forgotten it, but there's a note in the developer
docs: https://docs.ceph.com/en/latest/dev/cephfs-snapshots/#hard-links
(I fortuitously stumbled across this from an entirely different
direction/discussion just after seeing this thread and put the pieces
together!)

Basically, hard links are *the worst*. For everything in filesystems.
I spent a lot of time trying to figure out how to handle hard links
being renamed across snapshots[1] and never managed it, and the
eventual "solution" was to give up and do the degenerate thing:
If there's a file with multiple hard links, that file is a member of
*every* snapshot.

Doing anything about this will take a lot of time. There's probably an
opportunity to improve it for users of the subvolumes library, as
those subvolumes do get tagged a bit, so I'll see if we can look into
that. But for generic CephFS, I'm not sure what the solution will look
like at all.

Sorry folks. :/
-Greg

[1]: The issue is that, if you have a hard linked file in two places,
you would expect it to be snapshotted whenever a snapshot covering
either location occurs. But in CephFS the file can only live in one
location, and the other location has to just hold a reference to it
instead. So say you have inode Y at path A, and then hard link it in
at path B. Given how snapshots work, when you open up Y from A, you
would need to check all the snapshots that apply from both A and B's
trees. But 1) opening up other paths is a challenge all on its own,
and 2) without an inode and its backtrace to provide a lookup resolve
point, it's impossible to maintain a lookup that scales and is
possible to keep consistent.
(Oh, I did just have one idea, but I'm not sure if it would fix every
issue or just that scalable backtrace lookup:
https://tracker.ceph.com/issues/54205)

>
> mkdir /a
> cd /a
> for i in {1..1000000}; do touch $i; done  # create 1M files in /a
> cd ..
> mkdir /b
> mkdir /b/.snap/testsnap  # create a snap in the empty dir /b
> rm -rf /a/
>
>
> Cheers, Dan
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx