Re: [ext] CephFS pool not releasing space after data deletion

Frank Schilder <frans@xxxxxx> · Sat, 2 Dec 2023 12:15:53 +0000

Hi Mathias,

have you made any progress on this? Did the capacity become available eventually?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Kuhring, Mathias <mathias.kuhring@xxxxxxxxxxxxxx>
Sent: Friday, October 27, 2023 3:52 PM
To: ceph-users@xxxxxxx; Frank Schilder
Subject: Re: [ext]  CephFS pool not releasing space after data deletion

Dear ceph users,

We are wondering, if this might be the same issue as with this bug:
https://tracker.ceph.com/issues/52581

Except that we seem to have been snapshots dangling on the old pool.
And the bug report snapshots dangling on the new pool.
But maybe it's both?

I mean, once the global root layout was created to a new pool,
the new pool became in charge for snapshooting at least of new data, right?
What about data which is overwritten? Is there a conflict of responsibility?

We do have similar listings of snaps with "ceph osd pool ls detail", I
think:

0|0[root@osd-1 ~]# ceph osd pool ls detail | grep -B 1 removed_snaps_queue
pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 1
object_hash rjenkins pg_num 115 pgp_num 107 pg_num_target 32
pgp_num_target 32 autoscale_mode on last_change 803558 lfor
0/803250/803248 flags hashpspool,selfmanaged_snaps stripe_width 0
expected_num_objects 1 application cephfs
         removed_snaps_queue
[3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]
--
pool 3 'hdd_ec' erasure profile hdd_ec size 3 min_size 2 crush_rule 3
object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off
last_change 803558 lfor 0/87229/87229 flags
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 8192 application
cephfs
         removed_snaps_queue
[3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]
--
pool 20 'hdd_ec_8_2_pool' erasure profile hdd_ec_8_2_profile size 10
min_size 9 crush_rule 5 object_hash rjenkins pg_num 8192 pgp_num 8192
autoscale_mode off last_change 803558 lfor 0/0/681917 flags
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 32768
application cephfs
         removed_snaps_queue
[3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]

Here, pool hdd_ec_8_2_pool is the one we recently assigned to the root
layout.
Pool hdd_ec is the one which was assigned before and which won't release
space (at least where I know of).

Is this removed_snaps_queue the same as removed_snaps in the bug issue
(i.e. the label was renamed)?
And is it normal that all queues list the same info or should this be
different per pool?
Might this be related to pools having now share responsibility over some
snaps due to layout changes?

And for the big question:
How can I actually trigger/speedup the removal of those snaps?
I find the removed_snaps/removed_snaps_queue mentioned a few times in
the user list.
But never with some conclusive answer how to deal with them.
And the only mentions in the docs are just change logs.

I also looked into and started cephfs stray scrubbing:
https://docs.ceph.com/en/latest/cephfs/scrub/#evaluate-strays-using-recursive-scrub
But according to the status output, no scrubbing is actually active.

I would appreciate any further ideas. Thanks a lot.

Best Wishes,
Mathias

On 10/23/2023 12:42 PM, Kuhring, Mathias wrote:
> Dear Ceph users,
>
> Our CephFS is not releasing/freeing up space after deleting hundreds of
> terabytes of data.
> By now, this drives us in a "nearfull" osd/pool situation and thus
> throttles IO.
>
> We are on ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)
> quincy (stable).
>
> Recently, we moved a bunch of data to a new pool with better EC.
> This was done by adding a new EC pool to the FS.
> Then assigning the FS root to the new EC pool via the directory layout xattr
> (so all new data is written to the new pool).
> And finally copying old data to new folders.
>
> I swapped the data as follows to remain the old directory structures.
> I also made snapshots for validation purposes.
>
> So basically:
> cp -r mymount/mydata/ mymount/new/ # this creates copy on new pool
> mkdir mymount/mydata/.snap/tovalidate
> mkdir mymount/new/mydata/.snap/tovalidate
> mv mymount/mydata/ mymount/old/
> mv mymount/new/mydata mymount/
>
> I could see the increase of data in the new pool as expected (ceph df).
> I compared the snapshots with hashdeep to make sure the new data is alright.
>
> Then I went ahead deleting the old data, basically:
> rmdir mymount/old/mydata/.snap/* # this also included a bunch of other
> older snapshots
> rm -r mymount/old/mydata
>
> At first we had a bunch of PGs with snaptrim/snaptrim_wait.
> But they are done for quite some time now.
> And now, already two weeks later the size of the old pool still hasn't
> really decreased.
> I'm still waiting for around 500 TB to be released (and much more is
> planned).
>
> I honestly have no clue, where to go from here.
>   From my point of view (i.e. the CephFS mount), the data is gone.
> I also never hard/soft-linked it anywhere.
>
> This doesn't seem to be a regular issue.
> At least I couldn't find anything related or resolved in the docs or
> user list, yet.
> If anybody has an idea how to resolve this, I would highly appreciate it.
>
> Best Wishes,
> Mathias
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Mathias Kuhring

Dr. rer. nat.
Bioinformatician
HPC & Core Unit Bioinformatics
Berlin Institute of Health at Charité (BIH)

E-Mail: mathias.kuhring@xxxxxxxxxxxxxx
Mobile: +49 172 3475576

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx