Re: [ext] CephFS pool not releasing space after data deletion

"Kuhring, Mathias" <mathias.kuhring@xxxxxxxxxxxxxx> · Fri, 27 Oct 2023 13:52:03 +0000

Dear ceph users,

We are wondering, if this might be the same issue as with this bug:
https://tracker.ceph.com/issues/52581

Except that we seem to have been snapshots dangling on the old pool.
And the bug report snapshots dangling on the new pool.
But maybe it's both?

I mean, once the global root layout was created to a new pool,
the new pool became in charge for snapshooting at least of new data, right?
What about data which is overwritten? Is there a conflict of responsibility?

We do have similar listings of snaps with "ceph osd pool ls detail", I 
think:

0|0[root@osd-1 ~]# ceph osd pool ls detail | grep -B 1 removed_snaps_queue
pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 1 
object_hash rjenkins pg_num 115 pgp_num 107 pg_num_target 32 
pgp_num_target 32 autoscale_mode on last_change 803558 lfor 
0/803250/803248 flags hashpspool,selfmanaged_snaps stripe_width 0 
expected_num_objects 1 application cephfs
         removed_snaps_queue 
[3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]
--
pool 3 'hdd_ec' erasure profile hdd_ec size 3 min_size 2 crush_rule 3 
object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off 
last_change 803558 lfor 0/87229/87229 flags 
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 8192 application 
cephfs
         removed_snaps_queue 
[3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]
--
pool 20 'hdd_ec_8_2_pool' erasure profile hdd_ec_8_2_profile size 10 
min_size 9 crush_rule 5 object_hash rjenkins pg_num 8192 pgp_num 8192 
autoscale_mode off last_change 803558 lfor 0/0/681917 flags 
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 32768 
application cephfs
         removed_snaps_queue 
[3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]

Here, pool hdd_ec_8_2_pool is the one we recently assigned to the root 
layout.
Pool hdd_ec is the one which was assigned before and which won't release 
space (at least where I know of).

Is this removed_snaps_queue the same as removed_snaps in the bug issue 
(i.e. the label was renamed)?
And is it normal that all queues list the same info or should this be 
different per pool?
Might this be related to pools having now share responsibility over some 
snaps due to layout changes?

And for the big question:
How can I actually trigger/speedup the removal of those snaps?
I find the removed_snaps/removed_snaps_queue mentioned a few times in 
the user list.
But never with some conclusive answer how to deal with them.
And the only mentions in the docs are just change logs.

I also looked into and started cephfs stray scrubbing:
https://docs.ceph.com/en/latest/cephfs/scrub/#evaluate-strays-using-recursive-scrub
But according to the status output, no scrubbing is actually active.

I would appreciate any further ideas. Thanks a lot.

Best Wishes,
Mathias

On 10/23/2023 12:42 PM, Kuhring, Mathias wrote:
> Dear Ceph users,
>
> Our CephFS is not releasing/freeing up space after deleting hundreds of
> terabytes of data.
> By now, this drives us in a "nearfull" osd/pool situation and thus
> throttles IO.
>
> We are on ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)
> quincy (stable).
>
> Recently, we moved a bunch of data to a new pool with better EC.
> This was done by adding a new EC pool to the FS.
> Then assigning the FS root to the new EC pool via the directory layout xattr
> (so all new data is written to the new pool).
> And finally copying old data to new folders.
>
> I swapped the data as follows to remain the old directory structures.
> I also made snapshots for validation purposes.
>
> So basically:
> cp -r mymount/mydata/ mymount/new/ # this creates copy on new pool
> mkdir mymount/mydata/.snap/tovalidate
> mkdir mymount/new/mydata/.snap/tovalidate
> mv mymount/mydata/ mymount/old/
> mv mymount/new/mydata mymount/
>
> I could see the increase of data in the new pool as expected (ceph df).
> I compared the snapshots with hashdeep to make sure the new data is alright.
>
> Then I went ahead deleting the old data, basically:
> rmdir mymount/old/mydata/.snap/* # this also included a bunch of other
> older snapshots
> rm -r mymount/old/mydata
>
> At first we had a bunch of PGs with snaptrim/snaptrim_wait.
> But they are done for quite some time now.
> And now, already two weeks later the size of the old pool still hasn't
> really decreased.
> I'm still waiting for around 500 TB to be released (and much more is
> planned).
>
> I honestly have no clue, where to go from here.
>   From my point of view (i.e. the CephFS mount), the data is gone.
> I also never hard/soft-linked it anywhere.
>
> This doesn't seem to be a regular issue.
> At least I couldn't find anything related or resolved in the docs or
> user list, yet.
> If anybody has an idea how to resolve this, I would highly appreciate it.
>
> Best Wishes,
> Mathias
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Mathias Kuhring

Dr. rer. nat.
Bioinformatician
HPC & Core Unit Bioinformatics
Berlin Institute of Health at Charité (BIH)

E-Mail: mathias.kuhring@xxxxxxxxxxxxxx
Mobile: +49 172 3475576

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx