Re: [ext] CephFS pool not releasing space after data deletion

"Kuhring, Mathias" <mathias.kuhring@xxxxxxxxxxxxxx> · Tue, 5 Dec 2023 19:59:28 +0000

Hey Frank, hey Venky,

Thanks for looking into this.
We are not sure yet, if all the expected capacity is or will be released.

Eventually, we just continued further cleaning out old data from the old 
pool.
This is still in progress, but with other data sets in this old pool we 
indeed observed reasonable capacity releases.
So we plan to have another look on the left over objects, once we remove 
all data in the CephFS known to be associated with the old pool.

Some more things we tried and observed:
We tried to trigger some snapshot purging using the OSD daemon command 
'scrub_purged_snaps'.
So, I was looking at some PGs on the old pool which were in state 
snaptrim{_wait}.
Then triggered purging on the primary active OSD, e.g.: 'ceph tell 
osd.313 scrub_purged_snaps'
Looking at some OSD debug logs the ('ceph tell osd.313 config set 
debug_osd 20/20'),
we got a lot of 'snap_mapper.run' outputs for the new pool such like 
this (but never for old pool):
`Nov 07 18:29:47 ceph-2-8 bash[11952]: debug 
2023-11-07T17:29:47.282+0000 7fbae2af1700 20 snap_mapper.run ok 
1:e4e26584:::10017209955.00000000:3a11 snap 39b8 precedes pool 20 
purged_snaps [2a4c,2a4d)`

Another think worth mentioning:
We stumbled open some older manual snapshots nested in different 
sub-directories .
The were not in sub-directories of the particular deleted folder but 
another "scheduled snapshotted" directory.
So, our working theory right now is that such nested snapshots might 
block proper snapshot trimming/purging in general.
After all, if I remember correctly nested snapshots is actually no a 
supported feature.

Let us know if these additional observations make are helpful.
Looking forward to hearing from you.

Best Wishes,
Mathias

On 12/5/2023 7:29 AM, Venky Shankar wrote:
> Hi Mathias/Frank,
>
> (sorry for the late reply - this didn't get much attention including
> the tracker report and eventually got parked).
>
> Will have this looked into - expect an update in a day or two.
>
> On Sat, Dec 2, 2023 at 5:46 PM Frank Schilder <frans@xxxxxx> wrote:
>> Hi Mathias,
>>
>> have you made any progress on this? Did the capacity become available eventually?
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Kuhring, Mathias <mathias.kuhring@xxxxxxxxxxxxxx>
>> Sent: Friday, October 27, 2023 3:52 PM
>> To: ceph-users@xxxxxxx; Frank Schilder
>> Subject: Re: [ext]  CephFS pool not releasing space after data deletion
>>
>> Dear ceph users,
>>
>> We are wondering, if this might be the same issue as with this bug:
>> https://tracker.ceph.com/issues/52581
>>
>> Except that we seem to have been snapshots dangling on the old pool.
>> And the bug report snapshots dangling on the new pool.
>> But maybe it's both?
>>
>> I mean, once the global root layout was created to a new pool,
>> the new pool became in charge for snapshooting at least of new data, right?
>> What about data which is overwritten? Is there a conflict of responsibility?
>>
>> We do have similar listings of snaps with "ceph osd pool ls detail", I
>> think:
>>
>> 0|0[root@osd-1 ~]# ceph osd pool ls detail | grep -B 1 removed_snaps_queue
>> pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 1
>> object_hash rjenkins pg_num 115 pgp_num 107 pg_num_target 32
>> pgp_num_target 32 autoscale_mode on last_change 803558 lfor
>> 0/803250/803248 flags hashpspool,selfmanaged_snaps stripe_width 0
>> expected_num_objects 1 application cephfs
>>           removed_snaps_queue
>> [3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]
>> --
>> pool 3 'hdd_ec' erasure profile hdd_ec size 3 min_size 2 crush_rule 3
>> object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off
>> last_change 803558 lfor 0/87229/87229 flags
>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 8192 application
>> cephfs
>>           removed_snaps_queue
>> [3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]
>> --
>> pool 20 'hdd_ec_8_2_pool' erasure profile hdd_ec_8_2_profile size 10
>> min_size 9 crush_rule 5 object_hash rjenkins pg_num 8192 pgp_num 8192
>> autoscale_mode off last_change 803558 lfor 0/0/681917 flags
>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 32768
>> application cephfs
>>           removed_snaps_queue
>> [3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]
>>
>>
>> Here, pool hdd_ec_8_2_pool is the one we recently assigned to the root
>> layout.
>> Pool hdd_ec is the one which was assigned before and which won't release
>> space (at least where I know of).
>>
>> Is this removed_snaps_queue the same as removed_snaps in the bug issue
>> (i.e. the label was renamed)?
>> And is it normal that all queues list the same info or should this be
>> different per pool?
>> Might this be related to pools having now share responsibility over some
>> snaps due to layout changes?
>>
>> And for the big question:
>> How can I actually trigger/speedup the removal of those snaps?
>> I find the removed_snaps/removed_snaps_queue mentioned a few times in
>> the user list.
>> But never with some conclusive answer how to deal with them.
>> And the only mentions in the docs are just change logs.
>>
>> I also looked into and started cephfs stray scrubbing:
>> https://docs.ceph.com/en/latest/cephfs/scrub/#evaluate-strays-using-recursive-scrub
>> But according to the status output, no scrubbing is actually active.
>>
>> I would appreciate any further ideas. Thanks a lot.
>>
>> Best Wishes,
>> Mathias
>>
>> On 10/23/2023 12:42 PM, Kuhring, Mathias wrote:
>>> Dear Ceph users,
>>>
>>> Our CephFS is not releasing/freeing up space after deleting hundreds of
>>> terabytes of data.
>>> By now, this drives us in a "nearfull" osd/pool situation and thus
>>> throttles IO.
>>>
>>> We are on ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)
>>> quincy (stable).
>>>
>>> Recently, we moved a bunch of data to a new pool with better EC.
>>> This was done by adding a new EC pool to the FS.
>>> Then assigning the FS root to the new EC pool via the directory layout xattr
>>> (so all new data is written to the new pool).
>>> And finally copying old data to new folders.
>>>
>>> I swapped the data as follows to remain the old directory structures.
>>> I also made snapshots for validation purposes.
>>>
>>> So basically:
>>> cp -r mymount/mydata/ mymount/new/ # this creates copy on new pool
>>> mkdir mymount/mydata/.snap/tovalidate
>>> mkdir mymount/new/mydata/.snap/tovalidate
>>> mv mymount/mydata/ mymount/old/
>>> mv mymount/new/mydata mymount/
>>>
>>> I could see the increase of data in the new pool as expected (ceph df).
>>> I compared the snapshots with hashdeep to make sure the new data is alright.
>>>
>>> Then I went ahead deleting the old data, basically:
>>> rmdir mymount/old/mydata/.snap/* # this also included a bunch of other
>>> older snapshots
>>> rm -r mymount/old/mydata
>>>
>>> At first we had a bunch of PGs with snaptrim/snaptrim_wait.
>>> But they are done for quite some time now.
>>> And now, already two weeks later the size of the old pool still hasn't
>>> really decreased.
>>> I'm still waiting for around 500 TB to be released (and much more is
>>> planned).
>>>
>>> I honestly have no clue, where to go from here.
>>>    From my point of view (i.e. the CephFS mount), the data is gone.
>>> I also never hard/soft-linked it anywhere.
>>>
>>> This doesn't seem to be a regular issue.
>>> At least I couldn't find anything related or resolved in the docs or
>>> user list, yet.
>>> If anybody has an idea how to resolve this, I would highly appreciate it.
>>>
>>> Best Wishes,
>>> Mathias
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> --
>> Mathias Kuhring
>>
>> Dr. rer. nat.
>> Bioinformatician
>> HPC & Core Unit Bioinformatics
>> Berlin Institute of Health at Charité (BIH)
>>
>> E-Mail: mathias.kuhring@xxxxxxxxxxxxxx
>> Mobile: +49 172 3475576
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
-- 
Mathias Kuhring

Dr. rer. nat.
Bioinformatician
HPC & Core Unit Bioinformatics
Berlin Institute of Health at Charité (BIH)

E-Mail: mathias.kuhring@xxxxxxxxxxxxxx
Mobile: +49 172 3475576

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx