Re: Removing secondary data pool from mds

Frank Schilder <frans@xxxxxx> · Sat, 13 Mar 2021 11:30:08 +0000

Dear Michael,

good to hear that it is over.

I'm a bit surprised and also worried that you lost data again. Was the cluster rebalancing when the restarts happened? I had OSDs restart all over the place due to bugs, OOM or admin accidents and never lost anything (except data access for a while). A PG should go into read-only or even inaccessible state as soon as the number of in-OSDs is below min_size. If min_size>=2 (REP) >=k+1 (EC) everywhere, it should be impossible to end up with incomplete and/or lost objects. During a benchmark I had an SFP transceiver go bad and it took at least 30% of the OSDs down constantly (proper crashes + restarts) for more than an hour while the benchmark was hammering away. After we removed the transceiver, everything recovered and moved on as if nothing had happened.

It would be really helpful for judging if a cluster is at risk if you could give details about what condition exactly your cluster was in just before the restarts. I assume it was not health_ok. Information of interest would include:

- existed remapped PGs
- were OSDs added and the cluster rebalancing
- did undersized PGs already exist
- did any other degraded PGs exist
- was degraded redundancy reported
- did misplaced objects exist
- size and min_size of each pool
- anything else that is non-standard and could be relevant
- did all PGs come back active+clean (or original state) after servers were back up
- after restart, did extra degraded states show up, for example, more incomplete PGs or misplaced/degraded objects than before the restart
- do you happen to have the output of ceph status from just before the restarts

It is really odd that you have such dramatic problems just from a restart of servers. This should go unnoticed even if your cluster is not health_ok.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Michael Thomas <wart@xxxxxxxxxxx>
Sent: 12 March 2021 22:29:48
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  Removing secondary data pool from mds

Hi Frank,

I finally got around to removing the data pool.  It went without a hitch.

Ironically, about a week before I got around to removing the pool, I
suffered the same problem as before, except this time it wasn't a power
glitch that took out the OSDs, it was my own careless self who decided
to reboot too many OSD hosts at the same time.  The multiple OSDs went
down while I was copying a lot of data into ceph.  And as before, this
left a bunch of corrupted files that caused stat() and unlink() to hang.

I recovered it the same as before, by removing the files from the
filesystem, then removing the lost objects from the PGs.  Unlike last
time, I did not try to copy the good files into a new pool.
Fortunately, this cleanup process worked fine.

For those watching from home, here are the steps I took to clean up:

* Restart all mons (I rebooted all of them, but it may have been enough
to simply restart the mds).  Reboot the client that is experiencing the
hang.  This didn't fix the problem with stat() hanging, but did allow
unlink() (and /usr/bin/unlink) to remove the files without hanging.  I'm
not sure which of these steps is the necessary one, as I did all of them
before I was able to proceed.

* Make a list of the affected PGs:
   ceph pg dump_stuck  | grep recovery_unfound > pg.txt

* Make a list of the affected OIDs:
   cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg
$pg list_unfound | jq '.objects[].oid.oid' ; done | sed -e 's/"//g' >
oid.txt

* Convert the OID numbers to inodes:
   cat oid.txt | awk '{print $2}' | sed -e 's/\..*//' | while read oid ;
do  printf "%d\n" 0x${oid} ; done > inum.txt

* Find the filenames corresponding to the affected inodes (requires the
/ceph filesystem to be mounted):
   cat inum.txt | while read inum ; do echo -n "${inum} " ; find
/ceph/frames/O3/raw -inum ${inum} ; done > files.txt

* Call /usr/bin/unlink on each of the files in files.txt.  Don't use
/usr/bin/rm, as it will hang when calling stat() before unlink().

* Remove the unfound objects:
   cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg
$pg mark_unfound_lost delete ; done

* Watch the output of 'ceph -s' to see the cluster become healthy again

--Mike

On 2/12/21 4:55 PM, Frank Schilder wrote:
> Hi Michael,
>
> I also think it would be safe to delete. The object count might be an incorrect reference count of lost objects that didn't get decremented. This might be fixed by running a deep scrub over all PGs in that pool.
>
> I don't know rados well enough to find out where such an object count comes from. However, ceph df is known to be imperfect. Maybe its just an accounting bug there. I think there were a couple of cases where people deleted all objects in a pool and ceph df would still report non-zero usage.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Michael Thomas <wart@xxxxxxxxxxx>
> Sent: 12 February 2021 22:35:25
> To: Frank Schilder; ceph-users@xxxxxxx
> Subject: Re:  Removing secondary data pool from mds
>
> Hi Frank,
>
> We're not using snapshots.
>
> I was able to run:
>       ceph daemon mds.ceph1 dump cache /tmp/cache.txt
>
> ...and scan for the stray object to find the cap id that was accessing
> the object.  I matched this with the entity name in:
>       ceph daemon mds.ceph1 session ls
>
> ...to determine the client host.  The strays went away after I rebooted
> the offending client.
>
> With all access to the objects now cleared, I ran:
>
>       ceph pg X.Y mark_unfound_lost delete
>
> ...on any remaining rados objects.
>
> At this point (at long last) the pool was able to return to the
> 'HEALTHY' status.  However, there is one remaining bit that I don't
> understand.  'ceph df' returns 355 objects for the pool
> (fs.data.archive.frames):
>
> https://pastebin.com/vbZLhQmC
>
> ...but 'rados -p fs.data.archive.frames ls --all' returns no objects.
> So I'm not sure what these 355 objects were.  Because of that, I haven't
> removed the pool from cephfs quite yet, even though I think it would be
> safe to do so.
>
> --Mike
>
>
> On 2/10/21 4:20 PM, Frank Schilder wrote:
>> Hi Michael,
>>
>> out of curiosity, did the pool go away or did it put up a fight?
>>
>> I don't remember exactly, its a long time ago, but I believe stray objects on fs pools come from files still in snapshots but were deleted on the fs level. Such files are moved to special stray pools until the snapshot containing them is deleted as well. Not sure if this applies here though, there might be other occasions when objects go to stray.
>>
>> I updated the case concerning the underlying problem, but not too much progress either: https://tracker.ceph.com/issues/46847#change-184710 . I had PG degradation even using the recovery technique with before- and after crush maps. I was just lucky that I lost only 1 shard per object and ordinary recovery could fix it.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Michael Thomas <wart@xxxxxxxxxxx>
>> Sent: 21 December 2020 23:12:09
>> To: ceph-users@xxxxxxx
>> Subject:  Removing secondary data pool from mds
>>
>> I have a cephfs secondary (non-root) data pool with unfound and degraded
>> objects that I have not been able to recover[1].  I created an
>> additional data pool and used "setfattr -n ceph.dir.layout.pool' and a
>> very long rsync to move the files off of the degraded pool and onto the
>> new pool.  This has completed, and using find + 'getfattr -n
>> ceph.file.layout.pool', I verified that no files are using the old pool
>> anymore.  No ceph.dir.layout.pool attributes point to the old pool either.
>>
>> However, the old pool still reports that there are objects in the old
>> pool, likely the same ones that were unfound/degraded from before:
>> https://pastebin.com/qzVA7eZr
>>
>> Based on a old message from the mailing list[2], I checked the MDS for
>> stray objects (ceph daemon mds.ceph4 dump cache file.txt ; grep -i stray
>> file.txt) and found 36 stray entries in the cache:
>> https://pastebin.com/MHkpw3DV.  However, I'm not certain how to map
>> these stray cache objects to clients that may be accessing them.
>>
>> 'rados -p fs.data.archive.frames ls' shows 145 objects.  Looking at the
>> parent of each object shows 2 strays:
>>
>> for obj in $(cat rados.ls.txt) ; do echo $obj ; rados -p
>> fs.data.archive.frames getxattr $obj parent | strings ; done
>>
>>
>> [...]
>> 10000020fa1.00000000
>> 10000020fa1
>> stray6
>> 10000020fbc.00000000
>> 10000020fbc
>> stray6
>> [...]
>>
>> ...before getting stuck on one object for over 5 minutes (then I gave up):
>>
>> 1000005b1af.00000083
>>
>> What can I do to make sure this pool is ready to be safely deleted from
>> cephfs (ceph fs rm_data_pool archive fs.data.archive.frames)?
>>
>> --Mike
>>
>> [1]https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/QHFOGEKXK7VDNNSKR74BA6IIMGGIXBXA/#7YQ6SSTESM5LTFVLQK3FSYFW5FDXJ5CF
>>
>> [2]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005233.html
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx