Re: CephFS no longer mounts and asserts in MDS after upgrade to 0.67.3

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 11 Sep 2013 11:36:11 -0700

On Wed, Sep 11, 2013 at 7:48 AM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> On Wed, Sep 11, 2013 at 10:06 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
>> Hey Yan,
>>
>> On 11-09-13 15:12, Yan, Zheng wrote:
>>> On Wed, Sep 11, 2013 at 7:51 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
>>>> Hey Gregory,
>>>>
>>>> I wiped and re-created the MDS-cluster I just mailed about, starting out
>>>> by making sure CephFS is not mounted anywhere, stopping all MDSs,
>>>> completely cleaning the "data" and "metadata"-pools using "rados
>>>> --pool=<pool> cleanup <prefix>", then creating a new cluster using `ceph
>>>> mds newfs 1 0 --yes-i-really-mean-it' and starting all MDSs again.
>>>> Directly afterwards, I saw this:
>>>> # rados --pool=metadata ls
>>>> 1.00000000
>>>> 2.00000000
>>>> 200.00000000
>>>> 200.00000001
>>>> 600.00000000
>>>> 601.00000000
>>>> 602.00000000
>>>> 603.00000000
>>>> 605.00000000
>>>> 606.00000000
>>>> 608.00000000
>>>> 609.00000000
>>>> mds0_inotable
>>>> mds0_sessionmap
>>>>
>>>> Note the missing objects, right from the start.  I was able to mount the
>>>> CephFS at this point, but after unmounting it and restarting the
>>>> MDS-cluster, it failed to come up, with the same symptoms as before.  I
>>>> didn't place any files on CephFS at any point between newfs and failure.
>>>> Naturally, I tried initializing it again, but now, even after more than
>>>> 5 tries, the "mds*"-objects simply no longer show up in the
>>>> "metadata"-pool at all.  In fact, it remains empty.  I can mount CephFS
>>>> after the first start of the MDS-cluster after a newfs, but on restart,
>>>> it fails because of the missing objects.  Am I doing anything wrong
>>>> while initializing the cluster, maybe?  Is cleaning the pools and doing
>>>> the newfs enough?  I did the same on the other cluster yesterday and it
>>>> seems to have all objects.
>>>>
>>>
>>> Thank you for your default information.
>>>
>>> The cause of missing object is that the MDS IDs for old FS and new FS
>>> are the same (incarnations are the same). When OSD receives MDS
>>> requests for the newly created FS. It silently drops the requests,
>>> because it thinks they are duplicated.  You can get around the bug by
>>> creating new pools for the newfs.
>>
>> Thanks for this very useful info, I think this solves the mystery!
>> Could I get around it any other way?  I'd rather not have to re-create
>> the pools and switch to new pool-ID's every time I have to do this.
>> Does the OSD store this info in it's meta-data, or might restarting the
>> OSDs be enough?  I'm quite sure that I re-created MDS-clusters on the
>> same pools many times, without all the objects going missing.  This was
>> usually as part of tests, where I also restarted other
>> cluster-components, like OSDs.  This could explain why only some files
>> went missing.  If some OSDs are restarted and processed the requests,
>> while others dropped the requests, it would appear as if some, but not
>> all objects are missing.  The problem then persists until the active MDS
>> in the MDS-cluster is restarted, after which the missing objects get
>> noticed, because things fail to restart.  IMHO, this is a bug.  Why
>
> Yes, it's a bug. Fixing it should be easy.
>
>> would the OSD ignore these requests, if the objects the MDS tries to
>> write don't even exist at that time?
>>
>
> OSD uses informartion in PG log to check duplicated requests, so
> restarting OSD does not work. Another way to get around the bug is
> generate lots of writes to the data/metadata pools, make sure each PG
> trim old entries in its log.
>
> Regards
> Yan, Zheng

This definitely explains the symptoms seen here on a
not-very-busy/long-lived cluster; I wish I had the notes to figure out
if it could have caused the problem for other users as well. I'm not
sure the best way to work around the problem in the code, though. We
could add an "fs generation" number to every object or every mds
incarnation, but that seems a bit icky. Did you have other ideas,
Zheng?

-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com