Re: Fwd: Re: Fwd: Latest firefly: osd not joining cluster after re-creation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Oct 23, 2014 at 9:18 PM, Joao Eduardo Luis
<joao.luis@xxxxxxxxxxx> wrote:
> Let me re-CC the list as this may be worth for the archives.
>
> On 10/23/2014 04:19 PM, Andrey Korolyov wrote:
>>
>> Doing off-list post again.
>>
>> So I was inaccurate in an initial bug description:
>> - mkfs goes just well
>> - on first start OSD is crashing with ABRT and trace from previous
>> message, changing fsid before in the mon store
>> - on next start it refuses to join due to fsid mismatch, not crashing any
>> more.
>>
>> On Thu, Oct 23, 2014 at 5:56 PM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>>>
>>> It is not so easy.. When I added fsid under selected osd` section and
>>> reformatted the store/journal, it aborted at start in
>>> FileStore::_do_transaction (see attach). On next launch, fsid in the
>>> mon store for this OSD magically changes to the something else and I
>>> am kicking again same doorstep (if I shut down osd process, recreate
>>> journal with new fsid inserted in fsid or recreate entire filestore
>>> too, it will abort, otherwise simply not join due to *next* mismatch).
>>> As far as I can see problem is in behavior of legacy clusters which
>>> are inherited fsid from filesystem created by third-party, not as a
>>> result of ceph-deploy work, so it is not fixed at all after such an
>>> update. Any suggestions?
>
>
> I'm not sure what you mean by 'changing fsid in the mon store', but I
> suspect you have a few misconceptions about 'fsid' and the 'osd uuid'.
>
> The error you have below, regarding the osd fsid, refers to the osd's uuid,
> which is passed to '--mkfs' using '--osd-uuid X'.  'X' is also the uuid you
> would pass when adding the osd to the monitors using 'ceph osd create
> <uuid>'.
>
> Then there's the cluster 'fsid', which refers to the cluster.  This 'fsid'
> is kept in the monmap and is used to identify the cluster the monitors
> belong to and to allow clients (such as the osd) to correctly contact the
> monitors of the cluster they too belong to.

Yes, I am referring to it. The problem is that the I called monmap mon
store which is a bit incorrect in terms of documentation.

>
> Changing the 'fsid' option in ceph.conf results in changing the perceived
> value the clients and daemons have of the cluster fsid.  If this value is
> different from the monmap's you're bound to have trouble.  If you only
> change the 'fsid' option in the 'osd' section of ceph.conf, you're basically
> telling the osds that they belong to a different cluster, which will
> probably cause issues when they contact the monitors to obtain the monmap
> during mkfs.
>
> What you clearly want is to remove the contents of the osd data directory,
> generate a uuid 'X', run 'ceph osd create X', save the value it will return
> (it will be used as the OSD's id) and then run ceph-osd --mkfs with
> --osd-uuid X.
>
> Also, I don't believe that the 'clashing' message is a bug.  IMO we should
> assume that it's the operator's responsibility to remove the data if it's no
> longer of any use, instead of just assuming what the operator may have meant
> when running mkfs repeatedly over a given osd store.


Thanks, I see, using existing UUID from 'osd dump' worked well. The
problem was probably in previous experience with the OSD recreation
which did not require UUID to be specified over OSD re-format (and I
believe that there is some inconsistence anyway - if I am specifying
existing osd id upon mkfs call, why just not fetch and reuse its UUID
for filestore?). Crash with SIGABRT takes place only with debug_ms
being set to 10 or higher, so probably I am hitting independent bug
there.


>
> Hope this helps.
>
>   -Joao
>
>>>
>>> Trace is attached if someone is interested in it.
>>>
>>> On Thu, Oct 23, 2014 at 5:25 PM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>>>>
>>>> Sorry, I see the problem.
>>>>
>>>> osd.0 10.6.0.1:6800/32051 clashes with existing osd: different fsid
>>>> (ours: d0aec02e-8513-40f1-bf34-22ec44f68466 ; theirs:
>>>> 16cbb1f8-e896-42cd-863c-bcbad710b4ea). Anyway it is clearly a bug and
>>>> fsid should be silently discarded there if OSD contains no epochs
>>>> itself.
>
>
>
> --
> Joao Eduardo Luis
> Software Engineer | http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux