Re: [ceph-users] Message sequence overflow

Ilya Dryomov <idryomov@xxxxxxxxx> · Thu, 2 Jun 2016 11:13:15 +0200

On Thu, Jun 2, 2016 at 4:46 AM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> On Wed, Jun 1, 2016 at 10:22 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> On Wed, 1 Jun 2016, Yan, Zheng wrote:
>>> On Wed, Jun 1, 2016 at 8:49 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>>> > On Wed, 1 Jun 2016, Yan, Zheng wrote:
>>> >> On Wed, Jun 1, 2016 at 6:15 AM, James Webb <jamesw@xxxxxxxxxxx> wrote:
>>> >> > Dear ceph-users...
>>> >> >
>>> >> > My team runs an internal buildfarm using ceph as a backend storage platform. We’ve recently upgraded to Jewel and are having reliability issues that we need some help with.
>>> >> >
>>> >> > Our infrastructure is the following:
>>> >> > - We use CEPH/CEPHFS (10.2.1)
>>> >> > - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
>>> >> > - We use enterprise SSDs for everything including journals
>>> >> > - We have one main mds and one standby mds.
>>> >> > - We are using ceph kernel client to mount cephfs.
>>> >> > - We have upgrade to Ubuntu 16.04 (4.4.0-22-generic kernel)
>>> >> > - We are using a kernel NFS to serve NFS clients from a ceph mount (~ 32 nfs threads. 0 swappiness)
>>> >> > - These are physical machines with 8 cores & 32GB memory
>>> >> >
>>> >> > On a regular basis, we lose all IO via ceph FS. We’re still trying to isolate the issue but it surfaces as an issue between MDS and ceph client.
>>> >> > We can’t tell if our our NFS server is overwhelming the MDS or if this is some unrelated issue. Tuning NFS server has not solved our issues.
>>> >> > So far our only recovery has been to fail the MDS and then restart our NFS. Any help or advice will be appreciated on the CEPH side of things.
>>> >> > I’m pretty sure we’re running with default tuning of CEPH MDS configuration parameters.
>>> >> >
>>> >> >
>>> >> > Here are the relevant log entries.
>>> >> >
>>> >> > From my primary MDS server, I start seeing these entries start to pile up:
>>> >> >
>>> >> > 2016-05-31 14:34:07.091117 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000004491 pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877480 seconds ago\
>>> >> > 2016-05-31 14:34:07.091129 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000005ddf pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877382 seconds ago\
>>> >> > 2016-05-31 14:34:07.091133 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000000a2a pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877356 seconds ago
>>> >> >
>>> >> > From my NFS server, I see these entries from dmesg also start piling up:
>>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 0 expected 4294967296
>>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 1 expected 4294967296
>>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 2 expected 4294967296
>>> >> >
>>> >>
>>> >> 4294967296 is 0x100000000, this looks like sequence  overflow.
>>> >>
>>> >> In src/msg/Message.h:
>>> >>
>>> >> class Message {
>>> >> ...
>>> >>   unsigned get_seq() const { return header.seq; }
>>> >>   void set_seq(unsigned s) { header.seq = s; }
>>> >> ...
>>> >> }
>>> >>
>>> >> in src/msg/simple/Pipe.cc
>>> >>
>>> >> class Pipe {
>>> >> ...
>>> >>   __u32 get_out_seq() { return out_seq; }
>>> >> ...
>>> >> }
>>> >>
>>> >> Is this bug or intentional ?
>>> >
>>> > That's a bug.  The seq values are intended to be 32 bits.
>>> >
>>> > (We should also be using the ceph_cmp_seq (IIRC) helper for any inequality
>>> > checks, which does a sloppy comparison so that a 31-bit signed difference
>>> > is used to determine > or <.  It sounds like in this case we're just
>>> > failing an equality check, though.)
>>> >
>>>
>>> struct ceph_msg_header {
>>>         __le64 seq;       /* message seq# for this session */
>>>         ...
>>> }
>>>
>>> you means we should leave the upper 32-bits unused?
>>
>> Oh, hmm. I'm confusing this with the cap seq (which is 32 bits).
>>
>> I think we can safely go either way.. the question is which path is
>> easier.  If we move to 32 bits used on the kernel side, will userspace
>> also need to be patched to make reconnect work?  That unsigned get_seq()
>> is only 32-bits wide.
>
> I don't think userspace need to be patched.
>
>>
>> If we go with 64 bits, userspace still needs to be fixed to change that
>> unsigned to uint64_t.
>>
>> What do you think?
>> sage
>>
>
> I like the 64 bits approach. Here is userspace code that checks
> message sequence.
>
> Pipe::reader() {
>
> ...
>       if (m->get_seq() <= in_seq) {
>         ldout(msgr->cct,0) << "reader got old message "
>                 << m->get_seq() << " <= " << in_seq << " " << m << " " << *m
>                 << ", discarding" << dendl;
>
>         msgr->dispatch_throttle_release(m->get_dispatch_throttle_size());
>         m->put();
>
>         if (connection_state->has_feature(CEPH_FEATURE_RECONNECT_SEQ) &&
>             msgr->cct->_conf->ms_die_on_old_message)
>           assert(0 == "old msgs despite reconnect_seq feature");
>         continue;
>       }
>       if (m->get_seq() > in_seq + 1) {
>         ldout(msgr->cct,0) << "reader missed message?  skipped from seq "
>                            << in_seq << " to " << m->get_seq() << dendl;
>         if (msgr->cct->_conf->ms_die_on_skipped_message)
>           assert(0 == "skipped incoming seq");
>       }
>       m->set_connection(connection_state.get());
>       // note last received message.
>       in_seq = m->get_seq();
> ...
> }
>
> Looks like the code works perfectly when the two ends of connection
> have different bits. We don't need to worry about the change breaks
> interoperability between patched userspace and un-patched userspace.

Are you sure?  It seems to me that if that were true, we wouldn't have
this thread in the first place.  An unpatched m->get_seq() in

    if (m->get_seq() <= in_seq) {

would truncate and the "old message" branch would be taken.  The same
goes for the unpatched m->set_seq() - the other (patched) side is going
to trip over.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html