Re: [ceph-users] Message sequence overflow

"Yan, Zheng" <ukernel@xxxxxxxxx> · Thu, 2 Jun 2016 17:51:42 +0800

On Thu, Jun 2, 2016 at 5:13 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> On Thu, Jun 2, 2016 at 4:46 AM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>> On Wed, Jun 1, 2016 at 10:22 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>>> On Wed, 1 Jun 2016, Yan, Zheng wrote:
>>>> On Wed, Jun 1, 2016 at 8:49 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>>>> > On Wed, 1 Jun 2016, Yan, Zheng wrote:
>>>> >> On Wed, Jun 1, 2016 at 6:15 AM, James Webb <jamesw@xxxxxxxxxxx> wrote:
>>>> >> > Dear ceph-users...
>>>> >> >
>>>> >> > My team runs an internal buildfarm using ceph as a backend storage platform. We’ve recently upgraded to Jewel and are having reliability issues that we need some help with.
>>>> >> >
>>>> >> > Our infrastructure is the following:
>>>> >> > - We use CEPH/CEPHFS (10.2.1)
>>>> >> > - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
>>>> >> > - We use enterprise SSDs for everything including journals
>>>> >> > - We have one main mds and one standby mds.
>>>> >> > - We are using ceph kernel client to mount cephfs.
>>>> >> > - We have upgrade to Ubuntu 16.04 (4.4.0-22-generic kernel)
>>>> >> > - We are using a kernel NFS to serve NFS clients from a ceph mount (~ 32 nfs threads. 0 swappiness)
>>>> >> > - These are physical machines with 8 cores & 32GB memory
>>>> >> >
>>>> >> > On a regular basis, we lose all IO via ceph FS. We’re still trying to isolate the issue but it surfaces as an issue between MDS and ceph client.
>>>> >> > We can’t tell if our our NFS server is overwhelming the MDS or if this is some unrelated issue. Tuning NFS server has not solved our issues.
>>>> >> > So far our only recovery has been to fail the MDS and then restart our NFS. Any help or advice will be appreciated on the CEPH side of things.
>>>> >> > I’m pretty sure we’re running with default tuning of CEPH MDS configuration parameters.
>>>> >> >
>>>> >> >
>>>> >> > Here are the relevant log entries.
>>>> >> >
>>>> >> > From my primary MDS server, I start seeing these entries start to pile up:
>>>> >> >
>>>> >> > 2016-05-31 14:34:07.091117 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000004491 pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877480 seconds ago\
>>>> >> > 2016-05-31 14:34:07.091129 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000005ddf pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877382 seconds ago\
>>>> >> > 2016-05-31 14:34:07.091133 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000000a2a pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877356 seconds ago
>>>> >> >
>>>> >> > From my NFS server, I see these entries from dmesg also start piling up:
>>>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 0 expected 4294967296
>>>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 1 expected 4294967296
>>>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 2 expected 4294967296
>>>> >> >
>>>> >>
>>>> >> 4294967296 is 0x100000000, this looks like sequence  overflow.
>>>> >>
>>>> >> In src/msg/Message.h:
>>>> >>
>>>> >> class Message {
>>>> >> ...
>>>> >>   unsigned get_seq() const { return header.seq; }
>>>> >>   void set_seq(unsigned s) { header.seq = s; }
>>>> >> ...
>>>> >> }
>>>> >>
>>>> >> in src/msg/simple/Pipe.cc
>>>> >>
>>>> >> class Pipe {
>>>> >> ...
>>>> >>   __u32 get_out_seq() { return out_seq; }
>>>> >> ...
>>>> >> }
>>>> >>
>>>> >> Is this bug or intentional ?
>>>> >
>>>> > That's a bug.  The seq values are intended to be 32 bits.
>>>> >
>>>> > (We should also be using the ceph_cmp_seq (IIRC) helper for any inequality
>>>> > checks, which does a sloppy comparison so that a 31-bit signed difference
>>>> > is used to determine > or <.  It sounds like in this case we're just
>>>> > failing an equality check, though.)
>>>> >
>>>>
>>>> struct ceph_msg_header {
>>>>         __le64 seq;       /* message seq# for this session */
>>>>         ...
>>>> }
>>>>
>>>> you means we should leave the upper 32-bits unused?
>>>
>>> Oh, hmm. I'm confusing this with the cap seq (which is 32 bits).
>>>
>>> I think we can safely go either way.. the question is which path is
>>> easier.  If we move to 32 bits used on the kernel side, will userspace
>>> also need to be patched to make reconnect work?  That unsigned get_seq()
>>> is only 32-bits wide.
>>
>> I don't think userspace need to be patched.
>>
>>>
>>> If we go with 64 bits, userspace still needs to be fixed to change that
>>> unsigned to uint64_t.
>>>
>>> What do you think?
>>> sage
>>>
>>
>> I like the 64 bits approach. Here is userspace code that checks
>> message sequence.
>>
>> Pipe::reader() {
>>
>> ...
>>       if (m->get_seq() <= in_seq) {
>>         ldout(msgr->cct,0) << "reader got old message "
>>                 << m->get_seq() << " <= " << in_seq << " " << m << " " << *m
>>                 << ", discarding" << dendl;
>>
>>         msgr->dispatch_throttle_release(m->get_dispatch_throttle_size());
>>         m->put();
>>
>>         if (connection_state->has_feature(CEPH_FEATURE_RECONNECT_SEQ) &&
>>             msgr->cct->_conf->ms_die_on_old_message)
>>           assert(0 == "old msgs despite reconnect_seq feature");
>>         continue;
>>       }
>>       if (m->get_seq() > in_seq + 1) {
>>         ldout(msgr->cct,0) << "reader missed message?  skipped from seq "
>>                            << in_seq << " to " << m->get_seq() << dendl;
>>         if (msgr->cct->_conf->ms_die_on_skipped_message)
>>           assert(0 == "skipped incoming seq");
>>       }
>>       m->set_connection(connection_state.get());
>>       // note last received message.
>>       in_seq = m->get_seq();
>> ...
>> }
>>
>> Looks like the code works perfectly when the two ends of connection
>> have different bits. We don't need to worry about the change breaks
>> interoperability between patched userspace and un-patched userspace.
>
> Are you sure?  It seems to me that if that were true, we wouldn't have
> this thread in the first place.  An unpatched m->get_seq() in
>
>     if (m->get_seq() <= in_seq) {
>
> would truncate and the "old message" branch would be taken.  The same
> goes for the unpatched m->set_seq() - the other (patched) side is going
> to trip over.

The user space code never increases in_seq. Instead, it assigns
m->get_seq() to in_seq. If m->get_seq() return 32-bits value, the
upper 32-bits of in_seq are always zero.

I did some tests (set the initial value of out_seq to 0xffffff00) . It
seems that the user space code also starts to malfunction when the seq
overflows. There are codes that compare {newly_acked_seq, out_seq},
{in_seq, in_seq_acked}, but they do not take the overflow into
consideration.

Besides, I found that Pipe::randomize_out_seq() never randomizes the out_seq.

Regards
Yan, Zheng

>
> Thanks,
>
>                 Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html