Re: OSD log being spammed with BlueStore stupidallocator dump

Wido den Hollander <wido@xxxxxxxx> · Mon, 15 Oct 2018 22:47:50 +0200

Hi,

On 10/15/2018 10:43 PM, Igor Fedotov wrote:
> Hi Wido,
> 
> once you apply the PR you'll probably see the initial error in the log
> that triggers the dump. Which is most probably the lack of space
> reported by _balance_bluefs_freespace() function. If so this means that
> BlueFS rebalance is unable to allocate contiguous 1M chunk at main
> device to gift to BlueFS. I.e. your main device space is very fragmented.
> 
> Unfortunately I don't know any ways to recover from this state but OSD
> redeployment or data removal.
> 

We are moving data away from these OSDs. Lucky us that we have HDD OSDs
in there as well, moving a lot of data there.

How would re-deployment work? Just wipe the OSDs and bring them back
into the cluster again? That would be a very painful task.. :-(

> Upcoming PR that brings an ability for offline BlueFS volume
> manipulation (https://github.com/ceph/ceph/pull/23103) will probably
> help to recover from this issue in future by migrating BlueFS data to a
> new larger DB volume. (targeted for Nautilus, not sure about backporting
> to Mimic or Luminous).
> 
> For now I can suggest the only preventive mean to avoid the case - have
> large enough space at your standalone DB volume. So that master device
> isn't used for DB at all or as minimum as possible. Hence no rebalance
> is needed and no fragmentation is present.
> 

I see, but these are SSD-only OSDs.

> BTW wondering if you have one for your OSDs? How large if so?
> 

The cluster consists out of 96 OSDs with Samsung PM863a 1.92TB OSDs.

The fullest OSD currently is 78% full, which is 348GiB free on the
1.75TiB device.

Does this information help?

Thanks!

Wido

> Everything above is "IMO", some chances that I missed something..
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/15/2018 10:12 PM, Wido den Hollander wrote:
>>
>> On 10/15/2018 08:23 PM, Gregory Farnum wrote:
>>> I don't know anything about the BlueStore code, but given the snippets
>>> you've posted this appears to be a debug thing that doesn't expect to be
>>> invoked (or perhaps only in an unexpected case that it's trying hard to
>>> recover from). Have you checked where the dump() function is invoked
>>> from? I'd imagine it's something about having to try extra-hard to
>>> allocate free space or something.
>> It seems BlueFS that is having a hard time finding free space.
>>
>> I'm trying this PR now: https://github.com/ceph/ceph/pull/24543
>>
>> It will stop the spamming, but that's not the root cause. The OSDs in
>> this case are at max 80% full and they do have a lot of OMAP (RGW
>> indexes) in them, but that's all.
>>
>> I'm however not sure why this is happening suddenly in this cluster.
>>
>> Wido
>>
>>> -Greg
>>>
>>> On Mon, Oct 15, 2018 at 10:02 AM Wido den Hollander <wido@xxxxxxxx
>>> <mailto:wido@xxxxxxxx>> wrote:
>>>
>>>
>>>
>>>      On 10/11/2018 12:08 AM, Wido den Hollander wrote:
>>>      > Hi,
>>>      >
>>>      > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and
>>> 12.2.8 I'm
>>>      > seeing OSDs writing heavily to their logfiles spitting out these
>>>      lines:
>>>      >
>>>      >
>>>      > 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc
>>>      0x0x55828ae047d0
>>>      > dump  0x15cd2078000~34000
>>>      > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
>>>      0x0x55828ae047d0
>>>      > dump  0x15cd22cc000~24000
>>>      > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
>>>      0x0x55828ae047d0
>>>      > dump  0x15cd2300000~20000
>>>      > 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc
>>>      0x0x55828ae047d0
>>>      > dump  0x15cd2324000~24000
>>>      > 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc
>>>      0x0x55828ae047d0
>>>      > dump  0x15cd26c0000~24000
>>>      > 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc
>>>      0x0x55828ae047d0
>>>      > dump  0x15cd2704000~30000
>>>      >
>>>      > It goes so fast that the OS-disk in this case can't keep up
>>> and become
>>>      > 100% util.
>>>      >
>>>      > This causes the OSD to slow down and cause slow requests and
>>>      starts to flap.
>>>      >
>>>
>>>      I've set 'log_file' to /dev/null for now, but that doesn't solve it
>>>      either. Randomly OSDs just start spitting out slow requests and
>>> have
>>>      these issues.
>>>
>>>      Any suggestions on how to fix this?
>>>
>>>      Wido
>>>
>>>      > It seems that this is *only* happening on OSDs which are the
>>> fullest
>>>      > (~85%) on this cluster and they have about ~400 PGs each (Yes,
>>> I know,
>>>      > that's high).
>>>      >
>>>      > Looking at StupidAllocator.cc I see this piece of code:
>>>      >
>>>      > void StupidAllocator::dump()
>>>      > {
>>>      >   std::lock_guard<std::mutex> l(lock);
>>>      >   for (unsigned bin = 0; bin < free.size(); ++bin) {
>>>      >     ldout(cct, 0) << __func__ << " free bin " << bin << ": "
>>>      >                   << free[bin].num_intervals() << " extents"
>>> << dendl;
>>>      >     for (auto p = free[bin].begin();
>>>      >          p != free[bin].end();
>>>      >          ++p) {
>>>      >       ldout(cct, 0) << __func__ << "  0x" << std::hex <<
>>> p.get_start()
>>>      > << "~"
>>>      >                     << p.get_len() << std::dec << dendl;
>>>      >     }
>>>      >   }
>>>      > }
>>>      >
>>>      > I'm just wondering why it would spit out these lines and what's
>>>      causing it.
>>>      >
>>>      > Has anybody seen this before?
>>>      >
>>>      > Wido
>>>      > _______________________________________________
>>>      > ceph-users mailing list
>>>      > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>>      > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>      >
>>>      _______________________________________________
>>>      ceph-users mailing list
>>>      ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>>      http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com