Re: OSD log being spammed with BlueStore stupidallocator dump

Igor Fedotov <ifedotov@xxxxxxx> · Mon, 15 Oct 2018 23:43:13 +0300

Hi Wido,

once you apply the PR you'll probably see the initial error in the log 
that triggers the dump. Which is most probably the lack of space 
reported by _balance_bluefs_freespace() function. If so this means that 
BlueFS rebalance is unable to allocate contiguous 1M chunk at main 
device to gift to BlueFS. I.e. your main device space is very fragmented.

Unfortunately I don't know any ways to recover from this state but OSD 
redeployment or data removal.

Upcoming PR that brings an ability for offline BlueFS volume 
manipulation (https://github.com/ceph/ceph/pull/23103) will probably 
help to recover from this issue in future by migrating BlueFS data to a 
new larger DB volume. (targeted for Nautilus, not sure about backporting 
to Mimic or Luminous).

For now I can suggest the only preventive mean to avoid the case - have 
large enough space at your standalone DB volume. So that master device 
isn't used for DB at all or as minimum as possible. Hence no rebalance 
is needed and no fragmentation is present.

BTW wondering if you have one for your OSDs? How large if so?

Everything above is "IMO", some chances that I missed something..

Thanks,

Igor

On 10/15/2018 10:12 PM, Wido den Hollander wrote:

On 10/15/2018 08:23 PM, Gregory Farnum wrote:
I don't know anything about the BlueStore code, but given the snippets
you've posted this appears to be a debug thing that doesn't expect to be
invoked (or perhaps only in an unexpected case that it's trying hard to
recover from). Have you checked where the dump() function is invoked
from? I'd imagine it's something about having to try extra-hard to
allocate free space or something.
It seems BlueFS that is having a hard time finding free space.

I'm trying this PR now: https://github.com/ceph/ceph/pull/24543

It will stop the spamming, but that's not the root cause. The OSDs in
this case are at max 80% full and they do have a lot of OMAP (RGW
indexes) in them, but that's all.

I'm however not sure why this is happening suddenly in this cluster.

Wido

-Greg

On Mon, Oct 15, 2018 at 10:02 AM Wido den Hollander <wido@xxxxxxxx
<mailto:wido@xxxxxxxx>> wrote:

     On 10/11/2018 12:08 AM, Wido den Hollander wrote:
     > Hi,
     >
     > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and 12.2.8 I'm
     > seeing OSDs writing heavily to their logfiles spitting out these
     lines:
     >
     >
     > 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc
     0x0x55828ae047d0
     > dump  0x15cd2078000~34000
     > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
     0x0x55828ae047d0
     > dump  0x15cd22cc000~24000
     > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
     0x0x55828ae047d0
     > dump  0x15cd2300000~20000
     > 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc
     0x0x55828ae047d0
     > dump  0x15cd2324000~24000
     > 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc
     0x0x55828ae047d0
     > dump  0x15cd26c0000~24000
     > 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc
     0x0x55828ae047d0
     > dump  0x15cd2704000~30000
     >
     > It goes so fast that the OS-disk in this case can't keep up and become
     > 100% util.
     >
     > This causes the OSD to slow down and cause slow requests and
     starts to flap.
     >

     I've set 'log_file' to /dev/null for now, but that doesn't solve it
     either. Randomly OSDs just start spitting out slow requests and have
     these issues.

     Any suggestions on how to fix this?

     Wido

     > It seems that this is *only* happening on OSDs which are the fullest
     > (~85%) on this cluster and they have about ~400 PGs each (Yes, I know,
     > that's high).
     >
     > Looking at StupidAllocator.cc I see this piece of code:
     >
     > void StupidAllocator::dump()
     > {
     >   std::lock_guard<std::mutex> l(lock);
     >   for (unsigned bin = 0; bin < free.size(); ++bin) {
     >     ldout(cct, 0) << __func__ << " free bin " << bin << ": "
     >                   << free[bin].num_intervals() << " extents" << dendl;
     >     for (auto p = free[bin].begin();
     >          p != free[bin].end();
     >          ++p) {
     >       ldout(cct, 0) << __func__ << "  0x" << std::hex << p.get_start()
     > << "~"
     >                     << p.get_len() << std::dec << dendl;
     >     }
     >   }
     > }
     >
     > I'm just wondering why it would spit out these lines and what's
     causing it.
     >
     > Has anybody seen this before?
     >
     > Wido
     > _______________________________________________
     > ceph-users mailing list
     > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     >
     _______________________________________________
     ceph-users mailing list
     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com