Re: OSD log being spammed with BlueStore stupidallocator dump

Igor Fedotov <ifedotov@xxxxxxx> · Tue, 16 Oct 2018 12:32:42 +0300

On 10/16/2018 6:57 AM, Wido den Hollander wrote:

On 10/16/2018 12:04 AM, Igor Fedotov wrote:
On 10/15/2018 11:47 PM, Wido den Hollander wrote:
Hi,

On 10/15/2018 10:43 PM, Igor Fedotov wrote:
Hi Wido,

once you apply the PR you'll probably see the initial error in the log
that triggers the dump. Which is most probably the lack of space
reported by _balance_bluefs_freespace() function. If so this means that
BlueFS rebalance is unable to allocate contiguous 1M chunk at main
device to gift to BlueFS. I.e. your main device space is very
fragmented.

Unfortunately I don't know any ways to recover from this state but OSD
redeployment or data removal.

We are moving data away from these OSDs. Lucky us that we have HDD OSDs
in there as well, moving a lot of data there.

How would re-deployment work? Just wipe the OSDs and bring them back
into the cluster again? That would be a very painful task.. :-(
Good chances that you'll face the same issue again one day.
May be allocate some SSDs to serve as DB devices?
Maybe, but this is a very common use-case where people run WAL+DB+DATA
on a single SSD.
Yeah, but I'd consider that as a workaround until better solution is 
provided.
Before your case this high fragmentation issue had been rather a 
theoretical outlook with a very small probability.

Now we are running into it, but aren't the chances big other people will
run into it as well?

Upcoming PR that brings an ability for offline BlueFS volume
manipulation (https://github.com/ceph/ceph/pull/23103) will probably
help to recover from this issue in future by migrating BlueFS data to a
new larger DB volume. (targeted for Nautilus, not sure about backporting
to Mimic or Luminous).

For now I can suggest the only preventive mean to avoid the case - have
large enough space at your standalone DB volume. So that master device
isn't used for DB at all or as minimum as possible. Hence no rebalance
is needed and no fragmentation is present.

I see, but these are SSD-only OSDs.

BTW wondering if you have one for your OSDs? How large if so?

The cluster consists out of 96 OSDs with Samsung PM863a 1.92TB OSDs.

The fullest OSD currently is 78% full, which is 348GiB free on the
1.75TiB device.

Does this information help?
Yeah, thanks for sharing.
Let me know if you need more!
Additionally wondering if you know how many data (in average) has been 
totally written to these OSDs. I mean an aggregation over all writes 
(even ones that already has been removed) not the current usage.
Use patterns, object sizes etc would be interesting as well.

Wido

Thanks!

Wido

Everything above is "IMO", some chances that I missed something..

Thanks,

Igor

On 10/15/2018 10:12 PM, Wido den Hollander wrote:
On 10/15/2018 08:23 PM, Gregory Farnum wrote:
I don't know anything about the BlueStore code, but given the snippets
you've posted this appears to be a debug thing that doesn't expect
to be
invoked (or perhaps only in an unexpected case that it's trying
hard to
recover from). Have you checked where the dump() function is invoked
from? I'd imagine it's something about having to try extra-hard to
allocate free space or something.
It seems BlueFS that is having a hard time finding free space.

I'm trying this PR now: https://github.com/ceph/ceph/pull/24543

It will stop the spamming, but that's not the root cause. The OSDs in
this case are at max 80% full and they do have a lot of OMAP (RGW
indexes) in them, but that's all.

I'm however not sure why this is happening suddenly in this cluster.

Wido

-Greg

On Mon, Oct 15, 2018 at 10:02 AM Wido den Hollander <wido@xxxxxxxx
<mailto:wido@xxxxxxxx>> wrote:

       On 10/11/2018 12:08 AM, Wido den Hollander wrote:
       > Hi,
       >
       > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and
12.2.8 I'm
       > seeing OSDs writing heavily to their logfiles spitting out
these
       lines:
       >
       >
       > 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc
       0x0x55828ae047d0
       > dump  0x15cd2078000~34000
       > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
       0x0x55828ae047d0
       > dump  0x15cd22cc000~24000
       > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
       0x0x55828ae047d0
       > dump  0x15cd2300000~20000
       > 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc
       0x0x55828ae047d0
       > dump  0x15cd2324000~24000
       > 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc
       0x0x55828ae047d0
       > dump  0x15cd26c0000~24000
       > 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc
       0x0x55828ae047d0
       > dump  0x15cd2704000~30000
       >
       > It goes so fast that the OS-disk in this case can't keep up
and become
       > 100% util.
       >
       > This causes the OSD to slow down and cause slow requests and
       starts to flap.
       >

       I've set 'log_file' to /dev/null for now, but that doesn't
solve it
       either. Randomly OSDs just start spitting out slow requests and
have
       these issues.

       Any suggestions on how to fix this?

       Wido

       > It seems that this is *only* happening on OSDs which are the
fullest
       > (~85%) on this cluster and they have about ~400 PGs each (Yes,
I know,
       > that's high).
       >
       > Looking at StupidAllocator.cc I see this piece of code:
       >
       > void StupidAllocator::dump()
       > {
       >   std::lock_guard<std::mutex> l(lock);
       >   for (unsigned bin = 0; bin < free.size(); ++bin) {
       >     ldout(cct, 0) << __func__ << " free bin " << bin << ": "
       >                   << free[bin].num_intervals() << " extents"
<< dendl;
       >     for (auto p = free[bin].begin();
       >          p != free[bin].end();
       >          ++p) {
       >       ldout(cct, 0) << __func__ << "  0x" << std::hex <<
p.get_start()
       > << "~"
       >                     << p.get_len() << std::dec << dendl;
       >     }
       >   }
       > }
       >
       > I'm just wondering why it would spit out these lines and
what's
       causing it.
       >
       > Has anybody seen this before?
       >
       > Wido
       > _______________________________________________
       > ceph-users mailing list
       > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
       > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
       >
       _______________________________________________
       ceph-users mailing list
       ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
       http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com