On 10/16/2018 12:04 AM, Igor Fedotov wrote: > > On 10/15/2018 11:47 PM, Wido den Hollander wrote: >> Hi, >> >> On 10/15/2018 10:43 PM, Igor Fedotov wrote: >>> Hi Wido, >>> >>> once you apply the PR you'll probably see the initial error in the log >>> that triggers the dump. Which is most probably the lack of space >>> reported by _balance_bluefs_freespace() function. If so this means that >>> BlueFS rebalance is unable to allocate contiguous 1M chunk at main >>> device to gift to BlueFS. I.e. your main device space is very >>> fragmented. >>> >>> Unfortunately I don't know any ways to recover from this state but OSD >>> redeployment or data removal. >>> >> We are moving data away from these OSDs. Lucky us that we have HDD OSDs >> in there as well, moving a lot of data there. >> >> How would re-deployment work? Just wipe the OSDs and bring them back >> into the cluster again? That would be a very painful task.. :-( > > Good chances that you'll face the same issue again one day. > May be allocate some SSDs to serve as DB devices? Maybe, but this is a very common use-case where people run WAL+DB+DATA on a single SSD. Now we are running into it, but aren't the chances big other people will run into it as well? >> >>> Upcoming PR that brings an ability for offline BlueFS volume >>> manipulation (https://github.com/ceph/ceph/pull/23103) will probably >>> help to recover from this issue in future by migrating BlueFS data to a >>> new larger DB volume. (targeted for Nautilus, not sure about backporting >>> to Mimic or Luminous). >>> >>> For now I can suggest the only preventive mean to avoid the case - have >>> large enough space at your standalone DB volume. So that master device >>> isn't used for DB at all or as minimum as possible. Hence no rebalance >>> is needed and no fragmentation is present. >>> >> I see, but these are SSD-only OSDs. >> >>> BTW wondering if you have one for your OSDs? How large if so? >>> >> The cluster consists out of 96 OSDs with Samsung PM863a 1.92TB OSDs. >> >> The fullest OSD currently is 78% full, which is 348GiB free on the >> 1.75TiB device. >> >> Does this information help? > Yeah, thanks for sharing. Let me know if you need more! Wido >> >> Thanks! >> >> Wido >> >>> Everything above is "IMO", some chances that I missed something.. >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 10/15/2018 10:12 PM, Wido den Hollander wrote: >>>> On 10/15/2018 08:23 PM, Gregory Farnum wrote: >>>>> I don't know anything about the BlueStore code, but given the snippets >>>>> you've posted this appears to be a debug thing that doesn't expect >>>>> to be >>>>> invoked (or perhaps only in an unexpected case that it's trying >>>>> hard to >>>>> recover from). Have you checked where the dump() function is invoked >>>>> from? I'd imagine it's something about having to try extra-hard to >>>>> allocate free space or something. >>>> It seems BlueFS that is having a hard time finding free space. >>>> >>>> I'm trying this PR now: https://github.com/ceph/ceph/pull/24543 >>>> >>>> It will stop the spamming, but that's not the root cause. The OSDs in >>>> this case are at max 80% full and they do have a lot of OMAP (RGW >>>> indexes) in them, but that's all. >>>> >>>> I'm however not sure why this is happening suddenly in this cluster. >>>> >>>> Wido >>>> >>>>> -Greg >>>>> >>>>> On Mon, Oct 15, 2018 at 10:02 AM Wido den Hollander <wido@xxxxxxxx >>>>> <mailto:wido@xxxxxxxx>> wrote: >>>>> >>>>> >>>>> >>>>> On 10/11/2018 12:08 AM, Wido den Hollander wrote: >>>>> > Hi, >>>>> > >>>>> > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and >>>>> 12.2.8 I'm >>>>> > seeing OSDs writing heavily to their logfiles spitting out >>>>> these >>>>> lines: >>>>> > >>>>> > >>>>> > 2018-10-10 21:52:04.019037 7f90c2f0f700 0 stupidalloc >>>>> 0x0x55828ae047d0 >>>>> > dump 0x15cd2078000~34000 >>>>> > 2018-10-10 21:52:04.019038 7f90c2f0f700 0 stupidalloc >>>>> 0x0x55828ae047d0 >>>>> > dump 0x15cd22cc000~24000 >>>>> > 2018-10-10 21:52:04.019038 7f90c2f0f700 0 stupidalloc >>>>> 0x0x55828ae047d0 >>>>> > dump 0x15cd2300000~20000 >>>>> > 2018-10-10 21:52:04.019039 7f90c2f0f700 0 stupidalloc >>>>> 0x0x55828ae047d0 >>>>> > dump 0x15cd2324000~24000 >>>>> > 2018-10-10 21:52:04.019040 7f90c2f0f700 0 stupidalloc >>>>> 0x0x55828ae047d0 >>>>> > dump 0x15cd26c0000~24000 >>>>> > 2018-10-10 21:52:04.019041 7f90c2f0f700 0 stupidalloc >>>>> 0x0x55828ae047d0 >>>>> > dump 0x15cd2704000~30000 >>>>> > >>>>> > It goes so fast that the OS-disk in this case can't keep up >>>>> and become >>>>> > 100% util. >>>>> > >>>>> > This causes the OSD to slow down and cause slow requests and >>>>> starts to flap. >>>>> > >>>>> >>>>> I've set 'log_file' to /dev/null for now, but that doesn't >>>>> solve it >>>>> either. Randomly OSDs just start spitting out slow requests and >>>>> have >>>>> these issues. >>>>> >>>>> Any suggestions on how to fix this? >>>>> >>>>> Wido >>>>> >>>>> > It seems that this is *only* happening on OSDs which are the >>>>> fullest >>>>> > (~85%) on this cluster and they have about ~400 PGs each (Yes, >>>>> I know, >>>>> > that's high). >>>>> > >>>>> > Looking at StupidAllocator.cc I see this piece of code: >>>>> > >>>>> > void StupidAllocator::dump() >>>>> > { >>>>> > std::lock_guard<std::mutex> l(lock); >>>>> > for (unsigned bin = 0; bin < free.size(); ++bin) { >>>>> > ldout(cct, 0) << __func__ << " free bin " << bin << ": " >>>>> > << free[bin].num_intervals() << " extents" >>>>> << dendl; >>>>> > for (auto p = free[bin].begin(); >>>>> > p != free[bin].end(); >>>>> > ++p) { >>>>> > ldout(cct, 0) << __func__ << " 0x" << std::hex << >>>>> p.get_start() >>>>> > << "~" >>>>> > << p.get_len() << std::dec << dendl; >>>>> > } >>>>> > } >>>>> > } >>>>> > >>>>> > I'm just wondering why it would spit out these lines and >>>>> what's >>>>> causing it. >>>>> > >>>>> > Has anybody seen this before? >>>>> > >>>>> > Wido >>>>> > _______________________________________________ >>>>> > ceph-users mailing list >>>>> > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> >>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> > >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com