Hi, On 10/15/2018 10:43 PM, Igor Fedotov wrote: > Hi Wido, > > once you apply the PR you'll probably see the initial error in the log > that triggers the dump. Which is most probably the lack of space > reported by _balance_bluefs_freespace() function. If so this means that > BlueFS rebalance is unable to allocate contiguous 1M chunk at main > device to gift to BlueFS. I.e. your main device space is very fragmented. > > Unfortunately I don't know any ways to recover from this state but OSD > redeployment or data removal. > We are moving data away from these OSDs. Lucky us that we have HDD OSDs in there as well, moving a lot of data there. How would re-deployment work? Just wipe the OSDs and bring them back into the cluster again? That would be a very painful task.. :-( > Upcoming PR that brings an ability for offline BlueFS volume > manipulation (https://github.com/ceph/ceph/pull/23103) will probably > help to recover from this issue in future by migrating BlueFS data to a > new larger DB volume. (targeted for Nautilus, not sure about backporting > to Mimic or Luminous). > > For now I can suggest the only preventive mean to avoid the case - have > large enough space at your standalone DB volume. So that master device > isn't used for DB at all or as minimum as possible. Hence no rebalance > is needed and no fragmentation is present. > I see, but these are SSD-only OSDs. > BTW wondering if you have one for your OSDs? How large if so? > The cluster consists out of 96 OSDs with Samsung PM863a 1.92TB OSDs. The fullest OSD currently is 78% full, which is 348GiB free on the 1.75TiB device. Does this information help? Thanks! Wido > Everything above is "IMO", some chances that I missed something.. > > > Thanks, > > Igor > > > On 10/15/2018 10:12 PM, Wido den Hollander wrote: >> >> On 10/15/2018 08:23 PM, Gregory Farnum wrote: >>> I don't know anything about the BlueStore code, but given the snippets >>> you've posted this appears to be a debug thing that doesn't expect to be >>> invoked (or perhaps only in an unexpected case that it's trying hard to >>> recover from). Have you checked where the dump() function is invoked >>> from? I'd imagine it's something about having to try extra-hard to >>> allocate free space or something. >> It seems BlueFS that is having a hard time finding free space. >> >> I'm trying this PR now: https://github.com/ceph/ceph/pull/24543 >> >> It will stop the spamming, but that's not the root cause. The OSDs in >> this case are at max 80% full and they do have a lot of OMAP (RGW >> indexes) in them, but that's all. >> >> I'm however not sure why this is happening suddenly in this cluster. >> >> Wido >> >>> -Greg >>> >>> On Mon, Oct 15, 2018 at 10:02 AM Wido den Hollander <wido@xxxxxxxx >>> <mailto:wido@xxxxxxxx>> wrote: >>> >>> >>> >>> On 10/11/2018 12:08 AM, Wido den Hollander wrote: >>> > Hi, >>> > >>> > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and >>> 12.2.8 I'm >>> > seeing OSDs writing heavily to their logfiles spitting out these >>> lines: >>> > >>> > >>> > 2018-10-10 21:52:04.019037 7f90c2f0f700 0 stupidalloc >>> 0x0x55828ae047d0 >>> > dump 0x15cd2078000~34000 >>> > 2018-10-10 21:52:04.019038 7f90c2f0f700 0 stupidalloc >>> 0x0x55828ae047d0 >>> > dump 0x15cd22cc000~24000 >>> > 2018-10-10 21:52:04.019038 7f90c2f0f700 0 stupidalloc >>> 0x0x55828ae047d0 >>> > dump 0x15cd2300000~20000 >>> > 2018-10-10 21:52:04.019039 7f90c2f0f700 0 stupidalloc >>> 0x0x55828ae047d0 >>> > dump 0x15cd2324000~24000 >>> > 2018-10-10 21:52:04.019040 7f90c2f0f700 0 stupidalloc >>> 0x0x55828ae047d0 >>> > dump 0x15cd26c0000~24000 >>> > 2018-10-10 21:52:04.019041 7f90c2f0f700 0 stupidalloc >>> 0x0x55828ae047d0 >>> > dump 0x15cd2704000~30000 >>> > >>> > It goes so fast that the OS-disk in this case can't keep up >>> and become >>> > 100% util. >>> > >>> > This causes the OSD to slow down and cause slow requests and >>> starts to flap. >>> > >>> >>> I've set 'log_file' to /dev/null for now, but that doesn't solve it >>> either. Randomly OSDs just start spitting out slow requests and >>> have >>> these issues. >>> >>> Any suggestions on how to fix this? >>> >>> Wido >>> >>> > It seems that this is *only* happening on OSDs which are the >>> fullest >>> > (~85%) on this cluster and they have about ~400 PGs each (Yes, >>> I know, >>> > that's high). >>> > >>> > Looking at StupidAllocator.cc I see this piece of code: >>> > >>> > void StupidAllocator::dump() >>> > { >>> > std::lock_guard<std::mutex> l(lock); >>> > for (unsigned bin = 0; bin < free.size(); ++bin) { >>> > ldout(cct, 0) << __func__ << " free bin " << bin << ": " >>> > << free[bin].num_intervals() << " extents" >>> << dendl; >>> > for (auto p = free[bin].begin(); >>> > p != free[bin].end(); >>> > ++p) { >>> > ldout(cct, 0) << __func__ << " 0x" << std::hex << >>> p.get_start() >>> > << "~" >>> > << p.get_len() << std::dec << dendl; >>> > } >>> > } >>> > } >>> > >>> > I'm just wondering why it would spit out these lines and what's >>> causing it. >>> > >>> > Has anybody seen this before? >>> > >>> > Wido >>> > _______________________________________________ >>> > ceph-users mailing list >>> > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com