On 10/16/18 11:32 AM, Igor Fedotov wrote: > > > On 10/16/2018 6:57 AM, Wido den Hollander wrote: >> >> On 10/16/2018 12:04 AM, Igor Fedotov wrote: >>> On 10/15/2018 11:47 PM, Wido den Hollander wrote: >>>> Hi, >>>> >>>> On 10/15/2018 10:43 PM, Igor Fedotov wrote: >>>>> Hi Wido, >>>>> >>>>> once you apply the PR you'll probably see the initial error in the log >>>>> that triggers the dump. Which is most probably the lack of space >>>>> reported by _balance_bluefs_freespace() function. If so this means >>>>> that >>>>> BlueFS rebalance is unable to allocate contiguous 1M chunk at main >>>>> device to gift to BlueFS. I.e. your main device space is very >>>>> fragmented. >>>>> >>>>> Unfortunately I don't know any ways to recover from this state but OSD >>>>> redeployment or data removal. >>>>> >>>> We are moving data away from these OSDs. Lucky us that we have HDD OSDs >>>> in there as well, moving a lot of data there. >>>> >>>> How would re-deployment work? Just wipe the OSDs and bring them back >>>> into the cluster again? That would be a very painful task.. :-( >>> Good chances that you'll face the same issue again one day. >>> May be allocate some SSDs to serve as DB devices? >> Maybe, but this is a very common use-case where people run WAL+DB+DATA >> on a single SSD. > Yeah, but I'd consider that as a workaround until better solution is > provided. > Before your case this high fragmentation issue had been rather a > theoretical outlook with a very small probability. > Understood. What I see now is that after offloading the RGW bucket indexes from these OSDs there is free space. The OSDs are now randomly going to 100% util (their disk) in reading a lot from their disk. Cranking up the logging shows me: 2018-10-16 11:50:20.325674 7fa52b9a4700 5 bdev(0x5627cc185200 /var/lib/ceph/osd/ceph-118/block) read_random 0xe7547cd93c~f0b 2018-10-16 11:50:20.325836 7fa52b9a4700 5 bdev(0x5627cc185200 /var/lib/ceph/osd/ceph-118/block) read_random 0xe7547ce847~ef9 2018-10-16 11:50:20.325997 7fa52b9a4700 5 bdev(0x5627cc185200 /var/lib/ceph/osd/ceph-118/block) read_random 0xe7547cf740 bluefs / bdev seem to be reading a lot, a lot. From what I can tell they are discarding OMAP data from RocksDB which causes a lot of reads on BlueFS. >> Now we are running into it, but aren't the chances big other people will >> run into it as well? >> >>>>> Upcoming PR that brings an ability for offline BlueFS volume >>>>> manipulation (https://github.com/ceph/ceph/pull/23103) will probably >>>>> help to recover from this issue in future by migrating BlueFS data >>>>> to a >>>>> new larger DB volume. (targeted for Nautilus, not sure about >>>>> backporting >>>>> to Mimic or Luminous). >>>>> >>>>> For now I can suggest the only preventive mean to avoid the case - >>>>> have >>>>> large enough space at your standalone DB volume. So that master device >>>>> isn't used for DB at all or as minimum as possible. Hence no rebalance >>>>> is needed and no fragmentation is present. >>>>> >>>> I see, but these are SSD-only OSDs. >>>> >>>>> BTW wondering if you have one for your OSDs? How large if so? >>>>> >>>> The cluster consists out of 96 OSDs with Samsung PM863a 1.92TB OSDs. >>>> >>>> The fullest OSD currently is 78% full, which is 348GiB free on the >>>> 1.75TiB device. >>>> >>>> Does this information help? >>> Yeah, thanks for sharing. >> Let me know if you need more! > Additionally wondering if you know how many data (in average) has been > totally written to these OSDs. I mean an aggregation over all writes > (even ones that already has been removed) not the current usage. > Use patterns, object sizes etc would be interesting as well. > These OSDs were migrated from FileStore to BlueStore in March 2018. Using blkdiscard the SSDs (Samsung PM863a 1.92TB) were wiped and then deployed with BlueStore. They run: - RBD (couple of pools) - RGW indexes The actual RGW data is on HDD OSDs in the same cluster. The whole cluster does a steady 20k IOps (R+W) during the day I would say. Roughly 200MB/sec write and 100MB/sec read. Suddenly OSDs started to grind to a halt last week and have been flapping and showing slow requests ever since. The RGW indexes have been offloaded to the HDD nodes in the meantime. Wido >>> >> Wido >> >>>> Thanks! >>>> >>>> Wido >>>> >>>>> Everything above is "IMO", some chances that I missed something.. >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Igor >>>>> >>>>> >>>>> On 10/15/2018 10:12 PM, Wido den Hollander wrote: >>>>>> On 10/15/2018 08:23 PM, Gregory Farnum wrote: >>>>>>> I don't know anything about the BlueStore code, but given the >>>>>>> snippets >>>>>>> you've posted this appears to be a debug thing that doesn't expect >>>>>>> to be >>>>>>> invoked (or perhaps only in an unexpected case that it's trying >>>>>>> hard to >>>>>>> recover from). Have you checked where the dump() function is invoked >>>>>>> from? I'd imagine it's something about having to try extra-hard to >>>>>>> allocate free space or something. >>>>>> It seems BlueFS that is having a hard time finding free space. >>>>>> >>>>>> I'm trying this PR now: https://github.com/ceph/ceph/pull/24543 >>>>>> >>>>>> It will stop the spamming, but that's not the root cause. The OSDs in >>>>>> this case are at max 80% full and they do have a lot of OMAP (RGW >>>>>> indexes) in them, but that's all. >>>>>> >>>>>> I'm however not sure why this is happening suddenly in this cluster. >>>>>> >>>>>> Wido >>>>>> >>>>>>> -Greg >>>>>>> >>>>>>> On Mon, Oct 15, 2018 at 10:02 AM Wido den Hollander <wido@xxxxxxxx >>>>>>> <mailto:wido@xxxxxxxx>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 10/11/2018 12:08 AM, Wido den Hollander wrote: >>>>>>> > Hi, >>>>>>> > >>>>>>> > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and >>>>>>> 12.2.8 I'm >>>>>>> > seeing OSDs writing heavily to their logfiles spitting out >>>>>>> these >>>>>>> lines: >>>>>>> > >>>>>>> > >>>>>>> > 2018-10-10 21:52:04.019037 7f90c2f0f700 0 stupidalloc >>>>>>> 0x0x55828ae047d0 >>>>>>> > dump 0x15cd2078000~34000 >>>>>>> > 2018-10-10 21:52:04.019038 7f90c2f0f700 0 stupidalloc >>>>>>> 0x0x55828ae047d0 >>>>>>> > dump 0x15cd22cc000~24000 >>>>>>> > 2018-10-10 21:52:04.019038 7f90c2f0f700 0 stupidalloc >>>>>>> 0x0x55828ae047d0 >>>>>>> > dump 0x15cd2300000~20000 >>>>>>> > 2018-10-10 21:52:04.019039 7f90c2f0f700 0 stupidalloc >>>>>>> 0x0x55828ae047d0 >>>>>>> > dump 0x15cd2324000~24000 >>>>>>> > 2018-10-10 21:52:04.019040 7f90c2f0f700 0 stupidalloc >>>>>>> 0x0x55828ae047d0 >>>>>>> > dump 0x15cd26c0000~24000 >>>>>>> > 2018-10-10 21:52:04.019041 7f90c2f0f700 0 stupidalloc >>>>>>> 0x0x55828ae047d0 >>>>>>> > dump 0x15cd2704000~30000 >>>>>>> > >>>>>>> > It goes so fast that the OS-disk in this case can't keep up >>>>>>> and become >>>>>>> > 100% util. >>>>>>> > >>>>>>> > This causes the OSD to slow down and cause slow requests >>>>>>> and >>>>>>> starts to flap. >>>>>>> > >>>>>>> >>>>>>> I've set 'log_file' to /dev/null for now, but that doesn't >>>>>>> solve it >>>>>>> either. Randomly OSDs just start spitting out slow >>>>>>> requests and >>>>>>> have >>>>>>> these issues. >>>>>>> >>>>>>> Any suggestions on how to fix this? >>>>>>> >>>>>>> Wido >>>>>>> >>>>>>> > It seems that this is *only* happening on OSDs which are >>>>>>> the >>>>>>> fullest >>>>>>> > (~85%) on this cluster and they have about ~400 PGs each >>>>>>> (Yes, >>>>>>> I know, >>>>>>> > that's high). >>>>>>> > >>>>>>> > Looking at StupidAllocator.cc I see this piece of code: >>>>>>> > >>>>>>> > void StupidAllocator::dump() >>>>>>> > { >>>>>>> > std::lock_guard<std::mutex> l(lock); >>>>>>> > for (unsigned bin = 0; bin < free.size(); ++bin) { >>>>>>> > ldout(cct, 0) << __func__ << " free bin " << bin << >>>>>>> ": " >>>>>>> > << free[bin].num_intervals() << " >>>>>>> extents" >>>>>>> << dendl; >>>>>>> > for (auto p = free[bin].begin(); >>>>>>> > p != free[bin].end(); >>>>>>> > ++p) { >>>>>>> > ldout(cct, 0) << __func__ << " 0x" << std::hex << >>>>>>> p.get_start() >>>>>>> > << "~" >>>>>>> > << p.get_len() << std::dec << dendl; >>>>>>> > } >>>>>>> > } >>>>>>> > } >>>>>>> > >>>>>>> > I'm just wondering why it would spit out these lines and >>>>>>> what's >>>>>>> causing it. >>>>>>> > >>>>>>> > Has anybody seen this before? >>>>>>> > >>>>>>> > Wido >>>>>>> > _______________________________________________ >>>>>>> > ceph-users mailing list >>>>>>> > ceph-users@xxxxxxxxxxxxxx >>>>>>> <mailto:ceph-users@xxxxxxxxxxxxxx> >>>>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> > >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com