On 10/11/2018 03:12 AM, David Turner wrote: > Not a resolution, but an idea that you've probably thought of. > Disabling logging on any affected OSDs (possibly just all of them) seems > like a needed step to be able to keep working with this cluster to > finish the upgrade and get it healthier. > Thanks for the tip! But I wouldn't know how to silence the stupidalloc. debug_osd = 0/0 debug_bluefs = 0/0 debug_bluestore = 0/0 debug_bdev = 0/0 It's all set, but the logs keep coming. Right now we found a work-around by offloading some data from these OSDs. The cluster is a mix of SSDs and HDDs. The problem is with the SSD OSDs. So we moved a pool from SSD to HDD and that seems to have fixed the problem for now. But it will probably get back as soon as some OSDs go >80%. Wido > On Wed, Oct 10, 2018 at 6:37 PM Wido den Hollander <wido@xxxxxxxx > <mailto:wido@xxxxxxxx>> wrote: > > > > On 10/11/2018 12:08 AM, Wido den Hollander wrote: > > Hi, > > > > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and 12.2.8 I'm > > seeing OSDs writing heavily to their logfiles spitting out these > lines: > > > > > > 2018-10-10 21:52:04.019037 7f90c2f0f700 0 stupidalloc > 0x0x55828ae047d0 > > dump 0x15cd2078000~34000 > > 2018-10-10 21:52:04.019038 7f90c2f0f700 0 stupidalloc > 0x0x55828ae047d0 > > dump 0x15cd22cc000~24000 > > 2018-10-10 21:52:04.019038 7f90c2f0f700 0 stupidalloc > 0x0x55828ae047d0 > > dump 0x15cd2300000~20000 > > 2018-10-10 21:52:04.019039 7f90c2f0f700 0 stupidalloc > 0x0x55828ae047d0 > > dump 0x15cd2324000~24000 > > 2018-10-10 21:52:04.019040 7f90c2f0f700 0 stupidalloc > 0x0x55828ae047d0 > > dump 0x15cd26c0000~24000 > > 2018-10-10 21:52:04.019041 7f90c2f0f700 0 stupidalloc > 0x0x55828ae047d0 > > dump 0x15cd2704000~30000 > > > > It goes so fast that the OS-disk in this case can't keep up and become > > 100% util. > > > > This causes the OSD to slow down and cause slow requests and > starts to flap. > > > > It seems that this is *only* happening on OSDs which are the fullest > > (~85%) on this cluster and they have about ~400 PGs each (Yes, I know, > > that's high). > > > > After some searching I stumbled upon this Bugzilla report: > https://bugzilla.redhat.com/show_bug.cgi?id=1600138 > > That seems to be the same issue, although I'm not 100% sure. > > Wido > > > Looking at StupidAllocator.cc I see this piece of code: > > > > void StupidAllocator::dump() > > { > > std::lock_guard<std::mutex> l(lock); > > for (unsigned bin = 0; bin < free.size(); ++bin) { > > ldout(cct, 0) << __func__ << " free bin " << bin << ": " > > << free[bin].num_intervals() << " extents" << dendl; > > for (auto p = free[bin].begin(); > > p != free[bin].end(); > > ++p) { > > ldout(cct, 0) << __func__ << " 0x" << std::hex << p.get_start() > > << "~" > > << p.get_len() << std::dec << dendl; > > } > > } > > } > > > > I'm just wondering why it would spit out these lines and what's > causing it. > > > > Has anybody seen this before? > > > > Wido > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com