On Thu, Jun 22, 2017 at 5:31 PM, Casey Bodley <cbodley@xxxxxxxxxx> wrote: > > On 06/22/2017 10:40 AM, Dan van der Ster wrote: >> >> On Thu, Jun 22, 2017 at 4:25 PM, Casey Bodley <cbodley@xxxxxxxxxx> wrote: >>> >>> On 06/22/2017 04:00 AM, Dan van der Ster wrote: >>>> >>>> I'm now running the three relevant OSDs with that patch. (Recompiled, >>>> replaced /usr/lib64/rados-classes/libcls_log.so with the new version, >>>> then restarted the osds). >>>> >>>> It's working quite well, trimming 10 entries at a time instead of >>>> 1000, and no more timeouts. >>>> >>>> Do you think it would be worth decreasing this hardcoded value in ceph >>>> proper? >>>> >>>> -- Dan >>> >>> >>> I do, yeah. At least, the trim operation should be able to pass in its >>> own >>> value for that. I opened a ticket for that at >>> http://tracker.ceph.com/issues/20382. >>> >>> I'd also like to investigate using the ObjectStore's OP_OMAP_RMKEYRANGE >>> operation to trim a range of keys in a single osd op, instead of >>> generating >>> a different op for each key. I have a PR that does this at >>> https://github.com/ceph/ceph/pull/15183. But it's still hard to guarantee >>> that leveldb can process the entire range inside of the suicide timeout. >> >> I wonder if that would help. Here's what I've learned today: >> >> * two of the 3 relevant OSDs have something screwy with their >> leveldb. The primary and 3rd replica are ~quick at trimming for only a >> few hundred keys, whilst the 2nd OSD is very very fast always. >> * After manually compacting the two slow OSDs, they are fast again >> for just a few hundred trims. So I'm compacting, trimming, ..., in a >> loop now. >> * I moved the omaps to SSDs -- doesn't help. (iostat confirms this >> is not IO bound). >> * CPU util on the slow OSDs gets quite high during the slow trimming. >> * perf top is below [1]. leveldb::Block::Iter::Prev and >> leveldb::InternalKeyComparator::Compare are notable. >> * The always fast OSD shows no leveldb functions in perf top while >> trimming. >> >> I've tried bigger leveldb cache and block sizes, compression on and >> off, and played with the bloom size up to 14 bits -- none of these >> changes make any difference. >> >> At this point I'm not confident this trimming will ever complete -- >> there are ~20 million records to remove at maybe 1Hz. >> >> How about I just delete the meta.log object? Would this use a >> different, perhaps quicker, code path to remove those omap keys? >> >> Thanks! >> >> Dan >> >> [1] >> >> 4.92% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] >> 0x0000000000023e8d >> 4.47% libc-2.17.so [.] >> __memcmp_sse4_1 >> 4.13% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] >> 0x00000000000273bb >> 3.81% libleveldb.so.1.0.7 [.] >> leveldb::Block::Iter::Prev >> 3.07% libc-2.17.so [.] >> __memcpy_ssse3_back >> 2.84% [kernel] [k] port_inb >> 2.77% libstdc++.so.6.0.19 [.] >> std::string::_M_mutate >> 2.75% libstdc++.so.6.0.19 [.] >> std::string::append >> 2.53% libleveldb.so.1.0.7 [.] >> leveldb::InternalKeyComparator::Compare >> 1.32% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] >> 0x0000000000023e77 >> 0.85% [kernel] [k] >> _raw_spin_lock >> 0.80% libleveldb.so.1.0.7 [.] >> leveldb::Block::Iter::Next >> 0.77% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] >> 0x0000000000023a05 >> 0.67% libleveldb.so.1.0.7 [.] >> leveldb::MemTable::KeyComparator::operator() >> 0.61% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] >> 0x0000000000023a09 >> 0.58% libleveldb.so.1.0.7 [.] >> leveldb::MemTableIterator::Prev >> 0.51% [kernel] [k] __schedule >> 0.48% libruby.so.2.1.0 [.] >> ruby_yyparse > > > Hi Dan, > > Removing an object will try to delete all of its keys at once, which should > be much faster. It's also very likely to hit your suicide timeout, so you'll > have to keep retrying until it stops killing your osd. Well, that was quick. The object delete took around 30s. I then restarted the osd to compact it, and now the leveldb is ~100MB. Phew! In summary, if someone finds themselves in this predicament (huge mdlog on a single-region rgw cluster), I'd advise to turn it off, then just delete the meta.log objects. Thanks! Dan _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com