On 06/22/2017 10:40 AM, Dan van der Ster wrote:
On Thu, Jun 22, 2017 at 4:25 PM, Casey Bodley <cbodley@xxxxxxxxxx> wrote:
On 06/22/2017 04:00 AM, Dan van der Ster wrote:
I'm now running the three relevant OSDs with that patch. (Recompiled,
replaced /usr/lib64/rados-classes/libcls_log.so with the new version,
then restarted the osds).
It's working quite well, trimming 10 entries at a time instead of
1000, and no more timeouts.
Do you think it would be worth decreasing this hardcoded value in ceph
proper?
-- Dan
I do, yeah. At least, the trim operation should be able to pass in its own
value for that. I opened a ticket for that at
http://tracker.ceph.com/issues/20382.
I'd also like to investigate using the ObjectStore's OP_OMAP_RMKEYRANGE
operation to trim a range of keys in a single osd op, instead of generating
a different op for each key. I have a PR that does this at
https://github.com/ceph/ceph/pull/15183. But it's still hard to guarantee
that leveldb can process the entire range inside of the suicide timeout.
I wonder if that would help. Here's what I've learned today:
* two of the 3 relevant OSDs have something screwy with their
leveldb. The primary and 3rd replica are ~quick at trimming for only a
few hundred keys, whilst the 2nd OSD is very very fast always.
* After manually compacting the two slow OSDs, they are fast again
for just a few hundred trims. So I'm compacting, trimming, ..., in a
loop now.
* I moved the omaps to SSDs -- doesn't help. (iostat confirms this
is not IO bound).
* CPU util on the slow OSDs gets quite high during the slow trimming.
* perf top is below [1]. leveldb::Block::Iter::Prev and
leveldb::InternalKeyComparator::Compare are notable.
* The always fast OSD shows no leveldb functions in perf top while trimming.
I've tried bigger leveldb cache and block sizes, compression on and
off, and played with the bloom size up to 14 bits -- none of these
changes make any difference.
At this point I'm not confident this trimming will ever complete --
there are ~20 million records to remove at maybe 1Hz.
How about I just delete the meta.log object? Would this use a
different, perhaps quicker, code path to remove those omap keys?
Thanks!
Dan
[1]
4.92% libtcmalloc.so.4.2.6;5873e42b (deleted) [.]
0x0000000000023e8d
4.47% libc-2.17.so [.] __memcmp_sse4_1
4.13% libtcmalloc.so.4.2.6;5873e42b (deleted) [.]
0x00000000000273bb
3.81% libleveldb.so.1.0.7 [.]
leveldb::Block::Iter::Prev
3.07% libc-2.17.so [.]
__memcpy_ssse3_back
2.84% [kernel] [k] port_inb
2.77% libstdc++.so.6.0.19 [.]
std::string::_M_mutate
2.75% libstdc++.so.6.0.19 [.]
std::string::append
2.53% libleveldb.so.1.0.7 [.]
leveldb::InternalKeyComparator::Compare
1.32% libtcmalloc.so.4.2.6;5873e42b (deleted) [.]
0x0000000000023e77
0.85% [kernel] [k] _raw_spin_lock
0.80% libleveldb.so.1.0.7 [.]
leveldb::Block::Iter::Next
0.77% libtcmalloc.so.4.2.6;5873e42b (deleted) [.]
0x0000000000023a05
0.67% libleveldb.so.1.0.7 [.]
leveldb::MemTable::KeyComparator::operator()
0.61% libtcmalloc.so.4.2.6;5873e42b (deleted) [.]
0x0000000000023a09
0.58% libleveldb.so.1.0.7 [.]
leveldb::MemTableIterator::Prev
0.51% [kernel] [k] __schedule
0.48% libruby.so.2.1.0 [.] ruby_yyparse
Hi Dan,
Removing an object will try to delete all of its keys at once, which
should be much faster. It's also very likely to hit your suicide
timeout, so you'll have to keep retrying until it stops killing your osd.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com