Potential deadlock in PG::_scan_snaps?

Mykola Golub <to.my.trociny@xxxxxxxxx> · Mon, 29 Oct 2018 21:08:33 +0200

Hi,

I am investigating a "suicide timeout" during scrubbing reported by
our users.

The cluster is running jewel 10.2.10 and a distinguish feature of it
is that it has a cache pool for which by some reason a pg scrub
generates many errors like this:

  osd.779 found snap mapper error on pg 56.11b oid 56:d8ba1d80:::rbd_data.924e5c33f6228e7.0000000000038670:2f3 snaps missing in mapper, should be: 2f3...repaired

And after one of such an error the scrub thread gets stack waiting for
apply_transaction that fixes the snap mapper to return:

#0  0x00007f64d251d0ff in pthread_cond_wait@@GLIBC_2.3.2 () from ./lib64/libpthread.so.0
#1  0x000055ba77c099eb in Cond::Wait (mutex=..., this=0x7f6490a0d640) at common/Cond.h:56
#2  ObjectStore::apply_transactions (this=this@entry=0x55ba81fd2000, osr=osr@entry=0x55ba81ee4db0, tls=..., ondisk=ondisk@entry=0x0) at os/ObjectStore.cc:173
#3  0x000055ba77a1c5ed in ObjectStore::apply_transaction (ondisk=0x0, t=..., osr=0x55ba81ee4db0, this=0x55ba81fd2000) at os/ObjectStore.h:1855
#4  PG::_scan_snaps (this=this@entry=0x55ba9631a800, smap=...) at osd/PG.cc:3924
#5  0x000055ba77a1d031 in PG::build_scrub_map_chunk (this=this@entry=0x55ba9631a800, map=..., start=..., end=..., deep=deep@entry=true, seed=seed@entry=4294967295, handle=...) at osd/PG.cc:3967
#6  0x000055ba77a2c56d in PG::chunky_scrub (this=this@entry=0x55ba9631a800, handle=...) at osd/PG.cc:4431
#7  0x000055ba77a2df5c in PG::scrub (this=0x55ba9631a800, queued=<optimized out>, handle=...) at osd/PG.cc:4166
#8  0x000055ba779283c7 in PGQueueable::RunVis::operator() (this=this@entry=0x7f6490a0e820, op=...) at osd/OSD.cc:171

It looks that the transaction successfully reached FileStore::_finish_op
and `onreadable` context (the scrub thread is waiting for) is queued
to the filestore apply finisher. But the filestore apply finisher has
got stuck completing a PG::C_UpdateLastRollbackInfoTrimmedToApplied
context that wants the pg lock, which is currently being hold by the
scrub thread:

#0  0x00007f64d251fb7c in __lll_lock_wait () from ./lib64/libpthread.so.0
#1  0x00007f64d251a7d3 in pthread_mutex_lock () from ./lib64/libpthread.so.0
#2  0x000055ba77f72548 in Mutex::Lock (this=this@entry=0x55ba9631ad28, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:110
#3  0x000055ba779eb650 in PG::lock (this=0x55ba9631a800, no_lockdep=no_lockdep@entry=false) at osd/PG.cc:275
#4  0x000055ba77a40f74 in PG::C_UpdateLastRollbackInfoTrimmedToApplied::finish (this=0x55ba8f3542b0) at osd/PG.h:534
#5  0x000055ba77986199 in Context::complete (this=0x55ba8f3542b0, r=<optimized out>) at include/Context.h:64
#6  0x000055ba779c483b in finish_contexts<Context> (cct=0x0, finished=..., result=0) at include/Context.h:120
#7  0x000055ba77986ad9 in Context::complete (r=<optimized out>, this=0x55ba99ab8c80) at include/Context.h:64
#8  C_ContextsBase<Context, Context>::complete (this=0x55ba99ab8c80, r=<optimized out>) at include/Context.h:194
#9  0x000055ba77ef40a6 in Finisher::finisher_thread_entry (this=0x55ba81fd62c0) at common/Finisher.cc:68
#10 0x00007f64d2518724 in start_thread () from ./lib64/libpthread.so.0
#11 0x00007f64d0067e8d in clone () from ./lib64/libc.so.6

To confirm that the lock is held by the scrub thread, the holder thead id:

#fr 2
#2  0x000055ba77f72548 in Mutex::Lock (this=this@entry=0x55ba9631ad28, no_lockdep=no_lockdep@entry=false) at common/Mutex.cc:110
110	    r = pthread_mutex_lock(&_m);
#p/x locked_by 
$12 = 0x7f6490a10700

And from the crash log, 0x7f6490a10700 is our scrub thread:

7f6490a10700 -1 log_channel(cluster) log [ERR] : osd.779 found snap mapper error on pg 56.11b oid 56:d88e91fb:::rbd_data.924e5c33f6228e7.000000000003ab75:2f3 snaps missing in mapper, should be: 2f3...repaired
7f6484bb4700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6490a10700' had timed out after 15
7f64cc502700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6490a10700' had suicide timed out after 150

And to confirm that this finisher (0x55ba81fd62c0) is the FileStore::apply_finishers[0]:

#bt
#0  0x00007f64d251d4a8 in pthread_cond_timedwait@@GLIBC_2.3.2 () from ./lib64/libpthread.so.0
#1  0x000055ba77cb9ac9 in Cond::WaitUntil (when=..., mutex=..., this=0x55ba81fd2530) at common/Cond.h:72
#2  Cond::WaitInterval (this=this@entry=0x55ba81fd2530, cct=<optimized out>, mutex=..., interval=...) at common/Cond.h:81
#3  0x000055ba77c95d8e in FileStore::sync_entry (this=0x55ba81fd2000) at os/filestore/FileStore.cc:3717
#4  0x000055ba77cc062d in FileStore::SyncThread::entry (this=<optimized out>) at os/filestore/FileStore.h:185
#5  0x00007f64d2518724 in start_thread () from ./lib64/libpthread.so.0
#6  0x00007f64d0067e8d in clone () from ./lib64/libc.so.6
#fr 3
#3  0x000055ba77c95d8e in FileStore::sync_entry (this=0x55ba81fd2000) at os/filestore/FileStore.cc:3717
3717	      sync_cond.WaitInterval(g_ceph_context, lock, max_interval);
#p *(apply_finishers._M_impl._M_start)
$10 = (Finisher *) 0x55ba81fd62c0

So if my analysis is correct it is unsafe to call
ObjectStore::apply_transaction in PG::_scan_snaps while holding the pg
lock, or we should be careful not to acquire the lock in contexts that
are completed by FileStore::apply_finishers (like
PG::C_UpdateLastRollbackInfoTrimmedToApplied).

I am not sure if it is applied to the master, as the code diverged,
but it looks like the same issue is at least in luminous.

I suppose a workaround for this particular user could be to try
increase filestore_apply_finisher_threads (are there any potential
issues with this?).

Or could temporary disabling snap trim (until all pgs are scrubbed)
avoid triggerring PG::C_UpdateLastRollbackInfoTrimmedToApplied?

-- 
Mykola Golub