Hi Sage et al,
After monitoring Ceph mailing lists (both devel and users) for a while
as well as from some customer reports I've got a belief that BlueStore
might cause operation slowdowns under specific circumstances. Which
often result in 'slow ops' alerts or suicide timeouts (e.g.
http://tracker.ceph.com/issues/34526).
One of the related observations is the presence of more or less
intensive read load when such slowdowns happened, e.g. enabled scrubbing
or cluster backfilling. Also I recall several cases when it had been
recommended to disable scrubbing to fight the issues and this helped.
Unfortunately I haven't collected all the cases in a sorted manner hence
just trying to share my feeling rather than facts.
But today I've got an idea that probably explains the behavior. Would
like to get community feedback if it makes sense.
Here it is:
BlueStore has a per-collection (aka PG) read/write lock that protects
both read and write operations. It allows concurrent reads from the
objects belonging to the same collection and enforces exclusive write
access.
R/W lock acquisition attempt puts new writer on wait if reader(s) are in
progress. While new reader is allowed to acquire the lock that already
acquired by other readers (but not a writer).
Hence one can imagine the situation when BlueStore gets massive read
flow that is processed again and again while some writers are pending
for indefinite period of time (actually until the gap in this read flow).
The requirement is the presence of multiple read operations on the same
collections overlapped in time.
Here is the sample picture:
read1 <---processing--->
write1 <waiting------------------------------------------------------
read2 <-----processing ------>
read3 <-----processing ------>
read4 <-----processing ------>
and so on.
And a side note (just to mention but it might make the situation a bit
worse) is that reading keeps the lock even when it performs potentially
long ops like csum verification and decompression. Hence the probability
for another reader to overlap with the existing one is increased.
Does this makes sense? Or I missed something?
Thanks,
Igor