'DDOS on BlueStore'?

Igor Fedotov <ifedotov@xxxxxxx> · Wed, 20 Feb 2019 00:56:09 +0300

Hi Sage et al,

After monitoring Ceph mailing lists (both devel and users) for a while 
as well as from some customer reports I've got a belief that BlueStore 
might cause operation slowdowns under specific circumstances. Which 
often result in 'slow ops' alerts or suicide timeouts (e.g. 
http://tracker.ceph.com/issues/34526).

One of the related observations is the presence of more or less 
intensive read load when such slowdowns happened, e.g. enabled scrubbing 
or cluster backfilling. Also I recall several cases when it had been 
recommended to disable scrubbing to fight the issues and this helped.

Unfortunately I haven't collected all the cases in a sorted manner hence 
just trying to share my feeling rather than facts.

But today I've got an idea that probably explains the behavior. Would 
like to get community feedback if it makes sense.

Here it is:

BlueStore has a per-collection (aka PG) read/write lock that protects 
both read and write operations. It allows concurrent reads from the 
objects belonging to the same collection and enforces exclusive write 
access.

R/W lock acquisition attempt puts new writer on wait if reader(s) are in 
progress. While new reader is allowed to acquire the lock that already 
acquired by other readers (but not a writer).

Hence one can imagine the situation when BlueStore gets massive read 
flow that is processed again and again while some writers are pending  
for indefinite period of time (actually until the gap in this read flow).

The requirement is the presence of multiple read operations on the same 
collections overlapped in time.

Here is the sample picture:

read1  <---processing--->

write1 <waiting------------------------------------------------------

read2             <-----processing ------>

read3                                <-----processing ------>

read4 <-----processing ------>

and so on.

And a side note (just to mention but it might make the situation a bit 
worse) is that reading keeps the lock even when it performs potentially 
long ops like csum verification and decompression. Hence the probability 
for another reader to overlap with the existing one is increased.

Does this makes sense? Or I missed something?

Thanks,

Igor