Re: 'DDOS on BlueStore'?

Sage Weil <sweil@xxxxxxxxxx> · Tue, 19 Feb 2019 22:53:27 +0000 (UTC)

On Wed, 20 Feb 2019, Igor Fedotov wrote:
> Hi Sage et al,
> 
> After monitoring Ceph mailing lists (both devel and users) for a while as well
> as from some customer reports I've got a belief that BlueStore might cause
> operation slowdowns under specific circumstances. Which often result in 'slow
> ops' alerts or suicide timeouts (e.g. http://tracker.ceph.com/issues/34526).
> 
> One of the related observations is the presence of more or less intensive read
> load when such slowdowns happened, e.g. enabled scrubbing or cluster
> backfilling. Also I recall several cases when it had been recommended to
> disable scrubbing to fight the issues and this helped.
> 
> Unfortunately I haven't collected all the cases in a sorted manner hence just
> trying to share my feeling rather than facts.
> 
> But today I've got an idea that probably explains the behavior. Would like to
> get community feedback if it makes sense.
> 
> Here it is:
> 
> BlueStore has a per-collection (aka PG) read/write lock that protects both
> read and write operations. It allows concurrent reads from the objects
> belonging to the same collection and enforces exclusive write access.
> 
> R/W lock acquisition attempt puts new writer on wait if reader(s) are in
> progress. While new reader is allowed to acquire the lock that already
> acquired by other readers (but not a writer).
> 
> Hence one can imagine the situation when BlueStore gets massive read flow that
> is processed again and again while some writers are pending  for indefinite
> period of time (actually until the gap in this read flow).
> 
> The requirement is the presence of multiple read operations on the same
> collections overlapped in time.
> 
> Here is the sample picture:
> 
> read1  <---processing--->
> 
> write1 <waiting------------------------------------------------------
> 
> read2             <-----processing ------>
> 
> read3                                <-----processing ------>
> 
> read4 <-----processing ------>
> 
> and so on.

I don't think this is it, because both reads a writes happen while the PG 
lock is held, which is just a regular mutex.  So if we're seeing a locking 
pattern like the above, it would be at the PG mutex level, not bluestore's 
per-collection rw lock.  At the bluestore layer, I think we only have a 
single caller into read or write at any time.

At least, that's the case on master (and mimic), which has the scrub 
preemption.  That first appeared in 12.2.6.  Before that, sometime in the 
distant past, we dropped the lock for a scrub read, but we don't do that 
now...

In any case, if we can get an OSD into the state where it is exhibiting 
long latencies, turning up debug_bluestore and/or debug_osd for a few 
seconds might give us some clues.  If there is some read/write interaction 
going on, we should be able to see it in that log?

sage

> And a side note (just to mention but it might make the situation a bit worse)
> is that reading keeps the lock even when it performs potentially long ops like
> csum verification and decompression. Hence the probability for another reader
> to overlap with the existing one is increased.
> 
> 
> Does this makes sense? Or I missed something?
> 
> 
> Thanks,
> 
> Igor
> 
> 
> 
> 
> 
>