Re: Bluestore caching oddities, again

Christian Balzer <chibi@xxxxxxx> · Thu, 8 Aug 2019 12:26:00 +0900

Hello Sage,

On Thu, 8 Aug 2019 02:23:15 +0000 (UTC) Sage Weil wrote:

> On Thu, 8 Aug 2019, Christian Balzer wrote:
> > 
> > Hello again,
> > 
> > Getting back to this:
> > On Sun, 4 Aug 2019 10:47:27 +0900 Christian Balzer wrote:
> >   
> > > Hello,
> > > 
> > > preparing the first production bluestore, nautilus (latest) based cluster
> > > I've run into the same things other people and myself ran into before.
> > > 
> > > Firstly HW, 3 nodes with 12 SATA HDDs each, IT mode LSI 3008, wal/db on
> > > 40GB SSD partitions. (boy do I hate the inability of ceph-volume to deal
> > > with raw partitions).
> > > SSDs aren't a bottleneck in any scenario.
> > > Single E5-1650 v3 @ 3.50GHz, cpu isn't a bottleneck in any scenario, less
> > > than 15% of a core per OSD.
> > > 
> > > Connection is via 40GB/s infiniband, IPoIB, no issues here as numbers later
> > > will show.
> > > 
> > > Clients are KVMs on Epyc based compute nodes, maybe some more speed could
> > > be squeezed out here with different VM configs, but the cpu isn't an issue
> > > in the problem cases.
> > > 
> > > 
> > > 
> > > 1. 4k random I/O can cause degraded PGs
> > > I've run into the same/similar issue as Nathan Fish here:
> > > https://www.spinics.net/lists/ceph-users/msg526
> > > During the first 2 tests with 4k random I/O I got shortly degraded PGs as
> > > well, with no indication in CPU or SSD utilization accounting for this.
> > > HDDs were of course busy at that time.
> > > Wasn't able to reproduce this so far, but it leaves me less than
> > > confident. 
> > > 
> > >   
> > This happened again yesterday when rsyncing 260GB of average 4MB files
> > into a Ceph image backed VM.
> > Given the nature of this rsync nothing on the ceph nodes was the least bit
> > busy, the HDDs were all below 15% utilization, CPU bored, etc.
> > 
> > Still we got:
> > ---
> > 2019-08-07 15:38:23.452580 osd.21 (osd.21) 651 : cluster [DBG] 1.125 starting backfill to osd.9 from (0'0,0'0] MAX to 1297'21584
> > 2019-08-07 15:38:24.454942 mon.ceph-05 (mon.0) 182756 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
> > 2019-08-07 15:38:25.396756 mon.ceph-05 (mon.0) 182757 : cluster [DBG] osdmap e1302: 36 total, 36 up, 36 in
> > 2019-08-07 15:38:23.452026 osd.12 (osd.12) 767 : cluster [DBG] 1.105 starting backfill to osd.25 from (0'0,0'0] MAX to 1297'6782
> > ---  
> 
> Is the balancer enabled?  Maybe it is adjusting the PG distribution a bit.
> 
It is indeed and that would explain things, though I did run it manually a
few times and the PGs are all within one of each other, so I didn't really
expect any further adjustment needs as this is only having a single pool,
RBD. 

Would be nice if it spoke up not just in the audit.log:
---
2019-08-07 15:38:21.092104 mon.ceph-05 (mon.0) 182680 : audit [INF] from='mgr.196195 10.0.8.25:0/960' entity='mgr.ceph-05' cmd=[{"item": "osd.0", "prefix": "osd crush weight-set reweight-compat", "weight": [2.504257053831929], "format": "json"}]: dispatch
---

I turned it off now, as I don't expect significant variances going forward.

Thanks,

Christian

> > Unfortunately all I have in the OSD log is this:
> > ---
> > 2019-08-07 15:38:23.461 7f155e71b700  1 osd.9 pg_epoch: 1299 pg[1.125( empty local-lis/les=0/0 n=0 ec=189/189 lis/c 1286/1286 les/c/f 1287/1287/0 1298/1299/189) [21,9,28]/[21,28,3] r=-1 lpr=1299 pi=[1286,1299)/1 crt=0'0 unknown mbc={}] state<Start>: transitioning to Stray
> > 2019-08-07 15:38:24.353 7f155e71b700  1 osd.9 pg_epoch: 1301 pg[1.125( v 1297'21584 (1246'18584,1297'21584] local-lis/les=1299/1300 n=5 ec=189/189 lis/c 1299/1299 les/c/f 1300/1300/0 1298/1301/189) [21,9,28] r=1 lpr=1301 pi=[1299,1301)/1 luod=0'0 crt=1297'21584 active mbc={}] start_peering_interval up [21,9,28] -> [21,9,28], acting [21,28,3] -> [21,9,28], acting_primary 21 -> 21, up_primary 21 -> 21, role -1 -> 1, features acting 4611087854031667199 upacting 4611087854031667199
> > 2019-08-07 15:38:24.353 7f155e71b700  1 osd.9 pg_epoch: 1301 pg[1.125( v 1297'21584 (1246'18584,1297'21584] local-lis/les=1299/1300 n=5 ec=189/189 lis/c 1299/1299 les/c/f 1300/1300/0 1298/1301/189) [21,9,28] r=1 lpr=1301 pi=[1299,1301)/1 crt=1297'21584 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
> > ---
> > 
> > How can I find out what happened here, given that it might not happen
> > again anytime soon cranking up debug levels now is a tad late.  
> 
> In the past we had "problems" where the degraded count would increase in 
> cases where we were migrated PGs, even though there aren't actually any 
> objects with too few replicas.  I think David Zafman ironed most/all 
> of these out, but perhaps they weren't all in Nautilus? I can't quite 
> remember.
> 
> s
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Mobile Inc.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx