Re: osd_memory_target=level0 ?

Christian Wuerdig <christian.wuerdig@xxxxxxxxx> · Sat, 2 Oct 2021 13:09:35 +1300

I don't have much experience in recovering struggling EC pools
unfortunately. Looks like it can't find OSDs for 2 out of the 6
shards. Since you run EC 4+2 the data isn't lost but not 100% sure how
to make it healthy.
There was a thread a while back that had some similar issue albeit
possibly different underlying causes but maybe there is some helpful
advise in there:
https://www.mail-archive.com/ceph-users@xxxxxxx/msg06854.html
I suspect your OSDs frequently suiciding has put these PGs into the state

Maybe someone else on the list has some ideas.

On Fri, 1 Oct 2021 at 18:13, Szabo, Istvan (Agoda)
<Istvan.Szabo@xxxxxxxxx> wrote:
>
> Thank you very much Christian, maybe you have idea how can I take out the cluster from this state? Something blocks the recovery and the rebalance, something stuck somewhere, thats why can’t increase the pg further.
> I don’t have auto pg scaler or anything just on warn state.
>
> If I set the min size of the pool to 4, will this pg be recovered? Or how I can take out the cluster from health error like this?
>
> Mark as lost seems risky based on some maillist experience, even if marked lost after you still have issue, so curious what is the way to take the cluster out from this and let it recover:
>
>
>
> Example problematic pg:
>
> dumped pgs_brief
>
> PG_STAT  STATE                                                 UP                   UP_PRIMARY  ACTING                              ACTING_PRIMARY
>
> 28.5b    active+recovery_unfound+undersized+degraded+remapped    [18,33,10,0,48,1]          18  [2147483647,2147483647,29,21,4,47]              29
>
>
>
> Cluster state:
>
>   cluster:
>
>     id:     5a07ec50-4eee-4336-aa11-46ca76edcc24
>
>     health: HEALTH_ERR
>
>             10 OSD(s) experiencing BlueFS spillover
>
>             4/1055070542 objects unfound (0.000%)
>
>             noout flag(s) set
>
>             Possible data damage: 2 pgs recovery_unfound
>
>             Degraded data redundancy: 64150765/6329079237 objects degraded (1.014%), 10 pgs degraded, 26 pgs undersized
>
>             4 pgs not deep-scrubbed in time
>
>
>
>   services:
>
>     mon: 3 daemons, quorum mon-2s01,mon-2s02,mon-2s03 (age 2M)
>
>     mgr: mon-2s01(active, since 2M), standbys: mon-2s03, mon-2s02
>
>     osd: 49 osds: 49 up (since 36m), 49 in (since 4d); 28 remapped pgs
>
>          flags noout
>
>     rgw: 3 daemons active (mon-2s01.rgw0, mon-2s02.rgw0, mon-2s03.rgw0)
>
>
>
>   task status:
>
>
>
>   data:
>
>     pools:   9 pools, 425 pgs
>
>     objects: 1.06G objects, 66 TiB
>
>     usage:   158 TiB used, 465 TiB / 623 TiB avail
>
>     pgs:     64150765/6329079237 objects degraded (1.014%)
>
>              38922319/6329079237 objects misplaced (0.615%)
>
>              4/1055070542 objects unfound (0.000%)
>
>              393 active+clean
>
>              13  active+undersized+remapped+backfill_wait
>
>              8   active+undersized+degraded+remapped+backfill_wait
>
>              3   active+clean+scrubbing
>
>              3   active+undersized+remapped+backfilling
>
>              2   active+recovery_unfound+undersized+degraded+remapped
>
>              2   active+remapped+backfill_wait
>
>              1   active+clean+scrubbing+deep
>
>
>
>   io:
>
>     client:   181 MiB/s rd, 9.4 MiB/s wr, 5.38k op/s rd, 2.42k op/s wr
>
>     recovery: 23 MiB/s, 389 objects/s
>
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---------------------------------------------------
> Agoda Services Co., Ltd.
> e: istvan.szabo@xxxxxxxxx
> ---------------------------------------------------
>
> On 2021. Oct 1., at 1:25, Christian Wuerdig <christian.wuerdig@xxxxxxxxx> wrote:
>
> Email received from the internet. If in doubt, don't click any link nor open any attachment !
> ________________________________
>
> That is - one thing you could do is to rate limit PUT requests on your
> haproxy down to a level that your cluster is stable. At least that
> gives you a chance to finish the PG scaling without OSDs dying on you
> constantly
>
> On Fri, 1 Oct 2021 at 11:56, Christian Wuerdig
> <christian.wuerdig@xxxxxxxxx> wrote:
>
>
> Ok, so I guess there are several things coming together that end up
>
> making your life a bit miserable at the moment:
>
> - PG scaling causing increase IO
>
> - Ingesting large number of objects into RGW causing lots of IOPs
>
> - Usual client traffic
>
> - Your NVME that's being used for WAL/DB has only half the listed
>
> performance in terms of random write IOPS than your backing storage
>
> SSD while it should be the other way around - the WAL/DB device is
>
> supposed to be the faster device.
>
>    you'd probably be better off replacing your NVME's with something
>
> like a P20096
>
> - Smaller number of large drives - ceph traditionally scales better
>
> with more but smaller OSDs - especially if you plan on hosting
>
> truckloads of RGW blobs
>
>
> Don't have a good solution - maybe you can stop the pg scaling until
>
> the big data load has finished or arrange a schedule - load data at
>
> night and pause during the day to continue the PG scaling
>
> Try and get your hands on a couple of faster NVME drives and replace
>
> the WAL/DB drives in one node so see how much of a difference it make
>
>
> Also I wouldn't lower the osd memory target if you can afford the RAM
>
> - you only have 6 OSDs per server with a mem target of 32GB thats
>
> 192GB RAM - so if you have at least 256GB in your servers then I would
>
> leave it. It won't help with writes but it should help with reducing
>
> read iops - you probably don't want to make your existing problems
>
> even bigger by chucking more read io onto the system due to lower
>
> in-memory buffers.
>
>
> On Thu, 30 Sept 2021 at 21:02, Szabo, Istvan (Agoda)
>
> <Istvan.Szabo@xxxxxxxxx> wrote:
>
>
> Hi Christian,
>
>
> Yes, I very clearly know what is spillover, read that github leveled document in the last couple of days every day multiple time. (Answers for your questions are after the cluster background information).
>
>
> About the cluster:
>
> - users are doing continuously put/head/delete operations
>
> - cluster iops: 10-50k read, 5000 write iops
>
> - throughput: 142MiB/s  write and 662 MiB/s read
>
> - Not containerized deployment, 3 cluster in multisite
>
> - 3x mon/mgr/rgw (5 rgw in each mon, altogether 15 behind haproxy vip)
>
>
> 7 nodes and in each node the following config:
>
> - 1x 1.92TB nvme for index pool
>
> - 6x 15.3 TB osd SAS SSD (hpe VO015360JWZJN read intensive ssd, SKU P19911-B21 in this document: https://h20195.www2.hpe.com/v2/getpdf.aspx/a00001288enw.pdf)
>
> - 2x 1.92TB nvme  block.db for the 6 ssd (model: HPE KCD6XLUL1T92 SKU: P20131-B21 in this document https://h20195.www2.hpe.com/v2/getpdf.aspx/a00001288enw.pdf)
>
> - osd deployed with dmcrypt
>
> - data pool is on ec 4:2 other pools are on the ssds with 3 replica
>
>
> Config file that we have on all nodes + on the mon nodes has the rgw definition also:
>
> [global]
>
> cluster network = 192.168.199.0/24
>
> fsid = 5a07ec50-4eee-4336-aa11-46ca76edcc24
>
> mon host = [v2:10.118.199.1:3300,v1:10.118.199.1:6789],[v2:10.118.199.2:3300,v1:10.118.199.2:6789],[v2:10.118.199.3:3300,v1:10.118.199.3:6789]
>
> mon initial members = mon-2s01,mon-2s02,mon-2s03
>
> osd pool default crush rule = -1
>
> public network = 10.118.199.0/24
>
> rgw_relaxed_s3_bucket_names = true
>
> rgw_dynamic_resharding = false
>
> rgw_enable_apis = s3, s3website, swift, swift_auth, admin, sts, iam, pubsub, notifications
>
> #rgw_bucket_default_quota_max_objects = 1126400
>
>
> [mon]
>
> mon_allow_pool_delete = true
>
> mon_pg_warn_max_object_skew = 0
>
> mon_osd_nearfull_ratio = 70
>
>
> [osd]
>
> osd_max_backfills = 1
>
> osd_recovery_max_active = 1
>
> osd_recovery_op_priority = 1
>
> osd_memory_target = 31490694621
>
> # due to osd reboots, the below configs has been added to survive the suicide timeout
>
> osd_scrub_during_recovery = true
>
> osd_op_thread_suicide_timeout=3000
>
> osd_op_thread_timeout=120
>
>
> Stability issue that I mean:
>
> - Pg increase still in progress, hasn’t been finished from 32-128 on the erasure coded data pool. 103 currently and the degraded objects are always stuck almost when finished, but at the end osd dies and start again the recovery process.
>
> - compaction is happening all the time, so all the nvme drives are generating iowait continuously because it is 100% utilized (iowait is around 1-3). If I try to compact with ceph tell osd.x compact that is impossible, it will never finish, only with ctrl+c.
>
> - At the beginning when we didn't have so much spilledover disks, I didn't mind it actually I was happy of the spillover because the underlaying ssd can take some load from the nvme, but after the osds started to reboot and I'd say started to collapse 1 by 1. When I monitor which osds are collapsing, it was always the one which was spillovered. This op thread and suicide timeout can keep a bit longer the osds up.
>
> - Now ALL rgw started to die once 1 specific osd goes down, and this make total outage. In the logs there isn't anything about this, neither message, nor rgw log just like timeout the connections. This is unacceptable from the user's perspective that thay need to wait 1.5 hour until my manual compaction finished and I can start the osd.
>
>
> Current cluster state ceph -s:
>
> health: HEALTH_ERR
>
>            12 OSD(s) experiencing BlueFS spillover
>
>            4/1055038256 objects unfound (0.000%)
>
>            noout flag(s) set
>
>            Possible data damage: 2 pgs recovery_unfound
>
>            Degraded data redundancy: 12341016/6328900227 objects degraded (0.195%), 16 pgs degraded, 21 pgs u
>
> ndersized
>
>            4 pgs not deep-scrubbed in time
>
>
>  services:
>
>    mon: 3 daemons, quorum mon-2s01,mon-2s02,mon-2s03 (age 2M)
>
>    mgr: mon-2s01(active, since 2M), standbys: mon-2s03, mon-2s02
>
>    osd: 49 osds: 49 up (since 101m), 49 in (since 4d); 23 remapped pgs
>
>         flags noout
>
>    rgw: 15 daemons active (mon-2s01.rgw0, mon-2s01.rgw1, mon-2s01.rgw2, mon-2s01.rgw3, mon-2s01.rgw4, mon-2s02.rgw0, mon-2s02.rgw1, mon-2s02.rgw2, mon-2s02.rgw3, mon-2s02.rgw4, mon-2s03.rgw0, mon-2s03.rgw1, mon-2s03.rgw2, mon-2s03.rgw3, mon-2s03.rgw4)
>
>
>  task status:
>
>
>  data:
>
>    pools:   9 pools, 425 pgs
>
>    objects: 1.06G objects, 67 TiB
>
>    usage:   159 TiB used, 465 TiB / 623 TiB avail
>
>    pgs:     12032346/6328762449 objects degraded (0.190%)
>
>             68127707/6328762449 objects misplaced (1.076%)
>
>             4/1055015441 objects unfound (0.000%)
>
>             397 active+clean
>
>             13  active+undersized+degraded+remapped+backfill_wait
>
>             4   active+undersized+remapped+backfill_wait
>
>             4   active+clean+scrubbing+deep
>
>             2   active+recovery_unfound+undersized+degraded+remapped
>
>             2   active+remapped+backfill_wait
>
>             1   active+clean+scrubbing
>
>             1   active+undersized+remapped+backfilling
>
>             1   active+undersized+degraded+remapped+backfilling
>
>
>  io:
>
>    client:   256 MiB/s rd, 94 MiB/s wr, 17.70k op/s rd, 2.75k op/s wr
>
>    recovery: 16 MiB/s, 223 objects/s
>
>
> Ty
>
>
> -----Original Message-----
>
> From: Christian Wuerdig <christian.wuerdig@xxxxxxxxx>
>
> Sent: Thursday, September 30, 2021 1:01 PM
>
> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
>
> Cc: Ceph Users <ceph-users@xxxxxxx>
>
> Subject: Re:  osd_memory_target=level0 ?
>
>
> Email received from the internet. If in doubt, don't click any link nor open any attachment !
>
> ________________________________
>
>
> Bluestore memory targets have nothing to do with spillover. It's already been said several times: The spillover warning is simply telling you that instead of writing data to your supposedly fast wal/blockdb device it's now hitting your slow device.
>
>
> You've stated previously that your fast device is nvme and your slow device is SSD. So the spill-over is probably less of a problem than you think. It's currently unclear what your actual problem is and why you think it's to do with spill-over.
>
>
> What model are your NVMEs and SSDs - what IOPS can each sustain (4k random write direct IO), what's their current load? What are the actual problems that you are observing, i.e. what does "stability problems" actually mean?
>
>
> On Thu, 30 Sept 2021 at 18:33, Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx> wrote:
>
>
> Hi,
>
>
> Still suffering with the spilledover disks and stability issue in 3 of my cluster after uploaded 6-900 millions objects to the cluster. (Octopus 15.2.10).
>
>
> I’ve set memory target around 31-32GB so could that be that the spilledover issue is coming from here?
>
> So have mem target 31GB, next level would be 310 and after go to the underlaying ssd disk. So the 4 level doesn’t have space on the nvme.
>
>
> Let’s say set to default 4GB, it would be 444GB the level0-3 so it
>
> should fit in on the 600GB lvm assigned on the nvme for db with wal.
>
>
> This is how it looks like, eg. Osd 27 even after 2 times manual
>
> compact still spilled over :(
>
>
> osd.1 spilled over 198 GiB metadata from 'db' device (303 GiB used of 596 GiB) to slow device
>
>     osd.5 spilled over 251 GiB metadata from 'db' device (163 GiB used of 596 GiB) to slow device
>
>     osd.8 spilled over 61 GiB metadata from 'db' device (264 GiB used of 596 GiB) to slow device
>
>     osd.11 spilled over 260 GiB metadata from 'db' device (242 GiB used of 596 GiB) to slow device
>
>     osd.12 spilled over 149 GiB metadata from 'db' device (238 GiB used of 596 GiB) to slow device
>
>     osd.15 spilled over 259 GiB metadata from 'db' device (195 GiB used of 596 GiB) to slow device
>
>     osd.17 spilled over 10 GiB metadata from 'db' device (314 GiB used of 596 GiB) to slow device
>
>     osd.21 spilled over 324 MiB metadata from 'db' device (346 GiB used of 596 GiB) to slow device
>
>     osd.27 spilled over 12 GiB metadata from 'db' device (486 GiB used of 596 GiB) to slow device
>
>     osd.29 spilled over 61 GiB metadata from 'db' device (306 GiB used of 596 GiB) to slow device
>
>     osd.31 spilled over 59 GiB metadata from 'db' device (308 GiB used of 596 GiB) to slow device
>
>     osd.46 spilled over 69 GiB metadata from 'db' device (308 GiB
>
> used of 596 GiB) to slow device
>
>
> Also is there a way to fasten compaction? It takes 1-1.5 hours /osd to compact.
>
>
> Thank you
>
> _______________________________________________
>
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
>
> email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx