Re: Drained OSDs are still ACTIVE_PRIMARY - casuing high IO latency on clients

denispolom@xxxxxxxxx · Fri, 20 May 2022 16:09:20 +0000 (UTC)

Hi,
yes, I had to change the procedure also.
1. Stop osd daemon
2. mark osd out in crush map

But as you are writing, that makes PGs degraded.

However it still looks like bug to me.

20. 5. 2022 17:25:47 Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx>:

> This sounds similar to an inquiry I submitted a couple years ago [1] whereby I discovered that the choose_acting function does not consider primary affinity when choosing the primary osd. I had made the assumption it would when developing my procedure for replacing failing disks. After that discovery I change my process to stop the OSD daemon failing (degraded pgs) to ensure its not participating in PG anymore. Not sure if any of the relevant code regarding this has changed since that initial submit but what you describe here seems similar. 
> 
>  [1] https://tracker.ceph.com/issues/44400
> 
> Respectfully,
> 
> *Wes Dillingham*
> wes@xxxxxxxxxxxxxxxxx
> *LinkedIn[http://www.linkedin.com/in/wesleydillingham]*
> 
> 
> On Fri, May 20, 2022 at 7:53 AM Denis Polom <denispolom@xxxxxxxxx> wrote:
>> Hi
>> 
>> I observed high latencies and mount points hanging since Octopus release
>> and it's still observed on Pacific latest while draining OSD.
>> 
>> Cluster setup:
>> 
>> Ceph Pacific 16.2.7
>> 
>> Cephfs with EC data pool
>> 
>> EC profile setup:
>> 
>> crush-device-class=
>> crush-failure-domain=host
>> crush-root=default
>> jerasure-per-chunk-alignment=false
>> k=10
>> m=2
>> plugin=jerasure
>> technique=reed_sol_van
>> w=8
>> 
>> Description:
>> 
>> If we have broken drive, we are removing it from Ceph cluster by
>> draining it first. That means changing its crush weight to 0
>> 
>> ceph osd crush reweight osd.1 0
>> 
>> Normally on Nautilus it didn't affected clients. But after upgrade to
>> Octopus (and since Octopus till current Pacific release) I can observe
>> very high IO latencies on clients while OSD being drained (10sec and
>> higher).
>> 
>> By debugging I found out that drained OSD is still listed as
>> ACTIVE_PRIMARY and that happens only on EC pools and only since Octopus.
>> I tested it back on Nautilus, to be sure, where behavior is correct and
>> drained OSD is not listed under UP and ACTIVE OSDs for PGs.
>> 
>> Even if setting up primary-affinity for given OSD to 0 this doesn't have
>> any effect on EC pool.
>> 
>> Bellow are my debugs:
>> 
>> Buggy behavior on Octopus and Pacific:
>> 
>> Before draining osd.70:
>> 
>> PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED UNFOUND 
>> BYTES       OMAP_BYTES*  OMAP_KEYS*  LOG   DISK_LOG
>> STATE                          STATE_STAMP VERSION           
>> REPORTED           UP UP_PRIMARY  ACTING                    
>> ACTING_PRIMARY LAST_SCRUB         SCRUB_STAMP LAST_DEEP_SCRUB   
>> DEEP_SCRUB_STAMP                 SNAPTRIMQ_LEN
>> 16.1fff     2269                   0         0          0 0 
>> 8955297727            0           0  2449 2449                  
>> active+clean 2022-05-19T08:41:55.241734+0200    19403690'275685
>> 19407588:19607199    [70,206,216,375,307,57]          70
>> [70,206,216,375,307,57]              70    19384365'275621
>> 2022-05-19T08:41:55.241493+0200    19384365'275621
>> 2022-05-19T08:41:55.241493+0200              0
>> dumped pgs
>> 
>> 
>> after setting osd.70 crush weight to 0 (osd.70 is still acting primary):
>> 
>>   UP                         UP_PRIMARY ACTING                    
>> ACTING_PRIMARY  LAST_SCRUB SCRUB_STAMP                     
>> LAST_DEEP_SCRUB DEEP_SCRUB_STAMP                 SNAPTRIMQ_LEN
>> 16.1fff     2269                   0         0       2269 0 
>> 8955297727            0           0  2449      2449
>> active+remapped+backfill_wait  2022-05-20T08:51:54.249071+0200
>> 19403690'275685  19407668:19607289 [71,206,216,375,307,57]          71
>> [70,206,216,375,307,57]              70    19384365'275621
>> 2022-05-19T08:41:55.241493+0200    19384365'275621
>> 2022-05-19T08:41:55.241493+0200              0
>> dumped pgs
>> 
>> 
>> Correct behavior on Nautilus:
>> 
>> Before draining osd.10:
>> 
>> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES   
>> OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE STATE_STAMP               
>> VERSION REPORTED UP         UP_PRIMARY ACTING     ACTING_PRIMARY
>> LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP          
>> SNAPTRIMQ_LEN
>> 2.4e          2                  0        0         0       0
>> 8388608           0          0   2        2 active+clean 2022-05-20
>> 02:13:47.432104    61'2    75:40   [10,0,7] 10   [10,0,7]            
>> 10        0'0 2022-05-20 01:44:36.217286             0'0 2022-05-20
>> 01:44:36.217286             0
>> 
>> after setting osd.10 crush weight to 0 (behavior is correct, osd.10 is
>> not listed, not used):
>> 
>> 
>> root@nautilus1:~# ceph pg dump pgs | head -2
>> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES    
>> OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE                        
>> STATE_STAMP                VERSION REPORTED UP         UP_PRIMARY
>> ACTING     ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP               
>> LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN
>> 2.4e         14                  0        0         0       0
>> 58720256           0          0  18       18 active+clean 2022-05-20
>> 02:18:59.414812   75'18    80:43 [22,0,7]         22  
>> [22,0,7]             22        0'0 2022-05-20
>> 01:44:36.217286             0'0 2022-05-20 01:44:36.217286             0
>> 
>> 
>> Now question is if is it some implemented feature?
>> 
>> Or is it a bug?
>> 
>> Thank you!
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx