Re: Blocked requests problem

Ramazan Terzi <ramazanterzi@xxxxxxxxx> · Tue, 22 Aug 2017 19:58:33 +0300

Hi Ranjan,

Thanks for your reply. I did set scrub and nodeep-scrub flags. But active scrubbing operation can’t working properly. Scrubbing operation always in same pg (20.1e).

$ ceph pg dump | grep scrub
dumped all in format plain
pg_stat	objects	mip	degr	misp	unf	bytes	log	disklog	state	state_stamp	v	reported	up	up_primary	acting	acting_primary	last_scrub	scrub_stamp	last_deep_scrub	deep_scrub_stamp
20.1e	25189	0	0	0	0	98359116362	3048	3048	active+clean+scrubbing	2017-08-21 04:55:13.354379	6930'23966663	6930:20949058	[29,31,3]	29	[29,31,3]	29	6712'22950171	2017-08-20 04:46:59.208792	6712'22950171	2017-08-20 04:46:59.208792

$ ceph -s
    cluster ****
     health HEALTH_WARN
            33 requests are blocked > 32 sec
            noscrub,nodeep-scrub flag(s) set
     monmap e9: 3 mons at {ceph-mon01=**:6789/0,ceph-mon02=**:6789/0,ceph-mon03=**:6789/0}
            election epoch 84, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03
     osdmap e6930: 36 osds: 36 up, 36 in
            flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
      pgmap v17667617: 1408 pgs, 5 pools, 24779 GB data, 6494 kobjects
            70497 GB used, 127 TB / 196 TB avail
                1407 active+clean
                   1 active+clean+scrubbing

Thanks,
Ramazan

> On 22 Aug 2017, at 18:52, Ranjan Ghosh <ghosh@xxxxxx> wrote:
> 
> Hi Ramazan,
> 
> I'm no Ceph expert, but what I can say from my experience using Ceph is:
> 
> 1) During "Scrubbing", Ceph can be extremely slow. This is probably where your "blocked requests" are coming from. BTW: Perhaps you can even find out which processes are currently blocking with: ps aux | grep "D". You might even want to kill some of those and/or shutdown services in order to relieve some stress from the machine until it recovers.
> 
> 2) I usually have the following in my ceph.conf. This lets the scrubbing only run between midnight and 6 AM (hopefully the time of least demand; adjust as necessary)  - and with the lowest priority.
> 
> #Reduce impact of scrub.
> osd_disk_thread_ioprio_priority = 7
> osd_disk_thread_ioprio_class = "idle"
> osd_scrub_end_hour = 6
> 
> 3) The Scrubbing begin and end hour will always work. The low priority mode, however, works (AFAIK!) only with CFQ I/O Scheduler. Show your current scheduler like this (replace sda with your device):
> 
> cat /sys/block/sda/queue/scheduler
> 
> You can also echo to this file to set a different scheduler.
> 
> 
> With these settings you can perhaps alleviate the problem so far, that the scrubbing runs over many nights until it finished. Again, AFAIK, it doesnt have to finish in one night. It will continue the next night and so on.
> 
> The Ceph experts say scrubbing is important. Don't know why, but I just believe them. They've built this complex stuff after all :-)
> 
> Thus, you can use "noscrub"/"nodeepscrub" to quickly get a hung server back to work, but you should not let it run like this forever and a day.
> 
> Hope this helps at least a bit.
> 
> BR,
> 
> Ranjan
> 
> 
> Am 22.08.2017 um 15:20 schrieb Ramazan Terzi:
>> Hello,
>> 
>> I have a Ceph Cluster with specifications below:
>> 3 x Monitor node
>> 6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have SSD journals)
>> Distributed public and private networks. All NICs are 10Gbit/s
>> osd pool default size = 3
>> osd pool default min size = 2
>> 
>> Ceph version is Jewel 10.2.6.
>> 
>> My cluster is active and a lot of virtual machines running on it (Linux and Windows VM's, database clusters, web servers etc).
>> 
>> During normal use, cluster slowly went into a state of blocked requests. Blocked requests periodically incrementing. All OSD's seems healthy. Benchmark, iowait, network tests, all of them succeed.
>> 
>> Yerterday, 08:00:
>> $ ceph health detail
>> HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests
>> 1 ops are blocked > 134218 sec on osd.31
>> 1 ops are blocked > 134218 sec on osd.3
>> 1 ops are blocked > 8388.61 sec on osd.29
>> 3 osds have slow requests
>> 
>> Todat, 16:05:
>> $ ceph health detail
>> HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow requests
>> 1 ops are blocked > 134218 sec on osd.31
>> 1 ops are blocked > 134218 sec on osd.3
>> 16 ops are blocked > 134218 sec on osd.29
>> 11 ops are blocked > 67108.9 sec on osd.29
>> 2 ops are blocked > 16777.2 sec on osd.29
>> 1 ops are blocked > 8388.61 sec on osd.29
>> 3 osds have slow requests
>> 
>> $ ceph pg dump | grep scrub
>> dumped all in format plain
>> pg_stat	objects	mip	degr	misp	unf	bytes	log	disklog	state	state_stamp	v	reported	up	up_primary	acting	acting_primary	last_scrub	scrub_stamp	last_deep_scrub	deep_scrub_stamp
>> 20.1e	25183	0	0	0	0	98332537930	3066	3066	active+clean+scrubbing	2017-08-21 04:55:13.354379	6930'23908781	6930:20905696	[29,31,3]	29	[29,31,3]	29	6712'22950171	2017-08-20 04:46:59.208792	6712'22950171	2017-08-20 04:46:59.208792
>> 
>> Active scrub does not finish (about 24 hours). I did not restart any OSD meanwhile.
>> I'm thinking set noscrub, noscrub-deep, norebalance, nobackfill, and norecover flags and restart 3,29,31th OSDs. Is this solve my problem? Or anyone has suggestion about this problem?
>> 
>> Thanks,
>> Ramazan
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com