Re: Slow requests from bluestore osds

Marc Schöchlin <ms@xxxxxxxxxx> · Mon, 13 May 2019 06:59:18 +0200

Hello cephers,

one week ago we replaced the bluestore cache size by "osd memory target" and removed the detail memory settings.
This storage class now runs 42*8GB spinners with a permanent write workload of 2000-3000 write IOPS, and 1200-8000 read IOPS.

Out new setup is now:
(12.2.10 on Ubuntu 16.04)

[osd]
osd deep scrub interval = 2592000
osd scrub begin hour = 19
osd scrub end hour = 6
osd scrub load threshold = 6
osd scrub sleep = 0.3
osd snap trim sleep = 0.4
pg max concurrent snap trims = 1

[osd.51]
osd memory target = 8589934592
...

After that (restarting the entire cluster with these settings) we were very happy to not seeany slow request for 7 days.

Unfortunately this night the slow requests returned on one osd without any known change of the workload of the last 14 days
(according to our detailed monitoring)

2019-05-12 22:00:00.000117 mon.ceph-mon-s43 [INF] overall HEALTH_OK
2019-05-12 23:00:00.000130 mon.ceph-mon-s43 [INF] overall HEALTH_OK
2019-05-13 00:00:00.000129 mon.ceph-mon-s43 [INF] overall HEALTH_OK
2019-05-13 00:00:44.069793 mon.ceph-mon-s43 [WRN] Health check failed: 416 slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
2019-05-13 00:00:50.151190 mon.ceph-mon-s43 [WRN] Health check update: 439 slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
2019-05-13 00:00:59.750398 mon.ceph-mon-s43 [WRN] Health check update: 452 slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
2019-05-13 00:01:04.750697 mon.ceph-mon-s43 [WRN] Health check update: 283 slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
2019-05-13 00:01:10.419801 mon.ceph-mon-s43 [WRN] Health check update: 230 slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
2019-05-13 00:01:19.751516 mon.ceph-mon-s43 [WRN] Health check update: 362 slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
2019-05-13 00:01:24.751822 mon.ceph-mon-s43 [WRN] Health check update: 324 slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
2019-05-13 00:01:30.675160 mon.ceph-mon-s43 [WRN] Health check update: 341 slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
2019-05-13 00:01:38.759012 mon.ceph-mon-s43 [WRN] Health check update: 390 slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
2019-05-13 00:01:44.858392 mon.ceph-mon-s43 [WRN] Health check update: 366 slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
2019-05-13 00:01:54.753388 mon.ceph-mon-s43 [WRN] Health check update: 352 slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
2019-05-13 00:01:59.045220 mon.ceph-mon-s43 [INF] Health check cleared: REQUEST_SLOW (was: 168 slow requests are blocked > 32 sec. Implicated osds 51)
2019-05-13 00:01:59.045257 mon.ceph-mon-s43 [INF] Cluster is now healthy
2019-05-13 01:00:00.000114 mon.ceph-mon-s43 [INF] overall HEALTH_OK
2019-05-13 02:00:00.000130 mon.ceph-mon-s43 [INF] overall HEALTH_OK

The output of a "ceph health detail" loop at the time the problem occurred:

Mon May 13 00:01:27 CEST 2019
HEALTH_WARN 324 slow requests are blocked > 32 sec. Implicated osds 51
REQUEST_SLOW 324 slow requests are blocked > 32 sec. Implicated osds 51
    324 ops are blocked > 32.768 sec
    osd.51 has blocked requests > 32.768 sec

The logfile of the OSD:

2019-05-12 23:57:28.767463 7f38da4e2700  4 rocksdb: (Original Log Time 2019/05/12-23:57:28.767419) [/build/ceph-12.2.10/src/rocksdb/db/db_impl_compaction_flush.cc:132] [default] Level summary: base level 1 max b
ytes base 268435456 files[2 4 21 122 0 0 0] max score 0.94

2019-05-12 23:57:28.767511 7f38da4e2700  4 rocksdb: [/build/ceph-12.2.10/src/rocksdb/db/db_impl_files.cc:388] [JOB 2991] Try to delete WAL files size 256700142, prev total WAL file size 257271487, number of live
 WAL files 2.

2019-05-12 23:58:07.816376 7f38ddce9700  0 log_channel(cluster) log [DBG] : 34.ac scrub ok
2019-05-12 23:59:54.070025 7f38de4ea700  0 log_channel(cluster) log [DBG] : 34.236 scrub starts
2019-05-13 00:02:21.818689 7f38de4ea700  0 log_channel(cluster) log [DBG] : 34.236 scrub ok
2019-05-13 00:04:37.613094 7f38ead03700  4 rocksdb: [/build/ceph-12.2.10/src/rocksdb/db/db_impl_write.cc:684] reusing log 422507 from recycle list

2019-05-13 00:04:37.613186 7f38ead03700  4 rocksdb: [/build/ceph-12.2.10/src/rocksdb/db/db_impl_write.cc:725] [default] New memtable created with log file: #422511. Immutable memtables: 0.

Any hints how to find more details about the origin of this problem?
How can we solve that?

Regards
Marc

Am 28.01.19 um 22:27 schrieb Marc Schöchlin:
> Hello cephers,
>
> as described - we also have the slow requests in our setup.
>
> We recently updated from ceph 12.2.4 to 12.2.10, updated Ubuntu 16.04 to the latest patchlevel (with kernel 4.15.0-43) and applied dell firmware 2.8.0.
>
> On 12.2.5 (before updating the cluster) we had in a frequency of 10min to 30minutes in the entire deepscrub-window between 8:00 PM and 6:00 AM.
> Especially between 04:00AM and 06:00 AM when when we sequentially create a rbd snapshots for every rbd image and delete a outdated snapshot (we hold 3 snapshots per rbd device).
>
> After the upgrade to 12.2.10 (and the other patches) slow requests seems to be reduced, but they still occur after the snapshot creation/deletion procedure.
> Today we changed the time of the creation/deletion procedure from 4:00 AM to 7:30PM and we experienced slow request right in the the snapshot process at 8:00PM.
>
> The slow requests only happen on a certain storage class osds (30 * 8GB spinners)  - i.e ssd osds do not have this problem on the same cluster
> The pools which use this storage class are loaded by 80% write requests.
>
> Our configuration looks like this:
> ---
> bluestore cache kv max = 2147483648
> bluestore cache kv ratio = 0.9
> bluestore cache meta ratio = 0.1
> bluestore cache size hdd = 10737418240
> osd deep scrub interval = 2592000
> osd scrub begin hour = 19
> osd scrub end hour = 6
> osd scrub load threshold = 4
> osd scrub sleep = 0.3
> osd max trimming pgs = 2
> ---
> We do not have so much devices in this storage class (a enhancement is in progress to get more iops)
>
> What can i do to decrease the impact of snaptrims to prevent slow requests?
> (i.e. reduce "osd max trimming pgs" to "1")
>
> Regards
> Marc Schöchlin
>
> Am 03.09.18 um 10:13 schrieb Marc Schöchlin:
>> Hi,
>>
>> we are also experiencing this type of behavior for some weeks on our not
>> so performance critical hdd pools.
>> We haven't spent so much time on this problem, because there are
>> currently more important tasks - but here are a few details:
>>
>> Running the following loop results in the following output:
>>
>> while true; do ceph health|grep -q HEALTH_OK || (date;  ceph health
>> detail); sleep 2; done
>>
>> Sun Sep  2 20:59:47 CEST 2018
>> HEALTH_WARN 4 slow requests are blocked > 32 sec
>> REQUEST_SLOW 4 slow requests are blocked > 32 sec
>>     4 ops are blocked > 32.768 sec
>>     osd.43 has blocked requests > 32.768 sec
>> Sun Sep  2 20:59:50 CEST 2018
>> HEALTH_WARN 4 slow requests are blocked > 32 sec
>> REQUEST_SLOW 4 slow requests are blocked > 32 sec
>>     4 ops are blocked > 32.768 sec
>>     osd.43 has blocked requests > 32.768 sec
>> Sun Sep  2 20:59:52 CEST 2018
>> HEALTH_OK
>> Sun Sep  2 21:00:28 CEST 2018
>> HEALTH_WARN 1 slow requests are blocked > 32 sec
>> REQUEST_SLOW 1 slow requests are blocked > 32 sec
>>     1 ops are blocked > 32.768 sec
>>     osd.41 has blocked requests > 32.768 sec
>> Sun Sep  2 21:00:31 CEST 2018
>> HEALTH_WARN 7 slow requests are blocked > 32 sec
>> REQUEST_SLOW 7 slow requests are blocked > 32 sec
>>     7 ops are blocked > 32.768 sec
>>     osds 35,41 have blocked requests > 32.768 sec
>> Sun Sep  2 21:00:33 CEST 2018
>> HEALTH_WARN 7 slow requests are blocked > 32 sec
>> REQUEST_SLOW 7 slow requests are blocked > 32 sec
>>     7 ops are blocked > 32.768 sec
>>     osds 35,51 have blocked requests > 32.768 sec
>> Sun Sep  2 21:00:35 CEST 2018
>> HEALTH_WARN 7 slow requests are blocked > 32 sec
>> REQUEST_SLOW 7 slow requests are blocked > 32 sec
>>     7 ops are blocked > 32.768 sec
>>     osds 35,51 have blocked requests > 32.768 sec
>>
>> Our details:
>>
>>   * system details:
>>     * Ubuntu 16.04
>>      * Kernel 4.13.0-39
>>      * 30 * 8 TB Disk (SEAGATE/ST8000NM0075)
>>      * 3* Dell Power Edge R730xd (Firmware 2.50.50.50)
>>        * Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
>>        * 2*10GBITS SFP+ Network Adapters
>>        * 192GB RAM
>>      * Pools are using replication factor 3, 2MB object size,
>>        85% write load, 1700 write IOPS/sec
>>        (ops mainly between 4k and 16k size), 300 read IOPS/sec
>>   * we have the impression that this appears on deepscrub/scrub activity.
>>   * Ceph 12.2.5, we alread played with the osd settings OSD Settings
>>     (our assumtion was that the problem is related to rocksdb compaction)
>>     bluestore cache kv max = 2147483648
>>     bluestore cache kv ratio = 0.9
>>     bluestore cache meta ratio = 0.1
>>     bluestore cache size hdd = 10737418240
>>   * this type problem only appears on hdd/bluestore osds, ssd/bluestore
>>     osds did never experienced that problem
>>   * the system is healthy, no swapping, no high load, no errors in dmesg
>>
>> I attached a log excerpt of osd.35 - probably this is useful for
>> investigating the problem is someone owns deeper bluestore knowledge.
>> (slow requests appeared on Sun Sep  2 21:00:35)
>>
>> Regards
>> Marc
>>
>>
>> Am 02.09.2018 um 15:50 schrieb Brett Chancellor:
>>> The warnings look like this. 
>>>
>>> 6 ops are blocked > 32.768 sec on osd.219
>>> 1 osds have slow requests
>>>
>>> On Sun, Sep 2, 2018, 8:45 AM Alfredo Deza <adeza@xxxxxxxxxx
>>> <mailto:adeza@xxxxxxxxxx>> wrote:
>>>
>>>     On Sat, Sep 1, 2018 at 12:45 PM, Brett Chancellor
>>>     <bchancellor@xxxxxxxxxxxxxx <mailto:bchancellor@xxxxxxxxxxxxxx>>
>>>     wrote:
>>>     > Hi Cephers,
>>>     >   I am in the process of upgrading a cluster from Filestore to
>>>     bluestore,
>>>     > but I'm concerned about frequent warnings popping up against the new
>>>     > bluestore devices. I'm frequently seeing messages like this,
>>>     although the
>>>     > specific osd changes, it's always one of the few hosts I've
>>>     converted to
>>>     > bluestore.
>>>     >
>>>     > 6 ops are blocked > 32.768 sec on osd.219
>>>     > 1 osds have slow requests
>>>     >
>>>     > I'm running 12.2.4, have any of you seen similar issues? It
>>>     seems as though
>>>     > these messages pop up more frequently when one of the bluestore
>>>     pgs is
>>>     > involved in a scrub.  I'll include my bluestore creation process
>>>     below, in
>>>     > case that might cause an issue. (sdb, sdc, sdd are SATA, sde and
>>>     sdf are
>>>     > SSD)
>>>
>>>     Would be useful to include what those warnings say. The ceph-volume
>>>     commands look OK to me
>>>
>>>     >
>>>     >
>>>     > ## Process used to create osds
>>>     > sudo ceph-disk zap /dev/sdb /dev/sdc /dev/sdd /dev/sdd /dev/sde
>>>     /dev/sdf
>>>     > sudo ceph-volume lvm zap /dev/sdb
>>>     > sudo ceph-volume lvm zap /dev/sdc
>>>     > sudo ceph-volume lvm zap /dev/sdd
>>>     > sudo ceph-volume lvm zap /dev/sde
>>>     > sudo ceph-volume lvm zap /dev/sdf
>>>     > sudo sgdisk -n 0:2048:+133GiB -t 0:FFFF -c 1:"ceph block.db sdb"
>>>     /dev/sdf
>>>     > sudo sgdisk -n 0:0:+133GiB -t 0:FFFF -c 2:"ceph block.db sdc"
>>>     /dev/sdf
>>>     > sudo sgdisk -n 0:0:+133GiB -t 0:FFFF -c 3:"ceph block.db sdd"
>>>     /dev/sdf
>>>     > sudo sgdisk -n 0:0:+133GiB -t 0:FFFF -c 4:"ceph block.db sde"
>>>     /dev/sdf
>>>     > sudo ceph-volume lvm create --bluestore --crush-device-class hdd
>>>     --data
>>>     > /dev/sdb --block.db /dev/sdf1
>>>     > sudo ceph-volume lvm create --bluestore --crush-device-class hdd
>>>     --data
>>>     > /dev/sdc --block.db /dev/sdf2
>>>     > sudo ceph-volume lvm create --bluestore --crush-device-class hdd
>>>     --data
>>>     > /dev/sdd --block.db /dev/sdf3
>>>     >
>>>     >
>>>     > _______________________________________________
>>>     > ceph-users mailing list
>>>     > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>>     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>     >
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com