Re: ceph fs crashes on simple fio test

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Thu, 5 Sep 2019 08:23:07 -0700

On Tue, Sep 3, 2019 at 11:33 AM Frank Schilder <frans@xxxxxx> wrote:
Hi Robert and Paul,

sad news. I did a 5 seconds single thread test after setting osd_op_queue_cut_off=high on all OSDs and MDSs. Here the current settings:

[root@ceph-01 ~]# ceph config show osd.0

NAME                                    VALUE              SOURCE   OVERRIDES IGNORES 

bluestore_compression_min_blob_size_hdd 262144             file                       

bluestore_compression_mode              aggressive         file                       

cluster_addr                            192.168.16.68:0/0  override                   

cluster_network                         192.168.16.0/20    file                       

crush_location                          host=c-04-A        file                       

daemonize                               false              override                   

err_to_syslog                           true               file                       

keyring                                 $osd_data/keyring  default                    

leveldb_log                                                default                    

mgr_initial_modules                     balancer dashboard file                       

mon_allow_pool_delete                   false              file                       

mon_pool_quota_crit_threshold           90                 file                       

mon_pool_quota_warn_threshold           70                 file                       

osd_journal_size                        4096               file                       

osd_max_backfills                       3                  mon                        

osd_op_queue_cut_off                    high               mon                        

osd_pool_default_flag_nodelete          true               file                       

osd_recovery_max_active                 8                  mon                        

osd_recovery_sleep                      0.050000           mon                        

public_addr                             192.168.32.68:0/0  override                   

public_network                          192.168.32.0/19    file                       

rbd_default_features                    61                 default                    

setgroup                                disk               cmdline                    

setuser                                 ceph               cmdline                    

[root@ceph-01 ~]# ceph config get osd.0 osd_op_queue

wpq

Unfortunately, the problem is not resolved. The fio job script is:

=====================

[global]

name=fio-rand-write

filename_format=fio-$jobname-${HOSTNAME}-$jobnum-$filenum

rw=randwrite

bs=4K

numjobs=1

time_based=1

runtime=5

[file1]

size=100G

ioengine=sync

=====================

That's a random write test on a 100G file with write size 4K. Note that fio uses "direct=0" by default. Using "direct=1" is absolutely fine.

Running this short burst of load, I already get the cluster unhealthy:

cluster log:

2019-09-03 20:00:00.000160 [INF]  overall HEALTH_OK

2019-09-03 20:08:36.450527 [WRN]  Health check failed: 1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)

2019-09-03 20:08:59.867124 [INF]  MDS health message cleared (mds.0): 2 slow metadata IOs are blocked > 30 secs, oldest blocked for 49 secs

2019-09-03 20:09:00.373050 [INF]  Health check cleared: MDS_SLOW_METADATA_IO (was: 1 MDSs report slow metadata IOs)

2019-09-03 20:09:00.373094 [INF]  Cluster is now healthy

/var/log/messages: loads of these (all OSDs!)

Sep  3 20:08:39 ceph-09 journal: 2019-09-03 20:08:39.269 7f6a3d63c700 -1 osd.161 10411 get_health_metrics reporting 354 slow ops, oldest is osd_op(client.4497435.0:38244 5.f7s0 5:ef9f1be4:::100010ed9bd.0000390c:head [write 8192~4096,write 32768~4096,write 139264~4096,write 172032~4096,write 270336~4096,write 512000~4096,write 688128~4096,write 876544~4096,write 1048576~4096,write 1257472~4096,write 1425408~4096,write 1445888~4096,write 1503232~4096,write 1552384~4096,write 1716224~4096,write 1765376~4096] snapc 12e=[] ondisk+write+known_if_redirected e10411)

It looks like the MDS is pushing waaaayyy too many requests onto the HDDs instead of throttling the client.

An ordinary user should not have so much power in his hands. This makes it trivial to destroy a ceph cluster.

This very short fio test is probably sufficient to reproduce the issue on any test cluster. Should I open an issue?

Best regards,

Are your metadata pools on SSD, or HDD? Usually for us, as long as the blocked I/O fluctuates and goes up and down, the cluster run fine even with the warnings. Usually on an idle cluster a client will send a bunch of data which fills up the queues, then the HDDs have to work through them, at that point the client realizes that the storage is 'slow' and starts throttling the traffic and will then match the speed at which the HDDs can perform the work. If you run the job for a long time, are you still seeing the trimming errors, or just a steady rise in blocked IO on the OSDs without any drops in count? What about the client, does it have a fairly even distribution of latencies, or does it have a lot that are just really long?
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx