Hi David, Actually this pg subfolder splitting was not explored yet, I will have a look. In our setup OSD are never marked as down, probably because of the following
settings: mon_osd_adjust_heartbeat_grace = false mon_osd_adjust_down_out_interval = false mon_osd_min_down_reporters = 5 mon_osd_min_down_reports = 10 Thomas From: David Turner [mailto:david.turner@xxxxxxxxxxxxxxxx]
This thread has gotten quite large and I haven't read most of it, so I apologize if this is a duplicate idea/suggestion. 100% of the time our cluster
has blocked requests and we aren't increasing pg_num, adding storage, or having a disk failing... it is pg subfolder splitting. 100% of the time, every time, this is our cause of blocked requests. It is often accompanied by drives being marked down by the
cluster even though the osd daemon is still running.
From: ceph-users
[ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Nick Fisk [nick@xxxxxxxxxx] Hi Thomas, I’m afraid I can’t offer anymore advice, they isn’t anything that I can see which could be the trigger. I know we spoke about downgrading the kernel,
did you manage to try that? Nick From: Thomas Danan [mailto:Thomas.Danan@xxxxxxxxxxxxx]
Hi all, Still not able to find any explanation to this issue. I recently tested the network and I am seeing some retransmit being done in the output of iperf. But overall the bandwidth durin a test of 10sec is around 7
to 8 Gbps. I was not sure to understand if it was the test itself who was overloading the network or if my network switches that were having an issue. Switches have been checked and they are showing no congestion issues or other errors. I really don’t know what to check or test, any idea is more than welcomed … Thomas
From: Thomas Danan
Hi Nick, Here are some logs. The system is in IST TZ and I have filtered the logs to get only 2 last hours during which we can observe the issue. In that particular case, issue is illustrated with the following OSDs Primary: ID:607 PID:2962227 HOST:10.137.81.18 Secondary1 ID:528 PID:3721728 HOST:10.137.78.194 Secondary2 ID:771 PID:2806795 HOST:10.137.81.25 In that specific example, first slow request message is detected at 16:18 2016-11-18 16:18:51.991185 7f13acd8a700 0 log_channel(cluster) log [WRN] : 7 slow requests, 7 included below; oldest blocked for > 30.521107 secs 2016-11-18 16:18:51.991213 7f13acd8a700 0 log_channel(cluster) log [WRN] : slow request 30.521107 seconds old, received at 2016-11-18 16:18:21.469965: osd_op(client.2406870.1:140440919
rbd_data.616bf2ae8944a.00000000002b85a7 [set-alloc-hint object_size 4194304 write_size 4194304,write 1449984~524288] 0.4e69d0de snapc 218=[218,1fb,1df] ondisk+write e212564) currently waiting for subops from 528,771 I see that it is about replicating a 4MB Object with snapc context but in my environment I have no snapshot (actually they were all deleted). Also I was said
those messages were not necessary related to object replication to snapshot image. Each time I have a slow request message it is formatted as described with 4MB Object and snapc context Rados df is showing me that I have 4 cloned objects, I do not understand why. 15 minutes later seems ops are unblocked after initiating reconnect message 2016-11-18 16:34:38.120918 7f13acd8a700 0 log_channel(cluster) log [WRN] : slow request 960.264008 seconds old, received at 2016-11-18 16:18:37.856850: osd_op(client.2406634.1:104826541
rbd_data.636fe2ae8944a.0000000000111eec [set-alloc-hint object_size 4194304 write_size 4194304,write 4112384~4096] 0.f56e90de snapc 426=[426,3f9] ondisk+write e212564) currently waiting for subops from 528,771 2016-11-18 16:34:46.863383 7f137705a700 0 -- 10.137.81.18:6840/2962227 >> 10.137.81.135:0/748393319 pipe(0x293bd000 sd=35 :6840 s=0 pgs=0 cs=0 l=0 c=0x21405020).accept
peer addr is really 10.137.81.135:0/748393319 (socket is 10.137.81.135:26749/0) 2016-11-18 16:35:05.048420 7f138fea6700 0 -- 192.168.228.36:6841/2962227 >> 192.168.228.28:6805/3721728 pipe(0x1271b000 sd=34 :50711 s=2 pgs=647 cs=5 l=0 c=0x42798c0).fault,
initiating reconnect I do not manage to identify anything obvious in the logs. Thanks for your help … Thomas From: Nick Fisk [mailto:nick@xxxxxxxxxx]
Hi Thomas, Do you have the OSD logs from around the time of that slow request (13:12 to 13:29 period)? Do you also see anything about OSD’s going down in the Mon ceph.log file around that time? 480 seconds is probably far too long for a disk to be busy for, I’m wondering if the OSD is either dying and respawning or if you are running out
of some type of system resource….eg TCP connections or something like that, which means the OSD’s can’t communicate with each other. From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of Thomas Danan Hi, I have recheck the pattern when slow request are detected. I have an example with following (primary: 411, secondary: 176, 594) On primary slow requests detected: waiting for subops (176, 594) during 16 minutes !!!! 2016-11-17 13:29:27.209754 7f001d414700 0 log_channel(cluster) log [WRN] : 7 slow requests, 7 included below; oldest blocked for > 480.477315 secs 2016-11-17 13:29:27.209777 7f001d414700 0 log_channel(cluster) log [WRN] : slow request 480.477315 seconds old, received at 2016-11-17 13:21:26.732303: osd_op(client.2407558.1:206455044
rbd_data.66ea12ae8944a.00000000001acbbc [set-alloc-hint object_size 4194304 write_size 4194304,write 1257472~368640] 0.61fe279f snapc 3fd=[3fd,3de] ondisk+write e210553) currently waiting for subops from 176,594 So the primary OSD is waiting for subops since 13:21 (13:29 - 480 seconds) 2016-11-17 13:36:33.039691 7efffd8ee700 0 -- 192.168.228.23:6800/694486 >> 192.168.228.7:6819/3611836 pipe(0x13ffd000 sd=33 :17791 s=2 pgs=131 cs=7 l=0 c=0x13251de0).fault,
initiating reconnect 2016-11-17 13:36:39.570692 7efff6784700 0 -- 192.168.228.23:6800/694486 >> 192.168.228.13:6858/2033854 pipe(0x17009000 sd=60 :52188 s=2 pgs=147 cs=7 l=0 c=0x141159c0).fault,
initiating reconnect After this log, it seems the ops are unblocked and I do not have any more the “currently waiting for subops from 176,594” So primary OSD was blocked during 15 minutes in total !!!! On the secondary OSD, I can see the following messages during the same period (but also after and before) secondary: 2016-11-17 13:34:58.125076 7fbcc7517700 0 -- 192.168.228.7:6819/3611836 >> 192.168.228.42:6832/2873873 pipe(0x12d2e000 sd=127 :6819 s=2 pgs=86 cs=5 l=0 c=0xef18c00).fault
with nothing to send, going to sandby In some other example and with some DEBUG messages activated I was also able to see the many of the following messages on secondary OSDs. 2016-11-15 03:53:04.298502 7ff9c434f700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7ff9bdb42700' had timed out after 15 Thomas From: Nick Fisk [mailto:nick@xxxxxxxxxx]
Hi, Yes, I can’t think of anything else at this stage. Could you maybe repost some dump historic op dumps now that you have turned off snapshots.
I wonder if they might reveal anything. From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of Thomas Danan Hi Nick, We have deleted all Snapshots and observed the system for several hours. From what I see this did not help to reduce the blocked ops and IO freeze on Client ceph side. We have also tried to increase a little bit the PGs (by 8 than 128) because this is something we should do and we wanted to see how the cluster was behaving. During recovery, the number of blocked ops and associated duration increased significantly. Also the number of impacted OSDs was much more important.
Don’t really know what to conclude from all of this … Again we have checked Disk / network / and everything seems fine … Thomas From: Nick Fisk [mailto:nick@xxxxxxxxxx]
The snapshot works by using Copy On Write. If you dirty even a 4kb section of a 4MB object in the primary RBD, that entire 4MB object then needs
to be read and then written into the snapshot RBD. From: Thomas Danan [mailto:Thomas.Danan@xxxxxxxxxxxxx]
Hi Nick, Actually I was wondering, is there any difference between Snapshot or simple RBD image ? With simple RBD image when doing a random IO, we are asking Ceph cluster to update one or several 4MB objects no ? So Snapshotting is multiplying the load by 2 but not more, Am I wrong ? Thomas From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of Thomas Danan Hi Nick, Yes our application is doing small Random IO and I did not realize that the snapshotting feature could so much degrade performances in that case. We have just deactivated it and deleted all snapshots. Will notify you if it drastically reduce the blocked ops and consequently the IO freeze on client side. Thanks Thomas From: Nick Fisk [mailto:nick@xxxxxxxxxx]
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of Thomas Danan Very interesting ... Any idea why optimal tunable would help here ? on our cluster we have 500TB of data, I am a bit concerned about changing it without taking lot of precautions . ... I am curious to know how much time it takes you to change tunable, size of your cluster and observed impacts on client IO ... Yes We do have daily rbd snapshot from 16 different ceph RBD clients. Snapshoting the RBD image is quite immediate while we are seing the issue continuously during the day... Just to point out that when you take a snapshot any writes to the original RBD will mean that the full 4MB object is copied into the snapshot.
If you have a lot of small random IO going on the original RBD this can lead to massive write amplification across the cluster and may cause issues such as what you describe. Also be aware that deleting large snapshots also puts significant strain on the OSD’s as they try and delete hundreds of thousands of objects. Will check all of this tomorrow . .. Thanks again Thomas Sent from my Samsung device
On 11/15/16 14:05, Thomas Danan wrote:
This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com