Re: Strange Data Issue - Unexpected client hang on OSD I/O Error

Dyweni - Ceph-Users <6EXbab4FYk8H@xxxxxxxxxx> · Wed, 26 Dec 2018 07:20:53 -0600

Good Morning,

I re-ran the verification and it matches exactly the original data that
was backed up (aprox 300GB).  There were no further messages issued on
the client or any OSD originally involved (2,9,18).  I believe the data
to be ok.  The cluster is currently healther (all PGs 'active+clean').

I would still like to understand the following:

1. What does the original message on the client mean?
   (libceph: get_reply osd2 tid 1459933 data 3248128 > preallocated 
131072, skipping)

2. What actions does the OSD take after reporting '4.35 missing primary 
copy'?

3. Why did the client read operation hang?  I have never had a client
   hang due to an OSD disk I/O error.

Further, I'm wondering about:

PG '4.35' was automatically deep-scrubbed with no errors on
'2018-12-24 18:45'.  The error 'missing primary copy' occured at
'2018-12-25 20:26'.  It was then manually deep-scrubbed, also with no
errors, at '2018-12-25 21:47'.

In the past, when a read error occurs, the PG goes inconsistent and the
admin has to repair it.  The client operations are unaffected, because
the data from the remaining 2 OSDs is available.

In this case, there was data missing, Ceph detected it, but the PG did
not go inconsistent.  Rather, the client operation was impacting,
forcing a hard reboot of the client to recover.

1. Why did the PG not go inconsistent?

2. Did Ceph automatically correct the data error?  I believe so, but
   there was no message in the logs that such a repair had completed.

3. I believe the client operation should have been unaffected by this,
   regardless of what kind of error it was on the OSD side, since two
   copies of this data were existing on other OSDs.  Thus, I would
   recommend that issue of the client hanging (stuck in 'D+' state) and
   requiring a hard reboot to recover should be treated as a bug and
   investigated.  I saw a similar issue (similar client kernel message)
   in the 4.9.x kernels regarding CephFS, but this is RBD.

Thank you,
Dyweni

On 2018-12-25 22:55, Dyweni - Ceph-Users wrote:
Hi again!

Prior to rebooting the client, I found this file (and it's contents):

# cat
/sys/kernel/debug/ceph/8abf116d-a710-4245-811d-c08473cb9fb4.client7412370/osdc
REQUESTS 1 homeless 0
1459933	osd2	4.3120c635	[2,18,9]/2	[2,18,9]/2	rbd_data.6b60e8643c9869.000000000000157f	0x400011	1	0'0	read
LINGER REQUESTS
18446462598732840963	osd8	4.1662b47d	[8,18,3]/8	[8,18,3]/8	rbd_header.711beb643c9869	0x24	0	WC/0
18446462598732840961	osd10	4.2a8391bc	[10,18,4]/10	[10,18,4]/10	rbd_header.357247238e1f29	0x24	0	WC/0
18446462598732840962	osd12	4.906668e1	[12,2,18]/12	[12,2,18]/12	rbd_header.6b60e8643c9869	0x24	0	WC/0

When I search for '6b60e8643c9869' in the Ceph logs across all OSDs, I
only find the two error lines listed previously.

Thanks,
Dyweni

On 2018-12-25 22:38, Dyweni - Ceph-Users wrote:
Hi Everyone/Devs,

Would someone please help me troubleshoot a strange data issue
(unexpected client hang on OSD I/O Error)?

On the client, I had a process reading a large amount of data from a
mapped RBD image.  I noticed tonight that it had stalled for a long
period of time (which never happens).  The process is currently in an
'uninterruptible sleep' state ('D+').  When I checked the kernel logs
(dmesg), I found this entry:

libceph: get_reply osd2 tid 1459933 data 3248128 > preallocated 
131072, skipping

I checked the kernel log on the referenced OSD (osd2) and noted the
usual kernel output regarding I/O errors on the disk.  These errors
occured 1 second prior to the message being issued on the client.  
This
OSD has a drive that is developing bad sectors. This is known and
tollerated.  The data sits in a pool with 3 replicas.

Normally, when I/O errors occur, Ceph reports the PG as
active+clean+inconsistent and 'rados list-inconsistent-obj' lists
'read error' for that OSD.  The clients proceed onward oblivious to 
the
issue, I review the 'ceph health detail' and
'rados list-inconsistent-obj' outputs, and issue the 'ceph pg repair'
commands.  Everything turns out ok.

However, tonight I received no such inconsistent messages from Ceph.
When I looked at the ceph logs on that OSD, I found these and only 
these
lines, regarding the corruption.

2018-12-25 20:26:19.665945 b0b540c0 -1 bdev(0x5db2b00
/var/lib/ceph/osd/ceph-2/block) _aio_thread got (5) Input/output error
2018-12-25 20:26:19.676377 a1b250c0 -1 log_channel(cluster) log [ERR]
: 4.35 missing primary copy of
4:ac63048c:::rbd_data.6b60e8643c9869.000000000000157f:head, will try
copies on 9,18

To be proactive, I then issued 'ceph pg deep-scrub 4.35'.  This
completed normally and all PGs still show 'active+clean' (as was prior
to issuing the deep-scrub command.

So I have my requests:

1. What does the message on the client mean?
   (libceph: get_reply osd2 tid 1459933 data 3248128 > preallocated
131072, skipping)

2. What occured on the OSD (osd2), after reporting '4.35 missing 
primary copy'?

3. Why did the client read operation hang?  I have never had a client
   hang due to an OSD disk I/O error.

4. How do I verify that the data in the cluster is OK (i.e. beyond
   forcing deep-scrub on all PGs)?

Note about 4:
   The client operation in question was performing a verification of
   a disk image backup (i.e. make MD5sums of the original disk and the
   backup image, and verify both match).  I will restart the client
   machine and repeat the verification step of that backup.  This will
   tell me if this small section of cluster data is OK (and indirectly
   test the Ceph execution/data paths that originally failed).

For reference:

All Ceph versions are 12.2.5.
Client kernel version is 4.9.95.

Thank you,
Dyweni

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com