Re: Potential OSD deadlock?

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Tue, 22 Sep 2015 08:24:36 -0600

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Is there some way to tell in the logs that this is happening? I'm not
seeing much I/O, CPU usage during these times. Is there some way to
prevent the splitting? Is there a negative side effect to doing so?
We've had I/O block for over 900 seconds and as soon as the sessions
are aborted, they are reestablished and complete immediately.

The fio test is just a seq write, starting it over (rewriting from the
beginning) is still causing the issue. I was suspect that it is not
having to create new file and therefore split collections. This is on
my test cluster with no other load.

I'll be doing a lot of testing today. Which log options and depths
would be the most helpful for tracking this issue down?

Thanks,
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWAWShCRDmVDuy+mK58QAAuUAP/3XuYrcOsneXKvWhHSRV
4oi6MZ4mEuVvxGsf+2Nhx70CUJGNOH37cpNL3xTt5R9V7Kpj0KoxoyVv81bN
ud1YfH5jZn1sGizHBEIR94mqNkqsQmYyqLAvez2xhShAbKYdsMjvyxovUGBE
skLY6oXNZ8UVAuBRoq8KMNWCCf5mLlp/XITYd9B+SMOwTEcU9D/tdkOMf8fn
wIv3FHIMOLgVmvzCgfXPjuPCvl2eo3oO9bSGmWU0FZUUTGzc+PranuQngULz
JOPaA2Qvte+jn0lU99tZhPaZ+62E9L8sZtQ2eorJoF1SBJtpzF+TW0Ev+7co
DNBdqp+JHTQIEyuPluhWi89E+MZlhQcsEBpb82Y5FIcZAjI00AJP+IHmFFPZ
ThP1UVpyymY3qn5995V0eUnbt6vpRUGDDdxPTMmW8dCRVZz9F1n2eoM1tdUS
t/tChgLHRq1RL0N2gD2w1E8r+t5Cu5zYK/+ZWs6HhRc1LuxwtOy/3QbXO+Bu
SfgFHh+tMFDinVSQCAbx6a759ySZ2FoMBhxONljluaemrdDCntcgq+52h3dK
Q4lkf1y3a4sdqHQwJ+Ew3rONilixC0abHw+GF29GjCXbYDBUeLxXoqIJXQbM
TGsOz4v0AnDLzgFQIaSHyweuptyh8MKT3XJbrOOAcmZo3YmGtYYfjSF6+qXF
6PLJ
=HIRW
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Tue, Sep 22, 2015 at 8:09 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Mon, Sep 21, 2015 at 11:43 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> I'm starting to wonder if this has to do with some OSDs getting full
>> or the 0.94.3 code. Earlier this afternoon, I cleared out my test
>> cluster so there was no pools. I created anew rbd pool and started
>> filling it with 6 - 1TB fio jobs replication 3 with 6 spindles over
>> six servers. It was running 0.94.2 at the time. After several hours of
>> writes, we had the new patched 0.93.3 binaries ready for testing so I
>> rolled the update on the test cluster while the fio jobs were running.
>> There were a few blocked I/O as the services were restarted (nothing
>> I'm concerned about). Now that the OSDs are about 60% full, the
>> blocked I/O is becoming very frequent even with the backports. The
>> write bandwidth was consistently at 200 MB/s until this point, now it
>> is fluctuating between 200 MB/s and 75 MB/s mostly around about
>> 100MB/s. Our production cluster is XFS on the OSDs, this test cluster
>> is EXT4.
>>
>> I'll see if I can go back to 0.94.2 and fill the cluster up again....
>> Going back to 0.94.2 and 0.94.0 still has the issue (although I didn't
>> refill the cluster, I didn't delete what was already there). I'm
>> building the latest of hammer-backports now and see if it resolves the
>> issue.
>
> You're probably running into the FileStore collection splitting and
> that's what is slowing things down in that testing.
> -Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com