Re: Inconsistent PG automatically got "repaired"?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2018-05-10 00:39, Gregory Farnum wrote:
On Wed, May 9, 2018 at 8:21 AM Nikos Kormpakis <nkorb@xxxxxxxxxxxx> wrote:
1) After how much time RADOS tries to read from a secondary replica? Is
this
    timeout configurable?
2) If a primary shard is missing, Ceph tries to recreate it somehow
    automatically?
3) If Ceph recreates the primary shard (even automatically, or with
`ceph
    pg repair`, why did we not observe IO errors again? Does BlueStore
know
which disk blocks are bad an somehow avoids them or the same object
can
    be stored on different blocks if recreated? Unfortunately, I'm not
    familiar with its internals.
4) Is there any reason why did slow requests appear? Can we correlate
these
    requests somehow with our problem?

This behavior looks very confusing from a first sight and we'd really
want
to know what is happening and what is Ceph doing internally. I'd really
appreciate any insights or pointers.


David and a few other people have been making a lot of changes around this area lately to make Ceph handle failures more transparently, and I haven't
kept up with all of it. But I *believe* what happened is:
1) the scrub caused a read of the object, and BlueStore returned a read
error
2) the OSD would have previously treated this as a catastrophic failure and crashed, but now it handles it by marking the object as missing and needing
recovery
— I don't quite remember the process here. Either 3') it tries to do
recovery on its own when there are available resources for it, or
3) the user requested an object the OSD had marked as missing, so
4) the recovery code kicked off and the OSD grabbed it from another replica.

In particular reference to your questions
1) It's not about time; a read error means the object is marked as gone
locally; when that happens it will try and recover the object from elsewhere
2) not a whole shard, but an object, sure. (I mean, it will also try to
recover a shard, but that's the normal peering, recovery, backfill sort of
thing...)
3) I don't know the BlueStore internals well enough to say for sure if it marks the blocks as bad, but most modern *disks* will do that transparently to the upper layers, so BlueStore just needs to write the data out again. To BlueStore, the write will look like a completely different object, so
the fact a previous bit of hard drive was bad won't matter.
4) Probably your cluster was already busy, and ops got backed up on either
the primary OSD or one of the others participating in recovery? I mean,
that generally shouldn't occur, but slow requests tend to happen if you
overload a cluster and maybe the recovery pushed it over the edge...
-Greg

I was accustomed to the old behavior, where OSDs crashed when having an IO
error, thus this behavior surprised me, hence I wrote this mail.

About the slow requests, our cluster does not have any serious load, but
again, I'm not 100% sure and I'll try to reproduce it.

Thanks for your info,
Nikos.




Best regards,
--
Nikos Kormpakis - nkorb@xxxxxxxxxxxx
Network Operations Center, Greek Research & Technology Network
Tel: +30 210 7475712 <+30%2021%200747%205712> - http://www.grnet.gr
7, Kifisias Av., 115 23 Athens, Greece
<https://maps.google.com/?q=7,+Kifisias+Av.,+115+23+Athens,+Greece&entry=gmail&source=g>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Nikos Kormpakis - nkorb@xxxxxxxxxxxx
Network Operations Center, Greek Research & Technology Network
Tel: +30 210 7475712 - http://www.grnet.gr
7, Kifisias Av., 115 23 Athens, Greece
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux