Re: BlueStore and BlueFS warnings after upgrade to 19.2.1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Den fre 7 mars 2025 kl 17:05 skrev Nicola Mori <mori@xxxxxxxxxx>:
>
> Dear Ceph users,
>
> after upgrading from 19.2.0 to 19.2.1 (via cephadm) my cluster started
> showing some warnings never seen before:
>
>       29 OSD(s) experiencing slow operations in BlueStore
>       13 OSD(s) experiencing stalled read in db device of BlueFS
>
> I searched for these messages but didn't find much. I also noticed that
> when browsing the CephFS folders (using the kernel module for the
> client) sometimes the client gets stuck for a long time before showing
> the folder content; however I don't know if this can be related to the
> above warnings.

Would it be possible for the people that implement these warnings (or
lower the thresholds significantly so they suddenly trigger) to put
something visible somewhere? It seems like these kinds of warnings
(like "too many PGs per OSD" around Luminous, "Large omaps" a bit
later) pop out of nowhere for us admins in a minor release and while I
could find https://github.com/ceph/ceph/pull/59464/files by
really,really looking through the Changelog for 19.2.1, it is by no
means easy to know for Nicola here above if this warning has been in
there for 3 major releases or if the condition appeared randomly at
the same time as the minor upgrade.

Is this the way we ceph admins are expected to "learn" about how these
things work, and wonder if it was related to what someone recently did
or if it indicates a bad set of drives or just new ceph code that
isn't correctly tuned yet?

It seems a bit like a pattern to just drop surprises like this on us
with no info on what to do, and I would like to think that this is
just a series of "random" accidents that just look very much alike,
but there seem to be few good explanations for why it happens so
often. As seen by the pull request, someone did a lot of writing about
rationale for the addition, some things about the values chosen, and
all the changelog had was this line "squid: os/bluestore: Warning
added for slow operations and stalled read (pr#59464, Md Mahamudur
Rahaman Sajib)" hidden among all the other changes. If we want people
to dare run latest releases so we can notice the real bugs early, we
need to be able to get information about "you might see the text
experiencing slow operations in BlueStore and this means you should
read URL-GOES-HERE" or something. Otherwise we are looking at
https://docs.ceph.com/en/latest/releases/squid/#notable-changes seeing
nothing, googling for this error message will be super useful 2.5
years from now when "everyone" has had time to post on reddit and
maillist and pasted it on slack/IRC, but for the early adopters of
ceph 19.2.x, this feels like a bad way to start validating a cluster
upgrade by having hard-to-find warnings suddenly pop up.

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux