Re: XFS attempt to access beyond end of device

Brad Hubbard <bhubbard@xxxxxxxxxx> · Fri, 28 Jul 2017 12:06:24 +1000

An update on this.

The "attempt to access beyond end of device" messages are created due to a
kernel bug which is rectified by the following patches.

  - 59d43914ed7b9625(vfs: make guard_bh_eod() more generic)
  - 4db96b71e3caea(vfs: guard end of device for mpage interface)

An upgraded Red Hat kernel including these patches is pending.

There was also discussion of the following upstream tracker
http://tracker.ceph.com/issues/14842 however that has been eliminated as being
in play for any of the devices analysed whilst investigating this issue since
these partitions are correctly aligned.

On Sun, Jul 23, 2017 at 10:49 AM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
> Blair,
>
> I should clarify that I am *now* aware of your support case =D
>
> For anyone willing to run a systemtap the following should give us more
> information about the problem.
>
> stap --all-modules -e 'probe kernel.function("handle_bad_sector"){ printf("handle_bad_sector(): ARGS is %s\n",$$parms$$);  print_backtrace()}'
>
> In order to run this you will need to install some non-trivial packages such as
> the kernel debuginfo package and kernel-devel. This is generally best
> accomplished as follows, at least on rpm based systems.
>
> (yum|dnf) install systemtap
> stap-prep
>
> The systemtap needs to be running when the error is generated as it monitors
> calls to "handle_bad_sector" which is the function generating the error message.
> Once that function is called the probe will dump all information about the bio
> structure passed as a parameter to "handle_bad_sector" as well as dumping the
> call stack. This would give us a good idea of the specific code involved.
>
>
> On Sat, Jul 22, 2017 at 9:45 AM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
>> On Sat, Jul 22, 2017 at 9:38 AM, Blair Bethwaite
>> <blair.bethwaite@xxxxxxxxx> wrote:
>>> Hi Brad,
>>>
>>> On 22 July 2017 at 09:04, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
>>>> Could you share what kernel/distro you are running and also please test whether
>>>> the error message can be triggered by running the "blkid" command?
>>>
>>> I'm seeing it on RHEL7.3 (3.10.0-514.2.2.el7.x86_64). See Red Hat
>>> support case #01891011 for sosreport etc.
>>
>> Thanks Blair,
>>
>> I'm aware of your case and the Bugzilla created from it and we are
>> investigating further.
>>
>>>
>>> No, blkid does not seem to trigger it. So far I haven't figured out
>>> what does. It seems to be showing up roughly once for each disk every
>>
>> Thanks, that appears to exclude any link to an existing Bugzilla that
>> was suggested as being related.
>>
>>> 1-2 weeks, and there is a clear time correlation across the hosts
>>> experiencing it.
>>>
>>> --
>>> Cheers,
>>> ~Blairo
>>
>>
>>
>> --
>> Cheers,
>> Brad
>
>
>
> --
> Cheers,
> Brad

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com