RE: After memory pressure: can't read from tape anymore

"Desai, Kashyap" <Kashyap.Desai@xxxxxxx> · Thu, 2 Dec 2010 16:47:19 +0530

> -----Original Message-----
> From: Lukas Kolbe [mailto:lkolbe@xxxxxxxxxxxxxxxxxxxxxxxx]
> Sent: Wednesday, December 01, 2010 3:10 PM
> To: Kai Makisara
> Cc: Boaz Harrosh; linux-scsi@xxxxxxxxxxxxxxx; Desai, Kashyap
> Subject: Re: After memory pressure: can't read from tape anymore
> 
> Am Dienstag, den 30.11.2010, 21:53 +0200 schrieb Kai Makisara:
> > On Tue, 30 Nov 2010, Boaz Harrosh wrote:
> 
> I'm Cc'ing Desay Kashyap from LSI, maybe he can comment on the hardware
> limitations of the SAS1068E?
Lukas,

No. it is not limitation from h/w that " CONFIG_FUSION_MAX_SGE" needs to be 128.
But our code is written such a way that even if you change it more than 128, it will fall down to 128 again.

To change this value you need to do below changes in mptbase.h

--
-#define MPT_SCSI_SG_DEPTH     CONFIG_FUSION_MAX_SGE
+#define MPT_SCSI_SG_DEPTH       256
--

128 is good amount for Scatter gather element. This value is standard value for MPT FUSIION, since long.

This value will be reflect to sg_tablesize and linux scatter-gather module will use this value for creating sg_table for HBA.
See: " cat /sys/class/scsi_host/host<x>/sg_tablesize"

If single IO is not able to fit into sg_tablesize, then it will be converted into multiple IOs for Low Layer Drivers(By "scatter-gather" module of linux).
So I do not see any problem with 
CONFIG_FUSION_MAX_SGE value.  Our driver internally convert sglist into SGE which understood by LSI H/W.

Thanks, Kashyap

> 
> > ...
> > > I looked at enlarge_buffer() and it looks fragile and broken. If
> you really
> > > need a pointer eg:
> > > 	STbuffer->b_data = page_address(STbuffer->reserved_pages[0]);
> > >
> > If you think it is broken, please fix it.
> >
> > > Than way not use vmalloc() for buffers larger then PAGE_SIZE? But
> better yet
> > > avoid it by keeping a pages_array or sg-list and operate on an aio
> type
> > > operations.
> > >
> > vmalloc() is not a solution here. Think about this from the HBA side.
> Each
> > s/g segment must be contiguous in the address space the HBA uses. In
> many
> > cases this is the physical memory address space. Any solution must
> make
> > sure that the HBA can perform the requested data transfer.
> >
> > > > Kai
> > >
> > > But I understand this is a lot of work on an old driver. Perhaps
> pre-allocate
> > > something big at startup. specified by user?
> > >
> > This used to be possible at some time and it could be made possible
> again.
> > But I don't like this option because it means that the users must
> > explicitly set the boot parameters.
> >
> > And it is difficult for me to believe the modern SAS HBAs only
> support 128
> > s/g segments.
> >
> > Kai
> 
> 
> For reference, here's my original message with Kais reply:
> 
> > Hi,
> >
> > On our backup system (2 LTO4 drives/Tandberg library via LSISAS1068E,
> > Kernel 2.6.36 with the stock Fusion MPT SAS Host driver 3.04.17 on
> > debian/squeeze), we see reproducible tape read and write failures
> after
> > the system was under memory pressure:
> >
> > [342567.297152] st0: Can't allocate 2097152 byte tape buffer.
> > [342569.316099] st0: Can't allocate 2097152 byte tape buffer.
> > [342570.805164] st0: Can't allocate 2097152 byte tape buffer.
> > [342571.958331] st0: Can't allocate 2097152 byte tape buffer.
> > [342572.704264] st0: Can't allocate 2097152 byte tape buffer.
> > [342873.737130] st: from_buffer offset overflow.
> >
> > Bacula is spewing this message every time it tries to access the tape
> > drive:
> > 28-Nov 19:58 sd1.techfak JobId 2857: Error: block.c:1002 Read error
> on fd=10 at file:blk 0:0 on device "drv2" (/dev/nst0). ERR=Input/output
> error
> >
> > By memory pressure, I mean that the KVM processes containing the
> > postgres-db (~20million files) and the bacula director have used all
> > available RAM, one of them used ~4GiB of its 12GiB swap for an hour
> or
> > so (by selecting a full restore, it seems that the whole directory
> tree
> > of the 15mio files backup gets read into memory). After this, I
> wasn't
> > able to read from the second tape drive anymore (/dev/st0); whereas
> the
> > first tape drive was restoring the data happily (it is currently
> about
> > halfway through a 3TiB restore from 5 tapes).
> >
> > This same behaviour appears when we're doing a few incremental
> backups;
> > after a while, it just isn't possible to use the tape drives anymore
> -
> > every I/O operation gives an I/O Error, even a simple dd bs=64k
> > count=10. After a restart, the system behaves correctly until
> > -seemingly- another memory pressure situation occured.
> >
> This is predictable. The maximum number of scatter/gather segments
> seems
> to be 128. The st driver first tries to set up transfer directly from
> the
> user buffer to the HBA. The user buffer is usually fragmented so that
> one
> scatter/gather segment is used for each page. Assuming 4 kB page size,
> the
> maximu size of the direct transfer is 128 x 4 = 512 kB.
> 
> When this fails, the driver tries to allocate a kernel buffer so that
> there larger than 4 kB physically contiguous segments. Let's assume
> that
> it can find 128 16 kB segments. In this case the maximum block size is
> 2048 kB. Memory pressure results in memory fragmentation and the driver
> can't find large enough segments and allocation fails. This is what you
> are seeing.
> 
> So, one solution is to use 512 kB block size. Another one is to try to
> find out if the 128 segment limit is a physical limitation or just a
> choice. In the latter case the mptsas driver could be modified to
> support
> larger block size even after memory fragmentation.
> 
> Kai
> 
> 
> 
> 

ÿô.nÇ‰·Ÿ®‰†+%ŠË±é¥Šwÿº{.nÇ‰·¥Š{±þÇ‹ø¡Ü}©ž²ÆzÚj:+v‰¨þø®w¥þŠàÞ¨è&¢)ß«a¶Úÿûz¹ÞúŽŠÝjÿŠwèf