df hangs on down nfs server mounted with hard,intr, can't kill

Ron Herardian rherardi at gssnet.com
Mon Mar 8 19:01:06 UTC 2004


"On a hard-mounted file system, NFS operations are retried until they are acknowledged by the server. A side effect of hard-mounting NFS file systems is that processes block (or "hang") in a high-priority disk wait state until their NFS RPC calls complete. If an NFS server goes down, the clients using its file systems hang if they reference these file systems before the server recovers. Using -intr in conjunction with the -hard mount option allows users to interrupt system calls that are blocked waiting on a crashed server. The system call is interrupted when the process making the call receives a signal, usually sent by the user typing Ctrl-C or using the kill command.

On a soft-mounted file system, an NFS RPC call returns a timeout error if it fails the number of times specified by the retrans option. You should not use the -soft option on any file system that is writeable, nor on any file system from which you load executables. NFS only guarantees the consistency of data after a server crash if the NFS file system was hard-mounted by the client."

[http://www.brandonhutchinson.com/nfs_timeouts.html]



Wade Hampton wrote:
> 
> I have a Fedora server with kernel 2.4.22-1-2163 SMP mounting a
> remote solaris server (hence choice of options):
> 
>    rsize=32768,ro,hard,intr,tcp,nfsvers=3
> 
> When the remote is down or disconnected, a "df" hangs (as expected),
> but I can't kill it, even as root or with kill -9.  The docs for mount
> indicate that the INTR option should allow for killing apps mounted
> with HARD.
> 
> I also coded a test program that calls statvfs(2) and it hangs in the
> on the statvfs(2) call when run against a down NFS server.  It too
> can't be interrupted or killed.
> 
> My questions are:
> 
> 1)  Is there a safe and reliable means to check for a down NFS server
>      (e.g., is showmount -e <server> safe enough -- it is interruptable
>      hence one could wrap this with a timer and it you timeout, the
>      server would be down)?
> 
> 2)  Is the non-interruptable operation (even with INTR option)
>      a bug or feature?
> 
> 3)  Is there a simple kernel call, /proc entry, or similar that can
>     be used for this purpose?
> 
> 4)  Is there a perl module to accomplish this?
> 
> This would be very useful for network monitoring, e.g., when the
> server goes down and stays down for >1 minute, generate an SNMP
> trap and write to a log file.  It would be good if you can't put an SNMP
> agent on the server, but only on the client.  It is also useful for writing
> a highly reliable client application.
> 
> As I have no control over the remote system, when it went down,
> I had to do a hard reboot of my Linux box to stop the hung apps.  This
> is a Windows solution, not a Linux solution
> 
> Note, I found this when writing some scripts for MRTG to check
> the disk utilization of partitions.  My df's hung so I didn't even get
> the proper values for my local partitions.  After a few days, I had
> LOTS of hung MRTG apps.
> 
> Thanks
> --
> Wade Hampton
> 
> --
> fedora-list mailing list
> fedora-list at redhat.com
> To unsubscribe: http://www.redhat.com/mailman/listinfo/fedora-list

-- 

Global System Services Corporation (GSS)
650 Castro Street, Suite 120, Number 268, Mountain View, CA 94041, USA
+1 (650) 965-8669 phone, +1 (650) 965-8679 fax, +1 (650) 283-5241 mobile
rherardi at gssnet.com, http://www.gssnet.com

"The best way to predict your future is to create it." - Stephen Covey





More information about the users mailing list