Hi,
I use NFS mount (defaults so V4) /home directories with a simple server over Gigabit Ethernet all running Fedora33. This has been working fine for 25+ years through various Fedora versions. However in the last month or so all of the client computers are getting KDE GUI lockups every few hours that last for around 40 secs. /home is not accessible during this time and it feels/looks to be an NFS lockup issue. There are no "NFS server no responding" or such like messages in either the servers or clients /var/log/messages and the network communications seems fine.
1. Have there been some changes to NFS recently in the kernel ?
2. Any idea where to begin to try and debug this ?
Terry
On 25/09/2021 13:04, Terry Barnaby wrote:
Hi,
I use NFS mount (defaults so V4) /home directories with a simple server over Gigabit Ethernet all running Fedora33. This has been working fine for 25+ years through various Fedora versions. However in the last month or so all of the client computers are getting KDE GUI lockups every few hours that last for around 40 secs. /home is not accessible during this time and it feels/looks to be an NFS lockup issue. There are no "NFS server no responding" or such like messages in either the servers or clients /var/log/messages and the network communications seems fine.
Have there been some changes to NFS recently in the kernel ?
Any idea where to begin to try and debug this ?
A few questions.
1. Are you saying your NFS server HW is the same for the past 25 years. Couldn't have been all Fedora, right?
2. How many clients? Connected on a single or multiple switches?
3. Do the lockups happen during a given time of day, or random?
4. Have you checked for possible disk errors on the server?
-- Nothing to see here
On 25/09/2021 06:42, Ed Greshko wrote:
On 25/09/2021 13:04, Terry Barnaby wrote:
Hi,
I use NFS mount (defaults so V4) /home directories with a simple server over Gigabit Ethernet all running Fedora33. This has been working fine for 25+ years through various Fedora versions. However in the last month or so all of the client computers are getting KDE GUI lockups every few hours that last for around 40 secs. /home is not accessible during this time and it feels/looks to be an NFS lockup issue. There are no "NFS server no responding" or such like messages in either the servers or clients /var/log/messages and the network communications seems fine.
Have there been some changes to NFS recently in the kernel ?
Any idea where to begin to try and debug this ?
Thanks for the reply:
A few questions.
1. Are you saying your NFS server HW is the same for the past 25 years. Couldn't have been all Fedora, right?
No ( :) ) was using previous Linux and Unix systems before then. Certainly OS versions and hardware has changed over the years but setup is the same and no hardware changed in that last couple of years certainly no hardware/software changes in the last couple of months when the problem started to occur apart from Fedora33 updates.
2. How many clients? Connected on a single or multiple switches?
5 clients, two switches, but clients on the single switch to the server have the issue as well as others. Pings still operate in locked up condition.
3. Do the lockups happen during a given time of day, or random?
They appear to be random although they appear more frequently when first logging in (more /home accesses then ?).
4. Have you checked for possible disk errors on the server?
No disk related error messages and RAID file systems show as ok. Smartctl shows no issue on disks.
On 25/09/2021 14:07, Terry Barnaby wrote:
A few questions.
1. Are you saying your NFS server HW is the same for the past 25 years. Couldn't have been all Fedora, right?
No ( :) ) was using previous Linux and Unix systems before then. Certainly OS versions and hardware has changed over the years but setup is the same and no hardware changed in that last couple of years certainly no hardware/software changes in the last couple of months when the problem started to occur apart from Fedora33 updates.
OK. Kinda sounded like the server HW was "old".
2. How many clients? Connected on a single or multiple switches?
5 clients, two switches, but clients on the single switch to the server have the issue as well as others. Pings still operate in locked up condition.
Since all clients are being affected in the same manner it would point more towards a server issues as you've already concluded.
3. Do the lockups happen during a given time of day, or random?
They appear to be random although they appear more frequently when first logging in (more /home accesses then ?).
Not that it matters, but everyone isn't logging in at the same time correct? At login folks are getting lock ups most frequently.
4. Have you checked for possible disk errors on the server?
No disk related error messages and RAID file systems show as ok. Smartctl shows no issue on disks.
Are you running sysstat and collecting system information? You may want to consider doing that to see, for example, if "sar -n NFS" or "sar -n NFSD" show anything unusual. LIke excessive re-transmission.
-- Nothing to see here
On 25/09/2021 09:00, Ed Greshko wrote:
On 25/09/2021 14:07, Terry Barnaby wrote:
A few questions.
1. Are you saying your NFS server HW is the same for the past 25 years. Couldn't have been all Fedora, right?
No ( :) ) was using previous Linux and Unix systems before then. Certainly OS versions and hardware has changed over the years but setup is the same and no hardware changed in that last couple of years certainly no hardware/software changes in the last couple of months when the problem started to occur apart from Fedora33 updates.
OK. Kinda sounded like the server HW was "old".
2. How many clients? Connected on a single or multiple switches?
5 clients, two switches, but clients on the single switch to the server have the issue as well as others. Pings still operate in locked up condition.
Since all clients are being affected in the same manner it would point more towards a server issues as you've already concluded.
3. Do the lockups happen during a given time of day, or random?
They appear to be random although they appear more frequently when first logging in (more /home accesses then ?).
Not that it matters, but everyone isn't logging in at the same time correct? At login folks are getting lock ups most frequently.
4. Have you checked for possible disk errors on the server?
No disk related error messages and RAID file systems show as ok. Smartctl shows no issue on disks.
Are you running sysstat and collecting system information? You may want to consider doing that to see, for example, if "sar -n NFS" or "sar -n NFSD" show anything unusual. LIke excessive re-transmission.
Thanks for the info. Yes, sysstat is running I will try "sar -n NFS" and "sar -n NFSD" as well as "mountstats /home" which I have found after a lockup has occurred. Although "systat does no seem to list max latency which would be the pointer to look for. Actually I was thinking it may be the clients rather than the server as normally there are "NFS server not responding" messages on the clients if the server is down for some reason, but obviously it could be either.
Random login times and it occurs during the day as well.
On Sat, 25 Sept 2021 at 02:04, Terry Barnaby terry1@beam.ltd.uk wrote:
Hi,
I use NFS mount (defaults so V4) /home directories with a simple server over Gigabit Ethernet all running Fedora33. This has been working fine for 25+ years through various Fedora versions. However in the last month or so all of the client computers are getting KDE GUI lockups every few hours that last for around 40 secs. /home is not accessible during this time and it feels/looks to be an NFS lockup issue. There are no "NFS server no responding" or such like messages in either the servers or clients /var/log/messages and the network communications seems fine.
Have there been some changes to NFS recently in the kernel ?
Any idea where to begin to try and debug this ?
Think about what has changed. There is always a possibility of "old age"
hardware problems. I once had a PCNFS system that was having NFS issues while connected to a big server with 100's of PCNFS clients. The vendor's ethernet diagnostics didn't find a problem, and other network services were working. A visual inspection of the network card found visibly heat-damaged components. Replacing the card solved the problem.
With multiple clients having problems, look for ways to check all the upstream gear common to clients with issues: server, ethernet switches, and wiring. I have seen cables develop issues that were found using high-end cable test gear.
On Sat, 2021-09-25 at 06:04 +0100, Terry Barnaby wrote:
in the last month or so all of the client computers are getting KDE GUI lockups every few hours that last for around 40 secs.
Might one of them have a cron job that's scouring the network?
e.g. locate databasing
On Sun, 26 Sept 2021 at 01:44, Tim via users users@lists.fedoraproject.org wrote:
On Sat, 2021-09-25 at 06:04 +0100, Terry Barnaby wrote:
in the last month or so all of the client computers are getting KDE GUI lockups every few hours that last for around 40 secs.
Might one of them have a cron job that's scouring the network?
e.g. locate databasing
If you have cron jobs that use a lot of network bandwidth it may work fine until some network issue causing lots of retransmits bogs it down. "netstat -i <interface>" will have non-zero values for X-DRP, X-ERR, and/or X-OVR. In the good old days, one bad patch cable might affect the workstation (and user would say "network is fine"), but now one laptop can shovel packets fast enough to bring down a network. At my former work we had smart switches that would cut off a workstation using excessive bandwidth. SGI workstations processing "large" data sets were given an exemption, but after every update to the switches we had to remind IT to reconnect the SGI boxes. IT did have very fancy software that could show where the LAN was conjested, but generally nobody looked at it unless users reported problems. I had cron jobs that generated daily reports for each workstation in our group, including netstat. Bad cables were causing enough downlime that IT banned use of cables old cables and provided new high quality cables for all (our environment was hard on cables due to users going to sea, so exposing cables to high vibration and humidity seasoned with salt).
On Sun, 26 Sep 2021 10:26:19 -0300 George N. White III wrote:
If you have cron jobs that use a lot of network bandwidth it may work fine until some network issue causing lots of retransmits bogs it down.
Which is why you should check the dumb stuff first! Has a critter chewed on the ethernet cable to the server?
Are there network switches under your control? It sounds similar to what happens when MTU on the systems MTU do not match or one system MTU is set above the value on the switch ports.
Next time the issue occurs use ping with the do not fragment flag. ex $ ping -m DO -s 8972 ip.address
This example should be the highest value to work in the case of MTU size 9000, there is 28 byte overhead for IPv4 packets.
Second, are you sure no one is attaching to the network and duplicating the MAC address of your NFS server or perhaps the system that is stalled? If the switches are manageable you would have to insure that the MAC addresses are being learned on the correct ports.
-Jamie
On Sun, Sep 26, 2021 at 10:24 AM Tom Horsley horsley1953@gmail.com wrote:
On Sun, 26 Sep 2021 10:26:19 -0300 George N. White III wrote:
If you have cron jobs that use a lot of network bandwidth it may work fine until some network issue causing lots of retransmits bogs it down.
Which is why you should check the dumb stuff first! Has a critter chewed on the ethernet cable to the server? _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Make sure you have sar/sysstat enabled and changed to do 1 minute samples.
sar -d will show disk perf. If one of the disks "blips" at the firmware level (working on a hard to read block maybe), the util% on that device will be significantly higher than all other disks so will stand out. Then you can look deeper at the smart data.
sar generically will show your cpu/system time and sar -n DEV will show detailed network traffic, sar -n EDEV will show network errors.
With it set to 1 minute you should be able to detect most blips.
On Sun, Sep 26, 2021 at 10:26 AM Jamie Fargen jamie@fargenable.com wrote:
Are there network switches under your control? It sounds similar to what happens when MTU on the systems MTU do not match or one system MTU is set above the value on the switch ports.
Next time the issue occurs use ping with the do not fragment flag. ex $ ping -m DO -s 8972 ip.address
This example should be the highest value to work in the case of MTU size 9000, there is 28 byte overhead for IPv4 packets.
Second, are you sure no one is attaching to the network and duplicating the MAC address of your NFS server or perhaps the system that is stalled? If the switches are manageable you would have to insure that the MAC addresses are being learned on the correct ports.
-Jamie
On Sun, Sep 26, 2021 at 10:24 AM Tom Horsley horsley1953@gmail.com wrote:
On Sun, 26 Sep 2021 10:26:19 -0300 George N. White III wrote:
If you have cron jobs that use a lot of network bandwidth it may work fine until some network issue causing lots of retransmits bogs it down.
Which is why you should check the dumb stuff first! Has a critter chewed on the ethernet cable to the server? _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Thanks for the feedback everyone.
This is a very lightly loaded system with just 3 users ATM and very little going on across the network (just editing code files etc). The problem occurred again yesterday. For about 10 minutes my KDE desktop locked up in 20 second bursts and then the problem went away for the rest of the day. During that time the desktop and server were idle for 98.5% and pings continued fine. A kconsole window doing an "ls /home" every 5 seconds was locked up doing the ls. I had kconsole windows open doing the pings, top's and ls'es and although I couldn't operate the desktop (move virtual desktops etc) the ping and top windows were updating fine. No error messages in /var/log/messages on both systems and the sar stats showed nothing out of the ordinary.
I am pretty sure the Ethernet network is fine including cables, switches Ethernet adapters etc. Pings are fine etc. It just appears that the client programs get a huge (> 20 secs) delayed response to accesses to /home every now and then which points to NFS issues. Most of the system stats counters just give the amount of access, not the latency of an access which is what I need to track down the problem as there are few disk and network accesses going on.
As I said all has been fine on this system until about a month ago and the only obvious changes are the Fedora updates so I wondered if anyone new if there had been changes to the NFS stack recently and/or how to log peak NFS latencies ?
Terry On 26/09/2021 18:06, Roger Heflin wrote:
Make sure you have sar/sysstat enabled and changed to do 1 minute samples.
sar -d will show disk perf. If one of the disks "blips" at the firmware level (working on a hard to read block maybe), the util% on that device will be significantly higher than all other disks so will stand out. Then you can look deeper at the smart data.
sar generically will show your cpu/system time and sar -n DEV will show detailed network traffic, sar -n EDEV will show network errors.
With it set to 1 minute you should be able to detect most blips.
On Sun, Sep 26, 2021 at 10:26 AM Jamie Fargen jamie@fargenable.com wrote:
Are there network switches under your control? It sounds similar to what happens when MTU on the systems MTU do not match or one system MTU is set above the value on the switch ports.
Next time the issue occurs use ping with the do not fragment flag. ex $ ping -m DO -s 8972 ip.address
This example should be the highest value to work in the case of MTU size 9000, there is 28 byte overhead for IPv4 packets.
Second, are you sure no one is attaching to the network and duplicating the MAC address of your NFS server or perhaps the system that is stalled? If the switches are manageable you would have to insure that the MAC addresses are being learned on the correct ports.
-Jamie
On Sun, Sep 26, 2021 at 10:24 AM Tom Horsley horsley1953@gmail.com wrote:
On Sun, 26 Sep 2021 10:26:19 -0300 George N. White III wrote:
If you have cron jobs that use a lot of network bandwidth it may work fine until some network issue causing lots of retransmits bogs it down.
Which is why you should check the dumb stuff first! Has a critter chewed on the ethernet cable to the server? _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On 30/09/2021 16:35, Terry Barnaby wrote:
This is a very lightly loaded system with just 3 users ATM and very little going on across the network (just editing code files etc). The problem occurred again yesterday. For about 10 minutes my KDE desktop locked up in 20 second bursts and then the problem went away for the rest of the day. During that time the desktop and server were idle for 98.5% and pings continued fine. A kconsole window doing an "ls /home" every 5 seconds was locked up doing the ls. I had kconsole windows open doing the pings, top's and ls'es and although I couldn't operate the desktop (move virtual desktops etc) the ping and top windows were updating fine. No error messages in /var/log/messages on both systems and the sar stats showed nothing out of the ordinary.
I am pretty sure the Ethernet network is fine including cables, switches Ethernet adapters etc. Pings are fine etc. It just appears that the client programs get a huge (> 20 secs) delayed response to accesses to /home every now and then which points to NFS issues. Most of the system stats counters just give the amount of access, not the latency of an access which is what I need to track down the problem as there are few disk and network accesses going on.
As I said all has been fine on this system until about a month ago and the only obvious changes are the Fedora updates so I wondered if anyone new if there had been changes to the NFS stack recently and/or how to log peak NFS latencies ?
First of all, pings are at the hardware level and pretty much useless for doing anything other than confirming connectivity.
How are the mounts achieved. Hard mounts, soft mounts, what version are you using for mounts?
I use systemd automounts for home directories and and have
Options=rw,soft,fg,x-systemd.mount-timeout=30,v4.2 Type=nfs4 I have not seen any issues, but all the systems are VM. When faced with this type of problem even though I swear there is nothing wrong with my physical set up I do tend to reset cables and swithch things around to see if something changes. -- Nothing to see here
On mine when I first access the NFS volume it takes 5-10 seconds for the disks to spin up. Mine will spin down later in the day if little or nothing is going on and I will get another delay.
I have also seen delays if a disk gets bad blocks and corrects them. About 1/2 of time that does have a message but some of the time there are no messages at all about it, and I have had to resort to using Sar to figure out which disk is causing the issue.
So on my machine I see this (sar -d): 05:29:01 AM DEV tps rkB/s wkB/s dkB/s areq-sz aqu-sz await %util 05:29:01 AM dev8-0 36.16 94.01 683.65 0.00 21.51 0.03 0.67 1.11 05:29:01 AM dev8-16 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:29:01 AM dev8-32 0.02 0.00 0.00 0.00 0.00 0.00 1.00 0.00 05:29:01 AM dev8-48 423.65 71239.92 198.64 0.00 168.63 12.73 29.72 86.07 05:29:01 AM dev8-64 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:29:01 AM dev8-80 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:29:01 AM dev8-144 2071.22 71311.58 212.22 0.00 34.53 11.37 5.47 54.81 05:29:01 AM dev8-96 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:29:01 AM dev8-128 1630.99 71389.49 198.18 0.00 43.89 15.72 9.62 57.05 05:29:01 AM dev8-112 2081.05 71426.01 182.48 0.00 34.41 11.32 5.42 55.68
There is a 4 disk raid6 check going on.
You will notice that dev8-48 is busier than the other 3 disks, in this case that is because it is a 3TB disk vs the other 3 being all newer 6tb disks with higher data/revolution.
If you have sar setup with 60 second samples the one disk that pauses should stand out more obvious than this since the 3tb seems to be only marginally faster than the 6tbs.
On 30/09/2021 11:32, Ed Greshko wrote:
On 30/09/2021 16:35, Terry Barnaby wrote:
This is a very lightly loaded system with just 3 users ATM and very little going on across the network (just editing code files etc). The problem occurred again yesterday. For about 10 minutes my KDE desktop locked up in 20 second bursts and then the problem went away for the rest of the day. During that time the desktop and server were idle for 98.5% and pings continued fine. A kconsole window doing an "ls /home" every 5 seconds was locked up doing the ls. I had kconsole windows open doing the pings, top's and ls'es and although I couldn't operate the desktop (move virtual desktops etc) the ping and top windows were updating fine. No error messages in /var/log/messages on both systems and the sar stats showed nothing out of the ordinary.
I am pretty sure the Ethernet network is fine including cables, switches Ethernet adapters etc. Pings are fine etc. It just appears that the client programs get a huge (> 20 secs) delayed response to accesses to /home every now and then which points to NFS issues. Most of the system stats counters just give the amount of access, not the latency of an access which is what I need to track down the problem as there are few disk and network accesses going on.
As I said all has been fine on this system until about a month ago and the only obvious changes are the Fedora updates so I wondered if anyone new if there had been changes to the NFS stack recently and/or how to log peak NFS latencies ?
First of all, pings are at the hardware level and pretty much useless for doing anything other than confirming connectivity.
How are the mounts achieved. Hard mounts, soft mounts, what version are you using for mounts?
I use systemd automounts for home directories and and have
Options=rw,soft,fg,x-systemd.mount-timeout=30,v4.2 Type=nfs4 I have not seen any issues, but all the systems are VM. When faced with this type of problem even though I swear there is nothing wrong with my physical set up I do tend to reset cables and swithch things around to see if something changes. --
Yes, the pings are to determine that the network interface chips, cables and switches are basically working, which they are with no obvious issues.
Mounts are normal fstab with "king.kingnet:/home /home nfs defaults,async 0 0", so defaults apart from async and with Ethernet interfaces set to default 1500 MTU etc. So is using the default NFSV4, I think I might try forcing that to NFSV3 to see if that changes anything.
Yes, problems often occur due to you having done something, but I am pretty sure nothing has changed apart from Fedora updates.
On 30/09/2021 11:42, Roger Heflin wrote:
On mine when I first access the NFS volume it takes 5-10 seconds for the disks to spin up. Mine will spin down later in the day if little or nothing is going on and I will get another delay.
I have also seen delays if a disk gets bad blocks and corrects them. About 1/2 of time that does have a message but some of the time there are no messages at all about it, and I have had to resort to using Sar to figure out which disk is causing the issue.
So on my machine I see this (sar -d): 05:29:01 AM DEV tps rkB/s wkB/s dkB/s areq-sz aqu-sz await %util 05:29:01 AM dev8-0 36.16 94.01 683.65 0.00 21.51 0.03 0.67 1.11 05:29:01 AM dev8-16 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:29:01 AM dev8-32 0.02 0.00 0.00 0.00 0.00 0.00 1.00 0.00 05:29:01 AM dev8-48 423.65 71239.92 198.64 0.00 168.63 12.73 29.72 86.07 05:29:01 AM dev8-64 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:29:01 AM dev8-80 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:29:01 AM dev8-144 2071.22 71311.58 212.22 0.00 34.53 11.37 5.47 54.81 05:29:01 AM dev8-96 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:29:01 AM dev8-128 1630.99 71389.49 198.18 0.00 43.89 15.72 9.62 57.05 05:29:01 AM dev8-112 2081.05 71426.01 182.48 0.00 34.41 11.32 5.42 55.68
There is a 4 disk raid6 check going on.
You will notice that dev8-48 is busier than the other 3 disks, in this case that is because it is a 3TB disk vs the other 3 being all newer 6tb disks with higher data/revolution.
If you have sar setup with 60 second samples the one disk that pauses should stand out more obvious than this since the 3tb seems to be only marginally faster than the 6tbs.
In my case the servers /home is on a partition of the two main Raid0 disks that is shared with the OS and so are active most of the time. No errors reported.
I will try setting up sar with a 60 second sample time on the client, thanks for the idea.
On Thu, 30 Sep 2021 17:50:01 +0100 Terry Barnaby wrote:
Yes, problems often occur due to you having done something, but I am pretty sure nothing has changed apart from Fedora updates.
But hardware is sneaky. It waits for you to install software updates, the breaks itself to make you think the software was at fault :-).
Raid0, so there is no redundancy on the data?
And what kind of underlying hard disks? The desktop drives will try for a long time (ie a minute or more) to read any bad blocks. Those disks will not report an error unless it gets to the default os timeout, or it hits the disk firmware timeout.
The sar data will show if one of the disks is being slow on the server end.
On the client end you are unlikely to get anything useful from any samples as it seems pretty likely the server is not responding to nfs and/or the disks are not responding.
It could be as simple as on login it tries to read a badish/slow block and that block takes a while to finally get it to read. If that is happening it will probably eventually stop being able to read it, and if you really are using raid0 then some data will be lost.
All of the nfsv4 issues I have ran into involve it just breaking and staying broke (usually when the server reboots). I never had it have big sudden pauses, but using v3 won't hurt and I try to avoid v4 still.
On Thu, Sep 30, 2021 at 11:55 AM Terry Barnaby terry1@beam.ltd.uk wrote:
On 30/09/2021 11:42, Roger Heflin wrote:
On mine when I first access the NFS volume it takes 5-10 seconds for the disks to spin up. Mine will spin down later in the day if little or nothing is going on and I will get another delay.
I have also seen delays if a disk gets bad blocks and corrects them. About 1/2 of time that does have a message but some of the time there are no messages at all about it, and I have had to resort to using Sar to figure out which disk is causing the issue.
So on my machine I see this (sar -d): 05:29:01 AM DEV tps rkB/s wkB/s dkB/s areq-sz aqu-sz await %util 05:29:01 AM dev8-0 36.16 94.01 683.65 0.00 21.51 0.03 0.67 1.11 05:29:01 AM dev8-16 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:29:01 AM dev8-32 0.02 0.00 0.00 0.00 0.00 0.00 1.00 0.00 05:29:01 AM dev8-48 423.65 71239.92 198.64 0.00 168.63 12.73 29.72 86.07 05:29:01 AM dev8-64 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:29:01 AM dev8-80 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:29:01 AM dev8-144 2071.22 71311.58 212.22 0.00 34.53 11.37 5.47 54.81 05:29:01 AM dev8-96 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:29:01 AM dev8-128 1630.99 71389.49 198.18 0.00 43.89 15.72 9.62 57.05 05:29:01 AM dev8-112 2081.05 71426.01 182.48 0.00 34.41 11.32 5.42 55.68
There is a 4 disk raid6 check going on.
You will notice that dev8-48 is busier than the other 3 disks, in this case that is because it is a 3TB disk vs the other 3 being all newer 6tb disks with higher data/revolution.
If you have sar setup with 60 second samples the one disk that pauses should stand out more obvious than this since the 3tb seems to be only marginally faster than the 6tbs.
In my case the servers /home is on a partition of the two main Raid0 disks that is shared with the OS and so are active most of the time. No errors reported.
I will try setting up sar with a 60 second sample time on the client, thanks for the idea.
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Trivial thoughts from reading this thread. Please don't take the triviality as an insult.
Perhaps the best way to determine if the problem is from a software update is to downgrade likely packages. In the case of the kernel, you can just boot an older one (assuming that an old enough one is still installed -- fedora sure has a lot of package churn).
In case the HDDs are the problem, consider running S.M.A.R.T. drive self-tests on them. I know you said that smartctl reports no errors but you didn't say whether you've run the drive self-tests.
Is the pause long enough for you to figure out what is hanging? On either side? (I haven't used NFS for a couple of decades so I'm pretty rusty on the tooling.)
On 30/09/2021 19:27, Roger Heflin wrote:
Raid0, so there is no redundancy on the data?
And what kind of underlying hard disks? The desktop drives will try for a long time (ie a minute or more) to read any bad blocks. Those disks will not report an error unless it gets to the default os timeout, or it hits the disk firmware timeout.
The sar data will show if one of the disks is being slow on the server end.
On the client end you are unlikely to get anything useful from any samples as it seems pretty likely the server is not responding to nfs and/or the disks are not responding.
It could be as simple as on login it tries to read a badish/slow block and that block takes a while to finally get it to read. If that is happening it will probably eventually stop being able to read it, and if you really are using raid0 then some data will be lost.
All of the nfsv4 issues I have ran into involve it just breaking and staying broke (usually when the server reboots). I never had it have big sudden pauses, but using v3 won't hurt and I try to avoid v4 still.
Sorry I meant Raid1, they are WD RED WD30EFRX-68N32N0 disks, I have found them pretty good for 24/7 RAID usage on a few different systems and have had no issues like this until about a month ago. Unfortunately I don't think sar will show latency, only amount of disk usage. Yes NFS V4 does have issues with things like directory access performance over slow connections etc.
On 01/10/2021 13:31, D. Hugh Redelmeier wrote:
Trivial thoughts from reading this thread. Please don't take the triviality as an insult.
Perhaps the best way to determine if the problem is from a software update is to downgrade likely packages. In the case of the kernel, you can just boot an older one (assuming that an old enough one is still installed -- fedora sure has a lot of package churn).
In case the HDDs are the problem, consider running S.M.A.R.T. drive self-tests on them. I know you said that smartctl reports no errors but you didn't say whether you've run the drive self-tests.
Is the pause long enough for you to figure out what is hanging? On either side? (I haven't used NFS for a couple of decades so I'm pretty rusty on the tooling.) _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Its probably getting to hard to downgrade the server and clients now, its been more than a month and as you say Fedora updates are frequent!
I think I will write some programs to perform live tests and logging of things.
it will show latency. await is average iotime in ms, and %util is calced based in await and iops/sec. So long as your turn sar down to 1 minute samples it should tell you which of the 2 disks had higher await/util%. With a 10 minute sample the 40sec pause may get spread out across enough iops that you cannot see it.
If one disk pauses that disks utilization will be significantly higher than the other disk, and if utilization is much higher for the same or less IOPS that is generally a bad sign. 2 similar disks with similar iops will generally have similar util. The math is close to (iops * await / 10)(returns percent).
Are you using MDraid or hardware raid? doing a "grep mddevice /var/log/messages will show if md forced a rewrite and/or had a slow.
you can do this on those disks: smartctl -l scterc,20,20 /dev/<device>
I believe 20 (2.0 seconds) is as low as a WD red lets you go according to my tests. If the disk hangs it will hang for 2 seconds vers the current default (it should be 7 seconds, and really depends on how many bad blocks there are together that try to be read). Setting it to 2 will make the overall timeout 3.5x smaller, so if that reduce the hang time by about that that is a confirmation that it is a disk issue.
and do this on the disks: smartctl --all /dev/sdb | grep -E '(Reallocated|Current_Pen|Offline Uncor)'
if any of those 3 is nonzero in the last column, that may be the issue. The smart firmware will fail disks that are perfectly find, and it will fail to fail horribly bad disks. The PASS/FAIL absolutely cannot be trusted no matter what is says. FAIL is more often right, but PASS is often unreliable.
So if nonzero note the number, and next pause look again and see if the numbers changed.
On 01/10/2021 19:05, Roger Heflin wrote:
it will show latency. await is average iotime in ms, and %util is calced based in await and iops/sec. So long as your turn sar down to 1 minute samples it should tell you which of the 2 disks had higher await/util%. With a 10 minute sample the 40sec pause may get spread out across enough iops that you cannot see it.
If one disk pauses that disks utilization will be significantly higher than the other disk, and if utilization is much higher for the same or less IOPS that is generally a bad sign. 2 similar disks with similar iops will generally have similar util. The math is close to (iops * await / 10)(returns percent).
Are you using MDraid or hardware raid? doing a "grep mddevice /var/log/messages will show if md forced a rewrite and/or had a slow.
you can do this on those disks: smartctl -l scterc,20,20 /dev/<device>
I believe 20 (2.0 seconds) is as low as a WD red lets you go according to my tests. If the disk hangs it will hang for 2 seconds vers the current default (it should be 7 seconds, and really depends on how many bad blocks there are together that try to be read). Setting it to 2 will make the overall timeout 3.5x smaller, so if that reduce the hang time by about that that is a confirmation that it is a disk issue.
and do this on the disks: smartctl --all /dev/sdb | grep -E '(Reallocated|Current_Pen|Offline Uncor)'
if any of those 3 is nonzero in the last column, that may be the issue. The smart firmware will fail disks that are perfectly find, and it will fail to fail horribly bad disks. The PASS/FAIL absolutely cannot be trusted no matter what is says. FAIL is more often right, but PASS is often unreliable.
So if nonzero note the number, and next pause look again and see if the numbers changed. _______________________________________________
Thanks for the info, I am using MDraid. There are no "mddevice" messages in /var/log/messages and smartctl -a lists no errors on any of the disks. The disks are about 3 years old, I change them in servers between 3 and 4 years old.
I will create a program to measure the effective sars output and detect any discrepancies as this problem only occurs now and then along with measuring iolatency on NFS accesses on the clients to see if I can track down if it is a server disk issue or an NFS issue. Thanks again for the info.
You need to replace mddevice with the name of your mddevice.
probably md0.
3-5 years is about when they start to go. I have 2-3TB wd-reds sitting on the floor because their correctable/offline uncorr kept happening and blipping my storage (a few second pause). I even removed the disks from the raid and tried to force rewrite the blocks and once a long self-test would run without errors I would put it back. But then put back in the array in a week or 2 it would start it all over on the exact same disk/blocks/issues. The disk firmware does not seem to want to replace them with spare blocks so keeps using the blocks that will never work, so the only option was buy new and put them on the pile of useless old disks.
You can do a smartctl -t long <devicename> and that will take several hours. If it is not successful you have unreadable spots on the disk. I have all my disks doing those tests at least 1x per week, and I keep in a <serialnumber> directory a dated file of each days smartctl report for that disk so I can compare them when something starts acting up. I have reports on the various disks going back to 2014.
Success looks like this in the smartctl --all device
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 4391 - # 2 Extended offline Completed without error 00% 4366 - # 3 Extended offline Completed without error 00% 4342 - # 4 Extended offline Completed without error 00% 4320 - # 5 Extended offline Completed without error 00% 4291 - # 6 Extended offline Completed without error 00% 4178 - # 7 Extended offline Completed without error 00% 4013 - # 8 Extended offline Completed without error 00% 3845 -
if there is an LBA where the - is it failed.
The script looks like this and runs 1x per day. #!/bin/bash stamp=`date +%Y%m%d-%H` for disk in a b c d e f g h i j k l m n o p ; do smartctl --all /dev/sd${disk} > /var/log/smartctl/tmp.out serial=`grep Serial /var/log/smartctl/tmp.out | awk '{print $3}'` mkdir -p /var/log/smartctl/${serial} mv /var/log/smartctl/tmp.out /var/log/smartctl/${serial}/${serial}.${stamp}.sd${disk}.out if [ "${disk}" == "a" ] ; then smartctl -l ssd /dev/sd${disk} >> /var/log/smartctl/${serial}/${serial}.${stamp}.sd${disk}.out fi if [ $? -eq 2 ] ; then rm -f /var/log/smartctl/sd${disk}.${stamp}.out fi done
On Fri, Oct 1, 2021 at 2:20 PM Terry Barnaby terry1@beam.ltd.uk wrote:
On 01/10/2021 19:05, Roger Heflin wrote:
it will show latency. await is average iotime in ms, and %util is calced based in await and iops/sec. So long as your turn sar down to 1 minute samples it should tell you which of the 2 disks had higher await/util%. With a 10 minute sample the 40sec pause may get spread out across enough iops that you cannot see it.
If one disk pauses that disks utilization will be significantly higher than the other disk, and if utilization is much higher for the same or less IOPS that is generally a bad sign. 2 similar disks with similar iops will generally have similar util. The math is close to (iops * await / 10)(returns percent).
Are you using MDraid or hardware raid? doing a "grep mddevice /var/log/messages will show if md forced a rewrite and/or had a slow.
you can do this on those disks: smartctl -l scterc,20,20 /dev/<device>
I believe 20 (2.0 seconds) is as low as a WD red lets you go according to my tests. If the disk hangs it will hang for 2 seconds vers the current default (it should be 7 seconds, and really depends on how many bad blocks there are together that try to be read). Setting it to 2 will make the overall timeout 3.5x smaller, so if that reduce the hang time by about that that is a confirmation that it is a disk issue.
and do this on the disks: smartctl --all /dev/sdb | grep -E '(Reallocated|Current_Pen|Offline Uncor)'
if any of those 3 is nonzero in the last column, that may be the issue. The smart firmware will fail disks that are perfectly find, and it will fail to fail horribly bad disks. The PASS/FAIL absolutely cannot be trusted no matter what is says. FAIL is more often right, but PASS is often unreliable.
So if nonzero note the number, and next pause look again and see if the numbers changed. _______________________________________________
Thanks for the info, I am using MDraid. There are no "mddevice" messages in /var/log/messages and smartctl -a lists no errors on any of the disks. The disks are about 3 years old, I change them in servers between 3 and 4 years old.
I will create a program to measure the effective sars output and detect any discrepancies as this problem only occurs now and then along with measuring iolatency on NFS accesses on the clients to see if I can track down if it is a server disk issue or an NFS issue. Thanks again for the info. _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On Fri, 1 Oct 2021 at 16:20, Terry Barnaby terry1@beam.ltd.uk wrote:
Thanks for the info, I am using MDraid. There are no "mddevice" messages in /var/log/messages and smartctl -a lists no errors on any of the disks. The disks are about 3 years old, I change them in servers between 3 and 4 years old.
When I was managing systems in a group doing data-intensive processing, I replaced drives at failure or "end-of-warranty", whichever came first. Over time, failure rates decreased, but more drives failed shortly after end-of-warranty (when I wasn't able to get replacements on schedule due to budgets or supply chain issues). Current disk drive manufacturing seems to be dialed in to warranty periods, but there will always be some early failures.
I was usually able to get a warranty return authorization on the basis of the smartctrl report for the failed drive.
I am getting more sure this is an NFS/networking issue rather than an issue with disks in the server.
I created a small test program that given a directory finds a random file in a random directory three levels below, opens it and reads up to a block (512 Bytes) of data from it and times how long it took to find the file (opendir/readir) and read the block from the file printing the results if the time is greater than previous ones (so seeing the peek times). This is repeated every 10 seconds. First param is the average time to find the file (there may not be a file 3 levels down so it repeats those searches untill it finds one that the user can access), the second is the time it took to find the file (3 x opendir/readdir) to a file that existed. the last time is how long it took to open, read and close the file.
I set one of these processes running on the server starting at the /home dir and did the same on one of my clients that has /home NFS V4 mounted with defaults + async.
The server after 12 hours had peak timings of (file paths hidden):
2021-10-02T09:26:38 0.008858 0.043513 0.031735 /home/... 2021-10-02T09:26:58 0.005384 0.050870 0.039186 /home/... 2021-10-02T09:38:09 0.006684 0.081707 0.014616 /home/... 2021-10-02T10:18:42 0.037394 0.144025 0.012603 /home/...
The client had timings of:
2021-10-02T08:48:45 0.056195 0.110149 0.019353 /home/... 2021-10-02T09:06:31 0.098647 0.098647 0.015171 /home/... 2021-10-02T09:28:38 1.060605 0.001996 0.000422 /home/... 2021-10-02T09:31:28 4.896196 2.037488 0.000836 /home/... 2021-10-02T11:48:44 4.423502 7.087917 1.111684 /home/... 2021-10-02T11:51:02 27.711746 45.646627 0.021321 /home/...
So at one point the NFS mounted client took 45 seconds to find a file (opendir/readdir 3 times) and once before 7.08 seconds with 1.1 seconds to read a block. The actual file it accessed is 46819 Bytes long and can be normally quickly accessed/copied etc.
"sar -d" reported no issues.
"mountstats /home" reported no issues
"/var/log/messages" in both systems reported no issues.
Generally the desktop system has been responsive all day (no other users and nothing obvious going on on both server and client) and I have not noticed a "lockup" on the GUI I have been using (intermittently). No noticeable network errors, no noticeable hard disk read issues, but occasional very long NFS opendir/readdir which would match up with when i see the desktop lock up for around 30secs ore more.
What did the sar -d look like for the 2 minutes before and 2 minutes afterward?
If it is slow or not may depend on if the directory/file fell out of cache and had to be reread from the disk.
I have also seen really large dirs take a really long time to find, but typically that takes thousands of fines in a dir. if you do ls -ld <dirname> you will see how big the dir is if the dir is really big under some condition that can be slow, but usually not 45 seconds.
On Sat, Oct 2, 2021 at 12:00 PM Terry Barnaby terry1@beam.ltd.uk wrote:
I am getting more sure this is an NFS/networking issue rather than an issue with disks in the server.
I created a small test program that given a directory finds a random file in a random directory three levels below, opens it and reads up to a block (512 Bytes) of data from it and times how long it took to find the file (opendir/readir) and read the block from the file printing the results if the time is greater than previous ones (so seeing the peek times). This is repeated every 10 seconds. First param is the average time to find the file (there may not be a file 3 levels down so it repeats those searches untill it finds one that the user can access), the second is the time it took to find the file (3 x opendir/readdir) to a file that existed. the last time is how long it took to open, read and close the file.
I set one of these processes running on the server starting at the /home dir and did the same on one of my clients that has /home NFS V4 mounted with defaults + async.
The server after 12 hours had peak timings of (file paths hidden):
2021-10-02T09:26:38 0.008858 0.043513 0.031735 /home/... 2021-10-02T09:26:58 0.005384 0.050870 0.039186 /home/... 2021-10-02T09:38:09 0.006684 0.081707 0.014616 /home/... 2021-10-02T10:18:42 0.037394 0.144025 0.012603 /home/...
The client had timings of:
2021-10-02T08:48:45 0.056195 0.110149 0.019353 /home/... 2021-10-02T09:06:31 0.098647 0.098647 0.015171 /home/... 2021-10-02T09:28:38 1.060605 0.001996 0.000422 /home/... 2021-10-02T09:31:28 4.896196 2.037488 0.000836 /home/... 2021-10-02T11:48:44 4.423502 7.087917 1.111684 /home/... 2021-10-02T11:51:02 27.711746 45.646627 0.021321 /home/...
So at one point the NFS mounted client took 45 seconds to find a file (opendir/readdir 3 times) and once before 7.08 seconds with 1.1 seconds to read a block. The actual file it accessed is 46819 Bytes long and can be normally quickly accessed/copied etc.
"sar -d" reported no issues.
"mountstats /home" reported no issues
"/var/log/messages" in both systems reported no issues.
Generally the desktop system has been responsive all day (no other users and nothing obvious going on on both server and client) and I have not noticed a "lockup" on the GUI I have been using (intermittently). No noticeable network errors, no noticeable hard disk read issues, but occasional very long NFS opendir/readdir which would match up with when i see the desktop lock up for around 30secs ore more. _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
You might retest with nfsv3, the code handling v3 should be significantly different since v3 is stateless and does not maintain long-term connections.
And if the long-term connection had some sort of issue then 45 seconds may be how long it takes to figure that out and re-initiate the connection.
I know in 2004 time range nfsv3/tcp had some bugs where if you had >250ish connections that the server harvested the tcp connections and the client never seemed to realize the connection was gone and never recreated it. And this all worked perfectly fine when below 250 clients or so. I know this because we expanded a nfs setup from 240 nodes (ran for months) to 270 or so and after that v3/tcp never worked right and tcpdumps and other info shows that the server was harvesting the "unused" connections once it had too many and the client was never handling it.
It could be that nfsv4+persistant connections is creating a connection with a new dir/file access and eventually you hit the magic limit, and nfs reconnections need to happen.
sar -n NFSD on the server, sar -n SOCK and sar -n SOCK6 on both client/server and sar -n NFS on a client might show something abnormal during the issue.
On Sat, Oct 2, 2021 at 12:29 PM Roger Heflin rogerheflin@gmail.com wrote:
What did the sar -d look like for the 2 minutes before and 2 minutes afterward?
If it is slow or not may depend on if the directory/file fell out of cache and had to be reread from the disk.
I have also seen really large dirs take a really long time to find, but typically that takes thousands of fines in a dir. if you do ls -ld <dirname> you will see how big the dir is if the dir is really big under some condition that can be slow, but usually not 45 seconds.
On Sat, Oct 2, 2021 at 12:00 PM Terry Barnaby terry1@beam.ltd.uk wrote:
I am getting more sure this is an NFS/networking issue rather than an issue with disks in the server.
I created a small test program that given a directory finds a random file in a random directory three levels below, opens it and reads up to a block (512 Bytes) of data from it and times how long it took to find the file (opendir/readir) and read the block from the file printing the results if the time is greater than previous ones (so seeing the peek times). This is repeated every 10 seconds. First param is the average time to find the file (there may not be a file 3 levels down so it repeats those searches untill it finds one that the user can access), the second is the time it took to find the file (3 x opendir/readdir) to a file that existed. the last time is how long it took to open, read and close the file.
I set one of these processes running on the server starting at the /home dir and did the same on one of my clients that has /home NFS V4 mounted with defaults + async.
The server after 12 hours had peak timings of (file paths hidden):
2021-10-02T09:26:38 0.008858 0.043513 0.031735 /home/... 2021-10-02T09:26:58 0.005384 0.050870 0.039186 /home/... 2021-10-02T09:38:09 0.006684 0.081707 0.014616 /home/... 2021-10-02T10:18:42 0.037394 0.144025 0.012603 /home/...
The client had timings of:
2021-10-02T08:48:45 0.056195 0.110149 0.019353 /home/... 2021-10-02T09:06:31 0.098647 0.098647 0.015171 /home/... 2021-10-02T09:28:38 1.060605 0.001996 0.000422 /home/... 2021-10-02T09:31:28 4.896196 2.037488 0.000836 /home/... 2021-10-02T11:48:44 4.423502 7.087917 1.111684 /home/... 2021-10-02T11:51:02 27.711746 45.646627 0.021321 /home/...
So at one point the NFS mounted client took 45 seconds to find a file (opendir/readdir 3 times) and once before 7.08 seconds with 1.1 seconds to read a block. The actual file it accessed is 46819 Bytes long and can be normally quickly accessed/copied etc.
"sar -d" reported no issues.
"mountstats /home" reported no issues
"/var/log/messages" in both systems reported no issues.
Generally the desktop system has been responsive all day (no other users and nothing obvious going on on both server and client) and I have not noticed a "lockup" on the GUI I have been using (intermittently). No noticeable network errors, no noticeable hard disk read issues, but occasional very long NFS opendir/readdir which would match up with when i see the desktop lock up for around 30secs ore more. _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
45 second event happened at: 2021-10-02T11:51:02 UTC. Not sure what sar time is based on (maybe local time BST rather than UTC so would be 2021-10-02T12:51:02 BST.
"sar -d" on the server:
11:50:02 dev8-0 4.67 0.01 46.62 0.00 9.99 0.12 14.03 5.75 11:50:02 dev8-32 0.01 0.00 0.00 0.00 0.25 0.00 1.62 0.00 11:50:02 dev8-16 4.85 5.46 46.62 0.00 10.74 0.13 14.25 5.92 11:50:02 dev8-48 0.01 0.00 0.00 0.00 0.25 0.00 1.75 0.00 11:50:02 dev252-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12:00:02 dev8-0 5.61 0.06 99.39 0.00 17.74 0.15 15.41 6.17 12:00:02 dev8-32 0.01 0.00 0.00 0.00 0.30 0.00 1.60 0.00 12:00:02 dev8-16 5.76 4.48 99.39 0.00 18.03 0.14 14.77 6.26 12:00:02 dev8-48 0.01 0.00 0.00 0.00 0.30 0.00 1.60 0.00 12:00:02 dev252-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12:10:02 dev8-0 11.41 0.07 139.77 0.00 12.25 0.20 12.15 6.26 12:10:02 dev8-32 0.04 0.02 0.00 0.00 0.37 0.00 1.12 0.01 12:10:02 dev8-16 11.69 3.72 139.77 0.00 12.28 0.19 10.74 6.50 12:10:02 dev8-48 0.04 0.02 0.00 0.00 0.37 0.00 1.12 0.01 12:10:02 dev252-0 0.12 0.00 0.47 0.00 4.00 0.00 0.00 0.00 12:20:02 dev8-0 8.69 0.51 84.42 0.00 9.77 0.18 14.21 6.15 12:20:02 dev8-32 0.01 0.00 0.00 0.00 0.25 0.00 1.62 0.00 12:20:02 dev8-16 8.94 5.27 84.42 0.00 10.04 0.15 10.27 6.33 12:20:02 dev8-48 0.01 0.00 0.00 0.00 0.25 0.00 1.62 0.00 12:20:02 dev252-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12:30:02 dev8-0 5.44 0.08 95.99 0.00 17.65 0.13 13.68 5.91 12:30:02 dev8-32 0.01 0.00 0.00 0.00 0.30 0.00 1.80 0.00 12:30:02 dev8-16 5.60 5.17 95.99 0.00 18.05 0.14 13.75 6.17 12:30:02 dev8-48 0.01 0.00 0.00 0.00 0.30 0.00 1.80 0.00 12:30:02 dev252-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12:40:02 dev8-0 4.62 0.04 88.43 0.00 19.14 0.12 13.70 5.48 12:40:02 dev8-32 0.01 0.00 0.00 0.00 0.30 0.00 1.60 0.00 12:40:02 dev8-16 4.73 4.15 88.43 0.00 19.57 0.12 14.01 5.70 12:40:02 dev8-48 0.01 0.00 0.00 0.00 0.30 0.00 1.60 0.00 12:40:02 dev252-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12:50:02 dev8-0 8.25 3.26 213.70 0.00 26.29 0.22 17.96 7.47 12:50:02 dev8-32 0.01 0.00 0.00 0.00 0.25 0.00 1.62 0.00 12:50:02 dev8-16 8.50 8.12 213.70 0.00 26.09 0.24 19.29 7.63 12:50:02 dev8-48 0.01 0.00 0.00 0.00 0.25 0.00 1.75 0.00 12:50:02 dev252-0 0.44 0.00 1.78 0.00 4.00 0.00 0.01 0.00 13:00:00 dev8-0 10.36 0.09 200.16 0.00 19.33 0.23 15.19 7.59 13:00:00 dev8-32 0.01 0.00 0.00 0.00 0.30 0.00 1.80 0.00 13:00:00 dev8-16 10.72 3.70 200.16 0.00 19.02 0.23 14.65 7.72 13:00:00 dev8-48 0.01 0.00 0.00 0.00 0.30 0.00 1.60 0.00 13:00:00 dev252-0 0.09 0.00 0.36 0.00 4.00 0.00 0.02 0.00
Other bits in next email ...
Terry
45 second event happened at: 2021-10-02T11:51:02 UTC. Not sure what sar time is based on (maybe local time BST rather than UTC so would be 2021-10-02T12:51:02 BST.
Continuing info ...
sar -n NFSD on the server
11:00:01 24.16 0.00 24.16 0.00 24.16 0.00 0.00 0.35 1.48 2.07 21.08 11:10:01 21.13 0.00 21.13 0.00 21.13 0.00 0.00 0.28 0.89 1.72 19.58 11:20:02 17.85 0.00 17.85 0.00 17.85 0.00 0.00 0.27 0.69 0.82 16.65 11:30:02 20.66 0.00 20.66 0.00 20.66 0.00 0.00 0.29 0.83 1.42 19.15 11:40:02 39.80 0.00 39.80 0.00 39.80 0.00 0.00 0.89 2.05 3.67 25.51 11:50:02 35.40 0.00 35.40 0.00 35.40 0.00 0.00 0.39 0.65 1.22 18.21 12:00:02 41.85 0.00 41.85 0.00 41.85 0.00 0.00 0.84 1.14 2.08 20.50 12:10:02 38.54 0.00 38.54 0.00 38.54 0.00 0.00 0.48 0.82 1.48 19.62 12:20:02 39.85 0.00 39.85 0.00 39.85 0.00 0.00 0.37 1.50 1.29 19.44 12:30:02 39.84 0.00 39.84 0.00 39.84 0.00 0.00 0.70 1.03 2.28 19.78 12:40:02 38.29 0.00 38.29 0.00 38.29 0.00 0.00 0.46 0.81 1.26 19.37 12:50:02 71.38 0.00 71.38 0.00 71.37 0.00 0.00 1.12 2.41 8.19 34.87 13:00:00 77.46 0.00 77.46 0.00 77.45 0.00 0.00 1.43 3.30 7.36 38.31 13:10:00 67.62 0.00 67.63 0.00 67.62 0.00 0.00 3.85 2.84 4.68 29.66
sar -n SOCK on the server 11:20:02 480 41 32 1 0 0 11:30:02 482 41 32 1 0 4 11:40:02 480 41 32 1 0 0 11:50:02 480 41 32 1 0 1 12:00:02 480 41 32 1 0 1 12:10:02 480 41 32 1 0 1 12:20:02 480 41 32 1 0 1 12:30:02 480 41 32 1 0 1 12:40:02 480 41 32 1 0 1
12:40:02 totsck tcpsck udpsck rawsck ip-frag tcp-tw 12:50:02 480 41 32 1 0 1 13:00:00 480 41 32 1 0 1 13:10:00 490 43 32 1 0 1
sar -n NFS on the client 11:10:02 19.82 0.00 0.28 0.34 0.71 15.13 11:20:03 16.53 0.00 0.27 0.15 0.34 13.80 11:30:04 17.20 0.00 0.13 0.08 0.30 15.17 11:40:04 37.46 0.00 0.89 1.47 2.07 14.42 11:50:05 32.97 0.00 0.28 0.11 0.34 15.00 12:00:05 36.31 0.00 0.59 0.47 0.75 14.17 12:10:06 34.77 0.00 0.36 0.26 0.65 14.95 12:20:07 37.55 0.00 0.27 0.97 0.35 15.36 12:30:07 33.90 0.00 0.46 0.37 0.62 13.47 12:40:07 35.67 0.00 0.36 0.28 0.64 15.44 12:50:07 68.97 0.00 1.01 1.89 6.64 13.28 13:00:07 71.08 0.00 1.17 2.58 4.32 17.87 13:10:07 49.28 0.00 0.84 1.55 1.81 15.23 13:20:00 41.87 0.00 0.50 1.24 0.87 14.98
sar -n SOCK client 11:20:03 1166 39 14 0 0 0 11:30:04 1164 35 14 0 0 2 11:40:04 1191 50 15 0 0 1 11:50:05 1182 40 14 0 0 0 12:00:05 1182 39 14 0 0 2 12:10:06 1182 39 14 0 0 3 12:20:07 1171 39 14 0 0 1 12:30:07 1179 40 15 0 0 0 12:40:07 1179 39 15 0 0 0 12:50:07 1200 45 17 0 0 1 13:00:07 1188 40 14 0 0 2
Nothing obvious I can see there ...
Terry
With 10 minute samples anything that happened gets averaged enough that even the worst event is almost impossible to see.
Sar will report the same as date ie local time. And a 12:51 event would be in the 13:00 sample (started at about 12:50 and ended at 1300).
What I do see is that during that window your io rate was about 2x prior 10 minute windows. With the 1 minute data we would be able to see if the disk was excessively busy. You average iops were about 10% of the disk capacity.
I have debugged issues where the badly behaving IO was maxing out everything for 10sec on/10 sec off, in the 1 minute data there appeared to be nothing interesting to see (50% capacity), but it was playing hell with the interactive apps since during the 10 sec on window operations that the user was doing that were normally taking .5 sec were taking 1-2 seconds and so clearly slow for the users. With the sample size (60sec) close to the event size (45sec) it should be visible on 1 minute data, but less than clear on 10 minute data (9.25 minutes to average it out and hide/mask it).
do "systemctl edit sysstat-collect.timer" And add this to the file:
[Timer] OnCalendar=*:00/1
That will change it to 1minutes.
if you do this: #!/bin/bash while [ true ] ; do export hour=$(date +%H) iostat -t -x 10 360 > filename.${hour} done
that will give you 10 sec iostat data, and start a new file each hour, and overwrite the hour files the next day.
On Sun, Oct 3, 2021 at 4:00 AM Terry Barnaby terry1@beam.ltd.uk wrote:
45 second event happened at: 2021-10-02T11:51:02 UTC. Not sure what sar time is based on (maybe local time BST rather than UTC so would be 2021-10-02T12:51:02 BST.
Continuing info ...
sar -n NFSD on the server 11:00:01 24.16 0.00 24.16 0.00 24.16 0.00 0.00 0.35 1.48 2.07 21.08 11:10:01 21.13 0.00 21.13 0.00 21.13 0.00 0.00 0.28 0.89 1.72 19.58 11:20:02 17.85 0.00 17.85 0.00 17.85 0.00 0.00 0.27 0.69 0.82 16.65 11:30:02 20.66 0.00 20.66 0.00 20.66 0.00 0.00 0.29 0.83 1.42 19.15 11:40:02 39.80 0.00 39.80 0.00 39.80 0.00 0.00 0.89 2.05 3.67 25.51 11:50:02 35.40 0.00 35.40 0.00 35.40 0.00 0.00 0.39 0.65 1.22 18.21 12:00:02 41.85 0.00 41.85 0.00 41.85 0.00 0.00 0.84 1.14 2.08 20.50 12:10:02 38.54 0.00 38.54 0.00 38.54 0.00 0.00 0.48 0.82 1.48 19.62 12:20:02 39.85 0.00 39.85 0.00 39.85 0.00 0.00 0.37 1.50 1.29 19.44 12:30:02 39.84 0.00 39.84 0.00 39.84 0.00 0.00 0.70 1.03 2.28 19.78 12:40:02 38.29 0.00 38.29 0.00 38.29 0.00 0.00 0.46 0.81 1.26 19.37 12:50:02 71.38 0.00 71.38 0.00 71.37 0.00 0.00 1.12 2.41 8.19 34.87 13:00:00 77.46 0.00 77.46 0.00 77.45 0.00 0.00 1.43 3.30 7.36 38.31 13:10:00 67.62 0.00 67.63 0.00 67.62 0.00 0.00 3.85 2.84 4.68 29.66
sar -n SOCK on the server 11:20:02 480 41 32 1 0 0 11:30:02 482 41 32 1 0 4 11:40:02 480 41 32 1 0 0 11:50:02 480 41 32 1 0 1 12:00:02 480 41 32 1 0 1 12:10:02 480 41 32 1 0 1 12:20:02 480 41 32 1 0 1 12:30:02 480 41 32 1 0 1 12:40:02 480 41 32 1 0 1
12:40:02 totsck tcpsck udpsck rawsck ip-frag tcp-tw 12:50:02 480 41 32 1 0 1 13:00:00 480 41 32 1 0 1 13:10:00 490 43 32 1 0 1
sar -n NFS on the client 11:10:02 19.82 0.00 0.28 0.34 0.71 15.13 11:20:03 16.53 0.00 0.27 0.15 0.34 13.80 11:30:04 17.20 0.00 0.13 0.08 0.30 15.17 11:40:04 37.46 0.00 0.89 1.47 2.07 14.42 11:50:05 32.97 0.00 0.28 0.11 0.34 15.00 12:00:05 36.31 0.00 0.59 0.47 0.75 14.17 12:10:06 34.77 0.00 0.36 0.26 0.65 14.95 12:20:07 37.55 0.00 0.27 0.97 0.35 15.36 12:30:07 33.90 0.00 0.46 0.37 0.62 13.47 12:40:07 35.67 0.00 0.36 0.28 0.64 15.44 12:50:07 68.97 0.00 1.01 1.89 6.64 13.28 13:00:07 71.08 0.00 1.17 2.58 4.32 17.87 13:10:07 49.28 0.00 0.84 1.55 1.81 15.23 13:20:00 41.87 0.00 0.50 1.24 0.87 14.98
sar -n SOCK client 11:20:03 1166 39 14 0 0 0 11:30:04 1164 35 14 0 0 2 11:40:04 1191 50 15 0 0 1 11:50:05 1182 40 14 0 0 0 12:00:05 1182 39 14 0 0 2 12:10:06 1182 39 14 0 0 3 12:20:07 1171 39 14 0 0 1 12:30:07 1179 40 15 0 0 0 12:40:07 1179 39 15 0 0 0 12:50:07 1200 45 17 0 0 1 13:00:07 1188 40 14 0 0 2
Nothing obvious I can see there ...
Terry
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On 04/10/2021 00:51, Roger Heflin wrote:
With 10 minute samples anything that happened gets averaged enough that even the worst event is almost impossible to see.
Sar will report the same as date ie local time. And a 12:51 event would be in the 13:00 sample (started at about 12:50 and ended at 1300).
What I do see is that during that window your io rate was about 2x prior 10 minute windows. With the 1 minute data we would be able to see if the disk was excessively busy. You average iops were about 10% of the disk capacity.
I have debugged issues where the badly behaving IO was maxing out everything for 10sec on/10 sec off, in the 1 minute data there appeared to be nothing interesting to see (50% capacity), but it was playing hell with the interactive apps since during the 10 sec on window operations that the user was doing that were normally taking .5 sec were taking 1-2 seconds and so clearly slow for the users. With the sample size (60sec) close to the event size (45sec) it should be visible on 1 minute data, but less than clear on 10 minute data (9.25 minutes to average it out and hide/mask it).
do "systemctl edit sysstat-collect.timer" And add this to the file:
[Timer] OnCalendar=*:00/1
That will change it to 1minutes.
if you do this: #!/bin/bash while [ true ] ; do export hour=$(date +%H) iostat -t -x 10 360 > filename.${hour} done
that will give you 10 sec iostat data, and start a new file each hour, and overwrite the hour files the next day.
Thanks Roger, I will set those up/running.
My disklatencytest showed a longish (14 secs) NFS file system directoty/stat lookup again today on a desktop:
2021-10-04T05:26:19 0.069486 0.069486 0.000570 /home/... 2021-10-04T05:28:19 0.269743 0.538000 0.001019 /home/... 2021-10-04T09:48:00 1.492158 0.003314 0.000907 /home/... 2021-10-04T09:49:02 2.581025 0.159358 0.000836 /home/... 2021-10-04T09:50:44 2.657260 0.076560 0.027128 /home/... 2021-10-04T09:51:30 14.889837 14.889837 0.022132 /home/...
A disklatencytest running on the server shows no long latency today so far.
The sar -d on the server around this time is:
10:49:00 dev8-0 20.14 152.57 246.05 0.00 19.80 0.51 21.48 9.81 10:49:00 dev8-32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:49:00 dev8-16 39.36 1277.08 246.51 0.00 38.71 0.52 11.03 13.06 10:49:00 dev8-48 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:49:00 dev252-0 2.35 0.00 9.39 0.00 4.00 0.00 0.00 0.02 10:50:00 dev8-0 8.38 134.51 80.09 0.00 25.60 0.14 8.90 6.89 10:50:00 dev8-32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:50:00 dev8-16 14.08 286.15 80.09 0.00 26.01 0.16 6.53 7.35 10:50:00 dev8-48 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:50:00 dev252-0 0.80 0.00 3.20 0.00 4.00 0.00 0.00 0.01 10:51:00 dev8-0 7.99 75.65 110.65 0.00 23.31 0.26 23.22 7.33 10:51:00 dev8-32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:51:00 dev8-16 44.98 1704.41 110.65 0.00 40.35 0.30 5.00 12.43 10:51:00 dev8-48 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:51:00 dev252-0 3.86 0.00 15.45 0.00 4.00 0.00 0.01 0.03 10:52:00 dev8-0 21.20 265.73 415.69 0.00 32.15 0.42 16.08 8.64 10:52:00 dev8-32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:52:00 dev8-16 56.56 1603.80 415.69 0.00 35.71 0.45 6.62 12.77 10:52:00 dev8-48 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:52:00 dev252-0 5.23 0.00 20.91 0.00 4.00 0.00 0.02 0.02 10:53:00 dev8-0 12.94 265.40 94.51 0.00 27.82 0.25 14.24 7.29 10:53:00 dev8-32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:53:00 dev8-16 46.35 1747.85 94.51 0.00 39.75 0.32 5.27 11.99 10:53:00 dev8-48 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:53:00 dev252-0 3.60 0.00 14.39 0.00 4.00 0.00 0.01 0.02
and iostats in next email
So nothing obvious on the disks of the server. I am pretty sure this is an NFS issue ...
and iostats:
04/10/21 10:51:14 avg-cpu: %user %nice %system %iowait %steal %idle 2.09 0.00 1.56 0.02 0.00 96.33
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 7.10 39.20 1.40 16.47 1.68 5.52 0.00 0.00 0.00 0.00 0.00 0.00 0.10 1.00 0.01 0.19 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 zram0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
04/10/21 10:51:40 avg-cpu: %user %nice %system %iowait %steal %idle 2.31 0.00 1.60 10.86 0.00 85.24
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 0.71 3.43 0.15 17.39 12.16 4.84 0.00 0.00 0.00 0.00 0.00 0.00 0.04 1.00 0.01 0.19 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 zram0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
04/10/21 10:51:50 avg-cpu: %user %nice %system %iowait %steal %idle 2.59 0.00 1.69 0.17 0.00 95.55
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 1.40 11.20 1.40 50.00 12.36 8.00 0.00 0.00 0.00 0.00 0.00 0.00 0.20 1.00 0.02 0.50 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 zram0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
So nothing obvious on the disks of the server. I am pretty sure this is an NFS issue ...
Since it is recovering from it, maybe it is losing packets inside the network, what does "sar -n DEV" and "sar -n EDEV" look like during that time on both client seeing the pause and the server.
EDEV is typically all zeros unless something is lost. if something is being lost and it matches the times the time of hangs that could be it.
On Mon, Oct 4, 2021 at 5:26 AM Terry Barnaby terry1@beam.ltd.uk wrote:
and iostats:
04/10/21 10:51:14 avg-cpu: %user %nice %system %iowait %steal %idle 2.09 0.00 1.56 0.02 0.00 96.33
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 7.10 39.20 1.40 16.47 1.68 5.52 0.00 0.00 0.00 0.00 0.00 0.00 0.10 1.00 0.01 0.19 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 zram0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
04/10/21 10:51:40 avg-cpu: %user %nice %system %iowait %steal %idle 2.31 0.00 1.60 10.86 0.00 85.24
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 0.71 3.43 0.15 17.39 12.16 4.84 0.00 0.00 0.00 0.00 0.00 0.00 0.04 1.00 0.01 0.19 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 zram0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
04/10/21 10:51:50 avg-cpu: %user %nice %system %iowait %steal %idle 2.59 0.00 1.69 0.17 0.00 95.55
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 1.40 11.20 1.40 50.00 12.36 8.00 0.00 0.00 0.00 0.00 0.00 0.00 0.20 1.00 0.02 0.50 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 zram0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
So nothing obvious on the disks of the server. I am pretty sure this is an NFS issue ...
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
sar -n EDEV reports all 0's all around then. There are somerxdrop/s of 0.02 occasionally on eno1 through the day (about 20 of these with minute based sampling). Today ifconfig lists 39 dropped RX packets out of 2357593. Not sure why there are some dropped packets. "ethtool -S eno1" doesn't seem to list any particular issues. sar -n DEV does not appear to show anything at 10:51:30: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil 10:44:04 eno1 18.29 19.54 5.81 5.25 0.00 0.00 0.00 0.00 10:45:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:45:04 eno1 20.45 22.52 5.96 5.79 0.00 0.00 0.00 0.00 10:46:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:46:04 eno1 22.50 24.26 7.52 7.88 0.00 0.00 0.00 0.01 10:47:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:47:04 eno1 21.53 22.75 7.27 5.71 0.00 0.00 0.00 0.01 10:48:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:48:04 eno1 222.03 284.24 173.49 367.55 0.00 0.00 0.00 0.30 10:49:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:49:04 eno1 11.83 12.28 2.74 3.98 0.00 0.00 0.00 0.00 10:50:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:50:04 eno1 15.72 14.13 4.33 3.80 0.00 0.00 0.00 0.00 10:51:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:51:04 eno1 11.00 10.53 3.48 2.63 0.00 0.00 0.00 0.00 10:52:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:52:04 eno1 13.48 13.45 4.21 4.56 0.00 0.00 0.00 0.00 10:53:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:53:04 eno1 21.76 23.98 6.99 10.26 0.00 0.00 0.00 0.01 10:54:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Also NFV4 uses TCP/IP I think by default and TCP/IP retries would be much quicker than 45 seconds. I do feel there is an issue in the NFS code somewhere, but I am biased about the speed of NFS directory access these days !
On 04/10/2021 17:06, Roger Heflin wrote:
Since it is recovering from it, maybe it is losing packets inside the network, what does "sar -n DEV" and "sar -n EDEV" look like during that time on both client seeing the pause and the server.
EDEV is typically all zeros unless something is lost. if something is being lost and it matches the times the time of hangs that could be it.
That network looks fine to me
I would try v3. I have had bad luck many times with v4 on a variety of different kernels. If the code is recovering from something related to a bug 45 seconds might be right to decide something that was working is no longer working.
I am not sure any amount of debugging would help (without having really verbose kernel debugging).
What is the current kernel you are running and trying a new one might be worth it. Though I don't see nfs changes/fixes listed in the 5.14.* or 5.13.* kernels changelog in the rpm file (rpm -q --changelog) and there are only a few listed at kernel.org for those kernels.
On Tue, Oct 5, 2021 at 11:04 AM Terry Barnaby terry1@beam.ltd.uk wrote:
sar -n EDEV reports all 0's all around then. There are some rxdrop/s of 0.02 occasionally on eno1 through the day (about 20 of these with minute based sampling). Today ifconfig lists 39 dropped RX packets out of 2357593. Not sure why there are some dropped packets. "ethtool -S eno1" doesn't seem to list any particular issues.
sar -n DEV does not appear to show anything at 10:51:30:
IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil10:44:04 eno1 18.29 19.54 5.81 5.25 0.00 0.00 0.00 0.00 10:45:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:45:04 eno1 20.45 22.52 5.96 5.79 0.00 0.00 0.00 0.00 10:46:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:46:04 eno1 22.50 24.26 7.52 7.88 0.00 0.00 0.00 0.01 10:47:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:47:04 eno1 21.53 22.75 7.27 5.71 0.00 0.00 0.00 0.01 10:48:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:48:04 eno1 222.03 284.24 173.49 367.55 0.00 0.00 0.00 0.30 10:49:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:49:04 eno1 11.83 12.28 2.74 3.98 0.00 0.00 0.00 0.00 10:50:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:50:04 eno1 15.72 14.13 4.33 3.80 0.00 0.00 0.00 0.00 10:51:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:51:04 eno1 11.00 10.53 3.48 2.63 0.00 0.00 0.00 0.00 10:52:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:52:04 eno1 13.48 13.45 4.21 4.56 0.00 0.00 0.00 0.00 10:53:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:53:04 eno1 21.76 23.98 6.99 10.26 0.00 0.00 0.00 0.01 10:54:04 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Also NFV4 uses TCP/IP I think by default and TCP/IP retries would be much quicker than 45 seconds. I do feel there is an issue in the NFS code somewhere, but I am biased about the speed of NFS directory access these days !
On 04/10/2021 17:06, Roger Heflin wrote:
Since it is recovering from it, maybe it is losing packets inside the network, what does "sar -n DEV" and "sar -n EDEV" look like during that time on both client seeing the pause and the server.
EDEV is typically all zeros unless something is lost. if something is being lost and it matches the times the time of hangs that could be it.
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Hi Roger,
Thanks for looking. I will try NFS v3 with my latency tests running. I did try NFS v3 before and I "think" there were still desktop lockups but for a much shorter time. But this is just a feeling. Current kernel on both systems is: 5.13.19-100.fc33.x86_64. If I find the time, I will try and add some kernel NFS RPC call timers with some printk's and maybe try Fedora35 on another system.
Terry On 05/10/2021 19:53, Roger Heflin wrote:
That network looks fine to me
I would try v3. I have had bad luck many times with v4 on a variety of different kernels. If the code is recovering from something related to a bug 45 seconds might be right to decide something that was working is no longer working.
I am not sure any amount of debugging would help (without having really verbose kernel debugging).
What is the current kernel you are running and trying a new one might be worth it. Though I don't see nfs changes/fixes listed in the 5.14.* or 5.13.* kernels changelog in the rpm file (rpm -q --changelog) and there are only a few listed at kernel.org for those kernels.
https://release-monitoring.org/project/2081/ Well it is a pre-release version. 2.5.5.rc3
Since some Fedora33 update in the last couple of weeks the problem has gone away. I haven't changed anything as far as I am aware.
One change is that the kernel moved from 5.13.x to 5.14.x ...
Terry On 21/10/2021 23:36, Reon Beon via users wrote:
https://release-monitoring.org/project/2081/ Well it is a pre-release version. 2.5.5.rc3 _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure