Hi,
On August 8 I added a bugreport for a machine that trashes aprox. every 2 weeks because of the kernel eating memory. You can find it here:
http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=97546
I've also made some graphs available together with some command output that clearly show it is a kernel problem:
http://dag.wieers.com/rmon-breeg-io-3months-800x120.png http://dag.wieers.com/rmon-breeg-kernel-3months-800x120.png http://dag.wieers.com/rmon-breeg-load-3months-800x120.png http://dag.wieers.com/rmon-breeg-paging-3months-800x120.png http://dag.wieers.com/rmon-breeg-swap-3months-800x120.png http://dag.wieers.com/rmon-breeg-mem-3months-800x120.png
The behaviour of the system has been like this since I installed RH9 (kernel 2.4.20-8) and newer kernels (2.4.20-18.9, 2.4.20-19.9 and 2.4.20-20.9) and continues to act like this ;(
I know at least 2 other persons that have the same problems (on different machines with a similar workload/functionality) so it seems to be a bug in the kernel on a particular use of the VM.
Anyone else noticed something like this ? -- dag wieers, dag@wieers.com, http://dag.wieers.com/ -- [Any errors in spelling, tact or fact are transmission errors]
On Mon, 8 Sep 2003, Dag Wieers wrote:
On August 8 I added a bugreport for a machine that trashes aprox. every 2 weeks because of the kernel eating memory. You can find it here:
I know at least 2 other persons that have the same problems (on different machines with a similar workload/functionality) so it seems to be a bug in the kernel on a particular use of the VM.
The thing is, RHL8 and 9 have a completely different VM from Severn. Also, many people cannot reproduce this problem at all.
This, together with your graphs, suggests it's more likely to be a memory leak in some driver then a bug in the core VM code.
Do you have anything suspicious in /proc/slabinfo when your system gets close to crashing ?
In the RHL9 and RHL8 kernels, how big is free + active + inactive* together? How much space is unaccounted for ?
On Sun, 7 Sep 2003, Rik van Riel wrote:
On Mon, 8 Sep 2003, Dag Wieers wrote:
On August 8 I added a bugreport for a machine that trashes aprox. every 2 weeks because of the kernel eating memory. You can find it here:
I know at least 2 other persons that have the same problems (on different machines with a similar workload/functionality) so it seems to be a bug in the kernel on a particular use of the VM.
The thing is, RHL8 and 9 have a completely different VM from Severn. Also, many people cannot reproduce this problem at all.
The best thing would be to try out the Severn kernel on RH9 ? I cannot simply install Severn on this system ;/
This, together with your graphs, suggests it's more likely to be a memory leak in some driver then a bug in the core VM code.
With driver you don't mean a loaded module ? Is there some way to find out what driver/module is allocating this memory ?
Do you have anything suspicious in /proc/slabinfo when your system gets close to crashing ?
It's now up 7 days (36MB used, 25MB free) so we'll see in another 5 days. These are the higher numbers:
inode_cache 812 1232 480 154 154 1 dentry_cache 572 1380 128 46 46 1 filp 620 630 128 21 21 1 buffer_head 4349 12363 100 162 317 1 mm_struct 6038 6069 224 356 357 1 vm_area_struct 1315 4920 96 38 123 1 pte_chain 729 3277 32 14 29 1
I'm not sure what I have to look for. I guess I better save this also directly after booting up.
In the RHL9 and RHL8 kernels, how big is free + active + inactive* together? How much space is unaccounted for ?
I'm not sure how it all adds up. The information I get from ps/top for all the processes is almost constant (around 12MB) whereas slowly each day +-3.5MB extra seems to be unaccounted for.
This is after 7 days running:
[root@breeg root]# free total used free shared buffers cached Mem: 61676 60648 1028 0 2212 20416 -/+ buffers/cache: 38020 23656 Swap: 457836 8876 448960
[root@breeg root]# cat /proc/meminfo total: used: free: shared: buffers: cached: Mem: 63156224 62107648 1048576 0 2277376 25321472 Swap: 468824064 9084928 459739136 MemTotal: 61676 kB MemFree: 1024 kB MemShared: 0 kB Buffers: 2224 kB Cached: 20416 kB SwapCached: 4312 kB Active: 20688 kB ActiveAnon: 7660 kB ActiveCache: 13028 kB Inact_dirty: 0 kB Inact_laundry: 7036 kB Inact_clean: 580 kB Inact_target: 5660 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 61676 kB LowFree: 1024 kB SwapTotal: 457836 kB SwapFree: 448964 kB
Kind regards, -- dag wieers, dag@wieers.com, http://dag.wieers.com/ -- [Any errors in spelling, tact or fact are transmission errors]
On Mon, 8 Sep 2003, Dag Wieers wrote:
On Sun, 7 Sep 2003, Rik van Riel wrote:
This, together with your graphs, suggests it's more likely to be a memory leak in some driver then a bug in the core VM code.
With driver you don't mean a loaded module ? Is there some way to find out what driver/module is allocating this memory ?
There's no really easy way. The best way would be to look at which drivers you are using and compare that list with the drivers being used by the other people who have this problem.
Then look at people who don't have the problem at all. Scratch the drivers on problemless systems from the list of suspects.
Hopefully, you'll end up with just one or a few suspect drivers...
Do you have anything suspicious in /proc/slabinfo when your system gets close to crashing ?
Hmmm ok, so nothing bad in /proc/slabinfo.
In the RHL9 and RHL8 kernels, how big is free + active + inactive* together? How much space is unaccounted for ?
I'm not sure how it all adds up. The information I get from ps/top for all the processes is almost constant (around 12MB) whereas slowly each day +-3.5MB extra seems to be unaccounted for.
This is after 7 days running:
MemTotal: 61676 kB MemFree: 1024 kB Active: 20688 kB Inact_dirty: 0 kB Inact_laundry: 7036 kB Inact_clean: 580 kB
OK, that's 1 + 20.5 + 7 + .5 = 29 MB pageable memory.
In other words, 30+ MB is already taken by the kernel!
Definately looks like a memory leak.
On Mon, 8 Sep 2003, Rik van Riel wrote:
On Mon, 8 Sep 2003, Dag Wieers wrote:
On Sun, 7 Sep 2003, Rik van Riel wrote:
This, together with your graphs, suggests it's more likely to be a memory leak in some driver then a bug in the core VM code.
With driver you don't mean a loaded module ? Is there some way to find out what driver/module is allocating this memory ?
There's no really easy way. The best way would be to look at which drivers you are using and compare that list with the drivers being used by the other people who have this problem.
Ok, this is fairly easy for the modules: ext3 and jbd Since one of the other machines crashing was a tokenring, I'm sure it's not the mii module or any of the NIC drivers. And no other modules match the other systems.
One of the other persons (and I hope they'll join this discussion soon) reported he already tried running the same setup on three different systems (completely different HW). And the system trashes every 6 to 8 hours (he now manages to work around it by rebooting often).
Then look at people who don't have the problem at all. Scratch the drivers on problemless systems from the list of suspects.
Well, I'm almost certain that doing the same stuff on almost any RH9 system will have it trash (either slow or fast depending on the amount of memory and the IO). So it may be memory-usage related or IO-utilisation related.
Maybe IDE ? I'm not sure if all the machines are IDE.
Thanks for your time, -- dag wieers, dag@wieers.com, http://dag.wieers.com/ -- [Any errors in spelling, tact or fact are transmission errors]
On Mon, 2003-09-08 at 07:09, Rik van Riel wrote:
MemTotal: 61676 kB MemFree: 1024 kB Active: 20688 kB Inact_dirty: 0 kB Inact_laundry: 7036 kB Inact_clean: 580 kB
OK, that's 1 + 20.5 + 7 + .5 = 29 MB pageable memory.
In other words, 30+ MB is already taken by the kernel!
Definately looks like a memory leak.
Would it be normal on a machine with 512MB memory to have a kernel using this much memory? Mine's ~33MB:
MemTotal: 513284 kB MemFree: 36460 kB ... Active: 362564 kB ... Inact_dirty: 3080 kB Inact_laundry: 65652 kB Inact_clean: 10916 kB ...
This is
$ uptime 19:16:56 up 2 days, 1:43, 4 users, load average: 0.20, 0.12, 0.05 $ uname -a Linux denk.nakedape.priv 2.4.20-19.9 #1 Tue Jul 15 17:18:13 EDT 2003 i686 i686 i386 GNU/Linux
Wil
On Mon, 8 Sep 2003, Rik van Riel wrote:
Definately looks like a memory leak.
Would it help if I say I have a feeling it happens when running out of physical memory ? Or when allocating/using a lot of memory in a short timespan and releasing it under this condition.
I removed memory from my desktop system and I seem to suffer from it on my desktop too (albeit in a less profound way).
I can also add that it wasn't present in 2.4.18-18.7.x (or earlier) kernels, but it started when I upgraded to RH9 with kernel 2.4.20-8.
-- dag wieers, dag@wieers.com, http://dag.wieers.com/ -- [Any errors in spelling, tact or fact are transmission errors]
On Fri, 12 Sep 2003, Dag Wieers wrote:
On Mon, 8 Sep 2003, Rik van Riel wrote:
Definately looks like a memory leak.
Would it help if I say I have a feeling it happens when running out of physical memory ? Or when allocating/using a lot of memory in a short timespan and releasing it under this condition.
This output is generated while it was trashing heavily:
[root@breeg root]# cat /proc/meminfo total: used: free: shared: buffers: cached: Mem: 63156224 62545920 610304 0 733184 6836224 Swap: 468824064 21938176 446885888 MemTotal: 61676 kB MemFree: 596 kB MemShared: 0 kB Buffers: 716 kB Cached: 4704 kB SwapCached: 1972 kB Active: 4912 kB ActiveAnon: 3728 kB ActiveCache: 1184 kB Inact_dirty: 1104 kB Inact_laundry: 528 kB Inact_clean: 964 kB Inact_target: 1500 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 61676 kB LowFree: 596 kB SwapTotal: 457836 kB SwapFree: 436412 kB
Is there anything I can try to help you find the cause ? This machine has only ext3 and jbd loaded (that match other machines with the same problems).
-- dag wieers, dag@wieers.com, http://dag.wieers.com/ -- [Any errors in spelling, tact or fact are transmission errors]
On Wed, 17 Sep 2003, Dag Wieers wrote:
MemTotal: 61676 kB MemFree: 596 kB
Active: 4912 kB Inact_dirty: 1104 kB Inact_laundry: 528 kB Inact_clean: 964 kB
Oh dear, that is only 8MB free&pageable memory, out of 61MB total available memory after bootup...
Is there anything I can try to help you find the cause ? This machine has only ext3 and jbd loaded (that match other machines with the same problems).
Exactly which kernel are you running ?
I wonder if there's some leak in reclaiming ext3 pages after a truncate. Maybe they get erroneously removed from the pageout lists and we never find them...
On Thu, 18 Sep 2003, Rik van Riel wrote:
On Wed, 17 Sep 2003, Dag Wieers wrote:
MemTotal: 61676 kB MemFree: 596 kB
Active: 4912 kB Inact_dirty: 1104 kB Inact_laundry: 528 kB Inact_clean: 964 kB
Oh dear, that is only 8MB free&pageable memory, out of 61MB total available memory after bootup...
Is there anything I can try to help you find the cause ? This machine has only ext3 and jbd loaded (that match other machines with the same problems).
Exactly which kernel are you running ?
2.4.20-19.9 currently, but I have the same effect with 2.4.20-18 and 2.4.20-20. The graph is exactly the same as it was weeks ago (every 2 weeks this happens, only this time the system was responsive enough to return the information ;)
If I go to runlevel 1 when the system isn't trashing (with only bash and init running) the memory isn't freed (and not claimed by bash or init).
I wonder if there's some leak in reclaiming ext3 pages after a truncate. Maybe they get erroneously removed from the pageout lists and we never find them...
Well, the system is running rrdtool for about 100 counters (2 machines) every 5 minutes. But once every evening (at 22h) it generates graphs and html-pages of all the data (+- 10months of data) in the rrd's and that seems to be causing the kernel to leak memory as is clearly shown on these older graphs:
http://dag.wieers.com/rmon-breeg-mem-2weeks-200x80.png http://dag.wieers.com/rmon-breeg-mem-3months-800x120.png
-- dag wieers, dag@wieers.com, http://dag.wieers.com/ -- [Any errors in spelling, tact or fact are transmission errors]
Hi,
On Fri, 2003-09-19 at 18:30, Dag Wieers wrote:
Exactly which kernel are you running ?
2.4.20-19.9 currently, but I have the same effect with 2.4.20-18 and 2.4.20-20. The graph is exactly the same as it was weeks ago (every 2 weeks this happens, only this time the system was responsive enough to return the information ;)
We recently found one problem which could affect the size of the inode cache. It's not exactly a leak, because the resources can still be reclaimed, but the inodes were filed away in a place where the VM was least likely to go looking to reclaim them.
Could you "cat /proc/slabinfo" on the affected systems and see what the inode_cache entry looks like, please?
Cheers, Stephen
On Fri, 2003-09-19 at 14:58, Stephen C. Tweedie wrote:
Hi,
On Fri, 2003-09-19 at 18:30, Dag Wieers wrote:
Exactly which kernel are you running ?
2.4.20-19.9 currently, but I have the same effect with 2.4.20-18 and 2.4.20-20. The graph is exactly the same as it was weeks ago (every 2 weeks this happens, only this time the system was responsive enough to return the information ;)
We recently found one problem which could affect the size of the inode cache. It's not exactly a leak, because the resources can still be reclaimed, but the inodes were filed away in a place where the VM was least likely to go looking to reclaim them.
Could you "cat /proc/slabinfo" on the affected systems and see what the inode_cache entry looks like, please?
I'm seeing this behavior under 7.3 using kernel 2.4.20-18.7 here is inode_cache from that machine:
inode_cache 20400 21014 512 3002 3002 1
Is this related to this bug: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=100680
at all?
-sv
Hi,
On Fri, 2003-09-19 at 20:07, seth vidal wrote:
Could you "cat /proc/slabinfo" on the affected systems and see what the inode_cache entry looks like, please?
I'm seeing this behavior under 7.3 using kernel 2.4.20-18.7 here is inode_cache from that machine:
inode_cache 20400 21014 512 3002 3002 1
64MB ram, and you've got 12MB (4k * 3002) already in the inode cache. The bug we just fixed could easily be causing your problems.
If those numbers continue to rise directly as performance degrades, then that's almost certainly the cause.
Is this related to this bug: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=100680
No, I don't think so.
Cheers, Stephen
On Fri, 2003-09-19 at 15:31, Stephen C. Tweedie wrote:
Hi,
On Fri, 2003-09-19 at 20:07, seth vidal wrote:
Could you "cat /proc/slabinfo" on the affected systems and see what the inode_cache entry looks like, please?
I'm seeing this behavior under 7.3 using kernel 2.4.20-18.7 here is inode_cache from that machine:
inode_cache 20400 21014 512 3002 3002 1
64MB ram, and you've got 12MB (4k * 3002) already in the inode cache. The bug we just fixed could easily be causing your problems.
well on this system - it's got 1GB of ram.
How should I read that line?
If those numbers continue to rise directly as performance degrades, then that's almost certainly the cause.
I'll keep an eye on it.
Thanks -sv
On 19 Sep 2003, Stephen C. Tweedie wrote:
On Fri, 2003-09-19 at 18:30, Dag Wieers wrote:
Exactly which kernel are you running ?
2.4.20-19.9 currently, but I have the same effect with 2.4.20-18 and 2.4.20-20. The graph is exactly the same as it was weeks ago (every 2 weeks this happens, only this time the system was responsive enough to return the information ;)
We recently found one problem which could affect the size of the inode cache. It's not exactly a leak, because the resources can still be reclaimed, but the inodes were filed away in a place where the VM was least likely to go looking to reclaim them.
Could you "cat /proc/slabinfo" on the affected systems and see what the inode_cache entry looks like, please?
The system has been rebooted 2 days ago. The effect will be much bigger if I at least wait a week.
This is what I pasted earlier on Rik's request:
| > Do you have anything suspicious in /proc/slabinfo when your | > system gets close to crashing ? | | It's now up 7 days (36MB used, 25MB free) so we'll see in another | 5 days. These are the higher numbers: | | inode_cache 812 1232 480 154 154 1 | dentry_cache 572 1380 128 46 46 1 | filp 620 630 128 21 21 1 | buffer_head 4349 12363 100 162 317 1 | mm_struct 6038 6069 224 356 357 1 | vm_area_struct 1315 4920 96 38 123 1 | pte_chain 729 3277 32 14 29 1 | | I'm not sure what I have to look for. I guess I better save this also | directly after booting up.
I'll give you some other numbers later. If there's some more stuff you need from that system just before it is trashing, please tell now ;)
Thanks, -- dag wieers, dag@wieers.com, http://dag.wieers.com/ -- [Any errors in spelling, tact or fact are transmission errors]
Would it help if I say I have a feeling it happens when running out of physical memory ? Or when allocating/using a lot of memory in a short timespan and releasing it under this condition.
I am having this exact same problem, with a machine that has 2G of swap and 768M of physical memory. I filed a bug in Red Hat Bugzilla, and was told to produce the output of /proc/slabinfo, but it takes the machine a while to consume all of its resources to the point where everything essentially is being run out of the swap file.
I will produce the Bug number once Bugzilla is alive again.
FYI: kernel <= 2.4.20-20.9 is not affected by the refile_inode bug
Am Fr, 2003-09-19 um 20.58 schrieb Stephen C. Tweedie:
Hi,
On Fri, 2003-09-19 at 18:30, Dag Wieers wrote:
Exactly which kernel are you running ?
2.4.20-19.9 currently, but I have the same effect with 2.4.20-18 and 2.4.20-20. The graph is exactly the same as it was weeks ago (every 2 weeks this happens, only this time the system was responsive enough to return the information ;)
We recently found one problem which could affect the size of the inode cache. It's not exactly a leak, because the resources can still be reclaimed, but the inodes were filed away in a place where the VM was least likely to go looking to reclaim them.
Could you "cat /proc/slabinfo" on the affected systems and see what the inode_cache entry looks like, please?
Cheers, Stephen
-- Rhl-devel-list mailing list Rhl-devel-list@redhat.com http://www.redhat.com/mailman/listinfo/rhl-devel-list