Hello.
Last N systems I've installed, I am usually assembling software raid1 at install time. The disk layout I'm using is trivial: 100MB + the rest, where on 1st 100MB is /dev/md0 (ext3, /boot) and the rest is /dev/md1 (lvm, other).
For whatever reason (haven't looked at the code yet :() anaconda issuing mdadm commands in such order that primary replica of unsynched md0 is second drive, and raid1 resync started first on md1.
So, to boot without changing bios settings back and forth (or without disabling 1st disk drive by any other means) it is nesessary to wait while md1 raid1 resyncs, which can be as long as 2 hours (250GB HDD).
I've not tried fc5 test1 yet, but I'm pretty sure that without any additional steps to reslove this issue it couldn't just vaporize. So, is it possible to change anaconda wrt described issue?
On Mon, 2005-12-19 at 23:16 +0300, Paul P Komkoff Jr wrote:
Hello.
Last N systems I've installed, I am usually assembling software raid1 at install time. The disk layout I'm using is trivial: 100MB + the rest, where on 1st 100MB is /dev/md0 (ext3, /boot) and the rest is /dev/md1 (lvm, other).
For whatever reason (haven't looked at the code yet :() anaconda issuing mdadm commands in such order that primary replica of unsynched md0 is second drive, and raid1 resync started first on md1.
So, to boot without changing bios settings back and forth (or without disabling 1st disk drive by any other means) it is nesessary to wait while md1 raid1 resyncs, which can be as long as 2 hours (250GB HDD).
I'm pretty sure that anaconda at some point deliberately did things to avoid that sync. So I suspect it got reenabled for a good reason...
Replying to Arjan van de Ven:
So, to boot without changing bios settings back and forth (or without disabling 1st disk drive by any other means) it is nesessary to wait while md1 raid1 resyncs, which can be as long as 2 hours (250GB HDD).
I'm pretty sure that anaconda at some point deliberately did things to avoid that sync. So I suspect it got reenabled for a good reason...
There's nothing wrong in sync itself, but the order. If /boot will be synched first then it will be safe to reboot as soon as actual installation is complete (and root volume reconstruction can happen later)
On Tue, 2005-12-20 at 11:56 +0300, Paul P Komkoff Jr wrote:
Replying to Arjan van de Ven:
So, to boot without changing bios settings back and forth (or without disabling 1st disk drive by any other means) it is nesessary to wait while md1 raid1 resyncs, which can be as long as 2 hours (250GB HDD).
I'm pretty sure that anaconda at some point deliberately did things to avoid that sync. So I suspect it got reenabled for a good reason...
There's nothing wrong in sync itself, but the order. If /boot will be synched first then it will be safe to reboot as soon as actual installation is complete (and root volume reconstruction can happen later)
well... if the root is brand spanking new.. why would it need a sync?
On Tue, 2005-12-20 at 11:13 +0100, Ralf Ertzinger wrote:
On Tue, Dec 20, 2005 at 10:46:18AM +0100, Arjan van de Ven wrote:
well... if the root is brand spanking new.. why would it need a sync?
Because it _is_ brand spanking new (and thus unclean from a RAID point of view)?
but.. it might as well be marked clean. there's no information in never-used parts of the disk after all... so why sync it instead of pretending it's synced perfectly fine....
Replying to Arjan van de Ven:
Because it _is_ brand spanking new (and thus unclean from a RAID point of view)?
but.. it might as well be marked clean. there's no information in never-used parts of the disk after all... so why sync it instead of pretending it's synced perfectly fine....
raid1 means that contents of both mirrors are exactly the same. Just after creation, this is obviously not true. Linux MD implementation solving this problem by synching it.
Other solution may be to store list of written blocks and do something with this, but, I think this isn't reliable/scalable.
Anyway, as I said previously, root volume reconstruction can happily continue after firstboot (and second boot too), but inconsistency in /boot making system unbootable.
Paul P Komkoff Jr wrote:
Replying to Arjan van de Ven:
Because it _is_ brand spanking new (and thus unclean from a RAID point of view)?
but.. it might as well be marked clean. there's no information in never-used parts of the disk after all... so why sync it instead of pretending it's synced perfectly fine....
raid1 means that contents of both mirrors are exactly the same. Just after creation, this is obviously not true. Linux MD implementation solving this problem by synching it.
No but if you plan to format the partition right after creating the RAID, which is usually the case, you can force the raid clean: sectors will only be written (and written clean to both disks), never read first. Well at least on paper.
On Tue, 2005-12-20 at 11:19 +0100, Arjan van de Ven wrote:
but.. it might as well be marked clean. there's no information in never-used parts of the disk after all... so why sync it instead of pretending it's synced perfectly fine....
Because it's RAID, and it doesn't know anything about the file system -- it's just trying naïvely to keep the _block_ device synchronised, without any of that helpful knowledge about which blocks the file system actually cares about, or might have touched since the array was last known to be consistent.
This is one of the reasons I believe RAID is the wrong solution; a better option would be to have the file system natively handle redundancy across block devices, rather than having two separate layers each of which know nothing about the other.
Imagine a 'RAID resync' which only needed to resync those blocks which the file system's journal says have been modified since the state was last consistent...
On Mar 20 décembre 2005 13:35, David Woodhouse wrote:
On Tue, 2005-12-20 at 11:19 +0100, Arjan van de Ven wrote:
but.. it might as well be marked clean. there's no information in never-used parts of the disk after all... so why sync it instead of pretending it's synced perfectly fine....
Because it's RAID, and it doesn't know anything about the file system -- it's just trying naïvely to keep the _block_ device synchronised, without any of that helpful knowledge about which blocks the file system actually cares about, or might have touched since the array was last known to be consistent.
I don't know how anaconda handles it but the default raid IO limits are very conservative (optimized for background work on a live system)
On a system being created it's probably a better idea to tell the raid system to pump as much blocks as possible, since this is a blocking op in this context and we don't care about anything else.
On Tue, 2005-12-20 at 14:06 +0100, Nicolas Mailhot wrote:
I don't know how anaconda handles it but the default raid IO limits are very conservative (optimized for background work on a live system)
On a system being created it's probably a better idea to tell the raid system to pump as much blocks as possible, since this is a blocking op in this context and we don't care about anything else.
I think you miss the point. On a file system which is being created, the bandwidth limits _ought_ to be fairly much irrelevant. The file system is _empty_, and the total amount of I/O to create an empty RAID-1 ought to be almost exactly the same amount of I/O which it takes just to run 'mkfs' twice. The blocks which aren't yet used by the filesystem don't matter; you shouldn't need to sync their contents.
RAID is just a hack which hides the redundancy from the file system; the kind of thing we always used to have to do for DOS compatibility so that we could provide an INT 13h handler and have it 'just work'. It's much the same reasoning which led to the built-in 'translation layers' on cheap CF cards and USB sticks which so often eat your data and which have a tendency to use up their limited lifetime in copying obsolete sectors belonging to deleted files around on the underlying medium.
We don't _need_ that kind of layering in Linux. Have a library of functions which can be used for such redundancy, by all means -- but don't just make RAID pretend to be a 'standard' block device and prevent all possibility of sharing knowledge between the file system and the layer providing the redundancy. That's just insane.
Replying to David Woodhouse:
RAID is just a hack which hides the redundancy from the file system; the kind of thing we always used to have to do for DOS compatibility so that we could provide an INT 13h handler and have it 'just work'. It's much the same reasoning which led to the built-in 'translation layers' on cheap CF cards and USB sticks which so often eat your data and which have a tendency to use up their limited lifetime in copying obsolete sectors belonging to deleted files around on the underlying medium.
We don't _need_ that kind of layering in Linux. Have a library of functions which can be used for such redundancy, by all means -- but don't just make RAID pretend to be a 'standard' block device and prevent all possibility of sharing knowledge between the file system and the layer providing the redundancy. That's just insane.
Well, everything you said habove is correct. But I've yet seen any "redundant block device filesystem" project anywhere. Anybody willing to start it?
Meanwhile, maybe we can fix md sync ordering? :)))
Paul P Komkoff Jr wrote:
Replying to David Woodhouse:
RAID is just a hack which hides the redundancy from the file system; the kind of thing we always used to have to do for DOS compatibility so that we could provide an INT 13h handler and have it 'just work'. It's much the same reasoning which led to the built-in 'translation layers' on cheap CF cards and USB sticks which so often eat your data and which have a tendency to use up their limited lifetime in copying obsolete sectors belonging to deleted files around on the underlying medium.
We don't _need_ that kind of layering in Linux. Have a library of functions which can be used for such redundancy, by all means -- but don't just make RAID pretend to be a 'standard' block device and prevent all possibility of sharing knowledge between the file system and the layer providing the redundancy. That's just insane.
Well, everything you said habove is correct. But I've yet seen any "redundant block device filesystem" project anywhere.
Sun zfs, Isilon OneFS, possibly others.
On Mar 20 décembre 2005 14:20, David Woodhouse wrote:
On Tue, 2005-12-20 at 14:06 +0100, Nicolas Mailhot wrote:
I don't know how anaconda handles it but the default raid IO limits are very conservative (optimized for background work on a live system)
On a system being created it's probably a better idea to tell the raid system to pump as much blocks as possible, since this is a blocking op in this context and we don't care about anything else.
I think you miss the point. On a file system which is being created, the bandwidth limits _ought_ to be fairly much irrelevant.
No they're not. If I remember well during an update anaconda can transform some existing logical volumes to redundant volumes, so you can have a fairly big mass of data to transfer.
And the bandwidth limits are *very* conservative. You won't win 5% or even half the time, sometimes it's a 10x (+) différence.
On Tue, Dec 20, 2005 at 12:35:17PM +0000, David Woodhouse wrote:
Imagine a 'RAID resync' which only needed to resync those blocks which the file system's journal says have been modified since the state was last consistent...
The fact Linux md doesn't currently deal with that isn't about fs/md splits its about 'nobody has written it'. High level raid stuff tends to keep dirty bitmaps and journals them so that the consistency of any stripe can be established.
On Thu, 2005-12-22 at 06:25 -0500, Alan Cox wrote:
The fact Linux md doesn't currently deal with that isn't about fs/md splits its about 'nobody has written it'. High level raid stuff tends to keep dirty bitmaps and journals them so that the consistency of any stripe can be established.
The same information is already available in the file system's own data structures; it shouldn't be necessary for the RAID layer to do that for itself. _That_ kind of redundancy isn't what RAID is supposed to be about.
But yes, there are always band-aids which can help to improve any design flaw.
În data de Jo, 22-12-2005 la 11:55 +0000, David Woodhouse a scris:
On Thu, 2005-12-22 at 06:25 -0500, Alan Cox wrote:
The fact Linux md doesn't currently deal with that isn't about fs/md splits its about 'nobody has written it'. High level raid stuff tends to keep dirty bitmaps and journals them so that the consistency of any stripe can be established.
The same information is already available in the file system's own data structures; it shouldn't be necessary for the RAID layer to do that for itself. _That_ kind of redundancy isn't what RAID is supposed to be about.
What if I use the raid device without any filesystem on it ?
On Thu, 22 Dec 2005, David Woodhouse wrote:
The same information is already available in the file system's own data structures; it shouldn't be necessary for the RAID layer to do that for itself. _That_ kind of redundancy isn't what RAID is supposed to be about.
But yes, there are always band-aids which can help to improve any design flaw.
I rather like the separation between file system and raid devices. It gives more choice and makes both components simpler. Now I can choose raid1+ext2, raid5+ext3, swap+raid0 and reiser+device.
I would rather like not to loose those options. Otherwise it would be like combining "uniq" & "sort", because you could make one case work better.
christof
On Thu, Dec 22, 2005 at 11:55:44AM +0000, David Woodhouse wrote:
structures; it shouldn't be necessary for the RAID layer to do that for itself. _That_ kind of redundancy isn't what RAID is supposed to be about.
Its a neccessary part of RAID because duplicating raid functionality in every file system would be incredibly inefficient and lead to a lot of code duplication and bug. Its also neccessary because the raid functionality may not even be on the same host as the file system.
But yes, there are always band-aids which can help to improve any design flaw.
You need to learn the difference between a design flaw and pragmatic design.
"Perfection is the enemy of success"
Alan Cox wrote:
Its a neccessary part of RAID because duplicating raid functionality in every file system would be incredibly inefficient and lead to a lot of code duplication and bug. Its also neccessary because the raid functionality may not even be on the same host as the file system.
Seen the RAID-Z system in Solaris 10?
http://blogs.sun.com/roller/page/bonwick?entry=raid_z
Or the Write Anywhere Filesystem used by Netfilter?
http://www.netapp.com/library/tr/3002.pdf
It's clear that you can get better performance and reliability by codesigning a filesystem with the block layer. I think of the difference between old-school Unix (KISS) and System 370: do you build a system that's simple and clean, or do you take the extra effort to maximize performance. Sun can afford to do this because it isn't wasting energy on maintaining 30 filesystems that almost work, but rather focusing on one that does.
On Thu, 2005-12-22 at 09:33 -0500, Alan Cox wrote:
Its a neccessary part of RAID because duplicating raid functionality in every file system would be incredibly inefficient and lead to a lot of code duplication and bug. Its also neccessary because the raid functionality may not even be on the same host as the file system.
A bit like the way that duplicating ATA functionality in every SATA host driver also leads to a lot of code duplication and bugs? Or the way that duplicating zlib in everything that needs compression leads to a lot of code duplication and bugs?
The former doesn't happen; the latter was a bug which got fixed.
On Tue, 2005-12-20 at 11:13 +0100, Ralf Ertzinger wrote:
On Tue, Dec 20, 2005 at 10:46:18AM +0100, Arjan van de Ven wrote:
well... if the root is brand spanking new.. why would it need a sync?
Because it _is_ brand spanking new (and thus unclean from a RAID point of view)?
Why does that matter? The parts of it we've written data out to are supposed to be in sync already. The resync should only be making the parts that aren't in use yet identical.
Peter Jones wrote:
On Tue, 2005-12-20 at 11:13 +0100, Ralf Ertzinger wrote:
On Tue, Dec 20, 2005 at 10:46:18AM +0100, Arjan van de Ven wrote:
well... if the root is brand spanking new.. why would it need a sync?
Because it _is_ brand spanking new (and thus unclean from a RAID point of view)?
Why does that matter? The parts of it we've written data out to are supposed to be in sync already. The resync should only be making the parts that aren't in use yet identical.
Exactly, it's like trying to sync uninitialized memory. Now i believe the --assume-clean option of mdadm only works for RAID1 unfortunately.
Also, the layered approach of RAID is why RAID "just works", trying to build the redundancy into the filesystem itself would be pure madness. There are many excellent OS design books that will explain this much better than i ever could.
On Tue, 2005-12-20 at 10:17 -0800, Denis Leroy wrote:
Also, the layered approach of RAID is why RAID "just works", trying to build the redundancy into the filesystem itself would be pure madness. There are many excellent OS design books that will explain this much better than i ever could.
RAID 'just works' in much the same way that microkernels and the OSI networking model 'work' -- suboptimally. It's a symptom of a popular design methodology which if allowed free rein would make Modula-3¹ programmers of us all.
I'm sure there are many OS design books which Linus could have read that would have explained why microkernels were better, too. Personally, I'm happy that he wasn't persuaded.
The next file system I write will probably do this correctly, but it'll be another flash file system, so won't be relevant to block devices.